EM algorithm

EM 算法
  • 文章类型: Journal Article
    IV期临床试验旨在监测药物治疗的长期副作用。例如,接受胸部放疗和/或蒽环类药物治疗的儿童癌症幸存者在成年期通常有发生心脏毒性的风险.通常,研究的主要重点可能是估计特定目标结果的累积发生率,例如心脏毒性。然而,连续和通常地评估患者是具有挑战性的,这些信息是通过纵向随访患者的横断面调查收集的.由于毒性发作的确切时间未知,因此会导致间隔删失数据。Rai等人。在疾病-死亡模型中,使用参数模型计算过渡强度率,并使用最大似然方法估计参数。然而,如果基本参数假设不成立,这种方法可能不合适。这篇手稿提出了一个半参数模型,两组治疗强度的logit关系,在疾病-死亡模型的背景下估计过渡强度率。参数的估计是使用具有轮廓似然的EM算法完成的。仿真研究的结果表明,所提出的方法易于实现,并产生与参数模型相当的结果。
    Phase IV clinical trials are designed to monitor long-term side effects of medical treatment. For instance, childhood cancer survivors treated with chest radiation and/or anthracycline are often at risk of developing cardiotoxicity during their adulthood. Often the primary focus of a study could be on estimating the cumulative incidence of a particular outcome of interest such as cardiotoxicity. However, it is challenging to evaluate patients continuously and usually, this information is collected through cross-sectional surveys by following patients longitudinally. This leads to interval-censored data since the exact time of the onset of the toxicity is unknown. Rai et al. computed the transition intensity rate using a parametric model and estimated parameters using maximum likelihood approach in an illness-death model. However, such approach may not be suitable if the underlying parametric assumptions do not hold. This manuscript proposes a semi-parametric model, with a logit relationship for the treatment intensities in two groups, to estimate the transition intensity rates within the context of an illness-death model. The estimation of the parameters is done using an EM algorithm with profile likelihood. Results from the simulation studies suggest that the proposed approach is easy to implement and yields comparable results to the parametric model.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    评估牙周病(PD)状态和进展的研究/试验通常集中在量化聚集(受试者内的牙齿)双变量终点之间的关系。如探孔深度(PPD),和临床依恋水平(CAL)与协变量。尽管可以在线性混合模型(LMM)框架下对随机项(随机效应和误差)调用多元正态的假设,违反这些假设可能会导致不精确的推断。此外,响应-协变量关系可能不是线性的,如在LMM拟合下假设的那样,其中获得的回归估计并没有提供PD风险的总体摘要,从协变量获得。受PD对讲古拉的非裔美国人2型糖尿病患者的研究的启发,我们将非对称聚类双变量(PPD和CAL)响应转换为非线性混合模型框架,其中两个随机项都遵循多元非对称拉普拉斯分布(ALD)。为了提供一个单一的风险摘要,通过单指数模型对关系中可能的非线性进行建模,由索引函数的多项式样条逼近提供动力,和ALD的正常混合物表达式。要进行最大似然推理设置,我们设计了一个优雅的EM型算法。此外,在一些温和条件下建立了大样本的理论性质。使用在各种情况下生成的合成数据进行模拟研究,以研究我们的估计量的有限样本属性,并证明了我们提出的模型和估计算法可以有效地处理非对称,重尾数据,与异常值。最后,我们通过应用于激励PD研究来说明我们提出的方法。
    Studies/trials assessing status and progression of periodontal disease (PD) usually focus on quantifying the relationship between the clustered (tooth within subjects) bivariate endpoints, such as probed pocket depth (PPD), and clinical attachment level (CAL) with the covariates. Although assumptions of multivariate normality can be invoked for the random terms (random effects and errors) under a linear mixed model (LMM) framework, violations of those assumptions may lead to imprecise inference. Furthermore, the response-covariate relationship may not be linear, as assumed under a LMM fit, and the regression estimates obtained therein do not provide an overall summary of the risk of PD, as obtained from the covariates. Motivated by a PD study on Gullah-speaking African-American Type-2 diabetics, we cast the asymmetric clustered bivariate (PPD and CAL) responses into a non-linear mixed model framework, where both random terms follow the multivariate asymmetric Laplace distribution (ALD). In order to provide a one-number risk summary, the possible non-linearity in the relationship is modeled via a single-index model, powered by polynomial spline approximations for index functions, and the normal mixture expression for ALD. To proceed with a maximum-likelihood inferential setup, we devise an elegant EM-type algorithm. Moreover, the large sample theoretical properties are established under some mild conditions. Simulation studies using synthetic data generated under a variety of scenarios were used to study the finite-sample properties of our estimators, and demonstrate that our proposed model and estimation algorithm can efficiently handle asymmetric, heavy-tailed data, with outliers. Finally, we illustrate our proposed methodology via application to the motivating PD study.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    对系统发育树进行定年以获得时间单位的分支长度对于许多下游应用至关重要,但仍然具有挑战性。约会需要推断替代率,这种替代率可能会在整个树上发生变化。虽然我们可以假设从化石记录或采样时间(对于快速进化的生物)中获得有关一小部分节点的信息,推断其他节点的年龄基本上需要外推和插值。假设分支率分布,我们可以将约会表述为约束最大似然(ML)估计问题。虽然存在ML约会方法,他们的准确性下降在面对模型错误的规格,其中假定的参数统计分布的分支率大大不同于真正的分布。值得注意的是,大多数现有的方法都假设是刚性的,通常是单峰的,分支率分布。第二个挑战是似然函数涉及在速率的连续域上的积分,并且经常导致困难的非凸优化问题。为了应对这两个挑战,我们提出了一种新的方法称为分子约会使用分类模型(MD-Cat)。MD-Cat使用受非参数统计启发的费率分类模型,并且可以通过将费率分布离散为k类,来近似大量模型。在这种模式下,我们可以使用期望最大化(EM)算法来共同估计速率类别和以时间为单位的分支长度。与Gamma或LogNormal分布等参数模型相比,我们的模型对分支率的真实分布的假设较少。我们对被子植物和HIV的两个模拟和真实数据集以及多种速率分布的结果表明,MD-Cat通常比替代品更准确。特别是在具有指数分布或多模态速率分布的数据集上。
    Dating phylogenetic trees to obtain branch lengths in time unit is essential for many downstream applications but has remained challenging. Dating requires inferring substitution rates that can change across the tree. While we can assume to have information about a small subset of nodes from the fossil record or sampling times (for fast-evolving organisms), inferring the ages of the other nodes essentially requires extrapolation and interpolation. Assuming a distribution of branch rates, we can formulate dating as a constrained maximum likelihood (ML) estimation problem. While ML dating methods exist, their accuracy degrades in the face of model misspecification where the assumed parametric statistical distribution of branch rates vastly differs from the true distribution. Notably, most existing methods assume rigid, often unimodal, branch rate distributions. A second challenge is that the likelihood function involves an integral over the continuous domain of the rates and often leads to difficult non-convex optimization problems. To tackle these two challenges, we propose a new method called Molecular Dating using Categorical-models (MD-Cat). MD-Cat uses a categorical model of rates inspired by non-parametric statistics and can approximate a large family of models by discretizing the rate distribution into k categories. Under this model, we can use the Expectation- Maximization (EM) algorithm to co-estimate rate categories and branch lengths in time units. Our model has fewer assumptions about the true distribution of branch rates than parametric models such as Gamma or LogNormal distribution. Our results on two simulated and real datasets of Angiosperms and HIV and a wide selection of rate distributions show that MD-Cat is often more accurate than the alternatives, especially on datasets with exponential or multimodal rate distributions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    认知诊断模型(CDM)是一类流行的离散潜在变量模型,可对学生掌握或缺乏多种细粒度技能进行建模。CDM已被最广泛地用于对分类项目响应数据进行建模,例如二元或多元响应。随着技术的进步和现代教育评估中各种考试形式的出现,新的响应类型,包括连续的响应,如响应时间,以及来自具有重复性任务或眼动跟踪传感器的测试的计数响应,也变得可用。最近已经提出了CDM的变体来对这种响应进行建模。然而,这些扩展的CDM是否可识别和可估计是完全未知的。我们提出了一个非常通用的认知诊断建模框架,用于任意类型的多变量反应,并在这种一般情况下建立可识别性。令人惊讶的是,我们证明了我们的一般响应CDM在类似于传统分类响应CDM的基于Q矩阵的条件下是可识别的。我们的结论建立了可识别的一般响应CDM的新范式。我们提出了一种EM算法来有效地估计一类基于指数族的一般响应CDM。我们在各种响应类型下进行了模拟研究。仿真结果不仅证实了我们的可辨识性理论,但也证明了我们的估计算法的优越的经验性能。我们通过将其应用于TIMSS2019响应时间数据集来说明我们的方法。
    Cognitive diagnostic models (CDMs) are a popular family of discrete latent variable models that model students\' mastery or deficiency of multiple fine-grained skills. CDMs have been most widely used to model categorical item response data such as binary or polytomous responses. With advances in technology and the emergence of varying test formats in modern educational assessments, new response types, including continuous responses such as response times, and count-valued responses from tests with repetitive tasks or eye-tracking sensors, have also become available. Variants of CDMs have been proposed recently for modeling such responses. However, whether these extended CDMs are identifiable and estimable is entirely unknown. We propose a very general cognitive diagnostic modeling framework for arbitrary types of multivariate responses with minimal assumptions, and establish identifiability in this general setting. Surprisingly, we prove that our general-response CDMs are identifiable under Q -matrix-based conditions similar to those for traditional categorical-response CDMs. Our conclusions set up a new paradigm of identifiable general-response CDMs. We propose an EM algorithm to efficiently estimate a broad class of exponential family-based general-response CDMs. We conduct simulation studies under various response types. The simulation results not only corroborate our identifiability theory, but also demonstrate the superior empirical performance of our estimation algorithms. We illustrate our methodology by applying it to a TIMSS 2019 response time dataset.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    比例数据经常出现在各种各样的研究领域。这样的数据通常表现出额外的变化,例如过/过/过分散,稀疏和零通货膨胀。例如,肝炎数据同时呈现稀疏性和零通货膨胀,在83个年度年龄组中,有19个贡献非零分母为5或更少,有36个具有零血清阳性.白粉虱数据由640个观测值和339个零(53%)组成,这表明了额外的零通胀。导管管理数据涉及过多的零,193例尿路感染的结果平均超过60%的零。194个导管阻塞结果和193个导管移位结果。然而,现有的模型不能总是适当地解决这些特征。在本文中,提出了一种新的两参数概率分布,称为Lindley-binomial(LB)分布,以分析具有此类特征的比例数据。分布的概率属性,例如矩,推导了矩生成函数。提出了Fisher评分算法和EM算法,用于计算提出的LB回归模型中的参数估计。讨论了LB模型的拟合优度问题。还进行了有限的模拟研究,以评估导出的EM算法的性能,以估计具有/不具有协变量的模型中的参数。通过上述三个比例数据集来说明所提出的模型。
    Proportional data arise frequently in a wide variety of fields of study. Such data often exhibit extra variation such as over/under dispersion, sparseness and zero inflation. For example, the hepatitis data present both sparseness and zero inflation with 19 contributing non-zero denominators of 5 or less and with 36 having zero seropositive out of 83 annual age groups. The whitefly data consists of 640 observations with 339 zeros (53%), which demonstrates extra zero inflation. The catheter management data involve excessive zeros with over 60% zeros averagely for outcomes of 193 urinary tract infections, 194 outcomes of catheter blockages and 193 outcomes of catheter displacements. However, the existing models cannot always address such features appropriately. In this paper, a new two-parameter probability distribution called Lindley-binomial (LB) distribution is proposed to analyze the proportional data with such features. The probabilistic properties of the distribution such as moment, moment generating function are derived. The Fisher scoring algorithm and EM algorithm are presented for the computation of estimates of parameters in the proposed LB regression model. The issues on goodness of fit for the LB model are discussed. A limited simulation study is also performed to evaluate the performance of derived EM algorithms for the estimation of parameters in the model with/without covariates. The proposed model is illustrated through three aforementioned proportional datasets.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    受DNA甲基化应用的激励,本文讨论了拟合和推断多变量二项回归模型的问题,这些模型被误差污染并表现出超参数变化,也称为分散。虽然单变量二项回归的离散性已经得到了广泛的研究,在多变量结果的背景下解决分散仍然是一项复杂且相对未探索的任务。复杂性源于我们的激励数据集中观察到的值得注意的数据特征:非恒定但相关的结果分散。为了应对这一挑战并考虑可能的测量误差,我们提出了一种新的分层拟二项变系数混合模型,通过加法和乘法色散组件的组合实现灵活的色散模式。为了最大化我们模型的拉普拉斯近似准似然,我们进一步开发了一种专门的两阶段期望最大化(EM)算法,其中,乘法尺度参数的插件估计提高了EM迭代的速度和稳定性。仿真表明,我们的方法可以对平滑协变量效应产生准确的推断,并在检测非零效应方面表现出出色的能力。此外,我们应用我们提出的方法来研究DNA甲基化之间的关联,通过全血的靶向定制捕获测序来测量整个基因组,和抗瓜氨酸化蛋白抗体(ACPA)的水平,类风湿关节炎(RA)风险的临床前标志物。我们的分析揭示了23个可能导致ACPA相关差异甲基化的重要基因,强调RA中细胞信号传导和胶原代谢的相关性。我们在RBioconductor软件包中实现了我们的方法,称为“SOMNiBUS”。\"
    Motivated by a DNA methylation application, this article addresses the problem of fitting and inferring a multivariate binomial regression model for outcomes that are contaminated by errors and exhibit extra-parametric variations, also known as dispersion. While dispersion in univariate binomial regression has been extensively studied, addressing dispersion in the context of multivariate outcomes remains a complex and relatively unexplored task. The complexity arises from a noteworthy data characteristic observed in our motivating dataset: non-constant yet correlated dispersion across outcomes. To address this challenge and account for possible measurement error, we propose a novel hierarchical quasi-binomial varying coefficient mixed model, which enables flexible dispersion patterns through a combination of additive and multiplicative dispersion components. To maximize the Laplace-approximated quasi-likelihood of our model, we further develop a specialized two-stage expectation-maximization (EM) algorithm, where a plug-in estimate for the multiplicative scale parameter enhances the speed and stability of the EM iterations. Simulations demonstrated that our approach yields accurate inference for smooth covariate effects and exhibits excellent power in detecting non-zero effects. Additionally, we applied our proposed method to investigate the association between DNA methylation, measured across the genome through targeted custom capture sequencing of whole blood, and levels of anti-citrullinated protein antibodies (ACPA), a preclinical marker for rheumatoid arthritis (RA) risk. Our analysis revealed 23 significant genes that potentially contribute to ACPA-related differential methylation, highlighting the relevance of cell signaling and collagen metabolism in RA. We implemented our method in the R Bioconductor package called \"SOMNiBUS.\"
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在广义线性模型的背景下,开发了一种新的经验贝叶斯方法来选择变量。所提出的算法适用于假定解释变量数量非常大的情况,可能比回应的数量大得多。线性预测器中的系数被建模为三组分混合物,允许解释变量对响应产生随机的积极影响,随机的负面影响,或者没有效果。一个关键的假设是,只有一小部分(但未知)的候选变量具有非零效应。这个假设,除了将系数视为随机效应之外,还促进了一种在计算上高效的方法。特别是,必须估计的参数数量很少,并且无论解释变量的数量如何,都保持不变。使用可扩展的广义交替最大化算法估计模型参数,与基于仿真的完全贝叶斯方法相比,收敛速度明显加快。
    A new empirical Bayes approach to variable selection in the context of generalized linear models is developed. The proposed algorithm scales to situations in which the number of putative explanatory variables is very large, possibly much larger than the number of responses. The coefficients in the linear predictor are modeled as a three-component mixture allowing the explanatory variables to have a random positive effect on the response, a random negative effect, or no effect. A key assumption is that only a small (but unknown) fraction of the candidate variables have a non-zero effect. This assumption, in addition to treating the coefficients as random effects facilitates an approach that is computationally efficient. In particular, the number of parameters that have to be estimated is small, and remains constant regardless of the number of explanatory variables. The model parameters are estimated using a Generalized Alternating Maximization algorithm which is scalable, and leads to significantly faster convergence compared with simulation-based fully Bayesian methods.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    我们考虑由不完善的排序集采样设计引起的独立顺序统计量对有限混合模型参数的贝叶斯估计。作为一种经济有效的方法,排名集抽样使我们能够合并容易获得的特征,作为排名信息,数据收集和贝叶斯估计。要处理排名集合样本的特殊结构,我们开发了一种贝叶斯估计方法,利用期望最大化(EM)算法来估计排名参数,并在吉布斯采样中使用Metropolis来估计基础混合模型的参数。我们的发现表明,提出的基于RSS的贝叶斯估计方法优于使用简单随机抽样的常用贝叶斯估计方法。最终将开发的方法应用于评估50岁及以上女性的骨骼疾病状况。
    We consider the Bayesian estimation of the parameters of a finite mixture model from independent order statistics arising from imperfect ranked set sampling designs. As a cost-effective method, ranked set sampling enables us to incorporate easily attainable characteristics, as ranking information, into data collection and Bayesian estimation. To handle the special structure of the ranked set samples, we develop a Bayesian estimation approach exploiting the Expectation-Maximization (EM) algorithm in estimating the ranking parameters and Metropolis within Gibbs Sampling to estimate the parameters of the underlying mixture model. Our findings show that the proposed RSS-based Bayesian estimation method outperforms the commonly used Bayesian counterpart using simple random sampling. The developed method is finally applied to estimate the bone disorder status of women aged 50 and older.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在健康和临床研究中,医学指标(例如,BMI)通常用于监测和/或预测感兴趣的健康结果。虽然可以使用单索引建模来构建此类索引,使用单指数模型分析具有多个相关二元响应的纵向数据的方法尚不发达,尽管这些数据有大量的应用(例如,基于纵向观察的疾病危险因素预测多种医疗状况)。本文旨在通过提出一个广义的单指数模型来填补这一空白,该模型可以包含多个单指数和混合效应,用于描述多个二元响应的观察到的纵向数据。与专注于为每个响应构建边际模型的现有方法相比,所提出的方法可以利用观测数据中关于不同响应的相关性信息来估计不同的单指标来预测响应变量。所提出的模型的估计是通过使用局部线性核平滑程序来实现的,以及专门为估计单指数模型而设计的方法和估计广义线性混合模型的传统方法。数值算例表明,所提出的方法在各种情况下都是有效的。还使用英国衰老纵向研究项目的数据集进行了演示。
    In health and clinical research, medical indices (eg, BMI) are commonly used for monitoring and/or predicting health outcomes of interest. While single-index modeling can be used to construct such indices, methods to use single-index models for analyzing longitudinal data with multiple correlated binary responses are underdeveloped, although there are abundant applications with such data (eg, prediction of multiple medical conditions based on longitudinally observed disease risk factors). This article aims to fill the gap by proposing a generalized single-index model that can incorporate multiple single indices and mixed effects for describing observed longitudinal data of multiple binary responses. Compared to the existing methods focusing on constructing marginal models for each response, the proposed method can make use of the correlation information in the observed data about different responses when estimating different single indices for predicting response variables. Estimation of the proposed model is achieved by using a local linear kernel smoothing procedure, together with methods designed specifically for estimating single-index models and traditional methods for estimating generalized linear mixed models. Numerical studies show that the proposed method is effective in various cases considered. It is also demonstrated using a dataset from the English Longitudinal Study of Aging project.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    时间序列中的自回归模型在各个领域都很有用。在这篇文章中,我们提出了一个偏斜自回归模型。我们使用期望最大化(EM)方法估计其参数,并基于局部扰动开发影响方法以进行验证。我们获得了四种扰动策略的正常曲率,以识别有影响的观测值,然后通过蒙特卡洛模拟评估他们的表现。提供了一个金融数据分析的示例,以研究布伦特原油期货的每日对数收益率,并调查COVID-19大流行的可能影响。
    Autoregressive models in time series are useful in various areas. In this article, we propose a skew-t autoregressive model. We estimate its parameters using the expectation-maximization (EM) method and develop the influence methodology based on local perturbations for its validation. We obtain the normal curvatures for four perturbation strategies to identify influential observations, and then to assess their performance through Monte Carlo simulations. An example of financial data analysis is presented to study daily log-returns for Brent crude futures and investigate possible impact by the COVID-19 pandemic.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号