compositional data

成分数据
  • 文章类型: Journal Article
    基于DNA甲基化(DNAm)的反卷积估计包含相对数据,形成一个组合物,标准方法(直接测试细胞比例)不适合处理。在这项研究中,我们检查了一种替代方法的性能,微生物组成分析(ANCOM),用于分析基于DNAm的反卷积估计。我们进行了两项不同的模拟研究,将ANCOM与标准方法(直接对细胞比例进行两个样本t检验)进行了比较,并分析了来自妇女健康倡议的真实数据,以评估ANCOM对基于DNAm的反卷积估计的适用性。我们的发现表明,ANCOM可以有效地解释基于DNAm的反卷积估计的组成性质。ANCOM充分控制错误发现率,同时保持与标准方法相当的统计能力。
    基于DNA甲基化(DNAm)的去卷积提供了对混合细胞类型生物样品中每种细胞类型比例的高度准确估计(例如,全血)。这些估计可用于检查细胞类型比例与生物学或临床终点之间的关联;例如,比较吸烟者和非吸烟者在全血中估计的中性粒细胞比例。细胞比例数据具有独特的特征,这对传统和广泛使用的统计方法提出了挑战。针对这个问题,我们的工作提出了两个模拟研究和现实世界的分析,基准性能的当前标准统计方法对一种替代方法称为微生物成分分析(ANCOM),最初是为分析微生物组数据而开发的。在我们的实际分析中,我们使用了从妇女健康倡议长寿研究I收集的DNAm数据,并将每种方法的结果与通常无法用于这些分析的黄金标准进行了比较。在我们的每个模拟研究中,ANCOM能够检测到被比较组之间细胞比例的真实差异,但与标准统计方法相比,错误发现率要低得多。我们的真实世界分析也证明了类似的发现。总的来说,我们的研究强调了ANCOM作为分析DNAm衍生的反褶积估计值的强大而稳健的方法的潜力,因为我们感兴趣的是细胞类型比例和生物学或临床终点的比较.ANCOM能够最大限度地减少错误发现,同时保持强大的统计能力,将其定位为表观基因组分析工具包的宝贵补充。
    DNA methylation (DNAm)-based deconvolution estimates contain relative data, forming a composition, that standard methods (testing directly on cell proportions) are ill-suited to handle. In this study we examined the performance of an alternative method, analysis of compositions of microbiomes (ANCOM), for the analysis of DNAm-based deconvolution estimates. We performed two different simulation studies comparing ANCOM to a standard approach (two sample t-test performed directly on cell proportions) and analyzed a real-world data from the Women\'s Health Initiative to evaluate the applicability of ANCOM to DNAm-based deconvolution estimates. Our findings indicate that ANCOM can effectively account for the compositional nature of DNAm-based deconvolution estimates. ANCOM adequately controls the false discovery rate while maintaining statistical power comparable to that of standard methods.
    DNA methylation (DNAm)-based deconvolution provides highly accurate estimates of the proportion of each cell type in a mixed-cell type biological sample (e.g., whole-blood). These estimates can be used for examining the association between cell type proportions and biological or clinical end points; for example, comparing the estimated neutrophil proportion in whole blood between smokers and non-smokers. Cell proportion data has unique features which present challenges for traditional and widely used statistical methods. In response to this issue, our work presents two simulation studies and a real-world analysis that benchmark the performance of current standard statistical methods against an alternative method called analysis composition of microbes (ANCOM), which was originally developed for the analysis of microbiome data. In our real-world analysis we used DNAm data collected from Women’s Health Initiative Long Life Study I and compared the results of each method against a gold-standard that is typically not available for these analyses. In each of our simulation studies, ANCOM was able to detect true differences in cell proportions between the groups being compared but had a much lower rate of false discovery compared with the standard statistical methods. Our real-world analysis demonstrated similar findings. Overall, our study highlights the potential of ANCOM as a powerful and robust method for analyzing DNAm-derived deconvolution estimates when the interest is comparisons of cell type proportions and biological or clinical end points. ANCOM’s ability to minimize false discovery while maintaining robust statistical power positions it as a valuable addition to the epigenomic analysis toolkit.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    细菌菌群之间的关系,疾病,和饮食已经被许多作者描述。操作分类单位(OTU)是将16SrRNA基因序列聚类到某个截止值的结果,它们被认为是成分数据。由于皮尔逊的相关系数难以解释,Aitchison的比率分析用于开发处理成分数据的方法。由于单变量分析可能存在较大偏差,因此开发了多变量分析。基于某些假设和一些分析的关于绝对丰度的模拟,例如非参数多维缩放(NMDS),主成分分析(PCA),和比率分析,是在这项研究中进行的。可以使用PCA在低维中表达与100%堆叠条形图相同的内容。然而,NMDS的相对多样性不可重现。基于相对丰度对绝对丰度做出各种假设。然而,哪些假设是真的无法确定。总之,比率分析和PCA可用于分析成分数据和肠道微生物群。
    The relationships among bacterial flora, diseases, and diet have been described by many authors. An operational taxonomic units (OTUs) are the result of clustering the 16S rRNA gene sequences at a certain cutoff value, and they are considered compositional data. As Pearson\'s correlation coefficient is difficult to interpret, Aitchison\'s ratio analysis was used to develop a method to handle compositional data. Multivariate analysis was developed because univariate analysis can be subject to large biases. Simulations regarding absolute abundance based on certain assumptions and some analyses, such as nonparametric multidimensional scaling (NMDS), principal component analysis (PCA), and ratio analysis, were conducted in this study. The same content as a 100% stacked bar graph could be expressed in low dimensions using PCA. However, the relative diversity was not reproducible with NMDS. Various assumptions were made regarding absolute abundance based on the relative abundance. However, which assumptions are true could not be determined. In summary, ratio analysis and PCA are useful for analyzing compositional data and the gut microbiota.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    双极心理测量量表数据广泛用于心理保健。充分的心理分析有益于患者并节省时间和成本。赠款资金取决于心理治疗措施的质量。双极Likert缩放产量组成数据,因为对项目断言的任何数量级的协议都意味着分歧的互补数量级。如果满足统计的中心极限定理(CLT),则使用等距对数比(ilr)变换,可以将双变量信息转换为实值区间尺度,从而产生无偏统计结果,从而增加皮尔逊相关显著性检验的统计功效。在实践中,然而,CLT的适用性取决于求和的数量(即,项的数量)和ilr转换数据的数据生成过程(DGP)的方差。通过模拟,我们提供了证据,证明如果违反了CLT,ilr方法也可以令人满意地工作。也就是说,ilr方法对基础DGP的极大或无限方差是稳健的,增加了相关检验的统计能力。该研究概括了以前的结果,指出了心理测量大数据分析中ilr方法影响心理测量健康经济学的普遍性和可靠性,患者福利,赠款资金,经济决策和利润。
    Bipolar psychometric scales data are widely used in psychologic healthcare. Adequate psychological profiling benefits patients and saves time and costs. Grant funding depends on the quality of psychotherapeutic measures. Bipolar Likert scales yield compositional data because any order of magnitude of agreement towards an item assertion implies a complementary order of magnitude of disagreement. Using an isometric log-ratio (ilr) transformation the bivariate information can be transformed towards the real valued interval scale yielding unbiased statistical results increasing the statistical power of the Pearson correlation significance test if the Central Limit Theorem (CLT) of statistics is satisfied. In practice, however, the applicability of the CLT depends on the number of summands (i.e., the number of items) and the variance of the data generating process (DGP) of the ilr transformed data. Via simulation we provide evidence that the ilr approach also works satisfactory if the CLT is violated. That is, the ilr approach is robust towards extremely large or infinite variances of the underlying DGP increasing the statistical power of the correlation test. The study generalizes former results pointing out the universality and reliability of the ilr approach in psychometric big data analysis affecting psychometric health economics, patient welfare, grant funding, economic decision making and profits.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    微生物组代表了一个隐藏的微小生物世界,不仅分布在我们的周围环境中,还分布在我们自己的身体中。通过对这些看不见的生物进行全面的剖析,现代基因组测序工具给了我们一个前所未有的能力来表征这些群体,并揭示它们对我们的环境和健康的巨大影响。微生物组数据的统计分析对于从观察到的丰度推断模式至关重要。该领域分析方法的应用和发展需要仔细考虑微生物组特征的独特方面。我们首先简要概述了微生物组数据的收集和处理,并描述了由此产生的数据结构。然后,我们概述了微生物组数据分析中关键任务的统计方法,包括数据可视化,各组微生物丰度的比较,回归建模,和网络推理。我们以讨论结束,并强调有趣的未来方向。
    The microbiome represents a hidden world of tiny organisms populating not only our surroundings but also our own bodies. By enabling comprehensive profiling of these invisible creatures, modern genomic sequencing tools have given us an unprecedented ability to characterize these populations and uncover their outsize impact on our environment and health. Statistical analysis of microbiome data is critical to infer patterns from the observed abundances. The application and development of analytical methods in this area require careful consideration of the unique aspects of microbiome profiles. We begin this review with a brief overview of microbiome data collection and processing and describe the resulting data structure. We then provide an overview of statistical methods for key tasks in microbiome data analysis, including data visualization, comparison of microbial abundance across groups, regression modeling, and network inference. We conclude with a discussion and highlight interesting future directions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    确定性变量是由一个或多个父变量在功能上确定的变量。它们通常在变量从一个或多个父变量功能创建时出现,与派生变量一样,在成分数据中,其中\'整个\'变量由其\'部分\'确定。本文介绍了如何在有向无环图(DAG)中描述确定性变量,以帮助识别和解释涉及派生变量和/或成分数据的因果效应。我们提出了一种两步法,其中所有变量最初都被考虑,并选择是专注于确定性变量还是其决定父母。在DAG中描述确定性变量会带来一些好处。更容易识别和避免误解同义反复关联,即,确定性变量与其父母之间的自我实现的关联,或在具有共享父母的兄弟变量之间。在组成数据中,更容易理解条件对“整个”变量的影响,并正确识别总体和相对因果效应。对于派生变量,它鼓励更多地考虑目标估计,并更严格地审查一致性和可交换性假设。具有确定性变量的DAG对于规划和解释涉及导出变量和/或组成数据的分析是有用的辅助。
    Deterministic variables are variables that are functionally determined by one or more parent variables. They commonly arise when a variable has been functionally created from one or more parent variables, as with derived variables, and in compositional data, where the \'whole\' variable is determined from its \'parts\'. This article introduces how deterministic variables may be depicted within directed acyclic graphs (DAGs) to help with identifying and interpreting causal effects involving derived variables and/or compositional data. We propose a two-step approach in which all variables are initially considered, and a choice is made whether to focus on the deterministic variable or its determining parents. Depicting deterministic variables within DAGs brings several benefits. It is easier to identify and avoid misinterpreting tautological associations, i.e., self-fulfilling associations between deterministic variables and their parents, or between sibling variables with shared parents. In compositional data, it is easier to understand the consequences of conditioning on the \'whole\' variable, and correctly identify total and relative causal effects. For derived variables, it encourages greater consideration of the target estimand and greater scrutiny of the consistency and exchangeability assumptions. DAGs with deterministic variables are a useful aid for planning and interpreting analyses involving derived variables and/or compositional data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    分类标记基因分析允许以低成本揭示微生物群落的分类概况,使其在微生物组研究中无所不在。有一套不断扩大的工具可以从这类数据中提取更多的生物信息。从这个角度来看,我们阐述了关于从分类学概况中预测功能潜力的生物学有效性的几个问题,特别是当它们是通过短阅读测序产生的。标记基因的分类分辨率,标记基因的基因组内变异性,并讨论了微生物组数据的组成性质。将微生物组功能的实际测量与预测的功能潜力相结合,被认为是更好地了解微生物组功能的有效方法。在这种情况下,强调了预测的功能电位对生成和测试假设的重要性。我们认为,通过短读取扩增子测序生成的微生物组DNA读取计数数据预测的微生物组功能不应作为得出生物学推论的唯一基础。
    Taxonomic marker gene analysis allows uncovering taxonomic profiles of microbial communities at low cost, making it omnipresent in microbiome research. There is an ever-expanding set of tools to extract further biological information from this kind of data. In this perspective, we enunciate several concerns regarding the biological validity of predicting functional potential from taxonomic profiles, especially when they are generated by short-read sequencing. The taxonomic resolution of marker genes, intragenomic variability of marker genes, and the compositional nature of microbiome data are discussed. Combining actual measurements of microbiome functions with predicted functional potentials is proposed as a powerful approach to better understand microbiome functioning. In this context, the significance of predicted functional potentials for generating and testing hypotheses is highlighted. We argue that functions of microbiomes predicted from microbiome DNA read count data generated by short-read amplicon sequencing should not serve as the only basis to draw biological inferences.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    城市地区的特点是持续的人为投入,这表现在城市土壤表层的化学成分上。结果是化学元素的强烈异常的形成,包括铅(Pb),对于这个景观来说是非典型的。因此,本研究旨在探索埃里温城市地区土壤Pb异常的成分地球化学特征,Gyumri,还有Vanadzor,并确定在这些城市地区普遍的人为影响下出现的Pb的地球化学关联。通过组合数据分析和地理空间制图的结合使用获得的结果表明,受历史和正在进行的活动影响,不同城市的调查Pb异常形成了特定源的地球化学关联,以及这些地区化学元素的自然地球化学行为。具体来说,在埃里温,铅与铜和锌密切相关,形成一组持续的城市地区的人为示踪剂。相比之下,在Gyumri和Vanadzor,Pb与Ca相连,这表明几十年来,Pb与碳酸钙络合。Pb异常的这些成分地球化学特征与城市的社会经济发展以及不同时期其环境中存在的各种排放源直接相关。人类健康风险评估表明,儿童在埃里温的确定性为63.59%,在Gyumri和Vanadzor的确定性均为50%。
    Urban areas are characterized by a constant anthropogenic input, which is manifested in the chemical composition of the surface layer of urban soil. The consequence is the formation of intense anomalies of chemical elements, including lead (Pb), that are atypical for this landscape. Therefore, this study aims to explore the compositional-geochemical characteristics of soil Pb anomalies in the urban areas of Yerevan, Gyumri, and Vanadzor, and to identify the geochemical associations of Pb that emerge under prevalent anthropogenic influences in these urban areas. The results obtained through the combined use of compositional data analysis and geospatial mapping showed that the investigated Pb anomalies in different cities form source-specific geochemical associations influenced by historical and ongoing activities, as well as the natural geochemical behavior of chemical elements occurring in these areas. Specifically, in Yerevan, Pb was closely linked with Cu and Zn, forming a group of persistent anthropogenic tracers of urban areas. In contrast, in Gyumri and Vanadzor, Pb was linked with Ca, suggesting that over decades, complexation of Pb by Ca carbonates occurred. These patterns of compositional-geochemical characteristics of Pb anomalies are directly linked to the socio-economic development of cities and the various emission sources present in their environments during different periods. The human health risk assessment showed that children are under the Pb-induced non-carcinogenic risk by a certainty of 63.59% in Yerevan and 50% both in Gyumri and Vanadzor.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    室外射击场(OSR)的土壤污染是人类健康的主要威胁,特别是当,活动结束后,土地用于娱乐区或农业生产。使用多传感器方法评估了意大利南部OSR的土地退化状况。它基于:i)近端传感器,包括用于测量土壤电导率(ECa)和磁化率(MSa)的电磁感应(EMI),K的γ射线光谱法,eU和eTh分析和超声波透入法检测代表土壤强度的锥体指数(CI)数据,ii)土壤厚度(ST)的实地调查,和iii)通过便携式X射线荧光光谱法对潜在有毒元素(PTE)进行实验室分析,并通过气相色谱法对多环芳烃(PAHs)进行实验室分析。使用地统计学方法对测量的空间变异性进行了建模和映射。测量最密集的协变量(即,表土的ECa)在具有外部漂移的克里格中使用,以改善PTE的预测。PTE图得到了空间不确定性图的补充。将稳健的多变量主成分分析(rPCA)应用于近端传感器和实验室数据,并允许识别PAHs的关联。铅,CI与表土ECa沿第一组分(PC1),强调土地人为效应与EMI措施之间的相关性;而ST(估计地下石灰华硬层的深度)与沿第二部分(PC2)的底部土壤ECa和MSa之间的关联证明了土壤地层学对EMI措施的影响。这项研究表明,同时使用与实验室分析相关的不同近端传感器可以评估和建模OSR的土地退化状态的空间变异性,包括土壤压实,有机和无机污染。EMI数据与PTEs含量之间的相关性突出了该技术在土壤污染领域的潜力。
    Soil contamination in outdoor shooting ranges (OSRs) is a major threat for human health, particularly when, after the end of activities, the land is used for recreational areas or agricultural production. The status of land degradation of an OSR in southern Italy was assessed using a multisensor approach. It was based on: i) proximal sensors, including electromagnetic induction (EMI) for measuring soil electrical conductivity (ECa) and magnetic susceptibility (MSa), γ-ray spectrometry for K, eU and eTh analyses and ultrasonic penetrometry detecting cone index (CI) data representative of soil\'s strength, ii) field surveys on soil thickness (ST), and iii) laboratory analyses of potentially-toxic-elements (PTEs) by portable X-ray fluorescence spectrometry and polycyclic aromatic hydrocarbons (PAHs) by gas-chromatography. Spatial variability of measurements was modelled and mapped using geostatistical methods. The most densely measured covariate (i.e., the ECa of the topsoil) was used within kriging with external drift to improve the PTEs predictions. The PTEs maps were complemented by maps of spatial uncertainty. A robust multivariate principal component analysis (rPCA) was applied to proximal sensor and laboratory data and allowed to identify associations of PAHs, lead, CI with the topsoil ECa along the first component (PC1), highlighting the correlation between land anthropogenic effects and EMI measures; while the association between the ST (estimating the depth of underground travertine hard-layers) and the bottom soil ECa and MSa along the second component (PC2) evidenced the influence of soil stratigraphy on the EMI measures. This study demonstrates that the simultaneous use of different proximal sensors associated with laboratory analysis can allow to assess and model the spatial variability of the land degradation status of an OSR, including soil compaction, organic and inorganic contamination. The correlation between EMI data with the PTEs content highlights the potential of this technique in the field of soil contamination.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    序数反应在医学中很常见,生物学和其他领域。在许多情况下,这个序数响应的预测因子是成分的,这意味着每个样本的预测因子的总和是固定的。组成数据的实例包括微生物组数据中物种的相对丰度和营养浓度的相对频率。此外,强相关的预测因子往往对应答结果具有相似的影响.用于序数响应的常规累积逻辑回归模型忽略了对预测因子及其相关相互关系的固定和约束,因此不适合分析成分预测因子。为了解决这个问题,我们提出了有序响应的贝叶斯组成模型,以分析组成数据与有序响应之间的关系,其中对组成系数采用结构化的正则化马蹄先验,并通过先验分布对系数进行软和零限制。该方法是使用有效的哈密顿蒙特卡罗算法用R包rstan实现的。我们进行了模拟,以比较所提出的方法和现有的序数响应方法。结果表明,我们提出的方法在参数估计和预测方面优于现有方法。我们还将提出的方法应用于微生物组研究HMP2Data,寻找与顺序炎症性肠病水平相关的微生物。为了使这项工作具有重现性,本文使用的代码和数据可在https://github.com/Li-Zhang28/BCO获得。
    Ordinal response is commonly found in medicine, biology, and other fields. In many situations, the predictors for this ordinal response are compositional, which means that the sum of predictors for each sample is fixed. Examples of compositional data include the relative abundance of species in microbiome data and the relative frequency of nutrition concentrations. Moreover, the predictors that are strongly correlated tend to have similar influence on the response outcome. Conventional cumulative logistic regression models for ordinal responses ignore the fixed-sum constraint on predictors and their associated interrelationships, and thus are not appropriate for analyzing compositional predictors.To solve this problem, we proposed Bayesian Compositional Models for Ordinal Response to analyze the relationship between compositional data and an ordinal response with a structured regularized horseshoe prior for the compositional coefficients and a soft sum-to-zero restriction on coefficients through the prior distribution. The method was implemented with R package rstan using efficient Hamiltonian Monte Carlo algorithm. We performed simulations to compare the proposed approach and existing methods for ordinal responses. Results revealed that our proposed method outperformed the existing methods in terms of parameter estimation and prediction. We also applied the proposed method to a microbiome study HMP2Data, to find microorganisms linked to ordinal inflammatory bowel disease levels. To make this work reproducible, the code and data used in this paper are available at https://github.com/Li-Zhang28/BCO.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    有效预测能源消费结构对我国实现"双碳"目标至关重要。然而,关于能源消费结构的整体性和内在性质的现有研究很少受到关注。因此,本文将成分数据理论纳入能源消费结构研究,这不仅考虑了结构内部特征的特殊性,但也更深入地挖掘相关信息。同时,基于组合数据中Aitchison距离平方的最小化理论,基于三个单一模型的组合模型,即新陈代谢灰色模型(MGM),反向传播神经网络(BPNN)模型,和自回归积分移动平均(ARIMA)模型,是本文的结构。2023-2040年能源消费结构预测结果表明,未来我国能源消费结构将朝着更加多元化的方向发展,但是天然气和非化石能源的比例尚未达到政府设定的政策目标。本文不仅表明联合预测模型的成分数据在能源领域具有很高的适用性,对适应和改善我国能源消费结构具有一定的理论意义。
    Effective forecasting of energy consumption structure is vital for China to reach its \"dual carbon\" objective. However, little attention has been paid to existing studies on the holistic nature and internal properties of energy consumption structure. Therefore, this paper incorporates the theory of compositional data into the study of energy consumption structure, which not only takes into account the specificity of the internal features of the structure, but also digs deeper into the relative information. Meanwhile, based on the minimization theory of squares of the Aitchison distance in the compositional data, a combined model based on the three single models, namely the metabolism grey model (MGM), back-propagation neural network (BPNN) model, and autoregressive integrated moving average (ARIMA) model, is structured in this paper. The forecast results of the energy consumption structure in 2023-2040 indicate that the future energy consumption structure of China will evolve towards a more diversified pattern, but the proportion of natural gas and non-fossil energy has yet to meet the policy goals set by the government. This paper not only suggests that compositional data from joint prediction models have a high applicability value in the energy sector, but also has some theoretical significance for adapting and improving the energy consumption structure in China.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号