compositional data

  • 文章类型: Journal Article
    Phylogenetic association analysis plays a crucial role in investigating the correlation between microbial compositions and specific outcomes of interest in microbiome studies. However, existing methods for testing such associations have limitations related to the assumption of a linear association in high-dimensional settings and the handling of confounding effects. Hence, there is a need for methods capable of characterizing complex associations, including nonmonotonic relationships. This article introduces a novel phylogenetic association analysis framework and associated tests to address these challenges by employing conditional rank correlation as a measure of association. The proposed tests account for confounders in a fully nonparametric manner, ensuring robustness against outliers and the ability to detect diverse dependencies. The proposed framework aggregates conditional rank correlations for subtrees using weighted sum and maximum approaches to capture both dense and sparse signals. The significance level of the test statistics is determined by calibration through a nearest-neighbour bootstrapping method, which is straightforward to implement and can accommodate additional datasets when these are available. The practical advantages of the proposed framework are demonstrated through numerical experiments using both simulated and real microbiome datasets.






  • 文章类型: Journal Article
    Air pollution stands as an environmental risk to child mental health, with proven relationships hitherto observed only in urban areas. Understanding the impact of pollution in rural settings is equally crucial. The novelty of this article lies in the study of the relationship between air pollution and behavioural and developmental disorders, attention deficit hyperactivity disorder (ADHD), anxiety, and eating disorders in children below 15 living in a rural area. The methodology combines spatio-temporal models, Bayesian inference and Compositional Data (CoDa), that make it possible to study areas with few pollution monitoring stations. Exposure to nitrogen dioxide (NO2), ozone (O3), and sulphur dioxide (SO2) is related to behavioural and development disorders, anxiety is related to particulate matter (PM10), O3 and SO2, and overall pollution is associated to ADHD and eating disorders. To sum up, like their urban counterparts, rural children are also subject to mental health risks related to air pollution, and the combination of spatio-temporal models, Bayesian inference and CoDa make it possible to relate mental health problems to pollutant concentrations in rural settings with few monitoring stations. Certain limitations persist related to misclassification of exposure to air pollutants and to the covariables available in the data sources used.






  • 文章类型: Journal Article
    DNA methylation (DNAm)-based deconvolution estimates contain relative data, forming a composition, that standard methods (testing directly on cell proportions) are ill-suited to handle. In this study we examined the performance of an alternative method, analysis of compositions of microbiomes (ANCOM), for the analysis of DNAm-based deconvolution estimates. We performed two different simulation studies comparing ANCOM to a standard approach (two sample t-test performed directly on cell proportions) and analyzed a real-world data from the Women\'s Health Initiative to evaluate the applicability of ANCOM to DNAm-based deconvolution estimates. Our findings indicate that ANCOM can effectively account for the compositional nature of DNAm-based deconvolution estimates. ANCOM adequately controls the false discovery rate while maintaining statistical power comparable to that of standard methods.
    DNA methylation (DNAm)-based deconvolution provides highly accurate estimates of the proportion of each cell type in a mixed-cell type biological sample (e.g., whole-blood). These estimates can be used for examining the association between cell type proportions and biological or clinical end points; for example, comparing the estimated neutrophil proportion in whole blood between smokers and non-smokers. Cell proportion data has unique features which present challenges for traditional and widely used statistical methods. In response to this issue, our work presents two simulation studies and a real-world analysis that benchmark the performance of current standard statistical methods against an alternative method called analysis composition of microbes (ANCOM), which was originally developed for the analysis of microbiome data. In our real-world analysis we used DNAm data collected from Women’s Health Initiative Long Life Study I and compared the results of each method against a gold-standard that is typically not available for these analyses. In each of our simulation studies, ANCOM was able to detect true differences in cell proportions between the groups being compared but had a much lower rate of false discovery compared with the standard statistical methods. Our real-world analysis demonstrated similar findings. Overall, our study highlights the potential of ANCOM as a powerful and robust method for analyzing DNAm-derived deconvolution estimates when the interest is comparisons of cell type proportions and biological or clinical end points. ANCOM’s ability to minimize false discovery while maintaining robust statistical power positions it as a valuable addition to the epigenomic analysis toolkit.






  • 文章类型: Journal Article
    The relationships among bacterial flora, diseases, and diet have been described by many authors. An operational taxonomic units (OTUs) are the result of clustering the 16S rRNA gene sequences at a certain cutoff value, and they are considered compositional data. As Pearson\'s correlation coefficient is difficult to interpret, Aitchison\'s ratio analysis was used to develop a method to handle compositional data. Multivariate analysis was developed because univariate analysis can be subject to large biases. Simulations regarding absolute abundance based on certain assumptions and some analyses, such as nonparametric multidimensional scaling (NMDS), principal component analysis (PCA), and ratio analysis, were conducted in this study. The same content as a 100% stacked bar graph could be expressed in low dimensions using PCA. However, the relative diversity was not reproducible with NMDS. Various assumptions were made regarding absolute abundance based on the relative abundance. However, which assumptions are true could not be determined. In summary, ratio analysis and PCA are useful for analyzing compositional data and the gut microbiota.






  • 文章类型: Journal Article
    Bipolar psychometric scales data are widely used in psychologic healthcare. Adequate psychological profiling benefits patients and saves time and costs. Grant funding depends on the quality of psychotherapeutic measures. Bipolar Likert scales yield compositional data because any order of magnitude of agreement towards an item assertion implies a complementary order of magnitude of disagreement. Using an isometric log-ratio (ilr) transformation the bivariate information can be transformed towards the real valued interval scale yielding unbiased statistical results increasing the statistical power of the Pearson correlation significance test if the Central Limit Theorem (CLT) of statistics is satisfied. In practice, however, the applicability of the CLT depends on the number of summands (i.e., the number of items) and the variance of the data generating process (DGP) of the ilr transformed data. Via simulation we provide evidence that the ilr approach also works satisfactory if the CLT is violated. That is, the ilr approach is robust towards extremely large or infinite variances of the underlying DGP increasing the statistical power of the correlation test. The study generalizes former results pointing out the universality and reliability of the ilr approach in psychometric big data analysis affecting psychometric health economics, patient welfare, grant funding, economic decision making and profits.






  • 文章类型: Journal Article
    The microbiome represents a hidden world of tiny organisms populating not only our surroundings but also our own bodies. By enabling comprehensive profiling of these invisible creatures, modern genomic sequencing tools have given us an unprecedented ability to characterize these populations and uncover their outsize impact on our environment and health. Statistical analysis of microbiome data is critical to infer patterns from the observed abundances. The application and development of analytical methods in this area require careful consideration of the unique aspects of microbiome profiles. We begin this review with a brief overview of microbiome data collection and processing and describe the resulting data structure. We then provide an overview of statistical methods for key tasks in microbiome data analysis, including data visualization, comparison of microbial abundance across groups, regression modeling, and network inference. We conclude with a discussion and highlight interesting future directions.






  • 文章类型: Journal Article
    Deterministic variables are variables that are functionally determined by one or more parent variables. They commonly arise when a variable has been functionally created from one or more parent variables, as with derived variables, and in compositional data, where the \'whole\' variable is determined from its \'parts\'. This article introduces how deterministic variables may be depicted within directed acyclic graphs (DAGs) to help with identifying and interpreting causal effects involving derived variables and/or compositional data. We propose a two-step approach in which all variables are initially considered, and a choice is made whether to focus on the deterministic variable or its determining parents. Depicting deterministic variables within DAGs brings several benefits. It is easier to identify and avoid misinterpreting tautological associations, i.e., self-fulfilling associations between deterministic variables and their parents, or between sibling variables with shared parents. In compositional data, it is easier to understand the consequences of conditioning on the \'whole\' variable, and correctly identify total and relative causal effects. For derived variables, it encourages greater consideration of the target estimand and greater scrutiny of the consistency and exchangeability assumptions. DAGs with deterministic variables are a useful aid for planning and interpreting analyses involving derived variables and/or compositional data.






  • 文章类型: Journal Article
    Taxonomic marker gene analysis allows uncovering taxonomic profiles of microbial communities at low cost, making it omnipresent in microbiome research. There is an ever-expanding set of tools to extract further biological information from this kind of data. In this perspective, we enunciate several concerns regarding the biological validity of predicting functional potential from taxonomic profiles, especially when they are generated by short-read sequencing. The taxonomic resolution of marker genes, intragenomic variability of marker genes, and the compositional nature of microbiome data are discussed. Combining actual measurements of microbiome functions with predicted functional potentials is proposed as a powerful approach to better understand microbiome functioning. In this context, the significance of predicted functional potentials for generating and testing hypotheses is highlighted. We argue that functions of microbiomes predicted from microbiome DNA read count data generated by short-read amplicon sequencing should not serve as the only basis to draw biological inferences.






  • 文章类型: Journal Article
    Urban areas are characterized by a constant anthropogenic input, which is manifested in the chemical composition of the surface layer of urban soil. The consequence is the formation of intense anomalies of chemical elements, including lead (Pb), that are atypical for this landscape. Therefore, this study aims to explore the compositional-geochemical characteristics of soil Pb anomalies in the urban areas of Yerevan, Gyumri, and Vanadzor, and to identify the geochemical associations of Pb that emerge under prevalent anthropogenic influences in these urban areas. The results obtained through the combined use of compositional data analysis and geospatial mapping showed that the investigated Pb anomalies in different cities form source-specific geochemical associations influenced by historical and ongoing activities, as well as the natural geochemical behavior of chemical elements occurring in these areas. Specifically, in Yerevan, Pb was closely linked with Cu and Zn, forming a group of persistent anthropogenic tracers of urban areas. In contrast, in Gyumri and Vanadzor, Pb was linked with Ca, suggesting that over decades, complexation of Pb by Ca carbonates occurred. These patterns of compositional-geochemical characteristics of Pb anomalies are directly linked to the socio-economic development of cities and the various emission sources present in their environments during different periods. The human health risk assessment showed that children are under the Pb-induced non-carcinogenic risk by a certainty of 63.59% in Yerevan and 50% both in Gyumri and Vanadzor.






  • 文章类型: Journal Article
    Soil contamination in outdoor shooting ranges (OSRs) is a major threat for human health, particularly when, after the end of activities, the land is used for recreational areas or agricultural production. The status of land degradation of an OSR in southern Italy was assessed using a multisensor approach. It was based on: i) proximal sensors, including electromagnetic induction (EMI) for measuring soil electrical conductivity (ECa) and magnetic susceptibility (MSa), γ-ray spectrometry for K, eU and eTh analyses and ultrasonic penetrometry detecting cone index (CI) data representative of soil\'s strength, ii) field surveys on soil thickness (ST), and iii) laboratory analyses of potentially-toxic-elements (PTEs) by portable X-ray fluorescence spectrometry and polycyclic aromatic hydrocarbons (PAHs) by gas-chromatography. Spatial variability of measurements was modelled and mapped using geostatistical methods. The most densely measured covariate (i.e., the ECa of the topsoil) was used within kriging with external drift to improve the PTEs predictions. The PTEs maps were complemented by maps of spatial uncertainty. A robust multivariate principal component analysis (rPCA) was applied to proximal sensor and laboratory data and allowed to identify associations of PAHs, lead, CI with the topsoil ECa along the first component (PC1), highlighting the correlation between land anthropogenic effects and EMI measures; while the association between the ST (estimating the depth of underground travertine hard-layers) and the bottom soil ECa and MSa along the second component (PC2) evidenced the influence of soil stratigraphy on the EMI measures. This study demonstrates that the simultaneous use of different proximal sensors associated with laboratory analysis can allow to assess and model the spatial variability of the land degradation status of an OSR, including soil compaction, organic and inorganic contamination. The correlation between EMI data with the PTEs content highlights the potential of this technique in the field of soil contamination.





