Data integration

数据集成
  • 文章类型: Journal Article
    构建基因调控网络是研究基因调控的一种广泛采用的方法。在生物学和医学中提供多样化的应用。大量的研究集中在使用时间序列数据或单细胞RNA测序数据来推断基因调控网络。然而,这样的基因表达数据缺乏细胞或时间信息。幸运的是,延时共聚焦激光显微镜的出现使生物学家能够获得秀丽隐杆线虫的树形基因表达数据,实现细胞和时间分辨率。尽管这样的树形数据提供了丰富的知识,它们像非配对时间序列一样构成挑战,奠定了下游分析的不准确性。为了解决这个问题,提出了一个全面的数据集成框架和一种新的基于布尔时滞网络的贝叶斯方法。应用预筛选过程和马尔可夫链蒙特卡罗算法获得参数估计。仿真研究表明,我们的方法优于现有的布尔网络推理算法。利用拟议的方法,基于秀丽隐杆线虫的真实树形数据集,重建了五个子树的基因调控网络,在以前的遗传研究中证实的一些基因调控关系被恢复。此外,检测到不同细胞谱系子树中调节关系的异质性。此外,正在探索在人类疾病中具有重要意义的潜在基因调控关系。所有源代码均可在GitHub存储库https://github.com/edawu11/BBTD获取。git.
    Constructing gene regulatory networks is a widely adopted approach for investigating gene regulation, offering diverse applications in biology and medicine. A great deal of research focuses on using time series data or single-cell RNA-sequencing data to infer gene regulatory networks. However, such gene expression data lack either cellular or temporal information. Fortunately, the advent of time-lapse confocal laser microscopy enables biologists to obtain tree-shaped gene expression data of Caenorhabditis elegans, achieving both cellular and temporal resolution. Although such tree-shaped data provide abundant knowledge, they pose challenges like non-pairwise time series, laying the inaccuracy of downstream analysis. To address this issue, a comprehensive framework for data integration and a novel Bayesian approach based on Boolean network with time delay are proposed. The pre-screening process and Markov Chain Monte Carlo algorithm are applied to obtain the parameter estimates. Simulation studies show that our method outperforms existing Boolean network inference algorithms. Leveraging the proposed approach, gene regulatory networks for five subtrees are reconstructed based on the real tree-shaped datatsets of Caenorhabditis elegans, where some gene regulatory relationships confirmed in previous genetic studies are recovered. Also, heterogeneity of regulatory relationships in different cell lineage subtrees is detected. Furthermore, the exploration of potential gene regulatory relationships that bear importance in human diseases is undertaken. All source code is available at the GitHub repository https://github.com/edawu11/BBTD.git.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    发展中国家疾病和稀缺资源的双重负担凸显了改变健康问题和转化研究概念的必要性。与传统的遗传学范式相反,2005年提出的补充基因组的exposome范式是一个创新的理论。它涉及一种整体方法来理解人类生活和健康中环境之间相互作用的复杂性。本文概述了一个可扩展的暴露研究框架,整合各种数据源,进行全面的公共卫生监测和政策支持。智利基于暴露系统的生态系统(CHiESS)项目提出了一个基于生态和一个健康方法的概念模型,并开发了用于曝光研究的技术动态平台,利用国家机构常规收集的现有行政数据,在临床记录中,和生物库。CHiESS考虑了暴露组操作的多水平暴露,包括生态系统,社区,人口,和个人水平。CHiESS将包括四个连续的发展阶段,以成为一个信息平台:(1)环境数据集成和协调系统,(2)临床和组学数据整合,(3)高级分析算法的开发,(4)可视化界面开发和有针对性的基于人群的队列招募。CHiESS平台旨在整合和协调可用的二级管理数据,并提供外部暴露的完整地理空间映射。此外,它旨在分析生态系统的环境压力源与人类分子过程之间的复杂相互作用及其对人类健康的影响。此外,通过识别基于曝光的热点,CHiESS允许有针对性和有效地招募基于人群的队列,以进行转化研究和影响评估。利用人工智能(AI)等先进技术,物联网(IoT)和区块链,该框架增强了数据安全性,实时监控,和预测分析。CHiESS模型可适应国际使用,促进全球卫生合作,支持可持续发展目标。
    The double burden of diseases and scarce resources in developing countries highlight the need to change the conceptualization of health problems and translational research. Contrary to the traditional paradigm focused on genetics, the exposome paradigm proposed in 2005 that complements the genome is an innovative theory. It involves a holistic approach to understanding the complexity of the interactions between the human being’s environment throughout their life and health. This paper outlines a scalable framework for exposome research, integrating diverse data sources for comprehensive public health surveillance and policy support. The Chilean exposome-based system for ecosystems (CHiESS) project proposes a conceptual model based on the ecological and One Health approaches, and the development of a technological dynamic platform for exposome research, which leverages available administrative data routinely collected by national agencies, in clinical records, and by biobanks. CHiESS considers a multilevel exposure for exposome operationalization, including the ecosystem, community, population, and individual levels. CHiESS will include four consecutive stages for development into an informatic platform: (1) environmental data integration and harmonization system, (2) clinical and omics data integration, (3) advanced analytical algorithm development, and (4) visualization interface development and targeted population-based cohort recruitment. The CHiESS platform aims to integrate and harmonize available secondary administrative data and provide a complete geospatial mapping of the external exposome. Additionally, it aims to analyze complex interactions between environmental stressors of the ecosystem and molecular processes of the human being and their effect on human health. Moreover, by identifying exposome-based hotspots, CHiESS allows the targeted and efficient recruitment of population-based cohorts for translational research and impact evaluation. Utilizing advanced technologies such as Artificial Intelligence (AI), Internet of Things (IoT), and blockchain, this framework enhances data security, real-time monitoring, and predictive analytics. The CHiESS model is adaptable for international use, promoting global health collaboration and supporting sustainable development goals.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    基因组医学通过实现个性化和循证的临床决策,改变了癌症患者的生活。尽管取得了这些进展,精准癌症医学的实施受到其对孤立生物标志物的依赖的限制。批量和单细胞多组学技术的发展揭示了癌症生态系统的巨大复杂性。除了癌细胞,肿瘤微环境,宏观环境和宿主因素,包括微生物组,深刻地影响癌症表型,并考虑这些因素增强了精准医学的分辨率。强大的多维分析和可解释的机器学习算法的出现标志着个性化癌症医学的新后基因组时代的到来。在精准癌症医学2.0中,高分辨率的个性化临床决策是通过对肿瘤和宿主进行全面的多体分析,使用人工智能集成。
    Genomic medicine has transformed the lives of patients with cancer by enabling individualised and evidence-based clinical decision-making. Despite this progress, the implementation of precision cancer medicine is limited by its dependence on isolated biomarkers. The development of bulk and single-cell multiomic technologies has revealed the enormous complexity of the cancer ecosystem. Beyond the cancer cell, the tumour microenvironment, macroenvironment and host factors, including the microbiome, profoundly influence the cancer phenotype, and accounting for these enhances the resolution of precision medicine. The advent of robust multiomic profiling and interpretable machine learning algorithms mark the dawn of a new postgenomic era of personalised cancer medicine. In Precision Cancer Medicine 2.0, high-resolution personalised clinical decision-making is informed by the comprehensive multiomic profiling of tumour and host, integrated using artificial intelligence.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    我们考虑以下设置:(1)内部研究基于个人水平的数据建立线性回归模型进行预测,(2)一些外部研究已经拟合了类似的线性回归模型,这些模型仅使用协变量的子集,并为没有个体水平数据的简化模型提供了系数估计,(3)这些研究人群存在异质性。目标是将外部模型摘要信息集成到拟合内部模型中以提高预测精度。我们采用James-Stein收缩方法来提出估计器,这些估计器在信息集成后的预测均方误差中不会更差,而且往往更好,无论研究人群异质性的程度如何。我们进行了全面的仿真研究,以研究所提出的估计器的数值性能。我们还通过整合已发表文献中的摘要信息,将该方法应用于血铅水平和其他协变量方面,以增强髌骨铅水平的预测模型。
    We consider the setting where (1) an internal study builds a linear regression model for prediction based on individual-level data, (2) some external studies have fitted similar linear regression models that use only subsets of the covariates and provide coefficient estimates for the reduced models without individual-level data, and (3) there is heterogeneity across these study populations. The goal is to integrate the external model summary information into fitting the internal model to improve prediction accuracy. We adapt the James-Stein shrinkage method to propose estimators that are no worse and are oftentimes better in the prediction mean squared error after information integration, regardless of the degree of study population heterogeneity. We conduct comprehensive simulation studies to investigate the numerical performance of the proposed estimators. We also apply the method to enhance a prediction model for patella bone lead level in terms of blood lead level and other covariates by integrating summary information from published literature.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目的:描述澳大利亚在操作上使用隐私保护链接方法,并从其实施中提出见解和关键经验。
    方法:使用Bloom过滤器的隐私保护记录链接(PPRL)提供了一种独特的实用机制,允许在不发布个人身份信息(PII)的情况下进行链接。同时仍然确保高精度。
    结果:该方法已在澳大利亚得到广泛采用,具有四个具有隐私保护功能的状态链接单元。它允许访问一般实践和私人病理数据等,两者都非常渴望以前无法访问的数据集进行链接。
    结论:澳大利亚的经验表明,隐私保护链接是改善政策数据访问的实用解决方案,规划和人口健康研究。希望国际上对这种方法的兴趣继续增长。
    OBJECTIVE: To describe the use of privacy preserving linkage methods operationally in Australia, and to present insights and key learnings from their implementation.
    METHODS: Privacy preserving record linkage (PPRL) utilising Bloom filters provides a unique practical mechanism that allows linkage to occur without the release of personally identifiable information (PII), while still ensuring high accuracy.
    RESULTS: The methodology has received wide uptake within Australia, with four state linkage units with privacy preserving capability. It has enabled access to general practice and private pathology data amongst other, both much sought after datasets previous inaccessible for linkage.
    CONCLUSIONS: The Australian experience suggests privacy preserving linkage is a practical solution for improving data access for policy, planning and population health research. It is hoped interest in this methodology internationally continues to grow.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    解密转录因子(TFs)之间的复杂关系,增强器,和基因通过增强子驱动的基因调控网络(eGRN)的推断对于理解复杂生物系统中的基因调控程序至关重要。这项研究引入了STREAM,一种利用斯坦纳森林问题模型的新方法,一个混合的双闪烁管道,和亚模块化优化,从联合分析的单细胞转录组和染色质可达性数据推断eGRN。与现有方法相比,STREAM在TF恢复方面表现出增强的性能,TF-增强子连锁预测,和增强子-基因关系发现。将STREAM应用于阿尔茨海默病数据集和弥漫性小淋巴细胞淋巴瘤数据集揭示了其识别与假时间相关的TF-增强子-基因关系的能力,以及关键的TF增强子基因关系和TF合作潜在的肿瘤细胞。
    Deciphering the intricate relationships between transcription factors (TFs), enhancers, and genes through the inference of enhancer-driven gene regulatory networks (eGRNs) is crucial in understanding gene regulatory programs in a complex biological system. This study introduces STREAM, a novel method that leverages a Steiner forest problem model, a hybrid biclustering pipeline, and submodular optimization to infer eGRNs from jointly profiled single-cell transcriptome and chromatin accessibility data. Compared to existing methods, STREAM demonstrates enhanced performance in terms of TF recovery, TF-enhancer linkage prediction, and enhancer-gene relation discovery. Application of STREAM to an Alzheimer\'s disease dataset and a diffuse small lymphocytic lymphoma dataset reveals its ability to identify TF-enhancer-gene relations associated with pseudotime, as well as key TF-enhancer-gene relations and TF cooperation underlying tumor cells.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目标:本文旨在描述一种名为HealthConnect的新健康信息技术系统的实施,该系统正在协调加拿大纽芬兰和拉布拉多省的癌症数据;解释该技术的高级技术细节;提供该技术如何帮助改善该省的癌症护理的具体示例,并讨论其未来的扩展和影响。方法:我们给出了健康连接架构的技术描述,它如何将众多数据源集成到一个单一的,可扩展的癌症数据健康信息系统,并突出其人工智能和分析能力。结果:我们说明了HealthConnect的两项实际成就。首先,一个分析仪表板,用于查明该省小的定义地理区域的结肠癌筛查吸收的变化;第二,一种自然语言处理算法,该算法根据对乳腺X线照相术报告的评估,在解释适当的后续行动时提供AI辅助决策支持.结论:健康连接是一个前沿,用于协调癌症筛查数据以进行实际决策的卫生系统解决方案。长期目标是将所有癌症护理数据纳入HealthConnect,为该省的癌症护理建立一个全面的健康信息系统。
    Objective: This article aims to describe the implementation of a new health information technology system called Health Connect that is harmonizing cancer data in the Canadian province of Newfoundland and Labrador; explain high-level technical details of this technology; provide concrete examples of how this technology is helping to improve cancer care in the province, and to discuss its future expansion and implications. Methods: We give a technical description of the Health Connect architecture, how it integrated numerous data sources into a single, scalable health information system for cancer data and highlight its artificial intelligence and analytics capacity. Results: We illustrated two practical achievements of Health Connect. First, an analytical dashboard that was used to pinpoint variations in colon cancer screening uptake in small defined geographic regions of the province; and second, a natural language processing algorithm that provided AI-assisted decision support in interpreting appropriate follow-up action based on assessments of breast mammography reports. Conclusion: Health Connect is a cutting-edge, health systems solution for harmonizing cancer screening data for practical decision-making. The long term goal is to integrate all cancer care data holdings into Health Connect to build a comprehensive health information system for cancer care in the province.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    阿尔茨海默病(AD)正在影响越来越多的个体。因此,迫切需要准确和早期的诊断方法。本研究旨在通过开发最佳的数据分析策略以增强计算诊断来实现这一目标。尽管收集了各种形式的AD诊断数据,过去对AD诊断计算方法的研究主要集中在使用单模态输入。我们假设整合,或“融合”,“各种数据模式作为预测模型的输入,可以通过提供更全面的个人健康状况视图来提高诊断准确性。然而,一个潜在的挑战出现了,因为这种多种模式的融合可能会导致更高的维度数据。我们假设,在异构模态中采用合适的降维方法不仅可以帮助诊断模型提取潜在信息,还可以提高准确性。因此,必须确定数据融合和降维的最佳策略。在本文中,我们对80多种统计机器学习方法进行了综合比较,考虑到各种分类器,降维技术,和数据融合策略来评估我们的假设。具体来说,我们探索了三种主要策略:(1)简单的数据融合,这涉及在将数据集输入分类器之前直接串联(融合)数据集;(2)早期数据融合,首先连接数据集,然后在将结果数据馈送到分类器之前应用降维技术;以及(3)中间数据融合,其中降维方法在连接它们以构造分类器之前单独应用于每个数据集。对于降维,我们已经探索了几种常用的技术,如主成分分析(PCA),自动编码器(AE),还有LASSO.此外,我们已经实现了一种新的降维方法,称为监督编码器(SE),这涉及对标准深度神经网络的轻微修改。我们的结果表明,与PCA相比,SE大大提高了预测精度,AE,还有LASSO,特别是结合中间融合进行多类诊断预测。
    Alzheimer\'s disease (AD) is affecting a growing number of individuals. As a result, there is a pressing need for accurate and early diagnosis methods. This study aims to achieve this goal by developing an optimal data analysis strategy to enhance computational diagnosis. Although various modalities of AD diagnostic data are collected, past research on computational methods of AD diagnosis has mainly focused on using single-modal inputs. We hypothesize that integrating, or \"fusing,\" various data modalities as inputs to prediction models could enhance diagnostic accuracy by offering a more comprehensive view of an individual\'s health profile. However, a potential challenge arises as this fusion of multiple modalities may result in significantly higher dimensional data. We hypothesize that employing suitable dimensionality reduction methods across heterogeneous modalities would not only help diagnosis models extract latent information but also enhance accuracy. Therefore, it is imperative to identify optimal strategies for both data fusion and dimensionality reduction. In this paper, we have conducted a comprehensive comparison of over 80 statistical machine learning methods, considering various classifiers, dimensionality reduction techniques, and data fusion strategies to assess our hypotheses. Specifically, we have explored three primary strategies: (1) Simple data fusion, which involves straightforward concatenation (fusion) of datasets before inputting them into a classifier; (2) Early data fusion, in which datasets are concatenated first, and then a dimensionality reduction technique is applied before feeding the resulting data into a classifier; and (3) Intermediate data fusion, in which dimensionality reduction methods are applied individually to each dataset before concatenating them to construct a classifier. For dimensionality reduction, we have explored several commonly-used techniques such as principal component analysis (PCA), autoencoder (AE), and LASSO. Additionally, we have implemented a new dimensionality-reduction method called the supervised encoder (SE), which involves slight modifications to standard deep neural networks. Our results show that SE substantially improves prediction accuracy compared to PCA, AE, and LASSO, especially in combination with intermediate fusion for multiclass diagnosis prediction.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    结论:单细胞组学数据分析的第一步是可视化,这使研究人员能够看到细胞类型之间的分离程度。一次可视化多个数据集时,数据集合并使用数据集成/批量修正方法。虽然下游分析需要,这些方法修改特征空间(例如基因表达)/PCA空间,以便尽可能在批次之间混合细胞类型。这掩盖了样本特定的特征,并破坏了单独嵌入样本时可以看到的局部嵌入结构。因此,为了改善大量样本之间的视觉比较(例如,多名患者,总体模态,不同的时间点),我们介绍复合SNE,它执行我们所说的嵌入空间中样本的软对齐。我们证明Compound-SNE能够在样本的嵌入空间中排列细胞类型,同时保留样本独立嵌入时的局部嵌入结构。
    方法:Compound-SNE的Python代码可从https://github.com/HaghverdiLab/Compound-SNE下载。
    背景:在线提供。提供算法详细信息和其他测试。
    CONCLUSIONS: One of the first steps in single-cell omics data analysis is visualization, which allows researchers to see how well-separated cell-types are from each other. When visualizing multiple datasets at once, data integration/batch correction methods are used to merge the datasets. While needed for downstream analyses, these methods modify features space (e.g. gene expression)/PCA space in order to mix cell-types between batches as well as possible. This obscures sample-specific features and breaks down local embedding structures that can be seen when a sample is embedded alone. Therefore, in order to improve in visual comparisons between large numbers of samples (e.g., multiple patients, omic modalities, different time points), we introduce Compound-SNE, which performs what we term a soft alignment of samples in embedding space. We show that Compound-SNE is able to align cell-types in embedding space across samples, while preserving local embedding structures from when samples are embedded independently.
    METHODS: Python code for Compound-SNE is available for download at https://github.com/HaghverdiLab/Compound-SNE.
    BACKGROUND: Available online. Provides algorithmic details and additional tests.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:空间分辨转录组学数据集的综合分析使人们对复杂的生物系统有了更深入的了解。然而,整合多个组织切片对批量效应去除提出了挑战,特别是当这些部分通过各种技术测量或在不同时间收集时。
    结果:我们建议空间对齐,一个无监督的对比学习模型,采用所有测量基因的表达和细胞的空间位置,整合多个组织切片。它不仅可以在低维嵌入中,而且可以在重建的完整表达式空间中对多个数据集进行联合下游分析。
    结论:在基准分析中,spatiacAlign在学习组织切片的联合和判别表示方面优于最先进的方法,每个潜在的特征是复杂的批次效应或不同的生物学特征。此外,我们证明了spatialAlign对时间序列大脑切片的综合分析的好处,包括空间聚类,差异表达分析,特别是需要校正基因表达矩阵的轨迹推断。
    Integrative analysis of spatially resolved transcriptomics datasets empowers a deeper understanding of complex biological systems. However, integrating multiple tissue sections presents challenges for batch effect removal, particularly when the sections are measured by various technologies or collected at different times.
    We propose spatiAlign, an unsupervised contrastive learning model that employs the expression of all measured genes and the spatial location of cells, to integrate multiple tissue sections. It enables the joint downstream analysis of multiple datasets not only in low-dimensional embeddings but also in the reconstructed full expression space.
    In benchmarking analysis, spatiAlign outperforms state-of-the-art methods in learning joint and discriminative representations for tissue sections, each potentially characterized by complex batch effects or distinct biological characteristics. Furthermore, we demonstrate the benefits of spatiAlign for the integrative analysis of time-series brain sections, including spatial clustering, differential expression analysis, and particularly trajectory inference that requires a corrected gene expression matrix.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号