Data integration

数据集成
  • 文章类型: Editorial
    暂无摘要。
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    构建基因调控网络是研究基因调控的一种广泛采用的方法。在生物学和医学中提供多样化的应用。大量的研究集中在使用时间序列数据或单细胞RNA测序数据来推断基因调控网络。然而,这样的基因表达数据缺乏细胞或时间信息。幸运的是,延时共聚焦激光显微镜的出现使生物学家能够获得秀丽隐杆线虫的树形基因表达数据,实现细胞和时间分辨率。尽管这样的树形数据提供了丰富的知识,它们像非配对时间序列一样构成挑战,奠定了下游分析的不准确性。为了解决这个问题,提出了一个全面的数据集成框架和一种新的基于布尔时滞网络的贝叶斯方法。应用预筛选过程和马尔可夫链蒙特卡罗算法获得参数估计。仿真研究表明,我们的方法优于现有的布尔网络推理算法。利用拟议的方法,基于秀丽隐杆线虫的真实树形数据集,重建了五个子树的基因调控网络,在以前的遗传研究中证实的一些基因调控关系被恢复。此外,检测到不同细胞谱系子树中调节关系的异质性。此外,正在探索在人类疾病中具有重要意义的潜在基因调控关系。所有源代码均可在GitHub存储库https://github.com/edawu11/BBTD获取。git.
    Constructing gene regulatory networks is a widely adopted approach for investigating gene regulation, offering diverse applications in biology and medicine. A great deal of research focuses on using time series data or single-cell RNA-sequencing data to infer gene regulatory networks. However, such gene expression data lack either cellular or temporal information. Fortunately, the advent of time-lapse confocal laser microscopy enables biologists to obtain tree-shaped gene expression data of Caenorhabditis elegans, achieving both cellular and temporal resolution. Although such tree-shaped data provide abundant knowledge, they pose challenges like non-pairwise time series, laying the inaccuracy of downstream analysis. To address this issue, a comprehensive framework for data integration and a novel Bayesian approach based on Boolean network with time delay are proposed. The pre-screening process and Markov Chain Monte Carlo algorithm are applied to obtain the parameter estimates. Simulation studies show that our method outperforms existing Boolean network inference algorithms. Leveraging the proposed approach, gene regulatory networks for five subtrees are reconstructed based on the real tree-shaped datatsets of Caenorhabditis elegans, where some gene regulatory relationships confirmed in previous genetic studies are recovered. Also, heterogeneity of regulatory relationships in different cell lineage subtrees is detected. Furthermore, the exploration of potential gene regulatory relationships that bear importance in human diseases is undertaken. All source code is available at the GitHub repository https://github.com/edawu11/BBTD.git.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    发展中国家疾病和稀缺资源的双重负担凸显了改变健康问题和转化研究概念的必要性。与传统的遗传学范式相反,2005年提出的补充基因组的exposome范式是一个创新的理论。它涉及一种整体方法来理解人类生活和健康中环境之间相互作用的复杂性。本文概述了一个可扩展的暴露研究框架,整合各种数据源,进行全面的公共卫生监测和政策支持。智利基于暴露系统的生态系统(CHiESS)项目提出了一个基于生态和一个健康方法的概念模型,并开发了用于曝光研究的技术动态平台,利用国家机构常规收集的现有行政数据,在临床记录中,和生物库。CHiESS考虑了暴露组操作的多水平暴露,包括生态系统,社区,人口,和个人水平。CHiESS将包括四个连续的发展阶段,以成为一个信息平台:(1)环境数据集成和协调系统,(2)临床和组学数据整合,(3)高级分析算法的开发,(4)可视化界面开发和有针对性的基于人群的队列招募。CHiESS平台旨在整合和协调可用的二级管理数据,并提供外部暴露的完整地理空间映射。此外,它旨在分析生态系统的环境压力源与人类分子过程之间的复杂相互作用及其对人类健康的影响。此外,通过识别基于曝光的热点,CHiESS允许有针对性和有效地招募基于人群的队列,以进行转化研究和影响评估。利用人工智能(AI)等先进技术,物联网(IoT)和区块链,该框架增强了数据安全性,实时监控,和预测分析。CHiESS模型可适应国际使用,促进全球卫生合作,支持可持续发展目标。
    The double burden of diseases and scarce resources in developing countries highlight the need to change the conceptualization of health problems and translational research. Contrary to the traditional paradigm focused on genetics, the exposome paradigm proposed in 2005 that complements the genome is an innovative theory. It involves a holistic approach to understanding the complexity of the interactions between the human being’s environment throughout their life and health. This paper outlines a scalable framework for exposome research, integrating diverse data sources for comprehensive public health surveillance and policy support. The Chilean exposome-based system for ecosystems (CHiESS) project proposes a conceptual model based on the ecological and One Health approaches, and the development of a technological dynamic platform for exposome research, which leverages available administrative data routinely collected by national agencies, in clinical records, and by biobanks. CHiESS considers a multilevel exposure for exposome operationalization, including the ecosystem, community, population, and individual levels. CHiESS will include four consecutive stages for development into an informatic platform: (1) environmental data integration and harmonization system, (2) clinical and omics data integration, (3) advanced analytical algorithm development, and (4) visualization interface development and targeted population-based cohort recruitment. The CHiESS platform aims to integrate and harmonize available secondary administrative data and provide a complete geospatial mapping of the external exposome. Additionally, it aims to analyze complex interactions between environmental stressors of the ecosystem and molecular processes of the human being and their effect on human health. Moreover, by identifying exposome-based hotspots, CHiESS allows the targeted and efficient recruitment of population-based cohorts for translational research and impact evaluation. Utilizing advanced technologies such as Artificial Intelligence (AI), Internet of Things (IoT), and blockchain, this framework enhances data security, real-time monitoring, and predictive analytics. The CHiESS model is adaptable for international use, promoting global health collaboration and supporting sustainable development goals.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    我们考虑以下设置:(1)内部研究基于个人水平的数据建立线性回归模型进行预测,(2)一些外部研究已经拟合了类似的线性回归模型,这些模型仅使用协变量的子集,并为没有个体水平数据的简化模型提供了系数估计,(3)这些研究人群存在异质性。目标是将外部模型摘要信息集成到拟合内部模型中以提高预测精度。我们采用James-Stein收缩方法来提出估计器,这些估计器在信息集成后的预测均方误差中不会更差,而且往往更好,无论研究人群异质性的程度如何。我们进行了全面的仿真研究,以研究所提出的估计器的数值性能。我们还通过整合已发表文献中的摘要信息,将该方法应用于血铅水平和其他协变量方面,以增强髌骨铅水平的预测模型。
    We consider the setting where (1) an internal study builds a linear regression model for prediction based on individual-level data, (2) some external studies have fitted similar linear regression models that use only subsets of the covariates and provide coefficient estimates for the reduced models without individual-level data, and (3) there is heterogeneity across these study populations. The goal is to integrate the external model summary information into fitting the internal model to improve prediction accuracy. We adapt the James-Stein shrinkage method to propose estimators that are no worse and are oftentimes better in the prediction mean squared error after information integration, regardless of the degree of study population heterogeneity. We conduct comprehensive simulation studies to investigate the numerical performance of the proposed estimators. We also apply the method to enhance a prediction model for patella bone lead level in terms of blood lead level and other covariates by integrating summary information from published literature.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Editorial
    暂无摘要。
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    解密转录因子(TFs)之间的复杂关系,增强器,和基因通过增强子驱动的基因调控网络(eGRN)的推断对于理解复杂生物系统中的基因调控程序至关重要。这项研究引入了STREAM,一种利用斯坦纳森林问题模型的新方法,一个混合的双闪烁管道,和亚模块化优化,从联合分析的单细胞转录组和染色质可达性数据推断eGRN。与现有方法相比,STREAM在TF恢复方面表现出增强的性能,TF-增强子连锁预测,和增强子-基因关系发现。将STREAM应用于阿尔茨海默病数据集和弥漫性小淋巴细胞淋巴瘤数据集揭示了其识别与假时间相关的TF-增强子-基因关系的能力,以及关键的TF增强子基因关系和TF合作潜在的肿瘤细胞。
    Deciphering the intricate relationships between transcription factors (TFs), enhancers, and genes through the inference of enhancer-driven gene regulatory networks (eGRNs) is crucial in understanding gene regulatory programs in a complex biological system. This study introduces STREAM, a novel method that leverages a Steiner forest problem model, a hybrid biclustering pipeline, and submodular optimization to infer eGRNs from jointly profiled single-cell transcriptome and chromatin accessibility data. Compared to existing methods, STREAM demonstrates enhanced performance in terms of TF recovery, TF-enhancer linkage prediction, and enhancer-gene relation discovery. Application of STREAM to an Alzheimer\'s disease dataset and a diffuse small lymphocytic lymphoma dataset reveals its ability to identify TF-enhancer-gene relations associated with pseudotime, as well as key TF-enhancer-gene relations and TF cooperation underlying tumor cells.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    阿尔茨海默病(AD)正在影响越来越多的个体。因此,迫切需要准确和早期的诊断方法。本研究旨在通过开发最佳的数据分析策略以增强计算诊断来实现这一目标。尽管收集了各种形式的AD诊断数据,过去对AD诊断计算方法的研究主要集中在使用单模态输入。我们假设整合,或“融合”,“各种数据模式作为预测模型的输入,可以通过提供更全面的个人健康状况视图来提高诊断准确性。然而,一个潜在的挑战出现了,因为这种多种模式的融合可能会导致更高的维度数据。我们假设,在异构模态中采用合适的降维方法不仅可以帮助诊断模型提取潜在信息,还可以提高准确性。因此,必须确定数据融合和降维的最佳策略。在本文中,我们对80多种统计机器学习方法进行了综合比较,考虑到各种分类器,降维技术,和数据融合策略来评估我们的假设。具体来说,我们探索了三种主要策略:(1)简单的数据融合,这涉及在将数据集输入分类器之前直接串联(融合)数据集;(2)早期数据融合,首先连接数据集,然后在将结果数据馈送到分类器之前应用降维技术;以及(3)中间数据融合,其中降维方法在连接它们以构造分类器之前单独应用于每个数据集。对于降维,我们已经探索了几种常用的技术,如主成分分析(PCA),自动编码器(AE),还有LASSO.此外,我们已经实现了一种新的降维方法,称为监督编码器(SE),这涉及对标准深度神经网络的轻微修改。我们的结果表明,与PCA相比,SE大大提高了预测精度,AE,还有LASSO,特别是结合中间融合进行多类诊断预测。
    Alzheimer\'s disease (AD) is affecting a growing number of individuals. As a result, there is a pressing need for accurate and early diagnosis methods. This study aims to achieve this goal by developing an optimal data analysis strategy to enhance computational diagnosis. Although various modalities of AD diagnostic data are collected, past research on computational methods of AD diagnosis has mainly focused on using single-modal inputs. We hypothesize that integrating, or \"fusing,\" various data modalities as inputs to prediction models could enhance diagnostic accuracy by offering a more comprehensive view of an individual\'s health profile. However, a potential challenge arises as this fusion of multiple modalities may result in significantly higher dimensional data. We hypothesize that employing suitable dimensionality reduction methods across heterogeneous modalities would not only help diagnosis models extract latent information but also enhance accuracy. Therefore, it is imperative to identify optimal strategies for both data fusion and dimensionality reduction. In this paper, we have conducted a comprehensive comparison of over 80 statistical machine learning methods, considering various classifiers, dimensionality reduction techniques, and data fusion strategies to assess our hypotheses. Specifically, we have explored three primary strategies: (1) Simple data fusion, which involves straightforward concatenation (fusion) of datasets before inputting them into a classifier; (2) Early data fusion, in which datasets are concatenated first, and then a dimensionality reduction technique is applied before feeding the resulting data into a classifier; and (3) Intermediate data fusion, in which dimensionality reduction methods are applied individually to each dataset before concatenating them to construct a classifier. For dimensionality reduction, we have explored several commonly-used techniques such as principal component analysis (PCA), autoencoder (AE), and LASSO. Additionally, we have implemented a new dimensionality-reduction method called the supervised encoder (SE), which involves slight modifications to standard deep neural networks. Our results show that SE substantially improves prediction accuracy compared to PCA, AE, and LASSO, especially in combination with intermediate fusion for multiclass diagnosis prediction.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    结论:单细胞组学数据分析的第一步是可视化,这使研究人员能够看到细胞类型之间的分离程度。一次可视化多个数据集时,数据集合并使用数据集成/批量修正方法。虽然下游分析需要,这些方法修改特征空间(例如基因表达)/PCA空间,以便尽可能在批次之间混合细胞类型。这掩盖了样本特定的特征,并破坏了单独嵌入样本时可以看到的局部嵌入结构。因此,为了改善大量样本之间的视觉比较(例如,多名患者,总体模态,不同的时间点),我们介绍复合SNE,它执行我们所说的嵌入空间中样本的软对齐。我们证明Compound-SNE能够在样本的嵌入空间中排列细胞类型,同时保留样本独立嵌入时的局部嵌入结构。
    方法:Compound-SNE的Python代码可从https://github.com/HaghverdiLab/Compound-SNE下载。
    背景:在线提供。提供算法详细信息和其他测试。
    CONCLUSIONS: One of the first steps in single-cell omics data analysis is visualization, which allows researchers to see how well-separated cell-types are from each other. When visualizing multiple datasets at once, data integration/batch correction methods are used to merge the datasets. While needed for downstream analyses, these methods modify features space (e.g. gene expression)/PCA space in order to mix cell-types between batches as well as possible. This obscures sample-specific features and breaks down local embedding structures that can be seen when a sample is embedded alone. Therefore, in order to improve in visual comparisons between large numbers of samples (e.g., multiple patients, omic modalities, different time points), we introduce Compound-SNE, which performs what we term a soft alignment of samples in embedding space. We show that Compound-SNE is able to align cell-types in embedding space across samples, while preserving local embedding structures from when samples are embedded independently.
    METHODS: Python code for Compound-SNE is available for download at https://github.com/HaghverdiLab/Compound-SNE.
    BACKGROUND: Available online. Provides algorithmic details and additional tests.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:空间分辨转录组学数据集的综合分析使人们对复杂的生物系统有了更深入的了解。然而,整合多个组织切片对批量效应去除提出了挑战,特别是当这些部分通过各种技术测量或在不同时间收集时。
    结果:我们建议空间对齐,一个无监督的对比学习模型,采用所有测量基因的表达和细胞的空间位置,整合多个组织切片。它不仅可以在低维嵌入中,而且可以在重建的完整表达式空间中对多个数据集进行联合下游分析。
    结论:在基准分析中,spatiacAlign在学习组织切片的联合和判别表示方面优于最先进的方法,每个潜在的特征是复杂的批次效应或不同的生物学特征。此外,我们证明了spatialAlign对时间序列大脑切片的综合分析的好处,包括空间聚类,差异表达分析,特别是需要校正基因表达矩阵的轨迹推断。
    Integrative analysis of spatially resolved transcriptomics datasets empowers a deeper understanding of complex biological systems. However, integrating multiple tissue sections presents challenges for batch effect removal, particularly when the sections are measured by various technologies or collected at different times.
    We propose spatiAlign, an unsupervised contrastive learning model that employs the expression of all measured genes and the spatial location of cells, to integrate multiple tissue sections. It enables the joint downstream analysis of multiple datasets not only in low-dimensional embeddings but also in the reconstructed full expression space.
    In benchmarking analysis, spatiAlign outperforms state-of-the-art methods in learning joint and discriminative representations for tissue sections, each potentially characterized by complex batch effects or distinct biological characteristics. Furthermore, we demonstrate the benefits of spatiAlign for the integrative analysis of time-series brain sections, including spatial clustering, differential expression analysis, and particularly trajectory inference that requires a corrected gene expression matrix.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    随着技术创新,现实世界中的企业正在管理每一个数据,因为它们可以被挖掘以获得商业智能(BI)。然而,当数据来自多个来源时,这可能会导致重复的记录。由于数据至关重要,消除重复实体对数据集成也很重要,性能和资源优化。为了实现可靠的重复记录删除系统,迟到,深度学习可以通过基于学习的方法提供令人兴奋的规定。深度ER是最近用于处理结构化数据中重复项的基于深度学习的方法之一。使用它作为参考模型,在本文中,我们提出了一个称为增强型深度学习的基于记录重复数据删除(EDL-RD)的框架,以进一步提高性能。为此,我们利用了长短期记忆(LSTM)的变体以及各种属性组成,相似性度量,以及数值和空值解析。我们提出了一种称为基于高效学习的重复记录删除(ELbRD)的算法。该算法利用上述增强来扩展参考模型。一项实证研究表明,所提出的带有扩展的框架优于现有方法。
    With technological innovations, enterprises in the real world are managing every iota of data as it can be mined to derive business intelligence (BI). However, when data comes from multiple sources, it may result in duplicate records. As data is given paramount importance, it is also significant to eliminate duplicate entities towards data integration, performance and resource optimization. To realize reliable systems for record deduplication, late, deep learning could offer exciting provisions with a learning-based approach. Deep ER is one of the deep learning-based methods used recently for dealing with the elimination of duplicates in structured data. Using it as a reference model, in this paper, we propose a framework known as Enhanced Deep Learning-based Record Deduplication (EDL-RD) for improving performance further. Towards this end, we exploited a variant of Long Short Term Memory (LSTM) along with various attribute compositions, similarity metrics, and numerical and null value resolution. We proposed an algorithm known as Efficient Learning based Record Deduplication (ELbRD). The algorithm extends the reference model with the aforementioned enhancements. An empirical study has revealed that the proposed framework with extensions outperforms existing methods.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号