Data integration

数据集成
  • 文章类型: Journal Article
    生物数据的检索和可视化对于理解复杂系统至关重要。随着高通量测序技术产生的数据量不断增加,有效和优化的数据可视化工具已成为不可或缺的。这在COVID-19大流行后时期尤其重要,了解微生物群落的多样性和相互作用(即,病毒和细菌)构成了制定和计划适当干预措施的重要资产。在这一章中,我们展示了ExTaxsI(探索分类信息)工具的用途和潜力,以检索存储在国家生物技术信息中心(NCBI)数据库中的病毒生物多样性数据,并创建相关的可视化。此外,通过集成不同的功能和模块,该工具生成相关类型的可视化图,以促进微生物生物多样性群落的探索,有助于深入了解不同物种之间的生态和分类关系,并确定潜在的重要目标。使用猴痘病毒作为案例研究,这项工作指出了生物数据可视化的重要观点,可以用来深入了解生态,进化,和病毒的发病机理。因此,我们展示了ExTaxsI组织和描述可用/下载数据的潜力,简单,和可解释的方式,允许用户通过特定的过滤器与可视化图动态交互,缩放,探索功能。
    Retrieval and visualization of biological data are essential for understanding complex systems. With the increasing volume of data generated from high-throughput sequencing technologies, effective and optimized data visualization tools have become indispensable. This is particularly relevant in the COVID-19 postpandemic period, where understanding the diversity and interactions of microbial communities (i.e., viral and bacterial) constitutes an important asset to develop and plan suitable interventions.In this chapter, we show the usage and the potentials of ExTaxsI (Exploring Taxonomy Information) tool to retrieve viral biodiversity data stored in National Center for Biotechnology Information (NCBI) databases and create the related visualization. In addition, by integrating different functions and modules, the tool generates relevant types of visualization plots to facilitate the exploration of microbial biodiversity communities useful to deep dive into ecological and taxonomic relationships among different species and identify potential significant targets.Using the Monkeypox virus as a case study, this work points out significant perspectives on biological data visualization, which can be used to gain insights into the ecology, evolution, and pathogenesis of viruses. Accordingly, we show the potentiality of ExTaxsI to organize and describe the available/downloaded data in an easy, simple, and interpretable way allowing the user to interact dynamically with the visualization plots through specific filters, zoom, and explore functions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:胶质母细胞瘤(GBM)是最侵袭性和最常见的恶性原发性脑肿瘤;然而,治疗仍然是一个重大挑战。这项研究旨在通过开发包含异构类型的生物医学数据的综合性罕见疾病概况网络来确定GBM的药物再利用或重新定位候选药物。
    方法:我们通过从NCATSGARD知识图谱(NGKG)中提取和整合与GBM相关疾病相关的生物医学信息,开发了基于胶质母细胞瘤的生物医学概况网络(GBPN)。我们进一步基于模块化类对GBPN进行聚类,从而产生多个聚焦子图,名为mc_GBPN。然后,我们通过在mc_GBPN上执行网络分析来识别高影响力节点,并验证那些可能是GBM的潜在药物再利用或重新定位候选的节点。
    结果:我们开发了具有1,466个节点和107,423个边缘的GBPN,因此具有41个模块化类的mc_GBPN。从mc_GBPN中确定了十个最有影响力的节点的列表。这些特别包括利鲁唑,干细胞疗法,大麻二酚,和VK-0214,具有治疗GBM的证据。
    结论:我们的GBM靶向网络分析使我们能够有效地确定药物再利用或重新定位的潜在候选药物。将通过使用其他不同类型的生物医学和临床数据以及生物学实验进行进一步验证。这些发现可以减少胶质母细胞瘤的侵入性治疗,同时通过缩短药物开发时间表显着降低研究成本。此外,这个工作流程可以扩展到其他疾病领域。
    Glioblastoma (GBM) is the most aggressive and common malignant primary brain tumor; however, treatment remains a significant challenge. This study aims to identify drug repurposing or repositioning candidates for GBM by developing an integrative rare disease profile network containing heterogeneous types of biomedical data.
    We developed a Glioblastoma-based Biomedical Profile Network (GBPN) by extracting and integrating biomedical information pertinent to GBM-related diseases from the NCATS GARD Knowledge Graph (NGKG). We further clustered the GBPN based on modularity classes which resulted in multiple focused subgraphs, named mc_GBPN. We then identified high-influence nodes by performing network analysis over the mc_GBPN and validated those nodes that could be potential drug repurposing or repositioning candidates for GBM.
    We developed the GBPN with 1,466 nodes and 107,423 edges and consequently the mc_GBPN with forty-one modularity classes. A list of the ten most influential nodes were identified from the mc_GBPN. These notably include Riluzole, stem cell therapy, cannabidiol, and VK-0214, with proven evidence for treating GBM.
    Our GBM-targeted network analysis allowed us to effectively identify potential candidates for drug repurposing or repositioning. Further validation will be conducted by using other different types of biomedical and clinical data and biological experiments. The findings could lead to less invasive treatments for glioblastoma while significantly reducing research costs by shortening the drug development timeline. Furthermore, this workflow can be extended to other disease areas.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    制药行业不断寻找提高其开发和生产效率的方法。近年来,在工艺开发中,从批量制造到连续制造和数字化的转变推动了这种努力。为了促进这种转变,需要在工业4.0技术的框架内开发和实施集成的数据管理和信息学工具。在这方面,该工作旨在指导工业4.0框架下连续制药流程的数据集成开发,提高数字成熟度,实现数字孪生的发展。本文演示了两个实例,其中数据集成框架已成功用于学术连续制药试点工厂。全面展示了集成结构和信息流的细节。减轻合并复杂数据流的担忧的方法,包括集成多种过程分析技术工具和传统设备,连接云数据和仿真模型,维护网络物理安全,正在讨论。强调了进行实际考虑的关键挑战和机遇。
    The pharmaceutical industry continuously looks for ways to improve its development and manufacturing efficiency. In recent years, such efforts have been driven by the transition from batch to continuous manufacturing and digitalization in process development. To facilitate this transition, integrated data management and informatics tools need to be developed and implemented within the framework of Industry 4.0 technology. In this regard, the work aims to guide the data integration development of continuous pharmaceutical manufacturing processes under the Industry 4.0 framework, improving digital maturity and enabling the development of digital twins. This paper demonstrates two instances where a data integration framework has been successfully employed in academic continuous pharmaceutical manufacturing pilot plants. Details of the integration structure and information flows are comprehensively showcased. Approaches to mitigate concerns in incorporating complex data streams, including integrating multiple process analytical technology tools and legacy equipment, connecting cloud data and simulation models, and safeguarding cyber-physical security, are discussed. Critical challenges and opportunities for practical considerations are highlighted.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:集成来自多个域的数据可以大大提高分析工作流中生成的知识的质量和适用性。然而,处理健康数据是一项挑战,需要仔细的准备,以支持有意义的解释和稳健的结果。本体封装变量之间的关系,可以丰富健康数据集的语义内容,以增强可解释性并为下游分析提供信息。
    结果:我们开发了用于电子健康数据准备的R包,\"eHDPrep,“在多模态结直肠癌数据集上证明(661例患者,155个变量;Colo-661);另一个演示者取自癌症基因组图谱(459名患者,94个变量;TCGA-COAD)。eHDPrep提供了用户友好的质量控制方法,包括内部一致性检查和冗余去除和信息论变量合并。提供了语义丰富功能,根据变量之间的本体论共同祖先,能够生成新的信息“元变量”,在目前的研究中,用SNOMEDCT和基因本体论进行了证明。eHDPrep还有助于数字编码,从自由文本中提取变量,完整性分析,和用户查看对数据集的修改。
    结论:eHDPrep提供了有效的工具来评估和提高数据质量,为下游分析的稳健性能和可解释性奠定基础。应用于多模态结直肠癌数据集提高了数据质量,结构化,和强大的编码,以及增强的语义信息。我们使eHDPrep作为一个R包从CRAN(https://cran。r-project.org/package=eHDPrep)和GitHub(https://github.com/overton-group/eHDPrep)。
    Integration of data from multiple domains can greatly enhance the quality and applicability of knowledge generated in analysis workflows. However, working with health data is challenging, requiring careful preparation in order to support meaningful interpretation and robust results. Ontologies encapsulate relationships between variables that can enrich the semantic content of health datasets to enhance interpretability and inform downstream analyses.
    We developed an R package for electronic health data preparation, \"eHDPrep,\" demonstrated upon a multimodal colorectal cancer dataset (661 patients, 155 variables; Colo-661); a further demonstrator is taken from The Cancer Genome Atlas (459 patients, 94 variables; TCGA-COAD). eHDPrep offers user-friendly methods for quality control, including internal consistency checking and redundancy removal with information-theoretic variable merging. Semantic enrichment functionality is provided, enabling generation of new informative \"meta-variables\" according to ontological common ancestry between variables, demonstrated with SNOMED CT and the Gene Ontology in the current study. eHDPrep also facilitates numerical encoding, variable extraction from free text, completeness analysis, and user review of modifications to the dataset.
    eHDPrep provides effective tools to assess and enhance data quality, laying the foundation for robust performance and interpretability in downstream analyses. Application to multimodal colorectal cancer datasets resulted in improved data quality, structuring, and robust encoding, as well as enhanced semantic information. We make eHDPrep available as an R package from CRAN (https://cran.r-project.org/package = eHDPrep) and GitHub (https://github.com/overton-group/eHDPrep).
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    移动和传感技术的日益复杂使得能够收集关于个体状态和背景的动态变化的密集纵向数据(ILD)。ILD可用于发展行为变化的动态理论,反过来,可用于为开发即时适应性干预(JITAI)提供概念框架,该框架利用移动和传感技术的进步来确定何时以及如何进行干预。因此,JITAI在解决吸烟等重大公共卫生问题方面具有巨大潜力,它可以复发和意外地出现。串联,越来越多的研究利用多种方法从同一个人收集有关特定动态结构的数据。这种方法有望为调查人员提供比以往任何时候都更详细的了解行为改变过程如何在同一个人体内展开。然而,与粗略数据相关的细微差别挑战,嘈杂的数据,并介绍了数据源之间的不一致性。在这份手稿中,我们使用移动健康(mHealth)研究吸烟者有戒烟动机(BreakFree;R01MD010362)来说明这些挑战。在开发行为变化和JITAI的动态理论的更大科学背景下,讨论了集成多个数据源的实用方法。
    The increasing sophistication of mobile and sensing technology has enabled the collection of intensive longitudinal data (ILD) concerning dynamic changes in an individual\'s state and context. ILD can be used to develop dynamic theories of behavior change which, in turn, can be used to provide a conceptual framework for the development of just-in-time adaptive interventions (JITAIs) that leverage advances in mobile and sensing technology to determine when and how to intervene. As such, JITAIs hold tremendous potential in addressing major public health concerns such as cigarette smoking, which can recur and arise unexpectedly. In tandem, a growing number of studies have utilized multiple methods to collect data on a particular dynamic construct of interest from the same individual. This approach holds promise in providing investigators with a significantly more detailed view of how a behavior change processes unfold within the same individual than ever before. However, nuanced challenges relating to coarse data, noisy data, and incoherence among data sources are introduced. In this manuscript, we use a mobile health (mHealth) study on smokers motivated to quit (Break Free; R01MD010362) to illustrate these challenges. Practical approaches to integrate multiple data sources are discussed within the greater scientific context of developing dynamic theories of behavior change and JITAIs.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:医疗保健领域的数据量正在迅速上升,导致为任何给定的个人生成多个数据集。数据集成涉及将不同数据集中的变量映射在一起,以形成组合的数据集,然后可以将其用于进行不同类型的分析。然而,随着变量数量的增加,数据集的手动映射可能变得低效。另一种方法是通过机器学习使用文本分类来将变量分类为模式。
    目的:我们的目标是创建和评估机器学习方法的使用,以整合来自健康信息寻求行为(HISB)数据库中数据集的数据。
    方法:选择与研究领域相关的四个在线数据库进行整合。为数据集映射设计了两个实验:使用一个数据源的数据库内映射,和数据库间映射,以在四个数据库之间映射数据集。我们比较了逻辑回归(LR),随机森林分类器(RFC),和神经网络(NN)模型通过F1-score进行两种方法的集成。第三个实验是使用所有可用数据来创建用于对数据集中的HISB变量进行分类的模型的消融研究。
    结果:在数据库内映射中,LR分类器的平均F1评分(0.787)优于RFC评分(0.767)和全连接NN评分(0.735).在数据库间映射中,LR(0.245)得分最高,然而,这取决于使用哪个数据库作为训练源。使用所有的数据库,这三个模型能够正确分类90-91%的变量.删除一个数据集提高了分数,并产生了能够正确分类95-96%的HISB变量的模型。
    结论:作为数据集成的一部分,神经网络可以用作映射数据集变量的方法。开发的模型可用于对数据库中的HISB术语进行分类。
    BACKGROUND: The amount of data in health care is rapidly rising, leading to multiple datasets generated for any given individual. Data integration involves mapping variables in different datasets together to form a combined dataset which can then be used to conduct different types of analyses. However, with increasing numbers of variables, manual mapping of a dataset can become inefficient. Another approach is to use text classification through machine learning to classify the variables to a schema.
    OBJECTIVE: Our aim was to create and evaluate the use of machine learning methods for the integration of data from datasets across health information-seeking behavior (HISB) databases.
    METHODS: Four online databases relevant to the research field were selected for integration. Two experiments were designed for dataset mapping: intra-database mapping using the one data source, and inter-database mapping to map datasets between the four databases. We compared logistic regression (LR), a random forest classifier (RFC), and neural network (NN) models by F1-score for two methods of integration. A third experiment was an ablation study that used all the available data to create a model for classifying HISB variables in a dataset.
    RESULTS: In intra-database mapping, the mean F1 score for an LR classifier (0.787) was better than the RFC score (0.767) and fully connected NN (0.735). In inter-database mapping, the LR (0.245) scored best, however, this was dependent on which database was used as a training source. Using all the databases, these top three models were able to correctly classify 90-91% of the variables. Removing one dataset improved scores and resulted in a model able to correctly classify 95-96% of the HISB variables.
    CONCLUSIONS: As part of data integration, a neural network can be used as an approach to map the variables of a dataset. The developed models can be used to classify the HISB terms in a database.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    Knowledge of demography is essential for understanding wildlife population dynamics and developing appropriate conservation plans. However, population survey and demographic data (e.g., capture-recapture) are not always aligned in space and time, hindering our ability to robustly estimate population size and demographic processes. Integrated population models (IPMs) can provide inference for population dynamics with poorly aligned but jointly analysed population and demographic data. In this study, we used an IPM to analyse partially aligned population and demographic data of a migratory shorebird species, the snowy plover (Charadrius nivosus). Snowy plover populations have declined dramatically during the last two decades, yet the demographic mechanisms and environmental drivers of these declines remain poorly understood, hindering development of appropriate conservation strategies. We analysed 21 years (1998-2018) of partially aligned population survey, nest survey, and capture-recapture-resight data in three snowy plover populations (i.e., Texas, New Mexico, Oklahoma) in the Southern Great Plains of the US. By using IPMs we aimed to achieve better precision while evaluating the effects of wetland habitat and climatic factors (minimum temperature, wind speed) on snowy plover demography. Our IPM provided reasonable precision for productivity measures even with missing data, but population and survival estimates had greater uncertainty in years without corresponding data. Our model also uncovered the complex relationships between wetland habitat, climate, and demography with reasonable precision. Wetland habitat had positive effects on snowy plover productivity (i.e., clutch size and clutch fate), indicating the importance of protecting wetland habitat under climate change and other human stressors for the conservation of this species. We also found a positive effect of minimum temperature on snowy plover productivity, indicating potential benefits of warmth during night on their population. Based on our results, we suggest prioritizing population and capture-recapture surveys for understanding population dynamics and underlying demographic processes when data collection is limited by time and/or financial resources. Our modelling approach can be used to allocate limited conservation resources for evidence-based decision-making.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    动机:药物性肝损伤(DILI)是药物开发中的主要问题之一。DILI的早期预测,基于物质的化学性质和在细胞系上进行的实验,将带来临床试验成本的显著降低和药物的更快开发。本研究旨在使用多种信息来源建立化合物的DILI风险预测模型。方法:使用几种有监督的机器学习算法,我们为DILI和非DILI类别之间的化合物的几种选择性拆分建立了预测模型。为此,我们使用给定化合物的化学性质,它们对用它们处理的六种人类细胞系的基因表达水平的影响,以及它们的毒理学特征.首先,我们确定了所有数据集中信息最丰富的变量.然后,这些变量用于构建机器学习模型。最后,复合模型是用超级学习者方法构建的。所有建模均使用多次交叉验证重复进行,以获得无偏和精确的性能估计。结果:除了一个例外,人类细胞系的基因表达谱是无信息的,并导致随机模型。毒理学报告对DILI的预测没有帮助。对于辨别无害化合物和观察到任何水平的DILI的那些(AUC=0.75)的模型,获得最佳结果。这些模型是用使用分子描述符的随机森林算法构建的。
    Motivation: Drug-induced liver injury (DILI) is one of the primary problems in drug development. Early prediction of DILI, based on the chemical properties of substances and experiments performed on cell lines, would bring a significant reduction in the cost of clinical trials and faster development of drugs. The current study aims to build predictive models of risk of DILI for chemical compounds using multiple sources of information. Methods: Using several supervised machine learning algorithms, we built predictive models for several alternative splits of compounds between DILI and non-DILI classes. To this end, we used chemical properties of the given compounds, their effects on gene expression levels in six human cell lines treated with them, as well as their toxicological profiles. First, we identified the most informative variables in all data sets. Then, these variables were used to build machine learning models. Finally, composite models were built with the Super Learner approach. All modeling was performed using multiple repeats of cross-validation for unbiased and precise estimates of performance. Results: With one exception, gene expression profiles of human cell lines were non-informative and resulted in random models. Toxicological reports were not useful for prediction of DILI. The best results were obtained for models discerning between harmless compounds and those for which any level of DILI was observed (AUC = 0.75). These models were built with Random Forest algorithm that used molecular descriptors.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    Biomedical information mining is increasingly recognized as a promising technique to accelerate drug discovery and development. Especially, integrative approaches which mine data from several (open) data sources have become more attractive with the increasing possibilities to programmatically access data through Application Programming Interfaces (APIs). The use of open data in conjunction with free, platform-independent analytic tools provides the additional advantage of flexibility, re-usability, and transparency. Here, we present a strategy for performing ligand-based in silico drug repurposing with the analytics platform KNIME. We demonstrate the usefulness of the developed workflow on the basis of two different use cases: a rare disease (here: Glucose Transporter Type 1 (GLUT-1) deficiency), and a new disease (here: COVID 19). The workflow includes a targeted download of data through web services, data curation, detection of enriched structural patterns, as well as substructure searches in DrugBank and a recently deposited data set of antiviral drugs provided by Chemical Abstracts Service. Developed workflows, tutorials with detailed step-by-step instructions, and the information gained by the analysis of data for GLUT-1 deficiency syndrome and COVID-19 are made freely available to the scientific community. The provided framework can be reused by researchers for other in silico drug repurposing projects, and it should serve as a valuable teaching resource for conveying integrative data mining strategies.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    生物医学科学中的技术进步导致了许多领域的大量数据,在如何最好地整合和分析这些资源方面提出了新的挑战。例如,快速化学筛选计划,如美国环境保护署的ToxCast和合作努力,Tox21已经产生了大量关于推定的化学机制的信息,其中测定靶标被鉴定为基因;然而,系统地将这些假设的机制与体内毒性终点如疾病结局联系起来仍然存在问题.在这里,我们提出了一种新的使用标准化的逐点互信息(NPMI)来挖掘生物医学文献中与PubMed中医学主题词(MeSH术语)所代表的生物学概念的基因关联。为文章标记基因的资源被整合,然后使用UniRef50簇鉴定跨物种直系同源物。将MeSH项频率归一化以反映MeSH树结构,然后使用NPMI对所得的GeneID-MeSH关联进行排名。由此产生的网络,称为实体MeSH共生网络(EMCON),是一个可扩展的资源,用于识别和排列给定主题的基因。用乳腺癌发生的用例评估了EMCON的实用性。与乳腺癌发生相关的主题用于查询EMCON并检索对每个主题重要的基因。通过专家文献综述(ELR)汇编乳腺癌基因集以评估搜索结果的性能。我们发现,EMCON的结果将ELR的乳腺癌基因排名高于随机选择的基因,召回率为0.98。所选主题的前五个基因的精确度计算为0.87。这项工作表明,EMCON可用于将体外结果与可能的生物学结果联系起来,从而有助于产生可测试的假设,以进一步了解生物功能和化学暴露对疾病的贡献。
    Advances in technology within biomedical sciences have led to an inundation of data across many fields, raising new challenges in how best to integrate and analyze these resources. For example, rapid chemical screening programs like the US Environmental Protection Agency\'s ToxCast and the collaborative effort, Tox21, have produced massive amounts of information on putative chemical mechanisms where assay targets are identified as genes; however, systematically linking these hypothesized mechanisms with in vivo toxicity endpoints like disease outcomes remains problematic. Herein we present a novel use of normalized pointwise mutual information (NPMI) to mine biomedical literature for gene associations with biological concepts as represented by Medical Subject Headings (MeSH terms) in PubMed. Resources that tag genes to articles were integrated, then cross-species orthologs were identified using UniRef50 clusters. MeSH term frequency was normalized to reflect the MeSH tree structure, and then the resulting GeneID-MeSH associations were ranked using NPMI. The resulting network, called Entity MeSH Co-occurrence Network (EMCON), is a scalable resource for the identification and ranking of genes for a given topic of interest. The utility of EMCON was evaluated with the use case of breast carcinogenesis. Topics relevant to breast carcinogenesis were used to query EMCON and retrieve genes important to each topic. A breast cancer gene set was compiled through expert literature review (ELR) to assess performance of the search results. We found that the results from EMCON ranked the breast cancer genes from ELR higher than randomly selected genes with a recall of 0.98. Precision of the top five genes for selected topics was calculated as 0.87. This work demonstrates that EMCON can be used to link in vitro results to possible biological outcomes, thus aiding in generation of testable hypotheses for furthering understanding of biological function and the contribution of chemical exposures to disease.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

公众号