关键词: bioinformatics colorectal cancer data integration health data medical informatics ontology quality assessment quality control semantic enrichment

Mesh : Humans Semantics Gene Ontology Data Accuracy Quality Control Colorectal Neoplasms / genetics

来  源:   DOI:10.1093/gigascience/giad030   PDF(Pubmed)

Abstract:
Integration of data from multiple domains can greatly enhance the quality and applicability of knowledge generated in analysis workflows. However, working with health data is challenging, requiring careful preparation in order to support meaningful interpretation and robust results. Ontologies encapsulate relationships between variables that can enrich the semantic content of health datasets to enhance interpretability and inform downstream analyses.
We developed an R package for electronic health data preparation, \"eHDPrep,\" demonstrated upon a multimodal colorectal cancer dataset (661 patients, 155 variables; Colo-661); a further demonstrator is taken from The Cancer Genome Atlas (459 patients, 94 variables; TCGA-COAD). eHDPrep offers user-friendly methods for quality control, including internal consistency checking and redundancy removal with information-theoretic variable merging. Semantic enrichment functionality is provided, enabling generation of new informative \"meta-variables\" according to ontological common ancestry between variables, demonstrated with SNOMED CT and the Gene Ontology in the current study. eHDPrep also facilitates numerical encoding, variable extraction from free text, completeness analysis, and user review of modifications to the dataset.
eHDPrep provides effective tools to assess and enhance data quality, laying the foundation for robust performance and interpretability in downstream analyses. Application to multimodal colorectal cancer datasets resulted in improved data quality, structuring, and robust encoding, as well as enhanced semantic information. We make eHDPrep available as an R package from CRAN (https://cran.r-project.org/package = eHDPrep) and GitHub (https://github.com/overton-group/eHDPrep).
摘要:
背景:集成来自多个域的数据可以大大提高分析工作流中生成的知识的质量和适用性。然而,处理健康数据是一项挑战,需要仔细的准备,以支持有意义的解释和稳健的结果。本体封装变量之间的关系,可以丰富健康数据集的语义内容,以增强可解释性并为下游分析提供信息。
结果:我们开发了用于电子健康数据准备的R包,\"eHDPrep,“在多模态结直肠癌数据集上证明(661例患者,155个变量;Colo-661);另一个演示者取自癌症基因组图谱(459名患者,94个变量;TCGA-COAD)。eHDPrep提供了用户友好的质量控制方法,包括内部一致性检查和冗余去除和信息论变量合并。提供了语义丰富功能,根据变量之间的本体论共同祖先,能够生成新的信息“元变量”,在目前的研究中,用SNOMEDCT和基因本体论进行了证明。eHDPrep还有助于数字编码,从自由文本中提取变量,完整性分析,和用户查看对数据集的修改。
结论:eHDPrep提供了有效的工具来评估和提高数据质量,为下游分析的稳健性能和可解释性奠定基础。应用于多模态结直肠癌数据集提高了数据质量,结构化,和强大的编码,以及增强的语义信息。我们使eHDPrep作为一个R包从CRAN(https://cran。r-project.org/package=eHDPrep)和GitHub(https://github.com/overton-group/eHDPrep)。
公众号