Data curation

数据策展
  • 文章类型: Journal Article
    近年来,人工智能(AI)在医学影像中的作用日益突出,FDA批准的大多数AI申请在2023年用于成像和放射学。应对临床挑战的AI模型开发激增,突显了准备高质量医学成像数据的必要性。正确的数据准备至关重要,因为它可以促进创建标准化和可重复的AI模型,同时最大程度地减少偏见。数据策展将原始数据转换为有价值的、有组织的,和可靠的资源,是机器学习和分析项目成功的基本过程。考虑到不同阶段的数据管理工具过多,了解特定研究领域内最相关的工具至关重要。在目前的工作中,我们为数据策展的不同步骤提出了描述性大纲,同时提供了从成像信息学协会(SIIM)成员中应用的调查中收集的工具的汇编。该集合有可能增强研究人员的决策过程,因为他们为其特定任务选择了最合适的工具。
    In recent years, the role of Artificial Intelligence (AI) in medical imaging has become increasingly prominent, with the majority of AI applications approved by the FDA being in imaging and radiology in 2023. The surge in AI model development to tackle clinical challenges underscores the necessity for preparing high-quality medical imaging data. Proper data preparation is crucial as it fosters the creation of standardized and reproducible AI models while minimizing biases. Data curation transforms raw data into a valuable, organized, and dependable resource and is a fundamental process to the success of machine learning and analytical projects. Considering the plethora of available tools for data curation in different stages, it is crucial to stay informed about the most relevant tools within specific research areas. In the current work, we propose a descriptive outline for different steps of data curation while we furnish compilations of tools collected from a survey applied among members of the Society of Imaging Informatics (SIIM) for each of these stages. This collection has the potential to enhance the decision-making process for researchers as they select the most appropriate tool for their specific tasks.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    公开可用的化合物和生物活性数据库为生命科学研究和药物设计中的数据驱动应用提供了必要的基础。通过分析几个生物活性库,我们发现复合和目标覆盖率存在差异,主张联合使用多个来源的数据.使用来自ChEMBL的数据,PubChem,IUPHAR/BPS,BindingDB,和探针和药物,我们收集了一个共识数据集,重点是对人类大分子靶标具有生物活性的小分子。这样可以改善复合空间和目标的覆盖范围,以及结构和生物活性数据的自动比较和管理,以揭示潜在的错误条目并增加信心。共识数据集包括超过110万种化合物,超过1090万个生物活性数据点,并附有测定类型和生物活性置信度的注释。为药物设计和化学基因组学中的计算应用提供了有用的集合。
    Publicly available compound and bioactivity databases provide an essential basis for data-driven applications in life-science research and drug design. By analyzing several bioactivity repositories, we discovered differences in compound and target coverage advocating the combined use of data from multiple sources. Using data from ChEMBL, PubChem, IUPHAR/BPS, BindingDB, and Probes & Drugs, we assembled a consensus dataset focusing on small molecules with bioactivity on human macromolecular targets. This allowed an improved coverage of compound space and targets, and an automated comparison and curation of structural and bioactivity data to reveal potentially erroneous entries and increase confidence. The consensus dataset comprised of more than 1.1 million compounds with over 10.9 million bioactivity data points with annotations on assay type and bioactivity confidence, providing a useful ensemble for computational applications in drug design and chemogenomics.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在决定在更大的下一代测序(NGS)数据集中评估哪些基因用于智力障碍(ID)的分子遗传诊断时,如今,遗传学家拥有各种基因表型数据库和专家策划的基因列表。为了量化它们各自的完整性,我们比较了从人类表型本体基因-表型关联数据库自动生成的ID基因选择和来自三个信誉良好的来源(sysID,DDD联盟和GenomicsEngland),并分析了它们的一些差异。我们给出了我们认为每个人的真正差距(“缺失ID基因”)的例子,并得出结论,需要一种补充或共识的方法来最大化ID患者的诊断产量。我们提出了几个具有不同置信水平的ID相关基因的共有基因列表。
    When deciding on which genes to assess in larger Next-Generation Sequencing (NGS) datasets for the molecular genetic diagnosis of intellectual disability (ID), geneticists today have a variety of gene-phenotype databases and expert-curated gene lists available. To quantify their respective completeness, we compare an ID gene selection auto-generated from the Human Phenotype Ontology gene-phenotype association database and expert-curated ID gene lists from three reputable sources (sysID, the DDD consortium and Genomics England) and analyse some of their differences. We give examples of what we regard as genuine gaps (\"missing ID genes\") for each of these and conclude that a complementary or consensus approach is needed to maximise diagnostic yield in ID patients. We propose several consensus gene lists with ID-associated genes of different confidence levels.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • DOI:
    文章类型: Journal Article
    我们介绍了针对德国临床报告的内容敏感分割的注释结果。我们招募了一个由多达八名医学生组成的注释团队,在四个预注释迭代和一个最后的主要注释步骤中逐句注释临床文本语料库。我们提出的注释方案符合针对章节标题的HL7-CDA(临床文档体系结构)标准中为临床文档开发的类别。一旦计划变得稳定,我们对大约1000份临床文件的完整集进行了主要注释活动。由于其对CDA标准的依赖,注释方案允许将遗留和新生成的临床文档集成到一个共同的管道中。然后,我们通过训练基线分类器来直接使用注释,以自动识别临床报告中的切片。
    We present the outcome of an annotation effort targeting the content-sensitive segmentation of German clinical reports into sections. We recruited an annotation team of up to eight medical students to annotate a clinical text corpus on a sentence-by-sentence basis in four pre-annotation iterations and one final main annotation step. The annotation scheme we came up with adheres to categories developed for clinical documents in the HL7-CDA (Clinical Document Architecture) standard for section headings. Once the scheme became stable, we ran the main annotation campaign on the complete set of roughly 1,000 clinical documents. Due to its reliance on the CDA standard, the annotation scheme allows the integration of legacy and newly produced clinical documents within a common pipeline. We then made direct use of the annotations by training a baseline classifier to automatically identify sections in clinical reports.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    了解生物活性物质的全部目标空间,批准和研究药物以及化学探针,提供了对治疗潜力和可能的不良反应的重要见解。由于非标准化和异质测定类型以及终点测量的可变性,现有的化合物-靶标生物活性数据资源通常是不可比较的。为了从现有和未来的复合目标分析数据中提取更高的价值,我们实现了一个开放数据的网络平台,命名为药物目标共用区(DTC),它具有用于众包复合靶标生物活性数据注释的工具,标准化,策展,和内部资源整合。我们通过与药物发现和药物再利用应用相关的几个例子证明了DTC的独特价值,并邀请研究人员加入这一社区,以增加化合物生物活性数据的再利用和扩展。
    Knowledge of the full target space of bioactive substances, approved and investigational drugs as well as chemical probes, provides important insights into therapeutic potential and possible adverse effects. The existing compound-target bioactivity data resources are often incomparable due to non-standardized and heterogeneous assay types and variability in endpoint measurements. To extract higher value from the existing and future compound target-profiling data, we implemented an open-data web platform, named Drug Target Commons (DTC), which features tools for crowd-sourced compound-target bioactivity data annotation, standardization, curation, and intra-resource integration. We demonstrate the unique value of DTC with several examples related to both drug discovery and drug repurposing applications and invite researchers to join this community effort to increase the reuse and extension of compound bioactivity data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    共有编码序列(CCDS)项目提供了蛋白质编码区的数据集,所述蛋白质编码区在由NCBI和Ensembl组在EMBL-EBI处独立产生的基因组注释中在人和小鼠参考基因组装配体上相同地注释。该数据集是包括NCBI在内的国际合作的产物,Ensembl,HUGO基因命名委员会,小鼠基因组信息学和加州大学,圣克鲁斯.相同注释的编码区域,使用自动化管道生成并通过多次质量保证检查,被分配一个稳定和跟踪的标识符(CCDSID)。此外,由CCDS协作的专家策展人进行协调的手动审查有助于保持数据集的完整性和高质量。CCDS数据可通过交互式网页(https://www.ncbi.nlm.nih.gov/CCDS/Ccds浏览。cgi)和FTP站点(ftp://ftp。ncbi.nlm.nih.gov/pub/CCDS/)。在本文中,我们概述了正在进行的工作,CCDS数据集的增长和稳定性,并提供有关新协作成员的更新以及添加到CCDS用户界面的新功能。我们还介绍了专家策展方案,具体的例子强调了准确的参考基因组组装的重要性以及研究界的投入所发挥的关键作用。
    The Consensus Coding Sequence (CCDS) project provides a dataset of protein-coding regions that are identically annotated on the human and mouse reference genome assembly in genome annotations produced independently by NCBI and the Ensembl group at EMBL-EBI. This dataset is the product of an international collaboration that includes NCBI, Ensembl, HUGO Gene Nomenclature Committee, Mouse Genome Informatics and University of California, Santa Cruz. Identically annotated coding regions, which are generated using an automated pipeline and pass multiple quality assurance checks, are assigned a stable and tracked identifier (CCDS ID). Additionally, coordinated manual review by expert curators from the CCDS collaboration helps in maintaining the integrity and high quality of the dataset. The CCDS data are available through an interactive web page (https://www.ncbi.nlm.nih.gov/CCDS/CcdsBrowse.cgi) and an FTP site (ftp://ftp.ncbi.nlm.nih.gov/pub/CCDS/). In this paper, we outline the ongoing work, growth and stability of the CCDS dataset and provide updates on new collaboration members and new features added to the CCDS user interface. We also present expert curation scenarios, with specific examples highlighting the importance of an accurate reference genome assembly and the crucial role played by input from the research community.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Consensus Development Conference
    Thyroid nodule detection has increased with widespread use of ultrasound, which is currently the main tool for detection, monitoring, diagnosis and, in some instances, treatment of thyroid nodules. Knowledge of ultrasound and adequate instruction on its use require a position statement by the scientific societies concerned. The working groups on thyroid cancer and ultrasound techniques of the Spanish Society of Endocrinology and Nutrition have promoted this document, based on a thorough analysis of the current literature, the results of multicenter studies and expert consensus, in order to set the requirements for the best use of ultrasound in clinical practice. The objectives include the adequate framework for use of thyroid ultrasound, the technical and legal requirements, the clinical situations in which it is recommended, the levels of knowledge and learning processes, the associated responsibility, and the establishment of a standardized reporting of results and integration into hospital information systems and endocrinology units.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Consensus Development Conference
    要真正实现肿瘤个体化用药,对癌症序列变异的临床相关性进行分类和筛选是至关重要的.临床基因组资源(ClinGen)的体细胞工作组(WG),与ClinVar和多种癌症变异策展利益相关者合作,已经开发了一套共识的最小变异水平数据(MVLD)。MVLD是一个标准化数据元素的框架,用于临床应用癌症变异。随着MVLD标准的实施,并与ClinVar合作,我们的目标是简化社区中的体细胞变异治疗工作,减少临床实践中癌症变异解释的冗余和时间负担.
    我们通过共识方法开发了MVLD,i)回顾了参与工作组的机构的临床可操作性解释,ii)对临床躯体解释模式进行广泛的文献检索,和iii)癌症变异门户网站调查。即将发布的癌症变异解释指南,来自分子病理学协会(AMP),可以并入MVLD。
    除了统一由许多数据库收集的等位基因解释性和描述性字段的标准化术语外,MVLD包括癌症变异的独特领域,如生物标志物类,治疗背景和效果。此外,MVLD包括对受控语义和本体的建议。体细胞WG正在与ClinVar合作,评估MVLD在体细胞变体提交中的使用。ClinVar是一个开放且集中的存储库,测序实验室可以在其中报告具有临床意义的摘要级变异数据。ClinVar接受癌症变异数据。
    我们希望使用MVLD简化癌症变异的临床解释,增强多个冗余策展工作之间的互操作性,并增加对ClinVar的体细胞变异,所有这些都将增强临床肿瘤学实践的翻译。
    To truly achieve personalized medicine in oncology, it is critical to catalog and curate cancer sequence variants for their clinical relevance. The Somatic Working Group (WG) of the Clinical Genome Resource (ClinGen), in cooperation with ClinVar and multiple cancer variant curation stakeholders, has developed a consensus set of minimal variant level data (MVLD). MVLD is a framework of standardized data elements to curate cancer variants for clinical utility. With implementation of MVLD standards, and in a working partnership with ClinVar, we aim to streamline the somatic variant curation efforts in the community and reduce redundancy and time burden for the interpretation of cancer variants in clinical practice.
    We developed MVLD through a consensus approach by i) reviewing clinical actionability interpretations from institutions participating in the WG, ii) conducting extensive literature search of clinical somatic interpretation schemas, and iii) survey of cancer variant web portals. A forthcoming guideline on cancer variant interpretation, from the Association of Molecular Pathology (AMP), can be incorporated into MVLD.
    Along with harmonizing standardized terminology for allele interpretive and descriptive fields that are collected by many databases, the MVLD includes unique fields for cancer variants such as Biomarker Class, Therapeutic Context and Effect. In addition, MVLD includes recommendations for controlled semantics and ontologies. The Somatic WG is collaborating with ClinVar to evaluate MVLD use for somatic variant submissions. ClinVar is an open and centralized repository where sequencing laboratories can report summary-level variant data with clinical significance, and ClinVar accepts cancer variant data.
    We expect the use of the MVLD to streamline clinical interpretation of cancer variants, enhance interoperability among multiple redundant curation efforts, and increase submission of somatic variants to ClinVar, all of which will enhance translation to clinical oncology practice.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • DOI:
    文章类型: Journal Article
    暂无摘要。
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    Semi-automatic text analysis involves manual inspection of text. Often, different text annotations (like part-of-speech or named entities) are indicated by using distinctive text highlighting techniques. In typesetting there exist well-known formatting conventions, such as bold typeface, italics, or background coloring, that are useful for highlighting certain parts of a given text. Also, many advanced techniques for visualization and highlighting of text exist; yet, standard typesetting is common, and the effects of standard typesetting on the perception of text are not fully understood. As such, we surveyed and tested the effectiveness of common text highlighting techniques, both individually and in combination, to discover how to maximize pop-out effects while minimizing visual interference between techniques. To validate our findings, we conducted a series of crowdsourced experiments to determine: i) a ranking of nine commonly-used text highlighting techniques; ii) the degree of visual interference between pairs of text highlighting techniques; iii) the effectiveness of techniques for visual conjunctive search. Our results show that increasing font size works best as a single highlighting technique, and that there are significant visual interferences between some pairs of highlighting techniques. We discuss the pros and cons of different combinations as a design guideline to choose text highlighting techniques for text viewers.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

公众号