数据策展 Data curation-医云文献数字医云科研云海量医学决策数据服务

Data curation 关注

数据策展

文献(991篇)

百科

视频

1 Survey on large language model annotation of cellular senescence from figures in review articles.

从评论文章中的数字对细胞衰老的大型语言模型注释进行调查。影响指数 : 暂无
发表时间：Jun 2024 17
来源期刊：Genomics Inform PMID：38907285

DOI：10.1186/s44342-024-00011-6
文章类型： Journal Article

这项研究评估了大型语言模型(LLM)，特别是带有视觉的GPT-4（GPT-4V）和GPT-4Turbo，用于注释生物医学数字，专注于细胞衰老。我们评估了LLM对复杂生物医学图像进行分类和注释以提高其准确性和效率的能力。我们的实验采用了来自评论文章的数字，标签提取准确率超过70%，节点类型分类准确率约为80%。在方向性和抑制过程之间的关系的正确注释中指出了挑战，随着节点数量的增加而加剧。使用图例比使用字幕更精确地识别源和目标，但有时缺乏途径细节。这项研究强调了LLM从文本中解码生物机制的潜力，并概述了改善生物医学信息学中抑制关系表示的途径。
This study evaluated large language models (LLMs), particularly the GPT-4 with vision (GPT-4 V) and GPT-4 Turbo, for annotating biomedical figures, focusing on cellular senescence. We assessed the ability of LLMs to categorize and annotate complex biomedical images to enhance their accuracy and efficiency. Our experiments employed prompt engineering with figures from review articles, achieving more than 70% accuracy for label extraction and approximately 80% accuracy for node-type classification. Challenges were noted in the correct annotation of the relationship between directionality and inhibitory processes, which were exacerbated as the number of nodes increased. Using figure legends was a more precise identification of sources and targets than using captions, but sometimes lacked pathway details. This study underscores the potential of LLMs in decoding biological mechanisms from text and outlines avenues for improving inhibitory relationship representations in biomedical informatics.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

求助全文
2 Understanding the value of curation: A survey of US data repository curation practices and perceptions.

了解策展的价值：对美国数据存储库策展实践和看法的调查。影响指数 : 3.752
发表时间：2024
来源期刊：PLoS One PMID：38875230

DOI：10.1371/journal.pone.0301171
文章类型： Journal Article

数据管理者在评估数据质量方面发挥着重要作用，并采取可能最终导致更好、更有价值的数据产品。本研究探讨了在美国数据存储库中工作的数据策展人的策展实践。我们在2021年1月进行了一项调查，以衡量存储库执行的策展水平，并评估策展对数据共享过程的感知价值和影响。我们的分析包括来自59个独特数据库的95个响应。受访者主要是在存储库中工作的专业人员，并检查了在存储库环境中进行的策展。大多数72.6%的受访者报告说，“数据级”策展是由他们的存储库执行的，大约一半的受访者报告说，他们的存储库采取了措施，以确保其存储库数据集的互操作性和可重复性。最常报告的修复操作包括检查重复文件，审阅文档,审查元数据，铸造持久性标识符，并检查损坏/损坏的文件。跨通才的最“增值”的策展行动，机构,和学科存储库受访者与审查和增强文档有关。受访者报告说，他们的存储库对特定数据共享结果（包括可用性）的策展影响很大，可移动性，可理解性,和存储数据集的可访问性；与学科存储库相关的受访者倾向于认为对大多数结果的影响更大。大多数调查参与者强烈同意，存储库的数据管理为数据共享过程增加了价值，并且超过了努力和成本。我们发现了机构和学科库之间的一些差异，无论是报告的具体策展行动的频率，还是数据策展的感知影响。有趣的是,我们还发现了在同一存储库中工作的人对所执行的策展行动的水平和频率的看法的差异，这说明了存储库策展工作的复杂性。我们的结果表明，就具体的策展行动和结果而言，与广泛定义的策展水平相比，数据策展可能会得到更好的理解，并且需要更多的研究来了解执行这些活动的资源含义。我们分享这些结果，以提供更细致入微的策展观，以及策展如何影响更广泛的数据生命周期和数据共享行为。
Data curators play an important role in assessing data quality and take actions that may ultimately lead to better, more valuable data products. This study explores the curation practices of data curators working within US-based data repositories. We performed a survey in January 2021 to benchmark the levels of curation performed by repositories and assess the perceived value and impact of curation on the data sharing process. Our analysis included 95 responses from 59 unique data repositories. Respondents primarily were professionals working within repositories and examined curation performed within a repository setting. A majority 72.6% of respondents reported that \"data-level\" curation was performed by their repository and around half reported their repository took steps to ensure interoperability and reproducibility of their repository\'s datasets. Curation actions most frequently reported include checking for duplicate files, reviewing documentation, reviewing metadata, minting persistent identifiers, and checking for corrupt/broken files. The most \"value-add\" curation action across generalist, institutional, and disciplinary repository respondents was related to reviewing and enhancing documentation. Respondents reported high perceived impact of curation by their repositories on specific data sharing outcomes including usability, findability, understandability, and accessibility of deposited datasets; respondents associated with disciplinary repositories tended to perceive higher impact on most outcomes. Most survey participants strongly agreed that data curation by the repository adds value to the data sharing process and that it outweighs the effort and cost. We found some differences between institutional and disciplinary repositories, both in the reported frequency of specific curation actions as well as the perceived impact of data curation. Interestingly, we also found variation in the perceptions of those working within the same repository regarding the level and frequency of curation actions performed, which exemplifies the complexity of a repository curation work. Our results suggest data curation may be better understood in terms of specific curation actions and outcomes than broadly defined curation levels and that more research is needed to understand the resource implications of performing these activities. We share these results to provide a more nuanced view of curation, and how curation impacts the broader data lifecycle and data sharing behaviors.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
3 IPAD-DB: a manually curated database for experimentally verified inhibitors of proteins associated with Alzheimer's disease.

IPAD - DB ：一个人工数据库，用于实验验证与阿尔茨海默病相关的蛋白质抑制剂。影响指数 : 4.462
发表时间：Jun 2024 12
来源期刊：Database (Oxford) PMID：38865432

DOI：10.1093/database/baae048
文章类型： Journal Article

阿尔茨海默病(Alzheimer’sdisease,AD)是一种以进行性痴呆为特征的普遍性神经退行性疾病。目前,只有七种食品和药物管理局批准的药物用于治疗AD，它只是暂时缓解症状恶化，而不会逆转潜在的疾病过程。能够与AD相关蛋白质相互作用的抑制剂的鉴定在有效治疗干预措施的开发中起着关键作用。然而,大量此类抑制剂分散在许多已发表的文章中，这使得研究人员不方便探索潜在的AD候选药物。鉴于此,我们手动编制了针对AD相关蛋白的抑制剂,并构建了一个名为IPAD-DB(阿尔茨海默病相关蛋白抑制剂数据库)的综合数据库.该数据库中的精选抑制剂包括不同范围的化合物，包括天然化合物,合成化合物，毒品,天然提取物和纳米抑制剂。迄今为止,数据库已编译>4800个条目，每个代表抑制剂与其靶蛋白之间的对应关系。IPAD-DB提供了方便浏览的用户友好界面，搜索和下载其记录。我们坚信IPAD-DB是筛选潜在AD候选药物和研究这种使人衰弱的疾病的潜在机制的宝贵资源。可以在http://www上免费访问IPAD-DB。Lamee.cn/ipad-db/与所有主要的Web浏览器兼容。数据库URL：http：//www。Lamee.cn/ipad-db/.
Alzheimer\'s disease (AD) is a universal neurodegenerative disease with the feature of progressive dementia. Currently, there are only seven Food and Drug Administration-approved drugs for the treatment of AD, which merely offer temporary relief from symptom deterioration without reversing the underlying disease process. The identification of inhibitors capable of interacting with proteins associated with AD plays a pivotal role in the development of effective therapeutic interventions. However, a vast number of such inhibitors are dispersed throughout numerous published articles, rendering it inconvenient for researchers to explore potential drug candidates for AD. In light of this, we have manually compiled inhibitors targeting proteins associated with AD and constructed a comprehensive database known as IPAD-DB (Inhibitors of Proteins associated with Alzheimer\'s Disease Database). The curated inhibitors within this database encompass a diverse range of compounds, including natural compounds, synthetic compounds, drugs, natural extracts and nano-inhibitors. To date, the database has compiled >4800 entries, each representing a correspondent relationship between an inhibitor and its target protein. IPAD-DB offers a user-friendly interface that facilitates browsing, searching and downloading of its records. We firmly believe that IPAD-DB represents a valuable resource for screening potential AD drug candidates and investigating the underlying mechanisms of this debilitating disease. Access to IPAD-DB is freely available at http://www.lamee.cn/ipad-db/ and is compatible with all major web browsers. Database URL: http://www.lamee.cn/ipad-db/.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
4 DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations.

DUVEL ：一种用于识别寡基因组合的主动学习注释生物医学语料库。影响指数 : 4.462
发表时间：May 2024 28
来源期刊：Database (Oxford) PMID：38805753

DOI：10.1093/database/baae039
文章类型： Journal Article

虽然生物医学关系提取（bioRE）数据集有助于开发支持文本中单个变体生物固化的方法，目前没有数据集可用于提取双基因甚至寡基因变异关系，尽管文献报道不同位点（或基因）的变异组合之间的上位效应对于了解疾病病因很重要。这项工作提出了一个独特的寡基因变体组合数据集的创建，旨在训练工具来帮助科学文献的策展。为了克服与未标记实例数量和专业知识成本相关的障碍，主动学习(AL)用于优化注释，从而在找到要标记的信息最丰富的样本子集时获得帮助。通过使用PubTator对85篇全文文章进行预注释，这些文章包含来自少生疾病数据库（OLIDA）的相关关系，具有潜在双基因变体组合的文本片段，即基因-变异-基因-变异,被提取。所得的文本片段用ALAMBIC注释，基于AL的注释平台。生成的数据集，叫做DUVEL,用于微调四种最先进的生物医学语言模型：BiomedBERT，BiomedBERT-large,BioLinkBERT和BioM-BERT。超过50万个文本片段被考虑用于注释，最终产生一个具有8442个片段的数据集，其中794个是积极的例子，覆盖95%的原始注释文章。当应用于基因变异对检测时，BiomedBERT-large在微调后达到最高F1得分（0.84），与非微调模型相比，证明了显著的改进，强调DUVEL数据集的相关性。这项研究显示了AL如何在创建与生物医学策展应用相关的bioRE数据集方面发挥重要作用。DUVEL提供了一个独特的生物医学语料库，专注于两个基因和两个变体之间的4元关系。它可以免费用于GitHub和拥抱脸的研究。数据库URL：https://huggingface。co/datasets/cnachteg/duvel或https://doi.org/10.57967/hf/1571。
While biomedical relation extraction (bioRE) datasets have been instrumental in the development of methods to support biocuration of single variants from texts, no datasets are currently available for the extraction of digenic or even oligogenic variant relations, despite the reports in literature that epistatic effects between combinations of variants in different loci (or genes) are important to understand disease etiologies. This work presents the creation of a unique dataset of oligogenic variant combinations, geared to train tools to help in the curation of scientific literature. To overcome the hurdles associated with the number of unlabelled instances and the cost of expertise, active learning (AL) was used to optimize the annotation, thus getting assistance in finding the most informative subset of samples to label. By pre-annotating 85 full-text articles containing the relevant relations from the Oligogenic Diseases Database (OLIDA) with PubTator, text fragments featuring potential digenic variant combinations, i.e. gene-variant-gene-variant, were extracted. The resulting fragments of texts were annotated with ALAMBIC, an AL-based annotation platform. The resulting dataset, called DUVEL, is used to fine-tune four state-of-the-art biomedical language models: BiomedBERT, BiomedBERT-large, BioLinkBERT and BioM-BERT. More than 500 000 text fragments were considered for annotation, finally resulting in a dataset with 8442 fragments, 794 of them being positive instances, covering 95% of the original annotated articles. When applied to gene-variant pair detection, BiomedBERT-large achieves the highest F1 score (0.84) after fine-tuning, demonstrating significant improvement compared to the non-fine-tuned model, underlining the relevance of the DUVEL dataset. This study shows how AL may play an important role in the creation of bioRE dataset relevant for biomedical curation applications. DUVEL provides a unique biomedical corpus focusing on 4-ary relations between two genes and two variants. It is made freely available for research on GitHub and Hugging Face. Database URL: https://huggingface.co/datasets/cnachteg/duvel or https://doi.org/10.57967/hf/1571.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
5 MSGD: a manually curated database of genomic, transcriptomic, proteomic and drug information for multiple sclerosis.

MSGD ：一个手动策划的基因组数据库，转录组，多发性硬化症的蛋白质组学和药物信息。影响指数 : 4.462
发表时间：May 2024 24
来源期刊：Database (Oxford) PMID：38788333

DOI：10.1093/database/baae037
文章类型： Journal Article

多发性硬化(MS)是中枢神经系统最常见的炎性脱髓鞘疾病。“组学”技术(基因组学，转录组学，蛋白质组学）和相关的药物信息已经开始重塑我们对多发性硬化症的理解。然而,这些数据分散在许多参考文献中，使它们具有充分利用的挑战性。我们在多发性硬化症基因数据库（MSGD）数据库中手动挖掘和编译这些数据，打算在未来继续更新它。我们筛选了5485种出版物，并构建了MSGD的当前版本。MSGD包含6255个条目，包括3274个变体条目，1175个RNA条目，418个蛋白质条目，313个淘汰赛条目，612个药物条目和463个高通量条目。每个条目都包含详细信息，如物种，疾病类型,详细的基因描述（如官方基因符号），和原始参考文献。MSGD可免费访问，并提供用户友好的Web界面。用户可以轻松搜索感兴趣的基因，查看他们的表达模式和详细信息，管理基因集并通过平台提交新的MS基因关联。MSGD设计背后的主要原则是提供一个探索性的平台，旨在最大限度地减少过滤和解释障碍，同时确保高度可访问的数据呈现。这一举措预计将大大有助于研究人员破译基因机制和改善预防，MS的诊断和治疗数据库URL：http://bio-bigdata。hrbmu.edu.cn/MSGD。
Multiple sclerosis (MS) is the most common inflammatory demyelinating disease of the central nervous system. \'Omics\' technologies (genomics, transcriptomics, proteomics) and associated drug information have begun reshaping our understanding of multiple sclerosis. However, these data are scattered across numerous references, making them challenging to fully utilize. We manually mined and compiled these data within the Multiple Sclerosis Gene Database (MSGD) database, intending to continue updating it in the future. We screened 5485 publications and constructed the current version of MSGD. MSGD comprises 6255 entries, including 3274 variant entries, 1175 RNA entries, 418 protein entries, 313 knockout entries, 612 drug entries and 463 high-throughput entries. Each entry contains detailed information, such as species, disease type, detailed gene descriptions (such as official gene symbols), and original references. MSGD is freely accessible and provides a user-friendly web interface. Users can easily search for genes of interest, view their expression patterns and detailed information, manage gene sets and submit new MS-gene associations through the platform. The primary principle behind MSGD\'s design is to provide an exploratory platform, aiming to minimize filtration and interpretation barriers while ensuring highly accessible presentation of data. This initiative is expected to significantly assist researchers in deciphering gene mechanisms and improving the prevention, diagnosis and treatment of MS. Database URL: http://bio-bigdata.hrbmu.edu.cn/MSGD.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
6 iCEED: Integrated customized extraction of enzyme data.

iCEED ：酶数据的集成定制提取。影响指数 : 1.204
发表时间：Apr 2024 22
来源期刊：J Bioinform Comput Biol PMID：38779780

DOI：10.1142/S0219720024500057
文章类型： Journal Article

酶催化多种生化反应，是细胞和代谢途径的组成部分。酶的数据和元数据分布在数据库中，并以各种格式存档。酶数据库提供了用于以批量模式进行高效搜索和下载酶记录的实用程序，但不支持特定于生物体的数据子集提取。在下游分析之前，用户需要编写脚本来解析条目以进行自定义数据提取。已开发出集成定制酶数据提取（iCEED），为七个常用的酶数据库提供特定于生物体的定制数据提取实用程序，并将这些资源置于集成门户下。iCEED使用typehead实用程序提供下拉菜单和搜索框，用于提交查询以及基于酶类的浏览实用程序。集成了促进酶的三维（3D）结构上功能重要特征的映射和可视化的实用程序。iCEED中提供的定制数据提取实用程序有望对生物化学家有用，生物技术专家,计算生物学家，和生命科学研究人员通过易于导航的基于Web的界面来构建他们选择的精选数据集。集成的特征可视化系统对于对酶结构-功能关系的细粒度理解很有用。所需的数据子集，使用iCEED提取和策划可随后用于下游处理，分析，和知识发现。iCEED也可用于培训和教学目的。
Enzymes catalyze diverse biochemical reactions and are building blocks of cellular and metabolic pathways. Data and metadata of enzymes are distributed across databases and are archived in various formats. The enzyme databases provide utilities for efficient searches and downloading enzyme records in batch mode but do not support organism-specific extraction of subsets of data. Users are required to write scripts for parsing entries for customized data extraction prior to downstream analysis. Integrated Customized Extraction of Enzyme Data (iCEED) has been developed to provide organism-specific customized data extraction utilities for seven commonly used enzyme databases and brings these resources under an integrated portal. iCEED provides dropdown menus and search boxes using typehead utility for submission of queries as well as enzyme class-based browsing utility. A utility to facilitate mapping and visualization of functionally important features on the three-dimensional (3D) structures of enzymes is integrated. The customized data extraction utilities provided in iCEED are expected to be useful for biochemists, biotechnologists, computational biologists, and life science researchers to build curated datasets of their choice through an easy to navigate web-based interface. The integrated feature visualization system is useful for a fine-grained understanding of the enzyme structure-function relationship. Desired subsets of data, extracted and curated using iCEED can be subsequently used for downstream processing, analyses, and knowledge discovery. iCEED can also be used for training and teaching purposes.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

求助全文
7 Best practices for machine learning in antibody discovery and development.

抗体发现和开发中机器学习的最佳实践。影响指数 : 8.369
发表时间：May 2024 17
来源期刊：Drug Discov Today PMID：38762089

DOI：10.1016/j.drudis.2024.104025
文章类型： Journal Article

在过去的40年里,治疗性抗体的发现和开发取得了相当大的进展，机器学习（ML）提供了一种有前途的方法，通过降低成本和所需的实验数量来加快这一过程。数据集和评估方法的多样性阻碍了ML指导抗体设计和开发（D＆D）的最新进展，这使得很难进行比较和评估效用。建立标准和准则对于更广泛地采用ML和该领域的发展至关重要。这个观点批判性地回顾了当前的实践，突出了常见的陷阱，并提出了治疗性抗体D＆D中各种基于ML的技术的方法开发和评估指南。解决机器学习过程中的挑战，建议每个阶段采用最佳做法，以提高可重复性和进展性。
In the past 40 years, therapeutic antibody discovery and development have advanced considerably, with machine learning (ML) offering a promising way to speed up the process by reducing costs and the number of experiments required. Recent progress in ML-guided antibody design and development (D&D) has been hindered by the diversity of data sets and evaluation methods, which makes it difficult to conduct comparisons and assess utility. Establishing standards and guidelines will be crucial for the wider adoption of ML and the advancement of the field. This perspective critically reviews current practices, highlights common pitfalls and proposes method development and evaluation guidelines for various ML-based techniques in therapeutic antibody D&D. Addressing challenges across the ML process, best practices are recommended for each stage to enhance reproducibility and progress.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

求助全文
8 PMBC: a manually curated database for prognostic markers of breast cancer.

PMBC ：乳腺癌预后标志物的手动管理数据库。影响指数 : 4.462
发表时间：May 2024 15
来源期刊：Database (Oxford) PMID：38748636

DOI：10.1093/database/baae033
文章类型： Journal Article

乳腺癌以其高死亡率和异质性而臭名昭著，导致不同的治疗反应。经典生物标志物已被鉴定并成功地商业应用于预测乳腺癌患者的结果。积累的生物标志物，包括非编码RNA，随着测序技术的发展，已被报道为乳腺癌的预后标志物。然而,目前没有专门研究乳腺癌预后标志物的数据库.因此,我们构建了一个乳腺癌预后标志物(PMBC)数据库.PMBC由1070个覆盖mRNA的标记组成，lncRNAs,miRNA和circRNAs。这些标记物富含各种癌症和上皮相关功能，包括丝裂原激活的蛋白激酶信号传导。我们将预后标记映射到starBase的ceRNA网络中。lncRNANEAT1与11种RNA竞争，包括lncRNAs和mRNAs。ABAT中的大多数ceRNA属于假基因。ceRNA网络的拓扑分析表明，已知的预后RNA比随机RNA具有更高的紧密度。在所有的生物标志物中,预后lncRNAs有更高的程度，而预后mRNA比随机RNA具有更高的接近性。这些结果表明，lncRNAs在维持lncRNAs与其ceRNAs之间的相互作用中起重要作用，这可能被用作基于ceRNA网络对预后lncRNAs进行优先级排序的特征。PMBC呈现用户友好的界面，并提供有关个体预后标志物的详细信息，这将有助于乳腺癌的精确治疗。PMBC可通过以下URL获得：http://www。pmbreastcancer.com/.
Breast cancer is notorious for its high mortality and heterogeneity, resulting in different therapeutic responses. Classical biomarkers have been identified and successfully commercially applied to predict the outcome of breast cancer patients. Accumulating biomarkers, including non-coding RNAs, have been reported as prognostic markers for breast cancer with the development of sequencing techniques. However, there are currently no databases dedicated to the curation and characterization of prognostic markers for breast cancer. Therefore, we constructed a curated database for prognostic markers of breast cancer (PMBC). PMBC consists of 1070 markers covering mRNAs, lncRNAs, miRNAs and circRNAs. These markers are enriched in various cancer- and epithelial-related functions including mitogen-activated protein kinases signaling. We mapped the prognostic markers into the ceRNA network from starBase. The lncRNA NEAT1 competes with 11 RNAs, including lncRNAs and mRNAs. The majority of the ceRNAs in ABAT belong to pseudogenes. The topology analysis of the ceRNA network reveals that known prognostic RNAs have higher closeness than random. Among all the biomarkers, prognostic lncRNAs have a higher degree, while prognostic mRNAs have significantly higher closeness than random RNAs. These results indicate that the lncRNAs play important roles in maintaining the interactions between lncRNAs and their ceRNAs, which might be used as a characteristic to prioritize prognostic lncRNAs based on the ceRNA network. PMBC renders a user-friendly interface and provides detailed information about individual prognostic markers, which will facilitate the precision treatment of breast cancer. PMBC is available at the following URL: http://www.pmbreastcancer.com/.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
9 OMD Curation Toolkit: a workflow for in-house curation of public omics datasets.

OMD 固化工具包：用于公共组学数据集的内部管理的工作流程。影响指数 : 3.307
发表时间：May 2024 9
来源期刊：BMC Bioinformatics PMID：38724907

DOI：10.1186/s12859-024-05803-9
文章类型： Journal Article

背景：测序技术的重大进展以及科学中数据和元数据的共享已经产生了大量公开可用的数据集。然而,尽管做出了这些努力，但与公共组学数据集合作，尤其是管理公共组学数据集仍然具有挑战性。虽然越来越多的举措旨在重复使用以前的成果，这些目前的限制往往导致需要进一步的内部管理和处理。
结果：这里，我们介绍了OMD固化工具包（OMD固化工具包），一个python3软件包，旨在在公共组学数据集的元数据和fastq文件的策展过程中陪伴和指导研究人员。此工作流提供了具有多种功能（集合，控制检查,处理和整合），以促进策划公共测序数据项目的艰巨任务。虽然以欧洲核苷酸档案（ENA）为中心，提供的大多数工具都是通用的，可用于管理来自不同来源的数据集。
结论：因此，它为以前重新使用公共组学数据所需的内部策展提供了有价值的工具。由于其工作流结构和功能，在基于测序数据开发新的组学荟萃分析中,它可以很容易地使用,并使研究者受益.
BACKGROUND: Major advances in sequencing technologies and the sharing of data and metadata in science have resulted in a wealth of publicly available datasets. However, working with and especially curating public omics datasets remains challenging despite these efforts. While a growing number of initiatives aim to re-use previous results, these present limitations that often lead to the need for further in-house curation and processing.
RESULTS: Here, we present the Omics Dataset Curation Toolkit (OMD Curation Toolkit), a python3 package designed to accompany and guide the researcher during the curation process of metadata and fastq files of public omics datasets. This workflow provides a standardized framework with multiple capabilities (collection, control check, treatment and integration) to facilitate the arduous task of curating public sequencing data projects. While centered on the European Nucleotide Archive (ENA), the majority of the provided tools are generic and can be used to curate datasets from different sources.
CONCLUSIONS: Thus, it offers valuable tools for the in-house curation previously needed to re-use public omics data. Due to its workflow structure and capabilities, it can be easily used and benefit investigators in developing novel omics meta-analyses based on sequencing data.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
10 Automating literature screening and curation with applications to computational neuroscience.

自动化文献筛选和策展与计算神经科学的应用。影响指数 : 7.942
发表时间：Jun 2024 20
来源期刊：J Am Med Inform Assoc PMID：38722233

DOI：10.1093/jamia/ocae097
文章类型： Journal Article

目标：ModelDB(https://modeldb。科学)是计算神经科学的发现平台，包含超过1850个已发布的带有标准化元数据的模型代码。这些代码主要来自未经请求的模型作者提交的内容，但是这种方法本质上是有限的。例如,我们估计我们只捕获了大约三分之一的神经元模型，ModelDB中最常见的模型类型。为了更完整地描述计算神经科学建模工作的状态，我们的目标是识别包含来自计算神经科学方法及其标准化相关元数据的结果的作品(例如，细胞类型，研究主题)。
方法：我们的研究包括了ModelDB的已知计算神经科学工作和PubMed查询的确定神经科学工作。在使用SPECTER2（一种免费的文档嵌入方法）进行预筛选后，GPT-3.5和GPT-4用于识别可能的计算神经科学工作和相关元数据。
结果：SPECTER2，GPT-4和GPT-3.5在识别计算神经科学工作方面表现出多种但很高的能力。GPT-4通过指令调整和思想链实现了96.9%的准确率，GPT-3.5从54.2%提高到85.5%。GPT-4在识别相关元数据注释方面也显示出很高的潜力。
结论：识别和提取的准确性可以通过处理计算元素的模糊性来进一步提高，包括更多来自论文的信息(例如，方法部分)，改进提示，等。
结论：可以将自然语言处理和大型语言模型技术添加到ModelDB中，以促进进一步的模型发现，并将有助于建立一个更加标准化和全面的框架，以建立特定领域的资源。
OBJECTIVE: ModelDB (https://modeldb.science) is a discovery platform for computational neuroscience, containing over 1850 published model codes with standardized metadata. These codes were mainly supplied from unsolicited model author submissions, but this approach is inherently limited. For example, we estimate we have captured only around one-third of NEURON models, the most common type of models in ModelDB. To more completely characterize the state of computational neuroscience modeling work, we aim to identify works containing results derived from computational neuroscience approaches and their standardized associated metadata (eg, cell types, research topics).
METHODS: Known computational neuroscience work from ModelDB and identified neuroscience work queried from PubMed were included in our study. After pre-screening with SPECTER2 (a free document embedding method), GPT-3.5, and GPT-4 were used to identify likely computational neuroscience work and relevant metadata.
RESULTS: SPECTER2, GPT-4, and GPT-3.5 demonstrated varied but high abilities in identification of computational neuroscience work. GPT-4 achieved 96.9% accuracy and GPT-3.5 improved from 54.2% to 85.5% through instruction-tuning and Chain of Thought. GPT-4 also showed high potential in identifying relevant metadata annotations.
CONCLUSIONS: Accuracy in identification and extraction might further be improved by dealing with ambiguity of what are computational elements, including more information from papers (eg, Methods section), improving prompts, etc.
CONCLUSIONS: Natural language processing and large language model techniques can be added to ModelDB to facilitate further model discovery, and will contribute to a more standardized and comprehensive framework for establishing domain-specific resources.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)

Data curation 关注

1 Survey on large language model annotation of cellular senescence from figures in review articles.

2 Understanding the value of curation: A survey of US data repository curation practices and perceptions.

3 IPAD-DB: a manually curated database for experimentally verified inhibitors of proteins associated with Alzheimer's disease.

4 DUVEL: an active-learning annotated biomedical corpus for the recognition of oligogenic combinations.

5 MSGD: a manually curated database of genomic, transcriptomic, proteomic and drug information for multiple sclerosis.

6 iCEED: Integrated customized extraction of enzyme data.

7 Best practices for machine learning in antibody discovery and development.

8 PMBC: a manually curated database for prognostic markers of breast cancer.

9 OMD Curation Toolkit: a workflow for in-house curation of public omics datasets.

10 Automating literature screening and curation with applications to computational neuroscience.