information extraction

信息抽取
  • 文章类型: Journal Article
    在生态学和进化生物学中,来自已发表文献的数据的综合和建模通常用于生成见解和测试跨系统的理论。然而,搜索的任务,筛选,从文献中提取数据往往是艰巨的。研究人员可能会手动处理数百至数千个文章进行系统评价,荟萃分析,和编译合成数据集。随着相关文章扩展到数万或数十万,基于计算机的方法可以提高效率,基于文献的研究的透明度和可重复性。由于基于机器学习的语言模型的发展,可用于文本挖掘的方法正在迅速变化。我们回顾了不断增长的方法,将它们映射到三个广泛的范式(基于频率的方法,传统的自然语言处理和基于深度学习的语言模型)。这是学习基础和前沿概念的切入点,词汇,以及促进将这些工具整合到生态和进化研究中的方法。我们涵盖了生态文本建模的方法,生成训练数据,开发自定义模型并与大型语言模型进行交互,并讨论在生态和进化中实现这些方法的挑战和可能的解决方案。
    In ecology and evolutionary biology, the synthesis and modelling of data from published literature are commonly used to generate insights and test theories across systems. However, the tasks of searching, screening, and extracting data from literature are often arduous. Researchers may manually process hundreds to thousands of articles for systematic reviews, meta-analyses, and compiling synthetic datasets. As relevant articles expand to tens or hundreds of thousands, computer-based approaches can increase the efficiency, transparency and reproducibility of literature-based research. Methods available for text mining are rapidly changing owing to developments in machine learning-based language models. We review the growing landscape of approaches, mapping them onto three broad paradigms (frequency-based approaches, traditional Natural Language Processing and deep learning-based language models). This serves as an entry point to learn foundational and cutting-edge concepts, vocabularies, and methods to foster integration of these tools into ecological and evolutionary research. We cover approaches for modelling ecological texts, generating training data, developing custom models and interacting with large language models and discuss challenges and possible solutions to implementing these methods in ecology and evolution.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    洪水灾害每年在全球范围内造成重大人员伤亡和经济损失。灾难期间,准确及时的信息对于灾害管理至关重要。然而,遥感无法平衡时空分辨率,专用设备的覆盖范围有限,使连续监测具有挑战性。社交媒体用户共享的实时灾难相关信息为监控提供了新的可能性。我们提出了一个从社交媒体中提取和分析洪水信息的框架,通过2018年中国寿光洪水进行验证。该框架创新性地结合了深度学习技术和正则表达式匹配技术,从微博文本数据中自动提取与洪水相关的关键信息,如问题,洪水,需要,营救,和措施,达到83%的准确率,超越传统模型,如Biterm主题模型(BTM)。在灾害的时空分析中,我们的研究通过对信息的定量分析来确定灾难期间的关键时间点,并使用核密度估计(KDE)探索求助请求的空间分布,然后使用分层基于密度的噪声应用空间聚类(HDBSCAN)算法识别受影响的核心区域。对于语义分析,我们采用潜在狄利克雷分配(LDA)算法对来自不同地区的微博文本进行主题建模,确定影响每个乡镇的灾害类型。此外,通过相关性分析,我们调查了救灾请求和应对措施之间的关系,以评估每个乡镇洪水应对措施的充分性。研究结果表明,该分析框架能够准确提取灾害信息,准确识别洪水灾害的关键时间点,确定受影响的核心区域,发现主要的区域问题,并进一步验证应对措施的充分性,因此,提高了收集灾害信息的效率和分析能力。
    Flood disasters cause significant casualties and economic losses annually worldwide. During disasters, accurate and timely information is crucial for disaster management. However, remote sensing cannot balance temporal and spatial resolution, and the coverage of specialized equipment is limited, making continuous monitoring challenging. Real-time disaster-related information shared by social media users offers new possibilities for monitoring. We propose a framework for extracting and analyzing flood information from social media, validated through the 2018 Shouguang flood in China. This framework innovatively combines deep learning techniques and regular expression matching techniques to automatically extract key flood-related information from Weibo textual data, such as problems, floodings, needs, rescues, and measures, achieving an accuracy of 83 %, surpassing traditional models like the Biterm Topic Model (BTM). In the spatiotemporal analysis of the disaster, our research identifies critical time points during the disaster through quantitative analysis of the information and explores the spatial distribution of calls for help using Kernel Density Estimation (KDE), followed by identifying the core affected areas using the Hierarchical Density-Based Spatial Clustering of Applications with Noise (HDBSCAN) algorithm. For semantic analysis, we adopt the Latent Dirichlet Allocation (LDA) algorithm to perform topic modeling on Weibo texts from different regions, identifying the types of disasters affecting each township. Additionally, through correlation analysis, we investigate the relationship between disaster rescue requests and response measures to evaluate the adequacy of flood response measures in each township. The research results demonstrate that this analytical framework can accurately extract disaster information, precisely identify critical time points in flood disasters, locate core affected areas, uncover primary regional issues, and further validate the sufficiency of response measures, therefore enhancing the efficiency in collecting disaster information and analytical capabilities.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目的:阿尔茨海默病(AD)是美国最常见的痴呆形式。睡眠是与生活方式相关的因素之一,已被证明对老年人的最佳认知功能至关重要。然而,缺乏研究睡眠与AD发病率之间的关联。进行此类研究的主要瓶颈是传统的获取睡眠信息的方法耗时,低效,不可伸缩,仅限于患者的主观体验。我们的目标是自动提取特定的睡眠相关模式,比如打鼾,午睡,睡眠质量差,白天嗜睡,晚上醒来,其他睡眠问题,和睡眠持续时间,从AD患者的临床记录。假设这些睡眠模式在AD的发病中起作用,深入了解睡眠与AD发病和进展之间的关系。
    方法:黄金标准数据集是从adSLEEP的570份随机抽样临床笔记文档的手动注释中创建的,从匹兹堡大学医学中心(UPMC)检索到的7266名AD患者的192.000个取消识别的临床笔记。我们开发了一种基于规则的自然语言处理(NLP)算法,机器学习模型,和基于大型语言模型(LLM)的NLP算法,以自动提取与睡眠相关的概念,包括打鼾,午睡,睡眠问题,睡眠质量差,白天嗜睡,晚上醒来,和睡眠持续时间,来自黄金标准数据集。
    结果:482名患者的注释数据集主要包括白人(89.2%),平均年龄为84.7岁的老年人口,女性占64.1%,绝大多数是非西班牙裔或拉丁裔(94.6%)。基于规则的NLP算法在所有睡眠相关概念中实现了F1的最佳性能。就阳性预测值(PPV)而言,基于规则的NLP算法在白天嗜睡(1.00)和睡眠持续时间(1.00)方面获得了最高的PPV分数,虽然机器学习模型的睡眠PPV最高(0.95),睡眠质量差(0.86),LLAMA2在夜间醒来(0.93)和睡眠问题(0.89)时的PPV最高。
    结论:尽管临床记录中很少记录睡眠信息,提出的基于规则的NLP算法和基于LLM的NLP算法仍然取得了有希望的结果。相比之下,基于机器学习的方法没有取得好的效果,这是由于训练数据中的睡眠信息较小。
    结论:结果表明,基于规则的NLP算法一致地实现了所有睡眠概念的最佳性能。本研讨集中于AD患者的临床注解,但可以扩展到其他疾病的普通睡眠信息提取。
    OBJECTIVE: Alzheimer\'s disease (AD) is the most common form of dementia in the United States. Sleep is one of the lifestyle-related factors that has been shown critical for optimal cognitive function in old age. However, there is a lack of research studying the association between sleep and AD incidence. A major bottleneck for conducting such research is that the traditional way to acquire sleep information is time-consuming, inefficient, non-scalable, and limited to patients\' subjective experience. We aim to automate the extraction of specific sleep-related patterns, such as snoring, napping, poor sleep quality, daytime sleepiness, night wakings, other sleep problems, and sleep duration, from clinical notes of AD patients. These sleep patterns are hypothesized to play a role in the incidence of AD, providing insight into the relationship between sleep and AD onset and progression.
    METHODS: A gold standard dataset is created from manual annotation of 570 randomly sampled clinical note documents from the adSLEEP, a corpus of 192 000 de-identified clinical notes of 7266 AD patients retrieved from the University of Pittsburgh Medical Center (UPMC). We developed a rule-based natural language processing (NLP) algorithm, machine learning models, and large language model (LLM)-based NLP algorithms to automate the extraction of sleep-related concepts, including snoring, napping, sleep problem, bad sleep quality, daytime sleepiness, night wakings, and sleep duration, from the gold standard dataset.
    RESULTS: The annotated dataset of 482 patients comprised a predominantly White (89.2%), older adult population with an average age of 84.7 years, where females represented 64.1%, and a vast majority were non-Hispanic or Latino (94.6%). Rule-based NLP algorithm achieved the best performance of F1 across all sleep-related concepts. In terms of positive predictive value (PPV), the rule-based NLP algorithm achieved the highest PPV scores for daytime sleepiness (1.00) and sleep duration (1.00), while the machine learning models had the highest PPV for napping (0.95) and bad sleep quality (0.86), and LLAMA2 with finetuning had the highest PPV for night wakings (0.93) and sleep problem (0.89).
    CONCLUSIONS: Although sleep information is infrequently documented in the clinical notes, the proposed rule-based NLP algorithm and LLM-based NLP algorithms still achieved promising results. In comparison, the machine learning-based approaches did not achieve good results, which is due to the small size of sleep information in the training data.
    CONCLUSIONS: The results show that the rule-based NLP algorithm consistently achieved the best performance for all sleep concepts. This study focused on the clinical notes of patients with AD but could be extended to general sleep information extraction for other diseases.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    背景:命名实体识别(NER)是自然语言处理中的一项基本任务。然而,它之前通常是命名实体注释,这带来了一些挑战,尤其是在临床领域。例如,确定实体边界是注释者之间最常见的分歧来源之一,因为诸如是否应该注释修饰语或外围词。如果未解决,这些会导致产生的语料库不一致,然而,另一方面,严格的指导方针或裁决会议可以进一步延长已经缓慢和复杂的过程。
    目的:本研究的目的是通过评估两种新颖的注释方法来解决这些挑战,宽松的跨度和点注释,旨在减轻精确确定实体边界的难度。
    方法:我们通过对日本医学病例报告数据集的注释案例研究来评估其效果。我们比较注释时间,注释者协议,和生成的标签的质量,并评估对在注释的语料库上训练的NER系统的性能的影响。
    结果:我们看到了标签过程效率的显着提高,与传统的边界严格方法相比,整体注释时间减少了25%,注释者协议甚至提高了10%。然而,与传统的注释方法相比,即使是最好的NER模型也表现出一些性能下降。
    结论:我们的发现证明了注释速度和模型性能之间的平衡。尽管忽略边界信息会在一定程度上影响模型性能,这是由显著减少注释者的工作量和显著提高注释过程的速度所抵消的。这些好处可能在各种应用中被证明是有价值的,为开发人员和研究人员提供了一个有吸引力的折衷方案。
    BACKGROUND: Named entity recognition (NER) is a fundamental task in natural language processing. However, it is typically preceded by named entity annotation, which poses several challenges, especially in the clinical domain. For instance, determining entity boundaries is one of the most common sources of disagreements between annotators due to questions such as whether modifiers or peripheral words should be annotated. If unresolved, these can induce inconsistency in the produced corpora, yet, on the other hand, strict guidelines or adjudication sessions can further prolong an already slow and convoluted process.
    OBJECTIVE: The aim of this study is to address these challenges by evaluating 2 novel annotation methodologies, lenient span and point annotation, aiming to mitigate the difficulty of precisely determining entity boundaries.
    METHODS: We evaluate their effects through an annotation case study on a Japanese medical case report data set. We compare annotation time, annotator agreement, and the quality of the produced labeling and assess the impact on the performance of an NER system trained on the annotated corpus.
    RESULTS: We saw significant improvements in the labeling process efficiency, with up to a 25% reduction in overall annotation time and even a 10% improvement in annotator agreement compared to the traditional boundary-strict approach. However, even the best-achieved NER model presented some drop in performance compared to the traditional annotation methodology.
    CONCLUSIONS: Our findings demonstrate a balance between annotation speed and model performance. Although disregarding boundary information affects model performance to some extent, this is counterbalanced by significant reductions in the annotator\'s workload and notable improvements in the speed of the annotation process. These benefits may prove valuable in various applications, offering an attractive compromise for developers and researchers.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    人工提取的农业表型信息主观性强,准确性低,而图像提取信息的利用容易受到雾霾的干扰。此外,由于图像中纹理细节和颜色表示不清晰,因此用于提取此类信息的农业图像去雾方法的有效性受到限制。为了解决这些限制,我们提出了AgriGAN(通过周期一致的生成对抗网络进行不成对的图像去雾),以增强农业植物表型的去雾性能。该算法结合了大气散射模型来改进鉴别器模型,并采用全细节一致的鉴别方法来提高鉴别器的效率。从而加速对抗性网络内向纳什均衡状态的收敛。最后,通过网络对抗损失+周期一致性损失的训练,除雾过程后获得清晰的图像。进行了实验评估和比较分析,以评估该算法的性能,在保留详细的纹理信息和减轻颜色偏差问题的同时,展示了提高农业图像去雾的准确性。
    Artificially extracted agricultural phenotype information exhibits high subjectivity and low accuracy, while the utilization of image extraction information is susceptible to interference from haze. Furthermore, the effectiveness of the agricultural image dehazing method used for extracting such information is limited due to unclear texture details and color representation in the images. To address these limitations, we propose AgriGAN (unpaired image dehazing via a cycle-consistent generative adversarial network) for enhancing the dehazing performance in agricultural plant phenotyping. The algorithm incorporates an atmospheric scattering model to improve the discriminator model and employs a whole-detail consistent discrimination approach to enhance discriminator efficiency, thereby accelerating convergence towards Nash equilibrium state within the adversarial network. Finally, by training with network adversarial loss + cycle consistent loss, clear images are obtained after dehazing process. Experimental evaluations and comparative analysis were conducted to assess this algorithm\'s performance, demonstrating improved accuracy in dehazing agricultural images while preserving detailed texture information and mitigating color deviation issues.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目的:将这些预防指南与电子健康记录(EHRs)系统集成,加上个性化预防护理建议的产生,具有改善医疗保健结果的巨大潜力。我们的研究调查了使用大型语言模型(LLM)自动评估标准和风险因素的可行性,该指南用于未来对EHR医疗记录的分析。
    方法:我们注释了标准,危险因素,和美国预防服务工作组发布的成人指南中描述的预防性医疗服务,并评估了3种最新的LLM自动从指南中提取这些类别的信息。
    结果:我们在本研究中纳入了24条指南。LLM可以自动提取所有标准,危险因素,和9个指南的医疗服务。所有3个LLM在提取有关人口统计学标准或风险因素的信息方面表现良好。一些LLM在提取健康的社会决定因素方面表现更好,家族史,和预防性咨询服务比其他服务。
    结论:虽然LLM证明了处理冗长的预防性护理指南的能力,几个挑战依然存在,包括与输入令牌的最大长度和生成内容而不是严格遵守原始输入的趋势相关的约束。此外,在现实世界的临床环境中使用LLM需要仔细的伦理考虑。医疗保健专业人员必须仔细验证提取的信息,以减轻偏见,确保完整性,保持准确性。
    结论:我们开发了一种数据结构来存储注释的预防指南,并使其公开可用。采用最先进的LLM来提取预防性护理标准,危险因素,预防性护理服务为将来将这些指南纳入EHR铺平了道路。
    OBJECTIVE: The integration of these preventive guidelines with Electronic Health Records (EHRs) systems, coupled with the generation of personalized preventive care recommendations, holds significant potential for improving healthcare outcomes. Our study investigates the feasibility of using Large Language Models (LLMs) to automate the assessment criteria and risk factors from the guidelines for future analysis against medical records in EHR.
    METHODS: We annotated the criteria, risk factors, and preventive medical services described in the adult guidelines published by United States Preventive Services Taskforce and evaluated 3 state-of-the-art LLMs on extracting information in these categories from the guidelines automatically.
    RESULTS: We included 24 guidelines in this study. The LLMs can automate the extraction of all criteria, risk factors, and medical services from 9 guidelines. All 3 LLMs perform well on extracting information regarding the demographic criteria or risk factors. Some LLMs perform better on extracting the social determinants of health, family history, and preventive counseling services than the others.
    CONCLUSIONS: While LLMs demonstrate the capability to handle lengthy preventive care guidelines, several challenges persist, including constraints related to the maximum length of input tokens and the tendency to generate content rather than adhering strictly to the original input. Moreover, the utilization of LLMs in real-world clinical settings necessitates careful ethical consideration. It is imperative that healthcare professionals meticulously validate the extracted information to mitigate biases, ensure completeness, and maintain accuracy.
    CONCLUSIONS: We developed a data structure to store the annotated preventive guidelines and make it publicly available. Employing state-of-the-art LLMs to extract preventive care criteria, risk factors, and preventive care services paves the way for the future integration of these guidelines into the EHR.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:提取问答(EQA)是一种有用的自然语言处理(NLP)应用程序,用于通过在临床笔记中找到答案来回答患者特定的问题。现实的临床EQA可以为单个问题和一个问题中的多个焦点产生多个答案,这是现有数据集所缺乏的人工智能解决方案的发展。
    目的:本研究旨在创建用于开发和评估临床EQA系统的数据集,该系统可以处理自然的多答案和多焦点问题。
    方法:我们利用来自2018年国家NLP临床挑战语料库的注释关系来生成EQA数据集。具体来说,1对N,包括M对1和M对N药物-原因关系,以形成多答案和多焦点问答条目,这代表了更复杂和自然的挑战,除了基本的1-药物-1-原因案例。开发基线解决方案并在数据集上测试。
    结果:导出的RxWhyQA数据集包含96,939个QA条目。在可回答的问题中,25%的人需要多个答案,2%的人询问1个问题内的多种药物。经常在文本中的答案周围观察到提示,90%的药物和原因术语发生在相同或相邻的句子中。基线EQA解决方案在整个数据集上取得了0.72的最佳F1分数,在特定的子集上,无法回答的问题是0.93,单药问题为0.48,多药问题为0.60,单回答问题为0.54,多回答问题为0.43。
    结论:RxWhyQA数据集可用于训练和评估需要处理多答案和多焦点问题的系统。具体来说,多答案EQA似乎具有挑战性,因此值得在研究方面进行更多投资。我们创建并共享了一个临床EQA数据集,其中包含多答案和多焦点问题,这将使未来的研究工作转向更现实的场景。
    BACKGROUND: Extractive question-answering (EQA) is a useful natural language processing (NLP) application for answering patient-specific questions by locating answers in their clinical notes. Realistic clinical EQA can yield multiple answers to a single question and multiple focus points in 1 question, which are lacking in existing data sets for the development of artificial intelligence solutions.
    OBJECTIVE: This study aimed to create a data set for developing and evaluating clinical EQA systems that can handle natural multianswer and multifocus questions.
    METHODS: We leveraged the annotated relations from the 2018 National NLP Clinical Challenges corpus to generate an EQA data set. Specifically, the 1-to-N, M-to-1, and M-to-N drug-reason relations were included to form the multianswer and multifocus question-answering entries, which represent more complex and natural challenges in addition to the basic 1-drug-1-reason cases. A baseline solution was developed and tested on the data set.
    RESULTS: The derived RxWhyQA data set contains 96,939 QA entries. Among the answerable questions, 25% of them require multiple answers, and 2% of them ask about multiple drugs within 1 question. Frequent cues were observed around the answers in the text, and 90% of the drug and reason terms occurred within the same or an adjacent sentence. The baseline EQA solution achieved a best F1-score of 0.72 on the entire data set, and on specific subsets, it was 0.93 for the unanswerable questions, 0.48 for single-drug questions versus 0.60 for multidrug questions, and 0.54 for the single-answer questions versus 0.43 for multianswer questions.
    CONCLUSIONS: The RxWhyQA data set can be used to train and evaluate systems that need to handle multianswer and multifocus questions. Specifically, multianswer EQA appears to be challenging and therefore warrants more investment in research. We created and shared a clinical EQA data set with multianswer and multifocus questions that would channel future research efforts toward more realistic scenarios.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目的:生物医学命名实体识别(bioNER)是识别生物医学文本中命名实体的任务。本文介绍了一种通过考虑其他外部环境来解决生物NER的新模型。与以前主要使用原始输入序列进行序列标记的方法不同,该模型考虑了其他上下文,以增强原始序列中实体的表示,因为额外的上下文可以为生物医学实体的概念解释提供增强的信息。
    方法:要利用额外的上下文,给定原始输入序列,模型首先从PubMed中检索相关的句子,然后对检索到的句子进行排序以形成上下文。接下来将上下文与原始输入序列组合以形成新的增强序列。原始和新的增强序列被馈送到PubMedBERT中用于学习特征表示。为了获得更细粒度的特征,模型在PubMedBERT上堆叠一个BiLSTM层。最终的命名实体标签预测是通过使用CRF层完成的。以端到端方式联合训练模型以利用原始序列的NER的附加上下文。
    结果:在六个生物医学数据集上的实验结果表明,与强基线相比,所提出的模型实现了有希望的性能,并证实了其他上下文对bioNER的贡献。
    结论:有希望的结果证实了三个重要点。首先,PubMed的额外背景有助于提高生物医学实体的识别质量.第二,PubMed比Google搜索引擎更适合提供生物NER的相关信息。最后,上下文中更相关的句子比不相关的句子更有利于为原始输入序列提供增强的信息。该模型可以灵活地集成NER任务的任何其他上下文类型。
    OBJECTIVE: Biomedical Named Entity Recognition (bio NER) is the task of recognizing named entities in biomedical texts. This paper introduces a new model that addresses bio NER by considering additional external contexts. Different from prior methods that mainly use original input sequences for sequence labeling, the model takes into account additional contexts to enhance the representation of entities in the original sequences, since additional contexts can provide enhanced information for the concept explanation of biomedical entities.
    METHODS: To exploit an additional context, given an original input sequence, the model first retrieves the relevant sentences from PubMed and then ranks the retrieved sentences to form the contexts. It next combines the context with the original input sequence to form a new enhanced sequence. The original and new enhanced sequences are fed into PubMedBERT for learning feature representation. To obtain more fine-grained features, the model stacks a BiLSTM layer on top of PubMedBERT. The final named entity label prediction is done by using a CRF layer. The model is jointly trained in an end-to-end manner to take advantage of the additional context for NER of the original sequence.
    RESULTS: Experimental results on six biomedical datasets show that the proposed model achieves promising performance compared to strong baselines and confirms the contribution of additional contexts for bio NER.
    CONCLUSIONS: The promising results confirm three important points. First, the additional context from PubMed helps to improve the quality of the recognition of biomedical entities. Second, PubMed is more appropriate than the Google search engine for providing relevant information of bio NER. Finally, more relevant sentences from the context are more beneficial than irrelevant ones to provide enhanced information for the original input sequences. The model is flexible to integrate any additional context types for the NER task.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    本文提出了一种基于语义网络的解决方案,用于从银行/金融机构的年度财务报告中自动提取相关信息,并通过知识图以可查询的形式呈现这些信息。这些报告中的信息是各利益相关者在做出关键投资决策时非常需要的。然而,这些信息以非结构化格式提供,这使得手动或甚至通过数字系统理解和查询变得更加复杂和具有挑战性。使对信息的理解更加复杂的另一个挑战是不同银行或金融机构的财务报告之间的术语差异。本文提出的解决方案意味着一种本体论方法来解决该领域术语的标准化问题。它进一步解决了语义差异问题,以提取共享公共语义的相关数据。然后,通过将其表示实现为知识图谱来合并此类语义,以使信息可理解和可查询。我们的结果突出了知识图谱在搜索引擎中的使用,推荐系统和问答(Q-A)系统。这个金融知识图谱也可以用来服务于金融故事的任务。所提出的解决方案在各个银行的数据集上实施和测试,结果通过对精度和召回率评估的能力问题的回答来呈现。
    This article presents a semantic web-based solution for extracting the relevant information automatically from the annual financial reports of the banks/financial institutions and presenting this information in a queryable form through a knowledge graph. The information in these reports is significantly desired by various stakeholders for making key investment decisions. However, this information is available in an unstructured format making it much more complex and challenging to understand and query manually or even through digital systems. Another challenge that makes the understanding of information more complex is the variation of terminologies among financial reports of different banks or financial institutions. The solution presented in this article signifies an ontological approach to solving the standardization problems of the terminologies in this domain. It further addresses the issue of semantic differences to extract relevant data sharing common semantics. Such semantics are then incorporated by implementing their representation as a Knowledge Graph to make the information understandable and queryable. Our results highlight the usage of Knowledge Graph in search engines, recommender systems and question-answering (Q-A) systems. This financial knowledge graph can also be used to serve the task of financial storytelling. The proposed solution is implemented and tested on the datasets of various banks and the results are presented through answers to competency questions evaluated on precision and recall measures.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    细粒度,有关植物栖息地和生殖条件的描述性信息对于森林恢复和恢复工作至关重要。水果采集的精确时间和物种栖息地偏好和生殖状况的知识是必要的,特别是对于具有短命顽固种子的热带植物物种,那些表现出复杂生殖模式的人,例如,具有可能以不规则间隔发生的年度大规模开花事件的物种。通过提供对结构化信息的访问,可以帮助以计划有效造林的方式了解植物再生,例如,在知识库中,这跨越了几年,如果不是几十年,以及覆盖广泛的地理位置。这种资源的内容可以通过文献中有关物种对时间敏感的生殖条件和特定位置的栖息地的信息来丰富。
    我们试图开发无监督的方法来提取与栖息地及其位置有关的关系,以及植物物种的繁殖条件和相应的时间信息。首先,我们为传统的基于规则的模式匹配方法手工制作了规则。然后,我们开发了一种基于变压器模型的关系提取方法,即,文本到文本转换转换器(T5),将关系提取问题转换为问答和自然语言推理任务。然后,我们提出了一种新颖的无监督混合方法,该方法结合了基于规则和基于变压器的方法。
    对我们的混合方法在以生物多样性为重点的文档的注释语料库上的评估表明,与仅基于规则和基于变压器的方法相比,在召回和最佳性能方面提高了多达15个百分点,对于生殖条件-时间表达关系,F1得分范围从89.61到96.75%,栖息地-地理位置关系从85.39%到89.90%不等。我们的工作表明,即使没有在任何特定领域的标记数据集上训练模型,我们能够从令人满意的表现的文献中提取生物多样性概念之间的关系。
    UNASSIGNED: Fine-grained, descriptive information on habitats and reproductive conditions of plant species are crucial in forest restoration and rehabilitation efforts. Precise timing of fruit collection and knowledge of species\' habitat preferences and reproductive status are necessary especially for tropical plant species that have short-lived recalcitrant seeds, and those that exhibit complex reproductive patterns, e.g., species with supra-annual mass flowering events that may occur in irregular intervals. Understanding plant regeneration in the way of planning for effective reforestation can be aided by providing access to structured information, e.g., in knowledge bases, that spans years if not decades as well as covering a wide range of geographic locations. The content of such a resource can be enriched with literature-derived information on species\' time-sensitive reproductive conditions and location-specific habitats.
    UNASSIGNED: We sought to develop unsupervised approaches to extract relationships pertaining to habitats and their locations, and reproductive conditions of plant species and corresponding temporal information. Firstly, we handcrafted rules for a traditional rule-based pattern matching approach. We then developed a relation extraction approach building upon transformer models, i.e., the Text-to-Text Transfer Transformer (T5), casting the relation extraction problem as a question answering and natural language inference task. We then propose a novel unsupervised hybrid approach that combines our rule-based and transformer-based approaches.
    UNASSIGNED: Evaluation of our hybrid approach on an annotated corpus of biodiversity-focused documents demonstrated an improvement of up to 15 percentage points in recall and best performance over solely rule-based and transformer-based methods with F1-scores ranging from 89.61 to 96.75% for reproductive condition - temporal expression relations, and ranging from 85.39% to 89.90% for habitat - geographic location relations. Our work shows that even without training models on any domain-specific labeled dataset, we are able to extract relationships between biodiversity concepts from literature with satisfactory performance.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号