关键词: MIMICSQL dataset NLP T5 model healthcare domain large language model text-to-SQL conversion transformers

来  源:   DOI:10.3389/fdata.2024.1371680   PDF(Pubmed)

Abstract:
UNASSIGNED: In response to the increasing prevalence of electronic medical records (EMRs) stored in databases, healthcare staff are encountering difficulties retrieving these records due to their limited technical expertise in database operations. As these records are crucial for delivering appropriate medical care, there is a need for an accessible method for healthcare staff to access EMRs.
UNASSIGNED: To address this, natural language processing (NLP) for Text-to-SQL has emerged as a solution, enabling non-technical users to generate SQL queries using natural language text. This research assesses existing work on Text-to-SQL conversion and proposes the MedT5SQL model specifically designed for EMR retrieval. The proposed model utilizes the Text-to-Text Transfer Transformer (T5) model, a Large Language Model (LLM) commonly used in various text-based NLP tasks. The model is fine-tuned on the MIMICSQL dataset, the first Text-to-SQL dataset for the healthcare domain. Performance evaluation involves benchmarking the MedT5SQL model on two optimizers, varying numbers of training epochs, and using two datasets, MIMICSQL and WikiSQL.
UNASSIGNED: For MIMICSQL dataset, the model demonstrates considerable effectiveness in generating question-SQL pairs achieving accuracy of 80.63%, 98.937%, and 90% for exact match accuracy matrix, approximate string-matching, and manual evaluation, respectively. When testing the performance of the model on WikiSQL dataset, the model demonstrates efficiency in generating SQL queries, with an accuracy of 44.2% on WikiSQL and 94.26% for approximate string-matching.
UNASSIGNED: Results indicate improved performance with increased training epochs. This work highlights the potential of fine-tuned T5 model to convert medical-related questions written in natural language to Structured Query Language (SQL) in healthcare domain, providing a foundation for future research in this area.
摘要:
为了应对存储在数据库中的电子病历(EMR)的日益普及,由于他们在数据库操作方面的技术专长有限,医护人员在检索这些记录方面遇到困难。由于这些记录对于提供适当的医疗服务至关重要,需要一种用于医护人员访问EMR的可访问方法。
为了解决这个问题,用于文本到SQL的自然语言处理(NLP)已经成为一种解决方案,允许非技术用户使用自然语言文本生成SQL查询。这项研究评估了文本到SQL转换的现有工作,并提出了专门为EMR检索设计的MedT5SQL模型。所提出的模型利用文本到文本转换转换器(T5)模型,在各种基于文本的NLP任务中常用的大型语言模型(LLM)。该模型在MIMICSQL数据集上进行了微调,医疗保健领域的第一个Text-to-SQL数据集。性能评估涉及在两个优化器上对MedT5SQL模型进行基准测试,不同数量的训练时期,并使用两个数据集,MIMICSQL和WikiSQL。
对于MIMICSQL数据集,该模型在生成问题-SQL对方面表现出相当大的有效性,准确率达到80.63%,98.937%,和90%的精确匹配精度矩阵,近似字符串匹配,和手动评估,分别。在WikiSQL数据集上测试模型的性能时,该模型展示了生成SQL查询的效率,WikiSQL的准确率为44.2%,近似字符串匹配的准确率为94.26%。
结果表明,随着训练时间的增加,性能有所提高。这项工作强调了微调T5模型将以自然语言编写的医疗相关问题转换为医疗保健领域的结构化查询语言(SQL)的潜力,为该领域的未来研究奠定了基础。
公众号