关键词: Big data Data mining Feature extraction Medical data annotation Multimodal representation Unsupervised machine learning

来  源:   DOI:10.1186/s13040-024-00373-1   PDF(Pubmed)

Abstract:
BACKGROUND: The use of machine learning in medical diagnosis and treatment has grown significantly in recent years with the development of computer-aided diagnosis systems, often based on annotated medical radiology images. However, the lack of large annotated image datasets remains a major obstacle, as the annotation process is time-consuming and costly. This study aims to overcome this challenge by proposing an automated method for annotating a large database of medical radiology images based on their semantic similarity.
RESULTS: An automated, unsupervised approach is used to create a large annotated dataset of medical radiology images originating from the Clinical Hospital Centre Rijeka, Croatia. The pipeline is built by data-mining three different types of medical data: images, DICOM metadata and narrative diagnoses. The optimal feature extractors are then integrated into a multimodal representation, which is then clustered to create an automated pipeline for labelling a precursor dataset of 1,337,926 medical images into 50 clusters of visually similar images. The quality of the clusters is assessed by examining their homogeneity and mutual information, taking into account the anatomical region and modality representation.
CONCLUSIONS: The results indicate that fusing the embeddings of all three data sources together provides the best results for the task of unsupervised clustering of large-scale medical data and leads to the most concise clusters. Hence, this work marks the initial step towards building a much larger and more fine-grained annotated dataset of medical radiology images.
摘要:
背景:近年来,随着计算机辅助诊断系统的发展,机器学习在医学诊断和治疗中的使用显着增长,通常基于带注释的医学放射学图像。然而,缺乏大型注释图像数据集仍然是一个主要障碍,因为注释过程耗时且成本高昂。本研究旨在通过提出一种基于语义相似性来注释大型医学放射学图像数据库的自动化方法来克服这一挑战。
结果:自动,无监督方法用于创建源自临床医院中心Rijeka的大型医学放射学图像注释数据集,克罗地亚。该管道是通过数据挖掘三种不同类型的医疗数据构建的:图像,DICOM元数据和叙事诊断。然后将最佳特征提取器集成到多模态表示中,然后对其进行聚类以创建自动管道,用于将1,337,926个医学图像的前体数据集标记为50个视觉上相似的图像集群。通过检查聚类的同质性和互信息来评估聚类的质量,考虑到解剖区域和模态表示。
结论:结果表明,将所有三个数据源的嵌入融合在一起,为大规模医疗数据的无监督聚类任务提供了最佳结果,并导致了最简洁的聚类。因此,这项工作标志着朝着建立更大,更细粒度的医学放射学图像注释数据集迈出了第一步。
公众号