Data scarcity

数据稀缺
  • 文章类型: Journal Article
    训练数据推动和塑造人工智能(AI)模型的发展。密集的数据需求是限制AI工具在固有数据稀缺的行业中成功的主要瓶颈。在医疗保健方面,训练数据很难策划,引发了越来越多的担忧,即当前贫困社会群体无法获得医疗保健将转化为未来医疗保健AI的偏见。在这份报告中,我们开发了一个自动编码器来增长和增强固有的稀缺数据集,以减轻我们对大数据的依赖。
    使用开源数据的计算研究。
    数据来自6个开源数据集,包括新加坡40-80岁的患者,中国,印度,和西班牙。
    报告的框架基于真实世界的患者成像数据生成合成图像。作为一个测试用例,我们使用自动编码器来扩展公开可用的光盘照片训练集,并评估所得数据集训练AI模型检测青光眼视神经病变的能力。
    接收器工作特征曲线(AUC)下的面积用于评估青光眼检测器的性能。更高的AUC指示更好的检测性能。
    结果表明,使用自动编码器生成的合成图像增强数据集导致了出色的训练集,从而提高了AI模型的性能。
    我们的发现有助于解决AI模型开发日益站不住脚的数据量和质量要求,并具有超出医疗保健的影响。为所有类似的数据挑战领域授权AI采用。
    作者对本文讨论的任何材料都没有专有或商业利益。
    UNASSIGNED: Training data fuel and shape the development of artificial intelligence (AI) models. Intensive data requirements are a major bottleneck limiting the success of AI tools in sectors with inherently scarce data. In health care, training data are difficult to curate, triggering growing concerns that the current lack of access to health care by under-privileged social groups will translate into future bias in health care AIs. In this report, we developed an autoencoder to grow and enhance inherently scarce datasets to alleviate our dependence on big data.
    UNASSIGNED: Computational study with open-source data.
    UNASSIGNED: The data were obtained from 6 open-source datasets comprising patients aged 40-80 years in Singapore, China, India, and Spain.
    UNASSIGNED: The reported framework generates synthetic images based on real-world patient imaging data. As a test case, we used autoencoder to expand publicly available training sets of optic disc photos, and evaluated the ability of the resultant datasets to train AI models in the detection of glaucomatous optic neuropathy.
    UNASSIGNED: Area under the receiver operating characteristic curve (AUC) were used to evaluate the performance of the glaucoma detector. A higher AUC indicates better detection performance.
    UNASSIGNED: Results show that enhancing datasets with synthetic images generated by autoencoder led to superior training sets that improved the performance of AI models.
    UNASSIGNED: Our findings here help address the increasingly untenable data volume and quality requirements for AI model development and have implications beyond health care, toward empowering AI adoption for all similarly data-challenged fields.
    UNASSIGNED: The authors have no proprietary or commercial interest in any materials discussed in this article.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    数据收集,策展,清洁是机器学习(ML)项目的关键阶段。在生物医学ML中,通常需要利用多个数据集来增加样本量和多样性,但这带来了独特的挑战,这源于研究设计的异质性,数据描述符,文件系统组织,和元数据。在这项研究中,我们提出了一种整合多个脑MRI数据集的方法,重点是使其组织均匀化和ML预处理。我们使用自己的融合示例(来自54,000个主题的大约84,000个图像,12项研究,和88个单独的扫描仪)来说明和讨论研究融合工作面临的问题,我们研究了数据集均质化过程中必要的关键决策,详细介绍了一个足够灵活的数据库结构,以容纳多个观察性MRI数据集。我们相信我们的方法可以为未来类似想法的生物医学ML项目提供基础。
    Data collection, curation, and cleaning constitute a crucial phase in Machine Learning (ML) projects. In biomedical ML, it is often desirable to leverage multiple datasets to increase sample size and diversity, but this poses unique challenges, which arise from heterogeneity in study design, data descriptors, file system organization, and metadata. In this study, we present an approach to the integration of multiple brain MRI datasets with a focus on homogenization of their organization and preprocessing for ML. We use our own fusion example (approximately 84,000 images from 54,000 subjects, 12 studies, and 88 individual scanners) to illustrate and discuss the issues faced by study fusion efforts, and we examine key decisions necessary during dataset homogenization, presenting in detail a database structure flexible enough to accommodate multiple observational MRI datasets. We believe our approach can provide a basis for future similarly-minded biomedical ML projects.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    全球制造业受到COVID-19大流行的显著影响,因为它们的生产特点是低成本的国家采购,全球化,和库存水平。为了分析相关的时间序列,时空模型变得更具吸引力,图卷积网络(GCN)也常用于向图中的节点及其邻居提供更多信息。最近,提出了注意力调整图时空网络(AGSTN),通过结合多图卷积和注意力调整来学习随时间的时空相关性,以解决GCN中预定义图的问题。然而,AGSTN可能显示有限的小型非传感器数据的潜在问题;特别是,收敛问题。这项研究提出了AGSTN的几种变体,并将其应用于非传感器数据。我们建议使用数据增强和正则化技术,例如边缘选择,时间序列分解,改善AGSTN的预防政策。对大流行时代的全球制造业进行了实证研究,以验证所提出的变体。结果表明,所提出的变体在均方误差(MSE)和收敛问题上显着提高了至少约20%的预测性能。
    Worldwide manufacturing industries are significantly affected by COVID-19 pandemic because of their production characteristics with low-cost country sourcing, globalization, and inventory level. To analyze the correlated time series, spatial-temporal model becomes more attractive, and the graph convolution network (GCN) is also commonly used to provide more information to the nodes and its neighbors in the graph. Recently, attention-adjusted graph spatio-temporal network (AGSTN) was proposed to address the problem of pre-defined graph in GCN by combining multi-graph convolution and attention adjustment to learn spatial and temporal correlations over time. However, AGSTN may show potential problem with limited small non-sensor data; particularly, convergence issue. This study proposes several variants of AGSTN and applies them to non-sensor data. We suggest data augmentation and regularization techniques such as edge selection, time series decomposition, prevention policies to improve AGSTN. An empirical study of worldwide manufacturing industries in pandemic era was conducted to validate the proposed variants. The results show that the proposed variants significantly improve the prediction performance at least around 20% on mean squared error (MSE) and convergence problem.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    情感智能在机器中的集成是推进人机交互的重要一步。这就需要开发可靠的端到端情感识别系统。然而,公共情感数据集的稀缺提出了挑战。在这篇文献综述中,我们强调使用生成模型来解决神经生理信号中的这个问题,特别是脑电图(EEG)和功能近红外光谱(fNIRS)。我们对该领域使用的不同生成模型进行了全面分析,检查他们的输入公式,部署战略,和评估综合数据质量的方法。这篇综述是一个全面的概述,提供对优势的见解,挑战,以及未来生成模型在情感识别系统中的应用方向。通过这次审查,我们的目标是促进神经生理学数据增强的进展,从而支持更有效和可靠的情感识别系统的发展。
    The integration of emotional intelligence in machines is an important step in advancing human-computer interaction. This demands the development of reliable end-to-end emotion recognition systems. However, the scarcity of public affective datasets presents a challenge. In this literature review, we emphasize the use of generative models to address this issue in neurophysiological signals, particularly Electroencephalogram (EEG) and Functional Near-Infrared Spectroscopy (fNIRS). We provide a comprehensive analysis of different generative models used in the field, examining their input formulation, deployment strategies, and methodologies for evaluating the quality of synthesized data. This review serves as a comprehensive overview, offering insights into the advantages, challenges, and promising future directions in the application of generative models in emotion recognition systems. Through this review, we aim to facilitate the progression of neurophysiological data augmentation, thereby supporting the development of more efficient and reliable emotion recognition systems.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目的:评估合成放射组学数据生成在解决放射组学/放射基因组学模型中数据缺乏方面的潜力。
    方法:本研究是在回顾性收集的386例结直肠癌患者(n=2570个病灶)的队列中进行的,这些患者具有匹配的对比增强CT图像和TP53基因突变状态。将完整的队列数据分为训练队列(n=2055个病变)和独立且固定的测试集(n=515个病变)。从训练队列中对不同大小的训练集进行子采样,以测量样本大小对模型性能的影响,并评估不同大小的合成放射学增强的附加值。根据从该队列中提取的“真实世界”影像组学数据,使用五种不同的表格合成数据生成模型来生成合成影像组学数据。评估生成的合成放射组学数据的质量和再现性。然后将合成放射组学与“真实世界”放射组学训练数据相结合,以评估其对预测模型性能的影响。
    结果:仅使用“真实世界”的影像数据生成预测模型,通过在低训练样本数下缺乏预测性能(n=200、400、1000个病变,平均AUC分别=0.52、0.53和0.56,揭示了该特定数据集中数据稀缺性的影响。与使用2055个训练病变时的0.64相比)。合成表格数据生成模型创建了可重复的合成放射学数据,其属性与“真实世界”数据高度相似(对于n=1000个病变,平均卡方=0.932,平均基本统计相关性=0.844)。综合影像数据的整合一致地增强了用小样本量集训练的预测模型的性能(AUC增强了9.6%,11.3%,在n_samples=200、400和1000个病变上训练的模型为16.7%,分别)。相比之下,从随机/嘈杂的放射学数据生成的合成数据未能增强预测性能,强调了真实信号数据的要求。
    结论:合成放射学数据,当与真正的影像组学结合时,可以提高预测模型的性能。表格合成数据生成可能有助于克服由于数据稀缺而导致的医疗AI的局限性。
    OBJECTIVE: To evaluate the potential of synthetic radiomic data generation in addressing data scarcity in radiomics/radiogenomics models.
    METHODS: This study was conducted on a retrospectively collected cohort of 386 colorectal cancer patients (n = 2570 lesions) for whom matched contrast-enhanced CT images and gene TP53 mutational status were available. The full cohort data was divided into a training cohort (n = 2055 lesions) and an independent and fixed test set (n = 515 lesions). Differently sized training sets were subsampled from the training cohort to measure the impact of sample size on model performance and assess the added value of synthetic radiomic augmentation at different sizes. Five different tabular synthetic data generation models were used to generate synthetic radiomic data based on \"real-world\" radiomics data extracted from this cohort. The quality and reproducibility of the generated synthetic radiomic data were assessed. Synthetic radiomics were then combined with \"real-world\" radiomic training data to evaluate their impact on the predictive model\'s performance.
    RESULTS: A prediction model was generated using only \"real-world\" radiomic data, revealing the impact of data scarcity in this particular data set through a lack of predictive performance at low training sample numbers (n = 200, 400, 1000 lesions with average AUC = 0.52, 0.53, and 0.56 respectively, compared to 0.64 when using 2055 training lesions). Synthetic tabular data generation models created reproducible synthetic radiomic data with properties highly similar to \"real-world\" data (for n = 1000 lesions, average Chi-square = 0.932, average basic statistical correlation = 0.844). The integration of synthetic radiomic data consistently enhanced the performance of predictive models trained with small sample size sets (AUC enhanced by 9.6%, 11.3%, and 16.7% for models trained on n_samples = 200, 400, and 1000 lesions, respectively). In contrast, synthetic data generated from randomised/noisy radiomic data failed to enhance predictive performance underlining the requirement of true signal data to do so.
    CONCLUSIONS: Synthetic radiomic data, when combined with real radiomics, could enhance the performance of predictive models. Tabular synthetic data generation might help to overcome limitations in medical AI stemming from data scarcity.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    数据驱动的机器学习(ML)提供了一种有前途的方法来理解和预测聚酰胺(PA)对痕量有机污染物(TrOCs)的排斥。然而,各种混杂变量,再加上数据稀缺,限制数据驱动ML的直接应用。在这项研究中,我们通过领域知识嵌入开发了数据知识共同驱动的ML模型,并探索了其在理解PA膜对TrOC的排斥中的应用。领域知识嵌入增强了ML模型的预测性能和可解释性。关键机制的贡献,包括尺寸排除,电荷效应,疏水相互作用,等。,主导了三个TrOC类别的拒绝(中性亲水,中性疏水,和带电的TrOC)进行量化。LogD和分子电荷是导致三个TrOC类别中的可识别的排斥变化的关键因素。此外,我们定量比较了纳滤(NF)和反渗透(RO)PA膜之间的TrOC排斥机制。电荷效应和疏水相互作用对NF排斥TrOCs具有较高的权重,而体积排阻在RO中发挥了更重要的作用。这项研究证明了数据知识共同驱动的ML方法在理解PA膜对TrOC的排斥方面的有效性,提供一种方法来制定有针对性的TrOC移除策略。
    Data-driven machine learning (ML) provides a promising approach to understanding and predicting the rejection of trace organic contaminants (TrOCs) by polyamide (PA). However, various confounding variables, coupled with data scarcity, restrict the direct application of data-driven ML. In this study, we developed a data-knowledge codriven ML model via domain-knowledge embedding and explored its application in comprehending TrOC rejection by PA membranes. Domain-knowledge embedding enhanced both the predictive performance and the interpretability of the ML model. The contribution of key mechanisms, including size exclusion, charge effect, hydrophobic interaction, etc., that dominate the rejections of the three TrOC categories (neutral hydrophilic, neutral hydrophobic, and charged TrOCs) was quantified. Log D and molecular charge emerge as key factors contributing to the discernible variations in the rejection among the three TrOC categories. Furthermore, we quantitatively compared the TrOC rejection mechanisms between nanofiltration (NF) and reverse osmosis (RO) PA membranes. The charge effect and hydrophobic interactions possessed higher weights for NF to reject TrOCs, while the size exclusion in RO played a more important role. This study demonstrated the effectiveness of the data-knowledge codriven ML method in understanding TrOC rejection by PA membranes, providing a methodology to formulate a strategy for targeted TrOC removal.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    RNA修饰,N4-乙酰胞苷(ac4C),由N-乙酰转移酶10(NAT10)酶催化,并在整个tRNA中发挥重要作用,rRNA,和mRNA。它影响各种细胞功能,包括mRNA稳定性和rRNA生物合成。ac4C修饰位点的湿实验室检测是高度资源密集型和昂贵的。因此,各种机器学习和深度学习技术已被用于ac4C修改站点的计算检测。已知的ac4C修饰位点对于训练准确和稳定的预测模型是有限的。本研究介绍了GANSamples-ac4C,一个新的框架,协同迁移学习和生成对抗网络(GAN)生成合成RNA序列,以训练更好的ac4C修饰位点预测模型。比较分析表明,GANSamples-ac4C在鉴定ac4C位点方面优于现有的最新技术方法。此外,我们的结果强调了合成数据在缓解生物序列预测任务的数据稀缺问题方面的潜力.GANSamples-ac4C的另一个主要优点是其可解释的决策逻辑。多方面的可解释性分析检测ac4C序列中影响阳性和阴性样本之间的区分决策的关键区域,G在这个地区明显富集,和ac4C相关的图案。这些发现可能为ac4C研究提供新的见解。GANSamples-ac4C框架及其源代码可在http://www上公开访问。healthinformaticslab.org/supp/。
    RNA modification, N4-acetylcytidine (ac4C), is enzymatically catalyzed by N-acetyltransferase 10 (NAT10) and plays an essential role across tRNA, rRNA, and mRNA. It influences various cellular functions, including mRNA stability and rRNA biosynthesis. Wet-lab detection of ac4C modification sites is highly resource-intensive and costly. Therefore, various machine learning and deep learning techniques have been employed for computational detection of ac4C modification sites. The known ac4C modification sites are limited for training an accurate and stable prediction model. This study introduces GANSamples-ac4C, a novel framework that synergizes transfer learning and generative adversarial network (GAN) to generate synthetic RNA sequences to train a better ac4C modification site prediction model. Comparative analysis reveals that GANSamples-ac4C outperforms existing state-of-the-art methods in identifying ac4C sites. Moreover, our result underscores the potential of synthetic data in mitigating the issue of data scarcity for biological sequence prediction tasks. Another major advantage of GANSamples-ac4C is its interpretable decision logic. Multi-faceted interpretability analyses detect key regions in the ac4C sequences influencing the discriminating decision between positive and negative samples, a pronounced enrichment of G in this region, and ac4C-associated motifs. These findings may offer novel insights for ac4C research. The GANSamples-ac4C framework and its source code are publicly accessible at http://www.healthinformaticslab.org/supp/.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    Several recent studies have evidenced the relevance of machine-learning for soil salinity mapping using Sentinel-2 reflectance as input data and field soil salinity measurement (i.e., Electrical Conductivity-EC) as the target. As soil EC monitoring is costly and time consuming, most learning databases used for training/validation rely on a limited number of soil samples, which can affect the model consistency. Based on the low soil salinity variation at the Sentinel-2 pixel resolution, this study proposes to increase the learning database\'s number of observations by assigning the EC value obtained on the sampled pixel to the eight neighboring pixels. The method allowed extending the original learning database made up of 97 field EC measurements (OD) to an enhanced learning database made up of 691 observations (ED). Two classification machine-learning models (i.e., Random Forest-RF and Support Vector Machine-SVM) were trained with both OD and ED to assess the efficiency of the proposed method by comparing the models\' outcomes with EC observations not used in the models´ training. The use of ED led to a significant increase in both models\' consistency with the overall accuracy of the RF (SVM) model increasing from 0.25 (0.26) when using the OD to 0.77 (0.55) when using ED. This corresponds to an improvement of approximately 208% and 111%, respectively. Besides the improved accuracy reached with the ED database, the results showed that the RF model provided better soil salinity estimations than the SVM model and that feature selection (i.e., Variance Inflation Factor-VIF and/or Genetic Algorithm-GA) increase both models´ reliability, with GA being the most efficient. This study highlights the potential of machine-learning and Sentinel-2 image combination for soil salinity monitoring in a data-scarce context, and shows the importance of both model and features selection for an optimum machine-learning set-up.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    Tumor boundary identification during colorectal cancer surgery can be challenging, and incomplete tumor removal occurs in approximately 10% of the patients operated for advanced rectal cancer. In this paper, a deep learning framework for automatic tumor segmentation in colorectal ultrasound images was developed, to provide real-time guidance on resection margins using intra-operative ultrasound. A colorectal ultrasound dataset was acquired consisting of 179 images from 74 patients, with ground truth tumor annotations based on histopathology results. To address data scarcity, transfer learning techniques were used to optimize models pre-trained on breast ultrasound data for colorectal ultrasound data. A new custom gradient-based loss function (GWDice) was developed, which emphasizes the clinically relevant top margin of the tumor while training the networks. Lastly, ensemble learning methods were applied to combine tumor segmentation predictions of multiple individual models and further improve the overall tumor segmentation performance. Transfer learning outperformed training from scratch, with an average Dice coefficient over all individual networks of 0.78 compared to 0.68. The new GWDice loss function clearly decreased the average tumor margin prediction error from 1.08 mm to 0.92 mm, without compromising the segmentation of the overall tumor contour. Ensemble learning further improved the Dice coefficient to 0.84 and the tumor margin prediction error to 0.67 mm. Using transfer and ensemble learning strategies, good tumor segmentation performance was achieved despite the relatively small dataset. The developed US segmentation model may contribute to more accurate colorectal tumor resections by providing real-time intra-operative feedback on tumor margins.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    大型高质量数据集对于构建能够支持心脏临床研究进展的强大人工智能(AI)算法至关重要。然而,使用心电图(ECG)信号的研究人员努力获取和/或构建一个信号。当前工作的目的是阐明一种潜在的解决方案,以解决缺乏大型且易于访问的ECG数据集的问题。首先,确定和检查了这种缺乏的主要原因。之后,深入分析了通过深度生成模型(DGM)生成心脏数据的潜力和局限性。已经发现这些非常有前途的算法不仅能够生成大量的ECG信号,而且能够支持数据匿名化过程。简化数据共享,同时尊重患者的隐私。它们的应用可以以开放科学的名义帮助研究进展和合作。然而,几个方面,例如标准化的综合数据质量评估和算法稳定性,需要进一步探讨。
    Large high-quality datasets are essential for building powerful artificial intelligence (AI) algorithms capable of supporting advancement in cardiac clinical research. However, researchers working with electrocardiogram (ECG) signals struggle to get access and/or to build one. The aim of the present work is to shed light on a potential solution to address the lack of large and easily accessible ECG datasets. Firstly, the main causes of such a lack are identified and examined. Afterward, the potentials and limitations of cardiac data generation via deep generative models (DGMs) are deeply analyzed. These very promising algorithms have been found capable not only of generating large quantities of ECG signals but also of supporting data anonymization processes, to simplify data sharing while respecting patients\' privacy. Their application could help research progress and cooperation in the name of open science. However several aspects, such as a standardized synthetic data quality evaluation and algorithm stability, need to be further explored.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号