过采样 Oversampling-医云文献数字医云科研云海量医学决策数据服务

Oversampling 关注

过采样

文献(74篇)

百科

视频

1 Oversampling for Enhanced Spatial Resolution of Zebrafish by Top-Hat IR-MALDESI-MSI.

通过 Top - Hat IR - MALDESI - MSI 对斑马鱼增强空间分辨率的过采样。影响指数 : 3.262
发表时间：Jul 2024 10
来源期刊：J Am Soc Mass Spectrom PMID：38985437

DOI：10.1021/jasms.4c00219
文章类型： Journal Article

质谱成像（MSI）已成为测量生物组织中化学物质的重要工具。这些平台的大部分影响在于它们能够报告分析物的空间分布以与样品形态相关。因此,提高空间分辨率已成为该领域创新的前沿,和必要的发展取决于电离源。更具体地说,基于激光的成像源可能需要修改光学系统或替代采样技术。对于具有红外（IR）激光器的系统，这些挑战更加突出。因为它们的工作波长产生的光斑尺寸固有地大于它们的紫外线对应物。最近,红外矩阵辅助激光解吸电喷雾电离（IR-MALDESI）源已显示出衍射光学元件（DOE）产生正方形烧蚀图案的实用性，称为顶帽IR-MALDESI。如果DOE光学元件与过采样方法相结合，可以对较小的消融体积进行采样以呈现较高的空间分辨率成像实验。Further,这种方法能够实现可重复的斑点大小和消融体积,以便更好地比较扫描.在这里，我们研究了使用顶帽IR-MALDESI进行过采样以增强位于斑马鱼组织头部内的测量脂质的空间分辨率的实用性。评估了四种不同的空间分辨率的数据质量(例如，质量测量精度,光谱精度)和注释数量。还讨论了用于高空间分辨率成像的其他实验参数。最终,在这项工作中实现了20μm的空间分辨率，并支持在未来的IR-MALDESI研究中使用的可行性。
Mass spectrometry imaging (MSI) has become a significant tool for measuring chemical species in biological tissues, where much of the impact of these platforms lies in their capability to report the spatial distribution of analytes for correlation to sample morphology. As a result, enhancement of spatial resolution has become a frontier of innovation in the field, and necessary developments are dependent on the ionization source. More particularly, laser-based imaging sources may require modifications to the optical train or alternative sampling techniques. These challenges are heightened for systems with infrared (IR) lasers, as their operating wavelength generates spot sizes that are inherently larger than their ultraviolet counterparts. Recently, the infrared matrix-assisted laser desorption electrospray ionization (IR-MALDESI) source has shown the utility of a diffractive optical element (DOE) to produce square ablation patterns, termed top-hat IR-MALDESI. If the DOE optic is combined with oversampling methods, smaller ablation volumes can be sampled to render higher spatial resolution imaging experiments. Further, this approach enables reproducible spot sizes and ablation volumes for better comparison between scans. Herein, we investigate the utility of oversampling with top-hat IR-MALDESI to enhance the spatial resolution of measured lipids localized within the head of sectioned zebrafish tissue. Four different spatial resolutions were evaluated for data quality (e.g., mass measurement accuracy, spectral accuracy) and quantity of annotations. Other experimental parameters to consider for high spatial resolution imaging are also discussed. Ultimately, 20 μm spatial resolution was achieved in this work and supports feasibility for use in future IR-MALDESI studies.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

求助全文
2 Feature group partitioning: an approach for depression severity prediction with class balancing using machine learning algorithms.

特征组划分：一种使用机器学习算法进行类平衡的抑郁症严重程度预测方法。影响指数 : 4.612
发表时间：Jun 2024 3
来源期刊：BMC Med Res Methodol PMID：38831346

DOI：10.1186/s12874-024-02249-8
文章类型： Journal Article

在当代社会,抑郁症已成为一种突出的精神障碍，表现出指数增长，并对过早死亡产生重大影响。尽管许多研究应用机器学习方法来预测抑郁症的迹象。然而，只有有限数量的研究将严重性级别作为多类变量考虑在内.此外,在实际社区中，保持所有类之间数据分布的平等很少发生。所以,多个变量不可避免的类不平衡被认为是该领域的重大挑战。此外，这项研究强调了在多班级背景下解决班级不平衡问题的重要性。我们在数据预处理阶段引入了一种新的特征组划分（FGP）方法，该方法有效地将特征的维度降至最低。这项研究利用了合成过采样技术，特别是合成少数过采样技术(SMOTE)和自适应合成(ADASYN)，类平衡。本研究中使用的数据集是通过管理烧伤抑郁症清单（BDC）从大学生那里收集的。对于方法上的修改，我们实现了异构集成学习堆叠，均匀合奏装袋，和五种不同的监督机器学习算法。通过评估训练的准确性，缓解了过拟合的问题，验证,和测试数据集。为了证明预测模型的有效性，平衡精度，灵敏度,特异性，精度,并使用f1分数指数。总的来说,综合分析证明了传统抑郁症筛查(CDS)和FGP方法之间的区别。总之,结果表明，采用SMOTE方法的FGP堆叠分类器具有最高的平衡精度，率92.81%。经验证据表明，FGP方法，当与SMOTE结合时，能够在预测抑郁症的严重程度方面产生更好的表现。最重要的是，优化所有分类器的FGP方法的训练时间是本研究的一项重大成就。
In contemporary society, depression has emerged as a prominent mental disorder that exhibits exponential growth and exerts a substantial influence on premature mortality. Although numerous research applied machine learning methods to forecast signs of depression. Nevertheless, only a limited number of research have taken into account the severity level as a multiclass variable. Besides, maintaining the equality of data distribution among all the classes rarely happens in practical communities. So, the inevitable class imbalance for multiple variables is considered a substantial challenge in this domain. Furthermore, this research emphasizes the significance of addressing class imbalance issues in the context of multiple classes. We introduced a new approach Feature group partitioning (FGP) in the data preprocessing phase which effectively reduces the dimensionality of features to a minimum. This study utilized synthetic oversampling techniques, specifically Synthetic Minority Over-sampling Technique (SMOTE) and Adaptive Synthetic (ADASYN), for class balancing. The dataset used in this research was collected from university students by administering the Burn Depression Checklist (BDC). For methodological modifications, we implemented heterogeneous ensemble learning stacking, homogeneous ensemble bagging, and five distinct supervised machine learning algorithms. The issue of overfitting was mitigated by evaluating the accuracy of the training, validation, and testing datasets. To justify the effectiveness of the prediction models, balanced accuracy, sensitivity, specificity, precision, and f1-score indices are used. Overall, comprehensive analysis demonstrates the discrimination between the Conventional Depression Screening (CDS) and FGP approach. In summary, the results show that the stacking classifier for FGP with SMOTE approach yields the highest balanced accuracy, with a rate of 92.81%. The empirical evidence has demonstrated that the FGP approach, when combined with the SMOTE, able to produce better performance in predicting the severity of depression. Most importantly the optimization of the training time of the FGP approach for all of the classifiers is a significant achievement of this research.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
3 Structure-aware machine learning strategies for antimicrobial peptide discovery.

抗菌肽发现的结构感知机器学习策略。影响指数 : 4.996
发表时间：05 2024 25
来源期刊：Sci Rep PMID：38796582

DOI：10.1038/s41598-024-62419-y
文章类型： Journal Article

机器学习模型正在彻底改变我们发现和设计生物活性肽的方法。这些模型通常需要蛋白质结构意识，因为他们严重依赖顺序数据。这些模型擅长识别特定生物学性质或活性的序列，但他们往往无法理解其复杂的行动机制。要同时解决两个问题，我们研究了抗菌肽作为（i）膜破坏肽的作用机制和结构景观，(ii)膜穿透性肽，和(iii)蛋白结合肽。通过分析关键特征，如二肽和物理化学描述符，我们开发了预测这些类别的高精度模型(86-88%).然而,我们的初始模型(1.0和2.0)表现出倾向于α-螺旋和盘绕结构，影响预测。为了解决这种结构偏差，我们实施了子集选择和数据缩减策略。前者给出了三种可能折叠成α螺旋的肽的结构特异性模型（模型1.1和2.1），线圈（1.3和2.3），或混合结构（1.4和2.4）。后者耗尽了过度代表的结构，导致结构不可知的预测因子1.5和2.5。此外,我们的研究强调了重要特征对不同模型结构类别的敏感性。
Machine learning models are revolutionizing our approaches to discovering and designing bioactive peptides. These models often need protein structure awareness, as they heavily rely on sequential data. The models excel at identifying sequences of a particular biological nature or activity, but they frequently fail to comprehend their intricate mechanism(s) of action. To solve two problems at once, we studied the mechanisms of action and structural landscape of antimicrobial peptides as (i) membrane-disrupting peptides, (ii) membrane-penetrating peptides, and (iii) protein-binding peptides. By analyzing critical features such as dipeptides and physicochemical descriptors, we developed models with high accuracy (86-88%) in predicting these categories. However, our initial models (1.0 and 2.0) exhibited a bias towards α-helical and coiled structures, influencing predictions. To address this structural bias, we implemented subset selection and data reduction strategies. The former gave three structure-specific models for peptides likely to fold into α-helices (models 1.1 and 2.1), coils (1.3 and 2.3), or mixed structures (1.4 and 2.4). The latter depleted over-represented structures, leading to structure-agnostic predictors 1.5 and 2.5. Additionally, our research highlights the sensitivity of important features to different structure classes across models.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
4 A comprehensive study on machine learning models combining with oversampling for bronchopulmonary dysplasia-associated pulmonary hypertension in very preterm infants.

机器学习模型结合过采样对极早产儿支气管肺发育不良相关肺动脉高压的综合研究。影响指数 : 暂无
发表时间：May 2024 8
来源期刊：Respir Res PMID：38720331

DOI：10.1186/s12931-024-02797-z
文章类型： Journal Article

背景：支气管肺发育不良相关性肺动脉高压（BPD-PH）仍然是严重影响早产儿治疗结果的严重临床并发症。因此，早期预防和病理改变前的及时诊断是降低发病率和改善预后的关键。我们的主要目标是利用机器学习技术来建立预测模型，以准确识别患有PH风险的BPD婴儿。
方法：本研究使用的数据来自中国四家三级医院的新生儿科。为了解决数据不平衡的问题，过采样算法采用合成少数过采样技术(SMOTE)对模型进行了改进。
结果：在我们的研究中收集了761条临床记录。在数据预处理和特征选择之后，46个特征中有5个用于构建模型，包括有创呼吸支持的持续时间（天），BPD的严重程度,呼吸机相关性肺炎,肺出血,和早发性PH。四种机器学习模型被应用于预测学习，经过综合选择，最终选择了一个模型。该模型实现了93.8%的灵敏度，准确率85.0%,和0.933AUC。逻辑回归公式的得分大于0被识别为BPD-PH的警告信号。
结论：我们综合比较了不同的机器学习模型，最终获得了良好的预后模型，足以支持儿科临床医生对BPD-PH患儿进行早期诊断和制定更好的治疗方案。
BACKGROUND: Bronchopulmonary dysplasia-associated pulmonary hypertension (BPD-PH) remains a devastating clinical complication seriously affecting the therapeutic outcome of preterm infants. Hence, early prevention and timely diagnosis prior to pathological change is the key to reducing morbidity and improving prognosis. Our primary objective is to utilize machine learning techniques to build predictive models that could accurately identify BPD infants at risk of developing PH.
METHODS: The data utilized in this study were collected from neonatology departments of four tertiary-level hospitals in China. To address the issue of imbalanced data, oversampling algorithms synthetic minority over-sampling technique (SMOTE) was applied to improve the model.
RESULTS: Seven hundred sixty one clinical records were collected in our study. Following data pre-processing and feature selection, 5 of the 46 features were used to build models, including duration of invasive respiratory support (day), the severity of BPD, ventilator-associated pneumonia, pulmonary hemorrhage, and early-onset PH. Four machine learning models were applied to predictive learning, and after comprehensive selection a model was ultimately selected. The model achieved 93.8% sensitivity, 85.0% accuracy, and 0.933 AUC. A score of the logistic regression formula greater than 0 was identified as a warning sign of BPD-PH.
CONCLUSIONS: We comprehensively compared different machine learning models and ultimately obtained a good prognosis model which was sufficient to support pediatric clinicians to make early diagnosis and formulate a better treatment plan for pediatric patients with BPD-PH.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
5 Pruning-based oversampling technique with smoothed bootstrap resampling for imbalanced clinical dataset of Covid-19.

Covid - 19 不平衡临床数据集的基于修剪的过采样技术和平滑自举重新采样。影响指数 : 8.839
发表时间：Oct 2022
来源期刊：J King Saud Univ Comput Inf Sci PMID：38620726

DOI：10.1016/j.jksuci.2021.09.021
文章类型： Journal Article

冠状病毒病(COVID-19)被世界卫生组织(WHO)宣布为大流行疾病，到目前为止还没有结束。由于COVID-19的感染率增加，与常规诊断相比,需要采用计算方法来预测感染COVID-19的患者,以加快诊断时间并最大限度地减少人为错误.然而,负数据的数量高于正数据的数量会导致数据不平衡的情况，从而影响分类性能。导致模型评估结果存在偏差。这项研究提出了一种新的过采样技术，即，TRIM-SBR,生成诊断感染COVID-19患者的次要类别数据。由于数据的泛化问题，开发过采样技术仍然具有挑战性。所提出的方法是基于修剪，通过寻找特定的少数民族地区，同时保留数据泛化，产生少数数据种子，作为使用引导重采样技术创建新的合成数据的基准。准确性,特异性,灵敏度,F-measure,和AUC用于评估数据不平衡情况下的分类器性能。结果表明，与其他过采样技术相比，TRIM-SBR方法提供了最佳性能。
The Coronavirus Disease (COVID-19) was declared a pandemic disease by the World Health Organization (WHO), and it has not ended so far. Since the infection rate of the COVID-19 increases, the computational approach is needed to predict patients infected with COVID-19 in order to speed up the diagnosis time and minimize human error compared to conventional diagnoses. However, the number of negative data that is higher than positive data can result in a data imbalance situation that affects the classification performance, resulting in a bias in the model evaluation results. This study proposes a new oversampling technique, i.e., TRIM-SBR, to generate the minor class data for diagnosing patients infected with COVID-19. It is still challenging to develop the oversampling technique due to the data\'s generalization issue. The proposed method is based on pruning by looking for specific minority areas while retaining data generalization, resulting in minority data seeds that serve as benchmarks in creating new synthesized data using bootstrap resampling techniques. Accuracy, Specificity, Sensitivity, F-measure, and AUC are used to evaluate classifier performance in data imbalance cases. The results show that the TRIM-SBR method provides the best performance compared to other oversampling techniques.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
6 Transformer fault diagnosis method based on SMOTE and NGO-GBDT.

基于 SMOTE 和 NGO - GBDT 的变压器故障诊断方法. 影响指数 : 4.996
发表时间：Mar 2024 26
来源期刊：Sci Rep PMID：38531936

DOI：10.1038/s41598-024-57509-w
文章类型： Journal Article

为了提高变压器故障诊断的准确性,改善模型训练不足导致的不平衡样本对模型辨识精度低的影响,提出了一种基于SMOTE和NGO-GBDT的变压器故障诊断方法。首先,使用合成少数过采样技术（SMOTE）来扩展少数样本。其次,采用非编码比方法构造多维特征参数,引入光梯度提升机(LightGBM)特征优化策略筛选最优特征子集。最后,采用NorthernGoshawk优化(NGO)算法对梯度提升决策树(GBDT)参数进行优化,实现了变压器故障诊断。结果表明,该方法可以减少少数样本的误判。与其他集成模型相比，该方法具有较高的故障识别精度,误判率低，性能稳定。
In order to improve the accuracy of transformer fault diagnosis and improve the influence of unbalanced samples on the low accuracy of model identification caused by insufficient model training, this paper proposes a transformer fault diagnosis method based on SMOTE and NGO-GBDT. Firstly, the Synthetic Minority Over-sampling Technique (SMOTE) was used to expand the minority samples. Secondly, the non-coding ratio method was used to construct multi-dimensional feature parameters, and the Light Gradient Boosting Machine (LightGBM) feature optimization strategy was introduced to screen the optimal feature subset. Finally, Northern Goshawk Optimization (NGO) algorithm was used to optimize the parameters of Gradient Boosting Decision Tree (GBDT), and then the transformer fault diagnosis was realized. The results show that the proposed method can reduce the misjudgment of minority samples. Compared with other integrated models, the proposed method has high fault identification accuracy, low misjudgment rate and stable performance.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
7 Enhancing and improving the performance of imbalanced class data using novel GBO and SSG: A comparative analysis.

使用新型 GBO 和 SSG 增强和改进不平衡类数据的性能：比较分析。影响指数 : 9.657
发表时间：May 2024
来源期刊：Neural Netw PMID：38335796

DOI：10.1016/j.neunet.2024.106157
文章类型： Journal Article

数据集中的类不平衡问题(CIP)是一个重大挑战，会显著影响机器学习(ML)模型的性能，从而导致有偏差的预测。已经提出了许多技术来解决CIP，包括,但不限于,过采样,欠采样,和成本敏感的方法。由于其生成合成数据的能力，诸如合成少数过采样技术（SMOTE）之类的过采样技术是研究人员使用最广泛的方法。然而,SMOTE的潜在缺点之一是新创建的次要样本与主要样本重叠。因此,ML模型偏向主要类别的概率会增加。由于生成对抗网络（GAN）能够创建真实样本，因此最近引起了很多关注。然而,GAN很难训练，尽管它有很大的潜力。考虑到这些机会,这项工作提出了两种新颖的技术：基于GAN的过采样（GBO）和支持向量机-SMOTE-GAN（SSG）来克服现有方法的局限性。初步结果表明，SSG和GBO在9个不平衡基准数据集上的表现比几种现有的基于SMOTE的方法更好。此外,可以观察到，所提出的SSG和GBO方法可以准确地对次要类别进行分类，当测试时，准确率超过90％，30%,和40%的测试数据。研究还表明，SSG产生的次要样本表现出高斯分布，使用原始SMOTE和SVM-SMOTE通常很难实现。
Class imbalance problem (CIP) in a dataset is a major challenge that significantly affects the performance of Machine Learning (ML) models resulting in biased predictions. Numerous techniques have been proposed to address CIP, including, but not limited to, Oversampling, Undersampling, and cost-sensitive approaches. Due to its ability to generate synthetic data, oversampling techniques such as the Synthetic Minority Oversampling Technique (SMOTE) are the most widely used methodology by researchers. However, one of SMOTE\'s potential disadvantages is that newly created minor samples overlap with major samples. Therefore, the probability of ML models\' biased performance toward major classes increases. Generative adversarial network (GAN) has recently garnered much attention due to their ability to create real samples. However, GAN is hard to train even though it has much potential. Considering these opportunities, this work proposes two novel techniques: GAN-based Oversampling (GBO) and Support Vector Machine-SMOTE-GAN (SSG) to overcome the limitations of the existing approaches. The preliminary results show that SSG and GBO performed better on the nine imbalanced benchmark datasets than several existing SMOTE-based approaches. Additionally, it can be observed that the proposed SSG and GBO methods can accurately classify the minor class with more than 90% accuracy when tested with 20%, 30%, and 40% of the test data. The study also revealed that the minor sample generated by SSG demonstrates Gaussian distributions, which is often difficult to achieve using original SMOTE and SVM-SMOTE.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

求助全文
8 Evaluation of penalized and machine learning methods for asthma disease prediction in the Korean Genome and Epidemiology Study (KoGES).

韩国基因组和流行病学研究（ KoGES ）中哮喘疾病预测的惩罚和机器学习方法的评估。影响指数 : 3.307
发表时间：Feb 2024 2
来源期刊：BMC Bioinformatics PMID：38308205

DOI：10.1186/s12859-024-05677-x
文章类型： Journal Article

背景：全基因组关联研究已成功鉴定出与人类疾病相关的遗传变异。最近已经提出了基于惩罚和机器学习方法的各种统计方法用于疾病预测。在这项研究中,我们使用韩国基因组和流行病学研究(KoGES)的韩国芯片(KORV1.1)评估了几种此类方法预测哮喘的性能.
结果：首先，通过单变异检测,采用logistic回归分析并调整了几个流行病学因素,筛选出单核苷酸多态性.接下来,我们评估了以下疾病预测方法：里奇，最小绝对收缩和选择运算符，弹性网，平滑地削减绝对偏差，支持向量机，随机森林,升压，装袋,天真贝叶斯，和k最近的邻居。最后,我们根据接收器工作特性曲线的曲线下面积比较了它们的预测性能，精度,召回，F1分数，Cohen\'sKappa,平衡精度，错误率,马修斯相关系数，和精确召回率曲线下的面积。此外,三种过采样算法用于处理不平衡问题。
结论：我们的结果表明，与通过机器学习方法相比，惩罚方法对哮喘表现出更好的预测性能。另一方面,在过抽样研究中，随机森林和增强方法总体上显示出比惩罚方法更好的预测性能。
BACKGROUND: Genome-wide association studies have successfully identified genetic variants associated with human disease. Various statistical approaches based on penalized and machine learning methods have recently been proposed for disease prediction. In this study, we evaluated the performance of several such methods for predicting asthma using the Korean Chip (KORV1.1) from the Korean Genome and Epidemiology Study (KoGES).
RESULTS: First, single-nucleotide polymorphisms were selected via single-variant tests using logistic regression with the adjustment of several epidemiological factors. Next, we evaluated the following methods for disease prediction: ridge, least absolute shrinkage and selection operator, elastic net, smoothly clipped absolute deviation, support vector machine, random forest, boosting, bagging, naïve Bayes, and k-nearest neighbor. Finally, we compared their predictive performance based on the area under the curve of the receiver operating characteristic curves, precision, recall, F1-score, Cohen\'s Kappa, balanced accuracy, error rate, Matthews correlation coefficient, and area under the precision-recall curve. Additionally, three oversampling algorithms are used to deal with imbalance problems.
CONCLUSIONS: Our results show that penalized methods exhibit better predictive performance for asthma than that achieved via machine learning methods. On the other hand, in the oversampling study, randomforest and boosting methods overall showed better prediction performance than penalized methods.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)
9 Identification of a Histone Deacetylase 8 Inhibitor through Drug Screenings Based on Machine Learning.

通过基于机器学习的药物筛选鉴定组蛋白去乙酰化酶 8 抑制剂。影响指数 : 1.903
发表时间：2024
来源期刊：Chem Pharm Bull (Tokyo) PMID：38296560

DOI：10.1248/cpb.c23-00577
文章类型： Journal Article

组蛋白脱乙酰酶8(HDAC8)是锌依赖性HDAC,其催化非组蛋白蛋白的脱乙酰化。它与癌症发展有关，HDAC8抑制剂是有希望的抗癌药物。然而,最多报道的HDAC8抑制剂含有异羟肟酸部分，这通常会导致诱变。因此,我们使用机器学习进行药物筛选,并尝试鉴定非异羟肟酸作为HDAC8抑制剂.在这项研究中,我们建立了一个基于随机森林（RF）算法筛选HDAC8抑制剂的预测模型，因为它在训练数据集中表现出最佳的预测精度，包括由合成少数过采样技术(SMOTE)生成的数据。使用经过训练的RF-SMOTE模型，我们筛选了大阪大学图书馆的化合物，并选择了50个虚拟命中。然而,首次筛选中的50次命中未显示HDAC8抑制活性.在第二次筛选中，使用RF-SMOTE模型，它是通过重新训练包括50种非活性化合物的数据集建立的，我们鉴定非异羟肟酸12为HDAC8抑制剂，IC50为842nM。有趣的是,其对HDAC1和HDAC3抑制活性的IC50值分别为38和12μM，分别,显示化合物12具有高HDAC8选择性。使用机器学习，我们扩展了HDAC8抑制剂的化学空间，并确定非异羟肟酸12为新型HDAC8选择性抑制剂。
Histone deacetylase 8 (HDAC8) is a zinc-dependent HDAC that catalyzes the deacetylation of nonhistone proteins. It is involved in cancer development and HDAC8 inhibitors are promising candidates as anticancer agents. However, most reported HDAC8 inhibitors contain a hydroxamic acid moiety, which often causes mutagenicity. Therefore, we used machine learning for drug screening and attempted to identify non-hydroxamic acids as HDAC8 inhibitors. In this study, we established a prediction model based on the random forest (RF) algorithm for screening HDAC8 inhibitors because it exhibited the best predictive accuracy in the training dataset, including data generated by the synthetic minority over-sampling technique (SMOTE). Using the trained RF-SMOTE model, we screened the Osaka University library for compounds and selected 50 virtual hits. However, the 50 hits in the first screening did not show HDAC8-inhibitory activity. In the second screening, using the RF-SMOTE model, which was established by retraining the dataset including 50 inactive compounds, we identified non-hydroxamic acid 12 as an HDAC8 inhibitor with an IC50 of 842 nM. Interestingly, its IC50 values for HDAC1 and HDAC3-inhibitory activity were 38 and 12 µM, respectively, showing that compound 12 has high HDAC8 selectivity. Using machine learning, we expanded the chemical space for HDAC8 inhibitors and identified non-hydroxamic acid 12 as a novel HDAC8 selective inhibitor.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

求助全文
10 A quantum-based oversampling method for classification of highly imbalanced and overlapped data.

一种基于量子的过采样方法，用于对高度不平衡和重叠的数据进行分类。影响指数 : 4.088
发表时间：Dec 2023 28
来源期刊：Exp Biol Med (Maywood) PMID：38281087

DOI：10.1177/15353702231220665
文章类型： Journal Article

数据不平衡是分类任务中的一个具有挑战性的问题，当与类重叠结合时，它进一步恶化了分类性能。然而,现有的研究很少同时解决这两个问题。在这篇文章中,我们提出了一种新的基于量子的过采样方法（QOSM），以有效地解决数据不平衡和类重叠，从而提高分类性能。QOSM利用量子势理论来计算每个样本的势能，并选择具有最低电势的样本作为建设性覆盖算法生成的每个覆盖的中心。这种方法优化了覆盖中心选择，更好地捕获原始样本的分布，特别是在重叠区域。此外,对少数类别覆盖的样本进行过采样，以减轻不平衡比(IR)。我们使用三种传统分类器(支持向量机[SVM]，k-最近邻[KNN]，和朴素贝叶斯[NB]分类器）在10个公开可用的KEEL数据集上，这些数据集以高IR和不同程度的重叠为特征。实验结果表明，与未解决类不平衡和重叠的方法相比，QOSM显着提高了分类准确性。此外，QOSM始终优于测试的现有过采样方法。由于它与不同分类器的兼容性，QOSM具有改善高度不平衡和重叠数据的分类性能的潜力。
Data imbalance is a challenging problem in classification tasks, and when combined with class overlapping, it further deteriorates classification performance. However, existing studies have rarely addressed both issues simultaneously. In this article, we propose a novel quantum-based oversampling method (QOSM) to effectively tackle data imbalance and class overlapping, thereby improving classification performance. QOSM utilizes the quantum potential theory to calculate the potential energy of each sample and selects the sample with the lowest potential as the center of each cover generated by a constructive covering algorithm. This approach optimizes cover center selection and better captures the distribution of the original samples, particularly in the overlapping regions. In addition, oversampling is performed on the samples of the minority class covers to mitigate the imbalance ratio (IR). We evaluated QOSM using three traditional classifiers (support vector machines [SVM], k-nearest neighbor [KNN], and naive Bayes [NB] classifier) on 10 publicly available KEEL data sets characterized by high IRs and varying degrees of overlap. Experimental results demonstrate that QOSM significantly improves classification accuracy compared to approaches that do not address class imbalance and overlapping. Moreover, QOSM consistently outperforms existing oversampling methods tested. With its compatibility with different classifiers, QOSM exhibits promising potential to improve the classification performance of highly imbalanced and overlapped data.

导出

Endnote Noteexpress

更多引用

收藏

翻译标题摘要

我要上传

PDF(Pubmed)

Oversampling 关注

1 Oversampling for Enhanced Spatial Resolution of Zebrafish by Top-Hat IR-MALDESI-MSI.

2 Feature group partitioning: an approach for depression severity prediction with class balancing using machine learning algorithms.

3 Structure-aware machine learning strategies for antimicrobial peptide discovery.

4 A comprehensive study on machine learning models combining with oversampling for bronchopulmonary dysplasia-associated pulmonary hypertension in very preterm infants.

5 Pruning-based oversampling technique with smoothed bootstrap resampling for imbalanced clinical dataset of Covid-19.

6 Transformer fault diagnosis method based on SMOTE and NGO-GBDT.

7 Enhancing and improving the performance of imbalanced class data using novel GBO and SSG: A comparative analysis.

8 Evaluation of penalized and machine learning methods for asthma disease prediction in the Korean Genome and Epidemiology Study (KoGES).

9 Identification of a Histone Deacetylase 8 Inhibitor through Drug Screenings Based on Machine Learning.

10 A quantum-based oversampling method for classification of highly imbalanced and overlapped data.