Instance selection

  • 文章类型: Journal Article
    随着计算机科学和基于实验室的工程在日常生活中的广泛应用,数据的维度和规模正在迅速增长。由于模糊性的可用性,后来的不确定性,冗余,无关,和噪音,这在构建有效的学习模型方面提出了担忧。模糊粗糙集及其扩展已通过各种数据约简方法应用于处理这些问题。然而,构建一个能够同时应对所有这些问题的模型总是一项具有挑战性的任务。迄今为止,没有一项研究同时解决了所有这些问题。本文研究了一种基于直觉模糊(IF)和粗糙集概念的方法,通过提出一种有趣的数据约简技术来同时避免这些障碍。为了完成这项任务,首先,提出了一种新的IF相似关系。其次,在这种相似关系的基础上建立了IF粗糙集模型。第三,通过使用建立的相似关系和下近似,给出了IF颗粒结构。接下来,数学定理用于验证所提出的概念。然后,IF颗粒的重要性程度用于多余的尺寸消除。Further,讨论了重要度保留的降维。因此,可以同时执行大量高维数据集的实例和特征选择,以消除维度和大小上的冗余和不相关性,其中模糊性和后来的不确定性分别用粗糙集和IF集处理,而噪声是用中频颗粒结构解决的。此后,对基准数据集进行了全面的实验,以证明同时选择特征和数据点的方法的有效性。最后,我们提出的方法学辅助框架进行了讨论,以提高抗病毒肽的IC50的回归性能。
    The dimension and size of data is growing rapidly with the extensive applications of computer science and lab based engineering in daily life. Due to availability of vagueness, later uncertainty, redundancy, irrelevancy, and noise, which imposes concerns in building effective learning models. Fuzzy rough set and its extensions have been applied to deal with these issues by various data reduction approaches. However, construction of a model that can cope with all these issues simultaneously is always a challenging task. None of the studies till date has addressed all these issues simultaneously. This paper investigates a method based on the notions of intuitionistic fuzzy (IF) and rough sets to avoid these obstacles simultaneously by putting forward an interesting data reduction technique. To accomplish this task, firstly, a novel IF similarity relation is addressed. Secondly, we establish an IF rough set model on the basis of this similarity relation. Thirdly, an IF granular structure is presented by using the established similarity relation and the lower approximation. Next, the mathematical theorems are used to validate the proposed notions. Then, the importance-degree of the IF granules is employed for redundant size elimination. Further, significance-degree-preserved dimensionality reduction is discussed. Hence, simultaneous instance and feature selection for large volume of high-dimensional datasets can be performed to eliminate redundancy and irrelevancy in both dimension and size, where vagueness and later uncertainty are handled with rough and IF sets respectively, whilst noise is tackled with IF granular structure. Thereafter, a comprehensive experiment is carried out over the benchmark datasets to demonstrate the effectiveness of simultaneous feature and data point selection methods. Finally, our proposed methodology aided framework is discussed to enhance the regression performance for IC50 of Antiviral Peptides.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    远程医疗服务越来越受欢迎,导致越来越多的数据由卫生专业人员监测。机器学习可以支持他们管理这些数据。因此,正确的机器学习算法需要应用于正确的数据。我们已经实施并验证了不同的算法,用于从糖尿病远程医疗服务得出的时间序列数据中选择最佳时间实例。内在,监督,分析了无监督的实例选择算法。实例选择对我们的随机森林模型进行辍学预测的准确性有很大影响。最好的结果是用一类支持向量机,将原算法的接收机工作曲线下面积从69.91%提高到75.88%。我们的结论是,尽管到目前为止在远程医疗文献中几乎没有提到,实例选择有可能显著提高机器学习算法的准确性。
    Telehealth services are becoming more and more popular, leading to an increasing amount of data to be monitored by health professionals. Machine learning can support them in managing these data. Therefore, the right machine learning algorithms need to be applied to the right data. We have implemented and validated different algorithms for selecting optimal time instances from time series data derived from a diabetes telehealth service. Intrinsic, supervised, and unsupervised instance selection algorithms were analysed. Instance selection had a huge impact on the accuracy of our random forest model for dropout prediction. The best results were achieved with a One Class Support Vector Machine, which improved the area under the receiver operating curve of the original algorithm from 69.91 to 75.88 %. We conclude that, although hardly mentioned in telehealth literature so far, instance selection has the potential to significantly improve the accuracy of machine learning algorithms.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    在每天成千上万个新网站的迅速扩散中,区分安全的和潜在有害的已经成为一项日益复杂的任务。这些网站经常收集用户数据,and,如果没有足够的网络安全措施,例如对恶意URL的有效检测和分类,用户的敏感信息可能会受到损害。本研究旨在开发基于机器学习算法的模型,以有效识别和分类恶意URL,有助于增强网络安全。在此背景下,本研究利用支持向量机(SVM),随机森林(RF),决策树(DTs),和k-最近邻(KNN)结合贝叶斯优化对URL进行准确分类。为了提高计算效率,采用实例选择方法,包括基于局部敏感哈希(DRLSH)的数据缩减,基于局部敏感散列(BPLSH)的边界点提取,和随机选择。结果表明,RF在提供高精度方面的有效性,召回,和F1得分,与SVM还提供有竞争力的表现,以增加培训时间为代价。结果还强调了实例选择方法对这些模型性能的实质性影响,表明其在机器学习管道中用于恶意URL分类的重要性。
    Amid the rapid proliferation of thousands of new websites daily, distinguishing safe ones from potentially harmful ones has become an increasingly complex task. These websites often collect user data, and, without adequate cybersecurity measures such as the efficient detection and classification of malicious URLs, users\' sensitive information could be compromised. This study aims to develop models based on machine learning algorithms for the efficient identification and classification of malicious URLs, contributing to enhanced cybersecurity. Within this context, this study leverages support vector machines (SVMs), random forests (RFs), decision trees (DTs), and k-nearest neighbors (KNNs) in combination with Bayesian optimization to accurately classify URLs. To improve computational efficiency, instance selection methods are employed, including data reduction based on locality-sensitive hashing (DRLSH), border point extraction based on locality-sensitive hashing (BPLSH), and random selection. The results show the effectiveness of RFs in delivering high precision, recall, and F1 scores, with SVMs also providing competitive performance at the expense of increased training time. The results also emphasize the substantial impact of the instance selection method on the performance of these models, indicating its significance in the machine learning pipeline for malicious URL classification.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    支持向量机(SVM)是强大的统计学习工具,但是它们对大型数据集的应用可能会导致耗时的训练复杂性。为了解决这个问题,已经提出了各种实例选择(IS)方法,选择一小部分关键实例,并在培训前筛选出其他实例。然而,现有的方法还不能很好地平衡准确性和效率。一些方法错过关键实例,而其他人使用复杂的选择方案,需要比所有原始实例训练更多的执行时间,从而违反了IS的最初意图。在这项工作中,我们提出了一种新开发的IS方法,称为有效边界识别(VBR)。VBR选择最接近的异构邻居作为有效边界实例,并将此过程合并到简化的高斯核矩阵的创建中,从而最小化执行时间。为了提高可靠性,我们提出了VBR的强化版本(SVBR)。基于VBR,SVBR逐渐增加更多的异构邻居作为补充,直到已经选择的实例的拉格朗日乘数变得稳定。在数值实验中,在基准和合成数据集上验证了我们提出的方法的有效性,执行时间和推理时间。
    Support vector machines (SVMs) are powerful statistical learning tools, but their application to large datasets can cause time-consuming training complexity. To address this issue, various instance selection (IS) approaches have been proposed, which choose a small fraction of critical instances and screen out others before training. However, existing methods have not been able to balance accuracy and efficiency well. Some methods miss critical instances, while others use complicated selection schemes that require even more execution time than training with all original instances, thus violating the initial intention of IS. In this work, we present a newly developed IS method called Valid Border Recognition (VBR). VBR selects the closest heterogeneous neighbors as valid border instances and incorporates this process into the creation of a reduced Gaussian kernel matrix, thus minimizing the execution time. To improve reliability, we propose a strengthened version of VBR (SVBR). Based on VBR, SVBR gradually adds farther heterogeneous neighbors as complements until the Lagrange multipliers of already selected instances become stable. In numerical experiments, the effectiveness of our proposed methods is verified on benchmark and synthetic datasets in terms of accuracy, execution time and inference time.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    EEG信号可以无创监测大脑活动,已广泛用于脑机接口(BCI)。研究领域之一是通过脑电图客观地识别情绪。事实上,人们的情绪随着时间的推移而变化,然而,大多数现有的情感BCI离线处理数据和识别情绪,因此不能应用于实时情感识别。
    为了解决此问题,将实例选择策略引入迁移学习中,提出了一种简化的风格迁移映射算法。在提出的方法中,首先从源域数据中选择信息实例,然后还简化了超参数的更新策略,用于样式传递映射,使模型训练更加快速和准确的一个新的主题。
    为了验证我们算法的有效性,我们进行种子实验,SEED-IV和我们自己收集的离线数据集,识别准确率高达86.78%,7s的计算时间为82.55%和77.68%,4s和10s,分别。此外,我们还开发了一个实时的情感识别系统,该系统集成了脑电信号采集模块,数据处理,情感识别和结果可视化。
    离线和在线实验的结果都表明,所提出的算法可以在短时间内准确识别情绪,满足实时情感识别应用的需求。
    UNASSIGNED: EEG signals can non-invasively monitor the brain activities and have been widely used in brain-computer interfaces (BCI). One of the research areas is to recognize emotions objectively through EEG. In fact, the emotion of people changes over time, however, most of the existing affective BCIs process data and recognize emotions offline, and thus cannot be applied to real-time emotion recognition.
    UNASSIGNED: In order to solve this problem, we introduce the instance selection strategy into transfer learning and propose a simplified style transfer mapping algorithm. In the proposed method, the informative instances are firstly selected from the source domain data, and then the update strategy of hyperparameters is also simplified for style transfer mapping, making the model training more quickly and accurately for a new subject.
    UNASSIGNED: To verify the effectiveness of our algorithm, we carry out the experiments on SEED, SEED-IV and the offline dataset collected by ourselves, and achieve the recognition accuracies up to 86.78%, 82.55% and 77.68% in computing time of 7s, 4s and 10s, respectively. Furthermore, we also develop a real-time emotion recognition system which integrates the modules of EEG signal acquisition, data processing, emotion recognition and result visualization.
    UNASSIGNED: Both the results of offline and online experiments show that the proposed algorithm can accurately recognize emotions in a short time, meeting the needs of real-time emotion recognition applications.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    目的:深度学习(DL)已应用于生物医学成像的概念证明中,包括跨模式和医学专业。标签数据对于训练和测试DL模型至关重要,但是人类专家标签是有限的。此外,DL传统上需要大量的训练数据,这在计算上是昂贵的处理和迭代。因此,优先使用那些最有可能提高模型性能的图像是很有用的,一种称为实例选择的做法。挑战是确定如何最好地确定优先级。喜欢直截了当是很自然的,健壮,定量指标作为实例选择优先级的基础。然而,在目前的实践中,这样的指标不是针对的,几乎从未使用过,图像数据集。
    方法:为了解决这个问题,我们引入了ENRICH-消除成像挑战的噪声和冗余-一种可定制的方法,该方法根据每个图像在训练集中增加的多样性来确定图像的优先级。
    结果:首先,我们表明,医学数据集的特殊之处在于,通常每个图像比非医学数据集增加更少的多样性。接下来,我们证明,ENRICH在几个医学图像数据集上的分类和分割任务上实现了几乎最大的性能,仅使用一小部分可用图像,而无需前期数据标记。ENRICH优于随机图像选择,阴性对照。最后,我们证明了ENRICH还可以用于识别成像数据集中的错误和异常值。
    结论:ENRICH是一个简单的,计算有效的方法,用于对图像进行优先级排序,以便在DL中进行专家标记和使用。
    Deep learning (DL) has been applied in proofs of concept across biomedical imaging, including across modalities and medical specialties. Labeled data are critical to training and testing DL models, but human expert labelers are limited. In addition, DL traditionally requires copious training data, which is computationally expensive to process and iterate over. Consequently, it is useful to prioritize using those images that are most likely to improve a model\'s performance, a practice known as instance selection. The challenge is determining how best to prioritize. It is natural to prefer straightforward, robust, quantitative metrics as the basis for prioritization for instance selection. However, in current practice, such metrics are not tailored to, and almost never used for, image datasets.
    To address this problem, we introduce ENRICH-Eliminate Noise and Redundancy for Imaging Challenges-a customizable method that prioritizes images based on how much diversity each image adds to the training set.
    First, we show that medical datasets are special in that in general each image adds less diversity than in nonmedical datasets. Next, we demonstrate that ENRICH achieves nearly maximal performance on classification and segmentation tasks on several medical image datasets using only a fraction of the available images and without up-front data labeling. ENRICH outperforms random image selection, the negative control. Finally, we show that ENRICH can also be used to identify errors and outliers in imaging datasets.
    ENRICH is a simple, computationally efficient method for prioritizing images for expert labeling and use in DL.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    由于癌症在生物医学领域的重要性,因此对癌症的早期检测进行了大量探索。在用来回答这个生物学问题的不同类型的数据中,由于对宿主免疫系统在肿瘤生物学中的作用的日益重视,基于T细胞受体(TCR)的研究最近受到关注。然而,患者与多个TCR序列之间的一对多对应关系阻碍了研究人员简单地采用经典的统计/机器学习方法.最近尝试在多实例学习(MIL)的上下文中对此类数据进行建模。尽管MIL在使用TCR序列的癌症检测中具有新颖的应用,并且在几种肿瘤类型中表现出足够的性能,仍有改进的空间,特别是对于某些类型的癌症。此外,对于此应用程序,尚未充分研究可解释的神经网络模型。在这篇文章中,我们提出了基于稀疏注意的多实例神经网络(MINN-SA)来提高癌症检测和可解释性。稀疏的注意力结构会在每个袋子中删除无信息的实例,结合跳过连接实现可解释性和更好的预测性能。我们的实验表明,MINN-SA在10种不同类型的癌症中平均测量的ROC曲线得分最高。与现有的MIL方法相比。此外,我们从估计的注意力中观察到,MINN-SA可以识别同一T细胞库中对肿瘤抗原具有特异性的TCR。
    Early detection of cancers has been much explored due to its paramount importance in biomedical fields. Among different types of data used to answer this biological question, studies based on T cell receptors (TCRs) are under recent spotlight due to the growing appreciation of the roles of the host immunity system in tumor biology. However, the one-to-many correspondence between a patient and multiple TCR sequences hinders researchers from simply adopting classical statistical/machine learning methods. There were recent attempts to model this type of data in the context of multiple instance learning (MIL). Despite the novel application of MIL to cancer detection using TCR sequences and the demonstrated adequate performance in several tumor types, there is still room for improvement, especially for certain cancer types. Furthermore, explainable neural network models are not fully investigated for this application. In this article, we propose multiple instance neural networks based on sparse attention (MINN-SA) to enhance the performance in cancer detection and explainability. The sparse attention structure drops out uninformative instances in each bag, achieving both interpretability and better predictive performance in combination with the skip connection. Our experiments show that MINN-SA yields the highest area under the ROC curve scores on average measured across 10 different types of cancers, compared to existing MIL approaches. Moreover, we observe from the estimated attentions that MINN-SA can identify the TCRs that are specific for tumor antigens in the same T cell repertoire.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    新冠肺炎疫情迅速给医院中心带来沉重压力,尤其是在重症监护室.迫切需要工具来了解COVID-19患者的类型学,并确定那些在住院期间最有恶化风险的患者。数据包括法国第一波(2020年春季)因COVID-19住院的400多名患者,具有临床和生物学特征。使用机器学习和可解释性方法构建加重风险评分并分析特征效应。该模型具有81%的稳健AUCROC评分。最重要的特征是年龄,胸部CT严重程度和生物变量,如CRP,O2饱和度和嗜酸性粒细胞。几个特征表现出强烈的非线性效应,尤其是CT严重程度。还检测到年龄和性别之间以及年龄和嗜酸性粒细胞之间的相互作用效应。聚类技术将住院患者分为三个主要亚组(低加重风险,无危险因素,由于他们的高年龄,中等风险,和高风险主要是由于高CT严重程度和异常的生物学价值)。这项深入的分析确定了住院患者的类型明显不同,这有助于定义医疗协议,为每个配置文件提供最合适的护理。图形抽象图形抽象代表使用的主要方法和发现的结果,重点是特征对加重风险和确定的患者群体的影响。
    The COVID-19 pandemic rapidly puts a heavy pressure on hospital centers, especially on intensive care units. There was an urgent need for tools to understand typology of COVID-19 patients and identify those most at risk of aggravation during their hospital stay. Data included more than 400 patients hospitalized due to COVID-19 during the first wave in France (spring of 2020) with clinical and biological features. Machine learning and explainability methods were used to construct an aggravation risk score and analyzed feature effects. The model had a robust AUC ROC Score of 81%. Most important features were age, chest CT Severity and biological variables such as CRP, O2 Saturation and Eosinophils. Several features showed strong non-linear effects, especially for CT Severity. Interaction effects were also detected between age and gender as well as age and Eosinophils. Clustering techniques stratified inpatients in three main subgroups (low aggravation risk with no risk factor, medium risk due to their high age, and high risk mainly due to high CT Severity and abnormal biological values). This in-depth analysis determined significantly distinct typologies of inpatients, which facilitated definition of medical protocols to deliver the most appropriate cares for each profile. Graphical Abstract Graphical abstract represents main methods used and results found with a focus on feature impact on aggravation risk and identified groups of patients.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    目标:对于图像分类问题,构建合适的训练数据对于提高分类器的泛化能力非常重要,特别是当训练数据较小时。我们提出了一种方法,该方法可以从一组免疫组织化学(IHC)染色中定量评估苏木精和伊红(H&E)染色的组织载玻片的典型性,并将典型性应用于实例选择,以构建分类器,从而预测恶性淋巴瘤的亚型,以提高泛化能力。
    方法:我们通过低维嵌入空间上IHC染色模式的概率密度的比率来定义H&E染色的组织载玻片的典型性。采用基于多实例学习的卷积神经网络来构建子类型分类器,而无需注释指示整个幻灯片图像中的癌变区域,我们通过参考评估的典型性来选择训练数据,以提高泛化能力。我们在262例恶性淋巴瘤的三类亚型分类中证明了基于所提出的典型性的实例选择的有效性。
    结果:在实验中,我们证实,典型病例的亚型可以比非典型病例更准确地预测。此外,证实了基于所提出的典型性的训练数据的实例选择提高了分类器的泛化能力,其中当训练数据集中于典型实例时,与基线方法相比,分类准确度从0.664提高到0.683.
    结论:实验结果表明,根据IHC染色模式计算的H&E染色的组织切片的典型性可用作实例选择的标准,以增强泛化能力,这种典型性可以在一些实际限制下用于实例选择。
    OBJECTIVE: For the image classification problem, the construction of appropriate training data is important for improving the generalization ability of the classifier in particular when the size of the training data is small. We propose a method that quantitatively evaluates the typicality of a hematoxylin-and-eosin (H&E)-stained tissue slide from a set of immunohistochemical (IHC) stains and applies the typicality to instance selection for the construction of classifiers that predict the subtype of malignant lymphoma to improve the generalization ability.
    METHODS: We define the typicality of the H&E-stained tissue slides by the ratio of the probability density of the IHC staining patterns on low-dimensional embedded space. Employing a multiple-instance-learning-based convolutional neural network for the construction of the subtype classifier without the annotations indicating cancerous regions in whole slide images, we select the training data by referring to the evaluated typicality to improve the generalization ability. We demonstrate the effectiveness of the instance selection based on the proposed typicality in a three-class subtype classification of 262 malignant lymphoma cases.
    RESULTS: In the experiment, we confirmed that the subtypes of typical instances could be predicted more accurately than those of atypical instances. Furthermore, it was confirmed that instance selection for the training data based on the proposed typicality improved the generalization ability of the classifier, wherein the classification accuracy was improved from 0.664 to 0.683 compared with the baseline method when the training data was constructed focusing on typical instances.
    CONCLUSIONS: The experimental results showed that the typicality of the H&E-stained tissue slides computed from IHC staining patterns is useful as a criterion for instance selection to enhance the generalization ability, and this typicality could be employed for instance selection under some practical limitations.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    The purpose of instance selection is to reduce the data size while preserving as much useful information stored in the data as possible and detecting and removing the erroneous and redundant information. In this work, we analyze instance selection in regression tasks and apply the NSGA-II multi-objective evolutionary algorithm to direct the search for the optimal subset of the training dataset and the k-NN algorithm for evaluating the solutions during the selection process. A key advantage of the method is obtaining a pool of solutions situated on the Pareto front, where each of them is the best for certain RMSE-compression balance. We discuss different parameters of the process and their influence on the results and put special efforts to reducing the computational complexity of our approach. The experimental evaluation proves that the proposed method achieves good performance in terms of minimization of prediction error and minimization of dataset size.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

公众号