Supervised learning

监督学习
  • 文章类型: Journal Article
    这项研究旨在采用监督机器学习算法来检查影响学生(表现不佳的学生)学习成绩的因素。我们使用数据库中的知识发现(KDD)方法对阿曼一所主要公立大学提供的11年(从2009年到2019年)的N=6514名大学生样本进行了研究。我们使用信息增益(InfoGain)算法来选择最有效的特征和集成方法,以比较精度与更强大的算法,包括LogitBoost,投票,还有Bagging.算法是根据准确性等性能评估指标进行评估的,精度,召回,F-measure,和ROC曲线,然后使用10倍交叉验证进行验证。研究表明,影响学生学业成绩的主要确定因素包括大学的学习时间和中学以前的表现。根据实验结果,这些特征一直被列为对学业成绩产生负面影响的首要因素.研究还表明,性别,预计毕业年份,队列,学术专业化对学生是否处于缓刑状态有重要贡献。领域专家和其他学生参与验证一些结果。讨论了本研究的理论和实践意义。
    This study aims to employ the supervised machine learning algorithms to examine factors that negatively impacted academic performance among college students on probation (underperforming students). We used the Knowledge Discovery in Databases (KDD) methodology on a sample of N = 6514 college students spanning 11 years (from 2009 to 2019) provided by a major public university in Oman. We used the Information Gain (InfoGain) algorithm to select the most effective features and ensemble methods to compare the accuracy with more robust algorithms, including Logit Boost, Vote, and Bagging. The algorithms were evaluated based on the performance evaluation metrics such as accuracy, precision, recall, F-measure, and ROC curve, and then validated using 10-folds cross-validation. The study revealed that the main identified factors affecting student academic achievement include study duration in the university and previous performance in secondary school. Based on the experimental results, these features were consistently ranked as the top factors that negatively impacted academic performance. The study also indicated that gender, estimated graduation year, cohort, and academic specialization significantly contributed to whether a student was under probation. Domain experts and other students were involved in verifying some of the results. The theoretical and practical implications of this study are discussed.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    Traffic accidents are of worldwide concern, as they are one of the leading causes of death globally. One policy designed to cope with them is the design and deployment of road safety systems. These aim to predict crashes based on historical records, provided by new Internet of Things (IoT) technologies, to enhance traffic flow management and promote safer roads. Increasing data availability has helped machine learning (ML) to address the prediction of crashes and their severity. The literature reports numerous contributions regarding survey papers, experimental comparisons of various techniques, and the design of new methods at the point where crash severity prediction (CSP) and ML converge. Despite such progress, and as far as we know, there are no comprehensive research articles that theoretically and practically approach the model selection problem (MSP) in CSP. Thus, this paper introduces a bibliometric analysis and experimental benchmark of ML and automated machine learning (AutoML) as a suitable approach to automatically address the MSP in CSP. Firstly, 2318 bibliographic references were consulted to identify relevant authors, trending topics, keywords evolution, and the most common ML methods used in related-case studies, which revealed an opportunity for the use AutoML in the transportation field. Then, we compared AutoML (AutoGluon, Auto-sklearn, TPOT) and ML (CatBoost, Decision Tree, Extra Trees, Gradient Boosting, Gaussian Naive Bayes, Light Gradient Boosting Machine, Random Forest) methods in three case studies using open data portals belonging to the cities of Medellín, Bogotá, and Bucaramanga in Colombia. Our experimentation reveals that AutoGluon and CatBoost are competitive and robust ML approaches to deal with various CSP problems. In addition, we concluded that general-purpose AutoML effectively supports the MSP in CSP without developing domain-focused AutoML methods for this supervised learning problem. Finally, based on the results obtained, we introduce challenges and research opportunities that the community should explore to enhance the contributions that ML and AutoML can bring to CSP and other transportation areas.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    本文提供补充材料。关键词:常规射线照相,胸部,创伤,肋条,导管,Segmentation,诊断,Classification,监督学习,机器学习©RSNA,2021年。
    Supplemental material is available for this article. Keywords: Conventional Radiography, Thorax, Trauma, Ribs, Catheters, Segmentation, Diagnosis, Classification, Supervised Learning, Machine Learning © RSNA, 2021.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    Currently, the identification of infectious disease re-emergence is performed without describing specific quantitative criteria that can be used to identify re-emergence events consistently. This practice may lead to ineffective mitigation. In addition, identification of factors contributing to local disease re-emergence and assessment of global disease re-emergence require access to data about disease incidence and a large number of factors at the local level for the entire world. This paper presents Re-emerging Disease Alert (RED Alert), a web-based tool designed to help public health officials detect and understand infectious disease re-emergence.
    Our objective is to bring together a variety of disease-related data and analytics needed to help public health analysts answer the following 3 primary questions for detecting and understanding disease re-emergence: Is there a potential disease re-emergence at the local (country) level? What are the potential contributing factors for this re-emergence? Is there a potential for global re-emergence?
    We collected and cleaned disease-related data (eg, case counts, vaccination rates, and indicators related to disease transmission) from several data sources including the World Health Organization (WHO), Pan American Health Organization (PAHO), World Bank, and Gideon. We combined these data with machine learning and visual analytics into a tool called RED Alert to detect re-emergence for the following 4 diseases: measles, cholera, dengue, and yellow fever. We evaluated the performance of the machine learning models for re-emergence detection and reviewed the output of the tool through a number of case studies.
    Our supervised learning models were able to identify 82%-90% of the local re-emergence events, although with 18%-31% (except 46% for dengue) false positives. This is consistent with our goal of identifying all possible re-emergences while allowing some false positives. The review of the web-based tool through case studies showed that local re-emergence detection was possible and that the tool provided actionable information about potential factors contributing to the local disease re-emergence and trends in global disease re-emergence.
    To the best of our knowledge, this is the first tool that focuses specifically on disease re-emergence and addresses the important challenges mentioned above.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    Whole transcriptome studies typically yield large amounts of data, with expression values for all genes or transcripts of the genome. The search for genes of interest in a particular study setting can thus be a daunting task, usually relying on automated computational methods. Moreover, most biological questions imply that such a search should be performed in a multivariate setting, to take into account the inter-genes relationships. Differential expression analysis commonly yields large lists of genes deemed significant, even after adjustment for multiple testing, making the subsequent study possibilities extensive. Here, we explore the use of supervised learning methods to rank large ensembles of genes defined by their expression values measured with RNA-Seq in a typical 2 classes sample set. First, we use one of the variable importance measures generated by the random forests classification algorithm as a metric to rank genes. Second, we define the EPS (extreme pseudo-samples) pipeline, making use of VAEs (Variational Autoencoders) and regressors to extract a ranking of genes while leveraging the feature space of both virtual and comparable samples. We show that, on 12 cancer RNA-Seq data sets ranging from 323 to 1,210 samples, using either a random forests-based gene selection method or the EPS pipeline outperforms differential expression analysis for 9 and 8 out of the 12 datasets respectively, in terms of identifying subsets of genes associated with survival. These results demonstrate the potential of supervised learning-based gene selection methods in RNA-Seq studies and highlight the need to use such multivariate gene selection methods alongside the widely used differential expression analysis.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    机器学习领域使研究人员能够使用各种方法生成和分析大量数据。人工神经网络(ANN)是一些最常用的统计模型,在多种疾病类型的生物标志物发现研究中取得了成功。这篇综述旨在探索和评估用于阿尔茨海默病生物标志物发现和验证的集成ANN管道,全球范围内最常见的痴呆症,没有明确的病因,也没有可用的治疗方法。拟议的管道包括使用分类和连续的逐步算法分析公共数据,并通过网络推断进一步检查以预测基因相互作用。这种方法可以可靠地产生新的标记,并进一步检查已知的标记,并可用于指导阿尔茨海默病的未来研究。
    The field of machine learning has allowed researchers to generate and analyse vast amounts of data using a wide variety of methodologies. Artificial Neural Networks (ANN) are some of the most commonly used statistical models and have been successful in biomarker discovery studies in multiple disease types. This review seeks to explore and evaluate an integrated ANN pipeline for biomarker discovery and validation in Alzheimer\'s disease, the most common form of dementia worldwide with no proven cause and no available cure. The proposed pipeline consists of analysing public data with a categorical and continuous stepwise algorithm and further examination through network inference to predict gene interactions. This methodology can reliably generate novel markers and further examine known ones and can be used to guide future research in Alzheimer\'s disease.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    人工神经网络(ANN)是基于人工智能和机器学习开发精确预测模型的最广泛使用的方法之一。在本研究中,开发可靠的人工神经网络模型的重要实践方面,例如适当分配神经元数量,隐藏层的数量,传递函数,训练算法,讨论了网络的数据集划分和初始化。作为一个案例研究,使用ANN的740个有机化合物的数据集的闪点的可预测性通过484220ANN的总数进行了研究,以允许覆盖影响ANN性能的各种参数。在所有研究的参数中,发现神经元或层的数量是开发具有低过拟合风险的可靠ANN的最重要参数。为了评估适当数量的神经元和层,建议将训练样本与ANN常数之比等于或大于10的值作为经验法则。更多,提出了一种用于评估ANN的真实性能并确定ANN模型可靠性的策略,该策略适用于通过监督学习开发的其他模型。基于介绍的考虑,提出了一种预测纯有机化合物闪点的人工神经网络模型。根据结果,与其他可用模型相比,新模型产生的误差最低。
    Artificial neural network (ANN) is one of the most widely used methods to develop accurate predictive models based on artificial intelligence and machine learning. In the present study, the important practical aspects of developing a reliable ANN model e.g. appropriate assignment of the number of neurons, number of hidden layers, transfer function, training algorithm, dataset division and initialization of the network are discussed. As a case study, predictability of the flash point for a dataset of 740 organic compounds using ANNs was investigated via a total number of 484220ANNs to allow covering a wide range of parameters affecting the performance of an ANN. Among all studied parameters, the number of neurons or layers was found to be the most important parameters to develop a reliable ANN with low overfitting risk. To evaluate appropriate number of neurons and layers, a value of equal or greater than 10 for the ratio of the training samples to the ANN constants was suggested as a rule of thumb. More ever, a strategy for evaluation of the authentic performance of ANNs and deciding about the reliability of an ANN model was proposed which is applicable to other models developed by supervised learning. Based on the introduced considerations, an ANN model was proposed for predicting the flash point of pure organic compounds. According to the results, the new model was found to produce the lowest error compared to other available models.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    计算机技术的最新进展为医疗行业提供了收集医疗数据的专用工具,处理能力,以及快速存储和检索功能。人工智能(AI)计算机科学的一个新兴领域是研究人类问题解决和决策的问题。此外,作为人工智能其他领域的基于规则的系统和基于知识的系统已经被许多科学家采用,以努力开发智能医疗诊断系统。在这项研究中,引入了人工神经网络(ANN)作为构建智能诊断系统的工具;该系统并不试图从决策者中取代医生,而是增强做出正确决策的设施。本研究中使用了用于评估某些神经肌肉疾病的集成诊断系统作为演示所提出方法的示例。诊断系统由独立地从患者的临床检查向系统提供数值数据的模块组成。以及进行的各种实验室测试。通过为每个专业领域制定协议,检查程序已经标准化,与该领域的专家合作。在临床检查和实验室检查结束时,数字向量形式的数据表示受试者的医学检查快照。使用无监督自组织特征映射算法开发了人工神经网络(ANN)模型。收集来自71名受试者的数据。使用来自41名受试者的数据对ANN模型进行了训练,并用其余30名受试者的数据进行测试。开发了两组模型;仅使用来自临床检查的数据进行训练的模型;以及通过结合临床和实验室测试数据进行训练的模型。对于仅使用临床数据训练的模型,未知病例获得的诊断率在73%至93%的范围内。通过结合临床和实验室数据,受过培训的人的比例为73%至100%。通过自组织的二维特征图对诊断模型进行图形表示,为医生提供了友好的人机界面和可用于进一步观察的综合工具,例如在监测受试者的疾病进展中。
    Recent advances in computer technology offer to the medical profession specialized tools for gathering medical data, processing power, as well as fast storing and retrieving capabilities. Artificial intelligence (AI), an emerging field of computer science is studying the issues of human problem solving and decision making. Furthermore, rule-based systems and knowledge-based systems that are other fields of AI have been adopted by many scientists in an effort to develop intelligent medical diagnostic systems. In this study artificial neural networks (ANN) are introduced as a tool for building an intelligent diagnostic system; the system does not attempt to replace the physician from being the decision maker but to enhance ones facilities for reaching a correct decision. An integrated diagnostic system for assessing certain neuromuscular disorders is used in this study as an example for demonstrating the proposed methodology. The diagnostic system is composed of modules that independently provide numerical data to the system from the clinical examination of a patient, and from various laboratory tests that are performed. The examination procedure has been standardized by developing protocols for each specialized area, in cooperation with experts in the area. At the conclusion of the clinical examination and laboratory tests, data in the form of a numerical vector represents a medical examination snapshot of the subject. Artificial neural network (ANN) models were developed using the unsupervised self-organizing feature maps algorithm. Data from 71 subjects were collected. The ANN models were trained with the data from 41 subjects, and tested with the data from the remaining 30 subjects. Two sets of models were developed; those trained with the data from only the clinical examinations; and those trained by combining the clinical and the laboratory test data. The diagnostic yield that was obtained for the unknown cases is in the region of 73 to 93% for the models trained with only the clinical data, and in the region of 73 to 100% for those trained by combining both the clinical and laboratory data. The pictorial representation of the diagnostic models through the self organized two dimensional feature maps provide the physician with a friendly human-computer interface and a comprehensive tool that can be used for further observations, for example in monitoring disease progression of a subject.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号