synthetic minority oversampling technique

合成少数过采样技术
  • 文章类型: Journal Article
    没有空气,人类的生存是无法想象的。现代人类社会几乎所有领域的持续发展都对空气的健康产生了不利影响。日常工业,运输,家庭活动正在我们的环境中搅动有害污染物。在这个时代,监测和预测空气质量已经变得至关重要,尤其是像印度这样的发展中国家。与传统方法相比,基于机器学习技术的预测技术被证明是研究此类现代危害的最有效工具。本工作调查了来自23个印度城市的六年空气污染数据,以进行空气质量分析和预测。对数据集进行了很好的预处理,并通过相关性分析选择了关键特征。进行探索性数据分析,以深入了解数据集中的各种隐藏模式,并确定直接影响空气质量指数的污染物。在大流行年,几乎所有污染物都出现了显着下降,2020年。通过重采样技术解决了数据不平衡问题,并采用了五种机器学习模型来预测空气质量。将这些模型的结果与标准度量进行比较。高斯朴素贝叶斯模型具有最高的精度,而支持向量机模型具有最低的精度。通过建立的性能参数对这些模型的性能进行评估和比较。XGBoost模型在其他模型中表现最好,并且在预测数据和实际数据之间获得最高的线性度。
    The survival of mankind cannot be imagined without air. Consistent developments in almost all realms of modern human society affected the health of the air adversely. Daily industrial, transport, and domestic activities are stirring hazardous pollutants in our environment. Monitoring and predicting air quality have become essentially important in this era, especially in developing countries like India. In contrast to the traditional methods, the prediction technologies based on machine learning techniques are proved to be the most efficient tools to study such modern hazards. The present work investigates six years of air pollution data from 23 Indian cities for air quality analysis and prediction. The dataset is well preprocessed and key features are selected through the correlation analysis. An exploratory data analysis is exercised to develop insights into various hidden patterns in the dataset and pollutants directly affecting the air quality index are identified. A significant fall in almost all pollutants is observed in the pandemic year, 2020. The data imbalance problem is solved with a resampling technique and five machine learning models are employed to predict air quality. The results of these models are compared with the standard metrics. The Gaussian Naive Bayes model achieves the highest accuracy while the Support Vector Machine model exhibits the lowest accuracy. The performances of these models are evaluated and compared through established performance parameters. The XGBoost model performed the best among the other models and gets the highest linearity between the predicted and actual data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号