关键词: land use particulate matter random forest regression modelling sensitivity analysis

来  源:   DOI:10.3390/s24134193   PDF(Pubmed)

Abstract:
Machine learning (ML) methods are widely used in particulate matter prediction modelling, especially through use of air quality sensor data. Despite their advantages, these methods\' black-box nature obscures the understanding of how a prediction has been made. Major issues with these types of models include the data quality and computational intensity. In this study, we employed feature selection methods using recursive feature elimination and global sensitivity analysis for a random-forest (RF)-based land-use regression model developed for the city of Berlin, Germany. Land-use-based predictors, including local climate zones, leaf area index, daily traffic volume, population density, building types, building heights, and street types were used to create a baseline RF model. Five additional models, three using recursive feature elimination method and two using a Sobol-based global sensitivity analysis (GSA), were implemented, and their performance was compared against that of the baseline RF model. The predictors that had a large effect on the prediction as determined using both the methods are discussed. Through feature elimination, the number of predictors were reduced from 220 in the baseline model to eight in the parsimonious models without sacrificing model performance. The model metrics were compared, which showed that the parsimonious_GSA-based model performs better than does the baseline model and reduces the mean absolute error (MAE) from 8.69 µg/m3 to 3.6 µg/m3 and the root mean squared error (RMSE) from 9.86 µg/m3 to 4.23 µg/m3 when applying the trained model to reference station data. The better performance of the GSA_parsimonious model is made possible by the curtailment of the uncertainties propagated through the model via the reduction of multicollinear and redundant predictors. The parsimonious model validated against reference stations was able to predict the PM2.5 concentrations with an MAE of less than 5 µg/m3 for 10 out of 12 locations. The GSA_parsimonious performed best in all model metrics and improved the R2 from 3% in the baseline model to 17%. However, the predictions exhibited a degree of uncertainty, making it unreliable for regional scale modelling. The GSA_parsimonious model can nevertheless be adapted to local scales to highlight the land-use parameters that are indicative of PM2.5 concentrations in Berlin. Overall, population density, leaf area index, and traffic volume are the major predictors of PM2.5, while building type and local climate zones are the less significant predictors. Feature selection based on sensitivity analysis has a large impact on the model performance. Optimising models through sensitivity analysis can enhance the interpretability of the model dynamics and potentially reduce computational costs and time when modelling is performed for larger areas.
摘要:
机器学习(ML)方法广泛应用于颗粒物预测建模中,特别是通过使用空气质量传感器数据。尽管他们的优势,这些方法“黑箱性质”掩盖了对预测是如何做出的理解。这些类型的模型的主要问题包括数据质量和计算强度。在这项研究中,我们使用递归特征消除和全局敏感性分析的特征选择方法,为柏林市开发的基于随机森林(RF)的土地利用回归模型,德国。基于土地利用的预测因子,包括当地的气候区,叶面积指数,每日交通量,人口密度,建筑类型,建筑高度,和街道类型用于创建基线射频模型。五个额外的模型,三种使用递归特征消除方法,两种使用基于Sobol的全局灵敏度分析(GSA),实施了,并将它们的性能与基线射频模型进行了比较。讨论了使用两种方法确定的对预测有很大影响的预测因子。通过功能消除,在不牺牲模型性能的情况下,预测因子的数量从基线模型中的220个减少到简约模型中的8个.比较了模型指标,这表明,基于Parsimonious_GSA的模型比基线模型表现更好,并将平均绝对误差(MAE)从8.69µg/m3降低到3.6µg/m3,将均方根误差(RMSE)从9.86µg/m3降低到4.23µg/m3。通过减少多共线性和冗余预测因子,减少了通过模型传播的不确定性,从而使GSA_简约模型的性能更好。针对参考站进行验证的简约模型能够预测12个位置中的10个MAE小于5µg/m3的PM2.5浓度。GSA_parsimonious在所有模型指标中表现最佳,并将R2从基线模型中的3%提高到17%。然而,预测表现出一定程度的不确定性,使得区域尺度建模不可靠。尽管如此,GSA_简约模型仍可以适应当地尺度,以突出表明柏林PM2.5浓度的土地利用参数。总的来说,人口密度,叶面积指数,和交通量是PM2.5的主要预测因素,而建筑类型和当地气候带是次要预测因素。基于灵敏度分析的特征选择对模型性能有很大影响。通过灵敏度分析优化模型可以增强模型动力学的可解释性,并在对更大区域进行建模时潜在地降低计算成本和时间。
公众号