关键词: Ensemble machine learning Flood risk assessment Multivariable flood damage model SHAP values SMOTE

来  源:   DOI:10.1016/j.scitotenv.2023.167872

Abstract:
Flooding is a global threat and predicting flood risk accurately is vital for effective mitigation and increasing society\'s awareness of the negative impacts of floods. Over the years, researchers have worked on physical and data-driven models to predict flood damage, striving to improve accuracy and understanding. However, the challenge lies in the scarcity and limitedness of comprehensive datasets needed to develop these models. This study aims to enhance the National Flood Insurance Program (NFIP) claims dataset from Hurricane Katrina in coastal Alabama to make it adequate for multi-variable flood damage assessment. The NFIP claims dataset was combined with the Alabama property dataset, simulated flood hazard information, and property location characteristics. Oversampling techniques are employed to address data imbalance in the datasets. Subsequently, several ensemble machine learning approaches, including random forest, extra tree, extreme gradient boosting, and categorical boosting, are utilized to develop multi-variable flood damage models. The validation of these models demonstrates that extreme gradient boosting performs best, achieving satisfactory results in identifying damaged properties with precision (0.89), recall (0.90), and F1-score (0.90), as well as determining relative damage with R-squared (0.59), root mean squared error (0.21), and Spearman correlation (0.70). Utilizing data oversampling techniques improves the model performance of imbalanced flood damage datasets. Despite the dataset\'s limitations and data augmentation techniques employed, the model\'s output explanation based on SHapley Additive exPlanations (SHAP) is constructive as it aligns with the study\'s expectations regarding the interaction of different features to produce the final results.
摘要:
洪水是全球性的威胁,准确预测洪水风险对于有效缓解和提高社会对洪水负面影响的认识至关重要。多年来,研究人员研究了物理和数据驱动模型来预测洪水灾害,努力提高准确性和理解力。然而,挑战在于开发这些模型所需的综合数据集的稀缺性和局限性。这项研究旨在增强阿拉巴马州沿海卡特里娜飓风的国家洪水保险计划(NFIP)索赔数据集,使其足以进行多变量洪水损失评估。NFIP索赔数据集与阿拉巴马州房地产数据集结合在一起,模拟洪水灾害信息,和物业位置特征。采用过采样技术来解决数据集中的数据不平衡。随后,几种集成的机器学习方法,包括随机森林,额外的树,极端梯度增强,和明确的提升,用于开发多变量洪水灾害模型。这些模型的验证表明,极端梯度提升表现最好,在精度(0.89)识别受损特性方面取得令人满意的结果,召回(0.90),和F1得分(0.90),以及用R平方(0.59)确定相对损伤,均方根误差(0.21),和斯皮尔曼相关(0.70)。利用数据过采样技术可以提高不平衡洪水破坏数据集的模型性能。尽管数据集存在局限性,并且采用了数据增强技术,该模型基于SHapley加法扩张(SHAP)的输出解释是建设性的,因为它符合研究对不同特征相互作用产生最终结果的期望。
公众号