关键词: Binning Categorical data Dataset coherence analysis Deep learning models Imbalanced datasets Learning space dimensions Machine learning Pareto analysis Principal component analysis

来  源:   DOI:10.1186/s40537-021-00428-8   PDF(Sci-hub)   PDF(Pubmed)

Abstract:
Deep learning models are tools for data analysis suitable for approximating (non-linear) relationships among variables for the best prediction of an outcome. While these models can be used to answer many important questions, their utility is still harshly criticized, being extremely challenging to identify which data descriptors are the most adequate to represent a given specific phenomenon of interest. With a recent experience in the development of a deep learning model designed to detect failures in mechanical water meter devices, we have learnt that a sensible deterioration of the prediction accuracy can occur if one tries to train a deep learning model by adding specific device descriptors, based on categorical data. This can happen because of an excessive increase in the dimensions of the data, with a correspondent loss of statistical significance. After several unsuccessful experiments conducted with alternative methodologies that either permit to reduce the data space dimensionality or employ more traditional machine learning algorithms, we changed the training strategy, reconsidering that categorical data, in the light of a Pareto analysis. In essence, we used those categorical descriptors, not as an input on which to train our deep learning model, but as a tool to give a new shape to the dataset, based on the Pareto rule. With this data adjustment, we trained a more performative deep learning model able to detect defective water meter devices with a prediction accuracy in the range 87-90%, even in the presence of categorical descriptors.
摘要:
深度学习模型是用于数据分析的工具,适用于近似变量之间的(非线性)关系,以便对结果进行最佳预测。虽然这些模型可以用来回答许多重要的问题,他们的效用仍然受到严厉批评,识别哪些数据描述符最适合表示给定的特定感兴趣现象是极具挑战性的。根据最近开发用于检测机械水表设备故障的深度学习模型的经验,我们已经了解到,如果一个人试图通过添加特定的设备描述符来训练深度学习模型,那么预测准确性可能会出现明显的下降。基于分类数据。这可能是因为数据维度的过度增加,具有相应的统计显著性损失。在使用替代方法进行了几次失败的实验之后,这些方法要么允许减少数据空间维度,要么采用更传统的机器学习算法。我们改变了训练策略,重新考虑分类数据,根据帕累托分析。实质上,我们使用了这些分类描述符,不是作为训练我们的深度学习模型的输入,但是作为一种为数据集提供新形状的工具,基于帕累托规则。有了这个数据调整,我们训练了一个性能更高的深度学习模型,能够检测有缺陷的水表设备,预测精度在87-90%之间,即使存在分类描述符。
公众号