关键词: Chemometrics Data fusion Multi-block NIR Pre-processing Spectroscopy

来  源:   DOI:10.1016/j.aca.2024.342965

Abstract:
BACKGROUND: Spectral data from multiple sources can be integrated into multi-block fusion chemometric models, such as sequentially orthogonalized partial-least squares (SO-PLS), to improve the prediction of sample quality features. Pre-processing techniques are often applied to mitigate extraneous variability, unrelated to the response variables. However, the selection of suitable pre-processing methods and identification of informative data blocks becomes increasingly complex and time-consuming when dealing with a large number of blocks. The problem addressed in this work is the efficient pre-processing, selection, and ordering of data blocks for targeted applications in SO-PLS.
RESULTS: We introduce the PROSAC-SO-PLS methodology, which employs pre-processing ensembles with response-oriented sequential alternation calibration (PROSAC). This approach identifies the best pre-processed data blocks and their sequential order for specific SO-PLS applications. The method uses a stepwise forward selection strategy, facilitated by the rapid Gram-Schmidt process, to prioritize blocks based on their effectiveness in minimizing prediction error, as indicated by the lowest prediction residuals. To validate the efficacy of our approach, we showcase the outcomes of three empirical near-infrared (NIR) datasets. Comparative analyses were performed against partial-least-squares (PLS) regressions on single-block pre-processed datasets and a methodology relying solely on PROSAC. The PROSAC-SO-PLS approach consistently outperformed these methods, yielding significantly lower prediction errors. This has been evidenced by a reduction in the root-mean-squared error of prediction (RMSEP) ranging from 5 to 25 % across seven out of the eight response variables analyzed.
CONCLUSIONS: The PROSAC-SO-PLS methodology offers a versatile and efficient technique for ensemble pre-processing in NIR data modeling. It enables the use of SO-PLS minimizing concerns about pre-processing sequence or block order and effectively manages a large number of data blocks. This innovation significantly streamlines the data pre-processing and model-building processes, enhancing the accuracy and efficiency of chemometric models.
摘要:
背景:来自多个来源的光谱数据可以集成到多块融合化学计量模型中,例如顺序正交化偏最小二乘(SO-PLS),改进样本质量特征的预测。预处理技术通常用于减轻无关的可变性,与响应变量无关。然而,当处理大量块时,选择合适的预处理方法和识别信息数据块变得越来越复杂和耗时。在这项工作中解决的问题是有效的预处理,选择,以及SO-PLS中目标应用程序的数据块排序。
结果:我们介绍PROSAC-SO-PLS方法,它采用预处理集成与面向响应的顺序交替校准(PROSAC)。该方法识别最佳预处理数据块及其用于特定SO-PLS应用的顺序次序。该方法采用逐步前向选择策略,在快速革兰氏施密特过程的推动下,根据块在最小化预测误差方面的有效性来确定块的优先级,如最低预测残差所示。为了验证我们方法的有效性,我们展示了三个经验近红外(NIR)数据集的结果。对单块预处理数据集和仅依赖于PROSAC的方法进行了偏最小二乘(PLS)回归的比较分析。PROSAC-SO-PLS方法始终优于这些方法,产生显著较低的预测误差。在所分析的8个响应变量中的7个中,预测的均方根误差(RMSEP)的降低范围从5%至25%证明了这一点。
结论:PROSAC-SO-PLS方法为NIR数据建模中的集成预处理提供了一种通用且有效的技术。它使SO-PLS的使用最小化对预处理序列或块顺序的关注,并且有效地管理大量数据块。这一创新显著简化了数据预处理和模型构建过程,提高化学计量模型的准确性和效率。
公众号