关键词: PARAFAC censored least squares imputation tensor

来  源:   DOI:10.1101/2024.07.05.602272   PDF(Pubmed)

Abstract:
Tensor factorization is a dimensionality reduction method applied to multidimensional arrays. These methods are useful for identifying patterns within a variety of biomedical datasets due to their ability to preserve the organizational structure of experiments and therefore aid in generating meaningful insights. However, missing data in the datasets being analyzed can impose challenges. Tensor factorization can be performed with some level of missing data and reconstruct a complete tensor. However, while tensor methods may impute these missing values, the choice of fitting algorithm may influence the fidelity of these imputations. Previous approaches, based on alternating least squares with prefilled values or direct optimization, suffer from introduced bias or slow computational performance. In this study, we propose that censored least squares can better handle missing values with data structured in tensor form. We ran censored least squares on four different biological datasets and compared its performance against alternating least squares with prefilled values and direct optimization. We used the error of imputation and the ability to infer masked values to benchmark their missing data performance. Censored least squares appeared best suited for the analysis of high-dimensional biological data by accuracy and convergence metrics across several studies.
摘要:
张量分解是一种应用于多维数组的降维方法。这些方法可用于识别各种生物医学数据集中的模式,因为它们能够保留实验的组织结构,因此有助于产生有意义的见解。然而,正在分析的数据集中丢失的数据可能会带来挑战。可以在一定程度的缺失数据的情况下执行张量分解,并重建完整的张量。然而,虽然张量方法可能会推算这些缺失值,拟合算法的选择可能会影响这些插补的保真度。以前的方法,基于具有预填充值的交替最小二乘或直接优化,遭受引入的偏差或计算性能缓慢。在这项研究中,我们认为删失最小二乘法可以更好地处理张量形式的数据的缺失值。我们在四个不同的生物数据集上运行了截尾最小二乘,并将其性能与具有预填充值和直接优化的交替最小二乘进行了比较。我们使用了插补错误和推断掩蔽值的能力来衡量其缺失的数据性能。通过多项研究的准确性和收敛性指标,经审查的最小二乘法似乎最适合分析高维生物数据。
公众号