dimension reduction

降维
  • 文章类型: Journal Article
    地球观测卫星时空分辨率的逐步演变给科学研究带来了多重好处。越来越多的具有更高频率和空间分辨率的数据提供了精确和及时的信息,使其成为环境分析和增强决策的宝贵工具。然而,这对基于空间时间序列的大规模环境分析和社会经济应用提出了巨大的挑战,经常迫使研究人员求助于较低分辨率的图像,这可能会带来不确定性和影响结果。对此,我们的主要贡献是一种新的机器学习方法,用于植根于超像素分割的密集地理空间时间序列,这是减轻大规模应用中数据高维性的初步步骤。这种方法,在有效降低维度的同时,最大限度地保存有价值的信息,从而大大提高了数据的准确性和随后的环境分析。在全面的案例研究的背景下,根据经验应用了此方法,该案例研究涵盖了2002-2022年期间,在43,470km2的区域中以250-m的分辨率提供了8d频率归一化差异植被指数数据。通过比较分析评估了这种方法的有效性,将我们的结果与从1000米分辨率卫星数据和现有的时间序列数据超像素算法得出的结果进行比较。对时间序列偏差的评估表明,使用较粗分辨率的像素会导致误差超过所提出算法的误差25%,并且所提出的方法优于其他算法9%以上。值得注意的是,这种方法创新同时促进了共享类似土地覆盖分类的像素的聚集,从而减轻数据集中的亚像素异质性。Further,拟议的方法,用作预处理步骤,根据像素的时间序列改进了像素的聚类,并且可以在广泛的应用程序中增强大规模环境分析。
    The progressive evolution of the spatial and temporal resolutions of Earth observation satellites has brought multiple benefits to scientific research. The increasing volume of data with higher frequencies and spatial resolutions offers precise and timely information, making it an invaluable tool for environmental analysis and enhanced decision-making. However, this presents a formidable challenge for large-scale environmental analyses and socioeconomic applications based on spatial time series, often compelling researchers to resort to lower-resolution imagery, which can introduce uncertainty and impact results. In response to this, our key contribution is a novel machine learning approach for dense geospatial time series rooted in superpixel segmentation, which serves as a preliminary step in mitigating the high dimensionality of data in large-scale applications. This approach, while effectively reducing dimensionality, preserves valuable information to the maximum extent, thereby substantially enhancing data accuracy and subsequent environmental analyses. This method was empirically applied within the context of a comprehensive case study encompassing the 2002-2022 period with 8-d-frequency-normalized difference vegetation index data at 250-m resolution in an area spanning 43,470 km2. The efficacy of this methodology was assessed through a comparative analysis, comparing our results with those derived from 1000-m-resolution satellite data and an existing superpixel algorithm for time series data. An evaluation of the time-series deviations revealed that using coarser-resolution pixels introduced an error that exceeded that of the proposed algorithm by 25 % and that the proposed methodology outperformed other algorithms by more than 9 %. Notably, this methodological innovation concurrently facilitates the aggregation of pixels sharing similar land-cover classifications, thus mitigating subpixel heterogeneity within the dataset. Further, the proposed methodology, which is used as a preprocessing step, improves the clustering of pixels according to their time series and can enhance large-scale environmental analyses across a wide range of applications.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    越来越普遍的观点是,蛋白质动力学数据集位于低构象能的非线性子空间中。因此,理想的数据分析工具应该考虑到这种非线性几何形状。黎曼几何设置可以适用于各种原因。首先,它有一个丰富的数学结构,以说明范围广泛的几何形状,可以建模后的能源景观。第二,为欧几里得空间中的数据开发的许多标准数据分析工具可以推广到黎曼流形。在蛋白质动力学的背景下,概念上的挑战来自缺乏基于能源景观构建光滑黎曼结构的指导方针。此外,计算测地线和相关映射的计算可行性提出了重大挑战。这项工作考虑了这些挑战。本文的第一部分开发了一种局部逼近技术,用于以计算可行的方式在黎曼流形上计算测地线和相关映射。第二部分构建了光滑的流形和黎曼结构,该结构基于蛋白质构象的能量景观。在与蛋白质动力学数据相关的几个数据分析任务上测试了所得的黎曼几何形状。特别是,具有给定起点和终点的测地线大致恢复了蛋白质的相应分子动力学轨迹,这些蛋白质经历了具有中等尺寸变形的相对有序的过渡。黎曼蛋白质几何形状还提供物理上真实的汇总统计信息,并在笔记本电脑上甚至在几秒钟内就可以检索大型变形的基础尺寸。
    An increasingly common viewpoint is that protein dynamics datasets reside in a nonlinear subspace of low conformational energy. Ideal data analysis tools should therefore account for such nonlinear geometry. The Riemannian geometry setting can be suitable for a variety of reasons. First, it comes with a rich mathematical structure to account for a wide range of geometries that can be modeled after an energy landscape. Second, many standard data analysis tools developed for data in Euclidean space can be generalized to Riemannian manifolds. In the context of protein dynamics, a conceptual challenge comes from the lack of guidelines for constructing a smooth Riemannian structure based on an energy landscape. In addition, computational feasibility in computing geodesics and related mappings poses a major challenge. This work considers these challenges. The first part of the paper develops a local approximation technique for computing geodesics and related mappings on Riemannian manifolds in a computationally feasible manner. The second part constructs a smooth manifold and a Riemannian structure that is based on an energy landscape for protein conformations. The resulting Riemannian geometry is tested on several data analysis tasks relevant for protein dynamics data. In particular, the geodesics with given start- and end-points approximately recover corresponding molecular dynamics trajectories for proteins that undergo relatively ordered transitions with medium-sized deformations. The Riemannian protein geometry also gives physically realistic summary statistics and retrieves the underlying dimension even for large-sized deformations within seconds on a laptop.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    本文提出了一个基于距离的框架,该框架受到高维数据向特征聚合的范式转变的激励,它不依赖于稀疏特征假设或基于置换的推断。专注于基于距离的结果,在不截断任何特征的情况下保留信息,已经开发了一类半参数回归,它使用主题间属性的成对结果封装高维变量的多个源。Further,我们提出了一种策略,通过基于U-统计的估计方程(UGEE)来解决配对之间的互锁相关性,它们对应于它们独特的有效影响函数(EIF)。因此,所得的半参数估计器对分布错误指定是鲁棒的,同时享有根n一致性和渐近最优性以促进推理。实质上,所提出的方法不仅避免了由于特征选择而导致的信息损失,而且提高了模型的可解释性和计算可行性。提供了人体微生物组和可穿戴设备数据的模拟研究和应用,特征尺寸是成千上万。
    This article proposes a distance-based framework incentivized by the paradigm shift towards feature aggregation for high-dimensional data, which does not rely on the sparse-feature assumption or the permutation-based inference. Focusing on distance-based outcomes that preserve information without truncating any features, a class of semiparametric regression has been developed, which encapsulates multiple sources of high-dimensional variables using pairwise outcomes of between-subject attributes. Further, we propose a strategy to address the interlocking correlations among pairs via the U-statistics-based estimating equations (UGEE), which correspond to their unique efficient influence function (EIF). Hence, the resulting semiparametric estimators are robust to distributional misspecification while enjoying root-n consistency and asymptotic optimality to facilitate inference. In essence, the proposed approach not only circumvents information loss due to feature selection but also improves the model\'s interpretability and computational feasibility. Simulation studies and applications to the human microbiome and wearables data are provided, where the feature dimensions are tens of thousands.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:化学空间嵌入方法在各种研究环境中广泛用于降维,聚类和有效的可视化。嵌入过程生成的地图可以为药物化学家提供有价值的洞察力,化合物的物理化学和生物学性质。然而,众所周知,这些地图很难解释,当嵌入不同的化合物集时,地图上的\'\'景观\'\'容易发生\'\'重排\'\'。
    结果:在这项研究中,我们提出了希尔伯特曲线辅助空间嵌入(HCASE)方法,该方法旨在通过根据药物化学家熟悉的逻辑组织结构来创建映射。首先,化学空间是在一组\'\'参考支架\'\'的帮助下创建的。这些支架根据现有技术中发现的受药物化学启发的支架-密钥算法进行分类。接下来,有序的支架被映射到折叠成更高维(这里:2D)空间的线。复杂折叠的线被称为伪希尔伯特曲线。化合物的嵌入是通过将其最相似的参考支架定位在伪希尔伯特曲线中并假定相应的位置来进行的。通过一系列的实验,我们演示了HCASE方法生成的地图的属性。嵌入的对象是DrugBank和CANVASS库的化合物,化学空间由ChEMBL数据库中提取的支架定义。
    HCASE方法的新颖性在于生成强大而直观的化学空间嵌入,这些嵌入反映了药物化学家的推理,并在该过程中优先使用空间填充(Hilbert)曲线。
    背景:https://github.com/ncats/hcase。
    BACKGROUND: Chemical space embedding methods are widely utilized in various research settings for dimensional reduction, clustering and effective visualization. The maps generated by the embedding process can provide valuable insight to medicinal chemists in terms of the relationships between structural, physicochemical and biological properties of compounds. However, these maps are known to be difficult to interpret, and the \'\'landscape\'\' on the map is prone to \'\'rearrangement\'\' when embedding different sets of compounds.
    RESULTS: In this study we present the Hilbert-Curve Assisted Space Embedding (HCASE) method which was designed to create maps by organizing structures according to a logic familiar to medicinal chemists. First, a chemical space is created with the help of a set of \'\'reference scaffolds\'\'. These scaffolds are sorted according to the medicinal chemistry inspired Scaffold-Key algorithm found in prior art. Next, the ordered scaffolds are mapped to a line which is folded into a higher dimensional (here: 2D) space. The intricately folded line is referred to as a pseudo-Hilbert-Curve. The embedding of a compound happens by locating its most similar reference scaffold in the pseudo-Hilbert-Curve and assuming the respective position. Through a series of experiments, we demonstrate the properties of the maps generated by the HCASE method. Subjects of embeddings were compounds of the DrugBank and CANVASS libraries, and the chemical spaces were defined by scaffolds extracted from the ChEMBL database.
    UNASSIGNED: The novelty of HCASE method lies in generating robust and intuitive chemical space embeddings that are reflective of a medicinal chemist\'s reasoning, and the precedential use of space filling (Hilbert) curve in the process.
    BACKGROUND: https://github.com/ncats/hcase.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    尽管最近已经开发了大量的检查工具,但甲状腺癌的发病率仍在持续增加。由于甲状腺癌诊断没有标准和确定的程序,临床医生需要进行各种测试。这种审查过程会产生多维大数据,并且缺乏通用方法会导致随机分布的丢失(稀疏)数据,这对机器学习算法来说都是巨大的挑战。本文旨在开发一种准确且计算高效的深度学习算法来诊断甲状腺癌。在这方面,处理学习问题中随机分布的缺失数据,并开发了具有内部和目标相似性方法的降维方法,以选择信息量最大的输入数据集。此外,用层次聚类算法进行大小缩减,以消除相当相似的数据样本。对四种机器学习算法进行了训练,并用看不见的数据进行了测试,以验证它们的泛化和鲁棒性。结果对看不见的数据产生100%的训练和83%的测试精确度。在同等条件下还检查了算法的计算时间效率。
    Thyroid cancer incidences endure to increase even though a large number of inspection tools have been developed recently. Since there is no standard and certain procedure to follow for the thyroid cancer diagnoses, clinicians require conducting various tests. This scrutiny process yields multi-dimensional big data and lack of a common approach leads to randomly distributed missing (sparse) data, which are both formidable challenges for the machine learning algorithms. This paper aims to develop an accurate and computationally efficient deep learning algorithm to diagnose the thyroid cancer. In this respect, randomly distributed missing data stemmed singularity in learning problems is treated and dimensionality reduction with inner and target similarity approaches are developed to select the most informative input datasets. In addition, size reduction with the hierarchical clustering algorithm is performed to eliminate the considerably similar data samples. Four machine learning algorithms are trained and also tested with the unseen data to validate their generalization and robustness abilities. The results yield 100% training and 83% testing preciseness for the unseen data. Computational time efficiencies of the algorithms are also examined under the equal conditions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    生物数据的日益复杂刺激了创新计算技术的发展,以提取有意义的信息并发现大量数据集中的隐藏模式。生物网络,如基因调控网络和蛋白质-蛋白质相互作用网络,对生物特征的连接和功能持有关键见解。集成和分析高维数据,特别是在基因表达研究中,在破译这些网络的挑战中,这是突出的。聚类方法在解决这些挑战中起着至关重要的作用,考虑到固有的几何结构,谱聚类成为一种有效的无监督技术。然而,频谱聚类的用户定义的聚类编号可能导致不一致,有时甚至是正交的聚类机制。我们提出了多层捆绑(MLB)方法来解决这个限制,结合多个突出的聚类制度,提供一个全面的数据视图。我们将结果群集称为“bundle”。这种方法改进了聚类结果,解开等级制度,并标识在网络组件之间进行通信的网桥元素。通过分层聚类结果,MLB提供生物特征簇的全局到局部视图,从而能够洞察复杂的生物系统。此外,该方法通过将束协同聚类矩阵与亲和矩阵相结合来增强束网络预测。MLB的多功能性超越了生物网络,使其适用于需要理解复杂关系和模式的各种领域。
    The growing complexity of biological data has spurred the development of innovative computational techniques to extract meaningful information and uncover hidden patterns within vast datasets. Biological networks, such as gene regulatory networks and protein-protein interaction networks, hold critical insights into biological features\' connections and functions. Integrating and analyzing high-dimensional data, particularly in gene expression studies, stands prominent among the challenges in deciphering these networks. Clustering methods play a crucial role in addressing these challenges, with spectral clustering emerging as a potent unsupervised technique considering intrinsic geometric structures. However, spectral clustering\'s user-defined cluster number can lead to inconsistent and sometimes orthogonal clustering regimes. We propose the Multi-layer Bundling (MLB) method to address this limitation, combining multiple prominent clustering regimes to offer a comprehensive data view. We call the outcome clusters \"bundles\". This approach refines clustering outcomes, unravels hierarchical organization, and identifies bridge elements mediating communication between network components. By layering clustering results, MLB provides a global-to-local view of biological feature clusters enabling insights into intricate biological systems. Furthermore, the method enhances bundle network predictions by integrating the bundle co-cluster matrix with the affinity matrix. The versatility of MLB extends beyond biological networks, making it applicable to various domains where understanding complex relationships and patterns is needed.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    本文提出了协方差矩阵结果的协变量辅助主回归的贝叶斯重新表述,以识别与协变量相关的协方差中的低维成分。通过引入协方差矩阵的几何方法并利用欧几里德几何,我们基于协变量估计降维参数和模型协方差异质性。该方法实现了与异方差相关的相关模型参数的联合估计和不确定性量化。我们通过模拟研究展示了我们的方法,并使用来自人类Connectome项目的数据将其应用于分析协变量和大脑功能连接之间的关联。
    This paper presents a Bayesian reformulation of covariate-assisted principal regression for covariance matrix outcomes to identify low-dimensional components in the covariance associated with covariates. By introducing a geometric approach to the covariance matrices and leveraging Euclidean geometry, we estimate dimension reduction parameters and model covariance heterogeneity based on covariates. This method enables joint estimation and uncertainty quantification of relevant model parameters associated with heteroscedasticity. We demonstrate our approach through simulation studies and apply it to analyze associations between covariates and brain functional connectivity using data from the Human Connectome Project.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    由于高维,冗余,和近红外(NIR)光谱数据的非线性,以及样品的生产面积和品位等属性的影响,这些都会影响样本之间的相似性度量。结合多属性数据信息,提出了一种基于Sinkhorn距离的t分布随机近邻嵌入算法(St-SNE)。首先,引入了Sinkhorn距离,可以解决高维空间中KL发散不对称和数据分布稀疏等问题,从而构造使低维空间类似于高维空间的概率分布。此外,为了解决样本的多属性特征对相似性度量的影响,利用信息熵构造了多属性距离矩阵,然后结合光谱数据的数值矩阵得到混合数据矩阵。为了验证St-SNE算法的有效性,对近红外光谱数据进行降维投影,并与PCA进行比较,LPP,和t-SNE算法。结果表明,St-SNE算法能有效区分具有不同属性信息的样本,并在低维空间中产生了更清晰的样本类别投影边界。然后,我们使用烟草和芒果数据集测试了St-SNE对不同属性的分类性能,并将其与LPP进行比较,t-SNE,UMAP,和Fishert-SNE算法。结果表明,St-SNE算法对不同属性的分类准确率最高。最后,我们将搜索最相似样本的结果与卷烟配方的目标烟草进行了比较,实验表明,与其他算法相比,St-SNE与专家推荐的一致性最高。它可以为产品配方的维护和设计提供强有力的支持。
    Due to the high-dimensionality, redundancy, and non-linearity of the near-infrared (NIR) spectra data, as well as the influence of attributes such as producing area and grade of the sample, which can all affect the similarity measure between samples. This paper proposed a t-distributed stochastic neighbor embedding algorithm based on Sinkhorn distance (St-SNE) combined with multi-attribute data information. Firstly, the Sinkhorn distance was introduced which can solve problems such as KL divergence asymmetry and sparse data distribution in high-dimensional space, thereby constructing probability distributions that make low-dimensional space similar to high-dimensional space. In addition, to address the impact of multi-attribute features of samples on similarity measure, a multi-attribute distance matrix was constructed using information entropy, and then combined with the numerical matrix of spectral data to obtain a mixed data matrix. In order to validate the effectiveness of the St-SNE algorithm, dimensionality reduction projection was performed on NIR spectral data and compared with PCA, LPP, and t-SNE algorithms. The results demonstrated that the St-SNE algorithm effectively distinguishes samples with different attribute information, and produced more distinct projection boundaries of sample category in low-dimensional space. Then we tested the classification performance of St-SNE for different attributes by using the tobacco and mango datasets, and compared it with LPP, t-SNE, UMAP, and Fisher t-SNE algorithms. The results showed that St-SNE algorithm had the highest classification accuracy for different attributes. Finally, we compared the results of searching the most similar sample with the target tobacco for cigarette formulas, and experiments showed that the St-SNE had the highest consistency with the recommendation of the experts than that of the other algorithms. It can provide strong support for the maintenance and design of the product formula.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    随着大量和单细胞分析对多组数据的依赖增加,对聚类进行无监督分析的健壮方法的可用性,可视化,而特征选择势在必行。联合降维方法可以应用于多组学数据集,以得出类似于单组学技术的全局样本嵌入,例如主成分分析(PCA)。多重协同惯性分析(MCIA)是一种用于联合降维的方法,可最大化块级和全局级嵌入之间的协方差。MCIA的当前实现未针对大型数据集进行优化,例如来自单细胞研究的数据集。并且缺乏嵌入新数据的能力。
    我们介绍一下nipalsMCIA,一种MCIA实现,使用对非线性迭代偏最小二乘(NIPALS)的扩展来求解目标函数,与依赖单细胞多组学数据的特征分解的早期实现相比,显示出显着的加速。它还消除了对计算解释方差的特征分解的依赖,并允许用户对新数据执行样本外嵌入。nipalsMCIA为用户提供各种预处理和参数选项,以及简单的功能,用于单个整体和全局嵌入因子的下游分析。
    nipalsMCIA作为BioConductor软件包可在https://bioparductor.org/packages/release/bioc/html/nipalsMCIA获得。html,并包括详细的文档和应用插图。补充材料可在线获得。
    UNASSIGNED: With the increased reliance on multi-omics data for bulk and single cell analyses, the availability of robust approaches to perform unsupervised analysis for clustering, visualization, and feature selection is imperative. Joint dimensionality reduction methods can be applied to multi-omics datasets to derive a global sample embedding analogous to single-omic techniques such as Principal Components Analysis (PCA). Multiple co-inertia analysis (MCIA) is a method for joint dimensionality reduction that maximizes the covariance between block- and global-level embeddings. Current implementations for MCIA are not optimized for large datasets such such as those arising from single cell studies, and lack capabilities with respect to embedding new data.
    UNASSIGNED: We introduce nipalsMCIA, an MCIA implementation that solves the objective function using an extension to Non-linear Iterative Partial Least Squares (NIPALS), and shows significant speed-up over earlier implementations that rely on eigendecompositions for single cell multi-omics data. It also removes the dependence on an eigendecomposition for calculating the variance explained, and allows users to perform out-of-sample embedding for new data. nipalsMCIA provides users with a variety of pre-processing and parameter options, as well as ease of functionality for down-stream analysis of single-omic and global-embedding factors.
    UNASSIGNED: nipalsMCIA is available as a BioConductor package at https://bioconductor.org/packages/release/bioc/html/nipalsMCIA.html, and includes detailed documentation and application vignettes. Supplementary Materials are available online.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在这篇文章中,我们开发了CausalEGM,一个深度学习框架,用于对影响治疗和反应的协变量特征之间的依赖关系进行非线性降维和生成建模。因果EGM可用于估计二元和连续治疗设置中的因果效应。通过学习高维协变量空间和低维潜在空间之间的双向变换,然后对潜在变量的不同子集对治疗和反应的依赖性进行建模,因果EGM可以提取影响治疗和反应的潜在协变量特征。通过对这些特征的调节,可以减轻高维协变量对治疗和治疗之间因果关系估计的混杂效应。在一系列的实验中,所提出的方法被证明在二进制和连续治疗设置中实现优于现有方法的性能。当样本量较大且协变量具有高维数时,改进是实质性的。最后,我们为我们的方法建立了超额风险界限和一致性结果,并讨论我们的方法如何与因果推理中的其他降维方法相关并加以改进。
    In this article, we develop CausalEGM, a deep learning framework for nonlinear dimension reduction and generative modeling of the dependency among covariate features affecting treatment and response. CausalEGM can be used for estimating causal effects in both binary and continuous treatment settings. By learning a bidirectional transformation between the high-dimensional covariate space and a low-dimensional latent space and then modeling the dependencies of different subsets of the latent variables on the treatment and response, CausalEGM can extract the latent covariate features that affect both treatment and response. By conditioning on these features, one can mitigate the confounding effect of the high dimensional covariate on the estimation of the causal relation between treatment and response. In a series of experiments, the proposed method is shown to achieve superior performance over existing methods in both binary and continuous treatment settings. The improvement is substantial when the sample size is large and the covariate is of high dimension. Finally, we established excess risk bounds and consistency results for our method, and discuss how our approach is related to and improves upon other dimension reduction approaches in causal inference.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号