dimension reduction

降维
  • 文章类型: Journal Article
    地球观测卫星时空分辨率的逐步演变给科学研究带来了多重好处。越来越多的具有更高频率和空间分辨率的数据提供了精确和及时的信息,使其成为环境分析和增强决策的宝贵工具。然而,这对基于空间时间序列的大规模环境分析和社会经济应用提出了巨大的挑战,经常迫使研究人员求助于较低分辨率的图像,这可能会带来不确定性和影响结果。对此,我们的主要贡献是一种新的机器学习方法,用于植根于超像素分割的密集地理空间时间序列,这是减轻大规模应用中数据高维性的初步步骤。这种方法,在有效降低维度的同时,最大限度地保存有价值的信息,从而大大提高了数据的准确性和随后的环境分析。在全面的案例研究的背景下,根据经验应用了此方法,该案例研究涵盖了2002-2022年期间,在43,470km2的区域中以250-m的分辨率提供了8d频率归一化差异植被指数数据。通过比较分析评估了这种方法的有效性,将我们的结果与从1000米分辨率卫星数据和现有的时间序列数据超像素算法得出的结果进行比较。对时间序列偏差的评估表明,使用较粗分辨率的像素会导致误差超过所提出算法的误差25%,并且所提出的方法优于其他算法9%以上。值得注意的是,这种方法创新同时促进了共享类似土地覆盖分类的像素的聚集,从而减轻数据集中的亚像素异质性。Further,拟议的方法,用作预处理步骤,根据像素的时间序列改进了像素的聚类,并且可以在广泛的应用程序中增强大规模环境分析。
    The progressive evolution of the spatial and temporal resolutions of Earth observation satellites has brought multiple benefits to scientific research. The increasing volume of data with higher frequencies and spatial resolutions offers precise and timely information, making it an invaluable tool for environmental analysis and enhanced decision-making. However, this presents a formidable challenge for large-scale environmental analyses and socioeconomic applications based on spatial time series, often compelling researchers to resort to lower-resolution imagery, which can introduce uncertainty and impact results. In response to this, our key contribution is a novel machine learning approach for dense geospatial time series rooted in superpixel segmentation, which serves as a preliminary step in mitigating the high dimensionality of data in large-scale applications. This approach, while effectively reducing dimensionality, preserves valuable information to the maximum extent, thereby substantially enhancing data accuracy and subsequent environmental analyses. This method was empirically applied within the context of a comprehensive case study encompassing the 2002-2022 period with 8-d-frequency-normalized difference vegetation index data at 250-m resolution in an area spanning 43,470 km2. The efficacy of this methodology was assessed through a comparative analysis, comparing our results with those derived from 1000-m-resolution satellite data and an existing superpixel algorithm for time series data. An evaluation of the time-series deviations revealed that using coarser-resolution pixels introduced an error that exceeded that of the proposed algorithm by 25 % and that the proposed methodology outperformed other algorithms by more than 9 %. Notably, this methodological innovation concurrently facilitates the aggregation of pixels sharing similar land-cover classifications, thus mitigating subpixel heterogeneity within the dataset. Further, the proposed methodology, which is used as a preprocessing step, improves the clustering of pixels according to their time series and can enhance large-scale environmental analyses across a wide range of applications.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    越来越普遍的观点是,蛋白质动力学数据集位于低构象能的非线性子空间中。因此,理想的数据分析工具应该考虑到这种非线性几何形状。黎曼几何设置可以适用于各种原因。首先,它有一个丰富的数学结构,以说明范围广泛的几何形状,可以建模后的能源景观。第二,为欧几里得空间中的数据开发的许多标准数据分析工具可以推广到黎曼流形。在蛋白质动力学的背景下,概念上的挑战来自缺乏基于能源景观构建光滑黎曼结构的指导方针。此外,计算测地线和相关映射的计算可行性提出了重大挑战。这项工作考虑了这些挑战。本文的第一部分开发了一种局部逼近技术,用于以计算可行的方式在黎曼流形上计算测地线和相关映射。第二部分构建了光滑的流形和黎曼结构,该结构基于蛋白质构象的能量景观。在与蛋白质动力学数据相关的几个数据分析任务上测试了所得的黎曼几何形状。特别是,具有给定起点和终点的测地线大致恢复了蛋白质的相应分子动力学轨迹,这些蛋白质经历了具有中等尺寸变形的相对有序的过渡。黎曼蛋白质几何形状还提供物理上真实的汇总统计信息,并在笔记本电脑上甚至在几秒钟内就可以检索大型变形的基础尺寸。
    An increasingly common viewpoint is that protein dynamics datasets reside in a nonlinear subspace of low conformational energy. Ideal data analysis tools should therefore account for such nonlinear geometry. The Riemannian geometry setting can be suitable for a variety of reasons. First, it comes with a rich mathematical structure to account for a wide range of geometries that can be modeled after an energy landscape. Second, many standard data analysis tools developed for data in Euclidean space can be generalized to Riemannian manifolds. In the context of protein dynamics, a conceptual challenge comes from the lack of guidelines for constructing a smooth Riemannian structure based on an energy landscape. In addition, computational feasibility in computing geodesics and related mappings poses a major challenge. This work considers these challenges. The first part of the paper develops a local approximation technique for computing geodesics and related mappings on Riemannian manifolds in a computationally feasible manner. The second part constructs a smooth manifold and a Riemannian structure that is based on an energy landscape for protein conformations. The resulting Riemannian geometry is tested on several data analysis tasks relevant for protein dynamics data. In particular, the geodesics with given start- and end-points approximately recover corresponding molecular dynamics trajectories for proteins that undergo relatively ordered transitions with medium-sized deformations. The Riemannian protein geometry also gives physically realistic summary statistics and retrieves the underlying dimension even for large-sized deformations within seconds on a laptop.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    本文提出了一个基于距离的框架,该框架受到高维数据向特征聚合的范式转变的激励,它不依赖于稀疏特征假设或基于置换的推断。专注于基于距离的结果,在不截断任何特征的情况下保留信息,已经开发了一类半参数回归,它使用主题间属性的成对结果封装高维变量的多个源。Further,我们提出了一种策略,通过基于U-统计的估计方程(UGEE)来解决配对之间的互锁相关性,它们对应于它们独特的有效影响函数(EIF)。因此,所得的半参数估计器对分布错误指定是鲁棒的,同时享有根n一致性和渐近最优性以促进推理。实质上,所提出的方法不仅避免了由于特征选择而导致的信息损失,而且提高了模型的可解释性和计算可行性。提供了人体微生物组和可穿戴设备数据的模拟研究和应用,特征尺寸是成千上万。
    This article proposes a distance-based framework incentivized by the paradigm shift towards feature aggregation for high-dimensional data, which does not rely on the sparse-feature assumption or the permutation-based inference. Focusing on distance-based outcomes that preserve information without truncating any features, a class of semiparametric regression has been developed, which encapsulates multiple sources of high-dimensional variables using pairwise outcomes of between-subject attributes. Further, we propose a strategy to address the interlocking correlations among pairs via the U-statistics-based estimating equations (UGEE), which correspond to their unique efficient influence function (EIF). Hence, the resulting semiparametric estimators are robust to distributional misspecification while enjoying root-n consistency and asymptotic optimality to facilitate inference. In essence, the proposed approach not only circumvents information loss due to feature selection but also improves the model\'s interpretability and computational feasibility. Simulation studies and applications to the human microbiome and wearables data are provided, where the feature dimensions are tens of thousands.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:化学空间嵌入方法在各种研究环境中广泛用于降维,聚类和有效的可视化。嵌入过程生成的地图可以为药物化学家提供有价值的洞察力,化合物的物理化学和生物学性质。然而,众所周知,这些地图很难解释,当嵌入不同的化合物集时,地图上的\'\'景观\'\'容易发生\'\'重排\'\'。
    结果:在这项研究中,我们提出了希尔伯特曲线辅助空间嵌入(HCASE)方法,该方法旨在通过根据药物化学家熟悉的逻辑组织结构来创建映射。首先,化学空间是在一组\'\'参考支架\'\'的帮助下创建的。这些支架根据现有技术中发现的受药物化学启发的支架-密钥算法进行分类。接下来,有序的支架被映射到折叠成更高维(这里:2D)空间的线。复杂折叠的线被称为伪希尔伯特曲线。化合物的嵌入是通过将其最相似的参考支架定位在伪希尔伯特曲线中并假定相应的位置来进行的。通过一系列的实验,我们演示了HCASE方法生成的地图的属性。嵌入的对象是DrugBank和CANVASS库的化合物,化学空间由ChEMBL数据库中提取的支架定义。
    HCASE方法的新颖性在于生成强大而直观的化学空间嵌入,这些嵌入反映了药物化学家的推理,并在该过程中优先使用空间填充(Hilbert)曲线。
    背景:https://github.com/ncats/hcase。
    BACKGROUND: Chemical space embedding methods are widely utilized in various research settings for dimensional reduction, clustering and effective visualization. The maps generated by the embedding process can provide valuable insight to medicinal chemists in terms of the relationships between structural, physicochemical and biological properties of compounds. However, these maps are known to be difficult to interpret, and the \'\'landscape\'\' on the map is prone to \'\'rearrangement\'\' when embedding different sets of compounds.
    RESULTS: In this study we present the Hilbert-Curve Assisted Space Embedding (HCASE) method which was designed to create maps by organizing structures according to a logic familiar to medicinal chemists. First, a chemical space is created with the help of a set of \'\'reference scaffolds\'\'. These scaffolds are sorted according to the medicinal chemistry inspired Scaffold-Key algorithm found in prior art. Next, the ordered scaffolds are mapped to a line which is folded into a higher dimensional (here: 2D) space. The intricately folded line is referred to as a pseudo-Hilbert-Curve. The embedding of a compound happens by locating its most similar reference scaffold in the pseudo-Hilbert-Curve and assuming the respective position. Through a series of experiments, we demonstrate the properties of the maps generated by the HCASE method. Subjects of embeddings were compounds of the DrugBank and CANVASS libraries, and the chemical spaces were defined by scaffolds extracted from the ChEMBL database.
    UNASSIGNED: The novelty of HCASE method lies in generating robust and intuitive chemical space embeddings that are reflective of a medicinal chemist\'s reasoning, and the precedential use of space filling (Hilbert) curve in the process.
    BACKGROUND: https://github.com/ncats/hcase.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    尽管最近已经开发了大量的检查工具,但甲状腺癌的发病率仍在持续增加。由于甲状腺癌诊断没有标准和确定的程序,临床医生需要进行各种测试。这种审查过程会产生多维大数据,并且缺乏通用方法会导致随机分布的丢失(稀疏)数据,这对机器学习算法来说都是巨大的挑战。本文旨在开发一种准确且计算高效的深度学习算法来诊断甲状腺癌。在这方面,处理学习问题中随机分布的缺失数据,并开发了具有内部和目标相似性方法的降维方法,以选择信息量最大的输入数据集。此外,用层次聚类算法进行大小缩减,以消除相当相似的数据样本。对四种机器学习算法进行了训练,并用看不见的数据进行了测试,以验证它们的泛化和鲁棒性。结果对看不见的数据产生100%的训练和83%的测试精确度。在同等条件下还检查了算法的计算时间效率。
    Thyroid cancer incidences endure to increase even though a large number of inspection tools have been developed recently. Since there is no standard and certain procedure to follow for the thyroid cancer diagnoses, clinicians require conducting various tests. This scrutiny process yields multi-dimensional big data and lack of a common approach leads to randomly distributed missing (sparse) data, which are both formidable challenges for the machine learning algorithms. This paper aims to develop an accurate and computationally efficient deep learning algorithm to diagnose the thyroid cancer. In this respect, randomly distributed missing data stemmed singularity in learning problems is treated and dimensionality reduction with inner and target similarity approaches are developed to select the most informative input datasets. In addition, size reduction with the hierarchical clustering algorithm is performed to eliminate the considerably similar data samples. Four machine learning algorithms are trained and also tested with the unseen data to validate their generalization and robustness abilities. The results yield 100% training and 83% testing preciseness for the unseen data. Computational time efficiencies of the algorithms are also examined under the equal conditions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    生物数据的日益复杂刺激了创新计算技术的发展,以提取有意义的信息并发现大量数据集中的隐藏模式。生物网络,如基因调控网络和蛋白质-蛋白质相互作用网络,对生物特征的连接和功能持有关键见解。集成和分析高维数据,特别是在基因表达研究中,在破译这些网络的挑战中,这是突出的。聚类方法在解决这些挑战中起着至关重要的作用,考虑到固有的几何结构,谱聚类成为一种有效的无监督技术。然而,频谱聚类的用户定义的聚类编号可能导致不一致,有时甚至是正交的聚类机制。我们提出了多层捆绑(MLB)方法来解决这个限制,结合多个突出的聚类制度,提供一个全面的数据视图。我们将结果群集称为“bundle”。这种方法改进了聚类结果,解开等级制度,并标识在网络组件之间进行通信的网桥元素。通过分层聚类结果,MLB提供生物特征簇的全局到局部视图,从而能够洞察复杂的生物系统。此外,该方法通过将束协同聚类矩阵与亲和矩阵相结合来增强束网络预测。MLB的多功能性超越了生物网络,使其适用于需要理解复杂关系和模式的各种领域。
    The growing complexity of biological data has spurred the development of innovative computational techniques to extract meaningful information and uncover hidden patterns within vast datasets. Biological networks, such as gene regulatory networks and protein-protein interaction networks, hold critical insights into biological features\' connections and functions. Integrating and analyzing high-dimensional data, particularly in gene expression studies, stands prominent among the challenges in deciphering these networks. Clustering methods play a crucial role in addressing these challenges, with spectral clustering emerging as a potent unsupervised technique considering intrinsic geometric structures. However, spectral clustering\'s user-defined cluster number can lead to inconsistent and sometimes orthogonal clustering regimes. We propose the Multi-layer Bundling (MLB) method to address this limitation, combining multiple prominent clustering regimes to offer a comprehensive data view. We call the outcome clusters \"bundles\". This approach refines clustering outcomes, unravels hierarchical organization, and identifies bridge elements mediating communication between network components. By layering clustering results, MLB provides a global-to-local view of biological feature clusters enabling insights into intricate biological systems. Furthermore, the method enhances bundle network predictions by integrating the bundle co-cluster matrix with the affinity matrix. The versatility of MLB extends beyond biological networks, making it applicable to various domains where understanding complex relationships and patterns is needed.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    随着大量和单细胞分析对多组数据的依赖增加,对聚类进行无监督分析的健壮方法的可用性,可视化,而特征选择势在必行。联合降维方法可以应用于多组学数据集,以得出类似于单组学技术的全局样本嵌入,例如主成分分析(PCA)。多重协同惯性分析(MCIA)是一种用于联合降维的方法,可最大化块级和全局级嵌入之间的协方差。MCIA的当前实现未针对大型数据集进行优化,例如来自单细胞研究的数据集。并且缺乏嵌入新数据的能力。
    我们介绍一下nipalsMCIA,一种MCIA实现,使用对非线性迭代偏最小二乘(NIPALS)的扩展来求解目标函数,与依赖单细胞多组学数据的特征分解的早期实现相比,显示出显着的加速。它还消除了对计算解释方差的特征分解的依赖,并允许用户对新数据执行样本外嵌入。nipalsMCIA为用户提供各种预处理和参数选项,以及简单的功能,用于单个整体和全局嵌入因子的下游分析。
    nipalsMCIA作为BioConductor软件包可在https://bioparductor.org/packages/release/bioc/html/nipalsMCIA获得。html,并包括详细的文档和应用插图。补充材料可在线获得。
    UNASSIGNED: With the increased reliance on multi-omics data for bulk and single cell analyses, the availability of robust approaches to perform unsupervised analysis for clustering, visualization, and feature selection is imperative. Joint dimensionality reduction methods can be applied to multi-omics datasets to derive a global sample embedding analogous to single-omic techniques such as Principal Components Analysis (PCA). Multiple co-inertia analysis (MCIA) is a method for joint dimensionality reduction that maximizes the covariance between block- and global-level embeddings. Current implementations for MCIA are not optimized for large datasets such such as those arising from single cell studies, and lack capabilities with respect to embedding new data.
    UNASSIGNED: We introduce nipalsMCIA, an MCIA implementation that solves the objective function using an extension to Non-linear Iterative Partial Least Squares (NIPALS), and shows significant speed-up over earlier implementations that rely on eigendecompositions for single cell multi-omics data. It also removes the dependence on an eigendecomposition for calculating the variance explained, and allows users to perform out-of-sample embedding for new data. nipalsMCIA provides users with a variety of pre-processing and parameter options, as well as ease of functionality for down-stream analysis of single-omic and global-embedding factors.
    UNASSIGNED: nipalsMCIA is available as a BioConductor package at https://bioconductor.org/packages/release/bioc/html/nipalsMCIA.html, and includes detailed documentation and application vignettes. Supplementary Materials are available online.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在这篇文章中,我们开发了CausalEGM,一个深度学习框架,用于对影响治疗和反应的协变量特征之间的依赖关系进行非线性降维和生成建模。因果EGM可用于估计二元和连续治疗设置中的因果效应。通过学习高维协变量空间和低维潜在空间之间的双向变换,然后对潜在变量的不同子集对治疗和反应的依赖性进行建模,因果EGM可以提取影响治疗和反应的潜在协变量特征。通过对这些特征的调节,可以减轻高维协变量对治疗和治疗之间因果关系估计的混杂效应。在一系列的实验中,所提出的方法被证明在二进制和连续治疗设置中实现优于现有方法的性能。当样本量较大且协变量具有高维数时,改进是实质性的。最后,我们为我们的方法建立了超额风险界限和一致性结果,并讨论我们的方法如何与因果推理中的其他降维方法相关并加以改进。
    In this article, we develop CausalEGM, a deep learning framework for nonlinear dimension reduction and generative modeling of the dependency among covariate features affecting treatment and response. CausalEGM can be used for estimating causal effects in both binary and continuous treatment settings. By learning a bidirectional transformation between the high-dimensional covariate space and a low-dimensional latent space and then modeling the dependencies of different subsets of the latent variables on the treatment and response, CausalEGM can extract the latent covariate features that affect both treatment and response. By conditioning on these features, one can mitigate the confounding effect of the high dimensional covariate on the estimation of the causal relation between treatment and response. In a series of experiments, the proposed method is shown to achieve superior performance over existing methods in both binary and continuous treatment settings. The improvement is substantial when the sample size is large and the covariate is of high dimension. Finally, we established excess risk bounds and consistency results for our method, and discuss how our approach is related to and improves upon other dimension reduction approaches in causal inference.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    精确肿瘤学的主要挑战是基于所考虑的肿瘤的分子生物标志物来识别和优先化合适的治疗选择。为了实现这个目标,已经成功地研究了大型癌细胞系,以阐明细胞特征与治疗反应之间的关系。由于这些数据集的高维性,机器学习(ML)通常用于分析它们。然而,选择合适的算法和一组输入特征可能是具有挑战性的。我们对ML方法和降维(DR)技术进行了全面的基准测试,以预测药物反应指标。在癌细胞系中使用药物敏感性基因组学,我们训练了随机森林,神经网络,增强179种抗癌化合物的树和弹性网,其特征集来自9种DR方法。我们比较了有关统计性能的结果,运行时和可解释性。此外,我们提供了与简单基线模型相比评估模型性能的策略,并测量了不同复杂性模型之间的权衡。最后,我们展示了复杂的机器学习模型受益于使用优化的DR策略,而标准型号——即使使用相当少的功能——在性能上仍然是优越的。
    A major challenge of precision oncology is the identification and prioritization of suitable treatment options based on molecular biomarkers of the considered tumor. In pursuit of this goal, large cancer cell line panels have successfully been studied to elucidate the relationship between cellular features and treatment response. Due to the high dimensionality of these datasets, machine learning (ML) is commonly used for their analysis. However, choosing a suitable algorithm and set of input features can be challenging. We performed a comprehensive benchmarking of ML methods and dimension reduction (DR) techniques for predicting drug response metrics. Using the Genomics of Drug Sensitivity in Cancer cell line panel, we trained random forests, neural networks, boosting trees and elastic nets for 179 anti-cancer compounds with feature sets derived from nine DR approaches. We compare the results regarding statistical performance, runtime and interpretability. Additionally, we provide strategies for assessing model performance compared with a simple baseline model and measuring the trade-off between models of different complexity. Lastly, we show that complex ML models benefit from using an optimized DR strategy, and that standard models-even when using considerably fewer features-can still be superior in performance.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    多组学数据的综合分析有可能对癌症和阿尔茨海默病等复杂疾病的分子机制产生有价值和全面的见解。然而,许多分析挑战使多组学数据集成复杂化。例如,-组学数据通常是高维的,多组学研究中的样本量往往适中。此外,当一条重要通路中的基因信号相对较弱时,很难单独检测到它们。关于知识指导学习方法的文献越来越多,可以通过将功能基因组学和功能蛋白质组学等生物学知识纳入多组学数据分析中来解决这些挑战。这些方法已经被证明优于他们的同行不利用生物知识的任务,包括预测,特征选择,聚类,和降维。在这次审查中,我们调查了知识引导的多组学数据整合方法的最新发展方法和应用,并讨论了未来的研究方向。
    Integrative analysis of multi-omics data has the potential to yield valuable and comprehensive insights into the molecular mechanisms underlying complex diseases such as cancer and Alzheimer\'s disease. However, a number of analytical challenges complicate multi-omics data integration. For instance, -omics data are usually high-dimensional, and sample sizes in multi-omics studies tend to be modest. Furthermore, when genes in an important pathway have relatively weak signal, it can be difficult to detect them individually. There is a growing body of literature on knowledge-guided learning methods that can address these challenges by incorporating biological knowledge such as functional genomics and functional proteomics into multi-omics data analysis. These methods have been shown to outperform their counterparts that do not utilize biological knowledge in tasks including prediction, feature selection, clustering, and dimension reduction. In this review, we survey recently developed methods and applications of knowledge-guided multi-omics data integration methods and discuss future research directions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号