dimensionality reduction

降维
  • 文章类型: Journal Article
    生物分子通常表现出复杂的自由能景观,其中长寿命的亚稳态被大的能量屏障隔开。通过经典分子动力学(MD)模拟克服亚稳态之间的稳健样品跃迁的这些障碍提出了挑战。为了避免这个问题,通常采用基于集体变量(CV)的增强采样MD方法。传统的CV选择依赖于系统的直觉和先验知识。这种方法引入了偏见,这可能导致不完整的机械见解。因此,需要自动CV检测以更深入地了解系统/过程。使用各种机器学习算法分析MD数据,如主成分分析(PCA),支持向量机(SVM)和基于线性判别分析(LDA)的方法已实现用于自动CV检测。然而,它们的性能尚未在结构和机械上复杂的生物系统上进行系统评估。这里,我们将这些方法应用于在多个功能相关的亚稳态中的MFSD2A(主要促进者超家族域2A)溶血脂转运蛋白的MD模拟,目的是确定可以在结构上区分这些状态的最佳CV。特别强调基于LDA的CV的自动检测和解释能力。我们发现LDA方法,其中包括一个新颖的基于梯度下降的多类谐波变体,称为GDHLDA,我们在这里开发的,在类分离方面优于PCA,在提取区分亚稳态的关键CV方面表现出显著的一致性。此外,鉴定的CV包括以前与MFSD2A构象转变相关的特征。具体来说,跨膜螺旋7和该螺旋上的残基Y294的构象变化是区分MFSD2A中亚稳态的关键特征。这突出了基于LDA的方法在从MD轨迹中自动提取功能相关性的CV方面的有效性,这些CV可用于驱动偏置的MD模拟,以有效地对分子系统中的构象转变进行采样。
    Biomolecules often exhibit complex free energy landscapes in which long-lived metastable states are separated by large energy barriers. Overcoming these barriers to robustly sample transitions between the metastable states with classical molecular dynamics (MD) simulations presents a challenge. To circumvent this issue, collective variable (CV)-based enhanced sampling MD approaches are often employed. Traditional CV selection relies on intuition and prior knowledge of the system. This approach introduces bias, which can lead to incomplete mechanistic insights. Thus, automated CV detection is desired to gain a deeper understanding of the system/process. Analysis of MD data with various machine learning algorithms, such as Principal Component Analysis (PCA), Support Vector Machine (SVM), and Linear Discriminant Analysis (LDA)-based approaches have been implemented for automated CV detection. However, their performance has not been systematically evaluated on structurally and mechanistically complex biological systems. Here, we applied these methods to MD simulations of the MFSD2A (Major Facilitator Superfamily Domain 2A) lysolipid transporter in multiple functionally relevant metastable states with the goal of identifying optimal CVs that would structurally discriminate these states. Specific emphasis was on the automated detection and interpretive power of LDA-based CVs. We found that LDA methods, which included a novel gradient descent-based multiclass harmonic variant, termed GDHLDA, we developed here, outperform PCA in class separation, exhibiting remarkable consistency in extracting CVs critical for distinguishing metastable states. Furthermore, the identified CVs included features previously associated with conformational transitions in MFSD2A. Specifically, conformational shifts in transmembrane helix 7 and in residue Y294 on this helix emerged as critical features discriminating the metastable states in MFSD2A. This highlights the effectiveness of LDA-based approaches in automatically extracting from MD trajectories CVs of functional relevance that can be used to drive biased MD simulations to efficiently sample conformational transitions in the molecular system.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    亚临床乳腺炎是影响乳羊生产的常见且具有经济意义的疾病。热成像为非侵入性检测提供了一个有希望的途径,但是现有的方法通常依赖于简单的温差,可能导致不准确的评估。本研究提出了一种先进的算法方法,将热成像处理与统计纹理分析和t分布随机邻居嵌入(t-SNE)集成在一起。我们的方法使用支持向量机(SVM)算法实现了84%的高分类精度。此外,我们介绍另一种常用的评估指标,在建立统计特征的阈值条件后,将热图像与商业加利福尼亚乳腺炎测试(CMT)结果相关联,产生80%的敏感性(真阳性率)和92.5%的特异性(真阴性率)。评估指标强调了我们的方法在检测奶牛亚临床乳腺炎中的功效,提供一个强大的工具来改进管理实践。
    Subclinical mastitis is a common and economically significant disease that affects dairy sheep production. Thermal imaging presents a promising avenue for non-invasive detection, but existing methodologies often rely on simplistic temperature differentials, potentially leading to inaccurate assessments. This study proposes an advanced algorithmic approach integrating thermal imaging processing with statistical texture analysis and t-distributed stochastic neighbor embedding (t-SNE). Our method achieves a high classification accuracy of 84% using the support vector machines (SVM) algorithm. Furthermore, we introduce another commonly employed evaluation metric, correlating thermal images with commercial California mastitis test (CMT) results after establishing threshold conditions on statistical features, yielding a sensitivity (the true positive rate) of 80% and a specificity (the true negative rate) of 92.5%. The evaluation metrics underscore the efficacy of our approach in detecting subclinical mastitis in dairy sheep, offering a robust tool for improved management practices.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    鉴于个性化的神经功能平衡,在为脑肿瘤患者做出治疗决定时,越来越多地考虑认知功能。理想情况下,考虑到这种平衡,人们可以预测个体患者的认知功能,以做出治疗决定。为了做出准确的预测,肿瘤位置的信息表示至关重要,然而,缺乏代表性的比较。因此,这项研究比较了脑图谱和主成分分析(PCA)来表示逐体素的肿瘤位置。通过8项认知测试,预测了246例高级别神经胶质瘤患者的术前认知功能,同时使用不同的体素肿瘤位置表示作为预测因子。使用13种不同的常用群体平均图谱表示体素肿瘤位置,13个随机生成的图册,和13种基于PCA的表示。将ElasticNet预测在表示之间进行比较,并与仅使用肿瘤体积的模型进行比较。术前认知功能只能从肿瘤位置部分预测。不同表现的表现在很大程度上是相似的。与随机图册相比,人口平均图册没有产生更好的预测。基于PCA的表示并没有明显优于其他表示,尽管汇总指标表明基于PCA的表示在我们的样本中表现得更好一些。具有更多区域或组件的表示导致不太准确的预测。当应用于神经胶质瘤患者时,人口平均图谱可能无法区分功能不同的区域。这强调需要开发和验证在存在病变的情况下进行个体分割的方法。未来的研究可能会测试观察到的基于PCA的表示的小优势是否可以推广到其他数据。
    Cognitive functioning is increasingly considered when making treatment decisions for patients with a brain tumor in view of a personalized onco-functional balance. Ideally, one can predict cognitive functioning of individual patients to make treatment decisions considering this balance. To make accurate predictions, an informative representation of tumor location is pivotal, yet comparisons of representations are lacking. Therefore, this study compares brain atlases and principal component analysis (PCA) to represent voxel-wise tumor location. Pre-operative cognitive functioning was predicted for 246 patients with a high-grade glioma across eight cognitive tests while using different representations of voxel-wise tumor location as predictors. Voxel-wise tumor location was represented using 13 different frequently-used population average atlases, 13 randomly generated atlases, and 13 representations based on PCA. ElasticNet predictions were compared between representations and against a model solely using tumor volume. Preoperative cognitive functioning could only partly be predicted from tumor location. Performances of different representations were largely similar. Population average atlases did not result in better predictions compared to random atlases. PCA-based representation did not clearly outperform other representations, although summary metrics indicated that PCA-based representations performed somewhat better in our sample. Representations with more regions or components resulted in less accurate predictions. Population average atlases possibly cannot distinguish between functionally distinct areas when applied to patients with a glioma. This stresses the need to develop and validate methods for individual parcellations in the presence of lesions. Future studies may test if the observed small advantage of PCA-based representations generalizes to other data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    我们介绍EPR-Net,一种新颖而有效的深度学习方法,解决了生物物理学中的一个关键挑战:为高维非平衡稳态系统构建潜在景观。EPR-Net利用了一个很好的数学事实,即所需的负电势梯度只是加权内积空间中基础动力学驱动力的正交投影。值得注意的是,我们的损失函数与稳定熵生产率(EPR)密切相关,能够同时进行景观建设和EPR估算。我们为噪声小的系统引入了增强的学习策略,并扩展我们的框架,以统一的方式包括降维和状态相关的扩散系数情况。对基准问题的比较评估证明了更高的准确性,与现有方法相比,EPR-Net的有效性和鲁棒性。我们将我们的方法应用于挑战生物物理问题,例如八维(8D)极限环和52D多稳定性问题,它提供了准确的解决方案和对建筑景观的有趣见解。凭借其多功能性和强大功能,EPR-Net为生物物理学中的各种景观建设问题提供了有希望的解决方案。
    We present EPR-Net, a novel and effective deep learning approach that tackles a crucial challenge in biophysics: constructing potential landscapes for high-dimensional non-equilibrium steady-state systems. EPR-Net leverages a nice mathematical fact that the desired negative potential gradient is simply the orthogonal projection of the driving force of the underlying dynamics in a weighted inner-product space. Remarkably, our loss function has an intimate connection with the steady entropy production rate (EPR), enabling simultaneous landscape construction and EPR estimation. We introduce an enhanced learning strategy for systems with small noise, and extend our framework to include dimensionality reduction and the state-dependent diffusion coefficient case in a unified fashion. Comparative evaluations on benchmark problems demonstrate the superior accuracy, effectiveness and robustness of EPR-Net compared to existing methods. We apply our approach to challenging biophysical problems, such as an eight-dimensional (8D) limit cycle and a 52D multi-stability problem, which provide accurate solutions and interesting insights on constructed landscapes. With its versatility and power, EPR-Net offers a promising solution for diverse landscape construction problems in biophysics.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    保障网络空间安全的关键技术之一是网络流量异常检测,它通过分析和识别网络流量行为来检测恶意攻击。网络的快速发展导致了网络流量的爆发式增长,这严重影响了用户的信息安全。研究人员已经将入侵检测作为一种主动防御技术来应对这一挑战。然而,在处理大规模网络数据时,传统的机器学习方法很难捕获复杂的威胁和攻击模式。相比之下,深度学习方法具有从网络流量数据中自动提取特征、泛化能力强等优点。为了提高网络异常流量检测的能力,本文提出了一种基于深度残差收缩网络(DRSN)的网络流量异常检测方法,即“GSOOA-1DDRSN”。该方法使用改进的Osprey优化算法来选择网络流量中最相关和最重要的特征,减少特征的维度。为了更好地检测网络流量异常,设计了一维深度残差收缩网络(1DDRSN)作为分类器。使用NSL-KDD和UNSW-NB15数据集进行验证,并与其他方法进行比较。实验结果表明,GSOOA-1DDRSN提高了多分类精度,精度,召回,和F1得分大约2%和3%,分别,与两个数据集上的1DDRSN模型进行比较。此外,它将这些数据集上的时间计算成本降低了20%和30%。此外,与其他型号相比,GSOOA-1DDRSN提供卓越的分类精度,并有效减少特征数量。
    One of the critical technologies to ensure cyberspace security is network traffic anomaly detection, which detects malicious attacks by analyzing and identifying network traffic behavior. The rapid development of the network has led to explosive growth in network traffic, which seriously impacts the user\'s information security. Researchers have delved into intrusion detection as an active defense technology to address this challenge. However, traditional machine learning methods struggle to capture complex threats and attack patterns when dealing with large-scale network data. In contrast, deep learning methods have the advantages of automatically extracting features from network traffic data and strong generalization capabilities. Aiming to enhance the ability of network anomaly traffic detection, this paper proposes a network traffic anomaly detection based on Deep Residual Shrinkage Network (DRSN), namely \"GSOOA-1DDRSN\". This method uses an improved Osprey optimization algorithm to select the most relevant and essential features in network traffic, reducing the features\' dimensionality. For better detection performance of network traffic anomalies, a one-dimensional deep residual shrinkage network (1DDRSN) is designed as a classifier. Validation is performed using the NSL-KDD and UNSW-NB15 datasets and compared with other methods. The experimental results show that GSOOA-1DDRSN has improved multi-classification accuracy, precision, recall, and F1 Score by approximately 2 % and 3 %, respectively, compared to the 1DDRSN model on two datasets. Additionally, it reduces the time computation costs by 20 % and 30 % on these datasets. Furthermore, compared to other models, GSOOA-1DDRSN offers superior classification accuracy and effectively reduces the number of features.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    随着计算机科学和基于实验室的工程在日常生活中的广泛应用,数据的维度和规模正在迅速增长。由于模糊性的可用性,后来的不确定性,冗余,无关,和噪音,这在构建有效的学习模型方面提出了担忧。模糊粗糙集及其扩展已通过各种数据约简方法应用于处理这些问题。然而,构建一个能够同时应对所有这些问题的模型总是一项具有挑战性的任务。迄今为止,没有一项研究同时解决了所有这些问题。本文研究了一种基于直觉模糊(IF)和粗糙集概念的方法,通过提出一种有趣的数据约简技术来同时避免这些障碍。为了完成这项任务,首先,提出了一种新的IF相似关系。其次,在这种相似关系的基础上建立了IF粗糙集模型。第三,通过使用建立的相似关系和下近似,给出了IF颗粒结构。接下来,数学定理用于验证所提出的概念。然后,IF颗粒的重要性程度用于多余的尺寸消除。Further,讨论了重要度保留的降维。因此,可以同时执行大量高维数据集的实例和特征选择,以消除维度和大小上的冗余和不相关性,其中模糊性和后来的不确定性分别用粗糙集和IF集处理,而噪声是用中频颗粒结构解决的。此后,对基准数据集进行了全面的实验,以证明同时选择特征和数据点的方法的有效性。最后,我们提出的方法学辅助框架进行了讨论,以提高抗病毒肽的IC50的回归性能。
    The dimension and size of data is growing rapidly with the extensive applications of computer science and lab based engineering in daily life. Due to availability of vagueness, later uncertainty, redundancy, irrelevancy, and noise, which imposes concerns in building effective learning models. Fuzzy rough set and its extensions have been applied to deal with these issues by various data reduction approaches. However, construction of a model that can cope with all these issues simultaneously is always a challenging task. None of the studies till date has addressed all these issues simultaneously. This paper investigates a method based on the notions of intuitionistic fuzzy (IF) and rough sets to avoid these obstacles simultaneously by putting forward an interesting data reduction technique. To accomplish this task, firstly, a novel IF similarity relation is addressed. Secondly, we establish an IF rough set model on the basis of this similarity relation. Thirdly, an IF granular structure is presented by using the established similarity relation and the lower approximation. Next, the mathematical theorems are used to validate the proposed notions. Then, the importance-degree of the IF granules is employed for redundant size elimination. Further, significance-degree-preserved dimensionality reduction is discussed. Hence, simultaneous instance and feature selection for large volume of high-dimensional datasets can be performed to eliminate redundancy and irrelevancy in both dimension and size, where vagueness and later uncertainty are handled with rough and IF sets respectively, whilst noise is tackled with IF granular structure. Thereafter, a comprehensive experiment is carried out over the benchmark datasets to demonstrate the effectiveness of simultaneous feature and data point selection methods. Finally, our proposed methodology aided framework is discussed to enhance the regression performance for IC50 of Antiviral Peptides.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    度量多维缩放是将数据嵌入低维欧氏空间的经典方法之一。它通过近似保留输入点之间的成对距离来创建低维嵌入。然而,当前最先进的方法只能扩展到几千个数据点。对于更大的数据集,例如在单细胞RNA测序实验中发生的数据集,运行时间变得非常大,因此广泛使用诸如PCA之类的替代方法。这里,我们提出了一种简单的基于神经网络的方法来解决度量多维缩放问题,该方法比以前的最先进的方法快几个数量级,因此可以扩展到多达几百万个细胞的数据集。同时,它提供了高维空间和低维空间之间的非线性映射,可以将以前看不见的单元放置在相同的嵌入中。
    Metric multidimensional scaling is one of the classical methods for embedding data into low-dimensional Euclidean space. It creates the low-dimensional embedding by approximately preserving the pairwise distances between the input points. However, current state-of-the-art approaches only scale to a few thousand data points. For larger data sets such as those occurring in single-cell RNA sequencing experiments, the running time becomes prohibitively large and thus alternative methods such as PCA are widely used instead. Here, we propose a simple neural network-based approach for solving the metric multidimensional scaling problem that is orders of magnitude faster than previous state-of-the-art approaches, and hence scales to data sets with up to a few million cells. At the same time, it provides a non-linear mapping between high- and low-dimensional space that can place previously unseen cells in the same embedding.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    近年来,单细胞转录组学和空间转录组学分析技术的应用越来越广泛。无论是处理单细胞转录组还是空间转录组数据,降维和聚类是必不可少的。单细胞和空间转录组数据通常都是高维的,使得这些数据的分析和可视化具有挑战性。通过降维,可以在低维空间中可视化数据,允许观察细胞亚群之间的关系和差异。聚类可以将相似的单元格分组到同一个集群中,帮助识别不同的细胞亚群并揭示细胞多样性,为下游分析提供指导。在这次审查中,我们系统地总结了用于单细胞转录组和空间转录组数据的降维和聚类分析的最广泛认可的算法。这一努力提供了宝贵的见解和想法,可以有助于在这个快速发展的领域中开发新的工具。
    In recent years, the application of single-cell transcriptomics and spatial transcriptomics analysis techniques has become increasingly widespread. Whether dealing with single-cell transcriptomic or spatial transcriptomic data, dimensionality reduction and clustering are indispensable. Both single-cell and spatial transcriptomic data are often high-dimensional, making the analysis and visualization of such data challenging. Through dimensionality reduction, it becomes possible to visualize the data in a lower-dimensional space, allowing for the observation of relationships and differences between cell subpopulations. Clustering enables the grouping of similar cells into the same cluster, aiding in the identification of distinct cell subpopulations and revealing cellular diversity, providing guidance for downstream analyses. In this review, we systematically summarized the most widely recognized algorithms employed for the dimensionality reduction and clustering analysis of single-cell transcriptomic and spatial transcriptomic data. This endeavor provides valuable insights and ideas that can contribute to the development of novel tools in this rapidly evolving field.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    由于对社会和经济的重要性,财务困境识别仍然是科学文献中的重要主题。信息技术的进步和存储数据量的不断增加导致财务困境的出现,超越了财务报表及其指标(比率)的范围。特征空间可以通过纳入宏观经济学等特征数据类别的新观点来扩展,部门,社会,董事会,管理,司法事件,等。然而,维度的增加导致数据稀疏和模型过度拟合。本研究通过结合降维和机器学习技术,提出了一种有效的财务困境分类评估的新方法。拟议的框架旨在确定导致描述企业财务困境的损失函数最小化的特征子集。在研究期间,比较了15种具有不同特征数量的降维技术和17种机器学习模型。总的来说,使用2015年至2022年期间的立陶宛企业数据进行了1,432次实验。结果表明,使用随机森林均值递减Gini(RF_MDG)特征选择技术识别的具有30个排名特征的人工神经网络(ANN)模型提供了最高的AUC得分。此外,这项研究引入了一种新的特征提取方法,这可以改进财务困境分类模型。
    Financial distress identification remains an essential topic in the scientific literature due to its importance for society and the economy. The advancements in information technology and the escalating volume of stored data have led to the emergence of financial distress that transcends the realm of financial statements and its\' indicators (ratios). The feature space could be expanded by incorporating new perspectives on feature data categories such as macroeconomics, sectors, social, board, management, judicial incident, etc. However, the increased dimensionality results in sparse data and overfitted models. This study proposes a new approach for efficient financial distress classification assessment by combining dimensionality reduction and machine learning techniques. The proposed framework aims to identify a subset of features leading to the minimization of the loss function describing the financial distress in an enterprise. During the study, 15 dimensionality reduction techniques with different numbers of features and 17 machine-learning models were compared. Overall, 1,432 experiments were performed using Lithuanian enterprise data covering the period from 2015 to 2022. Results revealed that the artificial neural network (ANN) model with 30 ranked features identified using the Random Forest mean decreasing Gini (RF_MDG) feature selection technique provided the highest AUC score. Moreover, this study has introduced a novel approach for feature extraction, which could improve financial distress classification models.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    我们提出了一种用于高维脑体状态的计算框架,作为由内部感觉控制的嵌套内部和外部动力学的瞬态实施例。统一最近的理论工作,我们提出了将任意状态复杂性降低到可观察到的特征数量的方法,以便准确预测和干预病理轨迹。
    We propose a computational framework for high-dimensional brain-body states as transient embodiments of nested internal and external dynamics governed by interoception. Unifying recent theoretical work, we suggest ways to reduce arbitrary state complexity to an observable number of features in order to accurately predict and intervene in pathological trajectories.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

公众号