dimensionality reduction

降维
  • 文章类型: Journal Article
    遥感数据集和方法适用于绘制和管理自然资源,如矿物,干净的水,和能源,也控制着它们的可持续性。高光谱(HS)成像对岩石类型分类具有巨大的潜力,矿物测绘,和识别。这项工作证明了空间高光谱遥感数据的特征提取技术和无监督机器学习方法在Banswara中表征和识别矿物和分类岩石类型方面的潜力,拉贾斯坦邦,印度。特征提取技术可以揭示数据中的变化,这可以帮助识别地质区域,减少噪音,并检查数据的维度。基于奇异值分解(SVD)的主成分分析(PCA),内核PCA(KPCA),最小噪声分数(MNF),使用最近推出的DLR地球传感成像光谱仪高光谱(DESIS)和PRecursoreIperSpettraledellaMissioneApplicativa(PRISMA)数据对岩性制图进行了测试,以绘制具有地质意义的区域。无监督机器学习方法,如迭代自组织数据分析技术(ISODATA)和K-means,也被雇用。顶点成分分析(VCA)用于检查相似性并识别各种光谱特征。我们的工作证明了在地质制图和可解释性中使用PCA和KPCA等特征提取算法相对于MNF和ICA的优势。我们建议K-means作为高光谱遥感数据岩性分类的首选方法。我们的工作强调了使用高光谱数据进行矿物测绘的高级特征提取算法的潜力,提供不同的方法来识别矿物,并最终导致更好的矿物资源管理。
    Remote sensing datasets and methods are suitable for mapping and managing the natural resources like minerals, clean water, and energy and also govern their sustainability nowadays. Hyperspectral (HS) imaging has immense potential for rock type classification, mineral mapping, and identification. This work demonstrates the potential of feature extraction techniques and unsupervised machine learning methods for the space-borne hyperspectral remote sensing data in characterizing and identifying mineral and classifying rock type in Banswara, Rajasthan, India. Feature extraction techniques can reveal variations within the data, which can help identify geological areas, reduce noise, and check the dimensionality of the data. Singular value decomposition (SVD)-based principal component analysis (PCA), kernel PCA (KPCA), minimum noise fraction (MNF), and independent component analysis (ICA) were tested for lithological mapping using recently launched DLR Earth Sensing Imaging Spectrometer Hyperspectral (DESIS) and PRecursore IperSpettrale della Missione Applicativa (PRISMA) data in order to map geologically significant areas. Unsupervised machine learning methods, such as Iterative Self-Organizing Data Analysis Technique (ISODATA) and K-means, were also employed. Vertex component analysis (VCA) was utilized to check for similarity and identify various spectral features. Our work demonstrates the advantages of using feature extraction algorithms such as PCA and KPCA over MNF and ICA in geological mapping and interpretability. We recommend K-means as the preferred method for lithological classification of hyperspectral remote sensing data. Our work highlights the potential of advanced feature extraction algorithms for mineral mapping using hyperspectral data, providing different ways to identify minerals and ultimately leading to better mineral resource management.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    X射线荧光(XRF)光谱法已被证明是一个核心,非破坏性的,文化遗产研究中的分析技术主要是因为它具有非侵入性和快速揭示被分析文物的元素组成的能力。能够比可见光更深地渗透到物质中,X射线允许进一步分析,这最终可以导致提取与伪影的衬底有关的信息。最近开发的扫描宏观X射线荧光法(MA-XRF)允许提取元素分布图像。本工作旨在比较两种不同的分析方法,以解释在MA-XRF分析框架中收集的大量XRF光谱。以两种方式分析测量的光谱:仅光谱方法和探索性数据分析方法。应用方法的潜力展示在著名的18世纪希腊宗教面板绘画上。光谱方法分别分析每个测量的光谱,并导致构建单元素空间分布图像(元素图)。统计数据分析方法导致将所有光谱分组为具有共同特征的不同簇,而随后的降维算法有助于在易于感知的二维图像数据集中减少数千个通道的XRF光谱。这两种分析方法允许提取有关所用颜料和油漆层地层学的详细信息(即,绘画技术)以及恢复干预/保存状态。
    X-ray fluorescence (XRF) spectrometry has proven to be a core, non-destructive, analytical technique in cultural heritage studies mainly because of its non-invasive character and ability to rapidly reveal the elemental composition of the analyzed artifacts. Being able to penetrate deeper into matter than the visible light, X-rays allow further analysis that may eventually lead to the extraction of information that pertains to the substrate(s) of an artifact. The recently developed scanning macroscopic X-ray fluorescence method (MA-XRF) allows for the extraction of elemental distribution images. The present work aimed at comparing two different analysis methods for interpreting the large number of XRF spectra collected in the framework of MA-XRF analysis. The measured spectra were analyzed in two ways: a merely spectroscopic approach and an exploratory data analysis approach. The potentialities of the applied methods are showcased on a notable 18th-century Greek religious panel painting. The spectroscopic approach separately analyses each one of the measured spectra and leads to the construction of single-element spatial distribution images (element maps). The statistical data analysis approach leads to the grouping of all spectra into distinct clusters with common features, while afterward dimensionality reduction algorithms help reduce thousands of channels of XRF spectra in an easily perceived dataset of two-dimensional images. The two analytical approaches allow extracting detailed information about the pigments used and paint layer stratigraphy (i.e., painting technique) as well as restoration interventions/state of preservation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    coreset通常是一组输入项的小加权子集,对于给定的一组查询(模型,分类器,假设)。也就是说,所有查询的最大(最坏情况)错误是有界的。为了获得更小的核心集,我们建议自然松弛:在给定的查询集上的平均误差是有界的核心集。我们提供了确定性和随机化(通用)算法,用于为任何有限的查询集计算此类核心。与最坏情况错误的大多数对应核心集不同,在这项工作中,coreset的大小与输入大小及其Vapnik-Chervonenkis(VC)维度无关。主要技术是将平均情况下的核复位减少到向量摘要问题中,其中目标是计算n个输入向量的加权子集,该子集近似它们的总和。然后,我们提出了第一种算法,用于计算输入大小为线性的时间加权子集,对于n1/ε,其中ε是近似误差,改进,例如,[ICML\'17]和主成分分析(PCA)[NIPS\'16]的应用。实验结果在实践中也显示出显著且一致的改进。提供开放源代码。
    Coreset is usually a small weighted subset of an input set of items, that provably approximates their loss function for a given set of queries (models, classifiers, hypothesis). That is, the maximum (worst-case) error over all queries is bounded. To obtain smaller coresets, we suggest a natural relaxation: coresets whose average error over the given set of queries is bounded. We provide both deterministic and randomized (generic) algorithms for computing such a coreset for any finite set of queries. Unlike most corresponding coresets for the worst-case error, the size of the coreset in this work is independent of both the input size and its Vapnik-Chervonenkis (VC) dimension. The main technique is to reduce the average-case coreset into the vector summarization problem, where the goal is to compute a weighted subset of the n input vectors which approximates their sum. We then suggest the first algorithm for computing this weighted subset in time that is linear in the input size, for n≫1/ε, where ε is the approximation error, improving, e.g., both [ICML\'17] and applications for principal component analysis (PCA) [NIPS\'16]. Experimental results show significant and consistent improvement also in practice. Open source code is provided.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:在基于案例的推理系统中检索相似案例的过程被认为是基因表达数据集的一大挑战。微阵列技术产生的大量基因表达值导致复杂的数据集,高维数据的相似性度量存在问题。因此,基因表达相似性测量需要大量的机器学习和数据挖掘技术,如特征选择和降维,纳入检索过程。
    方法:本文提出了一种基于案例的检索框架,该框架使用具有基于加权特征的相似性的k最近邻分类器来根据先前治疗的患者的基因表达谱检索他们。
    结果:本文提出的方法在几个数据集上得到了验证:从Westmead儿童医院收集的儿童白血病数据集,以及结肠癌,国家癌症研究所(NCI),和前列腺癌数据集。通过提出的框架在检索与新患者相似的数据集的患者中获得的结果如下:儿童白血病数据集的准确率为96%,NCI数据集的95%,结肠癌数据集中的93%,和98%的前列腺癌数据集。
    结论:设计的基于病例的检索框架是检索与新患者相似的先前患者的适当选择,根据他们的基因表达数据,更好地诊断和治疗儿童白血病。此外,这个框架可以应用于其他基因表达数据集使用一些或所有的步骤。
    BACKGROUND: The process of retrieving similar cases in a case-based reasoning system is considered a big challenge for gene expression data sets. The huge number of gene expression values generated by microarray technology leads to complex data sets and similarity measures for high-dimensional data are problematic. Hence, gene expression similarity measurements require numerous machine-learning and data-mining techniques, such as feature selection and dimensionality reduction, to be incorporated into the retrieval process.
    METHODS: This article proposes a case-based retrieval framework that uses a k-nearest-neighbor classifier with a weighted-feature-based similarity to retrieve previously treated patients based on their gene expression profiles.
    RESULTS: The herein-proposed methodology is validated on several data sets: a childhood leukemia data set collected from The Children\'s Hospital at Westmead, as well as the Colon cancer, the National Cancer Institute (NCI), and the Prostate cancer data sets. Results obtained by the proposed framework in retrieving patients of the data sets who are similar to new patients are as follows: 96% accuracy on the childhood leukemia data set, 95% on the NCI data set, 93% on the Colon cancer data set, and 98% on the Prostate cancer data set.
    CONCLUSIONS: The designed case-based retrieval framework is an appropriate choice for retrieving previous patients who are similar to a new patient, on the basis of their gene expression data, for better diagnosis and treatment of childhood leukemia. Moreover, this framework can be applied to other gene expression data sets using some or all of its steps.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    化学品负债,如不良反应和毒性,在现代药物发现过程中发挥着重要作用。化学负债的计算机评估是通过补充或替代体外和体内实验来降低成本和动物测试的重要步骤。在这里,我们提出了一种方法,结合了几种分类和化学方法,以便能够预测化学负债,并在化合物结构变化对其药理学特征的影响的背景下解释获得的结果。我们第一次认识到,生成地形图的监督扩展是一种有效的新化学方法。已经提出了使用监督Isomap映射新数据的新方法,而无需从头开始重新构建模型。在我们的研究中,首次在化学信息学中使用了两种方法来估计模型的适用性领域。作为模型解释的结果,已经发现了负责化合物药理学特征负面特征的结构警报。
    Chemical liabilities, such as adverse effects and toxicity, play a significant role in modern drug discovery process. In silico assessment of chemical liabilities is an important step aimed to reduce costs and animal testing by complementing or replacing in vitro and in vivo experiments. Herein, we propose an approach combining several classification and chemography methods to be able to predict chemical liabilities and to interpret obtained results in the context of impact of structural changes of compounds on their pharmacological profile. To our knowledge for the first time, the supervised extension of Generative Topographic Mapping is proposed as an effective new chemography method. New approach for mapping new data using supervised Isomap without re-building models from the scratch has been proposed. Two approaches for estimation of model\'s applicability domain are used in our study to our knowledge for the first time in chemoinformatics. The structural alerts responsible for the negative characteristics of pharmacological profile of chemical compounds has been found as a result of model interpretation.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    Dementia is one of the most common neurological disorders among the elderly. Identifying those who are of high risk suffering dementia is important to the administration of early treatment in order to slow down the progression of dementia symptoms. However, to achieve accurate classification, significant amount of subject feature information are involved. Hence identification of demented subjects can be transformed into a pattern recognition problem with high-dimensional nonlinear datasets. In this paper, we introduce trace ratio linear discriminant analysis (TR-LDA) for dementia diagnosis. An improved ITR algorithm (iITR) is developed to solve the TR-LDA problem. This novel method can be integrated with advanced missing value imputation method and utilized for the analysis of the nonlinear datasets in many real-world medical diagnosis problems. Finally, extensive simulations are conducted to show the effectiveness of the proposed method. The results demonstrate that our method can achieve higher accuracies for identifying the demented patients than other state-of-art algorithms.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    神经科学界的一个长期存在的假设是,中枢神经系统(CNS)通过结合相对少量的刻板的肌肉激活模式来产生肌肉活动以完成运动,通常被称为“肌肉协同作用”。“文献中对协同作用给出了不同的定义。最著名的是同步的,时变和时间肌肉协同作用。它们中的每一个都基于不同的数学模型,用于将在执行各种运动任务期间收集的一些EMG阵列记录分解为确定的空间,时间或时空组织。到目前为止,这种多种定义及其对复杂任务的单独应用使跨研究获得的结果的比较和解释变得复杂。它一直不清楚为什么和何时一个协同分解应该优先于另一个。通过使用众所周知的运动任务,如肘部弯曲和伸展,在这项研究中,我们的目的是更好地阐明每种分解所表征的运动特征是什么,并评估是否,什么时候以及为什么他们中的一个应该比其他人更受欢迎。我们发现三个时间协同作用,它们中的每一个都考虑了运动的特定时间阶段,可以解释大部分的数据变化。类似的性能可以通过两个同步协同实现,编码所考虑的两种肌肉的激动剂-拮抗剂性质,通过两种时变的肌肉协同作用,对每个肘部运动的任务相关特征进行编码,特别是他们的方向。我们的发现支持这样的观点,即每个EMG分解都提供了一组可解释的肌肉协同作用,识别运动不同方面的降维。一起来看,我们的研究结果表明,所有的分解都不是等价的,可能意味着要实施不同的神经生理学基础.
    A long standing hypothesis in the neuroscience community is that the central nervous system (CNS) generates the muscle activities to accomplish movements by combining a relatively small number of stereotyped patterns of muscle activations, often referred to as \"muscle synergies.\" Different definitions of synergies have been given in the literature. The most well-known are those of synchronous, time-varying and temporal muscle synergies. Each one of them is based on a different mathematical model used to factor some EMG array recordings collected during the execution of variety of motor tasks into a well-determined spatial, temporal or spatio-temporal organization. This plurality of definitions and their separate application to complex tasks have so far complicated the comparison and interpretation of the results obtained across studies, and it has always remained unclear why and when one synergistic decomposition should be preferred to another one. By using well-understood motor tasks such as elbow flexions and extensions, we aimed in this study to clarify better what are the motor features characterized by each kind of decomposition and to assess whether, when and why one of them should be preferred to the others. We found that three temporal synergies, each one of them accounting for specific temporal phases of the movements could account for the majority of the data variation. Similar performances could be achieved by two synchronous synergies, encoding the agonist-antagonist nature of the two muscles considered, and by two time-varying muscle synergies, encoding each one a task-related feature of the elbow movements, specifically their direction. Our findings support the notion that each EMG decomposition provides a set of well-interpretable muscle synergies, identifying reduction of dimensionality in different aspects of the movements. Taken together, our findings suggest that all decompositions are not equivalent and may imply different neurophysiological substrates to be implemented.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

公众号