curse of dimensionality

维度的诅咒
  • 文章类型: Journal Article
    马尔可夫法是一种常用的可靠性评估方法。它通常用于描述系统的动态特性,比如它的可修复性,故障序列和多个退化状态。然而,“维度的诅咒”,是指随着系统复杂度的增加,系统状态空间呈指数增长,对基于马尔可夫方法的复杂系统可靠性评估提出了挑战。为了应对这一挑战,提出了一种基于非齐次马尔可夫过程的复杂系统可靠性评估方法。这种方法需要将复杂的系统分解为多级子系统,每个都有相对较小的状态空间,根据系统功能。自下而上为每个子系统/系统建立齐次马尔可夫模型或非齐次马尔可夫模型。为了利用下层子系统模型的结果作为上层子系统模型的输入,提出了一种将子系统的不可用性曲线转换为其相应的2×2动态状态转移概率矩阵(STPM)的算法。然后将STPM用作上层系统的非齐次马尔可夫模型的输入。基于所提出的方法,使用反应堆保护系统(RPS)的可靠性评估进行了案例研究,然后将其与基于其他两种对比方法的模型进行比较。通过对比验证了该方法的有效性和准确性。
    The Markov method is a common reliability assessment method. It is often used to describe the dynamic characteristics of a system, such as its repairability, fault sequence and multiple degradation states. However, the \"curse of dimensionality\", which refers to the exponential growth of the system state space with the increase in system complexity, presents a challenge to reliability assessments for complex systems based on the Markov method. In response to this challenge, a novel reliability assessment method for complex systems based on non-homogeneous Markov processes is proposed. This method entails the decomposition of a complex system into multilevel subsystems, each with a relatively small state space, in accordance with the system function. The homogeneous Markov model or the non-homogeneous Markov model is established for each subsystem/system from bottom to top. In order to utilize the outcomes of the lower-level subsystem models as inputs to the upper-level subsystem model, an algorithm is proposed for converting the unavailability curve of a subsystem into its corresponding 2×2 dynamic state transition probability matrix (STPM). The STPM is then employed as an input to the upper-level system\'s non-homogeneous Markov model. A case study is presented using the reliability assessment of the Reactor Protection System (RPS) based on the proposed method, which is then compared with the models based on the other two contrast methods. This comparison verifies the effectiveness and accuracy of the proposed method.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    维度的诅咒给计算资源带来了沉重的负担,随着维度的增加,计算成本呈指数增长。这对求解高维偏微分方程(PDEs)提出了巨大的挑战,正如理查德·E·贝尔曼在60多年前首次指出的那样。虽然最近在解决高维数值PDE方面取得了一些成功,这样的计算非常昂贵,并且从未实现将一般非线性PDE真正缩放到高维。我们开发了一种扩展物理信息神经网络(PINN)的新方法,以解决任意高维PDE。新方法,称为随机维数梯度下降(SDGD),将PDE\'和PINN\'残差的梯度分解为对应于不同维度的片段,并在训练PINN的每次迭代中随机采样这些维度片段的子集。我们从理论上证明了所提出方法的收敛性和其他期望的性质。我们在各种不同的测试中证明,所提出的方法可以解决许多众所周知的高维PDE,包括Hamilton-Jacobi-Bellman(HJB)和Schrödinger方程,在使用PINN无网格方法的单个GPU上非常快。值得注意的是,我们用非平凡的方法求解非线性PDE,各向异性,在使用带有PINN的SDGD的单个GPU上,对于1000个维度在不到一小时的时间内以及对于100,000个维度在12小时内的不可分割的解决方案。由于SDGD是PINN的通用培训方法,它可以应用于PINN的任何当前和未来变体,以将其扩展为任意高维PDE。
    The curse-of-dimensionality taxes computational resources heavily with exponentially increasing computational cost as the dimension increases. This poses great challenges in solving high-dimensional partial differential equations (PDEs), as Richard E. Bellman first pointed out over 60 years ago. While there has been some recent success in solving numerical PDEs in high dimensions, such computations are prohibitively expensive, and true scaling of general nonlinear PDEs to high dimensions has never been achieved. We develop a new method of scaling up physics-informed neural networks (PINNs) to solve arbitrary high-dimensional PDEs. The new method, called Stochastic Dimension Gradient Descent (SDGD), decomposes a gradient of PDEs\' and PINNs\' residual into pieces corresponding to different dimensions and randomly samples a subset of these dimensional pieces in each iteration of training PINNs. We prove theoretically the convergence and other desired properties of the proposed method. We demonstrate in various diverse tests that the proposed method can solve many notoriously hard high-dimensional PDEs, including the Hamilton-Jacobi-Bellman (HJB) and the Schrödinger equations in tens of thousands of dimensions very fast on a single GPU using the PINNs mesh-free approach. Notably, we solve nonlinear PDEs with nontrivial, anisotropic, and inseparable solutions in less than one hour for 1000 dimensions and in 12 h for 100,000 dimensions on a single GPU using SDGD with PINNs. Since SDGD is a general training methodology of PINNs, it can be applied to any current and future variants of PINNs to scale them up for arbitrary high-dimensional PDEs.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

    求助全文

  • 文章类型: Journal Article
    许多复杂的人类疾病的原因在很大程度上仍然未知。遗传学在揭示复杂人类疾病的分子机制中起着重要作用。表征复杂人类疾病的遗传学的关键步骤是在全基因组尺度上无偏差地鉴定疾病相关基因转录本。混杂因素可能导致假阳性。配对设计,例如测量同一受试者治疗前后的基因表达,可以减少已知混杂因素的影响。然而,并非所有已知的混杂因素都可以在配对/匹配设计中进行控制。基于模型的聚类,例如分层模型的混合,已提出检测配对样品之间差异表达的基因转录本。据我们所知,没有基于模型的基因聚类方法有能力调整协变量的影响。在这篇文章中,在使用配对设计的高通量全基因组数据鉴定差异表达的转录本时,我们提出了一种新的分层模型与协变量调整的混合.仿真研究和实际数据分析都表明了该方法的良好性能。
    The causes of many complex human diseases are still largely unknown. Genetics plays an important role in uncovering the molecular mechanisms of complex human diseases. A key step to characterize the genetics of a complex human disease is to unbiasedly identify disease-associated gene transcripts on a whole-genome scale. Confounding factors could cause false positives. Paired design, such as measuring gene expression before and after treatment for the same subject, can reduce the effect of known confounding factors. However, not all known confounding factors can be controlled in a paired/match design. Model-based clustering, such as mixtures of hierarchical models, has been proposed to detect gene transcripts differentially expressed between paired samples. To the best of our knowledge, no model-based gene clustering methods have the capacity to adjust for the effects of covariates yet. In this article, we proposed a novel mixture of hierarchical models with covariate adjustment in identifying differentially expressed transcripts using high-throughput whole-genome data from paired design. Both simulation study and real data analysis show the good performance of the proposed method.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    边际原则指导分析师避免从包含高阶项作为协变量的模型中省略低阶项。低阶术语被视为“边际”到高阶术语。我们考虑该原理如何适用于三种情况:可能包括两个测量变量之比的回归模型;测量变量的多项式变换;以及已定义干预措施的阶乘安排。对于每种情况,我们展示了哪些项或变换被认为是低阶的,因此是边缘的,取决于测量的尺度,这往往是任意的。理解这一点的含义会导致对维度诅咒的直观理解。我们得出的结论是,在某些特定情况下,边际原则可能对分析师有用,但请注意不要将其用作无上下文的食谱。
    The marginality principle guides analysts to avoid omitting lower-order terms from models in which higher-order terms are included as covariates. Lower-order terms are viewed as \"marginal\" to higher-order terms. We consider how this principle applies to three cases: regression models that may include the ratio of two measured variables; polynomial transformations of a measured variable; and factorial arrangements of defined interventions. For each case, we show that which terms or transformations are considered to be lower-order, and therefore marginal, depends on the scale of measurement, which is frequently arbitrary. Understanding the implications of this point leads to an intuitive understanding of the curse of dimensionality. We conclude that the marginality principle may be useful to analysts in some specific cases but caution against invoking it as a context-free recipe.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    我们将一系列日益复杂的参数统计主题重新制定和重组为响应与响应的框架-在没有任何明确功能结构的情况下描述的协变量(Re-Co)动力学。然后,我们通过仅利用数据的分类性质来发现此类Re-Co动态的主要因素,从而解决这些主题的数据分析任务。通过采用香农的条件熵(CE)和互信息(I[Re;Co])作为两个关键的信息理论测量,说明并执行了分类探索数据分析(CEDA)范式核心的主要因素选择协议。通过评估这两种基于熵的测量和解决统计任务的过程,我们获得了一些计算指南,用于以做和学习的方式执行主要因素选择协议。具体来说,根据称为[C1:可确认]的标准,建立了评估CE和I[Re;Co]的实用指南。按照[C1:可确认]标准,我们没有试图获得这些理论信息测量的一致估计。所有评价均在列联表平台上进行,在此基础上,实践准则还提供了减轻维度诅咒影响的方法。我们明确地进行了Re-Co动力学的六个例子,其中每一个,还探索和讨论了几个广泛扩展的场景。
    We reformulate and reframe a series of increasingly complex parametric statistical topics into a framework of response-vs.-covariate (Re-Co) dynamics that is described without any explicit functional structures. Then we resolve these topics\' data analysis tasks by discovering major factors underlying such Re-Co dynamics by only making use of data\'s categorical nature. The major factor selection protocol at the heart of Categorical Exploratory Data Analysis (CEDA) paradigm is illustrated and carried out by employing Shannon\'s conditional entropy (CE) and mutual information (I[Re;Co]) as the two key Information Theoretical measurements. Through the process of evaluating these two entropy-based measurements and resolving statistical tasks, we acquire several computational guidelines for carrying out the major factor selection protocol in a do-and-learn fashion. Specifically, practical guidelines are established for evaluating CE and I[Re;Co] in accordance with the criterion called [C1:confirmable]. Following the [C1:confirmable] criterion, we make no attempts on acquiring consistent estimations of these theoretical information measurements. All evaluations are carried out on a contingency table platform, upon which the practical guidelines also provide ways of lessening the effects of the curse of dimensionality. We explicitly carry out six examples of Re-Co dynamics, within each of which, several widely extended scenarios are also explored and discussed.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    本文的目的是为贝叶斯统计学家提供一种方法,将二次抽样直接纳入他们选择的贝叶斯分层模型中,而无需施加额外的限制性模型假设。“大数据”的兴起给统计学家直接将他们的方法应用于大数据集带来了困难,这让我们受到了鼓舞。我们在流行的数据模型中引入了一个“数据子集模型”,过程模型,和参数模型“框架,用于总结贝叶斯分层模型。数据子集模型的超参数被建设性地指定,因为它们被选择为使得子集的隐含大小满足预定义的计算约束。因此,这些超参数有效地将统计模型校准到计算机本身,以在预先指定的时间内获得预测/估计。提供了数据子集模型的几个属性,包括:适当性,部分充足,和半参数属性。模拟数据集将用于评估二次抽样的后果,结果将在不同的计算机上显示,以显示计算机对统计分析的影响。此外,我们提供了一个高维数据集(大约10GB)的联合分析,该数据集包含美国人口普查局公共使用微样本(PUMS)的2018年5年期估计值.
    The goal of this paper is to provide a way for Bayesian statisticians to incorporate subsampling directly into the Bayesian hierarchical model of their choosing without imposing additional restrictive model assumptions. We are motivated by the fact that the rise of \"big data\" has created difficulties for statisticians to directly apply their methods to big datasets. We introduce a \"data subset model\" to the popular \"data model, process model, and parameter model\" framework used to summarize Bayesian hierarchical models. The hyperparameters of the data subset model are specified constructively in that they are chosen such that the implied size of the subset satisfies pre-defined computational constraints. Thus, these hyperparameters effectively calibrate the statistical model to the computer itself to obtain predictions/estimations in a pre-specified amount of time. Several properties of the data subset model are provided including: propriety, partial sufficiency, and semi-parametric properties. Simulated datasets will be used to assess the consequences of subsampling, and results will be presented across different computers to show the effect of the computer on the statistical analysis. Additionally, we provide a joint analysis of a high-dimensional dataset (roughly 10 gigabytes) consisting of 2018 5-year period estimates from the US Census Bureau\'s Public Use Micro-Sample (PUMS).
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    几何形态计量学的基础是大约30年前制定的,并不断被完善和扩展。形态计量学界一直是争论的中心重点和根源,是通过生物学理论之间的紧密联系来实现有意义的生物学推断的共同目标。测量,多元生物统计学,和几何。在这里,我们回顾了现代几何形态计量学的基础:通过地标和半地标表示有机几何,通过叠加计算形状或形式变量,将统计结果可视化为实际形状或形式,将形状变化分解为对称和非对称分量以及不同的空间尺度,对形状或形式空间中各种几何形状的解释,以及形状或形式与其他变量之间的关联模型,如环境、遗传,或行为数据。我们专注于最近的发展和当前的方法挑战,特别是那些由越来越多的地标和半地标产生的,并强调彻底的探索性多变量分析而不是单一标量汇总统计的重要性。我们概述了进一步研究和评估新发展的有希望的方向,例如“无地标”方法。为了说明这些方法,我们根据雅芳父母和子女纵向研究(ALSPAC)的数据分析三维人脸形状。
    The foundations of geometric morphometrics were worked out about 30 years ago and have continually been refined and extended. What has remained as a central thrust and source of debate in the morphometrics community is the shared goal of meaningful biological inference through a tight connection between biological theory, measurement, multivariate biostatistics, and geometry. Here we review the building blocks of modern geometric morphometrics: the representation of organismal geometry by landmarks and semilandmarks, the computation of shape or form variables via superimposition, the visualization of statistical results as actual shapes or forms, the decomposition of shape variation into symmetric and asymmetric components and into different spatial scales, the interpretation of various geometries in shape or form space, and models of the association between shape or form and other variables, such as environmental, genetic, or behavioral data. We focus on recent developments and current methodological challenges, especially those arising from the increasing number of landmarks and semilandmarks, and emphasize the importance of thorough exploratory multivariate analyses rather than single scalar summary statistics. We outline promising directions for further research and for the evaluation of new developments, such as \"landmark-free\" approaches. To illustrate these methods, we analyze three-dimensional human face shape based on data from the Avon Longitudinal Study of Parents and Children (ALSPAC).
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    我们提出了一种中高维推理方法,非线性,非高斯,部分观察到的马尔可夫过程模型,其过渡密度在分析上是不可行的。具有难以处理的过渡密度的马尔可夫过程出现在由仿真算法隐式定义的模型中。广泛使用的粒子滤波方法适用于非线性,非高斯模型,但遭受维数的诅咒。通过集成卡尔曼滤波方法提供了改进的可扩展性,但这些都不适合高度非线性和非高斯模型。我们提出了一种粒子滤波方法,该方法在模型维度方面具有改进的实际和理论可扩展性。此方法适用于具有分析上难以处理的过渡密度的隐式定义模型。我们的方法是基于以下假设开发的:潜在过程是在连续时间内定义的,并且该潜在过程的模拟器是可用的。在这种方法中,粒子在观测之间的中间时间间隔传播,并根据未来观测的预测可能性进行重采样。我们将此粒子滤波器与参数估计方法相结合,以实现对高度非线性时空系统的基于似然的推理。我们在随机Lorenz96模型和链接区域网络中传染病种群动态模型上展示了我们的方法。
    We propose a method for inference on moderately high-dimensional, nonlinear, non-Gaussian, partially observed Markov process models for which the transition density is not analytically tractable. Markov processes with intractable transition densities arise in models defined implicitly by simulation algorithms. Widely used particle filter methods are applicable to nonlinear, non-Gaussian models but suffer from the curse of dimensionality. Improved scalability is provided by ensemble Kalman filter methods, but these are inappropriate for highly nonlinear and non-Gaussian models. We propose a particle filter method having improved practical and theoretical scalability with respect to the model dimension. This method is applicable to implicitly defined models having analytically intractable transition densities. Our method is developed based on the assumption that the latent process is defined in continuous time and that a simulator of this latent process is available. In this method, particles are propagated at intermediate time intervals between observations and are resampled based on a forecast likelihood of future observations. We combine this particle filter with parameter estimation methodology to enable likelihood-based inference for highly nonlinear spatiotemporal systems. We demonstrate our methodology on a stochastic Lorenz 96 model and a model for the population dynamics of infectious diseases in a network of linked regions.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    我们提出了一种新的启发式特征选择(FS)算法,该算法在一个有原则的算法框架中集成了三个关键的FS组件:相关性,冗余,和互补性。因此,我们称之为相关性,冗余,和互补性权衡(RRCT)。每个特征与响应之间以及特征对之间的关联强度通过秩相关系数的信息理论变换来量化,并使用偏相关系数对特征互补性进行量化。我们在4个合成数据集和8个现实世界数据集的19个FS算法中对RRCT的性能进行了经验性的基准测试,评估了以下内容:(1)匹配真实特征集,以及(2)在二进制和多类分类问题中的样本外性能。RRCT在这两项任务中都非常有竞争力,并且我们试探性地对性能最佳的FS算法在可以有效运行的设置中的通用性和应用提出建议。
    We present a new heuristic feature-selection (FS) algorithm that integrates in a principled algorithmic framework the three key FS components: relevance, redundancy, and complementarity. Thus, we call it relevance, redundancy, and complementarity trade-off (RRCT). The association strength between each feature and the response and between feature pairs is quantified via an information theoretic transformation of rank correlation coefficients, and the feature complementarity is quantified using partial correlation coefficients. We empirically benchmark the performance of RRCT against 19 FS algorithms across four synthetic and eight real-world datasets in indicative challenging settings evaluating the following: (1) matching the true feature set and (2) out-of-sample performance in binary and multi-class classification problems when presenting selected features into a random forest. RRCT is very competitive in both tasks, and we tentatively make suggestions on the generalizability and application of the best-performing FS algorithms across settings where they may operate effectively.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    Despite the numerous band selection (BS) algorithms reported in the field, most if not all have exhibited maximal accuracy when more spectral bands are utilized for classification. This apparently disagrees with the theoretical model of the \'curse of dimensionality\' phenomenon, without apparent explanations. If it were true, then BS would be deemed as an academic piece of research without real benefits to practical applications. This paper presents a spatial spectral mutual information (SSMI) BS scheme that utilizes a spatial feature extraction technique as a preprocessing step, followed by the clustering of the mutual information (MI) of spectral bands for enhancing the efficiency of the BS. Through the SSMI BS scheme, a sharp \'bell\'-shaped accuracy-dimensionality characteristic that peaks at about 20 bands has been observed for the very first time. The performance of the proposed SSMI BS scheme has been validated through 6 hyperspectral imaging (HSI) datasets (Indian Pines, Botswana, Barrax, Pavia University, Salinas, and Kennedy Space Center (KSC)), and its classification accuracy is shown to be approximately 10% better than seven state-of-the-art BS schemes (Saliency, HyperBS, SLN, OCF, FDPC, ISSC, and Convolution Neural Network (CNN)). The present result confirms that the high efficiency of the BS scheme is essentially important to observe and validate the Hughes\' phenomenon in the analysis of HSI data. Experiments also show that the classification accuracy can be affected by as much as approximately 10% when a single \'crucial\' band is included or missed out for classification.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

公众号