关键词: Bayesian hierarchical model Big data Curse of dimensionality Gibbs sampler Markov chain Monte Carlo Non-Gaussian Small sample theory

来  源:   DOI:10.1080/10618600.2021.1923518   PDF(Pubmed)

Abstract:
The goal of this paper is to provide a way for Bayesian statisticians to incorporate subsampling directly into the Bayesian hierarchical model of their choosing without imposing additional restrictive model assumptions. We are motivated by the fact that the rise of \"big data\" has created difficulties for statisticians to directly apply their methods to big datasets. We introduce a \"data subset model\" to the popular \"data model, process model, and parameter model\" framework used to summarize Bayesian hierarchical models. The hyperparameters of the data subset model are specified constructively in that they are chosen such that the implied size of the subset satisfies pre-defined computational constraints. Thus, these hyperparameters effectively calibrate the statistical model to the computer itself to obtain predictions/estimations in a pre-specified amount of time. Several properties of the data subset model are provided including: propriety, partial sufficiency, and semi-parametric properties. Simulated datasets will be used to assess the consequences of subsampling, and results will be presented across different computers to show the effect of the computer on the statistical analysis. Additionally, we provide a joint analysis of a high-dimensional dataset (roughly 10 gigabytes) consisting of 2018 5-year period estimates from the US Census Bureau\'s Public Use Micro-Sample (PUMS).
摘要:
本文的目的是为贝叶斯统计学家提供一种方法,将二次抽样直接纳入他们选择的贝叶斯分层模型中,而无需施加额外的限制性模型假设。“大数据”的兴起给统计学家直接将他们的方法应用于大数据集带来了困难,这让我们受到了鼓舞。我们在流行的数据模型中引入了一个“数据子集模型”,过程模型,和参数模型“框架,用于总结贝叶斯分层模型。数据子集模型的超参数被建设性地指定,因为它们被选择为使得子集的隐含大小满足预定义的计算约束。因此,这些超参数有效地将统计模型校准到计算机本身,以在预先指定的时间内获得预测/估计。提供了数据子集模型的几个属性,包括:适当性,部分充足,和半参数属性。模拟数据集将用于评估二次抽样的后果,结果将在不同的计算机上显示,以显示计算机对统计分析的影响。此外,我们提供了一个高维数据集(大约10GB)的联合分析,该数据集包含美国人口普查局公共使用微样本(PUMS)的2018年5年期估计值.
公众号