count data

计数数据
  • 文章类型: Journal Article
    本文评估了分析策略,这些策略尊重在经验应用中经常遇到的健康结果的有限计数性质。文献中缺少对分析和理解此类数据的策略的全面讨论和批评。本文的目标是提供一个深入的考虑突出的问题和战略进行这种分析,强调各种分析工具的优点和局限性,实证研究人员可以考虑。涵盖三个主要主题。首先,对有界计数健康结果的测量特性进行了回顾,并评估了它们的含义。第二,描述了当有界计数结果是评估中关注的对象时出现的问题。第三,推导了有界计数结果的(条件)概率和矩结构,并提出了相应的规范和估计策略,并特别注意部分效果。在健康研究中,可能会对这些数据提出许多问题,并且研究人员对分析方法的选择通常是重要的。
    This paper assesses analytical strategies that respect the bounded-count nature of health outcomes encountered often in empirical applications. Absent in the literature is a comprehensive discussion and critique of strategies for analyzing and understanding such data. The paper\'s goal is to provide an in-depth consideration of prominent issues arising in and strategies for undertaking such analyses, emphasizing the merits and limitations of various analytical tools empirical researchers may contemplate. Three main topics are covered. First, bounded-count health outcomes\' measurement properties are reviewed and their implications assessed. Second, issues arising when bounded-count outcomes are the objects of concern in evaluations are described. Third, the (conditional) probability and moment structures of bounded-count outcomes are derived and corresponding specification and estimation strategies presented with particular attention to partial effects. Many questions may be asked of such data in health research and a researcher\'s choice of analytical method is often consequential.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    癫痫临床治疗中的一个主要问题是癫痫发作的不可预测性。然而,癫痫发作预测和风险评估的传统方法严重依赖原始发作频率,这是癫痫发作风险的随机测量。我们考虑贝叶斯非齐次隐马尔可夫模型,用于零膨胀发作计数数据的无监督聚类。所提出的模型允许在个体水平上对癫痫发作风险状态的序列进行概率估计。它还通过结合变量选择先验来识别驱动癫痫发作风险变化的临床协变量并适应高粒度数据,从而提供了优于现有方法的显着改进。为了推断,我们实现了一个有效的采样器,采用随机搜索和数据增强技术。我们在模拟癫痫发作计数数据上评估模型性能。然后,我们通过分析通过SeizureTracker™系统收集的133例Dravet综合征患者的每日癫痫发作计数数据来证明所提出的模型的临床实用性。患者报告的电子癫痫发作日记。我们报告了癫痫发作风险循环的动态,包括验证几种已知的药理学关系。我们还发现了描述Dravet综合征中风险状态的存在和波动性的新发现,这可能会直接告知咨询,以减少癫痫发作的不可预测性的患者这种毁灭性的原因。
    A major issue in the clinical management of epilepsy is the unpredictability of seizures. Yet, traditional approaches to seizure forecasting and risk assessment in epilepsy rely heavily on raw seizure frequencies, which are a stochastic measurement of seizure risk. We consider a Bayesian non-homogeneous hidden Markov model for unsupervised clustering of zero-inflated seizure count data. The proposed model allows for a probabilistic estimate of the sequence of seizure risk states at the individual level. It also offers significant improvement over prior approaches by incorporating a variable selection prior for the identification of clinical covariates that drive seizure risk changes and accommodating highly granular data. For inference, we implement an efficient sampler that employs stochastic search and data augmentation techniques. We evaluate model performance on simulated seizure count data. We then demonstrate the clinical utility of the proposed model by analyzing daily seizure count data from 133 patients with Dravet syndrome collected through the Seizure Tracker™ system, a patient-reported electronic seizure diary. We report on the dynamics of seizure risk cycling, including validation of several known pharmacologic relationships. We also uncover novel findings characterizing the presence and volatility of risk states in Dravet syndrome, which may directly inform counseling to reduce the unpredictability of seizures for patients with this devastating cause of epilepsy.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    零膨胀泊松(ZIP)模型广泛用于计数具有过多零的数据。多重共线性是计数数据解释变量中的共同因素。在这种情况下,通常,最大似然估计(MLE)由于均方误差(MSE)的膨胀而产生不令人满意的结果。在解决这个问题时,使用脊参数。在这项研究中,提出了一种新的修正零膨胀泊松岭回归模型来减少多重共线性问题。我们在指定的仿真策略的背景下进行了实验,并记录了所提出的估计器的行为。我们还将提出的估计器应用于现实生活中的数据集,并借助用于计数数据的ZIP模型,探索提出的估计器在存在多重共线性的情况下如何表现良好。
    Zero-inflated Poisson (ZIP) model is widely used for counting data with excessive zeroes. The multicollinearity is the common factor in the explanatory variables of the count data. In this context, typically, maximum likelihood estimation (MLE) generates unsatisfactory results due to inflation of mean square error (MSE). In the solution of this problem usually, ridge parameters are used. In this study, we proposed a new modified zero-inflated Poisson ridge regression model to reduce the problem of multicollinearity. We experimented within the context of a specified simulation strategy and recorded the behavior of proposed estimators. We also apply our proposed estimator to the real-life data set and explore how our proposed estimators perform well in the presence of multicollinearity with the help of ZIP model for count data.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    "脑膜炎带"是撒哈拉以南非洲每年爆发脑膜炎的地区,周期性观察到的流行病。虽然我们知道脑膜炎在很大程度上依赖于季节性趋势,感染该疾病的确切途径尚未完全了解,需要进一步研究。以前的大多数方法都使用大样本推断来评估天气对脑膜炎发病率的影响。然而,在罕见事件的情况下,这些假设的有效性是不确定的。这项工作检查了罕见事件背景下的脑膜炎趋势,具体目标是量化脑膜炎发病率的潜在季节性模式。我们比较了三类主要的模型:泊松广义线性模型,Poisson广义可加模型,并扩展了贝叶斯风险模型,以适应计数数据和不断变化的风险人群。我们通过偏差比较了模型的准确性和鲁棒性,RMSE,和估计器的标准偏差,并为Navrongo收集的数据提供了脑膜炎模式的详细案例研究,加纳。
    The \"meningitis belt\" is a region in sub-Saharan Africa where annual outbreaks of meningitis occur, with epidemics observed cyclically. While we know that meningitis is heavily dependent on seasonal trends, the exact pathways for contracting the disease are not fully understood and warrant further investigation. Most previous approaches have used large sample inference to assess impacts of weather on meningitis rates. However, in the case of rare events, the validity of such assumptions is uncertain. This work examines the meningitis trends in the context of rare events, with the specific objective of quantifying the underlying seasonal patterns in meningitis rates. We compare three main classes of models: the Poisson generalized linear model, the Poisson generalized additive model, and a Bayesian hazard model extended to accommodate count data and a changing at-risk population. We compare the accuracy and robustness of the models through the bias, RMSE, and standard deviation of the estimators, and also provide a detailed case study of meningitis patterns for data collected in Navrongo, Ghana.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在本文中,我们提出了一种有效的统计方法(称为“自适应资源分配CUSUM”),以在有限的采样资源下可靠有效地检测热点。我们的主要思想是将多臂强盗(MAB)和变化点检测方法相结合,以平衡热点检测资源分配的探索和开发。Further,贝叶斯加权更新用于更新感染率的后验分布。然后,置信上限(UCB)用于资源分配和规划。最后,CUSUM监视统计信息以检测变更点以及变更位置。对于性能评估,将该方法与文献中的几种基准方法的性能进行了比较,结果表明该算法能够实现更低的检测延迟和更高的检测精度。最后,在华盛顿州华盛顿州的县级每日阳性COVID-19病例的真实案例研究中,该方法应用于热点检测),并证明了在非常有限的分布样本中的有效性。
    In this paper, we present an efficient statistical method (denoted as \'Adaptive Resources Allocation CUSUM\') to robustly and efficiently detect the hotspot with limited sampling resources. Our main idea is to combine the multi-arm bandit (MAB) and change-point detection methods to balance the exploration and exploitation of resource allocation for hotspot detection. Further, a Bayesian weighted update is used to update the posterior distribution of the infection rate. Then, the upper confidence bound (UCB) is used for resource allocation and planning. Finally, CUSUM monitoring statistics to detect the change point as well as the change location. For performance evaluation, we compare the performance of the proposed method with several benchmark methods in the literature and showed the proposed algorithm is able to achieve a lower detection delay and higher detection precision. Finally, this method is applied to hotspot detection in a real case study of county-level daily positive COVID-19 cases in Washington State WA) and demonstrates the effectiveness with very limited distributed samples.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:骨折是罕见的事件,可能由于跌倒而发生。裂缝计数与其他计数数据不同,因为这些数据呈正偏斜,被多余的零计数膨胀,事件会随着时间的推移而重现。用于评估断裂数据和解释这些特征的分析方法在文献中是有限的。
    方法:常用的计数数据模型包括泊松回归,负二项回归,障碍回归,和零膨胀回归模型。在本文中,我们使用英国一项大型临床试验的数据比较了4种替代统计模型,以拟合骨折计数,该试验评估了替代跌倒预防干预措施在老年人中的临床和成本效益(预防跌倒损伤试验;PreFIT).
    结果:Akaike信息准则和贝叶斯信息准则的值,拟合优度统计,在负二项模型中最低。数据中无分散性的似然比检验显示出分散性的有力证据(卡方=225.68,p值<0.001)。这表明,与泊松回归模型相比,负二项模型更好地拟合数据。我们还比较了标准负二项回归和混合效应负二项模型。LR检验显示,与标准负二项模型相比,使用混合效应负二项模型(卡方=1.67,p值=0.098)拟合数据没有增益。
    结论:负二项回归模型是骨折计数分析的最合适和最佳拟合模型。
    背景:PreFIT试验注册为ISRCTN71002650。
    Fractures are rare events and can occur because of a fall. Fracture counts are distinct from other count data in that these data are positively skewed, inflated by excess zero counts, and events can recur over time. Analytical methods used to assess fracture data and account for these characteristics are limited in the literature.
    Commonly used models for count data include Poisson regression, negative binomial regression, hurdle regression, and zero-inflated regression models. In this paper, we compare four alternative statistical models to fit fracture counts using data from a large UK based clinical trial evaluating the clinical and cost-effectiveness of alternative falls prevention interventions in older people (Prevention of Falls Injury Trial; PreFIT).
    The values of Akaike information criterion and Bayesian information criterion, the goodness-of-fit statistics, were the lowest for negative binomial model. The likelihood ratio test of no dispersion in the data showed strong evidence of dispersion (chi-square = 225.68, p-value < 0.001). This indicates that the negative binomial model fits the data better compared to the Poisson regression model. We also compared the standard negative binomial regression and mixed effects negative binomial models. The LR test showed no gain in fitting the data using mixed effects negative binomial model (chi-square = 1.67, p-value = 0.098) compared to standard negative binomial model.
    The negative binomial regression model was the most appropriate and optimal fit model for fracture count analyses.
    The PreFIT trial was registered as ISRCTN71002650.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Randomized Controlled Trial
    背景:医学研究中常用结果的两个特征是零通货膨胀和非负整数;例如住院人数或急诊科就诊次数,大多数患者的计数为零。设计了零膨胀回归模型来分析此类数据。然而,零膨胀回归模型的性能或最适合这些分析的数据属性尚未得到彻底研究。
    方法:我们进行了一项仿真研究,以评估两个广义线性模型的性能,负二项式和零膨胀负二项式,用于分析零膨胀计数数据。模拟方案采用随机对照试验设计,并改变了真实的潜在分布,样本量,和零通货膨胀率。我们比较了这些模型的偏差,均方误差,和覆盖范围。此外,我们使用逻辑回归来确定哪些数据属性对于预测最佳拟合模型最重要.
    结果:我们首先发现,不管零通货膨胀率如何,在边际治疗组系数的偏倚方面,常规负二项与其零膨胀对应物之间几乎没有差异。第二,即使结果是从零膨胀分布模拟的,根据Akaike信息准则,负二项模型比ZI模型更受青睐。第三,数据非零部分的均值和偏度是比零计数百分比更强的模型偏好预测因子.这些结果不受样本量的影响,从60到800不等。
    结论:我们建议零通货膨胀率和结果的过度分散不应成为选择零膨胀回归模型的唯一和主要理由。研究者在为计数数据选择模型时还应考虑其他数据特征。此外,如果NB和ZINB回归模型的性能即使与ZI结果也相当,我们提倡使用NB回归模型,因为它对结果的解释清晰直接。
    BACKGROUND: Two characteristics of commonly used outcomes in medical research are zero inflation and non-negative integers; examples include the number of hospital admissions or emergency department visits, where the majority of patients will have zero counts. Zero-inflated regression models were devised to analyze this type of data. However, the performance of zero-inflated regression models or the properties of data best suited for these analyses have not been thoroughly investigated.
    METHODS: We conducted a simulation study to evaluate the performance of two generalized linear models, negative binomial and zero-inflated negative binomial, for analyzing zero-inflated count data. Simulation scenarios assumed a randomized controlled trial design and varied the true underlying distribution, sample size, and rate of zero inflation. We compared the models in terms of bias, mean squared error, and coverage. Additionally, we used logistic regression to determine which data properties are most important for predicting the best-fitting model.
    RESULTS: We first found that, regardless of the rate of zero inflation, there was little difference between the conventional negative binomial and its zero-inflated counterpart in terms of bias of the marginal treatment group coefficient. Second, even when the outcome was simulated from a zero-inflated distribution, a negative binomial model was favored above its ZI counterpart in terms of the Akaike Information Criterion. Third, the mean and skewness of the non-zero part of the data were stronger predictors of model preference than the percentage of zero counts. These results were not affected by the sample size, which ranged from 60 to 800.
    CONCLUSIONS: We recommend that the rate of zero inflation and overdispersion in the outcome should not be the sole and main justification for choosing zero-inflated regression models. Investigators should also consider other data characteristics when choosing a model for count data. In addition, if the performance of the NB and ZINB regression models is reasonably comparable even with ZI outcomes, we advocate the use of the NB regression model due to its clear and straightforward interpretation of the results.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    估计受损数据矩阵的秩是数据分析中的一项重要任务,最值得注意的是选择PCA中的组件数量。通过表征大噪声矩阵的频谱特性,使用随机矩阵理论在此任务上取得了重大进展。然而,当数据矩阵由计数随机变量组成时,利用这样的工具并不简单,例如,Poisson,在这种情况下,噪声可以是异方差的,每个条目都有未知的方差。在这项工作中,我们专注于具有独立条目的泊松随机矩阵,并提出了一个简单的过程,称为双白,用于估计基础信号矩阵的秩(即,没有任何先验知识的泊松参数矩阵)。我们的方法基于以下关键观察:可以同时缩放数据矩阵的行和列,以使相应噪声的频谱与标准Marchenko-Pastur(MP)定律一致,证明使用MP上边缘作为等级选择的阈值。重要的是,通过Sinkhorn-Knopp算法解决矩阵缩放问题,可以从观测值直接估计所需的缩放因子。除了泊松,我们的方法扩展到满足均值和方差之间的二次关系的分布族,比如广义泊松,二项式,负二项式,gamma,和许多其他人。这种二次关系也可以解释数据中的缺失条目。我们进行了数值实验,证实了我们的理论发现,并展示我们在具有挑战性的政权中进行等级估计的方法的优势。此外,我们证明了我们的方法在单细胞RNA测序(scRNA-seq)的几个真实数据集上的良好性能,高通量染色体构象捕获(Hi-C),和文档主题建模。
    Estimating the rank of a corrupted data matrix is an important task in data analysis, most notably for choosing the number of components in PCA. Significant progress on this task was achieved using random matrix theory by characterizing the spectral properties of large noise matrices. However, utilizing such tools is not straightforward when the data matrix consists of count random variables, e.g., Poisson, in which case the noise can be heteroskedastic with an unknown variance in each entry. In this work, we focus on a Poisson random matrix with independent entries and propose a simple procedure, termed biwhitening, for estimating the rank of the underlying signal matrix (i.e., the Poisson parameter matrix) without any prior knowledge. Our approach is based on the key observation that one can scale the rows and columns of the data matrix simultaneously so that the spectrum of the corresponding noise agrees with the standard Marchenko-Pastur (MP) law, justifying the use of the MP upper edge as a threshold for rank selection. Importantly, the required scaling factors can be estimated directly from the observations by solving a matrix scaling problem via the Sinkhorn-Knopp algorithm. Aside from the Poisson, our approach is extended to families of distributions that satisfy a quadratic relation between the mean and the variance, such as the generalized Poisson, binomial, negative binomial, gamma, and many others. This quadratic relation can also account for missing entries in the data. We conduct numerical experiments that corroborate our theoretical findings, and showcase the advantage of our approach for rank estimation in challenging regimes. Furthermore, we demonstrate the favorable performance of our approach on several real datasets of single-cell RNA sequencing (scRNA-seq), High-Throughput Chromosome Conformation Capture (Hi-C), and document topic modeling.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    背景:理想的孩子数(INC)是一个女人或男人如果可以回到他们没有孩子的时候,并且可以准确地选择孩子的数量。尽管对理想儿童人数的患病率和相关因素进行了大量研究,缺乏结合空间和多层次分析的研究。因此,这项研究旨在对理想儿童数量和相关因素进行空间和多水平分析.
    方法:研究设计是一项横断面研究,其中数据来自2016年埃塞俄比亚人口与健康调查(EDHS)。考虑了约13,961名符合纳入标准的15-49岁女性。采用了包含空间和多水平分析的负二项回归模型。
    结果:大约33%和12.8%的妇女分别有四个和六个理想的孩子。每名妇女的INC最高记录在奥罗米亚地区5055(36.1%),最低记录在哈拉雷35(0.2%)。农村地区10726名(76.6%)的每名妇女的INC较高,而城市地区3277名(23.4%)。理想的儿童数量在空间上是聚集的(全球Moran’sI=0.1439,p<.00043)。在索马里地区发现了重要的热点集群,例如在Afder,Shabelle,Korahe,和杜洛区。
    结论:空间分析揭示了埃塞俄比亚地区儿童理想数量的显著聚集。具体来说,在索马里地区观察到较高的INC,特别是在Afder中,Shabelle,Korahe,和杜洛地区。在考虑的各种因素中,女人的年龄,区域,居住地,妇女的教育水平,避孕使用,宗教,婚姻状况,家庭大小,和第一出生年的年龄被确定为理想儿童数量的重要预测因素。这些发现表明,这些因素在塑造研究人群中女性的生殖偏好和决定中起着至关重要的作用。基于这些发现,负责任的机构应优先考虑高风险地区的有针对性的干预措施和政策,以满足妇女的特定生殖需求。
    Ideal number of children (INC) is the number of children that a woman or man would have if they could go back to the time when they did not have any children and could choose accurately the number of children to have in their total life. Despite numerous studies on the prevalence and associated factors of the ideal number of children, there is a lack of studies that incorporated spatial and multilevel analysis. Thus, this study was aimed at the spatial and multilevel analysis of an ideal number of children and associated factors.
    The study design was a cross-sectional study in which the data was obtained from Ethiopian Demographic and Health Survey (EDHS) in 2016. About 13,961 women ages 15-49 who fulfill the inclusion criterion were considered. A negative binomial regression model that incorporates spatial and multilevel analysis was employed.
    About 33 and 12.8% of the women had four and six ideal numbers of children respectively. The highest INC per woman was recorded in Oromia region 5055 (36.1%) and the lowest in Harare 35(0.2%). The INC per woman is high in rural 10,726 (76.6%) areas as compared to urban areas 3277(23.4%). The ideal number of children was spatially clustered (Global Moran\'s I = 0.1439, p < .00043). Significant hotspot clusters were found in the Somali region such as in Afder, Shabelle, Korahe, and Doolo zone.
    The spatial analysis revealed a significant clustering of the ideal number of children in the Ethiopia zone. Specifically, higher INC was observed in the Somali region, specifically in the Afder, Shabelle, Korahe, and Doolo zones. Among the various factors considered, women\'s age, region, place of residence, women\'s education level, contraception use, religion, marital status, family size, and age at first birth year were identified as significant predictors of the ideal number of children. These findings indicate that these factors play a crucial role in shaping reproductive preferences and decisions among women in the study population. Based on these findings, responsible bodies should prioritize targeted interventions and policies in high-risk regions to address women\'s specific reproductive needs.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    我们为双变量计数响应开发了广义线性混合模型(GLMM),用于统计分析荷兰北部的蜻蜓种群数据。在2015-2018年期间,在17个不同的地点(池塘和沟渠)计算了受威胁的蜻蜓物种Aeshnaviridis的种群。使用了两种不同的广泛应用的人口规模衡量标准来量化人口规模,即发现的外骨骼(\'exuviae\')的数量和发现产卵的雌性的数量。由于这两种措施(响应)导致许多零计数,但也具有非常大的计数,我们的GLMM模型建立在零膨胀双变量几何(ZIBGe)分布上,为此,我们表明可以根据相关参数及其两个边际中位数轻松地对其进行参数化。我们用固定(环境协变量)和随机(特定于位置的截距)效应的线性组合对中位数进行建模。对中位数进行建模会降低对过大计数的敏感性;特别是,鉴于不断增长的边际零通货膨胀率。由于样本量相对较小(n=114),我们遵循贝叶斯建模方法,并使用Metropolis-Hastings马尔可夫链蒙特卡罗(MCMC)模拟来生成后验样本。
    We develop a generalized linear mixed model (GLMM) for bivariate count responses for statistically analyzing dragonfly population data from the Northern Netherlands. The populations of the threatened dragonfly species Aeshna viridis were counted in the years 2015-2018 at 17 different locations (ponds and ditches). Two different widely applied population size measures were used to quantify the population sizes, namely the number of found exoskeletons (\'exuviae\') and the number of spotted egg-laying females were counted. Since both measures (responses) led to many zero counts but also feature very large counts, our GLMM model builds on a zero-inflated bivariate geometric (ZIBGe) distribution, for which we show that it can be easily parameterized in terms of a correlation parameter and its two marginal medians. We model the medians with linear combinations of fixed (environmental covariates) and random (location-specific intercepts) effects. Modeling the medians yields a decreased sensitivity to overly large counts; in particular, in light of growing marginal zero inflation rates. Because of the relatively small sample size (n = 114) we follow a Bayesian modeling approach and use Metropolis-Hastings Markov Chain Monte Carlo (MCMC) simulations for generating posterior samples.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

公众号