Dependent data

相关数据
  • 文章类型: Journal Article
    越来越多的现代科学问题出现在基因组学等领域,神经生物学,和空间流行病学涉及对数千个相关特征的测量和分析,这些特征可能在任意强的水平上随机依赖。在这项工作中,我们考虑特征遵循多变量正态分布的情况。我们证明了依赖性表现为特征之间共享的随机变化,标准方法可能由于依赖性而产生高度不稳定的推断,即使在过程中完全参数化和利用依赖性。我们提出了一个“跨维度推理”框架,通过建模和删除特征之间共享的变化来缓解由于依赖而导致的问题,同时也适当地正则化跨特征的估计。我们演示了从感兴趣的科学应用得出的场景中同时进行点估计和多个假设检验的框架。
    A growing number of modern scientific problems in areas such as genomics, neurobiology, and spatial epidemiology involve the measurement and analysis of thousands of related features that may be stochastically dependent at arbitrarily strong levels. In this work, we consider the scenario where the features follow a multivariate Normal distribution. We demonstrate that dependence is manifested as random variation shared among features, and that standard methods may yield highly unstable inference due to dependence, even when the dependence is fully parameterized and utilized in the procedure. We propose a \"cross-dimensional inference\" framework that alleviates the problems due to dependence by modeling and removing the variation shared among features, while also properly regularizing estimation across features. We demonstrate the framework on both simultaneous point estimation and multiple hypothesis testing in scenarios derived from the scientific applications of interest.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在本文中,在平稳的α混合依赖样本下,我们开发了一种新的时间序列的非线性模态回归,并建立了所提出的非线性模态估计器的一致性和渐近性质与收缩带宽h在一定的正则性条件下。渐近分布显示为与从独立观测得出的分布相同,而收敛速度(nh3,其中n是样本量)比非线性均值回归慢。我们通过使用改进的模态期望最大化(MEM)算法结合泰勒展开对所提出的非线性模态回归模型进行数值估计。提出了蒙特卡罗模拟,以证明新提出的模型具有良好的有限样本(预测)性能。我们还构建了一个特定的非线性模态回归,以匹配美国州/地区层面的COVID-19疫情的每日新病例和新死亡数据,并提前130天(从2020年8月24日至2020年12月31日)提供远期预测。与传统的非线性回归相比,建议的模型可以更好地拟合COVID-19数据,并产生更精确的预测。预测结果表明,状态/区域之间的传播分布存在系统差异。对于大多数西部和东部州来说,与中西部相比,他们有许多严重的COVID-19负担。我们希望建立的非线性模态回归可以帮助政策制定者实施快速行动来遏制感染的传播,避免卫生系统负担过重,并从一些角度了解COVID-19的发展。
    In this paper, under the stationary α-mixing dependent samples, we develop a novel nonlinear modal regression for time series sequences and establish the consistency and asymptotic property of the proposed nonlinear modal estimator with a shrinking bandwidth h under certain regularity conditions. The asymptotic distribution is shown to be identical to the one derived from the independent observations, whereas the convergence rate ( n h 3 in which n is the sample size) is slower than that in the nonlinear mean regression. We numerically estimate the proposed nonlinear modal regression model by the use of a modified modal expectation-maximization (MEM) algorithm in conjunction with Taylor expansion. Monte Carlo simulations are presented to demonstrate the good finite sample (prediction) performance of the newly proposed model. We also construct a specified nonlinear modal regression to match the available daily new cases and new deaths data of the COVID-19 outbreak at the state/region level in the United States, and provide forward predictions up to 130 days ahead (from 24 August 2020 to 31 December 2020). In comparison to the traditional nonlinear regressions, the suggested model can fit the COVID-19 data better and produce more precise predictions. The prediction results indicate that there are systematic differences in spreading distributions among states/regions. For most western and eastern states, they have many serious COVID-19 burdens compared to Midwest. We hope that the built nonlinear modal regression can help policymakers to implement fast actions to curb the spread of the infection, avoid overburdening the health system and understand the development of COVID-19 from some points.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在经典因果推理中,从数据推断因果关系依赖于单位是独立和相同分布的假设。在单位通过依赖关系网络相关的设置中,违反了此假设。这种设置的一个例子是赞助搜索广告中的广告放置,用户点击特定广告的可能性可能会受到其放置位置以及其他广告在搜索结果页面上的位置的影响。在这种情况下,混杂的出现不仅是由于单个广告级别的协变量,而且是其他广告在系统中的位置和协变量。在本文中,在存在干扰的情况下,我们利用因果推理的语言来模拟广告之间的交互。这种交互的量化可以让我们更好地理解用户的点击行为,进而影响主机搜索引擎的收入并提高用户满意度。我们通过在Bing搜索引擎的广告放置系统上进行的实验来说明形式化的实用性。
    In classical causal inference, inferring cause-effect relations from data relies on the assumption that units are independent and identically distributed. This assumption is violated in settings where units are related through a network of dependencies. An example of such a setting is ad placement in sponsored search advertising, where the likelihood of a user clicking on a particular ad is potentially influenced by where it is placed and where other ads are placed on the search result page. In such scenarios, confounding arises due to not only the individual ad-level covariates but also the placements and covariates of other ads in the system. In this paper, we leverage the language of causal inference in the presence of interference to model interactions among the ads. Quantification of such interactions allows us to better understand the click behavior of users, which in turn impacts the revenue of the host search engine and enhances user satisfaction. We illustrate the utility of our formalization through experiments carried out on the ad placement system of the Bing search engine.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    在这项研究中,我们提出了一种量化多变量数据集中依赖关系的新方法,基于通过最小生成树(MST)估计Rényi互信息。随机变量的依赖程度是一个重要的问题,例如,用于不确定性量化和敏感性分析。后者与以下问题密切相关:例如,计算机模拟,是在单个随机输入变量上。为了从数据中估计伦尼互信息,我们使用一种由于英雄等人的方法。依赖于计算数据的最小生成树(MST),并在熵的估计器中使用MST的长度。为了降低为大型数据集构建精确MST的计算成本,我们探索计算精确MST近似的方法,并找到了钟等人最近提出的多层次方法。(2015)是最准确的。因为MST计算不需要分布的知识(或估计),我们的方法非常适合只有数据可用的情况。此外,我们证明,在只需要几个依赖关系的排名而不是它们的确切值的情况下,没有必要计算雷尼分歧,但只能从中得出一个估计器。本文的主要贡献是引入了这种依赖量词,以及使用MST的近似方法与通过MST估计Rényi互信息的新颖组合。我们将我们提出的方法应用于基于Ishigami函数的人工测试用例,以及涉及厄尔尼诺现象数据集的现实测试案例。
    In this study, we present a novel method for quantifying dependencies in multivariate datasets, based on estimating the Rényi mutual information by minimum spanning trees (MSTs). The extent to which random variables are dependent is an important question, e.g., for uncertainty quantification and sensitivity analysis. The latter is closely related to the question how strongly dependent the output of, e.g., a computer simulation, is on the individual random input variables. To estimate the Rényi mutual information from data, we use a method due to Hero et al. that relies on computing minimum spanning trees (MSTs) of the data and uses the length of the MST in an estimator for the entropy. To reduce the computational cost of constructing the exact MST for large datasets, we explore methods to compute approximations to the exact MST, and find the multilevel approach introduced recently by Zhong et al. (2015) to be the most accurate. Because the MST computation does not require knowledge (or estimation) of the distributions, our methodology is well-suited for situations where only data are available. Furthermore, we show that, in the case where only the ranking of several dependencies is required rather than their exact value, it is not necessary to compute the Rényi divergence, but only an estimator derived from it. The main contributions of this paper are the introduction of this quantifier of dependency, as well as the novel combination of using approximate methods for MSTs with estimating the Rényi mutual information via MSTs. We applied our proposed method to an artificial test case based on the Ishigami function, as well as to a real-world test case involving an El Nino dataset.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    In this paper, we aim to study the asymptotic properties of internal estimator of nonparametric regression with independent and dependent data. Under some weak conditions, we present some results on asymptotic normality of the estimator. Our results extend some corresponding ones.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    假设没有受试者的暴露影响另一个受试者的结果,被称为无干扰假设,长期以来,在因果推断研究中一直处于基础地位。然而,这个假设可能在许多情况下被违反,近年来已经大大放松。通常,这是在已知的底层网络的帮助下实现的,或者假设人口可以分成不同的组,它们之间没有干扰,并且在此范围内,每个受试者的结果可能会受到该组中所有其他受试者通过暴露的比例(分层干扰假设)的影响。在这篇文章中,相反,我们考虑一个完整的干扰设置,其中每个科目都会影响其他科目的结果。特别是,我们对由整个样本组成的单个组进行分层干扰假设。我们表明,i.i.d.设置的目标最大似然估计器可用于估计一类因果参数,其中包括某些干预措施下的直接影响和总体影响。该估计器保持双重稳健,半参数有效,并继续允许将机器学习纳入我们的模型。我们进行了模拟研究,并提供了数据应用程序的结果,我们研究了基于护士的分诊系统对肯尼亚卫生诊所接受HIV护理的患者结局的影响。
    The assumption that no subject\'s exposure affects another subject\'s outcome, known as the no-interference assumption, has long held a foundational position in the study of causal inference. However, this assumption may be violated in many settings, and in recent years has been relaxed considerably. Often this has been achieved with either the aid of a known underlying network, or the assumption that the population can be partitioned into separate groups, between which there is no interference, and within which each subject\'s outcome may be affected by all the other subjects in the group via the proportion exposed (the stratified interference assumption). In this article, we instead consider a complete interference setting, in which each subject affects every other subject\'s outcome. In particular, we make the stratified interference assumption for a single group consisting of the entire sample. We show that a targeted maximum likelihood estimator for the i.i.d. setting can be used to estimate a class of causal parameters that includes direct effects and overall effects under certain interventions. This estimator remains doubly-robust, semiparametric efficient, and continues to allow for incorporation of machine learning under our model. We conduct a simulation study, and present results from a data application where we study the effect of a nurse-based triage system on the outcomes of patients receiving HIV care in Kenyan health clinics.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

       PDF(Pubmed)

  • 文章类型: Journal Article
    我们研究了在单个连接单元网络上收集数据的观察性设置中样本平均治疗特定平均效应的半参数估计和统计推断框架(例如,在存在干扰或溢出的情况下)。尽管最近取得了进展,许多当前的统计方法依赖于估计技术,这些技术假设结果具有特定的参数模型,即使这些模型所需的一些最重要的统计假设在观测网络设置中很可能被违反,往往导致无效和反保守的统计推断。在这份手稿中,我们依赖于最近的方法学进展,用于因果连接单元网络中因果效应的目标最大似然估计(TMLE),描述一种估计方法,该方法允许更真实的数据生成模型类别,并在网络相关数据的背景下提供有效的统计推断。该方法适用于具有单个时间点随机干预的观察性设置。我们首先假设真实的观测数据生成分布属于一大类半参数统计模型。然后,我们对可能属于我们的统计模型的数据生成分布集施加一些限制。例如,我们假设单元之间的依赖性可以由已知网络完全描述,并且对其他单位的依赖性可以通过一些已知的(但在其他方面是任意的)汇总措施来汇总。我们证明,在我们的建模假设下,我们的估计和等同于假设的iid数据分布中的估计和,其中后一种分布是观察到的网络数据生成分布的函数。有了这个关键的洞察力,我们证明了我们估计的TMLE,在依赖的网络数据中可以描述为特定的iid数据TMLE算法,也导致了一种新的简化的方法来进行统计推断。我们在网络仿真研究中证明了我们方法的有效性。我们还将相关数据TMLE的先前工作扩展到新的因果参数估计,例如,干扰下特定单位的直接治疗效果以及修改初始网络结构的干预措施的效果。
    We study the framework for semi-parametric estimation and statistical inference for the sample average treatment-specific mean effects in observational settings where data are collected on a single network of connected units (e.g., in the presence of interference or spillover). Despite recent advances, many of the current statistical methods rely on estimation techniques that assume a particular parametric model for the outcome, even though some of the most important statistical assumptions required by these models are most likely violated in the observational network settings, often resulting in invalid and anti-conservative statistical inference. In this manuscript, we rely on the recent methodological advances for the targeted maximum likelihood estimation (TMLE) of causal effects in a network of causally connected units, to describe an estimation approach that permits for more realistic classes of data-generative models and provides valid statistical inference in the context of network-dependent data. The approach is applied to an observational setting with a single time point stochastic intervention. We start by assuming that the true observed data-generating distribution belongs to a large class of semi-parametric statistical models. We then impose some restrictions on the possible set of the data-generative distributions that may belong to our statistical model. For example, we assume that the dependence among units can be fully described by the known network, and that the dependence on other units can be summarized via some known (but otherwise arbitrary) summary measures. We show that under our modeling assumptions, our estimand is equivalent to an estimand in a hypothetical iid data distribution, where the latter distribution is a function of the observed network data-generating distribution. With this key insight in mind, we show that the TMLE for our estimand in dependent network data can be described as a certain iid data TMLE algorithm, also resulting in a new simplified approach to conducting statistical inference. We demonstrate the validity of our approach in a network simulation study. We also extend prior work on dependent-data TMLE towards estimation of novel causal parameters, e.g., the unit-specific direct treatment effects under interference and the effects of interventions that modify the initial network structure.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Pubmed)

  • 文章类型: Journal Article
    We consider the partial least squares algorithm for dependent data and study the consequences of ignoring the dependence both theoretically and numerically. Ignoring nonstationary dependence structures can lead to inconsistent estimation, but a simple modification yields consistent estimation. A protein dynamics example illustrates the superior predictive power of the proposed method.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    BACKGROUND: Disadvantages have already been pointed out on the use of odds ratio (OR) as a measure of association for designs such as cohort and cross sectional studies, for which relative risk (RR) or prevalence ratio (PR) are preferable. The model that directly estimates RR or PR and correctly specifies the distribution of the outcome as binomial is the log-binomial model, however, convergence problems occur very often. Robust Poisson regression also estimates these measures but it can produce probabilities greater than 1.
    RESULTS: In this paper, the use of Bayesian approach to solve the problem of convergence of the log-binomial model is illustrated. Furthermore, the method is extended to incorporate dependent data, as in cluster clinical trials and studies with multilevel design, and also to analyse polytomous outcomes. Comparisons between methods are made by analysing four data sets.
    CONCLUSIONS: In all cases analysed, it was observed that Bayesian methods are capable of estimating the measures of interest, always within the correct parametric space of probabilities.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

  • 文章类型: Journal Article
    We describe, evaluate, and recommend statistical methods for the analysis of paired binomial proportions. A total of 24 methods are considered. The best tests for association include the asymptotic McNemar test and the McNemar mid- p test. For the difference between proportions, we recommend two simple confidence intervals with closed-form expressions and the asymptotic score interval. The asymptotic score interval is also recommended for the ratio of proportions, as is an interval with closed-form expression based on combining two Wilson score intervals for the single proportion. For the odds ratio, we recommend a transformation of the Wilson score interval and a transformation of the Clopper-Pearson mid- p interval. We illustrate the practical application of the methods using data from a recently published study of airway reactivity in children before and after stem cell transplantation and a matched case-control study of the association between floppy eyelid syndrome and obstructive sleep apnea-hypopnea syndrome.
    导出

    更多引用

    收藏

    翻译标题摘要

    我要上传

       PDF(Sci-hub)

公众号