Dependent data

  • 文章类型: Journal Article
    A growing number of modern scientific problems in areas such as genomics, neurobiology, and spatial epidemiology involve the measurement and analysis of thousands of related features that may be stochastically dependent at arbitrarily strong levels. In this work, we consider the scenario where the features follow a multivariate Normal distribution. We demonstrate that dependence is manifested as random variation shared among features, and that standard methods may yield highly unstable inference due to dependence, even when the dependence is fully parameterized and utilized in the procedure. We propose a \"cross-dimensional inference\" framework that alleviates the problems due to dependence by modeling and removing the variation shared among features, while also properly regularizing estimation across features. We demonstrate the framework on both simultaneous point estimation and multiple hypothesis testing in scenarios derived from the scientific applications of interest.






  • 文章类型: Journal Article
    In this paper, under the stationary α-mixing dependent samples, we develop a novel nonlinear modal regression for time series sequences and establish the consistency and asymptotic property of the proposed nonlinear modal estimator with a shrinking bandwidth h under certain regularity conditions. The asymptotic distribution is shown to be identical to the one derived from the independent observations, whereas the convergence rate ( n h 3 in which n is the sample size) is slower than that in the nonlinear mean regression. We numerically estimate the proposed nonlinear modal regression model by the use of a modified modal expectation-maximization (MEM) algorithm in conjunction with Taylor expansion. Monte Carlo simulations are presented to demonstrate the good finite sample (prediction) performance of the newly proposed model. We also construct a specified nonlinear modal regression to match the available daily new cases and new deaths data of the COVID-19 outbreak at the state/region level in the United States, and provide forward predictions up to 130 days ahead (from 24 August 2020 to 31 December 2020). In comparison to the traditional nonlinear regressions, the suggested model can fit the COVID-19 data better and produce more precise predictions. The prediction results indicate that there are systematic differences in spreading distributions among states/regions. For most western and eastern states, they have many serious COVID-19 burdens compared to Midwest. We hope that the built nonlinear modal regression can help policymakers to implement fast actions to curb the spread of the infection, avoid overburdening the health system and understand the development of COVID-19 from some points.






  • 文章类型: Journal Article
    In classical causal inference, inferring cause-effect relations from data relies on the assumption that units are independent and identically distributed. This assumption is violated in settings where units are related through a network of dependencies. An example of such a setting is ad placement in sponsored search advertising, where the likelihood of a user clicking on a particular ad is potentially influenced by where it is placed and where other ads are placed on the search result page. In such scenarios, confounding arises due to not only the individual ad-level covariates but also the placements and covariates of other ads in the system. In this paper, we leverage the language of causal inference in the presence of interference to model interactions among the ads. Quantification of such interactions allows us to better understand the click behavior of users, which in turn impacts the revenue of the host search engine and enhances user satisfaction. We illustrate the utility of our formalization through experiments carried out on the ad placement system of the Bing search engine.






  • 文章类型: Journal Article
    In this study, we present a novel method for quantifying dependencies in multivariate datasets, based on estimating the Rényi mutual information by minimum spanning trees (MSTs). The extent to which random variables are dependent is an important question, e.g., for uncertainty quantification and sensitivity analysis. The latter is closely related to the question how strongly dependent the output of, e.g., a computer simulation, is on the individual random input variables. To estimate the Rényi mutual information from data, we use a method due to Hero et al. that relies on computing minimum spanning trees (MSTs) of the data and uses the length of the MST in an estimator for the entropy. To reduce the computational cost of constructing the exact MST for large datasets, we explore methods to compute approximations to the exact MST, and find the multilevel approach introduced recently by Zhong et al. (2015) to be the most accurate. Because the MST computation does not require knowledge (or estimation) of the distributions, our methodology is well-suited for situations where only data are available. Furthermore, we show that, in the case where only the ranking of several dependencies is required rather than their exact value, it is not necessary to compute the Rényi divergence, but only an estimator derived from it. The main contributions of this paper are the introduction of this quantifier of dependency, as well as the novel combination of using approximate methods for MSTs with estimating the Rényi mutual information via MSTs. We applied our proposed method to an artificial test case based on the Ishigami function, as well as to a real-world test case involving an El Nino dataset.







  • 文章类型: Journal Article
    In this paper, we aim to study the asymptotic properties of internal estimator of nonparametric regression with independent and dependent data. Under some weak conditions, we present some results on asymptotic normality of the estimator. Our results extend some corresponding ones.







  • 文章类型: Journal Article
    The assumption that no subject\'s exposure affects another subject\'s outcome, known as the no-interference assumption, has long held a foundational position in the study of causal inference. However, this assumption may be violated in many settings, and in recent years has been relaxed considerably. Often this has been achieved with either the aid of a known underlying network, or the assumption that the population can be partitioned into separate groups, between which there is no interference, and within which each subject\'s outcome may be affected by all the other subjects in the group via the proportion exposed (the stratified interference assumption). In this article, we instead consider a complete interference setting, in which each subject affects every other subject\'s outcome. In particular, we make the stratified interference assumption for a single group consisting of the entire sample. We show that a targeted maximum likelihood estimator for the i.i.d. setting can be used to estimate a class of causal parameters that includes direct effects and overall effects under certain interventions. This estimator remains doubly-robust, semiparametric efficient, and continues to allow for incorporation of machine learning under our model. We conduct a simulation study, and present results from a data application where we study the effect of a nurse-based triage system on the outcomes of patients receiving HIV care in Kenyan health clinics.







  • 文章类型: Journal Article
    We study the framework for semi-parametric estimation and statistical inference for the sample average treatment-specific mean effects in observational settings where data are collected on a single network of connected units (e.g., in the presence of interference or spillover). Despite recent advances, many of the current statistical methods rely on estimation techniques that assume a particular parametric model for the outcome, even though some of the most important statistical assumptions required by these models are most likely violated in the observational network settings, often resulting in invalid and anti-conservative statistical inference. In this manuscript, we rely on the recent methodological advances for the targeted maximum likelihood estimation (TMLE) of causal effects in a network of causally connected units, to describe an estimation approach that permits for more realistic classes of data-generative models and provides valid statistical inference in the context of network-dependent data. The approach is applied to an observational setting with a single time point stochastic intervention. We start by assuming that the true observed data-generating distribution belongs to a large class of semi-parametric statistical models. We then impose some restrictions on the possible set of the data-generative distributions that may belong to our statistical model. For example, we assume that the dependence among units can be fully described by the known network, and that the dependence on other units can be summarized via some known (but otherwise arbitrary) summary measures. We show that under our modeling assumptions, our estimand is equivalent to an estimand in a hypothetical iid data distribution, where the latter distribution is a function of the observed network data-generating distribution. With this key insight in mind, we show that the TMLE for our estimand in dependent network data can be described as a certain iid data TMLE algorithm, also resulting in a new simplified approach to conducting statistical inference. We demonstrate the validity of our approach in a network simulation study. We also extend prior work on dependent-data TMLE towards estimation of novel causal parameters, e.g., the unit-specific direct treatment effects under interference and the effects of interventions that modify the initial network structure.






  • 文章类型: Journal Article
    We consider the partial least squares algorithm for dependent data and study the consequences of ignoring the dependence both theoretically and numerically. Ignoring nonstationary dependence structures can lead to inconsistent estimation, but a simple modification yields consistent estimation. A protein dynamics example illustrates the superior predictive power of the proposed method.






  • 文章类型: Journal Article
    BACKGROUND: Disadvantages have already been pointed out on the use of odds ratio (OR) as a measure of association for designs such as cohort and cross sectional studies, for which relative risk (RR) or prevalence ratio (PR) are preferable. The model that directly estimates RR or PR and correctly specifies the distribution of the outcome as binomial is the log-binomial model, however, convergence problems occur very often. Robust Poisson regression also estimates these measures but it can produce probabilities greater than 1.
    RESULTS: In this paper, the use of Bayesian approach to solve the problem of convergence of the log-binomial model is illustrated. Furthermore, the method is extended to incorporate dependent data, as in cluster clinical trials and studies with multilevel design, and also to analyse polytomous outcomes. Comparisons between methods are made by analysing four data sets.
    CONCLUSIONS: In all cases analysed, it was observed that Bayesian methods are capable of estimating the measures of interest, always within the correct parametric space of probabilities.






  • 文章类型: Journal Article
    We describe, evaluate, and recommend statistical methods for the analysis of paired binomial proportions. A total of 24 methods are considered. The best tests for association include the asymptotic McNemar test and the McNemar mid- p test. For the difference between proportions, we recommend two simple confidence intervals with closed-form expressions and the asymptotic score interval. The asymptotic score interval is also recommended for the ratio of proportions, as is an interval with closed-form expression based on combining two Wilson score intervals for the single proportion. For the odds ratio, we recommend a transformation of the Wilson score interval and a transformation of the Clopper-Pearson mid- p interval. We illustrate the practical application of the methods using data from a recently published study of airway reactivity in children before and after stem cell transplantation and a matched case-control study of the association between floppy eyelid syndrome and obstructive sleep apnea-hypopnea syndrome.





