directed acyclic graphs

    BACKGROUND: With growing interest in causal inference and machine learning among epidemiologists, there is increasing discussion of causal discovery algorithms for guiding covariate selection. We present a case study of novice application of causal discovery tools and attempt to validate the results against a well-established causal relationship.
    METHODS: As a case study, we attempted causal discovery of relationships relevant to the effect of adherence on mortality in the placebo arm of the Coronary Drug Project (CDP) dataset. We used four algorithms available as existing software implementations and varied several model inputs.
    RESULTS: We identified 15 adjustment sets from 17 model parameterizations. When applied to a baseline covariate adjustment analysis, these 15 adjustment sets returned effect estimates with similar magnitude and direction of bias as prior published results. When using methods to control for time-varying confounding, there was generally more residual bias than compared to expert-selected adjustment sets.
    CONCLUSIONS: Although causal discovery algorithms can perform on par with expert knowledge, we do not recommend novice use of causal discovery without the input of experts in causal discovery. Expert support is recommended to aid in choosing the algorithm, selecting input parameters, assessing underlying assumptions, and finalizing selection of the adjustment variables.






    Causal discovery with prior knowledge is important for improving performance. We consider the incorporation of marginal causal relations, which correspond to the presence or absence of directed paths in a causal model. We propose the Marginal Prior Causal Knowledge PC (MPPC) algorithm to incorporate marginal causal relations into a constraint-based structure learning algorithm. We provide the theorems of conditional independence properties by combining observational data and marginal causal relations. We compare the MPPC algorithm with other structure learning methods in both simulation studies and real-world networks. The results indicate that, compare with other constraint-based structure learning methods, MPPC algorithm can incorporate marginal causal relations and is more effective and more efficient.






    The global health burden associated with exposure to heat is a grave concern and is projected to further increase under climate change. While physiological studies have demonstrated the role of humidity alongside temperature in exacerbating heat stress for humans, epidemiological findings remain conflicted. Understanding the intricate relationships between heat, humidity, and health outcomes is crucial to inform adaptation and drive increased global climate change mitigation efforts. This article introduces \'directed acyclic graphs\' (DAGs) as causal models to elucidate the analytical complexity in observational epidemiological studies that focus on humid-heat-related health impacts. DAGs are employed to delineate implicit assumptions often overlooked in such studies, depicting humidity as a confounder, mediator, or an effect modifier. We also discuss complexities arising from using composite indices, such as wet-bulb temperature. DAGs representing the health impacts associated with wet-bulb temperature help to understand the limitations in separating the individual effect of humidity from the perceived effect of wet-bulb temperature on health. General examples for regression models corresponding to each of the causal assumptions are also discussed. Our goal is not to prioritize one causal model but to discuss the causal models suitable for representing humid-heat health impacts and highlight the implications of selecting one model over another. We anticipate that the article will pave the way for future quantitative studies on the topic and motivate researchers to explicitly characterize the assumptions underlying their models with DAGs, facilitating accurate interpretations of the findings. This methodology is applicable to similarly complex compound events.






    Bayesian Networks (BNs) represent conditional probability relations among a set of random variables (nodes) in the form of a directed acyclic graph (DAG), and have found diverse applications in knowledge discovery. We study the problem of learning the sparse DAG structure of a BN from continuous observational data. The central problem can be modeled as a mixed-integer program with an objective function composed of a convex quadratic loss function and a regularization penalty subject to linear constraints. The optimal solution to this mathematical program is known to have desirable statistical properties under certain conditions. However, the state-of-the-art optimization solvers are not able to obtain provably optimal solutions to the existing mathematical formulations for medium-size problems within reasonable computational times. To address this difficulty, we tackle the problem from both computational and statistical perspectives. On the one hand, we propose a concrete early stopping criterion to terminate the branch-and-bound process in order to obtain a near-optimal solution to the mixed-integer program, and establish the consistency of this approximate solution. On the other hand, we improve the existing formulations by replacing the linear \"big- M \" constraints that represent the relationship between the continuous and binary indicator variables with second-order conic constraints. Our numerical results demonstrate the effectiveness of the proposed approaches.






    BACKGROUND: Empirical evaluation of inverse probability weighting (IPW) for self-selection bias correction is inaccessible without the full source population. We aimed to: (i) investigate how self-selection biases frequency and association measures and (ii) assess self-selection bias correction using IPW in a cohort with register linkage.
    METHODS: The source population included 17 936 individuals invited to the Copenhagen Aging and Midlife Biobank during 2009-11 (ages 49-63 years). Participants counted 7185 (40.1%). Register data were obtained for every invited person from 7 years before invitation to the end of 2020. The association between education and mortality was estimated using Cox regression models among participants, IPW participants and the source population.
    RESULTS: Participants had higher socioeconomic position and fewer hospital contacts before baseline than the source population. Frequency measures of participants approached those of the source population after IPW. Compared with primary/lower secondary education, upper secondary, short tertiary, bachelor and master/doctoral were associated with reduced risk of death among participants (adjusted hazard ratio [95% CI]: 0.60 [0.46; 0.77], 0.68 [0.42; 1.11], 0.37 [0.25; 0.54], 0.28 [0.18; 0.46], respectively). IPW changed the estimates marginally (0.59 [0.45; 0.77], 0.57 [0.34; 0.93], 0.34 [0.23; 0.50], 0.24 [0.15; 0.39]) but not only towards those of the source population (0.57 [0.51; 0.64], 0.43 [0.32; 0.60], 0.38 [0.32; 0.47], 0.22 [0.16; 0.29]).
    CONCLUSIONS: Frequency measures of study participants may not reflect the source population in the presence of self-selection, but the impact on association measures can be limited. IPW may be useful for (self-)selection bias correction, but the returned results can still reflect residual or other biases and random errors.






    The causal structure of a system imposes constraints on the joint probability distribution of variables that can be generated by the system. Archetypal constraints consist of conditional independencies between variables. However, particularly in the presence of hidden variables, many causal structures are compatible with the same set of independencies inferred from the marginal distributions of observed variables. Additional constraints allow further testing for the compatibility of data with specific causal structures. An existing family of causally informative inequalities compares the information about a set of target variables contained in a collection of variables, with a sum of the information contained in different groups defined as subsets of that collection. While procedures to identify the form of these groups-decomposition inequalities have been previously derived, we substantially enlarge the applicability of the framework. We derive groups-decomposition inequalities subject to weaker independence conditions, with weaker requirements in the configuration of the groups, and additionally allowing for conditioning sets. Furthermore, we show how constraints with higher inferential power may be derived with collections that include hidden variables, and then converted into testable constraints using data processing inequalities. For this purpose, we apply the standard data processing inequality of conditional mutual information and derive an analogous property for a measure of conditional unique information recently introduced to separate redundant, synergistic, and unique contributions to the information that a set of variables has about a target.






    Deterministic variables are variables that are functionally determined by one or more parent variables. They commonly arise when a variable has been functionally created from one or more parent variables, as with derived variables, and in compositional data, where the \'whole\' variable is determined from its \'parts\'. This article introduces how deterministic variables may be depicted within directed acyclic graphs (DAGs) to help with identifying and interpreting causal effects involving derived variables and/or compositional data. We propose a two-step approach in which all variables are initially considered, and a choice is made whether to focus on the deterministic variable or its determining parents. Depicting deterministic variables within DAGs brings several benefits. It is easier to identify and avoid misinterpreting tautological associations, i.e., self-fulfilling associations between deterministic variables and their parents, or between sibling variables with shared parents. In compositional data, it is easier to understand the consequences of conditioning on the \'whole\' variable, and correctly identify total and relative causal effects. For derived variables, it encourages greater consideration of the target estimand and greater scrutiny of the consistency and exchangeability assumptions. DAGs with deterministic variables are a useful aid for planning and interpreting analyses involving derived variables and/or compositional data.






    Prevalent Gene Regulatory Network (GRN) construction methods rely on generalized correlation analysis. However, in biological systems, regulation is essentially a causal relationship that cannot be adequately captured solely through correlation. Therefore, it is more reasonable to infer GRNs from a causal perspective. Existing causal discovery algorithms typically rely on Directed Acyclic Graphs (DAGs) to model causal relationships, but it often requires traversing the entire network, which result in computational demands skyrocketing as the number of nodes grows and make causal discovery algorithms only suitable for small networks with one or two hundred nodes or fewer. In this study, we propose the SLIVER (cauSaL dIscovery Via dimEnsionality Reduction) algorithm which integrates causal structural equation model and graph decomposition. SLIVER introduces a set of factor nodes, serving as abstractions of different functional modules to integrate the regulatory relationships between genes based on their respective functions or pathways, thus reducing the GRN to the product of two low-dimensional matrices. Subsequently, we employ the structural causal model (SCM) to learn the GRN within the gene node space, enforce the DAG constraint in the low-dimensional space, and guide each factor to aggregate various functions through cosine similarity. We evaluate the performance of the SLIVER algorithm on 12 real single cell transcriptomic datasets, and demonstrate it outperforms other 12 widely used methods both in GRN inference performance and computational resource usage. The analysis of the gene information integrated by factor nodes also demonstrate the biological explanation of factor nodes in GRNs. We apply it to scRNA-seq of Type 2 diabetes mellitus to capture the transcriptional regulatory structural changes of β cells under high insulin demand.






    How do we construct our causal directed acyclic graphs (DAGs)-for example, for life-course modeling and analysis? In this commentary, I review how the data-driven construction of causal DAGs (causal discovery) has evolved, what promises it holds, and what limitations or caveats must be considered. I find that expert- or theory-driven model-building might benefit from some more checking against the data and that causal discovery could bring new ideas to old theories.





