    In many biomedical applications, we are more interested in the predicted probability that a numerical outcome is above a threshold than in the predicted value of the outcome. For example, it might be known that antibody levels above a certain threshold provide immunity against a disease, or a threshold for a disease severity score might reflect conversion from the presymptomatic to the symptomatic disease stage. Accordingly, biomedical researchers often convert numerical to binary outcomes (loss of information) to conduct logistic regression (probabilistic interpretation). We address this bad statistical practice by modelling the binary outcome with logistic regression, modelling the numerical outcome with linear regression, transforming the predicted values from linear regression to predicted probabilities, and combining the predicted probabilities from logistic and linear regression. Analysing high-dimensional simulated and experimental data, namely clinical data for predicting cognitive impairment, we obtain significantly improved predictions of dichotomised outcomes. Thus, the proposed approach effectively combines binary with numerical outcomes to improve binary classification in high-dimensional settings. An implementation is available in the R package cornet on GitHub (https://github.com/rauschenberger/cornet) and CRAN (https://CRAN.R-project.org/package=cornet).






    The main concern of this paper is providing a flexible discrete model that captures every kind of dispersion (equi-, over- and under-dispersion). Based on the balanced discretization method, a new discrete version of Burr-Hatke distribution is introduced with the partial moment-preserving property. Some statistical properties of the new distribution are introduced, and the applicability of proposed model is evaluated by considering counting series. A new integer-valued autoregressive (INAR) process based on the mixing Pegram and binomial thinning operators with discrete Burr-Hatke innovations is introduced, which can model contagious data properly. The different estimation approaches of parameters of the new process are provided and compared through the Monte Carlo simulation scheme. The performance of the proposed process is evaluated by four data sets of the daily death counts of the COVID-19 in Austria, Switzerland, Nigeria and Slovenia in comparison with some competitor INAR(1) models, along with the Pearson residual analysis of the assessing model. The goodness of fit measures affirm the adequacy of the proposed process in modeling all COVID-19 data sets. The fundamental prediction procedures are considered for new process by classic, modified Sieve bootstrap and Bayesian forecasting methods for all COVID-19 data sets, which is concluded that the Bayesian forecasting approach provides more reliable results.






    Association testing has been widely used to study the relationship between genetic variants and phenotypes. Most association testing methods are genotype-based, i.e. first estimate genotype and then regress phenotype on estimated genotype and other variables. Directly testing methods based on next generation sequencing (NGS) data without genotype calling have been proposed and shown advantage over genotype-based methods in the scenarios when genotype calling is not accurate. NGS data-based single-variant testing have been proposed including our previously proposed single-variant testing method, i.e. UNC combo method [1]. NGS data-based group testing methods for continuous phenotype have also been proposed by us using a linear model framework which can handle continuous responses [2]. In this paper, we extend our linear model-based framework to a generalized linear model-based framework so that the methods can handle other types of responses especially binary responses which is commonly-faced in association studies. We have conducted extensive simulation studies to evaluate the performance of different estimators and compare our estimators with their corresponding genotype-based methods. We found that all methods have Type I errors controlled, and our NGS data-based testing methods have better performance than their corresponding genotype-based methods in the literature for other types of responses including binary responses (logistic regression) and count responses (Poisson regression especially when sequencing depth is low. In conclusion, we have extended our previous linear model (LM) framework to a generalized linear model (GLM) framework and derived NGS data-based testing methods for a group of genetic variants. Compared with our previously proposed LM-based methods [2], the new GLM-based methods can handle more complex responses (for example, binary responses and count responses) in addition to continuous responses. Our methods have filled the literature gap and shown advantage over their corresponding genotype-based methods in the literature.






    Beta distributions are commonly used to model proportion valued response variables, often encountered in longitudinal studies. In this article, we develop semi-parametric Beta regression models for proportion valued responses, where the aggregate covariate effect is summarized and flexibly modeled, using a interpretable monotone time-varying single index transform of a linear combination of the potential covariates. We utilize the potential of single index models, which are effective dimension reduction tools and accommodate link function misspecification in generalized linear mixed models. Our Bayesian methodology incorporates the missing-at-random feature of the proportion response and utilize Hamiltonian Monte Carlo sampling to conduct inference. We explore finite-sample frequentist properties of our estimates and assess the robustness via detailed simulation studies. Finally, we illustrate our methodology via application to a motivating longitudinal dataset on obesity research recording proportion body fat.






    With a single circulating vector-borne virus, the basic reproduction number incorporates contributions from tick-to-tick (co-feeding), tick-to-host and host-to-tick transmission routes. With two different circulating vector-borne viral strains, resident and invasive, and under the assumption that co-feeding is the only transmission route in a tick population, the invasion reproduction number depends on whether the model system of ordinary differential equations possesses the property of neutrality. We show that a simple model, with two populations of ticks infected with one strain, resident or invasive, and one population of co-infected ticks, does not have Alizon\'s neutrality property. We present model alternatives that are capable of representing the invasion potential of a novel strain by including populations of ticks dually infected with the same strain. The invasion reproduction number is analysed with the next-generation method and via numerical simulations.






    Statistical learning of the structures of cellular networks, such as protein signaling pathways, is a topical research field in computational systems biology. To get the most information out of experimental data, it is often required to develop a tailored statistical approach rather than applying one of the off-the-shelf network reconstruction methods. The focus of this paper is on learning the structure of the mTOR protein signaling pathway from immunoblotting protein phosphorylation data. Under two experimental conditions eleven phosphorylation sites of eight key proteins of the mTOR pathway were measured at ten non-equidistant time points. For the statistical analysis we propose a new advanced hierarchically coupled non-homogeneous dynamic Bayesian network (NH-DBN) model, and we consider various data imputation methods for dealing with non-equidistant temporal observations. Because of the absence of a true gold standard network, we propose to use predictive probabilities in combination with a leave-one-out cross validation strategy to objectively cross-compare the accuracies of different NH-DBN models and data imputation methods. Finally, we employ the best combination of model and data imputation method for predicting the structure of the mTOR protein signaling pathway.






    Several statistical models have been proposed in recent years, among them is the semiparametric regression. In medicine, there are several situations in which it is impracticable to consider a linear regression for statistical modeling, especially when the data contain explanatory variables that present a nonlinear relationship with the response variable. Another common situation is when the response variable does not have a unimodal shape, and it is not possible to adopt distributions belonging to the symmetric or asymmetric classes. In this context, a semiparametric heteroskedastic regression is proposed based on an extension of the normal distribution. Then, we show the usefulness of this model to analyze the cost of prostate cancer surgery. The predictor variables refer to two groups of patients such that one group receives a multimodal local anesthetic solution (Preemptive Target Anesthetic Solution) and the second group is treated with neuraxial blockade (spinal anesthesia/traditional standard). The other relevant predictor variables are also evaluated, thus allowing for the in-depth interpretation of the predictor variables with a nonlinear effect on the dependent variable cost. The penalized maximum likelihood method is adopted to estimate the model parameters. The new regression is a useful statistical tool for analyzing medical data.






    Current methods for clustering adult obesity prevalence by state focus on creating a single map of obesity prevalence for a given year in the United States. Comparing these maps for different years may limit our understanding of the progression of state and regional obesity prevalence over time for the purpose of developing targeted regional health policies. In this application note, we adopt the non-parametric Dynamic Time Warping method for clustering longitudinal time series of obesity prevalence by state. This method captures the lead and lag relationship between the time series as part of the temporal alignment, allowing us to produce a single map that captures the regional and temporal clusters of obesity prevalence from 1990 to 2019 in the United States. We identify six regions of obesity prevalence in the United States and forecast future estimates of obesity prevalence based on ARIMA models.






    We study the performance of shape-constrained methods for evaluating immune response profiles from early-phase vaccine trials. The motivating problem for this work involves quantifying and comparing the IgG binding immune responses to the first and second variable loops (V1V2 region) arising in HVTN 097 and HVTN 100 HIV vaccine trials. We consider unimodal and log-concave shape-constrained methods to compare the immune profiles of the two vaccines, which is reasonable because the data support that the underlying densities of the immune responses could have these shapes. To this end, we develop novel shape-constrained tests of stochastic dominance and shape-constrained plug-in estimators of the squared Hellinger distance between two densities. Our techniques are either tuning parameter free, or rely on only one tuning parameter, but their performance is either better (the tests of stochastic dominance) or comparable with the nonparametric methods (the estimators of the squared Hellinger distance). The minimal dependence on tuning parameters is especially desirable in clinical contexts where analyses must be prespecified and reproducible. Our methods are supported by theoretical results and simulation studies.






    In this paper, we present an efficient statistical method (denoted as \'Adaptive Resources Allocation CUSUM\') to robustly and efficiently detect the hotspot with limited sampling resources. Our main idea is to combine the multi-arm bandit (MAB) and change-point detection methods to balance the exploration and exploitation of resource allocation for hotspot detection. Further, a Bayesian weighted update is used to update the posterior distribution of the infection rate. Then, the upper confidence bound (UCB) is used for resource allocation and planning. Finally, CUSUM monitoring statistics to detect the change point as well as the change location. For performance evaluation, we compare the performance of the proposed method with several benchmark methods in the literature and showed the proposed algorithm is able to achieve a lower detection delay and higher detection precision. Finally, this method is applied to hotspot detection in a real case study of county-level daily positive COVID-19 cases in Washington State WA) and demonstrates the effectiveness with very limited distributed samples.





