Generate outcome variable y and merge with predictor data matrix X.

Generate simulated data according to the model f(X) = f_prog(X) + trt*(b0 + b1*f_pred(X)) for different outcome distributions. For continuous data f(X) is the conditional mean, for binary data the logit response probility, for count data the log-rate and for survival data the log-hazard rate.

generate_y(
  X,
  trt,
  prog,
  pred,
  b0,
  b1 = NULL,
  sd_te = NULL,
  type = c("binary", "continuous", "count", "survival"),
  sigma_error = 1,
  theta = 1,
  cens_time = NULL,
  lambda0 = NULL,
  include_truth = FALSE,
  sign_better = 1
)

Arguments

X: matrix/dataframe of predictor variables
trt: Binary treatment indicator variable
prog: Character variable giving expression for prognostic effects (defined in terms of names in the X matrix)
pred: Character variable giving expression for predictive effects (defined in terms of names in the X matrix)
b0: Treatment (main effect)
b1: Coefficient of the predictive effects defined in pred
sd_te: Standard deviation of the treatment effects defined via pred. If given b1 is ignored. For binary data this is assumed to be on the log-odds ratio scale. For survival and count data this is assumed to be on the log scale.
type: Outcome data type to generate ("continuous", "binary", "count" and "survival" are allowed here)
sigma_error: Residual error, only needed for type = "continuous"
theta: Overdispersion parameter, only needed for data_type = "count" (variance of neg bin in this parameterization is mu + mu^2/theta)
cens_time: Function to generate the censoring time, only needed for data_type = "survival"
lambda0: Intercept of exponential regression (on non-log scale)
include_truth: boolean, will the true prognostic and predictive effects be included in the outcome data-set?
sign_better: whether larger response is better (used to determine whether b1 is negative or positive if not given)

Value

Data set

Examples

X <- generate_X_dist(n = 10000, p = 10, rho = 0.5)
## observational data set
trt <- generate_trt(n = nrow(X), type = "random", X = X, prop = "X2")
dat <- generate_y(X, trt,
  prog = "0.5*((X1=='Y')+X3)",
  pred = "X3>0", b0 = 0, b1 = 1,
  type = "continuous", sigma_error = 3
)


#### generate data from user specified covariate X
X <- sapply(1:10, function(ii) {
  rnorm(500)
})
X <- as.data.frame(X)
colnames(X) <- paste0("Z", 1:10)
trt <- generate_trt(nrow(X), p_trt = 0.5)
dat <- generate_y(X, trt,
  prog = "0.5*Z1",
  pred = "(Z5>0)", b0 = 0, b1 = 1,
  type = "binary"
)
glm(Y ~ trt * ., data = dat, family = binomial)
#> 
#> Call:  glm(formula = Y ~ trt * ., family = binomial, data = dat)
#> 
#> Coefficients:
#> (Intercept)          trt           Z1           Z2           Z3           Z4  
#>     0.04097      0.66630      0.48682     -0.27904     -0.10889     -0.18297  
#>          Z5           Z6           Z7           Z8           Z9          Z10  
#>    -0.13666      0.06119     -0.01345     -0.21496      0.14267     -0.11297  
#>      trt:Z1       trt:Z2       trt:Z3       trt:Z4       trt:Z5       trt:Z6  
#>    -0.06650      0.09557     -0.01194      0.06109      0.48794     -0.10708  
#>      trt:Z7       trt:Z8       trt:Z9      trt:Z10  
#>    -0.06243      0.24723     -0.09472     -0.07955  
#> 
#> Degrees of Freedom: 499 Total (i.e. Null);  478 Residual
#> Null Deviance:	    679.6 
#> Residual Deviance: 625.7 	AIC: 669.7