Generate simulated data according to the model f(X) = f_prog(X) + trt*(b0 + b1*f_pred(X)) for different outcome distributions. For continuous data f(X) is the conditional mean, for binary data the logit response probility, for count data the log-rate and for survival data the log-hazard rate.

generate_y(
  X,
  trt,
  prog,
  pred,
  b0,
  b1 = NULL,
  sd_te = NULL,
  type = c("binary", "continuous", "count", "survival"),
  sigma_error = 1,
  theta = 1,
  cens_time = NULL,
  lambda0 = NULL,
  include_truth = FALSE,
  sign_better = 1
)

Arguments

X

matrix/dataframe of predictor variables

trt

Binary treatment indicator variable

prog

Character variable giving expression for prognostic effects (defined in terms of names in the X matrix)

pred

Character variable giving expression for predictive effects (defined in terms of names in the X matrix)

b0

Treatment (main effect)

b1

Coefficient of the predictive effects defined in pred

sd_te

Standard deviation of the treatment effects defined via pred. If given b1 is ignored. For binary data this is assumed to be on the log-odds ratio scale. For survival and count data this is assumed to be on the log scale.

type

Outcome data type to generate ("continuous", "binary", "count" and "survival" are allowed here)

sigma_error

Residual error, only needed for type = "continuous"

theta

Overdispersion parameter, only needed for data_type = "count" (variance of neg bin in this parameterization is mu + mu^2/theta)

cens_time

Function to generate the censoring time, only needed for data_type = "survival"

lambda0

Intercept of exponential regression (on non-log scale)

include_truth

boolean, will the true prognostic and predictive effects be included in the outcome data-set?

sign_better

whether larger response is better (used to determine whether b1 is negative or positive if not given)

Value

Data set

Examples

X <- generate_X_dist(n = 10000, p = 10, rho = 0.5)
## observational data set
trt <- generate_trt(n = nrow(X), type = "random", X = X, prop = "X2")
dat <- generate_y(X, trt,
  prog = "0.5*((X1=='Y')+X3)",
  pred = "X3>0", b0 = 0, b1 = 1,
  type = "continuous", sigma_error = 3
)


#### generate data from user specified covariate X
X <- sapply(1:10, function(ii) {
  rnorm(500)
})
X <- as.data.frame(X)
colnames(X) <- paste0("Z", 1:10)
trt <- generate_trt(nrow(X), p_trt = 0.5)
dat <- generate_y(X, trt,
  prog = "0.5*Z1",
  pred = "(Z5>0)", b0 = 0, b1 = 1,
  type = "binary"
)
glm(Y ~ trt * ., data = dat, family = binomial)
#> 
#> Call:  glm(formula = Y ~ trt * ., family = binomial, data = dat)
#> 
#> Coefficients:
#> (Intercept)          trt           Z1           Z2           Z3           Z4  
#>     0.04097      0.66630      0.48682     -0.27904     -0.10889     -0.18297  
#>          Z5           Z6           Z7           Z8           Z9          Z10  
#>    -0.13666      0.06119     -0.01345     -0.21496      0.14267     -0.11297  
#>      trt:Z1       trt:Z2       trt:Z3       trt:Z4       trt:Z5       trt:Z6  
#>    -0.06650      0.09557     -0.01194      0.06109      0.48794     -0.10708  
#>      trt:Z7       trt:Z8       trt:Z9      trt:Z10  
#>    -0.06243      0.24723     -0.09472     -0.07955  
#> 
#> Degrees of Freedom: 499 Total (i.e. Null);  478 Residual
#> Null Deviance:	    679.6 
#> Residual Deviance: 625.7 	AIC: 669.7