| Title: | Regression with Interval-Censored Covariates |
|---|---|
| Description: | Provides functions to simulate and analyze data for a regression model with an interval censored covariate, as described in Morrison et al. (2021) <doi:10.1111/biom.13472>. |
| Authors: | Douglas Ezra Morrison [aut, cre, cph] (ORCID: <https://orcid.org/0000-0002-7195-830X>), Ron Brookmeyer [aut] |
| Maintainer: | Douglas Ezra Morrison <[email protected]> |
| License: | MIT + file LICENSE |
| Version: | 0.2.0.9000 |
| Built: | 2026-05-30 09:14:45 UTC |
| Source: | https://github.com/d-morrison/rwicc |
Build table of event date possibilities
build_event_date_possibilities_table( participant_level_data, bin_width = 1, omega_hat = build_omega_table(participant_level_data, bin_width = bin_width) )build_event_date_possibilities_table( participant_level_data, bin_width = 1, omega_hat = build_omega_table(participant_level_data, bin_width = bin_width) )
participant_level_data |
a data.frame with columns:
|
bin_width |
number of days per bin |
omega_hat |
a data.frame from |
a data.frame with columns:
ID: participant identifier
Stratum: indicator for which population stratum the participant belongs to
S: possible seroconversion dates
library(dplyr) simulate_interval_censoring()$pt_data |> mutate(Stratum = 1) |> build_event_date_possibilities_table()library(dplyr) simulate_interval_censoring()$pt_data |> mutate(Stratum = 1) |> build_event_date_possibilities_table()
Build table of dates with possible seroconversions in dataset
build_omega_table(participant_level_data, bin_width = 1)build_omega_table(participant_level_data, bin_width = 1)
participant_level_data |
a data.frame with columns:
|
bin_width |
number of days per bin |
convert a pair of simple logistic regression coefficients into P(Y|T) curve:
build_phi_function_from_coefs(coefs)build_phi_function_from_coefs(coefs)
coefs |
numeric vector of coefficients |
function(t) P(Y=1|T=t)
compute mean window period duration from simple logistic regression coefficients
compute_mu(theta)compute_mu(theta)
theta |
numeric vector of coefficients |
numeric scalar: mean window period duration
This function fits a logistic regression model for a binary outcome Y with an interval-censored covariate T, using an EM algorithm, as described in Morrison et al (2021); doi:10.1111/biom.13472.
fit_joint_model( participant_level_data, obs_level_data, model_formula = stats::formula(Y ~ T), mu_function = compute_mu, bin_width = 1, denom_offset = 0.1, EM_toler_loglik = 0.1, EM_toler_est = 1e-04, EM_max_iterations = Inf, glm_tolerance = 1e-07, glm_maxit = 20, initial_S_estimate_location = 0.25, coef_change_metric = "max abs rel diff coefs", verbose = FALSE )fit_joint_model( participant_level_data, obs_level_data, model_formula = stats::formula(Y ~ T), mu_function = compute_mu, bin_width = 1, denom_offset = 0.1, EM_toler_loglik = 0.1, EM_toler_est = 1e-04, EM_max_iterations = Inf, glm_tolerance = 1e-07, glm_maxit = 20, initial_S_estimate_location = 0.25, coef_change_metric = "max abs rel diff coefs", verbose = FALSE )
participant_level_data |
a data.frame or tibble with the following variables:
|
obs_level_data |
a data.frame or tibble with the following variables:
|
model_formula |
the functional form for the regression model for p(y|t) (as a formula() object) |
mu_function |
a function taking a vector of regression coefficient estimates as input and outputting an estimate of mu (mean duration of MAA-positive infection). |
bin_width |
the number of days between possible seroconversion dates (should be an integer) |
denom_offset |
an offset value added to the denominator of the hazard estimates to improve numerical stability |
EM_toler_loglik |
the convergence cutoff for the log-likelihood criterion ("Delta_L" in the paper) |
EM_toler_est |
the convergence cutoff for the parameter estimate criterion ("Delta_theta" in the paper) |
EM_max_iterations |
the number of EM iterations to perform before giving up if still not converged. |
glm_tolerance |
the convergence cutoff for the glm fit in the M step |
glm_maxit |
the iterations cutoff for the glm fit in the M step |
initial_S_estimate_location |
determines how seroconversion date is guessed to initialize the algorithm; can be any decimal between 0 and 1; 0.5 = midpoint imputation, 0.25 = 1st quartile, 0 = last negative, etc. |
coef_change_metric |
a string indicating the type of parameter estimate criterion to use:
|
verbose |
whether to print algorithm progress details to the console |
a list with the following elements:
Theta: the estimated regression coefficients for the model of p(Y|T)
Mu: the estimated mean window period (a transformation of Theta)
Omega: a table with the estimated parameters for the model of p(S|E).
converged: indicator of whether the algorithm reached its cutoff
criteria before reaching the specified maximum iterations. 1 = reached
cutoffs, 0 = not.
iterations: the number of EM iterations completed before the
algorithm stopped.
convergence_metrics: the four convergence metrics
convergence_stats: a table of the log-likelihood at each iteration
Morrison, Laeyendecker, and Brookmeyer (2021). "Regression with interval-censored covariates: Application to cross-sectional incidence estimation". Biometrics. doi:10.1111/biom.13472.
## Not run: # simulate data: study_data <- simulate_interval_censoring() # fit model: EM_algorithm_outputs <- fit_joint_model( obs_level_data = study_data$obs_data, participant_level_data = study_data$pt_data ) ## End(Not run)## Not run: # simulate data: study_data <- simulate_interval_censoring() # fit model: EM_algorithm_outputs <- fit_joint_model( obs_level_data = study_data$obs_data, participant_level_data = study_data$pt_data ) ## End(Not run)
Fit model using midpoint imputation
fit_midpoint_model( participant_level_data, obs_level_data, maxit = 1000, tolerance = 1e-08 )fit_midpoint_model( participant_level_data, obs_level_data, maxit = 1000, tolerance = 1e-08 )
participant_level_data |
a data.frame or tibble with the following variables:
|
obs_level_data |
a data.frame or tibble with the following variables:
|
maxit |
maximum iterations, passed to |
tolerance |
convergence criterion, passed to |
a vector of logistic regression coefficient estimates
sim_data <- simulate_interval_censoring( "theta" = c(0.986, -3.88), "study_cohort_size" = 4500, "preconversion_interval_length" = 365, "hazard_alpha" = 1, "hazard_beta" = 0.5 ) theta_est_midpoint <- fit_midpoint_model( obs_level_data = sim_data$obs_data, participant_level_data = sim_data$pt_data )sim_data <- simulate_interval_censoring( "theta" = c(0.986, -3.88), "study_cohort_size" = 4500, "preconversion_interval_length" = 365, "hazard_alpha" = 1, "hazard_beta" = 0.5 ) theta_est_midpoint <- fit_midpoint_model( obs_level_data = sim_data$obs_data, participant_level_data = sim_data$pt_data )
Fit model using uniform imputation
fit_uniform_model( participant_level_data, obs_level_data, maxit = 1000, tolerance = 1e-08, n_imputations = 10 )fit_uniform_model( participant_level_data, obs_level_data, maxit = 1000, tolerance = 1e-08, n_imputations = 10 )
participant_level_data |
a data.frame or tibble with the following variables:
|
obs_level_data |
a data.frame or tibble with the following variables:
|
maxit |
maximum iterations, passed to |
tolerance |
convergence criterion, passed to |
n_imputations |
number of imputed data sets to create |
a vector of logistic regression coefficient estimates
sim_data <- simulate_interval_censoring( "theta" = c(0.986, -3.88), "study_cohort_size" = 4500, "preconversion_interval_length" = 365, "hazard_alpha" = 1, "hazard_beta" = 0.5 ) theta_est_midpoint <- fit_uniform_model( obs_level_data = sim_data$obs_data, participant_level_data = sim_data$pt_data )sim_data <- simulate_interval_censoring( "theta" = c(0.986, -3.88), "study_cohort_size" = 4500, "preconversion_interval_length" = 365, "hazard_alpha" = 1, "hazard_beta" = 0.5 ) theta_est_midpoint <- fit_uniform_model( obs_level_data = sim_data$obs_data, participant_level_data = sim_data$pt_data )
Graph seroconversion hazard model
graph_omega(omega)graph_omega(omega)
omega |
a data.frame containing parameter values for the seroconversion hazard model |
example_model <- system.file("extdata", "example_model.rds", package = "rwicc") |> readRDS() omega_est_EM <- example_model$Omega omega_est_EM |> graph_omega()example_model <- system.file("extdata", "example_model.rds", package = "rwicc") |> readRDS() omega_est_EM <- example_model$Omega omega_est_EM |> graph_omega()
plot estimated and true CDFs for seroconversion date distribution
plot_CDF(true_hazard_alpha, true_hazard_beta, omega_hat)plot_CDF(true_hazard_alpha, true_hazard_beta, omega_hat)
true_hazard_alpha |
The data-generating hazard at the start of the study |
true_hazard_beta |
The change in data-generating hazard per calendar year |
omega_hat |
tibble of estimated discrete hazards |
a ggplot
## Not run: hazard_alpha <- 1 hazard_beta <- 0.5 study_data <- simulate_interval_censoring( "hazard_alpha" = hazard_alpha, "hazard_beta" = hazard_beta ) # fit model: EM_algorithm_outputs <- fit_joint_model( obs_level_data = study_data$obs_data, participant_level_data = study_data$pt_data ) plot1 <- plot_CDF( true_hazard_alpha = hazard_alpha, true_hazard_beta = hazard_beta, omega_hat = EM_algorithm_outputs$Omega ) print(plot1) ## End(Not run)## Not run: hazard_alpha <- 1 hazard_beta <- 0.5 study_data <- simulate_interval_censoring( "hazard_alpha" = hazard_alpha, "hazard_beta" = hazard_beta ) # fit model: EM_algorithm_outputs <- fit_joint_model( obs_level_data = study_data$obs_data, participant_level_data = study_data$pt_data ) plot1 <- plot_CDF( true_hazard_alpha = hazard_alpha, true_hazard_beta = hazard_beta, omega_hat = EM_algorithm_outputs$Omega ) print(plot1) ## End(Not run)
Plot censoring data
plot_censoring_data( dataset, included_IDs = unique(dataset$pt_data$ID), label_size = 5, point_size = 5, s_vjust = 2, labelled_IDs = included_IDs, xmin = min(dataset$pt_data$E) - 28, xmax = max(dataset$obs_data$O) )plot_censoring_data( dataset, included_IDs = unique(dataset$pt_data$ID), label_size = 5, point_size = 5, s_vjust = 2, labelled_IDs = included_IDs, xmin = min(dataset$pt_data$E) - 28, xmax = max(dataset$obs_data$O) )
dataset |
output from |
included_IDs |
|
label_size |
numeric:
passed to |
point_size |
numeric: passed to ggplot2::geom_point's |
s_vjust |
passed to ggrepel::geom_text_repel's |
labelled_IDs |
|
xmin |
minimum displayed value for x-axis |
xmax |
maximum displayed value for x-axis |
a ggplot
Plot true and estimated curves for P(Y=1|T=t)
plot_phi_curves( theta_true, theta.hat_joint, theta.hat_midpoint, theta.hat_uniform )plot_phi_curves( theta_true, theta.hat_joint, theta.hat_midpoint, theta.hat_uniform )
theta_true |
the coefficients of the data-generating model P(Y=1|T=t) |
theta.hat_joint |
the estimated coefficients from the joint model |
theta.hat_midpoint |
the estimated coefficients from midpoint imputation |
theta.hat_uniform |
the estimated coefficients from uniform imputation |
a ggplot
## Not run: theta_true <- c(0.986, -3.88) hazard_alpha <- 1 hazard_beta <- 0.5 sim_data <- simulate_interval_censoring( "theta" = theta_true, "study_cohort_size" = 4500, "preconversion_interval_length" = 365, "hazard_alpha" = hazard_alpha, "hazard_beta" = hazard_beta ) # extract the participant-level and observation-level simulated data: sim_participant_data <- sim_data$pt_data sim_obs_data <- sim_data$obs_data rm(sim_data) # joint model: EM_algorithm_outputs <- fit_joint_model( obs_level_data = sim_obs_data, participant_level_data = sim_participant_data, bin_width = 7, verbose = FALSE ) # midpoint imputation: theta_est_midpoint <- fit_midpoint_model( obs_level_data = sim_obs_data, participant_level_data = sim_participant_data ) # uniform imputation: theta_est_uniform <- fit_uniform_model( obs_level_data = sim_obs_data, participant_level_data = sim_participant_data ) plot2 <- plot_phi_curves( theta_true = theta_true, theta.hat_uniform = theta_est_uniform, theta.hat_midpoint = theta_est_midpoint, theta.hat_joint = EM_algorithm_outputs$Theta ) print(plot2) ## End(Not run)## Not run: theta_true <- c(0.986, -3.88) hazard_alpha <- 1 hazard_beta <- 0.5 sim_data <- simulate_interval_censoring( "theta" = theta_true, "study_cohort_size" = 4500, "preconversion_interval_length" = 365, "hazard_alpha" = hazard_alpha, "hazard_beta" = hazard_beta ) # extract the participant-level and observation-level simulated data: sim_participant_data <- sim_data$pt_data sim_obs_data <- sim_data$obs_data rm(sim_data) # joint model: EM_algorithm_outputs <- fit_joint_model( obs_level_data = sim_obs_data, participant_level_data = sim_participant_data, bin_width = 7, verbose = FALSE ) # midpoint imputation: theta_est_midpoint <- fit_midpoint_model( obs_level_data = sim_obs_data, participant_level_data = sim_participant_data ) # uniform imputation: theta_est_uniform <- fit_uniform_model( obs_level_data = sim_obs_data, participant_level_data = sim_participant_data ) plot2 <- plot_phi_curves( theta_true = theta_true, theta.hat_uniform = theta_est_uniform, theta.hat_midpoint = theta_est_midpoint, theta.hat_joint = EM_algorithm_outputs$Theta ) print(plot2) ## End(Not run)
The rwicc package implements a regression model with an
interval-censored covariate using an EM algorithm, as described in
Morrison et al (2021); doi:10.1111/biom.13472.
The main rwicc functions are:
Maintainer: Douglas Ezra Morrison [email protected] (ORCID) [copyright holder]
Authors:
Douglas Ezra Morrison [email protected] (ORCID) [copyright holder]
Ron Brookmeyer
Morrison, Laeyendecker, and Brookmeyer (2021). "Regression with interval-censored covariates: Application to cross-sectional incidence estimation". Biometrics. doi:10.1111/biom.13472.
Useful links:
Report bugs at https://github.com/d-morrison/rwicc/issues
This function determines the seroconversion date corresponding to a provided probability of survival. See doi:10.1111/biom.13472, Supporting Information, Section A.4.
seroconversion_inverse_survival_function(u, e, hazard_alpha, hazard_beta)seroconversion_inverse_survival_function(u, e, hazard_alpha, hazard_beta)
u |
a vector of seroconversion survival probabilities |
e |
a vector of time differences between study start and enrollment (in years) |
hazard_alpha |
the instantaneous hazard of seroconversion on the study start date |
hazard_beta |
the change in hazard per year after study start date |
numeric vector of time differences between study start and seroconversion (in years)
Morrison, Laeyendecker, and Brookmeyer (2021). "Regression with interval-censored covariates: Application to cross-sectional incidence estimation". Biometrics, doi:10.1111/biom.13472.
simulate_interval_censoring generates a simulated data set from a
data-generating model based on the typical structure of a cohort study of HIV
biomarker progression, as described in Morrison et al (2021);
doi:10.1111/biom.13472.
simulate_interval_censoring( study_cohort_size = 4500, probability_of_ever_seroconverting = 0.05, n_at_risk = stats::rbinom(n = 1, size = study_cohort_size, prob = probability_of_ever_seroconverting), hazard_alpha = 1, hazard_beta = 0.5, preconversion_interval_length = 84, theta = c(0.986, -3.88), years_in_study = 10, max_scheduling_offset = 7, days_from_study_start_to_recruitment_end = 365, study_start_date = lubridate::ymd("2001-01-01") )simulate_interval_censoring( study_cohort_size = 4500, probability_of_ever_seroconverting = 0.05, n_at_risk = stats::rbinom(n = 1, size = study_cohort_size, prob = probability_of_ever_seroconverting), hazard_alpha = 1, hazard_beta = 0.5, preconversion_interval_length = 84, theta = c(0.986, -3.88), years_in_study = 10, max_scheduling_offset = 7, days_from_study_start_to_recruitment_end = 365, study_start_date = lubridate::ymd("2001-01-01") )
study_cohort_size |
the number of participants to simulate (N_0 in the paper) |
probability_of_ever_seroconverting |
the probability that each participant is at risk of HIV seroconversion |
n_at_risk |
number of participants who are at risk of infection; by
default, this number is determined stochastically from
|
hazard_alpha |
the hazard (instantaneous risk) of seroconversion at the start date of the cohort study for those participants at risk of seroconversion |
hazard_beta |
the change in hazard per calendar year |
preconversion_interval_length |
the number of days between tests for seroconversion |
theta |
the parameters of a logistic model (with linear functional from) specifying the probability of MAA-positive biomarkers as a function of time since seroconversion |
years_in_study |
the duration of follow-up for each participant |
max_scheduling_offset |
the maximum divergence of pre-seroconversion followup visits from the prescribed schedule |
days_from_study_start_to_recruitment_end |
the length of the recruitment period |
study_start_date |
the date when the study starts recruitment ("d_0" in the main text). The value of this parameter does not affect the simulation results; it is only necessary as a reference point for generating E, L, R, O, and S. |
A list containing the following two tibbles:
pt_data: a tibble of participant-level information, with the following
columns:
ID: participant ID
E: enrollment date
L: date of last HIV test prior to seroconversion
R: date of first HIV test after seroconversion
obs_data: a tibble of longitudinal observations with the following
columns:
ID: participant ID
O: dates of biomarker sample collection
Y: MAA classifications of biomarker samples
Morrison, Laeyendecker, and Brookmeyer (2021). "Regression with interval-censored covariates: Application to cross-sectional incidence estimation". Biometrics. doi:10.1111/biom.13472.
study_data <- simulate_interval_censoring() participant_characteristics <- study_data$pt_data longitudinal_observations <- study_data$obs_datastudy_data <- simulate_interval_censoring() participant_characteristics <- study_data$pt_data longitudinal_observations <- study_data$obs_data