Last updated: 06/28/2019

Background

This is a Bayesian partial adaptation and extension of Frank Harrell’s post titled: “Assessing Heterogeneity of Treatment Effect, Estimating Patient-Specific Efficacy, and Studying Variation in Odds Ratios, Risk Ratios, and Risk Differences”. The goal of this project is to provide a Bayesian framework with which we can: (1) explore the presence/absence of heterogeneity of treatment effect (HTE) and (2) model patient-specific treatment efficacy on multiple scales (e.g., ARR, RR, OR). While some consider HTE to be present if risk varies on the either the relative or absolute scale, others consider these to be separate phenomena (i.e., true HTE on the relative scale vs. risk magnification on the absolute scale). Regardless of terminology, a framework to model and adequately present these variations in risk is, in my opinion, essential when interpreting and applying the results of randomized controlled trials. There are several excellent reviews of this topic that cover the nuances in both its terminology and application in detail (1, 2).

For this project, we will use data from the GUSTO trial by Eric Topol and colleagues. This trial (published in 1993, prior to PCI becoming standard of care for acute STEMI) enrolled patients with myocardial infarction and examined relative efficacy of four different thrombolytic regimens: (1) streptokinase (SK) + subcutaneous heparin; (2) SK + intravenous heparin; (3) tPA + intravenous heparin; and (4) tPA + SK + intravenous heparin. For the purposes of this project, we will be condensing the two groups receiving SK, and will primarily be examining the relative difference between SK-only regimens (n = 20,162) and tPA-only (n = 10,348) regimens (we will not be specifically examining the combined SK + tPA group). We will still use patients in the SK + tPA group to build our models, though, as their data can help inform the relationship between the additional baseline variables and risk. The major outcome in this trial was 30-day mortality, and several baseline variables were collected.

For this project we will walk through the modeling process step-by-step using the brms package, with visualizations using tidybayes, bayesplot and, of course, ggplot2. We will do some prior-predictive simulation to pick reasonable priors, and will perform some basic model diagnostics.

As a final note, I am in no way an expert on trials or Bayesian modeling. These are very simple models and the model building process is far from comprehensive. This serves as a learning exercise for me, and I hope it is helpful for others who are equally interested in these concepts. With that said, I greatly appreciate any and all feedback that may improve on this work. Feel free to reach out at with any comments. The code for this project is freely available on GitHub.

Data Preparation and Exploration

The dataset used here is complete, and contains the following variables:

  • day30: mortality status at 30 days (1 / 0)
  • pulse: Heart rate (BPM) at study entry
  • age: Age (years)
  • sysbp: Systolic blood pressure (mm Hg) at study entry
  • Killip: Killip class (1, 2, 3, or 4)
  • miloc: Infarct location (anterior, inferior, or other)
  • pmi: History of previous myocardial infarction (yes / no)
  • tx: Treatment (tPA, SK, or both)

A quick glance at the data tells us what we’re working with:


.

8 Variables   40830 Observations

day30
nmissingdistinctInfoSumMeanGmd
40830020.19528510.069830.1299

pulse: Heart Rate beats/min
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
4083001660.99975.4119.55 50 55 62 73 86 98107
lowest : 0 1 6 8 9 , highest: 200 205 210 220 246
age
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
4083005575160.913.6240.8944.6952.0961.5269.8676.2579.45
lowest : 19.027 19.031 20.289 20.781 20.969 , highest: 92.953 95.000 96.547 108.000 110.000
sysbp: Systolic Blood Pressure mmHg
image
nmissingdistinctInfoMeanGmd.05.10.25.50.75.90.95
4083002090.99912926.56 92100112130144160170
lowest : 0 30 32 36 40 , highest: 270 274 275 276 280
Killip: Killip Class
image
nmissingdistinct
4083004
 Value          I    II   III    IV
 Frequency  34825  5141   551   313
 Proportion 0.853 0.126 0.013 0.008
 

miloc: MI Location
image
nmissingdistinct
4083003
 Value      Inferior    Other Anterior
 Frequency     23495     1435    15900
 Proportion    0.575    0.035    0.389
 

pmi: Previous MI
nmissingdistinct
4083002
 Value         no   yes
 Frequency  34104  6726
 Proportion 0.835 0.165
 

tx: Tx in 3 groups
image
nmissingdistinct
4083003
 Value      SK+tPA     SK    tPA
 Frequency   10320  20162  10348
 Proportion  0.253  0.494  0.253
 


Some of these values seem to fall outside the plausible range (e.g., pulse or SBP of zero), but we’ll ignore these for the sake of simplicity.

Next we can plot the distributions of each variable, stratified by outcome. For the continuous variables we’ll also show the centered/scaled version as this is what we’ll use for modeling.

base_theme <-
  theme_classic() +
  theme(text = element_text(family = "Gill Sans MT"))

facet_labels <- as_labeller(
  c(age = "Age (years)", 
    pulse = "HR (BPM)",
    sysbp = "SBP (mm Hg)",
    age_s = "Age (centered/scaled)",
    pulse_s = "HR (centered/scaled)",
    sbp_s = "SBP (centered/scaled)"))

plt_1 <- 
  ggplot(
    data = gusto %>% 
      select(day30, age, age_s, sysbp, sbp_s, pulse, pulse_s) %>%
      gather(key = "key", value = "value", -day30) %>%
      mutate(key = factor(key, levels = c("age", "pulse", "sysbp",
                                          "age_s", "pulse_s", "sbp_s")))
  ) + 
  geom_density(
    aes(x = value,
        fill = factor(day30)),
    adjust = 1.75,
    alpha = 0.5
  ) + 
  labs(
    x = "Value",
    y = "Density",
    subtitle = "Continuous Variables Stratified by Outcome"
  ) + 
  scale_fill_brewer(type = "qual", palette = "Dark2",
                    name = "Status at Day 30:",
                    breaks = c(0, 1),
                    labels = c("Alive", "Dead")) + 
  facet_wrap(~ key, scales = "free", labeller = facet_labels) + 
  base_theme + 
  theme(
    strip.background = element_rect(fill = "grey80", color = "white"),
    strip.text = element_text(face = "bold"),
    legend.position = "bottom",
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank()
  )
 
facet_labels <- as_labeller(
  c(tx = "Treatment", 
    Killip = "Killip Class",
    miloc = "Infarct Location",
    pmi = "History of MI"))

plt_2 <- 
  ggplot(
      data = gusto %>%
        select(day30, Killip, miloc, pmi, tx) %>%
        gather(key = "key", value = "value", -day30) %>%
        mutate(key = factor(key, levels = c("tx", "Killip", "miloc", "pmi")))
    ) +
    geom_bar(
      aes(x = value,
          fill = factor(day30)),
      position = "fill"
    ) + 
    scale_fill_brewer(type = "qual", palette = "Dark2",
                    name = "Status at Day 30:",
                    breaks = c(0, 1),
                    labels = c("Alive", "Dead")) + 
    facet_wrap(~ key, scales = "free", labeller = facet_labels) + 
    labs(
      x = "Value",
      y = "Proportion",
      subtitle = "Categorical Variables Stratified by Outcome"
    ) +
    base_theme + 
    theme(
      strip.background = element_rect(fill = "grey80", color = "white"),
      strip.text = element_text(face = "bold"),
      legend.position = "bottom",
      axis.text.y = element_blank(),
      axis.ticks.y = element_blank()
      )

plot_grid(plt_1, plt_2, ncol = 2, align = "hv")

Model Building

We’ll build up our model sequentially, examining the change in predicted out of sample fit as we go. For this task we’ll use WAIC.

Base Model

We’ll start with a base model containing just the treatment. When defining this model, well use the following parameterization:

\(day30_i \sim Bernoulli(p_i)\)

\(p_i \sim intercept + \beta_{tPA}(tPA_i) + \beta_{SK}(SK_i)\)


Where \(tPA_i\) is the tPA status (0/1) of each patient, and \(SK_i\) is the SK status (0/1) of each patient. Thus, the intercept here represents the log-odds of 30-day mortality in patients receiving SK + tPA. We can use this information to start thinking about reasonable priors. In a similar study published in the Lancet a few years prior to the GUSTO-I trial, patients with suspected acute myocardial infarction received either tPA or SK. In-hospital mortality in this study was 8.9% and 8.5% in the two treatment groups, respectively. We can use this data to inform our prior for the intercept in this base model. Let’s imagine we use a broad (on the probability scale), rather uninformative prior such as \(intercept \sim Normal(0, 1.5)\). Below I’ve plotted the result of 10,000 random draws from this prior on the log odds scale (left) and probability scale (right). As a note - in the remainder of this project when I plot distributions I will use an accompanying median +/- 89% credible interval. As Richard McElreath likes to point out, there is nothing special about 95%, and using it will only make it more difficult to mentally move away from the concept of NHST and p-values.

This is a very broad and unnecessarily uninformative prior that gives too much weight to unrealistic values. Let’s instead use the data from our prior study to develop a moderately informative prior. With an expected in-hospital mortality less than 10% we should aim for a prior that places most of its weight in the low probability zone, while still allowing for larger than expected effects to shine through (if present in the data). For this purpose I settled on \(intercept \sim Normal(-2.5, 0.75)\). Here are the resultant prior predictive simulations:

This looks more appropriate. The bulk of the density is below a 20% probability of 30-day mortality in the combined SK + tPA group, but the extended tail does allow for more extreme cases. The mean probability for 30-day mortality in the SK + tPA for this distribution is 0.093.

Now let’s turn our attention to the prior to \(\beta_{SK}\) and \(\beta_{tPA}\). I like Richard McElreath’s approach to prior simulation for these coefficients. We can again start with a relatively broad and uninformative prior: \(\beta \sim Normal(0, 3)\). Below I’ve plotted results from another 10,000 simulations from this distribution. This time we’ll look at the difference in probability between groups that is implied by the prior (top row in the figure below). For these plots, we’ll use our intercept distribution, and then generate a distribution of probabilities for 30-day mortality for each of the three groups based on our \(\beta\) priors. We can then plot the absolute value of differences in these probabilities to get an idea of how big an effect our prior is assuming the treatment will have. We’ll also plot the prior distributions for the probability of 30-day mortality by group (bottom row).

set.seed(11)
intercept <- rnorm(10000, -2.5, 0.75)
beta_1 <- rnorm(10000, 0, 3)

set.seed(12)
beta_2 <- rnorm(10000, 0, 3)

df <- 
  tibble(both = inv_logit_scaled(intercept),
         tpa = inv_logit_scaled(intercept + beta_1),
         sk = inv_logit_scaled(intercept + beta_2)) %>%
  mutate(diff_1 = abs(both - tpa), diff_2 = abs(both - sk),
         diff_3 = abs(tpa - sk))

plt_1 <- 
  ggplot(data = df, aes(x = diff_1, y = 1)) + 
  geom_halfeyeh(fill = "blue", .width = 0.89) + 
  labs(
    x = "abs[Pr(day30 = 1 | tx = SK + tPA) - Pr(day30 = 1 | tx = tPA)]",
    y = "Density",
    subtitle = "Prior: Treatment Contrast (SK + tPA vs. tPA)"
  ) + 
  scale_x_continuous(limits = c(0, 1)) + 
  base_theme + 
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title = element_text(size = 8)
  )

plt_2 <- 
  ggplot(data = df, aes(x = diff_2, y = 1)) + 
  geom_halfeyeh(fill = "darkgreen", .width = 0.89) + 
  labs(
    x = "abs[Pr(day30 = 1 | tx = SK + tPA) - Pr(day30 = 1 | tx = SK)]",
    y = "Density",
    subtitle = "Prior: Treatment Contrast (SK + tPA vs. SK)"
  ) + 
  scale_x_continuous(limits = c(0, 1)) + 
  base_theme + 
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title = element_text(size = 8)
  )

plt_3 <- 
  ggplot(data = df, aes(x = diff_3, y = 1)) + 
  geom_halfeyeh(fill = "darkred", .width = 0.89) + 
  labs(
    x = "abs[Pr(day30 = 1 | tx = tPA) - Pr(day30 = 1 | tx = SK)]",
    y = "Density",
    subtitle = "Prior: Treatment Contrast (tPA vs. SK)"
  ) + 
  scale_x_continuous(limits = c(0, 1)) + 
  base_theme + 
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title = element_text(size = 8)
  )

plt_4 <- 
  ggplot(data = df, aes(x = tpa, y = 1)) + 
  geom_halfeyeh(fill = "purple", .width = 0.89) + 
  labs(
    x = "Pr(day30 = 1 | tx = tPA)",
    y = "Density",
    subtitle = "Prior: Specific Treatment (tPA)"
  ) + 
  scale_x_continuous(limits = c(0, 1)) + 
  base_theme + 
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title = element_text(size = 8)
  )

plt_5 <- 
  ggplot(data = df, aes(x = sk, y = 1)) + 
  geom_halfeyeh(fill = "orange", .width = 0.89) + 
  labs(
    x = "Pr(day30 = 1 | tx = SK)",
    y = "Density",
    subtitle = "Prior: Specific Treatment (SK)"
  ) + 
  scale_x_continuous(limits = c(0, 1)) + 
  base_theme + 
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title = element_text(size = 8)
  )

plt_6 <- 
  ggplot(data = df, aes(x = both, y = 1)) + 
  geom_halfeyeh(fill = "grey", .width = 0.89) + 
  labs(
    x = "Pr(day30 = 1 | tx = SK + tPA)",
    y = "Density",
    subtitle = "Prior: Specific Treatment (SK + tPA)"
  ) + 
  scale_x_continuous(limits = c(0, 1)) + 
  base_theme + 
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title = element_text(size = 8)
  )

plot_grid(plt_1, plt_2, plt_3, plt_4, plt_5, plt_6, ncol = 3, align = "hv")

A few things to note about these results. While the bulk of the density for the differences in probability between treatments lies below 25%, there is a large portion of the distribution extends to extremely large and improbable values. Notice also how the uncertainty in the distribution for the difference between tPA and SK groups (top right plot) is greater than the uncertainty in the contrasts between either tPA or SK groups and the group with SK + tPA (top left and top middle plots). This is because in our parameterization of the model the reference group (SK + tPA) doesn’t get its own prior, but rather is only influenced by the intercept. The prior probability of 30-day mortality for the other two groups (tPA and SK), on the other hand, is influenced by both the intercept and the \(\beta\) terms. So our parameterization has inherently made the assumption that we are less certain at baseline about the prior for the tPA and SK groups, and more certain about the prior for the SK + tPA (reference) group. We know this isn’t true, and in fact in Statistical Rethinking, Richard McElreath advocates for using an index variable approach to avoid this issue (i.e., over parameterizing the model such that each group of a factor variable has its own coefficient and thus prior). I did try this approach, however inefficiencies in sampling (I think) made it impractical for this exercise (drastically increased run time, very small effective sample size). I would imagine this is due to the high degree of correlation between some of the terms in the over parameterized version of the model. Regardless, while our model parameterization does imply some less than ideal characteristics about our priors, the large sample size in our dataset will quickly wash out these minor differences.

The minor issues discussed above aside, we do need to work on these priors to make them a bit more informative and skeptical of unrealistic differences between groups. In the end I settled on \(\beta \sim Normal(0, 0.5)\). Using this prior leads to the following prior predictive plots:

set.seed(11)
intercept <- rnorm(10000, -2.5, 0.75)
beta_1 <- rnorm(10000, 0, 0.5)

set.seed(12)
beta_2 <- rnorm(10000, 0, 0.5)

df <- 
  tibble(both = inv_logit_scaled(intercept),
         tpa = inv_logit_scaled(intercept + beta_1),
         sk = inv_logit_scaled(intercept + beta_2)) %>%
  mutate(diff_1 = abs(both - tpa), diff_2 = abs(both - sk),
         diff_3 = abs(tpa - sk))

plt_1 <- 
  ggplot(data = df, aes(x = diff_1, y = 1)) + 
  geom_halfeyeh(fill = "blue", .width = 0.89) + 
  labs(
    x = "abs[Pr(day30 = 1 | tx = SK + tPA) - Pr(day30 = 1 | tx = tPA)]",
    y = "Density",
    subtitle = "Prior: Treatment Contrast (SK + tPA vs. tPA)"
  ) + 
  scale_x_continuous(limits = c(0, 1)) + 
  base_theme + 
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title = element_text(size = 8)
  )

plt_2 <- 
  ggplot(data = df, aes(x = diff_2, y = 1)) + 
  geom_halfeyeh(fill = "darkgreen", .width = 0.89) + 
  labs(
    x = "abs[Pr(day30 = 1 | tx = SK + tPA) - Pr(day30 = 1 | tx = SK)]",
    y = "Density",
    subtitle = "Prior: Treatment Contrast (SK + tPA vs. SK)"
  ) + 
  scale_x_continuous(limits = c(0, 1)) + 
  base_theme + 
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title = element_text(size = 8)
  )

plt_3 <- 
  ggplot(data = df, aes(x = diff_3, y = 1)) + 
  geom_halfeyeh(fill = "darkred", .width = 0.89) + 
  labs(
    x = "abs[Pr(day30 = 1 | tx = tPA) - Pr(day30 = 1 | tx = SK)]",
    y = "Density",
    subtitle = "Prior: Treatment Contrast (tPA vs. SK)"
  ) + 
  scale_x_continuous(limits = c(0, 1)) + 
  base_theme + 
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title = element_text(size = 8)
  )

plt_4 <- 
  ggplot(data = df, aes(x = tpa, y = 1)) + 
  geom_halfeyeh(fill = "purple", .width = 0.89) + 
  labs(
    x = "Pr(day30 = 1 | tx = tPA)",
    y = "Density",
    subtitle = "Prior: Specific Treatment (tPA)"
  ) + 
  scale_x_continuous(limits = c(0, 1)) + 
  base_theme + 
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title = element_text(size = 8)
  )

plt_5 <- 
  ggplot(data = df, aes(x = sk, y = 1)) + 
  geom_halfeyeh(fill = "orange", .width = 0.89) + 
  labs(
    x = "Pr(day30 = 1 | tx = SK)",
    y = "Density",
    subtitle = "Prior: Specific Treatment (SK)"
  ) + 
  scale_x_continuous(limits = c(0, 1)) + 
  base_theme + 
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title = element_text(size = 8)
  )

plt_6 <- 
  ggplot(data = df, aes(x = both, y = 1)) + 
  geom_halfeyeh(fill = "grey", .width = 0.89) + 
  labs(
    x = "Pr(day30 = 1 | tx = SK + tPA)",
    y = "Density",
    subtitle = "Prior: Specific Treatment (SK + tPA)"
  ) + 
  scale_x_continuous(limits = c(0, 1)) + 
  base_theme + 
  theme(
    axis.text.y = element_blank(),
    axis.ticks.y = element_blank(),
    axis.title = element_text(size = 8)
  )

plot_grid(plt_1, plt_2, plt_3, plt_4, plt_5, plt_6, ncol = 3, align = "hv")

These look much better. In the top row, we see that the predicted differences in probabilities between groups implied by our priors are appropriately clumped up near the lower end of the spectrum. Again we see that the predicted 30-day mortality in the combined SK + tPA group has less uncertainty (a more narrow prior distribution), but this is much less pronounced with our new \(\beta\) priors due to their reduced variability. We’ll move forward with this set of priors, making our final base model:

\(day30_i \sim Bernoulli(p_i)\)

\(p_i \sim intercept + \beta_{tPA}(tPA_i) + \beta_{SK}(SK_i)\)

\(intercept \sim Normal(-2.5, 0.75)\)

\(\beta \sim Normal(0, 0.5)\)


We’ll run this in brms. For all of our models we’ll run 4 chains, each with 2000 total iterations (1000 of which will be warmup iterations).

Diagnostics

For this project we’ll look at a few items to diagnose any problems with the MCMC sampling process: (1) trace plots, (2) Rhat values, and (3) effective sample size ratios. This only scratches the surface of MCMC diagnostics, but should cover any major issues we may run into.

Our trace plots look good and all the Rhat values are essentially 1. Our effective N / N ratios aren’t great here, though. That said, we still have over 1000 effective samples for each parameter which is more than enough to get a good posterior sample. Truth be told, I’m not sure why this particular parameterization causes this problem, but we’ll see in the next model that with added terms our sampling efficiency improves dramatically.

Results

From this base model we can look at some of our treatment effect estimates. First we can use the brms marginal_effects() function to display the marginal effect estimates for our model’s variables (only tx in this case).

If we want to get a bit more detailed, we can plot the entire distribution of predicted probabilities of 30-day mortality for patients receiving either tPA or SK on the probability scale. We can then also plot the contrast between these two distributions (tPA - SK) which represents the distribution of predicted ARR. Finally, we can plot the risk ratio for tPA vs. SK. Each distribution below is displayed with an 89% credible interval.

These results are well aligned with the primary findings of the GUSTO-I trial which is reassuring.

Next let’s add in some of our predictors. We’ll add the categorical variables first.

Categorical Model

We’ll now add in the three baseline categorical variables in the dataset: (1) Killip class (a categorization of heart failure symptoms); (2) history of previous MI; and (3) location of MI. Now that we have additional categorical variables, each with their own reference class, our intercept represents the log odds of 30-day mortality in patients receiving SK + tPA who are Killip class I, have not had a prior MI, with a new MI in the inferior wall of the heart. While this may slightly change our prior for the intercept it won’t make a major difference and I’ll leave it consistent with the base model. We’ll also continue to use the same prior for our new \(\beta\) coefficients as well.

Diagnostics

Again our traceplots and Rhat values are reassuring. Now, though, we see a much better set of N eff / N ratios.

Comparison

But is this model any better than the base model? We can compare these two models using WAIC (widely applicable information criterion), effectively comparing their predicted out of sample fit. Below I’ve plotted the difference between the WAIC estimates for the models we want to compare, along with the standard error of those differences.



Based on the above plot we can conclude that the categorical model represents an improvement over the base model with respect to predicted out of sample performance.

Continuous Model

Next we can add the continuous (scaled and centered) variables to our model. The same \(\beta\) coefficients will apply here, except now these coefficients tell us about the expected change in log odds of 30-day mortality with a one SD change in the centered/scaled continuous variable.

Comparison

Using the same method we again find that our new model represents an incremental improvement over the base and categorical models with respect to predicted out of sample performance.

Spline Model

Now we’ll extend our current model by using thin plate splines to model the continuous variables, allowing for a non-linear relationship between these variables and the log odds of 30-day mortality. I’ve used \(Exponential(1)\) as the prior for the spline standard deviation terms and kept the remainder of the priors the same.

Comparison

Here we see that modeling our continuous variables using splines provided an incremental (albeit less extreme than the previous models) improvement in predicted out of sample performance.

Interaction Model

Next we will fit a full interaction model (treatment x covariate interactions for all 6 covariates). In Dr. Harrell’s post, this is the point at which penalized maximum likelihood estimation is used to shrink the interaction coefficients in an effort to avoid overfitting. In our case, we will instead use our priors to employ regularization. Prior selection for regularization/shrinkage in this scenario is far beyond my current knowledge base, but will play an important role in the future implementation of these types of models. For our purposes here, rather than get into the weeds with more complicated regularizing priors (Cauchy, Horseshoe, etc.), we’ll use a more restrictive version of the \(\beta\) prior we’ve been using: \(\beta_{interaction} \sim Normal(0, 0.1)\).

The prior predictive plots above show us that this prior is skeptical of very large interaction effects. On the left we’ve plotted the simulated prior on the log odds scale and on the right on the OR scale. In reality this prior could probably be much more narrow, but it will work for our purposes moving forward. We fit the full model below:

m_int <-
  brm(
    data = gusto,
    family = bernoulli,
    formula =
      day30 ~ 0 + intercept + tx + 
                  s(age_s) + s(pulse_s) + s(sbp_s) + 
                  s(age_s, by = tx) + s(pulse_s, by = tx) + s(sbp_s, by = tx) +
                  Killip + pmi + miloc + 
                  tx:Killip + tx:pmi + tx:miloc,
    prior = c(prior(normal(-2.5, 0.75), class = b, coef = "intercept"),
              prior(normal(0, 0.5), class = b, coef = "txSK"),
              prior(normal(0, 0.5), class = b, coef = "txtPA"),
              prior(normal(0, 0.5), class = b, coef = "sage_s_1"),
              prior(normal(0, 0.5), class = b, coef = "spulse_s_1"),
              prior(normal(0, 0.5), class = b, coef = "ssbp_s_1"),
              prior(normal(0, 0.5), class = b, coef = "KillipII"),
              prior(normal(0, 0.5), class = b, coef = "KillipIII"),
              prior(normal(0, 0.5), class = b, coef = "KillipIV"),
              prior(normal(0, 0.5), class = b, coef = "milocAnterior"),
              prior(normal(0, 0.5), class = b, coef = "milocOther"),
              prior(normal(0, 0.5), class = b, coef = "pmiyes"),
              prior(normal(0, 0.1), class = b, coef = "sage_s:txSK_1"),
              prior(normal(0, 0.1), class = b, coef = "sage_s:txSKPtPA_1"),
              prior(normal(0, 0.1), class = b, coef = "sage_s:txtPA_1"),
              prior(normal(0, 0.1), class = b, coef = "spulse_s:txSK_1"),
              prior(normal(0, 0.1), class = b, coef = "spulse_s:txSKPtPA_1"),
              prior(normal(0, 0.1), class = b, coef = "spulse_s:txtPA_1"),
              prior(normal(0, 0.1), class = b, coef = "ssbp_s:txSK_1"),
              prior(normal(0, 0.1), class = b, coef = "ssbp_s:txSKPtPA_1"),
              prior(normal(0, 0.1), class = b, coef = "ssbp_s:txtPA_1"),
              prior(normal(0, 0.1), class = b, coef = "txSK:KillipII"),
              prior(normal(0, 0.1), class = b, coef = "txSK:KillipIII"),
              prior(normal(0, 0.1), class = b, coef = "txSK:KillipIV"),
              prior(normal(0, 0.1), class = b, coef = "txtPA:KillipII"),
              prior(normal(0, 0.1), class = b, coef = "txtPA:KillipIII"),
              prior(normal(0, 0.1), class = b, coef = "txtPA:KillipIV"),
              prior(normal(0, 0.1), class = b, coef = "txSK:pmiyes"),
              prior(normal(0, 0.1), class = b, coef = "txtPA:pmiyes"),
              prior(normal(0, 0.1), class = b, coef = "txSK:milocAnterior"),
              prior(normal(0, 0.1), class = b, coef = "txSK:milocOther"),
              prior(normal(0, 0.1), class = b, coef = "txtPA:milocAnterior"),
              prior(normal(0, 0.1), class = b, coef = "txtPA:milocOther"),
              prior(exponential(1), class = sds, coef = "s(age_s)"),
              prior(exponential(1), class = sds, coef = "s(pulse_s)"),
              prior(exponential(1), class = sds, coef = "s(sbp_s)"),
              prior(exponential(1), class = sds, coef = "s(age_s,by=tx)"),
              prior(exponential(1), class = sds, coef = "s(pulse_s,by=tx)"),
              prior(exponential(1), class = sds, coef = "s(sbp_s,by=tx)")
              ),
    iter = 2000, warmup = 1000, 
    chains = 4, cores = 4,
    seed = 11,
    control = list(adapt_delta = 0.99),
    sample_prior = TRUE
  )

m_int<- add_criterion(m_int, "waic")

Comparison

For the first time it appears that the change in our model did not provide a meaningful change in WAIC. While we did use somewhat narrow priors, the amount of regularization provided by \(\beta_{interaction} \sim Normal(0, 0.1)\) shouldn’t preclude a real interaction effect present in the data from effectively shining through in the final posterior distribution. Thus, we can take this as evidence against true systematic HTE (on the relative scale).

Results

While the lack of improvement in WAIC is evidence against HTE, we can also examine the posterior distributions of the interaction parameters to get a sense for the uncertainty in these estimates. In the ridgeline plot below, each distribution is an interaction parameter. Those with a leading “b_” are the categorical interaction terms, while those with a leading “bs_” are the continuous spline interaction terms.

All of the continuous interaction terms are centered on zero and with the exception of 2 or 3 small deviations, all of the categorical interaction terms are centered on zero. From this plot we don’t see any strong evidence for HTE.

We can also look at the marginal effect plots from our model to examine the magnitude of interaction present from the perspective of the outcome.

We can see from these plots what we might have expected based on the previous ridgeline plot. There is no interaction occurring between the continuous variables and treatment, and any interaction occurring between the categorical variables and treatment is small.

Subgroups Model

Another way to approach the question of HTE is to define subgroups of patients a priori that are hypothesized to be good candidates for HTE. For the purposes of this project I created a set of 8 unique subgroups by based on centered/scaled age (w/ cut point at 0), Killip score (I & II or III & IV) and history of previous MI (yes and no). We can then use a multilevel (hierarchical random effects) model, where the intercept and treatment effect estimates are clustered on subgroup. In this parameterization we can take advantage of the “partial pooling” effect whereby estimates for the intercept and treatment effect coefficients are shrunk towards the respective grand mean estimates in the entire population. For this model we have a new prior to define: the correlation matrix for the intercept/treatment effect coefficients. For this prior we’ll use \(R \sim LKJ(2)\), a somewhat informative prior that is generally skeptical of extreme correlation values (i.e., those very close to -1 or 1). The prior predictive plot below shows the prior distribution of \(\rho\) values implied by this prior:

Using this prior (along with the same priors used in previous models) we’ll fit the multilevel model in brms.

Comparison

Once again we don’t notice an appreciable change in WAIC for this model when compared to the spline model. If these subgroups were a priori hypothesized to have differential treatment effects, this would be evidence against true HTE.

Cluster Model

We can also use unsupervised clustering to derive our subgroups a priori. This approach was used by Fernando Zampieri and colleagues to investigate HTE in ARDS therapies. First we employ a clustering technique (in this case, Kamiila which may be better suited for mixed data such as the GUSTO dataset) and identify the optimal number of clusters. In their analysis, Zampieri at team used cross validated prediction strength to identify the optimal number of clusters. I won’t go into more detail here, but certainly the clustering process plays a very important role in these sorts of approaches. For our project, I’ve arbitrarily selected 3 as the optimal number of clusters, and run the Kamila algorithm on the GUSTO dataset. We can then use cluster membership to partially pool intercept and treatment effect estimates in a manner identical to our previous multilevel model approach. We’ll use all of the same priors for this model.

cont_vars <- 
  gusto %>% 
  select(sbp_s, pulse_s, age_s)

cat_vars <-
  gusto %>%
  select(Killip, pmi, miloc)

clust <- kamila(conVar = cont_vars, catFactor = cat_vars, 
                numClust = 3,
                numInit = 100)

gusto <- 
  gusto %>%
  mutate(cluster = factor(clust$finalMemb))

m_clust <-
  brm(
    data = gusto,
    family = bernoulli,
    formula =
      day30 ~ 1 + tx + (1 + tx | cluster) + s(age_s) + s(pulse_s) + s(sbp_s) + 
                  Killip + pmi + miloc,
    prior = c(prior(normal(-2.5, 0.75), class = Intercept),
              prior(normal(0, 0.5), class = b, coef = "txSK"),
              prior(normal(0, 0.5), class = b, coef = "txtPA"),
              prior(normal(0, 0.5), class = b, coef = "sage_s_1"),
              prior(normal(0, 0.5), class = b, coef = "spulse_s_1"),
              prior(normal(0, 0.5), class = b, coef = "ssbp_s_1"),
              prior(normal(0, 0.5), class = b, coef = "KillipII"),
              prior(normal(0, 0.5), class = b, coef = "KillipIII"),
              prior(normal(0, 0.5), class = b, coef = "KillipIV"),
              prior(normal(0, 0.5), class = b, coef = "milocAnterior"),
              prior(normal(0, 0.5), class = b, coef = "milocOther"),
              prior(normal(0, 0.5), class = b, coef = "pmiyes"),
              prior(exponential(1), class = sds, coef = "s(age_s)"),
              prior(exponential(1), class = sds, coef = "s(pulse_s)"),
              prior(exponential(1), class = sds, coef = "s(sbp_s)"),
              prior(exponential(1), class = sd, coef = "Intercept", group = "cluster"),
              prior(exponential(1), class = sd, coef = "txSK", group = "cluster"),
              prior(exponential(1), class = sd, coef = "txtPA", group = "cluster"),
              prior(lkj(2), class = cor, group = "cluster")),
    iter = 2000, warmup = 1000, 
    chains = 4, cores = 4,
    seed = 11,
    control = list(adapt_delta = 0.99)
  )

m_clust <- add_criterion(m_clust, "waic")

Comparison

Again we see no incremental improvement in WAIC with this approach, and could use this as evidence against true HTE (on the relative scale) in our identified clusters.

So, we now have a few models to use as we embark on the next section of the project: estimation of patient specific efficacy estimates. While the interaction model did not show improvement in WAIC, we’ll use this model moving forward to allow our relative treatment effect estimates to at potentially vary with patient characteristics.

Patient Specific Efficacy Estimates

First let’s get a sense for the distribution of treatment effects implied by our interaction model for the study population. For each patient in the trial, we can draw all samples (n = 4000) from the posterior distribution, and determine their predicted probability of 30-day mortality by taking the median of these values. We can then plot the distribution of these median probabilities of 30-day mortality either as absolute risks (under tPA and SK), absolute risk reductions (SK - tPA), risk ratios or odds ratios.


Here we see that the most density for median patient ARR is centered between 0 and 4%, with some values falling below 0 (i.e., risk is higher with tPA vs. SK) and some extend as far out as 8%. On the relative scale, the majority of the density for median patient RR falls between 0.7 and 0.9, although some values do cross 1. Likewise, the bulk of the density for median patient OR falls between 0.7 and 0.9, again with some values falling above 1.

We can also look at how our model’s predicted median risks on the absolute (ARR) and relative scales (RR and OR) vary with respect to predicted absolute baseline risk (under SK).

Here we see that as baseline risk increases, we have an initial rise in predicted ARR followed by a downturn after a predicted baseline risk of ~50%. In the relative measures, we see some small deviations in the early risk range, followed by a slow rise over the remainder of the range. This rise is more pronounced for RR than OR.

These plots (and the previous ones) are informative, but they don’t fully capture the true variability that is present in these values. Each patient’s predicted values have been summarized by the median, when in fact we know that each patient has a distribution of predicted values (ARR, RR, OR, etc.). To get a sense for some of the within-patient variability implied by our interaction model, we can evaluate the predicted risks for several “mock patients” with varying vectors of baseline characteristics. We’ll look at the following four hypothetical patients:

  1. An average age patient (age_s = 0) with history of a previous MI (pmi = yes) who presents with a new anterior wall MI (miloc = Anterior), Killip class 2 HF symptoms (Killip = II), an average pulse (pulse_s = 0) but a much lower than average blood pressure (sbp_s = -1.5).

  2. An old (age_s = 2) patient without history of previous MI (pmi = no) who presents with a new inferior wall MI (miloc = Inferior), Killip class 2 HF symptoms (Killip = II), a moderately elevated blood pressure (sbp_s = 0.5) and a very fast heart rate (pulse_s = 2).

  3. An average age patient (age_s = 0) without history of previous MI (pmi = no) who presents with a new inferior wall MI (miloc = Inferior), Killip class 3 HF symptoms (Killip = III) and an average pulse and blood pressure (pulse_s = 0, sbp_s = 0).

  4. A patient of above average age (age_s = 0.5) with a history of previous MI (pmi = yes) who presents with a new MI in neither the anterior nor inferior walls (miloc = Other), Killip class 2 HF symptoms (Killip = 2), a moderately elevated pulse (pulse_s = 0.5) and a moderately reduced blood pressure (sbp_s = -0.5).

These plots are very illuminating in regard to patient-specific efficacy predictions on both the relative and absolute scale, and the ability to present visualizations of these estimates (along with their inherent uncertainty). We see instances of patients with similar predicted relative efficacies with very different (or more/less uncertain) predicted absolute efficacies. Likewise we see instances of patients with different relative efficacy estimates and how they translate to the absolute scale. If we wanted to, we could now calculate the probability of relative (RR < 1) / absolute benefit (ARR > 0) for each individual patient. We aren’t constrained by these typical thresholds, though, and by presenting the entire posterior distribution we can calculate Pr(X > A) for any combination of efficacy measure (X: RR, ARR, OR) and value of interest (A).

As a proof of concept, I created a simple Shiny application that uses the interaction model to generate these patient-specific efficacy posterior distributions based on patient characteristics. In the clinical setting, clinicians can use this sort of tool to understand the predicted absolute and relative benefit from various interventions and make decisions using the entire posterior distribution combined with knowlege of the clinical scenario and the patient’s value system.

Conclusions

Regardless of the presence of differential treatment effects on the relative scale, patients have inherently different risks for mortality (or any study outcome) at baseline. These risks, and their distributions, can be predicted from our model and presented to both clinicians and patients, allowing for more informed decision making. Furthermore, by retaining the full posterior distribution, no premature decisions on the value system of a patient or clinical scenario are made by the analysis. Rather, those individuals making the decision are presented with a complete set of information that they can use as is appropriate for their given situation.

Acknowledgements

The only reason I am able to learn via these sorts of projects is the incredible generosity of so many people who produce high-quality educational content for free. As previously mentioned, this project was inspired by Frank Harrell’s original blog post. Much of my (novice) understanding of Bayesian modeling is thanks to Richard McElreath’s Statistical Rethinking. The actual implementation of this knowledge through brms is thanks to Solomon Kurz’s adaptation of the Statistical Rethinking text and all of the wonderful vignettes and resources from Paul Buerkner. Finally, my interest in this topic has only been further spurred on by many discussions with Dan Lane who is incredibly passionate about reframing our approach to diagnostic and prediction models in clinical medicine.

To-Do List

This is very much a work in progress. Below is a list of planned future improvements and other changes.

  1. Re-fit models with Killip appropriately coded as an ordinal variable in the brms syntax.
  2. Explore PSIS-LOO for model comparison in addition to WAIC.
  3. Add more functionality to the Shiny application (calculation of various probabilities of interest, for example)