Primer on Logistic Regression in R
Logistic Regression
<- gravier |>
gravier_wide bind_cols() |>
as_tibble()
gravier_wide
# A tibble: 168 × 2,906
g2E09 g7F07 g1A01 g3C09 g3H08 g1A08 g1B01 g1int1
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 -0.00144 -0.00144 -0.0831 -0.0475 0.0158 -0.0336 -0.136 0.0180
2 -0.0604 0.0129 -0.00144 0.0104 0.0316 0.108 0.0158 0.0800
3 0.0398 0.0524 -0.0786 0.0635 -0.0395 0.0342 0.00288 0.0594
4 0.0101 0.0314 -0.0218 0.0215 0.0868 0.0272 -0.0160 0.0759
5 0.0496 0.0201 0.0370 0.0311 0.0207 -0.0174 0.111 -0.0469
6 -0.0664 0.0468 0.00720 -0.370 0.00288 0.0243 0.0909 0.0482
7 -0.00289 -0.0816 -0.0291 -0.0249 -0.0174 0.0172 -0.170 -0.00578
8 -0.198 -0.0499 -0.0634 -0.0298 0.0300 0.00144 -0.0529 -0.0401
9 0.00288 0.0201 0.0272 0.0174 -0.0000789 -0.0634 0.0370 0.0314
10 -0.0574 -0.0574 -0.0831 -0.0897 -0.101 -0.144 -0.167 0.0172
# ℹ 158 more rows
# ℹ 2,898 more variables: g1E11 <dbl>, g8G02 <dbl>, g1H04 <dbl>, g1C01 <dbl>,
# g1F11 <dbl>, g3F05 <dbl>, g3B09 <dbl>, g1int2 <dbl>, g2C01 <dbl>,
# g1A05 <dbl>, g1E01 <dbl>, g1B05 <dbl>, g3C05 <dbl>, g3A07 <dbl>,
# g1F01 <dbl>, g2D01 <dbl>, g1int3 <dbl>, g1int4 <dbl>, g1D05 <dbl>,
# g1E05 <dbl>, g1G05 <dbl>, g1C05 <dbl>, g1G11 <dbl>, g2D08 <dbl>,
# g2E06 <dbl>, g3H09 <dbl>, g2F09 <dbl>, g3G06 <dbl>, g2G08 <dbl>, …
What if we wanted to understand how a continous variable like e.g. gene_expression_level
of a given gene, determined whether the plant would survive or sucomb to heat shock at a given temperature? In that case, we would be interested in understanding survived
yes/no as a function of gene_expression_level
:
\[survived_{yes/no} \sim gene\_expresssion\_level\]
Moreover, we are interested in estimating the probability of survival given a certain gene_expression_level
.
Data
As before, let’ us simply simulate some data and let’s say that we have 100 plants at 50 degrees celsius and the mean gene_expression_level
for plants, which did not survive (denoted 0
) was 90
and for those who did survive (denoted 1
), it was 100
:
set.seed(718802)
<- 90
mean_survived_no <- 100
mean_survived_yes <- tibble(
survival_data gene_expression_level = c(rnorm(n = 50,
mean = mean_survived_no,
sd = 3),
rnorm(n = 50,
mean = mean_survived_yes,
sd = 3)),
survived = c(rep(x = 0, times = 50), rep(x = 1, times = 50))
)|>
survival_data sample_n(10)
# A tibble: 10 × 2
gene_expression_level survived
<dbl> <dbl>
1 95.3 0
2 96.6 1
3 104. 1
4 93.6 0
5 95.0 0
6 91.8 0
7 90.2 0
8 88.2 0
9 86.2 0
10 89.3 0
Visualising
As mentioned, we are interested in survived
as a function of gene_expression_level
, visualising this looks like so:
|>
survival_data ggplot(aes(x = gene_expression_level,
y = survived)) +
geom_point(alpha = 0.5,
size = 3)
Doing what we did before with a straight line evidently isn’t super meaningful. What we’re interested in understanding is at what gene_expression_level
does the plant not survive/survive? Clearly, at e.g. 80
, the plant does not survive and at 105
it clear does! But what about 95
? Well, that isn’t really clear. Here, plants are both observed to survive and not-survive. What about 93
? Here, most do survive, but not all, although it seems that there is more chance of not-surviving and vice versa for 97
, than surviving. It’s this chance we’re interested in. So, at different values of gene_expression_levels
, what is the probability of surviving-/not-surviving respectively? This is the quesion a logistic regression answers, so let’s get to it!
Modelling
To do a logistic regression, we use the glm()
-function:
<- glm(formula = survived ~ gene_expression_level,
my_glm_mdl family = binomial(link = "logit"),
data = survival_data)
my_glm_mdl
Call: glm(formula = survived ~ gene_expression_level, family = binomial(link = "logit"),
data = survival_data)
Coefficients:
(Intercept) gene_expression_level
-86.0134 0.9095
Degrees of Freedom: 99 Total (i.e. Null); 98 Residual
Null Deviance: 138.6
Residual Deviance: 37.75 AIC: 41.75
Now, with the model in place, we can visualise. Below here, the points are the observed data and the line is the model of how the probability of survival changes with gene_expression_level
:
|>
survival_data mutate(my_glm_model = pluck(my_glm_mdl, "fitted.values")) |>
ggplot(aes(x = gene_expression_level, y = survived)) +
geom_point(alpha = 0.5,
size = 3) +
geom_line(aes(y = my_glm_model))
This allows us to answer the question from before:
- At
80
, the plant does not survive? - At
105
it clearly does! - What about
95
? - What about
93
? - Is it vice versa for
97
?
predict.glm(object = my_glm_mdl,
newdata = tibble(gene_expression_level = c(80, 105, 95, 93, 97)),
type = "response")
1 2 3 4 5
1.755625e-06 9.999240e-01 5.962687e-01 1.932429e-01 9.010511e-01
So:
- At
80
, the plant does not survive? TRUE, the probability of survival is close to zero - At
105
it clearly does! TRUE, the probability of survival is close to one - What about
95
? Around 60% survival probability - What about
93
? Around 20% survival probability - Is it vice versa for
97
? Around 90% survival probability
Again, this model object is a bit quirke, so broom
to the rescue:
|>
my_glm_mdl tidy(conf.int = TRUE,
conf.level = 0.95) |>
mutate(estimate = exp(estimate))
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 4.41e-38 18.1 -4.74 2.11e-6 -130. -56.6
2 gene_expression_level 2.48e+ 0 0.191 4.76 1.97e-6 0.599 1.37