Primer on Linear Models in R
In a basic linear regression model, one independent variable and one dependent variable are involved. The terms independent and dependent are literal meaning, that one variable does depend on the value of the other, whereas for the other the value is independent of the other. In a basic linear regression, we find the line that best fit the data. We will illustrate this, with the following example.
Example
Background
Let’s say we wanted to study the genetic mechanism protecting a plant from heat shock, then:
- Independent: Environmental Condition (temperature)
- Dependent: Gene Expression Level (related to heat shock protection)
Here, the independent variable is the temperature and the dependent variable is the gene expression level. It is clear, that the temperature, does not rely on the gene expression level, but the gene expression level of heat shock related genes, does rely on the temperature.
So, we keep plants under different temperatures and collect samples, from which we can extract RNA and run a transcriptomics analysis uncovering gene expression levels.
Data
For the data here, we are going to simulate the relationship between gene expression levels and temperature, as a function in R
:
<- function(temp){
run_simulation <- rnorm(n = length(temp), mean = 0, sd = 3)
measurement_error <- 2 * temp + 3 + measurement_error
gene_expression_level return( gene_expression_level )
}
Note, how we’re adding some measurement error to our simulation, otherwise we would get a perfect relationship, which we all know never happens.
Now, we can easily run simulations:
run_simulation(temp = c(15, 20, 25, 30, 35))
[1] 34.31363 45.48291 48.46203 63.74417 70.57002
Let’s just go ahead and create some data, we can work with. For this example, we take samples starting at 5 degree celsius and then in increments of 1 up to 50 degrees:
set.seed(806017)
<- tibble(
experiment_data temperature = seq(from = 5, to = 50, by = 1),
gene_expression_level = run_simulation(temp = temperature)
)|>
experiment_data sample_n(10) |>
arrange(temperature)
# A tibble: 10 × 2
temperature gene_expression_level
<dbl> <dbl>
1 12 28.7
2 13 30.2
3 16 35.6
4 19 44.0
5 22 48.8
6 27 64.6
7 32 61.1
8 33 73.9
9 35 76.7
10 40 84.9
Visualising
Now, that we have the data, we can visualise the relationship between the temperature
- and gene_expression_level
-variables:
<- experiment_data |>
my_viz ggplot(aes(x = temperature,
y = gene_expression_level)) +
geom_point() +
geom_vline(xintercept = 0) +
geom_hline(yintercept = 0)
my_viz
Now, we can easily add the best fit line using the geom_smooth()
-function, where we specify that we want to use method = "lm"
and for now, we exclude the confidence interval, by setting se = FALSE
:
+
my_viz geom_smooth(method = "lm",
se = FALSE)
What happens here, is that a best-fit line is added to the plot by calculating the line, such that the sum of the squared errors is as small as possible, where the error is the distance from the line to a given point. This is a basic linear regression and is known as Ordinary Least Squares (OLS). But what if we want to work with this regression model, beyond just adding a line to a plot?
Modelling
One of the super powers of R
is the build in capability to do modelling. Because we simulated the data (see above), we know that the true intercept is 3
and the true slope of the temperature variable is 2
. Let see what we get, if we run a linear model:
<- lm(formula = gene_expression_level ~ temperature,
my_lm_mdl data = experiment_data)
my_lm_mdl
Call:
lm(formula = gene_expression_level ~ temperature, data = experiment_data)
Coefficients:
(Intercept) temperature
2.816 2.021
Important, the formula
notation gene_expression_level ~ temperature
is central to R
and should be read as: “gene_expression_level modelled as a function of temperature”, i.e. gene_expression_level
is the dependent variable often denoted y
and temperature
is the independendt variable often denoted x
.
Okay that’s pretty close! Recall the reason for the difference is, that we are adding measurement error, when we run the simulation (see above).
In other words our model says, that:
\[gene\_expression\_level = 2.816 + 2.021 \cdot temperature\]
I.e. the estimate of the intercept is 2.816
and the estimate of the slope is 2.021
, meaning that when the temperature = 0
, we estimate that the gene_expression_level
is 2.816
and for each 1 degree increase in temperature, we estimate, that the increase in gene_expression_level
is 2.021
.
These estimates are pretty close to the true model underlying our simulation:
\[gene\_expression\_level = 3 + 2 \cdot temperature\]
In general form, such a linear model can be written like so:
\[y = \beta_{0} + \beta_{1} \cdot x_{1}\]
Where the \(\beta\)-coefficients are termed estimates, because that is exactly what we do, given the observed data, we estimate their values.
Working with a lm
-object:
The model format you saw above, is a bit quirky, but luckily, there is a really nice way to get these kind of model object into a more tidy
-format:
library("broom")
|>
my_lm_mdl tidy()
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.82 1.16 2.44 1.89e- 2
2 temperature 2.02 0.0378 53.4 1.16e-41
Briefly, here we term
, estimate
, std.error
, statistic
and p.value
. We discussed the term
and estimate
. The std.error
pertains to the estimate and the statistic
is used to calculate the p.value
.
The P-value
Now, because we now have a tidy object, we can simply plug-‘n’-play with other tidyverse tools, so let us visualise the p.value
. Note, because of the often vary large differences in p.values, we use a -log10
-transformation, this means that larger values are “more significant”. Below, the dashed line signifies \(p=0.05\), so anything above that line is considered “statistically significant”:
|>
my_lm_mdl tidy() |>
ggplot(aes(x = term,
y = -log10(p.value))) +
geom_point() +
geom_hline(yintercept = -log10(0.05),
linetype = "dashed")
Now, as mentioned the p-values are computed based on the statistic
and are defined as: “The probability of observing a statistic as or more extreme given, that the null-hypothesis is true”. Where the null-hypothesis it that there is no effect, i.e. the estimate
for the term
is zero.
From this, it is quite clear, that there very likely is a relationship between the gene_expression_level
and temperature
. In fact, we know there is, because we simulated the data.
The Confidence Intervals
We can further easily include the confidence intervals of the estimates:
<- my_lm_mdl |>
my_lm_mdl_tidy tidy(conf.int = TRUE,
conf.level = 0.95)
my_lm_mdl_tidy
# A tibble: 2 × 7
term estimate std.error statistic p.value conf.low conf.high
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 2.82 1.16 2.44 1.89e- 2 0.488 5.14
2 temperature 2.02 0.0378 53.4 1.16e-41 1.94 2.10
…and as before easily do a plug’n’play into ggplot
:
|>
my_lm_mdl_tidy ggplot(aes(x = estimate,
y = term,
xmin = conf.low,
xmax = conf.high)) +
geom_errorbarh(height = 0.1) +
geom_point()
Note, what the 0.95 = 95%
confidence intervals means is that: “If we were to repeat this experiment 100 times, then 95 of the times, the generated confidence interval would contain the true value”.
Summary
What we have gone through here is a basic linear regression, where we are aiming to model the continuous variable gene_expression_level
as a function of yet another continous variable temperature
. We simulated data, where the true intercept and slope were 3 and 2 respectively and by fitting a linear regression model, the estimates of the intercept and slope respectively were 2.82 [0.49;5.14] and 2.02 [1.94;2.10].
Linear models allow us to gain insights into data, by modelling relationships.