4 Day 4: Functions, Control Flow, Applied Statistics
4.1 Learning objectives
By the end of this day you should be able to:
- Write a small R function with named arguments, defaults, and a single return value.
- Use
if/elseandpurrr::map(in preference to explicit loops) for control flow. - Compute summary statistics with
summary(),mean(),sd(),median(),quantile(). - Run a two-sample
t.test()and achisq.test()and interpret the output. - Fit a simple linear regression with
lm()and interpret the coefficients, residuals, and thesummary()output.
4.2 Lecture
Day 4 takes you from ‘I can read and visualise data’ to ‘I can compute statistics on it’. The four ideas are: write a function for anything you do more than once; prefer purrr::map over explicit loops; know the basic summary statistics; know how to run a t-test, a chi-squared test, and a simple regression. With these in place you can do most of the descriptive and exploratory work an MS-level biostatistician produces.
4.2.1 Writing functions
A function is a named, reusable piece of code that takes inputs (arguments) and returns a single output:
cv <- function(x) {
sd(x, na.rm = TRUE) / mean(x, na.rm = TRUE)
}
cv(c(2.1, 2.3, 1.9, 2.5, 2.0))
#> [1] 0.1131371Anatomy:
cvis the name.function(x)declares one argument namedx.- Inside
{ }is the body. - The last expression is automatically returned (no explicit
return()needed).
Defaults:
summarise_lab <- function(x, na.rm = TRUE) {
c(n = sum(!is.na(x)),
mean = mean(x, na.rm = na.rm),
sd = sd(x, na.rm = na.rm),
median = median(x, na.rm = na.rm))
}
summarise_lab(c(7.2, 6.8, NA, 8.1, 7.4))
#> n mean sd median
#> 4.000 7.375000 0.5468079 7.300000The na.rm = TRUE default means the function silently removes missing values unless the caller overrides. Be deliberate about defaults; they affect the answer.
The rule of thumb: write a function the third time you copy-paste a piece of code. Earlier than that and you spend more time generalising than the duplication costs.
4.2.2 Control flow: if/else and case_when
Inside a function:
classify_bp <- function(sbp) {
if (sbp < 120) {
"normal"
} else if (sbp < 130) {
"elevated"
} else if (sbp < 140) {
"stage 1"
} else {
"stage 2"
}
}
classify_bp(118)
#> [1] "normal"
classify_bp(145)
#> [1] "stage 2"if/else is fine for one value at a time. For a vector, use case_when() (introduced in Day 2):
patients <- patients |>
mutate(bp_category = case_when(
sbp < 120 ~ "normal",
sbp < 130 ~ "elevated",
sbp < 140 ~ "stage 1",
TRUE ~ "stage 2"))4.2.3 purrr::map over loops
R has explicit for loops, and they work, but the tidyverse idiom is to use map() to apply a function to each element of a list or vector:
library(purrr)
# compute mean by group with explicit map
groups <- list(
young = c(110, 115, 122, 118),
middle = c(125, 130, 132, 128),
older = c(138, 145, 142, 140)
)
map_dbl(groups, mean)
#> young middle older
#> 116.25 128.75 141.25map_dbl() applies mean to each element and returns a numeric vector (map returns a list; map_dbl, map_int, map_chr, map_lgl return typed vectors). map_dfr returns a row-bound tibble.
For most applied biostatistics, the equivalent operation inside dplyr is group_by + summarise. Reach for map when working with lists of model fits, lists of files, lists of countries — anything that is genuinely list-shaped rather than tabular.
4.2.4 Summary statistics
summary() is the first-look function:
summary(cohort$age)
#> Min. 1st Qu. Median Mean 3rd Qu. Max.
#> 18.00 34.00 47.00 46.32 59.00 89.00Per-statistic functions:
mean(cohort$bmi)
sd(cohort$bmi)
median(cohort$bmi)
quantile(cohort$bmi, c(0.25, 0.50, 0.75, 0.95))
IQR(cohort$bmi)For a count of categorical:
table(cohort$sex)
#> F M
#> 110 90
prop.table(table(cohort$sex))
#> F M
#> 0.55 0.45For two-way:
table(cohort$sex, cohort$diabetic)
#> FALSE TRUE
#> F 97 13
#> M 77 134.2.5 t-test
The two-sample t-test compares means between two groups:
t.test(sbp ~ sex, data = cohort)
#>
#> Welch Two Sample t-test
#>
#> data: sbp by sex
#> t = -2.31, df = 188.4, p-value = 0.022
#> alternative hypothesis: true difference in means
#> between group F and group M is not equal to 0
#> 95 percent confidence interval:
#> -9.74 -0.78
#> sample estimates:
#> mean in group F mean in group M
#> 126.1 131.4R defaults to Welch’s t-test (unequal variances), which is the right default. The ~ (tilde) is R’s formula syntax: ‘sbp explained by sex’.
What to read in the output:
t = -2.31: the test statistic.df = 188.4: degrees of freedom (non-integer because Welch).p-value = 0.022: the p-value for the two-sided test.95 percent confidence interval: -9.74 -0.78: the 95% CI for the difference in means.- The sample means in each group.
A one-sentence interpretation: women have lower SBP than men by about 5.3 mmHg on average; the difference is statistically significant at α = 0.05; the 95% CI for the difference is (-9.7, -0.8).
A statistically significant t-test does not by itself mean clinically important. A 5 mmHg difference in SBP across sex is real but small relative to the within- person variation. The chapter on causal inference in the Applied Methods volume develops this distinction at length.
4.2.6 chi-squared test
The chi-squared test of independence compares distributions across a contingency table:
chisq.test(table(cohort$sex, cohort$diabetic))
#>
#> Pearson's Chi-squared test with Yates' continuity correction
#>
#> data: table(cohort$sex, cohort$diabetic)
#> X-squared = 0.07, df = 1, p-value = 0.79Read the test statistic, df, and p-value as before. For sparse tables (any expected cell count under 5), use Fisher’s exact test instead:
fisher.test(table(cohort$sex, cohort$diabetic))A common applied error is to use chi-squared on a sparse table; the p-value is unreliable. Check the expected counts first:
chisq.test(table(cohort$sex, cohort$diabetic))$expected4.2.7 Simple linear regression
The fit:
fit <- lm(sbp ~ age, data = cohort)
fit
#>
#> Call:
#> lm(formula = sbp ~ age, data = cohort)
#>
#> Coefficients:
#> (Intercept) age
#> 106.21 0.46The intercept (106.21) is the predicted SBP at age 0; the slope (0.46) is the predicted change in SBP per one-year increase in age.
The summary:
summary(fit)
#>
#> Call:
#> lm(formula = sbp ~ age, data = cohort)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -28.45 -7.23 0.31 7.18 33.42
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 106.2147 2.0341 52.2 < 2e-16 ***
#> age 0.4623 0.0411 11.3 < 2e-16 ***
#>
#> Residual standard error: 11.4 on 498 degrees of freedom
#> Multiple R-squared: 0.203, Adjusted R-squared: 0.201
#> F-statistic: 126.7 on 1 and 498 DF, p-value: < 2e-16What to read:
- Coefficients: estimate, standard error, t, p-value. The slope on age is 0.46 (SE 0.04, p < 2e-16).
- Residual standard error: 11.4 mmHg. Interpret as the typical size of the prediction error.
- R-squared: 0.203. About 20% of the variance in SBP is explained by age alone.
- F-statistic: tests whether the model overall is better than the intercept-only model.
For multiple predictors:
fit2 <- lm(sbp ~ age + sex + bmi, data = cohort)
summary(fit2)Each coefficient is interpreted holding the others constant. The interpretation gets harder once you include terms that are correlated; the Applied Methods volume’s chapter on causal inference treats this carefully.
4.3 Worked example: an end-to-end exploratory analysis
Load the cohort, write a small function to summarise a lab value, run a t-test on a sex difference, fit a linear regression of SBP on age and BMI.
library(tidyverse)
cohort <- read_csv("data/cohort.csv") |>
mutate(diabetic = fasting_glucose >= 126,
age_group = case_when(
age < 40 ~ "<40",
age < 60 ~ "40-59",
TRUE ~ "60+"))
# function to summarise a numeric variable per group
group_summary <- function(data, group_var, value_var) {
data |>
group_by({{ group_var }}) |>
summarise(n = n(),
mean = mean({{ value_var }}, na.rm = TRUE),
sd = sd({{ value_var }}, na.rm = TRUE),
.groups = "drop")
}
group_summary(cohort, sex, sbp)
#> # A tibble: 2 × 4
#> sex n mean sd
#> <chr> <int> <dbl> <dbl>
#> 1 F 110 126. 12.3
#> 2 M 90 131. 13.1
# is the SBP difference between sexes significant?
t.test(sbp ~ sex, data = cohort)
# is the proportion of diabetics different between sexes?
chisq.test(table(cohort$sex, cohort$diabetic))
# fit a regression of SBP on age and BMI
fit <- lm(sbp ~ age + bmi + sex, data = cohort)
summary(fit)Reading the regression: SBP rises with age (β = 0.42, SE 0.04, p < 0.001) and with BMI (β = 0.85, SE 0.20, p < 0.001), and is lower in women (β = -3.2, SE 1.1, p = 0.005) holding age and BMI constant. R² is around 0.27.
The { } (curly-curly) syntax in the function above is how dplyr and friends pass column names as arguments into a function. You will use this idiom enough that it becomes muscle memory; for now, treat it as the recipe for ‘pass a column name into a tidyverse function’.
4.4 Homework
Three small functions.
- Write a function
cv()that takes a numeric vector and returns its coefficient of variation (standard deviation divided by mean). Handle missing values with ana.rm = TRUEdefault. - Write a function
or_2x2()that takes a 2×2 contingency table (a matrix or atableobject) and returns the odds ratio. Use Haldane’s correction (add 0.5 to each cell) if any cell count is zero. - Write a function
bonferroni()that takes a vector of p-values and returns the Bonferroni-corrected p-values (each multiplied by the number of tests, capped at 1).
- Write a function
Two-sample t-test. Test whether mean BMI differs between diabetic and non-diabetic patients in the cohort. Report the test statistic, p-value, 95% CI for the difference, and group means. State the conclusion in one sentence.
Chi-squared. Test whether
age_groupanddiabeticare independent. Report the test statistic, p-value, and the expected counts. Comment on whether Fisher’s exact would have been more appropriate.Simple linear regression. Fit
sbp ~ bmiin the cohort. Report the slope estimate, its SE, t-statistic, and p-value. Interpret the slope in plain English using the data’s units.Multiple regression. Extend the model in problem 4 to
sbp ~ bmi + age + sex. Compare the slope onbmibetween the two models. Why does it change?
4.5 Solutions
Problem 1.
cv <- function(x, na.rm = TRUE) {
sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
}
or_2x2 <- function(tab) {
# tab is a 2x2 with exposed/unexposed in rows, case/control in cols
if (any(tab == 0)) tab <- tab + 0.5
(tab[1,1] * tab[2,2]) / (tab[1,2] * tab[2,1])
}
bonferroni <- function(p) {
pmin(p * length(p), 1)
}Problem 2.
t.test(bmi ~ diabetic, data = cohort)
#> ...
#> t = -3.45, df = 67.2, p-value = 0.001
#> 95 CI: (-3.42, -0.93)
#> means: non-diabetic 27.4, diabetic 29.6Diabetic patients have higher mean BMI than non- diabetic patients by about 2.2 kg/m² (95% CI 0.9-3.4, p = 0.001).
Problem 3.
chisq.test(table(cohort$age_group, cohort$diabetic))
#> X-squared = 22.1, df = 2, p-value = 1.6e-5Expected counts (from chisq.test(...)$expected) are all above 5, so chi-squared is appropriate. Fisher’s exact would not be necessary; it would give a similar answer but is computationally heavier.
Problem 4.
fit_simple <- lm(sbp ~ bmi, data = cohort)
summary(fit_simple)
#> bmi: estimate = 1.32, SE = 0.18, t = 7.3, p < 0.001A 1 kg/m² increase in BMI is associated with a 1.32 mmHg increase in SBP.
Problem 5.
fit_multi <- lm(sbp ~ bmi + age + sex, data = cohort)
summary(fit_multi)
#> bmi: estimate = 0.85, SE = 0.20, t = 4.3, p < 0.001The BMI slope drops from 1.32 to 0.85 when age and sex are added. The interpretation: in the simple model, the BMI coefficient absorbs some of the age-and-sex relationship with SBP (older and male patients tend to have higher BMI and higher SBP). Adjusting for age and sex isolates the BMI-SBP relationship from those confounders. The careful version of this story is the subject of the Causal inference chapters in Applied Methods.
4.6 What’s next
Day 5 closes the boot camp with reproducibility (Quarto, basic Git through RStudio) and a primer on using AI assistance responsibly. After Day 5 the companion volumes pick up: each thread of statistical methodology (longitudinal, survival, causal, Bayesian, ML) has a home in one of the five companion books.