3 Day 3: Visualisation with ggplot2
3.1 Learning objectives
By the end of this day you should be able to:
- Build a
ggplot2figure from scratch using the grammar of graphics: data, aesthetics, geoms. - Choose the right geom for the data (
geom_point,geom_line,geom_bar,geom_histogram,geom_boxplot). - Customise scales, labels, and themes.
- Use
facet_wrapto produce small multiples. - Save figures as PNG and PDF at appropriate dimensions and resolution.
3.2 Lecture
ggplot2 is the standard graphics package for the R ecosystem. It is built on the grammar of graphics: a deliberate vocabulary that lets you compose any figure from a small number of orthogonal pieces. Once you know the grammar, every plot is a recombination.
3.2.1 The grammar in one paragraph
A ggplot2 figure has three required pieces and several optional ones:
- data: a tibble or data frame.
- aesthetic mapping: which columns map to which visual channels (x, y, colour, shape, size).
- geom: the geometric form (point, line, bar, histogram, boxplot).
Plus optional:
- scales: how aesthetic values map to visual values (continuous to log, categorical to a colour palette).
- facets: a small-multiples grid by some categorical variable.
- theme: non-data ink (axes, grid, fonts).
A minimal plot:
library(tidyverse)
ggplot(cohort, aes(x = age, y = sbp)) +
geom_point()Read it as: ‘Take the cohort tibble, map age to the x-axis and sbp to the y-axis, and draw points.’ Each new layer is added with +.
3.2.2 Common geoms
geom_point for scatter:
ggplot(cohort, aes(x = age, y = sbp, colour = sex)) +
geom_point(alpha = 0.5)alpha = 0.5 makes points semi-transparent, helpful for overplotting.
geom_histogram for the distribution of one continuous variable:
ggplot(cohort, aes(x = bmi)) +
geom_histogram(binwidth = 1)binwidth controls bin width in data units. The default is sometimes too coarse or too fine; iterate.
geom_boxplot for distributions across categories:
ggplot(cohort, aes(x = age_group, y = sbp)) +
geom_boxplot()geom_bar for counts:
ggplot(cohort, aes(x = race)) +
geom_bar()geom_line for time series or any x-y where x is ordered:
ggplot(daily_admits, aes(x = date, y = n_admissions)) +
geom_line()geom_smooth adds a smoother. The default method depends on sample size: LOESS for n < 1000, gam (a generalised additive model) for larger samples. Use method = "lm" for a linear fit, or specify any other method explicitly:
ggplot(cohort, aes(x = age, y = sbp)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm")Layers compose. The points and the line above are drawn on top of one another in the order they are added.
3.2.3 Aesthetic mapping vs. fixed values
Inside aes(): column → visual channel. Outside aes(): a fixed value applied to every observation.
# Wrong (alpha is data-driven; it is not a column)
ggplot(cohort, aes(x = age, y = sbp, alpha = 0.5)) +
geom_point()
# Right (alpha is a fixed value)
ggplot(cohort, aes(x = age, y = sbp)) +
geom_point(alpha = 0.5)
# Right (colour mapped to the sex column)
ggplot(cohort, aes(x = age, y = sbp, colour = sex)) +
geom_point()When you see strange figures (a single hue ramp where you expected discrete colours, an alpha legend you did not ask for), it is usually aes()-vs.-not confusion.
3.2.4 Scales
Continuous scales:
ggplot(cohort, aes(x = bmi, y = fasting_glucose)) +
geom_point() +
scale_y_log10() # log y axisFor colour:
ggplot(cohort, aes(x = age, y = sbp, colour = bmi)) +
geom_point() +
scale_colour_viridis_c() # continuous viridisFor categorical colour, the default is fine for a few levels; for more, scale_colour_brewer() and scale_colour_manual() give control.
3.2.5 Labels and themes
ggplot(cohort, aes(x = age, y = sbp, colour = sex)) +
geom_point(alpha = 0.5) +
geom_smooth(method = "lm") +
labs(title = "Systolic blood pressure rises with age",
subtitle = "NHANES-style synthetic cohort, n = 500",
x = "Age (years)",
y = "Systolic blood pressure (mmHg)",
colour = "Sex",
caption = "Source: synthetic data") +
theme_minimal(base_size = 12)theme_minimal() strips the default grey background. Other themes: theme_bw(), theme_classic(), theme_void(). The base_size sets the default font size.
3.2.6 Facets
Small multiples are the right way to compare a relationship across groups:
ggplot(cohort, aes(x = age, y = sbp)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm") +
facet_wrap(~ sex)facet_wrap(~ var) makes one panel per level of var arranged in a grid. facet_grid(rows ~ cols) makes a two-way grid.
3.2.7 Saving figures
p <- ggplot(cohort, aes(x = age, y = sbp)) +
geom_point() + geom_smooth()
ggsave("figures/sbp-by-age.png", p, width = 6, height = 4,
dpi = 300)
ggsave("figures/sbp-by-age.pdf", p, width = 6, height = 4)PNG for screen and most slides; PDF for print and publication. The dpi = 300 for PNG keeps the figure crisp at print size; PDF is vector and has no dpi.
A figure for a journal article typically wants a specific width (often 3.5 inches single-column or 7 inches two-column). Match the journal’s specification at save time so the figure does not need rescaling.
3.3 Worked example: figures from yesterday’s analysis
Take yesterday’s cohort and produce three figures: a histogram of age, a coloured scatter of SBP vs. BMI, a faceted comparison of fasting glucose by age group.
library(tidyverse)
cohort <- read_csv("data/cohort.csv")
# 1. Histogram of age
p1 <- ggplot(cohort, aes(x = age)) +
geom_histogram(binwidth = 5, fill = "steelblue",
colour = "white") +
labs(title = "Age distribution",
x = "Age (years)", y = "Count") +
theme_minimal(base_size = 12)
# 2. Scatter of SBP vs. BMI, coloured by sex
p2 <- ggplot(cohort, aes(x = bmi, y = sbp, colour = sex)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
labs(x = "BMI (kg/m^2)", y = "SBP (mmHg)",
colour = "Sex") +
theme_minimal(base_size = 12)
# 3. Faceted boxplot of fasting glucose by age group
p3 <- ggplot(cohort, aes(x = age_group, y = fasting_glucose,
fill = sex)) +
geom_boxplot() +
labs(x = "Age group", y = "Fasting glucose (mg/dL)",
fill = "Sex") +
theme_minimal(base_size = 12)
# save
ggsave("figures/age-hist.png", p1, width = 5, height = 4,
dpi = 300)
ggsave("figures/sbp-bmi.png", p2, width = 6, height = 4,
dpi = 300)
ggsave("figures/glucose-box.png", p3, width = 6, height = 4,
dpi = 300)These three figures together would make a respectable ‘descriptive figures’ panel in a journal article.
3.4 Homework
Recreate four figures. For each of the four specifications below, produce the figure from yesterday’s
cohortdataset.Histogram of
bmiwith bin width 1 and the title ‘BMI distribution’.Scatter of
fasting_glucose(y) vs.bmi(x), coloured bydiabetic, with a linear smoother per group, log-scaled y-axis.Boxplot of
sbpbyage_groupfaceted bysex.Bar chart of count by
race, sorted from most common to least common.
Log scale. Plot
fasting_glucose(y) againstbmi(x) with both natural and log y-axes (two figures, side by side or sequential). When does the log scale change what you see? When does it not?Journal theme. Customise figure 1(b) with a journal-style theme: white background, grid only on the y-axis, larger axis font, a specified colour palette (e.g.,
scale_colour_manual(values = c("FALSE" = "grey50", "TRUE" = "firebrick"))).Faceted comparison. Plot the relationship between
sbp(y) andbmi(x) withgeom_pointplusgeom_smooth, faceted byage_group(three panels). Comment in 1-2 sentences on what the facets reveal.Save. Save figures (a) through (d) from problem 1 as both PNG (300 dpi) and PDF, with widths 5 inches for histogram and 6 inches for the others.
3.5 Solutions
Problem 1.
# (a)
ggplot(cohort, aes(x = bmi)) +
geom_histogram(binwidth = 1, fill = "steelblue",
colour = "white") +
labs(title = "BMI distribution",
x = "BMI (kg/m^2)", y = "Count") +
theme_minimal()
# (b)
ggplot(cohort, aes(x = bmi, y = fasting_glucose,
colour = diabetic)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
scale_y_log10() +
labs(x = "BMI (kg/m^2)",
y = "Fasting glucose (mg/dL, log scale)",
colour = "Diabetic") +
theme_minimal()
# (c)
ggplot(cohort, aes(x = age_group, y = sbp)) +
geom_boxplot() +
facet_wrap(~ sex) +
labs(x = "Age group", y = "SBP (mmHg)") +
theme_minimal()
# (d)
cohort |>
count(race) |>
mutate(race = fct_reorder(race, n, .desc = TRUE)) |>
ggplot(aes(x = race, y = n)) +
geom_col(fill = "steelblue") +
labs(x = "Race", y = "Count") +
theme_minimal()fct_reorder() reorders the factor levels by the count; geom_col() is geom_bar(stat = "identity") (use the y-values as-is rather than counting).
Problem 2. The log scale changes what you see when the underlying distribution is right-skewed (typical for fasting glucose and most lab values). The log axis linearises a multiplicative relationship; differences at the high end appear less compressed.
Problem 3.
ggplot(cohort, aes(x = bmi, y = fasting_glucose,
colour = diabetic)) +
geom_point(alpha = 0.6) +
geom_smooth(method = "lm", se = FALSE) +
scale_y_log10() +
scale_colour_manual(values = c("FALSE" = "grey50",
"TRUE" = "firebrick"),
labels = c("No", "Yes")) +
labs(x = "BMI (kg/m^2)",
y = "Fasting glucose (mg/dL, log)",
colour = "Diabetic") +
theme_classic(base_size = 14) +
theme(panel.grid.major.y = element_line(),
panel.grid.minor.y = element_line())theme_classic() is the closest to a typical journal look. The theme() modifier adds the y-axis grid back.
Problem 4.
ggplot(cohort, aes(x = bmi, y = sbp)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm") +
facet_wrap(~ age_group) +
labs(x = "BMI", y = "SBP") +
theme_minimal()The relationship between BMI and SBP is positive in every age group; the slope appears similar across groups but the intercept is higher in the older group (because SBP rises with age independent of BMI).
Problem 5.
ggsave("figures/bmi-hist.png", p_a, width = 5, height = 4, dpi = 300)
ggsave("figures/bmi-hist.pdf", p_a, width = 5, height = 4)
ggsave("figures/glucose-bmi.png", p_b, width = 6, height = 4, dpi = 300)
ggsave("figures/glucose-bmi.pdf", p_b, width = 6, height = 4)
ggsave("figures/sbp-box.png", p_c, width = 6, height = 4, dpi = 300)
ggsave("figures/sbp-box.pdf", p_c, width = 6, height = 4)
ggsave("figures/race-bar.png", p_d, width = 6, height = 4, dpi = 300)
ggsave("figures/race-bar.pdf", p_d, width = 6, height = 4)3.6 What’s next
Day 4 covers writing functions, control flow, and basic applied statistics. We will revisit the cohort dataset from Days 2 and 3 and run statistical tests on it.