1 Day 1: Setup and First Steps in R
1.1 Learning objectives
By the end of this day you should be able to:
- Install R and RStudio on macOS, Windows, or Linux.
- Distinguish the console (interactive) from a script (saved code).
- Create and assign variables; perform arithmetic.
- Recognise and work with R’s core data types: numeric, integer, character, logical, factor.
- Index vectors by position, by name, and by logical condition.
- Recognise missing values (
NA) and avoid the common pitfalls.
1.2 Lecture
R is a programming language built specifically for statistical computing. You will spend the next few years of your graduate career running R in some form every day. This first hour gets you to the point where R is installed, you can type expressions and have them evaluated, and you understand what R is doing with the characters you type.
1.2.1 Installing R and RStudio
Two pieces of software, in this order.
R itself is the language and the engine. Download from https://cran.r-project.org/ and run the installer for your operating system. R 4.5+ is current as of writing; any 4.x release is fine for this book.
RStudio is the integrated development environment (IDE). It is the application you actually open every day; it talks to R behind the scenes. Download from https://posit.co/download/rstudio-desktop/ and install.
When you open RStudio for the first time you see four panes:
- Source (top-left): your code, saved in
.Rfiles. - Console (bottom-left): the live R session.
- Environment / History (top-right): the objects you have created, and a log of recent commands.
- Files / Plots / Help / Packages (bottom-right): filesystem, figures you have produced, help pages, installed packages.
You will use the source pane (writing scripts) and the console pane (running them) most of the time.
1.2.2 The console: arithmetic and assignment
Click in the Console pane (bottom-left). You see a prompt (>). Type:
2 + 2
#> [1] 4R evaluated 2 + 2 and printed 4. The [1] means ‘the first element of the result is printed here’ — which matters when results have many elements.
The arithmetic operators are what you expect: +, -, *, /, ^ for exponentiation, %% for modulo (remainder).
17 %% 5
#> [1] 2To save a value for later, assign it to a name. The R convention is to use <- (left-arrow):
x <- 17
x
#> [1] 17
x * 2
#> [1] 34Names can use letters, digits, dot, and underscore, but must start with a letter or dot. The convention in modern R is snake_case:
patient_age <- 47
mean_blood_pressure <- 128.51.2.3 Vectors
The fundamental data structure in R is the vector: an ordered collection of values of the same type. Create one with c() (combine):
ages <- c(24, 47, 31, 62, 19, 55)
ages
#> [1] 24 47 31 62 19 55
length(ages)
#> [1] 6Arithmetic on a vector applies element-wise:
ages_in_months <- ages * 12
ages_in_months
#> [1] 288 564 372 744 228 660Many R functions take a vector and return either a vector of the same length or a single summary value:
mean(ages)
#> [1] 39.66667
sd(ages)
#> [1] 17.36856
range(ages)
#> [1] 19 62This vectorised behaviour is the single most important idiom in R. You almost never write loops to apply an operation element-by-element; R does it for you.
1.2.4 Data types
Every value in R has a type. The five you will meet this week:
typeof(42L) # integer (note the L)
#> [1] "integer"
typeof(3.14) # double-precision floating-point
#> [1] "double"
typeof("hello") # character (string)
#> [1] "character"
typeof(TRUE) # logical (boolean)
#> [1] "logical"
typeof(factor(c("low", "high"))) # factor
#> [1] "integer"A factor is R’s way of representing a categorical variable. Internally it is an integer with a labelling. You will encounter factors immediately when reading data; treat them as categorical for now.
R coerces between types automatically when it can:
TRUE + 1
#> [1] 2 # TRUE is treated as 1
"5" + 1
#> Error: non-numeric argument to binary operator
as.numeric("5") + 1
#> [1] 6Coercion is the source of many beginner bugs. When in doubt, check the type with typeof() or class().
1.2.5 Indexing vectors
Three ways to extract elements from a vector.
By position (integer index, starting at 1):
ages <- c(24, 47, 31, 62, 19, 55)
ages[1]
#> [1] 24
ages[c(1, 3, 5)]
#> [1] 24 31 19
ages[-1] # all but the first
#> [1] 47 31 62 19 55By logical (a vector of TRUE/FALSE of the same length):
adults <- ages >= 18
adults
#> [1] TRUE TRUE TRUE TRUE TRUE TRUE
ages[ages >= 50]
#> [1] 62 55By name (when the vector has names):
biomarkers <- c(hba1c = 7.2, ldl = 145, sbp = 132)
biomarkers["ldl"]
#> ldl
#> 145Logical indexing is the workhorse of data manipulation in R. Read it as ‘keep the elements where the condition is TRUE’.
1.2.6 Missing values
Real biomedical data has missing values. R represents them as NA (Not Available):
labs <- c(7.2, NA, 6.8, 8.1, NA)
labs
#> [1] 7.2 NA 6.8 8.1 NA
mean(labs)
#> [1] NA
mean(labs, na.rm = TRUE)
#> [1] 7.366667mean() returns NA by default when any input is NA. Adding na.rm = TRUE removes the missing values before computing. Most R functions follow this convention; it is deliberate, to force you to think about how you want to handle the missingness rather than silently ignoring it.
Test for missingness with is.na():
is.na(labs)
#> [1] FALSE TRUE FALSE FALSE TRUE
labs[!is.na(labs)]
#> [1] 7.2 6.8 8.1Do not test for NA with ==:
labs == NA
#> [1] NA NA NA NA NANA == NA is NA (because the result of comparing an unknown to an unknown is itself unknown).
1.3 Worked example: a small biomedical dataset
You see ten patients in a clinic. Their ages and systolic blood pressures (SBP) are recorded.
patient_id <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10)
ages <- c(34, 67, 52, 41, 58, 23, 71, 49, NA, 38)
sbp <- c(122, 148, 135, 128, 142, 118, 156, 130,
139, 124)Sample size, mean age (handling the missing), proportion hypertensive (SBP >= 140):
length(patient_id)
#> [1] 10
mean(ages, na.rm = TRUE)
#> [1] 48.11111
mean(sbp >= 140)
#> [1] 0.3Identify the hypertensive patients and report their IDs and ages:
hypertensive <- sbp >= 140
hypertensive
#> [1] FALSE TRUE FALSE FALSE TRUE FALSE TRUE FALSE FALSE FALSE
patient_id[hypertensive]
#> [1] 2 5 7
ages[hypertensive]
#> [1] 67 58 71In three lines you have done what an analyst might otherwise do by hand: counted the cohort, computed summary statistics with explicit missing-data handling, and identified the subset of clinical interest. This is the core rhythm of R for biostatistics.
1.4 Homework
Each problem is solvable with what was covered in the lecture. Attempt each one in your R session before checking the solution.
Reproduce. In your R session, type all of the examples from the ‘Worked example’ section. Confirm you get the same output.
Five small expressions. Write R expressions that compute each of these:
- The cube root of 27.
- The remainder when 1000 is divided by 7.
- The mean of the integers 1 through 100.
- The number of values in
c(2, 4, 6, 8, 10)that are greater than 5. - The maximum minus the minimum of
c(11, 4, 19, 7, 2).
Inspect three suspicious expressions. For each, run it in R and explain what happens. If it errors, explain why; if it runs, explain whether the result is what a careful R user would write.
mean(c(1, 2, 3, "4"))c(1, 2, 3) + c(10, 20)x = 5; x ** 2
Index by group. Given
ages <- c(8, 22, 47, 65, 13, 51, 73, 29, 17, 55), compute the mean age in three groups: children (under 18), working-age adults (18-64 inclusive), seniors (65+). Use logical indexing.Missing values. Given
labs <- c(7.2, NA, 6.8, 8.1, NA, 5.9, 7.4, NA):- How many values are missing?
- What is the mean of the non-missing values?
- Replace each
NAwith the mean of the non-missing values, producing a vector of length 8 with noNAs.
1.5 Solutions
Problem 1. Each line in the worked example should produce the output shown. If yours differs, common causes are: typo (check parentheses and commas), wrong type (e.g., quotes around numbers), or a left-over object from earlier in the session. Restart R if the session is in a strange state.
Problem 2.
# (a)
27 ^ (1/3)
#> [1] 3
# (b)
1000 %% 7
#> [1] 6
# (c)
mean(1:100)
#> [1] 50.5
# (d)
sum(c(2, 4, 6, 8, 10) > 5)
#> [1] 3
# (e)
max(c(11, 4, 19, 7, 2)) - min(c(11, 4, 19, 7, 2))
#> [1] 17
# (or, more cleanly:)
diff(range(c(11, 4, 19, 7, 2)))
#> [1] 17Problem 3.
mean(c(1, 2, 3, "4")). The"4"(in quotes) is a character;c()coerces the whole vector to character because mixed types must collapse to one type.mean()on a character vector errors. Fix:
mean(c(1, 2, 3, 4))
#> [1] 2.5c(1, 2, 3) + c(10, 20). R recycles the shorter vector, but lengths that do not divide evenly produce a warning. Fix to matching lengths:
c(1, 2, 3) + c(10, 20, 30)
#> [1] 11 22 33x = 5; x ** 2. The=is allowed for assignment but the convention in R is<-. R’s parser actually silently rewrites**to^(see?Arithmetic), so this expression returns 25. The expression is not strictly broken, but**is not idiomatic in R and relies on a parser quirk that is easy to forget; prefer^explicitly:
x <- 5
x ^ 2
#> [1] 25Problem 4.
ages <- c(8, 22, 47, 65, 13, 51, 73, 29, 17, 55)
mean(ages[ages < 18]) # children
#> [1] 12.66667
mean(ages[ages >= 18 & ages < 65]) # working-age
#> [1] 40.8
mean(ages[ages >= 65]) # seniors
#> [1] 69Problem 5.
labs <- c(7.2, NA, 6.8, 8.1, NA, 5.9, 7.4, NA)
# (a)
sum(is.na(labs))
#> [1] 3
# (b)
mean(labs, na.rm = TRUE)
#> [1] 7.08
# (c)
labs[is.na(labs)] <- mean(labs, na.rm = TRUE)
labs
#> [1] 7.20 7.08 6.80 8.10 7.08 5.90 7.40 7.08Note: imputing missing values with the mean is sometimes acceptable for exploratory work but can bias inferences in a real analysis. Day 4 introduces statistical tests; the Practicum and Applied Methods volumes treat missing-data handling at the level you will need for publishable work.
1.6 What’s next
Day 2 covers reading real data into R and manipulating it with the dplyr verbs. Bring your laptop with R and RStudio installed and tested.