2  Session 2: Working with data

In this session we will learn how to manipulate and summarise data using the dplyr package (with a little help from the tidyr package too).

Learning Objectives

At the end of this session, learners should be able to:

  1. Use the pipe (%>%) to chain multiple functions together

  2. Design chains of dplyr functions to manipulate data frames

  3. Understand how to identify and handle missing values in a data frame

  4. Apply grouping for more complex analysis of data

  5. Recall how to save data frames to a file

Both dplyr and tidyr are contained within the tidyverse (along with readr) so we can load all of these packages at once using library(tidyverse):

# don't forget to load tidyverse!
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

2.1 Chaining functions together with pipes

Pipes are a powerful feature of the tidyverse that allow you to chain multiple functions together. Pipes are useful because they allow you to break down complex operations into smaller steps that are easier to read and understand.

For example, take the following code:

my_vector <- c(1, 2, 3, 4, 5)
as.character(round(mean(my_vector)))
[1] "3"

What do you think this code does? It calculates the mean of my_vector, rounds the result to the nearest whole number, and then converts the result to a character. But the code is a bit hard to read because you have to start from the inside of the brackets and work your way out.

Instead, we can use the pipe operator (%>%) to chain these functions together in a more readable way:

my_vector <- c(1, 2, 3, 4, 5)
my_vector %>% mean() %>% round() %>% as.character()
[1] "3"

See how the code reads naturally from left to right? You can think of the pipe as being like the phrase “and then”. Here, we’re telling R: “Take my_vector, and then calculate the mean, and then round the result, and then convert it to a character.”

You’ll notice that we didn’t need to specify the input to each function. That’s because the pipe automatically passes the output of the previous function as the first input to the next function. We can still specify additional arguments to each function if we need to. For example, if we wanted to round the mean to 2 decimal places, we could do this:

my_vector %>% mean() %>% round(digits = 2) %>% as.character()
[1] "3"

R is clever enough to know that the first argument to round() is still the output of mean(), even though we’ve now specified the digits argument.

Plenty of pipes

There is another style of pipe in R, called the ‘base R pipe’ |>, which is available in R version 4.1.0 and later. The base R pipe works in a similar way to the magrittr pipe (%>%) that we use in this course, but it is not as flexible. We recommend using the magrittr pipe for now.

Fun fact: the magrittr package is named after the artist René Magritte, who made a famous painting of a pipe.

To type the pipe operator more easily, you can use the keyboard shortcut Cmd-shift-M (although once you get used to it, you might find it easier to type %>% manually).

Practice exercises

Try these practice questions to test your understanding

1. What is NOT a valid way to re-write the following code using the pipe operator: round(sqrt(sum(1:10)), 1). If you’re not sure, try running the different options in the console to see which one gives the same answer.

1:10 %>% sum() %>% sqrt() %>% round(1)

sum(1:10) %>% sqrt(1) %>% round()

1:10 %>% sum() %>% sqrt() %>% round(digits = 1)

sum(1:10) %>% sqrt() %>% round(digits = 1)

2. What is the output of the following code? letters %>% head() %>% toupper() Try to guess it before copy-pasting into R.

"A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R" "S" "T" "U" "V" "W" "X" "Y" "Z"

"a" "b" "c" "d" "e" "f"

An error

"A" "B" "C" "D" "E" "F"

Solutions

  1. The invalid option is sum(1:10) %>% sqrt(1) %>% round(). This is because the sqrt() function only takes one argument, so you can’t specify 1 as an argument in addition to what is being piped in from sum(1:10). Note that some options used the pipe to send 1:10 to sum() (like 1:10 %>% sum()), and others just used sum(1:10) directly. Both are valid ways to use the pipe, it’s just a matter of personal preference.
  2. The output of the code letters %>% head() %>% toupper() is "A" "B" "C" "D" "E" "F". The letters vector contains the lowercase alphabet, and the head() function returns the first 6 elements of the vector. Finally, the toupper() function then converts these elements to uppercase.

2.2 Basic data manipulation

To really see the power of the pipe, we will use it together with the dplyr package that provides a set of functions to easily filter, sort, select, and summarise data frames. These functions are designed to work well with the pipe, so you can chain them together to create complex data manipulations in a readable format.

For example, even though we haven’t covered the dplyr functions yet, you can probably guess what the following code does:

# use the pipe to chain together our data manipulation steps
mousezempic_dosage_data %>%
  filter(cage_number == "3E") %>%
  pull(weight_lost_g) %>%
  mean()

This code filters the mousezempic_dosage_data data frame to only include data from cage 3E, then pulls out the weight_lost_g column, and finally calculates the mean of the values in that column. The first argument to each function is the output of the previous function, and any additional arguments (like the column name in pull()) are specified in the brackets (like round(digits = 2) from the previous example).

We also used the enter key after each pipe %>% to break up the code into multiple lines to make it easier to read. This isn’t required, but is a popular style in the R community, so all the code examples in this session will follow this format.

We will now introduce some of the most commonly used dplyr functions for manipulating data frames. To showcase these, we will use the mousezempic_dosage_data that we practiced reading in last session. This imaginary dataset contains information on the weight lost by different strains of mice after being treated with different doses of MouseZempic®.

# read in the data, like we did in session 1
mousezempic_dosage_data <- read_delim("~/Desktop/R_course/data/mousezempic_dosage_data.csv")
Rows: 344 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (4): mouse_strain, cage_number, replicate, sex
dbl (5): weight_lost_g, drug_dose_g, tail_length_mm, initial_weight_g, id_num

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Before we start, let’s use what we learned in the previous session to take a look at mousezempic_dosage_data:

# it's a tibble, so prints nicely
mousezempic_dosage_data
# A tibble: 344 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         1A                   3.75 rep1      male       0.00181
 2 CD-1         1A                   3.8  rep1      female     0.00186
 3 CD-1         1A                   3.25 rep1      female     0.00195
 4 CD-1         1A                  NA    rep1      <NA>      NA      
 5 CD-1         1A                   3.45 rep1      female     0.00193
 6 CD-1         1A                   3.65 rep1      male       0.0019 
 7 CD-1         1A                   3.62 rep1      female     0.00181
 8 CD-1         1A                   4.68 rep1      male       0.00195
 9 CD-1         1A                   3.48 rep1      <NA>       0.00193
10 CD-1         1A                   4.25 rep1      <NA>       0.0019 
# ℹ 334 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

You might also like to use View() to open the data in a separate window and get a closer look.

Using RStudio autocomplete

Although it’s great to give our data a descriptive name like mousezempic_dosage_data, it can be a bit of a pain to type out every time. Luckily, RStudio has a handy autocomplete feature that can solve this problem. Just start typing the name of the object, and you’ll see it will popup:

RStudio autocomplete

You can then press Tab to autocomplete it. If there are multiple objects that start with the same letters, you can use the arrow keys to cycle through the options.

Try using autocomplete this session to save yourself some typing!

2.2.1 Sorting data

Often, one of the first things you might want to do with a dataset is sort it. In dplyr, this is called ‘arranging’ and is done with the arrange() function.

Arrange orders rows by their values in one or more columns

By default, arrange() sorts in ascending order (smallest values first). For example, let’s sort the mousezempic_dosage_data data frame by the weight_lost_g column:

mousezempic_dosage_data %>%
  arrange(weight_lost_g)
# A tibble: 344 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 BALB C       2B                   2.7  rep2      female     0.00192
 2 CD-1         3E                   2.85 rep2      female     0.00181
 3 CD-1         3E                   2.85 rep2      female     0.00184
 4 CD-1         3E                   2.9  rep2      female     0.00187
 5 CD-1         2B                   2.9  rep2      female     0.00178
 6 CD-1         1A                   2.9  rep3      female     0.00188
 7 BALB C       2B                   2.9  rep1      female     0.00187
 8 CD-1         3E                   2.92 rep3      female     0.00193
 9 CD-1         2B                   2.98 rep1      <NA>       0.00179
10 CD-1         2B                   3    rep1      female     0.00185
# ℹ 334 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

If we compare this to when we just printed our data above, we can see that the rows are now sorted so that the mice that lost the least weight are at the top.

Sometimes you might want to sort in descending order instead (largest values first). You can do this by putting the desc() function around your column name, inside arrange():

mousezempic_dosage_data %>%
  # put desc() around the column name to sort in descending order
  arrange(desc(weight_lost_g))
# A tibble: 344 × 9
   mouse_strain cage_number weight_lost_g replicate sex   drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>       <dbl>
 1 Black 6      3E                   6.3  rep1      male      0.00221
 2 Black 6      3E                   6.05 rep1      male      0.0023 
 3 Black 6      3E                   6    rep2      male      0.0022 
 4 Black 6      3E                   6    rep3      male      0.00222
 5 Black 6      3E                   5.95 rep2      male      0.00223
 6 Black 6      3E                   5.95 rep3      male      0.00229
 7 Black 6      3E                   5.85 rep1      male      0.00213
 8 Black 6      3E                   5.85 rep1      male      0.00217
 9 Black 6      3E                   5.85 rep3      male      0.0023 
10 Black 6      3E                   5.8  rep2      male      0.00229
# ℹ 334 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

Now we can see the mice that lost the most weight are at the top.

Comments and pipes

Notice how in the previous example we have written a comment in the middle of the pipe chain. This is a good practice to help you remember what each step is doing, especially when you have a long chain of functions, and won’t cause any errors as long as you make sure that the comment is on its own line.

You can also write comments at the end of the line, just make sure it’s after the pipe operator %>%.

For example, these comments are allowed:

mousezempic_dosage_data %>% # a comment here is fine
  # a comment here is fine
  arrange(desc(weight_lost_g))
# A tibble: 344 × 9
   mouse_strain cage_number weight_lost_g replicate sex   drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>       <dbl>
 1 Black 6      3E                   6.3  rep1      male      0.00221
 2 Black 6      3E                   6.05 rep1      male      0.0023 
 3 Black 6      3E                   6    rep2      male      0.0022 
 4 Black 6      3E                   6    rep3      male      0.00222
 5 Black 6      3E                   5.95 rep2      male      0.00223
 6 Black 6      3E                   5.95 rep3      male      0.00229
 7 Black 6      3E                   5.85 rep1      male      0.00213
 8 Black 6      3E                   5.85 rep1      male      0.00217
 9 Black 6      3E                   5.85 rep3      male      0.0023 
10 Black 6      3E                   5.8  rep2      male      0.00229
# ℹ 334 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

But this will cause an error, because the # is before the pipe, so R treats it as part of the comment (notice how the %>% has changed colour?) and doesn’t know how the two lines relate to each other. It tries to run them separately, which for the first line is ok (it will just print mousezempic_dosage_data):

mousezempic_dosage_data # this comment will cause an error %>%
# A tibble: 344 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         1A                   3.75 rep1      male       0.00181
 2 CD-1         1A                   3.8  rep1      female     0.00186
 3 CD-1         1A                   3.25 rep1      female     0.00195
 4 CD-1         1A                  NA    rep1      <NA>      NA      
 5 CD-1         1A                   3.45 rep1      female     0.00193
 6 CD-1         1A                   3.65 rep1      male       0.0019 
 7 CD-1         1A                   3.62 rep1      female     0.00181
 8 CD-1         1A                   4.68 rep1      male       0.00195
 9 CD-1         1A                   3.48 rep1      <NA>       0.00193
10 CD-1         1A                   4.25 rep1      <NA>       0.0019 
# ℹ 334 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>
  arrange(desc(weight_lost_g))
Error in eval(expr, envir, enclos): object 'weight_lost_g' not found

But for the second line, there is an error that R doesn’t know what the weight_lost_g object is. That’s because it’s a column in the mousezempic_dosage_data data frame, so R only knows what it is in the context of the pipe chain containing that data frame.

You can also sort by multiple columns by passing multiple column names to arrange(). For example, to sort by the strain first and then by the amount of weight lost:

# sort by strain first, then by weight lost
mousezempic_dosage_data %>%
  arrange(mouse_strain, weight_lost_g)
# A tibble: 344 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 BALB C       2B                   2.7  rep2      female     0.00192
 2 BALB C       2B                   2.9  rep1      female     0.00187
 3 BALB C       2B                   3.2  rep2      female     0.00187
 4 BALB C       2B                   3.25 rep1      female     0.00178
 5 BALB C       2B                   3.25 rep3      male       0.00187
 6 BALB C       2B                   3.25 rep3      female     0.00191
 7 BALB C       2B                   3.3  rep1      male       0.00197
 8 BALB C       2B                   3.3  rep1      female     0.00195
 9 BALB C       2B                   3.32 rep3      female     0.00199
10 BALB C       2B                   3.35 rep2      female     0.00187
# ℹ 334 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

This will sort the data frame by strain (according to alphabetical order, as it is a character column), and within each strain, they are then sorted by the amount of weight lost.

Piping into View()

In the above example, we sorted the data by strain and then by weight lost, but because there are so many mice in each strain, the preview shown in our console doesn’t allow us to see the full effect of the sorting.

One handy trick you can use with pipes is to add View() at the end of your chain to open the data in a separate window. Try running this code, and you’ll be able to scroll through the full dataset to check that the other mouse strains have also been sorted correctly:

# sort by strain first, then by weight lost
mousezempic_dosage_data %>%
  arrange(mouse_strain, weight_lost_g) %>%
  View()

This is a great way to check that your code has actually done what you intended!

2.2.1.1 Extracting rows with the smallest or largest values

Slice functions are used to select rows based on their position in the data frame. The slice_min() and slice_max() functions are particularly useful, because they allow you to select the rows with the smallest or largest values in a particular column.

This is equivalent to using arrange() followed by head(), but is more concise:

# get the 10 mice with the lowest drug dose
mousezempic_dosage_data %>%
  # slice_min() requires the column to sort by, and n = the number of rows to keep
  slice_min(drug_dose_g, n = 10)
# A tibble: 13 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         3E                   3.15 rep1      female     0.00172
 2 CD-1         3E                   3.4  rep1      female     0.00174
 3 CD-1         1A                   3.45 rep3      female     0.00176
 4 CD-1         2B                   3.25 rep1      female     0.00178
 5 CD-1         2B                   3.9  rep1      male       0.00178
 6 CD-1         2B                   2.9  rep2      female     0.00178
 7 BALB C       2B                   3.25 rep1      female     0.00178
 8 CD-1         2B                   2.98 rep1      <NA>       0.00179
 9 CD-1         1A                   3.7  rep1      <NA>       0.0018 
10 CD-1         3E                   3.6  rep1      male       0.0018 
11 CD-1         3E                   3.8  rep1      male       0.0018 
12 CD-1         3E                   3.95 rep1      male       0.0018 
13 CD-1         2B                   3.55 rep1      female     0.0018 
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>
# get the top 5 mice that lost the most weight
mousezempic_dosage_data %>%
  # slice_max() has the same arguments as slice_min()
  slice_max(weight_lost_g, n = 5)
# A tibble: 6 × 9
  mouse_strain cage_number weight_lost_g replicate sex   drug_dose_g
  <chr>        <chr>               <dbl> <chr>     <chr>       <dbl>
1 Black 6      3E                   6.3  rep1      male      0.00221
2 Black 6      3E                   6.05 rep1      male      0.0023 
3 Black 6      3E                   6    rep2      male      0.0022 
4 Black 6      3E                   6    rep3      male      0.00222
5 Black 6      3E                   5.95 rep2      male      0.00223
6 Black 6      3E                   5.95 rep3      male      0.00229
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

But wait— neither of those pieces of code actually gave the number of rows we asked for! In the first example, we asked for the 10 mice with the lowest drug dose, but we got 13. And in the second example, we asked for the top 5 mice that lost the most weight, but we got 6. Why aren’t the slice_ functions behaving as expected?

If we take a look at the help page (type ?slice_min in the console), we learn that slice_min() and slice_max() have an argument called with_ties that is set to TRUE by default. If we want to make sure we only get the number of rows we asked for, we would have to set it to FALSE, like so:

# get the top 5 mice that lost the most weight
mousezempic_dosage_data %>%
  # no ties allowed!
  slice_max(weight_lost_g, n = 5, with_ties = FALSE)
# A tibble: 5 × 9
  mouse_strain cage_number weight_lost_g replicate sex   drug_dose_g
  <chr>        <chr>               <dbl> <chr>     <chr>       <dbl>
1 Black 6      3E                   6.3  rep1      male      0.00221
2 Black 6      3E                   6.05 rep1      male      0.0023 
3 Black 6      3E                   6    rep2      male      0.0022 
4 Black 6      3E                   6    rep3      male      0.00222
5 Black 6      3E                   5.95 rep2      male      0.00223
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

This is an important lesson: sometimes functions will behave in a way that is unexpected, and you might need to read their help page or use other guides/google/AI to understand why.

Practice exercises

Try these practice questions to test your understanding

1. Which code would you use to sort the mousezempic_dosage_data data frame from biggest to smallest initial weight?

mousezempic_dosage_data %>% sort(initial_weight_g)

mousezempic_dosage_data %>% arrange(initial_weight_g)

mousezempic_dosage_data %>% sort(descending(initial_weight_g))

mousezempic_dosage_data %>% arrange(desc(initial_weight_g))

2. Which code would you use to extract the 3 mice with the highest initial weight from the mousezempic_dosage_data data frame?

mousezempic_dosage_data %>% slice_max(initial_weight_g, n = 3)

mousezempic_dosage_data %>% arrange(desc(initial_weight_g))

mousezempic_dosage_data %>% slice_min(initial_weight_g, n = 3)

mousezempic_dosage_data %>% arrange(initial_weight_g)

3. I’ve written the below code, but one of the comments is messing it up! Which one?

# comment A
mousezempic_dosage_data # comment B %>%
  # comment C
  slice_max(weight_lost_g, n = 5, with_ties = FALSE) # comment D

Comment A

Comment B

Comment C

Comment D

Solutions
  1. The correct code to sort the mousezempic_dosage_data data frame from biggest to smallest initial weight is mousezempic_dosage_data %>% arrange(desc(initial_weight_g)). The arrange() function is used to sort the data frame (although there is a sort() function in R, that’s not part of dplyr and won’t work the same way), and the desc() function is used to sort in descending order.

  2. The correct code to extract the 3 mice with the highest initial weight from the mousezempic_dosage_data data frame is mousezempic_dosage_data %>% slice_max(initial_weight_g, n = 3). The slice_max() function is used to select the rows with the largest values in the initial_weight_g column, and the n = 3 argument specifies that we want to keep 3 rows. The arrange() function is not needed in this case, because slice_max() will automatically sort the data frame by the specified column.

  3. The comment that is messing up the code is Comment B. The # symbol is before the pipe operator %>%, so R treats it as part of the comment and this breaks our chain of pipes. The other comments are fine, because they are either at the end of the line or on their own line. Basically, if a comment is changing the colour of the pipe operator (or any other bits of your code), it’s in the wrong place!

2.2.2 Filtering data (rows)

Filter allows you to filter rows using a logical test

In dplyr, the filter() function is used to subset rows based on their values. You provide a logical test, and filter() will keep the rows where the test is TRUE. We can write these tests using the comparison operators we learned in the previous session (e.g. ==, < and !=, see Section 1.3).

For example, to filter the mousezempic_dosage_data data frame to only include mice that lost more than 6g:

mousezempic_dosage_data %>%
  filter(weight_lost_g > 6)
# A tibble: 2 × 9
  mouse_strain cage_number weight_lost_g replicate sex   drug_dose_g
  <chr>        <chr>               <dbl> <chr>     <chr>       <dbl>
1 Black 6      3E                   6.3  rep1      male      0.00221
2 Black 6      3E                   6.05 rep1      male      0.0023 
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

Or to only include mice from cage 3E:

mousezempic_dosage_data %>%
  # remember that == is used for testing equality
  filter(cage_number == "3E") # don't forget the quotes either!
# A tibble: 168 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         3E                   3.4  rep1      female     0.00174
 2 CD-1         3E                   3.6  rep1      male       0.0018 
 3 CD-1         3E                   3.8  rep1      female     0.00189
 4 CD-1         3E                   3.95 rep1      male       0.00185
 5 CD-1         3E                   3.8  rep1      male       0.0018 
 6 CD-1         3E                   3.8  rep1      female     0.00187
 7 CD-1         3E                   3.55 rep1      male       0.00183
 8 CD-1         3E                   3.2  rep1      female     0.00187
 9 CD-1         3E                   3.15 rep1      female     0.00172
10 CD-1         3E                   3.95 rep1      male       0.0018 
# ℹ 158 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

2.2.2.1 Combining logical tests

Sometimes we want to filter based on multiple conditions. Here we will show some more advanced operators that can be used to combine logical tests.

The & operator is used to combine two logical tests with an ‘and’ condition. For example, to filter the data frame to only include mice that have a tail length greater than 19mm and are female:

mousezempic_dosage_data %>%
  filter(tail_length_mm > 19 & sex == "female")
# A tibble: 5 × 9
  mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
  <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
1 CD-1         1A                   3.45 rep1      female     0.00193
2 CD-1         3E                   3.8  rep1      female     0.00189
3 CD-1         2B                   3.3  rep1      female     0.00181
4 CD-1         3E                   3.9  rep3      female     0.00191
5 BALB C       2B                   3.52 rep3      female     0.00194
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

The | operator is used to combine two logical tests with an ‘or’ condition. For example, to filter the data frame to only include mice that have an initial weight less than 35g or a tail length less than 14mm:

mousezempic_dosage_data %>%
  filter(initial_weight_g < 35 | tail_length_mm < 14)
# A tibble: 30 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         1A                   3.48 rep1      <NA>       0.00193
 2 CD-1         1A                   4.4  rep1      male       0.00198
 3 CD-1         1A                   3.32 rep1      female     0.00184
 4 CD-1         3E                   2.9  rep2      female     0.00187
 5 CD-1         1A                   3.6  rep2      female     0.0019 
 6 CD-1         1A                   3.2  rep2      female     0.00189
 7 CD-1         2B                   3.4  rep2      female     0.00185
 8 CD-1         2B                   2.9  rep2      female     0.00178
 9 CD-1         2B                   3.05 rep3      female     0.00188
10 Black 6      3E                   4.5  rep1      female     0.00211
# ℹ 20 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

The %in% operator can be used to filter based on a vector of multiple values (c(x, y)). It’s particularly useful when you have a few character values you want to filter on, as it is shorter to type than | (or).

For example, to filter the data frame to only include mice from cages 3E or 1A, we could use | like this:

mousezempic_dosage_data %>%
  filter(cage_number == "3E" | cage_number == "1A")
# A tibble: 220 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         1A                   3.75 rep1      male       0.00181
 2 CD-1         1A                   3.8  rep1      female     0.00186
 3 CD-1         1A                   3.25 rep1      female     0.00195
 4 CD-1         1A                  NA    rep1      <NA>      NA      
 5 CD-1         1A                   3.45 rep1      female     0.00193
 6 CD-1         1A                   3.65 rep1      male       0.0019 
 7 CD-1         1A                   3.62 rep1      female     0.00181
 8 CD-1         1A                   4.68 rep1      male       0.00195
 9 CD-1         1A                   3.48 rep1      <NA>       0.00193
10 CD-1         1A                   4.25 rep1      <NA>       0.0019 
# ℹ 210 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

Or we could use %in% like this:

mousezempic_dosage_data %>%
  filter(cage_number %in% c("3E", "1A"))
# A tibble: 220 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         1A                   3.75 rep1      male       0.00181
 2 CD-1         1A                   3.8  rep1      female     0.00186
 3 CD-1         1A                   3.25 rep1      female     0.00195
 4 CD-1         1A                  NA    rep1      <NA>      NA      
 5 CD-1         1A                   3.45 rep1      female     0.00193
 6 CD-1         1A                   3.65 rep1      male       0.0019 
 7 CD-1         1A                   3.62 rep1      female     0.00181
 8 CD-1         1A                   4.68 rep1      male       0.00195
 9 CD-1         1A                   3.48 rep1      <NA>       0.00193
10 CD-1         1A                   4.25 rep1      <NA>       0.0019 
# ℹ 210 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>
Practice exercises

Try these practice questions to test your understanding

1. Which code would you use to filter the mousezempic_dosage_data data frame to only include mice from replicate 2?

mousezempic_dosage_data %>% filter(replicate == 2)

mousezempic_dosage_data %>% filter(replicate == rep2)

mousezempic_dosage_data %>% filter(replicate == "rep2")

mousezempic_dosage_data %>% filter(replicate = "rep2")

2. What is NOT a valid way to filter the mousezempic_dosage_data data frame to only include mice that lost more than 4g, and have an initial weight less than 40g?

mousezempic_dosage_data %>% filter(weight_lost_g > 4) %>% filter(initial_weight_g < 40)

mousezempic_dosage_data %>% filter(weight_lost_g > 4) %>% (initial_weight_g < 40)

mousezempic_dosage_data %>% filter(weight_lost_g > 4 & initial_weight_g < 40)

mousezempic_dosage_data %>% filter(initial_weight_g < 40) %>% filter(weight_lost_g > 4)

3. Which option correctly describes what the following code is doing?

mousezempic_dosage_data %>%
  filter(mouse_strain %in% c("BALB C", "Black 6")) %>%
  filter(weight_lost_g > 3 & weight_lost_g < 5) %>%
  arrange(desc(drug_dose_g))

Filters the data frame to remove mice from the “BALB C” and “Black 6” strains, who only lost between 3 and 5g of weight, and then sorts the data frame by drug dose.

Filters the data frame to remove mice from the “BALB C” and “Black 6” strains, that lost between 3 and 5g of weight, and then sorts the data frame by drug dose in descending order.

Filters the data frame to only include mice from the “BALB C” and “Black 6” strains, that lost between 3 and 5g of weight, and then sorts the data frame by drug dose.

Filters the data frame to only include mice from the “BALB C” and “Black 6” strains, that lost between 3 and 5g of weight, and then sorts the data frame by drug dose in descending order.

Solutions
  1. The correct code to filter the mousezempic_dosage_data data frame to only include mice from replicate 2 is mousezempic_dosage_data %>% filter(replicate == "rep2"). Option A is incorrect because 2 is not a value of replicate (when filtering you need to know what values are actually in your columns! So make sure to View() your data first). Option B is incorrect because the replicate column is a character column, so you need to use quotes around the value you are filtering on. Option D is incorrect because = is not the correct way to test for equality, you need to use ==.

  2. The invalid option is mousezempic_dosage_data %>% filter(weight_lost_g > 4) %>% (initial_weight_g < 40). This is because the second filtering step is missing the name of the filter function, so R doesn’t know what to do with (initial_weight_g < 40). The other options are valid ways to filter the data frame based on the specified conditions; note that we can use multiple filter() functions in a row to apply multiple conditions, or the & operator to combine them into a single filter() function. It’s just a matter of personal preference.

  3. The correct description of the code is that it filters the data frame to only include mice from the “BALB C” and “Black 6” strains, then filters those further to only those that lost between 3 and 5g of weight, and finally sorts the data frame by drug dose in descending order.

2.2.3 Dealing with missing values

Missing values are a common problem in real-world datasets. In R, missing values are represented by NA. In fact, if you look at the mousezempic_dosage_data data frame we’ve been using, you’ll see that some of the cells contain NA: try spotting them with the View() function.

You can also find missing values in a data frame using the is.na() function in combination with filter(). For example, to find all the rows in the mousezempic_dosage_data data frame that have a missing value for the drug_dose_g column:

mousezempic_dosage_data %>%
  filter(is.na(drug_dose_g))
# A tibble: 2 × 9
  mouse_strain cage_number weight_lost_g replicate sex   drug_dose_g
  <chr>        <chr>               <dbl> <chr>     <chr>       <dbl>
1 CD-1         1A                     NA rep1      <NA>           NA
2 Black 6      3E                     NA rep3      <NA>           NA
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

The problem with missing values is that they can cause problems when you try to perform calculations on your data. For example, if you try to calculate the mean of a column that contains even a single missing value, the result will also be NA:

# try to calculate the mean of the drug_dose_g column
# remember from session 1 that we can use $ to access columns in a data frame
mousezempic_dosage_data$drug_dose_g %>% mean()
[1] NA

NA values in R are therefore referred to as ‘contagious’: if you put an NA in you usually get an NA out. If you think about it, that makes sense— when we don’t know the value of a particular mouse’s drug dose, how can we calculate the average? That missing value could be anything.

For this reason, it’s important to deal with missing values before performing calculations. Many functions in R will have an argument called na.rm that you can set to TRUE to remove missing values before performing the calculation. For example, to calculate the mean of the drug_dose_g column with the missing values excluded:

# try to calculate the mean of the drug_dose_g column
# remember from session 1 that we can use $ to access columns in a data frame
mousezempic_dosage_data$drug_dose_g %>% mean(na.rm = TRUE)
[1] 0.002009152

This time, the result is a number, because the missing values have been removed before the calculation.

But not all functions have an na.rm argument. In these cases, you can remove rows with missing values. This can be done for a single column, using the filter() function together with is.na():

# remove rows with missing values in the drug_dose_g column
mousezempic_dosage_data %>%
  # remember the ! means 'not', it negates the result of is.na()
  filter(!is.na(drug_dose_g))
# A tibble: 342 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         1A                   3.75 rep1      male       0.00181
 2 CD-1         1A                   3.8  rep1      female     0.00186
 3 CD-1         1A                   3.25 rep1      female     0.00195
 4 CD-1         1A                   3.45 rep1      female     0.00193
 5 CD-1         1A                   3.65 rep1      male       0.0019 
 6 CD-1         1A                   3.62 rep1      female     0.00181
 7 CD-1         1A                   4.68 rep1      male       0.00195
 8 CD-1         1A                   3.48 rep1      <NA>       0.00193
 9 CD-1         1A                   4.25 rep1      <NA>       0.0019 
10 CD-1         1A                   3.3  rep1      <NA>       0.00186
# ℹ 332 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

Or, you can remove rows with missing values in any column using the na.omit() function:

# remove rows with missing values in any column
mousezempic_dosage_data %>%
  na.omit()
# A tibble: 333 × 9
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         1A                   3.75 rep1      male       0.00181
 2 CD-1         1A                   3.8  rep1      female     0.00186
 3 CD-1         1A                   3.25 rep1      female     0.00195
 4 CD-1         1A                   3.45 rep1      female     0.00193
 5 CD-1         1A                   3.65 rep1      male       0.0019 
 6 CD-1         1A                   3.62 rep1      female     0.00181
 7 CD-1         1A                   4.68 rep1      male       0.00195
 8 CD-1         1A                   3.2  rep1      female     0.00182
 9 CD-1         1A                   3.8  rep1      male       0.00191
10 CD-1         1A                   4.4  rep1      male       0.00198
# ℹ 323 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

Sometimes, instead of removing rows with missing values, you might want to replace them with a specific value. This can be done using the replace_na() function from the tidyr package. replace_na() takes a list() which contains each of the column names you want to edit, and the value that should be used.

For example, to replace missing values in the weight_lost_g columns with 0, replace missing values in the sex column with ‘unknown’ and leave the rest of the data frame unchanged:

# replace missing values in the drug_dose_g column with 0
mousezempic_dosage_data %>%
  # here we need to provide the column_names = values_to_replace
  # this needs to be contained within a list()
  replace_na(list(weight_lost_g = 0, sex = "unknown"))
# A tibble: 344 × 9
   mouse_strain cage_number weight_lost_g replicate sex     drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>         <dbl>
 1 CD-1         1A                   3.75 rep1      male        0.00181
 2 CD-1         1A                   3.8  rep1      female      0.00186
 3 CD-1         1A                   3.25 rep1      female      0.00195
 4 CD-1         1A                   0    rep1      unknown    NA      
 5 CD-1         1A                   3.45 rep1      female      0.00193
 6 CD-1         1A                   3.65 rep1      male        0.0019 
 7 CD-1         1A                   3.62 rep1      female      0.00181
 8 CD-1         1A                   4.68 rep1      male        0.00195
 9 CD-1         1A                   3.48 rep1      unknown     0.00193
10 CD-1         1A                   4.25 rep1      unknown     0.0019 
# ℹ 334 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

When deciding how to handle missing values, you might have prior knowledge that NA should be replaced with a specific value, or you might decide that removing rows with NA is the best approach for your analysis.

For example, maybe we knew that the mice were given a weight_lost_g of NA if they didn’t lose any weight, it would then make sense to replace those with 0 (as we did in the code above). However, if the drug_dose_g column was missing simply because the data was lost, we might choose to remove those rows entirely.

It’s important to think carefully about how missing values should be handled in your analysis.

Practice exercises

Try these practice questions to test your understanding

1. What would be the result of running this R code: mean(c(1, 2, 4, NA))

2.333333

0

NA

An error

2. Which line of code would you use to filter the mousezempic_dosage_data data frame to remove mice that have a missing value in the tail_length_mm column?

mousezempic_dosage_data %>% filter(tail_length_mm != NA)

mousezempic_dosage_data %>% filter(is.na(tail_length_mm))

mousezempic_dosage_data %>% na.omit()

mousezempic_dosage_data %>% filter(!is.na(tail_length_mm))

3. How would you replace missing values in the initial_weight_g column with the value 35?

mousezempic_dosage_data %>% replace_na(list(initial_weight_g = 35))

mousezempic_dosage_data %>% replace_na(initial_weight_g = 35)

mousezempic_dosage_data %>% replace_na(list(initial_weight_g == 35))

mousezempic_dosage_data %>% replace_na(35)

Solutions

  1. The result of running the code mean(c(1, 2, 4, NA)) is NA. This is because the NA value is ‘contagious’, so when you try to calculate the mean of a vector that contains an NA, the result will also be NA. If we wanted to calculate the mean of the vector without the NA, we would need to use the na.rm = TRUE argument.
  2. The correct line of code to filter the mousezempic_dosage_data data frame to remove mice that have a missing value in the tail_length_mm column is mousezempic_dosage_data %>% filter(!is.na(tail_length_mm)). The ! symbol is used to negate the result of is.na(), so we are filtering to keep the rows where tail_length_mm is not NA. We can’t use the first option with the != NA because NA is a special value in R that represents missing data, and it can’t be compared to anything, and the third option is incorrect because na.omit() removes entire rows with missing values, rather than just filtering based on a single column.
  3. The correct line of code to replace missing values in the initial_weight_g column with the value 35 is mousezempic_dosage_data %>% replace_na(list(initial_weight_g = 35)). The replace_na() function takes a list() that contains the column names you want to replace and the values you want to replace them with. We only need to use a single equal sign here as we’re not testing for equality, we’re assigning a value.

2.2.4 Selecting columns

Select allows you to select only certain columns

While filter() is used to subset rows, select() is used to subset columns. You can use select() to keep only the columns you’re interested in, or to drop columns you don’t need.

The select() function takes the names of the columns that you want to keep/remove (no vector notation c() or quotation marks "" necessary). For example, to select only the mouse_strain, initial_weight_g, and weight_lost_g columns from the mousezempic_dosage_data data frame:

mousezempic_dosage_data %>%
  select(mouse_strain, initial_weight_g, weight_lost_g)
# A tibble: 344 × 3
   mouse_strain initial_weight_g weight_lost_g
   <chr>                   <dbl>         <dbl>
 1 CD-1                     39.1          3.75
 2 CD-1                     39.5          3.8 
 3 CD-1                     40.3          3.25
 4 CD-1                     NA           NA   
 5 CD-1                     36.7          3.45
 6 CD-1                     39.3          3.65
 7 CD-1                     38.9          3.62
 8 CD-1                     39.2          4.68
 9 CD-1                     34.1          3.48
10 CD-1                     42            4.25
# ℹ 334 more rows

We can see that all the other columns have been removed from the data frame.

If you want to keep all columns except for a few, you can use - to drop columns. For example, to keep all columns except for cage_number and sex:

mousezempic_dosage_data %>%
  select(-cage_number, -sex)
# A tibble: 344 × 7
   mouse_strain weight_lost_g replicate drug_dose_g tail_length_mm
   <chr>                <dbl> <chr>           <dbl>          <dbl>
 1 CD-1                  3.75 rep1          0.00181           18.7
 2 CD-1                  3.8  rep1          0.00186           17.4
 3 CD-1                  3.25 rep1          0.00195           18  
 4 CD-1                 NA    rep1         NA                 NA  
 5 CD-1                  3.45 rep1          0.00193           19.3
 6 CD-1                  3.65 rep1          0.0019            20.6
 7 CD-1                  3.62 rep1          0.00181           17.8
 8 CD-1                  4.68 rep1          0.00195           19.6
 9 CD-1                  3.48 rep1          0.00193           18.1
10 CD-1                  4.25 rep1          0.0019            20.2
# ℹ 334 more rows
# ℹ 2 more variables: initial_weight_g <dbl>, id_num <dbl>

There are also some helper functions that can be used to select columns based on their names :

There are several helper functions that can be used with the select function
Function Description Example
starts_with() select column(s) that start with a certain string

select all columns starting with the letter i

select(starts_with("i"))

ends_with() select column(s) that end with a certain string

select all columns ending with _g

select(ends_with("_g"))

contains() select column(s) that contain a certain string

select all columns containing the word ‘weight’

select(contains("weight"))

You need to use quotation marks around the arguments in these helper functions, as they aren’t full column names, just strings of characters.

Try using these helper functions to select columns from the mousezempic_dosage_data data frame!

Reordering columns

Relocate allows you to move columns around We can reorder columns using the relocate() function, which works similarly to select() (except it just moves columns around rather than dropping/keeping them). For example, to move the sex column to before the cage_number column:

mousezempic_dosage_data %>%
  # first the name of the column to move, then where it should go
  relocate(sex, .before = cage_number)
# A tibble: 344 × 9
   mouse_strain sex    cage_number weight_lost_g replicate drug_dose_g
   <chr>        <chr>  <chr>               <dbl> <chr>           <dbl>
 1 CD-1         male   1A                   3.75 rep1          0.00181
 2 CD-1         female 1A                   3.8  rep1          0.00186
 3 CD-1         female 1A                   3.25 rep1          0.00195
 4 CD-1         <NA>   1A                  NA    rep1         NA      
 5 CD-1         female 1A                   3.45 rep1          0.00193
 6 CD-1         male   1A                   3.65 rep1          0.0019 
 7 CD-1         female 1A                   3.62 rep1          0.00181
 8 CD-1         male   1A                   4.68 rep1          0.00195
 9 CD-1         <NA>   1A                   3.48 rep1          0.00193
10 CD-1         <NA>   1A                   4.25 rep1          0.0019 
# ℹ 334 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>

Two useful helper functions here are the everything() and last_col() functions, which can be used to move columns to the start/end of the data frame.

# move id_num to the front
mousezempic_dosage_data %>%
  relocate(id_num, .before = everything()) # don't forget the brackets
# A tibble: 344 × 9
   id_num mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
    <dbl> <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1      1 CD-1         1A                   3.75 rep1      male       0.00181
 2      2 CD-1         1A                   3.8  rep1      female     0.00186
 3      3 CD-1         1A                   3.25 rep1      female     0.00195
 4      4 CD-1         1A                  NA    rep1      <NA>      NA      
 5      5 CD-1         1A                   3.45 rep1      female     0.00193
 6      6 CD-1         1A                   3.65 rep1      male       0.0019 
 7      7 CD-1         1A                   3.62 rep1      female     0.00181
 8      8 CD-1         1A                   4.68 rep1      male       0.00195
 9      9 CD-1         1A                   3.48 rep1      <NA>       0.00193
10     10 CD-1         1A                   4.25 rep1      <NA>       0.0019 
# ℹ 334 more rows
# ℹ 2 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>
# move mouse_strain to the end
mousezempic_dosage_data %>%
  relocate(mouse_strain, .after = last_col())
# A tibble: 344 × 9
   cage_number weight_lost_g replicate sex    drug_dose_g tail_length_mm
   <chr>               <dbl> <chr>     <chr>        <dbl>          <dbl>
 1 1A                   3.75 rep1      male       0.00181           18.7
 2 1A                   3.8  rep1      female     0.00186           17.4
 3 1A                   3.25 rep1      female     0.00195           18  
 4 1A                  NA    rep1      <NA>      NA                 NA  
 5 1A                   3.45 rep1      female     0.00193           19.3
 6 1A                   3.65 rep1      male       0.0019            20.6
 7 1A                   3.62 rep1      female     0.00181           17.8
 8 1A                   4.68 rep1      male       0.00195           19.6
 9 1A                   3.48 rep1      <NA>       0.00193           18.1
10 1A                   4.25 rep1      <NA>       0.0019            20.2
# ℹ 334 more rows
# ℹ 3 more variables: initial_weight_g <dbl>, id_num <dbl>, mouse_strain <chr>

Re-ordering columns isn’t necessary, but it makes it easier to see the data you’re most interested in within the console (since often not all of the columns will fit on the screen at once). For example, if we are doing a lot of computation on the initial_weight_g column, we’d probably like to have that near the start so we can easily check it.

Note that the output of the select() function is a new data frame, even if you only select a single column:

# select the mouse_strain column
mousezempic_dosage_data %>%
  select(mouse_strain) %>%
  # recall from session 1 that class() tells us the type of an object
  class()
[1] "tbl_df"     "tbl"        "data.frame"

Sometimes, we instead want to get the values of a column as a vector.

Pull allows you to pull acolumn out of a data frame as a vector

We can do this by using the pull() function, which extracts a single column from a data frame as a vector:

# get the mouse_strain column as a vector
mousezempic_dosage_data %>%
  pull(mouse_strain) %>%
  class()
[1] "character"

We can see that the class of the output is now a vector, rather than a data frame. This is important because some functions only accept vectors, not data frames, like mean() for example:

# this will give an error
mousezempic_dosage_data %>% select(initial_weight_g) %>% mean(na.rm = TRUE)
Warning in mean.default(., na.rm = TRUE): argument is not numeric or logical:
returning NA
[1] NA
# this will work
mousezempic_dosage_data %>% pull(initial_weight_g) %>% mean(na.rm = TRUE)
[1] 43.92193

Note how both times we used na.rm = TRUE to remove missing values before calculating the mean.

You might remember that we used the $ operator in the previous session to extract a single column from a data frame, so why use pull() instead? The main reason is that pull() works within a chain of pipes, whereas $ doesn’t.

For example, let’s say we want to know the average initial weight of mice that lost at least 4g. We can do this by chaining filter() and pull() together:

mousezempic_dosage_data %>%
  # filter to mice that lost at least 4g
  filter(weight_lost_g >= 4) %>%
  # get the initial_weight_g column as a vector
  pull(initial_weight_g) %>%
  # calculate mean, removing NA values
  mean(na.rm = TRUE)
[1] 46.48023
Practice exercises

Try these practice questions to test your understanding

1. Which line of code would NOT be a valid way to select the drug_dose_g, initial_weight_g, and weight_lost_g columns from the mousezempic_dosage_data data frame?

mousezempic_dosage_data %>% select(drug_dose_g, initial_weight_g, weight_lost_g)

mousezempic_dosage_data %>% select(contains("g"))

mousezempic_dosage_data %>% select(ends_with("_g"))

mousezempic_dosage_data %>% select(-cage_number, -tail_length_mm, -id_num, -mouse_strain, -sex, -replicate)

2. How would I extract the initial_weight_g column from the mousezempic_dosage_data data frame as a vector?

mousezempic_dosage_data %>% filter(initial_weight_g)

mousezempic_dosage_data %>% $initial_weight_g

mousezempic_dosage_data %>% select(initial_weight_g)

mousezempic_dosage_data %>% pull(initial_weight_g)

3. How would you move the sex column to the end of the mousezempic_dosage_data data frame?

mousezempic_dosage_data %>% relocate(sex)

mousezempic_dosage_data %>% relocate(sex, .after = last_col)

mousezempic_dosage_data %>% relocate(sex, .after = last_col())

mousezempic_dosage_data %>% reorder(sex, .after = last_col())

Solutions

  1. The line of code that would NOT be a valid way to select the drug_dose_g, initial_weight_g, and weight_lost_g columns from the mousezempic_dosage_data data frame is mousezempic_dosage_data %>% select(contains("g")). This line of code would select all columns that contain the letter ‘g’, which would include columns like cage_number and tail_length_mm. We need to specify either ends_with("g") or contains("_g") to only get those with _g at the end. The other options are valid ways to select the specified columns, although some are more efficient than others!
  2. The correct way to extract the initial_weight_g column from the mousezempic_dosage_data data frame as a vector is mousezempic_dosage_data %>% pull(initial_weight_g). The pull() function is used to extract a single column from a data frame as a vector. The other options are incorrect because filter() is used to subset rows, $ is not used in a pipe chain, and select() is outputs a data frame, not extract them as vectors.
  3. The correct way to move the sex column to the end of the mousezempic_dosage_data data frame is using the relocate() function like this: mousezempic_dosage_data %>% relocate(sex, .after = last_col()). The last_col() function is used to refer to the last column in the data frame. The other options are incorrect because reorder() is not a valid function, and you need to remember to include the brackets () when using last_col().

2.2.5 Modifying data

So far, we’ve learned how to filter rows and select columns from a data frame. But what if we want to change the data itself? This is where the mutate() function comes in.

Mutate allows you to add new columns to a data frame

The mutate() function is used to add new columns to a data frame, or modify existing columns, often by performing some sort of calculation. For example, we can add a new column to mousezempic_dosage_data that contains the drug dose in mg (rather than g):

mousezempic_dosage_data %>%
  # add a new column called drug_dose_mg
  # convert drug_dose_g to mg by multiplying by 1000
  mutate(drug_dose_mg = drug_dose_g * 1000) %>%
  # just select the drug dose columns so we can compare them
  select(drug_dose_g, drug_dose_mg)
# A tibble: 344 × 2
   drug_dose_g drug_dose_mg
         <dbl>        <dbl>
 1     0.00181         1.81
 2     0.00186         1.86
 3     0.00195         1.95
 4    NA              NA   
 5     0.00193         1.93
 6     0.0019          1.9 
 7     0.00181         1.81
 8     0.00195         1.95
 9     0.00193         1.93
10     0.0019          1.9 
# ℹ 334 more rows

You can see that the drug_dose_mg column has been added to the data frame, and it contains, for each row, the value of the drug_dose_g column multiplied by 1000 (NA values are preserved).

These calculations can be as complex as you like, and involve multiple different columns. For example, to add a new column to the mousezempic_dosage_data data frame that calculates the weight lost as a percentage of the initial weight:

mousezempic_dosage_data %>%
  # calculate the % of initial weight that was lost
  mutate(weight_lost_percent = (weight_lost_g / initial_weight_g) * 100)
# A tibble: 344 × 10
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         1A                   3.75 rep1      male       0.00181
 2 CD-1         1A                   3.8  rep1      female     0.00186
 3 CD-1         1A                   3.25 rep1      female     0.00195
 4 CD-1         1A                  NA    rep1      <NA>      NA      
 5 CD-1         1A                   3.45 rep1      female     0.00193
 6 CD-1         1A                   3.65 rep1      male       0.0019 
 7 CD-1         1A                   3.62 rep1      female     0.00181
 8 CD-1         1A                   4.68 rep1      male       0.00195
 9 CD-1         1A                   3.48 rep1      <NA>       0.00193
10 CD-1         1A                   4.25 rep1      <NA>       0.0019 
# ℹ 334 more rows
# ℹ 4 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>, weight_lost_percent <dbl>

A useful helper function for mutate() is the case_when() function, which allows you to create new columns based on multiple conditions. We do this with the notation case_when(condition1 ~ value1, condition2 ~ value2, ...).

For example, to add a new column to the mousezempic_dosage_data data frame that categorises the mice based on how much weight they lost:

mousezempic_dosage_data %>%
  # create a new column called weight_loss_category
  mutate(weight_loss_category = case_when(
    weight_lost_g < 4 ~ "Low", # separate conditions with a comma
    weight_lost_g <= 5 ~ "Medium",
    weight_lost_g > 5 ~ "High"
  )) %>%
  select(weight_lost_g, weight_loss_category)
# A tibble: 344 × 2
   weight_lost_g weight_loss_category
           <dbl> <chr>               
 1          3.75 Low                 
 2          3.8  Low                 
 3          3.25 Low                 
 4         NA    <NA>                
 5          3.45 Low                 
 6          3.65 Low                 
 7          3.62 Low                 
 8          4.68 Medium              
 9          3.48 Low                 
10          4.25 Medium              
# ℹ 334 more rows

Note that the conditions are evaluated in order, and the first condition that is TRUE is the one that is used. So if a mouse lost 4.5g, it case_when() would first test if it fits the ‘Low’ category (by checking if 4.5 is less than 4, which it isn’t), and then if it fits the ‘Medium’ category (by checking if 4.5 is less than or equal to 5). Since it is, the mouse would be categorised as ‘Medium’.

Fallback with default value(s)

In the above example, what would happen if a mouse lost -1g (gained weight)? It wouldn’t fit any of the conditions, so it would get an NA in the weight_loss_category column. Sometimes you might want this behaviour, but other times you would prefer to specify a ‘fallback’ category that will be assigned to everything that doesn’t fit in the other categories. You can do this by including a .default = argument at the end of the case_when() function. For example:

mousezempic_dosage_data %>%
  # create a new column called weight_loss_category
  mutate(weight_loss_category = case_when(
    weight_lost_g < 4 ~ "Low", # separate conditions with a comma
    weight_lost_g <= 5 ~ "Medium",
    weight_lost_g > 5 ~ "High",
    .default = "Unknown"
  )) %>%
  select(weight_lost_g, weight_loss_category)
# A tibble: 344 × 2
   weight_lost_g weight_loss_category
           <dbl> <chr>               
 1          3.75 Low                 
 2          3.8  Low                 
 3          3.25 Low                 
 4         NA    Unknown             
 5          3.45 Low                 
 6          3.65 Low                 
 7          3.62 Low                 
 8          4.68 Medium              
 9          3.48 Low                 
10          4.25 Medium              
# ℹ 334 more rows

Notice how the NA value in the fourth row is now categorised as ‘Unknown’.

One final thing to note is that mutate() can be used to modify existing columns as well as add new ones. To do this, just use the name of the existing column as the ‘new’ one.

For example, let’s use mutate() together with case_when() to modify the sex column so that it uses M and F instead male and female:

mousezempic_dosage_data %>%
  # modify sex column
  mutate(sex = case_when(
    sex == "female" ~ "F",
    sex == "male" ~ "M",
    # if neither, code it as 'X'
    .default = "X"))
# A tibble: 344 × 9
   mouse_strain cage_number weight_lost_g replicate sex   drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>       <dbl>
 1 CD-1         1A                   3.75 rep1      M         0.00181
 2 CD-1         1A                   3.8  rep1      F         0.00186
 3 CD-1         1A                   3.25 rep1      F         0.00195
 4 CD-1         1A                  NA    rep1      X        NA      
 5 CD-1         1A                   3.45 rep1      F         0.00193
 6 CD-1         1A                   3.65 rep1      M         0.0019 
 7 CD-1         1A                   3.62 rep1      F         0.00181
 8 CD-1         1A                   4.68 rep1      M         0.00195
 9 CD-1         1A                   3.48 rep1      X         0.00193
10 CD-1         1A                   4.25 rep1      X         0.0019 
# ℹ 334 more rows
# ℹ 3 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>
Practice exercises

Try these practice questions to test your understanding

1. What line of code would you use to add a new column to the mousezempic_dosage_data data frame that converts the tail_length_mm column to cm?

mousezempic_dosage_data %>% create(tail_length_cm = tail_length_mm / 10)

mousezempic_dosage_data %>% mutate(tail_length_cm == tail_length_mm / 10)

mousezempic_dosage_data %>% mutate(tail_length_cm = tail_length_mm / 10)

mousezempic_dosage_data %>% tail_length_cm = tail_length_mm / 10

2. Explain in words what the following code does:

mousezempic_dosage_data %>%
  arrange(desc(weight_lost_g)) %>%
  mutate(weight_lost_rank = row_number())

Hint: the row_number() function returns the number of each row in the data frame (1 being the first row and so on).

Adds a new column to the data frame that ranks the mice based on how much weight they lost, with 1 being the mouse that lost the least weight.

Adds a new column to the data frame that ranks the mice based on how much weight they lost, with 1 being the mouse that lost the most weight.

Adds a new column to the data frame that ranks the mice

Does nothing, because the row_number() function has no arguments

3. What is wrong with this R code?

mousezempic_dosage_data %>%
  mutate(weight_lost_category = case_when(
    weight_lost_g < 4 ~ "Low"
    weight_lost_g <= 5 ~ "Medium"
    weight_lost_g > 5 ~ "High"
  ))
Error: <text>:4:5: unexpected symbol
3:     weight_lost_g < 4 ~ "Low"
4:     weight_lost_g
       ^

You didn’t include a .default = condition at the end of the case_when() function to act as a fallback

You can’t use the case_when() function with the mutate() function

weight_lost_g is not a valid column name

You need to separate the conditions in the case_when() function with a comma

4. Explain in words what the following code does:

mousezempic_dosage_data %>%
  mutate(mouse_strain = case_when(
    mouse_strain == "Black 6" ~ "B6",
    .default = mouse_strain
  ))

Hint: if you’re not sure, try running the code, but pipe it into View() so that you can take a good look at what’s happening in the mouse_strain column.

Renames the strains of all the mice to “B6”, regardless of their original strain

This code will produce an error

Adds a new column that categorises the mice based on their strain, so that any mice from the “Black 6” strain are now called “B6”, and all other strains are left unchanged.

Modifies the mouse_strain column so that any mice from the “Black 6” strain are now called “B6”, and all other strains are left unchanged.

Solutions

  1. The correct line of code to add a new column to the mousezempic_dosage_data data frame that converts the tail_length_mm column to cm is mousezempic_dosage_data %>% mutate(tail_length_cm = tail_length_mm / 10).
  2. The code mousezempic_dosage_data %>% arrange(desc(weight_lost_g)) %>% mutate(weight_lost_rank = row_number()) adds a new column to the data frame that ranks the mice based on how much weight they lost, with 1 being the mouse that lost the most weight. First, the arrange(desc(weight_lost_g)) function sorts the data frame by the weight_lost_g column in descending order, and then the mutate(weight_lost_rank = row_number()) function adds a new column that assigns a rank to each row based on its position (row number) in the sorted data frame.
  3. The error is that the conditions in the case_when() function are not separated by commas. Each condition should be followed by a comma because these are like the arguments in a function. Remeber that it’s optional to include the .default = condition at the end of the case_when() function.
  4. The code mousezempic_dosage_data %>% mutate(mouse_strain = case_when(mouse_strain == "Black 6" ~ "B6", .default = mouse_strain)) modifies the mouse_strain column so that any mice from the “Black 6” strain are now called “B6”, and all other strains are left unchanged. As we are calling our column mouse_strain, no new column is being created (we are modifying the existing one) and the .default = mouse_strain condition acts as a fallback to keep the original values (that already exist in the mouse_strain column) for any rows that don’t match our first condition (strain being “Black 6”).

2.2.6 Summarising data

Summarise allows you to calculate summary statistics that collapse many rows into one

The summarise() (or summarize(), if you prefer US spelling) function is used to calculate summary statistics on your data. It takes similar arguments to mutate(), but instead of adding a new column to the data frame, it returns a new data frame with a single row and one column for each summary statistic you calculate.

For example, to calculate the mean weight lost by the mice in the mousezempic_dosage_data data frame:

mousezempic_dosage_data %>%
  summarise(mean_weight_lost = mean(weight_lost_g, na.rm = TRUE))
# A tibble: 1 × 1
  mean_weight_lost
             <dbl>
1             4.20

We can also calculate multiple summary statistics at once. For example, to calculate the mean, median, and standard deviation of the weight lost by the mice:

mousezempic_dosage_data %>%
  summarise(
    mean_weight_lost = mean(weight_lost_g, na.rm = TRUE),
    median_weight_lost = median(weight_lost_g, na.rm = TRUE),
    sd_weight_lost = sd(weight_lost_g, na.rm = TRUE)
  )
# A tibble: 1 × 3
  mean_weight_lost median_weight_lost sd_weight_lost
             <dbl>              <dbl>          <dbl>
1             4.20               4.05          0.802

The power of summarising data is really seen when combined with grouping, which we will cover in the next section.

Practice exercises

Try these practice questions to test your understanding

1. Explain in words what the following code does:

mousezempic_dosage_data %>%
  summarise(average_tail = mean(tail_length_mm, na.rm = TRUE),
            min_tail = min(tail_length_mm, na.rm = TRUE),
            max_tail = max(tail_length_mm, na.rm = TRUE))

Calculates the average, minimum, and maximum tail length of the mice in the mousezempic_dosage_data data frame.

Produces a data frame containing one column for each of the average, minimum, and maximum tail length of the mice in the mousezempic_dosage_data data frame.

Finds the average tail length of the mice in the mousezempic_dosage_data data frame.

Produces a vector containing the average, minimum, and maximum tail length of the mice in the mousezempic_dosage_data data frame.

2. What is NOT a valid way to calculate the mean weight lost by the mice in the mousezempic_dosage_data data frame?

mousezempic_dosage_data %>% summarise(mean_weight_lost = mean(weight_lost_g, na.rm = TRUE))

mousezempic_dosage_data %>% pull(weight_lost_g) %>% mean(na.rm = TRUE)

mousezempic_dosage_data %>% summarize(mean_weight_lost = mean(weight_lost_g, na.rm = TRUE))

mousezempic_dosage_data %>% mean(weight_lost_g, na.rm = TRUE)

Solutions

1. The code produces a data frame containing one column for each of the average, minimum, and maximum tail length of the mice in the mousezempic_dosage_data data frame.

2. The line of code that is NOT a valid way to calculate the mean weight lost by the mice in the mousezempic_dosage_data data frame is mousezempic_dosage_data %>% mean(weight_lost_g, na.rm = TRUE). This line of code is incorrect because the mean() function is being used directly on the data frame, rather than within a summarise() function. The other options are valid ways to calculate the mean weight lost by the mice in the mousezempic_dosage_data data frame (although note that the second option uses pull() to extract the weight_lost_g column as a vector before calculating the mean, so the mean value is stored in a vector rather than in a data frame).

2.3 Grouping

Using summarise with the .by option allows you to create separate summaries for different ‘groups’ of data

Grouping is a powerful concept in in dplyr that allows you to perform operations on subsets of your data. For example, you might want to calculate the mean weight lost by mice in each cage, or find the mouse with the longest tail in each strain.

We can group data using the .by argument that exists in many dplyr functions, like summarise() and mutate(), and passing it the name(s) of column(s) to group by. For example, to calculate the mean weight lost by mice in each cage:

mousezempic_dosage_data %>%
  summarise(
    mean_weight_lost = mean(weight_lost_g, na.rm = TRUE),
    # don't forget it's .by, not by!
    .by = cage_number)
# A tibble: 3 × 2
  cage_number mean_weight_lost
  <chr>                  <dbl>
1 1A                      3.71
2 3E                      4.72
3 2B                      3.71

Like when we first learned the summarise function above, we give our new column a name (mean_weight_lost), and then we assign its value to be the mean of the weight_lost_g column (with NAs removed). But this time, we also added the .by argument to specify the column we want to group by (cage_number, in this case). This will return a data frame with the mean weight lost by mice in each cage.

Grouping is a powerful tool for exploring your data and can help you identify patterns that might not be obvious when looking at the data as a whole. For example, notice how this grouped summary reveals that mice in cage 3E lost more weight than those in the other two cages.

It’s also possible to group by multiple columns by passing a vector of column names to the .by argument. For example, to calculate the mean weight lost by mice in each cage and strain:

mousezempic_dosage_data %>%
  summarise(mean_weight_lost = mean(weight_lost_g, na.rm = TRUE),
  # group by both cage_number and mouse_strain
    .by = c(cage_number, mouse_strain))
# A tibble: 5 × 3
  cage_number mouse_strain mean_weight_lost
  <chr>       <chr>                   <dbl>
1 1A          CD-1                     3.71
2 3E          CD-1                     3.71
3 2B          CD-1                     3.69
4 3E          Black 6                  5.08
5 2B          BALB C                   3.73

Of course, mean() is not the only function that we can use within summarise(). We can use any function that takes a vector of values and returns a single value, like median(), sd(), or max(). We can also use multiple functions at once, by giving each column a name and specifying the function we want to use:

mousezempic_dosage_data %>%
  summarise(
    n = n(),
    mean_weight_lost = mean(weight_lost_g, na.rm = TRUE),
    median_weight_lost = median(weight_lost_g, na.rm = TRUE),
    sd_weight_lost = sd(weight_lost_g, na.rm = TRUE),
    max_weight_lost = max(weight_lost_g, na.rm = TRUE),
    min_weight_lost = min(weight_lost_g, na.rm = TRUE),
    .by = cage_number)
# A tibble: 3 × 7
  cage_number     n mean_weight_lost median_weight_lost sd_weight_lost
  <chr>       <int>            <dbl>              <dbl>          <dbl>
1 1A             52             3.71               3.7           0.445
2 3E            168             4.72               4.78          0.783
3 2B            124             3.71               3.69          0.417
# ℹ 2 more variables: max_weight_lost <dbl>, min_weight_lost <dbl>

Here, we also used the n() function to calculate the number of mice in each cage. This is a special helper function that works within summarise to count the number of rows in each group.

To .by or not to .by?

In the dplyr package, there are two ways to group data: using the .by argument within various functions (as we have covered so far), or using the group_by() function, then performing your operations and ungrouping with ungroup().

For example, we’ve seen above how to calculate the mean weight lost by mice in each cage using the .by argument:

mousezempic_dosage_data %>%
  summarise(mean_weight_lost = mean(weight_lost_g, na.rm = TRUE), .by = cage_number)
# A tibble: 3 × 2
  cage_number mean_weight_lost
  <chr>                  <dbl>
1 1A                      3.71
2 3E                      4.72
3 2B                      3.71

But we can also do the same using group_by() and ungroup():

mousezempic_dosage_data %>%
  group_by(cage_number) %>%
  summarise(mean_weight_lost = mean(weight_lost_g, na.rm = TRUE)) %>%
  ungroup()
# A tibble: 3 × 2
  cage_number mean_weight_lost
  <chr>                  <dbl>
1 1A                      3.71
2 2B                      3.71
3 3E                      4.72

The two methods are equivalent, but using the .by argument within functions can be more concise and easier to read. Still, it’s good to be aware of group_by() and ungroup() as they are widely used, particularly in older code.

Although grouping is most often used with summarise(), it can be used with dplyr functions too. For example mutate() function can also be used with grouping to add new columns to the data frame based on group-specific calculations. Let’s say we wanted to calculate the Z-score (also known as the standard score) to standardise the weight lost by each mouse within each strain.

As a reminder, the formula for calculating the Z-score is \(\frac{x - \mu}{\sigma}\), where \(x\) is the value (in our case the weight_lost_g column), \(\mu\) is the mean, and \(\sigma\) is the standard deviation.

We can calculate this for each mouse in each strain using the following code:

mousezempic_dosage_data %>%
  # remove NAs before calculating the mean and SD
  filter(!is.na(weight_lost_g)) %>%
  mutate(weight_lost_z = (weight_lost_g - mean(weight_lost_g)) / sd(weight_lost_g), .by = mouse_strain) %>%
  # select the relevant columns
  select(mouse_strain, weight_lost_g, weight_lost_z)
# A tibble: 342 × 3
   mouse_strain weight_lost_g weight_lost_z
   <chr>                <dbl>         <dbl>
 1 CD-1                  3.75         0.108
 2 CD-1                  3.8          0.217
 3 CD-1                  3.25        -0.983
 4 CD-1                  3.45        -0.547
 5 CD-1                  3.65        -0.110
 6 CD-1                  3.62        -0.165
 7 CD-1                  4.68         2.12 
 8 CD-1                  3.48        -0.492
 9 CD-1                  4.25         1.20 
10 CD-1                  3.3         -0.874
# ℹ 332 more rows

Unlike when we used .by with summarise(), we still get the same number of rows as the original data frame, but now we have a new column weight_lost_z that contains the Z-score for each mouse within each strain. This could be useful for identifying outliers or comparing the weight lost by each mouse to the average for its strain.

Practice exercises

Try these practice questions to test your understanding

1. Which line of code would you use to calculate the median tail length of mice belonging to each strain in the mousezempic_dosage_data data frame?

mousezempic_dosage_data %>% summarise(median_tail_length = median(tail_length_mm), .by = mouse_strain)

mousezempic_dosage_data %>% summarise(median_tail_length = median(tail_length_mm, na.rm = TRUE), .by = mouse_strain)

mousezempic_dosage_data %>% summarise(median_tail_length = median(tail_length_mm, na.rm = TRUE), by = mouse_strain)

mousezempic_dosage_data %>% mutate(median_tail_length = median(tail_length_mm, na.rm = TRUE), .by = mouse_strain)

2. Explain in words what the following code does:

mousezempic_dosage_data %>%
  summarise(max_tail_len = max(tail_length_mm, na.rm = TRUE), .by = c(mouse_strain, replicate))

Calculates the maximum tail length of all mice for each strain in the mousezempic_dosage_data data frame

Calculates the maximum tail length of all mice for each replicate in the mousezempic_dosage_data data frame

Calculates the maximum tail length of all mice in the mousezempic_dosage_data data frame

Calculates the maximum tail length of mice in each unique combination of strain and replicate in the mousezempic_dosage_data data frame.

3. I want to count how many male and how many female mice there are for each strain in the mousezempic_dosage_data data frame. Which line of code would I use?

mousezempic_dosage_data %>% summarise(count = n(), .by = sex)

mousezempic_dosage_data %>% summarise(count = n(), .by = mouse_strain)

mousezempic_dosage_data %>% summarise(count = n(), .by = c(mouse_strain, sex))

mousezempic_dosage_data %>% summarise(count = n(), .by = mouse_strain, sex)

4. I want to find the proportion of weight lost by each mouse in each cage in the mousezempic_dosage_data data frame. Which line of code would I use?

mousezempic_dosage_data %>% summarise(weight_lost_proportion = weight_lost_g / sum(weight_lost_g, na.rm = TRUE), .by = cage_number)

mousezempic_dosage_data %>% mutate(weight_lost_proportion = weight_lost_g / sum(weight_lost_g, na.rm = TRUE), .by = cage_number)

mousezempic_dosage_data %>% mutate(weight_lost_proportion = weight_lost_g / sum(weight_lost_g, na.rm = TRUE, .by = cage_number))

mousezempic_dosage_data %>% mutate(weight_lost_proportion = weight_lost_g / sum(weight_lost_g, na.rm = TRUE))

Solutions

  1. The correct line of code to calculate the median tail length of mice belonging to each strain in the mousezempic_dosage_data data frame is mousezempic_dosage_data %>% summarise(median_tail_length = median(tail_length_mm, na.rm = TRUE), .by = mouse_strain). Remember to use na.rm = TRUE to remove any missing values before calculating the median, and to use .by to specify the column to group by (not by). Seeing as we want to calculate the median (collapse down to a single value per group), we need to use summarise() rather than mutate().
  2. The code mousezempic_dosage_data %>% summarise(max_tail_len = max(tail_length_mm, na.rm = TRUE), .by = c(mouse_strain, replicate)) calculates the maximum tail length of mice in each unique combination of strain and replicate in the mousezempic_dosage_data data frame.
  3. The correct line of code to count how many male and how many female mice there are for each strain in the mousezempic_dosage_data data frame is mousezempic_dosage_data %>% summarise(count = n(), .by = c(mouse_strain, sex)). We need to group by both mouse_strain and sex to get the count for each unique combination of strain and sex. Don’t forget that we specify the column names as a vector when grouping by multiple columns.
  4. The correct line of code to find the proportion of weight lost by each mouse in each cage in the mousezempic_dosage_data data frame is mousezempic_dosage_data %>% mutate(weight_lost_proportion = weight_lost_g / sum(weight_lost_g, na.rm = TRUE), .by = cage_number). We use mutate() because we want a value for each mouse (each row in our data), rather than to collapse down to a single value for each group (cage number in this case). Be careful that you use the .by argument within the mutate() function call, not within the sum() function by mistake (this is what is wrong with the third option).

2.4 Saving data to a file

Once you’ve cleaned and transformed your data, you’ll often want to save it to a file so that you can use it in other programs or share it with others. The write_csv() and write_tsv() functions from the readr package are a great way to do this. They take two arguments - the data frame you want to save and the file path where you want to save it.

For example, let’s say I want to save my summary table of the weight lost by mice in each cage to a CSV file called cage_summary_table.csv:

# create the summary table
# and assign it to a variable
cage_summary_table <- mousezempic_dosage_data %>%
  summarise(
    n = n(),
    mean_weight_lost = mean(weight_lost_g, na.rm = TRUE),
    median_weight_lost = median(weight_lost_g, na.rm = TRUE),
    sd_weight_lost = sd(weight_lost_g, na.rm = TRUE),
    .by = cage_number)

# save the data to a CSV file
write_csv(cage_summary_table, "cage_summary_table.csv")

CSV files are particularly great because they can be easily read into other software, like Excel.

It’s also possible to use the write_*() functions along with a pipe:

mousezempic_dosage_data %>%
  summarise(
    n = n(),
    mean_weight_lost = mean(weight_lost_g, na.rm = TRUE),
    median_weight_lost = median(weight_lost_g, na.rm = TRUE),
    sd_weight_lost = sd(weight_lost_g, na.rm = TRUE),
    .by = cage_number) %>%
  write_csv("cage_summary_table.csv")

Remember here that the first argument (the data frame to save) is passed on by the pipe, so the only argument in the brackets is the second one: the file path.

2.5 Summary

Here’s what we’ve covered in this session:

  • The pipe operator %>% and how we can use it to chain together multiple function calls, making our code more readable and easier to understand.

  • The basic dplyr verbs filter(), select(), mutate(), and arrange() and how they can be used to tidy and analyse data.

  • Missing values (NA) and how to remove or replace them

  • The summarise() function and how it can be used to calculate summary statistics on your data, as well as the power of grouping data with the .by argument.

Why does data need to be tidy anyway?

In this session, we’ve been focusing on making our data ‘tidy’: that is, structured in a consistent way that makes it easy to work with. A nice visual illustration of tidy data and its importance can be found here.

2.5.1 Practice questions

  1. What is the purpose of the pipe operator %>%? Keeping this in mind, re-write the following code to use the pipe.

    1. round(mean(c(1, 2, 3, 4, 5)))
    2. print(as.character(1 + 10))
  2. What would be the result of evaluating the following expressions? You don’t need to know these off the top of your head, use R to help! (Hint: some expressions might give an error. Try to think about why)

    1. mousezempic_dosage_data %>% filter(weight_lost_g > 10)
    2. mousezempic_dosage_data %>% select(tail_length_mm, weight_lost_g)
    3. mousezempic_dosage_data %>% mutate(weight_lost_kg = weight_lost_g / 1000)
    4. mousezempic_dosage_data %>% arrange(tail_length_mm)
    5. mousezempic_dosage_data %>% filter(initial_Weight_g > 10) %>% arrange(mouse_strain)
    6. mousezempic_dosage_data %>% relocate(mouse_strain, .after = cage_number)
    7. mousezempic_dosage_data %>% pull(weight_lost_g)
    8. mousezempic_dosage_data %>% filter(!is.na(weight_lost_g))
    9. mousezempic_dosage_data %>% replace_na(list(weight_lost_g = 0))
    10. mousezempic_dosage_data %>% summarise(mean_weight_lost = mean(weight_lost_g, na.rm = TRUE))
    11. mousezempic_dosage_data %>% summarise(mean_weight_lost = mean(weight_lost_g, na.rm = TRUE), .by = cage_number)
  3. What is a missing value in R? What are two ways to deal with missing values in a data frame?

  4. I want to add a new column to the mousezempic_dosage_data data frame that converts the mouse_strain column to lowercase. Hint: you can use the tolower() function in R to convert characters to lowercase. Look up its help page by typing ?tolower in the R console to see how to use it.

  5. How could you find the maximum tail length for each unique combination of sex and mouse strain in the mousezempic_dosage_data data frame?

  6. Write a line of code to save the result of Q5 to a CSV file called max_tail_length.csv.

Solutions
  1. The pipe operator %>% is used to chain together multiple function calls, passing the result of one function to the next. Here’s how you could re-write the code to use the pipe:

    1. c(1, 2, 3, 4, 5) %>% mean() %>% round()
    2. as.character(1 + 10) %>% print()
  2. The result of evaluating the expressions would be:

    1. A data frame containing only the rows where weight_lost_g is greater than 10.
    2. A data frame containing only the tail_length_mm and weight_lost_g columns.
    3. A data frame with an additional column weight_lost_kg that contains the weight lost in kilograms.
    4. A data frame sorted by tail_length_mm, in ascending order.
    5. An error because initial_Weight_g is not a column in the data frame.
    6. A data frame with the mouse_strain column moved to be after the cage_number column.
    7. A vector containing the values of the weight_lost_g column.
    8. A data frame containing only the rows where weight_lost_g is not NA.
    9. A data frame with missing values in the weight_lost_g column replaced with 0.
    10. A data frame with the mean weight lost by all mice.
    11. A data frame with the mean weight lost by mice in each cage.
  3. A missing value in R is represented by NA. Two ways to deal with missing values in a data frame are to remove them using filter(!is.na(column_name)) or to replace them with a specific value using replace_na(list(column_name = value)).

  4. To add a new column to the mousezempic_dosage_data data frame that converts the mouse_strain column to lowercase, you can use mutate() as follows`:

mousezempic_dosage_data %>%
  mutate(mouse_strain_lower = tolower(mouse_strain))
# A tibble: 344 × 10
   mouse_strain cage_number weight_lost_g replicate sex    drug_dose_g
   <chr>        <chr>               <dbl> <chr>     <chr>        <dbl>
 1 CD-1         1A                   3.75 rep1      male       0.00181
 2 CD-1         1A                   3.8  rep1      female     0.00186
 3 CD-1         1A                   3.25 rep1      female     0.00195
 4 CD-1         1A                  NA    rep1      <NA>      NA      
 5 CD-1         1A                   3.45 rep1      female     0.00193
 6 CD-1         1A                   3.65 rep1      male       0.0019 
 7 CD-1         1A                   3.62 rep1      female     0.00181
 8 CD-1         1A                   4.68 rep1      male       0.00195
 9 CD-1         1A                   3.48 rep1      <NA>       0.00193
10 CD-1         1A                   4.25 rep1      <NA>       0.0019 
# ℹ 334 more rows
# ℹ 4 more variables: tail_length_mm <dbl>, initial_weight_g <dbl>,
#   id_num <dbl>, mouse_strain_lower <chr>
  1. You can use the max() function within summarise(.by = c(sex, mouse_strain)) to find the maximum tail length of each unique sex/mouse strain combination:
mousezempic_dosage_data %>%
  summarise(max_tail_length = max(tail_length_mm, na.rm = TRUE), .by = c(sex, mouse_strain))
# A tibble: 8 × 3
  sex    mouse_strain max_tail_length
  <chr>  <chr>                  <dbl>
1 male   CD-1                    21.5
2 female CD-1                    20.7
3 <NA>   CD-1                    20.2
4 female Black 6                 15.5
5 male   Black 6                 17.3
6 <NA>   Black 6                 15.7
7 female BALB C                  19.4
8 male   BALB C                  20.8
  1. To save the result of Q5 to a CSV file called max_tail_length.csv, you can use the write_csv() function, either by using a pipe to connect it to the code you wrote previously:
mousezempic_dosage_data %>%
  summarise(max_tail_length = max(tail_length_mm, na.rm = TRUE), .by = c(sex, mouse_strain)) %>%
  write_csv("max_tail_length.csv")

Or by assigning this result to a variable and then saving it to a file:

max_tail_length <- mousezempic_dosage_data %>%
  summarise(max_tail_length = max(tail_length_mm, na.rm = TRUE), .by = c(sex, mouse_strain))

write_csv(max_tail_length, "max_tail_length.csv")