ICO Workshop R & RStudio

Part 3

Data manipulation with dplyr

Sven De Maeyer & Tine van Daal

2nd - 4th July, 2024

1 / 28

Overview

Tidyverse --- (Click here)
The dplyr package --- (Cliick here)
Implementation --- (Click here)

2 / 28

1. Tidyverse3 / 28

Welcom in the `tidyverse`

4 / 28

Why `tidyverse`?

more accessible for beginners
consistent approach for all potential tasks
powerful potential applications mith minimum 'effort'
can give confidence to explore R

5 / 28

Tibble

Normally we work with a dataframe in R but we can have very complex data-structures as well (e.g., lists, matrices, ...)

In the tidyverse ecosystem we work with a simple form of data-structure: a tibble

A tibble is a dataframe that fits the tidy data principle

Friends

## # A tibble: 108 × 4
##    student occassion condition fluency
##      <dbl>     <dbl>     <dbl>   <dbl>
##  1       1         1         1   101. 
##  2       1         2         1   104. 
##  3       1         3         1   117. 
##  4       2         1         2    98.8
##  5       2         2         2   107. 
##  6       2         3         2   111. 
##  7       3         1         3   105. 
##  8       3         2         3   102. 
##  9       3         3         3   101. 
## 10       4         1         1   102. 
## # ℹ 98 more rows

6 / 28

What is tidy data?

Artwork by @allison_horst

7 / 28

What is tidy data?

Artwork by @allison_horst

8 / 28

What is tidy data?

Artwork by @allison_horst

9 / 28

2. The dplyr package10 / 28

`dplyr` ...

is THE package to work with tidy data !

VERBS are at the core:

filter()
mutate()
select()
group_by() + summarise()
arrange()
rename()
relocate()
join()

11 / 28

https://raw.githubusercontent.com/rstudio/cheatsheets/master/data-transformation.pdf

12 / 28

The `%>%` operator (a 'pipe')

To create
a chain of functions

Instead of

mean(c(1,2,3,4))

Numbers <- c(1,2,3,4)
mean(Numbers)

you can do

c(1,2,3,4) %>% 
  mean( )

With the %>% you can write a sentence like:

I %>% woke up %>%, took a shower %>%, got breakfast %>%, took the train %>% and arrived at the ICO course %>% …

13 / 28

`filter()`

Artwork by @allison_horst

14 / 28

Let's apply `filter()`

With the FRIENDS data:

We only select observations from the first measurement occassion in condition 1

Friends_Occ1 <- Friends %>%
  filter(occassion == 1 & condition == 1)

== is equals (notice the 2 = signs!)

Let's clean some data, and remove observations with fluency values above 300 and that do not equal fluence value 0

Friends_clean <- Friends %>%
  filter(fluency < 300 & fluency != 0)

!= means not equal to

15 / 28

`mutate()`

Artwork by @allison_horst

16 / 28

Let's apply `mutate()`

With the Friends data:

We calculate a new variable containing the fluency scores minus the average of fluency

Friends <- Friends %>%
  mutate(
    fluency_centered = fluency - mean(fluency, na.rm = T)
    )

17 / 28

Let's apply `mutate()`

With the Friends data:

We create a factor for condition

Friends <- Friends %>%
  mutate(
    condition_factor = as.factor(condition)
  )
str(Friends$condition_factor)

##  Factor w/ 3 levels "1","2","3": 1 1 1 2 2 2 3 3 3 1 ...

18 / 28

Let's apply `select()`

To select variables.

Some examples with the Friends data:

We only select condition and occasion and inspect the result with the str()function

Friends %>%
  select(
    condition, occassion
  ) %>%
  str()

## tibble [108 × 2] (S3: tbl_df/tbl/data.frame)
##  $ condition: num [1:108] 1 1 1 2 2 2 3 3 3 1 ...
##   ..- attr(*, "value.labels")= Named chr [1:3] "3" "2" "1"
##   .. ..- attr(*, "names")= chr [1:3] "No subtitles" "Spanish" "English"
##  $ occassion: num [1:108] 1 2 3 1 2 3 1 2 3 1 ...
##  - attr(*, "variable.labels")= Named chr(0) 
##   ..- attr(*, "names")= chr(0) 
##  - attr(*, "codepage")= int 1252

19 / 28

Rename variables with `rename()`

Notice how the variable occassion is misspelled! Pretty enoying when coding... But we can easily rename variables.

Function rename(new_name = old_name)

Rename the variable occassion to occasion

Friends <- Friends %>%
  rename(
    occasion = occassion
  )

20 / 28

Super combo 1: `group_by() + summarize( )`

transform a tibble to a grouped tibble making use of group_by()
calculate summary stats per group making use of summarize()

Calculate the average fluency and standard deviation per condition

Friends %>%
  group_by(
    condition
  ) %>%
  summarize(
    mean_fluency = mean(fluency),
    sd_fluency   = sd(fluency)
  )

## # A tibble: 3 × 3
##   condition mean_fluency sd_fluency
##       <dbl>        <dbl>      <dbl>
## 1         1         109.       9.08
## 2         2         108.       6.02
## 3         3         103.       4.17

21 / 28

Super combo 1: `group_by() + summarize( )`

Calculate the number of observations for each combination of condition and occasion

Friends %>%
  group_by(
    occasion, condition
  ) %>%
  summarize(
    n_observations  = n()
  )

## # A tibble: 9 × 3
## # Groups:   occasion [3]
##   occasion condition n_observations
##      <dbl>     <dbl>          <int>
## 1        1         1             12
## 2        1         2             12
## 3        1         3             12
## 4        2         1             12
## 5        2         2             12
## 6        2         3             12
## 7        3         1             12
## 8        3         2             12
## 9        3         3             12

22 / 28

Super combo 2: `mutate() + case_when( )`

Artwork by @allison_horst

23 / 28

Super combo 2: `mutate() + case_when( )`

To recode variables into new variables!

We create a new categorical variant of fluency with 3 groups, then we select this new variable and have a look to the top 5 observations...

Friends %>%
  mutate(
    fluency_grouped = case_when(
      fluency < 106.625 - 7.1 ~ 'low',
      fluency >= 106.625 - 7.1 & fluency < 106.625 + 7.1 ~ 'average',
      fluency >= 106.625 + 7.1 ~ 'high'
    )
  ) %>% 
  select(
    fluency,
    fluency_grouped
    ) %>%
  head(5)

## # A tibble: 5 × 2
##   fluency fluency_grouped
##     <dbl> <chr>          
## 1   101.  average        
## 2   104.  average        
## 3   117.  high           
## 4    98.8 low            
## 5   107.  average

24 / 28

How to define conditions

x == y $\to$ 'x is equal to y'
x != y $\to$ 'x is NOT equal to y'
x < y $\to$ 'x is smaller than y'
x <= y $\to$ 'x is smaller or equal to y'
x > y $\to$ 'x is higher than y'
x >= y $\to$ 'x is higher or equal to y'

25 / 28

Bolean operators

We can combine conditions!

& $\to$ 'and' $\to$ example: gender == 1 & age <=18
| $\to$ 'or' $\to$ example: gender == 1 | gender == 2
! $\to$ 'not' $\to$ example: gender == 1 & !age <=18

26 / 28

Interactive tutorial about `dplyr()`

If you want some more material and a place to exercise your skills? This online and freetutorial (made with the package learnr) is strongly advised!

https://allisonhorst.shinyapps.io/dplyr-learnr/#section-welcome

27 / 28

Exercise `dplyr`

You can find the qmd-file Exercises_dplyr.qmd in the Exercises folder (you created the project yesterday!) (Exercises > Exercise2_dplyr)
Open this document
You get a set of tasks with empty code blocks to start coding
Write and test the necessary code
Stuck? No Worries!
- We are there
- Help each other
- There is a solution key (Exercises_dplyr_solutions.qmd)

↑, ←, Pg Up, k	Go to previous slide
↓, →, Pg Dn, Space, j	Go to next slide
Home	Go to first slide
End	Go to last slide
Number + Return	Go to specific slide
b / m / f	Toggle blackout / mirrored / fullscreen mode
c	Clone slideshow
p	Toggle presenter mode
t	Restart the presentation timer
?, h	Toggle this help

ICO Workshop R & RStudio

Overview

1. Tidyverse

Welcom in the tidyverse

Why tidyverse?

Tibble

What is tidy data?

What is tidy data?

What is tidy data?

2. The dplyr package

dplyr ...

The %>% operator (a 'pipe')

filter()

Let's apply filter()

mutate()

Let's apply mutate()

Let's apply mutate()

Let's apply select()

Rename variables with rename()

Super combo 1: group_by() + summarize( )

Super combo 1: group_by() + summarize( )

Super combo 2: mutate() + case_when( )

Super combo 2: mutate() + case_when( )

How to define conditions

Bolean operators

Interactive tutorial about dplyr()

Exercise dplyr

Overview

Help

Welcom in the `tidyverse`

Why `tidyverse`?

2. The `dplyr` package

`dplyr` ...

The `%>%` operator (a 'pipe')

`filter()`

Let's apply `filter()`

`mutate()`

Let's apply `mutate()`

Let's apply `mutate()`

Let's apply `select()`

Rename variables with `rename()`

Super combo 1: `group_by() + summarize( )`

Super combo 1: `group_by() + summarize( )`

Super combo 2: `mutate() + case_when( )`

Super combo 2: `mutate() + case_when( )`

Interactive tutorial about `dplyr()`

Exercise `dplyr`