Solutions to integrated exercise

Introduction

During this workshop you have learned to import, manipulate, visualise and analyse data. Time to use all these skills and create a research report!

This research report will focus upon the relation of motivation with math scores. We will use data that was collected in 2010 during a large project on the transition of secondary to higher education or the labour market. We followed almost 4000 students for two years and collected data during 5 waves. If you want to know more about the project, you can have a look at this or this article. Today, we only use the data collected during the first two measurement waves. Students were in their final year of secondary education.

The data set contains information on 15 variables:
- ID_School: ID of the school
- ID_Class: ID of the class
- ID_Student: ID of the student
- Educ_level: educational track of the studion (1: general education, 2: technical education, 3: vocational education, 4: arts education)
- Gender: gender of student (0: female, 1: male)
- SES_language: language indicator of social-economic status that was used by the government to indicate whether or not a student was at risk (0: does not apply, 1: applies)
- SES_neighbourhood: neighbourhood indicator of social-economic status that was used by the government to indicate whether or not a student was at risk (0: does not apply, 1: applies)
- SES_educ_level: education indicator of social-economic status that was used by the government to indicate whether or not a student was at risk (0: does not apply, 1: applies)
- IQ_score: raw score on a non-verbal IQ-test (min. score: 0, max. score: 40) - Math_score: raw score on a math test (problem-solving) (min. score: 0, max. score: 24)
- Reading_score: raw score on a reading comprehension test (min. score: 0, max. score: 24)
- Autonomous_motivation: autonomous motivation for studying (based on average score on 6 items, min. score: 0, max. score: 5)
- Controlled_motivation: controlled motivation for studying (based on average score on 6 items, min. score: 0, max. score: 5)
- Amotivation: amotivation for studying (based on average score on 3 items, min. score: 0, max. score: 5)

💡 Don’t forget to load the necessary packages: tidyverse and here.

library(tidyverse) # for data wrangling and visualisation

── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
✔ ggplot2 3.3.6      ✔ purrr   0.3.5 
✔ tibble  3.1.8      ✔ dplyr   1.0.10
✔ tidyr   1.2.1      ✔ stringr 1.4.1 
✔ readr   2.1.3      ✔ forcats 0.5.2 
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

library(here) # for facilitating sharing of projects

here() starts at /Users/demaeyer/Dropbox/ICO_R_2022/Exercises

1. Import and clean data

1.1 Import the data

Import the data

transition_data <- read.table(
  here("Data", "transition_data.csv"), 
        header = TRUE,
        sep = ",", 
        dec = ".")

1.2 Check data

Check all variables:

Are they correctly defined (numeric, character, …)?
Which variables should be converted to another ‘type’ (numeric, character, …)?

str(transition_data)

'data.frame':   3782 obs. of  15 variables:
 $ X                    : int  1 2 3 4 5 6 7 8 9 10 ...
 $ ID_School            : int  1 1 2 2 4 5 11 11 11 11 ...
 $ ID_Class             : chr  "6LOS                         " "6THa                         " "6 CH                         " "6 STW                        " ...
 $ ID_Student           : int  10089 10112 20008 20092 40101 50031 110145 110148 110154 110157 ...
 $ Educ_level           : int  2 2 2 2 2 2 2 2 2 2 ...
 $ Gender               : int  1 0 1 1 1 1 1 1 1 1 ...
 $ SES_language         : int  NA 0 0 0 0 0 0 0 0 0 ...
 $ SES_neighbourhood    : int  NA 1 1 0 0 0 0 0 0 0 ...
 $ SES_educ_level       : int  NA 0 0 0 0 1 0 0 0 0 ...
 $ IQ_score             : int  32 31 29 33 29 30 34 31 29 35 ...
 $ Math_score           : int  3 6 15 7 4 8 16 16 17 16 ...
 $ Reading_score        : int  9 20 13 12 16 15 16 20 16 17 ...
 $ Autonomous_motivation: num  2 4.33 2.83 2.33 3.17 ...
 $ Controlled_motivation: num  3 2.33 2 2.17 3.17 ...
 $ Amotivation          : num  3 1 1.67 4.33 2.33 ...

What is your conclusion?

Most categorical variables (ID_Class, ID_Student, Educ_level, Gender, SES_language, SES_neighbourhood, SES_educ_level) are defined as ‘integer’. These variables should be converted to type ‘character’. The variable X contains row numbers and should be removed.

You might have already noticed that several variables contain missing values (NA’s). However, it is always save to check this for all variables and also to look for ‘strange’ values. Get an overview of the variables using the function summary().

summary(transition_data)

       X            ID_School        ID_Class           ID_Student     
 Min.   :   1.0   Min.   :  1.00   Length:3782        Min.   :  10001  
 1st Qu.: 946.2   1st Qu.: 11.00   Class :character   1st Qu.: 110254  
 Median :1891.5   Median : 31.00   Mode  :character   Median : 310154  
 Mean   :1891.5   Mean   : 59.18                      Mean   : 591910  
 3rd Qu.:2836.8   3rd Qu.:112.00                      3rd Qu.:1120068  
 Max.   :3782.0   Max.   :129.00                      Max.   :1290037  
                                                                       
   Educ_level        Gender        SES_language     SES_neighbourhood
 Min.   :1.000   Min.   :0.0000   Min.   :0.00000   Min.   :0.0000   
 1st Qu.:1.000   1st Qu.:0.0000   1st Qu.:0.00000   1st Qu.:0.0000   
 Median :2.000   Median :1.0000   Median :0.00000   Median :0.0000   
 Mean   :1.796   Mean   :0.5786   Mean   :0.04737   Mean   :0.2181   
 3rd Qu.:3.000   3rd Qu.:1.0000   3rd Qu.:0.00000   3rd Qu.:0.0000   
 Max.   :4.000   Max.   :1.0000   Max.   :1.00000   Max.   :1.0000   
 NA's   :83      NA's   :11       NA's   :214       NA's   :215      
 SES_educ_level      IQ_score       Math_score    Reading_score 
 Min.   :0.0000   Min.   : 2.00   Min.   : 1.00   Min.   : 3.0  
 1st Qu.:0.0000   1st Qu.:26.00   1st Qu.: 6.00   1st Qu.:14.0  
 Median :0.0000   Median :30.00   Median :10.00   Median :17.0  
 Mean   :0.1761   Mean   :29.39   Mean   :10.36   Mean   :16.6  
 3rd Qu.:0.0000   3rd Qu.:33.00   3rd Qu.:14.00   3rd Qu.:20.0  
 Max.   :1.0000   Max.   :40.00   Max.   :23.00   Max.   :24.0  
 NA's   :221      NA's   :392     NA's   :353     NA's   :415   
 Autonomous_motivation Controlled_motivation  Amotivation    
 Min.   : 1.000        Min.   : 1.000        Min.   : 1.000  
 1st Qu.: 2.500        1st Qu.: 2.333        1st Qu.: 1.333  
 Median : 3.167        Median : 3.000        Median : 2.000  
 Mean   :14.712        Mean   :14.423        Mean   :13.235  
 3rd Qu.: 3.833        3rd Qu.: 3.667        3rd Qu.: 3.000  
 Max.   :99.000        Max.   :99.000        Max.   :99.000

What is your conclusion?

Almost all variables have missing values. The maximum value of the motivation variables (Autonomous_motivation, Controlled_motivation, Amotivation) is 99. This is, of course, an impossible value (maximum was 5). To be sure that these variables don’t contain other strange values, you make a quick barplot of the distribution of each of these variables. You save the plots as objects P1, P2 and P3.
(A barplot is most of the time not a good choice to visualise quantitative variables. However, to get an idea of every value in a quantitative variable, a barplot is more informative than a histogram or density plot.)

💡 Attention! You have to create three barplots. These plots can be combined into one plot using the package patchwork. You have to load this package and add code to combine the three plots. This code has already been included in the code block below. This code now has a # in front of it. Remove that # so the code will be executed.

library(patchwork)

P1 <- transition_data %>% 
  ggplot(aes(x = Autonomous_motivation)) + 
  geom_bar() + 
  ggtitle("Autonomous motivation") +
  scale_x_continuous(breaks = c(0, 5, seq(25, 200, 25))) + 
  theme_minimal()  +
  theme(axis.title.x = element_blank())

P2 <- transition_data %>% 
  ggplot(aes(x = Controlled_motivation)) + 
  geom_bar() + 
  ggtitle("Controlled motivation") + 
  scale_x_continuous(breaks = c(0, 5, seq(25, 200, 25))) + 
  theme_minimal()  +
  theme(axis.title.x = element_blank())

P3 <- transition_data %>% 
  ggplot(aes(x = Amotivation)) +
  geom_bar() + 
  ggtitle("Amotivation") + 
  scale_x_continuous(breaks = c(0, 5, seq(25, 200, 25))) + 
  theme_minimal() +
  theme(axis.title.x = element_blank())

P1 / P2 / P3

What is your conclusion?

The motivation variables (Autonomous_motivation, Controlled_motivation, Amotivation) don’t contain other strange values.

You decide to also visualise the distribution of the categorical variables: Educ_level, Gender, SES_language, SES_neighbourhood en SES_educ_level (using a barplot).

You already know that these variables are included as numeric variables, while they are of type character. To make sure that you get correct values on the x-axis of your barplots, specify the x aesthetic as follows ggplot(aes(x = as.character(Educ_level))).

💡 Attention! You have to create five barplots. These plots can be combined into one plot using the package patchwork. You have to add code to combine the five plots. This code has already been included in the code block below. This code now has a # in front of it. Remove that # so the code will be executed.

P1 <- transition_data %>% 
  ggplot(aes(x = as.character(Educ_level))) + 
  geom_bar() + 
  ggtitle("Educational level") +
  theme_minimal() +
  theme(axis.title.x = element_blank())

P2 <- transition_data %>% 
  ggplot(aes(x = as.character(Gender))) + 
  geom_bar() + 
  ggtitle("Gender") +
  theme_minimal() +
  theme(axis.title.x = element_blank()) 

P3 <- transition_data %>% 
  ggplot(aes(x = as.character(SES_language))) + 
  geom_bar() + 
  ggtitle("SES 'Language'") + theme_minimal() +
  theme(axis.title.x = element_blank()) 

P4 <- transition_data %>% 
  ggplot(aes(x = as.character(SES_neighbourhood))) + geom_bar() + 
  ggtitle("SES 'Neighbourhood'") + 
  theme_minimal() +
  theme(axis.title.x = element_blank()) 

P5 <- transition_data %>% 
  ggplot(aes(x = as.character(SES_educ_level))) + 
  geom_bar() + 
  ggtitle("SES 'Educational level'") + 
  theme_minimal() +
  theme(axis.title.x = element_blank()) 

(P1 + P2) /
(P3 + P4 + P5)

What is your conclusion?

The distribution of the five categorical variables looks fine. (They all have missing values!) Category 4 of the variable Educ_level contains little observations compared to that of the three other categories. It would be nice if the categories of all these variables would be appropriately labelled. (In that case, you would immediately know what category 4 of the variable Educ_level refers to.)

1.3 Clean data

After having checked the data, you know that the following actions needs to be performed:

convert all categorical variables to a character variable
label the categories of the variables Educ_level, Gender, SES_language, SES_neighbourhood and SES_educ_level and store them as new variables
recode the values 99 in the motivation variables (Autonomous_motivation, Controlled_motivation, and Amotivation) to NA and store them as new variables
remove the first column

Write the code to perform the actions listed above and save as a new data frame transition_data_cleaned

transition_data_cleaned <- 
  transition_data %>%
  ## Convert ID variables to categorical variables (1)
  mutate_at(vars(ID_School, ID_Class, ID_Student,
                 Educ_level, Gender,
                 SES_language, SES_neighbourhood, SES_educ_level),
            as.character) %>%
  ## Recode categories of categorical variables and store as a new' variable (2)
  mutate(
    Educ_level_cat = case_when(
      Educ_level == 1 ~ "General education",
      Educ_level == 2 ~ "Technical education",
      Educ_level == 3 ~ "Vocational education",
      Educ_level == 4 ~ "Arts education"
      ),
    ## function `ifelse(condition == true, yes = do this, no = do this)`
    Gender_cat = ifelse(
      Gender == 0, yes = "female", no = "male"
      ),
    SES_language_cat = ifelse(
      SES_language == 0, yes = "does not apply", no = "applies"
      ),
    SES_neighbourhood_cat = ifelse(
      SES_neighbourhood == 0, yes = "does not apply", no = "applies"
      ),
    SES_educ_level_cat = ifelse(
      SES_educ_level == 0, yes = "does not apply", no = "applies"
      ),
    ## Replace 99 with NA and store as a new variable  (3)
    Autonomous_motivationR = na_if(Autonomous_motivation, 99),
    Controlled_motivationR = na_if(Controlled_motivation, 99),
    AmotivationR = na_if(Amotivation, 99)
    ) %>%
  ## Remove first column (4)
  select(-1)

Don’t forget to check your recoding work!

First, check whether each variable is defined correctly. (Are all categorical variables defined as a character variable? Is the redundant variable removed from the data?)

str(transition_data_cleaned)

'data.frame':   3782 obs. of  22 variables:
 $ ID_School             : chr  "1" "1" "2" "2" ...
 $ ID_Class              : chr  "6LOS                         " "6THa                         " "6 CH                         " "6 STW                        " ...
 $ ID_Student            : chr  "10089" "10112" "20008" "20092" ...
 $ Educ_level            : chr  "2" "2" "2" "2" ...
 $ Gender                : chr  "1" "0" "1" "1" ...
 $ SES_language          : chr  NA "0" "0" "0" ...
 $ SES_neighbourhood     : chr  NA "1" "1" "0" ...
 $ SES_educ_level        : chr  NA "0" "0" "0" ...
 $ IQ_score              : int  32 31 29 33 29 30 34 31 29 35 ...
 $ Math_score            : int  3 6 15 7 4 8 16 16 17 16 ...
 $ Reading_score         : int  9 20 13 12 16 15 16 20 16 17 ...
 $ Autonomous_motivation : num  2 4.33 2.83 2.33 3.17 ...
 $ Controlled_motivation : num  3 2.33 2 2.17 3.17 ...
 $ Amotivation           : num  3 1 1.67 4.33 2.33 ...
 $ Educ_level_cat        : chr  "Technical education" "Technical education" "Technical education" "Technical education" ...
 $ Gender_cat            : chr  "male" "female" "male" "male" ...
 $ SES_language_cat      : chr  NA "does not apply" "does not apply" "does not apply" ...
 $ SES_neighbourhood_cat : chr  NA "applies" "applies" "does not apply" ...
 $ SES_educ_level_cat    : chr  NA "does not apply" "does not apply" "does not apply" ...
 $ Autonomous_motivationR: num  2 4.33 2.83 2.33 3.17 ...
 $ Controlled_motivationR: num  3 2.33 2 2.17 3.17 ...
 $ AmotivationR          : num  3 1 1.67 4.33 2.33 ...

Check whether you labelled the categorical variables correct using the function table().

table(transition_data_cleaned$Educ_level,
      transition_data_cleaned$Educ_level_cat)

   
    Arts education General education Technical education Vocational education
  1              0              1731                   0                    0
  2              0                 0                1020                    0
  3              0                 0                   0                  920
  4             28                 0                   0                    0

table(transition_data_cleaned$Gender,
      transition_data_cleaned$Gender_cat)

   
    female male
  0   1589    0
  1      0 2182

table(transition_data_cleaned$SES_language,
      transition_data_cleaned$SES_language_cat)

   
    applies does not apply
  0       0           3399
  1     169              0

table(transition_data_cleaned$SES_neighbourhood,
      transition_data_cleaned$SES_neighbourhood_cat)

   
    applies does not apply
  0       0           2789
  1     778              0

table(transition_data_cleaned$SES_educ_level,
      transition_data_cleaned$SES_educ_level_cat)

   
    applies does not apply
  0       0           2934
  1     627              0

1.4 Prepare data for analysis

Create a data frame transition_data_for_analysis that only contains the following variables:

ID_School
ID_Class
ID_Student
Educ_level and Educ_level_cat
Gender and Gender_cat
all SES-indicators (original and recoded variables)
IQ_score, Reading_score, Wisk_score
the motivation variables (only the ones that you recoded)

To select these variables efficiently, the ‘selection helpers’ of the select()-function are handy. More information on these helpers can be found here. Also include a standardized version of every quantitative variable in your data set. To standardize a variable, you can use the function scale(). Finally, you decide to exclude all students in the arts education track (based on the results of the barplot you created in 1.2).

transition_data_for_analysis <- 
  transition_data_cleaned %>%
  select(contains("ID"),
         starts_with("Educ_"),
         contains("Gender"),
         contains("SES"),
         contains("_score"),
         contains("motivationR")) %>%
  mutate(IQ_Z = scale(IQ_score),
         Reading_Z = scale(Reading_score),
         Math_Z = scale(Math_score),
         Auto_mot_Z = scale(Autonomous_motivationR), # use the recoded variable!
         Contr_mot_Z = scale(Controlled_motivationR),
         Amot_Z = scale(AmotivationR)) %>%
  filter(Educ_level != 4)

Check you new data frame transition_data_for_analysis.

str(transition_data_for_analysis)

'data.frame':   3671 obs. of  25 variables:
 $ ID_School             : chr  "1" "1" "2" "2" ...
 $ ID_Class              : chr  "6LOS                         " "6THa                         " "6 CH                         " "6 STW                        " ...
 $ ID_Student            : chr  "10089" "10112" "20008" "20092" ...
 $ Educ_level            : chr  "2" "2" "2" "2" ...
 $ Educ_level_cat        : chr  "Technical education" "Technical education" "Technical education" "Technical education" ...
 $ Gender                : chr  "1" "0" "1" "1" ...
 $ Gender_cat            : chr  "male" "female" "male" "male" ...
 $ SES_language          : chr  NA "0" "0" "0" ...
 $ SES_neighbourhood     : chr  NA "1" "1" "0" ...
 $ SES_educ_level        : chr  NA "0" "0" "0" ...
 $ SES_language_cat      : chr  NA "does not apply" "does not apply" "does not apply" ...
 $ SES_neighbourhood_cat : chr  NA "applies" "applies" "does not apply" ...
 $ SES_educ_level_cat    : chr  NA "does not apply" "does not apply" "does not apply" ...
 $ IQ_score              : int  32 31 29 33 29 30 34 31 29 35 ...
 $ Math_score            : int  3 6 15 7 4 8 16 16 17 16 ...
 $ Reading_score         : int  9 20 13 12 16 15 16 20 16 17 ...
 $ Autonomous_motivationR: num  2 4.33 2.83 2.33 3.17 ...
 $ Controlled_motivationR: num  3 2.33 2 2.17 3.17 ...
 $ AmotivationR          : num  3 1 1.67 4.33 2.33 ...
 $ IQ_Z                  : num [1:3671, 1] 0.5249 0.3235 -0.0793 0.7263 -0.0793 ...
  ..- attr(*, "scaled:center")= num 29.4
  ..- attr(*, "scaled:scale")= num 4.97
 $ Reading_Z             : num [1:3671, 1] -1.946 0.872 -0.921 -1.177 -0.153 ...
  ..- attr(*, "scaled:center")= num 16.6
  ..- attr(*, "scaled:scale")= num 3.9
 $ Math_Z                : num [1:3671, 1] -1.56 -0.924 0.984 -0.712 -1.348 ...
  ..- attr(*, "scaled:center")= num 10.4
  ..- attr(*, "scaled:scale")= num 4.72
 $ Auto_mot_Z            : num [1:3671, 1] -1.293 1.691 -0.227 -0.867 0.199 ...
  ..- attr(*, "scaled:center")= num 3.01
  ..- attr(*, "scaled:scale")= num 0.782
 $ Contr_mot_Z           : num [1:3671, 1] 0.225 -0.643 -1.077 -0.86 0.442 ...
  ..- attr(*, "scaled:center")= num 2.83
  ..- attr(*, "scaled:scale")= num 0.768
 $ Amot_Z                : num [1:3671, 1] 0.992 -1.186 -0.46 2.445 0.266 ...
  ..- attr(*, "scaled:center")= num 2.09
  ..- attr(*, "scaled:scale")= num 0.918

Check if the data is filtered correctly.

table(transition_data_for_analysis$Educ_level_cat)


   General education  Technical education Vocational education 
                1731                 1020                  920

3. Some descriptives

Before doing the analysis, you want to check some things.

First, you want to get an idea of the number of observations per school. To do so, you create a lollipop plot. To help the reader of your rapport, you make sure that the ‘bars’ are arranged in ascending order. Color bars with counts less than 20 in red. (You first have to create a new variable color_ID to be avble to do that!) Use your ggplot2-skills to make the plot “publication-ready”.

transition_data_for_analysis %>%
  group_by(ID_School) %>%
  count() %>%
  ## Create the color indicator 
  mutate(
    color_ID = ifelse(n < 20, yes = "less than 20", no = "more than 20")
    ) %>%
  ## Adding fct_rev() revers the ordering from descending to ascending
  ggplot(aes(x = fct_rev(reorder(ID_School, n)), y = n)) + 
  geom_point( 
    aes(col = color_ID),
    size = .3
  ) +  
  geom_segment(  
    aes(x = ID_School, xend = ID_School, 
        y = 0, yend = n, color = color_ID),
    ## Makes the size of the lines thinner
    size = .4
    ) +
  coord_flip() +
  ## Place counts on top (to specify like this after having used coord_flip)
  scale_y_continuous(position = "right",
                     breaks = c(seq(50, 200, 50), 233), expand = c(0,0.4)) +
  scale_color_manual(values = c("red", "black")) +
  labs(
    title = "Number of observations per school (arranged in descending order)",
    subtitle = "If less than 20 observations are available bars are colored in red",
    x =  "ID of school",
    y = " "
  ) +
  theme_minimal() +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"),
        plot.subtitle = element_text(face = "italic"),
        panel.grid.minor = element_blank(),
        panel.grid.major.y = element_blank(),
        axis.text.y = element_text(size = 8))

You also create a table with the descriptive statistics (mean, SD, min, max) of the raw (unstandardized) variables IQ_score, Reading_score and Math_score. To so, you first create a vector per variable (using the function c()) that contains those four values. To calculate these values you use the functions mean(), sd(), min() and max(). Then, you combine the three vectors into one tibble (descriptives) using the function bind_rows(). (This function pastes the rows below each other.) Don’t forget to print the resulting tibble.

desc_IQ <- c(
  mean = mean(transition_data_for_analysis$IQ_score, na.rm = TRUE),
  SD = sd(transition_data_for_analysis$IQ_score, na.rm = TRUE),
  min = min(transition_data_for_analysis$IQ_score, na.rm = TRUE),
  max = max(transition_data_for_analysis$IQ_score, na.rm=TRUE))

desc_Reading <- c(
  mean = mean(transition_data_for_analysis$Reading_score, na.rm = TRUE),
  SD = sd(transition_data_for_analysis$Reading_score, na.rm = TRUE),
  min = min(transition_data_for_analysis$Reading_score, na.rm = TRUE),
  max = max(transition_data_for_analysis$Reading_score, na.rm=TRUE))

desc_Math <- c(
  mean = mean(transition_data_for_analysis$Math_score, na.rm = TRUE),
  SD = sd(transition_data_for_analysis$Math_score, na.rm = TRUE),
  min = min(transition_data_for_analysis$Math_score, na.rm = TRUE),
  max = max(transition_data_for_analysis$Math_score, na.rm=TRUE))

descriptives <- bind_rows(desc_IQ, desc_Reading, desc_Math)

descriptives

# A tibble: 3 × 4
   mean    SD   min   max
  <dbl> <dbl> <dbl> <dbl>
1  29.4  4.97     2    40
2  16.6  3.91     3    24
3  10.4  4.72     1    23

Before turning the tibble into a real table, you must round all numbers to two places after the decimal combining the functions mutate_all and round(). After that, you also create a new column Name that contains the names of the variables. Make sure it appears before the column mean.

descriptives %>%
  mutate_all(round, 2) %>%
  mutate(Name = c("IQ score", "Reading score", "Mathematics score")) %>%
  relocate(Name, .before = "mean")

# A tibble: 3 × 5
  Name               mean    SD   min   max
  <chr>             <dbl> <dbl> <dbl> <dbl>
1 IQ score           29.4  4.97     2    40
2 Reading score      16.6  3.91     3    24
3 Mathematics score  10.4  4.72     1    23

Reuse the code you wrote and turn your tibble into a real table by using the function kable().** Check out the help-page of the kable()-function to find out how you can align the five columns! Center all but the first columns. (The first one can be “left” aligned.)

💡 Attention! To use the function kable() you have to install and load the package knitr.

library(knitr)

descriptives %>%
  mutate_all(round, 2) %>%
  mutate(Name = c("IQ score", "Reading score", "Mathematics score")) %>%
  relocate(Name, .before = "mean") %>%
  kable(
    ## To align the content of the columns
    align = c("l", "c", "c", "c", "c"))

Name	mean	SD	min	max
IQ score	29.39	4.97	2	40
Reading score	16.60	3.91	3	24
Mathematics score	10.37	4.72	1	23

Much more options are available when you also install the package kableExtra. This website provides more information on the use of that package: https://cran.r-project.org/web/packages/kableExtra/vignettes/awesome_table_in_html.html One of the many nice features of kableExtra are the built-in themes:
- kable_classic()
- kable_classic2()
- kable_paper()
- kable_minimal()
- kable_material()
- kable_material_dark()

Reuse the code your wrote above and add a theme. Play around with these themes to get to know them.

💡 Attention! To use these functions you have to install and load the package kableExtra.

library(kableExtra)


Attaching package: 'kableExtra'

The following object is masked from 'package:dplyr':

    group_rows

descriptives %>%
  mutate_all(round, 2) %>%
  mutate(Name = c("IQ score", "Reading score", "Mathematics score")) %>%
  relocate(Name, .before = "mean") %>%
  kable(
    ## To align the content of the columns
    align = c("l", "c", "c", "c", "c")
    ) %>%
  kable_paper()

Name	mean	SD	min	max
IQ score	29.39	4.97	2	40
Reading score	16.60	3.91	3	24
Mathematics score	10.37	4.72	1	23

4. Analyse the data

Fit a first linear regression model with math score as dependent variable and the thee motivation variables as independent variables.

Use standardized variables. Save your model as an object Model_lm1 and print the model summary.

Model_lm1 <- lm(Math_Z ~ Auto_mot_Z + Contr_mot_Z + Amot_Z,
               data = transition_data_for_analysis)
summary(Model_lm1)


Call:
lm(formula = Math_Z ~ Auto_mot_Z + Contr_mot_Z + Amot_Z, data = transition_data_for_analysis)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.1265 -0.7964 -0.0523  0.7524  2.5859 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.009546   0.017649   0.541   0.5886    
Auto_mot_Z  -0.025564   0.021440  -1.192   0.2332    
Contr_mot_Z  0.043759   0.017925   2.441   0.0147 *  
Amot_Z      -0.138788   0.021543  -6.442 1.36e-10 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.9837 on 3103 degrees of freedom
  (564 observations deleted due to missingness)
Multiple R-squared:  0.0165,    Adjusted R-squared:  0.01555 
F-statistic: 17.35 on 3 and 3103 DF,  p-value: 3.594e-11

Fit a second linear regression model that builds on the previous one but adds gender, educational level, IQ-score and reading-score as control variables.

Use standardized variables. Save your model as an object Model_lm2 and print the model summary.

Model_lm2 <- lm(Math_Z ~ IQ_Z + Reading_Z + Gender + Educ_level_cat +
                 Auto_mot_Z + Contr_mot_Z + Amot_Z,
               data = transition_data_for_analysis)
summary(Model_lm2)


Call:
lm(formula = Math_Z ~ IQ_Z + Reading_Z + Gender + Educ_level_cat + 
    Auto_mot_Z + Contr_mot_Z + Amot_Z, data = transition_data_for_analysis)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.52511 -0.50194 -0.00795  0.47441  2.29684 

Coefficients:
                                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)                         0.120175   0.023871   5.034 5.07e-07 ***
IQ_Z                                0.238068   0.014250  16.707  < 2e-16 ***
Reading_Z                           0.297417   0.016110  18.461  < 2e-16 ***
Gender1                             0.355800   0.026990  13.183  < 2e-16 ***
Educ_level_catTechnical education  -0.409010   0.031607 -12.941  < 2e-16 ***
Educ_level_catVocational education -0.869544   0.042801 -20.316  < 2e-16 ***
Auto_mot_Z                         -0.006012   0.015663  -0.384   0.7011    
Contr_mot_Z                        -0.012878   0.013209  -0.975   0.3296    
Amot_Z                             -0.042446   0.015891  -2.671   0.0076 ** 
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7026 on 3024 degrees of freedom
  (638 observations deleted due to missingness)
Multiple R-squared:  0.4974,    Adjusted R-squared:  0.4961 
F-statistic: 374.1 on 8 and 3024 DF,  p-value: < 2.2e-16

Before reporting on the results, **you have to check the assumptions of both models. Several R-packages perform these kind of tasks. One example is the performance package. Install this package and use the function check_model() to do a visual check of the assumptions.

💡 Attention! To use the function check_model() you have to install and load the package performance.

Check the assumptions of the model Model_lm1 first.

library(performance)

check_model(Model_lm1)

Do the same for the other model.

check_model(Model_lm2)

These results look quite good. However, you also want to do an additional check for outliers. Go to this website and look for the function that you need to do an additional check on outliers.

# Check on outliers for model_lm1

check_outliers(Model_lm1)

OK: No outliers detected.
- Based on the following method and threshold: cook (0.84).
- For variable: (Whole model)

# Check on outliers for model_lm2

check_outliers(Model_lm2)

OK: No outliers detected.
- Based on the following method and threshold: cook (0.93).
- For variable: (Whole model)

Another important assumption concerns the distribution of the dependent variable (should be normally distributed). You do not need a package to check on that distribution, but use your ggplot2-skills to make a histogram of the distribution of the dependent variable. ‘Style’ the histogram so you can use it in a publication.

transition_data_for_analysis %>%
  ggplot(aes(
         x = IQ_Z
       )) +
  ## Use geom_density
  geom_histogram(
    fill = "grey",
    color = "grey",
    alpha = .7
  ) +
  labs(
    title = "Checking normality of the distribution of the dependent variable",
    subtitle = "Histogram plot of standardized IQ-score",
    x = "IQ-score (Z)"
  ) + 
  theme_minimal() +
  ## I reused the theme-settings of the lollipop plot.
  ## In this way, you can easily 'align' the style of the plots you make
  ## Of course, you can also create your own ggplot2-theme
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"),
        plot.subtitle = element_text(face = "italic"),
        panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        axis.text.y = element_text(size = 8))

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Warning: Removed 307 rows containing non-finite values (stat_bin).

The tail on the left-hand side of the distribution worries you a bit. You decide to discuss this with your supervisor.

To finish up the report, you create a nice table of the estimates of both models. To create this table, you use the function tab_model() of the package sjPlot. Go the help-page of this function and figure out how to turn off the printing of p-values, print a 90% confidence interval, rename the column names to ‘Variables’, ‘Est.’ and ‘90% CI’ and remove the brackets around Intercept.

💡 Attention! To use the function tab_model() you have to install and load the package sjPlot.

library(sjPlot)

#refugeeswelcome

tab_model(Model_lm1, Model_lm2,
          show.p = FALSE, show.ci = .9,
          string.pred = "Variables",
          string.est = "Est.", string.ci = "90% CI",
          string.intercept = "Intercept")

	Math Z		Math Z
Variables	Est.	90% CI	Est.	90% CI
Intercept	0.01	-0.02 – 0.04	0.12	0.08 – 0.16
Auto mot Z	-0.03	-0.06 – 0.01	-0.01	-0.03 – 0.02
Contr mot Z	0.04	0.01 – 0.07	-0.01	-0.03 – 0.01
Amot Z	-0.14	-0.17 – -0.10	-0.04	-0.07 – -0.02
IQ Z			0.24	0.21 – 0.26
Reading Z			0.30	0.27 – 0.32
Gender [1]			0.36	0.31 – 0.40
Educ level cat [Technical education]			-0.41	-0.46 – -0.36
Educ level cat [Vocational education]			-0.87	-0.94 – -0.80
Observations	3107		3033
R² / R² adjusted	0.016 / 0.016		0.497 / 0.496

Before sending the report to your supervisor, you compare both models. To do so, you first make a data set transition_data_omitted that excludes all missing values for the variables included in Model_lm2. (This is necessary in order to compare the models.)

transition_data_omitted <- 
  transition_data_for_analysis %>%
  filter(!is.na(Math_Z), !is.na(IQ_Z), !is.na(Reading_Z), !is.na(Gender), !is.na(Educ_level_cat), !is.na(Auto_mot_Z), !is.na(Contr_mot_Z), !is.na(Amot_Z))

Then, you refit both models using the new data set.

Model_lm1_B <- lm(Math_Z ~ Auto_mot_Z + Contr_mot_Z + Amot_Z,
                 data = transition_data_omitted)
Model_lm2_B <- lm(Math_Z ~ IQ_score + Reading_score + Gender + Educ_level_cat +
                  Auto_mot_Z + Contr_mot_Z + Amot_Z,
                  data = transition_data_omitted)

Finally, you compare both models using the function compare_performance of the package performance.

compare_performance(Model_lm1_B, Model_lm2_B)

# Comparison of Model Performance Indices

Name        | Model |  AIC (weights) |  BIC (weights) |    R2 | R2 (adj.) |  RMSE | Sigma
-----------------------------------------------------------------------------------------
Model_lm1_B |    lm | 8503.2 (<.001) | 8533.2 (<.001) | 0.017 |     0.016 | 0.981 | 0.982
Model_lm2_B |    lm | 6477.5 (>.999) | 6537.6 (>.999) | 0.497 |     0.496 | 0.702 | 0.703

To be able to sent the report to your supervisor, you render it as an html-file

A Quarto tip!

Please note, if you Render the document to send it to someone else you want to have all the information in one single file so that you can send the .html file to that person. To make sure all the figures etc are in the file, specify the option self-contained: true in the YAML. See it in action in the .qmd file for this exercise.