Analysing exam data with R and the tidyverse: a walkthrough

I often use R and the tidyverse to analyse student data. In this post I’ll walk through the process of analysing 20 fictional students’ performance in an exam paper, a process that I’ve used in the past to analyse real data. If you’d like to follow along, you can download the markbook.csv and questions.csv data files.

Loading packages and data

First let’s load the packages and data files we’ll need. In this case we’re loading CSV data files, but these could easily be Excel spreadsheets loaded with the readxl package’s read_excel function.

library(tidyverse)  # load core tidyverse packages
library(magrittr)  # for %<>% (compound assignment) operator
library(scales)  # for pretty_breaks() and percent_format()
library(knitr)  # for kable (outputs nicely formatted tables)

markbook <- read_csv('markbook.csv')
questions <- read_csv('questions.csv')

Markbook data

The markbook tibble we’ve created is a sample markbook containing students’ marks for an end-of-topic test and our exam (with marks for each sub-part included separately) in the wide format generally used for storing this kind of data. Here’s a preview of the first few rows and columns:

markbook %>%
  arrange(Name) %>%
  head() %>%
  select(1:7) %>%
  kable(format='markdown')

Name	Gender	TopicTest	ExamQ1a	ExamQ1b	ExamQ1c	ExamQ1d
Amelia	F	53	1	1	1	2
Charlie	M	91	1	1	1	3
Charlotte	F	32	0	1	1	2
Ella	F	63	1	1	1	3
Emily	F	49	1	1	1	3
George	M	46	1	1	1	3

You might think it’s extra work to input each student’s mark for each sub-question rather than just the total mark for each student, but I don’t find it any more time consuming; I’d probably add up the marks using a calculator anyway so might as well input them into a spreadsheet!

Exam paper data

The questions tibble we’ve created above contains a list of all the sub-questions in our fictional exam paper and how many marks could be achieved in each sub-question (the Possible column). We’ll use mutate to add a QuestionID column – this will contain an ID for each question which matches the column headers from the markbook (e.g. ExamQ1a) – then we’ll preview the first few rows.

questions %<>%
  mutate(QuestionID = paste0('ExamQ', Number, Part))

questions %>%
  head() %>%
  kable(format='markdown')

Number	Part	Possible	QuestionID
1	a	1	ExamQ1a
1	b	1	ExamQ1b
1	c	2	ExamQ1c
1	d	4	ExamQ1d
2	a	1	ExamQ2a
2	b	1	ExamQ2b

Joining the markbook and exam paper data

For analysis and plotting using the tidyverse, it’s more useful to have the exam marks in a tidy (i.e. long) format rather than the wide format generally used for a markbook. Let’s create an exam tibble for the exam marks, using the gather function to tidy the markbook data and joining the questions tibble to add the maximum marks for each sub-question (the Possible column). Then we’ll preview the first few rows for one of our students: Oliver.

exam <- markbook %>%
  select(Name, starts_with('ExamQ')) %>%
  gather(key=QuestionID, value=Score, starts_with('ExamQ')) %>%
  left_join(questions)

exam %>%
  filter(Name == 'Oliver') %>%
  head() %>%
  kable(format='markdown')

Name	QuestionID	Score	Number	Part	Possible
Oliver	ExamQ1a	1	1	a	1
Oliver	ExamQ1b	1	1	b	1
Oliver	ExamQ1c	1	1	c	2
Oliver	ExamQ1d	3	1	d	4
Oliver	ExamQ2a	0	2	a	1
Oliver	ExamQ2b	1	2	b	1

Students’ overall performance

Let’s use group_by and summarise to calculate the total mark for each pupil as well as their percentage score. We can then create a new Grade column using mutate and assign a grade based on the percentage score achieved using case_when. Finally we’ll preview the first few rows.

exam %>%
  group_by(Name) %>%
  summarise(Total = sum(Score),
            Percentage = sum(Score) / sum(Possible)) %>%
  mutate(Grade = case_when(
    Percentage >= 0.75  ~ 'A*',
    Percentage >= 0.675 ~ 'A',
    Percentage >= 0.60  ~ 'B',
    Percentage >= 0.525 ~ 'C'
  )) %>%
  head() %>%
  kable(format='markdown')

Name	Total	Percentage	Grade
Amelia	28	0.700	A
Charlie	26	0.650	B
Charlotte	25	0.625	B
Ella	31	0.775	A*
Emily	27	0.675	A
George	30	0.750	A*

To the end of the previous chain of commands, we’ll add some more in order to visualise students’ overall exam performance. Let’s use fct_relevel to sort the grades A*, A, B, C rather than A, A*, B, C. Then we’ll plot a bar chart using ggplot. We use fct_reorder to reorder the names on the x axis depending on the percentage score achieved. The scale_y_continuous options improve the y axis labels.

exam %>%
  group_by(Name) %>%
  summarise(Total = sum(Score),
            Percentage = sum(Score) / sum(Possible)) %>%
  mutate(Grade = case_when(
      Percentage >= 0.75  ~ 'A*',
      Percentage >= 0.675 ~ 'A',
      Percentage >= 0.60  ~ 'B',
      Percentage >= 0.525 ~ 'C'
    ),
    Grade = fct_relevel(Grade, 'A*')  # move A* before A
  ) %>%
  ggplot(aes(fct_reorder(Name, Percentage),  # reorder names by percentage
             Percentage, fill=Grade)) +  # colour bars by grade
  geom_col() +
  scale_y_continuous(breaks=pretty_breaks(),
                     labels=percent_format()) +
  labs(title='Total percentage scores', x=NULL) +
  coord_flip()  # flip x/y axes to create horizontal bars

Performance for each question

Now let’s analyse which questions the students found easier and harder. This could be useful in order to direct future teaching.

We use the exam tibble again, this time grouping by sub-question instead of by pupil. Here’s a preview of the first few rows:

exam %>%
  group_by(Number, Part) %>%
  # calculate average percentage score for each sub-question
  summarise(Percentage = sum(Score) / sum(Possible)) %>%
  head() %>%
  kable(format='markdown')

Number	Part	Percentage
1	a	0.8500
1	b	1.0000
1	c	0.5500
1	d	0.6625
2	a	0.8000
2	b	0.6500

Adding on to the commands above, we’ll use ggplot to visualise the average percentage mark for each sub-question. This time we’ll use facet_grid to split the plot into sections, one per question number.

exam %>%
  group_by(Number, Part) %>%
  summarise(Percentage = sum(Score) / sum(Possible)) %>%
  ggplot(aes(Part, Percentage)) +
  geom_col() +
  scale_y_continuous(labels=percent_format()) +
  facet_grid(~ Number, scales='free', space='free') +
  labs(title='Average percentage score by question',
       x='Question number')

Detailed breakdown for each student

Above, we’ve created a summary for each student and each question. Now let’s create a graph for each student to help them understand their individual performance in each question.

In order to automate the process of creating a similar graph for each pupil, we pipe the list of pupil names from the markbook into the purrr package’s map function. The argument given to map is a function that we create which summarises the relevant student’s exam data and pipes it into ggplot.

markbook %>%
  pull(Name) %>%
  map(function(pupil) {
    exam %>%
      filter(Name == pupil) %>%
      group_by(Number) %>%
      summarise(Score = sum(Score),
                Possible = sum(Possible)) %>%
      # fct_rev to put lowest question number at top of y axis
      ggplot(aes(x=fct_rev(as.factor(Number)))) +
      geom_col(aes(y=Possible), data=questions, fill='red', alpha=0.3) +
      geom_col(aes(y=Score), fill='forestgreen') +
      labs(title=paste('Mark breakdown for', pupil),
           x='Question number', y='Marks') +
      coord_flip()
  })

One of the graphs created is shown below.

If each question included sub-questions covering different topics or using different techniques, we could’ve included a category for each sub-question in the questions.csv file and used this category on the y axis instead of the question number.

Learning to use R and the tidyverse

If you haven’t already, why not read my other post about analysing student data using R and the tidyverse? If you’d like to learn more about R and the tidyverse, Hadley Wickham and Garrett Grolemund’s book R for Data Science (available electronically for free) is a great place to start. If you have any questions about using R to analyse student data, why not get in touch with me on Twitter?