Analysing exam data with R and the tidyverse: a walkthrough
I often use R and the tidyverse to analyse student data. In this post I’ll walk through the process of analysing 20 fictional students’ performance in an exam paper, a process that I’ve used in the past to analyse real data. If you’d like to follow along, you can download the markbook.csv
and questions.csv
data files.
Loading packages and data
First let’s load the packages and data files we’ll need. In this case we’re loading CSV data files, but these could easily be Excel spreadsheets loaded with the readxl package’s read_excel
function.
library(tidyverse) # load core tidyverse packages
library(magrittr) # for %<>% (compound assignment) operator
library(scales) # for pretty_breaks() and percent_format()
library(knitr) # for kable (outputs nicely formatted tables)
markbook <- read_csv('markbook.csv')
questions <- read_csv('questions.csv')
Markbook data
The markbook
tibble we’ve created is a sample markbook containing students’ marks for an end-of-topic test and our exam (with marks for each sub-part included separately) in the wide format generally used for storing this kind of data. Here’s a preview of the first few rows and columns:
markbook %>%
arrange(Name) %>%
head() %>%
select(1:7) %>%
kable(format='markdown')
Name | Gender | TopicTest | ExamQ1a | ExamQ1b | ExamQ1c | ExamQ1d |
---|---|---|---|---|---|---|
Amelia | F | 53 | 1 | 1 | 1 | 2 |
Charlie | M | 91 | 1 | 1 | 1 | 3 |
Charlotte | F | 32 | 0 | 1 | 1 | 2 |
Ella | F | 63 | 1 | 1 | 1 | 3 |
Emily | F | 49 | 1 | 1 | 1 | 3 |
George | M | 46 | 1 | 1 | 1 | 3 |
You might think it’s extra work to input each student’s mark for each sub-question rather than just the total mark for each student, but I don’t find it any more time consuming; I’d probably add up the marks using a calculator anyway so might as well input them into a spreadsheet!
Exam paper data
The questions
tibble we’ve created above contains a list of all the sub-questions in our fictional exam paper and how many marks could be achieved in each sub-question (the Possible
column). We’ll use mutate
to add a QuestionID
column – this will contain an ID for each question which matches the column headers from the markbook (e.g. ExamQ1a
) – then we’ll preview the first few rows.
questions %<>%
mutate(QuestionID = paste0('ExamQ', Number, Part))
questions %>%
head() %>%
kable(format='markdown')
Number | Part | Possible | QuestionID |
---|---|---|---|
1 | a | 1 | ExamQ1a |
1 | b | 1 | ExamQ1b |
1 | c | 2 | ExamQ1c |
1 | d | 4 | ExamQ1d |
2 | a | 1 | ExamQ2a |
2 | b | 1 | ExamQ2b |
Joining the markbook and exam paper data
For analysis and plotting using the tidyverse, it’s more useful to have the exam marks in a tidy (i.e. long) format rather than the wide format generally used for a markbook. Let’s create an exam tibble for the exam marks, using the gather
function to tidy the markbook data and joining the questions tibble to add the maximum marks for each sub-question (the Possible
column). Then we’ll preview the first few rows for one of our students: Oliver.
exam <- markbook %>%
select(Name, starts_with('ExamQ')) %>%
gather(key=QuestionID, value=Score, starts_with('ExamQ')) %>%
left_join(questions)
exam %>%
filter(Name == 'Oliver') %>%
head() %>%
kable(format='markdown')
Name | QuestionID | Score | Number | Part | Possible |
---|---|---|---|---|---|
Oliver | ExamQ1a | 1 | 1 | a | 1 |
Oliver | ExamQ1b | 1 | 1 | b | 1 |
Oliver | ExamQ1c | 1 | 1 | c | 2 |
Oliver | ExamQ1d | 3 | 1 | d | 4 |
Oliver | ExamQ2a | 0 | 2 | a | 1 |
Oliver | ExamQ2b | 1 | 2 | b | 1 |
Students’ overall performance
Let’s use group_by
and summarise
to calculate the total mark for each pupil as well as their percentage score. We can then create a new Grade
column using mutate
and assign a grade based on the percentage score achieved using case_when
. Finally we’ll preview the first few rows.
exam %>%
group_by(Name) %>%
summarise(Total = sum(Score),
Percentage = sum(Score) / sum(Possible)) %>%
mutate(Grade = case_when(
Percentage >= 0.75 ~ 'A*',
Percentage >= 0.675 ~ 'A',
Percentage >= 0.60 ~ 'B',
Percentage >= 0.525 ~ 'C'
)) %>%
head() %>%
kable(format='markdown')
Name | Total | Percentage | Grade |
---|---|---|---|
Amelia | 28 | 0.700 | A |
Charlie | 26 | 0.650 | B |
Charlotte | 25 | 0.625 | B |
Ella | 31 | 0.775 | A* |
Emily | 27 | 0.675 | A |
George | 30 | 0.750 | A* |
To the end of the previous chain of commands, we’ll add some more in order to visualise students’ overall exam performance. Let’s use fct_relevel
to sort the grades A*, A, B, C rather than A, A*, B, C. Then we’ll plot a bar chart using ggplot
. We use fct_reorder
to reorder the names on the x axis depending on the percentage score achieved. The scale_y_continuous
options improve the y axis labels.
exam %>%
group_by(Name) %>%
summarise(Total = sum(Score),
Percentage = sum(Score) / sum(Possible)) %>%
mutate(Grade = case_when(
Percentage >= 0.75 ~ 'A*',
Percentage >= 0.675 ~ 'A',
Percentage >= 0.60 ~ 'B',
Percentage >= 0.525 ~ 'C'
),
Grade = fct_relevel(Grade, 'A*') # move A* before A
) %>%
ggplot(aes(fct_reorder(Name, Percentage), # reorder names by percentage
Percentage, fill=Grade)) + # colour bars by grade
geom_col() +
scale_y_continuous(breaks=pretty_breaks(),
labels=percent_format()) +
labs(title='Total percentage scores', x=NULL) +
coord_flip() # flip x/y axes to create horizontal bars
Performance for each question
Now let’s analyse which questions the students found easier and harder. This could be useful in order to direct future teaching.
We use the exam
tibble again, this time grouping by sub-question instead of by pupil. Here’s a preview of the first few rows:
exam %>%
group_by(Number, Part) %>%
# calculate average percentage score for each sub-question
summarise(Percentage = sum(Score) / sum(Possible)) %>%
head() %>%
kable(format='markdown')
Number | Part | Percentage |
---|---|---|
1 | a | 0.8500 |
1 | b | 1.0000 |
1 | c | 0.5500 |
1 | d | 0.6625 |
2 | a | 0.8000 |
2 | b | 0.6500 |
Adding on to the commands above, we’ll use ggplot
to visualise the average percentage mark for each sub-question. This time we’ll use facet_grid
to split the plot into sections, one per question number.
exam %>%
group_by(Number, Part) %>%
summarise(Percentage = sum(Score) / sum(Possible)) %>%
ggplot(aes(Part, Percentage)) +
geom_col() +
scale_y_continuous(labels=percent_format()) +
facet_grid(~ Number, scales='free', space='free') +
labs(title='Average percentage score by question',
x='Question number')
Detailed breakdown for each student
Above, we’ve created a summary for each student and each question. Now let’s create a graph for each student to help them understand their individual performance in each question.
In order to automate the process of creating a similar graph for each pupil, we pipe the list of pupil names from the markbook into the purrr package’s map
function. The argument given to map
is a function that we create which summarises the relevant student’s exam data and pipes it into ggplot
.
markbook %>%
pull(Name) %>%
map(function(pupil) {
exam %>%
filter(Name == pupil) %>%
group_by(Number) %>%
summarise(Score = sum(Score),
Possible = sum(Possible)) %>%
# fct_rev to put lowest question number at top of y axis
ggplot(aes(x=fct_rev(as.factor(Number)))) +
geom_col(aes(y=Possible), data=questions, fill='red', alpha=0.3) +
geom_col(aes(y=Score), fill='forestgreen') +
labs(title=paste('Mark breakdown for', pupil),
x='Question number', y='Marks') +
coord_flip()
})
One of the graphs created is shown below.
If each question included sub-questions covering different topics or using different techniques, we could’ve included a category for each sub-question in the questions.csv
file and used this category on the y axis instead of the question number.
Learning to use R and the tidyverse
If you haven’t already, why not read my other post about analysing student data using R and the tidyverse? If you’d like to learn more about R and the tidyverse, Hadley Wickham and Garrett Grolemund’s book R for Data Science (available electronically for free) is a great place to start. If you have any questions about using R to analyse student data, why not get in touch with me on Twitter?