Tidytext walkthrough: correcting spellings and creating reproducible word clouds
In this post I’ll walk through the process of using hunspell to correct spellings automatically in a tidytext analysis. We’ll create a word cloud using purrr’s map
function and set.seed
to try out various different layouts and – given the same input data – reliably reproduce our favourite.
We’ll use the GNU GPL (in plain text format) as our example text; it may not be the most interesting piece of prose you’ll ever read (although an important one) but the process could be applied to other text such as responses to open-ended survey questions.
Loading packages and data
First we’ll load the required packages and put each individual word from the licence text into a tibble called words
.
library(readr)
library(dplyr)
library(tidytext)
words <- tibble(line = read_lines('gpl-3.0.txt')) %>%
# convert to lowercase later as this affects spellchecking
unnest_tokens('word', line, to_lower = FALSE)
head(words)
word |
---|
GNU |
GENERAL |
PUBLIC |
LICENSE |
Version |
3 |
Correcting spellings
Next we’ll define a function correct_spelling
which uses hunspell_check
to check spellings and hunspell_suggest
to suggest corrections, both using the British English dictionary. hunspell_suggest
returns a list of suggestions but we just want to return the first suggestion so we pipe its output through map(1)
and unlist
.
Before the automatic corrections, we define any manual corrections. In this case the American spelling ‘license’ is recognised as valid by hunspell_check
but I’d like to use the British spelling ‘licence’ so we’ll add a manual case to case_when
.
library(hunspell)
library(purrr)
correct_spelling <- function(input) {
output <- case_when(
# any manual corrections
input == 'license' ~ 'licence',
# check and (if required) correct spelling
!hunspell_check(input, dictionary('en_GB')) ~
hunspell_suggest(input, dictionary('en_GB')) %>%
# get first suggestion, or NA if suggestions list is empty
map(1, .default = NA) %>%
unlist(),
TRUE ~ input # if word is correct
)
# if input incorrectly spelled but no suggestions, return input word
ifelse(is.na(output), input, output)
}
Originally I had some problems using the correct_spelling
function with mutate
because if there were no suggestions for an incorrectly spelled word, nothing was returned; this meant that the output vector was shorter than the input vector, causing mutate
to fail. The .default = NA
argument to map
prevents this: if the suggestions list is empty it outputs NA
, which we then catch later and change it back to the input word.
Let’s see what words from the input text will be corrected by our correct_spelling
function. We’ll remove stop words first as we won’t include these in our word cloud later on and it will speed up the process.
words %>%
anti_join(stop_words, by = 'word') %>%
rename(original = word) %>%
group_by(original) %>%
summarise(count = n()) %>%
ungroup() %>% # so we can mutate word
mutate(suggestion = correct_spelling(original)) %>%
filter(suggestion != original)
original | count | suggestion |
---|---|---|
6b | 1 | 6 |
6d | 1 | 6 |
defenses | 1 | defences |
favor | 1 | favour |
fsf.org | 1 | forgoes |
Inc | 1 | Inca |
license | 27 | licence |
noncommercially | 1 | non commercially |
Sublicensing | 1 | Sub licensing |
WIPO | 1 | WIPE |
Some American spellings will be converted to British ones because we’re using hunspell’s British English (en_GB
) dictionary. There are also some words being corrected wrongly but as they only have one occurrence that won’t be an issue for our word cloud, which will only include the most common words.
Generating word clouds
Our correct_spelling
function takes a while to run so we’ll store the data we’re going to use to generate word clouds to avoid having to correct spellings multiple times. The table below shows the 5 most common words.
cloud_words <- words %>%
mutate(word = tolower(word)) %>%
anti_join(stop_words, by = 'word') %>%
# group and summarise here because correcting spellings is expensive
group_by(word) %>%
summarise(count = n()) %>%
ungroup() %>% # so we can mutate word
mutate(word = correct_spelling(word)) %>%
# group and summarise again to combine different
# original spellings of same word
group_by(word) %>%
summarise(count = sum(count))
cloud_words %>%
arrange(desc(count)) %>%
head(n = 5)
word | count |
---|---|
licence | 102 |
program | 49 |
source | 42 |
covered | 41 |
code | 34 |
Now we’ll create a word cloud to show the most common words in our text. The exact layout depends on the seed used by the random number generator, so we’ll produce multiple versions by using purrr’s map
function to set the seed before generating each version. This also means that – given the same input data – we could reproduce our favourite layout using set.seed
.
Strictly, we don’t need to pipe cloud_words
into wordcloud
: we could just use cloud_words$word
instead of the word
argument. However, I’ve used this method to demonstrate the use of with
, which can help if you need to pipe into wordcloud
following adjustments to the data earlier in the pipeline, such as anti_join
ing some words you’d like to remove.
Below are three different possible layouts for our word cloud.
library(wordcloud)
1:3 %>% map(function(seed) {
set.seed(seed)
cloud_words %>%
with(wordcloud(word, count, min.freq = 10,
colors = brewer.pal(5, 'Blues')))
})
If you have any questions or suggestions, why not get in touch with me on Twitter?