From now on, we do things inside a project to keep a clear and coherent workspace. Go to File –> New Project (choose either a New directory or an existing one). For instance, you can choose the folder you downloaded for this session from ILIAS. You can have an overview of the folder in the output pane.
Before getting to the actual content, the following section introduces the example story: You have left academia and decided to take over a local dog shelter. The previous owner has kept very crappy records in excel, you decide to clean them up and complement them with more information. To make your furry friends more relatable, you contact a dog name expert and ask them to choose names for your dogs. Also, you use ChatGPT and create nice descriptions for them.
Through this process, we learned the following topics in the previous session:
paste() or
str_c()str_to_upper(), and str_to_lower()str_sub()Our data from the previous session is saved as an RDS file. Let’s bring them into R.
dogs <- read_rds("files/dogs.rds")
website_content <- read_rds("files/website_content.rds")
collar <- read_rds("files/collar.rds")
Story: You decide to add a add a chart regarding the characteristics of your dogs to the website. For that, you specify some features such as “active”, “friendly”, “playful”, and “athletic” and you decide to look for these features in the descriptions in order to add them to the chart.
We use grepl() or tidyverse str_detect() to
look for the specified pattern. They are used to search for the presence
of a pattern, specified by a regular expression, in a
character vector. A regular expression is a string that describes a
search pattern.
The easiest way of using grepl() or
str_detect() is to look for an exact string in a
column.
website_content <- website_content %>%
mutate(friendly = ifelse( grepl("friendly", description), TRUE, FALSE)) %>%
mutate(playful = ifelse( grepl("playful", description), TRUE, FALSE)) %>%
mutate(active = ifelse( grepl("active", description), TRUE, FALSE)) %>%
mutate(athletic = ifelse( grepl("athletic", description), TRUE, FALSE))
chart <- website_content %>%
select(name, friendly, playful, active, athletic)
head(chart)
## name friendly playful active athletic
## 1 Sophronia TRUE TRUE FALSE FALSE
## 2 Buddy TRUE FALSE TRUE FALSE
## 3 Mara TRUE FALSE FALSE FALSE
## 4 Gracelyn FALSE TRUE TRUE FALSE
## 5 Broderick FALSE TRUE FALSE FALSE
## 6 Conrad FALSE FALSE TRUE TRUE
Regular expressions (regex) define search patterns for text within strings or files.
Key Points:
Applications of regular expressions
Simple Example:
Finding “cat” in a text.
## [1] "cat"
Regular expressions use specific characters and symbols to define patterns.
. - Matches any single character
except newline.* - Matches the preceding
element zero or more times.text <- "cat mat bat"
text2 <- "The cat is sitting on the table."
str_extract(text, pattern = ".*") # Matches every character
## [1] "cat mat bat"
str_extract(text2, pattern = "sitting.*") # Matches every from sitting onward.
## [1] "sitting on the table."
^ - Matches the start of a
string.text <- "First text"
text2 <- "Second text"
str_extract(text, pattern = "^First.*") # Matches if 'start' is at the beginning of text
## [1] "First text"
str_extract(text2, pattern = "^First") # Matches if 'start' is at the beginning of text2
## [1] NA
$ - Matches the end of a
string.text <- "here is the end"
text2 <- "here is the ending"
str_extract(text, pattern = "end$") # Matches if 'end' is at the end
## [1] "end"
str_extract(text2, pattern = "end$") # Matches if 'end' is at the end
## [1] NA
str_extract(text2, pattern = "end.*$") # Matches if 'end' is at the end
## [1] "ending"
+ - Matches the preceding element
one or more times.text <- "Faaa Faaa"
str_extract(text, pattern = "a+") #get letter a one or more time
## [1] "aaa"
str_extract_all(text, pattern = "a+") # Matches if 'end' is at the end
## [[1]]
## [1] "aaa" "aaa"
Please note that str_extract() is used to extract the
first instance of a pattern in each string, while
str_extract_all() is used to extract all instances of a
pattern in each string.
? - Matches the
preceding element zero or one time.text <- "colour color"
str_extract_all(text, pattern = "colou?r") # Matches both 'colour' and 'color'
## [[1]]
## [1] "colour" "color"
[] - Matches any one
of the enclosed characters.text <- "cat bat rat mat"
str_extract_all(text, pattern = "[br]at") # Matches 'bat' and 'rat'
## [[1]]
## [1] "bat" "rat"
() - Groups elements.text <- "recode decode encode"
str_extract_all(text, pattern = "(re|de)code") # Matches recode and decode
## [[1]]
## [1] "recode" "decode"
| - Logical OR.text <- "cat bat mat"
str_extract_all(text, pattern = "cat|bat") # Matches 'bat' and 'rat'\
## [[1]]
## [1] "cat" "bat"
Character classes provide a way to match predefined sets of characters.
\d - Matches any digit
(equivalent to [0-9]).text <- "Room 101 10483839"
str_extract(text, pattern = "\\d") # Matches '1'
## [1] "1"
str_extract(text, pattern = "\\d+") # Matches the whole number
## [1] "101"
str_extract_all(text, pattern = "\\d+") # Matches the whole number
## [[1]]
## [1] "101" "10483839"
\w - Matches any word character
(letters, digits, and underscores).text <- "Room 101 , people"
str_extract(text, pattern = "\\w")
## [1] "R"
str_extract(text, pattern = "\\w+") # Matches word characters
## [1] "Room"
str_extract_all(text, pattern = "(\\w+|\\,)") # Matches word characters
## [[1]]
## [1] "Room" "101" "," "people"
\s - Matches any whitespace
character (spaces, tabs, line breaks)text <- "Room 101"
str_extract(text, pattern = "\\s+") # Matches whitespace including newline
## [1] " "
Custom Character Sets: - Use square brackets to
define a custom set of characters. - Example: [A-Za-z]
matches any uppercase or lowercase letter.
Quantifiers control how many times an element is matched.
{} - Specify a specific
number of repetitions.text <- "wooooow"
str_extract(text, pattern = "o{3}") # Matches 'o' exactly 3 times
## [1] "ooo"
text <- "aaaab"
text2 <- "ab"
# Greedy: Matches at least two 'a's
str_extract(text, "a{2,}")
## [1] "aaaa"
str_extract(text2, "a{2,}")
## [1] NA
text <- "aaab"
# Lazy: Matches as few 'a's as possible
str_extract(text, "a*?")
## [1] ""
str_extract(text, "a+?")
## [1] "a"
str_extract(text, "a{2,}?")
## [1] "aa"
| Regex | Match |
|---|---|
\\. |
. |
\\! |
! |
\\? |
? |
\\\\ |
\ |
\\( |
( |
\\n |
new line |
\\t |
tab |
text <- "Here is a text in (parenthesis)"
str_extract(text, "\\(.*\\)")
## [1] "(parenthesis)"
Very good cheatsheet for all string operations
Okay, time to get back to our shelter.
Story: You believe that the collar IDs would sound even friendlier if a “y” would be added to the end of those collar IDs that do not contain any vowels or the letter “y. Therefore, you plan to first identify the collar IDs that fulfill this criterion, and then append a”y” to the end of those collar IDs.
In the code below, the ifelse function is used to conditionally
assign values to the collar_id_friendly column. The grepl
or str_detect function is used to search for the presence
of vowels (i.e., “a”, “e”, “y”, “o”, “u”, “i”) at the end of a string,
represented by the regular expression [aeyoui]$. The
$ symbol in the regular expression indicates that the
pattern should match the end of the string. If the grepl function
returns FALSE (i.e., if the string does not end with the letters
“aeyoui”), then the paste function is used to concatenate the original
collar_id value with the character “y”. If grepl returns TRUE, then the
original value in the collar_id column is assigned to the
collar_id_friendly column without modification.
collar <- collar %>%
mutate(collar_id_friendly = ifelse(grepl("[aeyoui]$", collar_id),
collar_id,
paste(collar_id, "y", sep = "")))
#alternative solution
collar <- collar %>%
mutate(collar_id_friendly_alt =
ifelse(str_detect(collar_id,"[aeyoui]$"),
collar_id,
str_c(collar_id, "y")))
?grepl
## starting httpd help server ... done
?str_detect
Story: Since why not, you also choose to modify the collar IDs ending with “e” by changing it to “ie”.
We use replace functions gsub() or
str_replace() to replace a specified pattern with something
else. In the code below, the “grepl” function is used to check if the
string ends with “e”, and if the condition is true, the “gsub” function
is used to replace the “e” with “ie”.
collar <- collar %>%
mutate(collar_id_friendly = ifelse(grepl("e$",collar_id_friendly),
gsub("e$", "ie", collar_id_friendly),
collar_id_friendly))
#alternative solution
collar %>%
mutate(collar_id_friendly_alt = ifelse(str_detect(collar_id_friendly,"e$"),
str_replace(collar_id_friendly,"e$","ie"),
collar_id_friendly))
Story: As mentioned, you have used ChapGPT to create descriptions for each dog. One day, a linguist friend of yours, Maria, comes to visit the shelter and is thoroughly impressed with the descriptions. She proposes collecting all the descriptions to form a corpus, a body of written texts. You are thrilled with the idea and gladly give Maria permission to use the descriptions. In return you ask for her assistance in reviewing the descriptions for any potential weaknesses. First, you send over the descriptions and a unique ID for each dog (concatenating name, breed, sex, and height of the dog) to Maria.
dog_descriptions <- dogs %>%
select(name, breed, sex, height, description) %>%
mutate(unique_id = paste(name, breed, sex, height, sep = "_")) %>%
select(unique_id, description)
Story: Reviewing the data, Maria noticed that there are some redundant white spaces in the description column. She decides to trim the description column to get rid of redundanttespaces.
The str_trim() function removes whitespace from start
and end of string.
dog_descriptions <- dog_descriptions %>%
mutate(description = str_trim(description))
The str_squish() function removes whitespaces at the start and end, and replaces all internal whitespace with a single space
spaces <- " This is a string with a lot of spaces "
str_trim(spaces)
## [1] "This is a string with a lot of spaces"
str_squish(spaces)
## [1] "This is a string with a lot of spaces"
Story: Since ChatGPT can potentially generate wrong or incomplete descriptions, Maria wants to make sure that the data is accurate. She wants to check two things: (1) whether or not the breed of each dog has been mentioned in the description, and (2) whether the unit of measurement for height has been reported in the right format (cm in this case).
The first step is to split the concatenated unique_id column into its building blocks.
The code below uses the separate() function to separate the column unique_id into several new columns namely, name, breed, sex, and height. In column unique_id, these values are sepatated from each other with “underscore”. The argument sep defines the type of separator used in the original column (in this case underscore). The argument “into” contains the name of the new columns. The “remove” argument is set to “FALSE”, which means the original “unique_id” column is not removed, and remains in the data after the transformation.
dog_descriptions <- dog_descriptions %>%
separate(., unique_id,
into = c("name", "breed", "sex", "height"),
sep = "_",
remove = FALSE)
head(dog_descriptions[1:5])
## unique_id name breed sex height
## 1 Sophronia_Pug_female_32 Sophronia Pug female 32
## 2 Buddy_Labrador_male_62 Buddy Labrador male 62
## 3 Mara_Saint Bernard_female_59 Mara Saint Bernard female 59
## 4 Gracelyn_Dalmatian_female_57 Gracelyn Dalmatian female 57
## 5 Broderick_Bull Terrier_male_48 Broderick Bull Terrier male 48
## 6 Conrad_Weimaraner_male_66 Conrad Weimaraner male 66
To check for the breed, the code below uses an ifelse function to check if the string in the “breed” column is detected in the “description” column.
dog_descriptions <- dog_descriptions %>%
mutate(is_breed_there = ifelse(str_detect(description, breed), "yes", "no"))
table(dog_descriptions$is_breed_there)
##
## no yes
## 2 25
The code below first uses str_detect and checks whether the unit reported in the description is “cm” or something else (“other”). The result is stored in the column “cm_vs_inch”.
In the second mutate function, if the value of cm_vs_inch is “other”, then the function uses str_replace_all to replace all instances of ” inches ” and ” inch ” with ” cm ” in the description variable. If the value of cm_vs_inch is already “cm”, then the value of description remains unchanged.
dog_descriptions <- dog_descriptions %>%
mutate(cm_vs_inch = ifelse(str_detect(description, " cm"), "cm", "other")) %>%
mutate(description = ifelse(cm_vs_inch == "other",
str_replace_all(description, c(" inch(es)? ")," cm "),
description))
Story: And again, since why not, Maria wants to check whether the name of each dog has been mentioned sufficiently in the descriptions. Her assumption is that if the name is repeated only once, the text is not too friendly, and the description should be improved.
In the code below, the str_count function is counting the number of instances of the “name” column that appear within the “description” column.
dog_descriptions <- dog_descriptions %>%
mutate(friendliness = str_count(description, name))
Story: Finally, it is time for Maria to create the corpus of dog descriptions. She first decides to split the descriptions sentence-wise. She assumes that the punctuation mark period (.) marks the end of a sentence.
The code below uses the function separate_rows() to split the description column into separate rows for each sentence. It does this by using the period (“.”) as the separator to split the text into separate rows.
dog_corpus <- dog_descriptions %>%
select(unique_id, description) %>%
separate_rows(description, sep = "\\.")
The code below shows a big pipeline containing various functions we talked about earlier and some new ones.
Here is a breakdown of what it does:
n(), seq(), str_count()
functions, respectively.lag() and
lead() functions. The argument n() in these
functions defines from how many rows back or ahead the values should be
chosen.dog_corpus <- dog_corpus %>%
rename(sentence = description) %>%
mutate(sentence = str_trim(sentence)) %>%
filter(sentence != "") %>%
group_by(unique_id) %>%
mutate(number_of_sentences = n(),
which_sentence = seq(n()),
count_char = str_count(sentence),
count_word = str_count(sentence, '\\w+')) %>%
mutate(previous_sentence = lag(sentence, n = 1),
next_sentence = lead(sentence, n = 1)) %>%
ungroup() %>%
mutate( words_with_letterF = str_extract_all(sentence, "\\b[fF]\\w+\\b"))
Story: Maria sends the corpus back to you at the shelter. After much thinking, you and Maria decide to call it “FurryFriends Corpus”.
You look at the FurryFriends Corpus and you find it very cool. However, you are not 100% happy with it. You think you can do more with this corpus. You decide to activate your linguist side and create a better corpus. For the creation of this corpus, you decide to use automatic
#install.packages("udpipe")
library(udpipe)
# Download a pre-trained model (for English in this case)
ud_model <- udpipe_download_model(language = "english")
## Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe to C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2023_12_13_Session9/english-ewt-ud-2.5-191206.udpipe
## - This model has been trained on version 2.5 of data from https://universaldependencies.org
## - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0
## - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.
## - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')
## Downloading finished, model stored at 'C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2023_12_13_Session9/english-ewt-ud-2.5-191206.udpipe'
model <- udpipe_load_model(ud_model$file_model)
# Apply the udpipe model
annotations <- udpipe_annotate(model, x = dog_descriptions$description)
annotations_df <- as.data.frame(annotations)
dog_corpus <- annotations_df %>%
select(1:12)