From now on, we do things inside a project to keep a clear and coherent workspace. Go to File –> New Project (choose either a New directory or an existing one). For instance, you can choose the folder you downloaded for this session from ILIAS. You can have an overview of the folder in the output pane.

Introduction and recap of the previous session

Before getting to the actual content, the following section introduces the example story: You have left academia and decided to take over a local dog shelter. The previous owner has kept very crappy records in excel, you decide to clean them up and complement them with more information. To make your furry friends more relatable, you contact a dog name expert and ask them to choose names for your dogs. Also, you use ChatGPT and create nice descriptions for them.

Through this process, we learned the following topics in the previous session:

Import libraries

Set up working directory

Get data into R

Our data from the previous session is saved as an RDS file. Let’s bring them into R.

dogs <- read_rds("files/dogs.rds")

website_content <- read_rds("files/website_content.rds")

collar <- read_rds("files/collar.rds")

String operations

Pattern matching and modifications

Story: You decide to add a add a chart regarding the characteristics of your dogs to the website. For that, you specify some features such as “active”, “friendly”, “playful”, and “athletic” and you decide to look for these features in the descriptions in order to add them to the chart.

We use grepl() or tidyverse str_detect() to look for the specified pattern. They are used to search for the presence of a pattern, specified by a regular expression, in a character vector. A regular expression is a string that describes a search pattern.

The easiest way of using grepl() or str_detect() is to look for an exact string in a column.

website_content <- website_content %>% 
  mutate(friendly = ifelse( grepl("friendly", description), TRUE, FALSE)) %>% 
  mutate(playful = ifelse( grepl("playful", description), TRUE, FALSE)) %>% 
  mutate(active = ifelse( grepl("active", description), TRUE, FALSE)) %>% 
  mutate(athletic = ifelse( grepl("athletic", description), TRUE, FALSE))

chart <- website_content %>% 
  select(name, friendly, playful, active, athletic)

head(chart)
##        name friendly playful active athletic
## 1 Sophronia     TRUE    TRUE  FALSE    FALSE
## 2     Buddy     TRUE   FALSE   TRUE    FALSE
## 3      Mara     TRUE   FALSE  FALSE    FALSE
## 4  Gracelyn    FALSE    TRUE   TRUE    FALSE
## 5 Broderick    FALSE    TRUE  FALSE    FALSE
## 6    Conrad    FALSE   FALSE   TRUE     TRUE

An introduction to Regular Expressions

What are Regular Expressions?

Regular expressions (regex) define search patterns for text within strings or files.

Key Points:

  • Regex is a sequence of characters defining a search pattern.
  • Used for text processing in various languages and tools (e.g., R, Python)
  • Can match, find, and replace strings.

Applications of regular expressions

  • Identifying patterns in texts.
  • Cleaning data
  • Data validation (e.g., format of emails)
  • Search and Replace

Simple Example:

Finding “cat” in a text.

## [1] "cat"

Basics of Regex Syntax

Fundamental Characters and Symbols

Regular expressions use specific characters and symbols to define patterns.

  • Dot . - Matches any single character except newline.
  • Asterisk * - Matches the preceding element zero or more times.
text <- "cat mat bat"
text2 <- "The cat is sitting on the table."

str_extract(text, pattern = ".*") # Matches every character
## [1] "cat mat bat"
str_extract(text2, pattern = "sitting.*") # Matches every from sitting onward.
## [1] "sitting on the table."
  • Caret ^ - Matches the start of a string.
text <- "First text"
text2 <- "Second text"

str_extract(text, pattern = "^First.*") # Matches if 'start' is at the beginning of text
## [1] "First text"
str_extract(text2, pattern = "^First") # Matches if 'start' is at the beginning of text2
## [1] NA
  • Dollar $ - Matches the end of a string.
text <- "here is the end"
text2 <- "here is the ending"

str_extract(text, pattern = "end$") # Matches if 'end' is at the end
## [1] "end"
str_extract(text2, pattern = "end$") # Matches if 'end' is at the end
## [1] NA
str_extract(text2, pattern = "end.*$") # Matches if 'end' is at the end
## [1] "ending"
  • Plus + - Matches the preceding element one or more times.
text <- "Faaa Faaa"

str_extract(text, pattern = "a+") #get letter a one or more time 
## [1] "aaa"
str_extract_all(text, pattern = "a+") # Matches if 'end' is at the end
## [[1]]
## [1] "aaa" "aaa"

Please note that str_extract() is used to extract the first instance of a pattern in each string, while str_extract_all() is used to extract all instances of a pattern in each string.

  • Question Mark ? - Matches the preceding element zero or one time.
text <- "colour color"

str_extract_all(text, pattern = "colou?r") # Matches both 'colour' and 'color'
## [[1]]
## [1] "colour" "color"
  • Square Brackets [] - Matches any one of the enclosed characters.
text <- "cat bat rat mat"

str_extract_all(text, pattern = "[br]at") # Matches 'bat' and 'rat'
## [[1]]
## [1] "bat" "rat"
  • Parentheses () - Groups elements.
text <- "recode decode encode"

str_extract_all(text, pattern = "(re|de)code") # Matches recode and decode
## [[1]]
## [1] "recode" "decode"
  • Pipe | - Logical OR.
text <- "cat bat mat"
str_extract_all(text, pattern = "cat|bat") # Matches 'bat' and 'rat'\
## [[1]]
## [1] "cat" "bat"

Character Classes and Sets

Character classes provide a way to match predefined sets of characters.

  • Digit \d - Matches any digit (equivalent to [0-9]).
text <- "Room 101 10483839"

str_extract(text, pattern = "\\d") # Matches '1'
## [1] "1"
str_extract(text, pattern = "\\d+") # Matches the whole number
## [1] "101"
str_extract_all(text, pattern = "\\d+") # Matches the whole number
## [[1]]
## [1] "101"      "10483839"
  • Word \w - Matches any word character (letters, digits, and underscores).
text <- "Room 101 , people"

str_extract(text, pattern = "\\w")
## [1] "R"
str_extract(text, pattern = "\\w+") # Matches word characters
## [1] "Room"
str_extract_all(text, pattern = "(\\w+|\\,)") # Matches word characters
## [[1]]
## [1] "Room"   "101"    ","      "people"
  • Whitespace \s - Matches any whitespace character (spaces, tabs, line breaks)
text <- "Room      101"
str_extract(text, pattern = "\\s+") # Matches whitespace including newline
## [1] "      "

Custom Character Sets: - Use square brackets to define a custom set of characters. - Example: [A-Za-z] matches any uppercase or lowercase letter.

Quantifiers and Repetitions

Quantifiers control how many times an element is matched.

  • Curly Braces {} - Specify a specific number of repetitions.
text <- "wooooow"

str_extract(text, pattern = "o{3}") # Matches 'o' exactly 3 times
## [1] "ooo"
  • Greedy Quantifiers: - Greedy quantifiers try to match as much of the string as possible. Common greedy quantifiers are * (zero or more), + (one or more), and {n,} (at least n).
text <- "aaaab"

text2 <- "ab"

# Greedy: Matches at least two 'a's
str_extract(text, "a{2,}")
## [1] "aaaa"
str_extract(text2, "a{2,}")
## [1] NA
  • Lazy Quantifiers: Lazy quantifiers try to match as little of the string as possible. They are often represented by adding a ? after a greedy quantifier, such as *?, +?, or {n,}?.
text <- "aaab"

# Lazy: Matches as few 'a's as possible
str_extract(text, "a*?")
## [1] ""
str_extract(text, "a+?")
## [1] "a"
str_extract(text, "a{2,}?")
## [1] "aa"

Very common special characters to match:

Regex Match
\\. .
\\! !
\\? ?
\\\\ \
\\( (
\\n new line
\\t tab
text <- "Here is a text in (parenthesis)"

str_extract(text, "\\(.*\\)")
## [1] "(parenthesis)"

Very good cheatsheet for all string operations

Okay, time to get back to our shelter.

Story: You believe that the collar IDs would sound even friendlier if a “y” would be added to the end of those collar IDs that do not contain any vowels or the letter “y. Therefore, you plan to first identify the collar IDs that fulfill this criterion, and then append a”y” to the end of those collar IDs.

In the code below, the ifelse function is used to conditionally assign values to the collar_id_friendly column. The grepl or str_detect function is used to search for the presence of vowels (i.e., “a”, “e”, “y”, “o”, “u”, “i”) at the end of a string, represented by the regular expression [aeyoui]$. The $ symbol in the regular expression indicates that the pattern should match the end of the string. If the grepl function returns FALSE (i.e., if the string does not end with the letters “aeyoui”), then the paste function is used to concatenate the original collar_id value with the character “y”. If grepl returns TRUE, then the original value in the collar_id column is assigned to the collar_id_friendly column without modification.

collar <- collar %>% 
  mutate(collar_id_friendly = ifelse(grepl("[aeyoui]$", collar_id),
                                     collar_id,
                                     paste(collar_id, "y", sep = "")))

#alternative solution
collar <- collar %>% 
  mutate(collar_id_friendly_alt =
           ifelse(str_detect(collar_id,"[aeyoui]$"),
                                         collar_id,
                                         str_c(collar_id, "y")))

?grepl
## starting httpd help server ... done
?str_detect

Story: Since why not, you also choose to modify the collar IDs ending with “e” by changing it to “ie”.

We use replace functions gsub() or str_replace() to replace a specified pattern with something else. In the code below, the “grepl” function is used to check if the string ends with “e”, and if the condition is true, the “gsub” function is used to replace the “e” with “ie”.

collar <- collar %>% 
  mutate(collar_id_friendly = ifelse(grepl("e$",collar_id_friendly),
                                     gsub("e$", "ie", collar_id_friendly),
                                     collar_id_friendly))

#alternative solution
collar %>% 
  mutate(collar_id_friendly_alt = ifelse(str_detect(collar_id_friendly,"e$"),
                                     str_replace(collar_id_friendly,"e$","ie"),
                                     collar_id_friendly))

Story: As mentioned, you have used ChapGPT to create descriptions for each dog. One day, a linguist friend of yours, Maria, comes to visit the shelter and is thoroughly impressed with the descriptions. She proposes collecting all the descriptions to form a corpus, a body of written texts. You are thrilled with the idea and gladly give Maria permission to use the descriptions. In return you ask for her assistance in reviewing the descriptions for any potential weaknesses. First, you send over the descriptions and a unique ID for each dog (concatenating name, breed, sex, and height of the dog) to Maria.

dog_descriptions <- dogs %>% 
  select(name, breed, sex, height, description) %>% 
  mutate(unique_id = paste(name, breed, sex, height, sep = "_")) %>% 
  select(unique_id, description)

Trimming

Story: Reviewing the data, Maria noticed that there are some redundant white spaces in the description column. She decides to trim the description column to get rid of redundanttespaces.

The str_trim() function removes whitespace from start and end of string.

dog_descriptions <- dog_descriptions %>% 
  mutate(description = str_trim(description)) 

The str_squish() function removes whitespaces at the start and end, and replaces all internal whitespace with a single space

spaces <- "  This is   a string   with a lot of    spaces "

str_trim(spaces)
## [1] "This is   a string   with a lot of    spaces"
str_squish(spaces)
## [1] "This is a string with a lot of spaces"

Separating a column into multiple columns

Story: Since ChatGPT can potentially generate wrong or incomplete descriptions, Maria wants to make sure that the data is accurate. She wants to check two things: (1) whether or not the breed of each dog has been mentioned in the description, and (2) whether the unit of measurement for height has been reported in the right format (cm in this case).

The first step is to split the concatenated unique_id column into its building blocks.

The code below uses the separate() function to separate the column unique_id into several new columns namely, name, breed, sex, and height. In column unique_id, these values are sepatated from each other with “underscore”. The argument sep defines the type of separator used in the original column (in this case underscore). The argument “into” contains the name of the new columns. The “remove” argument is set to “FALSE”, which means the original “unique_id” column is not removed, and remains in the data after the transformation.

dog_descriptions <- dog_descriptions %>% 
  separate(., unique_id, 
           into = c("name", "breed", "sex", "height"), 
           sep = "_", 
           remove = FALSE)

head(dog_descriptions[1:5])
##                        unique_id      name         breed    sex height
## 1        Sophronia_Pug_female_32 Sophronia           Pug female     32
## 2         Buddy_Labrador_male_62     Buddy      Labrador   male     62
## 3   Mara_Saint Bernard_female_59      Mara Saint Bernard female     59
## 4   Gracelyn_Dalmatian_female_57  Gracelyn     Dalmatian female     57
## 5 Broderick_Bull Terrier_male_48 Broderick  Bull Terrier   male     48
## 6      Conrad_Weimaraner_male_66    Conrad    Weimaraner   male     66

Pattern matching and replacements

To check for the breed, the code below uses an ifelse function to check if the string in the “breed” column is detected in the “description” column.

dog_descriptions <- dog_descriptions %>% 
  mutate(is_breed_there = ifelse(str_detect(description, breed), "yes", "no"))

table(dog_descriptions$is_breed_there)
## 
##  no yes 
##   2  25

The code below first uses str_detect and checks whether the unit reported in the description is “cm” or something else (“other”). The result is stored in the column “cm_vs_inch”.

In the second mutate function, if the value of cm_vs_inch is “other”, then the function uses str_replace_all to replace all instances of ” inches ” and ” inch ” with ” cm ” in the description variable. If the value of cm_vs_inch is already “cm”, then the value of description remains unchanged.

dog_descriptions <- dog_descriptions %>%
   mutate(cm_vs_inch = ifelse(str_detect(description, " cm"), "cm", "other")) %>% 
   mutate(description = ifelse(cm_vs_inch == "other", 
                               str_replace_all(description, c(" inch(es)? ")," cm "),
                               description)) 

Counting instances

Story: And again, since why not, Maria wants to check whether the name of each dog has been mentioned sufficiently in the descriptions. Her assumption is that if the name is repeated only once, the text is not too friendly, and the description should be improved.

In the code below, the str_count function is counting the number of instances of the “name” column that appear within the “description” column.

dog_descriptions <- dog_descriptions %>% 
  mutate(friendliness = str_count(description, name))

Row-wise split

Story: Finally, it is time for Maria to create the corpus of dog descriptions. She first decides to split the descriptions sentence-wise. She assumes that the punctuation mark period (.) marks the end of a sentence.

The code below uses the function separate_rows() to split the description column into separate rows for each sentence. It does this by using the period (“.”) as the separator to split the text into separate rows.

dog_corpus <- dog_descriptions %>% 
  select(unique_id, description) %>% 
  separate_rows(description, sep = "\\.")

A recap of different tidyverse operations + some new ones

The code below shows a big pipeline containing various functions we talked about earlier and some new ones.

Here is a breakdown of what it does:

  • The function rename() in the second line renames the description column to sentence.
  • The function str_trim() in the third line trims any leading or trailing whitespace from the sentence column.
  • The function filter() in the fourth line filters out any rows where the sentence column is empty.
  • The group-by() function in the fifth line groups the rows by their unique_id column.
  • The function mutate() [lines 6-9] calculates various sentence properties within each group, including the total number of sentences (number_of_sentences), the order of the sentence within the group (which_sentence), the count of characters in each sentence (count_char), and the count of words in each sentence (count_word). These calculations use the n(), seq(), str_count() functions, respectively.
  • The function mutate() [lines 10-11] adds new columns to each row containing the previous (previous_sentence) and next (next_sentence) sentences within the group using the lag() and lead() functions. The argument n() in these functions defines from how many rows back or ahead the values should be chosen.
  • Finally, the function str_extract_all() on line 13 extracts all words in each sentence starting with the letter “f” or “F” and saves the results in a new column called words_with_letterA. This is done using the str_extract_all function from the stringr package with the regular expression pattern b[fF]w+b.
dog_corpus <- dog_corpus %>% 
  rename(sentence = description) %>% 
  mutate(sentence = str_trim(sentence)) %>% 
  filter(sentence != "") %>% 
  group_by(unique_id) %>% 
  mutate(number_of_sentences = n(),
         which_sentence = seq(n()),
         count_char = str_count(sentence),
         count_word = str_count(sentence, '\\w+')) %>%
  mutate(previous_sentence = lag(sentence, n = 1),
         next_sentence = lead(sentence, n = 1)) %>% 
  ungroup() %>% 
  mutate( words_with_letterF = str_extract_all(sentence, "\\b[fF]\\w+\\b"))

Story: Maria sends the corpus back to you at the shelter. After much thinking, you and Maria decide to call it “FurryFriends Corpus”.

You look at the FurryFriends Corpus and you find it very cool. However, you are not 100% happy with it. You think you can do more with this corpus. You decide to activate your linguist side and create a better corpus. For the creation of this corpus, you decide to use automatic

#install.packages("udpipe")
library(udpipe)

# Download a pre-trained model (for English in this case)
ud_model <- udpipe_download_model(language = "english")
## Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe to C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2023_12_13_Session9/english-ewt-ud-2.5-191206.udpipe
##  - This model has been trained on version 2.5 of data from https://universaldependencies.org
##  - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0
##  - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.
##  - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')
## Downloading finished, model stored at 'C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2023_12_13_Session9/english-ewt-ud-2.5-191206.udpipe'
model <- udpipe_load_model(ud_model$file_model)


# Apply the udpipe model
annotations <- udpipe_annotate(model, x = dog_descriptions$description)
annotations_df <- as.data.frame(annotations)

dog_corpus <- annotations_df %>% 
  select(1:12)