From now on, we do things inside a project to keep a clear and coherent workspace. Go to File –> New Project (choose either a New directory or an existing one). For instance, you can choose the folder you downloaded for this session from ILIAS. You can have an overview of the folder in the output pane.

Introduction and recap of the previous session

Before getting to the actual content, the following section introduces the example story: You have left academia and decided to take over a local dog shelter. The previous owner has kept very crappy records in excel, you decide to clean them up and complement them with more information. To make your furry friends more relatable, you contact a dog name expert and ask them to choose names for your dogs. Also, you use ChatGPT and create nice descriptions for them.

Through this process, we learned the following topics in the previous session:

Using ifelse() for conditional statement.
row-binding and c-binding different data frames.
The concept of categorization
Merging different dataframes.

Import libraries

Set up working directory

Get data into R

Our data from the previous session is saved in 3 formats, namely RDS, CSV, and TSV. Let’s bring them into R.

dogs <- read_rds("files/dogs.rds")

Reshaping

Story: The annual weighing of dogs at the shelter is an important practice. You have a specific scale to do so. Unfortunately, the manufacturers of the scale have recently come forward with news that their product is not accurately measuring the weight of dogs under 5 kilograms and that these values need to be increased by 10%.

Pivot_longer

Given that the weight information is currently spread across multiple columns (weight_20, weight_21, weight_22, and weight_23), consolidating it into a single column could make the necessary changes more manageable.

Pivot_longer() helps us reshape the data from a “wide” format to a “long” format, where each variable is in a single column and each observation is in a separate row. This is useful when working with data that has multiple values for a single observation in different columns (e.g., weight info in our dogs dataframe).

Let’s see how we can apply this to our dog dataframe.

For simplicity and visual reasons, I create a smaller dataframe (dog_weight) by selecting only the name, breed, and the weight columns.

Please note that you can run pivot_longer on the full dataframe. You just need to mention which columns need to be pivoted.

dog_weight <- dogs %>%
  select(name, breed, weight_20, weight_21, weight_22, weight_23)

The code below uses the function pivot_longer() to reshape the columns weight_20, weight_21, weight_22, and weight_23 into a longer format.

The cols argument is used to specify the columns that we want to pivot, in this case, the columns “weight_20”, “weight_21”, “weight_22”, and “weight_23”. The names_to argument is used to specify the new column to create from the information stored in the column names of data specified by cols. The values_to argument is used to specify the new column for storing the data stored in cell values. The argument values_drop_na is set to TRUE, so any missing weight values in the columns are not added as extra rows.

longer_weight <- pivot_longer(dog_weight,
    cols = c(weight_20, weight_21, weight_22, weight_23),
    names_to = "year", #translated as: write name of the column to the column year
    values_to = "weight", #write values of the columns to the column weight
    values_drop_na = TRUE
  )

#let us arrange the values to see which ones are under 5 kilo

head(arrange(longer_weight, weight), n=7)

## # A tibble: 7 × 4
##   name      breed             year      weight
##   <chr>     <chr>             <chr>      <dbl>
## 1 Harvey    Chihuahua         weight_22   1.34
## 2 Harvey    Chihuahua         weight_23   1.82
## 3 Emmett    Yorkshire Terrier weight_23   3.21
## 4 Emmett    Yorkshire Terrier weight_22   3.46
## 5 Sophronia Pug               weight_22   6.2 
## 6 Sophronia Pug               weight_21   7.27
## 7 Sophronia Pug               weight_20   7.54

# values that need to be changed
#Harvey: 1.34
#Harvey: 1.82
#Emmett: 3.21
#Emmett: 3.46

After creating the long dataframe “longer_weight”, we want to update the values in the weight column.

The code below updates the column weight in the data frame longer_weight by using the ifelse function.

The ifelse function checks each value in the weight column to see if it is less than 5.

If a value is less than 5, it is increased by 10% of its original value. This increase is calculated by multiplying the original value with 1.1. If a value is not less than 5, it remains unchanged. We then round the values to two decimal places.

longer_weight <- longer_weight %>% 
  mutate(weight = ifelse(weight < 5,
                         weight * 1.1,
                         weight)) %>% 
  mutate(weight = round(weight, digits = 2))

head(arrange(longer_weight, weight), n= 5)

## # A tibble: 5 × 4
##   name      breed             year      weight
##   <chr>     <chr>             <chr>      <dbl>
## 1 Harvey    Chihuahua         weight_22   1.47
## 2 Harvey    Chihuahua         weight_23   2   
## 3 Emmett    Yorkshire Terrier weight_23   3.53
## 4 Emmett    Yorkshire Terrier weight_22   3.81
## 5 Sophronia Pug               weight_22   6.2

Pivot_wider

Pivot_wider is the opposite of pivot_longer.

It is used to reshape a data frame from long format to wide format. The pivot_wider function takes columns with multiple values and spreads them out into multiple columns, while collapsing multiple rows into one.

In the previous section, we increased the weight of dogs weighing less than 5 kilos by 10%.

Now, we use the pivot_wider function to transform the long dataframe “longer_weight” into its previous wide format.

The “names_from” argument specifies that the unique values in the “year” column of the “longer_weight” dataframe (i.e., weight_20, weight_21, weight_22, weight_23) will become the new column names in the “wider_weight” dataframe.

For now, we call this dataframe wider_weight; but it is in fact similar to the dog_weight dataframe

wider_weight <- pivot_wider(longer_weight, names_from = year, values_from = weight)

#names_from means: name of the new columns should be taken from the values in the year column.

Before moving to the next task, let’s remove the dataframes we do not need anymore.

In the code below, the grep function is used to search for objects in the current environment that match the pattern “dogs”. The invert = TRUE argument inverts the search so that it returns objects that do NOT contain the word “dogs” in their name. Finally, the rm function removes all the dataframes stored in the “toremove” object.

toremove <- grep("dogs", ls(),
                 invert = TRUE,
                 value = TRUE)

rm(list = c(toremove, "toremove"))

Group-wise operations

Group-wise operations refer to the process of performing operations on subsets of data, based on the values in one or more columns.

In what follows, we talk about the functions group-by() and then summarise().

Group-by()

With group_by(), you can specify one or more variables that you want to use as the basis for grouping your data.

The function will then create groups based on the unique values of the specified variables and arrange the data accordingly.

For instance, we can group our dogs based on their breed, and then apply some functions to each group.

Story: At the shelter you want to know how many members of each breed you have, with the purpose of adding more members to groups with only one member. Here are the steps:

For simplicity, I reduce the dimensions of the “dogs” dataframe to only a few columns we will use here.
Then, we group-by() the dogs by their breed.
Then, we use the mutate() function to create a new column called “number_of_members” that contains the number of members in each breed group. The function n() counts the number of observations (rows) within each group.
(IMPORTANT): Finally, the ungroup() function is used to remove the grouping of the data, returning the data to its original format.

#step 1
dog_groups <- dogs %>% 
  select(name, breed, sex, height) %>% 
  group_by(breed) %>% 
  mutate(number_of_members = n()) %>% 
  ungroup()

Story: Also, for your database, you want to assign IDs to members of each breed based on their height (smallest to largest). Here are the steps:

We group-by() the dogs by their breed.
We then “arrange” members of a group based on their height.
The mutate function is then used to create a new column called “group_id” that contains a unique identifier for each breed group. The seq(n()) function is used within the mutate function to generate a sequence of numbers based on the number of observations (rows) in each group, which is given by n().
By ungrouping the data, you ensure that the data is in the correct format for future operations and analysis.

#step 3
dog_groups <- dog_groups %>% 
  group_by(breed) %>% 
  arrange(height) %>% 
  mutate(breed_group_id = seq(n())) %>% 
  ungroup() # step 4

Concise way to do the two operations above:

#step 1
dog_groups <- dogs %>% 
  select(name, breed, sex, height) %>% #step 1 
  group_by(breed) %>% 
  mutate(number_of_members = n()) %>% #step 2
  arrange(height) %>% 
  mutate(breed_group_id = seq(n())) %>% #step 3
  ungroup() # step 4

Group-by() and summarise()

The group_by() and summarise() functions are often used together to perform data summarization and aggregation. group_by is used to group the data based on one or more variables, and summarise is used to apply summary functions to the subgroups.

Note that different from the application of group_by above, this combination aggregates the data of each group down to one row.

Useful calculations you can do with summarise (Taken from the documentation: https://dplyr.tidyverse.org/reference/summarise.html)

Center: mean(), median()
Spread: sd(), IQR(), mad()
Range: min(), max(),
Position: first(), last(), nth(),
Count: n(), n_distinct()
Logical: any(), all()

Story: One day, you receive a request from a prestigious animal organization called “Furry Friends Foundation”. The organization is conducting a study on the health and well-being of dogs in shelters across the country, and wants to get a more in-depth understanding of any potential gender-based differences in the population. So, the organization asks you to provide the summary statistics of the dogs at the shelter based on their gender.

Let us first calculate the number of members in each sex group.

First, the data in the “dogs” dataframe is grouped based on the “sex” column.

For each group defined by the “sex” column (male vs. female), the count of observations is calculated using the n() function.

The result is stored in a new variable called “n_dogs.” Since the “sex” column has two distinct values, male and female, the summary statistics will be given on two rows (one for female dogs and the other for male dogs).

gender_groups <- dogs %>% 
  group_by(sex) %>% 
  summarise(n_dogs = n()) %>% 
  ungroup()

gender_groups

## # A tibble: 2 × 2
##   sex    n_dogs
##   <chr>   <int>
## 1 female     13
## 2 male       14

Story: Since you enjoyed the combination of group-by() and summarise() a lot, you decide to also calculate bunch of other values for each gender.

After grouping the dogs by their sex, the following variables are calculated in the code below using the summarise() function:

n_dogs, which is the number of dogs in each group
mean_height, which is the average height of the dogs in each group
mean_weight2023 and mean_weight2022, which are the average weights of the dogs in 2023 and 2022, respectively. Since there are some missing values in year 22, we include the following (na.rm = TRUE) which removes the NA values from the calculations.
min_height, which is the minimum height among the dogs in each group
max_height, which is the maximum height among the dogs in each group
cage_small, cage_medium, and cage_large, which are the number of dogs in each group that have a cage of size “small”, “medium”, or “large”.

gender_groups <- dogs %>% 
  group_by(sex) %>%
  summarise(n_dogs = n(),
            mean_height = mean(height),
            mean_weight2023 = mean(weight_23),
            mean_weight2022 = mean(weight_22, na.rm = TRUE),
            min_height = min(height),
            max_height = max(height),
            cage_small = sum(cage == "small"),
            cage_medium = sum(cage == "medium"),
            cage_large = sum(cage == "large")) %>% 
  ungroup()

#install.packages("DT")

# library(DT)
# DT::datatable(gender_groups)

String operations

Next we want to turn to a number of different functions. These string operations are a type of data manipulation that involve working with character strings. In R, there are two main ways to perform string operations: using base R functions and using the tidyverse library. Some common tasks involving strings are:

Combining multiple strings into one, which is also called concatenation
Formatting, e.g., converting strings to lower or uppercase
Extracting substrings from a string
Pattern matching and modifications
Checking if a string contains a specific pattern or word
Splitting a string into separate pieces

Combining multiple strings into one

Story: You have started building a website for your shelter. You decide to add images of the dogs to the website, with a short title for each image. You want to use the following info for this purpose: name, sex, and breed columns. You want to create a title such as “Sophronia is a female pug.”

We can use the paste() function to concatenate the values in the columns name, sex, and breed, in addition to the string “is a” and create a short title for each dog image. The default separator in the paste function is whitespace; you can define any other separator (e.g., comma, nothing, underscore).

website_content <- dogs %>% 
  select(name, breed, sex, description) %>% 
  mutate(title = paste(name, "is a", sex,breed, sep = " ")) #default sep is a whitespace 

head(website_content)

##        name         breed    sex
## 1 Sophronia           Pug female
## 2     Buddy      Labrador   male
## 3      Mara Saint Bernard female
## 4  Gracelyn     Dalmatian female
## 5 Broderick  Bull Terrier   male
## 6    Conrad    Weimaraner   male
##                                                                                                                                                                                                                                                                                                                                                                                                                                                                             description
## 1                                                               Sophronia is a female Pug who stands at 32 cm tall. This small and affectionate breed is known for their playful personality and charming wrinkles. Pugs are great family dogs, as they love to cuddle and are always up for a game of fetch. Sophronia is a friendly and outgoing pup who enjoys belly rubs and treats. She would make a great companion for someone who is looking for a low-maintenance, loving dog.
## 2 Buddy is a male Labrador who stands at 62 cm tall. This friendly and active breed is known for their obedience and trainability. Labrador Retrievers are one of the most popular dog breeds in the world and are known for their friendly and outgoing personality. Buddy is a social butterfly who loves meeting new people and dogs. He is also a big fan of playing fetch and going for long walks. Buddy will make a great companion for an active family who loves the outdoors.
## 3             Mara is a female Saint Bernard who stands at 59 cm tall. Saint Bernards are a giant breed known for their size and strength, but also for their gentle and friendly nature. They make great family dogs as they are patient and affectionate with children. Mara is a gentle giant who loves belly rubs and cuddles. She is also a great watchdog, always keeping a watchful eye over her family. Mara will need a large living space and plenty of room to run and play.
## 4                       Gracelyn is a female Dalmatian who stands at 57 cm tall. Dalmatians are an energetic and playful breed known for their distinctive black and white spotted coat. They are an active breed that loves to run and play and make great family pets for those who can keep up with their energy. Gracelyn is a fast and agile pup who loves to play games of chase. She is also known to be a bit of a clown, always making her family laugh with her silly antics.
## 5                                                                   Broderick is a male Bull Terrier who stands at 48 cm tall. Bull Terriers are a muscular and energetic breed known for their tenacity and loyalty. They make great family dogs for those who are prepared for their high energy and playfulness. Broderick is a playful and energetic pup who loves to run and play. He is also known for his fierce loyalty to his family and will always be there to protect them.
## 6                                                                Conrad is a male Weimaraner who stands at 66 cm tall. Weimaraners are an athletic and energetic breed known for their hunting instincts and loyalty. They make great family pets for those who can keep up with their high energy and need for exercise. Conrad is an active and energetic pup who loves to run and play. He is also known for his protective nature and will always be there to keep his family safe.
##                              title
## 1        Sophronia is a female Pug
## 2         Buddy is a male Labrador
## 3   Mara is a female Saint Bernard
## 4   Gracelyn is a female Dalmatian
## 5 Broderick is a male Bull Terrier
## 6      Conrad is a male Weimaraner

Note: The paste() function is a base R function. Its tidyverse equivalent is the function str_c(). As previously noted, R offers a variety of options for performing the same operation, and the choice of which to use often comes down to personal preference.

Formatting the strings

Story: You like the titles you have created, but you are not sure about its format. You decide to write the title in other formats (e.g., all in uppercase, only first words in upper case, first letter of each word in uppercase, lexical words in upper case) to see which version fits the images better.

In the code below, the titles are being passed through a series of functions that modify their format: The first function “str_to_upper” is being applied, converting all the characters of the titles to uppercase letters. This is equivalent to the toupper function in base R.

Next, the “str_to_title” function is being applied, converting the titles to title case, where the first letter of each word is capitalized.

Finally, the “str_to_sentence” function is being used to convert the titles to sentence case, where only the first letter of the first word is capitalized.

website_content <- website_content %>% 
  mutate(uppercase = str_to_upper(title)) %>%
  mutate(lowercase = str_to_lower(title)) %>%
  mutate(title_format = str_to_title(title)) %>% 
  mutate(sentence_format = str_to_sentence(title)) 


head(website_content[6:9])

##                          uppercase                        lowercase
## 1        SOPHRONIA IS A FEMALE PUG        sophronia is a female pug
## 2         BUDDY IS A MALE LABRADOR         buddy is a male labrador
## 3   MARA IS A FEMALE SAINT BERNARD   mara is a female saint bernard
## 4   GRACELYN IS A FEMALE DALMATIAN   gracelyn is a female dalmatian
## 5 BRODERICK IS A MALE BULL TERRIER broderick is a male bull terrier
## 6      CONRAD IS A MALE WEIMARANER      conrad is a male weimaraner
##                       title_format                  sentence_format
## 1        Sophronia Is A Female Pug        Sophronia is a female pug
## 2         Buddy Is A Male Labrador         Buddy is a male labrador
## 3   Mara Is A Female Saint Bernard   Mara is a female saint bernard
## 4   Gracelyn Is A Female Dalmatian   Gracelyn is a female dalmatian
## 5 Broderick Is A Male Bull Terrier Broderick is a male bull terrier
## 6      Conrad Is A Male Weimaraner      Conrad is a male weimaraner

Extracting substrings from a string

Story: At the dog shelter, you are approached by a collar company looking to create unique collars for each of your dogs. You are torn between using the dog’s name, breed, or both as the identifier on the collar. After a period of intense contemplation, you decide to use a bit of both. You want to create a column in the dog data frame called “collar_id” and use the first three letters of the dogs’ names followed by the last four letters of their breed.

In the code below, a new column named “collar_id” is being created, the value of which is the combination of two separate str_sub() functions. The str_sub function is a string manipulation function used to extract a portion of a character string, specified by a starting and ending position.

The first str_sub function str_sub(name, start = 1 , end = 3) is used to extract the first three characters of the “name” column. The start argument is set to 1, indicating the start of the string, and end is set to 3, indicating the index of the last character that should still be included in the desired substring.

The second str_sub function str_sub(breed, start =-4, end =-1) is used to extract the last four characters of the breed column. In this case, start is set to -4, indicating that the extraction should start four characters from the end of the string, and end is set to -1, indicating that the extraction should end at the last character of the string.

collar <- website_content %>% 
  select(name, breed) %>% 
  mutate(name_letters  = str_sub(name, start = 1 , end = 3) ) %>% 
  mutate(breed_letters = str_sub(breed, start = -4 , end = -1)) %>% 
  mutate(collar_id = str_c(name_letters, breed_letters))

head(collar)

##        name         breed name_letters breed_letters collar_id
## 1 Sophronia           Pug          Sop           Pug    SopPug
## 2     Buddy      Labrador          Bud          ador   Budador
## 3      Mara Saint Bernard          Mar          nard   Marnard
## 4  Gracelyn     Dalmatian          Gra          tian   Gratian
## 5 Broderick  Bull Terrier          Bro          rier   Brorier
## 6    Conrad    Weimaraner          Con          aner   Conaner

Pattern matching and modifications

Story: You decide to add a add a chart regarding the characteristics of your dogs to the website. For that, you specify some features such as “active”, “friendly”, “playful”, and “athletic” and you decide to look for these features in the descriptions in order to add them to the chart.

We use grepl() to look for the specified pattern. grepl() is used to search for the presence of a pattern, specified by a regular expression, in a character vector. A regular expression is a string that describes a search pattern. The easiest way of using grepl() is to use for an exact string in a column.

website_content <- website_content %>% 
  mutate(friendly = ifelse( grepl("friendly", description), TRUE, FALSE)) %>% 
  mutate(playful = ifelse( grepl("playful", description), TRUE, FALSE)) %>% 
  mutate(active = ifelse( grepl("active", description), TRUE, FALSE)) %>% 
  mutate(athletic = ifelse( grepl("athletic", description), TRUE, FALSE))

chart <- website_content %>% 
  select(name, friendly, playful, active, athletic)

head(chart)

##        name friendly playful active athletic
## 1 Sophronia     TRUE    TRUE  FALSE    FALSE
## 2     Buddy     TRUE   FALSE   TRUE    FALSE
## 3      Mara     TRUE   FALSE  FALSE    FALSE
## 4  Gracelyn    FALSE    TRUE   TRUE    FALSE
## 5 Broderick    FALSE    TRUE  FALSE    FALSE
## 6    Conrad    FALSE   FALSE   TRUE     TRUE

Story: You believe that the collar IDs would sound even friendlier if a “y” would be added to the end of those collar IDs that do not contain any vowels or the letter “y. Therefore, you plan to first identify the collar IDs that fulfill this criterion, and then append a”y” to the end of those collar IDs.

As mentioned, grepl() takens regular expressions. The pattern can include simple characters, such as letters and digits, as well as special characters, such as dot . (which matches any character), * (which matches zero or more occurrences of the preceding character), and ^ (which matches the start of a line).

In the code below, the ifelse function is used to conditionally assign values to the collar_id_friendly column. The grepl function is used to search for the presence of vowels (i.e., “a”, “e”, “y”, “o”, “u”, “i”) at the end of a string, represented by the regular expression “[aeyoui]$”. The $ symbol in the regular expression indicates that the pattern should match the end of the string. If the grepl function returns FALSE (i.e., if the string does not end with the letters “aeyoui”), then the paste function is used to concatenate the original collar_id value with the character “y”. If grepl returns TRUE, then the original value in the collar_id column is assigned to the collar_id_friendly column without modification.

The tidyverse alternative to grepl() is str_detect().

collar <- collar %>% 
  mutate(collar_id_friendly = ifelse(grepl("[aeyoui]$", collar_id),
                                     collar_id,
                                     paste(collar_id, "y", sep = "")))

#alternative solution
collar %>% 
  mutate(collar_id_friendly_alt = ifelse(str_detect(collar_id,"[aeyoui]$"),
                                         collar_id,
                                         str_c(collar_id, "y")))

?grepl

## starting httpd help server ... done

?str_detect

Story: Since why not, you also choose to modify the collar IDs ending with “e” by changing it to “ie”.

We use replace functions gsub() or str_replace() to replace a specified pattern with something else. In the code below, the “grepl” function is used to check if the string ends with “e”, and if the condition is true, the “gsub” function is used to replace the “e” with “ie”.

collar <- collar %>% 
  mutate(collar_id_friendly = ifelse(grepl("e$",collar_id_friendly), 
                                     gsub("e$", "ie", collar_id_friendly),
                                     collar_id_friendly))

#alternative solution
collar %>% 
  mutate(collar_id_friendly_alt = ifelse(str_detect(collar_id_friendly,"e$"), 
                                     str_replace(collar_id_friendly,"e$", "ie"),
                                     collar_id_friendly))

Story: As mentioned, you have used ChapGPT to create descriptions for each dog. One day, a linguist friend of yours, Maria, comes to visit the shelter and is thoroughly impressed with the descriptions. She proposes collecting all the descriptions to form a corpus, a body of written texts. You are thrilled with the idea and gladly give Maria permission to use the descriptions. In return you ask for her assistance in reviewing the descriptions for any potential weaknesses. First, you send over the descriptions and a unique ID for each dog (concatenating name, breed, sex, and height of the dog) to Maria.

dog_descriptions <- dogs %>% 
  select(name, breed, sex, height, description) %>% 
  mutate(unique_id = paste(name, breed, sex, height, sep = "_")) %>% 
  select(unique_id, description)

Trimming

Story: Reviewing the data, Maria noticed that there are some redundant white spaces in the description column. She decides to trim the description column to get rid of redundanttespaces.

The str_trim() function removes whitespace from start and end of string.

dog_descriptions <- dog_descriptions %>% 
  mutate(description = str_trim(description))

The str_squish() function removes whitespaces at the start and end, and replaces all internal whitespace with a single space

spaces <- "  This is   a string   with a lot of    spaces "

str_trim(spaces)

## [1] "This is   a string   with a lot of    spaces"

str_squish(spaces)

## [1] "This is a string with a lot of spaces"

Separating a column into multiple columns

Story: Since ChatGPT can potentially generate wrong or incomplete descriptions, Maria wants to make sure that the data is accurate. She wants to check two things: (1) whether or not the breed of each dog has been mentioned in the description, and (2) whether the unit of measurement for height has been reported in the right format (cm in this case).

The first step is to split the concatenated unique_id column into its building blocks.

The code below uses the separate() function to separate the column unique_id into several new columns namely, name, breed, sex, and height. In column unique_id, these values are sepatated from each other with “underscore”. The argument sep defines the type of separator used in the original column (in this case underscore). The argument “into” contains the name of the new columns. The “remove” argument is set to “FALSE”, which means the original “unique_id” column is not removed, and remains in the data after the transformation.

dog_descriptions <- dog_descriptions %>% 
  separate(., unique_id, 
           into = c("name", "breed", "sex", "height"), 
           sep = "_", 
           remove = FALSE)

head(dog_descriptions[1:5])

##                        unique_id      name         breed    sex height
## 1        Sophronia_Pug_female_32 Sophronia           Pug female     32
## 2         Buddy_Labrador_male_62     Buddy      Labrador   male     62
## 3   Mara_Saint Bernard_female_59      Mara Saint Bernard female     59
## 4   Gracelyn_Dalmatian_female_57  Gracelyn     Dalmatian female     57
## 5 Broderick_Bull Terrier_male_48 Broderick  Bull Terrier   male     48
## 6      Conrad_Weimaraner_male_66    Conrad    Weimaraner   male     66

Pattern matching and replacements

To check for the breed, the code below uses an ifelse function to check if the string in the “breed” column is detected in the “description” column.

dog_descriptions <- dog_descriptions %>% 
  mutate(is_breed_there = ifelse(str_detect(description, breed), "yes", "no"))

The code below first uses str_detect and checks whether the unit reported in the description is “cm” or something else (“other”). The result is stored in the column “cm_vs_inch”.

In the second mutate function, if the value of cm_vs_inch is “other”, then the function uses str_replace_all to replace all instances of ” inches ” and ” inch ” with ” cm ” in the description variable. If the value of cm_vs_inch is already “cm”, then the value of description remains unchanged.

dog_descriptions <- dog_descriptions %>%
   mutate(cm_vs_inch = ifelse(str_detect(description, " cm"), "cm", "other"))%>% 
   mutate(description = ifelse(cm_vs_inch == "other", 
                               str_replace_all(description, c(" inch(es)? ")," cm "),
                               description))

Counting instances

Story: And again, since why not, Maria wants to check whether the name of each dog has been mentioned sufficiently in the descriptions. Her assumption is that if the name is repeated only once, the text is not too friendly, and the description should be improved.

In the code below, the str_count function is counting the number of instances of the “name” column that appear within the “description” column.

dog_descriptions <- dog_descriptions %>% 
  mutate(friendliness = str_count(description, name))

Row-wise split

Story: Finally, it is time for Maria to create the corpus of dog descriptions. She first decides to split the descriptions sentence-wise. She assumes that the punctuation mark period (.) marks the end of a sentence.

The code below uses the function separate_rows() to split the description column into separate rows for each sentence. It does this by using the period (“.”) as the separator to split the text into separate rows.

dog_corpus <- dog_descriptions %>% 
  select(unique_id, description) %>% 
  separate_rows(description, sep = "\\.")

A recap of different tidyverse operations + some new ones

The code below shows a big pipeline containing various functions we talked about earlier and some new ones.

Here is a breakdown of what it does:

The function rename() in the second line renames the description column to sentence.
The function str_trim() in the third line trims any leading or trailing whitespace from the sentence column.
The function filter() in the fourth line filters out any rows where the sentence column is empty.
The group-by() function in the fifth line groups the rows by their unique_id column.
The function mutate() [lines 6-9] calculates various sentence properties within each group, including the total number of sentences (number_of_sentences), the order of the sentence within the group (which_sentence), the count of characters in each sentence (count_char), and the count of words in each sentence (count_word). These calculations use the n(), seq(), str_count() functions, respectively.
The function mutate() [lines 10-11] adds new columns to each row containing the previous (previous_sentence) and next (next_sentence) sentences within the group using the lag() and lead() functions. The argument n() in these functions defines from how many rows back or ahead the values should be chosen.
Finally, the function str_extract_all() on line 13 extracts all words in each sentence starting with the letter “f” or “F” and saves the results in a new column called words_with_letterA. This is done using the str_extract_all function from the stringr package with the regular expression pattern \b[fF]\w+\b.

dog_corpus <- dog_corpus %>% 
  rename(sentence = description) %>% 
  mutate(sentence = str_trim(sentence)) %>% 
  filter(sentence != "") %>% 
  group_by(unique_id) %>% 
  mutate(number_of_sentences = n(),
         which_sentence = seq(n()),
         count_char = str_count(sentence),
         count_word = str_count(sentence, '\\w+')) %>%
  mutate(previous_sentence = lag(sentence, n = 1),
         next_sentence = lead(sentence, n = 1)) %>% 
  ungroup() %>% 
  mutate( words_with_letterF = str_extract_all(sentence, "\\b[fF]\\w+\\b"))

Story: Maria sends the corpus back to you at the shelter. After much thinking, you and Maria decide to call it “FurryFriends Corpus”.

Pivoting, grouping, string operations

2023-12-06