From now on, we do things inside a project to keep a clear and coherent workspace. Go to File –> New Project (choose either a New directory or an existing one). For instance, you can choose the folder you downloaded for this session from ILIAS. You can have an overview of the folder in the output pane.
Before getting to the actual content, the following section introduces the example story: You have left academia and decided to take over a local dog shelter. The previous owner has kept very crappy records in excel, you decide to clean them up and complement them with more information. To make your furry friends more relatable, you contact a dog name expert and ask them to choose names for your dogs. Also, you use ChatGPT and create nice descriptions for them.
Through this process, we learned the following topics in the previous session:
ifelse() for conditional statement.Our data from the previous session is saved in 3 formats, namely RDS, CSV, and TSV. Let’s bring them into R.
dogs <- read_rds("files/dogs.rds")
Story: The annual weighing of dogs at the shelter is an important practice. You have a specific scale to do so. Unfortunately, the manufacturers of the scale have recently come forward with news that their product is not accurately measuring the weight of dogs under 5 kilograms and that these values need to be increased by 10%.
Given that the weight information is currently spread across multiple columns (weight_20, weight_21, weight_22, and weight_23), consolidating it into a single column could make the necessary changes more manageable.
Pivot_longer() helps us reshape the data from a “wide” format to a “long” format, where each variable is in a single column and each observation is in a separate row. This is useful when working with data that has multiple values for a single observation in different columns (e.g., weight info in our dogs dataframe).
Let’s see how we can apply this to our dog dataframe.
For simplicity and visual reasons, I create a smaller dataframe (dog_weight) by selecting only the name, breed, and the weight columns.
Please note that you can run pivot_longer on the full dataframe. You just need to mention which columns need to be pivoted.
dog_weight <- dogs %>%
select(name, breed, weight_20, weight_21, weight_22, weight_23)
The code below uses the function pivot_longer() to reshape the columns weight_20, weight_21, weight_22, and weight_23 into a longer format.
The cols argument is used to specify the columns that we want to pivot, in this case, the columns “weight_20”, “weight_21”, “weight_22”, and “weight_23”. The names_to argument is used to specify the new column to create from the information stored in the column names of data specified by cols. The values_to argument is used to specify the new column for storing the data stored in cell values. The argument values_drop_na is set to TRUE, so any missing weight values in the columns are not added as extra rows.
longer_weight <- pivot_longer(dog_weight,
cols = c(weight_20, weight_21, weight_22, weight_23),
names_to = "year", #translated as: write name of the column to the column year
values_to = "weight", #write values of the columns to the column weight
values_drop_na = TRUE
)
#let us arrange the values to see which ones are under 5 kilo
head(arrange(longer_weight, weight), n=7)
## # A tibble: 7 × 4
## name breed year weight
## <chr> <chr> <chr> <dbl>
## 1 Harvey Chihuahua weight_22 1.34
## 2 Harvey Chihuahua weight_23 1.82
## 3 Emmett Yorkshire Terrier weight_23 3.21
## 4 Emmett Yorkshire Terrier weight_22 3.46
## 5 Sophronia Pug weight_22 6.2
## 6 Sophronia Pug weight_21 7.27
## 7 Sophronia Pug weight_20 7.54
# values that need to be changed
#Harvey: 1.34
#Harvey: 1.82
#Emmett: 3.21
#Emmett: 3.46
After creating the long dataframe “longer_weight”, we want to update the values in the weight column.
The code below updates the column weight in the data frame longer_weight by using the ifelse function.
The ifelse function checks each value in the weight column to see if it is less than 5.
If a value is less than 5, it is increased by 10% of its original value. This increase is calculated by multiplying the original value with 1.1. If a value is not less than 5, it remains unchanged. We then round the values to two decimal places.
longer_weight <- longer_weight %>%
mutate(weight = ifelse(weight < 5,
weight * 1.1,
weight)) %>%
mutate(weight = round(weight, digits = 2))
head(arrange(longer_weight, weight), n= 5)
## # A tibble: 5 × 4
## name breed year weight
## <chr> <chr> <chr> <dbl>
## 1 Harvey Chihuahua weight_22 1.47
## 2 Harvey Chihuahua weight_23 2
## 3 Emmett Yorkshire Terrier weight_23 3.53
## 4 Emmett Yorkshire Terrier weight_22 3.81
## 5 Sophronia Pug weight_22 6.2
Pivot_wider is the opposite of pivot_longer.
It is used to reshape a data frame from long format to wide format. The pivot_wider function takes columns with multiple values and spreads them out into multiple columns, while collapsing multiple rows into one.
In the previous section, we increased the weight of dogs weighing less than 5 kilos by 10%.
Now, we use the pivot_wider function to transform the long dataframe “longer_weight” into its previous wide format.
The “names_from” argument specifies that the unique values in the “year” column of the “longer_weight” dataframe (i.e., weight_20, weight_21, weight_22, weight_23) will become the new column names in the “wider_weight” dataframe.
For now, we call this dataframe wider_weight; but it is in fact similar to the dog_weight dataframe
wider_weight <- pivot_wider(longer_weight, names_from = year, values_from = weight)
#names_from means: name of the new columns should be taken from the values in the year column.
Before moving to the next task, let’s remove the dataframes we do not need anymore.
In the code below, the grep function is used to search for objects in the current environment that match the pattern “dogs”. The invert = TRUE argument inverts the search so that it returns objects that do NOT contain the word “dogs” in their name. Finally, the rm function removes all the dataframes stored in the “toremove” object.
toremove <- grep("dogs", ls(),
invert = TRUE,
value = TRUE)
rm(list = c(toremove, "toremove"))
Group-wise operations refer to the process of performing operations on subsets of data, based on the values in one or more columns.
In what follows, we talk about the functions group-by() and then summarise().
With group_by(), you can specify one or more variables that you want to use as the basis for grouping your data.
The function will then create groups based on the unique values of the specified variables and arrange the data accordingly.
For instance, we can group our dogs based on their breed, and then apply some functions to each group.
Story: At the shelter you want to know how many members of each breed you have, with the purpose of adding more members to groups with only one member. Here are the steps:
For simplicity, I reduce the dimensions of the “dogs” dataframe to only a few columns we will use here.
Then, we group-by() the dogs by their
breed.
Then, we use the mutate() function to create a new
column called “number_of_members” that contains the number of members in
each breed group. The function n() counts the number of
observations (rows) within each group.
(IMPORTANT): Finally, the ungroup() function is used
to remove the grouping of the data, returning the data to its original
format.
#step 1
dog_groups <- dogs %>%
select(name, breed, sex, height) %>%
group_by(breed) %>%
mutate(number_of_members = n()) %>%
ungroup()
Story: Also, for your database, you want to assign IDs to members of each breed based on their height (smallest to largest). Here are the steps:
We group-by() the dogs by their breed.
We then “arrange” members of a group based on their height.
The mutate function is then used to create a new column called
“group_id” that contains a unique identifier for each breed group. The
seq(n()) function is used within the mutate function to
generate a sequence of numbers based on the number of observations
(rows) in each group, which is given by n().
By ungrouping the data, you ensure that the data is in the correct format for future operations and analysis.
#step 3
dog_groups <- dog_groups %>%
group_by(breed) %>%
arrange(height) %>%
mutate(breed_group_id = seq(n())) %>%
ungroup() # step 4
Concise way to do the two operations above:
#step 1
dog_groups <- dogs %>%
select(name, breed, sex, height) %>% #step 1
group_by(breed) %>%
mutate(number_of_members = n()) %>% #step 2
arrange(height) %>%
mutate(breed_group_id = seq(n())) %>% #step 3
ungroup() # step 4
The group_by() and summarise() functions
are often used together to perform data summarization and aggregation.
group_by is used to group the data based on one or more variables, and
summarise is used to apply summary functions to the subgroups.
Note that different from the application of group_by above, this combination aggregates the data of each group down to one row.
Useful calculations you can do with summarise (Taken from the documentation: https://dplyr.tidyverse.org/reference/summarise.html)
Story: One day, you receive a request from a prestigious animal organization called “Furry Friends Foundation”. The organization is conducting a study on the health and well-being of dogs in shelters across the country, and wants to get a more in-depth understanding of any potential gender-based differences in the population. So, the organization asks you to provide the summary statistics of the dogs at the shelter based on their gender.
Let us first calculate the number of members in each sex group.
First, the data in the “dogs” dataframe is grouped based on the “sex” column.
For each group defined by the “sex” column (male vs. female), the count of observations is calculated using the n() function.
The result is stored in a new variable called “n_dogs.” Since the “sex” column has two distinct values, male and female, the summary statistics will be given on two rows (one for female dogs and the other for male dogs).
gender_groups <- dogs %>%
group_by(sex) %>%
summarise(n_dogs = n()) %>%
ungroup()
gender_groups
## # A tibble: 2 × 2
## sex n_dogs
## <chr> <int>
## 1 female 13
## 2 male 14
Story: Since you enjoyed the combination of group-by() and summarise() a lot, you decide to also calculate bunch of other values for each gender.
After grouping the dogs by their sex, the following variables are calculated in the code below using the summarise() function:
gender_groups <- dogs %>%
group_by(sex) %>%
summarise(n_dogs = n(),
mean_height = mean(height),
mean_weight2023 = mean(weight_23),
mean_weight2022 = mean(weight_22, na.rm = TRUE),
min_height = min(height),
max_height = max(height),
cage_small = sum(cage == "small"),
cage_medium = sum(cage == "medium"),
cage_large = sum(cage == "large")) %>%
ungroup()
#install.packages("DT")
# library(DT)
# DT::datatable(gender_groups)
Next we want to turn to a number of different functions. These string operations are a type of data manipulation that involve working with character strings. In R, there are two main ways to perform string operations: using base R functions and using the tidyverse library. Some common tasks involving strings are:
Story: You have started building a website for your shelter. You decide to add images of the dogs to the website, with a short title for each image. You want to use the following info for this purpose: name, sex, and breed columns. You want to create a title such as “Sophronia is a female pug.”
We can use the paste() function to concatenate the values in the columns name, sex, and breed, in addition to the string “is a” and create a short title for each dog image. The default separator in the paste function is whitespace; you can define any other separator (e.g., comma, nothing, underscore).
website_content <- dogs %>%
select(name, breed, sex, description) %>%
mutate(title = paste(name, "is a", sex,breed, sep = " ")) #default sep is a whitespace
head(website_content)
## name breed sex
## 1 Sophronia Pug female
## 2 Buddy Labrador male
## 3 Mara Saint Bernard female
## 4 Gracelyn Dalmatian female
## 5 Broderick Bull Terrier male
## 6 Conrad Weimaraner male
## description
## 1 Sophronia is a female Pug who stands at 32 cm tall. This small and affectionate breed is known for their playful personality and charming wrinkles. Pugs are great family dogs, as they love to cuddle and are always up for a game of fetch. Sophronia is a friendly and outgoing pup who enjoys belly rubs and treats. She would make a great companion for someone who is looking for a low-maintenance, loving dog.
## 2 Buddy is a male Labrador who stands at 62 cm tall. This friendly and active breed is known for their obedience and trainability. Labrador Retrievers are one of the most popular dog breeds in the world and are known for their friendly and outgoing personality. Buddy is a social butterfly who loves meeting new people and dogs. He is also a big fan of playing fetch and going for long walks. Buddy will make a great companion for an active family who loves the outdoors.
## 3 Mara is a female Saint Bernard who stands at 59 cm tall. Saint Bernards are a giant breed known for their size and strength, but also for their gentle and friendly nature. They make great family dogs as they are patient and affectionate with children. Mara is a gentle giant who loves belly rubs and cuddles. She is also a great watchdog, always keeping a watchful eye over her family. Mara will need a large living space and plenty of room to run and play.
## 4 Gracelyn is a female Dalmatian who stands at 57 cm tall. Dalmatians are an energetic and playful breed known for their distinctive black and white spotted coat. They are an active breed that loves to run and play and make great family pets for those who can keep up with their energy. Gracelyn is a fast and agile pup who loves to play games of chase. She is also known to be a bit of a clown, always making her family laugh with her silly antics.
## 5 Broderick is a male Bull Terrier who stands at 48 cm tall. Bull Terriers are a muscular and energetic breed known for their tenacity and loyalty. They make great family dogs for those who are prepared for their high energy and playfulness. Broderick is a playful and energetic pup who loves to run and play. He is also known for his fierce loyalty to his family and will always be there to protect them.
## 6 Conrad is a male Weimaraner who stands at 66 cm tall. Weimaraners are an athletic and energetic breed known for their hunting instincts and loyalty. They make great family pets for those who can keep up with their high energy and need for exercise. Conrad is an active and energetic pup who loves to run and play. He is also known for his protective nature and will always be there to keep his family safe.
## title
## 1 Sophronia is a female Pug
## 2 Buddy is a male Labrador
## 3 Mara is a female Saint Bernard
## 4 Gracelyn is a female Dalmatian
## 5 Broderick is a male Bull Terrier
## 6 Conrad is a male Weimaraner
Note: The paste() function is a base R function. Its tidyverse
equivalent is the function str_c(). As previously noted, R
offers a variety of options for performing the same operation, and the
choice of which to use often comes down to personal preference.
Story: You like the titles you have created, but you are not sure about its format. You decide to write the title in other formats (e.g., all in uppercase, only first words in upper case, first letter of each word in uppercase, lexical words in upper case) to see which version fits the images better.
In the code below, the titles are being passed through a series of functions that modify their format: The first function “str_to_upper” is being applied, converting all the characters of the titles to uppercase letters. This is equivalent to the toupper function in base R.
Next, the “str_to_title” function is being applied, converting the titles to title case, where the first letter of each word is capitalized.
Finally, the “str_to_sentence” function is being used to convert the titles to sentence case, where only the first letter of the first word is capitalized.
website_content <- website_content %>%
mutate(uppercase = str_to_upper(title)) %>%
mutate(lowercase = str_to_lower(title)) %>%
mutate(title_format = str_to_title(title)) %>%
mutate(sentence_format = str_to_sentence(title))
head(website_content[6:9])
## uppercase lowercase
## 1 SOPHRONIA IS A FEMALE PUG sophronia is a female pug
## 2 BUDDY IS A MALE LABRADOR buddy is a male labrador
## 3 MARA IS A FEMALE SAINT BERNARD mara is a female saint bernard
## 4 GRACELYN IS A FEMALE DALMATIAN gracelyn is a female dalmatian
## 5 BRODERICK IS A MALE BULL TERRIER broderick is a male bull terrier
## 6 CONRAD IS A MALE WEIMARANER conrad is a male weimaraner
## title_format sentence_format
## 1 Sophronia Is A Female Pug Sophronia is a female pug
## 2 Buddy Is A Male Labrador Buddy is a male labrador
## 3 Mara Is A Female Saint Bernard Mara is a female saint bernard
## 4 Gracelyn Is A Female Dalmatian Gracelyn is a female dalmatian
## 5 Broderick Is A Male Bull Terrier Broderick is a male bull terrier
## 6 Conrad Is A Male Weimaraner Conrad is a male weimaraner
Story: At the dog shelter, you are approached by a collar company looking to create unique collars for each of your dogs. You are torn between using the dog’s name, breed, or both as the identifier on the collar. After a period of intense contemplation, you decide to use a bit of both. You want to create a column in the dog data frame called “collar_id” and use the first three letters of the dogs’ names followed by the last four letters of their breed.
In the code below, a new column named “collar_id” is being created,
the value of which is the combination of two separate
str_sub() functions. The str_sub function is a string
manipulation function used to extract a portion of a character string,
specified by a starting and ending position.
The first str_sub function
str_sub(name, start = 1 , end = 3) is used to extract the
first three characters of the “name” column. The start argument is set
to 1, indicating the start of the string, and end is set to 3,
indicating the index of the last character that should still be included
in the desired substring.
The second str_sub function
str_sub(breed, start =-4, end =-1) is used to extract the
last four characters of the breed column. In this case, start is set to
-4, indicating that the extraction should start four characters from the
end of the string, and end is set to -1, indicating that the extraction
should end at the last character of the string.
collar <- website_content %>%
select(name, breed) %>%
mutate(name_letters = str_sub(name, start = 1 , end = 3) ) %>%
mutate(breed_letters = str_sub(breed, start = -4 , end = -1)) %>%
mutate(collar_id = str_c(name_letters, breed_letters))
head(collar)
## name breed name_letters breed_letters collar_id
## 1 Sophronia Pug Sop Pug SopPug
## 2 Buddy Labrador Bud ador Budador
## 3 Mara Saint Bernard Mar nard Marnard
## 4 Gracelyn Dalmatian Gra tian Gratian
## 5 Broderick Bull Terrier Bro rier Brorier
## 6 Conrad Weimaraner Con aner Conaner
Story: You decide to add a add a chart regarding the characteristics of your dogs to the website. For that, you specify some features such as “active”, “friendly”, “playful”, and “athletic” and you decide to look for these features in the descriptions in order to add them to the chart.
We use grepl() to look for the specified pattern.
grepl() is used to search for the presence of a pattern, specified by a
regular expression, in a character vector. A regular expression is a
string that describes a search pattern. The easiest way of using
grepl() is to use for an exact string in a column.
website_content <- website_content %>%
mutate(friendly = ifelse( grepl("friendly", description), TRUE, FALSE)) %>%
mutate(playful = ifelse( grepl("playful", description), TRUE, FALSE)) %>%
mutate(active = ifelse( grepl("active", description), TRUE, FALSE)) %>%
mutate(athletic = ifelse( grepl("athletic", description), TRUE, FALSE))
chart <- website_content %>%
select(name, friendly, playful, active, athletic)
head(chart)
## name friendly playful active athletic
## 1 Sophronia TRUE TRUE FALSE FALSE
## 2 Buddy TRUE FALSE TRUE FALSE
## 3 Mara TRUE FALSE FALSE FALSE
## 4 Gracelyn FALSE TRUE TRUE FALSE
## 5 Broderick FALSE TRUE FALSE FALSE
## 6 Conrad FALSE FALSE TRUE TRUE
Story: You believe that the collar IDs would sound even friendlier if a “y” would be added to the end of those collar IDs that do not contain any vowels or the letter “y. Therefore, you plan to first identify the collar IDs that fulfill this criterion, and then append a”y” to the end of those collar IDs.
As mentioned, grepl() takens regular expressions. The pattern can include simple characters, such as letters and digits, as well as special characters, such as dot . (which matches any character), * (which matches zero or more occurrences of the preceding character), and ^ (which matches the start of a line).
In the code below, the ifelse function is used to conditionally assign values to the collar_id_friendly column. The grepl function is used to search for the presence of vowels (i.e., “a”, “e”, “y”, “o”, “u”, “i”) at the end of a string, represented by the regular expression “[aeyoui]$”. The $ symbol in the regular expression indicates that the pattern should match the end of the string. If the grepl function returns FALSE (i.e., if the string does not end with the letters “aeyoui”), then the paste function is used to concatenate the original collar_id value with the character “y”. If grepl returns TRUE, then the original value in the collar_id column is assigned to the collar_id_friendly column without modification.
The tidyverse alternative to grepl() is
str_detect().
collar <- collar %>%
mutate(collar_id_friendly = ifelse(grepl("[aeyoui]$", collar_id),
collar_id,
paste(collar_id, "y", sep = "")))
#alternative solution
collar %>%
mutate(collar_id_friendly_alt = ifelse(str_detect(collar_id,"[aeyoui]$"),
collar_id,
str_c(collar_id, "y")))
?grepl
## starting httpd help server ... done
?str_detect
Story: Since why not, you also choose to modify the collar IDs ending with “e” by changing it to “ie”.
We use replace functions gsub() or
str_replace() to replace a specified pattern with something
else. In the code below, the “grepl” function is used to check if the
string ends with “e”, and if the condition is true, the “gsub” function
is used to replace the “e” with “ie”.
collar <- collar %>%
mutate(collar_id_friendly = ifelse(grepl("e$",collar_id_friendly),
gsub("e$", "ie", collar_id_friendly),
collar_id_friendly))
#alternative solution
collar %>%
mutate(collar_id_friendly_alt = ifelse(str_detect(collar_id_friendly,"e$"),
str_replace(collar_id_friendly,"e$", "ie"),
collar_id_friendly))
Story: As mentioned, you have used ChapGPT to create descriptions for each dog. One day, a linguist friend of yours, Maria, comes to visit the shelter and is thoroughly impressed with the descriptions. She proposes collecting all the descriptions to form a corpus, a body of written texts. You are thrilled with the idea and gladly give Maria permission to use the descriptions. In return you ask for her assistance in reviewing the descriptions for any potential weaknesses. First, you send over the descriptions and a unique ID for each dog (concatenating name, breed, sex, and height of the dog) to Maria.
dog_descriptions <- dogs %>%
select(name, breed, sex, height, description) %>%
mutate(unique_id = paste(name, breed, sex, height, sep = "_")) %>%
select(unique_id, description)
Story: Reviewing the data, Maria noticed that there are some redundant white spaces in the description column. She decides to trim the description column to get rid of redundanttespaces.
The str_trim() function removes whitespace from start
and end of string.
dog_descriptions <- dog_descriptions %>%
mutate(description = str_trim(description))
The str_squish() function removes whitespaces at the start and end, and replaces all internal whitespace with a single space
spaces <- " This is a string with a lot of spaces "
str_trim(spaces)
## [1] "This is a string with a lot of spaces"
str_squish(spaces)
## [1] "This is a string with a lot of spaces"
Story: Since ChatGPT can potentially generate wrong or incomplete descriptions, Maria wants to make sure that the data is accurate. She wants to check two things: (1) whether or not the breed of each dog has been mentioned in the description, and (2) whether the unit of measurement for height has been reported in the right format (cm in this case).
The first step is to split the concatenated unique_id column into its building blocks.
The code below uses the separate() function to separate the column unique_id into several new columns namely, name, breed, sex, and height. In column unique_id, these values are sepatated from each other with “underscore”. The argument sep defines the type of separator used in the original column (in this case underscore). The argument “into” contains the name of the new columns. The “remove” argument is set to “FALSE”, which means the original “unique_id” column is not removed, and remains in the data after the transformation.
dog_descriptions <- dog_descriptions %>%
separate(., unique_id,
into = c("name", "breed", "sex", "height"),
sep = "_",
remove = FALSE)
head(dog_descriptions[1:5])
## unique_id name breed sex height
## 1 Sophronia_Pug_female_32 Sophronia Pug female 32
## 2 Buddy_Labrador_male_62 Buddy Labrador male 62
## 3 Mara_Saint Bernard_female_59 Mara Saint Bernard female 59
## 4 Gracelyn_Dalmatian_female_57 Gracelyn Dalmatian female 57
## 5 Broderick_Bull Terrier_male_48 Broderick Bull Terrier male 48
## 6 Conrad_Weimaraner_male_66 Conrad Weimaraner male 66
To check for the breed, the code below uses an ifelse function to check if the string in the “breed” column is detected in the “description” column.
dog_descriptions <- dog_descriptions %>%
mutate(is_breed_there = ifelse(str_detect(description, breed), "yes", "no"))
The code below first uses str_detect and checks whether the unit reported in the description is “cm” or something else (“other”). The result is stored in the column “cm_vs_inch”.
In the second mutate function, if the value of cm_vs_inch is “other”, then the function uses str_replace_all to replace all instances of ” inches ” and ” inch ” with ” cm ” in the description variable. If the value of cm_vs_inch is already “cm”, then the value of description remains unchanged.
dog_descriptions <- dog_descriptions %>%
mutate(cm_vs_inch = ifelse(str_detect(description, " cm"), "cm", "other"))%>%
mutate(description = ifelse(cm_vs_inch == "other",
str_replace_all(description, c(" inch(es)? ")," cm "),
description))
Story: And again, since why not, Maria wants to check whether the name of each dog has been mentioned sufficiently in the descriptions. Her assumption is that if the name is repeated only once, the text is not too friendly, and the description should be improved.
In the code below, the str_count function is counting the number of instances of the “name” column that appear within the “description” column.
dog_descriptions <- dog_descriptions %>%
mutate(friendliness = str_count(description, name))
Story: Finally, it is time for Maria to create the corpus of dog descriptions. She first decides to split the descriptions sentence-wise. She assumes that the punctuation mark period (.) marks the end of a sentence.
The code below uses the function separate_rows() to split the description column into separate rows for each sentence. It does this by using the period (“.”) as the separator to split the text into separate rows.
dog_corpus <- dog_descriptions %>%
select(unique_id, description) %>%
separate_rows(description, sep = "\\.")
The code below shows a big pipeline containing various functions we talked about earlier and some new ones.
Here is a breakdown of what it does:
n(), seq(), str_count()
functions, respectively.lag() and
lead() functions. The argument n() in these
functions defines from how many rows back or ahead the values should be
chosen.dog_corpus <- dog_corpus %>%
rename(sentence = description) %>%
mutate(sentence = str_trim(sentence)) %>%
filter(sentence != "") %>%
group_by(unique_id) %>%
mutate(number_of_sentences = n(),
which_sentence = seq(n()),
count_char = str_count(sentence),
count_word = str_count(sentence, '\\w+')) %>%
mutate(previous_sentence = lag(sentence, n = 1),
next_sentence = lead(sentence, n = 1)) %>%
ungroup() %>%
mutate( words_with_letterF = str_extract_all(sentence, "\\b[fF]\\w+\\b"))
Story: Maria sends the corpus back to you at the shelter. After much thinking, you and Maria decide to call it “FurryFriends Corpus”.