Import libraries

Set up working directory


# set working directory to source file location
if (requireNamespace("rstudioapi", quietly = TRUE) && rstudioapi::isAvailable()) {
  setwd(dirname(rstudioapi::getSourceEditorContext()$path))
} else {
  setwd(dirname(knitr::current_input(dir = TRUE)))
}

getwd()

Introduction

The example data for today’s class follows an example story. Because of this most (sub-)sections start with a short paragraph which explains the example story for each presented task (indicated by “Story: …”). Additionally, each (sub-)section usually covers one task, including explanations, code and (optionally) notes.

Before getting to the actual content, the following section introduces the example story.

Example story

You have left academia and decided to take over a local dog shelter.

Right away you realize that keeping a record of the dogs is highly useful for managing the shelter. Beyond just keeping an overview, it can be used for all sorts of things: Managing the weight of the dogs, knowing how much food needs to be ordered, how many and which cages are needed, advertising the dogs to potential buyers and many more.

The previous manager of the shelter has been keeping a “record” as well. Unfortunately, the record is in a rather poor state. Apparently, the previous owner was not really a “data person”.

Luckily, in your previous occupation you have acquired just the right habits and skills to clean up this data mess so you can use it for your management tasks.

Get data into (and out of) R

loading tables from MS Excel

Story: The previous owner has been using Excel for managing his records. Obviously, a bad idea! Surely, there has to be a way to get the data into R…

The read_excel() function from the package readxl can be used to load excel tables. Main argument of the function is the file path (with extension).

#install.packages("readxl")

library(readxl)
getwd()

## [1] "C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2023_11_22_Session6"

old_records <- read_excel("old_records.xlsx")


if (interactive()) {
  View(old_records)
  ?read_excel
}

Notes:

If there are multiple sheets in an Excel file, you can specify the sheet using the sheet argument: sheet = “sheetname”
Since we have set the working directory to where the data is (see above), we only need the filename, not the full path.

loading tables from csv

Story: During a quiet hour after feeding the dogs (it seems that eating makes them sleepy), you spend some time online. While googling, you have found a data set with info on different dog breeds. You have a feeling that this might come in handy at some point.

There is a number of functions for loading csv files. Each function has slightly different default values concerning some parameters (e.g. which delimiter, header/no header, …). For choosing the right parameters going through the menu of RStudio helps.

Go to: File > Import Dataset > From Text (base)

But don’t forget to copy and paste resulting command into your script to make it reproducible!

dog_breed_info <- read_csv("AKC Breed Info.csv")

## Rows: 150 Columns: 5
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (5): Breed, height_low_inches, height_high_inches, weight_low_lbs, weigh...
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

#View(dog_breed_info)

inputting data by hand

Story: The year is still young, but already there have been a number of new arrivals. Clearly, their data needs to be collected as well. You went ahead and identified their breed and sex and also put each dog on a scale.

The data.frame() function can be used to input data ‘by hand’. The syntax has the following structure:

data.frame(var_name = data_vector, …)

new_arrivals <- data.frame(
  breed  = c("Poodle", "Great Dane", "French Bulldog", "German Shepherd", "Siberian Husky", "Siberian Husky", "Boxer"),
  sex    = c("male", "female", "female", "female", "female", "male", "male"),
  weight = c(28.54, 60.37, 8.62, 24.50, 17.24, 21.84, 35.47)
)

#view(new_arrivals)

Notes:

You can input as many columns as needed.
The vectors need to have the same length.

saving and loading using RDS

Story: To cut down on expenses, you want to cancel your Microsoft Office subscription in the future. Worried that you won’t be able to access the Excel records of the old owner anymore, you decide it is best to save that data in a different format for safekeeping.

write_rds() can be used to save any R object to a file. First argument is the object, second the file path (extension .rds).

write_rds(old_records, "old_records.rds")

To check if this worked, let’s first remove the data frame from the environment and then load it again.

# rm(remove) is used for removing a dataframe

rm(old_records)

# Let's bring the file back in. 

old_records <- read_rds("old_records.rds")

Notice the different structure of saving and loading (it also applies to other saving/loading functions).

Note:

Saving (and loading) using RDS also saves additional info such as formatting (especially useful for factors).

saving using csv

Story: An hour later you feel somewhat paranoid (maybe because of all the barking). Afraid to loose data, you decide to additionally save the data in csv and tsv formats as well. Just to be “safe”. This might also be useful in case you need to share it with someone who does not use R.

write_csv(old_records, "old_records.csv")

write_tsv(old_records, "old_records.tsv")

when to use to which format

Suggestion:

for loading

use the process that fits the format that the data is in

saving

when only using R -> use rds
when it is necessary to be platform/software independent -> use csv or tsv
and only when working with colleagues who can only handle Excel files -> use xlsx

Working with single columns I

creating columns

Story: As a first step, you want to prepare the info on dog breeds. Looking at the dog_breed_info dataframe, you notice that it includes weight and height info in units of lbl (=pounds) and inches, respectively. However, you prefer measurements in kg and cm. Transforming the values should be easy using the formulas you found online.

lbs to kg conversion formula = multiplying the given lbs value by 0.453 kg.

inch to cm conversion formula = multiplying the given inch value by 2.54 cm

# # transform height info to cm

colnames(dog_breed_info)

## [1] "Breed"              "height_low_inches"  "height_high_inches"
## [4] "weight_low_lbs"     "weight_high_lbs"

# dog_breed_info <- dog_breed_info %>%
#   mutate(height_low = height_low_inches * 2.54) %>%
#   mutate(height_high = height_high_inches * 2.54)

str(dog_breed_info)

## spc_tbl_ [150 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Breed             : chr [1:150] "Akita" "Anatolian Sheepdog" "Bernese Mountain Dog" "Bloodhound" ...
##  $ height_low_inches : chr [1:150] "26" "27" "23" "24" ...
##  $ height_high_inches: chr [1:150] "28" "29" "27" "26" ...
##  $ weight_low_lbs    : chr [1:150] "80" "100" "85" "80" ...
##  $ weight_high_lbs   : chr [1:150] "120" "150" "110" "120" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Breed = col_character(),
##   ..   height_low_inches = col_character(),
##   ..   height_high_inches = col_character(),
##   ..   weight_low_lbs = col_character(),
##   ..   weight_high_lbs = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

This fails. (Don’t forget to comment it or put error option in the code chunk to TRUE (error = TRUE) to compile the markdown file. Why doesn’t it work? Let’s take a closer look.

There are several data types in R:

numeric (numbers)
character (text)
factor (gender)

You can check the current type of each column with the str() function.

str(dog_breed_info$height_low_inches)

##  chr [1:150] "26" "27" "23" "24" "26" "25" "32" "27" "23" "28" "28" "27" ...

str(dog_breed_info)

## spc_tbl_ [150 × 5] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
##  $ Breed             : chr [1:150] "Akita" "Anatolian Sheepdog" "Bernese Mountain Dog" "Bloodhound" ...
##  $ height_low_inches : chr [1:150] "26" "27" "23" "24" ...
##  $ height_high_inches: chr [1:150] "28" "29" "27" "26" ...
##  $ weight_low_lbs    : chr [1:150] "80" "100" "85" "80" ...
##  $ weight_high_lbs   : chr [1:150] "120" "150" "110" "120" ...
##  - attr(*, "spec")=
##   .. cols(
##   ..   Breed = col_character(),
##   ..   height_low_inches = col_character(),
##   ..   height_high_inches = col_character(),
##   ..   weight_low_lbs = col_character(),
##   ..   weight_high_lbs = col_character()
##   .. )
##  - attr(*, "problems")=<externalptr>

All the values of each column have to be of one of these types. All numeric value can be stored as character/text (“12” instead of 12), but not all text can be numeric (“mouse” can not sensibly be turned into a number).

If any value in a column is of the type character, then the whole column is.

What happened here: There are character values in the columns (“na” and “not found”). Therefore these columns can not be numeric. Multiplication does not work on character columns. Hence, the error.

How to deal with this?

handling “na” and “not found” string values

Missings are values that are not available in the data (for some reason). This is a common phenomenon for a lot of datasets. In R there is a common way of indicating them so that the functions of R know how to deal with them: Set them to NA. Note that NA is not a value of a certain data type. Because of this columns that otherwise contain only numeric values can still contain NAs and be of the type numeric.

With the following commands, you can replace “na” and “not found” with NA. This is a way of subsetting or indexing a dataframe in R. What it does is select all elements in dataframe that meet the condition specified.

dog_breed_info[dog_breed_info == "na"] <- NA
dog_breed_info[dog_breed_info == "not found"] <- NA


dog_breed_info[dog_breed_info == "na" | dog_breed_info == "not found" ] <- NA

By doing that, we tell R that these are missing values. Any operation done to the column is done on the rest of the values. The missings stay NAs.

Changing columns’ data types

Setting the missings to NA does not automatically change the data type of the variables, however. They are still of type character (although they could be numeric now).

So to prepare the columns for the multiplication we need to make them numeric, first:

# transform height info to numeric

# as.numeric()
# as.character()
# as.factor()

dog_breed_info <- dog_breed_info %>% 
  mutate(height_low_inches = as.numeric(height_low_inches)) %>%
  mutate(height_high_inches = as.numeric(height_high_inches)) %>% 
  mutate(weight_low_lbs = as.numeric(weight_low_lbs)) %>% 
  mutate(weight_high_lbs = as.numeric(weight_high_lbs))


str(dog_breed_info)

## tibble [150 × 5] (S3: tbl_df/tbl/data.frame)
##  $ Breed             : chr [1:150] "Akita" "Anatolian Sheepdog" "Bernese Mountain Dog" "Bloodhound" ...
##  $ height_low_inches : num [1:150] 26 27 23 24 26 25 32 27 23 28 ...
##  $ height_high_inches: num [1:150] 28 29 27 26 28 27 32 32 28 35 ...
##  $ weight_low_lbs    : num [1:150] 80 100 85 80 70 100 120 95 130 90 ...
##  $ weight_high_lbs   : num [1:150] 120 150 110 120 100 130 160 120 150 150 ...

And now, we should (finally) be able to perform the transformation to cm and kg:

# transform height info to cm and weight info to kg

dog_breed_info <- dog_breed_info %>% 
  mutate(height_low = height_low_inches * 2.54) %>% 
  mutate(height_high = height_high_inches * 2.54) %>% 
  mutate(weight_low = weight_low_lbs * 0.453) %>% 
  mutate(weight_high = weight_high_lbs * 0.453)

Let’s round the values of the new columns to only one decimal places.

?round

## starting httpd help server ... done

dog_breed_info <- dog_breed_info %>% 
  mutate(height_low = round(height_low, digits = 1)) %>% 
  mutate(height_high = round(height_high, digits = 1)) %>% 
  mutate(weight_low = round(weight_low, digits = 1)) %>% 
  mutate(weight_high = round(weight_high, digits = 1))

To make our dataframe cleaner, let’s get rid of the unwanted inch and lbs columns. Also, let’s change “Breed” to “breed”

dog_breed_info <- dog_breed_info %>% 
  select(-c(height_low_inches,
            height_high_inches,
            weight_low_lbs,
            weight_high_lbs)) %>% 
  rename(breed = Breed)

Working with single columns

recoding categorical variables

Story: The previous owner seems to be inconsistent in providing the information on dogs’ breed and sex. For instance, he sometimes use “female” and other times “woman”! You decide to make it more consistent by changing “woman” values to “female” and “man” values to “male”.

One solution would be to use the subsetting rules you learned in the second/third session.

# SOLUTION 1: SUBSETTING


# check which labels are used in the sex column
table(old_records$sex)

## 
## female   male    man  woman 
##      6      8      3      3

# Just create a copy of the sex column for showing purposes
old_records$sex_subset <- old_records$sex

# change them
old_records$sex_subset[old_records$sex_subset == "woman"] <- "female" #What this means is: Assign the value "female" to entries in the sex column of the old_records dataframe where they currently have the value "woman".

old_records$sex_subset[old_records$sex_subset == "man"] <- "male"

# check again
table(old_records$sex_subset)

## 
## female   male 
##      9     11

Second solution is to use conditional statements provided by the ifelse() function. Its syntax is as follows:

result <- ifelse(condition, result if cond. == TRUE, result if cond. == FALSE)

The recoding as seen in the code chunk above can be done with ifelse(), too. Since the values have been already changed in the old_records dataframe, I just use the copy_old_records to show this solution.

# SOLUTION 2: ifelse()

# create a copy of the sex column for showing purposes
old_records$sex_ifelse <- old_records$sex

# change man -> male
old_records$sex_ifelse <- ifelse(old_records$sex_ifelse == "man",   # condition
                          "male",                     # wherever TRUE use "male"
                          old_records$sex_ifelse)            # otherwise, keep values as they are


table(old_records$sex_ifelse)

## 
## female   male  woman 
##      6     11      3

# SOLUTION 2: using pipe
# change woman -> female

old_records <- old_records %>% 
  mutate(sex_ifelse = ifelse(sex_ifelse == "woman", # condition
                      "female", # wherever TRUE use "female"
                      sex_ifelse)) # otherwise, keep values as they are


table(old_records$sex_ifelse)

## 
## female   male 
##      9     11

## NOW, we want to combine both ifelse conditions and nest the second ifelse inside the first one. Just for showing purposes, I use the dataframe copy2_old_records, so you can compare all results together.


old_records <- old_records %>% 
  mutate(sex = ifelse(sex == "man", # condition
                      "male", # wherever TRUE use "male"
                      ifelse(sex == "woman", # otherwise, another condition
                             "female", # wherever that one is TRUE, use female
                             sex))) # all other cases, use original

table(old_records$sex)

## 
## female   male 
##      9     11

table(old_records$sex_subset)

## 
## female   male 
##      9     11

table(old_records$sex_ifelse)

## 
## female   male 
##      9     11

# Let's remove unwanted categories

old_records <- old_records %>% 
  select(-c(sex_subset, sex_ifelse))

Furthermore, we want to reduce the inconsistencies in the breed info:

“Golden Retriever” is spelled “golden retr.” (row 15 of breed column)
Instead of “German Shepherd” “Schäferhund” is used (row 16 of breed column)
There is a typo in “Labrador”, instead it is spelled “Labradoer” (row 20 of breed column)

old_records <- old_records %>% 
  mutate(breed = ifelse( breed == "golden retr.", "Golden Retriever",
                        ifelse( breed == "Schäferhund", "German Shepherd",
                                ifelse( breed == "Labradoer", "Labrador", breed))))

## Let's check the values of the breed column 
old_records$breed

##  [1] "Pug"               "Labrador"          "Saint Bernard"    
##  [4] "Dalmatian"         "Bull Terrier"      "Weimaraner"       
##  [7] "Chow Chow"         "Pointer"           "Greyhound"        
## [10] "Finnish Spitz"     "Alaskan Malamute"  "Boxer"            
## [13] "Yorkshire Terrier" "Poodle"            "Golden Retriever" 
## [16] "German Shepherd"   "Chihuahua"         "Beagle"           
## [19] "Golden Retriever"  "Labrador"

Combining data sets

Story: After spending all this time preparing these separate data sets (and learning a lot along the way!), it is time to combine them into one complete data record.

There are different ways of combining two different data sets:

adding columns (in cases where the row structure is similar)
adding rows (in cases where the columns structure is similar)
so called “merging” or “joining”

Adding columns: cbind

Story: You notice that the weight measurements from this year are not in the old records. Before combining data sets it might make sense to correct that. You have your intern weigh all these dogs and give you their measurement (in the same order as they appear in the old records). You want to use cbind to add a single column.

The cbind() function (short for column bind) can be used to add/bind (multiple) columns to a data frame. A requirement of this is that the number of rows is similar in both pieces.

# put values into a vector with the intended column name
# weight_23 <- c(7.66, 25.47, 66.80, 31.69, 28.80, 29.26, 24.13, 34.45, 27.22, 14.12, 38.00, 26.97, 3.21, 22.95, 24.16, 32.96, 1.82, 11.93, 26.18)

# # combine old records with new column 
#old_records <- cbind(old_records, weight_23)

Story: An error! You realize that you have missed the last weight value. Gotta remember to be more careful with this. For once you are thankful for a (rather descriptive) error message. Who would have thought…

# put values into a vector with the intended column name

weight_23 <- c(7.66, 25.47, 66.80, 31.69, 28.80, 29.26, 24.13, 34.45, 27.22, 14.12, 38.00, 26.97, 3.21, 22.95, 24.16, 32.96,  1.82, 11.93, 26.18, 18.62)

# combine old records with new column 
old_records <- cbind(old_records, weight_23)

if (interactive()) {
  View(old_records)
}

Adding rows: rbind

Story: Next, you want to add the data of the new arrivals to the data set.

The rbind() function (short for rowbind) can be used to add/bind (multiple) rows to a data frame.

A requirement of this is that the number of columns (and the data type of each column) is similar in both pieces of data.

First we must make the new arrival data and the old records structurally similar.

# do they have different columns?
colnames(new_arrivals)

## [1] "breed"  "sex"    "weight"

colnames(old_records)

## [1] "id"        "breed"     "sex"       "weight_20" "weight_21" "weight_22"
## [7] "weight_23"

# get rid of useless id column in old_records
old_records <- select(old_records, -id)

# rename weight into weight_23 in new_arrivals data
new_arrivals <- rename(new_arrivals, weight_23 = weight)

# how about now?
colnames(new_arrivals)

## [1] "breed"     "sex"       "weight_23"

colnames(old_records)

## [1] "breed"     "sex"       "weight_20" "weight_21" "weight_22" "weight_23"

Story: What about the missing weight columns in the new_arrivals data? Is the function “smart” enough to add NAs and put the values in the correct columns even though the number of columns is still different?

# old_records <- rbind(old_records, new_arrivals)

Story: Oh no, apparently it is not. Alright, then you have to add them first! Probably this is what structurally similar means.

# add empty columns to new arrival data

new_arrivals <- new_arrivals %>% 
  mutate(weight_20 = NA) %>% 
  mutate(weight_21 = NA) %>% 
  mutate(weight_22 = NA)


old_records <- rbind(old_records, new_arrivals)

This time it worked. Note that the order of the columns does not necessarily have to be similar. As long as the names are similar the rbind function places the values in the correct position.

merge

Story: Now you want to add the information about the breed of each dog to the data. That operation is a different animal. You can’t just take the breed information dataset and add the columns to your dog data. The values from the breed information dataset need to be placed in the correct positions, according to what the breed column says. Some rows in the breed data (many actually) are about breeds that you don’t have at the shelter. You do not need those. And some of your dogs have the same breed, the information about that breed which only exists once in the breed information needs to be automatically duplicated where necessary. Oh my, that is quite the task! If only there would be a function that does exactly this…

The merge() function is perfect for this. It automatically detects which column exists in both datasets and uses that to add the values of all the columns that don’t exist in the first data set in the correct position. Let’s see it in action.

# Let's give the result a new name
dogs <- merge(old_records, dog_breed_info)

#View(dogs)

Something went wrong here. Actually, two unfortunate things have happened:

The rows have been reordered (alphabetically after the breed). This is bad as you do not have IDs or names to identify them. So, for instance you don’t know which Labrador is which.
Because some breeds that exist in your data do not appear in the breed info, dogs with these breeds have been dropped from the data.

Both are things we definitely can not allow!

The second is actually easy to fix: To ensure that all dogs remain in the data we can use all.x = TRUE

dogs <- merge(old_records, dog_breed_info, all.x = TRUE)

#View(dogs)

A solution is to “manually” add an ID to the data, then do the merge (as above) and then restore the original ordering.

# Here's a base and a tidyverse solution


# BASE: add ID as first column: a numeric sequence of 1 to the number of rows

id <- 1:nrow(old_records)

old_records <- cbind(id, old_records)

# TIDYVERSE

old_records <- old_records %>% 
  rowid_to_column("id2")

Now let’s merge the two dataframes.

# merge

dogs <- merge(old_records, dog_breed_info, all.x = TRUE)

# Let's arrange their ids.

dogs <- dogs %>% 
  arrange(id)


#View(dogs)

Now all dogs are kept in the data and the row order is retained.

However, some of the breeds do not have any info added. This is because of a third issue that is common with using merge: While all breeds exist in the breed info, they sometimes have different names compared to the names in your data. We can rename those in the breed information to solve it.

# which of the breed names of the dogs data do not appear in the breed info data?
# What we want to do is to Select the elements in the 'breed' column of the 'old_records' that are not present in the 'breed' column of the 'dog_breed_info' dataframe.

old_records$breed[!(old_records$breed %in% dog_breed_info$breed)] # hurra! subsetting with conditions to the rescue

## [1] "Labrador"        "Poodle"          "German Shepherd" "Labrador"       
## [5] "Poodle"          "German Shepherd"

# Now let's use ifelse() to clean up these names in dog_breed_info

dog_breed_info <- dog_breed_info %>% 
  mutate(breed = ifelse(breed == "Labrador Retriever", "Labrador", 
                        ifelse(breed == "Poodle Standard", "Poodle", 
                               ifelse(breed == "German Shepherd Dog", "German Shepherd", breed))))


# merge
dogs <- merge(old_records, dog_breed_info)

dogs <- dogs %>% 
  arrange(breed)

#View(dogs)

Finally!

merge() can also handle different applications:

The combined data should keep all rows from both datasets (set all = TRUE).
The variable to merge by has different names in the datasets (use by.x and by.y arguments).

categorization

Story: Now that you have finally put all the data together, it is time to use the full power of the information to help the dogs (and your shelter) prosper! After all, there is a reason why “knowledge is power” is a popular saying. First you want to determine which cages (for sleeping) are appropriate for which dogs. In your storage you have cages of three different sizes (small, medium and large). You dig up the manufacturer instructions for the cage model. In them you find for which height each size is appropriate. You want to “calculate” the appropriate cage size for each dog. Checking whether the calculated cage size matches with the currently used one seems like a good job for your intern…

This task could be called categorization: Transforming a numeric vector/column into a categorical using pre-defined thresholds.

Creating a new variable and then recoding it by ifelse() is a solution for this task:

# create height column
dogs$height <- c(32, 62, 59, 57, 48, 66, 53, 57, 85, 43, 63, 59, 15, 56, 63, 60, 19, 34, 55, 44, 53, 73, 35, 64, 65, 63, 68)


dogs <- dogs %>% 
  mutate(cage = ifelse(height < 40, "small",
                       ifelse(height >= 40 & height < 63, "medium", "large")))

Story: Next you want to know to combat overweight - a growing problem in the (dog) population? Not if you can help it! Knowing which weight is typical for the breed of each dog can help you determine whether its weight is appropriate which can help you with feeding them. However, just comparing a dogs actual weight with the typical weight range of its breed seems too simple to you. You feel like the dogs height should play a role as well. Clearly, a dog that is heavier than typical for its breed can still have a healthy weight if it is bigger than typical as well. FOr now, we just check if their weight and height is in the normal range. If you like the challenge, you can later combine the values in height_vs_range and weight_vs_range, and come up with an idea if they receive too much, too little, or within the range amount of food.

# compared height range, using nested ifelse()

dogs <- dogs %>% 
  mutate(height_vs_range = ifelse(height > height_high, "over", 
                                  ifelse(height < height_low, "under", 
                                         "in range")))
# compared weight range
dogs <- dogs %>% 
  mutate(weight_vs_range = ifelse(weight_23 > weight_high, "over", 
                                  ifelse(weight_23 < weight_low, "under", 
                                         "in range")))



#View(dogs)

# listing (all) value combinations and assigning result

Data Cleaning and Preprocessing

2023-11-22