Import libraries
Set up working directory
Introduction and recap of the previous session
Get data into R
Ifelse() conditional statements
Combining data sets
Adding some more info
Reshaping
Group-wise operations

From now on, we do things inside a project to keep a clear and coherent workspace. Go to File –> New Project (choose either a New directory or an existing one). For instance, you can choose the folder you downloaded for this session from ILIAS. You can have an overview of the folder in the output pane.

Import libraries

Set up working directory

Introduction and recap of the previous session

The example data for today’s class follows an example story. Because of this most (sub-)sections start with a short paragraph which explains the example story for each presented task (indicated by “Story: …”). Additionally, each (sub-)section usually covers one task, including explanations, code and (optionally) notes.

Before getting to the actual content, the following section introduces the example story: You have left academia and decided to take over a local dog shelter. The previous owner has kep very crappy records in excel, you decide to clean them up and complement them with more information. Thrpugh this process, we learned the following topics in the previous session:

Write data to other formatds (tsv, csv, rds) with commands such as write_tsv(), write_csv(), and write_rds()
Changing the column type, such as character to numeric, character to factor, etc, with commands like as.factor(), as.character(), as.numeric().
Rounding numeric values with the round() function.
handling “na” and “not found” string values (handling real NAs comes later).
Using ifelse() for conditional statement.

Get data into R

Our data from the previous session is saved in 3 formats, namely RDS, CSV, and TSV. Let’s bring them into R.

dog_breed_info <- read_rds("files/dog_breed_info.rds")


# instead of read_tsv() and read_csv(), I use read_delim() and define my delimiter inside it.

old_records <- read_delim("files/old_records.tsv", delim = "\t")

## Rows: 20 Columns: 6
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (3): breed, sex, weight_21
## dbl (3): id, weight_20, weight_22
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

new_arrivals <- read_delim("files/new_arrivals.csv", delim = ",")

## Rows: 7 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (2): breed, sex
## dbl (1): weight
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Ifelse() conditional statements

We notice some inconsistency in the breed of the dogs from the old records and decide to fix them using ifelse() syntax:

result <- ifelse(condition, result if cond. == TRUE, result if cond. == FALSE)

We want to reduce the following inconsistencies in the breed column of the old_records dataframe:

“Golden Retriever” is spelled “golden retr.” (row 15 of breed column)
Instead of “German Shepherd” “Schäferhund” is used (row 16 of breed column)
There is a typo in “Labrador”, instead it is spelled “Labradoer” (row 20 of breed column)

old_records <- old_records %>% 
  mutate(breed = ifelse( breed == "golden retr.", "Golden Retriever",
                        ifelse( breed == "Schäferhund", "German Shepherd",
                                ifelse( breed == "Labradoer", "Labrador", breed))))

## Let's check the values of the breed column 
old_records$breed

##  [1] "Pug"               "Labrador"          "Saint Bernard"    
##  [4] "Dalmatian"         "Bull Terrier"      "Weimaraner"       
##  [7] "Chow Chow"         "Pointer"           "Greyhound"        
## [10] "Finnish Spitz"     "Alaskan Malamute"  "Boxer"            
## [13] "Yorkshire Terrier" "Poodle"            "Golden Retriever" 
## [16] "German Shepherd"   "Chihuahua"         "Beagle"           
## [19] "Golden Retriever"  "Labrador"

Combining data sets

Story: After spending all this time preparing these separate data sets (and learning a lot along the way!), it is time to combine them into one complete data record.

There are different ways of combining two different data sets:

adding columns (in cases where we have the same number of rows)
adding rows (in cases where we have the same number of columns)
so-called “merging” or “joining”

Adding columns: cbind

Story: You notice that the weight measurements from this year are not in the old records. Before combining data sets it might make sense to correct that. You have your intern weigh all these dogs and give you their measurement (in the same order as they appear in the old records). You want to use cbind to add a single column.

The cbind() function (short for column bind) can be used to add/bind (multiple) columns to a data frame. A requirement of this is that the number of rows is similar in both pieces.

# # put values into a vector with the intended column name
# weight_23 <- c(7.66, 25.47, 66.80, 31.69, 28.80, 29.26, 24.13, 34.45, 27.22, 14.12, 38.00, 26.97, 3.21, 22.95, 24.16, 32.96, 1.82, 11.93, 26.18)
# 
# # combine old records with new column
# old_records <- cbind(old_records, weight_23)

Story: An error! You realize that you have missed the last weight value. Gotta remember to be more careful with this. For once you are thankful for a (rather descriptive) error message. Who would have thought…

# put values into a vector with the intended column name

weight_23 <- c(7.66, 25.47, 66.80, 31.69, 28.80, 29.26, 24.13, 34.45, 27.22, 14.12, 38.00, 26.97, 3.21, 22.95, 24.16, 32.96,  1.82, 11.93, 26.18, 18.62)

# combine old records with new column 
old_records <- cbind(old_records, weight_23)

colnames(old_records)

## [1] "id"        "breed"     "sex"       "weight_20" "weight_21" "weight_22"
## [7] "weight_23"

Adding rows: rbind

Story: Next, you want to add the data of the new arrivals to the data set.

The rbind() function (short for rowbind) can be used to add/bind (multiple) rows to a data frame.

A requirement of this is that the number of columns (and the data type of each column) is similar in both pieces of data.

First we must make the new arrival data and the old records structurally similar.

# do they have different columns?

colnames(new_arrivals)

## [1] "breed"  "sex"    "weight"

colnames(old_records)

## [1] "id"        "breed"     "sex"       "weight_20" "weight_21" "weight_22"
## [7] "weight_23"

# get rid of useless id column in old_records
old_records <- select(old_records, -id)

# rename weight into weight_23 in new_arrivals data
new_arrivals <- rename(new_arrivals, weight_23 = weight)

# how about now?
colnames(new_arrivals)

## [1] "breed"     "sex"       "weight_23"

colnames(old_records)

## [1] "breed"     "sex"       "weight_20" "weight_21" "weight_22" "weight_23"

Story: What about the missing weight columns in the new_arrivals data? Is the function “smart” enough to add NAs and put the values in the correct columns even though the number of columns is still different?

#old_records <- rbind(old_records, new_arrivals)

Story: Oh no, apparently it is not.

Alright, then you have to add them first!

# add empty columns to new arrival data

new_arrivals <- new_arrivals %>% 
  mutate(weight_20 = NA) %>% 
  mutate(weight_21 = NA) %>% 
  mutate(weight_22 = NA)


all_records <- rbind(old_records, new_arrivals)

This time it worked.

Note that the order of the columns does not necessarily have to be similar. As long as the names are similar the rbind function places the values in the correct position.

The second solution is to use bind_rows() from tidyverse.

copy_all_records <- bind_rows(all_records, new_arrivals)


rm(copy_all_records, old_records, new_arrivals)

merge

Story: Now you want to add the information about the breed of each dog to the data. That operation is a different animal. You can’t just take the breed information dataset and add the columns to your dog data. The values from the breed information dataset need to be placed in the correct positions, according to what the breed column says.

Some rows in the breed data (many actually) are about breeds that you don’t have at the shelter.

You do not need those.

And some of your dogs have the same breed, the information about that breed which only exists once in the breed information needs to be automatically duplicated where necessary. Oh my, that is quite the task! If only there would be a function that does exactly this…

The merge() function is perfect for this.

It automatically detects which column exists in both datasets and uses that to add the values of all the columns that don’t exist in the first data set in the correct position.

Let’s see it in action.

# Let's give the result a new name

dogs <- merge(all_records, dog_breed_info)

#View(dogs)

Something went wrong here.

Actually, two unfortunate things have happened:

The rows have been reordered (alphabetically after the breed). This is bad as you do not have IDs or names to identify them. So, for instance you don’t know which Labrador is which.
Because some breeds that exist in your data do not appear in the breed info, dogs with these breeds have been dropped from the data.

Both are things we definitely can not allow!

The second is actually easy to fix: To ensure that all dogs remain in the data we can use all.x = TRUE

dogs <- merge(all_records, dog_breed_info, all.x = TRUE)

#View(dogs)

A solution is to “manually” add an ID to the data, then do the merge (as above) and then restore the original ordering.

# Here's a base and a tidyverse solution


# BASE: add ID as first column: a numeric sequence of 1 to the number of rows

id <- 1:nrow(all_records)

all_records <- cbind(id, all_records)

# TIDYVERSE

all_records <- all_records %>% 
  rowid_to_column("id2")

Now let’s merge the two dataframes.

# merge

dogs <- merge(all_records, dog_breed_info, all.x = TRUE)

# Let's arrange their ids.

dogs <- dogs %>% 
  arrange(id)


#View(dogs)

Now all dogs are kept in the data and the row order is retained.

However, some of the breeds do not have any info added.

This is because of a third issue that is common with using merge: While all breeds exist in the breed info, they sometimes have different names compared to the names in your data.

We can rename those in the breed information to solve it.

# which of the breed names of the dogs data do not appear in the breed info data?
# What we want to do is to choose the elements in the 'breed' column of the 'old_records' that are not present in the 'breed' column of the 'dog_breed_info' dataframe.

#Conditional subsetting
all_records$breed[!(all_records$breed %in% dog_breed_info$breed)]

## [1] "Labrador"        "Poodle"          "German Shepherd" "Labrador"       
## [5] "Poodle"          "German Shepherd"

#tidyverse
all_records %>% filter(!(breed %in% dog_breed_info$breed)) %>% select(breed)

##             breed
## 1        Labrador
## 2          Poodle
## 3 German Shepherd
## 4        Labrador
## 5          Poodle
## 6 German Shepherd

# Now let's use ifelse() to clean up these names in dog_breed_info

dog_breed_info <- dog_breed_info %>% 
  mutate(breed = ifelse(breed == "Labrador Retriever", "Labrador", 
                        ifelse(breed == "Poodle Standard", "Poodle", 
                               ifelse(breed == "German Shepherd Dog", "German Shepherd", breed))))

Now finally we can merge them proprly.

# merge

dogs <- merge(all_records, dog_breed_info) %>% 
  arrange(id)

rm(all_records, dog_breed_info)

#View(dogs)

merge() can also handle different applications:

The combined data should keep all rows from both datasets (set all = TRUE).
The variable to merge by has different names in the datasets (use by.x and by.y arguments).

categorization

Story: Now that you have finally put all the data together, it is time to use the full power of the information to help the dogs (and your shelter) prosper! After all, there is a reason why “knowledge is power” is a popular saying.

First you want to determine which cages (for sleeping) are appropriate for which dogs. In your storage you have cages of three different sizes (small, medium and large). You dig up the manufacturer instructions for the cage model. In them you find for which height each size is appropriate. You want to “calculate” the appropriate cage size for each dog. Checking whether the calculated cage size matches with the currently used one seems like a good job for your intern…

This task could be called categorization: Transforming a numeric vector/column into a categorical using pre-defined thresholds.

Creating a new variable and then recoding it by ifelse() is a solution for this task:

# create height column

dogs <- dogs %>% 
  mutate(height = c(32, 62, 59, 57, 48, 66, 53, 57, 85, 43, 63, 59, 15, 56, 63, 60, 19, 34, 55, 44, 53, 73, 35, 64, 65, 63, 68))

dogs <- dogs %>% 
  mutate(cage = ifelse(height < 40, "small",
                       ifelse(height >= 40 & height < 63, "medium", "large")))

Story: Next you want to know to combat overweight - a growing problem in the (dog) population? Not if you can help it! Knowing which weight is typical for the breed of each dog can help you determine whether its weight is appropriate which can help you with feeding them. However, just comparing a dogs actual weight with the typical weight range of its breed seems too simple to you. You feel like the dogs height should play a role as well. Clearly, a dog that is heavier than typical for its breed can still have a healthy weight if it is bigger than typical as well.

For now, we just check if their weight and height is in the normal range.

If you like the challenge, you can later combine the values in height_vs_range and weight_vs_range, and come up with an idea if they receive too much, too little, or within the range amount of food.

# compared height range, using nested ifelse()

dogs <- dogs %>% 
  mutate(height_vs_range = ifelse(height > height_high, "over", 
                                  ifelse(height < height_low, "under", 
                                         "in range")))
# compared weight range
dogs <- dogs %>% 
  mutate(weight_vs_range = ifelse(weight_23 > weight_high, "over", 
                                  ifelse(weight_23 < weight_low, "under", 
                                         "in range")))


#View(dogs)

# listing (all) value combinations and assigning result

## We have a lot of conditions here. Using ifelse will be a bit cumbersome since there are many nested ifelse() conditions. I instead use case_when() function from tidyverse, which allows for allows for each condition and its corresponding action to be clearly and separately stated.

dogs <- dogs %>%
  mutate(feed_instruction = NA) %>% 
  mutate(feed_instruction = case_when(
    height_vs_range == "under" & weight_vs_range == "over" ~ "much less",
    height_vs_range == "under" & weight_vs_range == "in range" ~ "less",
    height_vs_range == "under" & weight_vs_range == "under" ~ "normal",
    height_vs_range == "in range" & weight_vs_range == "over" ~ "less",
    height_vs_range == "in range" & weight_vs_range == "in range" ~ "normal",
    height_vs_range == "in range" & weight_vs_range == "under" ~ "more",
    height_vs_range == "over" & weight_vs_range == "over" ~ "normal",
    height_vs_range == "over" & weight_vs_range == "in range" ~ "more",
    height_vs_range == "over" & weight_vs_range == "under" ~ "much more",
    TRUE ~ feed_instruction # Default case to keep existing values
  ))

Adding some more info

Story: Next you want to fix something that has been bugging you for quite a while now.

It does not seem right to refer to your furry friends just using IDs. Instead they should have names. This will also make them more relateable to potential future buyers.

You reach out to a name expert and provided images of the dogs for reference. Based on the pictures, the expert curates a list of names. Time to make the dogs and their data more personal.

The following code uses the read_csv() function to read in the name of the dogs as a dataframe.

dog_names <- read_csv("files/dog_names.txt")

## Rows: 27 Columns: 1
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): name
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Story: As part of their efforts to promote the adoption of your lovable dogs, you decided to create individualized descriptions for each of them, highlighting their unique characteristics. To accomplish this, you turned to ChatGPT, which generates descriptions tailored to the specific features of each dog.

The following code uses the read_csv() function to read in the description of the dogs as a dataframe.

dog_descriptions <- read_tsv("files/reports.txt")

## Rows: 27 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: "\t"
## chr (2): name, description
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Lets combine this with the previous data:

dogs <- cbind(dogs, dog_descriptions)

Let’s reorder the columns.

dogs <- dogs %>% 
  select(id, breed, name, everything())

Finally some more clean up, before moving on.

str(dogs)

## 'data.frame':    27 obs. of  19 variables:
##  $ id              : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ breed           : chr  "Pug" "Labrador" "Saint Bernard" "Dalmatian" ...
##  $ name            : chr  "Sophronia" "Buddy" "Mara" "Gracelyn" ...
##  $ id2             : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ sex             : chr  "female" "male" "female" "female" ...
##  $ weight_20       : num  7.54 23.75 NA NA NA ...
##  $ weight_21       : chr  "7.27" "28.88" "57.79" "18.89" ...
##  $ weight_22       : num  6.2 19.3 78.4 26.6 34.1 ...
##  $ weight_23       : num  7.66 25.47 66.8 31.69 28.8 ...
##  $ height_low      : num  25.4 53.3 63.5 48.3 53.3 63.5 48.3 53.3 68.6 38.1 ...
##  $ height_high     : num  27.9 61 71.1 58.4 55.9 68.6 55.9 61 76.2 50.8 ...
##  $ weight_low      : num  6.3 24.9 49.8 20.4 22.7 31.7 20.4 19.9 27.2 14 ...
##  $ weight_high     : num  10 36.2 86.1 31.7 31.7 38.5 24.9 29.9 31.7 15.9 ...
##  $ height          : num  32 62 59 57 48 66 53 57 85 43 ...
##  $ cage            : chr  "small" "medium" "medium" "medium" ...
##  $ height_vs_range : chr  "over" "over" "under" "in range" ...
##  $ weight_vs_range : chr  "in range" "in range" "in range" "in range" ...
##  $ feed_instruction: chr  "more" "more" "less" "normal" ...
##  $ description     : chr  "Sophronia is a female Pug who stands at 32 cm tall. This small and affectionate breed is known for their playfu"| __truncated__ "Buddy is a male Labrador who stands at 62 cm tall. This friendly and active breed is known for their obedience "| __truncated__ "Mara is a female Saint Bernard who stands at 59 cm tall. Saint Bernards are a giant breed known for their size "| __truncated__ "Gracelyn is a female Dalmatian who stands at 57 cm tall. Dalmatians are an energetic and playful breed known fo"| __truncated__ ...

dogs <- dogs %>% 
  mutate(weight_21 = as.numeric(weight_21))

## Warning: There was 1 warning in `mutate()`.
## ℹ In argument: `weight_21 = as.numeric(weight_21)`.
## Caused by warning:
## ! NAs introduced by coercion

Reshaping

Story: The annual weighing of dogs at the shelter is an important practice. You have a specific scale to do so.

Unfortunately, the manufacturers of the scale have recently come forward with news that their product is not accurately measuring the weight of dogs under 5 kilograms and that these values need to be increased by 10%.

Pivot_longer

Given that the weight information is currently spread across multiple columns (weight_20, weight_21, weight_22, and weight_23), consolidating it into a single column could make the necessary changes more manageable.

Pivot_longer() helps us reshape the data from a “wide” format to a “long” format, where each variable is in a single column and each observation is in a separate row. This is useful when working with data that has multiple values for a single observation in different columns (e.g., weight info in our dogs dataframe).

Let’s first look at a very simple artificial example.

In the code below, we first create a data frame called df with columns “id”, “x”, “y”, and “z”. Let’s call “x”, “y”, and “z” our features. Next, we create a new data frame called df_long using the pivot_longer function. The cols argument is used to specify the columns that we want to pivot, in this case, the columns “x”, “y”, and “z”. The names_to argument is used to specify the new column name for the feature names, which is “feature”. The values_to argument is used to specify the new column name for the feature values, which is “value”.

df <- data.frame(
  id = c(1, 2),
  x = c("a", "b"),
  y = c("c", "d"),
  z = c("e", "f")
)

df

##   id x y z
## 1  1 a c e
## 2  2 b d f

df_long <- df %>% 
  pivot_longer(cols = c(x, y, z), 
               names_to = "feature", values_to = "value")

df_long

## # A tibble: 6 × 3
##      id feature value
##   <dbl> <chr>   <chr>
## 1     1 x       a    
## 2     1 y       c    
## 3     1 z       e    
## 4     2 x       b    
## 5     2 y       d    
## 6     2 z       f

Now let’s see how we can apply this to our dog dataframe.

For simplicity and visual reasons, I create a smaller dataframe (dog_weight) by selecting only the name, breed, and the weight columns.

Please note that you can run pivot_longer on the full dataframe. You just need to mention which columns need to be pivoted.

dog_weight <- dogs %>%
  select(name, breed, weight_20, weight_21, weight_22, weight_23)

The code below uses the function pivot_longer() to reshape the columns weight_20, weight_21, weight_22, and weight_23 into a longer format.

The argument values_drop_na is set to TRUE, so any missing weight values in the columns are not added as extra rows.

longer_weight <- pivot_longer(dog_weight,
    cols = c(weight_20, weight_21, weight_22, weight_23),
    names_to = "year", 
    values_to = "weight",
    values_drop_na = TRUE
  )

#let us arrange the values to see which ones are under 5 kilo

head(arrange(longer_weight, weight), n=7)

## # A tibble: 7 × 4
##   name      breed             year      weight
##   <chr>     <chr>             <chr>      <dbl>
## 1 Harvey    Chihuahua         weight_22   1.34
## 2 Harvey    Chihuahua         weight_23   1.82
## 3 Emmett    Yorkshire Terrier weight_23   3.21
## 4 Emmett    Yorkshire Terrier weight_22   3.46
## 5 Sophronia Pug               weight_22   6.2 
## 6 Sophronia Pug               weight_21   7.27
## 7 Sophronia Pug               weight_20   7.54

# values that need to be changed
#Harvey: 1.34
#Harvey: 1.82
#Emmett: 3.21
#Emmett: 3.46

After creating the long dataframe “longer_weight”, we want to update the values in the weight column.

The code below updates the column weight in the data frame longer_weight by using the ifelse function.

The ifelse function checks each value in the weight column to see if it is less than 5.

If a value is less than 5, it is increased by 10% of its original value. This increase is calculated by multiplying the original value with 1.1. If a value is not less than 5, it remains unchanged. We then round the values to two decimal places.

longer_weight <- longer_weight %>% 
  mutate(weight = ifelse(weight < 5,
                         weight * 1.1,
                         weight)) %>% 
  mutate(weight = round(weight, digits = 2))

head(arrange(longer_weight, weight), n= 5)

## # A tibble: 5 × 4
##   name      breed             year      weight
##   <chr>     <chr>             <chr>      <dbl>
## 1 Harvey    Chihuahua         weight_22   1.47
## 2 Harvey    Chihuahua         weight_23   2   
## 3 Emmett    Yorkshire Terrier weight_23   3.53
## 4 Emmett    Yorkshire Terrier weight_22   3.81
## 5 Sophronia Pug               weight_22   6.2

Pivot_wider

Pivot_wider is the opposite of pivot_longer.

It is used to reshape a data frame from long format to wide format. The pivot_wider function takes columns with multiple values and spreads them out into multiple columns, while collapsing multiple rows into one.

Again we will first demonstrate this using a simplified artificial example. The code below first creates a data frame “df2” with three columns: “name”, “features”, and “value”.

The “value” column contains the corresponding values for each feature for each individual.

Next, the code uses the “pivot_wider” function to reshape the data frame from long format to wide format.

The “names_from” argument is set to “features”, meaning that the unique values in the “features” column (i.e., height, weight, gender) should become the new column names.

The “values_from” argument is set to “value” to specify that the values in the “value” column should be spread out into the new columns.

The result is a new data frame, “df_wide”, which has columns for height, weight, and gender, and a row for each individual, with the corresponding values for each variable in the same row.

# Create a long-format data frame
df2 <- data.frame(
  name = c("John", "Jane", "Mary", 
           "John", "Jane", "Mary", 
           "John", "Jane", "Mary"),
  features = c("height", "height", "height" , 
               "weight", "weight", "weight", 
               "gender", "gender", "gender"),
  value = c(180, 170, 160, 
            70, 65, 60, 
            "male", "female", "female"))

head(df2)

##   name features value
## 1 John   height   180
## 2 Jane   height   170
## 3 Mary   height   160
## 4 John   weight    70
## 5 Jane   weight    65
## 6 Mary   weight    60

# Use pivot_wider to spread out the "value" column into multiple columns
df_wide <- df2 %>% pivot_wider(names_from = features, values_from = value)

df_wide

## # A tibble: 3 × 4
##   name  height weight gender
##   <chr> <chr>  <chr>  <chr> 
## 1 John  180    70     male  
## 2 Jane  170    65     female
## 3 Mary  160    60     female

Going back to our dog business: In the previous section, we increased the weight of dogs weighing less than 5 kilos by 10%.

Now, we use the pivot_wider function to transform the long dataframe “longer_weight” into its previous wide format.

The “names_from” argument specifies that the unique values in the “year” column of the “longer_weight” dataframe (i.e., weight_20, weight_21, weight_22, weight_23) will become the new column names in the “wider_weight” dataframe.

For now, we call this dataframe wider_weight; but it is in fact similar to the dog_weight dataframe

wider_weight <- pivot_wider(longer_weight, names_from = year, values_from = weight)

Before moving to the next task, let’s remove the dataframes we do not need anymore. In the code below, the grep function is used to search for objects in the current environment that match the pattern “dogs”. The invert = TRUE argument inverts the search so that it returns objects that do NOT contain the word “dogs” in their name. Finally, the rm function removes all the dataframes stored in the “toremove” object.

toremove <- grep("dogs", ls(),
                 invert = TRUE,
                 value = TRUE)

rm(list = c(toremove, "toremove"))

Group-wise operations

Group-wise operations refer to the process of performing operations on subsets of data, based on the values in one or more columns.

In what follows, we talk about the functions group-by() and then summarise().

Group-by()

With group_by(), you can specify one or more variables that you want to use as the basis for grouping your data.

The function will then create groups based on the unique values of the specified variables and arrange the data accordingly.

For instance, we can group our dogs based on their bread, and then apply some functions to each group.

Story: At the shelter you want to know how many members of each breed you have. For instance, to add more members to groups with only one member. Also, for your database, you want to assign IDs to members of each breed based on their height (smallest to largest).

Again for representational purposes, I reduce the dimensions of the “dogs” dataframe to only a few columns we will use here (step 1).

In step 2, we first group-by the dogs by their breed. The mutate function is then used to create a new column called “number_of_members” that contains the number of members in each breed group.

The function n() counts the number of observations (rows) within each group.

Step 3: Since owners want to assign IDs to breed members based on their height (smallest to largest), we first “arrange” members of a group based on their height.

The mutate function is then used to create a new column called breed_group_id that contains a unique identifier for each breed group.

The seq(n()) function is used within the mutate function to generate a sequence of numbers based on the number of observations (rows) in each group, which is given by n().

The seq function generates a sequence of numbers from 1 to n() and assigns that sequence to the breed_group_id column.

Step 4 (IMPORTANT): Finally, the ungroup() function is used to remove the grouping of the data, returning the data to its original format.

By ungrouping the data, you ensure that the data is in the correct format for future operations and analysis.

#step 1
dog_groups <- dogs %>% 
  select(name, breed, sex, height) 

#step 2
dog_groups <- dog_groups %>% 
  group_by(breed) %>% 
  mutate(number_of_members = n()) %>% 
  ungroup() # step 4

#step 3
dog_groups <- dog_groups %>% 
  group_by(breed) %>% 
  arrange(height) %>% 
  mutate(breed_group_id = seq(n())) %>% 
  ungroup() # step 4

Group-by() and summarise()

The group_by() and summarise() functions are often used together to perform data summarization and aggregation. group_by is used to group the data based on one or more variables, and summarise is used to apply summary functions to the subgroups.

Note that different from the application of group_by above, this combination aggregates the data of each group down to one row.

Useful calculations you can do with summarise (Taken from the documentation: https://dplyr.tidyverse.org/reference/summarise.html)

Center: mean(), median()
Spread: sd(), IQR(), mad()
Range: min(), max(),
Position: first(), last(), nth(),
Count: n(), n_distinct()
Logical: any(), all()

Story: One day, you receive a request from a prestigious animal organization called “Furry Friends Foundation”. The organization is conducting a study on the health and well-being of dogs in shelters across the country, and want to get a more in-depth understanding of any potential gender-based differences in the population.

So, the organization asks you to provide summary statistics of the dogs at the shelter based on their gender.

Let us first calculate the number of members in each sex group.

First, the data in the “dogs” dataframe is grouped based on the “sex” column.

For each group defined by the “sex” column (male vs. female), the count of observations is calculated using the n() function.

The result is stored in a new variable called “n_dogs.” Since the “sex” column has two distinct values, male and female, the summary statistics will be given on two rows (one for female dogs and the other for male dogs).

gender_groups <- dogs %>% 
  group_by(sex) %>% 
  summarise(n_dogs = n()) %>% 
  ungroup()

gender_groups

## # A tibble: 2 × 2
##   sex    n_dogs
##   <chr>   <int>
## 1 female     13
## 2 male       14

Story: Since you enjoyed the combination of group-by() and summarise() a lot, you decide to also calculate bunch of other values for each gender.

After grouping the dogs by their sex, the following variables are calculated in the code below using the summarise() function:

n_dogs, which is the number of dogs in each group
mean_height, which is the average height of the dogs in each group
mean_weight2023 and mean_weight2022, which are the average weights of the dogs in 2023 and 2022, respectively. Since there are some missing values in year 22, we include the following (na.rm = TRUE) which removes the NA values from the calculations.
min_height, which is the minimum height among the dogs in each group
max_height, which is the maximum height among the dogs in each group
cage_small, cage_medium, and cage_large, which are the number of dogs in each group that have a cage of size “small”, “medium”, or “large”.

gender_groups <- dogs %>% 
  group_by(sex) %>%
  summarise(n_dogs = n(),
            mean_height = mean(height),
            mean_weight2023 = mean(weight_23),
            mean_weight2022 = mean(weight_22, na.rm = TRUE),
            min_height = min(height),
            max_height = max(height),
            cage_small = sum(cage == "small"),
            cage_medium = sum(cage == "medium"),
            cage_large = sum(cage == "large")) %>% 
  ungroup()

#install.packages("DT")

#DT::datatable(gender_groups)

Data Cleaning and Preprocessing

2023-11-29