Automatic Annotation

From now on, we do things inside a project to keep a clear and coherent workspace. Go to File –> New Project (choose either a New directory or an existing one). For instance, you can choose the folder you downloaded for this session from ILIAS. You can have an overview of the folder in the output pane.

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.2.1     ✔ readr     2.2.0
## ✔ forcats   1.0.1     ✔ stringr   1.6.0
## ✔ ggplot2   4.0.2     ✔ tibble    3.3.1
## ✔ lubridate 1.9.5     ✔ tidyr     1.3.2
## ✔ purrr     1.2.1     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

Read in the data

dog_descriptions <- read_rds("data/dog_descriptions.rds")

Story:

Remember you sent your Dog corpus to your friend Maria to create a dog corpus for you? Here’s the rest of the story:

Row-wise split

Story: Finally, it is time for Maria to create the corpus of dog descriptions. She first decides to split the descriptions sentence-wise. She assumes that the punctuation mark period (.) marks the end of a sentence.

The code below uses the function separate_rows() to split the description column into separate rows for each sentence. It does this by using the period (“.”) as the separator to split the text into separate rows.

dog_corpus <- dog_descriptions %>% 
  select(unique_id, description) %>% 
  separate_rows(description, sep = "\\.")

A recap of different tidyverse operations + some new ones

The code below shows a big pipeline containing various functions we talked about earlier and some new ones.

Here is a breakdown of what it does:

The function rename() in the second line renames the description column to sentence.
The function str_trim() in the third line trims any leading or trailing whitespace from the sentence column.
The function filter() in the fourth line filters out any rows where the sentence column is empty.
The group-by() function in the fifth line groups the rows by their unique_id column.
The function mutate() [lines 6-9] calculates various sentence properties within each group, including the total number of sentences (number_of_sentences), the order of the sentence within the group (which_sentence), the count of characters in each sentence (count_char), and the count of words in each sentence (count_word). These calculations use the n(), seq(), str_count() functions, respectively.
The function mutate() [lines 10-11] adds new columns to each row containing the previous (previous_sentence) and next (next_sentence) sentences within the group using the lag() and lead() functions. The argument n() in these functions defines from how many rows back or ahead the values should be chosen.
Finally, the function str_extract_all() on line 13 extracts all words in each sentence starting with the letter “f” or “F” and saves the results in a new column called words_with_letterA. This is done using the str_extract_all function from the stringr package with the regular expression pattern b[fF]w+b.

dog_corpus <- dog_corpus %>% 
  rename(sentence = description) %>% 
  mutate(sentence = str_trim(sentence)) %>% 
  filter(sentence != "") %>% 
  group_by(unique_id) %>% 
  mutate(number_of_sentences = n(),
         which_sentence = seq(n()),
         count_char = str_count(sentence),
         count_word = str_count(sentence, '\\w+')) %>%
  mutate(previous_sentence = lag(sentence, n = 1),
         next_sentence = lead(sentence, n = 1)) %>% 
  ungroup() %>% 
  mutate( words_with_letterF = str_extract_all(sentence, "\\b[fF]\\w+\\b"))

Story: Maria sends the corpus back to you at the shelter. After much thinking, you and Maria decide to call it “FurryFriends Corpus”.

Automatic annotation using udpipe

You look at the FurryFriends Corpus and you find it very cool. However, you are not 100% happy with it. You think you can do more with this corpus. You decide to activate your linguist side and create a better corpus. For the creation of this corpus, you decide to use automatic annotation libraries. Since you know a bit about dependency parsing and universal dependency, you google and find out that there is a cool package called udpipe which does dependency annotation for many languages.

Install udpipe package

#install.packages("udpipe")
library(udpipe)

Download a pre-trained model

We use the command udpipe_download_model() to download a pre-trained UDPipe model for the language you want to annotate. To get the name of the models you want to give to this function, you can:

1- Check the udpipe guideline package, page 81. link

2- Check the models from here. However, please note that not all these models are implemented in the R udpipe wrapper.

# Download a pre-trained model (for English in this case). In some cases, you can just give the name of the language (without the name of the specific model) and udpipe download one of the models from that language for you.

ud_model_eng <- udpipe_download_model(language = "english")

## Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe to C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2024_01_10_Session11/english-ewt-ud-2.5-191206.udpipe

##  - This model has been trained on version 2.5 of data from https://universaldependencies.org

##  - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0

##  - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.

##  - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')

## Downloading finished, model stored at 'C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2024_01_10_Session11/english-ewt-ud-2.5-191206.udpipe'

ud_model_eng_ewt <- udpipe_download_model(language = "english-ewt")

## Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe to C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2024_01_10_Session11/english-ewt-ud-2.5-191206.udpipe

##  - This model has been trained on version 2.5 of data from https://universaldependencies.org

##  - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0

##  - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.

##  - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')

## Downloading finished, model stored at 'C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2024_01_10_Session11/english-ewt-ud-2.5-191206.udpipe'

ud_model_german_gsd <- udpipe_download_model(language = "german-gsd")

## Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/german-gsd-ud-2.5-191206.udpipe to C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2024_01_10_Session11/german-gsd-ud-2.5-191206.udpipe

##  - This model has been trained on version 2.5 of data from https://universaldependencies.org

##  - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0

##  - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.

##  - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')

## Downloading finished, model stored at 'C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2024_01_10_Session11/german-gsd-ud-2.5-191206.udpipe'

ud_model_danish_ddt <- udpipe_download_model(language = "danish-ddt")

## Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/danish-ddt-ud-2.5-191206.udpipe to C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2024_01_10_Session11/danish-ddt-ud-2.5-191206.udpipe

##  - This model has been trained on version 2.5 of data from https://universaldependencies.org

##  - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0

##  - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.

##  - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')

## Downloading finished, model stored at 'C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2024_01_10_Session11/danish-ddt-ud-2.5-191206.udpipe'

Load models

You can either download the model and directly use it; or for subsequent use, you can load the downloded model from your directory.

# directly from your environment
model_eng <- udpipe_load_model(ud_model_eng$file_model)

# reading the model from your directory
model_eng <- udpipe_load_model("english-ewt-ud-2.5-191206.udpipe")

Next, we can use the model to automatically annotate our data. The data we are interested in is in the “description” column of the “dog_description” dataframe

# Apply the udpipe model
annotations <- udpipe_annotate(model_eng, x =
                          dog_descriptions$description)

After the annotation, you can turn the annotated output into a dataframe using the command as.data.frame()

annotations_df <- as.data.frame(annotations)

Furthermore, you can write the annotated output in the conll-u format, using the command as_conllu().

The following line first converts the annotated data in the “annotations_df” dataframe into the conll-u format. Then it writes it to a file named “annotations.conllu” in the working directory, using the UTF-8 encoding.

cat(as_conllu(annotations_df), file = file("annotations.conllu", encoding = "UTF-8"))