Story:
Remember you sent your Dog corpus to your friend Maria to create a dog corpus for you? Here’s the rest of the story:
From now on, we do things inside a project to keep a clear and coherent workspace. Go to File –> New Project (choose either a New directory or an existing one). For instance, you can choose the folder you downloaded for this session from ILIAS. You can have an overview of the folder in the output pane.
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.2.1 ✔ readr 2.2.0
## ✔ forcats 1.0.1 ✔ stringr 1.6.0
## ✔ ggplot2 4.0.2 ✔ tibble 3.3.1
## ✔ lubridate 1.9.5 ✔ tidyr 1.3.2
## ✔ purrr 1.2.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
dog_descriptions <- read_rds("data/dog_descriptions.rds")
Remember you sent your Dog corpus to your friend Maria to create a dog corpus for you? Here’s the rest of the story:
Story: Finally, it is time for Maria to create the corpus of dog descriptions. She first decides to split the descriptions sentence-wise. She assumes that the punctuation mark period (.) marks the end of a sentence.
The code below uses the function separate_rows() to split the description column into separate rows for each sentence. It does this by using the period (“.”) as the separator to split the text into separate rows.
dog_corpus <- dog_descriptions %>%
select(unique_id, description) %>%
separate_rows(description, sep = "\\.")
The code below shows a big pipeline containing various functions we talked about earlier and some new ones.
Here is a breakdown of what it does:
n(), seq(), str_count()
functions, respectively.lag() and
lead() functions. The argument n() in these
functions defines from how many rows back or ahead the values should be
chosen.dog_corpus <- dog_corpus %>%
rename(sentence = description) %>%
mutate(sentence = str_trim(sentence)) %>%
filter(sentence != "") %>%
group_by(unique_id) %>%
mutate(number_of_sentences = n(),
which_sentence = seq(n()),
count_char = str_count(sentence),
count_word = str_count(sentence, '\\w+')) %>%
mutate(previous_sentence = lag(sentence, n = 1),
next_sentence = lead(sentence, n = 1)) %>%
ungroup() %>%
mutate( words_with_letterF = str_extract_all(sentence, "\\b[fF]\\w+\\b"))
Story: Maria sends the corpus back to you at the shelter. After much thinking, you and Maria decide to call it “FurryFriends Corpus”.
You look at the FurryFriends Corpus and you find it very cool. However, you are not 100% happy with it. You think you can do more with this corpus. You decide to activate your linguist side and create a better corpus. For the creation of this corpus, you decide to use automatic annotation libraries. Since you know a bit about dependency parsing and universal dependency, you google and find out that there is a cool package called udpipe which does dependency annotation for many languages.
#install.packages("udpipe")
library(udpipe)
We use the command udpipe_download_model() to download a
pre-trained UDPipe model for the language you want to annotate. To get
the name of the models you want to give to this function, you can:
1- Check the udpipe guideline package, page 81. link
2- Check the models from here. However, please note that not all these models are implemented in the R udpipe wrapper.
# Download a pre-trained model (for English in this case). In some cases, you can just give the name of the language (without the name of the specific model) and udpipe download one of the models from that language for you.
ud_model_eng <- udpipe_download_model(language = "english")
## Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe to C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2024_01_10_Session11/english-ewt-ud-2.5-191206.udpipe
## - This model has been trained on version 2.5 of data from https://universaldependencies.org
## - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0
## - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.
## - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')
## Downloading finished, model stored at 'C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2024_01_10_Session11/english-ewt-ud-2.5-191206.udpipe'
ud_model_eng_ewt <- udpipe_download_model(language = "english-ewt")
## Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/english-ewt-ud-2.5-191206.udpipe to C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2024_01_10_Session11/english-ewt-ud-2.5-191206.udpipe
## - This model has been trained on version 2.5 of data from https://universaldependencies.org
## - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0
## - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.
## - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')
## Downloading finished, model stored at 'C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2024_01_10_Session11/english-ewt-ud-2.5-191206.udpipe'
ud_model_german_gsd <- udpipe_download_model(language = "german-gsd")
## Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/german-gsd-ud-2.5-191206.udpipe to C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2024_01_10_Session11/german-gsd-ud-2.5-191206.udpipe
## - This model has been trained on version 2.5 of data from https://universaldependencies.org
## - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0
## - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.
## - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')
## Downloading finished, model stored at 'C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2024_01_10_Session11/german-gsd-ud-2.5-191206.udpipe'
ud_model_danish_ddt <- udpipe_download_model(language = "danish-ddt")
## Downloading udpipe model from https://raw.githubusercontent.com/jwijffels/udpipe.models.ud.2.5/master/inst/udpipe-ud-2.5-191206/danish-ddt-ud-2.5-191206.udpipe to C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2024_01_10_Session11/danish-ddt-ud-2.5-191206.udpipe
## - This model has been trained on version 2.5 of data from https://universaldependencies.org
## - The model is distributed under the CC-BY-SA-NC license: https://creativecommons.org/licenses/by-nc-sa/4.0
## - Visit https://github.com/jwijffels/udpipe.models.ud.2.5 for model license details.
## - For a list of all models and their licenses (most models you can download with this package have either a CC-BY-SA or a CC-BY-SA-NC license) read the documentation at ?udpipe_download_model. For building your own models: visit the documentation by typing vignette('udpipe-train', package = 'udpipe')
## Downloading finished, model stored at 'C:/Projects/github/WISE2023_PracticalSkillsForWorkingWithLinguisticData/Lectures/2024_01_10_Session11/danish-ddt-ud-2.5-191206.udpipe'
You can either download the model and directly use it; or for subsequent use, you can load the downloded model from your directory.
# directly from your environment
model_eng <- udpipe_load_model(ud_model_eng$file_model)
# reading the model from your directory
model_eng <- udpipe_load_model("english-ewt-ud-2.5-191206.udpipe")
Next, we can use the model to automatically annotate our data. The data we are interested in is in the “description” column of the “dog_description” dataframe
# Apply the udpipe model
annotations <- udpipe_annotate(model_eng, x =
dog_descriptions$description)
After the annotation, you can turn the annotated output into a
dataframe using the command as.data.frame()
annotations_df <- as.data.frame(annotations)
Furthermore, you can write the annotated output in the conll-u
format, using the command as_conllu().
The following line first converts the annotated data in the “annotations_df” dataframe into the conll-u format. Then it writes it to a file named “annotations.conllu” in the working directory, using the UTF-8 encoding.
cat(as_conllu(annotations_df), file = file("annotations.conllu", encoding = "UTF-8"))