From now on, we do things inside a project to keep a clear and coherent workspace. Go to File –> New Project (choose either a New directory or an existing one). For instance, you can choose the folder you downloaded for this session from ILIAS. You can have an overview of the folder in the output pane.

Story

we are getting close to the holiday time and you want to compile a good list of movies to watch. You decide to create a corpus of 2023 movies to help you make wise decisions for your holidays. To compile this list, you decide to do some web scraping.

What is web scraping?

“Web scraping is a technique used to extract specific information or data from websites.”

And what is web crawling? “The automated download of HTML pages is called Crawling.”

However, before getting to the fun part, you first need to learn a bit about hierarchical structures.

Hierarchical data structures: XML and HTML

XML

XML (eXtensible Markup Language) and HTML (Hypertext Markup Language) are both markup languages. These structures are essential for representing and organizing data in a way that reflects relationships and nesting of elements.

Feature	XML	HTML
Primary Use	Data storage and interchange.	Web page structure and content display.
Structure Type	Tree-like hierarchical structure.	Document Object Model (DOM) hierarchical structure.
Root Element	Single root element containing all other elements.	`<html>` tag as the root element for a web page.
Elements	Custom tags defined by the user.	Predefined tags (like `<div>`, `<p>`, `<h1>`, etc.).
Nesting	Deep nesting to represent complex data relationships.	Nesting to define page layout and content structure.
Attributes	Attributes provide metadata for elements.	Attributes define properties of elements (like `class`, `id`, `style`).
Parent-Child Relationship	Crucial for data structure and parsing.	Essential for webpage layout and CSS styling.
Sibling Relationship	Less emphasized compared to HTML.	Important for layout, particularly in CSS design.
Purpose in Web Development	Data transmission between systems and applications.	Front-end development and user interface design.

Important concepts in hierarchical structures:

Elements
Attributes
Opening and closing tags
Nodes
Parent-child relationship
Siblings
Nesting
Root element

Parent-Child Relationships:

The <students> element is the root and parent node of two <student> child elements. (Alternative phrasing: <student> element is the children of <students> element.
Each <student> element is a parent node of a <name> element and a <minitasks> element.
The <minitasks> element is the parent node of <minitask1> and <minitask2> elements.
<name> has a text node child with the student’s name.

Sibling Relationships:

The two <student> elements are siblings of each other.
Within each <student> element, the <name> element and the <minitasks> element are siblings.
Within each <minitasks> element, the <minitask1> and <minitask2> elements are siblings of each other.

HTML

HTML is organized using tags, which are surrounded by <> symbols. Different tags perform different functions. Together, many tags will form and contain the content of a web page. Whereas HTML provides the content and structure of a web page, CSS provides information about how a web page should be styled.

What is XPATH?

XPath, which stands for XML Path Language, is a query language that allows you to navigate through elements and attributes in an XML or HTML document.

Some Examples

Absolute path: /students/student selects all <student> elements that are children of the <students> root element.
Relative path (It does not start with a slash and is relative to the current node): student/minitasks/minitask1 selects <minitask1> elements that are children of <minitasks> which are, in turn, children of <student> element.
Predicates (These are expressions in square brackets that filter nodes based on conditions): /students/student[name='Olga Lapinskaya'] selects <student> elements with a <name> child node containing the text “Olga Lapinskaya”.
Wildcards: - Asterisks (*) can be used to match any element node: /students/* selects all child elements of <students>, which would be all <student> elements in this case.
Attribute Selection: The @ symbol is used to select attributes: /students/student[@id='std_1'] selects the <student> element with an attribute id that has the value “std_1”.
Selecting Text: The text() function selects the text within nodes. /students/student/name/text() selects the text within the <name> element of each <student>.
Path Operators: you can use / for direct children and // for any descendants: /students/student//minitask1 would select any <minitask1> that is a descendant of <student>, not just direct children.

OKAY! ENOUGH IS ENOUGH! LET’S DO SOME WEBSCRAPING!

Install rvest

#install.packages("rvest") #The package needed for web scraping
#install.packages("xml2")
#install.packages("strex")

Import libraries

Set up working directory

Retrieve the HTML Page

First we read in a html file into R with the command read_html()

IMDB best 2023 movies

As mentioned earlier, you are getting prepared for holidays and need some movies to watch. Also, you wanna have their summaries and runtime information.

imdb_link <- "https://www.imdb.com/list/ls562300956/"

imdb <- read_html(imdb_link)

After reading this html page into R, now we need a way to find the relevant information. We can do it in two different ways:

Using inspect: Right-click and select “Inspect” to find the HTML structure of the data you wish to scrape. It opens an element pane for you. Right click on the selected element –> copy –> copy XPath.

Furthermore, you can either enter “XPATH” or “css selector”. In CSS, selectors are patterns used to select the element(s) you want to style.

#xpath and css selector for the title "Are You There God? It's Me, Margaret."

xpath1 <- '//*[@id="main"]/div/div[3]/div[3]/div[1]/div[2]/h3/a'

selector1 <- '#main > div > div.lister.list.detail.sub-list > div.lister-list > div:nth-child(1) > div.lister-item-content > h3 > a'

# This lengthy xpath mean: select the hyperlink that is located inside an h3 element. This h3 is within the second div of the first div inside the third div of the third div inside the third div of the div with the id attribute of main. (this definition is generated by ChatGPT. I'm not that crazy to actually translate it.)

# The above xpath will only choose the text "Are You There God? It's Me, Margaret" for you. What we want is all the movie titles and not only one node. Here is a more general xpath, in which I have changed ones of the locators to a wild card, so it chooses all nodes with that specific characteristic.

xpath2 <- '//*[@id="main"]/div/div[3]/div[3]/div[*]/div[2]/h3/a'

selector2 <- '#main > div > div.lister.list.detail.sub-list > div.lister-list > div > div.lister-item-content > h3 > a'

"Look for any element within the document that has an 'id' attribute equal to 'main'. Within this element, navigate to the third 'div' child. Within this third 'div', go to the third 'div' nested inside. Now, instead of choosing a specific 'div', consider any 'div' at this level of nesting. Inside each of those 'div' elements, find the second 'div' child. Within this second 'div', there's a heading element marked as 'h3'. Inside this heading, there is a hyperlink ('a' element). This expression will select that hyperlink for every matching pattern."

## [1] "Look for any element within the document that has an 'id' attribute equal to 'main'. Within this element, navigate to the third 'div' child. Within this third 'div', go to the third 'div' nested inside. Now, instead of choosing a specific 'div', consider any 'div' at this level of nesting. Inside each of those 'div' elements, find the second 'div' child. Within this second 'div', there's a heading element marked as 'h3'. Inside this heading, there is a hyperlink ('a' element). This expression will select that hyperlink for every matching pattern."

test <- imdb %>% 
  html_nodes(selector2) %>% 
  html_text()

test

##   [1] "Are You There God? It's Me, Margaret."               
##   [2] "Evil Dead Rise"                                      
##   [3] "Der Super Mario Bros. Film"                          
##   [4] "Guy Ritchie's Der Pakt"                              
##   [5] "Tetris"                                              
##   [6] "A Good Person"                                       
##   [7] "Flamin' Hot"                                         
##   [8] "Infinity Pool"                                       
##   [9] "Champions"                                           
##  [10] "Ant-Man and the Wasp: Quantumania"                   
##  [11] "Dungeons & Dragons: Ehre unter Dieben"               
##  [12] "BlackBerry"                                          
##  [13] "Renfield"                                            
##  [14] "Somewhere in Queens"                                 
##  [15] "Tyler Rake: Extraction 2"                            
##  [16] "Sisu: Rache ist süss"                                
##  [17] "Knock at the Cabin"                                  
##  [18] "Big George Foreman"                                  
##  [19] "Jesus Revolution"                                    
##  [20] "Missing"                                             
##  [21] "Beau Is Afraid"                                      
##  [22] "Ponnlyin Selvan: Part Two"                           
##  [23] "Stan Lee"                                            
##  [24] "Air - Der große Wurf"                                
##  [25] "John Wick: Kapitel 4"                                
##  [26] "Jemand, den ich mal kannte"                          
##  [27] "Ein Mann namens Otto"                                
##  [28] "Shazam! Fury of the Gods"                            
##  [29] "The Pope's Exorcist"                                 
##  [30] "Scream VI"                                           
##  [31] "M3gan"                                               
##  [32] "Plane"                                               
##  [33] "Creed III: Rocky's Legacy"                           
##  [34] "Operation Fortune"                                   
##  [35] "You People"                                          
##  [36] "Magic Mike: The last Dance"                          
##  [37] "Catch the Killer"                                    
##  [38] "True Spirit"                                         
##  [39] "Sharper"                                             
##  [40] "Boston Strangler"                                    
##  [41] "Pinball: The Man Who Saved the Game"                 
##  [42] "Luther: The Fallen Sun"                              
##  [43] "The Last Kingdom: Seven Kings Must Die"              
##  [44] "Polite Society"                                      
##  [45] "Fast & Furious 10"                                   
##  [46] "Dalíland"                                            
##  [47] "Chevalier"                                           
##  [48] "Kandahar"                                            
##  [49] "Ghosted"                                             
##  [50] "Rye Lane"                                            
##  [51] "How to Blow Up a Pipeline"                           
##  [52] "Reality"                                             
##  [53] "Nefarious"                                           
##  [54] "What's Love Got to Do with It?"                      
##  [55] "The Artifice Girl"                                   
##  [56] "Wildflower"                                          
##  [57] "Linoleum"                                            
##  [58] "Cocaine Bear"                                        
##  [59] "Murder Mystery 2"                                    
##  [60] "Brady's Ladies"                                      
##  [61] "Mafia Mamma"                                         
##  [62] "Your Place or Mine"                                  
##  [63] "Paint"                                               
##  [64] "Blue Jean"                                           
##  [65] "We Have a Ghost"                                     
##  [66] "The Mother"                                          
##  [67] "Hypnotic"                                            
##  [68] "Legion der Superhelden"                              
##  [69] "Batman: The Doom That Came to Gotham"                
##  [70] "Der Elefant des Magiers"                             
##  [71] "Teen Wolf: The Movie"                                
##  [72] "When You Finish Saving the World"                    
##  [73] "Dog Gone"                                            
##  [74] "Inside"                                              
##  [75] "Mumien - Ein total verwickeltes Abenteuer"           
##  [76] "Prom Pact"                                           
##  [77] "Chupa"                                               
##  [78] "Cocaine Bear: The True Story"                        
##  [79] "Power Rangers: Once & Always"                        
##  [80] "Justice League x RWBY: Superhelden und Jäger: Teil 1"
##  [81] "Sweetwater"                                          
##  [82] "Love Again"                                          
##  [83] "Spinning Gold"                                       
##  [84] "Ride on"                                             
##  [85] "Und dann kam Dad"                                    
##  [86] "On Sacred Ground"                                    
##  [87] "On a Wing and a Prayer"                              
##  [88] "Robots"                                              
##  [89] "The Wedding Veil Journey"                            
##  [90] "The Wedding Veil Inspiration"                        
##  [91] "The Wedding Veil Expectations"                       
##  [92] "At Midnight"                                         
##  [93] "The Magic Flute"                                     
##  [94] "65"                                                  
##  [95] "Shotgun Wedding"                                     
##  [96] "Marlowe"                                             
##  [97] "Seriously Red"                                       
##  [98] "Blood"                                               
##  [99] "The Old Way"                                         
## [100] "Jung_E: Gedächtnis des Krieges"

Using SelectorGadget: Click on the SelectorGadget icon in your browser to activate it.Click on the elements in the web page you wish to scrape. SelectorGadget will suggest a CSS selector for the elements you’ve selected. This is just a guess and usually first guess is not totally accurate. You can deselect the elements that have been chosen wrongly and this way, you can get to your elements of interest.

THIS IS MUCH MORE FUN!

Extract Elements

Use html_nodes() with the suggested CSS selector or XPath. Use html_text() to extract the text from the elements

#Extract name of the movies
name <- imdb %>% 
  html_nodes(".lister-item-header a") %>% 
  html_text()

#Extract duration of the movies
runtime <- imdb %>% 
  html_nodes(".runtime") %>% 
  html_text()

#Extract duration of the movies
rating <- imdb %>% 
  html_nodes(".ipl-rating-star.small .ipl-rating-star__rating") %>%
  html_text()

text_primary <- imdb %>% 
  html_nodes(".text-primary") %>% 
  html_text()


summary <- imdb %>% 
  html_nodes(".ipl-rating-widget+ p , .ratings-metascore+ p") %>% 
  html_text()

votes <- imdb %>% 
  html_nodes(".text-small+ .text-small") %>% 
  html_text()

Now we bind all these vectors together

movies2023 <- cbind(name, rating, runtime, summary) %>%
  as.data.frame()

Scraping a book

Well! you watched enough movies. Let’s get some books to read. Let’s get Alice in Wonderland

aliceBook <- "https://www.gutenberg.org/cache/epub/11/pg11-images.html"

alice <- read_html(aliceBook)


# Whole first chapter

chapter1 <- alice %>% 
  html_nodes(".chapter") %>% 
  html_text()

# Paragraphs
paragraphs <- alice %>% 
  html_nodes("pre , p") %>% 
  html_text() %>% 
  as.data.frame() %>% 
  rename(par = 1) %>% 
  filter(par != "")

Rotten tomatoes

rottenTomatoes <- "https://editorial.rottentomatoes.com/guide/best-movies-of-2023/"

rotten <- read_html(rottenTomatoes)

#let's get movie titles, critics consensus, and synopsis

names <- rotten %>% 
  html_nodes(".article_movie_title a") %>% 
  html_text()

#WRITE YOUR CODE HERE



# Alternative way of getting name, year, and rating with a bit of text processing

names_full <- rotten %>% 
  html_nodes(".col-sm-20") %>% 
  html_text() %>% 
  as.data.frame() %>% 
  rename(info = 1) %>% 
  mutate(rating = str_extract(info, "\\d+%")) %>% 
  mutate(year = str_extract(info, "\\(\\d+\\)") %>% 
           str_extract(., "\\d+")) %>% 
  mutate(name = info) %>% 
  mutate(name = str_before_first(info, "\\(")) %>% 
  mutate(name = str_trim(name))

Tables

wikilink <- "https://en.wikipedia.org/wiki/List_of_films_with_a_100%25_rating_on_Rotten_Tomatoes"

wiki_movies <- read_html(wikilink)

table <- wiki_movies %>% 
  html_nodes(xpath = '//*[@id="mw-content-text"]/div[1]/table') %>% 
  html_table() %>% 
  as.data.frame()


movie_names <- wiki_movies %>% 
  html_nodes("i a") %>% 
  html_text()

name_links <- wiki_movies %>%
  html_nodes("i a") %>% 
  html_attr("href")

But many of the times, all the content is not written in the html file. Rather, there are separate links through which we can navigate to other content. What we will do next is to get the nested links and trying to scrape them.

book_link <- "https://www.goodreads.com/list/show/183940.Best_Books_of_2023"

books <- read_html(book_link)

title <- books %>% 
  html_nodes(".bookTitle span")

name_link <- imdb %>% 
  html_nodes(".lister-item-header a") %>% 
  html_attr("href") %>% 
  str_c("https://www.imdb.com",.)


print(name_link[1])

## [1] "https://www.imdb.com/title/tt9185206/?ref_=ttls_li_tt"

## creating a function

get_info <- function(movie_link){
  movie_link <- 'https://www.imdb.com/title/tt9185206/?ref_=ttls_li_tt'
  movie_page <- read_html(movie_link)
  info <- movie_page %>% html_nodes(".primary_photo+ td a") %>% html_text()
  return(info)
  # to be continued 
}

Web scraping in R

2023-12-20