IMDB best 2023 movies
As mentioned earlier, you are getting prepared for holidays and need
some movies to watch. Also, you wanna have their summaries and runtime
information.
imdb_link <- "https://www.imdb.com/list/ls562300956/"
imdb <- read_html(imdb_link)
After reading this html page into R, now we need a way to find the
relevant information. We can do it in two different ways:
- Using inspect: Right-click and select “Inspect” to
find the HTML structure of the data you wish to scrape. It opens an
element pane for you. Right click on the selected element –> copy
–> copy XPath.
Furthermore, you can either enter “XPATH” or “css selector”. In CSS,
selectors are patterns used to select the element(s) you want to
style.
#xpath and css selector for the title "Are You There God? It's Me, Margaret."
xpath1 <- '//*[@id="main"]/div/div[3]/div[3]/div[1]/div[2]/h3/a'
selector1 <- '#main > div > div.lister.list.detail.sub-list > div.lister-list > div:nth-child(1) > div.lister-item-content > h3 > a'
# This lengthy xpath mean: select the hyperlink that is located inside an h3 element. This h3 is within the second div of the first div inside the third div of the third div inside the third div of the div with the id attribute of main. (this definition is generated by ChatGPT. I'm not that crazy to actually translate it.)
# The above xpath will only choose the text "Are You There God? It's Me, Margaret" for you. What we want is all the movie titles and not only one node. Here is a more general xpath, in which I have changed ones of the locators to a wild card, so it chooses all nodes with that specific characteristic.
xpath2 <- '//*[@id="main"]/div/div[3]/div[3]/div[*]/div[2]/h3/a'
selector2 <- '#main > div > div.lister.list.detail.sub-list > div.lister-list > div > div.lister-item-content > h3 > a'
"Look for any element within the document that has an 'id' attribute equal to 'main'. Within this element, navigate to the third 'div' child. Within this third 'div', go to the third 'div' nested inside. Now, instead of choosing a specific 'div', consider any 'div' at this level of nesting. Inside each of those 'div' elements, find the second 'div' child. Within this second 'div', there's a heading element marked as 'h3'. Inside this heading, there is a hyperlink ('a' element). This expression will select that hyperlink for every matching pattern."
## [1] "Look for any element within the document that has an 'id' attribute equal to 'main'. Within this element, navigate to the third 'div' child. Within this third 'div', go to the third 'div' nested inside. Now, instead of choosing a specific 'div', consider any 'div' at this level of nesting. Inside each of those 'div' elements, find the second 'div' child. Within this second 'div', there's a heading element marked as 'h3'. Inside this heading, there is a hyperlink ('a' element). This expression will select that hyperlink for every matching pattern."
test <- imdb %>%
html_nodes(selector2) %>%
html_text()
test
## [1] "Are You There God? It's Me, Margaret."
## [2] "Evil Dead Rise"
## [3] "Der Super Mario Bros. Film"
## [4] "Guy Ritchie's Der Pakt"
## [5] "Tetris"
## [6] "A Good Person"
## [7] "Flamin' Hot"
## [8] "Infinity Pool"
## [9] "Champions"
## [10] "Ant-Man and the Wasp: Quantumania"
## [11] "Dungeons & Dragons: Ehre unter Dieben"
## [12] "BlackBerry"
## [13] "Renfield"
## [14] "Somewhere in Queens"
## [15] "Tyler Rake: Extraction 2"
## [16] "Sisu: Rache ist süss"
## [17] "Knock at the Cabin"
## [18] "Big George Foreman"
## [19] "Jesus Revolution"
## [20] "Missing"
## [21] "Beau Is Afraid"
## [22] "Ponnlyin Selvan: Part Two"
## [23] "Stan Lee"
## [24] "Air - Der große Wurf"
## [25] "John Wick: Kapitel 4"
## [26] "Jemand, den ich mal kannte"
## [27] "Ein Mann namens Otto"
## [28] "Shazam! Fury of the Gods"
## [29] "The Pope's Exorcist"
## [30] "Scream VI"
## [31] "M3gan"
## [32] "Plane"
## [33] "Creed III: Rocky's Legacy"
## [34] "Operation Fortune"
## [35] "You People"
## [36] "Magic Mike: The last Dance"
## [37] "Catch the Killer"
## [38] "True Spirit"
## [39] "Sharper"
## [40] "Boston Strangler"
## [41] "Pinball: The Man Who Saved the Game"
## [42] "Luther: The Fallen Sun"
## [43] "The Last Kingdom: Seven Kings Must Die"
## [44] "Polite Society"
## [45] "Fast & Furious 10"
## [46] "Dalíland"
## [47] "Chevalier"
## [48] "Kandahar"
## [49] "Ghosted"
## [50] "Rye Lane"
## [51] "How to Blow Up a Pipeline"
## [52] "Reality"
## [53] "Nefarious"
## [54] "What's Love Got to Do with It?"
## [55] "The Artifice Girl"
## [56] "Wildflower"
## [57] "Linoleum"
## [58] "Cocaine Bear"
## [59] "Murder Mystery 2"
## [60] "Brady's Ladies"
## [61] "Mafia Mamma"
## [62] "Your Place or Mine"
## [63] "Paint"
## [64] "Blue Jean"
## [65] "We Have a Ghost"
## [66] "The Mother"
## [67] "Hypnotic"
## [68] "Legion der Superhelden"
## [69] "Batman: The Doom That Came to Gotham"
## [70] "Der Elefant des Magiers"
## [71] "Teen Wolf: The Movie"
## [72] "When You Finish Saving the World"
## [73] "Dog Gone"
## [74] "Inside"
## [75] "Mumien - Ein total verwickeltes Abenteuer"
## [76] "Prom Pact"
## [77] "Chupa"
## [78] "Cocaine Bear: The True Story"
## [79] "Power Rangers: Once & Always"
## [80] "Justice League x RWBY: Superhelden und Jäger: Teil 1"
## [81] "Sweetwater"
## [82] "Love Again"
## [83] "Spinning Gold"
## [84] "Ride on"
## [85] "Und dann kam Dad"
## [86] "On Sacred Ground"
## [87] "On a Wing and a Prayer"
## [88] "Robots"
## [89] "The Wedding Veil Journey"
## [90] "The Wedding Veil Inspiration"
## [91] "The Wedding Veil Expectations"
## [92] "At Midnight"
## [93] "The Magic Flute"
## [94] "65"
## [95] "Shotgun Wedding"
## [96] "Marlowe"
## [97] "Seriously Red"
## [98] "Blood"
## [99] "The Old Way"
## [100] "Jung_E: Gedächtnis des Krieges"
- Using SelectorGadget: Click on the SelectorGadget
icon in your browser to activate it.Click on the elements in the web
page you wish to scrape. SelectorGadget will suggest a CSS selector for
the elements you’ve selected. This is just a guess and usually first
guess is not totally accurate. You can deselect the elements that have
been chosen wrongly and this way, you can get to your elements of
interest.
THIS IS MUCH MORE FUN!