Web Scraping

Author

Jason Hilton

Introduction

We are going to scrape some text data from a websites created directly for the purposes of practicing scraping.

As you may remember from our discussions in the lecture, if we were applying this on ‘real’ websites, we would have to be careful about the ethical, legal and privacy implications of the data we were planning to collect.

We also need to be careful about the rate at which we scrape webpages from a site: too many page requests can overwhelm the server, or may consume unreasonable server resource, worsening the experience for other users. It could also lead to restriction being applied to your ability to make future requests for webpages from this site.

We therefore must practice polite scraping by identifying ourselves and intentionally limiting the number of requests we make of the server.

Recap: HTML pages

To remind ourselves of the material covered in the lecture, we wish to extract data from a webpage provided in html format. Normally html pages is provided to our web-browsers in response to a request, which might occur, for instance, when we click on a link. However, we are going to request a web-page using R, and use the R package Rvest to extract data from the result.

A very simple example of an html document is given below. We have opening and closing html tags which defining html elements, which may be nested within each other. In the below, there is a paragraph element <p> with some text content which is contained within a <div> element (div for ‘division’) which is generally used to divide up the page and group together particular elements that are related somehow.

<!DOCTYPE html>
<html lang="en">
  <head>
    <!-- Meta-information, title, scripts. -->
  </head>
  <body>
    <div>
      <p> I am here! </p>
    </div>
  </body>
</html>

A more visual example of an html page structure is given below:

Recap: html attributes

HTML elements may have attributes. These describe certain properties relating to that element. These can help us extract information from a web-page.

IDs and Classes

The two most useful attributes are id and class. IDs uniquely identify particular html elements, so that two html elements on the same page can’t share the same ID. IDs are specified as below.

<p id="introduction">

Classes identify particular elements that are related in some way. Classes are often used to provide uniform formatting across such related elements.

<div class="bio">

Recap: CSS selectors

When we identify a page from which we want to scrape data, it is helpful to investigate the html source of this page. The will help us right the code that will allow us to select the elements from which we want to extract content.

You may remember from the lecture that we can do this using css selectors. CSS stands for cascading style sheet, and it is the language through which website programmers how particular groups of the html elements should appear.

The style of HTML content is the way it appears when it is viewed through a web browser.

This is generally determined by instructions written in the CSS (Cascading Style Sheets) language.

  • These instructions provides information about fonts, colours, size etc.

Why do we care about this?

  • In order to determine which elements of an HTML page should be styled, CSS uses selectors
  • This is specific way of referring to particular elements, classes and ids in CSS code
  • We can use these CSS selectors to specify which html elements we want to extract for analysis

The most important css rules are:

  • To select by element, the name of the element is simply written. For example p, h2, div, etc
  • To select by class, we add the . symbol. For example, .big-title
  • To select by id, we add the # symbol: #bio.

We can chain these selectors together:

  • p.body-text selects all paragraph elements of class body-text
  • .body-text.intro selects all elements with both classes body text and intro
  • .body-text .intro note space selects all elements of class intro that are descendents of elements of class .body-text.

Full reference here: https://www.w3schools.com/cssref/css_selectors.php

Now that we have done our revision, we can try look at trying to scrape data from a particular site.

Visit https://quotes.toscrape.com/ and examine the structure of the site. This site contains a set of quotations from various famous people. The information is spread over several pages. We would like to extract information from each of these quotes, and load them into a sensible dataframe.

We start by loading in the packages tidyverse and rvest.

library(tidyverse)
Warning: package 'ggplot2' was built under R version 4.3.3
Warning: package 'tidyr' was built under R version 4.3.3
Warning: package 'readr' was built under R version 4.3.3
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ purrr     1.0.2     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(rvest)
Warning: package 'rvest' was built under R version 4.3.3

Attaching package: 'rvest'

The following object is masked from 'package:readr':

    guess_encoding

We can use the rvest function read_html to read the html of the quotes page mentioned previously. The function issues a http request and parses the result.

quotes_html  <- read_html("https://quotes.toscrape.com/")
quotes_html
{html_document}
<html lang="en">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body>\n    <div class="container">\n        <div class="row header-box"> ...

The result is an R object quotes_html that contains an R representation of the website.

We can use the function html_elements to extract all elements that match a particular css selector. But first we need to find which css elements we want to extract.

In the quotes to scrape website, right click anywhere on the page and click ‘View page source’ or similar (the exact menu option may depend upon your browser). You should be able to see the html source corresponding to this page.

Take a moment to examine the structure of the page, and see how it compares to what you see when you open the web page with the browser.

You may notice that each quote is contained within a div element of class quote. Within each of these divs is a span of class text, which contains the text of the quote, and a small element of class author, which has the name of the author.

Therefore, we can extract all the div elements of class quote using the css selector div.quote.

quotes_html %>% html_elements("div.quote")
{xml_nodeset (10)}
 [1] <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\ ...
 [2] <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\ ...
 [3] <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\ ...
 [4] <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\ ...
 [5] <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\ ...
 [6] <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\ ...
 [7] <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\ ...
 [8] <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\ ...
 [9] <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\ ...
[10] <div class="quote" itemscope itemtype="http://schema.org/CreativeWork">\ ...

We can extract all the elements, within the quote div, we might want to extract the text span. We can do this by using the selector div.quote .text, which selects all descendents of class .text which were descendents of div elements of class quote.

quotes_html %>% html_elements("div.quote .text")
{xml_nodeset (10)}
 [1] <span class="text" itemprop="text">“The world as we have created it is a ...
 [2] <span class="text" itemprop="text">“It is our choices, Harry, that show  ...
 [3] <span class="text" itemprop="text">“There are only two ways to live your ...
 [4] <span class="text" itemprop="text">“The person, be it gentleman or lady, ...
 [5] <span class="text" itemprop="text">“Imperfection is beauty, madness is g ...
 [6] <span class="text" itemprop="text">“Try not to become a man of success.  ...
 [7] <span class="text" itemprop="text">“It is better to be hated for what yo ...
 [8] <span class="text" itemprop="text">“I have not failed. I've just found 1 ...
 [9] <span class="text" itemprop="text">“A woman is like a tea bag; you never ...
[10] <span class="text" itemprop="text">“A day without sunshine is like, you  ...

Finally, we can extract the actual quote text content from the html elemenst we extracted using the html_text2 function. Putting it all together:

quote_text <- quotes_html %>% html_elements("div.quote .text") %>%
  html_text2()

length(quote_text)
[1] 10
quote_text[10]
[1] "“A day without sunshine is like, you know, night.”"

TASK: Try to write similar code to extract the name of the author of each quote.

Solution
authors <- quotes_html %>% html_elements("div.quote .author") %>%
  html_text2()

authors
 [1] "Albert Einstein"   "J.K. Rowling"      "Albert Einstein"  
 [4] "Jane Austen"       "Marilyn Monroe"    "Albert Einstein"  
 [7] "André Gide"        "Thomas A. Edison"  "Eleanor Roosevelt"
[10] "Steve Martin"     

We could now also combine the text and the authors’ names in a dataframe:

quote_df <- tibble(Author=authors, Quote_text=quote_text)
quote_df
# A tibble: 10 × 2
   Author            Quote_text                                                 
   <chr>             <chr>                                                      
 1 Albert Einstein   “The world as we have created it is a process of our think…
 2 J.K. Rowling      “It is our choices, Harry, that show what we truly are, fa…
 3 Albert Einstein   “There are only two ways to live your life. One is as thou…
 4 Jane Austen       “The person, be it gentleman or lady, who has not pleasure…
 5 Marilyn Monroe    “Imperfection is beauty, madness is genius and it's better…
 6 Albert Einstein   “Try not to become a man of success. Rather become a man o…
 7 André Gide        “It is better to be hated for what you are than to be love…
 8 Thomas A. Edison  “I have not failed. I've just found 10,000 ways that won't…
 9 Eleanor Roosevelt “A woman is like a tea bag; you never know how strong it i…
10 Steve Martin      “A day without sunshine is like, you know, night.”         

Investigating the site more fully, you may notice that this is just the first page of several within the quotes to scrape website. We would like to extract data from each from each of these pages.

However, as we mentioned we would like to do this is a ‘polite’ manner, obeying instructions in the sites robots.txt file (which tells where we are allowed to scrape), and not making page requests too quickly.

We can do this using the aptly-named polite R package, which interacts well with rvest.

There are three functions we need to use from the polite package, bow, scrape and nod.

  • bow specifies what rate we should scrape at and parses the robots.txt file, as well as the base url of the site.
  • scrape issues the http request to actually download the page.
  • nod specifies any additional page we would like to scrape.

We can use these function to extract the same information as previously.

library(polite)
Warning: package 'polite' was built under R version 4.3.3
polite_connection <- bow("https://quotes.toscrape.com/")


polite_connection %>% scrape() %>% 
  html_elements("div.quote .text") %>%
  html_text2()
 [1] "“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”"                
 [2] "“It is our choices, Harry, that show what we truly are, far more than our abilities.”"                                              
 [3] "“There are only two ways to live your life. One is as though nothing is a miracle. The other is as though everything is a miracle.”"
 [4] "“The person, be it gentleman or lady, who has not pleasure in a good novel, must be intolerably stupid.”"                           
 [5] "“Imperfection is beauty, madness is genius and it's better to be absolutely ridiculous than absolutely boring.”"                    
 [6] "“Try not to become a man of success. Rather become a man of value.”"                                                                
 [7] "“It is better to be hated for what you are than to be loved for what you are not.”"                                                 
 [8] "“I have not failed. I've just found 10,000 ways that won't work.”"                                                                  
 [9] "“A woman is like a tea bag; you never know how strong it is until it's in hot water.”"                                              
[10] "“A day without sunshine is like, you know, night.”"                                                                                 

Now we are ready to extract quotes from all the pages on the site.

These pages have a predictable web address:

  • https://quotes.toscrape.com/page/1/
  • https://quotes.toscrape.com/page/2/

We can use the nod function to direct our scraping to page 2:

polite_connection %>% 
  nod("page/2") %>% 
  scrape() %>% 
  html_elements("div.quote .text") %>%
  html_text2()
 [1] "“This life is what you make it. No matter what, you're going to mess up sometimes, it's a universal truth. But the good part is you get to decide how you're going to mess it up. Girls will be your friends - they'll act like it anyway. But just remember, some come, some go. The ones that stay with you through everything - they're your true best friends. Don't let go of them. Also remember, sisters make the best friends in the world. As for lovers, well, they'll come and go too. And baby, I hate to say it, most of them - actually pretty much all of them are going to break your heart, but you can't give up because if you give up, you'll never find your soulmate. You'll never find that half who makes you whole and that goes for everything. Just because you fail once, doesn't mean you're gonna fail at everything. Keep trying, hold on, and always, always, always believe in yourself, because if you don't, then who will, sweetie? So keep your head high, keep your chin up, and most importantly, keep smiling, because life's a beautiful thing and there's so much to smile about.”"
 [2] "“It takes a great deal of bravery to stand up to our enemies, but just as much to stand up to our friends.”"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                 
 [3] "“If you can't explain it to a six year old, you don't understand it yourself.”"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              
 [4] "“You may not be her first, her last, or her only. She loved before she may love again. But if she loves you now, what else matters? She's not perfect—you aren't either, and the two of you may never be perfect together but if she can make you laugh, cause you to think twice, and admit to being human and making mistakes, hold onto her and give her the most you can. She may not be thinking about you every second of the day, but she will give you a part of her that she knows you can break—her heart. So don't hurt her, don't change her, don't analyze and don't expect more than she can give. Smile when she makes you happy, let her know when she makes you mad, and miss her when she's not there.”"                                                                                                                                                                                                                                                                                                                                                                                                   
 [5] "“I like nonsense, it wakes up the brain cells. Fantasy is a necessary ingredient in living.”"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
 [6] "“I may not have gone where I intended to go, but I think I have ended up where I needed to be.”"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             
 [7] "“The opposite of love is not hate, it's indifference. The opposite of art is not ugliness, it's indifference. The opposite of faith is not heresy, it's indifference. And the opposite of life is not death, it's indifference.”"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            
 [8] "“It is not a lack of love, but a lack of friendship that makes unhappy marriages.”"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          
 [9] "“Good friends, good books, and a sleepy conscience: this is the ideal life.”"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                
[10] "“Life is what happens to us while we are making other plans.”"                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               

We can therefore now right an R function to extract the text and author name from each page:

get_quote_text <- function(page_no, polite_con){
  quote_text <- polite_con %>% 
    nod(paste0("page/", page_no)) %>% 
    scrape() %>% 
    html_elements("div.quote .text") %>%
    html_text2()
  author <- polite_con %>% 
    nod(paste0("page/", page_no)) %>% 
    scrape() %>% 
    html_elements("div.quote .author") %>%
    html_text2()
  
  out_df <- tibble(Author=author, Quote_text=quote_text)
  return(out_df)
}


page_1_df <- get_quote_text(1,polite_connection)

At present we don’t know the last page number of the site. A simple way to find this out is to try a few:

get_quote_text(10,polite_connection)
# A tibble: 10 × 2
   Author             Quote_text                                                
   <chr>              <chr>                                                     
 1 J.K. Rowling       "“The truth.\" Dumbledore sighed. \"It is a beautiful and…
 2 Jimi Hendrix       "“I'm the one that's got to die when it's time for me to …
 3 J.M. Barrie        "“To die will be an awfully big adventure.”"              
 4 E.E. Cummings      "“It takes courage to grow up and become who you really a…
 5 Khaled Hosseini    "“But better to get hurt by the truth than comforted with…
 6 Harper Lee         "“You never really understand a person until you consider…
 7 Madeleine L'Engle  "“You have to write the book that wants to be written. An…
 8 Mark Twain         "“Never tell the truth to people who are not worthy of it…
 9 Dr. Seuss          "“A person's a person, no matter how small.”"             
10 George R.R. Martin "“... a mind needs books as a sword needs a whetstone, if…
get_quote_text(20,polite_connection)
# A tibble: 0 × 2
# ℹ 2 variables: Author <chr>, Quote_text <chr>
get_quote_text(11,polite_connection)
# A tibble: 0 × 2
# ℹ 2 variables: Author <chr>, Quote_text <chr>

So it turns out there are 10 pages of quotes. We can therefore write a for-loop to loop over all the pages and extract the information we need.

quote_df <- get_quote_text(1,polite_connection)

for (i in 2:10){
  quote_df <- rbind(quote_df, get_quote_text(i, polite_connection))
  
}

dim(quote_df)
[1] 100   2
quote_df
# A tibble: 100 × 2
   Author            Quote_text                                                 
   <chr>             <chr>                                                      
 1 Albert Einstein   “The world as we have created it is a process of our think…
 2 J.K. Rowling      “It is our choices, Harry, that show what we truly are, fa…
 3 Albert Einstein   “There are only two ways to live your life. One is as thou…
 4 Jane Austen       “The person, be it gentleman or lady, who has not pleasure…
 5 Marilyn Monroe    “Imperfection is beauty, madness is genius and it's better…
 6 Albert Einstein   “Try not to become a man of success. Rather become a man o…
 7 André Gide        “It is better to be hated for what you are than to be love…
 8 Thomas A. Edison  “I have not failed. I've just found 10,000 ways that won't…
 9 Eleanor Roosevelt “A woman is like a tea bag; you never know how strong it i…
10 Steve Martin      “A day without sunshine is like, you know, night.”         
# ℹ 90 more rows

We now have a list of 100 quotes, which we have scraped from the several pages across the site. Obviously we can use this technique to scrape text data over a large number of pages, and then subsequently use text analysis techniques as discussed in that part of the course.

Extra task

If you would like to practice, you may wish to have a go at scraping information from this site:

https://books.toscrape.com/

Resources