Web Scraping...continued

Recap

  • Using the HTML structure of pages to extract data from web pages
  • We use the requests library to acquire HTML
  • We use the BeautifulSoup library to parse and search HTML
  • BeautifulSoup lets us select HTML tags via a CSS selector with .select()
  • This returns a ResultSet class we can loop over
  • Each item in the result set is a Tag object which is the subsection of the HTML tree consiting of our matched tag and all its children
  • We can access the tags attributes as a Python dictionary via the .attrs property
  • We can access all the text contained within a tag via the .text property

Writing a web scraper tends
to be a one-off project

  • Collaborative community to create, maintain, and promote schemas for structured data on the Internet
  • Launched by Bing, Google, & Yahoo
  • Provide a common standard for HTML market to use for representing common kinds of structured information

Examples:

  • Events
  • People
  • Consumer products
  • Books and movies
  • Local businesses and restaurants
  • Recipes

Can we build a generic scraper for sites that use schema.org's schema for recipes?

Yes!

If a page with a recipe has this information, it will put an itemprop="attributeName" on the tag enclosing that information.


<ul>
    <li itemprop="recipeIngredient">3 garlic cloves minced</li>
    <li itemprop="recipeIngredient">1 stalk celery, diced</li>
    <li itemprop="recipeIngredient">1 carrot, diced</li>
    <li itemprop="recipeIngredient">2 tbsp. of olive oil</li>
</ul>
    

This means we can call:


        soup.select('[itemprop="recipeIngredient"]')
    

...to get all the ingredients on a recipe page

Let's look at a test recipe.

This simple recipe follows the schema.org standard.

Example Recipe Title


    <h1 itemprop="name">Mom's World Famous Banana Bread</h1>
    

Example Recipe Ingredient List


<h2>Ingredients:</h2>
<ul>
    <li itemprop="recipeIngredient">3 or 4 ripe bananas</li>
    <li itemprop="recipeIngredient">1 egg</li>
    <li itemprop="recipeIngredient">3/4 cup of sugar</li>
    <li itemprop="recipeIngredient">1 1/2 cup of flour</li>
</ul>
    

Example Recipe Ingredient List


<div itemprop="recipeInstructions">
    <span itemprop="step">Preheat the oven to 350 degrees.</span>
    <span itemprop="step">Mix in the ingredients in a bowl.</span>
    <span itemprop="step">Add the flour last.</span>
    <span itemprop="step">Pour the mixture into a loaf pan and bake for one hour.</span>
</div>