Remember, HTML provides a way to add structure and semantic meaning to a document.
It's not exactly structured...but at least semi-structured.
You can manually produce structured
information from semi-structured data.
Let's say I want to create a spreadsheet of the price and customer ratings of cheeses found on gourmetfoodstore.com
You could click through evey page of
cheese and manually collect the data.
😞 This will be:
Today we’re going to talking about web scraping:
the practice of automatically extracting
data of interest from webpages into a structured representation.
Python is a great language for doing this.
The requests
library is an open-source package for downloading web pages with Python
pip install requests
import requests
response = requests.get("https://cheese.com")
response.text
What you get back from
requests.get("https://cheese.com/bleu-dauvergne/").text
...is a huge string representing the HTML of the webpage.
Imagine that you have to use the string manipulation you already know to find all the links on that page and then list where they go.
Kinda scary, right? 😱
BeautifulSoup is a scraping library that takes an html string and lets you, the programmer, do “easy” searches for interesting stuff in that string.
pip install beautifulsoup4
import requests
from bs4 import BeautifulSoup
html = requests.get("https://cheese.com/bleu-dauvergne/").text
soup = BeautifulSoup(html, 'html.parser')
So let’s say we still want to find all the links.
(The links are <a>
tags in HTML.)
soup.select("a")
[<a href="https://cheese.com/advertise/"><img src="/media/img/advertise/cheese-banner-bottom.jpg"/></a>,
<a href="/"><h1 class="logo" title="Cheese.com"></h1></a>,
<a href="/alphabetical/">Alphabetical</a>,
<a class="dropdown-toggle" data-toggle="dropdown" href="#">Cheeses by Category <span></span></a>,
<a href="/by_type/"><span></span>Cheeses by type</a>,
<a href="/by_country/"><span></span>Cheeses by country</a>,
<a href="/by_milk/"><span></span>Cheeses by milk</a>,
<a href="/by_texture/"><span></span>Cheeses by texture</a>,
<a href="/by_color/"><span></span>Cheeses by color</a>,
...]
What does something like soup.select("a")
give you?
ResultSet
for tag in soup.select("a"):
soup.select("a")[2]
What are each of these things in the ResultSet
?
Tag
elements
# will give you a ResultSet with three Tag objects:
soup.select("li")
.select()
understands these.
soup.select("a.nav-links") |
Gets all links with class="nav-links" |
soup.select("#firstHeader") |
Gets the tag with id="FirstHeader" (there can only be one) |
soup.select('[itemprop="name"]') |
Gets all tags with the attribute "itemprop" set to "name" |
soup.select('li a') |
Gets all the links <a> that appear inside of list items <li> |
soup.select('h1, h2, h3') |
Select level 1, 2, or 3 headers |
Each Tag
object is its own little mini tag soup:
unordered_list = soup.select("ul")[0]
for list_item in unordered_list.select("li"):
for child in list_item.children:
# Each of these is going to an an <a> Tag instance.
Navigating around the HTML tree:
tag.previous_sibling
(→Tag
)tag.next_sibling
(→Tag
)tag.children
(→ResultSet
)tag.parent
(→Tag
)Once you get a tag with .select()
, you can access its attributes and text.
Here’s an example:
for anchor in soup.select('a'):
print(anchor.text)
if 'href' in anchor.attrs:
print(anchor.attrs['href'])
Notice that .attrs
is a dictionary, and we have to make sure that href
is in it so we avoid key errors for links
that don’t have href attributes. (It happens, actually. A lot.)
The text of a tag is the text that will actually get put on the webpage that is inside the tag.
The attribute value is whatever appears in the quotes in the tag attributes:
<a href="http://umbc.edu">UMBC </a>
tag.attrs["href"]
is "http://umbc.edu"tag.text
is "UMBC".
tag.name
will be the tag itself. Here that's "a".
Given this HTML:
<ul>
<li>This is <a href="example.com">link number 1</a>.</li>
<li>This is <a href="example.com">link number 2</a>.</li>
</ul>
This Python:
for li in soup.select("li"):
print(li.text)
...will print "This is link number 1." and "This is link number 2."
In addition to select
, you can also use the find
method:
first_link = soup.find("a")
email_form = soup.find(id="email-form")
Tag
object instead of a ResultSet
.Here's the HTML with cheese links:
Here's the HTML where the cheese name appears:
Here's the HTML where the cheese properties are: