Due Date: Friday, April 1st, 11:59:59 PM
Value: 150 points
Github Invite: Click Here
Collaboration: For Assignment 4, collaboration is not allowed; you must work individually. You may still see your TA and come to office hours for help, but you may not work with any other CMSC 210 students. You may post questions on Discord, but you may not post code.
The state of Maryland is home to many beautiful species of Butterfly.The website marylandbutterflies.com is designed
to aid in the identification of these insects, and includes information on and photos of hundreds of native species. The assignment is to generate a CSV file containing data about insects by scraping a mirrored version of the site.
An example of what the CSV file should contain is included in the GitHub assignment repository in the file example_output.csv
.
The marylandbutterflies.com website is maintained as a free resource by an individual person.
In order not to incur large bandwidth costs for this person, your Python screen scraper should never
hit the site directly. Instead, an exact replica of the site is mirrored at http://161.35.185.186/.
Make sure your code only retrieves pages from this mirrored site.
Your code should access each detail page for all Maryland butterflies. You should use BeautifulSoup
to extract these detail page links from the site's home page
An example detail page is this one for the pink-edged sulphur.
On that page we see the following facts:
One final note: the home page contains a handful of sections of butterflies:
The butterflies in the last, non-Maryland section should not be scraped or appear in your output CSV.
You may use the requests and BeautifulSoup libraries.
Generating the final CSV file can be handled using the csv.writer
function
or csv.DictWriter class in the CSV standard library.
Following coding standards | 30 points |
CSV Includes data about all butterflies | 60 points |
CSV columns contain correct data | 60 points |
Extra credit question: which page or pages contain the largest number of butterfly photos? (Note your assignment must include Python code to answer this question). | 20 points |