JSON
or XML
Some random, free data sets:
Let's look at the best-seller list:
Position | Title | Units Sold |
---|---|---|
1 | The Heart is a Lonely Blacksmith | 11293 |
2 | Hush Hush, Sweet Bruce | 9853 |
3 | "Freedom," I Yodeled | 5071 |
4 | C:\Windows\For\Morons | 3124 |
>>> row = '1,The Heart is a Lonely Blacksmith,11293'
>>> row.split(",")
["1", "The Heart is a Lonely Blacksmith", "11293"]
>>> row = '2,"Hush Hush", Sweet Bruce,9853'
>>> row.split(",")
['2' '"Hush Hush', ' Sweet Bruce"', "9853"]
row = '2,"Hush Hush, Sweet Bruce",9853'
quote_open = False
columns = []
current_column = ""
for char in row:
if char == '"':
quote_open = not quote_open
elif char = ",":
if not quote_open:
columns.append(current_column)
current_column = ""
else:
current_column += char
columns.append(current_column)
row_3 = '3,""Freedom," I Yodeled",5071'
row = '3,"\"Freedom,\" I Yodeled",5071'
😱
Standard library to the rescue:
import csv
with open("data_file.csv") as file_handle:
reader = csv.reader(file_handle)
for row in reader:
# Example row:
# ["1", "The Heart is a Lonely Blacksmith", "11293"]
rank = int(row[0])
title = row[1],
units_sold = int(row[2])
CSVs often have header rows
rank,title,units Sold
1,The Heart is a Lonely Blacksmith,11293
2,"Hush Hush, Sweet Bruce",9853
3,"""Freedom,"" I Yodeled",5071
import csv
with open("data_file.csv") as file_handle:
reader = csv.DictReader(file_handle)
for row in reader:
# Example row:
# {"rank": "1",
# "title": "The Heart is a Lonely Blacksmith",
# "units sold": "11293"}
rank = int(row["rank"])
title = row["title"],
units_sold = int(row["units sold"])
Basic types (unlike CSV)
"string"
3.14
null
false
Lists
[1, "A", true]
Objects
{"a": 1, "b": 2, "c": 3}
[
{
"position": 1,
"title": "The Heart is a Lonely Blacksmith",
"units sold": 11293
},
{
"position": 2,
"title": "Hush Hush, Sweet Bruce",
"units sold": 9853
},
{
"position": 3,
"title": "Freedom,\" I Yodeled\"",
"units sold": 5071
},
{
"position": 4,
"title": "C:\\Windows\\For\\Morons",
"units sold": 3124
}
]
Nested & complex data
{"name": "Ted",
"places_lived": ["Switzerland", "Canada", "Mexico"],
"pet": null,
"siblings": [{"name": "Martine", "age": 24, "pets": ["Rolf"]},
{"name": "Fred", "age": 31,
"pets": ["Boto", "Formica", "Rover"]}]
}
import json
json_data = """{"name": "Ted",
"places_lived": ["Switzerland", "Canada", "Mexico"],
"pet": null,
"siblings": [{"name": "Martine", "age": 24, "pets": ["Rolf"]},
{"name": "Fred", "age": 31,
"pets": ["Boto", "Formica", "Rover"]}]}"""
ted_data = json.loads(json_data)
ted_data
variable is a Python dictionary:
{'name': 'Ted',
'pet': None,
'places_lived': ['Switzerland', 'Canada', 'Mexico'],
'siblings': [{'age': 24, 'name': 'Martine', 'pets': ['Rolf']},
{'age': 31,
'name': 'Fred',
'pets': ['Boto', 'Formica', 'Rover']}]}
Although JSON and CSV are very common, there are an almost endless array of data formats. The one to chose will depend on many factors:
The Social Security Administration maintains data on the most popular names in the US since 1880!
You can download the data!
Why are we seeing a drop in popularity for all the names we looked at?