CMSC 210: Lecture 7

Lecture 6

Data formats & Plotting

Announcements

Assignment two is due this Wednesday (March 2).

Structured Data...

Is Quantitative
Has a defined length and format
Is Machine-readable

Structured Data Examples

Tabular or spreadsheet-like data in which each column may be a different type (string, numeric, date, or otherwise).
Data from a database
Data interchange formats: JSON or XML
Time series data, multidimensional arrays (matrices)
Data retrieved from web-based APIs

Unstructured data

Information that is not organized in a pre-defined manner
Generally designed for humans, not machines.
Books and other documents. Prose in general...
Images, audio, and video files
Sometimes machines can learn the structure within

Sources of data

Some random, free data sets:

Common data formats

CSV (comma-separated values)

From the days when spreadsheets roamed the land 🦕
Designed for tabular data
Human-readable
Ubiquitous
Don't parse it yourself.

Let's look at the best-seller list:

Position	Title	Units Sold
1	The Heart is a Lonely Blacksmith	11293
2	Hush Hush, Sweet Bruce	9853
3	"Freedom," I Yodeled	5071
4	C:\Windows\For\Morons	3124


     >>> row = '1,The Heart is a Lonely Blacksmith,11293'
     >>> row.split(",")
     ["1", "The Heart is a Lonely Blacksmith", "11293"]


        >>> row = '2,"Hush Hush", Sweet Bruce,9853'
        >>> row.split(",")
        ['2' '"Hush Hush', ' Sweet Bruce"', "9853"]


    row = '2,"Hush Hush, Sweet Bruce",9853'
    quote_open = False
    columns = []
    current_column = ""
    for char in row:
        if char == '"':
          quote_open = not quote_open
        elif char = ",":
            if not quote_open:
                columns.append(current_column)
                current_column = ""
        else:
            current_column += char
    columns.append(current_column)


    row_3 = '3,""Freedom," I Yodeled",5071'


    row = '3,"\"Freedom,\" I Yodeled",5071'

😱

In the real world, parsing CSV data is very complicated.

Don't do it.

Standard library to the rescue:


        import csv

        with open("data_file.csv") as file_handle:
            reader = csv.reader(file_handle)
            for row in reader:
                # Example row:
                #    ["1", "The Heart is a Lonely Blacksmith", "11293"]
                rank = int(row[0])
                title = row[1],
                units_sold = int(row[2])

CSVs often have header rows


        rank,title,units Sold
        1,The Heart is a Lonely Blacksmith,11293
        2,"Hush Hush, Sweet Bruce",9853
        3,"""Freedom,"" I Yodeled",5071


        import csv

        with open("data_file.csv") as file_handle:
            reader = csv.DictReader(file_handle)
            for row in reader:
                # Example row:
                #    {"rank": "1",
                #     "title": "The Heart is a Lonely Blacksmith",
                #     "units sold": "11293"}
                rank = int(row["rank"])
                title = row["title"],
                units_sold = int(row["units sold"])

Problems with CSV

Data needs to be tabular or it won't work well.
Has no notion of types: everything is a string
There isn't a standard for it.

JSON (JavaScript Object Notation)

More recent format
Standard in web-based data sources
A much more free-form data format than CSV
Data can be nested
Human-readable

Basic types (unlike CSV)


        "string"
        3.14
        null
        false

Lists


    [1, "A", true]

Objects


    {"a": 1, "b": 2, "c": 3}


        [
        {
          "position": 1,
          "title": "The Heart is a Lonely Blacksmith",
          "units sold": 11293
        },
        {
          "position": 2,
          "title": "Hush Hush, Sweet Bruce",
          "units sold": 9853
        },
        {
          "position": 3,
          "title": "Freedom,\" I Yodeled\"",
          "units sold": 5071
        },
        {
          "position": 4,
          "title": "C:\\Windows\\For\\Morons",
          "units sold": 3124
        }
      ]

Nested & complex data


    {"name": "Ted",
    "places_lived": ["Switzerland", "Canada", "Mexico"],
    "pet": null,
    "siblings": [{"name": "Martine", "age": 24, "pets": ["Rolf"]},
                 {"name": "Fred", "age": 31,
                  "pets": ["Boto", "Formica", "Rover"]}]
   }


    import json

    json_data = """{"name": "Ted",
        "places_lived": ["Switzerland", "Canada", "Mexico"],
        "pet": null,
        "siblings": [{"name": "Martine", "age": 24, "pets": ["Rolf"]},
                    {"name": "Fred", "age": 31,
                    "pets": ["Boto", "Formica", "Rover"]}]}"""

    ted_data = json.loads(json_data)

Our ted_data variable is a Python dictionary:


        {'name': 'Ted',
        'pet': None,
        'places_lived': ['Switzerland', 'Canada', 'Mexico'],
        'siblings': [{'age': 24, 'name': 'Martine', 'pets': ['Rolf']},
                     {'age': 31,
                      'name': 'Fred',
                      'pets': ['Boto', 'Formica', 'Rover']}]}

Although JSON and CSV are very common, there are an almost endless array of data formats. The one to chose will depend on many factors:

Language you are using
Whether the format needs to be human-readable
How fast the the format should be to read and write
The structure of the data
The size of the data
The systems that need to interoperate with the data

Let's look at some real data!

The Social Security Administration maintains data on the most popular names in the US since 1880!

You can download the data!

What is the most popular name for a given year?
What is the most unpopular name for a given year?
How has a given name changed in popularity over time?

Let's write some code! 💻

Why are we seeing a drop in popularity for all the names we looked at?

Diversity of names has increased
Diversity of spellings has increased
Birth rate has decreased