Lecture 6

Data formats & Plotting

Announcements

  • Assignment two is due this Wednesday (March 2).

Structured Data...

  • Is Quantitative
  • Has a defined length and format
  • Is Machine-readable

Structured Data Examples

  • Tabular or spreadsheet-like data in which each column may be a different type (string, numeric, date, or otherwise).
  • Data from a database
  • Data interchange formats: JSON or XML
  • Time series data, multidimensional arrays (matrices)
  • Data retrieved from web-based APIs

Unstructured data

  • Information that is not organized in a pre-defined manner
  • Generally designed for humans, not machines.
  • Books and other documents. Prose in general...
  • Images, audio, and video files
  • Sometimes machines can learn the structure within

Sources of data

Some random, free data sets:

Common data formats

CSV (comma-separated values)

  • From the days when spreadsheets roamed the land 🦕
  • Designed for tabular data
  • Human-readable
  • Ubiquitous
  • Don't parse it yourself.

Let's look at the best-seller list:

Position Title Units Sold
1 The Heart is a Lonely Blacksmith 11293
2 Hush Hush, Sweet Bruce 9853
3 "Freedom," I Yodeled 5071
4 C:\Windows\For\Morons 3124

     >>> row = '1,The Heart is a Lonely Blacksmith,11293'
     >>> row.split(",")
     ["1", "The Heart is a Lonely Blacksmith", "11293"]
    

        >>> row = '2,"Hush Hush", Sweet Bruce,9853'
        >>> row.split(",")
        ['2' '"Hush Hush', ' Sweet Bruce"', "9853"]
       

    row = '2,"Hush Hush, Sweet Bruce",9853'
    quote_open = False
    columns = []
    current_column = ""
    for char in row:
        if char == '"':
          quote_open = not quote_open
        elif char = ",":
            if not quote_open:
                columns.append(current_column)
                current_column = ""
        else:
            current_column += char
    columns.append(current_column)
    

    row_3 = '3,""Freedom," I Yodeled",5071'
    

    row = '3,"\"Freedom,\" I Yodeled",5071'
    

😱

In the real world, parsing CSV data is very complicated.

Don't do it.

Standard library to the rescue:


        import csv

        with open("data_file.csv") as file_handle:
            reader = csv.reader(file_handle)
            for row in reader:
                # Example row:
                #    ["1", "The Heart is a Lonely Blacksmith", "11293"]
                rank = int(row[0])
                title = row[1],
                units_sold = int(row[2])
    

CSVs often have header rows


        rank,title,units Sold
        1,The Heart is a Lonely Blacksmith,11293
        2,"Hush Hush, Sweet Bruce",9853
        3,"""Freedom,"" I Yodeled",5071
    

        import csv

        with open("data_file.csv") as file_handle:
            reader = csv.DictReader(file_handle)
            for row in reader:
                # Example row:
                #    {"rank": "1",
                #     "title": "The Heart is a Lonely Blacksmith",
                #     "units sold": "11293"}
                rank = int(row["rank"])
                title = row["title"],
                units_sold = int(row["units sold"])
    

Problems with CSV

  • Data needs to be tabular or it won't work well.
  • Has no notion of types: everything is a string
  • There isn't a standard for it.

JSON (JavaScript Object Notation)

  • More recent format
  • Standard in web-based data sources
  • A much more free-form data format than CSV
  • Data can be nested
  • Human-readable

Basic types (unlike CSV)


        "string"
        3.14
        null
        false
    

Lists


    [1, "A", true]
    

Objects


    {"a": 1, "b": 2, "c": 3}
    

        [
        {
          "position": 1,
          "title": "The Heart is a Lonely Blacksmith",
          "units sold": 11293
        },
        {
          "position": 2,
          "title": "Hush Hush, Sweet Bruce",
          "units sold": 9853
        },
        {
          "position": 3,
          "title": "Freedom,\" I Yodeled\"",
          "units sold": 5071
        },
        {
          "position": 4,
          "title": "C:\\Windows\\For\\Morons",
          "units sold": 3124
        }
      ]
    

Nested & complex data


    {"name": "Ted",
    "places_lived": ["Switzerland", "Canada", "Mexico"],
    "pet": null,
    "siblings": [{"name": "Martine", "age": 24, "pets": ["Rolf"]},
                 {"name": "Fred", "age": 31,
                  "pets": ["Boto", "Formica", "Rover"]}]
   }
   

    import json

    json_data = """{"name": "Ted",
        "places_lived": ["Switzerland", "Canada", "Mexico"],
        "pet": null,
        "siblings": [{"name": "Martine", "age": 24, "pets": ["Rolf"]},
                    {"name": "Fred", "age": 31,
                    "pets": ["Boto", "Formica", "Rover"]}]}"""

    ted_data = json.loads(json_data)
    
Our ted_data variable is a Python dictionary:

        {'name': 'Ted',
        'pet': None,
        'places_lived': ['Switzerland', 'Canada', 'Mexico'],
        'siblings': [{'age': 24, 'name': 'Martine', 'pets': ['Rolf']},
                     {'age': 31,
                      'name': 'Fred',
                      'pets': ['Boto', 'Formica', 'Rover']}]}
    

Although JSON and CSV are very common, there are an almost endless array of data formats. The one to chose will depend on many factors:

  • Language you are using
  • Whether the format needs to be human-readable
  • How fast the the format should be to read and write
  • The structure of the data
  • The size of the data
  • The systems that need to interoperate with the data

Let's look at some real data!

The Social Security Administration maintains data on the most popular names in the US since 1880!

You can download the data!

  • What is the most popular name for a given year?
  • What is the most unpopular name for a given year?
  • How has a given name changed in popularity over time?

Let's write some code! 💻

Why are we seeing a drop in popularity for all the names we looked at?

  • Diversity of names has increased
  • Diversity of spellings has increased
  • Birth rate has decreased