City College, Fall 2018

Intro to Data Science

Week 2: How to Get Data

September 5, 2018

Today's Agenda
  1. The data science lifecycle.
  2. Structured vs unstructured data.
  3. Common sources of data.
  4. Common ways to access data.
BUT FIRST
please go to this link and get a census api key
api.census.gov/data/key_signup.html
Recap
  • Working with Pandas
  • Things I didn't know
    • Three of the top ten hiring hospitals are in New York.
    • West Virginia is the third state in terms of median income.
    • The median salary for hair is $165,000, higher than data science.
  • Who needed help and where did it come from?
Week 1 Recap.
Where Does Data Come From?
Where does data come from?
  • Government Agencies
  • Private Firms
  • Individuals
Government Agencies


Private Firms


Individuals
Structured vs. Unstructured
  • Structured Data
    • Data with that has well defined model and clearly organized.
    • Structured data has a clear definition of what constitutes an observation, and is typically carefully collected and often well-documented.
    • Common Examples: stock prices, employee records, medical test results.
  • Unstructured Data
    • Data that lacks clear organization or does not follow a set model.
    • Unstructured data often requires significant effort to turn into a useful data set.
    • Common Examples: transcripts and other collections of text, code, or activity logs.
Is it structured or unstructured?
Stock Market Data
Call Center Transcripts
Facebook Likes
Credit Card Statements
Database of New York Times Articles
The Selfies on Your Phone
Click Data
Website HTML Code
Why is this important?
Common Ways to Access Data
  1. Databases
  2. Flat files
  3. APIs
  4. Scraping
Tools to access data
This Week's Data: Part 1
  • Conducts a full count of the U.S. population every 10 years
  • Estimates and projects U.S. population between counts
  • Conducts economic surveys of manufacturing, retail, service, and other establishments and of domestic governments
  • Ongoing survey - conducted continuously
  • Includes ancestry, educational attainment, income, language proficiency, migration, disability, employment, and housing characteristics
  • Sent to approximately 295,000 addresses monthly (or 3.5 million per year) [source]
This Week's Data: Part 2
Web Scraping
Web Scraping Resources
Assignment 2: Due Monday, September 17 by 6:30pm

Part I: DataCamp's Cleaning Data in Python

  • By now, everyone in the class should have received an invitation to join the course group at DataCamp at the email you indicated to me as your preferred email in Assignment 1. Please accept that invitation and complete all assignments in DataCamp through the account associated that email. Assignments completed under other accounts will not be accepted. If you have not received an invite to the course organization at DataCamp, please email me as soon as possible.
  • For this part of the assignment, there is nothing to submit formally, as I will have reports on your progress from DataCamp.
  • Note, the exercises in the course should be straightforward, but note that the course does take 4 hours. Please plan your time accordingly.
Part II: Querying Data through an API.
  • The assignment notebook has a series of exercises to walk through accessing Film Permit Data from NYC Open Data. Please complete the exercises as instructed in each individual notebook cells to obtain full credit for the assignment. To submit your assignment, please submit a completed jupyter notebook (*.ipynb files only) through Blackboard.