City College, Fall 2018
Intro to Data Science
Week 3: Processing and Cleaning Data
September 17, 2018
Today's Agenda
Why Data Cleaning is Important
Elements of the ETL Process
Processing Tools
Handling Missing Data
Goal today: talk about data engineering, familiarize students with concepts in data engineering.
Week 2 Recap
Structured vs. Unstructured Data
Where Data Comes From: DBs, APIs, Flat Files, Web Scraping
Sources of Data: Government, Private Firms, Personal
Packages: Pandas, Requests, Beautiful Soup
Assignment 2 Recap
How was DataCamp?
Jupyter Issues
Interesting Findings in the Film Permit Data
Broadly: gathering input data from source(s)
Practically: capturing and recording events
Broadly: convert raw event data to usable form; enforce data model
Practically: applying rules and functions to extracted data
Rules: split, group, drop
Transformations: map, apply, normalize
Broadly: Push transformed data to datastore
Practically: Publish data in usable form
Database, flat file, api
Dashboard, table, report
This Can Get Complex
Bash Scripts
ETL Process Stakeholders
Considerations for Data Scientists
Handling Missing Values
Drop observations with missing fields.
Impute values:
Central tendancy: median, mean, mode.
Modeled value.
We'll touch on this throughout the rest of the course.
We'll cover this in more detail later in the course.
Handling Outliers
We'll touch on this throughout the rest of the course.
We'll cover this in more detail later in the course.
Normalizing Data
Adjust range.
Adjust scale.
We'll touch on this throughout the rest of the course.
We'll cover this in more detail later in the course.
Drawbacks and Imperfections
Wrap Up
Why Data Cleaning is Important
Elements of the ETL Process
Processing Tools
Handling Missing Values
Goal today: talk about data engineering, familiarize students with concepts in data engineering.
These are all crucial things to consider on the job market!
Project: Components
Apply Python to load, clean, and process data sets.
Identify key elements of and patterns in your data set using computational analysis and statistical methods.
Apply principles of statistical modeling and machine learning to your data.
Explain, visualize, and communicate empirical findings within your analysis.
Demonstrate effective team collaboration
Project: Teams
Four people per group.
Requests to work together will be honored to the extent practical.
Contributions to the team's efforts, as measured by GitHub commits and peer review, will be a significant portion of the project grade.
Project: Data Resources
Be Creative
Project: Key Dates (Tentative)
Project Teams Formed, October 1.
Project Proposals Due via Email, October 15.
First Project Update, November 5.
Second Project Update, November 26.
Projects Due Finals period.
Be Creative
Assignment 3: Due Monday, September 24 by 6:30pm
Part I: DataCamp's Statistical Thinking in Python (Part 1)
The course should appear as assignment within your existing DataCamp account.
Course takes ~3 hours, plan your time accordingly.
Part II: Project Prep
Identify three datasets you're interested in using for your project, including at least two that we have not discussed in class .
For each of the three datasets, identify a question you'd like to use the data to answer.
For one of the data sets we have not discussed in class, provide one interesting summary statistic and one chart, and explain why you find this interesting. If you already have a team in mind, feel free to collaborate, but each team member must provide their own unique answer (chart and stat) to this question.
(Optional) List up to three of your classmates you'd like to work with.
Submit a word or pdf file to Blackboard with your response.