City College, Fall 2019

Intro to Data Science

Week 5: Intro to Linear Models

October 7, 2019

Week 4 Recap
  1. Types of Data
  2. Useful Statistical Distribution
  3. Important Summary Statistics
  4. Key Theorems
Today's Agenda
  1. Modeling Theory
  2. Linear Regression
  3. Assumptions for Linear Models
  4. Measuring Performance for Linear Models
Why Start Simple?

Occam's Razor

"Entities should not be multiplied unnecessarily."

Why Start Simple?

Occam's Razor

Simple models are preferable over more complex models.

See also: Wikipedia, and a brief history.

George Box

All models are wrong...

but some are useful.

Data Science Models
What is linear regression?
What is linear regression?
Why do we call this a linear regression?

A model is linear when each term is either a constant or the product of a parameter and a predictor variable.

What are we trying to solve for?

You try.
Assumptions of Linear Regression
  1. Data is linear in form.
  2. Sample is random.
  3. Error terms have constant variance (homoscedasticity).
  4. Error terms have a mean of zero based on the observed data.
  5. Predictors are independent (no multicollinearity).
  6. Errors are normally distributed.
Data is linear in form.
Sample is random.
Error terms have constant variance (homoskedasticity).
Errors are uncorrelated.
Predictors are independent (no multicollinearity).

in the case of:

Errors are normally distributed.
Key Metrics of Success
  1. R-squared
  2. Adjusted R-squared
  3. Coefficients
  4. P-values
Adjusted R-squared

Does linear regression prove causality?
This Week's Data
Your turn.

Details on the project are now available here.

Email me teammate requests by October 14.

Assignment 5: Due Wednesday, October 16 by 6:30pm

DataCamp's Supervised Learning with scikit-learn

  • The course should appear as assignment within your existing DataCamp account.
  • Course takes ~4 hours, plan your time accordingly.