City College, Spring 2019

Intro to Data Science

Week 5: Intro to Linear Models

March 11, 2019

Week 4 Recap
  1. Types of Data
  2. Useful Statistical Distribution
  3. Important Summary Statistics
  4. Key Theorems
Course Schedule
  1. Project Proposals: due March 18, details here
  2. Midterm: March 25
    • Drawn primarily from course lectures
    • Mix of multiple choice and short answer
    • Similar to fall's exam, but longer in content and time
  3. Project Update: April 29
  4. Project Deadline & Presentation: Wednesday May 15
Today's Agenda
  1. Modeling Theory
  2. Linear Regression
  3. Assumptions for Linear Models
  4. Measuring Performance for Linear Models
Why Start Simple?

Occam's Razor

"Entities should not be multiplied unnecessarily."

Why Start Simple?

Occam's Razor

Simple models are preferable over more complex models.

See also: Wikipedia, and a brief history.

George Box

All models are wrong...

but some are useful.

Data Science Models
What is linear regression?
What is linear regression?
Why do we call this a linear regression?

A model is linear when each term is either a constant or the product of a parameter and a predictor variable.

What are we trying to solve for?

You try.
Assumptions of Linear Regression
  1. Data is linear in form.
  2. Sample is random.
  3. Error terms have constant variance (homoscedasticity).
  4. Error terms have a mean of zero based on the observed data.
  5. Predictors are independent (no multicollinearity).
  6. Errors are normally distributed.
Data is linear in form.
Sample is random.
Error terms have constant variance (homoskedasticity).
Errors are uncorrelated.
Predictors are independent (no multicollinearity).

in the case of:

Errors are normally distributed.
Key Metrics of Success
  1. R-squared
  2. Adjusted R-squared
  3. Coefficients
  4. P-values
Adjusted R-squared

Does linear regression prove causality?
This Week's Data
Your turn.

Project Proposal: Due Monday, March 18 by 6:30pm

Details on the project are now available here.