Lecture 5

City College, Spring 2019

Intro to Data Science

Week 5: Intro to Linear Models

March 11, 2019

Week 4 Recap

Types of Data
Useful Statistical Distribution
Important Summary Statistics
Key Theorems

Course Schedule

Project Proposals: due March 18, details here
Midterm: March 25

Drawn primarily from course lectures
Mix of multiple choice and short answer
Similar to fall's exam, but longer in content and time

Project Update: April 29
Project Deadline & Presentation: Wednesday May 15

Today's Agenda

Modeling Theory
Linear Regression
Assumptions for Linear Models
Measuring Performance for Linear Models

Why Start Simple?

Occam's Razor

"Entities should not be multiplied unnecessarily."

Why Start Simple?

Occam's Razor

Simple models are preferable over more complex models.

See also: Wikipedia, and a brief history.

George Box

All models are wrong...

but some are useful.

Data Science Models

What is linear regression?

Why do we call this a linear regression?

A model is linear when each term is either a constant or the product of a parameter and a predictor variable.

What are we trying to solve for?

You try.

Demo

Assumptions of Linear Regression

Data is linear in form.
Sample is random.
Error terms have constant variance (homoscedasticity).
Error terms have a mean of zero based on the observed data.
Predictors are independent (no multicollinearity).
Errors are normally distributed.

Data is linear in form.

Sample is random.

Error terms have constant variance (homoskedasticity).

Errors are uncorrelated.

Predictors are independent (no multicollinearity).

in the case of:

Errors are normally distributed.

Output

Key Metrics of Success

R-squared
Adjusted R-squared
Coefficients
P-values

R-Squared

Adjusted R-squared

Coefficients

P-values

xkcd

Does linear regression prove causality?

This Week's Data

Your turn.

Project Proposal: Due Monday, March 18 by 6:30pm

Details on the project are now available here.