Lecture 5

City College, Fall 2019

Intro to Data Science

Week 5: Intro to Linear Models

October 7, 2019

Week 4 Recap

Types of Data
Useful Statistical Distribution
Important Summary Statistics
Key Theorems

Today's Agenda

Modeling Theory
Linear Regression
Assumptions for Linear Models
Measuring Performance for Linear Models

Why Start Simple?

Occam's Razor

"Entities should not be multiplied unnecessarily."

Why Start Simple?

Occam's Razor

Simple models are preferable over more complex models.

See also: Wikipedia, and a brief history.

George Box

All models are wrong...

but some are useful.

Data Science Models

What is linear regression?

What is linear regression?

Why do we call this a linear regression?

A model is linear when each term is either a constant or the product of a parameter and a predictor variable.

What are we trying to solve for?

You try.

Demo

Assumptions of Linear Regression

Data is linear in form.
Sample is random.
Error terms have constant variance (homoscedasticity).
Error terms have a mean of zero based on the observed data.
Predictors are independent (no multicollinearity).
Errors are normally distributed.

Data is linear in form.

Sample is random.

Error terms have constant variance (homoskedasticity).

Errors are uncorrelated.

Predictors are independent (no multicollinearity).

in the case of:

Errors are normally distributed.

Output

Key Metrics of Success

R-squared
Adjusted R-squared
Coefficients
P-values

R-Squared

Adjusted R-squared

Coefficients

P-values

xkcd

Does linear regression prove causality?

This Week's Data

Your turn.

Details on the project are now available here.

Email me teammate requests by October 14.

Assignment 5: Due Wednesday, October 16 by 6:30pm

DataCamp's Supervised Learning with scikit-learn

The course should appear as assignment within your existing DataCamp account.
Course takes ~4 hours, plan your time accordingly.