City College, Spring 2019

Intro to Data Science

Week 4: Statistics and the Stories We Tell Ourselves

February 25, 2019

Today's Agenda
  1. Types of Data
  2. Useful Statistical Distribution
  3. Important Summary Statistics
  4. Independence
  5. Key Theorems
Week 3 Recap
  • Elements of the ETL Process
  • Processing Tools: Luigi, Airflow
  • Handling Missing Data: Drop, Impute
HW Recap
  1. How was DataCamp?
  2. How do we feel about projects?
sta·tis·tics

noun

The practice or science of collecting and analysing numerical data in large quantities, especially for the purpose of inferring proportions in a whole from those in a representative sample.

Source

xkcd
Types of Data
Boolean
Categorical
Continuous
Probability Distributions

A mathematical function that provides the probabilities of occurrence of different possible outcomes in an experiment.

Source
A Few Important Distributions
Binomial

describes the likelihood for k successes over n trials with p probability of success where:



Wikipedia
Normal

Wikipedia
Uniform

Wikipedia
How to Describe Distributions
Central Tendency

[1, 1, 1, 1, 6, 2, 4, 2, 9]

Central Tendency

Mean

Central Tendency

Median

Central Tendency

Mode

Variation

Range

Variation

Min, Max

Variation

Variance, Standard Deviation

Variation

Percentiles

Dependence

How to describe the relationship between two distributions?


formal definition
Dependence

Covariance


formal definition
Dependence

Correlation


formal definition
Key Theorems

Law of Large Numbers


The average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.


Demo
Key Theorems

Central Limit Theorem


When independent random variables are added, their properly normalized sum tends toward a normal distribution (informally a "bell curve") even if the original variables themselves are not normally distributed.


Demo
Let's Code!

Wrap Up
  1. Types of Data
  2. Useful Statistical Distribution
  3. Important Summary Statistics
  4. Independence
  5. Key Theorems

Reference: Data Science from Scratch

About the Project
Project: Components
  1. Apply Python to load, clean, and process data sets.
  2. Identify key elements of and patterns in your data set using computational analysis and statistical methods.
  3. Apply principles of statistical modeling and machine learning to your data.
  4. Explain, visualize, and communicate empirical findings within your analysis.
  5. Demonstrate effective team collaboration
Project: Data Resources

Be Creative

Project: Key Dates (Tentative)
  • Project Teams Formed, February 25.
  • Project Proposals Due via Email, March 18.
  • Project Update, April 29.
  • Projects Due, May 13.

Be Creative

Project Teams
Team Member 1 Member 2 Member 3 Member 4
Team 1 Angelique Nabil Baivab Shravan
Team 2 Brandon Haojin Mohamed Andy
Team 3 Daniel Carlton Michael Tan
Team 4 Chantelle Khristian Ahalya
Team 5 Chieh Melvin Michael Li
Team 6 Andrew Arjun Farhan Tadjoudine
Team 7 Ana Bon Kirstyn
Team 8 Weicheng Joy Ka Shing
Team 9 Gong Alan Rongjun
Assignment 4: Due Monday, March 4 by 6:30pm

DataCamp's Statistical Thinking in Python (Part 2)

  • The course should appear as assignment within your existing DataCamp account.
  • Course takes 4+ hours, plan your time accordingly.