City College, Fall 2019

Intro to Data Science

Week 13: Concepts and Solutions in Big Data

December 2, 2019

Today's Agenda
  1. End of Semester Reminders
  2. What is Big Data?
  3. Frameworks for Handling Big Data
  4. Tools for Processing Big Data
  5. Case Studies
Reminder: DataCamp Courses Still Available

If you haven't completed an assignment:

  • Assignments not turned in by the set deadline are eligible to be completed for half credit by the final class on December 9!

Reminder: Final Project Submission Due December 7 by 11:59pm

Due this Saturday:

  1. An email to the professor containing:
    1. The team name, the names of all members of the team, and a link to the team’s repository on GitHub.
    2. An attached csv with predictions against test3.csv. The csv must consist exclusively of 2 columns, header rows with the titles rental_id and predictions, and the 2,000 rental ids and corresponding rent predictions. Suggested methodology for creating the submission file can be found here.

  2. A markdown file posted in the project Github repo entitled project_findings.md containing:
    1. A markdown file entitled project_findings.md containing answers and supporting evidence for all of points in the Questions and Tasks section in the project description. Please be sure to address each section!
    2. A Jupyter notebook allowing for the complete replication of the modeling process.

Do not use others' work without citation!

Peer Reviews

Each student should send me an email with each team members' primary responsibilities and overall share of contribution to the project. Should look something like this:

Dear Grant, my review on my team contributions is as follows:

  1. Scarlet Witch: moved most heavy objects, consisent hand to hand combat. 35%
  2. Captain Marvel: absent through most of the semester, but crucial last minute contributions. 30%
  3. Thor: took time to motivate, but eventually did most summoning, hammering. 25%
  4. Peter Quill: dead most of the time, only a handful of last minute one-liners. 10%

Don't just dance around the questions.

What makes big data "big"?

The Age of Big Data

Over the past twenty years, data has become dramatically cheaper to:

  1. Collect
  2. Store
  3. Process
What makes big data "big"?


A deeper discussion.

Variety of Data
  • Covered in class:
    • Continous / categorical / binary
    • Text as data
    • Survey data

  • Not covered in class:
    • Photos, video
    • Geospatial data - mobile trace, fitness trackers, etc.
    • Website tracking data
Velocity of Data
  • Concepts:
    • 24/7 news cycles, accessibility, service
    • Mobile first technology
    • On-demand economy

  • Requirements:
    • Real time alerts
    • High volume, low latency processing
Volume of Data
  • Concepts:
    • Terabytes, petabytes, and exabytes
    • On-premise vs. off-premise storage
    • Distributed computing
  • Examples:
    • Click data
    • Search histories
    • Large collections of text, photos, or videos
My Take

Big data is data that can't be easily stored and processed on a single machine.

  • With roughly 20,000 sales listing a month, most of my work at StreetEasy is not big data.
  • An exception: dozens or hundreds of individual users will look at a single site, making our activity data much trickier to deal with .

Processing Big Data

Key Concepts
  • Distributed Storage and Computing
  • Streaming and Batch Processing
  • Queues
Distributed Storage and Computing

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage.


Rather than rely on hardware to deliver high-availability, the library itself is designed to detect and handle failures at the application layer, so delivering a highly-available service on top of a cluster of computers, each of which may be prone to failures.


Source
Storage: HDFS

HDFS = Hadoop Distributed File System


Source
Storage: AWS S3

S3 = Simple Storage Service

How do we read large data sets stored remotely?

One row at a time!

Python Generators

Generators are functions that can be paused and resumed on the fly, returning an object that can be iterated over. Unlike lists, they are lazy and thus produce items one at a time and only when asked. So they are much more memory efficient when dealing with large datasets.


							def fibonacci():
								x, y = 0, 1
								while True:
									yield x
									x, y = y, x + y
						
Source

MapReduce

  • MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
    • Map stage: each worker node applies the map function to the local data, and writes the output to a temporary storage. A master node ensures that only one copy of redundant input data is processed.
    • Shuffle: worker nodes redistribute data based on the output keys (produced by the map function), such that all data belonging to one key is located on the same worker node.
    • Reduce: worker nodes now process each group of output data, per key, in parallel.

MapReduce

MapReduce

Example: Counting bi-grams in documents
  • Map stage: Within each document set, extract and count each set of bi-grams as key-value pairs. Ie - ("hot dog", 3) ("potato salad", 1) ("fried chicken", 3).
  • Shuffle: sort map key-value output across key values, distribute to workers.
  • Reduce: sum value counts so that each bi-gram has a single count for all documents.

Application: Google n-grams

Lots of tools can save us time

Database Tools
Processing Tools: ML

Spark use case at Facebook
Processing Tools: Handling Streams
Processing Tools: Ecosystems
Get this work requires teamwork!
  • Data Scientists
  • Data Engineers
  • Dev Ops
  • Product Managers
  • Machine Learning Engineers, Applied Scientists, etc.
What can big data analytics enable?
  • Better analysis and predictions of rare events
  • More personalized services for individuals
  • Broader feature sets for more traditional problems

Example: NYC Taxi Data

  • ~8 million trips per month
  • Flat files stored in AWS S3

A cool demo using taxi data.
This Can Go Wrong
Who's excited to try?