EECS6893 BigDataAnalytics Lecture1
EECS6893 BigDataAnalytics Lecture1
www.stanford.edu/~cdel/2014.asplos.quasar.pdf
• Massive Parallelism
• Huge Data Volumes Storage
• Data Distribution
• High-Speed Networks
• High-Performance Computing
• Task and Thread Management
• Data Mining and Analytics
• Data Retrieval
• Machine Learning
• Data Visualization
8 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Why Big Data now?
• High-Volume
➔ • High-Velocity
• High-Variety
➔ Artificial
Intelligence
9 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
https://www.youtube.com/watch?v=BV8qFeZxZPE
10 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
11 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Human brain is a graph/network of 100B nodes and 700T edges.
memory
• Graph Database:
• Large-Scale
Native Store
The Apache™ Hadoop® project develops open-source software for reliable, scalable,
distributed computing.
The Apache Hadoop software library is a framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver high-availability, the
library itself is designed to detect and handle failures at the application layer, so delivering
a highly-available service on top of a cluster of computers, each of which may be prone to
failures.
http://hadoop.apache.org
13 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Four distinctive layers of Hadoop
14 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2017 CY Lin, Columbia University
Course Main Thrust 2: Apache Spark and ML
15 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Main Spark Stack
16 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2018 CY Lin, Columbia University
Course Main Thrust 3: Linked Big Data — Graph Analysis
18 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Course Main Thrust 5: Big Data Visualization
19 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Course Main Thrust 6: Big Data and AI Solutions
20 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Why you want to take this class
22 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Course Information
▪ Website:
http://www.ee.columbia.edu/~cylin/course/bigdata/
▪ Textbook:
-- None, but reference book(s) and/or articles/papers will be provided each lecture.
23 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Course Outline
24 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Assignments and Submissions
25 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Other Issues
▪ Professor Lin:
▪ Office Hours:
Friday after the class: 9:40pm – 10:00pm (lecture room)
Or by appointment
▪ Contact: c.lin@columbia.edu
▪ TAs (CAs/IAs/Graders) —
26 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Reading Reference for Lecture 1
item
Enhancing:
user
Graph Visualizations
Dynamic networks
of 400,000+
IBMers:
– On BusinessWeek four times, including being the Top Story of Week, April 2009 Shortest Paths
– Help IBM earned the 2012 Most Admired Knowledge Enterprise Award Social Capital
– Wharton School study: $7,010 gain per user per year using the tool Bridges
– In 2012, contributing about 1/3 of GBS Practitioner Portal $228.5 million savings andHubs
benefits
Expertise Search
– APQC (WW leader in Knowledge Practice) April 2013:
Graph Search
“The Industry Leader and Best Practice in Expertise Location” Graph Recomm.
31 © CY Lin, Columbia University
E6895 Advanced Big Data Analytics – Lecture 1
Use Case 2: Personalized Recommendation
▪ Data Source:
– Relationships among 7594
companies, data mining from
NYT 1981 ~ 2009
profiles
− Build analytics applications (e.g. personalized System G Analysis
advertisement) based on the extracted
BigInsights
customer social profiles
Enhancing:
headache
chill migraine
high fever
stomachache
cough
Graph
Communities
http://systemg.ibm.com/apps/whisper/
index.html
http://systemg.ibm.com/apps/whisper/index.html
SocialHelix: Visualizaiton of
Sentiment Divergence in
Social Media
http://systemg.ibm.com/apps/socialhelix/index.html
39 © CY Lin, Columbia University
E6895 Advanced Big Data Analytics – Lecture 1
Use Case 9: Graph Search
ranking re-ranking
Interest / social network
based content
recommendations
Info-Socio
networks Graph analysis query context
Normal:
Attacker:
(1)Clique-like
Near-Star
(2)Two-way links
Detecting DoS
attack
Graph Visualizations
Emails
Graph analysis
Instant Messaging
Social sensors
Web Access Behavior analysis Detection,
Click streams capturer Multimodality Prediction
Executed Processes
Feed subscription Semantics analysis Analysis &
Printing Exploration
Copying Database access Psychological Interface
analysis
Log On/Off
Normal:
Attacker:
(1)Clique-like
Near-Star
(2)Two-way links
Detecting DoS
attack
Bayesian
Network
Varying over
KPI time series (e.g., ? time
Causality
server performance/
load, network analyzer
performance/load)
KPI (a time series)
(potential) pairwise
relationship (e.g., causality)
Graph Visualizations
Bayesian Network
* 3 timesteps * 63 variables
* 3.9 avg states * 4.0 avg
indegree
* 16,858 CPT entries
Junction Tree
* 67 cliques
* 873,064 PT entries in cliques
Varying over
KPI time series Causality ? time
(e.g., server analyzer
performance/load,
network KPI (a time series)
performance/load) (potential) pairwise
relationship (e.g., causality)
Select KPI pairs (sampling)→ Test link existence → Estimate unsampled links based on history
48 → Overall graph E6895 Advanced Big Data Analytics – Lecture 1 © CY Lin, Columbia University
Category 5: Data Warehouse Augmentation
Graph
application Graph
application
Graph objects
Graph objects
Vertex Attribute
Correspondence Transformation
ARG s ARG t
52 © CY Lin, Columbia University
E6895 Advanced Big Data Analytics – Lecture 1
Use Case 19: Graph Matching for Genomic Medicine
1. Warm-Up Exercises:
• Setup Google Cloud account and environment
• Install Google Cloud SDK
• Create a Spark cluster
• Word Count using Google Cloud Storage and Spark
• Hive and BigQuery
https://docs.google.com/document/d/1MWBVItrLL0MizDR-9q7-SY986SqcPcib-OrlB10fYh0/
57 E6893 Big Data Analytics – Lecture 1: Big Data Introduction © 2019 CY Lin, Columbia University
Homework Late Submission Policy
58 E6893 Big Data Analytics – Lecture 1: Big Data Introduction © 2019 CY Lin, Columbia University