0% found this document useful (0 votes)
109 views

EECS6893 BigDataAnalytics Lecture1

mmmm

Uploaded by

paranoea911
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
109 views

EECS6893 BigDataAnalytics Lecture1

mmmm

Uploaded by

paranoea911
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 58

EECS E6893 Big Data Analytics Lecture 1:

Overview of Big Data Analytics

Ching-Yung Lin, Ph.D.


Adjunct Professor, Depts. of Electrical Engineering and Computer Science
IEEE Fellow

September 6h, 2019


E6893 Big Data Analytics — Lecture 1 © CY Lin, 2019 Columbia University
Definition and Characteristics of Big Data

“Big data is high-volume, high-velocity and high-variety information assets that


demand cost-effective, innovative forms of information processing for
enhanced insight and decision making.” -- Gartner

which was derived from:

“While enterprises struggle to consolidate systems and collapse redundant


databases to enable greater operational, analytical, and collaborative
consistencies, changing economic conditions have made this job more difficult.
E-commerce, in particular, has exploded data management challenges along
three dimensions: volumes, velocity and variety. In 2001/02, IT organizations
much compile a variety of approaches to have at their disposal for dealing
each.” – Doug Laney

2 © CY Lin 2018, Columbia University


E6893 Big Data Analytics — Lecture 1
What made Big Data needed?

“Big Data Analytics”, David Loshin, 2013


3 © CY Lin 2018, Columbia University
E6893 Big Data Analytics — Lecture 1
Key Computing Resources for Big Data

• Processing capability: CPU, processor, or node.


• Memory
• Storage
• Network

“Big Data Analytics”, David Loshin, 2013


4 © CY Lin 2018, Columbia University
E6893 Big Data Analytics — Lecture 1
Scalability — Scale Up & Scale Out
● Scale out
● Use more resources to distribute workload in parallel
● Higher data access latency is typically incurred
● Scale up
● Efficiently use the resources
● Architecture-aware algorithm design
Example: Resource utilization for a large production cluster at
Twitter data center

www.stanford.edu/~cdel/2014.asplos.quasar.pdf

• For independent data ==> scale up may not have obvious


advantage than scale out
• For linked data ==> utilizing scale up as much as possible
before scale out
5 © CY Lin 2018, Columbia University
E6893 Big Data Analytics — Lecture 1
Contrasting Approaches in Adopting High-Performance Capabilities

“Big Data Analytics”, David Loshin, 2013


6 © CY Lin 2018, Columbia University
E6893 Big Data Analytics — Lecture 1
Techniques towards Big Data

• Massive Parallelism
• Huge Data Volumes Storage
• Data Distribution
• High-Speed Networks
• High-Performance Computing
• Task and Thread Management
• Data Mining and Analytics
• Data Retrieval
• Machine Learning
• Data Visualization

➔ Techniques exist for years to decades. Why is Big Data


hot now?
7 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Why Big Data now?

• More data are being collected and stored


• Open source code
• Commodity hardware / Cloud

8 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Why Big Data now?

• More data are being collected and stored


• Open source code
• Commodity hardware / Cloud

• High-Volume
➔ • High-Velocity
• High-Variety

➔ Artificial
Intelligence

9 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
https://www.youtube.com/watch?v=BV8qFeZxZPE

10 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
11 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Human brain is a graph/network of 100B nodes and 700T edges.

• Machine Cognition: • Machine Learning:


• Robot Cognition • Machine Learning Tools
Tools • Deep Learning Tools
• Feeling
• Graph Analytics:
• Machine Reasoning: • Network Analysis
• Bayesian recognition • Matching and Search
Networks • Flow Prediction
perception
• Game Theory
Tools • Graph Visualization:
comprehension sensors
• Dynamic Graph
strategy representation • Big Graph

memory

• Graph Database:
• Large-Scale
Native Store

12 © 2019 CY Lin, Columbia University


E6893 Big Data Analytics – Lecture 1: Overview
Course Main Thrust 1: Apache Hadoop and Big Data

The Apache™ Hadoop® project develops open-source software for reliable, scalable,
distributed computing.

The Apache Hadoop software library is a framework that allows for the distributed processing
of large data sets across clusters of computers using simple programming models. It is
designed to scale up from single servers to thousands of machines, each offering local
computation and storage. Rather than rely on hardware to deliver high-availability, the
library itself is designed to detect and handle failures at the application layer, so delivering
a highly-available service on top of a cluster of computers, each of which may be prone to
failures.

The project includes these modules:


• Hadoop Common: The common utilities that support the other Hadoop modules.
• Hadoop Distributed File System (HDFS™): A distributed file system that provides high-
throughput access to application data.
• Hadoop YARN: A framework for job scheduling and cluster resource management.
• Hadoop MapReduce: A YARN-based system for parallel processing of large data sets.

http://hadoop.apache.org
13 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Four distinctive layers of Hadoop

14 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2017 CY Lin, Columbia University
Course Main Thrust 2: Apache Spark and ML

15 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Main Spark Stack

16 E6893 Big Data Analytics – Lecture 2: Big Data Platform © 2018 CY Lin, Columbia University
Course Main Thrust 3: Linked Big Data — Graph Analysis

Human brain is a graph of 100B nodes and


700T edges.
17 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Course Main Thrust 4: Streaming Big Data Analytics

18 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Course Main Thrust 5: Big Data Visualization

19 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Course Main Thrust 6: Big Data and AI Solutions

• Big Data and AI for Finance


• Big Data and AI for Healthcare

20 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Why you want to take this class

• Key Differentiator of this class: Focusing on building a full-spectrum


understanding of the latest Big Data Analytics technologies and using
them to build real industry real-world solutions.

• Sapphire Big Data Analytics Open Source Applications: Create a Big


Data open source toolsets for various industries (and disciplines)

• Dataset and Use Cases: Welcome!!

21 © 2019 CY Lin, Columbia University


E6893 Big Data Analytics – Lecture 1: Overview
Course Grading
▪ 5 Homeworks: 50%
-- Individual work (except HW #4); Language Requirement: C/C++, Java, JavaScript, Python)
-- Report and source code
▪ HW #0: Big Data Environment Setup and Testing
▪ HW #1: Big Data Analytics and Machine Learning
▪ HW #2: Linked Big Data Analytics
▪ HW #3: Streaming Big Data Analytics
▪ HW #4: Big Data Analytics Visualization (2 students per team)

▪ Final Project: 50%


-- Teamwork: 2 - 3 students per team (on campus); 1 - 3 students per team for CVN
▪ Proposal (slides — short presentation in the class, 5 mins presentation with video on YouTube)
▪ Final Report (paper, up to 10 pages)
▪ Workshop Presentation (Oral and Demo)
▪ Open Source Codes
▪ Video Presentation (on YouTube)

22 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Course Information
▪ Website:
http://www.ee.columbia.edu/~cylin/course/bigdata/

▪ Textbook:
-- None, but reference book(s) and/or articles/papers will be provided each lecture.

23 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Course Outline

24 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Assignments and Submissions

25 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Other Issues

▪ Professor Lin:
▪ Office Hours:
Friday after the class: 9:40pm – 10:00pm (lecture room)
Or by appointment

▪ Contact: c.lin@columbia.edu

▪ TAs (CAs/IAs/Graders) —

▪ Frank Ouyang (ho2271)


▪ Tingyu Li (tl2861)
▪ Yunan Lu (yl4021)
▪ TBDs, probably have 6-7 TAs in total.

26 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Reading Reference for Lecture 1

Chapter 1: Market and Business Drivers for Big Data


Analysis
Chapter 2: Business Problems Suited to Big Data
Analytics
Chapter 3: Achieving Organizational Alignment for Big
Data Analytics
Chapter 4: Developing a Strategy for Integrating Big
Data Analytics into the Enterprise
Chapter 5: Data Governance for Big Data Analytics:
Considerations for Data Policies and
Processes
Chapter 6: Introduction to High-Performance
Appliances for Big Data Management
Chapter 7: Big Data Tools and Techniques
Chapter 8: Developing Big Data Applications
Chapter 9: NoSQL Data Management for Big Data
Chapter 10: Using Graph Analytics for Big Data
Chapter 11: Developing the Big Data Roadmap

27 © CY Lin 2018, Columbia University


E6893 Big Data Analytics — Lecture 1
5 Example Big Data Use Case Categories


Big Data Exploration Enhanced 360o View
 Security/Intelligence


Find, visualize, understand all of the Customer Extension
big data to improve decision Extend existing customer Lower risk, detect fraud
making views (MDM, CRM, etc) by and monitor cyber security
incorporating additional in real-time
internal and external
information sources

Operations Analysis Data Warehouse Augmentation


Analyze a variety of machine
 Integrate big data and data warehouse
data for improved business results capabilities to increase operational efficiency

28 E6895 Advanced Big Data Analytics – Lecture 1 © CY Lin, Columbia University


Big Data Examples -- Application Use Cases
1. Expertise Location
2. Recommendation
3. Commerce
4. Financial Analysis
5. Social Media Monitoring
6. Telco Customer Analysis
7. Healthcare Analysis
8. Data Exploration and Visualization
9. Personalized Search
10. Anomaly Detection
11. Fraud Detection
12. Cybersecurity
13. Sensor Monitoring (Smarter another Planet)
14. Cellular Network Monitoring
15. Cloud Monitoring
16. Code Life Cycle Management
17. Traffic Navigation
18. Image and Video Semantic Understanding
19. Genomic Medicine
20. Brain Network Analysis
21. Data Curation
22. Near Earth Object Analysis
29 E6893 Big Data Analytics – Lecture 1: Overview © 2019 CY Lin, Columbia University
Category 1: 360º View
Recommendation

item

Enhancing:
user

Graph Visualizations

Communities Graph Search Network Info Flow Bayesian Networks


Centralities Graph Query Shortest Paths Latent Net Inference

Ego Net Features Graph Matching Graph Sampling Markov Networks

Middleware and Database


30 © CY Lin, Columbia University
E6895 Advanced Big Data Analytics – Lecture 1
Use Case 1: Social Network Analysis in Enterprise for Productivity
Production Live System used by IBM GBS since 2009 – verified ~$100M contribution
15,000 contributors in 76 countries; 92,000 annual unique IBM users
25,000,000+ emails & SameTime messages (incl. Content features) Shortest
Paths
1,000,000+ Learning clicks; 14M KnowledgeView, SalesOne, …, access data
1,000,000+ Lotus Connections (blogs, file sharing, bookmark) data Centralities
200,000 people’s consulting project & earning data
Graph
Search

Dynamic networks
of 400,000+
IBMers:

– On BusinessWeek four times, including being the Top Story of Week, April 2009 Shortest Paths
– Help IBM earned the 2012 Most Admired Knowledge Enterprise Award Social Capital
– Wharton School study: $7,010 gain per user per year using the tool Bridges
– In 2012, contributing about 1/3 of GBS Practitioner Portal $228.5 million savings andHubs
benefits
Expertise Search
– APQC (WW leader in Knowledge Practice) April 2013:
Graph Search
“The Industry Leader and Best Practice in Expertise Location” Graph Recomm.
31 © CY Lin, Columbia University
E6895 Advanced Big Data Analytics – Lecture 1
Use Case 2: Personalized Recommendation

32 © CY Lin, Columbia University


E6895 Advanced Big Data Analytics – Lecture 1
Use Case 3: Customer Behavior Sequence Analytics
Markov Latent Bayesian
Network Network Network

• Behavior Pattern Detection


login browsing
• Help Needed Detection

search comparing Checkout

33 © CY Lin, Columbia University


53 E6895 Advanced Big Data Analytics – Lecture 1
Use Case 4: Graph Analytics for Financial Analysis
Goal: Injecting Network Graph Effects for Financial Analysis. Estimating company performance
considering correlated companies, network properties and evolutions, causal parameter analysis, etc.

▪ IBM 2003 ▪ IBM 2009

▪ Data Source:
– Relationships among 7594
companies, data mining from
NYT 1981 ~ 2009

Targets: 20 Fortune Network feature:


companies’ normalized s (current year network
Profits feature),
t (temporal network feature), 

Goal: Learn from d (delta value of network
previous 5 years, and feature)
predict next year Financial feature:
Model: Support Vector p (historical profits and
Regression (RBF kernel) revenues)
Profit prediction by joint network and financial analysis
outperforms network-only by 130% and financial-only by
34
E6895 Advanced Big Data Analytics – Lecture 1
33%. © CY Lin, Columbia University
Use Case 5: Social Media Monitoring

monitoring categories Monitoring filter

Real-Time Translation, Locat


Live Tweets, Sentiment, Keywords
Dynamic Graphs
Zooming / Panning Top Retweets
35 © CY Lin, Columbia University
E6895 Advanced Big Data Analytics – Lecture 1
Use Case 6: Customer Social Analysis for Telco
Applications
Goal: Extract customer social network High Value Viral
Personalized Customer
behaviors to enable Call Detail Records (CDRs) Identification marketing
Advertisement
data monetization for Telco. & targeting campaign

▪ Applications based on the extracted social enable


profiles
− Personalized advertisement (beyond the scope
of traditional campaign in Telco)
Customer Profiles
− High value customer identification and (influence, community,
targeting etc.)
− Viral marketing campaign
▪ Approach
− Construct social graphs from CDRs based on Degree Weakly
Maximal
Connected
{caller, callee, call time, call duration} Centrality
Component Cliques

− Extract customer social features (e.g.


influence, communities, etc.) from the Community
Pagerank K-core
constructed social graph as customer social Detection

profiles
− Build analytics applications (e.g. personalized System G Analysis
advertisement) based on the extracted
BigInsights
customer social profiles

PoCs with Chinese and Indian Telecomm companies CDR


36 © CY Lin, Columbia University
E6895 Advanced Big Data Analytics – Lecture 1
Category 2: Data Exploration


Enhancing:

Huge Network Network I2 3D Network Geo Network Graphical


Visualization Propagation Visualization Visualization Model
Visualization
Communities Graph Search Network Info Flow Bayesian Networks
Centralities Graph Query Shortest Paths Latent Net Inference
Ego Net Features Graph Matching Graph Sampling
Markov Networks

Middleware and Database


37 © CY Lin, Columbia University
E6895 Advanced Big Data Analytics – Lecture 1
Use Case 7: Graph Analytics and Visualization

Graph
Matching
Matches
Query

headache
chill migraine
high fever
stomachache
cough

Graph
Communities

38 © CY Lin, Columbia University


E6895 Advanced Big Data Analytics – Lecture 1
User Case 8: Visualization for Navigation and Exploration

Whisper : Tracing the


information diffusion in
Social Media

http://systemg.ibm.com/apps/whisper/
index.html
http://systemg.ibm.com/apps/whisper/index.html

SocialHelix: Visualizaiton of
Sentiment Divergence in
Social Media

http://systemg.ibm.com/apps/socialhelix/index.html
39 © CY Lin, Columbia University
E6895 Advanced Big Data Analytics – Lecture 1
Use Case 9: Graph Search

existing search engine Graph


query Search
index Improved search results

ranking re-ranking
Interest / social network
based content
recommendations

Info-Socio

networks Graph analysis query context

40 © CY Lin, Columbia University


E6895 Advanced Big Data Analytics – Lecture 1
Category 3: Security
Network Ponzi scheme Detection Ego Net
Info Flow Features

Normal:
Attacker:
(1)Clique-like
Near-Star
(2)Two-way links
Detecting DoS
attack

Graph Visualizations

Communities Graph Search Network Info Flow Bayesian Networks


Centralities Graph Query Shortest Paths Latent Net Inference
Ego Net Features Graph Matching Graph Sampling Markov Networks

Middleware and Database


41 © CY Lin, Columbia University
E6895 Advanced Big Data Analytics – Lecture 1
Use Case 10: Anomaly Detection at Multiple Scales


Based on President Executive Order 13587

Goal: System for Detecting and Predicting


“Enterprise Information
Abnormal Behaviors in Organization, through
large-scale social network & cognitive analytics Leakage Impacted
and data mining, to decrease insider threats such economy and jobs” Feb
as espionage, sabotage, colleague-shooting, 2013
suicide, etc.
“What's emerged is a
multibillion dollar detective
industry”
npr Jan 10, 2013

Emails
Graph analysis
Instant Messaging
Social sensors
Web Access Behavior analysis Detection,
Click streams capturer Multimodality Prediction
Executed Processes
Feed subscription Semantics analysis Analysis &
Printing Exploration
Copying Database access Psychological Interface
analysis
Log On/Off

Infrastructure + ~ 490 Analytics


42 © CY Lin, Columbia University
E6895 Advanced Big Data Analytics – Lecture 1
Use Case 11: Fraud Detection for Bank
Network Ego Net
Info Flow Features

Ponzi scheme Detection

Normal:
Attacker:
(1)Clique-like
Near-Star
(2)Two-way links

43 © CY Lin, Columbia University


E6895 Advanced Big Data Analytics – Lecture 1
Use Case 12: Detecting Cyber Attacks
Network Ego Net
Info Flow Features

Detecting DoS
attack

44 © CY Lin, Columbia University


E6895 Advanced Big Data Analytics – Lecture 1
Category 4: Operations Analysis
Cloud Service Placement
Network Server
KPIs KPIs Graph
Matching

Bayesian
Network

Varying over
KPI time series (e.g., ? time
Causality
server performance/
load, network analyzer
performance/load)
KPI (a time series)
(potential) pairwise
relationship (e.g., causality)

Graph Visualizations

Communities Graph Search Network Info Flow Bayesian Networks


Centralities Graph Query Shortest Paths Latent Net Inference

Ego Net Features Graph Matching Graph Sampling Markov Networks

Middleware and Database


45 © CY Lin, Columbia University
81 E6895 Advanced Big Data Analytics – Lecture 1
Use Case 13: Smarter another Planet
Bayesian
Goal: Atmospheric Radiation Measurement (ARM) climate research 

Network
facility provides 24x7 continuous field observations of cloud, aerosol 

and radiative processes. Graphical models can automate the
validation with improvement efficiency and performance.

Approach: BN is built to represent the dependence among sensors 



and replicated across timesteps. BN parameters are learned from
over 15 years of ARM climate data to support distributed climate
sensor validation. Inference validates sensors in the connected
instruments.

Bayesian Network
* 3 timesteps * 63 variables
* 3.9 avg states * 4.0 avg
indegree
* 16,858 CPT entries
Junction Tree
* 67 cliques
* 873,064 PT entries in cliques

46 © CY Lin, Columbia University


E6895 Advanced Big Data Analytics – Lecture 1
Use Case 14: Cellular Network Analytics in Telco Operation
Goal: Efficiently and uniquely identify internal state of
Cellular/Telco networks (e.g., performance and load of
network elements/links) using probes between monitors
placed at selected network elements & endhosts Network load
level report

▪ Applied Graph Analytics to telco network analytics


based on CDRs (call detail records): estimate
traffic load on CSP network with low monitoring
overhead
(1)CDRs, already collected for billing purposes, contain
information about voice/data calls
(2)Traditional NMS* and EMS** typically lack of end-to- Network topology
end visibility and topology across vendors Graph
(3)Employ graph algorithms to analyze network elements Analysis
which are not reported by the usage data from CDR
information
▪ Approach
– Cellular network comprises a hierarchy of network
elements
– Map CDR onto network topology and infer load on each
network element using graph analysis
CDR
– Estimate network load and localize potential problems

47 © CY Lin, Columbia University


E6895 Advanced Big Data Analytics – Lecture 1
Use Case 15: Monitoring Large Cloud
Goal: Monitoring technology that can track the time-varying Network Server
state (e.g., causality relationships between KPIs) of a large KPIs KPIs
Cloud when the processing power of monitoring system cannot
keep up with the scale of the system & the rate of change
• Causality relationships (e.g., Granger causality) are crucial in
performance monitoring & root cause analysis
• Challenge: easy to test pairwise relationship, but hard to test
multi-variate relationship (e.g., a large number of KPIs)

Varying over
KPI time series Causality ? time
(e.g., server analyzer
performance/load,
network KPI (a time series)
performance/load) (potential) pairwise
relationship (e.g., causality)

Our approach: Basic analytics engine


Probabilistic (e.g., pairwise granger causality)
monitoring via
sampling & estimation Link sampling & estimation

Select KPI pairs (sampling)→ Test link existence → Estimate unsampled links based on history
48 → Overall graph E6895 Advanced Big Data Analytics – Lecture 1 © CY Lin, Columbia University
Category 5: Data Warehouse Augmentation

49 © CY Lin, Columbia University


85 E6895 Advanced Big Data Analytics – Lecture 1
Use Case 16: Code Life Cycle Improvement

Graph
application Graph
application
Graph objects
Graph objects

Convert from Convert to


relational relational Graph DB Graph DB model
Relational
DB
Traditional (relational) model

● Advantages of working directly with graph DB for graph applications


(1) Smaller and simpler code
(2) Flexible schema ! easy schema evolution
(3) Code is easier and faster to write, debug and manage
(4) Code and Data is easier to transfer and maintain

50 © CY Lin, Columbia University


E6895 Advanced Big Data Analytics – Lecture 1
Use Case 17: Smart Navigation Utilizing Real-time Road
Information
Goal: Enable unprecedented level of accuracy in traffic scheduling (for a fleet of
transportation vehicles) and navigation of individual cars utilizing the dynamic real-
time information of changing road condition and predictive analysis on the data

• Dynamic graph algorithms implemented in


System G provide highly efficient graph
query computation (e.g. shorted path
computation) on time-varying graphs (order of
magnitudes improvement over existing
solutions)

• High-throughput real-time predictive


analytics on graph makes it possible to
estimate the future traffic condition on the route
to make sure that the decision taken now is
optimal overall
Historical data
Predictive results
Our approach:
Predictive analytics for graphs
Querying over
dynamic graph +
Dynamic Graph query problem Query & response
predictive analytics on
graph properties
Graph store
Real-time update
51 © CY Lin, Columbia University
E6895 Advanced Big Data Analytics – Lecture 1
Use Case 18: Graph Analysis for Image and Video Analysis

Vertex Attribute
Correspondence Transformation

ARG s ARG t
52 © CY Lin, Columbia University
E6895 Advanced Big Data Analytics – Lecture 1
Use Case 19: Graph Matching for Genomic Medicine

53 © CY Lin, Columbia University


E6895 Advanced Big Data Analytics – Lecture 1
Use Case 20: Data Curation for Enterprise Data Management

54 © CY Lin, Columbia University


E6895 Advanced Big Data Analytics – Lecture 1
Use Case 21: Understanding Brain Network

55 © CY Lin, Columbia University


E6895 Advanced Big Data Analytics – Lecture 1
Use Case 22: Planet Security
• Big Data on Large-Scale Sky Monitoring

56 © CY Lin, Columbia University


E6895 Advanced Big Data Analytics – Lecture 1
Homework #0: Big Data Environment Setup and Test (due
September 20, 5pm)

1. Warm-Up Exercises:
• Setup Google Cloud account and environment
• Install Google Cloud SDK
• Create a Spark cluster
• Word Count using Google Cloud Storage and Spark
• Hive and BigQuery

2. Data Analysis — NYC Bike Expert:


• Load data to a Cloud Storage
• Simple Analyses through BigQuery

3. Data Analysis — Understanding Shakespeare:


• Load data to a Cloud Storage
• Simple Analyses through Word Counts
• Analyses after running Natural Language Toolkit

https://docs.google.com/document/d/1MWBVItrLL0MizDR-9q7-SY986SqcPcib-OrlB10fYh0/

57 E6893 Big Data Analytics – Lecture 1: Big Data Introduction © 2019 CY Lin, Columbia University
Homework Late Submission Policy

Friday 5pm: submission deadline


Saturday 5pm: 10% penalty
Sunday 5pm: 20% penalty
Monday 5pm: 30% penalty
Any submission after Monday 5pm will not be accepted.

58 E6893 Big Data Analytics – Lecture 1: Big Data Introduction © 2019 CY Lin, Columbia University

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy