Prof. Randy Paffenroth Data Science Program Department of Mathematical Sciences Worcester Polytechnic Institute Rcpaffenroth@wpi - Edu 2014

Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 22

Musings on Data Science and

Students Experiencing Data Analytics


New England SENCER Center for Innovation

Prof. Randy Paffenroth


Data Science Program
Department of Mathematical Sciences
Worcester Polytechnic Institute

rcpaffenroth@wpi.edu

2014
My Research

"Internet Connectivity Access layer" by User:Ludovic.ferre -


Internet_Connectivity_Overview2_Access.svg. Licensed under Creative
Commons Attribution-Share Alike 3.0 via Wikimedia Commons -
http://commons.wikimedia.org/wiki/File:Internet_Connectivity_Access_layer
.svg#mediaviewer/File:Internet_Connectivity_Access_layer.svg
This is a panel, so I want to be
provocative!
Provocative
Adjective
1. tending or serving to provoke; inciting, 
stimulating, irritating, or vexing.
So, I will be a little sad if I don’t end up irritating
anyone 
The first war: Terminology
• Analyzing data has a long history!
• There have been many terms that have been
used to describe such endeavors:
• Statistics
• Artificial Intelligence
• Machine learning
• Data analytics
• Since I happen to work in a “Data Science”
program perhaps I may be allowed the
indulgence of using that terminology…
Whatever we call it, what makes
things different now?
Experiments, observations, and numerical simulations in many
areas of science and business are currently generating terabytes of
data, and in some cases are on the verge of generating petabytes
and beyond. Analyses of the information contained in these data
sets have already led to major breakthroughs in fields ranging from
genomics to astronomy and high-energy physics and to the
development of new information-based industries.
- Frontiers in Massive Data Analysis, National Research Council of the National Academies

Given a large mass of data, we can by judicious selection


construct perfectly plausible unassailable theories—all of
which, some of which, or none of which may be right.
- Paul Arnold Srere
The ability to take data—to be able to understand it, to process it, to extract value from it, to
visualize it, to communicate it—that’s going to be a hugely important skill in the next decades,
not only at the professional level but even at the educational level for elementary school kids,
for high school kids, for college kids. Because now we really do have essentially free and
ubiquitous data. So the complimentary scarce factor is the ability to understand that data and
extract value from it.
- Hal Varian, Google's Chief Economist, http://www.mckinsey.com/insights/innovation/hal_varian_on_how_the_web_challenges_managers

My personal goal: Getting students to be able to


think critically about data.
What is Big Data?

The are many examples of "data", but what makes some of
it “big”? The classic definition revolves around the three
Vs.
 Volume, velocity, and variety.

Volume: There is a just a lot of it being generated all
the time. Things get interesting and “big”, when you
can’t fit it all on one computer anymore. Why? There
are many ideas here such as MapReduce, Hadoop, etc.
that all revolve around being able to process data that
goes from Terabytes, to Petabytes, to Exabytes.
http://pl.wikipedia.org

Velocity: Data is being generated very quickly. Can /wiki/Green_Giant#m
ediaviewer/Plik:Jolly_
you even store it all? If not, then what do you get rid of green_giant.jpg
and what do you keep?

Variety: The data types you mention all take different
shapes. What does it mean to store them so that you
can play with or compare them?
Is Big Data the same as Data
Science?
 Are Big Data and Data Science the same thing?
 I wouldn't say so...
 Data Science can be done on small data sets.
 And not everything done using Big Data would
necessarily be called Data Science.

Data
Big Data
Science
Is Big Data the same as Data
Science?
 Are Big Data and Data Science the same thing?
 I wouldn't say so...
 Data Science can be done on small data sets.
 And not everything done using Big Data would
necessarily be called Data Science.
 But there certainly is a substantial overlap!

Data
Big Data
Science
Can you even be certain?
 For real world problems, I
claim that you will never be
certain of any inferences from
data.

I mean, what happens to your
carefully thought out marketing
plan for some rocking slacks
when the Martians land.
 What is unacceptable is when
the data you actually have
does not support the Public domain image

conclusion you report.


It can be easy to fool yourself!
Human beings are really Perhaps a bit too good!
good at pattern
detection...

http://en.wikipedia.org/wiki/Cydonia_(region_of_Mars)
It can be easy to fool yourself!

http://en.wikipedia.org/wiki/Cydonia_(region_of_Mars)
Skills for Data Science

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
Which is most important?

http://en.wikipedia.org/wiki/View_of_the_World_from_9th_Avenue

http://drewconway.com/zia/2013/3/26/the-data-science-venn-diagram
WPI Data Science Program:
A Collaboration

Computer
Science
Mathematical Department
Sciences
Department

Business School
M.S. in Data Science Program
GRADUATE QUALIFYING PROJECT OR MS THESIS
(3 TO 9 CREDITS)

CONCENTRATION AND ELECTIVES


(9 TO 15 CREDITS)

DATA BUSINESS
MATHEMATICAL DATA ACCESS &
ANALYTICS & INTELLIGENCE &
ANALYTICS MANAGEMENT
MINING CASE STUDIES
(3 CREDITS) (3 CREDITS)
(3 CREDITS) (3 CREDITS)

INTEGRATIVE DATA SCIENCE (3 CREDITS)


Data Science Core
INTEGRATIVE DATA SCIENCE :
DS 501 INTRODUCTION TO DATA SCIENCE (NEW COURSE)

 MATHEMATICAL ANALYTICS (SELECT ONE):


MA 543/DS 502 STATISTICAL METHODS FOR DATA SCIENCE (NEW COURSE)
MA 542 REGRESSION ANALYSIS
MA 554 APPLIED MULTIVARIATE ANALYSIS Data Science Certificate
  Program (18 credits);
DATA ACCESS AND MANAGEMENT (SELECT ONE): •15 CREDIT DATA SCIENCE
CS 542 DATABASE MANAGEMENT SYSTEMS CORE
MIS 571 DATABASE APPLICATIONS DEVELOPMENT
CS 561 ADVANCED TOPICS IN DATABASE SYSTEMS plus
CS 585/DS 503 BIG DATA MANAGEMENT (NEW COURSE) •3 CREDIT ELECTIVE
 
DATA ANALYTICS AND MINING (SELECT ONE):
CS 548 KNOWLEDGE DISCOVERY AND DATA MINING
CS 539 MACHINE LEARNING
CS 586/DS 504 BIG DATA ANALYTICS (NEW COURSE)
 
BUSINESS INTELLIGENCE AND CASE STUDIES (SELECT ONE):
MIS 584 BUSINESS INTELLIGENCE
MKT 568 DATA MINING BUSINESS APPLICATIONS
2014 Data Science Cohort

EDUCATIONAL FOUNDATION
QUANTITATIVE/ COMPUTATIONAL
BACKGROUNDS
NATIONALITY
PROGRAMMING WITH DATA
STRUCTURES AND ALGORITHMS CAMBODIA
FOR COMPUTATIONAL SKILLS 10% INDIA
QUANTITATIVE SKILLS FULBRIGHT
CALCULUS, LINEAR ALGEBRA AND CHINA
SCHOLARS
STATISTICS PAKISTAN
EMPLOYMENT HISTORIES TAIWAN
SENIOR RESEARCH ANALYST
SENIOR BUSINESS ANALYST
GENDER IRAN
PATIENT FINANCIAL SERVICES 66.70% Male U.S.A.
DATA BASE ANALYST-ARCHITECT
DECISION SCIENTIST 33.3% Female BRAZIL
MINISTRY OF FINANCE NEPAL
LAHEY HEALTH
AFGHANISTAN
TECHNICAL PROGRAM
MANAGEMENT INDONESIA
U.S. DEPARTMENT OF STATE
2014 Data Science Cohort

FALL 2014
Total Applicants 126
Total acceptances 33 Many hold more than one earned Bachelor’s Degree
Fulbright Scholars 3 US Universities include Columbia, UNH and WPI
Brazil Science Mobility Student 1 Dean Oates gave two Awards of $5K to outstanding
students.
Countries Represented 9 These awards help attract top students.
Domestic Students 5
International Students 28
Skills Acquired by Our Students
Fundamental/Technical : Tools :
SQL/ Data Modeling / Cleaning Oracle /MySQL/DB2/SQLServer
Data Integration / Warehousing R / SAS / SciKit
Statistical Learning / Machine Learning Weka /RapidMiner /MatLab
Distributed Computing IBM Cognos / SPSS Modeler
Big Data Management Hadoop / Mahout / Cassandra

Classif./Regression/DecisionTrees Python / Java / Cloud Computing

Business Intelligence Storm / Sparc / InfoSphere Streams


Spotfire / Tableaux
Distributed Mining Algorithms

Professional Skills:
Professional Skills:
Story Telling / Visualization
Business Use Cases / Entrepreneurship
Presentations / Reports
Interdisciplinary Teams / Leadership
Data Science Tools for Students:
Free!
Software: Data:
•Python •UCI Machine learning repository
•http://www.python.org/ • http://archive.ics.uci.edu/ml/
• iPython: http://ipython.org/ •Kaggle
• Numpy: http://www.numpy.org/ • https://www.kaggle.com/
• Pandas: http://pandas.pydata.org/
•U.S. Government
• Matplotlib: http://matplotlib.org/
• Mayavi: http://mayavi.sourceforge.net/ • https://www.data.gov/
• Scikit-learn: http://scikit-
learn.org/stable/

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy