0% found this document useful (0 votes)
198 views

Big Data Analytics

This document provides an introduction to big data analytics. It defines big data according to the 4 V's: volume, variety, velocity, and veracity. Big data promises intelligence in real-time through dynamically analyzing streaming data from diverse sources at large scale. The document outlines different types of big data sources and discusses why big data is valuable for accessibility, decision making, marketing trends, performance improvement, and new business models. Big data is estimated to have a potential value of over $1 trillion per year across various industries.

Uploaded by

ai.test
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
198 views

Big Data Analytics

This document provides an introduction to big data analytics. It defines big data according to the 4 V's: volume, variety, velocity, and veracity. Big data promises intelligence in real-time through dynamically analyzing streaming data from diverse sources at large scale. The document outlines different types of big data sources and discusses why big data is valuable for accessibility, decision making, marketing trends, performance improvement, and new business models. Big data is estimated to have a potential value of over $1 trillion per year across various industries.

Uploaded by

ai.test
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 86

www.pwc.

com

Big Data Analytics


Learning Lab 1

UN Data Innovation Lab 4


University of Nairobi
March 13-14, 2017
Agenda

I. Introduction to Big Data


What it is and why it matters

II.Big Data Analytics


Putting Big Data to work

III.Creating a Big Data-Enabled Organization


Bringing Big Data Analytics home

IV.Case Study
‘Nowcasting’ economic activity in Colombia

PwC
Introduction to
Big Data
What it is and Why it
Matters

01
What is Big Data?
“Big Data” exceeds the capacity of traditional analytics and information management
paradigms across what is known as the 4 V’s: Volume, Variety, Velocity, and Veracity

Veracity Velocity Variety Volume

Uncertainty of Data Analysis of Streaming Different Forms of Data Scale of Data


Data

With exponential The speed at which Represents the Reflects the size of a
increases of data from data is generated and diversity of the data. data set. New
unfiltered and used. New data is Data sets will vary by information is
constantly flowing being created every type (e.g. social generated daily and
data sources, data second and in some networking, media, in some cases hourly,
quality often suffers cases it may need to text) and they will creating data sets that
and new methods be analyzed just as vary how well they are measured in
must find ways to quickly are structured terabytes and
“sift” through junk to petabytes
find meaning

PwC 4
The Promise of Big Data
Even more important than its definition is what Big Data promises to achieve:
intelligence in the moment.
Traditional Techniques &
Big Data Differentiators
Issues
• Does not account for biases,
Veracity

• Data is stored, and mined meaningful to the problem


noise and abnormality in data being analyzed
• Keeps data clean and processes to keep ‘dirty data’
from accumulating in your systems

In real-time:
Velocity

• No real time analysis


• Dynamically analyze data
• Consistently integrate new information
• Auto deletes unwanted to ensure optimal storage

• Compatibility issues • Frameworks accommodate varying data types and


Variety

• Advanced analytics struggle with data models


non-numerical data • Insightful analysis with very few parameters

• Analysis is limited to small data • Scalable for huge amounts of multi-sourced data
Volume

sets
• Facilitation of massively parallel processing
• Analyzing large data sets = High
• Low-cost data storage
Costs & High Memory

PwC 5
Types of Big Data
Variety is the most unique aspect of Big Data. New technologies and new types of data
have driven much of the evolution around Big Data.

Twitter, Linkedin, Facebook, Tumblr, Blog,


Images, videos, audio, Flash, live
SlideShare, YouTube, Google+, Instagram,
streams, podcasts, etc.
Social Flickr, Pinterest, Vimeo, WordPress, IM, RSS,
Media Review, Chatter, Jive, Yammer, etc.
Media

Medical devices, smart electric


XLS, PDF, CSV, email, Word,
meters, car sensors, road cameras,
PPT, HTML, HTML 5, plain
satellites, traffic recording devices,
text, XML, JSON, etc. Sensor
Docs processors found within vehicles,
data video games, cable boxes,
assembly lines, office building, cell
Government, weather, towers, jet engines, air
competitive, traffic, regulatory, conditioning units, refrigerators,
compliance, health care services, trucks, farm machinery, etc..
economic, census, public
finance, stock, OSINT, the Public Machine Event logs, server data,
World Bank, SEC/Edgar, Web Log application logs, business process
Wikipedia, IMDb, etc. Data logs, audit logs, call detail records
(CDRs), mobile location, mobile
app usage, clickstream data, etc.
Archives of scanned documents, statements,
insurance forms, medical record and customer Business
Archive Project management, marketing automation,
correspondence, paper archives, and print Apps productivity, CRM, ERP content management
stream files that contain original systems of
system, HR, storage, talent management,
record between organizations and their
procurement, expense management Google
customers
Docs, intranets, portals, etc.
PwC 6
“Single sources of data are no longer sufficient to
cope with the increasingly complicated problems
in many policy arenas.” 1

Big data “is not notable because of its size, but


because of its relationality to other data. Due to
efforts to mine and aggregate data, Big Data is
fundamentally networked.”2

(1) M. Milakovich, “Anticipatory Government: Integrating big data for Smaller Government”, in Oxford Internet Institute “Internet, Politics, Policy 2012” Conference, Oxford, 2012
(2) D. Boyd and K. Crawford, “Six Provocations for big data,” in A Decade in Internet Time: Symposium on the Dynamics of the Internet and Society, 2011

PwC 7
Why is Big Data valuable?
We have identified 5 key areas where Big Data is uniquely valuable:

Accessibility to Enhanced visibility of relevant information and better transparency to massive


Data amounts of data. Improved reporting to stakeholders.

Next generation analytics can enable automated decision making (inventory


Decision Making management, financial risk assessment, sensor data management, machinery
tuning).

Marketing Segmentation of population to customize offerings and marketing campaigns


Trends (consumer goods, retail, social, clinical data, etc).

Performance Exploration for, and discovery of, new needs, can drive organizations to fine tune
Improvement for optimal performance and efficiency (employee data).

New Business Discovery of trends will lead organizations to form new business models to adapt
by creating new service offerings for their customers. Intermediary companies
Models/Services
with big data expertise will provide analytics to 3rd parties.

PwC 8
$1 One study estimated the potential value of big data in
the U.S. health care, European public sector
administration, global personal location data, U.S.

Trillion retail, and global manufacturing to be over $1 trillion


U.S. dollars per year.1

Another study estimated the value of big data in the


areas of customer intelligence, supply chain intelligence,
$41
performance improvements, fraud detection, and
quality and risk management to be $41 billion per year
in the UK alone.2 Billion
(1) J. Manyika, M. Chui, B. Brown, J. Bughin, R. Dobbs, C. Roxburgh and A. H. Byers, “Big data: The next frontier for innovation, competition, and productivity,” McKinsey & Company,
2011.
(2) Centre for Economics and Business Research, “Data equity: unlocking the value of big data,” SAS, 2012.
PwC 9
Not to be confused with…

Structured, semi-structured or
unstructured information
distinguished by one or more of
the four “V”s: Veracity, Velocity,
Variety, Volume.
Big Data Open Data

Public, freely available data

Crowdsourced
Data

Data collected through contributions from a


large number of individuals.

Graphic and definitions based on “Big Data in Action for Development,” World Bank, worldbank.org

PwC 10
Big Data
Analytics
Putting Big Data to Work

02
It’s not just about the data…
It is important to understand the distinction between Big Data sets (large, unstructured,
fast, and uncertain data) and ‘Big Data Analytics’.

Big Data + Big Data Analytics


Refers to the DATA only Methods of using Big Data to generate insight

• Leveraging a computer’s ability to learn


Machine Learning/Deep
1 Learning
without being explicitly programmed to solve
business problems

• Understanding value drivers from the ever-


IoT (Internet of Things) &
2 Sensor Analytics
growing network of connected physical
objects and the communication between them

• Mining product reviews to estimate


Modeling Willingness-to-
3 Pay
willingness-to-pay for product features

• Understanding human speech as it is spoken


4 Natural Language through application of computer science, AI,
Processing and computational linguistics

• Using distributed computing and machine


5 Analyzing Data @ Scale
learning tools to analyze hundreds of
gigabytes of data

• Mining social data in real time to understand


Creating a Streaming when and where consumers are making
6 Consumer Behavior Data Lake choices

PwC 12
… It’s also about what, how, and why you use it
Big Data Analytics – the process of harnessing Big Data to yield actionable insights – is a
combination of five key elements:

Decisions Analytics Data Technology Mindset & Skills

Big Data Analytics


To leverage the variety Big Data Analytics is
The value of Big Data To store, manage, and requires firm
and volume of Big Data about operationalizing
Analytics is driven by the use Big Data often commitment to using
while managing its new and more data, but
unique decisions facing requires investments in analytics in decision-
volatility, advanced it is also about data
leaders, companies, and new technologies and making; a decisive
analytical approaches quality, data
countries today. In turn, data processing mentality capable of
are necessary, such as interoperability, data
the type, frequency, methods, such as employing in-the-
natural language disaggregation, and the
speed, and complexity distributed processing moment intelligence;
processing, network ability to modularize
of decisions drive how (e.g., Hadoop), NoSQL and investment in
analysis, simulative data structures to
Big Data Analytics is storage, and Cloud analytical technology,
modeling, artificial quickly absorb new data
deployed. computing. resources, and skills.
intelligence, etc. and new types of data.

PwC 13
Big Data Analytical Capabilities
Continuing increases in processing capacity have opened the door to a range of advanced
algorithms and modeling techniques that can produce valuable insights from Big Data.

Structured Unstructured

Time Series Signal Analysis Cluster Analysis


Regression
Traditional

Analysis Distinguish between Discover meaningful


Discover relationships
Discover relationships noise and meaningful groupings of data
between variables
over time information points

A/B/N Testing
Simulation Spatial Analysis
Experiment to find the Classification
Modeling Extract geographic or
most effective Organize data points
Experiment with a topological
variation of a website, into known categories
system virtually information
product, etc

Visualization
Predictive Complex Event Sentiment Analysis
Use visual Modeling Processing Extract consumer
representations of
Use data to forecast or Combine data sources reactions based on
data to find and
infer behavior to recognize events social media behavior
Emerging

communicate info

Network Analysis Deep QA Natural Language


Optimization
Discover meaningful Find answers to Processing
Improve a process or
nodes and human questions Extract meaning from
function based on
relationships on using artificial human speech or
criteria
networks intelligence writing

PwC * For more information on these analytic methods, see Appendix. 14


Forward-Looking vs. Rear-View Analytics
Big Data Analytics improves the speed and efficiency with which we understand the past,
and opens up entirely new avenues for preparing for and adapting to the future.

Rear-view Forward-looking

Continuous
Analytics
Prescriptive
Analytics How do we adapt to
change?
Predictive Analytics What should be
Monitor, decide, and
done?
act autonomously or
Increasing Business Value

Recommend ‘right’ or
Diagnostic Analytics What could semi-autonomously
optimal actions or
happen?
Descriptive decisions
Predict future • Monitor results on a
Analytics outcomes based on the continuous basis
Why did it happen? • Real-time product and
Identify causes of past • Dynamically adjust
service propositions
What happened? trends and outcomes (graph analysis, entity strategies based on
Describe, summarize • Forward-looking view
resolution on data lakes changing environment
and analyze historical of current and future
• Observed behavior or to infer present and improved
data value
events customer need) predictions
• Sentiment Scoring
• Non-traditional data • Rapid evaluation of • Agent-based and
• Observed behavior or • Graph analysis and dynamic simulation
sources such as social multiple ‘what-if’
events Natural Language scenarios models, time-series
listening and web
• Non-traditional data Processing to identify analysis
crawling • Optimization decisions
sources such as social hidden relationships
• Statistical and and themes and actions
listening and web
regression analysis • Dual objective models
crawling
• Dynamic visualization
• Behavioral economics

Increasing Sophistication of Data & Analytics

PwC 15
Examples of Big Data Analytics in Action
Market Leaders are leveraging Big Data Analytics to generate value by starting with a
business need and focusing on implementing actionable insights quickly and decisively

Business Need Big Data Analytics Impact


Company Business Need Data and Analytics Impact
Greater tailoring of credit card Statistical model based on public Net revenue grew at a CAGR
offers to fit customer needs credit and demographic data to target of 32% from 1994 to 2003;
customized products to customers prompted competitors to shift
focus to data and analytics
Data-enabled engine Analysis of sensor data from hundreds Over 70% annual revenue
prognostics, monitoring, of sensors in 4,000 engines to identify from the aircraft engine
maintenance and repair and solve issues weeks in advance division attributable to this
service
Search-to-purchase conversion Semantic search, which enables Increases 10-15% the
by anticipating intent of a shopper’s discovery using algorithms that rank likelihood that a customer
search and delivering relevant results via social signals from around will complete their purchase
results the web – translating to millions of
dollars in revenue
Transformation from subscription Analysis of data from 66 million Revenue and subscriber
streaming service to original subscribers’ viewing habits and base increased by 15%
content producer preferences and 9% respectively in 2013
Leverage Internet of Things Launched software to help airlines and Estimated 1% reduction in
(IoT) by connecting machines to railroads move their data to the cloud fuel costs, projected to save
facilitate data-enabled prognostics, and predict mechanical malfunctions, the airline industry $30
increase efficiency and reduce improve safety, and reduce trip billion over 15 years
downtime cancellations and cost

PwC 16
Big Data Analytics in Development
Big Data Analytics is making an equally impressive impact on Development interventions
– allowing decision-makers to reach and serve previously neglected populations.

Business Need Big Data Analytics Impact


Company Business Need Data and Analytics Impact
More transparent, reliable, Web scraping of online price data used Government statistical
and low-cost method to track to produce price indices, and offices shifting to accept
inflation in Argentina econometric analysis used to model Big Data. Central banks
disaggregated impacts of policies using Big Data to see day-to-
day volatility.
Understand how migrants act Iterative analysis of call detail records Informing labor policy
as arbitrageurs to bring labor (CDRs) to track movement of migrants design in low-income
markets into equilibrium in response to local shocks to labor countries to incentivize or
demand (weather, economy, conflict, disincentivize migratory
etc.) behavior
The city of Rio de Janeiro The city combines data from 30 city Rio has improved
wanted to improve its agencies – including weather, satellite, emergency response time
emergency response by better video, GPS, historic rainfall, and by 30%, catalogued 200+
predicting heavy rainfall and topographic survey data – in a central flood points, and can now
subsequent severe landslides and Operations Center predict heavy rains 48 hours
flooding in advance on a half-km basis
Create a better ecosystem for Remote crowdsourced data gathered M-PESA is being used to
mobile services in the via cell phones used to connect lower costs for farmers to
agricultural sectors of Kenya, farmers to markets, assess farmers’ receive loans and
Tanzania, and Mozambique credit worthiness, and incubate new perform transactions with
mobile businesses with greater distributers and buyers, as
predictors of success well as to provide geography-
specific market information
PwC 17
Creating a Big
Data-Enabled
Organization
Bringing Big Data home

03
Step 1: Be Yourself
Beginning with a clear understanding of the specific questions you intend to use Big Data
Analytics to address can help guide where and which data solutions are deployed.

Value
enhancement Delivering future value
• Data-driven decision-making in real time
• Use analytics to develop new
programs/opportunities
• Relies heavily on data supplied by others
• Often struggles to move away from exclusively
intuitive decision-making

Strategic
Enabling strategy and improving
performance
• Use analytics to reduce political divergence and
drive consensus
• Real-time analytics to enable quick responses to
events
• Use data to develop personalized services
Tactical • Need for more objective and higher quality data

Value
Day to day operations
enablement • Struggle to move from narrow focus on reactive
operations to more proactive, comprehensive
management of daily operations
• High value for digitization of operational
processes across program units
• Often already proficient in traditional business
Operational intelligence

PwC 19
Step 2: Secure People & Skills
The competencies required of “data scientists” within an analytics organization or project
converge from multiple skill domains.

Subject Area or
Expertise in statistical Domain Expertise Deep understanding of industry,
techniques, tools and subject area, or research domain to
languages used to run help determine which questions need
analyses that generate answering and on what frequency,
insights to effectively specificity, or geography
determine and
communicate
actionable insights
Computer
Statistical & Science &
Mathematical Programming

Comfort in programming
across various languages, a
thorough understanding of
external and internal data
sources, data gathering,
storing, and retrieving
Organization-specific
methods which help combine
Information
Organization-specific knowledge about data Knowledge disparate data sources to
assets – including enterprise “metadata” – generate unique insights
their location and appropriate business
context for use in advanced analytics

PwC 20
Step 3: Let objectives dictate structure, not vice versa
How analytics efforts or organizations are structured – whether reporting is vertically or
horizontally aligned, how interconnected or autonomous separate units are, how resources
and successes are shared – can influence efficiency and impact.
Distributed Analytics Federated Analytics Centralized Analytics
CENTRAL Analytics CENTRAL Analytics CENTRAL Analytics
Competency Center Competency Center Competency Center

LOCAL Metadata Metadata Metadata


Repository Repository Repository

ETL ETL ETL

Data Data Data


Warehouse LOCAL Warehouse Warehouse

Data Mart Data Mart Data Mart

BI BI BI
Applications Applications Applications

• Adopt previously proven • Subject area-specific innovations • Governance


Objectives practices • Repeatable models • Aligning analytics to organization-
• Highly focused analytics support wide strategy
Data • Deployed locally • Deployed locally • Deployed and managed centrally
Warehouses, • Some data and models shared
Marts, etc. across groups

• Managed locally • Managed locally, but connected • Controlled centrally, with units
Analytics Tools to group framework having access to shared
resources
• Placed within individual units • Placed within individual units • Placed within central analytics
Analytics Staff/
• Skills tailored to specific region or team, available as needed to
Competencies subject matter support individual units

PwC 21
The ‘Hub-Spoke’ operating model often serves as a well-
synchronized, connected system
4 3 2 1
Local Local Centers of Competency Global
Business Central Business
Adoption Excellence Center
Operations Decision Hub Strategy
of Practices (Regional) (‘Standards’)

4 4
Local
Local
‘Spoke’
‘Spoke’
Local
‘Spoke’
3
4 4
Sample Hub-Spoke Local
Local
‘Spoke’ Interaction Model Center of ‘Spoke’
Competency Excellence
Center (Regional)
3
2
4
Center of Central 4
Local Local
Excellence Decision
‘Spoke’ ‘Spoke’
(Regional) 1 Hub

4
Local ‘Standardization’ Local
4

‘Spoke’ ‘Spoke’

4
Center of
Local Excellence Local
4

‘Spoke’ (Regional) ‘Spoke’


3
PwC 22
Step 4: Invest in Appropriate Infrastructure
Big Data introduces challenges related to data volume and variety, processing constraints,
and new data structures that traditional data infrastructure is not equipped to support
Objective Considerations Impact
Dictates performance needs along with data
Identify the type of Analysis Type structures and processing architecture
analysis that will be
Analytics conducted and define Analysis Interface could restrict the ability to perform
Capabilities which analytics Flexibility analysis ad hoc and restrict ability to update
capabilities will be
Analysis Support for analysis specific data structures can
employed
Structures improve performance and reduce analysis effort

Size of data sets introduce need for scalable


Size infrastructure and performance
Define the data set
that will be used for Variability of source data models and data
Data Variety the analysis including Structure set structure require data model flexibility
its sources, size, and
structure Diverse sources will require scalability,
Sources model flexibility, and flexible interfaces

Frequency of analysis will dictate the


Define the timeliness Frequency processing architecture (batch or real time)
and frequency of the
analysis results for The timeliness of the analysis will impact the
Application reporting and Speed need for scalability and performance
downstream systems
In and out bound interfaces are defined by
Interfaces the use of data and required flexibility
PwC 23
Emerging Infrastructure Options
To harness Big Data, storage solutions must be able to support targeted analytics
capabilities, data diversity and performance needs

Distributed Processing Hadoop and similar solutions that provide


scalable distributed storage and distributed
computation on commodity hardware

NoSQL Embedded and persisted storage that


implement data models through document,
graph, and dictionary structures

Cloud Computing Cloud computing can improve flexibility,


scalability and cost management and enable a
cohesive business strategy across a org

Traditional challenges being addressed…

• Scalability Issues • Data storage solutions need to provide


• Big Data set information extraction and flexible data models to better ingest
queries require large volumes of processing unstructured and semi structured data
cycles that can quickly scale • Need to combine and link multiple data
sources

PwC * For more information on these infrastructure options, see Appendix. 24


PwC
25
Summary: Key Guiding Principles for developing best-in-
class analytics organization

Guiding Principles – Illustrative, May be Customized

1. Establish the Analytics organization as an objective advisor for insight generation .

2. Ensure responsiveness to business needs by balancing ‘consolidation’ with ‘distribution’ of


analytics functionality where it makes sense.

3. Innovate, invest in, and build new analytics capabilities, and gradually push them out to the
business as user sophistication matures (e.g., data visualization).

4. Prioritize strategic business value delivery over tactical outputs.

5. Ensure adequate attention to user experience.

6. Focus on speed, accuracy, and reusability.

7. Optimize and manage work-flow to achieve maximum resource efficiency.

8. Allow distributed analytics where it makes sense, but tightly govern and ensure cataloguing.

9. Ensure a consistent feedback loop of all outputs that are created.

PwC 26
Case Study
‘Nowcasting’ Economic
Activity in Colombia

04
Situation
In Colombia, the leading economic indicators
used to analyze economic activity have an
average lag of 10 weeks. This presents challenges
for the well-timed design of economic policy and
monitoring of economic shocks or trends.

The Colombian Ministry of Finance looked for


coincident indicators that could allow tracking
the short-term trends of economic activity.

Characteristics of Data Needed:


• Real-time
• Highly disaggregated – by sector, geography, etc.
• Statistically correlated with key economic trends
(consumption, GDP, etc.)
• Robust enough of a sample to be representative of
the economy as a whole

PwC 28
Group Discussion

What Big Data sources could the Colombian Ministry of Finance


potentially use to reliably approximate sectorial economic activity in
real-time?

PwC 29
Brainstorming Breakout

In groups of 3-4, take five-ten minutes to brainstorm how the Ministry


could approach answering the following questions:
• What data should it consider using?
• Is this data the Ministry already has available, or will this require the Ministry to
acquire an entirely new source of data?
• How does the cost of acquiring this data – whether by their own collection or through
external data partnership – compare to the expected benefits of using it?
• If this data is new to the Ministry, what entities may already have this data in
possession?
• How might the Ministry ensure its staff have the skills necessary to acquire, manage,
and use this data? Is this data uniquely complex such that it may require more
advanced or entirely new skillsets?
• What should the Ministry consider in the way of data storage and security? How
extensively may it be required to overhaul data storage infrastructure to accommodate
using this data?

PwC 30
Solution

Based on web searches performed by Google users, Google Trends (GT) provides daily
information about the query volume for a given search term in a given geographic region.
For Colombia, GT data are available at the departmental level and also for the largest
municipalities.
The Colombian Administrative Department for National Statistics (DANE – for its
acronym in Spanish) combined indexes built using GT data with its own official economic
activity data (both at the aggregate level and at the sectorial level) – both of which are
publicly available – to construct leading indicators that determine, in real-time, the short-
term trend of different economic sectors, as well as their turning points.
In some sense, the GT data takes the place of traditional consumer-sentiment surveys. For
example, the use of data for a certain keyword (such as the brand for a certain product)
might be justified in the case a drop or surge in the web searches for that keyword could be
linked to a fall or increase in its demand and, therefore, a lower or higher production for
the specific sector producing that product.

PwC 31
Example: “Ahorro” vs. Unemployment Rate

Ahorro – savings

PwC 32
Example: “Ahorro” vs. Unemployment Rate

These trends were shown to correlate with a high coefficient of


correlation with traditional measures of unemployment.

PwC 33
Example: “Zapatos” vs. Employment Rate

Zapatos – shoes

PwC 34
Example: “Zapatos” vs. Employment Rate

These trends were shown to correlate with a high coefficient of


correlation with traditional measures of employment.

PwC 35
Find Out More
Melanie Thomas Armstrong Jean Young
Leading Partner Managing Director
International Public Sector International Public Sector Data Analytics
+1 (202) 320-7098 +1 (703) 918-1001
Melanie.Thomas.Armstrong@us.pwc.com Jean.M.Young@us.pwc.com

Bill Stephens Mariola Pogacnik


Director Director
International Public Sector Data Analytics United Nations & International Public Sector
+1 (703) 635-0800 +1 (646) 471-5467
William.L.Stephens.Jr@us.pwc.com Mariola.Pogacnik@us.pwc.com

Ashraf Faramawi Jared Nyarumba


Manager Manager
International Public Sector Data Analytics Data Analytics, Africa
+1 (202) 271-5711 +254 710 623 426
Ashraf.Faramawi@us.pwc.com Jared.Nyarumba@ke.pwc.com

This publication has been prepared for general guidance on matters of interest only, and does not constitute professional advice. You should not act upon the
information contained in this publication without obtaining specific professional advice. No representation or warranty (express or implied) is given as to the
accuracy or completeness of the information contained in this publication, and, to the extent permitted by law, PricewaterhouseCoopers LLP, its members,
employees and agents do not accept or assume any liability, responsibility or duty of care for any consequences of you or anyone else acting, or refraining to
act, in reliance on the information contained in this publication or for any decision based on it.

© 2017 PricewaterhouseCoopers LLP. All rights reserved. In this document, “PwC” refers to PricewaterhouseCoopers LLP which is a member firm of
PricewaterhouseCoopers International Limited, each member firm of which is a separate legal entity.
PwC 36
Appendix
Emerging Data
Storage and
Infrastructure
Options
Building an Analytics Organization: Critical Components
Emerging Infrastructure – Data Storage Options

Distributed Processing Hadoop and similar solutions that provide


scalable distributed storage and distributed
computation on commodity hardware

Introduction to Hadoop

• Hadoop is based on work done by Google in


early 2000s (combination of Google File
System (GFS) and MapReduce)
Faster and Lower Cost Analysis
• Useful for analyzing copious amounts of
complex data across multiple data sources
• Distributes data as it is initially stored in the
system Linear Scalability
• Applications are written in high-level code
• Computation happens where data is stored,
whenever possible
• Data is replicated multiple times on the
Greater flexibility
system for increased availability and
reliability
PwC 39
Distributed Storage and Analytics: Hadoop vs. Traditional Data
Stores
Compared to traditional data stores, Hadoop provides greater flexibility when
it comes to storing data and scaling to meet demand.
Traditional Data
Hadoop vs.
Stores

Supports both structured and


Data Structure unstructured data.
Supports only structured data.

Limited depending on selected


Data Size Unlimited
RDBMS.

Supports various serialization


Supports a single tabular data
Data Formats and data formats (e.g. text,
format.
JSON, XML, etc)

Scaling is possible, but is


Distributed scaling from the
typically more complex and
Scaling ground up – simply add more
cannot be performed at a node
nodes to increase capacity
level.

Sources: http://hadoop.apache.org/
PwC 40
Building an Analytics Organization: Critical Components
Emerging Infrastructure – Data Storage Options

NoSQL Embedded and persisted storage that


implement data models through document,
graph, and dictionary structures
NoSQL - Storage Types

Key – Value Columnar Store Document Store Graph Store


Store

Increasing Data Complexity


Pros: Simplicity & Pros: Scalability & Pros: Easy to Use Pros: Graph Joins
Scalability Flexibility Cons: Scalability Cons: Flexibility
Cons: Lack of advanced Cons: Complexity
features/queries
Solution Examples

PwC 41
Building an Analytics Organization: Critical Components
Emerging Infrastructure – Data Storage Options

Cloud Computing The model is compelling; cloud computing can improve flexibility,
scalability and cost management. Businesses best able to realize the
potential will establish a cohesive business strategy as cloud
computing can transform your entire organization — people,
processes, and systems

Cloud transformation begins at the infrastructure


level and leads to more agile applications, resulting in
faster speed to market and more flexibility to meet
client needs.

The key benefits, beyond consolidation, include


standardized application and development
environments, resulting in better controlled and
more efficient application lifecycles.

Source: PwC, “Digital IQ Snapshot: Cloud,”; PwC, “FS Viewpoint: Clouds is the forecast”

PwC 42
Text Mining and
Natural Language
Processing
Data Mining, Text Mining, and Natural Language Processing
What are they and how are they used?

Natural Language
Processing
NLP is a theoretically
motivated range of
computational
Text Mining techniques for analyzing
Analysis of large and representing
quantities of natural naturally occurring texts
language text and at one or more levels of
Data Mining detecting lexical or linguistic analysis for the
linguistic usage purpose of achieving
Extraction of implicit, human-like
patterns to extract
previously unknown, language processing for a
probably useful
and potentially useful range of tasks or
information
information from data applications.
Source: Text Mining, Ian Witten, 2004

PwC 44
Natural Language Processing and Text Mining
What are they and how are they used?

Natural Language Processing Text Mining


Purpose and Overview Purpose and Overview
NLP (Natural Language Processing) Text mining represents a system of
applies statistical or rules based statistical analysis and classification
computational techniques to evaluate algorithms that are employed to
and model texts at various levels of explore groups of natural language
linguistic analysis in order to identify texts and identify useful patterns,
key concepts, enable intelligent relationships, and knowledge
processing and draw inferences
Objectives Objectives
• Deep analysis and structuring of individual texts • Use of data mining techniques and statistical methods
through phrase identification, part of speech tagging, to conduct a shallow analysis of groups of documents
and word disambiguation and make accessible knowledge within
• Identification of a text’s message or meaning through structured/semi-structured texts
the use of linguistic analysis: • Develop a structured view of a documents contents in
• Syntactic - sentence structure or breakdown order to develop linkages among texts for
• Lexical - meaning of words within the context of classification, categorization, knowledge discovery and
use search
• Semantic - logical meaning of phrases or text • Conduct a statistical analysis of the word/sentence
• Discourse - connections among sentences and usage and attributes in order to identify key phrases,
phrases that define the topic summarize texts, extract information from groups of
• Generate natural language sentences or texts as a texts, and discover new knowledge using the
response to an input/ question using a context text or information within texts
knowledge base
PwC 45
NLP Tools
Tools and APIs that provide capabilities to parse and structure natural
language texts for machine analysis

Tool Description Analysis Type


A machine learning based toolkit for the processing of • Tokenization • Named entity extraction
OpenNLP natural language text. Link • sentence segmentation • Chunking, parsing
• Part-of-speech tagging • Coreference resolution.
A Java suite of tools that can perform natural language • Information extraction • Tokenizer
GATE
processing tasks for multiple languages. Link • Part of speech tagging • Sentence splitter
A suite of libraries and programs for symbolic and • Information extraction • Word categorization
NLTK statistical natural language processing Python. Link • Part of speech tagging, • Text classification
• Tokenizer
Statistical NLP toolkits for various computational • Including tokenization • Classification
linguistics problems that can be incorporated into • Part-of-speech tagging • Segmentation
Stanford NLP
applications with human language technology needs. • Named entity recognition • Coreference Resolution
Link • Parsing
A tool kit for processing text using computational • Sentiment analysis • Part of speech tagging
linguistics. Link • Entity recognition • Sentence detection
LingPipe
• Clustering • Disambiguation
• Topic classification
A suite of libraries and programs for symbolic and • Information extraction • Text generation
statistical natural language processing for both Python • Part of speech tagging • Stemming
MontyLingua
and Java. Link • Tokenizer • Phrase chunking
• Word categorization
A suite of linguistic analysis components that integrate • Language Identification • name matching
Rosetta
into applications for mining unstructured data. Link • Name, places, and key • name translation
Linguistic
concept extraction
Platform

PwC 46
Text Mining/Analytics Tools
Tool kits that provide capabilities for identifying and analyzing features
within individual or groups of texts

Tool Description Analysis Type


An open source environment for machine learning, data • Document classification • Data mining
RapidMiner mining, text mining, predictive analytics, and business • Sentiment analysis • Traditional analytics
analytics. Link • Topic tracking
A suite of text processing and analysis tools. Link, • Text Parsing • Feature Extraction
SAS Text Miner
• Filtering • Topic Clustering
Integrated development environment for building • Information extractions • Data Mining
VisualText information extraction systems, natural language • Summarization • Document Filtering
processing systems, and text analyzers. Link • Categorization • Natural Language Search
SAS Sentiment Commercial tool that is dedicated to customer • Customer sentiment • sentiment discovery
Analysis sentiment analysis. Link monitoring
Tool for sorting large amounts of unstructured text with • Topic modeling, • Document analysis
Textifier
The Public Comment Analysis Toolkit (PCAT). Link • Information retrieval • Social media analysis
System for automatically preparing and transforming • Term frequency • Customization of stop
unstructured text attributes into a structured • Term frequency inverse words
Infinite
representation. Link • Document frequency • Stemming rules
Insight
• Root word coding • Concepts merging
• synonym identification
Software for grouping related documents into clusters, • Document clustering
Clustify providing an overview of the document set and aiding
with categorization. Link

PwC 47
Text Mining/Analytics Tools Cont.
Tool kits that provide capabilities for identifying and analyzing features
within individual or groups of texts

Tool Description Analysis Type


Customer analytics applications that help analyze high • Unstructured • consumer profiling
volumes of customer conversations across multiple communication analysis
Attensity Analyze channels. Link • sentiment analysis

A program that automatically identifies and extracts • Information extraction • Topic Linking
ReVerb
binary relationships from English sentences. Link • Topic Identification
Open text Open source tool for summarizing texts. Link • Document summarization
summarizer
Web based API that is used to analyze content and • Attribute/feature • Fact identification
Open Calais
extract topics or information. Link extraction
Knowledge Family of techniques tools for searching and organizing • Semantic Analysis
Search large data collections. Link
A free software for Quantitative Content Analysis or • Text Parsing • Network analysis
KH Coder
Text Mining Link • document search

PwC 48
Resources
Tutorials, Tools, Applications, and Research Groups

Link and Description


Text Mining Overview
Text Mining Activites

Text Mining Tutorial


Tutorials and Text Mining Process
Overviews NLP Introduction
NLP Overview
NLP Overview
NLP Concepts
http://ai.cs.washington.edu/projects/open-information-extraction

Research Groups and http://www.nactem.ac.uk/index.php


Papers http://nlp.stanford.edu/
http://research.microsoft.com/en-us/groups/nlp/
NLP Toolkit List
NLP Tools
Tools and Data Sets Text Mining Tools
Tools by Function
Text Mining Tools

PwC 49
DeepQA, Image
Analytics, and
Audio Analytics
DeepQA
Overview and Introduction

What is DeepQA?
• DeepQA forms that core of Watson, the open
domain question analysis and answering system
• The DeepQA stack is comprised of set of search,
NLP, learning, and scoring algorithms
• DeepQA operates on a distributed computing
infrastructure that leverages Map Reduce and
the Unstructured Information Management
Architecture
What is the target problem set?
• Understanding the meaning and context of
human language
• Searching and retrieving information from large
library of unstructured information
• Identifying accurate and precise answers to
questions that are complex and must sourced
from a large knowledge set

PwC 51
DeepQA Infrastructure Technology
Data Management and Search

Technology Links
Unstructured
Information UIMA Link
Architecture
MySQL Link
SQL Server
Apache Derby Link
Java Natural Open NLP Link
Language
Toolkit Stanford NLP Link

Map/Reduce Apache Hadoop Link

Commonsense OpenCYC Link


Knowledgebase Open Mind Common Sense Link
Apache Jena Link
Triple Store
OpenAnzo Link
Lucene Link
Text Search
Open FTS Link

PwC 52
DeepQA Infrastructure Technology
Platform and Administration

Technology Links

Web Server Apache Link

VMWare Link
Virtualization
Host
Zen Link

Apache Hadoop Link


Distributed File
System
OpenAFS Link

File
Management/ rSync Link
Archival

OS Fedora Link

Extreme Cloud Administration Link


Cloud
Management
Open Nebula Link

PwC 53
Business Applications
DeepQA provides capabilities that can facilitate knowledge discovery, improve
customer interaction, and uncover hidden facts
Overview Objectives
Search internal and external • Identify information about a subject through deep analysis
Knowledge unstructured/structured information of internal and external information sources
Discovery assets to uncover previously • Answer questions about a business problem or trend that
unknown knowledge may be difficult to analyze within traditional data sources
Search documents and • Identify business topics and trends within communication
communications to uncover relevant and documents
E-Discovery information associated with a specific • Search for non compliance activities within internal and
topic external data sources
• Identify key facts or issues that comprise a contract or sets
Search through single or multiple of contracts
Contract
contracts to answer specific questions
Evaluations about the nature of the contract • Identify contracts or legal documents that contain similar
entities or features
Provide the ability to interact with • Provide a platform for automatically answering consumer
Relationship consumers providing precise questions about products or services
Management responses to technical and open • Reduce reliance on call centers and improve interaction with
domain questions consumers
Search consumer communications, • Identify background information about consumers
Consumer social media, and sales information to • Identify consumer qualities that create risks or represent
Discovery identify opportunities and opportunities
demographics
Technical Find answers to technical and process • Utilize unstructured data and communications to identify
Troubleshooting problems through solutions or root causes to system and process problems

PwC 54
Areas for Further Research
Infrastructure/Tools and Search Technologies/Concepts

Topic Research
The tool is used to distribute queries, analysis, and other processing activities across
Hadoop
multiple CPUs. Further research is required to understand the tools architecture and
Map/Reduce
how to integrate it with other tool kits. OpenNLP, UIMA, Lucene, etc.
A Java library for NLP tasks. Need to evaluate the tools capabilities and gaps as well as
OpenNLP
how it can be incorporated into the UIMA
Tools OpenCYC
An open common sense reasoning platform. Need to better understand the tools role as
well as how it fits within the other technologies
An architecture for managing unstructured data. Further research is needed to
UIMA
understand how to run in parallel and how the SDK can be applied to NLP activities
A text search platform. Further research is needed to understand the library and how to
Lucene
incorporate it into UIMA
Algorithms are used to score search results based on their alignment with the question.
Text Search
Further research is needed to understand what models and scoring metrics can be
Scoring
applied to search results at various phases of DeepQA.
Triple stores maintain data in a subject-predicate-object structure and is used for turning
Triple Store
around quick facts. Further research is needed to understand the philosophy and
Search
Search technologies behind these data storage mechanisms
Commonsense Research is required to understand the branch of AI, technologies and role within
Reasoning DeepQA.
Document/ Generate research on information and document retrieval practices. Technologies and
Information algorithms need to be reviewed. Falls within a broader research topic for enterprise
Retrieval search.

PwC 55
Areas for Further Research
Machine Learning and Natural Language Processing
Topic Description
Research the concept and how they are to used evaluate learning models and assign a
MetaLearners
confidence score based on the learning models that are used to rank search results
Question Identify techniques and models that can be employed to analyze and classify questions
Machine
Classification
Learning
Search Research models are available for ranking search results based on the various search and
Ranking recall techniques that are employed for a question
Models
Logical Form Research how SNA is used to discover logical relationships within text and product an
Analysis understanding about the information within the text
Semantic Identify tools and algorithms that are employed to uncover semantic relationships within
Structure texts/phrases and how these relationships can be applied to extract relevant information
Analysis for question analysis and search
NLP Relationship Research techniques and tools for uncovering temporal, geospatial and spatial
Analysis relationships within a knowledge set
Feature Evaluate tools and algorithms that are used to extract features of entities from text and
Extraction identify methods for structuring the data for search
Phrase Identify algorithms and tools that can be applied to extract key phrases from text based
Analysis on a search context

PwC 56
URLs
Overviews and Applications

Links
The AI Behind Watson
How to build a Watson Jr.

Background Building your own Watson


Documents Algorithms behind Watson
Overview of the technology behind Watson
DeepQA Project Page
Watson and your business

Applications and Understanding the DeepQA Process


Articles The future of DeepQA
DeepQA for e-discovery

PwC 57
Image Analytics Overview
How can we extract insight from images and video?

Overview
• The process of pulling relevant information
from an image or sets of images for advanced
classification and traditional analysis
• Applies image capture, image processing, and
machine learning techniques to extract,
quantify, and structure, image information

Advantages
• Provides a method to structure, organize, and
search information that is stored within
images
• Offers an additional data set that can be
applied to understanding consumer behavior,
automating business processes, and
discovering knowledge enterprise content
58

PwC
Image Analytics Tools
There are few standalone packages that are capable of performing robust
image analysis; however, solutions can be developed using existing
frameworks and analytics toolkits
Image Machine
Tool Overview Computer Vision
Processing Learning
Open source library of computer vision
OpenCV functions that is accessible via C, Java, X X X
and Python
Integrated image analysis platform that
PAXit
Image Analys
provides basic feature identification X X
is functions
Java based image processing platform
ImageJ that can be accessed via an API and X
expanded with custom plugins

PIL Python image processing library X


A modular machine learning library for
PyBrain
Python X

PwC 59
URLs
Tutorials, Tools, Applications, and Research Groups

Link and Description


Tutorial on Image Processing and Analysis

Tutorials Online Book of Algorithms for Computer Vision

Online Machine Vision Book


Computer Vision Group
Research Groups and CMU Machine Vision Group
Papers
Stanford Machine Vision Group
Image Analysis and Mining Framework
Tools and Data Sets
Image Mining Software

PwC 60
Audio Analytics Overview
How can we extract insight from audio and voice media?

Overview
• The process of capturing audio and analyzing
its features as to extract content and context
of an event
• Applies speech analysis and signal
processing principles to structure audio
information for analysis via NLP or traditional
analytics techniques
Advantages
• Provides a method for identifying events or
common patterns within sound bytes
• Offers a way of capturing not only the content
and topics within a conversation, but also the
emotions and context

61

PwC
Audio Analytics: Capabilities and Insights
What data can we capture from sound bites that can be used to enhance other
data or analysis?

Audio Event Information Points

• Event – audio events are identified as


Loudness/Intensity

Power and Intensity of


changes in sound patterns and or the intensity
over time

Sound
• Rate – defines how quickly a sound or a
pattern of sound is occurring and can be used
to evaluate the nature of an exchange, the
state of the sound source, and context of the
Rate
Time topic
• Power and Intensity – measures the
loudness of the sound or event and provides a
Sound or Pitch
way of evaluating the mood or emotion of the
Frequency

sound source
• Sound and Pitch – a measure of the sound
quality and can serve as a tool for isolating
separate audio events or sources as well as
Time
measuring changes to the sound source
PwC 62
Audio Analytics Applications

Analysis Objectives

• Capture and structure the content of conversations


• Utilize structured speech as an input to text mining and
Analyze conversations to capture
Voice Recognition speech as text based dialog
natural language processing capabilities
• Combine phone based conversations with other
interaction data sets

• Monitor customer interactions or business operations to


Analyze sound clips to identify capture events in real time
Sound Matching specific events taking place • Use captured events for comparison, categorization and
analysis with other data points

• Capture the content of the conversation and conducting


Monitor phone calls with customers sentiment analysis based on word choice
Sentiment Analysis to uncover sentiment towards the • Analyze the pitch, loudness, and rate of consumer speech
experience and/or products/services to identify emotional state during the conversation and its
cause

• Analyze prescreen phone conversations to assess job


Monitor customer and job candidate candidate personality, interest in job, and fit to job
Employee conversations to extract information requirements
/Customer from word usage and speech
patterns that can inform or improve • Analyze customer conversations to assess level of risk
Screening and honestly when applying for a product or filing
a screening process
claim/complaints

PwC 63
Audio Analytics Tools
There are few tools on the market that provide a broad range of audio
analysis capabilities. However, basic audio analysis and natural language
tool kits can be combined for robust analytics
Tool Overview Audio Processing Information Retrieval

A C++ library that provides varying level of


Clam audio processing and information retrieval X X
capabilities
A tool that is capable of translating calls to a
CallMiner more structured text data set and combining X
with other communication forms
Logs calls and structures audio for text based
Nuance
search and retrieval X
Aduio feature extraction toolkit with wrappers
yaafe
for several languages X
PRAAT Multiple platform audio analysis toolkit X

PwC 64
URLs
Tutorials, Tools, Applications, and Research Groups

Link and Description


Overview of audio features for seniment analysis
Tutorials Lecture on Audio Features and Information
Overview of audio analysis

Research Groups National Center for Voice and Speech


and Papers
Tools and Data Audio analysis package
Sets Audio Mining Software

PwC 65
Social Network
Analysis
Applications
Analyze organizational structures to identify opportunities that can improve
communication, productivity, and collaborations
Analysis Objectives
Evaluate team structures , • Identify team structures that are not effective
information flows among team
Collaboration • Identify informal organizational structures
members, and information
Analysis exchanges with other teams to • Identify individuals/roles or groups that are influential to
improve working structures collaborative work environments

• Improve content and knowledge distribution


Content/ Evaluate how knowledge or content
• Identify content bottlenecks, open communication flows,
Knowledge is diffused and accessed within an
and establish channels
Management organization
• Explore impact of new communication methods
• Improved structures for key organizational functions.
Identify groups or informal teams • Improved information flows
Community that share knowledge, communicate
frequently, solve problems, or work • Identify potential bottlenecks for organizational
Mining
together to perform specific tasks functions
• Identify cultural patterns to build other communities
Explore formal and informal • Improve hierarchy and structure of organization to better
organization structures and how align with the informal practices
Organization
individuals work with one another
Development to improve the design of the • Identify team members that are effective leaders and
organization would impact the organization if promoted

PwC 67
Applications
Analyze network structures, communication channels, and information flows
to identify operational enhancements
Analysis Objectives
Assess organizational structures • Identify communication improvements to disaster
Disaster recovery and communication patterns as they recovery teams
planning relate to the groups that play a role • Identify weak links among functional groups to improve
in disaster recovery plans collaboration during recovery plan execution
Assess how data points or • Identify overlapping information sets and bottlenecks for
Data/ information sets originate or are information dissemination
Information distributed across the enterprise to • Assess how organization structures or information
Dissemination their intended targets architecture impact the flow of information to its targets
Assess the organization or external • Identify network agents that collaborate with known
Fraud Detection / network to identify communication fraudulent agents
prevention or collaboration patterns that align • Identify activities that align with known fraudulent
with known fraudulent activity behavior
Analyze the organization structure • Identify process improvements through discovery of
Process and communication patterns to hidden process steps, communication flows , and actors
Discovery / uncover process improvements or • Discover undocumented or informal processes that are
Improvement identify new processes hidden within frequent collaboration and communication
paths
Evaluate the structure of a supply • Identify communication gaps that could impact dependent
network and the interactions among process or operations
Supply Chain the entities that comprise the • Identify strategic relationships to optimize the supply
Analysis network to identify gaps, network
bottlenecks and sourcing strategies
• Identify supply nodes that create inefficiencies

PwC 68
Applications
Analyze social media networks and consumer feedback to improve product
offerings and market interactions
Analysis Objectives
Observe how a specific topic, news • Assess how target consumers/market will react to a piece
Novelty/ articles or sentiment diffuses of news or campaign
Sentiment through a consumer network • Evaluate how long news, data, or sentiment will be
Diffusion Analysis retained within a system and how far it will spread
Monitor and analyze connections • Identify individuals or groups that influence markets and
within social media networks to adoption
Market Influencer identify markets or consumers that • Identify untapped markets
Identification are influential within communities
• Identify market segments as targets for ad campaigns to
improve product/service adoption
Analyze the connections and • Improve product or service offerings based on attributes
consumer attributes within the that connect the consumer market
Consumer
target market to discover • Develop strategies to target new or existing consumers
Segmentation communities or groups with based on identified segmentation characteristics
common characteristics
Analyze the flow of communication • Identify segments or individuals that will be likely early
Product or Brand or ideas through a market segment adopters
Diffusion Analysis to evaluate how a product may • Identify incentives or campaigns that will improve
diffuse product/service adoption
Analyze consumer network • Identify new feature sets for products and services
Recommendation connections and common features • Assess new markets for selling similar or new products
Systems among consumers to develop
• Target consumers with specific products or services
recommendations

PwC 69
Tools
Social network analysis plug-ins and APIs for development/scripting
languages and data analysis tools

Network Network Network


Tool Overview
Analysis Visual Manipulation
A general purpose network analysis and graph
SNAP mining library for C++ . Link X X
A package for R that provides capabilities for social
Statnet network statistical analysis. Link X
libSNA, Python libraries for network analysis and
graphTool, manipulation. libSNA, networkX, graphTool X X
networkX
Java package for network analysis and modeling.
JUNG Link X X X
Excel plug-in that provides an easy to use and
NodeXL interactive interface to explore and visualize X X
networks Link

PwC 70
Tools
Proprietary and open source social network analysis interactive application
suites

Network Network Network


Tool Overview
Analysis Visual Manipulation
Interactive open source platform for network
GEPHI analysis and visualization. Gephi X X X
Commercial social network analysis tool with
Ucinet separate visualization component. Link X X
Open source graph visualization package. Link
Graphviz X
Proprietary package that provides the ability to
NetMiner develop and implement custom algorithms X X X
link
Network analysis package that provides predictive
kxen SNA analytics and customer MDM integration. Link X X X
Open source package for mining business process
ProM networks. Link X X X
Open source tool for network modeling, and
Cytoscape analysis. Can connect to external data sources Link X X X
Large-Scale Network Analysis, Modeling and
Network
Workbench
Visualization Toolkit for Biomedical, Social Science X X X
and Physics Research. Link

PwC 71
Resources
Tutorials, Tools, Applications, and Research Groups

Link and Description


Introduction for Beginners

Introductory Lecture

Paper on Business Applications


Tutorials
Network Analysis Process
Online Introductory Book
Introduction to Network Analysis Application and Theory (Open source book)
SNA Group at Stanford: Tool, Lectures, and Papers
Complex Networks and Systems Reasearch Collaboration
Research Groups and
SNA Group and Indiana University: Lectures, Papers, and Tools
Papers
Reality Mining at MIT
Papers from International Conference on Advances in Social Networks Analysis and Mining

Wiki List of Social Network Analysis Software


Review of 100+ Social Network Analysis Tools
List of Tools from The SAGE Handbook of Social Network Analysis
Tools and Data Sets More Tool Reviews
Twitter Data Sets
Web/Blog Data Sets
Facebook Data Sets

PwC 72
Additional Case
Studies
Example 1

Advanced natural language processing and deep question-


answering technology are being applied to address clinical
decision-making
Memorial Sloan-Kettering Cancer Center

• Memorial Sloan-Kettering Cancer Center is applying DeepQA technology


(technology that relies on advanced analytics powered by IBM’s Watson)
to develop a decision-support application for cancer treatment
• Doctors will be able to generate and evaluate hypothesis on evidence
and treatment and the Cancer Center will be able to better identify and
personalize cancer therapies for individual patients

WellPoint and Cedars-Sinai


• WellPoint and the Cedars-Sinai Samuel Oschin Comprehensive Cancer Institute
will work together to help improve patient care and support physicians in their
efforts to make the most informed, personalized treatment decisions possible.
• It is estimated that new clinical research and medical information doubles every
five years, and nowhere is this knowledge advancing more quickly than in the
complex area of cancer care.
• The WellPoint health care solutions will use DeepQA technology to draw from
vast libraries of information including medical evidence-based scientific and
health care data, and clinical insights from institutions like Cedars-Sinai.
Source: Memorial Sloan-Kettering Cancer Institute Press Release March 2012, WellPoint Press Release, December 2011;
PwC 74
Example 2

Large volumes of real-time sensor data are empowering


individuals to take more control of their health

Quantified Health – P4 Medicine (Predictive, Preventive, Personalized, Participatory)

• Non-invasive wearable sensors are creating a new


‘Quantified Health’ movement and one of the fastest
growing sectors in the tech industry, let alone in the
field of Big Data Analytics
• The number of connected industrial and medical
devices is projected to reach 16 billion by 2015
• The mHealth market is estimated to reach a value of
$23 billion by 2017

Source: Bruce Bigelow, Big Data, Big Biology, and the ‘Tipping Point’ in Quantified Health: Takeaways from Xconomy’s On-the-Record Dinner,
Xconomy, April 26, 2012

PwC 75
Example 3

Advanced machine learning and visualization techniques


are being used to model drug interactions

Modeling Adverse Drug Reactions

When biological and phenotypic features were integrated alongside chemical structures to
predict adverse drug reactions, prediction accuracy increased from 0.9054 to 0.9524.

Source: “Liu M, Wu Y, Chen Y, et al . Large-scale prediction of adverse drug reactions by integrating chemical, biological, and phenotypic properties
of drugs. J Am Med Inform Assoc 2012;19:e28–35.

PwC 76
Other Examples

Companies in other sectors are also pursuing various


applications of ‘Big Data’ and ‘Smart Analytics’.

Satellite Data
Hartford Steam Boiler
Allianz Location
Hartford Steam Boiler is using sensors
Allianz is ‘mashing’ satellite data, and real-time sensor data to monitor
third-party street-level data, assets, reduce losses and manage risks
images, and other internal data better
to better understand risk
concentrations and manage Property-Specific Hartford Steam Boiler has been able to
Map Data
concentration risk in commercial Data manage concentration risks and reduce
property insurance losses, having one of the lowest
combined ratios for a commercial
insurer

Proctor & Gamble

Proctor & Gamble is investing in analytics talent for quicker decision


making, with the CIO planning to increase fourfold the number of staff with
expertise in business analytics

Executives are currently using big data to uncover what is currently going on in their business, to
understand why, to predict future performance and to understand what actions P&G should take

Source: “Procter & Gamble – Business Sphere and Decision Cockpits”, Ravi Kalakota, Pratical Analytics Wordpress, Feb. 2012, mskcc.org/cancer-
care; eWeek.com, Healthcare IT News, IBM Watson to Aid Sloan-Kettering With Cancer Research, March 2012
PwC 77
Big Data
Analytics
Technology &
Vendor Mappings
Big Data Analytics – Technology & Vendor Mappings

Layer L1 L2 Technology Vendor Logos


Private EMC Private Cloud EMC

HP Private Cloud HP

Teradata Private Teradata


Cloud
Dell Private Cloud Dell

Public Azure SQL Microsoft


1.
Cloud
Infrastructure
Amazon Web Amazon
Services
Google Cloud Google
Platform
Hybrid EMC Hybrid Cloud EMC

HP Helion HP

IBM Hybrid Cloud IBM

PwC 79
Big Data Analytics – Technology & Vendor Mappings

Layer L1 L2 Technology Vendor Logos


Batch/Micro Apache Kafka Apache Software
Foundation

Fluentd Open Source

Sqoop Apache Software


Foundation
Rabbit MQ Rabbit MQ

AWS Kinesis Amazon Web Services


3.
Data
Data Ingestion
Acquisition
& Integration
Apache Spark Apache Software
Foundation
Real time/ Apache Storm Apache Software
Streaming Foundation
Apache Spark Apache Software
Streaming Foundation
Samza Apache Software
Foundation
NiFi Apache Software
Foundation

PwC 80
Big Data Analytics – Technology & Vendor Mappings

Layer L1 L2 Technology Vendor Logos


Data *Need assistance in
Profiling/Cle- locating
ansing
Data *Need assistance in
Data Matching/D- locating
Quality uplication
Standardizati *Need assistance in
-on/ locating
Normaliz-
ation
ETL/ELT Hadoop Apache Hadoop

3. Talend Talend
Data Ingestion
& Integration Hive Apache Software
Foundation
Drill Apache Software
Foundation
Data
Integration Staging *Need assistance in
locating
Persistent *Need assistance in
Staging locating
File Exchange *Need assistance in
locating
File Storage *Need assistance in
locating
PwC 81
Big Data Analytics – Technology & Vendor Mappings

Layer L1 L2 Technology Vendor Logos


Custom *Need assistance in
Compliers locating
Batch MapReduce Apache Hadoop
MapReduce
Spark Apache Software
Execution/ Foundation
Data
Processing AWS EMR Amazon Web Services

Tez Apache Software


Foundation
3.5.
In-Memory Spark Apache Software
Execution/
Processing Foundation
Data
Processing Computing *Need assistance in
Framework locating
Cluster YARN Apache Hadoop
Management
Resource Mesos Apache Software
Managem- Foundation
ent
Zookeeper Apache Software
Foundation

Oozie Apache Software


Foundation

PwC 82
Big Data Analytics – Technology & Vendor Mappings

Layer L1 L2 Technology Vendor Logos


Workflow Hue Open Source
Management
3.5. Ambari Apache Software
Resource Foundation
Execution/
Managem-
Data Lipstick Netflix
ent
Processing
Ganglia The Ganglia Project

Traditional SQL Server Microsoft


Database
Oracle 10g Oracle

Parallel Teradata Teradata


Database
Relational
Data HP Vertica HP
Database
Appliances
IBM BigInsights IBM
4. EMC Greenplum EMC
Data
Repositories NewSQL ClustrixDB Clustrix
Mem SQL Memsql

Hadoop DFS HDFS Apache Hadoop


Distributed AWS Amazon Web Services
File System Packaged
Solutions Tachyon Tachyon Project
Operational
ODS
Data Store
PwC 83
Big Data Analytics – Technology & Vendor Mappings

Layer L1 L2 Technology Vendor Logos


Relational/ MySQL Open Source
NewSQL
PostgreSQL Open Source

AWS RDS Amazon Web Services

Columnar DB Cassandra Apache Software


In-Memory Foundation
4.
Data Hbase Apache Hadoop
Repositories
AWS Redshift Amazon Web Services

NoSQL Hazelcast Open Source


Aerospike Aerospike
*Need
Metadata
assistance in
Storage
locating

PwC 84
Big Data Analytics – Technology & Vendor Mappings

Layer L1 L2 Technology Vendor Logos


Key Value Redis Open Source

Riak Basho
AWS DynamoDB Amazon Web Services

Column Store Cassandra Apache Software


Foundation
Hbase Apache Hadoop
AWS Redshift Amazon Web Services
4.
Data
NoSQL Graph Neo4j Neo Technology
Repositories
Database
OrientDB Orient Tehcnologies

ArangoDB Open Source

Document MongoDB MongoDB, Inc.


Database
Elastic Elastic

Couchbase Couchbase

PwC 85
Big Data Analytics – Technology & Vendor Mappings

Layer L1 L2 Technology Vendor Logos


Reporting *Need Microstrategy
& assistance in
Dashboar- locating Datameer
ds

Visualizati- *Need Qlik Sense Qlick


on tools/ assistance in
Interactive locating
Visual Tableau Tableau
Analytics
*Need *Need assistance in
Real-time assistance in locating
6. Alerts locating
Presentation/
Data
*Need D3 Open Source
Visualization
assistance in
locating Angular JS Google

Website Flask Open Source


Front-end
Highcharts Highcharts
Django Django Software
Foundation
*Need *Need assistance in
API assistance in locating
locating

PwC 86

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy