0% found this document useful (0 votes)
21 views32 pages

Big Data (Unit 1)

Uploaded by

kpaushal9582
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views32 pages

Big Data (Unit 1)

Uploaded by

kpaushal9582
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

KALI CHARAN NIGAM INSTITUTE OF

TECHNOLOGY, BANDA

“BIG DATA”
KCS-061

UNIT – I

Introduction to Big Data: Types of digital data, history of Big Data innovation,
introduction to Big Data platform, drivers for Big Data, Big Data architecture and
characteristics, 5 Vs of Big Data, Big Data technology components, Big Data
importance and applications, Big Data features – security, compliance, auditing
and protection, Big Data privacy and ethics, Big Data Analytics, Challenges of
conventional systems, intelligent data analysis, nature of data, analytic processes
and tools, analysis vs reporting, modern data analytic tools.

Compiled by
Shristhi Gupta
(Assistant Professor)
CSE Department
Introduction to Big Data
Data: The quantities, characters, or symbols on which operations are performed by a computer,
which may be stored and transmitted in the form of electrical signals and recorded on magnetic,
optical, or mechanical recording media.

Big Data: “Big data” is high-volume, velocity, and variety information assets that demand
cost-effective, innovative forms of information processing for enhanced insight and decision
making.”
Hence, Big Data refers to complex and large data sets that have to be processed and analyzed to
uncover valuable information that can benefit businesses and organizations.
Big Data is a collection of data that is huge in volume, yet growing exponentially with time. It is
a data with so large size and complexity that none of traditional data management tools can store
it or process it efficiently. Big data is also a data but with huge size.

Data which are very large in size is called Big Data. Normally we work on data of size
MB(WordDoc ,Excel) or maximum GB(Movies, Codes) but data in Peta bytes i.e. 10^15 byte
size is called Big Data. It is stated that almost 90% of today's data has been generated in the past
3 years.

Examples:
 Social networking sites: Facebook, Google, LinkedIn all these sites generates huge
amount of data on a day to day basis as they have billions of users worldwide.
 E-commerce site: Sites like Amazon, Flipkart, Alibaba generates huge amount of logs
from which users buying trends can be traced.
 Weather Station: All the weather station and satellite gives very huge data which are
stored and manipulated to forecast weather.

 Telecom company: Telecom giants like Airtel, Vodafone study the user trends and
accordingly publish their plans and for this they store the data of its million users.

 Share Market: Stock exchange across the world generates huge amount of data through
its daily transaction.

Types of digital data:

 Structured
 Unstructured
 Semi Structured Data
Structured Data:
Structured is one of the types of big data and by structured data, we mean data that can be
processed, stored, and retrieved in a fixed format. It refers to highly organized information that
can be readily and seamlessly stored and accessed from a database by simple search engine
algorithms. For instance, the employee table in a company database will be structured as the
employee details, their job positions, their salaries, etc., will be present in an organized manner.

 Structured data refers to any data that resides in a fixed field within a record or file.
 Having a particular Data Model.
 Meaningful data.
 Data arranged in arow and column.
 Structured data has the advantage of being easily entered, stored, queried and analysed.
 E.g.: Relational Data Base, Spread sheets.
 Structured data is often managed using Structured Query Language (SQL)

Sources of Structured Data:


 SQL Databases
 Spreadsheets such as Excel
 OLTP Systems
 Online forms
 Sensors such as GPS or RFID tags
 Network and Web server logs
 Medical devices

Unstructured Data:
Unstructured data refers to the data that lacks any specific form or structure whatsoever. This
makes it very difficult and time-consuming to process and analyze unstructured data. Email is an
example of unstructured data. Structured and unstructured are two important types of big data.

 Unstructured data can not readily classify and fit into a neat box.
 Also called unclassified data.
 Which does not confirm to any data model.
 Business rules are not applied.
 Indexing is not required.
 E.g.: photos and graphic images, videos, streaming instrument data, webpages, Pdf files,
PowerPoint presentations, emails, blog entries, wikis and word processing documents.
Sources of Unstructured Data:
 Web pages
 Images (JPEG, GIF, PNG, etc.)
 Videos
 Reports
 Word documents and PowerPoint presentations
 Surveys

Semi structured Data:


Semi structured is the third type of big data. Semi-structured data pertains to the data containing
both the formats mentioned above, that is, structured and unstructured data. To be precise, it
refers to the data that although has not been classified under a particular repository (database),
yet contains vital information or tags that segregate individual elements within the data. Thus we
come to the end of types of data.

 Self-describing data
 Metadata (Data about data).
 Also called quiz data: data in between structured and semi structured.
 It is a type of structured data but not followed data model.
 Data which does not have rigid structure.
 E.g.: E-mails, word processing software.
 XML and other markup language are often used to manage semi structured data.

Sources of semi-structured Data:


 E-mails
 XML and other markup languages
 Binary executables
 TCP/IP packets
 Zipped files
 Integration of data from different sources
 Web pages
The History of Big Data:
Although the concept of big data itself is relatively new, the origins of large data sets go back to
the 1960s and '70s when the world of data was just getting started with the first data centers and
the development of the relational database.

Around 2005, people began to realize just how much data users generated through Facebook,
YouTube, and other online services. Hadoop (an open-source framework created specifically to
store and analyze big data sets) was developed that same year. NoSQL also began to gain
popularity during this time.

The development of open-source frameworks, such as Hadoop (and more recently, Spark) was
essential for the growth of big data because they make big data easier to work with and cheaper
to store. In the years since then, the volume of big data has skyrocketed. Users are still
generating huge amounts of data—but it’s not just humans who are doing it.

With the advent of the Internet of Things (IoT), more objects and devices are connected to the
internet, gathering data on customer usage patterns and product performance. The emergence of
machine learning has produced still more data.

While big data has come far, its usefulness is only just beginning. Cloud computing has
expanded big data possibilities even further. The cloud offers truly elastic scalability, where
developers can simply spin up ad hoc clusters to test a subset of data.

Introduction to Big Data platform:


Big data platform is a type of IT solution that combines the features and capabilities of several
big data application and utilities within a single solution. It is an enterprise class IT platform that
enables organization in developing, deploying, operating and managing a big data infrastructure
/environment.

Big data platform generally consists of big data storage, servers, database, big data management,
business intelligence and other big data management utilities. It also supports custom
development, querying and integration with other systems. The primary benefit behind a big data
platform is to reduce the complexity of multiple vendors/ solutions into a one cohesive solution.
Big data platform are also delivered through cloud where the provider provides an all inclusive
big data solutions and services.

Big Data Platform provide the approach for data management that combines servers, Big Data
Tools: Empowering Data Management and Analysis, and Analytical and Machine Learning into
one Cloud Platform, for managing as well as Real-time Insights.

Big data Platform workflow is divided into the following stages:


1. Data Collection
2. Data Storage
3. Data Processing
4. Data Analytics
5. Data Management and Warehousing
6. Data Catalog and Metadata Management
7. Data Observability
8. Data Intelligence

What is the need for a Big Data Platform?


This comprehensive solution consolidates the capabilities and features of multiple applications
into a single, unified platform. It encompasses servers, storage, databases, management utilities,
and business intelligence tools.
The primary focus of this platform is to provide users with efficient analytics tools
specifically designed for handling massive datasets. Data engineers often utilize these platforms
to aggregate, clean, and prepare data for insightful business analysis. Data scientists, on the other
hand, leverage this platform to uncover valuable relationships and patterns within large datasets
using advanced machine learning algorithms. Furthermore, users have the flexibility to build
custom applications tailored to their specific use cases, such as calculating customer loyalty in
the e-commerce industry, among countless other possibilities.

Different Types of Big Data Platforms and Tools


This includes four letters: S, A, P, and S, which means Scalability, Availability, Performance,
and Security. There are various tools responsible for managing hybrid data of IT systems. The
list of platforms are listed below:

1. Hadoop Delta Lake Migration Platform


2. Data Catalog and Data Observability Platform
3. Data Ingestion and Integration Platform
4. Big Data and IoT Analytics Platform
5. Data Discovery and Management Platform
6. Cloud ETL Data Transformation Platform

1. Hadoop - Delta Lake Migration Platform


It is an open-source software platform managed by Apache Software Foundation. It is used to
collect and store large data sets cheaply and efficiently.

2. Big Data and IoT Analytics Platform


It provides a wide range of tools to work on; this functionality comes in handy while using it
over the IoT case.

3. Data Ingestion and Integration Platform


This layer is the first step for the data from variable sources to start its journey. This means the
data here is prioritized and categorized, making data flow smoothly in further layers in this
process flow.
4. Data Mesh and Data Discovery Platform
A data mesh introduces the concept of a self-serve data platform to avoid duplicate of efforts.
Data engineers set up technologies so that all business units can process and store their data
products.

5. Data Catalog and Data Observability Platform


It provides a single self-service environment to the users, helping them find, understand, and
trust the data source. It also helps the users discover new data sources, if any. Seeing and
understanding data sources are the initial steps for registering the births. Users search for the
Data Catalog Tools and filter the appropriate results based on their needs. In Enterprises, Data
Lake is needed for Business Intelligence, Data Scientists, and ETL Developers where the correct
data is needed. The users use catalog discovery to find the data that fits their needs.

6. Cloud ETL Data Transformation Platform


This Platform can be used to build pipelines and even schedule the running of the same for data
transformation.

Drivers for Big Data:


Six main business drivers can be identified:
1. The digitization of society;
2. The plummeting of technology costs;
3. Connectivity through cloud computing;
4. Increased knowledge about data science;
5. Social media applications;
6. The upcoming Internet-of-Things (IoT).

1. The digitization of society


Big Data is largely consumer driven and consumer oriented. Most of the data in the world is
generated by consumers, who are nowadays ‘always-on’. Most people now spend 4-6 hours per
day consuming and generating data through a variety of devices and (social) applications. With
every click, swipe or message, new data is created in a database somewhere around the world.
Because everyone now has a smartphone in their pocket, the data creation sums to
incomprehensible amounts. Some studies estimate that 60% of data was generated within the last
two years, which is a good indication of the rate with which society has digitized.

2. The plummeting of technology costs


Technology related to collecting and processing massive quantities of diverse (high variety) data
has become increasingly more affordable. The costs of data storage and processors keep
declining, making it possible for small businesses and individuals to become involved with Big
Data. For storage capacity, the often-cited Moore’s Law still holds that the storage density (and
therefore capacity) still doubles every two years. The plummeting of technology costs has been
depicted in the figure below.
Besides the plummeting of the storage costs, a second key contributing factor to the
affordability of Big Data has been the development of open source Big Data software
frameworks. The most popular software framework (nowadays considered the standard for Big
Data) is Apache Hadoop for distributed storage and processing. Due to the high availability of
these software frameworks in open sources, it has become increasingly inexpensive to start Big
Data projects in organizations.

3. Connectivity through cloud computing


Cloud computing environments (where data is remotely stored in distributed storage systems)
have made it possible to quickly scale up or scale down IT infrastructure and facilitate a pay-as-
you-go model. This means that organizations that want to process massive quantities of data (and
thus have large storage and processing requirements) do not have to invest in large quantities of
IT infrastructure. Instead, they can license the storage and processing capacity they need and
only pay for the amounts they actually used. As a result, most of Big Data solutions leverage the
possibilities of cloud computing to deliver their solutions to enterprises.

4. Increased knowledge about data science


In the last decade, the term data science and data scientist have become tremendously popular. In
October 2012, Harvard Business Review called the data scientist “sexiest job of the 21st
century” and many other publications have featured this new job role in recent years. The
demand for data scientist (and similar job titles) has increased tremendously and many people
have actively become engaged in the domain of data science.
As a result, the knowledge and education about data science has greatly professionalized
and more information becomes available every day. While statistics and data analysis mostly
remained an academic field previously, it is quickly becoming a popular subject among students
and the working population.

5. Social media applications


Everyone understands the impact that social media has on daily life. However, in the study of
Big Data, social media plays a role of paramount importance. Not only because of the sheer
volume of data that is produced everyday through platforms such as Twitter, Facebook, LinkedIn
and Instagram, but also because social media provides nearly real-time data about human
behavior.
Social media data provides insights into the behaviors, preferences and opinions of ‘the
public’ on a scale that has never been known before. Due to this, it is immensely valuable to
anyone who is able to derive meaning from these large quantities of data. Social media data can
be used to identify customer preferences for product development, target new customers for
future purchases, or even target potential voters in elections. Social media data might even be
considered one of the most important business drivers of Big Data.

6. The upcoming internet of things (IoT)


The Internet of things (IoT) is the network of physical devices, vehicles, home appliances and
other items embedded with electronics, software, sensors, actuators, and network connectivity
which enables these objects to connect and exchange data. It is increasingly gaining popularity as
consumer goods providers start including ‘smart’ sensors in household appliances. Whereas the
average household in 2010 had around 10 devices that connected to the internet, this number is
expected to rise to 50 per household by 2020. Examples of these devices include thermostats,
smoke detectors, televisions, audio systems and even smart refrigerators.
Big Data architecture and characteristics:
A big data architecture is designed to handle the ingestion, processing, and analysis of data that
is too large or complex for traditional database systems.

Big data solutions typically involve one or more of the following types of workload:

 Batch processing of big data sources at rest.


 Real-time processing of big data in motion.
 Interactive exploration of big data.
 Predictive analytics and machine learning.

Most big data architectures include some or all of the following components:

 Data sources: All big data solutions start with one or more data sources. Examples
include:
o Application data stores, such as relational databases.
o Static files produced by applications, such as web server log files.
o Real-time data sources, such as IoT devices.

 Data storage: Data for batch processing operations is typically stored in a distributed
file store that can hold high volumes of large files in various formats. This kind of store is
often called a data lake. Options for implementing this storage include Azure Data Lake
Store or blob containers in Azure Storage.

 Batch processing: Because the data sets are so large, often a big data solution must
process data files using long-running batch jobs to filter, aggregate, and otherwise prepare
the data for analysis. Usually these jobs involve reading source files, processing them, and
writing the output to new files. Options include running U-SQL jobs in Azure Data Lake
Analytics, using Hive, Pig, or custom Map/Reduce jobs in an HDInsight Hadoop cluster,
or using Java, Scala, or Python programs in an HDInsight Spark cluster.

 Real-time message ingestion: If the solution includes real-time sources, the


architecture must include a way to capture and store real-time messages for stream
processing. This might be a simple data store, where incoming messages are dropped into a
folder for processing. However, many solutions need a message ingestion store to act as a
buffer for messages, and to support scale-out processing, reliable delivery, and other
message queuing semantics. Options include Azure Event Hubs, Azure IoT Hubs, and
Kafka.

 Stream processing: After capturing real-time messages, the solution must process
them by filtering, aggregating, and otherwise preparing the data for analysis. The
processed stream data is then written to an output sink. Azure Stream Analytics provides a
managed stream processing service based on perpetually running SQL queries that operate
on unbounded streams. You can also use open source Apache streaming technologies like
Spark Streaming in an HDInsight cluster.

 Analytical data store: Many big data solutions prepare data for analysis and then
serve the processed data in a structured format that can be queried using analytical tools.
The analytical data store used to serve these queries can be a Kimball-style relational data
warehouse, as seen in most traditional business intelligence (BI) solutions. Alternatively,
the data could be presented through a low-latency NoSQL technology such as HBase, or an
interactive Hive database that provides a metadata abstraction over data files in the
distributed data store. Azure Synapse Analytics provides a managed service for large-scale,
cloud-based data warehousing. HDInsight supports Interactive Hive, HBase, and Spark
SQL, which can also be used to serve data for analysis.

 Analysis and reporting: The goal of most big data solutions is to provide insights into
the data through analysis and reporting. To empower users to analyze the data, the
architecture may include a data modeling layer, such as a multidimensional OLAP cube or
tabular data model in Azure Analysis Services. It might also support self-service BI, using
the modeling and visualization technologies in Microsoft Power BI or Microsoft Excel.
Analysis and reporting can also take the form of interactive data exploration by data
scientists or data analysts. For these scenarios, many Azure services support analytical
notebooks, such as Jupyter, enabling these users to leverage their existing skills with
Python or R. For large-scale data exploration, you can use Microsoft R Server, either
standalone or with Spark.

 Orchestration: Most big data solutions consist of repeated data processing operations,
encapsulated in workflows, that transform source data, move data between multiple
sources and sinks, load the processed data into an analytical data store, or push the results
straight to a report or dashboard. To automate these workflows, you can use an
orchestration technology such Azure Data Factory or Apache Oozie and Sqoop.
Big Data Characteristics / 5 Vs of Big Data:

 Volume: The name Big Data itself is related to an enormous size. Big Data is a vast
‘volume’ of data generated from many sources daily, such as business processes,
machines, social media platforms, networks, human interactions, and many more.
 Variety: Big Data can an be structured, unstructured, and semi
semi-structured
structured that are being
collected from different sources. Data will only be collected from databases and sheets in
the past, but these days the data will comes in array forms, that are PDFs, Emails, audios,
SM posts,
ts, photos, videos, etc.
 Veracity:: Veracity means how much the data is reliable. It has many ways to filter or
translate the data. Veracity is the process of being able to handle and manage data
efficiently. Big Data is also essential in business develo
development.
 Value: Value is an essential characteristic of big data. It is not the data that we process
or store. It is valuable and reliable data that we store, process, and also analyze.
 Velocity: Velocity plays an important role compared to other
others.
s. Velocity creates the speed
by which the data is created in real
real-time.
time. It contains the linking of incoming data sets
speeds, rate of change, and activity bursts. The primary aspect of Big Data is to provide
demanding data rapidly.

Big Data technology com


components:
We can categorize the leading big data technologies into the following four sections:

 Data Storage
 Data Mining
 Data Analytics
 Data Visualization

Data Storage
Let us first discuss leading Big Data Technologies that come under Data Storage:

 Hadoop: When it comes to handling big data, Hadoop is one of the leading technologies
that come into play. This technology is based entirely on map-reduce architecture and is
mainly used to process batch information. Also, it is capable enough to process tasks in
batches. The Hadoop framework was mainly introduced to store and process data in a
distributed data processing environment parallel to commodity hardware and a basic
programming execution model.
Apart from this, Hadoop is also best suited for storing and analyzing the data from
various machines with a faster speed and low cost. That is why Hadoop is known as one
of the core components of big data technologies. The Apache Software
Foundation introduced it in Dec 2011. Hadoop is written in Java programming language.
 MongoDB: MongoDB is another important component of big data technologies in terms
of storage. No relational properties and RDBMS properties apply to MongoDb because it
is a NoSQL database. This is not the same as traditional RDBMS databases that use
structured query languages. Instead, MongoDB uses schema documents.
The structure of the data storage in MongoDB is also different from traditional RDBMS
databases. This enables MongoDB to hold massive amounts of data. It is based on a
simple cross-platform document-oriented design. The database in MongoDB uses
documents similar to JSON with the schema. This ultimately helps operational data
storage options, which can be seen in most financial organizations. As a result,
MongoDB is replacing traditional mainframes and offering the flexibility to handle a
wide range of high-volume data-types in distributed architectures.
MongoDB Inc. introduced MongoDB in Feb 2009. It is written with a combination of
C++, Python, JavaScript, and Go language.
 RainStor: RainStor is a popular database management system designed to manage and
analyze organizations' Big Data requirements. It uses deduplication strategies that help
manage storing and handling vast amounts of data for reference.
RainStor was designed in 2004 by a RainStor Software Company. It operates just like
SQL. Companies such as Barclays and Credit Suisse are using RainStor for their big data
needs.
 Hunk: Hunk is mainly helpful when data needs to be accessed in remote Hadoop clusters
using virtual indexes. This helps us to use the spunk search processing language to
analyze data. Also, Hunk allows us to report and visualize vast amounts of data from
Hadoop and NoSQL data sources.
Hunk was introduced in 2013 by Splunk Inc. It is based on the Java programming
language.
 Cassandra: Cassandra is one of the leading big data technologies among the list of top
NoSQL databases. It is open-source, distributed and has extensive column storage
options. It is freely available and provides high availability without fail. This ultimately
helps in the process of handling data efficiently on large commodity groups. Cassandra's
essential features include fault-tolerant mechanisms, scalability, MapReduce support,
distributed nature, eventual consistency, query language property, tunable consistency,
and multi-datacenter replication, etc.
Cassandra was developed in 2008 by the Apache Software Foundation for the
Facebook inbox search feature. It is based on the Java programming language.

Data Mining:
Let us now discuss leading Big Data Technologies that come under Data Mining:

 Presto: Presto is an open-source and a distributed SQL query engine developed to run
interactive analytical queries against huge-sized data sources. The size of data sources
can vary from gigabytes to petabytes. Presto helps in querying the data in Cassandra,
Hive, relational databases and proprietary data storage systems.
Presto is a Java-based query engine that was developed in 2013 by the Apache Software
Foundation. Companies like Repro, Netflix, Airbnb, Facebook and Checkr are using this
big data technology and making good use of it.
 RapidMiner: RapidMiner is defined as the data science software that offers us a very
robust and powerful graphical user interface to create, deliver, manage, and maintain
predictive analytics. Using RapidMiner, we can create advanced workflows and scripting
support in a variety of programming languages.
RapidMiner is a Java-based centralized solution developed in 2001 by Ralf
Klinkenberg, Ingo Mierswa, and Simon Fischer at the Technical University of
Dortmund's AI unit. It was initially named YALE (Yet Another Learning Environment).
A few sets of companies that are making good use of the RapidMiner tool are Boston
Consulting Group, InFocus, Domino's, Slalom, and Vivint.SmartHome.
 ElasticSearch: When it comes to finding information, elasticsearch is known as an
essential tool. It typically combines the main components of the ELK stack (i.e., Logstash
and Kibana). In simple words, ElasticSearch is a search engine based on the Lucene
library and works similarly to Solr. Also, it provides a purely distributed, multi-tenant
capable search engine. This search engine is completely text-based and contains schema-
free JSON documents with an HTTP web interface.
ElasticSearch is primarily written in a Java programming language and was developed in
2010 by Shay Banon. Now, it has been handled by Elastic NV since 2012. ElasticSearch
is used by many top companies, such as LinkedIn, Netflix, Facebook, Google, Accenture,
StackOverflow, etc.

Data Analytics:
Now, let us discuss leading Big Data Technologies that come under Data Analytics:

 Apache Kafka: Apache Kafka is a popular streaming platform. This streaming platform
is primarily known for its three core capabilities: publisher, subscriber and consumer. It is
referred to as a distributed streaming platform. It is also defined as a direct messaging,
asynchronous messaging broker system that can ingest and perform data processing on
real-time streaming data. This platform is almost similar to an enterprise messaging
system or messaging queue.
Besides, Kafka also provides a retention period, and data can be transmitted through a
producer-consumer mechanism. Kafka has received many enhancements to date and
includes some additional levels or properties, such as schema, Ktables, KSql, registry,
etc. It is written in Java language and was developed by the Apache software
community in 2011. Some top companies using the Apache Kafka platform include
Twitter, Spotify, Netflix, Yahoo, LinkedIn etc.
 Splunk: Splunk is known as one of the popular software platforms for capturing,
correlating, and indexing real-time streaming data in searchable repositories. Splunk can
also produce graphs, alerts, summarized reports, data visualizations, and dashboards, etc.,
using related data. It is mainly beneficial for generating business insights and web
analytics. Besides, Splunk is also used for security purposes, compliance, application
management and control.
Splunk Inc. introduced Splunk in the year 2014. It is written in combination with AJAX,
Python, C ++ and XML. Companies such as Trustwave, QRadar, and 1Labs are making
good use of Splunk for their analytical and security needs.
 KNIME: KNIME is used to draw visual data flows, execute specific steps and analyze
the obtained models, results, and interactive views. It also allows us to execute all the
analysis steps altogether. It consists of an extension mechanism that can add more
plugins, giving additional features and functionalities.
KNIME is based on Eclipse and written in a Java programming language. It was
developed in 2008 by KNIME Company. A list of companies that are making use of
KNIME includes Harnham, Tyler, and Paloalto.
 Spark: Apache Spark is one of the core technologies in the list of big data technologies.
It is one of those essential technologies which are widely used by top companies. Spark is
known for offering In-memory computing capabilities that help enhance the overall speed
of the operational process. It also provides a generalized execution model to support more
applications. Besides, it includes top-level APIs (e.g., Java, Scala, and Python) to ease the
development process.
Also, Spark allows users to process and handle real-time streaming data using batching
and windowing operations techniques. This ultimately helps to generate datasets and data
frames on top of RDDs. As a result, the integral components of Spark Core are produced.
Components like Spark MlLib, GraphX, and R help analyze and process machine
learning and data science. Spark is written using Java, Scala, Python and R language.
The Apache Software Foundation developed it in 2009. Companies like Amazon,
ORACLE, CISCO, VerizonWireless, and Hortonworks are using this big data technology
and making good use of it.
 R-Language: R is defined as the programming language, mainly used in statistical
computing and graphics. It is a free software environment used by leading data miners,
practitioners and statisticians. Language is primarily beneficial in the development of
statistical-based software and data analytics.
R-language was introduced in Feb 2000 by R-Foundation. It is written in Fortran.
Companies like Barclays, American Express, and Bank of America use R-Language for
their data analytics needs.
 Blockchain: Blockchain is a technology that can be used in several applications related
to different industries, such as finance, supply chain, manufacturing, etc. It is primarily
used in processing operations like payments and escrow. This helps in reducing the risks
of fraud. Besides, it enhances the transaction's overall processing speed, increases
financial privacy, and internationalize the markets. Additionally, it is also used to fulfill
the needs of shared ledger, smart contract, privacy, and consensus in any Business
Network Environment.
Blockchain technology was first introduced in 1991 by two researchers, Stuart
Haber and W. Scott Stornetta. However, blockchain has its first real-world application
in Jan 2009 when Bitcoin was launched. It is a specific type of database based on Python,
C++, and JavaScript. ORACLE, Facebook, and MetLife are a few of those top companies
using Blockchain technology.

Data Visualization:
Let us discuss leading Big Data Technologies that come under Data Visualization:

 Tableau: Tableau is one of the fastest and most powerful data visualization tools used by
leading business intelligence industries. It helps in analyzing the data at a very faster
speed. Tableau helps in creating the visualizations and insights in the form of dashboards
and worksheets.
Tableau is developed and maintained by a company named TableAU. It was introduced
in May 2013. It is written using multiple languages, such as Python, C, C++, and Java.
Some of the list's top companies are Cognos, QlikQ, and ORACLE Hyperion, using this
tool.
 Plotly: As the name suggests, Plotly is best suited for plotting or creating graphs and
relevant components at a faster speed in an efficient way. It consists of several rich
libraries and APIs, such as MATLAB, Python, Julia, REST API, Arduino, R, Node.js,
etc. This helps interactive styling graphs with Jupyter notebook and Pycharm.
Plotly was introduced in 2012 by Plotly company. It is based on JavaScript. Paladins and
Bitbank are some of those companies that are making good use of Plotly.

Emerging Big Data Technologies

Apart from the above mentioned big data technologies, there are several other emerging big data
technologies. The following are some essential technologies among them:

 TensorFlow: TensorFlow combines multiple comprehensive libraries, flexible ecosystem


tools, and community resources that help researchers implement the state-of-art in
Machine Learning. Besides, this ultimately allows developers to build and deploy
machine learning-powered applications in specific environments.
TensorFlow was introduced in 2019 by Google Brain Team. It is mainly based on C++,
CUDA, and Python. Companies like Google, eBay, Intel, and Airbnb are using this
technology for their business requirements.
 Beam: Apache Beam consists of a portable API layer that helps build and maintain
sophisticated parallel-data processing pipelines. Apart from this, it also allows the
execution of built pipelines across a diversity of execution engines or runners.
Apache Beam was introduced in June 2016 by the Apache Software Foundation. It is
written in Python and Java. Some leading companies like Amazon, ORACLE, Cisco, and
VerizonWireless are using this technology.
 Docker: Docker is defined as the special tool purposely developed to create, deploy, and
execute applications easier by using containers. Containers usually help developers pack
up applications properly, including all the required components like libraries and
dependencies. Typically, containers bind all components and ship them all together as a
package.
Docker was introduced in March 2003 by Docker Inc. It is based on the Go language.
Companies like Business Insider, Quora, Paypal, and Splunk are using this technology.
 Airflow: Airflow is a technology that is defined as a workflow automation and
scheduling system. This technology is mainly used to control, and maintain data
pipelines. It contains workflows designed using the DAGs (Directed Acyclic Graphs)
mechanism and consisting of different tasks. The developers can also define workflows
in codes that help in easy testing, maintenance, and versioning.
Airflow was introduced in May 2019 by the Apache Software Foundation. It is based
on a Python language. Companies like Checkr and Airbnb are using this leading
technology.
 Kubernetes: Kubernetes is defined as a vendor-agnostic cluster and container
management tool made open-source in 2014 by Google. It provides a platform for
automation, deployment, scaling, and application container operations in the host
clusters.
Kubernetes was introduced in July 2015 by the Cloud Native Computing Foundation.
It is written in the Go language. Companies like American Express, Pear Deck,
PeopleSource, and Northwestern Mutual are making good use of this technology.

These are emerging technologies. However, they are not limited because the ecosystem of big
data is constantly emerging. That is why new technologies are coming at a very fast pace based
on the demand and requirements of IT industries.

L-5
Big Data importance and applications:
Importance of Big Data in IT Sectors:
 Many old IT companies are fully dependent on big data in order to modernize their
outdated mainframes by identifying the root causes of failures and issues in real-
real
time and antiquated code bases. Many organizations are replacing their traditional
system with open-source
source platforms like Hadoop.
 Most big data solutions are based on Hadoop,, which allows designs to scale up from
a single machine to thousands of machines, each offering local computation and
storage, moreover, it is a “free” open source platform, allowing minimizing capital
investment for an organization in acquiring new platf
platforms.
 With the help of big data technologies IT companies are able to process third-party
third
data fast, which is often hard to understand at once by having inherently high
horsepower and parallelized working of platforms.

Applications of Big Data


Data:
The term Big Data is referred to as large amount of complex and unprocessed data. Now a day's
companies use Big Data to make business more informative and allows to take business
decisions by enabling data scientists, analytical modelers and other professionals to analyse large
volume of transactional data. Big data is the valuable and powerful fuel that drives large IT
industries of the 21st century. Big data is a spreading technology used in each business sector. In
this section, we will discuss application of Bi
Big Data.

Travel and Tourism


Travel and tourism are the users of Big Data.. It enables us to forecast travel facilities
requirements at multiple locations, improve business through dynamic pricing, and many more.
Financial and banking sector
The financial and banking sectors use big data technology extensively. Big data analytics
help banks and customer behaviour on the basis of investment patterns, shopping trends,
motivation to invest, and inputs that are obtained from personal or financial backgrounds.

Healthcare
Big data has started making a massive difference in the healthcare sector, with the help
of predictive analytics, medical professionals, and health care personnel. It can
produce personalized healthcare and solo patients also.

Telecommunication and media


Telecommunications and the multimedia sector are the main users of Big Data. There
are zettabytes to be generated every day and handling large-scale data that require big data
technologies.
Government and Military
The government and military also used technology at high rates. We see the figures that
the government makes on the record. In the military, a fighter plane requires to
process petabytes of data.Government agencies use Big Data and run many agencies, managing
utilities, dealing with traffic jams, and the effect of crime like hacking and online fraud.

Aadhar Card: The government has a record of 1.21 billion citizens. This vast data is analyzed
and store to find things like the number of youth in the country. Some schemes are built to target
the maximum population. Big data cannot store in a traditional database, so it stores and analyze
data by using the Big Data Analytics tools.

E-commerce
E-commerce is also an application of Big data. It maintains relationships with customers that is
essential for the e-commerce industry. E-commerce websites have many marketing ideas to retail
merchandise customers, manage transactions, and implement better strategies of innovative ideas
to improve businesses with Big data.

Amazon: Amazon is a tremendous e-commerce website dealing with lots of traffic daily. But, when there
is a pre-announced sale on Amazon, traffic increase rapidly that may crash the website. So, to handle this
type of traffic and data, it uses Big Data. Big Data help in organizing and analyzing the data for far use.

Social Media
Social Media is the largest data generator. The statistics have shown that around 500+ terabytes
of fresh data generated from social media daily, particularly on Facebook. The data mainly
contains videos, photos, message exchanges, etc. A single activity on the social media site
generates many stored data and gets processed when required. The data stored is in terabytes
(TB); it takes a lot of time for processing. Big Data is a solution to the problem.
Big Data features – security, compliance, auditing and protection:
Big Data Security: Big data security is a set of data security measures and practices to
safeguard large volumes of data, known as "big data," from malware attacks, unauthorized
access, and other security threats. Security measures enhance safety, prevent incidents, protect
property, and contribute to the overall well-being of individuals and communities.

Big Data Compliance: Data compliance is the formal governance structure in place to
ensure an organization complies with laws, regulations, and standards around its data. The
process governs the possession, organization, storage, and management of digital assets or data
to prevent it from loss, theft, misuse, or compromise.

Data compliance involves identifying the relevant governance frameworks that ensure the
highest standards of data security, storage, and protection. The process includes setting up
procedures, policies, and protocols to protect the company data (customer lists) from
unauthorized access and cybersecurity threats.

Big Data auditing and protection:


What is big data auditing?
Big data is based on volume, variety, and velocity. Businesses that utilize big data in their audit
programs have huge volumes of data from different locations. Big data adoption allows
businesses to combine structured and unstructured data in a uniform manner and for valuable
insights. It allows auditors to more effectively audit the large amounts of data held and processed
in IT systems in larger clients. Auditors can extract and manipulate client data and analyse it.
Big data security's mission is clear enough: keep out on unauthorized users and intrusions with
firewalls, strong user authentication, end-user training, and intrusion protection systems (IPS)
and intrusion detection systems (IDS). In case someone does gain access, encrypt your data in
transit and at rest.

Big Data privacy and ethics:


Big data privacy ethics delves into the ethical considerations and principles that guide the
responsible collection, storage, and utilization of massive amount of data while protecting
individuals' privacy and data protection rights.

Ethical concerns arise when companies do not provide clear and transparent information about
data collection practices, making it difficult for individuals to make informed decisions and give
meaningful consent. The field of big data ethics itself is defined as outlining, defending and
recommending concepts of right and wrong practice when it comes to the use of data, with
particular emphasis on personal data. Big data ethics aims to create an ethical and moral code of
conduct for data use.

The ethics of privacy involve many different concepts such as liberty, autonomy, security, and in
a more modern sense, data protection and data exposure.
You can understand the concept of big data privacy by breaking it down into three categories:
1. The condition of privacy
2. The right to privacy
3. The loss of privacy and invasion

The scale and velocity of big data pose a serious concern as many traditional privacy processes
cannot protect sensitive data, which has led to an exponential increase in cybercrime and data
leaks.

Big Data Analytics:

Big-Data Analytics is like having a digital detective that helps us make sense of the massive
amount of data we create in our online lives. Whether it’s the things we purchase online (on
Flipkart, Amazon, etc), the videos we watch on different platforms, or the posts we share, our
digital actions generate a ton of information. It’s like we are searching for a needle in a
haystack, where the needle is the valuable information. In this case big data analytics acts as a
super-detective for data which uses advanced technology and clever math techniques to make
sense of all this messy information. Its ability to assist people in many occupations is really
interesting. As an example, if you own a cafe, it can give you insights about your customer’s
tastes, helping you create new recipes. Even in the field of finance, it can help to analyze
trends and patterns to help investors make fruitful decisions. So, it’s like having a super-smart
friend who takes the messy data and turns it into valuable insights, whether you’re cooking,
shopping, saving lives, or growing your savings. For instance, when you shop online on e-
commerce platform, websites use Big-Data Analytics to recommend products based on what
you’ve bought in the past. In healthcare, it can predict disease outbreaks by analyzing patient
data. It’s like a smart friend who takes all the messy data and turns it into useful insights,
whether you’re shopping, staying healthy, or making business decisions.

Challenges of conventional systems:

Conventional Systems:
 The system consists of one or more zones each having either manually operated call
points or automatic detection devices, or a combination of both.
 Big data is huge amount of data which is beyond the processing capacity of conventional
data base systems to manage and analyze the data in a specific time interval.

List of challenges of Conventional Systems


1. Uncertainty of Data Management Landscape
2. The Big Data Talent Gap
3. The talent gap that exists in the industry Getting data into the big data platform
4. Need for synchronization across data sources
5. Getting important insights through the use of Big data analytics

 Uncertainty of Data Management Landscape: Because big data is continuously


expanding, there are new companies and technologies that are being developed everyday.
A big challenge for companies is to find out which technology works bests for them
without the introduction of new risks and problems.

 The Big Data Talent Gap: While Big Data is a growing field, there are very few experts
available in this field. This is because Big data is a complex field and people who
understand the complexity and intricate nature of this field are far few and between.

 The talent gap that exists in the industry Getting data into the big data platform:
Data is increasing every single day. This means that companies have to tackle limitless
amount of data on a regular basis. The scale and variety of data that is available today
can overwhelm any data practitioner and that is why it is important to make data
accessibility simple and convenient for brand mangers and owners.

 Need for synchronization across data sources: As data sets become more diverse, there
is a need to incorporate them into an analytical platform. If this is ignored, it can create
gaps and lead to wrong insights and messages.

 Getting important insights through the use of Big data analytics: It is important that
companies gain proper insights from big data analytics and it is important that the correct
department has access to this information. A major challenge in the big data analytics is
bridging this gap in an effective fashion.

Intelligent data analysis:


Intelligent Data Analysis (IDA) is one of the hot issues in the field of artificial intelligence
and information.

IDA is an interdisciplinary study concerned with the effective analysis of data.

IDA is used for extracting useful information from large quantities of online data; extracting
desirable knowledge or interesting patterns from existing databases;

 the distillation of information that has been collected, classified, organized, integrated,
abstracted and value-added;

 at a level of abstraction higher than the data, and information on which it is based and can
be used to deduce new information and new knowledge;

 usually in the context of human expertise used in solving problems.

 the distillation of information that has been collected, classified, organized, integrated,
abstracted and value-added;

 at a level of abstraction higher than the data, and information on which it is based and can
be used to deduce new information and new knowledge;

 usually in the context of human expertise used in solving problems.


Note: Goal of Intelligent data analysis is to extract useful knowledge, the process demands a
combination of extraction, analysis, conversion, classification, organization, reasoning, and
so on.

NATURE OF DATA:
 Data is a set of values of qualitative or quantitative variables; restated, pieces of data are
individual pieces of information.

 Data is measured, collected and reported, and analyzed, where upon it can be visualized
using graphs or images.

Properties of Data:
1. Amenability of use
2. Clarity
3. Accuracy
4. Essence
5. Aggregation
6. Compression
7. Refinement

 Amenability of use: From the dictionary meaning of data it is learnt that data are facts
used in deciding something. In short, data are meant to be used as a base for arriving at
definitive conclusions.

 Clarity: Data are a crystallized presentation. Without clarity, the meaning desired to be
communicated will remain hidden.

 Accuracy: Data should be real, complete and accurate. Accuracy is thus, an essential
property of data.

 Essence: A large quantities of data are collected and they have to be Compressed and
refined. Data so refined can present the essence or derived qualitative value, of the
matter.

 Aggregation: Aggregation is cumulating or adding up.

 Compression: Large amounts of data are always compressed to make them more
meaningful. Compress data to a manageable size.Graphs and charts are some examples of
compressed data.

 Refinement: Data require processing or refinement. When refined, they are capable of
leading to conclusions or even generalizations. Conclusions can be drawn only when data
are processed or refined.

TYPES OF DATA:
 In order to understand the nature of data it is necessary to categorize them into various
types.
 Different categorizations of data are possible.

 The first such categorization may be on the basis of disciplines, e.g., Sciences, Social
Sciences, etc. in which they are generated.

 Within each of these fields, there may be several ways in which data can be categorized
into types.

There are four types of data:

 Nominal
 Ordinal
 Interval
 Ratio
Each offers a unique set of characteristics, which impacts the type of analysis that can be
performed.

The distinction between the four types of scales center on three different characteristics:
 The order of responses – whether it matters or not
 The distance between observations – whether it matters or is interpretable
 The presence or inclusion of a true zero

Nominal Scales:
Nominal scales measure categories and have the following characteristics:
Order: The order of the responses or observations does not matter.

Distance: Nominal scales do not hold distance. The distance between a 1 and a 2 is not the
same as a 2 and 3.

True Zero: There is no true or real zero. In a nominal scale, zero is uninterruptable.

Appropriate statistics for nominal scales: mode, count, frequencies

Displays: histograms or bar charts


Ordinal Scales:
At the risk of providing a tautological definition, ordinal scales measure, well, order. So, our
characteristics for ordinal scales are:
Order: The order of the responses or observations matters.

Distance: Ordinal scales do not hold distance. The distance between first and second is
unknown as is the distance between first and third along with all observations.

True Zero: There is no true or real zero. An item, observation, or category cannot finish
zero. Appropriate statistics for ordinal scales: count, frequencies, mode

Displays: histograms or bar charts.

Interval Scales:
Interval scales provide insight into the variability of the observations or data.
Classic interval scales are Likert scales (e.g., 1 - strongly agree and 9 - strongly disagree) and
Semantic Differential scales (e.g., 1 - dark and 9 - light).
In an interval scale, users could respond to “I enjoy opening links to thwebsite from a
company email” with a response ranging on a scale of values.

The characteristics of interval scales are:


Order: The order of the responses or observations does matter.

Distance: Interval scales do offer distance. That is, the distance from 1 to 2 appears the same
as 4 to 5. Also, six is twice as much as three and two is half of four. Hence, we can perform
arithmetic operations on the data.

True Zero: There is no zero with interval scales. However, data can be rescaled in a manner
that contains zero. An interval scales measure from 1 to 9 remains the same as 11 to 19
because we added 10 to all values. Similarly, a 1 to 9 interval scale is the same a -4 to 4 scale
because we subtracted 5 from all values. Although the new scale contains zero, zero remains
uninterruptable because it only appears in the scale from the transformation.

Appropriate statistics for interval scales: count, frequencies, mode, median, mean,
standard deviation (and variance), skewness, and kurtosis.

Displays: histograms or bar charts, line charts, and scatter plots.

Ratio Scales:
Ratio scales appear as nominal scales with a true zero.
They have the following characteristics:
Order: The order of the responses or observations matters.

Distance: Ratio scales do do have an interpretable distance.

True Zero: There is a true zero.

Income is a classic example of a ratio scale:


Order is established. We would all prefer $100 to $1!

Zero dollars means we have no income (or, in accounting terms, our revenue exactly equals
our expenses!)

Distance is interpretable, in that $20 appears as twice $10 and $50 is half of a $100.

For the web analyst, the statistics for ratio scales are the same as for interval scales.

Appropriate statistics for ratio scales: count, frequencies, mode, median, mean, standard
deviation (and variance), skewness, and kurtosis.

Displays: histograms or bar charts, line charts, and scatter plots.

The table below summarizes the characteristics of all four types of scales.

ANALYTIC PROCESS AND TOOLS:


There are 6 analytic processes:
1. Deployment
2. Business Understanding
3. Data Exploration
4. Data Preparation
5. Data Modeling
6. Data Evaluation
Step 1: Deployment
• Here we need to:
– plan the deployment and monitoring and maintenance,
– we need to produce a final report and review the project.
– In this phase,
• we deploy the results of the analysis.
• This is also known as reviewing the project.

Step 2: Business Understanding


 The very first step consists of business understanding.
 Whenever any requirement occurs, firstly we need to determine the business objective,
 assess the situation,
 determine data mining goals and then
 produce the project plan as per the requirement.
Business objectives are defined in this phase.

Step 3: Data Exploration


• The second step consists of Data understanding.
– For the further process, we need to gather initial data, describe and explore the data and verify
data quality to ensure it contains the data we require.
– Data collected from the various sources is described in terms of its application and the need for
the project in this phase.
– This is also known as data exploration.
• This is necessary to verify the quality of data collected.
Step 4: Data Preparation
From the data collected in the last step,
– we need to select data as per the need, clean it, construct it to get useful information and
– then integrate it all.
• Finally, we need to format the data to get the appropriate data.
• Data is selected, cleaned, and integrated into the format finalized for the analysis in this phase.

Step 5: Data Modeling


• We need to
– select a modeling technique, generate test design, build a model and assess the model
built.
• The data model is build to
– analyze relationships between various selected objects in the data,
– test cases are built for assessing the model and model is tested and implemented on the
data in this phase.
• Where processing is hosted?
– Distributed Servers / Cloud (e.g. Amazon EC2)
• Where data is stored?
– Distributed Storage (e.g. Amazon S3)
• What is the programming model?
– Distributed Processing (e.g. MapReduce)
• How data is stored & indexed?
– High-performance schema-free databases (e.g. MongoDB)
• What operations are performed on data?
– Analytic / Semantic Processing
• Big data tools for HPC and supercomputing
– MPI
• Big data tools on clouds
– MapReduce model
– Iterative MapReduce model
– DAG model
– Graph model
– Collective model
• Other BDA tools
– SaS
–R
– Hadoop
Thus the BDA tools are used through out the BDA applications development.

L-9
ANALYSIS AND REPORTING:
What is Analysis?
• The process of exploring data and reports
– in order to extract meaningful insights,
– which can be used to better understand and improve business performance.
What is Reporting ?
• Reporting is
– “the process of organizing data
– into informational summaries
– in order to monitor how different areas of a business are performing.”

COMPARING ANALYSIS WITH REPORTING:


• Reporting is “the process of organizing data into informational summaries in order to monitor
how different areas of a business are performing.”

• Measuring core metrics and presenting them — whether in an email, a slidedeck, or online
dashboard — falls under this category.

• Analytics is “the process of exploring data and reports in order to extract meaningful insights,
which can be used to better understand and improve business performance.”

• Reporting helps companies to monitor their online business and be alerted to when data falls
outside of expected ranges.

• Good reporting should raise questions about the business from its end users.

• The goal of analysis is to answer questions by interpreting the data at a deeper level and
providing actionable recommendations.

• A firm may be focused on the general area of analytics (strategy, implementation, reporting,
etc.)

– but not necessarily on the specific aspect of analysis.


• It’s almost like some organizations run out of gas after the initial set-up-related activities and
don’t make it to the analysis stage.

A reporting activity deliberately proposes Analysis activity.

CONTRAST BETWEEN ANALYSIS AND REPORTING


The basis differences between Analysis and Reporting are as follows:
• Reporting translates raw data into information.
• Analysis transforms data and information into insights.
• reporting shows you what is happening
• while analysis focuses on explaining why it is happening and what you can do about it.

Reporting usually raises a question – What is happening ?


Analysis transforms the data into insights - Why is it happening? What you can do about it?

Thus, Analysis and Reporting is synonym to each other with respect their need and utilizing in
the needy context.

MODERN ANALYTIC TOOLS:


Current Analytic tools concentrate on three classes:
a) Batch processing tools
b) Stream Processing tools and
c) Interactive Analysis tools.

Big Data Tools Based on Batch Processing:


a) Batch processing tools :-
• Batch Processing System involves
– collecting a series of processing jobs and carrying them out periodically
as a group (or batch) of jobs.

• It allows a large volume of jobs to be processed at the same time.

• An organization can schedule batch processing for a time when there is little
activity on their computer systems, for example overnight or at weekends.

• One of the most famous and powerful batch process-based Big Data tools is
Apache Hadoop.

It provides infrastructures and platforms for other specific Big Data applications.

b) Stream Processing tools


• Stream processing
– Envisioning (predicting) the life in data as and when it transpires.
• The key strength of stream processing is that it can provide insights faster, often within
milliseconds to seconds.
– It helps understanding the hidden patterns in millions of data records in real
time.
– It translates into processing of data from single or multiple sources
– in real or near-real time applying the desired business logic and emitting the
processed information to the sink.

• Stream processing serves


– multiple
– resolves in today’s business arena.

Real time data streaming tools are:


Storm
• Storm is a stream processing engine without batch support,
• a true real-time processing framework,
• taking in a stream as an entire ‘event’ instead of series of small batches.
• Apache Storm is a distributed real-time computation system.
• It’s applications are designed as directed acyclic graphs.

Apache flink
• Apache flink is
– an open source platform
– which is a streaming data flow engine that provides communication fault
tolerance and
– data distribution computation over data stream .
– flink is a top level project of Apache flink is scalable data analytics
framework that is fully compatible to hadoop .
– flink can execute both stream processing and batch processing easily.
– flink was designed as an alternative to map-reduce.

Kinesis
– Kinesis as an out of the box streaming data tool.
– Kinesis comprises of shards which Kafka calls partitions.
– For organizations that take advantage of real-time or near real-time access to
large stores of data,
– Amazon Kinesis is great.
– Kinesis Streams solves a variety of streaming data problems.
– One common use is the real-time aggregation of data which is followed by
loading the aggregate data into a data warehouse.
– Data is put into Kinesis streams.
– This ensures durability and elasticity.

Interactive Analysis -Big Data Tools


• The interactive analysis presents
– the data in an interactive environment,
– allowing users to undertake their own analysis of information.

• Users are directly connected to


– the computer and hence can interact with it in real time.

• The data can be :


– reviewed,
– compared and
– analyzed

• in tabular or graphic format or both at the same time.

Big Data Tools –


Google’s Dremel is the google proposed an interactive analysis system in 2010. And
named named Dremel.
– which is scalable for processing nested data.
– Dremel provides

• a very fast SQL like interface to the data by using a different technique than
MapReduce.
• Dremel has a very different architecture:
– compared with well-known Apache Hadoop, and
– acts as a successful complement of Map/Reduce-based computations.

• Dremel has capability to:


– run aggregation queries over trillion-row tables in seconds
– by means of:
• combining multi-level execution trees and
• columnar data layout.
Apache drill
• Apache drill is:
– Drill is an Apache open-source SQL query engine for Big Data exploration.
– It is similar to Google’s Dremel.
• For Drill, there is:
– more flexibility to support
• a various different query languages,
• data formats and
• data sources.
• Drill is designed from the ground up to:
– support high-performance analysis on the semi-structured and
– rapidly evolving data coming from modern Big Data applications.
• Drill provides plug-and-play integration with existing Apache Hive and Apache HBase
deployments.

______________________________________________________________________________

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy