0% found this document useful (0 votes)
6 views

module -1-Part II

big data

Uploaded by

Jonti Deuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

module -1-Part II

big data

Uploaded by

Jonti Deuri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 45

CABD0056:

BIG DATA MANAGEMENT

MCA 3rd Semester

Dr. Jonti Deuri, ADBU


Introduction to Hadoop

Big Data Management-Dr. Jonti Deuri, ADBU


Key Features of Hadoop
• Distributed Storage: stores data across multiple machines in a distributed and fault-tolerant manner
• Distributed Processing: divides tasks into smaller sub-tasks and distributes them across the cluster.
• Scalability: It is designed to scale horizontally.
• Fault Tolerance: Hadoop is resilient to hardware failures.
• Cost-Effective: making it an economical option for organizations
• Flexibility: supports a variety of data types.

• Parallel Processing: enable efficient execution of data-intensive tasks, leading to faster data processing and
analysis.

Big Data Management-Dr. Jonti Deuri, ADBU


History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File System paper, published by Google

Big Data Management-Dr. Jonti Deuri, ADBU


History of Hadoop
The Hadoop was started by Doug Cutting and Mike Cafarella in 2002. Its origin was the Google File System paper, published by Google

• Early Development.
• Named "Hadoop"
• Hadoop Core Components.
• Ecosystem Expansion.
• Commercial Adoption.
• Hadoop's Influence.
• Challenges and Evolution.
• Apache Hadoop Releases.
• Transition to Modern Data Platforms.
• Legacy and Impact.

Big Data Management-Dr. Jonti Deuri, ADBU


Hadoop – Architecture
The Hadoop Architecture Mainly consists of 4 components:

• MapReduce
• HDFS(Hadoop Distributed File System)
• YARN(Yet Another Resource Negotiator)
• Common Utilities or Hadoop Common

Big Data Management-Dr. Jonti Deuri, ADBU


Hadoop – Architecture
The Hadoop Architecture Mainly consists of 4 components:

Big Data Management-Dr. Jonti Deuri, ADBU


Hadoop – Architecture
The Hadoop Architecture Mainly consists of 4 components:

1. MapReduce:
• In first phase, Map is
utilized and in next
phase Reduce is utilized.

Big Data Management-Dr. Jonti Deuri, ADBU


Hadoop – Architecture
The Hadoop Architecture Mainly consists of 4 components:

2. HDFS:
• It contains a master/slave architecture.
• This architecture consist of:
• a single NameNode (master)
• multiple DataNodes ( slave).

Big Data Management-Dr. Jonti Deuri, ADBU


Hadoop – Architecture
The Hadoop Architecture Mainly consists of 4 components:

2. HDFS:

Big Data Management-Dr. Jonti Deuri, ADBU


Hadoop – Architecture
The Hadoop Architecture Mainly consists of 4 components:

3. YARN(Yet Another Resource Negotiator


• YARN is the resource management and job scheduling component of Hadoop.
• It separates the resource management and job scheduling functions, allowing different processing
frameworks (not just MapReduce) to run on the same Hadoop cluster.
• YARN manages resources, schedules tasks, and ensures efficient utilization of cluster resources.

Big Data Management-Dr. Jonti Deuri, ADBU


Hadoop – Architecture
The Hadoop Architecture Mainly consists of 4 components:

4. Hadoop common or Common Utilities


• java library and java files or we can say the java scripts that we need for all the other components
present in a Hadoop cluster.
• these utilities are used by HDFS, YARN, and MapReduce for running the cluster.
• Hadoop Common verify that Hardware failure in a Hadoop cluster is common so it needs to be
solved automatically in software by Hadoop Framework.

Big Data Management-Dr. Jonti Deuri, ADBU


Analysing Data with Hadoop

• Analyzing data with Hadoop involves using the Hadoop framework to


process and analyze large volumes of data in a distributed and parallel
manner.

Big Data Management-Dr. Jonti Deuri, ADBU


Analysing Data with Hadoop

• Data Ingestion: Hadoop command-line tools, HDFS APIs, or higher-level tools like Apache Sqoop etc
• Data Processing with MapReduce: Apache Pig or Hive.
• Higher-Level Abstractions: Pig Latin, HiveQL
• Distributed Computing
• Data Analysis Libraries: Apache Spark, Apache Flink, Apache Mahout
• Visualization and Reporting: Apache Zeppelin, Jupyter notebooks

Big Data Management-Dr. Jonti Deuri, ADBU


Hadoop Streaming

• A utility that allows us to use any executable or script as a


mapper or reducer in Hadoop MapReduce jobs.

• It enables to write MapReduce jobs using languages other than


Java, such as Python, Perl, Ruby, or even shell scripts

Big Data Management-Dr. Jonti Deuri, ADBU


Hadoop Streaming

Big Data Management-Dr. Jonti Deuri, ADBU


How Hadoop Streaming works

Big Data Management-Dr. Jonti Deuri, ADBU


How Hadoop Streaming works

• Input
• Mapper and Reducer Scripts
• Execution
• Shuffling and Sorting
• Reducer Phase
• Output

Big Data Management-Dr. Jonti Deuri, ADBU


Syntax for Hadoop Streaming
Syntax to run MapReduce code written in a language other than JAVA to process data using the Hadoop MapReduce
framework.

$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-


streaming.jar
-input myInputDirs \
Parameter Description
-output myOutputDir \ -input myInputDirs \ Input location for mapper
-mapper /bin/cat \ -output myOutputDir \ Output location for reducer
-mapper /bin/cat \ Mapper executable
-reducer /usr/bin/wc -reducer /usr/bin/wc Reducer executable

Big Data Management-Dr. Jonti Deuri, ADBU


Hadoop Ecosystem
Hadoop has a rich ecosystem of tools and frameworks that build upon its capabilities

• Hive: A data warehousing and SQL-like query language that allows users to query and analyze
data stored in Hadoop.

• • Pig: A high-level scripting language for processing and analyzing large datasets without writing
MapReduce code.

• • HBase: A distributed, scalable NoSQL database that provides real-time access to large
amounts of sparse data.

Big Data Management-Dr. Jonti Deuri, ADBU


Hadoop Ecosystem
Hadoop has a rich ecosystem of tools and frameworks that build upon its capabilities

• Spark: A fast and flexible data processing framework that supports batch processing, interactive
queries, and machine learning.

• Sqoop: A tool for transferring data between Hadoop and relational databases.

• •Flume: A distributed, reliable, and available service for efficiently collecting, aggregating, and moving
large amounts of log data.

• • Oozie: A workflow scheduling and coordination system for managing Hadoop jobs.
• • ZooKeeper: A distributed coordination service for managing configuration information,
synchronization, and group services

Big Data Management-Dr. Jonti Deuri, ADBU


Hadoop Ecosystem
Hadoop has a rich ecosystem of tools and frameworks that build upon its capabilities

• Spark: A fast and flexible data processing framework that supports batch processing, interactive
queries, and machine learning.

• Sqoop: A tool for transferring data between Hadoop and relational databases.

• •Flume: A distributed, reliable, and available service for efficiently collecting, aggregating, and moving
large amounts of log data.

• • Oozie: A workflow scheduling and coordination system for managing Hadoop jobs.
• • ZooKeeper: A distributed coordination service for managing configuration information,
synchronization, and group services

Big Data Management-Dr. Jonti Deuri, ADBU


Analyzing data with Unix tools

• It is a powerful and versatile way to manipulate, process, and gain insights


from large datasets using command-line utilities available in Unix-like operating
systems.

• The Unix philosophy emphasizes the use of small, single-purpose tools that can
be combined to accomplish complex tasks.

• Here's how you can analyze data using some commonly used Unix tools:

Big Data Management-Dr. Jonti Deuri, ADBU


Analyzing data with Unix tools
• Here's how you can analyze data using some commonly used Unix tools:

1. Getting Data:
• Use commands like curl, wget, or scp to retrieve data from remote sources or copy files.
2. Viewing Data:
• cat: Display the content of a file.
• head: Display the beginning of a file (default: first 10 lines).
• tail: Display the end of a file (default: last 10 lines).
• less or more: View files interactively, scrolling up and down.

Big Data Management-Dr. Jonti Deuri, ADBU


Analyzing data with Unix tools
• Here's how you can analyze data using some commonly used Unix tools:
3. Filtering Data:
• grep: Search for specific patterns in text data.
• awk and sed: Process and manipulate text data using patterns and actions

4. Transforming Data:
• cut: Extract specific columns from text data.
• tr: Translate or replace characters in text data.
• sort: Sort lines of text data.
• uniq: Display unique lines in sorted data.
• paste: Merge lines from different files.
• join: Join lines from two sorted files based on a common field.

Big Data Management-Dr. Jonti Deuri, ADBU


Analyzing data with Unix tools
• Here's how you can analyze data using some commonly used Unix tools:
5. Aggregating Data:
• wc: Count lines, words, and characters in text data.
• ‘grep' with options: Count occurrences of a specific pattern.
• ‘awk’ or ‘sed’ for custom aggregation.

6. Calculating Statistics:
• sort and uniq: Count occurrences and perform basic statistics on data.
• awk or perl for more complex calculations.

7. Data Transformation:
• sed and awk: For more advanced data manipulation.
• cut, paste, and join: Combining and reshaping data

Big Data Management-Dr. Jonti Deuri, ADBU


Analyzing data with Unix tools
• Here's how you can analyze data using some commonly used Unix tools:
8. Plotting and Visualization:

• Use third-party tools like gnuplot or matplotlib to create visualizations from processed data.

9. Pipelining:
• Unix tools are designed to work together using pipes (|) to pass output from one command as
input to another.
• This allows you to create complex data processing pipelines.

Big Data Management-Dr. Jonti Deuri, ADBU


IBM Big Data Strategy

• IBM, a US-based computer hardware and software manufacturer, had implemented a Big Data
strategy.

• Where the company offered solutions to store, manage, and analyze the huge amounts of
data generated daily and equipped large and small companies to make informed business decisions.

• The company believed that its Big Data and analytics products and services would help its
clients become more competitive and drive growth.

Big Data Management-Dr. Jonti Deuri, ADBU


IBM Big Data Strategy

• IBM's approach to big data involves a comprehensive strategy that encompasses various
products, services, and solutions to help organizations manage and gain insights
from large and complex datasets.

• IBM has been a significant player in the big data space for many years, offering a range
of tools and technologies that cater to different aspects of big data management and
analysis.

Big Data Management-Dr. Jonti Deuri, ADBU


IBM Big Data Strategy
IBM's big data strategy typically includes the following components:
• Data Collection and Integration
• Data Storage and Management
• Data Processing and Analytics
• Machine Learning and AI
• Data Governance and Security
• Hybrid and Multi-Cloud Deployment
• Industry-Specific Solutions
• Consulting and Services

Big Data Management-Dr. Jonti Deuri, ADBU


IBM Big Data Strategy
Data Collection and Integration:

• IBM provides tools and platforms for collecting, integrating, and ingesting
data from various sources.

• This includes data warehouses, data lakes, and integration technologies that help
organizations bring together structured and unstructured data from different
systems.

Big Data Management-Dr. Jonti Deuri, ADBU


IBM Big Data Strategy
Data Storage and Management:

• IBM offers data storage solutions that cater to the scalability and performance
requirements of big data.

• This includes cloud-based storage services and on-premises solutions for


managing large volumes of data.

Big Data Management-Dr. Jonti Deuri, ADBU


IBM Big Data Strategy
Data Processing and Analytics:

• IBM provides platforms for processing and analyzing big data.

• This includes solutions for batch processing, real-time streaming analytics,


and machine learning.

• The IBM Watson platform, for instance, offers a suite of AI and analytics tools
to derive insights from data.

Big Data Management-Dr. Jonti Deuri, ADBU


IBM Big Data Strategy
Machine Learning and AI:

• IBM emphasizes the integration of machine learning and artificial intelligence in


its big data strategy.

• IBM Watson includes machine learning capabilities that allow organizations to


build and deploy predictive models for various use cases.

Big Data Management-Dr. Jonti Deuri, ADBU


IBM Big Data Strategy
Data Governance and Security:

• Managing and securing big data is a critical aspect of the strategy.

• IBM offers tools and practices for ensuring data quality, governance, and
compliance with data regulations.

Big Data Management-Dr. Jonti Deuri, ADBU


IBM Big Data Strategy
Hybrid and Multi-Cloud Deployment:

• IBM recognizes the importance of hybrid and multi-cloud deployment models


in modern big data strategies.

• Their offerings allow organizations to seamlessly deploy and manage big data
solutions across different environments.

Big Data Management-Dr. Jonti Deuri, ADBU


IBM Big Data Strategy
Industry-Specific Solutions:

• IBM tailors its big data offerings to various industries, including finance,
healthcare, retail, and more.

• This allows organizations to leverage industry-specific solutions and best


practices.

Big Data Management-Dr. Jonti Deuri, ADBU


IBM Big Data Strategy
Consulting and Services:

• IBM provides consulting services to assist organizations in designing,


implementing, and managing their big data solutions.

• These services cover a wide range of topics, from architecture design to data
science.

Big Data Management-Dr. Jonti Deuri, ADBU


Introduction to InfoSphere

• InfoSphere Information Server provides a single platform for data integration and
governance.

• The components in the suite combine to create a unified foundation for enterprise
information architectures, capable of scaling to meet any information volume requirements.

• InfoSphere Information Server helps your business and IT personnel collaborate to


understand the meaning, structure, and content of information across a wide variety of
sources.
Big Data Management-Dr. Jonti Deuri, ADBU
Introduction to InfoSphere

• InfoSphere is a brand and suite of data integration and governance software


products developed by IBM.

• It encompasses a range of tools and solutions designed to help organizations


manage, integrate, cleanse, and govern their data.

• By using InfoSphere Information Server, your business can access and use
information in new ways to drive innovation, increase operational efficiency, and
lower risk.
Big Data Management-Dr. Jonti Deuri, ADBU
Introduction to InfoSphere

• InfoSphere is a brand and suite of data integration and governance software


products developed by IBM.

• It encompasses a range of tools and solutions designed to help organizations


manage, integrate, cleanse, and govern their data.

• By using InfoSphere Information Server, your business can access and use
information in new ways to drive innovation, increase operational efficiency, and
lower risk.
Big Data Management-Dr. Jonti Deuri, ADBU
Big Insights

• IBM Big Insights is an enterprise platform that combines Hadoop and Spark for fast
analysis and processing of data.

• The solution includes Spark, SQL, text analytics and more to help you easily integrate and
analyze big data.

• With IBM, spend less time creating an enterprise-ready Hadoop infrastructure, and more
time gaining valuable insights.

Big Data Management-Dr. Jonti Deuri, ADBU


Big Insights

• It allows organizations to cost-effectively analyze

• Includes a variety of IBM technologies that enhance and extend the value of open-source Hadoop
software to

Big Data Management-Dr. Jonti Deuri, ADBU


BigSheets

• BigSheets is a spreadsheet-style tool for business analysts provided with IBM InfoSphere
BigInsights.

• BigSheets enables non-programmers to iteratively explore, manipulate, and visualize data stored in
your distributed file system.

• BigSheets can process huge amount of data due to the fact that user commands, expressed through a
graphical interface, are translated into Pig scripts and can be run as MapReduce jobs in parallel on
many nodes.

Big Data Management-Dr. Jonti Deuri, ADBU


BigSheets

• BigSheets can process huge amount of data due to the fact that user commands, expressed through a graphical interface,

are translated into Pig scripts and can be run as MapReduce jobs in parallel on many nodes.

• E.g. , in social networking sites like twitter, big sheets are widely used for sentimental analysis and they are greatly used to

find the insights from the bigdata.

Big Data Management-Dr. Jonti Deuri, ADBU

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy