module -1-Part II
module -1-Part II
• Parallel Processing: enable efficient execution of data-intensive tasks, leading to faster data processing and
analysis.
• Early Development.
• Named "Hadoop"
• Hadoop Core Components.
• Ecosystem Expansion.
• Commercial Adoption.
• Hadoop's Influence.
• Challenges and Evolution.
• Apache Hadoop Releases.
• Transition to Modern Data Platforms.
• Legacy and Impact.
• MapReduce
• HDFS(Hadoop Distributed File System)
• YARN(Yet Another Resource Negotiator)
• Common Utilities or Hadoop Common
1. MapReduce:
• In first phase, Map is
utilized and in next
phase Reduce is utilized.
2. HDFS:
• It contains a master/slave architecture.
• This architecture consist of:
• a single NameNode (master)
• multiple DataNodes ( slave).
2. HDFS:
• Data Ingestion: Hadoop command-line tools, HDFS APIs, or higher-level tools like Apache Sqoop etc
• Data Processing with MapReduce: Apache Pig or Hive.
• Higher-Level Abstractions: Pig Latin, HiveQL
• Distributed Computing
• Data Analysis Libraries: Apache Spark, Apache Flink, Apache Mahout
• Visualization and Reporting: Apache Zeppelin, Jupyter notebooks
• Input
• Mapper and Reducer Scripts
• Execution
• Shuffling and Sorting
• Reducer Phase
• Output
• Hive: A data warehousing and SQL-like query language that allows users to query and analyze
data stored in Hadoop.
• • Pig: A high-level scripting language for processing and analyzing large datasets without writing
MapReduce code.
• • HBase: A distributed, scalable NoSQL database that provides real-time access to large
amounts of sparse data.
• Spark: A fast and flexible data processing framework that supports batch processing, interactive
queries, and machine learning.
• Sqoop: A tool for transferring data between Hadoop and relational databases.
• •Flume: A distributed, reliable, and available service for efficiently collecting, aggregating, and moving
large amounts of log data.
• • Oozie: A workflow scheduling and coordination system for managing Hadoop jobs.
• • ZooKeeper: A distributed coordination service for managing configuration information,
synchronization, and group services
• Spark: A fast and flexible data processing framework that supports batch processing, interactive
queries, and machine learning.
• Sqoop: A tool for transferring data between Hadoop and relational databases.
• •Flume: A distributed, reliable, and available service for efficiently collecting, aggregating, and moving
large amounts of log data.
• • Oozie: A workflow scheduling and coordination system for managing Hadoop jobs.
• • ZooKeeper: A distributed coordination service for managing configuration information,
synchronization, and group services
• The Unix philosophy emphasizes the use of small, single-purpose tools that can
be combined to accomplish complex tasks.
• Here's how you can analyze data using some commonly used Unix tools:
1. Getting Data:
• Use commands like curl, wget, or scp to retrieve data from remote sources or copy files.
2. Viewing Data:
• cat: Display the content of a file.
• head: Display the beginning of a file (default: first 10 lines).
• tail: Display the end of a file (default: last 10 lines).
• less or more: View files interactively, scrolling up and down.
4. Transforming Data:
• cut: Extract specific columns from text data.
• tr: Translate or replace characters in text data.
• sort: Sort lines of text data.
• uniq: Display unique lines in sorted data.
• paste: Merge lines from different files.
• join: Join lines from two sorted files based on a common field.
6. Calculating Statistics:
• sort and uniq: Count occurrences and perform basic statistics on data.
• awk or perl for more complex calculations.
7. Data Transformation:
• sed and awk: For more advanced data manipulation.
• cut, paste, and join: Combining and reshaping data
• Use third-party tools like gnuplot or matplotlib to create visualizations from processed data.
9. Pipelining:
• Unix tools are designed to work together using pipes (|) to pass output from one command as
input to another.
• This allows you to create complex data processing pipelines.
• IBM, a US-based computer hardware and software manufacturer, had implemented a Big Data
strategy.
• Where the company offered solutions to store, manage, and analyze the huge amounts of
data generated daily and equipped large and small companies to make informed business decisions.
• The company believed that its Big Data and analytics products and services would help its
clients become more competitive and drive growth.
• IBM's approach to big data involves a comprehensive strategy that encompasses various
products, services, and solutions to help organizations manage and gain insights
from large and complex datasets.
• IBM has been a significant player in the big data space for many years, offering a range
of tools and technologies that cater to different aspects of big data management and
analysis.
• IBM provides tools and platforms for collecting, integrating, and ingesting
data from various sources.
• This includes data warehouses, data lakes, and integration technologies that help
organizations bring together structured and unstructured data from different
systems.
• IBM offers data storage solutions that cater to the scalability and performance
requirements of big data.
• The IBM Watson platform, for instance, offers a suite of AI and analytics tools
to derive insights from data.
• IBM offers tools and practices for ensuring data quality, governance, and
compliance with data regulations.
• Their offerings allow organizations to seamlessly deploy and manage big data
solutions across different environments.
• IBM tailors its big data offerings to various industries, including finance,
healthcare, retail, and more.
• These services cover a wide range of topics, from architecture design to data
science.
• InfoSphere Information Server provides a single platform for data integration and
governance.
• The components in the suite combine to create a unified foundation for enterprise
information architectures, capable of scaling to meet any information volume requirements.
• By using InfoSphere Information Server, your business can access and use
information in new ways to drive innovation, increase operational efficiency, and
lower risk.
Big Data Management-Dr. Jonti Deuri, ADBU
Introduction to InfoSphere
• By using InfoSphere Information Server, your business can access and use
information in new ways to drive innovation, increase operational efficiency, and
lower risk.
Big Data Management-Dr. Jonti Deuri, ADBU
Big Insights
• IBM Big Insights is an enterprise platform that combines Hadoop and Spark for fast
analysis and processing of data.
• The solution includes Spark, SQL, text analytics and more to help you easily integrate and
analyze big data.
• With IBM, spend less time creating an enterprise-ready Hadoop infrastructure, and more
time gaining valuable insights.
• Includes a variety of IBM technologies that enhance and extend the value of open-source Hadoop
software to
• BigSheets is a spreadsheet-style tool for business analysts provided with IBM InfoSphere
BigInsights.
• BigSheets enables non-programmers to iteratively explore, manipulate, and visualize data stored in
your distributed file system.
• BigSheets can process huge amount of data due to the fact that user commands, expressed through a
graphical interface, are translated into Pig scripts and can be run as MapReduce jobs in parallel on
many nodes.
• BigSheets can process huge amount of data due to the fact that user commands, expressed through a graphical interface,
are translated into Pig scripts and can be run as MapReduce jobs in parallel on many nodes.
• E.g. , in social networking sites like twitter, big sheets are widely used for sentimental analysis and they are greatly used to