Lec1 Special

Download as pdf or txt
Download as pdf or txt
You are on page 1of 21

Big Data Fundamentals

Dr. Ayman Alhelbawy , 20 February 2024

1
Course Overview
Structure & Gr ding

Midterm Exam : 15 marks


Final Exame : 50 marks
Lab Work: 20 marks (Project)
Oral Test: 15 Marks (Assignments + )
a
Course Overview
Syll bus

• Introduction to Big Data


• Business Intelligence and Big Data
• The Dimensions of Big Data.
• Big Data Challenges and Tools
• HDFS and the Hadoop ecosystem.
• The basics of HDFS، MapReduce and Hadoop cluster.
• Writing MapReduce programs.
• Page Rank technique.
• Hive, H Base, Pig،MapReduce design patterns
a
Course Overview
L b Project

Using what you learn in Big Data to develop a


distributed search engine to crawl 100,000 web
pages from the news websites and index them using
PageRank technique. All work must run in a
distributed fashion on Spark/Hadoop cluster.
Finally, provide search services on the crawled pages.
(more details ill be pro ided in the lab)
a
w
v
What is Big Data?
IBM Big H rd Drive in 1956
a
Big Data Types

• Structured Data
• Where data can be stored, processed, and retrieved with a
ixed format. For many years DBMS systems are used to
manage structured data in databases like ERP systems.

• Unstructured
• Where data is not in a structured form like free text, news,
Social media posts, emails, etc.

• Semi-Structured
• Where data contain both structure and unstructured forms.
eg. XML holding the structure part where data itself
f
Big Data Characteristics

• Volume
• Data size is enamours
• Velocity
• The speed of generation of data.
• Variety
• Data inconsistency and using different data sources.
Important Notes

• “Big” is a subjective word


• Big data is not a matter of size as many people think.
Big Data Examples

• Boing jet engine generates 1 terabyte of data per day.


• 84 terabytes of tweets are posted every week.
• CERN Data Centre processes on average one petabyte of data
per day

• The LHC (Large Hadron Collider) experiments produce about


90 petabytes of data per year, and an additional 25 petabytes of
data are produced per year for data from other (non-LHC)
experiments at CERN

• In 2017, CERN has 200 petabytes of data archived permanently.


Big Data Sources

• Users (Social media, Blogs, news, etc)


• Applications (Payroll, gaming, etc)
• Systems (Mobile devices, etc)
• Sensors (Science facilities, microphones, etc)
Big Data Analytics

Big data analytics is the process of examining big


data to discover hidden information — such as
hidden patterns, correlations, market trends —

It could help organisations to make informed


business strategic and operational decisions in order
to get a competitive advantage.
Risks of Big Data

• Costs is very high and escalate too fast.


• It may breach many of Human Rights.
• Many concerns about Privacy
Storage and processing of Big Data
The Ap che H doop

• The Apache Hadoop software library is a framework that allows


for the distributed processing of large data sets across clusters
of computers using simple programming models.

• It is designed to scale up from single server to thousands of


machines, each offering local computation and storage.

• It is designed to detect and handle failures at the application


layer.

• Data is stored in a distributedile system. Hadoop Distributed


File System (HDFS) is a very well known
a
a
f
Other Hadoop Related Projects

• HBase : A scalable, distributed database that


supports structured data storage for large tables.
• Hive: A data warehouse infrastructure that
provides data summarisation and ad hoc querying
• Spark: A fast and general compute engine for
Hadoop data. Spark provides a simple and
expressive programming model that supports a
wide range of applications, including machine
learning, stream processing, and graph
computation.
Business Intelligence (BI)
Business Intelligence (BI)

Traditional BI methodology is based on the principle


of grouping all business data into a central server.
Typically, this data is analyzed in of line mode, after
storing the information in an environment called
Data Warehouse. The data is structured in a
conventional relational database with an additional
set of indexes and forms of access to the tables
f
Business Intelligence vs Big Data

• In a Big Data environment, information is stored on a


distributed ile system, rather than on a central server.

• Big Data can analyze data in different formats, both structured


and unstructured.

• Data processed by Big Data solutions can be historical or come


from real-time sources. Thus, companies can make decisions
that affect their business in an agile and ef icient way.

• Big Data technology uses parallel mass processing concepts,


which improves the speed of analysis
f
f
Data Warehouse (DW)

— It is a collection of corporate information and data derived


from operational systems and external data sources.
— It is designed to support business decisions by allowing data
consolidation, analysis and reporting at different aggregate
levels.
— Data is populated into the Data Warehouse through the
processes of extraction, transformation and loading (ETL tools).
— Data analysis tools, such as business intelligence software,
access the data within the warehouse.
Big Data analytics

It is the process of examining large data sets


containing a variety of data types to discover
some knowledge in databases, to identify
interesting patterns and establish relationships to
solve problems, market trends, customer
preferences, and other useful information.
Thank You
Questions?????

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy