0% found this document useful (0 votes)
13 views

Unit 1 and Unit 2 notes bda

The document provides an overview of data, its characteristics, and the evolution of Big Data technology, highlighting key milestones such as data warehousing, Hadoop, and cloud computing. It discusses the definition, types, and characteristics of Big Data, along with its sources, challenges, and the importance of big data analytics. Additionally, it covers the role of data science and the responsibilities of data scientists in extracting insights from data, as well as domain-specific examples of Big Data applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Unit 1 and Unit 2 notes bda

The document provides an overview of data, its characteristics, and the evolution of Big Data technology, highlighting key milestones such as data warehousing, Hadoop, and cloud computing. It discusses the definition, types, and characteristics of Big Data, along with its sources, challenges, and the importance of big data analytics. Additionally, it covers the role of data science and the responsibilities of data scientists in extracting insights from data, as well as domain-specific examples of Big Data applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Unit 1

What is data?
Data refers to raw facts, figures, or symbols that represent information. It can be in various forms such as
numbers, text, images, or multimedia.

What are the characteristics of data:

1. Accuracy: Data should be free from errors and represent the true value or state of what it's measuring.
2. Completeness: Data should encompass all the necessary information required for its intended purpose.
3. Relevance: Data should be pertinent and applicable to the context or problem being addressed.
4. Consistency: Data should be uniform and coherent across different sources and time periods.
5. Timeliness: Data should be current and up-to-date to maintain its relevance.
6. Accessibility: Data should be easily accessible and retrievable for analysis and decision-making.
7. Granularity: Data can be fine-grained or coarse-grained, depending on the level of detail required.
8. Reliability: Data should be dependable and consistent in its quality and accuracy.
9. Validity: Data should measure what it claims to measure and be relevant to the objectives at hand.
10.Security: Data should be protected from unauthorized access, alteration, or destruction to maintain its
integrity and confidentiality.

Evolution of Big Data:

If we see the last few decades, we can analyze that Big Data technology has gained so much growth.
There are a lot of milestones in the evolution of Big Data which are described below:

1. Data Warehousing:
In the 1990s, data warehousing emerged as a solution to store and analyze large volumes of
structured data.
2. Hadoop:
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed storage
medium and large data processing are provided by Hadoop, and it is an open-source framework.
3. NoSQL Databases:
In 2009, NoSQL databases were introduced, which provide a flexible way to store and retrieve
unstructured data.
4. Cloud Computing:
Cloud Computing technology helps companies to store their important data in data centers that
are remote, and it saves their infrastructure cost and maintenance costs.
5. Machine Learning:
Machine Learning algorithms are those algorithms that work on large data, and analysis is done
on a huge amount of data to get meaningful insights from it. This has led to the development of
artificial intelligence (AI) applications.
6. Data Streaming:
Data Streaming technology has emerged as a solution to process large volumes of data in real
time.
7. Edge Computing:
dge Computing is a kind of distributed computing paradigm that allows data processing to be
done at the edge or the corner of the network, closer to the source of the data.
Overall, big data technology has come a long way since the early days of data warehousing. The
introduction of Hadoop, NoSQL databases, cloud computing, machine learning, data streaming, and edge
computing has revolutionized how we store, process, and analyze large volumes of data. As technology
evolves, we can expect Big Data to play a very important role in various industries.

Definition of big data


Big Data refers to complex and large data sets that have to be processed and analyzed to uncover valuable
information that can benefit businesses and organizations. However, there are certain basic tenets of Big
Data that will make it even simpler to answer what is Big Data:
● It refers to a massive amount of data that keeps on growing exponentially with time.
● It is so voluminous that it cannot be processed or analyzed using conventional data processing
techniques.
● It includes data mining, data storage, data analysis, data sharing, and data visualization.
● The term is an all-comprehensive one including data, data frameworks, along with the tools and
techniques used to process and analyze the data.
Types of Big Data

a) Structured Structured is one of the types of big data and By structured data, we mean data that can be
processed, stored, and retrieved in a fixed format. It refers to highly organized information that can be
readily and seamlessly stored and accessed from a database by simple search engine algorithms. For
instance, the employee table in a company database will be structured as the employee details, their job
positions, their salaries, etc., will be present in an organized manner.

b) Unstructured Unstructured data refers to the data that lacks any specific form or structure
whatsoever. This makes it very difficult and time-consuming to process and analyze unstructured data.
Email is an example of unstructured data. Structured and unstructured are two important types of big data.

c) Semi-structured Semi structured is the third type of big data. Semi-structured data pertains to the data
containing both the formats mentioned above, that is, structured and unstructured data. To be precise, it
refers to the data that, although has not been classified under a particular repository (database), yet
contains vital information or tags that segregate individual elements within the data. Thus we come to the
end of types of data.

Characteristics of Big Data

Back in 2001, Gartner analyst Doug Laney listed the 3 ‘V’s of Big Data – Variety, Velocity, and Volume.

a) Variety of Big Data refers to structured, unstructured, and semi-structured data that is gathered from
multiple sources. While in the past, data could only be collected from spreadsheets and databases, today
data comes in an array of forms such as emails, PDFs, photos, videos, audios, SM posts, and so much
more. Variety is one of the important characteristics of big data.

b) Velocity Velocity essentially refers to the speed at which data is being created in real-time. In a broader
prospect, it comprises the rate of change, linking of incoming data sets at varying speeds, and activity
bursts.

c) Volume Volume is one of the characteristics of big data. We already know that Big Data indicates huge
‘volumes’ of data that is being generated on a daily basis from various sources like social media
platforms, business processes, machines, networks, human interactions, etc. Such a large amount of data
is stored in data warehouses.
Sources of Big Data

● Stock Exchange
● Bank Data
● Social Media Data
● Video sharing portals
● Transport Data
● Search Engine Data

Explain Challenges with big data

● Insufficient understanding and acceptance of big data


● Confusion while big data tools selection
● Paying loads of money
● Data Integration
● Data Security
● Data Capture
● Storages of data
● Analysis of data
● Presentations

Explain Big data analytics and classification and importance of it.

Big data analytics refers to the process of examining large and complex datasets to uncover hidden
patterns, correlations, trends, and insights. It involves using advanced analytical techniques and
technologies to extract actionable information from massive volumes of data, ultimately helping
organizations make informed decisions and gain valuable insights into their operations, customers, and
markets.

Analytics can be classified into three main categories:

1. Descriptive Analytics: Describes what has happened in the past by analyzing historical data to gain
insights into trends, patterns, and relationships.

2. Predictive Analytics: Forecasts future outcomes by applying statistical algorithms and machine
learning techniques to historical data to identify potential trends and predict future events.

3. Prescriptive Analytics: Provides recommendations on the best course of action to achieve a specific
outcome by combining historical data, predictive models, optimization algorithms, and business rules to
optimize decision-making processes.

Big data analytics is important because it enables organizations to extract valuable insights from large and
diverse datasets, leading to informed decision-making, improved efficiency, innovation, and competitive
advantage.

Explain data science.

Data science is an interdisciplinary field that utilizes scientific methods, algorithms, processes, and
systems to extract insights and knowledge from structured and unstructured data. It combines elements of
statistics, mathematics, computer science, and domain expertise to uncover patterns, trends, and
relationships within data, ultimately enabling data-driven decision-making and innovation.
Responsibilities of a data scientist include:

1. Data Collection: Gathering, cleaning, and preprocessing data from various sources to ensure its quality
and suitability for analysis.

2. Data Analysis: Applying statistical methods, machine learning algorithms, and data mining techniques
to analyze data and extract meaningful insights.

3. Model Development: Developing and refining predictive models and algorithms to solve specific
business problems or generate actionable insights.

4. Visualization: Creating visual representations of data using charts, graphs, and dashboards to
communicate findings and insights effectively.

5. Interpretation: Interpreting analysis results and providing actionable recommendations or insights to


stakeholders to inform decision-making processes.

6. Collaboration: Collaborating with cross-functional teams, including business analysts, engineers, and
domain experts, to understand business requirements and translate them into analytical solutions.

7. Continuous Learning: Staying updated on the latest advancements in data science, tools, and techniques
to enhance skills and capabilities.

Data scientists play a crucial role in extracting actionable insights from data to drive business growth,
innovation, and competitive advantage.

Terminologies used in big data environments:

1. **Volume**: Refers to the vast amount of data generated and collected, often in petabytes or exabytes.

2. **Velocity**: Denotes the speed at which data is generated, processed, and analyzed, often in real-time
or near real-time.

3. **Variety**: Indicates the diverse types of data, including structured, semi-structured, and unstructured
data from various sources such as text, images, videos, and sensor data.

4. **Veracity**: Refers to the quality, accuracy, and reliability of data, including issues such as
inconsistency, incompleteness, and uncertainty.

5. **Value**: Represents the potential insights and actionable intelligence that can be derived from
analyzing big data to drive business decisions and innovation.

6. **Batch Processing**: Involves processing large volumes of data at once, typically in scheduled
batches, using distributed computing frameworks like Hadoop MapReduce.

7. **Real-time Processing**: Involves processing and analyzing data streams as they are generated,
enabling immediate insights and actions.

8. **Hadoop**: An open-source distributed computing framework for storing and processing large
datasets across clusters of commodity hardware.

9. **MapReduce**: A programming model for processing and generating large-scale parallelizable


computations on big data sets in Hadoop clusters.

10. **Spark**: An open-source distributed computing framework that provides fast in-memory
processing capabilities for big data analytics and machine learning.

These terminologies form the foundation of understanding and working with big data environments,
encompassing the challenges, technologies, and opportunities associated with managing and analyzing
large and complex datasets.
Unit 2

Domain-specific Examples of Big Data


1. Web: In the web domain, big data is used by online platforms and e-commerce websites to analyze
user behavior. This includes tracking user clicks, page views, and interactions. For instance, platforms like
Amazon use big data to personalize recommendations based on a user's browsing and purchase history,
enhancing the overall shopping experience.

2. Financial: Financial institutions leverage big data for risk management, fraud detection, and customer
insights. Credit card companies analyze transaction data in real-time to identify unusual patterns that may
indicate fraudulent activities. Additionally, big data analytics is employed for predicting market trends
and optimizing investment strategies.

3. Healthcare: In healthcare, big data is applied to enhance patient care and optimize healthcare
processes. Electronic health records (EHRs) are analyzed to identify patterns, improve treatment plans,
and predict disease outbreaks. Big data analytics also plays a crucial role in genomics, helping researchers
and clinicians analyze large-scale genomic data for personalized medicine.

4. Internet of Things (IoT): IoT devices generate massive amounts of data, and big data analytics is
essential for extracting meaningful insights. In smart cities, sensors on traffic lights, waste management
systems, and public transportation are interconnected. Big data is used to analyze this data in real-time,
optimizing traffic flow, reducing energy consumption, and improving overall city management.

5. Environment: Big data is instrumental in environmental studies, particularly in climate monitoring and
prediction. Climate scientists analyze vast datasets from satellites, weather stations, and ocean buoys to
model climate patterns. This information is crucial for understanding climate change, predicting natural
disasters, and making informed decisions for environmental conservation.

6. Logistics & Transportation: In logistics and transportation, big data is used for route optimization,
predictive maintenance, and supply chain management. Companies like UPS use big data analytics to
optimize delivery routes, reduce fuel consumption, and enhance overall operational efficiency. Predictive
maintenance helps prevent breakdowns, ensuring continuous and reliable transportation services.

7. Industry: Manufacturing industries leverage big data for quality control, process optimization, and
predictive maintenance. Sensors on production lines generate vast amounts of data, which is analyzed in
real-time to identify defects, optimize production processes, and predict when machinery requires
maintenance. This improves overall efficiency and reduces downtime.

8. Retail: Retailers use big data to analyze customer purchasing patterns, optimize inventory
management, and personalize marketing strategies. For instance, supermarkets analyze customer purchase
data to optimize inventory levels, ensuring products are always available. Online retailers use big data to
personalize recommendations and promotions based on customer preferences and browsing history.

Analytics flow for big data:

1. *Data Collection*: This is where we gather all the relevant data from various sources such as
databases, sensors, or social media platforms. Think of it as collecting pieces of a puzzle.

2. *Data Preparation*: After collecting the data, we need to clean and organize it. This step involves
removing any errors or inconsistencies and formatting the data in a way that's suitable for analysis. It's
like sorting and arranging the puzzle pieces so they fit together neatly.

3. *Analysis Types*: There are different ways we can analyze the data depending on what we want to
find out. For example, we might use descriptive analysis to summarize the data, predictive analysis to
forecast future trends, or prescriptive analysis to recommend actions based on the data.
4. *Analysis Modes*: Once we know what type of analysis we want to perform, we choose the mode of
analysis. This could be batch processing, where we analyze a large amount of data at once, or real-time
processing, where we analyze data as it's generated. It's like deciding whether to solve the puzzle all at
once or piece by piece as we go.

5. *Visualizations*: Finally, we present the results of our analysis in a visual format, such as charts,
graphs, or dashboards. This makes it easier for people to understand the insights gained from the data.
Think of it as putting together the completed puzzle so others can see the big picture.

Big data stack:

1. *Raw Data Sources*: These are the original sources where data is generated or collected, such as
databases, sensors, or social media platforms. It's like the starting point where all the data comes from.

2. *Data Access Connectors*: These are tools or interfaces that allow us to access and retrieve data from
different sources. They act as bridges between the raw data sources and the rest of the big data stack,
ensuring smooth data flow. Think of them as connectors that link the raw data sources to the rest of the
system.

3. *Data Storage*: This is where the collected data is stored for future use and analysis. It could be in
traditional databases, data lakes, or distributed file systems like Hadoop Distributed File System (HDFS).
It's like the storage room where we keep all the puzzle pieces safe and organized.

4. *Batch Analytics*: Batch analytics involves processing and analyzing large volumes of data in
batches or chunks. It's useful for tasks that don't require immediate results, such as historical analysis or
periodic reporting. Think of it as solving the puzzle piece by piece, but not necessarily in real-time.

5. *Real-time Analytics*: Real-time analytics, on the other hand, involves processing and analyzing data
as it's generated, providing immediate insights and responses. It's like solving the puzzle as soon as you
receive each piece, allowing for quick reactions and decision-making.

6. *Interactive Querying*: This refers to the ability to interactively query and explore the data stored in
the system. It allows users to ask ad-hoc questions and receive instant responses, facilitating exploratory
data analysis and troubleshooting. Think of it as being able to search for specific puzzle pieces and get
instant answers.
7. *Serving Databases*: Databases designed for serving data quickly and efficiently, such as NoSQL
databases (e.g., MongoDB) or distributed SQL databases (e.g., Apache Cassandra), play a role in storing
processed data for easy retrieval. It's like having a well-organized library where you can easily find the
book you need.

8. *Web & Visualization Frameworks*: These are the tools and frameworks used to serve the analyzed
data to end-users, whether through databases, web applications, or visualization tools like Tableau or
Power BI. They make the insights gained from the data accessible and understandable to non-technical
users. It's like putting the puzzle together in a way that others can see and understand the complete
picture.
Mapping the analytics flow to the Big Data Stack means aligning the stages and processes involved in
data analytics with the various components of a Big Data technology stack. It involves understanding how
different tools and technologies within the Big Data ecosystem can be employed to handle the various
aspects of data processing, storage, and analysis.Here's a simplified breakdown:
Raw Data Sources:Identify where the data is coming from, such as sensors, databases, logs, or any other
sources.
Data Access Connectors:Determine how to connect and collect data from these sources efficiently. Use
connectors or pipelines to move data to the next stages.
Data Storage:Choose appropriate storage systems to house the data, considering factors like volume,
velocity, and variety of the data. This could involve distributed file systems, databases, or cloud storage.
Batch Analytics:Decide how to process large volumes of data in batches to gain insights over time.
Utilize technologies like Apache Spark, Hadoop, or other batch processing frameworks.
Real-time Analytics:Address the need for immediate insights by implementing real-time analytics using
technologies such as Apache Flink, Apache Kafka Streams, or other stream processing frameworks.
Interactive Querying:Provide a way for users to interactively query and explore the data. Use databases
optimized for quick querying, like Apache Cassandra, or other interactive query systems.
Serving Databases:Store processed and analyzed data in serving databases for quick and efficient
retrieval. This could involve NoSQL databases like MongoDB or traditional relational databases.
Web & Visualization Frameworks:Present the results to end-users using web and visualization
frameworks. Utilize tools like Tableau, Power BI, or custom-built dashboards with frameworks like D3.js
for effective data visualization.

By mapping the analytics flow to the Big Data Stack, organizations can optimize their data processing and
analysis workflows, making use of the capabilities offered by different components of the Big Data
ecosystem. This ensures efficient handling of large datasets and facilitates the extraction of valuable
insights from the data.
Case study on Genome Data Analysis:

A research institution is conducting a study to understand the genetic factors influencing a rare disease.
They have collected genomic data from a large group of individuals and aim to analyze this data to
identify potential genetic markers associated with the disease.

Big Data Stack Implementation:

​ Raw Data Sources:


● Genomic data is collected from various sources, including DNA sequencing machines
and public genetic databases. This data is incredibly large and complex, consisting of
millions of DNA sequences.
​ Data Access Connectors:
● Data connectors are used to aggregate genomic data from different sources and ensure
compatibility for further processing. These connectors help in dealing with the diverse
formats and structures of genomic data.
​ Data Storage:
● The raw genomic data is stored in a distributed file system or a specialized genomic
database. This storage system is designed to handle the massive volume of data and
provide efficient retrieval.
​ Batch Analytics:
● Batch analytics processes involve running complex algorithms on the entire genomic
dataset. This may include identifying variations, mutations, and patterns in the genetic
code that could be linked to the rare disease. Technologies like Apache Spark or Hadoop
MapReduce could be employed for this.
​ Real-time Analytics:
● Real-time analytics could be applied for immediate analysis of newly sequenced
genomes. This is particularly useful for identifying urgent insights or adjusting the
analysis approach based on ongoing findings.
​ Interactive Querying:
● Researchers may need to interactively query specific genes, regions, or mutations.
Interactive querying tools, possibly built on top of distributed databases like Apache
HBase, allow researchers to explore specific aspects of the genomic data.
​ Serving Databases:
● Processed and analyzed genomic data is stored in databases optimized for quick access.
This allows researchers to efficiently retrieve relevant genetic information during the
study. NoSQL databases like MongoDB could be utilized for this purpose.
​ Web & Visualization Frameworks:
● Results from the genomic analysis are presented using web-based visualization
frameworks. Researchers can use tools like GenomeBrowse or custom-built
visualizations to explore and interpret genetic variations, making it easier to identify
potential genetic markers associated with the rare disease.
● The research institution can uncover potential genetic markers associated with the rare disease.
● Insights gained can contribute to a better understanding of the disease's genetic basis.
● This information may lead to the development of targeted treatments or interventions.

In this case study, the Big Data Stack plays a crucial role in handling the vast and complex genomic data,
performing in-depth analysis, and presenting the findings in a way that aids researchers in understanding
the genetic factors influencing the rare disease.
Case study on Weather Data Analysis:

A meteorological agency wants to improve its weather prediction models by analyzing historical and
real-time weather data. The goal is to enhance accuracy in forecasting and provide more timely and
precise information to the public.

Big Data Stack Implementation:

​ Raw Data Sources:


● Meteorological stations, satellites, and other sensors collect vast amounts of weather data,
including temperature, humidity, wind speed, and atmospheric pressure. Real-time data is
continuously streamed from these sources.
​ Data Access Connectors:
● Data connectors are used to aggregate data from various sources, ensuring a smooth flow
of information from meteorological stations, satellites, and other data-producing devices
to the central data processing system.
​ Data Storage:
● Raw weather data is stored in a distributed and scalable data storage system. This could
be a combination of cloud-based storage solutions and on-premise databases, capable of
handling large volumes of historical and real-time data.
​ Batch Analytics:
● Batch analytics processes historical weather data to identify long-term trends, patterns,
and seasonal variations. This analysis helps in refining predictive models and improving
the accuracy of long-term weather forecasts. Technologies like Apache Spark or Hadoop
can be employed for this purpose.
​ Real-time Analytics:
● Real-time analytics processes the continuously streaming data from various sensors. This
analysis provides insights into the current weather conditions, enabling more accurate
short-term predictions. Stream processing frameworks like Apache Flink or Apache
Kafka Streams may be utilized for real-time analytics.
​ Interactive Querying:
● Meteorologists and weather analysts may need to interactively query specific regions,
time periods, or meteorological parameters. Interactive querying tools built on top of
databases like Apache Cassandra or Amazon DynamoDB allow for quick and flexible
access to specific weather data.
​ Serving Databases:
● Processed and analyzed weather data is stored in serving databases optimized for quick
access. This enables the meteorological agency to provide timely and accurate
information to the public, as well as other stakeholders. NoSQL databases like MongoDB
or traditional relational databases could be used for this purpose.
​ Web & Visualization Frameworks:
● Weather forecasts and predictions are presented to the public through web and
visualization frameworks. Interactive maps, charts, and dashboards, possibly built using
tools like D3.js or Leaflet, help convey the weather information in an easily
understandable format.
● Improved accuracy in long-term weather forecasts.
● Enhanced ability to provide real-time weather updates and warnings.
● Better-informed decision-making for various sectors, including agriculture, transportation, and
emergency response.

In this case study, the Big Data Stack facilitates the efficient handling of vast and dynamic weather data,
enabling comprehensive analysis, accurate predictions, and timely communication of weather information
to the public and other stakeholders.
Analytics patterns refer to recurring approaches or methodologies used in data analysis to solve common
problems or achieve specific goals. These patterns provide guidance on how to structure and conduct data
analysis tasks efficiently and effectively. Here are some common analytics patterns:

1. *Descriptive Analytics*: This pattern involves summarizing historical data to gain insights into past
events or trends. It focuses on answering questions like "What happened?" and includes techniques such
as data aggregation, summarization, and visualization.

2. *Diagnostic Analytics*: Diagnostic analytics aims to understand why certain events occurred by
identifying patterns, correlations, or relationships in the data. It helps uncover root causes behind
observed phenomena and supports troubleshooting and problem-solving efforts.
3. *Predictive Analytics*: Predictive analytics uses historical data to forecast future outcomes or trends.
It involves building predictive models using statistical techniques, machine learning algorithms, or other
predictive modeling approaches to make educated predictions based on past patterns.

4. *Prescriptive Analytics*: Prescriptive analytics goes beyond predicting future outcomes by


recommending specific actions or interventions to achieve desired objectives. It combines predictive
models with optimization algorithms or decision-making frameworks to provide actionable insights and
guidance.

5. *Text Analytics*: Text analytics focuses on extracting insights and patterns from unstructured text
data, such as emails, social media posts, or customer reviews. It includes techniques like natural language
processing (NLP), sentiment analysis, and topic modeling to analyze and interpret textual data.
6. *Spatial Analytics*: Spatial analytics deals with analyzing data that has a geographic or spatial
component, such as maps, GPS coordinates, or spatial databases. It involves techniques like spatial
clustering, spatial interpolation, and spatial regression to understand spatial relationships and patterns in
the data.

7. *Temporal Analytics*: Temporal analytics focuses on analyzing data over time to identify temporal
trends, patterns, or seasonality. It involves time series analysis, event sequence analysis, and trend
detection techniques to uncover insights related to temporal changes in the data.

These analytics patterns serve as building blocks for designing and implementing data analysis workflows
and can be combined or adapted to address specific analytical challenges or business objectives
effectively.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy