Unit 1 and Unit 2 notes bda
Unit 1 and Unit 2 notes bda
What is data?
Data refers to raw facts, figures, or symbols that represent information. It can be in various forms such as
numbers, text, images, or multimedia.
1. Accuracy: Data should be free from errors and represent the true value or state of what it's measuring.
2. Completeness: Data should encompass all the necessary information required for its intended purpose.
3. Relevance: Data should be pertinent and applicable to the context or problem being addressed.
4. Consistency: Data should be uniform and coherent across different sources and time periods.
5. Timeliness: Data should be current and up-to-date to maintain its relevance.
6. Accessibility: Data should be easily accessible and retrievable for analysis and decision-making.
7. Granularity: Data can be fine-grained or coarse-grained, depending on the level of detail required.
8. Reliability: Data should be dependable and consistent in its quality and accuracy.
9. Validity: Data should measure what it claims to measure and be relevant to the objectives at hand.
10.Security: Data should be protected from unauthorized access, alteration, or destruction to maintain its
integrity and confidentiality.
If we see the last few decades, we can analyze that Big Data technology has gained so much growth.
There are a lot of milestones in the evolution of Big Data which are described below:
1. Data Warehousing:
In the 1990s, data warehousing emerged as a solution to store and analyze large volumes of
structured data.
2. Hadoop:
Hadoop was introduced in 2006 by Doug Cutting and Mike Cafarella. Distributed storage
medium and large data processing are provided by Hadoop, and it is an open-source framework.
3. NoSQL Databases:
In 2009, NoSQL databases were introduced, which provide a flexible way to store and retrieve
unstructured data.
4. Cloud Computing:
Cloud Computing technology helps companies to store their important data in data centers that
are remote, and it saves their infrastructure cost and maintenance costs.
5. Machine Learning:
Machine Learning algorithms are those algorithms that work on large data, and analysis is done
on a huge amount of data to get meaningful insights from it. This has led to the development of
artificial intelligence (AI) applications.
6. Data Streaming:
Data Streaming technology has emerged as a solution to process large volumes of data in real
time.
7. Edge Computing:
dge Computing is a kind of distributed computing paradigm that allows data processing to be
done at the edge or the corner of the network, closer to the source of the data.
Overall, big data technology has come a long way since the early days of data warehousing. The
introduction of Hadoop, NoSQL databases, cloud computing, machine learning, data streaming, and edge
computing has revolutionized how we store, process, and analyze large volumes of data. As technology
evolves, we can expect Big Data to play a very important role in various industries.
a) Structured Structured is one of the types of big data and By structured data, we mean data that can be
processed, stored, and retrieved in a fixed format. It refers to highly organized information that can be
readily and seamlessly stored and accessed from a database by simple search engine algorithms. For
instance, the employee table in a company database will be structured as the employee details, their job
positions, their salaries, etc., will be present in an organized manner.
b) Unstructured Unstructured data refers to the data that lacks any specific form or structure
whatsoever. This makes it very difficult and time-consuming to process and analyze unstructured data.
Email is an example of unstructured data. Structured and unstructured are two important types of big data.
c) Semi-structured Semi structured is the third type of big data. Semi-structured data pertains to the data
containing both the formats mentioned above, that is, structured and unstructured data. To be precise, it
refers to the data that, although has not been classified under a particular repository (database), yet
contains vital information or tags that segregate individual elements within the data. Thus we come to the
end of types of data.
Back in 2001, Gartner analyst Doug Laney listed the 3 ‘V’s of Big Data – Variety, Velocity, and Volume.
a) Variety of Big Data refers to structured, unstructured, and semi-structured data that is gathered from
multiple sources. While in the past, data could only be collected from spreadsheets and databases, today
data comes in an array of forms such as emails, PDFs, photos, videos, audios, SM posts, and so much
more. Variety is one of the important characteristics of big data.
b) Velocity Velocity essentially refers to the speed at which data is being created in real-time. In a broader
prospect, it comprises the rate of change, linking of incoming data sets at varying speeds, and activity
bursts.
c) Volume Volume is one of the characteristics of big data. We already know that Big Data indicates huge
‘volumes’ of data that is being generated on a daily basis from various sources like social media
platforms, business processes, machines, networks, human interactions, etc. Such a large amount of data
is stored in data warehouses.
Sources of Big Data
● Stock Exchange
● Bank Data
● Social Media Data
● Video sharing portals
● Transport Data
● Search Engine Data
Big data analytics refers to the process of examining large and complex datasets to uncover hidden
patterns, correlations, trends, and insights. It involves using advanced analytical techniques and
technologies to extract actionable information from massive volumes of data, ultimately helping
organizations make informed decisions and gain valuable insights into their operations, customers, and
markets.
1. Descriptive Analytics: Describes what has happened in the past by analyzing historical data to gain
insights into trends, patterns, and relationships.
2. Predictive Analytics: Forecasts future outcomes by applying statistical algorithms and machine
learning techniques to historical data to identify potential trends and predict future events.
3. Prescriptive Analytics: Provides recommendations on the best course of action to achieve a specific
outcome by combining historical data, predictive models, optimization algorithms, and business rules to
optimize decision-making processes.
Big data analytics is important because it enables organizations to extract valuable insights from large and
diverse datasets, leading to informed decision-making, improved efficiency, innovation, and competitive
advantage.
Data science is an interdisciplinary field that utilizes scientific methods, algorithms, processes, and
systems to extract insights and knowledge from structured and unstructured data. It combines elements of
statistics, mathematics, computer science, and domain expertise to uncover patterns, trends, and
relationships within data, ultimately enabling data-driven decision-making and innovation.
Responsibilities of a data scientist include:
1. Data Collection: Gathering, cleaning, and preprocessing data from various sources to ensure its quality
and suitability for analysis.
2. Data Analysis: Applying statistical methods, machine learning algorithms, and data mining techniques
to analyze data and extract meaningful insights.
3. Model Development: Developing and refining predictive models and algorithms to solve specific
business problems or generate actionable insights.
4. Visualization: Creating visual representations of data using charts, graphs, and dashboards to
communicate findings and insights effectively.
6. Collaboration: Collaborating with cross-functional teams, including business analysts, engineers, and
domain experts, to understand business requirements and translate them into analytical solutions.
7. Continuous Learning: Staying updated on the latest advancements in data science, tools, and techniques
to enhance skills and capabilities.
Data scientists play a crucial role in extracting actionable insights from data to drive business growth,
innovation, and competitive advantage.
1. **Volume**: Refers to the vast amount of data generated and collected, often in petabytes or exabytes.
2. **Velocity**: Denotes the speed at which data is generated, processed, and analyzed, often in real-time
or near real-time.
3. **Variety**: Indicates the diverse types of data, including structured, semi-structured, and unstructured
data from various sources such as text, images, videos, and sensor data.
4. **Veracity**: Refers to the quality, accuracy, and reliability of data, including issues such as
inconsistency, incompleteness, and uncertainty.
5. **Value**: Represents the potential insights and actionable intelligence that can be derived from
analyzing big data to drive business decisions and innovation.
6. **Batch Processing**: Involves processing large volumes of data at once, typically in scheduled
batches, using distributed computing frameworks like Hadoop MapReduce.
7. **Real-time Processing**: Involves processing and analyzing data streams as they are generated,
enabling immediate insights and actions.
8. **Hadoop**: An open-source distributed computing framework for storing and processing large
datasets across clusters of commodity hardware.
10. **Spark**: An open-source distributed computing framework that provides fast in-memory
processing capabilities for big data analytics and machine learning.
These terminologies form the foundation of understanding and working with big data environments,
encompassing the challenges, technologies, and opportunities associated with managing and analyzing
large and complex datasets.
Unit 2
2. Financial: Financial institutions leverage big data for risk management, fraud detection, and customer
insights. Credit card companies analyze transaction data in real-time to identify unusual patterns that may
indicate fraudulent activities. Additionally, big data analytics is employed for predicting market trends
and optimizing investment strategies.
3. Healthcare: In healthcare, big data is applied to enhance patient care and optimize healthcare
processes. Electronic health records (EHRs) are analyzed to identify patterns, improve treatment plans,
and predict disease outbreaks. Big data analytics also plays a crucial role in genomics, helping researchers
and clinicians analyze large-scale genomic data for personalized medicine.
4. Internet of Things (IoT): IoT devices generate massive amounts of data, and big data analytics is
essential for extracting meaningful insights. In smart cities, sensors on traffic lights, waste management
systems, and public transportation are interconnected. Big data is used to analyze this data in real-time,
optimizing traffic flow, reducing energy consumption, and improving overall city management.
5. Environment: Big data is instrumental in environmental studies, particularly in climate monitoring and
prediction. Climate scientists analyze vast datasets from satellites, weather stations, and ocean buoys to
model climate patterns. This information is crucial for understanding climate change, predicting natural
disasters, and making informed decisions for environmental conservation.
6. Logistics & Transportation: In logistics and transportation, big data is used for route optimization,
predictive maintenance, and supply chain management. Companies like UPS use big data analytics to
optimize delivery routes, reduce fuel consumption, and enhance overall operational efficiency. Predictive
maintenance helps prevent breakdowns, ensuring continuous and reliable transportation services.
7. Industry: Manufacturing industries leverage big data for quality control, process optimization, and
predictive maintenance. Sensors on production lines generate vast amounts of data, which is analyzed in
real-time to identify defects, optimize production processes, and predict when machinery requires
maintenance. This improves overall efficiency and reduces downtime.
8. Retail: Retailers use big data to analyze customer purchasing patterns, optimize inventory
management, and personalize marketing strategies. For instance, supermarkets analyze customer purchase
data to optimize inventory levels, ensuring products are always available. Online retailers use big data to
personalize recommendations and promotions based on customer preferences and browsing history.
1. *Data Collection*: This is where we gather all the relevant data from various sources such as
databases, sensors, or social media platforms. Think of it as collecting pieces of a puzzle.
2. *Data Preparation*: After collecting the data, we need to clean and organize it. This step involves
removing any errors or inconsistencies and formatting the data in a way that's suitable for analysis. It's
like sorting and arranging the puzzle pieces so they fit together neatly.
3. *Analysis Types*: There are different ways we can analyze the data depending on what we want to
find out. For example, we might use descriptive analysis to summarize the data, predictive analysis to
forecast future trends, or prescriptive analysis to recommend actions based on the data.
4. *Analysis Modes*: Once we know what type of analysis we want to perform, we choose the mode of
analysis. This could be batch processing, where we analyze a large amount of data at once, or real-time
processing, where we analyze data as it's generated. It's like deciding whether to solve the puzzle all at
once or piece by piece as we go.
5. *Visualizations*: Finally, we present the results of our analysis in a visual format, such as charts,
graphs, or dashboards. This makes it easier for people to understand the insights gained from the data.
Think of it as putting together the completed puzzle so others can see the big picture.
1. *Raw Data Sources*: These are the original sources where data is generated or collected, such as
databases, sensors, or social media platforms. It's like the starting point where all the data comes from.
2. *Data Access Connectors*: These are tools or interfaces that allow us to access and retrieve data from
different sources. They act as bridges between the raw data sources and the rest of the big data stack,
ensuring smooth data flow. Think of them as connectors that link the raw data sources to the rest of the
system.
3. *Data Storage*: This is where the collected data is stored for future use and analysis. It could be in
traditional databases, data lakes, or distributed file systems like Hadoop Distributed File System (HDFS).
It's like the storage room where we keep all the puzzle pieces safe and organized.
4. *Batch Analytics*: Batch analytics involves processing and analyzing large volumes of data in
batches or chunks. It's useful for tasks that don't require immediate results, such as historical analysis or
periodic reporting. Think of it as solving the puzzle piece by piece, but not necessarily in real-time.
5. *Real-time Analytics*: Real-time analytics, on the other hand, involves processing and analyzing data
as it's generated, providing immediate insights and responses. It's like solving the puzzle as soon as you
receive each piece, allowing for quick reactions and decision-making.
6. *Interactive Querying*: This refers to the ability to interactively query and explore the data stored in
the system. It allows users to ask ad-hoc questions and receive instant responses, facilitating exploratory
data analysis and troubleshooting. Think of it as being able to search for specific puzzle pieces and get
instant answers.
7. *Serving Databases*: Databases designed for serving data quickly and efficiently, such as NoSQL
databases (e.g., MongoDB) or distributed SQL databases (e.g., Apache Cassandra), play a role in storing
processed data for easy retrieval. It's like having a well-organized library where you can easily find the
book you need.
8. *Web & Visualization Frameworks*: These are the tools and frameworks used to serve the analyzed
data to end-users, whether through databases, web applications, or visualization tools like Tableau or
Power BI. They make the insights gained from the data accessible and understandable to non-technical
users. It's like putting the puzzle together in a way that others can see and understand the complete
picture.
Mapping the analytics flow to the Big Data Stack means aligning the stages and processes involved in
data analytics with the various components of a Big Data technology stack. It involves understanding how
different tools and technologies within the Big Data ecosystem can be employed to handle the various
aspects of data processing, storage, and analysis.Here's a simplified breakdown:
Raw Data Sources:Identify where the data is coming from, such as sensors, databases, logs, or any other
sources.
Data Access Connectors:Determine how to connect and collect data from these sources efficiently. Use
connectors or pipelines to move data to the next stages.
Data Storage:Choose appropriate storage systems to house the data, considering factors like volume,
velocity, and variety of the data. This could involve distributed file systems, databases, or cloud storage.
Batch Analytics:Decide how to process large volumes of data in batches to gain insights over time.
Utilize technologies like Apache Spark, Hadoop, or other batch processing frameworks.
Real-time Analytics:Address the need for immediate insights by implementing real-time analytics using
technologies such as Apache Flink, Apache Kafka Streams, or other stream processing frameworks.
Interactive Querying:Provide a way for users to interactively query and explore the data. Use databases
optimized for quick querying, like Apache Cassandra, or other interactive query systems.
Serving Databases:Store processed and analyzed data in serving databases for quick and efficient
retrieval. This could involve NoSQL databases like MongoDB or traditional relational databases.
Web & Visualization Frameworks:Present the results to end-users using web and visualization
frameworks. Utilize tools like Tableau, Power BI, or custom-built dashboards with frameworks like D3.js
for effective data visualization.
By mapping the analytics flow to the Big Data Stack, organizations can optimize their data processing and
analysis workflows, making use of the capabilities offered by different components of the Big Data
ecosystem. This ensures efficient handling of large datasets and facilitates the extraction of valuable
insights from the data.
Case study on Genome Data Analysis:
A research institution is conducting a study to understand the genetic factors influencing a rare disease.
They have collected genomic data from a large group of individuals and aim to analyze this data to
identify potential genetic markers associated with the disease.
In this case study, the Big Data Stack plays a crucial role in handling the vast and complex genomic data,
performing in-depth analysis, and presenting the findings in a way that aids researchers in understanding
the genetic factors influencing the rare disease.
Case study on Weather Data Analysis:
A meteorological agency wants to improve its weather prediction models by analyzing historical and
real-time weather data. The goal is to enhance accuracy in forecasting and provide more timely and
precise information to the public.
In this case study, the Big Data Stack facilitates the efficient handling of vast and dynamic weather data,
enabling comprehensive analysis, accurate predictions, and timely communication of weather information
to the public and other stakeholders.
Analytics patterns refer to recurring approaches or methodologies used in data analysis to solve common
problems or achieve specific goals. These patterns provide guidance on how to structure and conduct data
analysis tasks efficiently and effectively. Here are some common analytics patterns:
1. *Descriptive Analytics*: This pattern involves summarizing historical data to gain insights into past
events or trends. It focuses on answering questions like "What happened?" and includes techniques such
as data aggregation, summarization, and visualization.
2. *Diagnostic Analytics*: Diagnostic analytics aims to understand why certain events occurred by
identifying patterns, correlations, or relationships in the data. It helps uncover root causes behind
observed phenomena and supports troubleshooting and problem-solving efforts.
3. *Predictive Analytics*: Predictive analytics uses historical data to forecast future outcomes or trends.
It involves building predictive models using statistical techniques, machine learning algorithms, or other
predictive modeling approaches to make educated predictions based on past patterns.
5. *Text Analytics*: Text analytics focuses on extracting insights and patterns from unstructured text
data, such as emails, social media posts, or customer reviews. It includes techniques like natural language
processing (NLP), sentiment analysis, and topic modeling to analyze and interpret textual data.
6. *Spatial Analytics*: Spatial analytics deals with analyzing data that has a geographic or spatial
component, such as maps, GPS coordinates, or spatial databases. It involves techniques like spatial
clustering, spatial interpolation, and spatial regression to understand spatial relationships and patterns in
the data.
7. *Temporal Analytics*: Temporal analytics focuses on analyzing data over time to identify temporal
trends, patterns, or seasonality. It involves time series analysis, event sequence analysis, and trend
detection techniques to uncover insights related to temporal changes in the data.
These analytics patterns serve as building blocks for designing and implementing data analysis workflows
and can be combined or adapted to address specific analytical challenges or business objectives
effectively.