BD 1
BD 1
BD 1
Key Terms:-
Device: A device is a unit of physical hardware or equipment that provides one or more computing functions within a
computer system. It can provide input to the computer, accept output or both. A device can be any electronic element with
some computing ability that supports the installation of firmware or third-party software.
Example: Mouse, Monitor, CPU, Keyboard etc.
Machine: A machine is a physical system that uses power to apply forces and control movement to perform an action. The
term is commonly applied to artificial devices, such as those employing engines or motors, but also to natural biological
macromolecules, such as molecular machines. Example: a wide range of vehicles, such as trains, automobiles, boats and
airplanes; appliances in the home and office, including computers, building air handling and water handling systems; as well
as farm machinery, machine tools and factory automation systems and robots.
IOT: The Internet of things describes devices with sensors, processing ability, software and other technologies that connect and
exchange data with other devices and systems over the Internet or other communications networks. Example: Home Security ·
Activity Trackers · Industrial Security and Safety · Augmented Reality Glasses · Motion Detection.
Social Media: Social media are interactive technologies that facilitate the creation, sharing and aggregation of content, ideas,
interests, and other forms of expression through virtual communities and networks.
Example: Facebook, Instagram, Youtube, Teligram etc.
Data: It is a collection of Information. It is a raw material, data can be a number, symbol, character, word, codes, graphs, etc.
On the other hand, information is data put into context. Information is utilised by humans in some significant way (such as to
make decisions, forecasts etc).
Information: Information is organized or classified data, which has some meaningful values for the receiver. Information is
the processed data on which decisions and actions are based.
Network: A computer network is a set of computers sharing resources located on or provided by network nodes. A network
consists of two or more computers that are linked in order to share resources (such as printers and CDs), exchange files, or
allow electronic communications. The computers on a network may be linked through cables, telephone lines, radio waves,
satellites, or infrared light beams.
Internet: The Internet is the global system of interconnected computer networks that uses the Internet protocol
suite (TCP/IP) to communicate between networks and devices. It is a network of networks that consists of private, public,
academic, business, and government networks of local to global scope, linked by a broad array of electronic, wireless,
and optical networking technologies. The Internet carries a vast range of information resources and services, such as the
interlinked hypertext documents and applications of the World Wide Web (WWW), electronic mail, telephony, and file
sharing.
2
UNIT 1: (KCS-061) Introduction of Big Data………………………..Er. Shubham Kumar Sir
ENGINEERING ADDAA (Online Tutorial Point)
Digital Data: Digital data is the electronic representation of information in a format or language that machines can read and
understand. In more technical terms, digital data is a binary format of information that's converted into a machine-readable
digital format. The power of digital data is that any analog inputs, from very simple text documents to genome sequencing
results, can be represented with the binary system.
Big Data: Big data refers to extremely large and diverse collections of structured, unstructured, and semi-structured data that
continues to grow exponentially over time. These datasets are so huge and complex in volume, velocity, and variety, that
traditional data management systems cannot store, process, and analyze them.
The amount and availability of data is growing rapidly, spurred on by digital technology advancements, such as connectivity,
mobility, the Internet of Things (IoT), and artificial intelligence (AI). As data continues to expand and proliferate, new big data
tools are emerging to help companies collect, process, and analyze data at the speed needed to gain the most value from it.
Big data describes large and diverse datasets that are huge in volume and also rapidly grow in size over time. Big data is used
in machine learning, predictive modeling, and other advanced analytics to solve business problems and make informed
decisions.
“Big Data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand
database management tools or traditional data processing applications”.
Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application
software. Big data is a combination of structured, semi structured and unstructured data collected by organizations that can be
mined for information and used in projects.
3
UNIT 1: (KCS-061) Introduction of Big Data………………………..Er. Shubham Kumar Sir
ENGINEERING ADDAA (Online Tutorial Point)
Types of Big Data Classification: There are three types of Big Data / Digital Data.
1. Structured Data: Structured data is created using a fixed schema and is maintained in tabular format. The elements in
structured data are addressable for effective analysis. It contains all the data which can be stored in the SQL database in a
tabular format. It stores data in form of Rows and Columns.
Consider an example for Relational Data like you have to maintain a record of students for a university like the name of the
student, ID of a student, address, and Email of the student. To store the record of students used the following relational schema
and table for the same.
S_ID S_Name S_Address S_Email
2. Unstructured Data: It is defined as the data in which is not follow a pre-defined standard or you can say that any does not
follow any organized format. This kind of data is also not fit for the relational database because in the relational database you
will see a pre-defined manner or you can say organized way of data. Unstructured data is also very important for the big data
domain and to manage and store Unstructured data there are many platforms to handle it like No-SQL Database.
3. Semi-Structured Data: Semi-structured data is information that does not reside in a relational database but that have some
organizational properties that make it easier to analyze. With some process, you can store them in a relational database but is
very hard for some kind of semi-structured data, but semi-structured exist to ease space.
1) Transportation
2) Advertising and Marketing
3) Banking and Financial Services
4) Government Projects
5) Media and Entertainment
6) Meterology
7) Healthcare
8) Cyber security
5
UNIT 1: (KCS-061) Introduction of Big Data………………………..Er. Shubham Kumar Sir
ENGINEERING ADDAA (Online Tutorial Point)
The early days of computing laid the foundation for data processing. Mainframe computers were used to handle large
volumes of data for scientific and business applications.
Databases like IBM's Information Management System (IMS) and the emergence of relational databases in the 1970s
were crucial for managing structured data.
The concept of data warehousing gained popularity as organizations started centralizing data from various sources
into a single repository for analysis.
Technologies like Online Analytical Processing (OLAP) and data mining became more prominent during this period.
6
UNIT 1: (KCS-061) Introduction of Big Data………………………..Er. Shubham Kumar Sir
ENGINEERING ADDAA (Online Tutorial Point)
With the rise of the internet, data started to explode, and organizations had to deal with vast amounts of information
generated by websites, online transactions, and more.
Search engines like Google emerged, and the need for efficient ways to process and analyze large datasets became
apparent.
The open-source movement played a significant role in big data innovation. Technologies like Apache Hadoop,
developed by the Apache Software Foundation, provided a scalable and distributed framework for processing large
datasets.
Traditional relational databases faced challenges in handling unstructured and semi-structured data. NoSQL
databases, like MongoDB and Cassandra, emerged to address these issues, offering flexible and scalable alternatives.
Cloud computing platforms, such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform,
made it easier for organizations to store and process massive amounts of data without the need for substantial
infrastructure investments.
The integration of advanced analytics and machine learning with big data became a key trend. Organizations started
leveraging data to gain insights, make predictions, and improve decision-making processes.
The proliferation of IoT devices added another dimension to big data. The massive amounts of data generated by
sensors, devices, and connected systems required new approaches to storage, processing, and analysis.
The demand for real-time analytics and processing led to the development of technologies like Apache Spark, which
allows for faster data processing and analysis in-memory.
As data became more abundant, concerns about data governance, security, and privacy became critical. Regulations
like GDPR (General Data Protection Regulation) aimed at protecting individual privacy had a significant impact on
how organizations handle and manage big data.
7
UNIT 1: (KCS-061) Introduction of Big Data………………………..Er. Shubham Kumar Sir
ENGINEERING ADDAA (Online Tutorial Point)
1. Data Ingestion:
Definition: The process of collecting and importing data from various sources into the big data platform.
Tools/Technologies: Apache Kafka, Apache NiFi, AWS Kinesis, Google Cloud Pub/Sub.
2. Storage:
Definition: The storage layer for housing large volumes of structured, semi-structured, and unstructured data.
Technologies: Hadoop Distributed File System (HDFS), Amazon S3, Google Cloud Storage, Azure Data Lake Storage.
3. Processing:
Definition: The capability to process and analyze data at scale, often in parallel across distributed computing resources.
Definition: Tools and technologies for querying and analyzing data stored in the big data platform.
Query Engines: Apache Hive, Apache Impala, Presto, Google BigQuery, AWS Athena.
Definition: Integration of machine learning frameworks and analytics tools for deriving insights, predictions, and
patterns from the data.
6. Data Warehousing:
Definition: Storage and management of structured data optimized for analytical queries.
7. Real-time Processing:
Definition: Handling and processing of data in real-time or near-real-time to enable instant insights and actions.
Definition: Ensuring the integrity, security, and compliance of data within the big data platform.
Definition: Tools and processes to monitor the performance, health, and resource utilization of the big data platform.
10. Scalability:
Definition: The ability to scale resources horizontally or vertically to handle growing data volumes and processing
requirements.
Scaling Methods: Horizontal scaling (adding more nodes), vertical scaling (increasing resources per node).
Definition: Leveraging cloud computing resources for flexibility, scalability, and cost-effectiveness.
Cloud Platforms: Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP).
Definition: Cataloging and managing metadata to provide a centralized view of data assets.
Definition: The end-to-end flow of data from source to destination, involving processing, transformation, and storage.
Definition: Ensuring the accuracy, reliability, and consistency of data across the big data platform.
Definition: Coexistence and integration of big data platforms with existing IT infrastructure and enterprise systems.
A well-designed Big Data Platform enables organizations to harness the power of large-scale data analytics, supporting
advanced capabilities such as real-time processing, machine learning, and deep analytics to gain valuable insights and drive
informed decision-making. The choice of specific technologies and components depends on the organization's requirements,
use cases, and existing infrastructure.
1. Data Growth:
Description: The exponential growth of data volumes generated by organizations, including structured, semi-structured,
and unstructured data.
Impact: Traditional data processing systems struggle to handle the sheer volume of data, necessitating scalable and
efficient solutions.
2. Data Variety:
Description: The increasing diversity of data types, sources, and formats, such as text, images, videos, social media, and
sensor data.
Impact: Organizations need flexible systems that can handle diverse data types and integrate information from various
sources.
3. Data Velocity:
Description: The speed at which data is generated, processed, and made available for analysis, especially with the rise
of real-time and streaming data sources.
Impact: Organizations require systems that can process and analyze data in real-time or near-real-time for timely
decision-making.
Description: The growing demand for advanced analytics, predictive modeling, and business intelligence to derive
actionable insights from data.
Impact: Big data platforms provide the infrastructure and tools needed for complex analytics, enabling organizations to
gain a competitive edge.
Description: The need for cost-effective solutions for storing and processing large volumes of data, especially as
traditional storage and processing methods become expensive.
10
UNIT 1: (KCS-061) Introduction of Big Data………………………..Er. Shubham Kumar Sir
ENGINEERING ADDAA (Online Tutorial Point)
Impact: Big data technologies, often based on distributed computing and storage, offer cost-effective alternatives to
traditional data management systems.
6. Competitive Advantage:
Description: The recognition that harnessing big data can provide a competitive advantage by enabling innovation,
identifying new business opportunities, and improving operational efficiency.
Impact: Organizations investing in big data gain insights that lead to better decision-making, improved customer
experiences, and innovative product development.
Description: The proliferation of connected devices and sensors, generating vast amounts of data that need to be
collected, processed, and analyzed.
Impact: Big data platforms are essential for handling the massive streams of data generated by IoT devices and
extracting meaningful insights.
8. Regulatory Compliance:
Description: The increasing focus on data privacy and regulatory compliance, such as GDPR, HIPAA, and other data
protection laws.
Impact: Organizations must implement robust data governance and security measures, which often involve big data
solutions to comply with regulations.
9. Real-time Decision-Making:
Description: The need for real-time or near-real-time insights to support rapid decision-making in dynamic business
environments.
Impact: Big data platforms with real-time processing capabilities enable organizations to respond quickly to changing
conditions and make data-driven decisions on the fly.
Description: Continuous advancements in big data technologies, including distributed computing frameworks, machine
learning algorithms, and cloud computing.
Impact: Ongoing innovation in big data technologies makes these platforms more powerful, accessible, and capable of
addressing complex data challenges.
Description: The desire to understand and improve customer experiences by analyzing customer behavior, preferences,
and feedback.
11
UNIT 1: (KCS-061) Introduction of Big Data………………………..Er. Shubham Kumar Sir
ENGINEERING ADDAA (Online Tutorial Point)
Impact: Big data analytics provides insights that help organizations tailor products, services, and marketing strategies to
meet customer expectations.
Description: The availability and widespread adoption of open-source big data technologies that provide cost-effective
and flexible solutions.
Impact: Open-source ecosystems like Apache Hadoop, Apache Spark, and others have democratized access to big data
technologies, enabling a broad range of organizations to leverage them.
Big Data Architecture: A big Data architecture is designed to handle the ingestion processing and analysis of data that is too
large or complex for traditional database system.
Data Sources:-
Internal Sources: Data generated from within an organization, including transactional databases, log files, and operational
systems.
External Sources: Data obtained from outside the organization, such as social media, open data sets, and third-party data
providers.
Streaming Data: Real-time data generated continuously by sources like sensors, IoT devices, and social media feeds.
Data Storage:-
Data Warehouses: Storing structured data for analytical purposes. Traditional relational databases, as well as cloud-based data
warehouses like Amazon Redshift and Google BigQuery, are common choices.
12
UNIT 1: (KCS-061) Introduction of Big Data………………………..Er. Shubham Kumar Sir
ENGINEERING ADDAA (Online Tutorial Point)
Data Lakes: Storing diverse and large volumes of raw and processed data. Technologies like Apache Hadoop Distributed File
System (HDFS) and cloud-based solutions like Amazon S3 and Azure Data Lake Storage are used.
Data Ingestion:-
Batch Processing: Collecting and processing large volumes of data at scheduled intervals. Technologies like Apache Hadoop
MapReduce and Apache Spark are commonly used for batch processing. Because the data sets are so large, often a big data
solution must process data files using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis.
Usually these jobs involve reading source files, processing them and writing the output to new files. Options include running
U-SQL jos in Azure Data Lake analytics, using Hive, Pig or Custom MAP/Reduce jobs in an HDInsight Hadoop Cluster, or
using Java, Scala, or Python Programs in an HDInsight Spark Cluster.
Stream Processing: Handling real-time data as it is generated. Technologies like Apache Kafka and Apache Flink are popular
for stream processing. After capturing real time message, the solution must process them by filtering, aggregating, and
otherwise preparing the data for analysis. The processed stream data is then written to an output sink. Azure stream analytics
provides a managed stream processing service based on perpetually remaining SQL queries that operate on unbounded
streams.
Machine learning:-
Machine learning (ML) plays a crucial role in big data architectures by enabling organizations to extract valuable insights,
predictions, and patterns from large and complex datasets.
Machine Learning by itself is a branch of Artificial Intelligence that has a large variety of algorithms and applications. One
of my earlier articles on 'The Machine Learning Landscape" provides a basic mind map of the various algorithms.
Machine learning is a field of study in artificial intelligence concerned with the development and study of statistical algorithms
that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions.
Orchestration:
Most big data solutions consist of repeated data processing operations, encapsulated in workflows that transform source data,
move data between multiple sources and sinks, load the processed data into an analytical data store, or push the results
straight to a report or dashboard. To automate these workflows, you can use an orchestration technology such Azure Data
Factory or Apache Oozie and Sqoop.
13
UNIT 1: (KCS-061) Introduction of Big Data………………………..Er. Shubham Kumar Sir
ENGINEERING ADDAA (Online Tutorial Point)
Store and process data in volumes too large for a traditional database.
Transform unstructured data for analysis and reporting.
Capture, process, and analyze unbounded streams of data in real time, or with low latency.
Use Azure Machine Learning or Microsoft Cognitive Services.
Volume:-
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data generated from many sources daily,
such as business processes, machines, social media platforms, networks, human interactions, and many more.
14
UNIT 1: (KCS-061) Introduction of Big Data………………………..Er. Shubham Kumar Sir
ENGINEERING ADDAA (Online Tutorial Point)
Variety:-
Big Data can be structured, unstructured, and semi-structured that are being collected from different sources. Data will only be
collected from databases and sheets in the past, But these days the data will comes in array forms, that are PDFs, Emails, audios,
SM posts, photos, videos, etc.
Example: Web server logs, i.e., the log file is created and maintained by some server that contains a list of activities.
Veracity:-
Veracity means how much the data is reliable. It has many ways to filter or translate the data. Veracity is the process of being able
to handle and manage data efficiently. Big Data is also essential in business development.
For example, Facebook posts with hashtags.
Value:-
Value is an essential characteristic of big data. It is not the data that we process or store. It is valuable and reliable data that
we store, process, and also analyze.
Velocity:-
Velocity plays an important role compared to others. Velocity creates the speed by which the data is created in real-time. It
contains the linking of incoming data sets speeds, rate of change, and activity bursts. The primary aspect of Big Data is to provide
demanding data rapidly. Big data velocity deals with the speed at the data flows from sources like application logs, business
processes, networks, and social media sites, sensors, mobile devices, etc.
15
UNIT 1: (KCS-061) Introduction of Big Data………………………..Er. Shubham Kumar Sir
ENGINEERING ADDAA (Online Tutorial Point)
Big Data technology has four main components: data capture, data storage, data processing, and data visualization.
1. Data capture refers to the process of collecting data from a variety of sources. This can include everything from social media
posts to sensor readings.
2. Data storage is the process of storing this data in a way that makes it accessible for further analysis.
3. Data processing is where the real magic happens. This is where algorithms are used to analyze the data and extract insights.
4. And finally, data visualization is the process of representing this data in a way that is easy for humans to understand.
Storage:-
With vast amounts of data generated daily, the greatest challenge is storage (especially when the data is in different formats)
within legacy systems. Unstructured data cannot be stored in traditional databases.
Processing:-
Processing Big Data refers to the reading, transforming, extraction, and formatting of useful information from raw information.
The input and output of information in unified formats continue to present difficulties.
Security:-
Security is a big concern for organizations. Big data security is the collective term for all the measures and tools used to guard
both the data and analytics processes from attacks, theft, or other malicious activities that could harm or negatively affect them.
Non-encrypted information is at risk of theft or damage by cyber-criminals. Therefore, data security professionals must balance
access to data against maintaining strict security protocols.
Data compliance is the practice of ensuring that sensitive data is organized and managed in such a way as to enable organizations
to meet enterprise business rules along with legal and governmental regulations.
Many of you are probably dealing with challenges related to poor data quality, but solutions are available. The following are four
approaches to fixing data problems:
1. Private customer data and identity should remain private: Privacy does not mean secrecy, as personal data might need to
be audited based on legal requirements, but that private data obtained from a person with their consent should not be exposed
for use by other businesses or individuals with any traces to their identity.
2. Shared private information should be treated confidentially: Third-party companies share sensitive data — medical,
financial or locational — and need restrictions on whether and how that information can be shared further.
3. Customers should have a transparent view of how our data is being used or sold and the ability to manage the flow of their
private information across massive, third-party analytical systems.
4. Big Data should not interfere with human will: Big data analytics can moderate and even determine who we are before we
make up our minds. Companies need to consider the kind of predictions and inferences that should be allowed and those that
should not.
5. Big data should not institutionalize unfair biases like racism or sexism. Machine learning algorithms can absorb unconscious
biases in a population and amplify them via training samples.
Big data analytics is the use of advanced analytic techniques against very large, diverse big data sets that include structured,
semi-structured and unstructured data, from different sources, and in different sizes from terabytes to zetta bytes.
Faster, better decision making: Businesses can access a large volume of data and analyze a large variety sources of data to
gain new insights and take action. Get started small and scale to handle data from historical records and in real-time.
Cost reduction and operational efficiency: Flexible data processing and storage tools can help organizations save costs in
storing and analyzing large amounts of data. Discover patterns and insights that help you identify do business more
efficiently.
Improved data-driven go to market: Analyzing data from sensors, devices, video, logs, transactional applications, web and
social media empowers an organization to be data-driven. Gauge customer needs and potential risks and create new products
and services.
1. Data professionals collect data from a variety of different sources. Often, it is a mix of semi-structured and unstructured data.
some common sources include:
2. Data is prepared and processed. After data is collected and stored in a data warehouse or data lake, data professionals must
organize, configure and partition the data properly for analytical queries. Thorough data preparation and processing makes
for higher performance from analytical queries.
3. Data is cleansed to improve its quality. Data professionals scrub the data using scripting tools or data quality software. They
look for any errors or inconsistencies, such as duplications or formatting mistakes, and organize and tidy up the data.
4. The collected, processed and cleaned data is analyzed with analytics software. This includes tools for:
Data mining, which sifts through data sets in search of patterns and relationships
Predictive analytics, which builds models to forecast customer behavior and other future actions, scenarios and trends
Machine learning, which taps various algorithms to analyze large data sets
Deep learning, which is a more advanced offshoot of machine learning
text mining and statistical analysis software
Artificial intelligence (ai)
Mainstream business intelligence software
Data visualization tools
Reporting vs Analytics
1. Purpose: Reporting involves extracting data from different sources within an organization and monitoring it to gain an
understanding of the performance of the various functions. By linking data from across functions, it helps create a cross-channel
view that facilitates comparison to understand data easily. An analysis is being able to interpret data at a deeper level, interpreting
it and providing recommendations on actions.
2. The Specifics: Reporting involves activities such as building, consolidating, organizing, configuring, formatting, and
summarizing. It requires clean, raw data and reports that may be generated periodically, such as daily, weekly, monthly,
quarterly, and yearly. Analytics includes asking questions, examining, comparing, interpreting, and confirming. Enriching
data with big data can help predict future trends as well.
3. The Final Output: In the case of reporting, outputs such as canned reports, dashboards, and alerts push information to users.
Through analysis, analysts try to extract answers using business queries and present them in the form of ad hoc responses,
insights, recommended actions, or a forecast. Understanding this key difference can help businesses leverage analytics better.
4. People: Reporting requires repetitive tasks that can be automated. It is often used by functional business heads who monitor
specific business metrics. Analytics requires customization and therefore depends on data analysts and scientists. Also, it is
used by business leaders to make data-driven decisions.
5. Value Proposition: This is like comparing apples to oranges. Both reporting and analytics serve a different purpose. By
understanding the purpose and using them correctly, businesses can derive immense value from both.
1. R and Python
2. Microsoft Excel
3. Tableau
4. RapidMiner
5. KNIME
6. Power BI
7. Apache Spark
8. QlikView
9. Talend
10. Splunk
19
UNIT 1: (KCS-061) Introduction of Big Data………………………..Er. Shubham Kumar Sir
ENGINEERING ADDAA (Online Tutorial Point)
Big Data Analytics is the use of advanced analytic techniques against very large, diverse data sets that include structured, semi-
structured and unstructured data, from different sources and different sizes from tera bytes to Zetta bytes.
1. Descriptive Analysis: What is happening now based on incoming data. Example: Google Analytics.
2. Predictive Analysis: What might happen in future.
3. Prescriptive Analysis: What action should be taken. Example: Google’s self driving car.
4. Diagnostic Analysis: What did it happen. (Campaigning, Advertisement, Promotion, Target based service)