BD 1

1
UNIT 1: (KCS-061) Introduction of Big Data………………………..Er. Shubham Kumar Sir

ENGINEERING ADDAA (Online Tutorial Point)
Key Terms:-
Device: A device is a unit of physical hardware or equipment that provides one or more computing functions within a
computer system. It can provide input to the computer, accept output or both. A device can be any electronic element with
some computing ability that supports the installation of firmware or third-party software.
Example: Mouse, Monitor, CPU, Keyboard etc.
Machine: A machine is a physical system that uses power to apply forces and control movement to perform an action. The
term is commonly applied to artificial devices, such as those employing engines or motors, but also to natural biological
macromolecules, such as molecular machines. Example: a wide range of vehicles, such as trains, automobiles, boats and
airplanes; appliances in the home and office, including computers, building air handling and water handling systems; as well
as farm machinery, machine tools and factory automation systems and robots.
IOT: The Internet of things describes devices with sensors, processing ability, software and other technologies that connect and
exchange data with other devices and systems over the Internet or other communications networks. Example: Home Security ·
Activity Trackers · Industrial Security and Safety · Augmented Reality Glasses · Motion Detection.
Social Media: Social media are interactive technologies that facilitate the creation, sharing and aggregation of content, ideas,
interests, and other forms of expression through virtual communities and networks.
Example: Facebook, Instagram, Youtube, Teligram etc.
Data: It is a collection of Information. It is a raw material, data can be a number, symbol, character, word, codes, graphs, etc.
On the other hand, information is data put into context. Information is utilised by humans in some significant way (such as to
make decisions, forecasts etc).
Information: Information is organized or classified data, which has some meaningful values for the receiver. Information is
the processed data on which decisions and actions are based.
 Timely − Information should be available when required.

 Accuracy − Information should be accurate.
 Completeness − Information should be complete.
Network: A computer network is a set of computers sharing resources located on or provided by network nodes. A network
consists of two or more computers that are linked in order to share resources (such as printers and CDs), exchange files, or
allow electronic communications. The computers on a network may be linked through cables, telephone lines, radio waves,
satellites, or infrared light beams.
Internet: The Internet is the global system of interconnected computer networks that uses the Internet protocol
suite (TCP/IP) to communicate between networks and devices. It is a network of networks that consists of private, public,
academic, business, and government networks of local to global scope, linked by a broad array of electronic, wireless,
and optical networking technologies. The Internet carries a vast range of information resources and services, such as the
interlinked hypertext documents and applications of the World Wide Web (WWW), electronic mail, telephony, and file
sharing.
2
Digital Data: Digital data is the electronic representation of information in a format or language that machines can read and
understand. In more technical terms, digital data is a binary format of information that's converted into a machine-readable
digital format. The power of digital data is that any analog inputs, from very simple text documents to genome sequencing
results, can be represented with the binary system.
Big Data: Big data refers to extremely large and diverse collections of structured, unstructured, and semi-structured data that
continues to grow exponentially over time. These datasets are so huge and complex in volume, velocity, and variety, that
traditional data management systems cannot store, process, and analyze them.
The amount and availability of data is growing rapidly, spurred on by digital technology advancements, such as connectivity,
mobility, the Internet of Things (IoT), and artificial intelligence (AI). As data continues to expand and proliferate, new big data
tools are emerging to help companies collect, process, and analyze data at the speed needed to gain the most value from it.
Big data describes large and diverse datasets that are huge in volume and also rapidly grow in size over time. Big data is used
in machine learning, predictive modeling, and other advanced analytics to solve business problems and make informed
decisions.
“Big Data is the term for a collection of data sets so large and complex that it becomes difficult to process using on-hand
database management tools or traditional data processing applications”.
Big data primarily refers to data sets that are too large or complex to be dealt with by traditional data-processing application
software. Big data is a combination of structured, semi structured and unstructured data collected by organizations that can be
mined for information and used in projects.
3
Examples of big data include:

 Mobile phone details
 Social media content
 Health records
 Transactional data
 Web searches
 Financial documents
 Weather information
Why is big data important?
Companies use big data in their systems to improve operations, provide better customer service, create personalized marketing
campaigns and take other actions that, ultimately, can increase revenue and profits. A High Quality data gives you a great
customer experience and it finally leads to happy customers, brand loyalty, and higher revenue for your business. It helps
organizations:
 To understand Where, When and Why their customers buy

 Protect the company’s client base with improved loyalty programs
 Seizing cross-selling and upselling opportunities
 Provide targeted promotional information
 Optimize Workforce planning and operations
 Improve inefficiencies in the company’s supply chain
 Predict market trends
 Predict future needs
 Make companies more innovative and competitive
 It helps companies to discover new sources of revenue
4
Types of Big Data Classification: There are three types of Big Data / Digital Data.
1. Structured Data: Structured data is created using a fixed schema and is maintained in tabular format. The elements in
structured data are addressable for effective analysis. It contains all the data which can be stored in the SQL database in a
tabular format. It stores data in form of Rows and Columns.
Consider an example for Relational Data like you have to maintain a record of students for a university like the name of the
student, ID of a student, address, and Email of the student. To store the record of students used the following relational schema
and table for the same.
S_ID S_Name S_Address S_Email
EA512 Preeshu Delhi A@gmail.com
EA513 Shubham Mumbai B@gmail.com
2. Unstructured Data: It is defined as the data in which is not follow a pre-defined standard or you can say that any does not
follow any organized format. This kind of data is also not fit for the relational database because in the relational database you
will see a pre-defined manner or you can say organized way of data. Unstructured data is also very important for the big data
domain and to manage and store Unstructured data there are many platforms to handle it like No-SQL Database.
Examples –Word, PDF, text, media logs,mp4,mp3, etc.
3. Semi-Structured Data: Semi-structured data is information that does not reside in a relational database but that have some
organizational properties that make it easier to analyze. With some process, you can store them in a relational database but is
very hard for some kind of semi-structured data, but semi-structured exist to ease space.
Example – XML data, JSON, CSV, email, etc
Big Data Examples and Use Cases
1) Transportation
2) Advertising and Marketing
3) Banking and Financial Services
4) Government Projects
5) Media and Entertainment
6) Meterology
7) Healthcare
8) Cyber security
5
History of Big Data

The history of big data innovation can be traced back to the early days of computing, but the term "big data" gained
prominence in the early 21st century. The evolution of big data innovation can be summarized through several key milestones:
2. Early Computing Era (1950s-1980s):
 The early days of computing laid the foundation for data processing. Mainframe computers were used to handle large
volumes of data for scientific and business applications.
 Databases like IBM's Information Management System (IMS) and the emergence of relational databases in the 1970s
were crucial for managing structured data.
3. Data Warehousing (1980s-1990s):
 The concept of data warehousing gained popularity as organizations started centralizing data from various sources
into a single repository for analysis.
 Technologies like Online Analytical Processing (OLAP) and data mining became more prominent during this period.
6
4. Internet Boom (1990s-early 2000s):
 With the rise of the internet, data started to explode, and organizations had to deal with vast amounts of information
generated by websites, online transactions, and more.
 Search engines like Google emerged, and the need for efficient ways to process and analyze large datasets became
apparent.
5. Open Source Technologies (early 2000s):
 The open-source movement played a significant role in big data innovation. Technologies like Apache Hadoop,
developed by the Apache Software Foundation, provided a scalable and distributed framework for processing large
datasets.
6. Emergence of NoSQL Databases (mid-2000s):
 Traditional relational databases faced challenges in handling unstructured and semi-structured data. NoSQL
databases, like MongoDB and Cassandra, emerged to address these issues, offering flexible and scalable alternatives.
7. Cloud Computing (late 2000s-2010s):
 Cloud computing platforms, such as Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform,
made it easier for organizations to store and process massive amounts of data without the need for substantial
infrastructure investments.
8. Advanced Analytics and Machine Learning (2010s):
 The integration of advanced analytics and machine learning with big data became a key trend. Organizations started
leveraging data to gain insights, make predictions, and improve decision-making processes.
9. IoT (Internet of Things) (2010s):
 The proliferation of IoT devices added another dimension to big data. The massive amounts of data generated by
sensors, devices, and connected systems required new approaches to storage, processing, and analysis.
10. Real-time Data Processing (2010s):
 The demand for real-time analytics and processing led to the development of technologies like Apache Spark, which
allows for faster data processing and analysis in-memory.
11. Data Governance and Privacy (2010s-2020s):
 As data became more abundant, concerns about data governance, security, and privacy became critical. Regulations
like GDPR (General Data Protection Regulation) aimed at protecting individual privacy had a significant impact on
how organizations handle and manage big data.
7
Introduction to Big Data Platform

A Big Data Platform is a comprehensive and integrated set of technologies, tools, and frameworks designed to process, store,
manage, and analyze large volumes of data. It provides a scalable and flexible infrastructure to handle the complexities of big
data, allowing organizations to extract valuable insights and make data-driven decisions. Here is an introduction to the key
components and concepts associated with a Big Data Platform:
1. Data Ingestion:
 Definition: The process of collecting and importing data from various sources into the big data platform.
 Tools/Technologies: Apache Kafka, Apache NiFi, AWS Kinesis, Google Cloud Pub/Sub.
2. Storage:
 Definition: The storage layer for housing large volumes of structured, semi-structured, and unstructured data.
 Technologies: Hadoop Distributed File System (HDFS), Amazon S3, Google Cloud Storage, Azure Data Lake Storage.
3. Processing:
 Definition: The capability to process and analyze data at scale, often in parallel across distributed computing resources.
 Frameworks: Apache Hadoop (MapReduce), Apache Spark, Apache Flink.
4. Querying and Analysis:
 Definition: Tools and technologies for querying and analyzing data stored in the big data platform.
 Query Engines: Apache Hive, Apache Impala, Presto, Google BigQuery, AWS Athena.
5. Machine Learning and Analytics:
 Definition: Integration of machine learning frameworks and analytics tools for deriving insights, predictions, and
patterns from the data.
 Frameworks/Tools: TensorFlow, PyTorch, scikit-learn, Apache Spark MLlib.
6. Data Warehousing:
 Definition: Storage and management of structured data optimized for analytical queries.
 Platforms: Amazon Redshift, Google BigQuery, Snowflake, Teradata.
7. Real-time Processing:
 Definition: Handling and processing of data in real-time or near-real-time to enable instant insights and actions.
 Technologies: Apache Kafka Streams, Apache Flink, AWS Kinesis Streams.

8
8. Data Governance and Security:
 Definition: Ensuring the integrity, security, and compliance of data within the big data platform.
 Components: Data catalog, metadata management, access control, encryption.
9. Monitoring and Management:
 Definition: Tools and processes to monitor the performance, health, and resource utilization of the big data platform.
 Monitoring Tools: Apache Ambari, Cloudera Manager, Prometheus.
10. Scalability:
 Definition: The ability to scale resources horizontally or vertically to handle growing data volumes and processing
requirements.
 Scaling Methods: Horizontal scaling (adding more nodes), vertical scaling (increasing resources per node).
11. Cloud Integration:
 Definition: Leveraging cloud computing resources for flexibility, scalability, and cost-effectiveness.
 Cloud Platforms: Amazon Web Services (AWS), Microsoft Azure, Google Cloud Platform (GCP).
12. Data Catalog and Metadata Management:
 Definition: Cataloging and managing metadata to provide a centralized view of data assets.
 Components: Apache Atlas, AWS Glue, Google Cloud Data Catalog.
13. Data Pipelines:
 Definition: The end-to-end flow of data from source to destination, involving processing, transformation, and storage.
 Pipeline Orchestration: Apache Airflow, Apache NiFi, AWS Data Pipeline.
14. Data Quality and Governance:
 Definition: Ensuring the accuracy, reliability, and consistency of data across the big data platform.
 Components: Data quality checks, governance policies, data profiling.
15. Integration with Traditional Systems:
 Definition: Coexistence and integration of big data platforms with existing IT infrastructure and enterprise systems.
 Integration Tools: Apache Kafka Connect, Talend, Apache NiFi.

9
A well-designed Big Data Platform enables organizations to harness the power of large-scale data analytics, supporting
advanced capabilities such as real-time processing, machine learning, and deep analytics to gain valuable insights and drive
informed decision-making. The choice of specific technologies and components depends on the organization's requirements,
use cases, and existing infrastructure.
Drivers for Big Data

The adoption and implementation of big data technologies are driven by various factors, reflecting the evolving needs and
challenges of organizations. The key drivers for the adoption of big data include:
1. Data Growth:
 Description: The exponential growth of data volumes generated by organizations, including structured, semi-structured,
and unstructured data.
 Impact: Traditional data processing systems struggle to handle the sheer volume of data, necessitating scalable and
efficient solutions.
2. Data Variety:
 Description: The increasing diversity of data types, sources, and formats, such as text, images, videos, social media, and
sensor data.
 Impact: Organizations need flexible systems that can handle diverse data types and integrate information from various
sources.
3. Data Velocity:
 Description: The speed at which data is generated, processed, and made available for analysis, especially with the rise
of real-time and streaming data sources.
 Impact: Organizations require systems that can process and analyze data in real-time or near-real-time for timely
decision-making.
4. Business Intelligence and Analytics:
 Description: The growing demand for advanced analytics, predictive modeling, and business intelligence to derive
actionable insights from data.
 Impact: Big data platforms provide the infrastructure and tools needed for complex analytics, enabling organizations to
gain a competitive edge.
5. Cost-effective Storage and Processing:
 Description: The need for cost-effective solutions for storing and processing large volumes of data, especially as
traditional storage and processing methods become expensive.
10
 Impact: Big data technologies, often based on distributed computing and storage, offer cost-effective alternatives to
traditional data management systems.
6. Competitive Advantage:
 Description: The recognition that harnessing big data can provide a competitive advantage by enabling innovation,
identifying new business opportunities, and improving operational efficiency.
 Impact: Organizations investing in big data gain insights that lead to better decision-making, improved customer
experiences, and innovative product development.
7. Internet of Things (IoT):
 Description: The proliferation of connected devices and sensors, generating vast amounts of data that need to be
collected, processed, and analyzed.
 Impact: Big data platforms are essential for handling the massive streams of data generated by IoT devices and
extracting meaningful insights.
8. Regulatory Compliance:
 Description: The increasing focus on data privacy and regulatory compliance, such as GDPR, HIPAA, and other data
protection laws.
 Impact: Organizations must implement robust data governance and security measures, which often involve big data
solutions to comply with regulations.
9. Real-time Decision-Making:
 Description: The need for real-time or near-real-time insights to support rapid decision-making in dynamic business
environments.
 Impact: Big data platforms with real-time processing capabilities enable organizations to respond quickly to changing
conditions and make data-driven decisions on the fly.
10. Technological Advancements:
 Description: Continuous advancements in big data technologies, including distributed computing frameworks, machine
learning algorithms, and cloud computing.
 Impact: Ongoing innovation in big data technologies makes these platforms more powerful, accessible, and capable of
addressing complex data challenges.
11. Customer Experience Enhancement:
 Description: The desire to understand and improve customer experiences by analyzing customer behavior, preferences,
and feedback.
11
 Impact: Big data analytics provides insights that help organizations tailor products, services, and marketing strategies to
meet customer expectations.
12. Open Source Ecosystem:
 Description: The availability and widespread adoption of open-source big data technologies that provide cost-effective
and flexible solutions.
 Impact: Open-source ecosystems like Apache Hadoop, Apache Spark, and others have democratized access to big data
technologies, enabling a broad range of organizations to leverage them.
Big Data Architecture
Big Data Architecture: A big Data architecture is designed to handle the ingestion processing and analysis of data that is too
large or complex for traditional database system.
Data Sources:-
 Internal Sources: Data generated from within an organization, including transactional databases, log files, and operational
systems.
 External Sources: Data obtained from outside the organization, such as social media, open data sets, and third-party data
providers.
 Streaming Data: Real-time data generated continuously by sources like sensors, IoT devices, and social media feeds.
Data Storage:-
 Data Warehouses: Storing structured data for analytical purposes. Traditional relational databases, as well as cloud-based data
warehouses like Amazon Redshift and Google BigQuery, are common choices.
12
 Data Lakes: Storing diverse and large volumes of raw and processed data. Technologies like Apache Hadoop Distributed File
System (HDFS) and cloud-based solutions like Amazon S3 and Azure Data Lake Storage are used.
Real Time Message Ingestion:-

 Real-time message ingestion is a critical component of big data architectures, especially when dealing with streaming data
sources. Ingesting and processing data in real-time allows organizations to derive insights and make decisions on the fly.
Data Ingestion:-
 Batch Processing: Collecting and processing large volumes of data at scheduled intervals. Technologies like Apache Hadoop
MapReduce and Apache Spark are commonly used for batch processing. Because the data sets are so large, often a big data
solution must process data files using long-running batch jobs to filter, aggregate, and otherwise prepare the data for analysis.
Usually these jobs involve reading source files, processing them and writing the output to new files. Options include running
U-SQL jos in Azure Data Lake analytics, using Hive, Pig or Custom MAP/Reduce jobs in an HDInsight Hadoop Cluster, or
using Java, Scala, or Python Programs in an HDInsight Spark Cluster.
 Stream Processing: Handling real-time data as it is generated. Technologies like Apache Kafka and Apache Flink are popular
for stream processing. After capturing real time message, the solution must process them by filtering, aggregating, and
otherwise preparing the data for analysis. The processed stream data is then written to an output sink. Azure stream analytics
provides a managed stream processing service based on perpetually remaining SQL queries that operate on unbounded
streams.
Machine learning:-
 Machine learning (ML) plays a crucial role in big data architectures by enabling organizations to extract valuable insights,
predictions, and patterns from large and complex datasets.
 Machine Learning by itself is a branch of Artificial Intelligence that has a large variety of algorithms and applications. One
of my earlier articles on 'The Machine Learning Landscape" provides a basic mind map of the various algorithms.
 Machine learning is a field of study in artificial intelligence concerned with the development and study of statistical algorithms
that can learn from data and generalize to unseen data, and thus perform tasks without explicit instructions.
Analytical Data Store:-

 Many big data solution prepare data for analysis and then serve the processed data in a structured format that can be queried
using analytical tools. The Analytical data store used to serve there queries can be a Kimball-style relational data warehouse,
as seen in most traditional business intelligence (BI) solutions.
Analysis and reporting:

 The goal of most big data solutions is to provide insights into the data through analysis and reporting. To empower users to
analyze the data, the architecture may include a data modeling layer, such as a Microsoft Power BI or Microsoft Excel.
Orchestration:
 Most big data solutions consist of repeated data processing operations, encapsulated in workflows that transform source data,
move data between multiple sources and sinks, load the processed data into an analytical data store, or push the results
straight to a report or dashboard. To automate these workflows, you can use an orchestration technology such Azure Data
Factory or Apache Oozie and Sqoop.
13
When to use this architecture

Consider this architecture style when you need to:
 Store and process data in volumes too large for a traditional database.
 Transform unstructured data for analysis and reporting.
 Capture, process, and analyze unbounded streams of data in real time, or with low latency.
 Use Azure Machine Learning or Microsoft Cognitive Services.
Characteristics OR 5 Vs of Big Data
Volume:-
The name Big Data itself is related to an enormous size. Big Data is a vast 'volumes' of data generated from many sources daily,
such as business processes, machines, social media platforms, networks, human interactions, and many more.
14
Variety:-
Big Data can be structured, unstructured, and semi-structured that are being collected from different sources. Data will only be
collected from databases and sheets in the past, But these days the data will comes in array forms, that are PDFs, Emails, audios,
SM posts, photos, videos, etc.
The data is categorized as below:

a. Structured data: In Structured schema, along with all the required columns. It is in a tabular form. Structured Data is stored
in the relational database management system.
b. Semi-structured: In Semi-structured, the schema is not appropriately defined, e.g., JSON, XML, CSV, TSV, and email.
OLTP (Online Transaction Processing) systems are built to work with semi-structured data. It is stored in relations, i.e., tables.
c. Unstructured Data: All the unstructured files, log files, audio files, and image files are included in the unstructured data.
Some organizations have much data available, but they did not know how to derive the value of data since the data is raw.
d. Quasi-structured Data: The data format contains textual data with inconsistent data formats that are formatted with effort
and time with some tools.
Example: Web server logs, i.e., the log file is created and maintained by some server that contains a list of activities.
Veracity:-
Veracity means how much the data is reliable. It has many ways to filter or translate the data. Veracity is the process of being able
to handle and manage data efficiently. Big Data is also essential in business development.
For example, Facebook posts with hashtags.
Value:-
Value is an essential characteristic of big data. It is not the data that we process or store. It is valuable and reliable data that
we store, process, and also analyze.
Velocity:-
Velocity plays an important role compared to others. Velocity creates the speed by which the data is created in real-time. It
contains the linking of incoming data sets speeds, rate of change, and activity bursts. The primary aspect of Big Data is to provide
demanding data rapidly. Big data velocity deals with the speed at the data flows from sources like application logs, business
processes, networks, and social media sites, sensors, mobile devices, etc.
15
Big Data technology components
Big Data technology has four main components: data capture, data storage, data processing, and data visualization.
1. Data capture refers to the process of collecting data from a variety of sources. This can include everything from social media
posts to sensor readings.
2. Data storage is the process of storing this data in a way that makes it accessible for further analysis.
3. Data processing is where the real magic happens. This is where algorithms are used to analyze the data and extract insights.
4. And finally, data visualization is the process of representing this data in a way that is easy for humans to understand.
Challenges of Big Data
Storage:-
With vast amounts of data generated daily, the greatest challenge is storage (especially when the data is in different formats)
within legacy systems. Unstructured data cannot be stored in traditional databases.
Processing:-
Processing Big Data refers to the reading, transforming, extraction, and formatting of useful information from raw information.
The input and output of information in unified formats continue to present difficulties.
Security:-
Security is a big concern for organizations. Big data security is the collective term for all the measures and tools used to guard
both the data and analytics processes from attacks, theft, or other malicious activities that could harm or negatively affect them.
Non-encrypted information is at risk of theft or damage by cyber-criminals. Therefore, data security professionals must balance
access to data against maintaining strict security protocols.
Big Data Compliance:
Data compliance is the practice of ensuring that sensitive data is organized and managed in such a way as to enable organizations
to meet enterprise business rules along with legal and governmental regulations.
Finding and Fixing Data Quality Issues
Many of you are probably dealing with challenges related to poor data quality, but solutions are available. The following are four
approaches to fixing data problems:
 Correct information in the original database.

 Repairing the original data source is necessary to resolve any data inaccuracies.
 You must use highly accurate methods of determining who someone is.
16
Big Data privacy and ethics
1. Private customer data and identity should remain private: Privacy does not mean secrecy, as personal data might need to
be audited based on legal requirements, but that private data obtained from a person with their consent should not be exposed
for use by other businesses or individuals with any traces to their identity.
2. Shared private information should be treated confidentially: Third-party companies share sensitive data — medical,
financial or locational — and need restrictions on whether and how that information can be shared further.
3. Customers should have a transparent view of how our data is being used or sold and the ability to manage the flow of their
private information across massive, third-party analytical systems.
4. Big Data should not interfere with human will: Big data analytics can moderate and even determine who we are before we
make up our minds. Companies need to consider the kind of predictions and inferences that should be allowed and those that
should not.
5. Big data should not institutionalize unfair biases like racism or sexism. Machine learning algorithms can absorb unconscious
biases in a population and amplify them via training samples.
Big Data Analytics:-
Big data analytics is the use of advanced analytic techniques against very large, diverse big data sets that include structured,
semi-structured and unstructured data, from different sources, and in different sizes from terabytes to zetta bytes.
Benefits of Big Data Analytics
 Faster, better decision making: Businesses can access a large volume of data and analyze a large variety sources of data to
gain new insights and take action. Get started small and scale to handle data from historical records and in real-time.
 Cost reduction and operational efficiency: Flexible data processing and storage tools can help organizations save costs in
storing and analyzing large amounts of data. Discover patterns and insights that help you identify do business more
efficiently.
 Improved data-driven go to market: Analyzing data from sensors, devices, video, logs, transactional applications, web and
social media empowers an organization to be data-driven. Gauge customer needs and potential risks and create new products
and services.
Big data analytics process:
1. Data professionals collect data from a variety of different sources. Often, it is a mix of semi-structured and unstructured data.
some common sources include:
 Internet click stream data

 Web server logs
 Cloud applications
 Mobile applications
 Text from customer emails and survey responses;
 Mobile phone records
17
 Machine data captured by sensors connected to the internet of things (iot).

 Surveys
 Transactional Tracking
 Interviews and Focus Groups
 Observation
 Online Tracking
 Forms
 Social Media Monitoring
 Experimental
2. Data is prepared and processed. After data is collected and stored in a data warehouse or data lake, data professionals must
organize, configure and partition the data properly for analytical queries. Thorough data preparation and processing makes
for higher performance from analytical queries.
3. Data is cleansed to improve its quality. Data professionals scrub the data using scripting tools or data quality software. They
look for any errors or inconsistencies, such as duplications or formatting mistakes, and organize and tidy up the data.
4. The collected, processed and cleaned data is analyzed with analytics software. This includes tools for:
 Data mining, which sifts through data sets in search of patterns and relationships
 Predictive analytics, which builds models to forecast customer behavior and other future actions, scenarios and trends
 Machine learning, which taps various algorithms to analyze large data sets
 Deep learning, which is a more advanced offshoot of machine learning
 text mining and statistical analysis software
 Artificial intelligence (ai)
 Mainstream business intelligence software
 Data visualization tools
Challenges of conventional systems
 Big data is the storage and analysis of large data sets.

 These are complex data sets that can be both structured or unstructured.
 They are so large that it is not possible to work on them with traditional analytical tools.
 One of the major challenges of conventional systems was the uncertainty of the Data Management Landscape.
 Big data is continuously expanding, there are new companies and technologies that are being developed every day.
 A big challenge for companies is to find out which technology works bests for them without the introduction of new risks and
problems.
 These days, organizations are realising the value they get out of big data analytics and hence they are deploying big data tools
and processes to bring more efficiency in their work environment.
18
Intelligent Data Analysis

Intelligent Data Analysis provides a forum for the examination of issues related to the research and applications of Artificial
Intelligence techniques in data analysis across a variety of disciplines. These techniques include (but are not limited to): all areas
of data visualization, data pre-processing (fusion, editing, transformation, filtering, sampling), data engineering, database mining
techniques, tools and applications, use of domain knowledge in data analysis, big data applications, evolutionary algorithms,
machine learning, neural nets, fuzzy logic, statistical pattern recognition, knowledge filtering, and post-processing.
Reporting vs Analytics
1. Purpose: Reporting involves extracting data from different sources within an organization and monitoring it to gain an
understanding of the performance of the various functions. By linking data from across functions, it helps create a cross-channel
view that facilitates comparison to understand data easily. An analysis is being able to interpret data at a deeper level, interpreting
it and providing recommendations on actions.
2. The Specifics: Reporting involves activities such as building, consolidating, organizing, configuring, formatting, and
summarizing. It requires clean, raw data and reports that may be generated periodically, such as daily, weekly, monthly,
quarterly, and yearly. Analytics includes asking questions, examining, comparing, interpreting, and confirming. Enriching
data with big data can help predict future trends as well.
3. The Final Output: In the case of reporting, outputs such as canned reports, dashboards, and alerts push information to users.
Through analysis, analysts try to extract answers using business queries and present them in the form of ad hoc responses,
insights, recommended actions, or a forecast. Understanding this key difference can help businesses leverage analytics better.
4. People: Reporting requires repetitive tasks that can be automated. It is often used by functional business heads who monitor
specific business metrics. Analytics requires customization and therefore depends on data analysts and scientists. Also, it is
used by business leaders to make data-driven decisions.
5. Value Proposition: This is like comparing apples to oranges. Both reporting and analytics serve a different purpose. By
understanding the purpose and using them correctly, businesses can derive immense value from both.
Data Analytics Tools:
1. R and Python
2. Microsoft Excel
3. Tableau
4. RapidMiner
5. KNIME
6. Power BI
7. Apache Spark
8. QlikView
9. Talend
10. Splunk
19
Big Data Analytics:-
Big Data Analytics is the use of advanced analytic techniques against very large, diverse data sets that include structured, semi-
structured and unstructured data, from different sources and different sizes from tera bytes to Zetta bytes.
Types of Big Data Analytics:-
1. Descriptive Analysis: What is happening now based on incoming data. Example: Google Analytics.
2. Predictive Analysis: What might happen in future.
3. Prescriptive Analysis: What action should be taken. Example: Google’s self driving car.
4. Diagnostic Analysis: What did it happen. (Campaigning, Advertisement, Promotion, Target based service)

BD 1

Uploaded by

Copyright:

Available Formats

BD 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

BD 1

Uploaded by

Copyright:

Available Formats

1

UNIT 1: (KCS-061) Introduction of Big Data………………………..Er. Shubham Kumar Sir

 Timely − Information should be available when required.

Examples of big data include:

 To understand Where, When and Why their customers buy

EA512 Preeshu Delhi A@gmail.com

EA513 Shubham Mumbai B@gmail.com

Examples –Word, PDF, text, media logs,mp4,mp3, etc.

Example – XML data, JSON, CSV, email, etc

Big Data Examples and Use Cases

History of Big Data

2. Early Computing Era (1950s-1980s):

3. Data Warehousing (1980s-1990s):

4. Internet Boom (1990s-early 2000s):

5. Open Source Technologies (early 2000s):

6. Emergence of NoSQL Databases (mid-2000s):

7. Cloud Computing (late 2000s-2010s):

8. Advanced Analytics and Machine Learning (2010s):

9. IoT (Internet of Things) (2010s):

10. Real-time Data Processing (2010s):

11. Data Governance and Privacy (2010s-2020s):

Introduction to Big Data Platform

 Frameworks: Apache Hadoop (MapReduce), Apache Spark, Apache Flink.

4. Querying and Analysis:

5. Machine Learning and Analytics:

 Frameworks/Tools: TensorFlow, PyTorch, scikit-learn, Apache Spark MLlib.

 Platforms: Amazon Redshift, Google BigQuery, Snowflake, Teradata.

 Technologies: Apache Kafka Streams, Apache Flink, AWS Kinesis Streams.

8. Data Governance and Security:

 Components: Data catalog, metadata management, access control, encryption.

9. Monitoring and Management:

 Monitoring Tools: Apache Ambari, Cloudera Manager, Prometheus.

11. Cloud Integration:

12. Data Catalog and Metadata Management:

 Components: Apache Atlas, AWS Glue, Google Cloud Data Catalog.

13. Data Pipelines:

 Pipeline Orchestration: Apache Airflow, Apache NiFi, AWS Data Pipeline.

14. Data Quality and Governance:

 Components: Data quality checks, governance policies, data profiling.

15. Integration with Traditional Systems:

 Integration Tools: Apache Kafka Connect, Talend, Apache NiFi.

Drivers for Big Data

4. Business Intelligence and Analytics:

5. Cost-effective Storage and Processing:

7. Internet of Things (IoT):

10. Technological Advancements:

11. Customer Experience Enhancement:

12. Open Source Ecosystem:

Big Data Architecture

Real Time Message Ingestion:-

Analytical Data Store:-

Analysis and reporting:

When to use this architecture

Characteristics OR 5 Vs of Big Data

The data is categorized as below:

Big Data technology components

Challenges of Big Data

Big Data Compliance:

Finding and Fixing Data Quality Issues