Unit- 1
Unit- 1
Unit- 1
Processing big data is a critical step that converts vast data reservoirs into actionable insights. This
section delves into the methodologies and technologies that enable rapid and efficient big data
processing at scale.
Distributed computing is the backbone of big data processing. It involves breaking down large data
processing tasks into smaller, more manageable pieces that can be processed concurrently across a
network of computers.
MapReduce: this programming model is fundamental for processing large data sets with a parallel,
distributed algorithm on a cluster.
Beyond MapReduce: newer models, such as Apache Spark’s in-memory processing, which allows for
faster data processing cycles, are becoming the norm for big data tasks that require iterative processing.
2. Stream processing
For real-time analytics, stream processing systems process data directly, as it is produced or received.
Stream processing technologies: tools like Apache Kafka and Apache Storm are designed for high-
throughput, scalable stream processing and can handle real-time data streams.
Several techniques and practices can significantly improve big data processing speeds:
In-memory computing: by storing data in RAM instead of on disk, in-memory computing dramatically
reduces data access times, enabling real-time analytics and faster processing speeds.
Data sharding: this technique involves splitting a larger database into smaller, more manageable pieces,
or shards, that can be processed in parallel.
Indexing: creating indexes allows for quicker retrieval of information from databases, thereby
accelerating query response times.
Machine learning algorithms and AI play an increasingly important role in big data processing:
Automated data processing pipelines: machine learning can automate the creation of data processing
workflows, improving efficiency and reducing the likelihood of errors.
Predictive analytics: AI-driven predictive models process vast data sets to forecast future trends and
behaviors, providing organizations with valuable foresight.
5. The challenge of processing complexity
As data grows in complexity, so do the processing requirements. Techniques to manage this complexity
include:
Data partitioning: organizing data into partitions can make it easier to manage and process.
Optimized query execution: advanced query optimization techniques ensure that the most efficient
processing paths are used.
The speed and efficiency of big data processing are not just operational concerns; they are strategic
imperatives that can differentiate and define an organization's success. Ineffective techniques or a less
optimized algorithm might lead to a higher demand for infrastructure resources and significant cost
spikes. To manage and process this deluge of data efficiently, a variety of data engineering tools are
utilized. These tools are tailored to meet the specific needs of the infrastructure, whether for batch
processing, stream processing, data integration, or constructing data pipelines.
a) Data management: It provides a robust framework for efficient data management, ensuring seamless
storage, retrieval, and processing of vast volumes of information.
b) Real-time insights: With distributed processing capabilities, the infrastructure enables organisations
to analyse data in real time, leading to faster decision-making and swift responses to changing trends.
c) Enhanced customer experiences: Leveraging Data Analytics through Big Data Infrastructure helps
businesses gain valuable insights into customer preferences and behaviours. This allows them to deliver
personalised and tailored experiences.
d) Predictive analytics: It also empowers businesses to employ predictive analytics, anticipating future
trends and potential challenges, thus enabling proactive decision-making and risk mitigation.
e) Competitive advantage: By harnessing the power of its infrastructure, companies can gain a
competitive edge. It is because they make data-driven decisions that optimise operations, identify new
opportunities, and stay ahead in the market.
f) Innovation and research: Researchers and scientists Benefit from Big Data Infrastructure to process
and analyse large datasets, facilitating groundbreaking discoveries and advancements across various
fields.
g) Resource scalability: It offers scalable resources, allowing organisations to adjust storage and
processing power as per changing data demands, ensuring cost-effectiveness.
h) Data security: Implementing proper security measures within Big Data Infrastructure safeguards
sensitive data and protects against potential cyber threats, ensuring compliance with data regulations.
i) Business growth: The ability to harness and utilise data effectively through Big Data Infrastructure
directly contributes to business growth. It helps drive efficiency, productivity, and revenue generation.
j) Data-driven decision making: It enables data-driven decision-making processes and reduces reliance
on intuition and gut feelings. This also fosters a culture of data-centricity within organisations.
Gain hands-on expertise in designing scalable and robust Big Data solutions with our Big Data
Architecture Training.
Challenges with Big Data
4. Technical challenges:
Quality of data:
When there is a collection of a large amount of data and storage of this data, it comes at a cost.
Big companies, business leaders and IT leaders always want large data storage.
For better results and conclusions, Big data rather than having irrelevant data, focuses on quality
data storage.
This further arise a question that how it can be ensured that data is relevant, how much data
would be enough for decision making and whether the stored data is accurate or not.
Fault tolerance:
Fault tolerance is another technical challenge and fault tolerance computing is extremely hard,
involving intricate algorithms.
Nowadays some of the new technologies like cloud computing and big data always intended
that whenever the failure occurs the damage done should be within the acceptable threshold
that is the whole task should not begin from the scratch.
Scalability:
Big data projects can grow and evolve rapidly. The scalability issue of Big Data has lead towards
cloud computing.
It leads to various challenges like how to run and execute various jobs so that goal of each
workload can be achieved cost-effectively.
It also requires dealing with the system failures in an efficient manner. This leads to a big
question again that what kinds of storage devices are to be used.
Real-world examples of successful Big Data Infrastructures abound, with prominent companies and
organisations leveraging these cutting-edge technologies to revolutionise their operations and deliver
outstanding results. Here are some compelling examples:
a) Netflix: The entertainment giant employs a robust Big Data Infrastructure to analyse user
preferences, viewing habits, and content interactions. This enables Netflix to provide highly personalised
recommendations, enhancing user satisfaction and retention.
b) Amazon: As one of the world's largest e-commerce platforms, Amazon utilises Big Data Infrastructure
to optimise its supply chain, forecast demand, and manage inventory efficiently. This data-driven
approach ensures timely deliveries and exceptional customer experiences.
c) Google: Google’s search engine handles an immense amount of data each day. By employing
sophisticated Big Data Infrastructure, Google can process user queries quickly and accurately, delivering
relevant search results in real time.
d) Facebook: With billions of active users, Facebook relies on Big Data Infrastructure to analyse vast
amounts of user-generated content, identify trends, and target advertisements with precision, resulting
in a highly profitable advertising model.
e) NASA: The space agency employs Big Data Infrastructure to process and analyse complex data from
satellites, telescopes, and rovers. This data analysis has led to a deeper understanding of the universe
and groundbreaking scientific discoveries.
f) Uber: Uber relies on Big Data Infrastructure to optimise ride-sharing routes, predict rider demand, and
dynamically adjust pricing. This data-driven approach enhances efficiency and reliability for both drivers
and riders.
g) Walmart: This retail giant utilise Big Data Infrastructure to analyse customer purchase patterns,
optimise inventory management, and offer personalised promotions, ensuring a seamless shopping
experience for customers.
A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too
large or complex for traditional database systems. The threshold at which organizations enter into the
big data realm differs, depending on the capabilities of the users and their tools. For some, it can mean
hundreds of gigabytes of data, while for others it means hundreds of terabytes. As tools for working
with big datasets advance, so does the meaning of big data. More and more, this term relates to the
value you can extract from your data sets through advanced analytics, rather than strictly the size of the
data, although in these cases they tend to be quite large.
Big data solutions typically involve one or more of the following types of workload:
Store and process data in volumes too large for a traditional database.
Transform unstructured data for analysis and reporting.
Capture, process, and analyze unbounded streams of data in real time, or with low latency.
Data sources. All big data solutions start with one or more data sources. Examples include:
Data storage. Data for batch processing operations is typically stored in a distributed file store
that can hold high volumes of large files in various formats. This kind of store is often called a
data lake. Options for implementing this storage include Azure Data Lake Store or blob
containers in Azure Storage.
Batch processing. Because the data sets are so large, often a big data solution must process data
files using long-running batch jobs to filter, aggregate, and otherwise prepare the data for
analysis. Usually these jobs involve reading source files, processing them, and writing the output
to new files. Options include running U-SQL jobs in Azure Data Lake Analytics, using Hive, Pig, or
custom Map/Reduce jobs in an HDInsight Hadoop cluster, or using Java, Scala, or Python
programs in an HDInsight Spark cluster.
Real-time message ingestion. If the solution includes real-time sources, the architecture must
include a way to capture and store real-time messages for stream processing. This might be a
simple data store, where incoming messages are dropped into a folder for processing. However,
many solutions need a message ingestion store to act as a buffer for messages, and to support
scale-out processing, reliable delivery, and other message queuing semantics. This portion of a
streaming architecture is often referred to as stream buffering. Options include Azure Event
Hubs, Azure IoT Hub, and Kafka.
Stream processing. After capturing real-time messages, the solution must process them by
filtering, aggregating, and otherwise preparing the data for analysis. The processed stream data
is then written to an output sink. Azure Stream Analytics provides a managed stream processing
service based on perpetually running SQL queries that operate on unbounded streams. You can
also use open source Apache streaming technologies like Spark Streaming in an HDInsight
cluster.
Machine learning. Reading the prepared data for analysis (from batch or stream processing),
machine learning algorithms can be used to build models that can predict outcomes or classify
data. These models can be trained on large datasets, and the resulting models can be used to
analyze new data and make predictions. This can be done using Azure Machine Learning
Analytical data store. Many big data solutions prepare data for analysis and then serve the
processed data in a structured format that can be queried using analytical tools. The analytical
data store used to serve these queries can be a Kimball-style relational data warehouse, as seen
in most traditional business intelligence (BI) solutions. Alternatively, the data could be presented
through a low-latency NoSQL technology such as HBase, or an interactive Hive database that
provides a metadata abstraction over data files in the distributed data store. Azure Synapse
Analytics provides a managed service for large-scale, cloud-based data warehousing. HDInsight
supports Interactive Hive, HBase, and Spark SQL, which can also be used to serve data for
analysis.
Analysis and reporting. The goal of most big data solutions is to provide insights into the data
through analysis and reporting. To empower users to analyze the data, the architecture may
include a data modeling layer, such as a multidimensional OLAP cube or tabular data model in
Azure Analysis Services. It might also support self-service BI, using the modeling and
visualization technologies in Microsoft Power BI or Microsoft Excel. Analysis and reporting can
also take the form of interactive data exploration by data scientists or data analysts. For these
scenarios, many Azure services support analytical notebooks, such as Jupyter, enabling these
users to leverage their existing skills with Python or R. For large-scale data exploration, you can
use Microsoft R Server, either standalone or with Spark.
Orchestration. Most big data solutions consist of repeated data processing operations,
encapsulated in workflows, that transform source data, move data between multiple sources
and sinks, load the processed data into an analytical data store, or push the results straight to a
report or dashboard. To automate these workflows, you can use an orchestration technology
such Azure Data Factory or Apache Oozie and Sqoop.
Batch Layer: The batch layer of the lambda architecture saves incoming data in its entirety as
batch views. The batch views are used to prepare the indexes. The data is immutable, and only
copies of the original data are created and preserved. The batch layer ensures consistency by
making the data append-only. It is the first layer in the lambda architecture that saves incoming
data in its entirety as batch views. The data cannot be changed, and only copies of the original
data are created and preserved. The data that is saved is immutable, meaning that it cannot be
changed, and only copies of the original data are preserved and stored. The data that is saved is
append-only, which ensures that it is prepared before it is presented. The master dataset and
then pre-computing the batch views are handled this way.
Speed Layer: The speed layer delivers data straight to the batch layer, which is responsible for
computing incremental data. However, the speed layer itself may also be reduced in latency by
reducing the number of computations. The stream layer processes the processed data from the
speed layer to produce error correction.
Serving Layer: The batch views and the speed outcomes traverse to the serving layer as a result
of the batch layers batch views. The serving layer indexes the views and parallelizes them to
ensure users’ queries are fast and are exempt from delays.
Kappa Architecture
When compared to Lambda architecture, Kappa architecture is also intended to handle both
real-time streaming and batch processing data. The Kappa architecture, in addition to reducing
the additional cost that comes from the Lambda architecture, replaces the data sourcing
medium with message queues.
The messaging engines store a sequence of data in the analytical databases, which are then read
and converted into appropriate format before being saved for the end-user.
The architecture makes it easy to access real-time information by reading and transforming the
message data into a format that is easily accessible to end users. It also provides additional
outputs by allowing previously saved data to be taken into account.
The batch layer was eliminated in the Kappa architecture, and the speed layer was enhanced to
provide reprogramming capabilities. The key difference with the Kappa architecture is that all
the data is presented as a series or stream. Data transformation is achieved through the steam
engine, which is the central engine for data processing.
Big Data Tools and Techniques
A big data tool can be classified into the four buckets listed below based on its practicability.
A distributed database is a set of data storage chunks that is distributed over a network of computers.
Data centres may have their own processing units for distributed databases. The distributed databases
may be physically located in the same location or dispersed over an interconnected network of
computers. The distributed databases are heterogeneous (having a variety of software and hardware),
homogeneous (having the same software and hardware across all instances), and different, supported
by distinct hardware.
The leading big data processing and distribution platforms are Hadoop HDFS, Snowflake, Qubole,
Apache Spark, Azure HDInsight, Azure Data Lake, Amazon EMR, Google BigQuery, Google Cloud
Dataflow, MS SQL.
The purpose of working with big data is to process it, and Hadoop is one of the most effective
solutions for this purpose. It is used to store, process, and analyse huge amounts of data.
Loading processing tools are required to process the vast amount of data.
The MapReduce programming model facilitates the efficient formatting and organisation of
huge quantities of data into precise sets by performing operations of compilation and
organisation of the data sets.
Hadoop is an open-source software project from the Apache Foundation that allows for the
computation of computational software. Spark, another open-source project from the Apache
Foundation, is to fasten computational computing software processes.
Cloud Computing Tools refers to the network-based computing services that utilise the Internet’s
development and services. The shared pool of configurable computing resources, which are available at
any time and anywhere and at any time, are shared by all network-based services. This service is
available for paid-for use when required and is provided by the service provider. The platform is very
useful in handling large amounts of data.
Amazon Web Services (AWS) is the most popular cloud computing tool, followed by Microsoft Azure,
Google Cloud, Blob Storage, and DataBricks. Oracle, IBM, and Alibaba are also popular cloud computing
tools.
An important aspect of a big data architecture is using and applying big data applications, in particular,
the big data architecture utilises and applies big data applications are:
The data structure of the big data architecture allows deleting of sensitive data right at the
beginning because of its data ingesting procedure and because of its data lake storage.
A batch-or real-time-involving big data architecture ingests data both in the batch and real-time.
Batch processing has a frequency and recurring schedule. The ingestion process and the job
scheduling for the batch data are simplified as the data files can be partitioned. The query
performance is improved by partitioning the tables. Hive, U-SQL, or SQL queries are used to
partition the table data.
Distributed batch files can be split further using parallelism and reduced job time. Another
application is to disperse the workload across processing units. The static batch files are created
and saved in formats that can be split further. The Hadoop Distributed File System (HDFS) can
cluster hundreds of nodes and can parallelly process the files, eventually decreasing job times.
High-performance parallel computing: Big data architectures employ parallel computing, in which
multiprocessor servers perform lots of calculations at the same time to speed up the process. Large data
sets can be processed quickly by parallelising them on multiprocessor servers. Part of the job can be
handled simultaneously.
Elastic scalability: Big Data architectures can be scaled horizontally, allowing the environment to be
tuned to the size of the workloads. A big data solution is usually operated in the cloud, where you only
have to pay for the storage and processing resources you actually utilise.
Freedom of choice: Big Data architectures may use various platforms and solutions in the marketplace,
such as Azure-managed services, MongoDB Atlas, and Apache technologies. You can pick the right
combination of solutions for your specific workloads, existing systems, and IT expertise levels to achieve
the best result.
The ability to interoperate with other systems: You can use Big Data architecture components for IoT
processing and BI as well as analytics workflows to create integrated platforms across different types of
workloads.
Security: When it comes to static big data, the data lake is the norm. Because security is required to
safeguard your data from intrusion and theft, robust security is required. In addition, setting up secure
access can be difficult. Other applications must also consume data in order for them to function.
Complexity: The moving parts of a Big Data architecture typically consist of many interlocking elements.
These components may have their own data-injection pipelines and various configuration settings to
improve performance, in addition to many cross-component configuration interventions. Big Data
procedures are demanding in nature and require a lot of knowledge and skill.
Evolving technologies: Choosing the right solutions and components is critical to meeting Big Data
business objectives. It can be challenging to determine which Big Data technologies, practices, and
standards are still in the midst of a period of advancement, as many of them are relatively new and still
evolving. Core Hadoop components such as Hive and Pig have reached a stable stage, but other
technologies and services are still in development and are likely to change over time.
Expertise in a specific domain: As Big Data APIs built on mainstream languages gradually become
popular, we can see Big Data architectures and solutions using atypical, highly specialized languages and
frameworks. Nevertheless, Big Data architectures and solutions do generally use unique, highly
specialised languages and frameworks that impose a substantial learning curve for developers and data
analysts alike.
Data Warehouse
A data warehouse is an enterprise system used for the analysis and reporting of structured and semi-
structured data from multiple sources, such as point-of-sale transactions, marketing automation,
customer relationship management, and more. A data warehouse is suited for ad hoc analysis as well
custom reporting. A data warehouse can store both current and historical data in one place and is
designed to give a long-range view of data over time, making it a primary component of business
intelligence.
Issues
When and how to gather data: In a source-driven architecture for gathering data, the data sources
transmit new information, either continually (as transaction processing takes place), or periodically
(nightly, for example). In a destination-driven architecture, the data warehouse periodically sends
requests for new data to the sources. Unless updates at the sources are replicated at the warehouse via
two phase commit, the warehouse will never be quite up to-date with the sources. Two-phase commit is
usually far too expensive to be an option, so data warehouses typically have slightly out-of-date data.
That, however, is usually not a problem for decision-support systems.
What schema to use: Data sources that have been constructed independently are likely to have
different schemas. In fact, they may even use different data models. Part of the task of a warehouse is to
perform schema integration, and to convert data to the integrated schema before they are stored. As a
result, the data stored in the warehouse are not just a copy of the data at the sources. Instead, they can
be thought of as a materialized view of the data at the sources.
Data transformation and cleansing: The task of correcting and preprocessing data is called data
cleansing. Data sources often deliver data with numerous minor inconsistencies, which can be
corrected. For example, names are often misspelled, and addresses may have street, area, or city names
misspelled, or postal codes entered incorrectly. These can be corrected to a reasonable extent by
consulting a database of street names and postal codes in each city. The approximate matching of data
required for this task is referred to as fuzzy lookup.
How to propagate update: Updates on relations at the data sources must be propagated to the data
warehouse. If the relations at the data warehouse are exactly the same as those at the data source, the
propagation is straightforward. If they are not, the problem of propagating updates is basically the view-
maintenance problem.
What data to summarize: The raw data generated by a transaction-processing system may be too large
to store online. However, we can answer many queries by maintaining just summary data obtained by
aggregation on a relation, rather than maintaining the entire relation. For example, instead of storing
data about every sale of clothing, we can store total sales of clothing by item name and category.
Benefits
Better business analytics: Data warehouse plays an important role in every business to store and
analysis of all the past data and records of the company. which can further increase the understanding
or analysis of data for the company.
Faster Queries: The data warehouse is designed to handle large queries that’s why it runs queries faster
than the database.
Improved data Quality: In the data warehouse the data you gathered from different sources is being
stored and analyzed it does not interfere with or add data by itself so your quality of data is maintained
and if you get any issue regarding data quality then the data warehouse team will solve this.
Historical Insight: The warehouse stores all your historical data which contains details about the
business so that one can analyze it at any time and extract insights from it.
Attributes of Data Warehouse
Types of Databases
Data Warehousing
A Database Management System (DBMS) stores data in the form of tables and uses an ER model and the
goal is ACID properties. For example, a DBMS of a college has tables for students, faculty, etc.
A Data Warehouse is separate from DBMS, it stores a huge amount of data, which is typically collected
from multiple heterogeneous sources like files, DBMS, etc. The goal is to produce statistical results that
may help in decision-making. For example, a college might want to see quick different results, like how
the placement of CS students has improved over the last 10 years, in terms of salaries, counts, etc.
When and how to gather data: In a source-driven architecture for gathering data, the data sources
transmit new information, either continually (as transaction processing takes place), or periodically
(nightly, for example). In a destination-driven architecture, the data warehouse periodically sends
requests for new data to the sources. Unless updates at the sources are replicated at the warehouse via
two phase commit, the warehouse will never be quite up to-date with the sources. Two-phase commit is
usually far too expensive to be an option, so data warehouses typically have slightly out-of-date data.
That, however, is usually not a problem for decision-support systems.
What schema to use: Data sources that have been constructed independently are likely to have different
schemas. In fact, they may even use different data models. Part of the task of a warehouse is to perform
schema integration, and to convert data to the integrated schema before they are stored. As a result,
the data stored in the warehouse are not just a copy of the data at the sources. Instead, they can be
thought of as a materialized view of the data at the sources.
Data transformation and cleansing: The task of correcting and preprocessing data is called data
cleansing. Data sources often deliver data with numerous minor inconsistencies, which can be
corrected. For example, names are often misspelled, and addresses may have street, area, or city names
misspelled, or postal codes entered incorrectly. These can be corrected to a reasonable extent by
consulting a database of street names and postal codes in each city. The approximate matching of data
required for this task is referred to as fuzzy lookup.
How to propagate update: Updates on relations at the data sources must be propagated to the data
warehouse. If the relations at the data warehouse are exactly the same as those at the data source, the
propagation is straightforward. If they are not, the problem of propagating updates is basically the view-
maintenance problem.
What data to summarize: The raw data generated by a transaction-processing system may be too large
to store online. However, we can answer many queries by maintaining just summary data obtained by
aggregation on a relation, rather than maintaining the entire relation. For example, instead of storing
data about every sale of clothing, we can store total sales of clothing by item name and category.
An ordinary Database can store MBs to GBs of data and that too for a specific purpose. For storing data
of TB size, the storage shifted to the Data Warehouse. Besides this, a transactional database doesn’t
offer itself to analytics. To effectively perform analytics, an organization keeps a central Data Warehouse
to closely study its business by organizing, understanding, and using its historical data for making
strategic decisions and analyzing trends.
Benefits of Data Warehouse
Better business analytics: Data warehouse plays an important role in every business to store and analysis
of all the past data and records of the company. which can further increase the understanding or
analysis of data for the company.
Faster Queries: The data warehouse is designed to handle large queries that’s why it runs queries faster
than the database.
Improved data Quality: In the data warehouse the data you gathered from different sources is being
stored and analyzed it does not interfere with or add data by itself so your quality of data is maintained
and if you get any issue regarding data quality then the data warehouse team will solve this.
Historical Insight: The warehouse stores all your historical data which contains details about the
business so that one can analyze it at any time and extract insights from it.
Database
Data Warehouse
Generally, a Database stores current and up-to-date data which is used for daily operations.
A Data Warehouse maintains historical data over time. Historical data is the data kept over years and
can used for trend analysis, make future predictions and decision support.
A Data Warehouse is integrated generally at the organization level, by combining data from different
databases.
Example – A data warehouse integrates the data from one or more databases , so that analysis can be
done to get results , such as the best performing school in a city.
Data Warehousing can be applied anywhere where we have a huge amount of data and we want to see
statistical results that help in decision making.
Social Media Websites: The social networking websites like Facebook, Twitter, Linkedin, etc. are based
on analyzing large data sets. These sites gather data related to members, groups, locations, etc., and
store it in a single central repository. Being a large amount of data, Data Warehouse is needed for
implementing the same.
Banking: Most of the banks these days use warehouses to see the spending patterns of
account/cardholders. They use this to provide them with special offers, deals, etc.
Government: Government uses a data warehouse to store and analyze tax payments which are used to
detect tax thefts.
Features
Centralized Data Repository: Data warehousing provides a centralized repository for all enterprise data
from various sources, such as transactional databases, operational systems, and external sources. This
enables organizations to have a comprehensive view of their data, which can help in making informed
business decisions.
Data Integration: Data warehousing integrates data from different sources into a single, unified view,
which can help in eliminating data silos and reducing data inconsistencies.
Historical Data Storage: Data warehousing stores historical data, which enables organizations to analyze
data trends over time. This can help in identifying patterns and anomalies in the data, which can be used
to improve business performance.
Query and Analysis: Data warehousing provides powerful query and analysis capabilities that enable
users to explore and analyze data in different ways. This can help in identifying patterns and trends, and
can also help in making informed business decisions.
Data Transformation: Data warehousing includes a process of data transformation, which involves
cleaning, filtering, and formatting data from various sources to make it consistent and usable. This can
help in improving data quality and reducing data inconsistencies.
Data Mining: Data warehousing provides data mining capabilities, which enable organizations to
discover hidden patterns and relationships in their data. This can help in identifying new opportunities,
predicting future trends, and mitigating risks.
Data Security: Data warehousing provides robust data security features, such as access controls, data
encryption, and data backups, which ensure that the data is secure and protected from unauthorized
access.
Intelligent Decision-Making: With centralized data in warehouses, decisions may be made more quickly
and intelligently.
Historical Analysis: Predictions and trend analysis are made easier by storing past data.
Data Quality: Guarantees data quality and consistency for trustworthy reporting.
Scalability: Capable of managing massive data volumes and expanding to meet changing requirements.
Effective Queries: Fast and effective data retrieval is made possible by an optimized structure.
Cost reductions: Data warehousing can result in cost savings over time by reducing data management
procedures and increasing overall efficiency, even when there are setup costs initially.
Data security: Data warehouses employ security protocols to safeguard confidential information,
guaranteeing that only authorized personnel are granted access to certain data.
Complexity: Data warehousing can be complex, and businesses may need to hire specialized personnel
to manage the system.
Time-consuming: Building a data warehouse can take a significant amount of time, requiring businesses
to be patient and committed to the process.
Data integration challenges: Data from different sources can be challenging to integrate, requiring
significant effort to ensure consistency and accuracy.
Data security: Data warehousing can pose data security risks, and businesses must take measures to
protect sensitive data from unauthorized access or breaches.
Shared Nothing Architecture
Shared Nothing Architecture is a distributed computing approach where each node in a system operates
independently and has its own dedicated resources, such as processors, memory, and storage. In this
architecture, there is no shared memory or shared disk among the nodes. Instead, each node
communicates with others through a network.
This decentralized approach allows for horizontal scalability, fault tolerance, and high performance. As
the system grows, additional nodes can be added to handle the increasing workload.
In Shared Nothing Architecture, the data is partitioned and distributed across multiple nodes in the
system. Each node is responsible for processing a subset of the data and executing specific tasks
independently.
When a query or request is made to the system, it is broken down into smaller tasks that can be
processed by different nodes simultaneously. Each node operates on its subset of data and produces
results that are then combined to provide the final output.
The communication between nodes is typically achieved using message passing protocols over a
network. Nodes exchange data and coordinate their actions to complete the desired computation.
Its Important
Shared Nothing Architecture offers several benefits that make it important for businesses:
Scalability: Shared Nothing Architecture allows systems to scale horizontally by adding more nodes to
handle increased workloads. This scalability ensures that the system can accommodate growing
amounts of data and user requests.
Fault Tolerance: Since each node operates independently and has its own resources, failures in one
node do not impact the entire system. The failure of a single node does not lead to the loss of data or
service disruption.
High Performance: By distributing the workload across multiple nodes, Shared Nothing Architecture
enables parallel processing. This parallelism allows for faster data processing and analysis, leading to
improved performance and reduced response times.
Flexibility: Shared Nothing Architecture allows for flexibility in hardware and software choices. Each
node can have different configurations, which enables organizations to choose the best hardware and
software combinations that suit their needs and budget.
Distributed Databases: Shared Nothing Architecture is often used in distributed database systems to
achieve scalability, fault tolerance, and high availability. Each node can store and process a portion of
the data independently.
Data Warehousing: Shared Nothing Architecture can be employed in data warehousing systems to
handle complex queries and provide fast access to large amounts of data. The distributed nature of the
architecture allows for efficient data retrieval and analysis.
In Shared Everything Architecture, all data is stored in a shared storage infrastructure that can be
accessed by multiple processing units simultaneously. The processing units, such as CPUs or GPUs, are
connected to the shared storage via a high-speed network. This architecture enables efficient parallel
data processing, as each processing unit can access and process data in parallel, eliminating data
bottlenecks.
Importance
Centralized Data Management: By centralizing data storage and processing, Shared Everything
Architecture simplifies data management and eliminates data duplication.
Improved Performance: With parallel data processing and direct access to shared storage, Shared
Everything Architecture offers improved performance and reduced latency.
Scalability: Shared Everything Architecture allows for easy scalability by adding more processing units as
needed.
Data Consistency: Since all data is stored centrally, data consistency is ensured across all processing
units.
Use Cases
Big Data Processing: Shared Everything Architecture provides a scalable and efficient solution for
processing large volumes of big data.
Data Analytics: This architecture enables efficient data analytics by providing a unified view of data
across multiple sources.
Real-time Data Processing: Shared Everything Architecture allows for real-time data processing and
analysis, enabling organizations to make timely and informed decisions.
Data Warehousing: With its centralized data storage and processing capabilities, Shared Everything
Architecture is well-suited for data warehousing applications.
S.
No. Shared Nothing Architecture Shared Disk Architecture
Here the disks have individual nodes Here the disks have active nodes which are
2.
which cannot be shared. shared in case of failures.
8. Pros- Pros-
Easy to scale It can scale up to a fair amount of CPUs.
reduces single points of failure, Each processor possess its own memory so
makes upgrades easier, and the memory bus is not an obstruction.
avoids downtime Fault tolerance as the database is stored on
S.
No. Shared Nothing Architecture Shared Disk Architecture
Cons-
No scalability in the architecture because
the disc subsystem’s interconnection is
currently the bottleneck.
Cons- Slower CPU to CPU communication because
9. Deterioration in performance of passing through a communication
Expensive network.
Slow down in the speed of current CPUs
because of added more CPUs leads to the
increased competition for memory access in
network bandwidth.
Data Re-engineering
Data Re-engineering involves the restructuring and reorganizing of data from various sources, formats,
and systems to create an optimized and consolidated dataset. This process often includes data
extraction, cleansing, transformation, and integration to ensure data quality and consistency.
Working
Data Discovery: Identify and understand the data sources, formats, and systems that need to be re-
engineered.
Data Extraction: Extract data from different sources, such as databases, files, APIs, and external systems.
Data Cleansing: Remove or fix any errors, inconsistencies, duplicates, or incomplete data.
Data Transformation: Convert data into a unified format and structure that is suitable for analysis and
processing.
Data Integration: Combine data from multiple sources into a single dataset, ensuring data consistency
and integrity.
Data Storage: Store the re-engineered data in a data lake, data warehouse, or a data lakehouse
environment.
Importance
Improved Data Quality: By cleansing and transforming data, Data Re-engineering ensures the accuracy,
completeness, and reliability of the dataset.
Enhanced Data Processing Efficiency: Optimizing data structures and formats enables faster data
processing, reducing the time required for analysis and insights generation.
Streamlined Data Integration: Data Re-engineering simplifies the integration of data from different
sources, allowing organizations to have a unified view of their data.
Scalability: Re-engineered data can be easily scaled to handle large volumes of data, ensuring seamless
data processing even as data grows.
Enabling Advanced Analytics: By preparing and organizing data, Data Re-engineering facilitates the
application of advanced analytics techniques such as machine learning, predictive modeling, and data
mining.
Spark Components
The Spark project consists of different types of tightly integrated components. At its core, Spark is a
computational engine that can schedule, distribute and monitor multiple applications.
Spark Core
The Spark Core is the heart of Spark and performs the core functionality.
It holds the components for task scheduling, fault recovery, interacting with storage systems
and memory management.
Spark SQL
The Spark SQL is built on the top of Spark Core. It provides support for structured data.
It allows to query the data via SQL (Structured Query Language) as well as the Apache Hive
variant of SQL? called the HQL (Hive Query Language).
It supports JDBC and ODBC connections that establish a relation between Java objects and
existing databases, data warehouses and business intelligence tools.
It also supports various sources of data like Hive tables, Parquet, and JSON.
Spark Streaming
Spark Streaming is a Spark component that supports scalable and fault-tolerant processing of
streaming data.
It uses Spark Core's fast scheduling capability to perform streaming analytics.
It accepts data in mini-batches and performs RDD transformations on that data.
Its design ensures that the applications written for streaming data can be reused to analyze
batches of historical data with little modification.
The log files generated by web servers can be considered as a real-time example of a data
stream.
MLlib
The MLlib is a Machine Learning library that contains various machine learning algorithms.
These include correlations and hypothesis testing, classification and regression, clustering,
and principal component analysis.
It is nine times faster than the disk-based implementation used by Apache Mahout.
GraphX
The GraphX is a library that is used to manipulate graphs and perform graph-parallel
computations.
It facilitates to create a directed graph with arbitrary properties attached to each vertex and
edge.
To manipulate graph, it supports various fundamental operators like subgraph, join Vertices,
and aggregate Messages.
Spark Environment:
Local: The Spark cluster mode is available immediately upon running the shell. Simply run sc and you
will get the context information.
Local mode is the default mode and does not require any resource management. When we start
spark-shell command, it is already up and running. Local mode are also good for testing purposes,
quick setup scenarios and have number of partitions equals to number of CPU on local machine.
By the default the spark-shell will execute in local mode, and you can specify the master argument
with local attribute with how many threads you want Spark application to be running. Spark is
optimised for parallel computation. Spark in local mode will run with single thread. With passing the
number of CPU to local attribute, you can execute in multi-threaded computation.
RDD
The RDD (Resilient Distributed Dataset) is the Spark's core abstraction. It is a collection of elements,
partitioned across the nodes of the cluster so that we can execute various parallel operations on it.
Parallelized Collections
To create parallelized collection, call Spark Context's parallelized method on an existing collection in the
driver program. Each element of collection is copied to form a distributed dataset that can be operated
on in parallel.
External Datasets
In Spark, the distributed datasets can be created from any type of storage sources supported by Hadoop
such as HDFS, Cassandra, HBase and even our local file system. Spark provides the support for text files,
SequenceFiles, and other types of Hadoop InputFormat.
SparkContext's textFile method can be used to create RDD's text file. This method takes a URI for the file
(either a local path on the machine or a hdfs://) and reads the data of t