Fault-Tolerant Parallel Algorithms[1]
Fault-Tolerant Parallel Algorithms[1]
Topic:
NAME
AMIDU KIKAH ZULKANANI
INDEX NUMBER
UGW0202320126
Table of contents
1.1 Introduction..............................................................................................................................1
1.2 How Fault Tolerance Works...................................................................................................1
1.3 Types of Faults in Parallel Computing..................................................................................2
1.3.1 Process Failures......................................................................................................................................... 2
1.3.2 Node Failures.............................................................................................................................................. 2
1.3.3 Network Failures....................................................................................................................................... 3
1.4 Methods of Attaining Fault Tolerance...................................................................................3
1.4.1 Checkpointing............................................................................................................................................. 3
1.4.1.2 Advantages of Checkpointing......................................................................................................3
1.4.2 Replication.................................................................................................................................................... 4
1.4.2.1 There are two kinds of replication............................................................................................4
1.4.3 Redundancy................................................................................................................................................. 4
1.4.4 Error Correction......................................................................................................................................... 4
1.5 Fault-Tolerant Parallel Algorithms.......................................................................................5
1.5.1 Master-Worker Algorithm.....................................................................................................................5
1.5.1.1 How master-worker handle fault.............................................................................................. 5
1.5.2 Distributed Memory Algorithm...........................................................................................................5
1.5.2.1 Fault Handling by distributed memory algorithms...........................................................5
1.5.3 Parallel Task Queue Algorithm............................................................................................................6
1..5.3.1 Fault Handling................................................................................................................................... 6
1.6 Issues in Designing Fault-Tolerant Algorithms....................................................................6
1.6.1 Scalability Issues Ensuring fault tolerant algorithms will scale well becomes more
challenging as systems get larger and more complex................................................................................6
1.6.2 Overhead Issues................................................................................................................................... 7
1.6.3 Complexity of Design............................................................................................................................... 7
1.7 Applications of Fault-Tolerant Parallel Algorithms............................................................8
1.7.1 High-Performance Computing (HPC)...............................................................................................8
1.7.2 Cloud Computing....................................................................................................................................... 8
1.7.3 Real-Time Systems.................................................................................................................................... 9
1.7.4 Internet of Things (IoT).......................................................................................................................... 9
1.7.5 Edge Computing...................................................................................................................................... 10
Reference......................................................................................................................................11
1.1 Introduction
Fault-tolerant parallel algorithms are made to be as fault tolerant as possible while still offering
performance and reliability in large computing systems such as clusters and clouds. These
systems are typically beset by all some kind of failures such as process, node, and network
failures, which can bring execution to a critical standstill. Fault tolerance cannot be overstressed
because it allows the system to execute continuously, thus ensuring data integrity and providing
Replication: Creating duplicate copies of processes or data for the intent of providing
Redundancy: Employing redundant resources that will be utilized in case primary ones
Error Correction: Applying algorithms that are error-detecting and correcting during data
These techniques work in combination to detect failures and dynamically reorganize the system.
Applications are thus able to keep running with little interruption, hence enhancing the reliability
faults need to be known so that efficient fault tolerant methods can be adopted. The most
prominent types of faults are process failures, node failures, and network failures (Kale &
Krishnan, 2018).
numerous possible reasons, including software faults or resource not availability. For example, if
Example
When a matrix multiplication program is being used, whenever one process executing a row of
the resulting matrix encounters an unhandled exception, it will terminate the whole computation
Hardware failures can lead to this type of failure, such as power loss, overheating, or physical
Example
Take the case of a cloud computing setup in which a virtual machine becomes unavailable
because of hardware failure. If the primary computation is on that node, other processes that are
dependent on it will also be affected, and operations such as failover to backup nodes or task
very high latency, or complete collapse of communication. These kinds of failures can drastically
Example
For a distributed algorithm where nodes exchange computational outcomes, a sudden network
failure could induce inconsistent data states, where stale information is with some nodes, leading
systems to recover from failure with minimal interruption. We address in-depth here four central
techniques: checkpointing, replication, redundancy, and error correction. Each one of these
techniques has unique mechanisms and applications to provide reliability for parallel systems.
1.4.1 Checkpointing
Checkpointing involves saving the state of a computation at regular intervals so that in the event
of failure, the system can resume the earlier saved state (Gupta & Trivedi, 2016).
Figure 1.1 checkpointing
Users may select the checkpoint frequency based on performance demands and failure
rate analysis.
1.4.2 Replication
Replication means to copy data or processes across multiple nodes in a way that if one fails,
another can substitute it without the need for interruption (Huang & Kintala, 2017).
Figure 1.2 replication
Passive Replication: One active process executes tasks and the rest remain passive until a
failure occurs.
1.4.3 Redundancy
Redundancy provides additional resources which might be invoked when the primary resources
are down. This might be in the form of hardware, software, or network configuration (Shooman,
2017).
Examples include
1. Applying dormant servers or components that come into play on the failure of their
operational peers.
Forward Error Correction (FEC): Appending redundant information so that the errors can
(Kale & Krishnan, 2018). Three popular algorithms are the Master-Worker, the Distributed
Memory, and the Parallel Task Queue algorithms, are discussed here in terms of how they
dispatches work to multiple subordinate processes known as "workers." A worker executes its
workers. When a worker crashes, the master transfers the task to an alternate so that
Master holds intermediate outcomes and active task state. In case a worker crashes, the
memory, and nodes share data via a network. Such a type of architecture is used in clusters
containing critical data shuts down, other nodes with replicated data can still run
among concurrently processing units in an efficient manner. Tasks are queued and dynamically
allocated to ready processors depending on the current load of the processors, maximizing
resource utilization.
Figure 1.6 Parallel Task Queue
can rerun the task on another available unit from the queue, allowing uninterrupted
Traceability: The task queue maintains records of task status, enabling attempts at
must be addressed to ensure performance and reliability. Scalability, overhead, and complexity
are some of the most significant such issues (Chen & Lee, 2017).
Ensuring fault tolerant algorithms will scale well becomes more challenging as
Scalability includes
One aspect of scalability is resource management: effective resource allocation and
scheduling across nodes might result in bottlenecks, particularly in the event of a failure.
Task Redistribution: Task redistribution to maintain the high performance of the system
Low scalability results in poor performance since systems take time to recover from
The expenses associated with replicating data or operations have the potential to
of operation despite the adversity of potential failure in computing systems. Below, we discuss
the main applications of such algorithms in high performance computing, cloud computing, and
real-time systems.
normal affair, fault tolerant parallel algorithms help researchers and scientists to deliver results
Scientific Simulations: HPC applications that compute large datasets, such as astrophysical
simulations, molecular dynamics, and climate modeling, frequently use a lot of computer power.
Monte Carlo Techniques: Monte Carlo techniques, which are widely used in statistical analysis
and computational finance, typically require extensive computational simulations. By using fault
tolerant techniques, researchers may make sure that simulations can recover from process
rooted in diverse physical resources distributed over different locations, rely on undisturbed
Data Storage and Management: Without replication of data across nodes, cloud systems
would be prone to data loss. Replication protocols and other fault tolerant approaches
guarantee automatic data recovery in the event of failures in addition to granting access.
Cloud-based databases, which guarantee availability and consistency by actively
multiple components which need to collaborate and function in unison smoothly. Fault-
tolerant algorithms support dynamic task rebalancing and load distribution so that even
when a service crashes, others can take up the load, guaranteeing continuity of the
service.
require immediate responsiveness and reliability. Algorithms with fault tolerance stop
unacceptably large delays or losses from occurring when a system fails (Chen & Lee,
2017).
Systems for healthcare monitoring: They keep tabs on vital signs and other important
metrics. When one sensor fails, fault tolerant treatment makes sure that data is collected
Economic Trading Platforms: In trading systems, speed and dependability are essential
because delay can result in losses. Error correction techniques and automated rehydration
are examples of transaction processes that can continue without interruption thanks to
into distributed sensor networks will be an imperative to offer fault-resistant data collection and
processing, particularly for mission-critical systems like smart cities and healthcare (Hwang &
Xu, 2018).
fault tolerant approaches will be a necessity in ensuring continuity of service between distributed
Gupta, S., & Trivedi, K. S. (2016). Checkpointing and recovery in distributed systems. Journal of
Huang, Y., & Kintala, C. (2017). Replication and fault tolerance in distributed systems. IEEE
Lin, S., & Costello, D. J. (2017). Error control coding: Fundamentals and
Huang, Y., & Kintala, C. (2017). Replication and fault tolerance in distributed systems. IEEE
Atzori, L., et al. (2017). The Internet of Things: A survey. Computer Networks, 121, 125-145
Dongarra, J., et al. (2018). High-performance computing: Clusters, constellations, MPPs, and
Chen, P. M., & Lee, E. K. (2017). Fault tolerance in distributed systems. Journal of Systems and