0% found this document useful (0 votes)
2 views16 pages

Fault-Tolerant Parallel Algorithms[1]

The document discusses fault-tolerant parallel algorithms for large-scale computing systems, highlighting the importance of maintaining performance and reliability despite various types of failures such as process, node, and network failures. It outlines methods for achieving fault tolerance, including checkpointing, replication, redundancy, and error correction, as well as specific algorithms like the Master-Worker and Distributed Memory algorithms. Additionally, the document addresses challenges in designing these algorithms and their applications in fields such as high-performance computing, cloud computing, and real-time systems.

Uploaded by

Zulkanani Amidu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views16 pages

Fault-Tolerant Parallel Algorithms[1]

The document discusses fault-tolerant parallel algorithms for large-scale computing systems, highlighting the importance of maintaining performance and reliability despite various types of failures such as process, node, and network failures. It outlines methods for achieving fault tolerance, including checkpointing, replication, redundancy, and error correction, as well as specific algorithms like the Master-Worker and Distributed Memory algorithms. Additionally, the document addresses challenges in designing these algorithms and their applications in fields such as high-performance computing, cloud computing, and real-time systems.

Uploaded by

Zulkanani Amidu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 16

CATHOLIC UNIVERSITY OF GHANA, FIAPRE- SUNYANI

FACULTY OF COMPUTING ENGINEERING AND MATHEMATICS


SCIENCE
PROGRAM: BSc. COMPUTER SCIENCE

Topic:

Fault-Tolerant Parallel Algorithms for Large-Scale Systems

NAME
AMIDU KIKAH ZULKANANI

INDEX NUMBER
UGW0202320126
Table of contents
1.1 Introduction..............................................................................................................................1
1.2 How Fault Tolerance Works...................................................................................................1
1.3 Types of Faults in Parallel Computing..................................................................................2
1.3.1 Process Failures......................................................................................................................................... 2
1.3.2 Node Failures.............................................................................................................................................. 2
1.3.3 Network Failures....................................................................................................................................... 3
1.4 Methods of Attaining Fault Tolerance...................................................................................3
1.4.1 Checkpointing............................................................................................................................................. 3
1.4.1.2 Advantages of Checkpointing......................................................................................................3
1.4.2 Replication.................................................................................................................................................... 4
1.4.2.1 There are two kinds of replication............................................................................................4
1.4.3 Redundancy................................................................................................................................................. 4
1.4.4 Error Correction......................................................................................................................................... 4
1.5 Fault-Tolerant Parallel Algorithms.......................................................................................5
1.5.1 Master-Worker Algorithm.....................................................................................................................5
1.5.1.1 How master-worker handle fault.............................................................................................. 5
1.5.2 Distributed Memory Algorithm...........................................................................................................5
1.5.2.1 Fault Handling by distributed memory algorithms...........................................................5
1.5.3 Parallel Task Queue Algorithm............................................................................................................6
1..5.3.1 Fault Handling................................................................................................................................... 6
1.6 Issues in Designing Fault-Tolerant Algorithms....................................................................6
1.6.1 Scalability Issues Ensuring fault tolerant algorithms will scale well becomes more
challenging as systems get larger and more complex................................................................................6
1.6.2 Overhead Issues................................................................................................................................... 7
1.6.3 Complexity of Design............................................................................................................................... 7
1.7 Applications of Fault-Tolerant Parallel Algorithms............................................................8
1.7.1 High-Performance Computing (HPC)...............................................................................................8
1.7.2 Cloud Computing....................................................................................................................................... 8
1.7.3 Real-Time Systems.................................................................................................................................... 9
1.7.4 Internet of Things (IoT).......................................................................................................................... 9
1.7.5 Edge Computing...................................................................................................................................... 10
Reference......................................................................................................................................11
1.1 Introduction
Fault-tolerant parallel algorithms are made to be as fault tolerant as possible while still offering

performance and reliability in large computing systems such as clusters and clouds. These

systems are typically beset by all some kind of failures such as process, node, and network

failures, which can bring execution to a critical standstill. Fault tolerance cannot be overstressed

because it allows the system to execute continuously, thus ensuring data integrity and providing

availability of service (Kale & Krishnan, 2018).

1.2 How Fault Tolerance Works


From a simple point of view, fault tolerant parallel algorithms employ more than one approach to

manage and reverse failure. Among the best methods are:

 Checkpointing: Taking periodic snapshots of the state of computation to aid in recovery

from separate points after failure (Gupta & Trivedi, 2016).

 Replication: Creating duplicate copies of processes or data for the intent of providing

alternatives after a fault (Huang & Kintala, 2017).

 Redundancy: Employing redundant resources that will be utilized in case primary ones

fail (Shooman, 2017).

 Error Correction: Applying algorithms that are error-detecting and correcting during data

processing (Lin & Costello, 2017).

These techniques work in combination to detect failures and dynamically reorganize the system.

Applications are thus able to keep running with little interruption, hence enhancing the reliability

of cloud and high-performance computing systems.


1.3 Types of Faults in Parallel Computing
In parallel computing systems, several types of faults exist that may lead to system failure. These

faults need to be known so that efficient fault tolerant methods can be adopted. The most

prominent types of faults are process failures, node failures, and network failures (Kale &

Krishnan, 2018).

1.3.1 Process Failures


Process failures occur when a processing unit fails to execute as it should. This may be due to

numerous possible reasons, including software faults or resource not availability. For example, if

a data processing task running in a multi-threaded environment crashes because of a memory

leak, the entire operation based on that task can fail.

Example
When a matrix multiplication program is being used, whenever one process executing a row of

the resulting matrix encounters an unhandled exception, it will terminate the whole computation

unless dealt with by fault tolerant methods.

1.3.2 Node Failures


Node failures are failures of an entire computing unit or machine in a distributed system.

Hardware failures can lead to this type of failure, such as power loss, overheating, or physical

damage to server components.

Example
Take the case of a cloud computing setup in which a virtual machine becomes unavailable

because of hardware failure. If the primary computation is on that node, other processes that are

dependent on it will also be affected, and operations such as failover to backup nodes or task

reallocation will be required.


1.3.3 Network Failures
Network failures imply failures in the data channels between the nodes. It may be packet loss,

very high latency, or complete collapse of communication. These kinds of failures can drastically

affect process coordination based on consistent data exchange.

Example
For a distributed algorithm where nodes exchange computational outcomes, a sudden network

failure could induce inconsistent data states, where stale information is with some nodes, leading

to erroneous results or deadlocks.

1.4 Methods of Attaining Fault Tolerance


Fault tolerance of parallel computing relies on several crucial techniques that cooperate to enable

systems to recover from failure with minimal interruption. We address in-depth here four central

techniques: checkpointing, replication, redundancy, and error correction. Each one of these

techniques has unique mechanisms and applications to provide reliability for parallel systems.

1.4.1 Checkpointing
Checkpointing involves saving the state of a computation at regular intervals so that in the event

of failure, the system can resume the earlier saved state (Gupta & Trivedi, 2016).
Figure 1.1 checkpointing

1.4.1.2 Advantages of Checkpointing


 Efficiency of recovery making it possible for programs to restore from their last

checkpoint, computation time overall is reduced, especially whenever failure is frequent.

 Users may select the checkpoint frequency based on performance demands and failure

rate analysis.

1.4.2 Replication
Replication means to copy data or processes across multiple nodes in a way that if one fails,

another can substitute it without the need for interruption (Huang & Kintala, 2017).
Figure 1.2 replication

1.4.2.1 There are two kinds of replication


 Active Replication: All of the replicas take part in processing tasks simultaneously,

providing real-time failover support.

 Passive Replication: One active process executes tasks and the rest remain passive until a

failure occurs.

1.4.3 Redundancy
Redundancy provides additional resources which might be invoked when the primary resources

are down. This might be in the form of hardware, software, or network configuration (Shooman,

2017).

Figure 1.3 Redundancy

Examples include
1. Applying dormant servers or components that come into play on the failure of their

operational peers.

2. Having replica copies of valuable information in distinct nodes or locations.


1.4.4 Error Correction
Error correction are the routines used to find and correct errors during data processing (Lin &

Costello, 2017). They are of varying levels, including:

 Forward Error Correction (FEC): Appending redundant information so that the errors can

be identified and fixed without needing the information to be resent.

 Checksums: Simple methods wherein data integrity is verified by calculating checksums

before and after transmission.

1.5 Fault-Tolerant Parallel Algorithms


In fault tolerant parallel algorithms, there are a number of specialized methodologies, each with a

solution to managing faults efficiently without compromising on efficiency and performance

(Kale & Krishnan, 2018). Three popular algorithms are the Master-Worker, the Distributed

Memory, and the Parallel Task Queue algorithms, are discussed here in terms of how they

manage fault conditions in parallel computing environments.

1.5.1 Master-Worker Algorithm


The Master-Worker algorithm is structured around a core coordinator called the "master," which

dispatches work to multiple subordinate processes known as "workers." A worker executes its

given task independently and returns with the outcome.


Figure 1.4 Master-Worker

1.5.1.1 How master-worker handle fault


 In redundancy, to cope with worker failure, the master can assign jobs to redundant

workers. When a worker crashes, the master transfers the task to an alternate so that

computation is not delayed due to significant pauses.

 Master holds intermediate outcomes and active task state. In case a worker crashes, the

checkpoints help to resume lost computation from a saved state.

1.5.2 Distributed Memory Algorithm


In Distributed Memory architecture, each node in a system possesses individual dedicated

memory, and nodes share data via a network. Such a type of architecture is used in clusters

consisting of separate physical machines, raising scalability and usage of resources.


Figure 1.5 Distributed Memory

1.5.2.1 Fault Handling by distributed memory algorithms


 Replication: Data can be replicated on multiple nodes to prevent loss. Suppose the node

containing critical data shuts down, other nodes with replicated data can still run

processes, maintaining the system intact.

 Error Correction: Distributed Memory algorithms generally incorporate sophisticated

error-checking techniques that actively monitor communication between nodes, allowing

detection and correction of communication errors.

1.5.3 Parallel Task Queue Algorithm


The Parallel Task Queue algorithm employs a common task queue to control the workload

among concurrently processing units in an efficient manner. Tasks are queued and dynamically

allocated to ready processors depending on the current load of the processors, maximizing

resource utilization.
Figure 1.6 Parallel Task Queue

1.5.3.1 Fault Handling


 Dynamic Reassignment: If the processing unit crashes during task execution, the system

can rerun the task on another available unit from the queue, allowing uninterrupted

processing without the need for user intervention.

 Traceability: The task queue maintains records of task status, enabling attempts at

recovery and avoiding duplication or loss of tasks.

1.6 Issues in Designing Fault-Tolerant Algorithms


Fault-tolerant parallel algorithm design is in itself challenging due to the numerous issues that

must be addressed to ensure performance and reliability. Scalability, overhead, and complexity

are some of the most significant such issues (Chen & Lee, 2017).

1.6.1 Scalability Issues

Ensuring fault tolerant algorithms will scale well becomes more challenging as

systems get larger and more complex

Scalability includes
 One aspect of scalability is resource management: effective resource allocation and

scheduling across nodes might result in bottlenecks, particularly in the event of a failure.

 Task Redistribution: Task redistribution to maintain the high performance of the system

in the event of node failures is a difficult task.

 Low scalability results in poor performance since systems take time to recover from

failure, hence causing increased downtime or degradation of service.

1.6.2 Overhead Issues


Enhanced memory and computational overhead are the costs of fault tolerant schemes:

 Continuous checkpointing Because it takes a lot of time to store states regularly,

overhead will reduce overall performance.

 The expenses associated with replicating data or operations have the potential to

considerably raise the amount of storage space and resources needed.

1.6.3 Complexity of Design


Designing fault tolerant algorithms involves a number of additional aspects in addition to
the technical complexity:

 Algorithm Complexity: To prevent errors and inefficiencies, it is important to carefully


manage the complex interconnections between different components.

 The incorporation of robust error management into algorithms necessitates rigorous


testing and careful planning.
1.7 Applications of Fault-Tolerant Parallel Algorithms
Fault-tolerant parallel algorithms are critical in many sectors, ensuring the validity and efficiency

of operation despite the adversity of potential failure in computing systems. Below, we discuss

the main applications of such algorithms in high performance computing, cloud computing, and

real-time systems.

1.7.1 High-Performance Computing (HPC)


In the domain of high-performance computing, with big simulations and complex computations a

normal affair, fault tolerant parallel algorithms help researchers and scientists to deliver results

with minimum chances of loss of data (Dongarra et al., 2018).

Scientific Simulations: HPC applications that compute large datasets, such as astrophysical

simulations, molecular dynamics, and climate modeling, frequently use a lot of computer power.

Monte Carlo Techniques: Monte Carlo techniques, which are widely used in statistical analysis

and computational finance, typically require extensive computational simulations. By using fault

tolerant techniques, researchers may make sure that simulations can recover from process

failures without compromising the integrity of the results.

1.7.2 Cloud Computing


Cloud usage has increased the need for robust mechanisms of fault tolerance. Cloud services,

rooted in diverse physical resources distributed over different locations, rely on undisturbed

access and performance (Atzori et al., 2017).

 Data Storage and Management: Without replication of data across nodes, cloud systems

would be prone to data loss. Replication protocols and other fault tolerant approaches

guarantee automatic data recovery in the event of failures in addition to granting access.
Cloud-based databases, which guarantee availability and consistency by actively

replicating data between servers, are a perfect example.

 Microservices Architecture: Microservices in cloud applications today typically rely on

multiple components which need to collaborate and function in unison smoothly. Fault-

tolerant algorithms support dynamic task rebalancing and load distribution so that even

when a service crashes, others can take up the load, guaranteeing continuity of the

service.

1.7.3 Real-Time Systems


Time systems, such those seen in financial trading websites or healthcare monitoring,

require immediate responsiveness and reliability. Algorithms with fault tolerance stop

unacceptably large delays or losses from occurring when a system fails (Chen & Lee,

2017).

 Systems for healthcare monitoring: They keep tabs on vital signs and other important

metrics. When one sensor fails, fault tolerant treatment makes sure that data is collected

and processed continually and uninterruptedly, protecting patients' health.

 Economic Trading Platforms: In trading systems, speed and dependability are essential

because delay can result in losses. Error correction techniques and automated rehydration

are examples of transaction processes that can continue without interruption thanks to

fault tolerant parallel algorithms.


1.7.4 Internet of Things (IoT)
With IoT devices overwhelming all areas of our lives, incorporation of fault tolerant techniques

into distributed sensor networks will be an imperative to offer fault-resistant data collection and

processing, particularly for mission-critical systems like smart cities and healthcare (Hwang &

Xu, 2018).

1.7.5 Edge Computing


With edge computing on the rise, where computations are done nearer the data sources, focus on

fault tolerant approaches will be a necessity in ensuring continuity of service between distributed

nodes with unreliable interconnections (Li & Liu, 2019).


Reference
Kale, L. V., & Krishnan, S. (2018). Fault-Tolerant Parallel Computing. In Encyclopedia of

Parallel Computing (pp. 643-654). Springer.

Gupta, S., & Trivedi, K. S. (2016). Checkpointing and recovery in distributed systems. Journal of

System’s and Software, 117, 345-355.

Huang, Y., & Kintala, C. (2017). Replication and fault tolerance in distributed systems. IEEE

Transactions on Parallel and Distributed Systems, 28(10), 2611-2624.

Shooman, M. L. (2017). Redundancy and fault tolerance in computer systems. IEEE

Transactions on Reliability, 66(3), 531-543.

Lin, S., & Costello, D. J. (2017). Error control coding: Fundamentals and

applications. Prentice Hall.

Huang, Y., & Kintala, C. (2017). Replication and fault tolerance in distributed systems. IEEE

Transactions on Parallel and Distributed Systems, 28(10), 2611-2624.

Atzori, L., et al. (2017). The Internet of Things: A survey. Computer Networks, 121, 125-145

Dongarra, J., et al. (2018). High-performance computing: Clusters, constellations, MPPs, and

more. In Encyclopedia of Parallel Computing (pp. 755-766). Springer.

Chen, P. M., & Lee, E. K. (2017). Fault tolerance in distributed systems. Journal of Systems and

Software, 131, 245-256.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy