0% found this document useful (0 votes)
38 views1 page

Strong Points:: REMO Review - Binod Kumar

The paper presents an approach for redundant execution using spatial and temporal redundancy to detect errors with low overhead. Spatial redundancy uses an inorder checker to detect errors, while temporal redundancy re-executes instructions in the original functional units. The technique achieves low area overhead of 0.4% and power overhead of 9% with negligible performance impact during fault-free operation. However, the paper lacks fault injection experiments to validate its error coverage claims and does not fully address scenarios like double bit-flips or faults in the instruction decoder. Suggestions are made to improve the evaluation and comparison to other schemes.

Uploaded by

BINOD KUMAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views1 page

Strong Points:: REMO Review - Binod Kumar

The paper presents an approach for redundant execution using spatial and temporal redundancy to detect errors with low overhead. Spatial redundancy uses an inorder checker to detect errors, while temporal redundancy re-executes instructions in the original functional units. The technique achieves low area overhead of 0.4% and power overhead of 9% with negligible performance impact during fault-free operation. However, the paper lacks fault injection experiments to validate its error coverage claims and does not fully address scenarios like double bit-flips or faults in the instruction decoder. Suggestions are made to improve the evaluation and comparison to other schemes.

Uploaded by

BINOD KUMAR
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 1

REMO review ---- Binod Kumar

Strong points:
The paper presents an approach for redundant execution for fault-tolerance with minimum
overhead. Errors are detected by exploiting spatial and temporal redundancy. Spatial redundancy is
checked by a simple inorder checker module and temporal redundancy is performed by re-
computation in the original functional units. Although paper lacks novelty as this idea has already
been attempted in some form or other, the paper has contribution in keeping the power, area and
power overhead as minimum. Since re-execution of the instructions are initiated after resolving the
dependencies between them, the redundant execution is supposed to incur very low performance
penalty. The authors claim that proposed technique has very low detection latency and a high degree
of fault coverage comparable to a full double modular redundant architecture. The key results of
proposed technique are increase in area of only 0.4%, power overhead near to 9% and a negligible
performance penalty during fault free run (when no recovery is needed).

Weak points:
It is a common issue with many fault-tolerant architecture papers that they do not perform
experiments pertaining fault injection for the purpose of evaluating their proposal. This paper too
suffers from this. The authors claim The simulation results show almost full soft error
coverage...... Without a fault injection experiment, this claim is not very appealing although the
authors discuss three cases of occurrence of fault in Section-IV. The assumption that instruction
decoder is outside SOR (sphere of replication) can essentially lead to chances that errors in decoder
can have severe impact. How does fault recovery proceeds in such a scenario? The authors consider
fault coverage scenario in case of single bit-flips. Although single bit-flip serves as a sufficient
model for soft-errors, still the authors should comment on how the architecture behaves in
double/triple bit-flips scenario?

Disagreement:
The reuse of same unit for temporal redundancy from performance perspective may not always be
helpful for the case when programs have a large fraction of floating-point instructions. The
assumption that memory state can be recovered after 1 billion instructions is a very optimistic one.
The authors state that, A single bit flip in any of the field of ROB would result in incorrect
result of either the verifier or the main OOO processor but not both. For double-bit flips, incorrect
result may appear in both. How to deal with such cases?

Suggestions for improvement:


The authors should have experimented with different replay buffer size and impact on performance
penalty. Fault injection experiment should be performed for accurate estimation of fault coverage.
Quantitative comparison must be done with sate-of-the-art schemes. Even if the comparison is done
with DIVA (which has a complete inorder for the verifier part), the authors can bring out the
contribution of only the time-replay part for fault-tolerance.

Points which are not clear:


How do authors calculate/estimate that checkpoint overhead ranges from 30 to 50 cycles? In case of
temporal redundancy for A single bit flips in FU, the authors claim that Within this time span
even multi- cycle fault is expected to decay,....... what is the reasoning behind that? How does
performance get impacted in case the transient fault has not decayed?
Points to be discussed in class:
Are read ports to ROB and ARF sufficient for the verifier to access? This doubt arises because
during re-execution, verifier part accesses ROB and ARF for reading operands.
The paper states that checkpoints are taken at one billion instructions. It is definitely possible to
restore architectural state that far back, restoring memory that far back may not be feasible.
The recovery mechanism should be explained more elaborately.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy