You are here

Making a better TMR

winstead's picture
Submitted by winstead on Sat, 12/22/2012 - 09:12

All electronic and computing systems are sensitive to various types of errors. For several decades, a simple procedure known as Triple Modular Redundancy (TMR) has been a leading technique for producing highly error-resilient systems (see the Wikipedia entry on TMR). TMR is based on a simple idea: make three independently-running copies of any computing module or system. Each copy gets the same inputs. Then, at the output, take a majority vote to determine the answer. With this procedure, any single module can fail without affecting the correct operation.

It's quite difficult to improve on the TMR method without sacrificing universality or effectiveness. Here at the LE/FT lab, we developed a version of TMR that captures all the traditional benefits, but is much more effective at rejecting temporary faults or "glitches." The RFB method also provides for self-correction in the voting circuits. In traditional TMR, the voters need to be highly reliable, but this requirement is loosened in the RFB approach. As a bonus, our method can be directly applied to circuits that employ multiple-valued logic (i.e. more signal levels than just 0 and 1). TMR is a very expensive method for fault tolerance -- think three computers instead of just one -- so it is only used in specialized "mission-critical" applications like space, nuclear and military electronics. Although it is rarely used in practice, TMR is a fundamental design case that helps us understand and evaluate error-correction strategies for electronics.

We named our solution "Restorative Feedback" (RFB), and first published the concept in 2011 [1]. An expanded version of our concept was accepted for publication in the Journal of Multiple Valued Logic and Soft Computing, and is now awaiting publication. While we wait for the paper to appear, we can share some of the basic details here on our lab blog. We chose the name "Restorative FeedBack" (RFB) because it utilizes a feedback mechanism to control errors that may happen in the majority-vote module. The RFB method is a general concept can be applied whenever data is latched, e.g. for protecting hardware registers or logic pipelines. The concept is described as follows: three copies are made of all modules and signals in the system, as shown in the schematic figure below.

RFB schematicFigure 1. Generic schematic diagram for the RFB method.

In this figure, $M_1$, $M_2$ and $M_3$ are three copies of the a logic function. The input signals $x_1,\,x_2,\,x_3$ should all be exact copies, unless a fault occurs. Similarly the output signals $y_1,\,y_2,\,y_3$ should all be identical, unless a fault occurs. The modules labeled $C_1,\,C_2,\,C_3$ are modified Muller C-elements. The C-element is a unanimous-vote device; its output is set when both of its inputs are the same. When its inputs differ, the C-element simply retains its memory state.

RFB error-correction is a two-step process. In the first step, the C-element memories are initialized in a "barrel-shift" configuration, so that

\(\begin{align*} z_1 &:= y_3 \\ z_2 &:= y_1 \\ z_3 &:= y_2. \end{align*} \)
In the second step, the C-elements operate normally, with the output of one C-element feeding back into the input of the next C-element.

This procedure is able to correct all the same errors as TMR, but also corrects types of errors that TMR cannot. An example of error-correction is shown in the figures below, in which the propagation of a fault is shown by the dashed lines, and faulty signals are labeled with an asterisk*.

Phase 1Figure 2(a): A fault occurs in module $M_1$.
Phase 2Figure 2(b): During the initialization step, the error is copied onto $z_2$.
Phase 3Figure 2(c): After some settling time the C-element $C_2$ corrects the error on $z_2$, so that all outputs are correct.

In addition to correcting single errors, RFB is able to reject faults that occur after it has settled (i.e. in "phase 3"). Many double-fault events are rejected, and single C-element upsets are suppressed or quickly restored, giving significant advantages over traditional TMR for situations where momentary glitches are a primary concern. The statistical performance of RFB in comparison to TMR is indicated in the plot below, which shows the rate of uncorrectable error events for the two methods.

ResultsFigure 3: Bit error rate simulation results comparing TMR to RFB.

These results were obtained by injecting errors in the $x_i$ input signals at a rate of 0.05 errors per time unit (using an arbitrary time scale). The "Gate Error Rate" refers to the rate of momentary glitches injected in the $M_i$ and $C_i$ modules and, in the case of TMR, in the voter modules. The RFB circuit is allowed to settle for $T$ time units before errors are counted, hence implying a possible settling-time disadvantage for the RFB method.

After developing the original RFB concept, we began investigating methods to embed RFB into larger error-correcting structures, especially low-density parity-check (LDPC) decoders. This investigation is part of our ongoing collaboration with Prof. Emmanuel Boutillon of Universite de Bretagne Sud, and Dr. Yangyang Tang of Huawei. This work is also part of Prof. Winstead's Fulbright scholar research agenda during his visit to France. You can read about these ideas in some of our papers [3]-[5].

References:

[1] Chris Winstead, Abiezer Tejeda, Eduardo Monzon, Yi Luo, “An error-correction method for binary and multiple-valued logic,” IEEE International Symposium on Multiple-Valued Logic, Tuusala, Fin- land, May 2011 [link to article on IEEE Xplore].

[2] Chris Winstead, Abiezer Tejeda, Eduardo Monzon, Yi Luo, “Error Correction via Restorative Feed- back in M-ary Logic Circuits,” Journal of Multiple Valued Logic and Soft Computing, accepted for publication in June, 2012, still in press.

[3] Yangyang Tang, Emmanuel Boutillon, Chris Winstead, Christophe Jego and Michel Jezequel, “Muller C-element based Decoder (MCD): A Decoder Against Transient Faults,” IEEE International Sympo- sium on Circuits and Systems (ISCAS), May 2013. [Link to article on IEEE Xplore].

[4] Chris Winstead, Yangyang Tang, Emmanuel Boutillon, Christophe Jego, and Michel Jezéquel, “A Space-Time Redundancy Technique for Embedded Stochastic Error Correction in Digital Logic Sys- tems,” International Symposium on Turbo Codes (ISTC), Aug. 2012. [Link to article on IEEE Xplore].

[5] Yangyang Tang, Chris Winstead, Emmanuel Boutillon, Christophe Jego, and Michel Jezéquel, “An LDPC Decoding Method for Fault-Tolerant Digital Logic,” IEEE International Symposium on Circuits and Systems (ISCAS), May 2012. [Link to article on IEEE Xplore].

Genre:

Technical Level: