Software-based permanent fault recovery techniques using inherent hardware redundancy
Recent advances in deep submicron (DSM) technology have imposed an adverse impact on the long-term lifetime reliability of semiconductor devices. According to the reliability report from International Technology Roadmap for Semiconductors (ITRS), smaller feature sizes and higher power densities make DSM devices more susceptible to wear-out failures. As a consequence, permanent faults are more likely to occur in DSM devices at runtime. To ensure system reliability and availability, fault tolerant techniques must be applied to overcome these runtime permanent faults. For systems requiring non-stop computation, a full duplication of system hardware components is usually required, which incurs a high overhead in hardware cost. For systems that allow a short period of downtime, however, low cost software techniques that take advantage of the inherent hardware redundancy of computing devices, such as Field Programmable Gate Arrays (FPGAs) and Very Long Instruction Word (VLIW) processors, can potentially be applied as an intermediate fault recovery step. These techniques can reconfigure the computation of a faulty device to maintain the system operation until the faulty device can be replaced.
To maintain correct computation on a faulty device, operations originally assigned to faulty resources must be moved to fault-free device resources. This process requires two phases: a testing phase to locate faults and a recovery phase to eliminate the usage of faulty resources in the computation. In this dissertation, we present software techniques that address specific testing and recovery challenges for FPGAs and VLIW processors.
For FPGAs, we focus on testing and recovering path delay faults. Path delay faults occur when the maximum delay of at least one critical path exceeds the maximum allowable system delay due to a permanent fault. To locate paths with delay faults, a built-in self-test (BIST) approach is presented to evaluate all combinations of signal transitions along critical paths. To recover from path delay faults, a timing-driven incremental router is used to reroute paths affected by the faults. To facilitate fast fault recovery, information from the initial design route is used to guide the reroute process. Since many embedded systems have a limited amount of local computational resources, a network-based recovery system has been developed. A computationally superior server performs the FPGA fault recovery and sends the results back to the affected client, completing the recovery process. Experiments on the recovery system have shown that the incremental router provides a speedup of up to 12x compared with a commercial incremental flow.
For VLIW processors, we focus on recovering from permanent faults in registers. To maintain VLIW functionality after detecting faulty registers, programs must be recompiled to assign variables to fault-free registers. One issue with recompilation is possible performance loss due to increased register requirements. To address this problem, a register pressure control technique is presented to reduce register requirements. To demonstrate its advantages, the technique has been integrated into an academic VLIW compiler. Experimental results have shown that the technique improves performance by 14% compared with an academic VLIW flow.
0984: Computer science