Networking issues in distributed real -time systems
Networking involves every aspect in the design of the network infrastructure from the selection/synthesis of the interconnection topology to what communication protocols it should use and how it should be deployed and maintained. A large body of literature is available on these issues. We attempt to further increase this body of literature by looking at two specific issues: the synthesis of networks that satisfy multiple properties and the design of fault tolerant communication services for high-speed networks.
Synthesizing networks that satisfy multiple requirements, such as high reliability, low diameter, good embeddability etc., is a difficult problem to which there has been no completely satisfactory solution. Our approach to the problem involves a simple filtration process that takes as input a large number of randomly generated graphs. By using multiple filters, one for each requirement and arranging them such that one feeds the other, the final output consists of a short-list of networks that the designer can choose from. Our experimental results show that this approach is both practical and powerful. Perhaps our biggest achievement here is that we show how this seemingly simple approach can generate networks that are serious competitors to several traditional well-known networks. We further highlight the practical applicability of these networks by considering how they can be effectively used in a packaging environment.
The interconnection network can have a dominant effect on the reliability of a distributed system. While existing network softwares have been optimized for performance, they have not been able to deal with network failures effectively. We have developed a light-weight fault detection and recovery technique that provides coverage for almost all network interface failures. The detection is based on software watchdog timers and the recovery is based on delta-logging. We have implemented the schemes as a fault tolerance layer over Myrinet, a commercially available networking technology. The implementation showed that a fault detection time of 1 ms and a complete recovery time of around 0.5 second can be achieved with a performance impact of less than 10%. The effectiveness of our fault tolerance schemes was evaluated using a versatile performance and recovery analysis tool called RAPIDS.
0984: Computer science