On-chip monitoring infrastructures and strategies for multi-core and many-core systems
On-chip sensors are widely used in processors to closely monitor system temperature, performance, and supply power fluctuation, among other environmental conditions. As the number of cores integrated in a single die increases, the total number of on-chip sensors increases correspondingly. Information from these sensors needs to be collected and processed efficiently and effectively at run-time to achieve high performance and low power consumption at the system level. In this dissertation, dedicated infrastructures for sensor data collection, sensor measurement calibration and the use of sensor information to overcome thermal emergencies, voltage droops and soft errors are examined. These problems are addressed in both multi-core and many-core environments.
This dissertation research first shows that a dedicated on-chip monitoring infrastructure (monitor network-on-chip, MNoC) can achieve better performance than a bus in interconnecting on-chip sensors in multi-core systems. Our experiment results show that this dedicated on-chip network can provide consistent low latency for sensor data packets without affecting application on-chip network traffic. The necessity of a dedicated infrastructure for monitoring is then addressed in a many-core environment. A two-level hierarchical network-on-chip (NoC), which allows for efficient sensor data collection in many-cores, is introduced. This design is evaluated using benchmark driven simulations for a three-dimensional many-core system. The use of a two-level NoC is shown to provide an average of 65% sensor data latency improvement versus a flat sensor data NoC structure for a 256 core system.
As the number of on-chip sensors increases, the accuracy of these sensors' measurement becomes more and more important since it directly affects processor performance and reliability. For example, on-chip thermal sensors are used for monitoring system temperature and their measurements may affect the system frequency and operating voltage. A new approach is introduced in this dissertation, which determines models for imprecise thermal sensor measurements using probability distributions based on device parameters. Thermal measurements which are determined to be imprecise can be excluded from thermal management strategies. The collecting of on-chip sensor measurements is facilitated by dedicated on-chip monitoring infrastructures. Experiments show that a sensor operating outside a desired precision can be identified with a detection rate of 87% and an average false alarm rate of < 6%, with a confidence level of 90%.
The introduction of dedicated infrastructures for on-chip monitoring opens the door for more advanced run-time system control strategies. In this dissertation, run-time voltage droop compensation and soft error protection in multi-cores are targeted. High voltage droops in modern processors may cause serious reliability problems. A voltage droop compensation method considering ambient temperature changes is proposed to address this issue. In the proposed method, different reduced frequencies are used at different temperatures. A voltage droop signature sharing method in multi-core systems is proposed for early detection and remediation of high voltage droops. These two methods are combined and implemented in an 8 core system and a performance benefit of 5% on average is achieved.
Soft errors are caused by alpha particles and cosmic rays, among other sources, and can be detected by processor component redundancy. An approach to selectively enable redundancy to combat soft errors is also proposed in this dissertation. Both dual modular redundancy (DMR) and chip-level redundant threading (CRT) are used for adaptive redundancy protection. Power and energy savings over 8% are achieved by both methods compared to conventional methods. A multi-core architecture vulnerability factor (AVF) is also calculated for a multi-core environment, using the MNoC infrastructure.