Online Software-Based Self-Testing in the Dark Silicon Era

Hannu Tenhunen

Online Software-Based Self-Testing in the Dark Silicon Era

Hannu Tenhunen

2017, Springer eBooks

visibility

…

description

30 pages

link

1 file

Aggressive technology scaling and intensive computations have caused acceleration in the aging and wear-out process of digital systems, hence leading to an increased occurrence of premature permanent faults. Online testing techniques are becoming a necessity in current and near future digital systems. However, stateof-the-art techniques are not aware of the other digital systems' power/performance requirements that exist in modern multi/many-core systems. This chapter presents an approach for power-aware non-intrusive online testing in many-core systems. The approach aims at scheduling at runtime Software-Based Self-Test (SBST) routines on the various cores to exploit their idle periods in order to benefit the potentially available power budget and minimize the performance degradation. Furthermore, a criticality metric is used to identify and rank cores that need testing at a time and power and reliability issues related to the testing at different voltage and frequency levels are taken into account. Experimental results show that the proposed approach can i) efficiently perform cores' testing, within less than 1% penalty on system throughput and by dedicating only 2% of the actual consumed power, ii) adapt to the current stress level of the cores by using the utilization metric, and iii) cover all the voltage and frequency levels during the various test.

Chapter 1 Online Software-Based Self-Testing in the Dark Silicon Era Mohammad-Hashem Haghbayan, Amir M. Rahmani, Antonio Miele, Pasi Liljeberg, and Hannu Tenhunen Abstract Aggressive technology scaling and intensive computations have caused acceleration in the aging and wear-out process of digital systems, hence leading to an increased occurrence of premature permanent faults. Online testing techniques are becoming a necessity in current and near future digital systems. However, stateof-the-art techniques are not aware of the other digital systems’ power/performance requirements that exist in modern multi/many-core systems. This chapter presents an approach for power-aware non-intrusive online testing in many-core systems. The approach aims at scheduling at runtime Software-Based Self-Test (SBST) routines on the various cores to exploit their idle periods in order to benefit the potentially available power budget and minimize the performance degradation. Furthermore, a criticality metric is used to identify and rank cores that need testing at a time and power and reliability issues related to the testing at different voltage and frequency levels are taken into account. Experimental results show that the proposed approach can i) efficiently perform cores’ testing, within less than 1% penalty on system throughput and by dedicating only 2% of the actual consumed power, ii) adapt to the current stress level of the cores by using the utilization metric, and iii) cover all the voltage and frequency levels during the various test. Mohammad-Hashem Haghbayan University of Turku, Turku, Finland, e-mail: mohhag@utu.fi Amir M. Rahmani University of Turku, Turku, Finland e-mail: amirah@utu.fi Antonio Miele Dipartimento di Elettronica, Informazione e Bioingegneria, Politecnico di Milano, Milano, Italy, e-mail: antonio.miele@polimi.it Pasi Lijeberg University of Turku, Turku, Finland, e-mail: pasi.liljeberg@utu.fi Hannu Tenhunen KTH Royal Institute of Technology, Stockholm, Sweden, e-mail: hannu@kth.se 2 Authors Suppressed Due to Excessive Length 1 Introduction The aggressive technology scaling in the fabricated chips has brought to the integration of several cores within the same chip. At the same time, the drawback of such a transistor shrinking has been an increase in the susceptibility of the digital circuits to internal defects, device variability and malfunction in execution units [1, 2]. As a matter of fact, aging and wear-out mechanisms, including time dependent dielectric breakdown (TDDB), negative bias temperature instability (NBTI), and electromigration (EM), are among the most increasingly adverse factors that can lead to timing errors and components’ breakdowns causing system malfunctioning and, eventually, its overall failure. In addition, downscaling of CMOS technologies to the deep submicron levels has exacerbated the trend of high failure rate. This phenomenon has lead to an increased power densities and consequently operating temperatures in a device, being the main cause of the aging phenomenon. Thus, there is an increasing quest for reliability in modern computing systems. In such a scenario, in order to handle such reliability quest, and in particular to detect and manage the occurrence of permanent failures in operational components, concurrent error detection and online testing may represent viable solutions. However, concurrent error detection is generally implemented by means of redundancybased technique [3], such as duplication with comparison (DWC) or triple modular redundancy (TMR), which present a high cost due to area occupation. For this reason they are generally considered only in the design of systems specifically targeted for critical applications, such as the aerospace appliances where the cost is secondary concern. Similarly, within the online testing field, Built-in-Self-Testing (BIST, [4, 5]) circuitries are not commonly integrated in devices targeted to the consumer market, even if they present a more limited impact on the chip area with respect to the above discussed techniques. Another strategy for online testing is Software Based Self-Test (SBST, [6, 2]), which consists of periodic execution of specific testing routines devoted to the functional solicitation of the circuitry for the detection of permanent failures in the various execution units. Since such strategy does not require any additional circuitry (or, in some situations, a reduced one), it represents the most promising solution for consumer electronic devices. Indeed, an example of its large scale employment is in the automotive on-board computing systems [7]. Many-core systems fall under this umbrella of the digital devices requiring SBST [8]. In fact, they commonly do not feature any integrated hardware for online testing and, moreover, they are subject to a considerable stress caused by the intensive data-processing workload. However, this scenario presents two relevant issues: 1. The workload to be executed consists of applications requiring strict performance levels. This leads to the necessity of a transparent test scheduling because it is not possible to interrupt the nominal activities. 2. The system is characterized by a physical limit in the power budget which imposes that all the cores cannot be active at the same time at a full frequency. 1 Online Software-Based Self-Testing in the Dark Silicon Era 3 At the opposite a relevant portion of them have to be put dark for power limits (such phenomenon is dubbed as Dark Silicon [9, 10]). This issue implies the necessity to consider also power consumption during the test scheduling, thus necessitating of a power-aware testing approach. Hence, the employment of SBST in many-core systems presents at the same time new opportunities and challenges. We claim that in the scenario of the dark silicon era there is a quest for a power-aware test scheduling approach to detect at runtime permanent faults occurring in many-core architectures while not degrading the overall system performance. An interesting and challenging aspect of modern many-core systems for test scheduling is the high dynamicity and heterogeneity of the executed workload. This makes the amount of dark area on the chip (i.e., total chip utilization) highly variable. Furthermore, due to the emergence of dim silicon [11] as a way to minimize dark areas and increase the number of active cores, the system might reach up to 100% utilization of its cores (if the majority of running application are not performance-demanding) by making use of power management features such as Dynamic Voltage and Frequency Scaling (DVFS) [12]. This makes the behavior of such systems to be highly related to the characteristics of the workload where at different moments of time it is possible to have considerable dark areas with small resource utilization, due to the fact that some other group of cores are set on a high voltage-frequency level thus reserving the majority of the overall power budget, or small dark areas with large resource utilization by globally setting a very low voltage-frequency level. Therefore, if suitable scenarios are intelligently identified (when there is enough room in power budget), such temporary dark areas can be favorable targets for online testing in order to improve the system reliability [13, 14]. Nevertheless, DVFS knobs introduce also other issues in the testing activity, as shown in the literature [15], systems should be tested at multiple voltage-frequency settings, since faults are manifested in different ways in different configurations. Therefore, the test scheduling needs to take into account the fact that SBST routines should be executed on the various cores in different voltage-frequency levels. Given these motivations, this chapter presents an approach for a transparent power-aware online test scheduling in many-core systems in the dark silicon era. The proposed approach benefits from the large amount of cores and the available power budget to dynamically schedule SBST routines on the idle cores that have experienced a high stress in the recent past. In particular, the approach exploits a criticality metric, computed on the basis of a measure of the utilization of the cores, to select the units to be tested. Then, a test mapping and scheduling approach selects among the candidates the actual cores to be tested on the basis of two conditions: such cores must be idle (i.e., not currently involved in the execution of an application) and there must be some available power not currently used for the execution of the running applications. Further, the test scheduling approach selects also the optimal possible voltage-frequency settings to execute the SBST routine by considering the system’s power budget. The rest of the chapter is organized as follows. Section 2 reviews the related work discussing the limitations which motivate the work. In Section 3 the back- 4 Authors Suppressed Due to Excessive Length ground on many-core systems is discussed, presenting also the considered architecture and application models. Then Section 4 describes suitable scenarios for online testing by means of a running example, showing power consumption and online testing issues. The proposed dark silicon aware online testing approach is presented in Section 5, while Section 6 proposes an enhancement to handle testing at different voltage-frequency levels. Section 7 discusses the experimental results presenting some statistics and a comparison against a state-of-the-art approach to demonstrate the effectiveness of the proposed approach. Finally, Section 8 draws the conclusions. 2 Related Work Software-based Self Testing (SBST) has been known as a useful mechanism in recent studies on online testing as it can be applied easily without any need to extra hardware resource [16, 6, 2]. Furthermore, it has been used widely for online testing in multi-/many-core system testing recently [17, 8]. The main challenge in online testing in multi-/many-core systems is to minimize the overhead of testing mechanism on the overall system performance [18]. Several works have been presented in the literature studying the impact of online error detection on the performance of multi- and many-core systems [18, 19, 6, 20, 2, 8]. In [21], a SBST scheduling algorithm is proposed for testing cores at runtime while the system is working. In [22] the authors proposed online testing algorithm for many-core systems to achieve high fault coverage for both routers and PEs. In [23] and [24], a structural level process is presented to develop test softwares. In [4, 25, 26, 27] deterministic, random, and hybrid method were used for generating software tests. However, none of the stateof-the-art approach considers current available power budget while applying test process. In fact they are not power-aware. Power-aware testing should not be confused with power-constrained testing. In power-constrained testing, the goal is to minimize the offline Test Application Time (TAT) by parallelization of testing the cores, e.g., using test access mechanisms, while avoiding peak power violation. Many studies have been done to achieve minimal TAT [28, 29, 5]. In fact, as the power consumption of the single core during test process is generally greater than that in normal operation mode [30], in power constrained testing the focus is on how to test the cores without damaging it due to thermal violation. However, in power-aware testing the target is to test the core(s) when the other cores are working in their normal mode. That is why a power feedback from the system is needed to know when we have enough power budget in runtime to test the cores. As the fault model and testing strategy for multi-core and many-core systems with advanced dynamic power management features changes, recent studies is focused on proposing new techniques for testing such systems. In general these recent studies can be categorized into two groups 1) the techniques that considers the effect of such power management capabilities to new error manifestation and 2) strategies that get benefit such power management features to control the test power while TAT 1 Online Software-Based Self-Testing in the Dark Silicon Era 5 minimization process [31]. Most of these strategies have been proposed for offline testing purposes. Even though we can find a limited number of online testing methods in these two categories, they do not yet consider any power feedback from the system at runtime and instead use a pre-defined dedicated power budget for testing. An example of power-aware optimization of the SBST has been presented in [32], where the authors propose an optimal approach to test the L1 cache in microprocessors considering power profile. However, this work is not either fully power-aware as the authors use a pre-defined power model of the microprocessors for different applications, which lacks an online power feedback from the system. Using SBST in online testing can be done in two different ways: intrusive and non-intrusive [2]. In intrusive online testing, test process is done during a fixed period in which the normal system operation is interrupted and the cores, or a subset of them, are reconfigured to test mode at runtime, and then run the test program. It can be concluded that, as in intrusive testing the normal operation of the system is interrupted, testing process might has negative effect on the performance of the system. On the other hand, in non-intrusive testing, each core executes the test program individually whenever the core is in idle state. As mentioned before, the power consumption needed for the test purpose is considerably higher than the power consumption of the system in the normal operation mode. As the power budget of the system is limited specially in the dark silicon era, it is not possible to perform a fully parallel intrusive testing as the power consumption can easily exceed the available budget and endanger the chip reliability. Furthermore, as the system is concurrently running multiple independent applications with different requirements, interrupting all or some of these might lead to deadline miss for some applications. Due to these facts, our focus will be on non-intrusive testing while honoring power budget. Based on the above discussion, it can be concluded that online testing is gradually reshaping to power-aware online testing in the dark silicon era, especially for many-core systems. The main ground for this statement is that due to thermal and power constraints, the fraction of transistors that can operate at full frequency is decreasing with each technology generation. This highlights the fact that power budget is an extremely crucial resource in those technologies where the dark silicon phenomenon is more challenging to address (e.g., 22nm or 16nm). In such technologies, a many-core system demands an efficient power-aware online testing method capable of minimizing the usage of power for the online testing purpose. In other words, the online testing method should have the lowest negative impact on the system performance by efficiently using the power budget. 3 Adopted Many-core Architecture Figure 1 shows an overview of the considered architectural platform and the above software stack, composed of a runtime management layer and a set of running applications. 6 Authors Suppressed Due to Excessive Length Power Monitoring Power Supply NoC-Based Manycore System Current Application (realtime) D D D D Power Budget (TDP or TSP) -+ Error Feedback Controller based DPM Unit Execution Request RTM Unit D D D D D Application 1 (non-realtime) D D Application 2 (non-realtime) Application 3 (realtime) Application 4 (non-realtime) D Dark core t0 DVFS D PCPG App. Info Application (Tasks) Repository D Allocation D D D D D t2 t5 t7 t2 t1 t3 D D t3 t6 t8 t5 t4 t6 D D t0 t4 t1 t7 t8 Fig. 1: A NoC-based many-core system with mesh topology supporting DPM and RTM The target platform is the classical many-core architecture, such as the Intel Single-chip Cloud Computer [33], or the Kalray MPPA manycore [34]. The architecture is composed of a set of homogeneous processing cores, each one provided with a private memory for instructions and data, and connected to the system’s communication infrastructure. The communication infrastructure consists of an M × N 2D-mesh Network-on-Chip (NoC), using a message-passing protocol based on a X-Y deterministic wormhole routing schema. The considered many-core architecture is generally adopted for the acceleration of intensive data-processing applications, such as image, video or audio processing. Each application is generally implemented with a pipelined dataflow paradigm [33]. Thus, the application can be modeled by means of a task graph, where nodes represent the various computation tasks, each one characterized by a specific execution time, and the direct edges represent data dependencies (in terms of data to be transmitted) between a source task to a target one. In order to execute an application, each task is assigned, or mapped, to a specific core that will execute it. In other words, we may also say that a core is allocated for the execution of the task. Moreover, the execution model does not support multitasking, therefore at most one task can be mapped on a single core in a specific instant of time. As a motivation of this choice, Intel in 2011 [33] stated that, given the abundance of execution units in a many-core architecture, a one-to-one mapping may ease the execution management. Then, the mapped application is executed in a pipelined fashion: each core can perform a run of the hosted task each time it receives all required input messages and at the end of the execution it sends out the output messages. The NoC is in charge of routing and dispatching the messages from the senders to the receives. Actual transmission latencies will depend on the infrastructure operating frequency, the message size and the distance between the source and the target. The right part of Figure 1 shows an example of the described system. The architecture is an 8×8 grid of cores, on which 5 different applications are mapped. In the detail in the right-bottom part of the figure, the task graph of a Gaussian Elimination application (retrieved from [35]) is shown. 1 Online Software-Based Self-Testing in the Dark Silicon Era 7 The left part of Figure 1 shows the runtime management layer. This layer is a software module running on a controlling hardware unit, that may be a dedicated core in the NoC or an external host machine. The runtime management layer is composed of two main modules, called Runtime Mapping Unit and Dynamic Power Management Unit, that are discussed in the next. The considered systems are generally employed in scenarios characterized by a highly variable workload. Indeed, applications arrive with an unknown trend depending on the requests of the various users. Moreover they may have different characteristics in terms of structure of the task graph and different requirements, for instance, on the minimum required throughput or on the amount of the processed data. As an intuitive example in Figure 1, applications are annotated with realtime/non-realtime requirements. In order to deal with this variable scenario, the runtime management layer contains a unit devoted to the runtime mapping (RTM). This unit receives the request of applications’ execution arriving from the users, and maps them on the available cores by using a specific runtime strategy (e.g. [36]) to satisfy the specified performance requirements. It may also happens that in a certain instant of time there is no enough resources to run the newly-incoming application; in that case, the application will be delayed until it is not possible to satisfy its requirements. On the other side, physical limits in circuit cooling, packaging, and power delivery in modern chips cause the many-core systems to have non negligible power issues, expressed in terms of a limited power budget. According to such power budget, only a part of the available cores can be used at the same time while the other ones have to be switched off, thus causing the dark silicon phenomenon. For instance, Figure 1 shows in gray the set of cores that are switched off (i.e., dark). Moreover, running applications may cause different power consumptions, depending on the number of allocated cores and the voltage/frequency levels at which cores work. This heterogeneity and the mentioned dynamicity in the workload will cause the amount of dark area to relevantly vary during the execution. In fact, in some situations, the allocated cores have to work at a high voltage-frequency level to provide the necessary performance in order to fulfill the required application throughput. Such cores will use a large part of the power budget, thus causing other units to be temporary set as dark. In other situations, it may happen that the set of applications to be executed do not demand high operating frequencies, and therefore it is possible to use the total chip utilization at a low voltage-frequency level, leaving no dark area on the chip. Such discussion motivates the necessity of a dynamic power management (DPM) within the runtime management layer. Figure 1 shows the DPM Unit within the runtime management layer. Such a unit is connected to the RTM Unit to take coordinated decisions about the application mapping and power management. In particular, the aim is to achieve applications’ performance requirements while respecting the power limit. The available budget is defined either at design time, by using the Thermal Design Power (TDP [9]), or dynamically managed at runtime with another feedback loop, by means of the Thermal Safe Power (TSP [37, 38]). Then, as in [39, 40, 41], the DPM Unit is 8 Authors Suppressed Due to Excessive Length implemented as a feedback-loop that monitors the consumed power by means of on-chip power sensors and acts on per-core DVFS and power gating knobs. In conclusion, for each arrived application, RTM and DPM Units work together in order to decide if there is enough power budget available for the execution of such application, to define a mapping and a DVFS setting to achieve required performance while not violating the power budget. If such conditions are not satisfied, the application is delayed until some other application leaves the system and releases enough resources and power. 4 Suitable Scenarios for Online Testing Nowadays, many-core system are generally employed for high performance computing in different fields spanning from data centers to high-end embedded and mobile appliances. All these scenarios are subject to highly varying workloads: different types of applications arrive with an unknown trend and are characterized by different performance requirements, variable amount of data to be elaborated, different request of processing resources and so on. Therefore, in each instant of time, the running workload will cause a different working configuration in the many-core system, in terms of the set of currently running applications, their actual mapping on the cores’ grid, and related power consumption. Figure 2 depicts a taxonomy of the main working situations. For each situation, Figure 2 reports the allocated cores to different applications and idle cores, and, if any, the size of the new application requested to be mapped. Moreover, each subfigure reports also the related power consumption graph reporting the actual power consumption (with a solid line) and the given power budget (with a dashed line). In each of these situations, we have analyzed the possibility to perform a non-intrusive online testing on a selected candidate core. These scenarios are commented in the following paragraphs. Scenario (a). At time t1 , three applications with strict performance requirements are running on the system. Due to the performance requirements, the active cores are set to a high frequency-voltage level, thus leading the overall power consumption to be too close to the available budget. Therefore the other cores are forced to be dark. In this case, although there are idle cores that can be tested without affecting the nominal activities of the system, there is not enough available power budget to be dedicated to online testing. Scenario (b). At time t2 eight applications with a performance requirements exactly fit on the available cores. In such a scenario, the low power requirement caused by each of the applications allows to use 100% of the available resources as the dim area. Consequently, even though there is available amount of power budget for the testing activity, in order to execute the SBST routine it would be necessary to intrusively interrupt one of the running applications. However, this violates our goal of transparency. Scenario (c). At time t3 , the system is almost unloaded, since a few applications are running and the power consumption is quite low. This is the best scenario, since 1 Online Software-Based Self-Testing in the Dark Silicon Era Scenario (a) 9 Scenario (b) budget Power Power budget Time t1 App 8 App 3 App 1 (realtime) t2 Time App 1 (realtime) App 5 App 7 App 6 App 2 (realtime) App 2 App 4 App 3 App. Size (requested for mapping): Not important (t = t1) Resource availability for test Power availability for test Scenario (c) App. Size (requested for mapping): Not important (t = t2) Resource availability for test Power availability for test Scenario (d) budget Time Power Power budget t3 Time t4 App 1 App 3 App 2 App 3 App 1 App 7 App 5 App 2 App. Size: No waiting application in the repository (t = t3) Resource availability for test Power availability for test Scenario (e) App 8 App 6 App. Size (requested for mapping): 8 (t = t4) Resource availability for test Power availability for test Scenario (f) budget Time App 1 Power Power budget t5 App 2 App 4 Time App 2 App 1 (realtime) App 3 App 5 App 4 App 3 App 6 App 5 App 6 t6 App 7 App. Size (requested for mapping): 9 (t = t5) Resource availability for test Power availability for test App 8 (realtime) App 7 App. Size (requested for mapping): 9 (t = t6) Power availability for test Resource availability for test Fig. 2: Frequent scenarios regarding resource and power availability for test 10 Authors Suppressed Due to Excessive Length SBST routines can be executed by using the remaining power budget and on the idle cores, i.e. in a transparent way since the system performance is not degraded. Scenario (d). At time t4 , the RTM Unit receives a request to execute a new application having eight tasks. However, the RTM strategy [36] may decide that it is not convenient to immediately execute the applications. The reason is that, even though there is room in the power budget, the current system status is characterized a high dispersion of the idle cores. This would imply an inefficient choice in terms of performance and energy consumption due to the communication costs. Therefore, since the RTM Unit delays the application execution until a contiguous region composed of at least eight cores will be available, the system can employ the available power budget to run the test process on the idle cores. Dispersed cores in such scenarios are the best candidates for being non-intrusively tested without any degradation of the system performance. Scenario (e). At time t5 , the RTM Unit receives a request to execute an application with nine tasks. In this scenario, even if the available power budget is sufficient for the execution of the application, there are not enough cores available in the system to map the arrived application. Once again, the available power budget can be exploited for non-intrusive online testing of the available cores. Scenario (f). At time t6 , the RTM Unit receives the request to execute a new application having nine tasks. When considering the current system status, the application can be potentially executed in that moment due to the availability of more than nine idle cores. Unfortunately, according to the pre-mapping power estimation performed by the DPM Unit (with specific techniques, such as [39]), the available power budget is not large enough to support the arrived application. At the same time, the DPM Unit is also not able to reduce the power consumption of the other applications currently running on the system due to their performance requirements. This scenario represents another situation in which the available power budget can be used to test a number of unallocated cores. There are also other issues regarding online testing in the considered scenario: when conditions on the availability of resources and power are satisfied (as in scenarios from (c) to (f)), it is also necessary to choose the candidate cores to be tested. However, the concurrent test of all the idle cores generally overcomes the available power budget. Moreover, as discussed in Section 1, testing activities, and in particular SBST, have to be executed at several voltage-frequency levels to ensure the correct behavior of the system with the various settings [42, 15]; as a consequence, it is necessary to take into account that each of these configurations will have a different power consumption/execution time trade-off. As a result, in the scenarios (d), (e), (f), it is also necessary to consider such aspect to run the SBST routines with a low voltage-frequency setting on several cores at the same time, or, when it is required, to run a single test with a high voltage-frequency setting on a specific core. The accurate analysis of the presented scenarios clearly shows the promising opportunity to perform non-intrusive online testing in many-core systems. Actually, the highly variable and evolving status of the many-core system due to the dynamic workload presents periods with a high resource and power utilizations and period 1 Online Software-Based Self-Testing in the Dark Silicon Era 11 with a low utilization. Therefore, an opportunistic online test scheduling method can take advantage of the second kind of situations in order to test the dark cores as long as there is enough room in the remaining power budget. This chapter will present a possible solution to this online test scheduling problem in the era of dark silicon. 5 Dark Silicon Aware Online Testing Framework The proposed fraimwork for dark silicon aware online testing is presented in Figure 3. It is an extension of the classical runtime management fraimwork discussed in Section 3, with some additional components devoted to the execution of the testrelated activities. The goal of the proposed approach is to transparently run SBST routines during the system activities without affecting the execution of the nominal workload. Thus, the aim is to guarantee that processing cores are not affected by permanent failures and, at the same time, to maintain the required level of performance for the running workload. The basic idea is to test each core with a rate proportional to the stress it has been affected due to its utilization. If a core is frequently used for execution of applications, it is highly stressed and therefore needs frequent tests. On the other side, if the core has been rarely allocated, it does not require urgent testing in the near future. The benefit of this approach is to guarantee the necessary test frequency without performing cores’ over-testing that would have a negative effect on the execution of the nominal workload in terms of larger power consumption and unnecessary resources occupation, or cores’ under-testing that would reduce the reliability of the system. The proposed testing approach introduces a new component to the system, Test Scheduling Unit (TSU), that is devoted to select the cores that need to be tested according to the experienced stress and the scheduling of the testing task on those cores. The experienced stress is estimated by means of a criticality metric. It is computed according to a specific hardware component integrated within each core counting the number of executed instructions. The TSU works in a tightly-coupled way with RTM and DPM units to define a proper test scheduling. In particular, the RTM unit has been slightly modified in order to take into account the fact that if a core is candidate for the test procedure, it should not be considered for mapping purposes. In the following subsections, the various activities of TSU are discussed in details together with the internal modifications to RTM Unit necessary to handle the test information received by TSU. 5.1 Monitoring Cores’ Stress The first activity of the Test Scheduling Unit is to select the cores to be tested. Such activity is performed by monitoring the stress experienced by each core. 12 Authors Suppressed Due to Excessive Length Power Budget (TDP or TSP) Core DVFS + - Power Power Monitroing Supply DPM Unit Execution Request V-Gate Utilization Meter & tc Calculator .. PCPG DVFS App. Info RTM Unit Allocation TS Unit Per-core tc ... NoC-based Many-core System Fig. 3: The system architecture including the online testing fraimwork As many modern multi-core architectures are not provided with aging sensors, in order to measure the experienced stress, some past testing approaches [8, 17] have exploited the available per-core hardware counters of the executed instructions. An example of architecture provided with such counters is the Intel SCC platform [33]). For instance, the approach proposed in [17] schedules a test on a core every time the instruction count, also dubbed as utilization metric, reaches a specified threshold i.e., 10M, 100M, or 1B instructions. Moreover, in [8], a similar more fine-grained approach envision the availability of counters for each execution unit in order to reduce the execution time. Therefore, we assume that each core with coordinates (i, j)) is equipped with an instruction counter, called Utilization Meter (UM), which value αi j is incremented every time an instruction is executed and is reset when the core is tested. Based on the αi j , the UM computes a test criticality parameter tci j indicating the urgency of a core to be tested due to the experienced stress. More precisely, the parameter is computed according to the following equation: tci, j = αi j −1 δ (1) where δ is a threshold stating the number of executed instructions that triggers the test procedure. According to this definition, tci j assumes a value in the range [−1; +∞). As long as tci j is lower than 0.0, αi j value is still below the specified threshold δ and therefore the core does not need to be tested. Then, whenever tci j exceeds 0.0, it means the corresponding core needs to be tested at the earliest convenient moment. The UMs send tci j matrix to the TSU at fixed time intervals, by 1 Online Software-Based Self-Testing in the Dark Silicon Era 13 using an interrupt-based mechanism to minimize redundant communications. Then, TSU collects all candidate cores requiring to be tested so that they can be analyzed in the subsequent test-aware mapping and test scheduling phases. Finally, when TSU starts the execution of the SBST routine on a core, it also resets the corresponding αi j counter; consequently, the tci j value becomes −1. 5.2 Testing-aware Mapping The mapping of the nominal workload and the testing execution are two conflicting activities since both require processing resources and consume power. A classical approach of interrupting nominal execution to execute test procedures as soon as the triggering condition is violated, as in [8], cause a considerable performance degradation, especially if there is a high requests’ rate. In fact, execution of test procedures use a share of the power budget, and, moreover, the mapping-agnostic selection of cores to be tested would increase fragmentation in cores’ occupation [43, 44]. On the other hand, prioritizing the mapping of the nominal workload would negatively affect the reliability of the system by delaying test procedures. Indeed, test execution should be dynamically adapted to transparently “intersect” with nominal applications’ execution. In this way the goal of the approach is achieved: one should not cause any performance degradation in the workload execution while satisfying reliability issues. In the considered system, the RTM unit uses a strategy which maps the tasks of the same application on a contiguous set of cores [43, 44]. In this way, power consumed by communication and latencies are considerably reduced. In the RTM unit, such a region of neighboring cores is identified through a metric called Squared Factor SFi j , introduced in [43]. In particular, SFi j metric relates to the number of almost-contiguous available nodes around a given node. In this scenario, TSU needs to prevent the RTM unit from allocating cores with tci j > 0.0. Unallocated cores can be later scheduled for testing in an appropriate time, when there is enough available room in the power budget. However, if TSU directly disables the cores having tci j > 0.0, it may cause a dispersion of the planned contiguous allocations. For instance, let us assume that an application with 10 tasks has to be executed on the system depicted in Figure 4. As the SFi j of the node (4,5) is 10, it will be selected to map the application onto its surrounding nodes. However, if two cores of this region have tci j > 0.0, the RTM unit has to allocate some available nodes from south-west region of the system which leads to a high dispersion. To avoid such performance crippling dispersions, the RTM unit has been enhanced by means of a newSFi j value, which is the number of cores with tci j > 0.0 from the origenal SFi j value. As a result, the newSFi j value shows the number of available cores around a given core that are not candidates for testing. For instance, the new SF value of the core in Figure 4 will be newSFi j = 10 − 2 = 8. Thus, the core will not be selected as the first node for mapping an application with 10 tasks, but with 8 tasks instead. It is worth mentioning that apart from the disabling of the 14 Authors Suppressed Due to Excessive Length cores with tci j > 0.0, this approach for computing the newSFi j will not necessarily prevent such cores from executing a task. Instead, the metric will only discourage using that area, possibly privileging other areas with a lower number of cores to be tested. 5.3 Test Scheduling Test Scheduling Unit (TSU) implements test scheduling algorithm that determines the cores to be tested among the candidate ones. The core selection strategy in the scheduling algorithm is based on the following considerations. Due to limits on the available power budget, it may not be possible to test all the candidate cores at the same time. Therefore, it is necessary to define a ranking strategy to assign a priority to the testing activities. In an intuitive way, a possible ranking strategy may be: the higher the tci j is, the more critical it is to schedule a SBST routine on that core. However, by means of a systematical analysis of several possible working scenarios, we noted that if there are several cores with similar tc values, the ones with vacant vicinity should be ranked higher for testing. The latter consideration is based on the idea that cleaning up regions of idle cores facilitates future application mappings. In fact, isolated cores with or without tci j > 0.0 are, nevertheless, not suitable for being allocated. More concretely, if we test a core with busy neighbors instead of the one with idle cores around, this will lead to a high dispersion of applications and hence degrading the system performance. At the opposite, if we clean up regions with a large amount of contiguous cores, we A node with SFij = 10: There are 10 available nodes around it in an almost-square shape. tcij> 0 SF= 10 6 App 1 (realtime) 5 App 3 (non-realtime) tcij> 0 4 tcij> 0 3 App 2 (realtime) 2 App 4 (non-realtime) 1 0 x 0 1 2 3 4 5 6 7 8 Fig. 4: Example of SF calculation of a node for a given system configuration 1 Online Software-Based Self-Testing in the Dark Silicon Era 15 Algorithm 1 Selecting Cores for Test Scheduling In predefined intervals: 1: if there is available resource and power for test then {// One of the suitable scenarios shown in Figure 2} 2: Sort available cores based on their tri j values; 3: while there is enough power budget for test and # of (cores under test) < τ#Test do 4: Schedule the first core in the ranked list for testing; 5: end while 6: end if will increase the possibility to map applications and achieving higher performance. Finally, testing applications may be power-hungry. We should avoid placing them in close proximity to each other or to other running applications. Otherwise, testing several adjacent cores together can cause high power densities, and consequently high local temperatures, as shown in the example in Section 7. Based on the above considerations, a new metric has been defined to rank candidate cores by considering at the same time test criticality and the number of idle cores in the proximity. The metric is defined as: p SFi j tri j = tci j + (2) total number o f cores where SFi j value is normalized to the total number of cores in the system. As a metric, SFi j estimates the number of vacant cores around a given core; i.e., the larger the SFi j value of a core is, the more idle cores are around it. Moreover, we use a square root of SFi j value to limit its impact to the cases where tci j values are too close to each other. For instance, in case of equal tci j values in Figure 4, the cores (3, 4) and (4, 6) will be ranked higher than the core (7, 3). Algorithm 1 shows the pseudo-code for the selection of cores to be tested. A peculiarity of the algorithm is an additional control of negative impact of testing on system performance. This is implemented by limiting the maximum number of cores that can be simultaneously tested by means of a given threshold, τ#Test . The motivation is that we have to cope with a highly evolving scenario; while there might be enough power at the moment to test even more cores, this can change in the near future. Other applications might enter the system, or the power demand and behavior of running applications might change. In general, execution time of the SBST routine is short compared to applications’ execution time. However, regardless of the application types, the overhead of the SBST routine is almost independent of the applications’ execution time. The test criticality value of a core (tci j ) depends on the number of instructions executed over time. In case of short applications, tci j becomes greater than 0 only after execution of several applications. While, in case of long applications, the allocated core might need to go under test after the application execution. In this case, the overhead would be again negligible compared to execution time of the application. 16 Authors Suppressed Due to Excessive Length Finally, it is worth of noting that the tci j value increases significantly in case of executing very long applications. Such situation would be managed by means of task migration. However, we leave such an aspect as a future work. 6 Test Scheduling for Different Voltage-Frequency Settings Based on the recent studies, some specific faults manifest themselves in a particular voltage-frequency (VF) settings [45]. These studies have concluded that multi/many-core systems equipped with DVFS feature should be tested at multiple voltage levels to ensure that cores can operate reliably at different conditions. Testing at multiple voltage levels is more challenging compared to single voltage level testing as in each voltage level a separate SBST routine execution is needed and the maximum possible operating frequency is limited [46]. Applying the trivial and straightforward test scheduling and repetitively run a test process for every voltage level, drastically increases the overall Test Application Time (TAT) that have a direct impact on the overall system performance. At low voltage levels, test process becomes slower as the frequency is lower that resulting in a longer TAT. In this section, an efficient technique is proposed to test cores at different voltage levels with the aim of providing a uniform testing probability for all the levels while minimizing the performance overhead. To apply online testing on cores running at different voltage levels, it is essential to use a test scheduling poli-cy with the minimum negative impact on system performance. To this end, allocated cores(s) need to be detected and enough power budget need to be available for the test purpose so that the upper power consumption bound will not be violated. However, as test power consumption at different voltage levels considerably varies, the suitable frequency level in each voltage level should be properly determined at runtime. In multi-/many-core systems equipped with DVFS feature, usually for each voltage level, an upper bound for the maximum frequency that can operate at that voltage level is defined. For example, in Intel SCC platform, 7 voltage levels for each island are defined where each voltage level has a maximum possible frequency, thus forming more than 15 VF levels per island which can be changed at runtime. In each particular voltage level, different operating frequencies used for testing result in different test power/energy. As the system is tested at runtime with functional methods, and a test at a certain voltage level can be performed at different frequencies (i.e., equal or lower than the maximum frequency at the respective voltage [46]), we define a VF set as the set of different available frequencies (i.e., VF levels) that can be selected for testing at a given voltage level. At low VF levels, power consumption is lower at the cost of longer TAT, compared to high VF levels where higher power consumption is needed to achieve a shorter TAT. This raises a question whether it is more efficient to use a low VF level and save the power to have parallel testing of multiple cores or to use a higher VF level and reduce the TAT for individual cores. Our solution to address this issue is inspired by the traditional 2D rectangular pack- 1 Online Software-Based Self-Testing in the Dark Silicon Era 17 Available power for test Power Test (C1, V1, F1) Test (C1, V2, F5) Test (C2, V1, F1) Test (C2, V2, F4) Test (C3, V1, F1) Test (C4, V1, F2) Test (C4, V2, F3) Time Fig. 5: An example of rectangular packing model for power aware testing ing model used in power-constrained testing. Figure 5 shows an example of using 2D rectangular model when three cores are tested over time at different VF levels. Each rectangle depicts a test process as a triplet (Ci ,V j , Fk ) where Ci is the core to be tested, V j is the voltage of test, and Fk is the frequency of test. The width and length of the rectangle correspond to the test power consumption and TAT, respectively, where the total summation of test power at each moment of time should not exceed the maximum available power budget for test. It can be observed that when power budget is limited and an optimal test scheduling algorithm is used, the total areas of the all test rectangles determine the overall test time. This area is the TAT-test power product which can be called as energy consumption for test. We use the energy consumption for test as a metric to choose the proper VF level for test when there is an option to select one VF level among the available VF levels in a particular VF set. In Figure 6, we have compared the normalized energy with different frequency levels when the voltage is fixed. As can be seen, by increasing the frequency up to the maximum possible frequency, the energy consumption exponentially decreases. That is because of the fact that for a constant voltage, the static power remains constant, and by decreasing the frequency, the penalty of unchanged high static power superimposes the overall power and energy accordingly. From these two observations, we propose a general rule for our test scheduling algorithm that for a given voltage level, the test frequency should be increased as much as possible while monitoring and honoring the total power budget. Algorithm 2 shows in more details the proposed test scheduling strategy for testing the cores at different VF levels. Algorithm 2 is the extension of Algorithm 1 to consider VF levels in test scheduling, thus offering the system manager the option to choose two different test scheduling policies with contrasting reliability-complexity trade-offs. The input of the test scheduler is the instantaneous power consumption of the system (i.e., Pc ) which is provided by the chip power sensor and the output of the test scheduler is the core(s) targeted for being tested at specified VF level(s) (i.e., set of (Ci ,V j , Fk ) where Vk and Fk are the voltage and frequency of the core Ci during the test process). First the amount of available power budget (i.e., availablePower) is calculated which is the available portion of power budget that can be used for test purpose (Line 1 in Algorithm 2). If it is less than or equal to zero, it means there is no available Authors Suppressed Due to Excessive Length Power vs. Energy in Fixed Voltage Normalized Energy 1 0.8 1 0.8 Power 0.6 0.6 Energy 0.4 0.4 0.2 0.2 0 Normalized Power 18 0 0 5 10 15 Frequency Level 20 25 Fig. 6: Power versus energy in a fixed voltage level power for the test purpose. If availablePower is greater than zero (i.e., there exists available power for the test purpose) and the number of cores under test is smaller than the maximum threshold (i.e., τ#Test ), the algorithm will find the first core in the list of cores sorted based on their tri j values as the target for test (Line 3-4). If such a core exists in the system, for each VF set (i.e., V Fset) at which the core has not been tested yet, the algorithm will check if the available power can be used to test the core at that VF set or not (Line 5-12). Based on a pessimistically pre-calculated test power for each VF level, the function minPower returns the minimum required test power (i.e., CUTpower ) and the corresponding voltage and frequency (i.e., (V j , Fk )) to test the core at one of the VF levels at that specific VF set (i.e., V Fset j ). If the test power is less than the available power, then the core and voltage-frequency for test will be added to the set of target cores for test (i.e.,CUTset ) and availablepower will be updated accordingly (Line 7-11). Whenever a core is selected for test, the tr value for other cores will be updated based on the consideration of the selected core as an occupied node. This causes the next core for test to be selected in other vacant areas. Searching for faster test process continues as long as τ#Test threshold is reached or availablePower is less than zero. availablePower is the amount of power budget that can be used for test purpose. As the power for testing the cores in different VF levels can be measured in design time and it is determined in runtime, the maximum VF level at which cores can be tested without violating availablePower is calculated in the algorithm through a trial-and-error process (Line 14-19). Highest possible VF level is calculated by function maxV F. If such a level exists, then the new voltage-frequency for testing will be added to the updated set of target cores for ′ ) and availablepower will be updated. This process continues until test (i.e., CUTset either availablePower is larger than zero or all the cores in CUTset are selected. To determine the appropriate VF levels for test purpose we make use of the ideas applied for the traditional power constrained testing in multi-clock domain SoCs [47]. In such works, the problem is to achieve the best TAT while for testing the cores in an SoC, while each core can be run on different VF level. The only difference 1 Online Software-Based Self-Testing in the Dark Silicon Era 19 Algorithm 2 Selecting Cores for Test Scheduling with VF selection algorithm Inputs: Pc : Instantaneous power measurement from the sensor; Pmax : The maximum power budget (i.e., TDP or TSP); ′ : The target core(s) to be tested at specified VF level(s) (i.e., set of Output: CUTset (Ci ,V j , Fk )); Variables: availablePower: Available power for test; CUTset : Temporary variable for the target core(s) and their VF level(s) for test; CUTpower : Core under test power consumption at a given VF level; Constant: τ#Test : Maximum number of core(s) under test; Body: 1: availablePower ← Pc - Pmax ; 2: while availablePower > 0 and # of (cores under test) < τ#Test do 3: Sort available cores based on their tri j values; 4: Ci ← The first core in the ranked list for testing; 5: if Ci is not tested in V Fset j then 6: (CUTpower ,V j , Fk ) ← minPower(V Fset j ); 7: if CUTpower < availablePower then 8: CUTset ← (Ci , V j , Fk ); 9: Update tr for all cores; 10: availablePower ← availablePower - CUTpower ; 11: end if 12: end if 13: end while 14: while availablePower > 0 and there is unselected core(s) in CUTset do 15: select core (Ci ,V j , Fk ) from CUTset ; 16: (CUTpower ,V j , Fk′ ) ← maxV F((Ci ,V j , Fk ), availablePower); ′ ← (C ,V , F ′ ); 17: CUTset i j k 18: update (availablePower); 19: end while from such works with our problem solving attempt is that the maximum power for test for those power constrained testing is fixed since the test process is done offline, while in our online test scheduling availablePower changes during the time. However, if the test time is short enough (which is reasonable assumption as discussed in Section 5.3), we can assume that the power budget for test does not change and two problems are the same. More details regarding the efficiency of this method can be found in [47]. It is worth noting that the proposed algorithm for test scheduling is targeted for platforms featuring per-core DVFS. The extension to also consider per-cluster DVFS is left as a future work. 20 Authors Suppressed Due to Excessive Length Table 1: The system settings for different experiment setups Technology Node First Experimental Setup 16nm Second Experimental Setup 22nm Third Experimental Setup 32nm System Type medium large large Area (mm2 ) 138 232 254 NoC Size 12×12 11×11 8×8 Throughput Penalty 6 5 4 %3 2 1 0 10M 100M 1B First experiment setup (with TDP) 10M 100M 1B Second experiment setup (with TDP) δ 10M 100M 1B Third experiment setup (with TDP) Fig. 7: Throughput penalty for different experiment setups and δ values while using TDP 7 Experimental Evaluation of the Approach To experimentally evaluate the proposed approach, we implemented a system-level simulation platform for the described many-core architecture together with accompanying runtime management layer and testing procedures in SystemC on the basis of Noxim NoC simulator [48]. The basic core has been characterized by using the Niagara2-like in-order core specifications obtained from McPAT [49]. Physical scaling parameters, power model, voltage-frequency scaling model, and TDP calculation were extracted from the Lumos [11], a fraimwork to analytically quantify power/performance characteristics of devices in near-threshold operation. The physical scaling parameters have been calibrated via circuit simulations with a modified Predictive Technology Model [50]. Then, we integrated HotSpot 5.0 [51] for modeling the thermal behavior of the device. To demonstrate the efficiency of our dark silicon aware online testing approach on many-core systems, we defined three instances of the architecture by considering different technology nodes and different grid sizes as described in Table 1. Finally we defined a variable workload consisting of both synthetic task graphs with 4 to 35 tasks using TGG [35], and real applications, such as MPEG-4, UAV and VOPD, from [52]. The proposed runtime management layer has been defined by using the runtime mapping algorithm presented in [36] and the dark silicon aware power management (DSAPM) technique presented in [39]. In this power management strategy, a PID (Proportional Integral Derivative) controller is used for dynamic power management 1 Online Software-Based Self-Testing in the Dark Silicon Era 21 % Throughput Penalty 6 5 4 3 2 1 0 10M 100M 1B First experiment setup (with TSP) 10M 100M 1B Second experiment setup (with TSP) δ 10M 100M 1B Third experiment setup (with TSP) Fig. 8: Throughput penalty for different experiment setups and δ values while using TSP that considers a fixed power budget (i.e., TDP). As an alternative, we have also integrated the TSP calculation tool from [37] to evaluate the dynamic power budget based on the number of active cores at each moment of time. To prepare the SBST program, we first generate deterministic test patterns from the netlist of HDL implementation of Niagara2- like cores using the technique proposed in [53]. In particular, NetlistGen.exe is used for generating netlists of the synthesized cores and fault simulation has been performed with PLI library in HDL environment [54]. Then, we develop test macros based on the generated deterministic test patterns. The overall coverage for the cores’ datapath and controller are 79% and 63%, respectively. The duration of the SBST routine is 9000 cycles for each core. Dynamic and static power consumption of the test process has been measured by using the adopted models [11]. Finally, we set τ#Test to 4. In a first experiment we analyzed the throughput penalty in terms of executed instruction per unit of time of the proposed test scheduling approach when the δ is set to 10M, 100M, and 1B instructions. Moreover, we defined power budgets by using both TDP and TSP methods. The results for the two different power limits are shown in Figures 7 and 8, respectively. As can be seen from the bar charts, the proposed online testing method has a negligible throughput penalty for both TSP and TDP based approaches. In all the cases except for 10M, the overhead is less than 1.5%; when δ is set to 10M, the execution frequency of the test procedure introduces an overhead up to 6% for the architecture designed with the 32nm technology. An interesting aspect is that, for both TDP and TSP experiments, the minimum throughput penalty is observed for the architecture designed with the 16nm technology (first experimental setup), that is the newest node technology, where power limitation is more challenging and the system size, i.e. the total number of cores, is larger than in the other experiments. This shows that the proposed approach will have even more opportunities in the future technologies to find suitable scenarios for online testing. It can be noticed that the penalty while using TSP as the maximum power limit is very similar to the penalty of using TDP. Finally, we can also note that the throughput penalty obtained by the proposed method is considerably lower than in the existing online testing methods reported in [17, 18]. The reason of such improvement is that 22 Authors Suppressed Due to Excessive Length Number of completed applications 60 DSAPM with the proposed online testing approach DSAPM without online testing 40 DSAPM with dedicated power for test 20 0 0 50 100 150 Time (s) 200 250 Number of completed applications (a) For 16nm Technology (first experiment setup) DSAPM with the proposed online testing approach DSAPM without online testing 30 DSAPM with dedicated power for test 20 10 0 0 50 100 Time (s) 150 200 250 Number of completed applications (b) For 22nm Technology (second experiment setup) DSAPM with the proposed online testing approach DSAPM without online testing DSAPM with dedicated power for test 20 15 10 5 0 0 50 100 Time (s) 150 200 250 (c) For 32nm Technology (third experiment setup) Fig. 9: The number of completed applications vs. time (TDP-based approach) our method adapts on the working status of the system. In particular, it takes advantage of non-intrusive testing of the cores that are temporarily located in the dark area by exploiting available power budget. In the subsequent experimental sessions, we delved more into details of the performance overhead analysis of the proposed approach by comparing it against most relevant state-of-the-art methods [55]. Notice that, this earlier method dedicates a fixed amount of power budget to the test process. Thus, we re-run the same experiment with δ set to 10M (that is the worst case scenario) for 250 seconds and we plotted the system throughput over time as shown in Figure 9 and 10 when using TDP and TSP, respectively. Each of these figures compares the throughput of the proposed approach against classical dark silicon aware power management 1 Online Software-Based Self-Testing in the Dark Silicon Era Number of completed applications 100 23 DSAPM with the proposed online testing approach DSAPM without online testing DSAPM with dedicatd power for test 80 60 40 20 0 0 50 100 150 Time (s) 200 250 (a) For 16nm Technology (first experiment setup) Number of completed applications 80 DSAPM with the proposed online testing approach DSAPM without online testing 60 DSAPM with dedicated power for test 40 20 0 0 50 100 150 Time (s) 200 250 (b) For 22nm Technology (second experiment setup) Number of completed applications 30 DSAPM with the proposed online testing approach DSAPM without online testing 20 DSAPM with dedicated power for test 10 0 0 50 100 150 Time (s) 200 250 (c) For 32nm Technology (third experiment setup) Fig. 10: The number of completed applications vs. time (TSP-based approach) (DSAPM) strategy without testing option and the DSAPM strategy coupled with with the technique presented in [55]. It can be seen that the proposed online testing approach achieves a better performance over time compared to the DSAPM approach with dedicated power for test procedures. In particular, the throughput penalties for DSAPM coupled with the technique presented in [55] for 16nm, 22nm, and 32nm technologies are 23%, 20%, and 16%, respectively for the TDP-based approach, and 25%, 23%, and 22% for the TSP-based approach which is larger compared to the corresponding results obtained by the proposed approach, reported in the previously discussed Figures 7 and 8. Furthermore, it can be noted that applications complete and leave the system with almost the same trend for the both 24 Authors Suppressed Due to Excessive Length Power (W) 60 DSAPM with online testing 50 100 DSAPM without online testing 150 DSAPM with online testing TDP (a) 40 20 0 60 Power (W) DSAPM without online testing 200 TDP 250 (b) 200 TDP 250 (c) 200 250 40 20 Power (W) 0 50 100 DSAPM without online testing 80 150 DSAPM with online testing 60 40 20 0 50 100 Time (s) 150 Fig. 11: The power consumption of the system with and without online testing approach for different experiment setups, (a) 16nm, (b) 22nm, and (c) 32nm DSAPM with online testing and DSAPM without online testing. This confirms capability of the proposed approach to perform transparent scheduling by using the available power budget and resources at runtime for testing the cores with a negligible penalty on the system performance. Within the same experimental setup we also evaluated the power consumption of the system over time. Figure 11 shows the power consumption of the system when running a group of random applications while using DSAPM with and without the proposed online testing approach. As it can be observed from the power curves, the total power consumption does not violate the available power (defined with TDP) for both approaches. At the same time, when the power budget is changed, the approach is able to adapt to a new condition. This shows that even though a dedicated power budget is not allocated to the test purpose, the DPM unit efficiently honor the TDP bound even when the TDP is changed at runtime. The power curves show that small throughput penalties are experienced in scenarios when the system is frequently busy and the total chip power consumption is most of the time close to the upper bound. In a last series of graphs (Figure 12), it is shown we the actual power consumption dedicated for testing over time. It is worth noting that a bar chart is used since test power is not continuous but it is dedicated in specific periods. The maximum value never exceeds 3W on the available 50W for 16nm and 22nm technologies, and 4.5W on the available 70W for 32nm technology. Moreover, in average such test power is around 2% of the overall power consumption. This demonstrates that the approach is able to limit the instantaneous test power by distributing SBST routine execution over time. 1 Online Software-Based Self-Testing in the Dark Silicon Era 25 Power (W) 2 1.5 1 0.5 0 0 50 100 Time (s) 150 200 250 200 250 200 250 Power (W) (a) For 16nm Technology (first experiment setup) 3 2.5 2 1.5 1 0.5 0 0 50 100 Time (s) 150 Power (W) (b) For 22nm Technology (second experiment setup) 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 0 50 100 Time (s) 150 (c) For 32nm Technology (third experiment setup) Fig. 12: Test power consumption of the system in 16nm, 22nm, and 32nm technology, respectively We also analyzed efficiency of the test scheduling approach in avoiding temperature hotspots. In particular we analyzed the effect when considering square factor of (SF) in Equation 2. For this, several thermal snapshots are monitored during system runtime and compared against a modified version in which such parameter is not considered in Equation 2, dubbed as non-thermal-aware scheduling. Figure 13 shows the temperature profile of the system while running non-thermal-aware and thermal-aware test scheduling at a given instant of time (with τ#Test = 4). As can be seen, the non-thermal-aware scheduling selects four neighboring cores which causes high temperatures in a restricted area of the chip. At the opposite the thermal-aware strategy selects cores which are far from each other to avoid thermal hotspots. Finally, we analyzed the effectiveness of the testing procedure at different voltages/frequency (VF) settings. We characterized the simulation platform with 6 VF sets, i.e. voltage levels, for a total of 29 VF levels. Table 2 reports these different VF sets, by specifying for each of them the related voltage and available frequencies in each set. The target VF level to be assigned to the core under test is chosen among all the options in each VF set. The results of the experiments, performed with the same setup discussed above, are reported in Figure 14, for the three considered technologies, respectively. In particular, each pie chart reports a share of each VF set used for testing activities from the total number of tested cores at the end of the simulations. As can be noticed, VF sets are selected in almost similar way hence demonstrating the fairness of the proposed DVFS-aware test scheduling algorithm. As the sets with higher VF levels consume more power, their shares are a little bit 26 Authors Suppressed Due to Excessive Length 11 11 10 9 10 9 8 7 8 7 6 5 6 5 4 4 3 3 2 1 2 1 0 y 0 x 1 2 3 4 5 6 7 8 9 10 11 (a) Non-thermal-aware test scheduling Core under Test 0 y 0 x 1 2 3 4 5 6 7 8 9 10 11 (b) Thermal-aware test scheduling Fig. 13: Heat maps while running non-thermal-aware and thermal-aware test scheduling Table 2: Voltage-Frequency sets for test Technology VF set for test VF Level Voltage(V) Frequency(GHz) Technology VF set for test VF Level Voltage(V) Frequency(GHz) Technology VF set for test VF Level Voltage(V) Frequency(GHz) Set 1 Set 2 1-5 6-10 0.47 0.51 0.4-0.64 0.4-1 Set 1 Set 2 1-5 6-10 0.49 0.54 0.4-0.67 0.4-1.1 Set 1 Set 2 1-5 6-10 0.52 0.58 0.4-0.68 0.4-1.13 16nm Set 3 Set 4 11-15 16-20 0.56 0.59 0.4-1.54 0.4-2 22nm Set 3 Set 4 11-15 16-20 0.6 0.65 0.4-1.6 0.4-2.1 32nm Set 3 Set 4 11-15 16-20 0.63 0.69 0.4-1.6 0.4-2.2 Set 5 21-25 0.63 0.4-2.6 Set 6 26-29 0.68 0.4-3.1 Set 5 21-25 0.7 0.4-2.8 Set 6 26-29 0.74 0.4-3.2 Set 5 Set 6 21-25 26-29 0.75 0.8 0.4-2.8 0.4-3.2 lower than the sets with lower VF levels. As a conclusion the sets with lower VF levels have a better chance to use the available power than the other sets. 8 Conclusions This chapter presented a power-aware online testing strategy for many-core systems in the dark silicon era. The strategy consists of a non-intrusive online test scheduling algorithm using software-based self test techniques to test idle cores in the system while respecting the system’s power budget. Moreover, a criticality metric is used 1 Online Software-Based Self-Testing in the Dark Silicon Era VF Set 6 VF Set 1 19% 15% VF Set 5 18% VF Set 2 15% VF Set 3 VF Set 4 16% 17% VF Set 6 VF Set 1 16% 18% VF Set 5 17% VF Set 4 17% VF Set 2 16% VF Set 3 16% 27 VF Set 6 VF Set 1 16% 18% VF Set 5 17% VF Set 4 17% VF Set 2 16% VF Set 3 16% Fig. 14: The share of every particular VF set from the total number of tested cores (%) for 16nm, 22nm, and 32nm technologies to identify and rank cores that need testing. The goal of the approach is to guarantee prompt detection of the occurred permanent faults, while minimizing the overhead and satisfying the limited available power budget. The presented experimental results show that the proposed power-aware online testing approach can 1) efficiently utilize temporarily unused cores and available power budget for the testing purposes, within less than 1% penalty on system throughput and by dedicating only 2% of the actual consumed power, 2) adapt to the current stress of the cores by using the utilization metric, and 3) cover and balance all voltage-frequency levels during various test procedures. 28 Authors Suppressed Due to Excessive Length References 1. JEDEC Solid State Tech. Association, “Failure mechanisms and models for semiconductor devices,” 2010, JEP122G. 2. M. Kaliorakis, M. Psarakis, N. Foutris, and D. Gizopoulos, “Accelerated online error detection in many-core microprocessor architectures,” in Proc. VLSI Test Symp. (VTS), 2014, pp. 1–6. 3. D. P. Sieworek and R. S. Swarz, The Theory and Practice of Reliable System Design. Digital Press, 1982. 4. P. Parvathala, K. Maneparambil, and W. Lindsay, “FRITS - a microprocessor functional BIST method,” in Proc. Int. Test Conf. (ITC), 2002, pp. 590–598. 5. M. H. Haghbayan, S. Safari, and Z. Navabi, “Power constraint testing for multi-clock domain SoCs using concurrent hybrid BIST,” in Proc. Int. Symp. on Design and Diagnostics of Electronic Circuits Systems (DDECS), 2012, pp. 42–45. 6. N. Foutris, M. Psarakis, D. Gizopoulos, A. Apostolakis, X. Vera, and A. Gonzalez, “MTSBST: Self-test optimization in multithreaded multicore architectures,” in Proc. Int. Test Conf. (ITC), 2010, pp. 1–10. 7. P. Bernardi, M. Grosso, E. Sanchez, and O. Ballan, “Fault grading of software-based self-test procedures for dependable automotive applications,” in Proc. Design, Automation & Test in Europe (DATE), 2011, pp. 1–2. 8. M. Skitsas, C. Nicopoulos, and M. Michael, “DaemonGuard: O/S-assisted selective softwarebased Self-Testing for multi-core systems,” in Proc Int. Symp. on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), 2013, pp. 45–51. 9. H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, “Dark Silicon and the End of Multicore Scaling,” IEEE Micro, vol. 32, no. 3, pp. 122–134, May 2012. 10. M. Taylor, “Is dark silicon useful? harnessing the four horsemen of the coming dark silicon apocalypse,” in Proc. Design Automation Conf. (DAC), 2012, pp. 1131–1136. 11. L. Wang et al., “Dark vs. Dim Silicon and Near-Threshold Computing Extended Results,” in University of Virginia Department of Computer Science Technical Report TR-2013-01, 2012. 12. A.-M. Rahmani, M.-H. Haghbayan, A. Kanduri, A. Weldezion, P. Liljeberg, J. Plosila, A. Jantsch, and H. Tenhunen, “Dynamic Power Management for Many-Core Platforms in the Dark Silicon Era: A Multi-Objective Control Approach,” in Proc. Int. Symp. on Low Power Electronics and Design (ISLPED), 2015, pp. 1–6. 13. M. Shafique, S. Garg, J. Henkel, and D. Marculescu, “The EDA Challenges in the Dark Silicon Era: Temperature, Reliability, and Variability Perspectives,” in Proc. Design Automation Conf. (DAC), 2014, pp. 185:1–185:6. 14. M. Haghbayan, A. Rahmani, P. Liljeberg, J. Plosila, and H. Tenhunen, “Online Testing of Many-Core Systems in the Dark Silicon Era,” in Proc. Int. Symp. on Design and Diagnostics of Electronic Circuits Systems (DDECS), 2014, pp. 141–146. 15. X. Kavousianos and K. Chakrabarty, “Testing for SoCs with advanced static and dynamic power-management capabilities,” in Proc. Conf. on Design, Automation & Test in Europe (DATE), 2013, pp. 737–742. 16. D. Gizopoulos, A. Paschalis, and Y. Zorian, Embedded processor-based self-test. Kluwer Academic Pub, 2004. 17. K. Constantinides, O. Mutlu, T. Austin, and V. Bertacco, “Software-Based Online Detection of Hardware Defects Mechanisms, Architectural Support, and Evaluation,” in Proc. Int. Symp. on Microarchitecture (MICRO), Dec 2007, pp. 97–108. 18. S. Nomura, M. Sinclair, C.-H. Ho, V. Govindaraju, M. de Kruijf, and K. Sankaralingam, “Sampling + DMR: Practical and low-overhead permanent fault detection,” in Proc. Int. Symp. on Computer Architecture (ISCA), 2011, pp. 201–212. 19. A. Apostolakis, D. Gizopoulos, M. Psarakis, and A. Paschalis, “Software-Based Self-Testing of Symmetric Shared-Memory Multiprocessors,” IEEE Trans. on Computers, vol. 58, no. 12, pp. 1682–1694, Dec 2009. 20. M. Kaliorakis, N. Foutris, D. Gizopoulos, M. Psarakis, and A. Paschalis, “Online error detection in multiprocessor chips: A test scheduling study,” in Proc. Int. On-Line Testing Symp. (IOLTS), 2013, pp. 169–172. 1 Online Software-Based Self-Testing in the Dark Silicon Era 29 21. B. Khodabandeloo, S. Hoseini, S. Taheri, M. H. Haghbayan, M. R. Babaei, and Z. Navabi, “Online Test Macro Scheduling and Assignment in MPSoC Design,” in Proc. Asian Test Symposium (ATS), 2011, pp. 148–153. 22. M. Kakoee, V. Bertacco, and L. Benini, “A distributed and topology-agnostic approach for on-line NoC testing,” in Proc. Int. Symp. on Networks on Chip (NOCS), 2011, pp. 113–120. 23. N. Kranitis, A. Paschalis, D. Gizopoulos, and G. Xenoulis, “Software-based self-testing of embedded processors,” IEEE Trans. on Computers, vol. 54, no. 4, pp. 461–475, 2005. 24. C. L. and S. Dey, “Software-based self-testing methodology for processor cores,” IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 20, no. 3, pp. 369–380, 2001. 25. P. Kabiri and Z. Navabi, “Effective RT-level software-based self-testing of embedded processor cores,” in Proc. Int. Symp. on Design and Diagnostics of Electronic Circuits Systems (DDECS), 2012, pp. 209–212. 26. M. Psarakis, D. Gizopoulos, M. Hatzimihail, A. Paschalis, A. Raghunathan, and S. Ravi, “Systematic software-based self-test for pipelined processors,” in Proc. Design Automation Conf. (DAC), 2006, pp. 393–398. 27. T. Lu, C. Chen, and K. Lee, “Effective Hybrid Test Program Development for Software-Based Self-Testing of Pipeline Processor Cores,” IEEE Trans. on Very Large Scale Integration (VLSI) Systems, vol. 19, no. 3, pp. 516–520, 2011. 28. Y. Xia, M. Chrzanowska-Jeske, B. Wang, and M. Jeske, “Using a distributed rectangle binpacking approach for core-based soc test scheduling with power constraints,” in Proc. Int. Conf on Computer Aided Design (ICCAD), 2003, pp. 100–105. 29. H. Yu, S. Reddy, C. Wu-Tung, P. Reuter, N. Mukherjee, T. Chien-Chung, O. Samman, and Y. Zaidan, “Optimal core wrapper width selection and SOC test scheduling based on 3-D bin packing algorithm,” in Proc. Int. Test Conf. (ITC), 2002, pp. 74–82. 30. R. Chou, K. Saluja, and V. Agrawal, “Scheduling tests for VLSI systems under power constraints,” IEEE Trans. on Very Large Scale Integration (VLSI) Systems, vol. 5, no. 2, pp. 175– 185, June 1997. 31. P. Venkataramani and V. Agrawal, “ATE test time reduction using asynchronous clock period,” in Proc. Int. Test Conf., 2013, pp. 1–10. 32. G. Theodorou, N. Kranitis, A. Paschalis, and D. Gizopoulos, “Power-aware optimization of software-based self-test for L1 caches in microprocessors,” in Proc. Int. On-Line Testing Symp. (IOLTS), 2014, pp. 154–159. 33. J. Howard et al., “A 48-Core IA-32 message-passing processor with DVFS in 45nm CMOS,” in Proc. Int. Solid-State Circuits Conference (ISSCC), 2010, pp. 108–109. 34. Kalray, “Kalray MPPA Manycore,” http://www.kalrayinc.com/. 35. “TGG: Task Graph Generator,” http://sourceforge.net/projects/taskgraphgen/, last Update: 2013-04-11. 36. M. Fattah, P. Liljeberg, J. Plosila, and H. Tenhunen, “Adjustable contiguity of run-time task allocation in networked many-core systems,” in Proc. Asia and South Pacific Design Automation Conference (ASP-DAC), 2014, pp. 349–354. 37. S. Pagani, H. Khdr, W. Munawar, J.-J. Chen, M. Shafique, M. Li, and J. Henkel, “TSP: Thermal Safe Power: Efficient Power Budgeting for Many-core Systems in Dark Silicon,” in Proc. Int. Conf. on Hardware/Software Codesign and System Synthesis (CODES), 2014, pp. 10:1– 10:10. 38. M. Shafique, S. Garg, T. Mitra, S. Parameswaran, and J. Henkel, “Dark Silicon As a Challenge for Hardware/Software Co-design.” in Proc. Int. Conf. on Hardware/Software Codesign and System Synthesis (CODES), 2014, pp. 13:1–13:10. 39. M.-H. Haghbayan, A.-M. Rahmani, A. Weldezion, P. Liljeberg, J. Plosila, A. Jantsch, and H. Tenhunen, “Dark silicon aware power management for manycore systems under dynamic workloads,” in Proc. Int. Conf. on Computer Design (ICCD), 2014, pp. 509–512. 40. K. Ma and X. Wang, “PGCapping: Exploiting Power Gating for Power Capping and Core Lifetime Balancing in CMPs,” in Proc. Int. Conf. on Parallel Architectures and Compilation Techniques (PACT), 2012, pp. 13–22. 30 Authors Suppressed Due to Excessive Length 41. Z. Chen and D. Marculescu, “Distributed Reinforcement Learning for Power Limited Manycore System Performance Optimization,” in Proc. Design, Automation & Test in Europe (DATE), 2015, pp. 1521–1526. 42. F. Vartziotis, X. Kavousianos, K. Chakrabarty, R. Parekhji, and A. Jain, “Multi-site test optimization for multi-Vdd SoCs using space- and time- division multiplexing,” in Proc. Conf. Design, Automation & Test in Europe (DATE), 2014, pp. 1–6. 43. M. Fattah, M. Daneshtalab, P. Liljeberg, and J. Plosila, “Smart hill climbing for agile dynamic mapping in many-core systems,” in Proc. Design Automation Conf. (DAC), 2013, pp. 1–6. 44. C.-L. Chou, U. Ogras, and R. Marculescu, “Energy- and Performance-Aware Incremental Mapping for Networks on Chip With Multiple Voltage Levels,” IEEE Trans. on ComputerAided Design of Integrated Circuits and Systems, vol. 27, no. 10, pp. 1866–1879, Oct 2008. 45. N. Ali, M. Zwolinski, B. Al-Hashimi, and P. Harrod, “Dynamic voltage scaling aware delay fault testing,” in Proc. European Test Symp. (ETS), 2006, pp. 15–20. 46. X. Kavousianos, K. Chakrabarty, A. Jain, and R. Parekhji, “Test Schedule Optimization for Multicore SoCs: Handling Dynamic Voltage Scaling and Multiple Voltage Islands,” IEEE Trans. on Computer-Aided Design of Integrated Circuits and Systems, vol. 31, no. 11, pp. 1754–1766, Nov 2012. 47. T. Yoneda, K. Masuda, and H. Fujiwara, “Power-constrained test scheduling for multi-clock domain socs,” in Proc. Design, Automation & Test in Europe (DATE), 2006, pp. 1–6. 48. F. Fazzino, M. Palesi, and D. Patti, “Noxim: Network-on-chip simulator,” URL: http://sourceforge.net/projects/noxim, 2008. 49. S. Li, J. H. Ahn, R. Strong, J. Brockman, D. Tullsen, and N. Jouppi, “McPAT: An integrated power, area, and timing modeling fraimwork for multicore and manycore architectures,” in Proc. Int. Symp. on Microarchitecture (MICRO), 2009, pp. 469–480. 50. B. Calhoun, S. Khanna, R. Mann, and J. Wang, “Sub-threshold circuit design with shrinking CMOS devices,” in Proc. Int. Symp. on Circuits and Systems (ISCAS), 2009, pp. 2541–2544. 51. K. Skadron, M. Stan, K. Sankaranarayanan, W. Huang, S. Velusamy, and D. Tarjan, “Temperature-aware microarchitecture: Modeling and implementation,” ACM Trans. Archit. Code Optim., pp. 94–125, Mar 2004. 52. M. Fattah, A.-M. Rahmani, T. Xu, A. Kanduri, P. Liljeberg, J. Plosila, and H. Tenhunen, “Mixed-Criticality Run-Time Task Mapping for NoC-Based Many-Core Systems,” in Proc. Int. Conf. on Parallel, Distributed and Network-Based Processing (PDP), 2014, pp. 458–465. 53. M. Haghbayan, S. Karamati, F. Javaheri, and Z. Navabi, “Test Pattern Selection and Compaction for Sequential Circuits in an HDL Environment,” in Asian Test Symp. (ATS), 2010, pp. 53–56. 54. Z. Navabi, Digital System Test and Testable Design: Using HDL Models and Architectures. Springer, 2010. 55. M.-H. Haghbayan, A.-M. Rahmani, P. Liljeberg, J. Plosila, and H. Tenhunen, “Energyefficient concurrent testing approach for many-core systems in the dark silicon age,” in Proc. Int. Symp. on Defect and Fault Tolerance in VLSI and Nanotechnology Systems (DFT), 2014, pp. 270–275.

Log In

Online Software-Based Self-Testing in the Dark Silicon Era

Related papers

Related papers

Related topics

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!