Divisha: Low-Power Design Through Voltage Scaling

7.3 Low-Power Design Through Voltage Scaling

The switching power dissipation in CMOS digital integrated circuits is a strong function of the power supply voltage. Therefore, reduction of VDD emerges as a very effective means of limiting the power consumption. Given a certain technology, the circuit designer may utilize on-chip DC- DC converters and/or separate power pins to achieve this goal. As we have already discussed briefly in Section 7.2, however, the savings in power dissipation comes at a significant cost in terms of increased circuit delay. When considering drastic reduction of the power supply voltage below the new standard of 3.3 V, the issue of time-domain performance should also be addressed carefully. In the following, we will examine reduction of the power supply voltage with a corresponding scaling of threshold voltages, in order to compensate for the speed degradation. At the system level, architectural measures such as the use of parallel processing blocks and/or pipelining techniques also offer very feasible alternatives for maintaining the system performance (throughput) despite aggressive reduction of the power supply voltage.

The propagation delay expression (7.4) clearly shows that the negative effect of reducing the power supply voltage upon delay can be compensated for, if the threshold voltage of the transistor is scaled down accordingly. However, this approach is limited due to the fact that the threshold voltage cannot be scaled to the same extent as the supply voltage. When scaled linearly, reduced threshold voltages allow the circuit to produce the same speed-performance at a lower VDD. Figure 7.9 shows the variation of the propagation delay of a CMOS inverter as a function of the power supply voltage, and for different threshold voltage values.

[Click to enlarge image]

Figure-7.9: Variation of the normalized propagation delay of a CMOS inverter, as a function of the power supply voltage VDD and the threshold voltage VT.

We can see, for example, that reducing the threshold voltage from 0.8 V to 0.2 V can improve the delay at VDD = 2 V by a factor of 2. The influence of threshold voltage reduction upon propagation delay is especially pronounced at low power supply voltages. It should be noted, however, that the threshold voltage reduction approach is restricted by the concerns on noise margins and the subthreshold conduction. Smaller threshold voltages lead to smaller noise margins for the CMOS logic gates. The subthreshold conduction current also sets a severe limitation against reducing the threshold voltage. For threshold voltages smaller than 0.2 V, leakage power dissipation due to subthreshold conduction may become a very significant component of the overall power consumption.

In certain types of applications, the reduction of circuit speed which comes as a result of voltage scaling can be compensated for at the expense of more silicon area. In the following, we will examine the use of architectural measures such as pipelining and hardware replication to offset the loss of speed at lower supply voltages.

Pipelining Approach

First, consider the single functional block shown in Fig. 7.10 which implements a logic function F(INPUT) of the input vector, INPUT. Both the input and the output vectors are sampled through register arrays, driven by a clock signal CLK. Assume that the critical path in this logic block (at a power supply voltage of VDD) allows a maximum sampling frequency of fCLK; in other words, the maximum input-to-output propagation delay tP,max of this logic block is equal to or less than TCLK = 1/fCLK. Figure 7.10 also shows the simplified timing diagram of the circuit. A new input vector is latched into the input register array at each clock cycle, and the output data becomes valid with a latency of one cycle.

[Click to enlarge image]

Figure-7.10: Single-stage implementation of a logic function and its simplified timing diagram.

Let Ctotal be the total capacitance switched every clock cycle. Here, Ctotal consists of (i) the capacitance switched in the input register array, (ii) the capacitance switched to implement the logic function, and (iii) the capacitance switched in the output register array. Then, the dynamic power consumption of this structure can be found as

(7.11)

Now consider an N-stage pipelined structure for implementing the same logic function, as shown in Fig. 7.11. The logic function F(INPUT) has been partitioned into N successive stages, and a total of (N-1) register arrays have been introduced, in addition to the original input and output registers, to create the pipeline. All registers are clocked at the original sample rate, fCLK. If all stages of the partitioned function have approximately equal delay of

(7.12)

Then the logic blocks between two successive registers can operate N-times slower while maintaining the same functional throughput as before. This implies that the power supply voltage can be reduced to a value of VDD,new, to effectively slow down the circuit by a factor of N. The supply voltage to achieve this reduction can be found by solving (7.4).

[Click to enlarge image]

Figure-7.11: N-stage pipeline structure realizing the same logic function as in Fig. 7.10. The maximum pipeline stage delay is equal to the clock period, and the latency is N clock cycles.

The dynamic power consumption of the N-stage pipelined structure with a lower supply voltage and with the same functional throughput as the single-stage structure can be approximated by

(7.13)

where Creg represents the capacitance switched by each pipeline register. Then, the power reduction factor achieved in a N-stage pipeline structure is

(7.14)

As an example, consider replacing a single-stage logic block (VDD = 5 V, fCLK = 20 MHz) with a four-stage pipeline structure, running at the same clock frequency. This means that the propagation delay of each pipeline stage can be increased by a factor of 4 without sacrificing the data throughput. Assuming that the magnitude of the threshold voltage of all transistors is 0.8 V, the desired speed reduction can be achieved by reducing the power supply voltage from 5 V to approximately 2 V (see Fig. 7.9). With a typical ratio of (Creg/Ctotal) = 0.1, the overall power reduction factor is found from (7.14) as 1/5. This means that replacing the original single-stage logic block with a four-stage pipeline running at the same clock frequency and reducing the power supply voltage from 5 V to 2 V will provide a dynamic power savings of about 80%, while maintaining the same throughput as before.

The architectural modification described here has a relatively small area overhead. A total of (N-1) register arrays have to be added to convert the original single-stage structure into a pipeline. While trading off area for lower power, this approach also increases the latency from one to N clock cycles. Yet in many applications such as signal processing and data encoding, latency is not a very significant concern.

Parallel Processing Approach (Hardware Replication)

Another possibility of trading off area for lower power dissipation is to use parallelism, or hardware replication. This approach could be useful especially when the logic function to be implemented is not suitable for pipelining. Consider N identical processing elements, each implementing the logic function F(INPUT) in parallel, as shown in Fig. 7.12. Assume that the consecutive input vectors arrive at the same rate as in the single-stage case examined earlier. The input vectors are routed to all the registers of the N processing blocks. Gated clock signals, each with a clock period of (N TCLK), are used to load each register every N clock cycles. This means that the clock signals to each input register are skewed by TCLK, such that each one of the N consecutive input vectors is loaded into a different input register. Since each input register is clocked at a lower frequency of (fCLK / N), the time allowed to compute the function for each input vector is increased by a factor of N. This implies that the power supply voltage can be reduced until the critical path delay equals the new clock period of (N TCLK). The outputs of the N processing blocks are multiplexed and sent to an output register which operates at a clock frequency of fCLK, ensuring the same data throughput rate as before. The timing diagram of this parallel arrangement is given in Fig. 7.13.

Since the time allowed to compute the function for each input vector is increased by a factor of N, the power supply voltage can be reduced to a value of VDD,new, to effectively slow down the circuit. The new supply voltage can be found, as in the pipelined case, by solving (7.4). The total dynamic power dissipation of the parallel structure (neglecting the dissipation of the multiplexor) is found as the sum of the power dissipated by the input registers and the logic blocks operating at a clock frequency of (fCLK / N), and the output register operating at a clock frequency of fCLK.

(7.15)

[Click to enlarge image]

Figure-7.12: N-block parallel structure realizing the same logic function as in Fig. 7.10. Notice that the input registers are clocked at a lower frequency of (fCLK / N).

Note that there is also an additional overhead which consists of the input routing capacitance, the output routing capacitance and the capacitance of the output multiplexor structure, all of which are increasing functions of N. If this overhead is neglected, the amount of power reduction achievable in a N-block parallel implementation is

(7.16)

The lower bound of dynamic power reduction realizable with architecture-driven voltage scaling is found, assuming zero threshold voltage, as

(7.17)

(7.17)

[Click to enlarge image]

Figure-7.13: Simplified timing diagram of the N-block parallel structure shown in Fig. 7.12.

Two obvious consequences of this approach are the increased area and the increased latency. A total of N identical processing blocks must be used to slow down the operation (clocking) speed by a factor of N. In fact, the silicon area will grow even faster than the number of processor because of signal routing and the overhead circuitry. The timing diagram in Fig. 7.13 shows that the parallel implementation has a latency of N clock cycles, as in the N-stage pipelined implementation. Considering its smaller area overhead, however, the pipelined approach offers a more efficient alternative for reducing the power dissipation while maintaining the throughput.

Divisha

Wednesday, 26 December 2012

Low-Power Design Through Voltage Scaling

7.3 Low-Power Design Through Voltage Scaling

No comments:

Post a Comment