Shawn McCloud, Bryan Bowyer and Vikas Tyagi - Calypto

1/7/2013 8:00 AM EST

Low power is a central concern of digital design, especially for handheld and wireless devices, but also for servers and other computation intensive applications where the cost of cooling and packaging can be quite high. As a consequence, power optimization is an essential factor in meeting and improving quality of results as well as for optimizing performance and area.

Thus far, power optimization efforts have centered on RTL models and gate-level netlists, which are not sufficient for achieving optimal power savings. Optimizing for power should occur at all levels of design—from architecture to board. It is at the architecture, or electronic system level (ESL), where the potential for power savings are the greatest. Indeed, the opportunities for optimizing low power are significantly higher at the architectural level of abstraction—with as much as a 10X improvement over gate level optimizations. Yet, ironically, this is where low power methodologies and tools are the weakest. This deficiency drives the need for tools that not only allows designers to explore the best architecture for power at a higher level of abstraction but also automatically implements lower level transformations, like sequential clock gating, in the RTL produced.

The answer is found in the integration of high-level synthesis (HLS) and power analysis to create a new HLS product capable of optimizing across three dimensions - power, performance and area (PPA). HLS allows designers to synthesize different RTL architectures from C++ or SystemC electronic system level (ESL) models. The different hardware architectures are generated through user constraints which specify such things as clock period, resource limitations, IO protocol and the level of desired concurrency. Such a low power HLS solution can implement a generous range of low power techniques into synthesized RTL; including bit-width optimization, multiple clock domain partitioning, memory access minimization, resource sharing, frequency exploration, power gating, and clock gating.

In this article, we will discuss, in general, the ESL to RTL low power design flow, and then share the results of two case studies using real customer designs to evaluate the efficacy of a unique solution for ESL synthesis and power architecting.

7 basics of architectural power exploration
There are seven basic concepts that designers should focus on when looking for ways to save power, while satisfying performance goals, during architectural exploration.

1. Numerical refinement: The first design step for controlling power is numerical refinement. Algorithmic C bit-accurate data types support arbitrary precision, allowing designers to specify any desired bit width for both integer and fixed point data types. SystemC data types can be used interchangeably. At higher design abstraction levels, this allows using only numbers represented by a minimum number of bits to minimize area and power and remain within error tolerances.

2. Interfaces: If a design’s interface is hammering the bus or memory, the designer can expand the bit-width of the interface to do several read and writes at once and store the data locally. In pure C++ designs, this can be achieved simply by using HLS interface synthesis technology without modifying the source code. In SystemC, constraints may work in limited cases but can always be implemented by changing the source code.

3. Memory architecture: For many algorithms, power, performance, and area are highly dependent on memory architectures. For example, a FIR filter can be implemented using a shift register, a rotational shift register, or a circular buffer [Figure-1].

Figure 1: Filter tap implementation in FIR filtering

A shift register based implementation can consume higher power at higher frequencies because all taps will switch with each shift. This is typically suited for filters with a smaller number of taps and gives the highest performance.

Rotational shift is an intermediate solution. This removes the MUX feeding the multiplier. This becomes a bottleneck as the number of taps becomes larger. Rotation occurs as part of a MAC loop after the +=. A circular buffer based implementation is good for a filter with a large number of taps and is ideal for mapping into a memory. This uses one pointer to set a write point (advances forward) and another pointer to set a read point (decrements in reverse to round the array).

Additional basic concepts
4. Micro-architecture: Optimization and exploration of the micro-architecture is a powerful technology for improving an algorithm’s performance and adjusting for power. For algorithms, it’s not just about operating frequency. The important measurements are latency (how long it takes to get the first result) and throughput (how fast the data can be fed).

An eight tap FIR filter with eight multipliers may have a latency of one cycle if the period is long and the adder tree can be done with the multipliers, but it might have a latency of two or three, or even more (if using pipelined multipliers), yet the throughput might remain constant at one clock cycle.

If only one multiplier is used and the coefficients and tap register are restricted to a single RAM, then the latency might be nine or ten clock cycles (or more) and the throughput similarly longer, but this comes with the benefit of considerably reduced area and power [Figure-2].

Figure 2: FIR serial versus FIR parallel implementation

One may reduce clock frequency to reduce power. This may then require increased parallelism in the design (using loop unrolling in the HLS tool) to balance latency. Using an HLS tool to perform loop pipelining and unrolling of constraints helps achieve these implementations quickly. Designers can then compare the power, performance, and area of each implementation. The right implementation depends on the design goals regarding frequency, latency, and throughput, collectively.

5. Frequency: The golden source code (SystemC/C++/C) is independent of technology details. The same code can be retargeted to different target technologies [Figure-3] because frequency is just a parameter. Through frequency explorations, designers can set or adjust the clock frequency; the HLS tool then figures out how to get things to fit in a clock cycle. Also, since the implementation can be controlled down to the resources used, designers can experiment with using different operators like pipelined multipliers and adders.

Figure 3: Target optimized RTL code generation

For example, if the analysis tools show that a design actually has some extra slack in one implementation, the designer can reduce voltage to save power. Or, with a little faster implementation, they can share more operators. In this way, they can balance dynamic power with parallelism for better performance.

6. Block hierarchy: Having hierarchical blocks naturally lends itself to multi-clock design. More advanced HLS tools support running the blocks at different clock speeds and handling the data transfer between blocks through FIFOs. Designs with decimation are well suited to multi-clock design [Figure-4].

Figure 4: Decimation

Blocks with lower data rates may run with a slower clock, reducing the switching power and the static power by decreasing block area. In more general cases, the clock frequency can be tuned to match the best implementation for either throughput or latency and power, based on the technology target, with the same source code.

7. LVFS (Low Voltage Frequency Scaling): In low power mode, the HLS tool can insert an idle signal (1-bit output port) in the design. This signal is set when the block is in an idle state (not processing any data, not reading any input, and not writing to any output). This signal can be used in a system-level power management strategy, like LVFS or gating the clock power to a block.

Evaluating a low power hardware realization flow
Using customer designs, we ran two types of test to evaluate a new ESL flow centered on the integration of an HLS and an RTL power analysis tool. The tool set we used was Calypto® Catapult® LP, which embeds the Calypto PowerPro® technology “under the hood” of the Catapult HLS product. Catapult LP enables designers to explore low power architectures at the ESL while leveraging automated RTL power optimization techniques.

Case study #1 Clock gating
This test focused on leveraging a sequential analysis engine for clock gating insertion. We benchmarked a low power HLS flow against a normal “baseline” HLS flow using real customer designs. Some of the designs were written in pure C++ and some were in SystemC. These designs were already tied to certain performance requirements for a given area.

We compared the power consumption of the RTLs produced with the low power HLS flow (LP) against the RTLs produced with a normal HLS flow (Base) for a given architecture. Since most of the signal processing applications had a minimum data path width of 8, we used a clock gating width of 8 for all of these designs. Power estimation of RTL synthesized using the normal HLS flow was estimated using the PowerPro power estimation engine in standalone mode.

Data was collected for different customer designs by running LP and Base HLS flows. We used a variety of designs from different applications; for example, FFT, Video Encoder I, and JPEG require high performance; whereas Automotive requires extremely low power. Designs used were as small as 11.4k to as large as 58.1k gates. Table 1 summarizes the data.

Table 1: Power optimization data

From Table 1, we observe that a higher clock gating percentage does not always indicate power savings. One reason for this is that if a gated flop has to switch every clock cycle, it will not save power. CG (%) indicates the percentage of total flops in the design that are gated. CG Efficiency (%) indicates the cycles a gated clock is inactive based on a representative vector set (SAIF, FSDB). Clock gating efficiency is the measure of cycles for which a node is inactive. A 30% clock gating efficiency of a flop means the flop is inactive for 3 out of 10 cycles for a representative vector set [Figure-5].

Figure 5: Clock gating efficiency

PowerPro ranks design registers by clock gating efficiency and accepts those transformations that result in significant improvement in clock gating efficiency. The sequential enable signals along with the corresponding enable logic are automatically inserted into the resulting RTL. The positive effect of this can be seen using the Video Encoder I design as an example. The encoder already had 98.7% clock gating, but just by strengthening the enable signal, Catapult LP was able to substantially increase clock gating efficiency to further optimize the design’s power by almost 50%.

The data in Table 1 also shows that better clock gating efficiency always results in better power savings. The Catapult LP flow was able to improve clock gating efficiency in all cases (Figure-6).

Making ESL power optimization a reality

Figure 6: Clock gating efficiency percentage improvement

Table 1 shows that the power consumption of the designs used in this case study varied from 37 µW to 12.3 mW. The percentage improvement between the Base flow and the LP flow is shown in Figure-7.

Making ESL power optimization a reality

Figure 7: Average power savings

The graph shows a general trend in power savings improvement when Catapult LP optimizations were turned on. For an extremely low power application like Automotive, design improvement was approximately 12%; whereas, for a high performance design like Image Scaler, there was a 50% improvement. Absolute numbers for power can be inferred from Table 1.

Second case study
Case study #2 Architectural exploration
This test focused on the impact of architectural exploration on reducing power. It provides examples of the effect on power of selecting bit widths during quantization and then demonstrates the automatic power optimizations.

Many design teams have general guidelines used to decide on the bit widths required in a float-to-fixed conversion. However, these guidelines may be wrong because they were developed based on older ASIC technologies and over-simplify the problem of power optimization. The following example shows numerical refinement of a FIR filter running at 400 MHz on a 65 nm technology. The change to the bit width is done by editing the C++ source and entering the technology and clock speeds as constraints to Catapult LP.

Table 4: Numerical refinement of a 64 tap FIR filter — 65 nm and 400 MHz

The selection of the optimal bit widths depends on the percent error versus floating point, area, and power consumption. Note that the lowest power design is not the smallest and that the percent error does not also correlate with average power consumption.

The optimal bit width depends on the underlying technology and clock speed. For example, here is the same experiment run with a 90 nm technology at 200 MHz. At the 90 nm technology node the FIR with eight register and coefficient bits has about average power consumption, but at 65 nm that solution has the best power and area.

Table 5: Numerical refinement of a 64 tap FIR filter - 90 nm and 200 MHz

Conclusion
Raising the abstraction level above the RTL provides additional power optimization opportunities. A successful ESL hardware implementation flow should allow the designer to explore architectures for power, produce RTL that is power efficient, and quickly compare different architectural solutions for power usage. Our tests showed that using an HLS tool with a low power option allows designers to produce the lowest-power RTL within a seamless, automated flow.

About the authors

Making ESL power optimization a reality Shawn McCloud is Vice President of Marketing at Calypto Design Systems. Previously, he was the Product Line Director for the Mentor Graphics HLS technology after several years as a senior system architect responsible for RISC and CISC based micro-processor design. Shawn received his B.S. degree in electrical and computer engineering from Case Western Reserve University.

Making ESL power optimization a reality Vikas Tyagi is Application Engineer at Calypto Design System. Previously he worked as Technical Marketing Engineer for Mentor Graphics HLS technology after several years as system architect for SDH systems. Vikas received M.S. degree in electrical engineering from National Institute of Technology, Kurukshetra in India.

Making ESL power optimization a reality Bryan Bowyer leads the Product Design team at Calypto Design Systems. He was previously the Product Manager for HLS at Mentor Graphics and has worked on HLS tools for the past 13 years. Bryan received his B.S. in Computer Engineering from Oregon State University.

原文链接: https://www.cnblogs.com/bluefish/archive/2013/06/09/3129345.html

欢迎关注

微信关注下方公众号，第一时间获取干货硬货；公众号内回复【pdf】免费获取数百本计算机经典书籍

原创文章受到原创版权保护。转载请注明出处：https://www.ccppcoding.com/archives/91762

非原创文章文中已经注明原地址，如有侵权，联系删除

关注公众号【高性能架构探索】，第一时间获取最新文章

转载文章受原作者版权保护。转载请注明原作者出处！

Making ESL power optimization a reality

Shawn McCloud, Bryan Bowyer and Vikas Tyagi - Calypto

1/7/2013 8:00 AM EST

相关推荐