Reconfigurable CORDIC-Based Low-Power DCT Architecture Based On Data Priority

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.


IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Reconfigurable CORDIC-Based Low-Power DCT


Architecture Based on Data Priority
Min-Woo Lee, Student Member, IEEE, Ji-Hwan Yoon, Student Member, IEEE,
and Jongsun Park, Senior Member, IEEE
Abstract This paper presents a low-power coordinate rotation
digital computer (CORDIC)-based reconfigurable discrete cosine
transform (DCT) architecture. The main idea of this paper
is based on the interesting fact that all the computations in
DCT are not equally important in generating the frequency
domain outputs. Considering the importance difference in the
DCT coefficients, the number of CORDIC iterations can be
dynamically changed to efficiently tradeoff image quality for
power consumption. Thus, the computational energy can be
significantly reduced without seriously compromising the image
quality. The proposed CORDIC-based 2-D DCT architecture is
implemented using 0.13 m CMOS process, and the experimental
results show that our reconfigurable DCT achieves power savings
ranging from 22.9% to 52.2% over the CORDIC-based Loeffler
DCT at the cost of minor image quality degradations.
Index Terms Coordinate rotation digital computer
(CORDIC), data priority, discrete cosine transform (DCT),
low-power, reconfigurable architecture.

I. I NTRODUCTION

ITH THE explosive growth of multimedia services


running on portable applications, the demand for lowpower implementations of complex signal processing algorithms is tremendously increasing. The most significant part of
multimedia systems are the applications involving image and
video processing, which are very computationally intensive
and thus should be implemented with low cost because of the
limited battery lifetime of portable devices. Many previous
research efforts are focused on reducing power dissipation of
image and video applications [1][3]. Especially, low-power
design of discrete cosine transform (DCT) [4] has been of
particular interest, since DCT is one of the most computationally intensive operations in video and image compression,
and it is widely adopted in many standards such as JPEG [5],
MPEG [6], and H.264 [7].
Manuscript received May 26, 2012; revised November 1, 2012 and February
8, 2013; accepted April 21, 2013. This work was supported in part by the
Basic Science Research Program through the National Research Foundation
of Korea funded by the Ministry of Education, Science and Technology under
Grant 2010-0004484, and also supported by the National Research Foundation
of Korea under Grant funded by the Korea Government (MEST) under Grant
2011-0020128.
M.-W. Lee was with the School of Electrical Engineering, Korea University, Seoul 110-810, Korea. He is now with DTV SoC Development
Team, SIC R&D Lab., LG Electronics Co., Seoul 157-030, Korea (e-mail:
minwoo3264.lee@lge.com).
J.-H. Yoon and J. Park are with the School of Electrical Engineering,
Korea University, Seoul 136-701, Korea (e-mail: improma@korea.ac.kr;
jongsun@korea.ac.kr).
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TVLSI.2013.2263232

Since first proposed in 1959 [8], coordinate rotation digital computer (CORDIC) has been widely used to calculate
the trigonometric functions in signal processing applications,
such as QR decomposition [9], fast Fourier transform [10],
singular value decomposition [11], [12], and so on. Since
CORDIC can be simply implemented with the iterative operations of additions and shifts, it has been widely used for
the multiplierless low-power DCT architectures [13][18].
Many previous research works focused on reducing the
hardware complexity of DCT such as distribute arithmetic
(DA)-based DCT [19] and multiple constant multiplication
(MCM)-based approach [20]. Although bit-serial DA-based
approach offers a regular and simple DCT architecture, large
hardware area is needed for bit-parallel operations because of
additional ROMs and control logics. MCM-based DCT [20]
can be simply implemented with a smaller number of shiftand-add operations, however, to make a tradeoff between the
image quality and computation energy, the computation sharing in different datapaths should be completely re-considered.
For the low-power CORDIC-based DCT architecture presented in [14], data correlations between neighboring pixels
are efficiently used to skip the internal CORDIC iterations.
Approximation technique or incorporating compensation steps
into the quantization is also exploited to reduce the power
consumption of CORDIC-based DCT architecture [16]. Most
of the previous research works are mainly focused on reducing
the number of arithmetic units; the inherent data priorities
in DCT coefficients, however, have not been exploited in the
CORDIC-based DCT.
In DCT, all the computations are not equally important in
generating the frequency domain outputs (DCT coefficients).
In other words, some of the computations in DCT are critical
for determining the output image quality, while others play
relatively less important roles. This interesting property can
be used to provide the right tradeoff between the output
image quality and power dissipations [21][24]. In this paper,
we present a low-power CORDIC-based DCT architecture,
where the important differences among the DCT coefficients
are efficiently exploited to achieve the power savings minimum image quality degradation. To apply the priority-based
data processing, lookahead CORDIC architectures [25][27]
are adopted to overcome the inherent data-dependencies in
the conventional CORDIC architecture. Thus, the number of
CORDIC iterations is dynamically controlled considering the
importance of DCT coefficients by which considerable power
savings is achieved.
The rest of this paper is organized as follows. The basics
of CORDIC algorithm and the conventional CORDIC-based

1063-8210/$31.00 2013 IEEE

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

DCT are presented in Section II. The proposed low-power


CORDIC-based DCT architecture and its hardware implementation are presented in Section III. Based on the proposed
DCT architecture, a reconfigurable CORDIC-based DCT is
presented in Section IV. Finally, conclusions are drawn in
Section V.
II. C ONVENTIONAL CORDIC-BASED
DCT A RCHITECTURE
Fig. 1.

A. CORDIC Architecture
The basic principal of CORDIC is to iteratively rotate a
vector using a rotation matrix [8], which is represented as
follows:


xi
x i1 i 21i yi1
yi = yi1 + i 21i x i1
(1)
zi
z i1 i i
where x and y are the vector coordinate components of x
and y axes, respectively, i is the i th iteration step, is the
sign-bit that can be +1 or 1 indicating the direction of
the vector rotation, z is the accumulated rotation angle, and
is the predefined angle value of each microrotation step,
i = arctan(21i ). In the CORDIC architecture, the amplitude
and argument of a given vector can be calculated using the
vectoring mode, while the sine and cosine values of the given
angle are obtained with the rotation mode [28]. The hardware
architecture of the CORDIC iteration is shown in Fig. 1, which
is referred as a crossing-architecture in the following.
1) Lookahead CORDIC Approach: In the CORDIC equation shown in (1), to calculate the output of the current stage,
the results from the previous stage iterations should be computed first. These data dependencies are the main performance
bottleneck in the conventional CORDIC hardware. To get
over the data dependencies, lookahead CORDIC [25][27] is
developed, where lookahead means that a number of CORDIC
iterations can be computed ahead to finish the iterations at one
time. An example of four-iteration step lookahead CORDIC
[25][27] is shown in (2). It is noteworthy that if the signbits k , (k = 1, . . . , 4) are known ahead, the following stage
iterations can be directly computed using the input vectors of
the present stage iteration without computing the intermediate
results:

1 20
1 2 20
1 3 21 +1 2 3 21

1 4 22 +1 2 4 22
 
 
3
3 x
x4
+

2
+

2
1
2
3
4
1
3
4

0 . (2)
=
y0
y4
+1 20
1 2 20

1 2 3 21 1 3 21

1 2 4 22 1 4 22
1 3 4 23
+1 2 3 4 23
2) Scale-Factor in CORDIC Operations: In the CORDIC
operation, the magnitude of the rotated vector is scaled and
accumulated after every iteration according to the following
equation:
1
.
Ki =
1 + 22(1i)

(3)

Hardware architecture of CORDIC iteration.

After a series of iterations, the accumulated K i value in (3) is


converged to a constant as follows:
n

1 + 22(1i)
i=1
i=1
lim K (n) 0.60725 . . .

K (n) =

Ki =

(4)

where n is the number of iterations. The constant above


is the scale-factor to restore the scaled magnitude of the
rotated vector. The scale-factor is determined by the number
of iterations. In the following sections, we use a low-power
CORDIC architecture by modifying the number of iterations,
where the vector rotates to the target angle in only one
direction. The corresponding scale-factor should be modified
as well according to the iterations. More discussions on the
scale-factor will be presented in Section III-A.

B. CORDIC-Based DCT Architecture


The 2-D DCT process is decomposed into an 1-D DCT (row
DCT) followed by another 1-D DCT (column DCT), which is
expressed as the following equation:
Y = T x T T = T (T x T )T

(5)

where x and Y are 8 8 size of image data matrix and 2-D


DCT transformed output matrix, respectively. T is the 8 8
1-D DCT basis matrix. The 2-D DCT process with separable
1-D DCT is shown in Fig. 2.
The 8 8 1-D DCT transform is expressed as


c(k)
x(i ) cos (2i+1)k
16
2
7

X (k) =

i=0

where
k = 0,
2, . . . , 7
 1,
1/ 2 k = 0
c(k) =
1
otherwise

(6)

where x(i ) is the input data, and X(k) is 1-D DCT transformed
output data. As a vector-matrix form, 1-D DCT is represented
as X = T x T , where T is the 8 8 DCT basis matrix. X and
x are the output and input vectors, respectively. Since 8 8
DCT bases matrix T has a symmetric property, the 1-D DCT

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LEE et al.: RECONFIGURABLE CORDIC-BASED LOW-POWER DCT ARCHITECTURE BASED ON DATA PRIORITY

Fig. 2.

8 8 2-D DCT processor with separable 1-D DCT.

transform is represented as follows:







1 c4 c4
x(0) + x(7) + x(3) + x(4)
X (0)
=
X (4)
2 c4 c4 x(1) + x(6) + x(2) + x(5)





1 c2 c6
x(0) + x(7) x(3) x(4)
X (2)
=
X (6)
2 c6 c2 x(1) + x(6) x(2) x(5)

x(0) x(7)
X (1)
c1 c3 c5 c7

X (3)

c3 c7 c1 c5 x(1) x(6)
(7)
X (5) = 2 c5 c1 c7 c3 x(2) x(5)
x(3) x(4)
X (7)
c7 c5 c3 c1
where ck = cos(k/16). The cosine elements in (7) can be
changed into sine elements through trigonometric symmetric
property, and (7) can be rearranged as the following equations:





1 c4 s4 x(0) + x(7) + x(3) + x(4)
X (4)
=
x(1) + x(6) + x(2) + x(5)
X (0)
2 s4 c4





1
X (6)
c6 s6 x(0) + x(7) x(3) x(4)
=
X (2)
x(1) + x(6) x(2) x(5)
2 s6 c6





1 c7 s7 x(3) x(4)
X (1)
=
X (7)
2 s7 c7 x(0) x(7)



1 c3 s3 x(1) x(6)
+
2 s3 c3 x(2) x(5)





1 c3 s3 x(0) x(7)
X (3)
=
x(3) x(4)
X (5)
2 s3 c3



1 c1 s1 x(2) x(5)

(8)
2 s1 c1 x(1) x(6)
where sm = sin(m/16) = ck , and m = 8 k. The rearranged
1-D DCT equation is now represented as vector rotation matrix
together with the consecutive CORDIC iterations as shown in
Fig. 3. Now, DCT can be implemented using only shifters and
adders without multiplier [13]. Please note that the sign-bits
and the scale-factor are known ahead since the input angles
of CORDIC module are given as the DCT bases.
After 2-D DCT operation, the input data in space domain
is transformed to the frequency domain, which is the
8 8 block of 64 DCT coefficients shown in Fig. 4. Here,
as DCT has the signal compaction property, the signal energy
of the output data (DCT coefficients) is mostly concentrated
on a few low-frequency components, while the other higher
frequency components are associated with small signal energy.
The high-frequency DCT coefficients become even smaller
after the quantization step [5], which means that the lowfrequency components (DC) are more sensitive to human eyes
than high-frequency components.
The main idea in this paper is based on the fact that
low-frequency DCT coefficients are relatively more important

Fig. 3.

Hardware architecture of CORDIC-based 1-D DCT.

Fig. 4.

Sensitivity difference of 8 8 2-D DCT coefficients.

than high-frequency coefficients. Our CORDIC-based DCT


architecture is designed considering the importance differences between the low and high-frequency DCT coefficients.
Generally, as the more number of iterations is performed
in CORDIC, the more accurate results are obtained. Therefore, in the proposed DCT architecture, a larger number
of CORDIC iterations are assigned to generate the lowfrequency DCT coefficients, whereas the relatively smaller
number of iterations are used for the high-frequency components. The number of CORDIC iterations is judiciously
selected such that the image quality degradation because of
the smaller iterations can be minimized. Detailed explanations

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE I
R EQUIRED I TERATIONS AND D IRECTIONS FOR V ECTOR ROTATION
(+ : C LOCKWISE D IRECTION , : C OUNTER -C LOCKWISE D IRECTION )
Angle

Required Iterations

Directions (Sign-Bits)

/16+

i = 0, 1, 3, 10

= 1, +1, +1, +1

3/16+

i = 1, 3, 10

= 1, 1, 1

3/16

i = 1, 3, 10

= +1, +1, +1

4/16

i=0

= +1

6/16

i = (90 ), 2, 3, 5, 7

= 1, +1, +1, +1, 1

7/16+

i = 0, 1, 3, 10

= 1, 1, 1, 1

TABLE II

Fig. 5. Differences between (a) crossing-architecture and (b) lookahead


approach-based architecture of CORDIC module.

on the DCT hardware will be presented in the following


sections.
III. P RIORITY-BASED L OW-P OWER DCT A RCHITECTURE
U SING L OOKAHEAD CORDIC A PPROACH
A. Data
Priority
Architecture

Considered

Lookahead

CORDIC S CALE -FACTORS AND THE A PPROXIMATION VALUES FOR


M ULTIPLIERLESS I MPLEMENTATION
Angle

Desired Scale-Factor

/16

0.3137856...

Approximation Value
22 + 24 + 210

3/16

0.4437599...

4/16

0.3535533...

21 24 + 27 29
2
2 + 24 + 25 + 27 + 29

6/16

0.4810759...

21 26 28

7/16

0.3137856...

22 + 24 + 210

CORDIC

In the conventional CORDIC structure shown in Fig. 1, due


to the crossing-datapath, changing the number of iterations
for two separate CORDIC datapaths is not feasible. To assign
different number of iterations to the two CORDIC datapaths,
we adopt the lookahead CORDIC approach [25][27] in the
proposed DCT architecture. As shown in (2), the three-step
lookahead CORDIC can be expressed as follows:
 


1 2 20
1 20
 
 
1 3 21
+1 2 3 21
x0
x3




=
.
(9)
y3
+1 20
1 2 20 y0
1
1
1 2 3 2
1 3 2
Assuming that if the CORDIC results require four iterations
for x whereas three iterations are needed for y, as shown in (2)
and (9), the lookahead CORDIC equation for both results can
be expressed as follows, which means that we can separately
calculate the two CORDIC outputs:

1 20
1 2 20
1
1

  1 3 22 +1 2 3 22  

+1 2 4 2
x 0
1 4 2
x4

=
.
3
3

y3
+1 3 4 2 
+1 2 3 4 2 
y0

0
0

+1 2
1 2 2
1 2 3 21
1 3 21
(10)
Fig. 5 presents the difference between the conventional
crossing CORDIC architecture and the lookahead-based
approach. When the lookahead approach is applied to the
CORDIC architecture, the number of iterations can be easily
controlled as all the internal datapath become independent.
In the proposed CORDIC-based DCT architecture, where
a different number of iterations are assigned for generating
DCT coefficients, the number of iterations should be carefully

decided to minimize the error between the desired input angle


and the corresponding accumulated angle. Table I shows the
iterations executed at i th stages and the corresponding rotation
direction (sign-bits). For example, to rotate the vector by
/16, only the i th iterations (i = 0, 1, 3, 10) are executed and
the rest of the iterations can be skipped for power savings.
The lookahead algorithm for /16 CORDIC rotator can be
written as follows:


  
1
3 23
x
1
10 210
=
y
10 210
1
3 23 1



 
1
1 21
1 0 20 x 0

y0
1 21
1
0 20
1
(11)
where 0 = 1, 1 = +1, 3 = +1, 10 = +1. In Table I,
i =(90 ) represents the optional first iteration of the CORDIC
[8]. In our DCT, the iterations to be skipped are carefully
selected such that the error between the desired angle and
the corresponding accumulated angle does not exceed 0.004
for all the given angles. For example, in case of /16 of
CORDIC rotator, the error between desired angle and rotated
angle using combination of CORDIC iterations presented in
Table I is 0.00397958. The number of CORDIC iterations
for combination used to derive lookahead CORDIC algorithm
can be decided using software modeling process presented in
Section III-C.
As mentioned in Section II-A2, the scale-factor is decided
according to the number of the executed CORDIC iterations.
As the number of iterations is known ahead, the scale-factors
are predetermined, which are shown in Table II. In the table,
the scale-factors are represented as signed power of two
format, and the quantization error of the scaling factor is below
10E 4.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LEE et al.: RECONFIGURABLE CORDIC-BASED LOW-POWER DCT ARCHITECTURE BASED ON DATA PRIORITY

One interesting observation when the lookahead approach is


applied to CORDIC is that removing high shift-terms has the
similar effect with the lookahead CORDIC using less number
of iterations. For example, if the CORDIC rotation with /16
is executed using three iterations (i = 0, 1, 3), the lookahead
CORDIC algorithm and its corresponding scale-factor are as
follows:
x = (1 + 21 + 24 )x 0 + (22 + 24 )y0
y = (22 24 )x 0 + (1 + 21 + 24 )y0 (12)
(13)
K /16+ = 0.3137858 . . . .
In (11), when the higher shift-terms (smaller than 29 terms)
are eliminated, the equation is changed to (12) and (13). Please
note that (11) represents the four iterations (i = 0, 1, 3, 10)
and (12) shows three iterations (i = 0, 1, 3). Please note that
the number of CORDIC iterations can be simply controlled
by removing the high shift-terms.

B. Proposed Low-Power CORDIC-Based DCT Architecture


As mentioned in the last part of Section III-A, considering
the data priorities in DCT coefficient, high shift-term of the
lookahead CORDIC can be carefully removed, which has
the same effect with the less number of CORDIC iterations.
Because the less number of CORDIC iterations means the
CORDIC with low computational complexity, a low-power
CORDIC-based DCT architecture can be derived and its
detailed implementation is as follows.
Fig. 6(a) shows the hardware architecture of the proposed
CORDIC-based DCT. Inside the CORDIC module, the lookahead CORDIC is derived using the parameters in Table I.
The scale-factors are also specified in Table II. An example
of the lookahead CORDIC algorithm for 7/16 rotation and
the corresponding scale-factors are presented in the equations
shown in Fig. 6(b). To reduce the number of iterations, the
high shift-terms are removed as presented in Section III-A,
the implementation of which is specified in the solid lines
of Fig. 6(b). We further reduce the less important components
considering the data priorities in DCT coefficients. In Fig. 6(b),
a CORDIC output, K x , is more important than K y as it is used
later for X (1), whereas K y is needed for the higher frequency
component, X (7). Thus, the high shift-terms for y and K y
are further removed, which is expressed as the dotted lines in
Fig. 6(b).
In the proposed hardware architecture, all the shift components for each of lookahead CORDIC algorithm and the
scale-factors are precomputed using the lookahead CORDIC
equations. In Fig. 6(c), the numbers in the circle represent
the shift operation, and the black color circle means the 2s
complement elements of the shifted component, which are
used for subtract operations. The dotted line in Fig. 6(c)
represents the omitted computations, thus, the two results in
lookahead CORDIC modules have the different number of
terms, which leads to power savings owing to the smaller
number of iterations.

TABLE III
H ARDWARE I MPLEMENTATION OR C OMPARISION R ESULTS FOR
VARIOUS DCT A RCHITECTURES
Architecture

[19]

[20]

[13]

[16]

[17]

Proposed

PSNR (dB)

31.63

31.49

31.72

30.61

31.57

31.45

Gate count

36.2k

24.6k

41.6k

27.3k

31.5k

22.4k

Power (mW)

6.76

5.42

7.72

6.54

5.62

5.11

C. Experimental Results of the


CORDIC-Based DCT Architecture

Proposed

Low-Power

In this section, the experimental results of the proposed


CORDIC-based DCT architecture are presented. First, the
number of CORDIC iterations is decided according to the
target PSNR of 31.5 dB, which is the average PSNR obtained
using nine benchmark images listed in Table IV. PSNRs of the
benchmark images are obtained using the following equation:


255
(14)
PSNR = 20 log10
MSE
m1 n1
1
MSE =
[I (x, y) K (x, y)]2
(15)
mn
x=0 y=0

where I is m n size of original image, and K is the


reconstructed image. The data bit-widths inside the proposed
DCT architecture are specified in Fig. 2.
For comparisons, various DCT architectures such as
DA-based DCT [19], MCM [20], CORDIC-based DCT [13],
and CORDIC-based Loeffler DCT [16], [17] are implemented
using 0.13 m CMOS standard cell library. The implemented
2-D DCT is specified with a dotted line in Fig. 2, and
Table III shows the implementation results. In the table, power
consumptions for different DCT architectures are measured
using nanosim [29] with 100 MHz clock cycles, 1.2 V supply
voltage. More than 500 input vectors are used to obtain the
average power. Compared with the DA-based architecture [19],
the proposed DCT shows 38.1% of area and 24% power savings. Compared with the MCM-based DCT [20], the proposed
DCT shows comparable power consumption and 10% smaller
area with a minor image quality degradation of 0.04 dB.
Because some of the higher order shift-terms in CORDIC
iterations can be removed considering the importance differences of DCT coefficients, our proposed DCT architecture
shows the lowest gate count and power consumption compared
with other CORDIC-based architectures [13], [16], and [17].
Especially, the proposed DCT architecture shows 21.87% of
power savings compared to the CORDIC-based Loeffler DCT
[16] with even better PSNR results.
IV. R ECONFIGURABLE CORDIC-BASED
DCT A RCHITECTURE
A. Proposed Reconfigurable Low-Power CORDIC-Based DCT
Architecture
Using the low-power DCT architecture presented in the
previous section, to further reduce the power consumption at the expense of a minor image quality degradation,

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
6

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 6. (a) Hardware architecture of the proposed low-power CORDIC-based 1-D DCT. (b) An example of lookahead CORDIC algorithm (7/16) and
(c) its hardware architecture.

we propose a reconfigurable CORDIC-based DCT architecture in this section. Several tradeoff modes are presented,
and the proposed reconfigurable architecture can dynamically change the CORDIC iterations to adaptively trade off
the computation energy for the image quality in the same
hardware.
Generally, in the lookahead CORDIC, the shift-terms for
calculating low-frequency DCT coefficients (terms for calculating X (0), X (1) in (8)) are more important than the shiftterms for calculating high-frequency coefficients. Additionally,
among the shift-terms in one lookahead CORDIC equation,
the most important terms are low shift-terms while the relatively less important terms are high shift-terms. To save the
computation power at the expense of minimum image quality
degradation, first, the least important shift-term in X(7) is
removed based on Greedy algorithm [30]. Again, we search for
the next least important shift-term to cancel the computation.
As we repeat the process, the more number of shift-terms are
removed, which means that the computation power is reduced
with minimum image quality degradation.
Fig. 7 shows a pseudocode for shift-term reduction process
in the proposed CORDIC-based DCT. In step 1, the high
shift-terms of CORDIC rotation part (EQ_Terms) and the
scale-factor part (SC_Terms) in lookahead CORDIC equation are initialized as those in the normal mode shown in
Section III-B. Once the target PSNR constraint is decided
in step 2, the loop from the steps 321 is performed until
the minimum number of CORDIC terms are found, which
satisfy the target PSNR. In the inner loop, we repetitively
search for the least sensitive shift-terms inside EQ_Terms and
SC_terms. Then, the least sensitive shift-term that shows the
lowest PSNR is selected between EQ_Terms and SC_Terms.
As the best choice (the least sensitive shift-term) is taken
based on the lookahead equation, which is updated every
iteration loop, the approach described in Fig. 7 is based on the
Greedy algorithm [30]. The selected shift-term is removed and
the CORDIC equations of the current iteration are updated.
The iteration continues until no further shift-term reduction is

Fig. 7.
Pseudocode for shift-term reduction process in the proposed
CORDIC-based DCT.

possible owing to the imposed PSNR constraint. For the PSNR


calculation, we use the average PSNR of nine benchmark
images [22][24].
With the approach shown in Fig. 7, we propose three modes
of tradeoff levels: normal mode, and modes 1 and 2. As we go
to the higher tradeoff levels (sacrificing the image quality in
favor of lower power), the number of shift-terms composing
lookahead CORDIC equations is reduced. Table IV shows
the PSNR results of the benchmark images for three tradeoff
levels. The image quality constraints for normal mode, mode 1,
and mode 2 are aimed at the average PSNR of 31.5, 30, and
27 dB, respectively, for nine benchmark images. The number
of tradeoff modes and the minimum allowable PSNRs can be
changed according to the users choice.
In Fig. 8, we present the number of shift-terms in the
lookahead CORDIC equation and the scaling factors for three
different modes of operations. As an example, to calculate

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LEE et al.: RECONFIGURABLE CORDIC-BASED LOW-POWER DCT ARCHITECTURE BASED ON DATA PRIORITY

TABLE IV
PSNR D IFFERENCES IN E ACH M ODE OF P ROPOSED R ECONFIGURABLE
DCT A RCHITECTURE W ITH VARIOUS I MAGE D ATA
PSNR (dB)

Normal

Mode 1

Mode 2

baboon

27.41

26.59

23.89

clegg

28.33

26.40

21.95

f r ymire

26.03

23.17

19.30

lena

34.30

33.54

31.75

monarch

34.98

33.69

30.55

peppers

35.93

34.61

30.71

sail

31.41

30.73

28.35

serrano

29.57

28.17

25.28

tli ps

35.08

33.91

30.93

Fig. 9. (a) Turnoff gate schematic [24]. (b) Dynamic bit-width control using
turnoff gate.

Fig. 10.
Overall hardware architecture of the proposed reconfigurable
CORDIC-based DCT.

B. Hardware Implementation of the Reconfigurable DCT and


Experimental Results
Fig. 8. Number of shift-terms inside the lookahead CORDIC rotators and
scale-factors of our proposed reconfigurable DCT architecture (+: clock-wise
direction, *: counter-clock-wise direction).

X(3) component, both the CORDIC rotators of 3/16 and


/16 are needed, and those are expressed as the following
lookahead CORDIC equations in the normal mode:
X 3/16 = (1 24 )x 0 + (21 23 )y0

(16)

X /16+ = (1 + 21 )x 0 + (22 )y0 .

(17)

The scale-factors in normal mode are as follows:


K 3/16 = 21 24
K /16+ = 2

+2

(18)
.

(19)

According to the equations above, four shift-terms are used


for X 3/16 CORDIC rotator, while three terms are used for
X /16+ rotator. Thus, the normal mode of X(3) CORDIC in
Fig. 8 is denoted as 4 | 3. At tradeoff level 1, the 3/16
CORDIC rotator is reduced as follows:

1
)y0 .
X 3/16
= 1x 0 + (2

(20)

As it goes to the higher tradeoff levels, the number of shiftterms are further reduced, which is specified in Fig. 8.

The image quality and computational energy tradeoff


approach proposed in the previous section can be realized as
a reconfigurable hardware using the DCT architecture shown
in Fig. 6. At normal mode of operation, the low-power DCT
architecture in Section III-B is used. At tradeoff level 1, some
of the shift-terms (2 ) are removed as shown in Fig. 6(b).
In the DCT hardware architecture, removing the higher shiftterms means that the number of addition operations is reduced
by turning off the corresponding datapaths to save computation
energy. A simple turnoff gate [24] shown in Fig. 9(a) is used
to turnoff the datapaths of high shift-terms. An example of
the proposed approach is illustrated in Fig. 9(b), where the
bit-width of datapath is dynamically controlled using dynamic
bit-width control (DBC) circuit.
The overall hardware architecture of the proposed reconfigurable CORDIC-based DCT is shown in Fig. 10. For
different tradeoff modes, the proposed DCT architecture can
be dynamically reconfigured by simply changing the control
signal  to tradeoff minor image quality for computation
energy. The left side of Fig. 10 shows the proposed dynamic
reconfigurable CORDIC module. Once a tradeoff mode is
determined, the control signal  controls the turnoff gate
arrays for both of the CORDIC equation terms and the scaling
terms. It is noteworthy that the proposed architecture and the
design parameters can be changed according to the required
amount of power savings.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
8

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE V
P OWER C ONSUMPTION AT D IFFERENT T RADEOFF M ODES
Normal

Mode 1

Mode 2

PSNR (dB)

31.45

30.09

26.97

Power (mW)

5.11

3.58

3.13

Percentage (%)

100

70.15

61.27

Fig. 11. Lena images obtained using the proposed reconfigurable CORDICbased DCT. (a) Normal mode. (b) Mode 1. (c) Mode 2.

The power consumption of our DCT architecture at different


modes is shown in Table V. The power consumption is
measured with nanosim [29] with 100-MHz clock cycles,
1.2 V supply voltage. The PSNR in the Table V shows
the average PSNR of 9 benchmark images. As shown in
the table, the proposed architecture offers significant power
savings as image quality decreases. Compared with the normal
mode, mode 2 provides 38.73% of power savings with the
image quality degradation. Compared with the CORDIC-based
Loeffler DCT [16] that was shown in Table III, the proposed
architecture shows 45.3% of power savings at mode 1 at the
expense of 0.52-dB image quality degradation. At tradeoff
level 2, the proposed DCT architecture achieves up to 59.5%
of power savings compared with the conventional CORDICbased DCT [13] with considerable image quality degradations.
It is noteworthy that the area increase for reconfigurable
architecture is only 7% when the turnoff gates [24] are used.
Examples of Lena images under various tradeoff modes are
presented in Fig. 11.
V. C ONCLUSION
In the conventional DCT architecture, all the computations are not equally important in generating the frequency
domain outputs. This paper presented a low-power CORDICbased DCT architecture, where the importance differences in
DCT coefficients were efficiently exploited to allocate the
numbers of CORDIC iterations and internal data bit-widths.
Lookahead CORDIC architectures were effectively used to
get over the inherent data-dependencies in the conventional
crossing-architecture of CORDIC. The proposed reconfigurable CORDIC-based DCT architecture can dynamically
change the tradeoff modes with the power savings ranging
from 22.9% to 52.2% compared with the CORDIC-based
Loeffler DCT architecture [16]. The idea presented in this
section can assist the low-power design of image and video
image compression applications.
ACKNOWLEDGMENT
The authors would like to thank the IC Design Education
Center (IDEC) for its software assistance.

R EFERENCES
[1] T. Liu, T. Lin, S. Wang, and C. Lee, A low-power dual-mode video
decoder for mobile applications, IEEE Commun. Mag., vol. 44, no. 8,
pp. 119126, Aug. 2006.
[2] M. Parlak and I. Hamzaoglu, Low power H.264 deblocking filter
hardware implementations, IEEE Trans. Consum. Electron., vol. 54,
no. 2, pp. 808816, May 2008.
[3] A. Bahari, T. Arslan, and A. T. Erdogan, Low-power H.264
video compression architectures for mobile communication, IEEE
Trans. Circuits Syst. Video Technol., vol. 19, no. 9, pp. 12511261,
Sep. 2009.
[4] N. Ahmed, T. Natarajan, and K. R. Rao, Discrete cosine transform,
IEEE Trans. Comput., vol. 23, no. 1, pp. 9093, Jan. 1974.
[5] G. K. Wallace, The JPEG still picture compression standard,
IEEE Trans. Consum. Electron., vol. 38, no. 1, pp. 1834, Feb. 1992.
[6] D. L. Gall, MPEG: A video compression standard for multimedia
applications, Commun. ACM, vol. 34, no. 4, pp. 4658, Apr. 1991.
[7] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, Overview
of the H.264/AVC video coding standard, IEEE Trans. Circuits Syst.
Video Technol., vol. 13, no. 7, pp. 560576, Jul. 2003.
[8] J. E. Volder, The CORDIC trigonometric computing technique,
IRE Trans. Electron. Comput., vol. 8, no. 3, pp. 330334, Sep. 1959.
[9] A. Maltsev, V. Pestretsov, R. Maslennikov, and A. Khoryaev, Triangular systolic array with reduced latency for QR-decomposition of
complex matrices, in Proc. IEEE Int. Symp. Circuits Syst., May 2006,
pp. 385388.
[10] A. M. Despain, Fourier transform computers using CORDIC iterations,
IEEE Trans. Comput., vol. 23, no. 10, pp. 9931001, Oct. 1974.
[11] S. Hsiao and J. Delosme, Parallel singular value decomposition of complex matrices using multidimensional CORDIC algorithms, IRE Trans.
Signal Process., vol. 44, no. 3, pp. 685697, Mar. 1996.
[12] J. R. Cavallaro and F. T. Luk, CORDIC arithmetic for an SVD
processor, J. Parallel Distrib. Comput., vol. 5, no. 3, pp. 271290,
Jun. 1988.
[13] E. P. Mariatos, D. E. Metafas, J. A. Hallas, and C. E. Goutis, A fast
DCT processor, based on special purpose CORDIC Rotators, in Proc.
IEEE Int. Symp. Circuits Syst., Jun. 1994, pp. 271274.
[14] H. Jeong, J. Kim, and W. Cho, Low-power multiplierless DCT architecture using image data correlation, IEEE Trans. Consum. Electron.,
vol. 50, no. 1, pp. 262267, Feb. 2004.
[15] T. Sung, Y. Shieh, C. Yu, and H. Hsin, High-efficiency and low-Power
architectures for 2-D DCT and IDCT based on CORDIC rotation,
in Proc. Int. Parallel Distrib. Comput. Appl. Technol., Dec. 2006,
pp. 191-196.
[16] C. C. Sun, S. J. Ruan, B. Heyne, and J. Goetze, Low-power and highquality CORDIC-based Loeffler DCT for signal processing, IET Circuits, Devices, Syst., vol. 1, no. 6, pp. 453461, Dec. 2007.
[17] Z. Wu, J. Sha, Z. Wang, and L. Li, An improved scaled
DCT architecture, IEEE Trans. Consum. Electron., vol. 55, no. 2,
pp. 685689, May 2009.
[18] S. Hsiao, Y. Hu, T. Juang, and C. Lee, Efficient VLSI implementations of fast multiplierless approximated DCT using parameterized hardware modules for silicon intellectual property design, IEEE
Trans. Circuits Syst. I, Reg. Papers, vol. 52, no. 8, pp. 15681579,
Aug. 2005.
[19] S. Yu and E. E. Swartziander, DCT implementation with distributed arithmetic, IEEE Trans. Comput., vol. 50, no. 9, pp. 985991,
Sep. 2001.
[20] B. Kim and S. G. Ziavras, Low-power multiplierless DCT for
image/video coders, in Proc. IEEE Int. Symp. Consum. Electron.,
May 2009, pp. 133136.
[21] J. Bracamonte, M. Ansorge, and F. Pellandini, VLSI systems for
image compression: A power-consumption/image-resolution trade-off
approach, in Proc. Digit. Compress. Technol. Syst. Video Commun.
Conf., 1994, pp. 271274.
[22] G. Karakonstantis, N. Banerjee, and K. Roy, Process-variation resilient
and voltage-scalable DCT architecture for robust low-power computing,
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 10,
pp. 14611470, Oct. 2010.
[23] J. Park and K. Roy, A low power reconfigurable DCT architecture to
trade off image quality for computational complexity, in Proc. IEEE
Int. Conf. Acoust., Speech, Signal Process., May 2004, pp. 1720.
[24] J. Park, J. H. Choi, and K. Roy, Dynamic bit-width adaptation in
DCT: An approach to trade off image quality and computation energy,
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 5,
pp. 787793, May 2010.

This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LEE et al.: RECONFIGURABLE CORDIC-BASED LOW-POWER DCT ARCHITECTURE BASED ON DATA PRIORITY

[25] J. Li, Sign lookahead CORDIC, M.S. thesis, Dept. Electr. Eng., Nat.
Cheng Kung Univ., Tainan, Taiwan, 2008.
[26] S. Wang and E. E. Swartzlander, Merged CORDIC algorithm, in Proc.
IEEE Int. Symp. Circuits Syst., May 1995, pp. 19881991.
[27] B. Gisuthan and T. Srikanthan, Pipelining flat CORDIC based trigonometric function generators, Microelectron. J., vol. 33, nos. 12,
pp. 7789, Jan. 2002.
[28] P. K. Meher, J. Valls, T. Juang, K. Sridharan, and K. Maharatna,
50 years of CORDIC, IEEE Trans. Circuits Syst. I, Reg. Papers,
vol. 56, no. 9, pp. 18931907, Sep. 2009.
[29] NanoSim User Guide, Version A-2008.03, Synopsys Inc., Mountain
View, CA, USA, 2008.
[30] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to
Algorithms. Cambridge, MA, USA: MIT Press, 1998.

Min-Woo Lee (S12) received the B.S. and M.S.


degrees in electrical engineering from Korea University, Seoul, Korea, in 2009 and 2012, respectively.
Since February 2012, he has been with the Department of DTV SoC Development, SIC R&D Lab,
LG Electronics Corporation, Seoul, as a Research
Engineer. His current research interests include
CORDIC-based DSP system, low-power, and highperformance VLSI architectures.

Ji-Hwan Yoon (S13) received the B.S. degree in


electrical engineering from Korea University, Seoul,
Korea, in 2009, where he is currently pursuing the
M.S. and Ph.D. degrees with the Department of
Electrical and Computer Engineering.
His current interests include low power highthroughput LDPC decoder architecture, CORDIC
based DSP system, and ultra low power system
design.

Jongsun Park (M05SM13) received the B.S.


degree in electronics engineering from Korea University, Seoul, Korea, in 1998, and the M.S. and
Ph.D. degrees in electrical and computer engineering
from Purdue University, West Lafayette, IN, USA,
in 2000 and 2005, respectively.
He joined the Electrical Engineering Faculty,
Korea University, in 2008. From 2005 to 2008, he
was with the Signal Processing Technology Group,
Marvell Semiconductor, Inc., Santa Clara, CA, USA.
He was with the Digital Radio Processor System
Design Group, Texas Instruments, Dallas, TX, USA, in 2002. His current
research interests include variation-tolerant, low-power and high-performance
VLSI architectures, and circuit designs for digital signal processing and digital
communications.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy