Reconfigurable CORDIC-Based Low-Power DCT Architecture Based On Data Priority
Reconfigurable CORDIC-Based Low-Power DCT Architecture Based On Data Priority
Reconfigurable CORDIC-Based Low-Power DCT Architecture Based On Data Priority
I. I NTRODUCTION
Since first proposed in 1959 [8], coordinate rotation digital computer (CORDIC) has been widely used to calculate
the trigonometric functions in signal processing applications,
such as QR decomposition [9], fast Fourier transform [10],
singular value decomposition [11], [12], and so on. Since
CORDIC can be simply implemented with the iterative operations of additions and shifts, it has been widely used for
the multiplierless low-power DCT architectures [13][18].
Many previous research works focused on reducing the
hardware complexity of DCT such as distribute arithmetic
(DA)-based DCT [19] and multiple constant multiplication
(MCM)-based approach [20]. Although bit-serial DA-based
approach offers a regular and simple DCT architecture, large
hardware area is needed for bit-parallel operations because of
additional ROMs and control logics. MCM-based DCT [20]
can be simply implemented with a smaller number of shiftand-add operations, however, to make a tradeoff between the
image quality and computation energy, the computation sharing in different datapaths should be completely re-considered.
For the low-power CORDIC-based DCT architecture presented in [14], data correlations between neighboring pixels
are efficiently used to skip the internal CORDIC iterations.
Approximation technique or incorporating compensation steps
into the quantization is also exploited to reduce the power
consumption of CORDIC-based DCT architecture [16]. Most
of the previous research works are mainly focused on reducing
the number of arithmetic units; the inherent data priorities
in DCT coefficients, however, have not been exploited in the
CORDIC-based DCT.
In DCT, all the computations are not equally important in
generating the frequency domain outputs (DCT coefficients).
In other words, some of the computations in DCT are critical
for determining the output image quality, while others play
relatively less important roles. This interesting property can
be used to provide the right tradeoff between the output
image quality and power dissipations [21][24]. In this paper,
we present a low-power CORDIC-based DCT architecture,
where the important differences among the DCT coefficients
are efficiently exploited to achieve the power savings minimum image quality degradation. To apply the priority-based
data processing, lookahead CORDIC architectures [25][27]
are adopted to overcome the inherent data-dependencies in
the conventional CORDIC architecture. Thus, the number of
CORDIC iterations is dynamically controlled considering the
importance of DCT coefficients by which considerable power
savings is achieved.
The rest of this paper is organized as follows. The basics
of CORDIC algorithm and the conventional CORDIC-based
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2
A. CORDIC Architecture
The basic principal of CORDIC is to iteratively rotate a
vector using a rotation matrix [8], which is represented as
follows:
xi
x i1 i 21i yi1
yi = yi1 + i 21i x i1
(1)
zi
z i1 i i
where x and y are the vector coordinate components of x
and y axes, respectively, i is the i th iteration step, is the
sign-bit that can be +1 or 1 indicating the direction of
the vector rotation, z is the accumulated rotation angle, and
is the predefined angle value of each microrotation step,
i = arctan(21i ). In the CORDIC architecture, the amplitude
and argument of a given vector can be calculated using the
vectoring mode, while the sine and cosine values of the given
angle are obtained with the rotation mode [28]. The hardware
architecture of the CORDIC iteration is shown in Fig. 1, which
is referred as a crossing-architecture in the following.
1) Lookahead CORDIC Approach: In the CORDIC equation shown in (1), to calculate the output of the current stage,
the results from the previous stage iterations should be computed first. These data dependencies are the main performance
bottleneck in the conventional CORDIC hardware. To get
over the data dependencies, lookahead CORDIC [25][27] is
developed, where lookahead means that a number of CORDIC
iterations can be computed ahead to finish the iterations at one
time. An example of four-iteration step lookahead CORDIC
[25][27] is shown in (2). It is noteworthy that if the signbits k , (k = 1, . . . , 4) are known ahead, the following stage
iterations can be directly computed using the input vectors of
the present stage iteration without computing the intermediate
results:
1 20
1 2 20
1 3 21 +1 2 3 21
1 4 22 +1 2 4 22
3
3 x
x4
+
2
+
2
1
2
3
4
1
3
4
0 . (2)
=
y0
y4
+1 20
1 2 20
1 2 3 21 1 3 21
1 2 4 22 1 4 22
1 3 4 23
+1 2 3 4 23
2) Scale-Factor in CORDIC Operations: In the CORDIC
operation, the magnitude of the rotated vector is scaled and
accumulated after every iteration according to the following
equation:
1
.
Ki =
1 + 22(1i)
(3)
1 + 22(1i)
i=1
i=1
lim K (n) 0.60725 . . .
K (n) =
Ki =
(4)
(5)
X (k) =
i=0
where
k = 0,
2, . . . , 7
1,
1/ 2 k = 0
c(k) =
1
otherwise
(6)
where x(i ) is the input data, and X(k) is 1-D DCT transformed
output data. As a vector-matrix form, 1-D DCT is represented
as X = T x T , where T is the 8 8 DCT basis matrix. X and
x are the output and input vectors, respectively. Since 8 8
DCT bases matrix T has a symmetric property, the 1-D DCT
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LEE et al.: RECONFIGURABLE CORDIC-BASED LOW-POWER DCT ARCHITECTURE BASED ON DATA PRIORITY
Fig. 2.
x(0) x(7)
X (1)
c1 c3 c5 c7
X (3)
c3 c7 c1 c5 x(1) x(6)
(7)
X (5) = 2 c5 c1 c7 c3 x(2) x(5)
x(3) x(4)
X (7)
c7 c5 c3 c1
where ck = cos(k/16). The cosine elements in (7) can be
changed into sine elements through trigonometric symmetric
property, and (7) can be rearranged as the following equations:
1 c4 s4 x(0) + x(7) + x(3) + x(4)
X (4)
=
x(1) + x(6) + x(2) + x(5)
X (0)
2 s4 c4
1
X (6)
c6 s6 x(0) + x(7) x(3) x(4)
=
X (2)
x(1) + x(6) x(2) x(5)
2 s6 c6
1 c7 s7 x(3) x(4)
X (1)
=
X (7)
2 s7 c7 x(0) x(7)
1 c3 s3 x(1) x(6)
+
2 s3 c3 x(2) x(5)
1 c3 s3 x(0) x(7)
X (3)
=
x(3) x(4)
X (5)
2 s3 c3
1 c1 s1 x(2) x(5)
(8)
2 s1 c1 x(1) x(6)
where sm = sin(m/16) = ck , and m = 8 k. The rearranged
1-D DCT equation is now represented as vector rotation matrix
together with the consecutive CORDIC iterations as shown in
Fig. 3. Now, DCT can be implemented using only shifters and
adders without multiplier [13]. Please note that the sign-bits
and the scale-factor are known ahead since the input angles
of CORDIC module are given as the DCT bases.
After 2-D DCT operation, the input data in space domain
is transformed to the frequency domain, which is the
8 8 block of 64 DCT coefficients shown in Fig. 4. Here,
as DCT has the signal compaction property, the signal energy
of the output data (DCT coefficients) is mostly concentrated
on a few low-frequency components, while the other higher
frequency components are associated with small signal energy.
The high-frequency DCT coefficients become even smaller
after the quantization step [5], which means that the lowfrequency components (DC) are more sensitive to human eyes
than high-frequency components.
The main idea in this paper is based on the fact that
low-frequency DCT coefficients are relatively more important
Fig. 3.
Fig. 4.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
4
TABLE I
R EQUIRED I TERATIONS AND D IRECTIONS FOR V ECTOR ROTATION
(+ : C LOCKWISE D IRECTION , : C OUNTER -C LOCKWISE D IRECTION )
Angle
Required Iterations
Directions (Sign-Bits)
/16+
i = 0, 1, 3, 10
= 1, +1, +1, +1
3/16+
i = 1, 3, 10
= 1, 1, 1
3/16
i = 1, 3, 10
= +1, +1, +1
4/16
i=0
= +1
6/16
i = (90 ), 2, 3, 5, 7
7/16+
i = 0, 1, 3, 10
= 1, 1, 1, 1
TABLE II
Considered
Lookahead
Desired Scale-Factor
/16
0.3137856...
Approximation Value
22 + 24 + 210
3/16
0.4437599...
4/16
0.3535533...
21 24 + 27 29
2
2 + 24 + 25 + 27 + 29
6/16
0.4810759...
21 26 28
7/16
0.3137856...
22 + 24 + 210
CORDIC
=
.
(9)
y3
+1 20
1 2 20 y0
1
1
1 2 3 2
1 3 2
Assuming that if the CORDIC results require four iterations
for x whereas three iterations are needed for y, as shown in (2)
and (9), the lookahead CORDIC equation for both results can
be expressed as follows, which means that we can separately
calculate the two CORDIC outputs:
1 20
1 2 20
1
1
1 3 22 +1 2 3 22
+1 2 4 2
x 0
1 4 2
x4
=
.
3
3
y3
+1 3 4 2
+1 2 3 4 2
y0
0
0
+1 2
1 2 2
1 2 3 21
1 3 21
(10)
Fig. 5 presents the difference between the conventional
crossing CORDIC architecture and the lookahead-based
approach. When the lookahead approach is applied to the
CORDIC architecture, the number of iterations can be easily
controlled as all the internal datapath become independent.
In the proposed CORDIC-based DCT architecture, where
a different number of iterations are assigned for generating
DCT coefficients, the number of iterations should be carefully
y0
1 21
1
0 20
1
(11)
where 0 = 1, 1 = +1, 3 = +1, 10 = +1. In Table I,
i =(90 ) represents the optional first iteration of the CORDIC
[8]. In our DCT, the iterations to be skipped are carefully
selected such that the error between the desired angle and
the corresponding accumulated angle does not exceed 0.004
for all the given angles. For example, in case of /16 of
CORDIC rotator, the error between desired angle and rotated
angle using combination of CORDIC iterations presented in
Table I is 0.00397958. The number of CORDIC iterations
for combination used to derive lookahead CORDIC algorithm
can be decided using software modeling process presented in
Section III-C.
As mentioned in Section II-A2, the scale-factor is decided
according to the number of the executed CORDIC iterations.
As the number of iterations is known ahead, the scale-factors
are predetermined, which are shown in Table II. In the table,
the scale-factors are represented as signed power of two
format, and the quantization error of the scaling factor is below
10E 4.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LEE et al.: RECONFIGURABLE CORDIC-BASED LOW-POWER DCT ARCHITECTURE BASED ON DATA PRIORITY
TABLE III
H ARDWARE I MPLEMENTATION OR C OMPARISION R ESULTS FOR
VARIOUS DCT A RCHITECTURES
Architecture
[19]
[20]
[13]
[16]
[17]
Proposed
PSNR (dB)
31.63
31.49
31.72
30.61
31.57
31.45
Gate count
36.2k
24.6k
41.6k
27.3k
31.5k
22.4k
Power (mW)
6.76
5.42
7.72
6.54
5.62
5.11
Proposed
Low-Power
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
6
Fig. 6. (a) Hardware architecture of the proposed low-power CORDIC-based 1-D DCT. (b) An example of lookahead CORDIC algorithm (7/16) and
(c) its hardware architecture.
we propose a reconfigurable CORDIC-based DCT architecture in this section. Several tradeoff modes are presented,
and the proposed reconfigurable architecture can dynamically change the CORDIC iterations to adaptively trade off
the computation energy for the image quality in the same
hardware.
Generally, in the lookahead CORDIC, the shift-terms for
calculating low-frequency DCT coefficients (terms for calculating X (0), X (1) in (8)) are more important than the shiftterms for calculating high-frequency coefficients. Additionally,
among the shift-terms in one lookahead CORDIC equation,
the most important terms are low shift-terms while the relatively less important terms are high shift-terms. To save the
computation power at the expense of minimum image quality
degradation, first, the least important shift-term in X(7) is
removed based on Greedy algorithm [30]. Again, we search for
the next least important shift-term to cancel the computation.
As we repeat the process, the more number of shift-terms are
removed, which means that the computation power is reduced
with minimum image quality degradation.
Fig. 7 shows a pseudocode for shift-term reduction process
in the proposed CORDIC-based DCT. In step 1, the high
shift-terms of CORDIC rotation part (EQ_Terms) and the
scale-factor part (SC_Terms) in lookahead CORDIC equation are initialized as those in the normal mode shown in
Section III-B. Once the target PSNR constraint is decided
in step 2, the loop from the steps 321 is performed until
the minimum number of CORDIC terms are found, which
satisfy the target PSNR. In the inner loop, we repetitively
search for the least sensitive shift-terms inside EQ_Terms and
SC_terms. Then, the least sensitive shift-term that shows the
lowest PSNR is selected between EQ_Terms and SC_Terms.
As the best choice (the least sensitive shift-term) is taken
based on the lookahead equation, which is updated every
iteration loop, the approach described in Fig. 7 is based on the
Greedy algorithm [30]. The selected shift-term is removed and
the CORDIC equations of the current iteration are updated.
The iteration continues until no further shift-term reduction is
Fig. 7.
Pseudocode for shift-term reduction process in the proposed
CORDIC-based DCT.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LEE et al.: RECONFIGURABLE CORDIC-BASED LOW-POWER DCT ARCHITECTURE BASED ON DATA PRIORITY
TABLE IV
PSNR D IFFERENCES IN E ACH M ODE OF P ROPOSED R ECONFIGURABLE
DCT A RCHITECTURE W ITH VARIOUS I MAGE D ATA
PSNR (dB)
Normal
Mode 1
Mode 2
baboon
27.41
26.59
23.89
clegg
28.33
26.40
21.95
f r ymire
26.03
23.17
19.30
lena
34.30
33.54
31.75
monarch
34.98
33.69
30.55
peppers
35.93
34.61
30.71
sail
31.41
30.73
28.35
serrano
29.57
28.17
25.28
tli ps
35.08
33.91
30.93
Fig. 9. (a) Turnoff gate schematic [24]. (b) Dynamic bit-width control using
turnoff gate.
Fig. 10.
Overall hardware architecture of the proposed reconfigurable
CORDIC-based DCT.
(16)
(17)
+2
(18)
.
(19)
(20)
As it goes to the higher tradeoff levels, the number of shiftterms are further reduced, which is specified in Fig. 8.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
8
TABLE V
P OWER C ONSUMPTION AT D IFFERENT T RADEOFF M ODES
Normal
Mode 1
Mode 2
PSNR (dB)
31.45
30.09
26.97
Power (mW)
5.11
3.58
3.13
Percentage (%)
100
70.15
61.27
Fig. 11. Lena images obtained using the proposed reconfigurable CORDICbased DCT. (a) Normal mode. (b) Mode 1. (c) Mode 2.
R EFERENCES
[1] T. Liu, T. Lin, S. Wang, and C. Lee, A low-power dual-mode video
decoder for mobile applications, IEEE Commun. Mag., vol. 44, no. 8,
pp. 119126, Aug. 2006.
[2] M. Parlak and I. Hamzaoglu, Low power H.264 deblocking filter
hardware implementations, IEEE Trans. Consum. Electron., vol. 54,
no. 2, pp. 808816, May 2008.
[3] A. Bahari, T. Arslan, and A. T. Erdogan, Low-power H.264
video compression architectures for mobile communication, IEEE
Trans. Circuits Syst. Video Technol., vol. 19, no. 9, pp. 12511261,
Sep. 2009.
[4] N. Ahmed, T. Natarajan, and K. R. Rao, Discrete cosine transform,
IEEE Trans. Comput., vol. 23, no. 1, pp. 9093, Jan. 1974.
[5] G. K. Wallace, The JPEG still picture compression standard,
IEEE Trans. Consum. Electron., vol. 38, no. 1, pp. 1834, Feb. 1992.
[6] D. L. Gall, MPEG: A video compression standard for multimedia
applications, Commun. ACM, vol. 34, no. 4, pp. 4658, Apr. 1991.
[7] T. Wiegand, G. J. Sullivan, G. Bjontegaard, and A. Luthra, Overview
of the H.264/AVC video coding standard, IEEE Trans. Circuits Syst.
Video Technol., vol. 13, no. 7, pp. 560576, Jul. 2003.
[8] J. E. Volder, The CORDIC trigonometric computing technique,
IRE Trans. Electron. Comput., vol. 8, no. 3, pp. 330334, Sep. 1959.
[9] A. Maltsev, V. Pestretsov, R. Maslennikov, and A. Khoryaev, Triangular systolic array with reduced latency for QR-decomposition of
complex matrices, in Proc. IEEE Int. Symp. Circuits Syst., May 2006,
pp. 385388.
[10] A. M. Despain, Fourier transform computers using CORDIC iterations,
IEEE Trans. Comput., vol. 23, no. 10, pp. 9931001, Oct. 1974.
[11] S. Hsiao and J. Delosme, Parallel singular value decomposition of complex matrices using multidimensional CORDIC algorithms, IRE Trans.
Signal Process., vol. 44, no. 3, pp. 685697, Mar. 1996.
[12] J. R. Cavallaro and F. T. Luk, CORDIC arithmetic for an SVD
processor, J. Parallel Distrib. Comput., vol. 5, no. 3, pp. 271290,
Jun. 1988.
[13] E. P. Mariatos, D. E. Metafas, J. A. Hallas, and C. E. Goutis, A fast
DCT processor, based on special purpose CORDIC Rotators, in Proc.
IEEE Int. Symp. Circuits Syst., Jun. 1994, pp. 271274.
[14] H. Jeong, J. Kim, and W. Cho, Low-power multiplierless DCT architecture using image data correlation, IEEE Trans. Consum. Electron.,
vol. 50, no. 1, pp. 262267, Feb. 2004.
[15] T. Sung, Y. Shieh, C. Yu, and H. Hsin, High-efficiency and low-Power
architectures for 2-D DCT and IDCT based on CORDIC rotation,
in Proc. Int. Parallel Distrib. Comput. Appl. Technol., Dec. 2006,
pp. 191-196.
[16] C. C. Sun, S. J. Ruan, B. Heyne, and J. Goetze, Low-power and highquality CORDIC-based Loeffler DCT for signal processing, IET Circuits, Devices, Syst., vol. 1, no. 6, pp. 453461, Dec. 2007.
[17] Z. Wu, J. Sha, Z. Wang, and L. Li, An improved scaled
DCT architecture, IEEE Trans. Consum. Electron., vol. 55, no. 2,
pp. 685689, May 2009.
[18] S. Hsiao, Y. Hu, T. Juang, and C. Lee, Efficient VLSI implementations of fast multiplierless approximated DCT using parameterized hardware modules for silicon intellectual property design, IEEE
Trans. Circuits Syst. I, Reg. Papers, vol. 52, no. 8, pp. 15681579,
Aug. 2005.
[19] S. Yu and E. E. Swartziander, DCT implementation with distributed arithmetic, IEEE Trans. Comput., vol. 50, no. 9, pp. 985991,
Sep. 2001.
[20] B. Kim and S. G. Ziavras, Low-power multiplierless DCT for
image/video coders, in Proc. IEEE Int. Symp. Consum. Electron.,
May 2009, pp. 133136.
[21] J. Bracamonte, M. Ansorge, and F. Pellandini, VLSI systems for
image compression: A power-consumption/image-resolution trade-off
approach, in Proc. Digit. Compress. Technol. Syst. Video Commun.
Conf., 1994, pp. 271274.
[22] G. Karakonstantis, N. Banerjee, and K. Roy, Process-variation resilient
and voltage-scalable DCT architecture for robust low-power computing,
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 10,
pp. 14611470, Oct. 2010.
[23] J. Park and K. Roy, A low power reconfigurable DCT architecture to
trade off image quality for computational complexity, in Proc. IEEE
Int. Conf. Acoust., Speech, Signal Process., May 2004, pp. 1720.
[24] J. Park, J. H. Choi, and K. Roy, Dynamic bit-width adaptation in
DCT: An approach to trade off image quality and computation energy,
IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 18, no. 5,
pp. 787793, May 2010.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
LEE et al.: RECONFIGURABLE CORDIC-BASED LOW-POWER DCT ARCHITECTURE BASED ON DATA PRIORITY
[25] J. Li, Sign lookahead CORDIC, M.S. thesis, Dept. Electr. Eng., Nat.
Cheng Kung Univ., Tainan, Taiwan, 2008.
[26] S. Wang and E. E. Swartzlander, Merged CORDIC algorithm, in Proc.
IEEE Int. Symp. Circuits Syst., May 1995, pp. 19881991.
[27] B. Gisuthan and T. Srikanthan, Pipelining flat CORDIC based trigonometric function generators, Microelectron. J., vol. 33, nos. 12,
pp. 7789, Jan. 2002.
[28] P. K. Meher, J. Valls, T. Juang, K. Sridharan, and K. Maharatna,
50 years of CORDIC, IEEE Trans. Circuits Syst. I, Reg. Papers,
vol. 56, no. 9, pp. 18931907, Sep. 2009.
[29] NanoSim User Guide, Version A-2008.03, Synopsys Inc., Mountain
View, CA, USA, 2008.
[30] T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to
Algorithms. Cambridge, MA, USA: MIT Press, 1998.