Multiplier-Less Hardware Realization of Trigonometric Functions For High Speed Applications

Download as pdf or txt
Download as pdf or txt
You are on page 1of 5

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/334167226

Multiplier-less Hardware Realization of Trigonometric Functions for High


Speed Applications

Conference Paper · December 2018


DOI: 10.1109/ASPCON.2018.8748709

CITATION READS

1 28

5 authors, including:

Debaprasad De Archisman Ghosh


Techno India Purdue University
4 PUBLICATIONS   5 CITATIONS    108 PUBLICATIONS   19,827 CITATIONS   

SEE PROFILE SEE PROFILE

K. Gaurav Kumar Mrinal Kanti Naskar

8 PUBLICATIONS   13 CITATIONS   
Jadavpur University
18 PUBLICATIONS   20 CITATIONS   
SEE PROFILE
SEE PROFILE

Some of the authors of this publication are also working on these related projects:

Diabetes Identification from Histological Image View project

Geographic and Annual Influences on Optical Follow-up of Gravitational Wave Events View project

All content following this page was uploaded by K. Gaurav Kumar on 08 July 2019.

The user has requested enhancement of the downloaded file.


Proceedings of 2018 IEEE Applied Signal Processing Conference (ASPCON)

Multiplier-less Hardware Realization of


Trigonometric Functions for High Speed
Applications
Debaprasad De Archisman Ghosh, K Gaurav Kumar, Anurup Saha, Mrinal
Techno India, Salt Lake, Kanti Naskar
Kolkata, India ADES Lab, Dept. of ETCE,
Jadavpur University,
Kolkata, India

Abstract— This paper presents a unified architecture of hardware that requires less computational time and consumes
trigonometric functions using CORDIC algorithm and its less device resources.
implementation on FPGA. The hardware finds applications in
fields of signal processing, mathematical calculators, and In this work, multiplier-less unified hardware for
various other engineering applications. CORDIC algorithm, calculating trigonometric functions is proposed. The
based on principle of vector rotations, computes the architecture is then programmed in Verilog HDL and
trigonometric functions using only add and shift operations. implemented on Virtex 4 FPGA. The present architecture has
The proposed architecture is a structural model, coded in also been compared in detail with previous published works.
Verilog HDL and is implemented on Virtex-4 FPGA kit. The
proposed architecture, which determines five trigonometric The structure of the rest of the paper is as follows. In
functions, utilizes less hardware resources than previously Section II, CORDIC, its mathematical background and the
reported architectures that can compute only sine and cosine unified algorithm are discussed. Proposed architecture is
functions. This architecture also provides an improvement in presented in section III. In Section IV, implementation
terms of chip area consumed and the maximum frequency of details have been highlighted along with results. Section V
operation of the hardware. The proposed architecture can also presents the performance of our architecture while
be easily reconfigured for applications where higher accuracy comparing with other architectures available in the literature.
is required. The novelty of this paper is implementing the Section VI concludes the paper.
unified architecture along with reducing the device resource
utilization and number of clock cycles required with respect to II. CORDIC ALGORITHM
previous works. CORDIC algorithm is used in calculators, discrete signal
processors, etc. [3]. Vector rotations are used to compute all
Keywords—Unified architecture, Trigonometric functions,
CORDIC, FPGA the trigonometric functions. The algorithm provides an
efficient iterative method to calculate the vector rotations by
I. INTRODUCTION an angle, using only add and shift operations.
Trigonometric functions like sine, cosine, tangent,
Fig. 1 shows the rotation of a vector P(x, y) through an
arcsine, arccosine, derived from complex exponential
functions, are used in wide range of applications, like digital angle α in the anti-clockwise direction to give a vector Pǯ (xǯ,
signal processing, wireless communications, biomedical yǯ). The coordinates of the point Pǯ is given by:
engineering, robotics, etc. There are several approaches to
realize hardware that performs these functionalities, i.e., (1)
Lookup Table (LUT), CORDIC (Coordinate Rotation Digital
(2)
Computer) and Polynomial Approximation.
Lookup table method uses memory blocks to store the
values of the functions to be computed, for every possible
input argument [1].This approach is relatively simple, but the
hardware requires more registers i.e. more memory.
Moreover, if less number of entries are stored in LUT, the
results obtained are inaccurate. If more number of entries
areused, the method produces accurate results, but the
realized hardware becomes uneconomical and memory-
inefficient.
Approximation algorithms such as polynomial
approximation take help of Maclaurin series to calculate
trigonometric functions. The hardware, realized by this Fig. 1. Rotating vector on 2-D plane.
method, requires large number of multipliers, adders and
shifters, making it area inefficient. As the values of factorials If angle of rotation is restricted, such that –ƒȽ ൌ ʹǦ‹,
remain fixed, LUTs are used to store these fixed values. where ‹ is an integer, then multiplication by tangent can be
CORDIC, invented by J. E. Volder in 1959 [2], is an performed by using only shift operations and total
efficient algorithm to calculate trigonometric, hyperbolic, calculation can be done using addition and shifting [4]. By
exponential and logarithmic functions. This algorithm uses choosing a proper sequence of ȽͲǡȽͳǡ ǤǤǤetc. and direction of
only add and shift operations, which are easy to implement in rotation, we can calculate the trigonometric functions [5].

‹,(((

ISBN: 978-1-5386-6686-9 149 PART: CFP18P52-ART


Proceedings of 2018 IEEE Applied Signal Processing Conference (ASPCON)

A. Unified Algorithm adder/ subtractor, lookup table etc. are designed to perform
The algorithm is initialized with a vector (x0, y0) and an various micro-operations of the algorithm.
angle z0. At every iteration, the vector is rotated either in Barrel shifter is used for controlled shift operation. An N-
clock-wise or in anti-clockwise depending on the values of bit barrel shifter has been implemented which rotates
xi-1, yi-1, zi-1, obtained in the previous iteration, the angle zi is arbitrary number of bits of input number to the right. The
updated accordingly. The algorithm stops after a predefined circuit has N-bit input lines and a control line that dictates
number of iterations. the amount of shift.
ALGORITHM 1: Unified Algorithm Controlled adder/subtractor is also used for controlled
addition and subtraction to update x, y, and z as and when
Require: Input α and the function to be computed required. Also lookup tables are used to store tan-1 values.
1: Initialize x0, y0 and z0. The proposed architecture for unified algorithm has been
2: for i = 0 to N-1 do shown in fig. 2. Here, three registers xi, yi and zi stores value
of x, y, z after ith iteration. The rotating vector is denoted by
3: Choose , , //d is the direction of rotation (x, y) and angle of rotation is represented by z. After each
4: iteration, values of xi, yi and zi are calculated from previous
values and the registers are updated at positive edge of the
5: clock.
6: IV. FPGA IMPLEMENTATION
7: end for The architecture presented in Section III has been coded
in Verilog HDL. The code, synthesized in Xilinx ISE Design
Here, N denotes the number of iterations. Values of x0, Suite 14.7, has been implemented on Virtex 4 FPGA kit
y0, z0 are chosen appropriately for separate functions. Table I (xc4vlx60). Timing waveform analysis report and Device
shows the different values of x0, y0, z0 for computing the Utilization summary are discussed in following subsections
corresponding trigonometric functions: to justify novelty of our proposed architecture.
TABLE I. SELECTION OF X0, Y0, Z0 A. Timing waveform analysis
Required Function x0 y0 z0 The simulation uses a clock with 100MHz frequency.
sinα, cosα 0.6073 0 α Table III and Figure 3 provide the timing analysis of the
sin-1α, cos-1α 0.6073 0 0 proposed hardware for N=12, 13, 14. It is observed that the
tan-1α 1 α 0 number of clock cycles required to get the output increase
linearly with N. Thus, an optimized value of N has to be
chosen based on trade-off between accuracy and speed of
Depending upon the direction of rotation, dx, dy, dz can be
operation.
+1 or -1. Their values are selected as from Table II.
B. Device Utlization Summary
TABLE II. SELECTION OF DIRECTION OF ROTATIONS
Device utilization summary is also provided for the
Required dx dy dz proposed architecture. There are 22664 slice, 53248 slice
Function flip-flops, 53248 four-input LUTs and 448 bonded IOBs
sinα, cosα -1 if zi≥ 0 +1 if zi≥ 0 -1 if zi≥ 0 present in Virtex 4 FPGA (xc4vlx60) kit. Table IV provides
+1 otherwise -1 otherwise +1 otherwise
sin-1α -1 if yi< α +1 if yi< α +1 if yi< α
the number of each type of components needed for different
+1 otherwise -1 otherwise -1 otherwise values of N. This table clearly shows that device utilization
cos-1α +1 if xi< α -1 if xi< α -1 if xi< α of the hardware goes up as no of iterations (N) increases.
-1 otherwise +1 otherwise +1 otherwise Trade-off between the accuracy and the area of the hardware
tan-1α +1 if yi≥ 0 -1 if yi≥ 0 +1 if yi≥ 0 must be taken into consideration while choosing the value of
-1 otherwise +1 otherwise -1 otherwise N. Figure 4 shows the graphical representation of device
utilization summary for different N.
Number of iterations (N) is chosen according to desired The value of N for the proposed hardware is chosen as
accuracy of results. Higher N implies better precision, but 12, based on the accuracy, maximum operating frequency of
demands more computation time. N=12 implies that the the application and the area of the hardware.
precision of results is 2-12 and the hardware requires 12 clock
cycles. For applications that require less precision, value of
N can be reduced to optimize computation time. From the
algorithm, it is obvious that its running time complexity is V. COMPARATIVE ANALYSIS
O(n). The performance of existing designs reported in literature
is compared with the proposed unified architecture. To have
III. PROPOSED ARCHITECTURE same parameter setting for reference, our proposed design is
The unified algorithm has been implemented as FPGA synthesized using Xilinx ISE design suite with Virtex 4 as
based hardware. The proposed architecture is a structural target device and N = 12.To compare the performance, a
model with the capability of calculating the values of five parameter, area multiplied by delay (A × T) is defined. The
trigonometric functions, namely sine, cosine, arcsine, parameter Area (A) is the number of resources required (i.e.,
arccosine and arctangent. Components like barrel shifter, A = Slices + Slice flip flops + LUTs + Bonded IOBs), while

ISBN: 978-1-5386-6686-9 150 PART: CFP18P52-ART


Proceedings of 2018 IEEE Applied Signal Processing Conference (ASPCON)

Fig. 3. Timing Analysis for varying number of iterations of CORDIC


algorithm.

TABLE IV. RESOURCE UTILIZATION


Components N = 12 N = 13 N= 14
Slice 160 170 178
Slice Flip-flops 40 43 46
Four-input LUTs 306 318 335
Bonded IOBs 51 55 59

Fig. 4. Device Utilization Summary for varying number of iterations of


CORDIC algorithm.
Fig. 2. Proposed architecture for Unified CORDIC Algorithm.
TABLE V. PERFORMANCE COMPARISON
parameter delay (T) denotes the time required to perform
computation [9]. It is desirable to have smallest area and Parameters Prop Ref Ref Ref Ref Ref
shortest delay, i.e. highest frequency of operation for osed [6] [7] [8] [9] [10]
hardware. It must be noted that the proposed architecture is a Slices 200 945 1104 203 191 358
more generalized in the sense that it computes all useful LUTs 306 1658 1748 378 359 697
trigonometric functions whereas the ones used for
comparison compute only sine and cosine values. fmax(MHz) 197.6 52.5 154.7 60.8 59.77 188.6
-7
TABLE III. SIMULATION TIME ANALYSIS A×T (10 ) 29.2 495.4 184.4 95.6 92.0 55.9

Number of Iteration Frequency of Clock Time taken


(MHz) (in microsecond)
12 100 0.12
VI. CONCLUSION
13 100 0.13
14 100 0.14 In this paper, we have proposed a fast, multiplier-less,
area-efficient and unified architecture for calculating
trigonometric functions using CORDIC algorithm and
Table V infers that the proposed architecture is better implemented the same in FPGA. We have also discussed
than others reported in literature, in terms of area, time delay, about the method to optimize the number of iterations
frequency of operation and number of trigonometric required for CORDIC algorithm, depending on accuracy and
functions computed. Hence, the architecture may be used speed requirements of the applications. The performance of
when speed of operation is of major concern. the proposed hardware, a completely structural model, is
compared with that of others present in the literature. The

ISBN: 978-1-5386-6686-9 151 PART: CFP18P52-ART


Proceedings of 2018 IEEE Applied Signal Processing Conference (ASPCON)

REFERENCES
[1] P. T. P. Tang, “Table-lookup Algorithms for Elementary Functions
and Their Error Analysis,” Proceedings 10th IEEE Symposium on
Computer Arithmetic, pp-232-236,1991.
[2] J. E. Volder, “The CORDIC Trigonometric Computing Technique,”
IRE Transactions on Electronic Computers, vol-EC-8, pp-330-334,
1959.
[3] V. Considine, “CORDIC Trigonometric Function Generator for
DSP,” International Conference on Acoustics, Speech, and Signal
Processing, pp-2381-2384, 1984.
[4] A. Saha, K. G. Kumar, and A. Ghosh, “Area Efficient Architecture of
Hyperbolic functions for high frequency applications,” International
Conference on Circuits, Controls and Communications, pp-139-142,
2017.
[5] J. S. Walther, “A Unified Algorithm for Elementary Functions,”
Springer Joint Computer Conference, pp-379-385, 1971.
[6] K. Maharatna, A. Troya, S. Banerjee, and E. Grass, “Virtually
Fig. 5. Comparative Analysis for validating performance of the hardware. scalingfreeadaptive cordic rotator,” IEEE Proceedings-Computers and
Digital Techniques, vol. 151, no. 6, pp. 448–456, 2004.
area delay product is almost 50% better than the previously [7] E. Garcia, R. Cumplido, and M. Arias, “Pipelined cordic design
onfpga for a digital sine and cosine waves generator,”, 3rd
reported architectures. Based on this performance analysis, it International Conference on Electrical and Electronics Engineering,
can be concluded that the proposed architecture is better than IEEE, pp. 1–4, 2006.
all the other existing implementations based on area of [8] L. Vachhani, K. Sridharan, and P. K. Meher, “Efficient cordic
hardware, maximum operating frequency (MOF) and area- algorithmsand architectures for low area and high throughput
delay product (A×T). Future works involve introducing implementation,” IEEE Transactions on Circuits and Systems II:
parallel processing and pipelining techniques, along with Express Briefs, , vol. 56, no. 1, pp. 61–65, 2009.
approximate circuits to further increase the MOF and [9] S. Aggarwal and K. Khare, “Hardware efficient architecture for
generating sine/cosine waves,” 25th International Conference on
decrease the area. VLSI Design (VLSID), IEEE, pp. 57–61, 2012.
ACKNOWLEDGMENT [10] Antonius P. Renardy, Nur Ahmadi, Ashbir A. Fadila, Naufal Shidqi
and Trio Adiono, “FPGA Implementation of CORDIC Algorithms
The authors are thankful to Asim Maiti, Dr. Swarup forSine and Cosine Generator”, 5th International Conference on
Kumar Mitra, and Rathindra Nath Biswas for their insightful Electrical Engineering and Informatics, Bali, Indonesia, 2015.
suggestions and inputs during the preparation of the
manuscript.

ISBN: 978-1-5386-6686-9 152 PART: CFP18P52-ART


View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy