0% found this document useful (0 votes)
65 views17 pages

The Promise and Challenge of Stochastic Computing

Stochastic computing (SC) is an unconventional method of computation that represents data as probabilities. In SC, basic operations like multiplication are performed using simple logic gates on stochastic bit streams, though with lower precision than conventional binary computing. While SC has advantages like error tolerance, its drawbacks like low accuracy and slow speed limited its adoption initially. Recent applications in error correction and neural networks have renewed interest in SC. Many challenges remain to be addressed, such as improving accuracy and handling correlated inputs, before SC can be widely used.

Uploaded by

Harsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views17 pages

The Promise and Challenge of Stochastic Computing

Stochastic computing (SC) is an unconventional method of computation that represents data as probabilities. In SC, basic operations like multiplication are performed using simple logic gates on stochastic bit streams, though with lower precision than conventional binary computing. While SC has advantages like error tolerance, its drawbacks like low accuracy and slow speed limited its adoption initially. Recent applications in error correction and neural networks have renewed interest in SC. Many challenges remain to be addressed, such as improving accuracy and handling correlated inputs, before SC can be widely used.

Uploaded by

Harsh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 37, NO.

8, AUGUST 2018 1515

The Promise and Challenge of


Stochastic Computing
Armin Alaghi , Member, IEEE, Weikang Qian , Member, IEEE, and John P. Hayes, Life Fellow, IEEE

Abstract—Stochastic computing (SC) is an unconventional


method of computation that treats data as probabilities. Typically,
each bit of an N-bit stochastic number (SN) X is randomly chosen
to be 1 with some probability pX , and X is generated and pro-
cessed by conventional logic circuits. For instance, a single AND
gate performs multiplication. The value X of an SN is measured
by the density of 1 s in it, an information-coding scheme also
found in biological neural systems. SC has uses in massively par- Fig. 1. Stochastic number generator (SNG).
allel systems and is very tolerant of soft errors. Its drawbacks
include low accuracy, slow processing, and complex design needs.
Its ability to efficiently perform tasks like communication decod-
ing and neural network inference has rekindled interest in the by Gaines [28], [29] and Poppelbaum et al. [74] investigated
field. Many challenges remain to be overcome, however, before data processing with SNs, a field that soon came to be known
SC becomes widespread. In this paper, we discuss the evolution as stochastic computing (SC). Their pioneering work identified
of SC, mostly focusing on recent developments. We highlight the key features of SC, including its ability to implement arith-
main challenges and discuss potential methods of overcoming
them.
metic operations by means of tiny logic circuits, its redundant
and highly error-tolerant data formats, and its low precision
Index Terms—Approximate computing, pulse circuits, stochas- levels comparable to analog computing.
tic circuits, unconventional computing methods.
Like some early binary computers, stochastic circuits pro-
cess data serially in the form of bit-streams. Fig. 1 shows
an SN generator (SNG) that converts a given BN B to stochas-
I. I NTRODUCTION tic bit-stream form. The SNG samples a random BN R which
ROM its beginnings in the 1940s, electronic computing it compares with B, and outputs an SN of probability B/2k at
F has relied on weighted binary numbers (BNs) of the form
X = x1 x2 . . . xk to represent numerical data [16].
a rate of one bit per clock cycle. After N clock cycles, it has
produced an N-bit SN X with pX ≈ B/2k . The value pX is the
 Typical is the
use of X to denote a fixed-point fraction v = ki=1 2−i xi lying frequency or rate at which 1 s appear, so an estimate p̂X of
in the unit interval [0,1]. Efficient arithmetic circuits for pro- pX can be made simply by counting the 1 s in X. In general,
cessing such BNs were soon developed. There were, however, the estimate’s accuracy depends on the randomness of X’s bit-
concerns about the cost and reliability of these circuits, which pattern, as well as its length N. Rather than a true random
led to the consideration of alternative number formats. Notable source, an SNG normally employs a logic circuit like a linear
among the latter are stochastic numbers (SNs), where the xi feedback shift register (LFSR) whose outputs are repeatable
bits are randomly chosen to make X’s value be the proba- and have many of the characteristics of true random num-
bility pX that xi = 1. Again the resulting data values are in bers [32]. Mathematically speaking, the SNG approximates
the unit interval [0,1]. In the late 1960s, research groups led a Bernoulli process that generates random binary sequences
of the coin-flipping type, where each new bit is independent
Manuscript received March 8, 2017; revised July 31, 2017; accepted
October 18, 2017. Date of publication November 28, 2017; date of current of all earlier bits.
version July 17, 2018. This work was supported in part by the National Natural The essence of SC can be seen in how it is used to perform
Science Foundation of China under Grant 61472243 and Grant 61204042, and basic multiplication. Let X and Y be two N-bit SNs that are
in part by the U.S. National Science Foundation under Grant CCF-1318091.
This paper was recommended by Associate Editor J. Henkel. (Corresponding applied synchronously to a two-input AND gate, as in Fig. 2.
author: Armin Alaghi.) A 1 appears in the AND’s output bit-stream Z if and only if
A. Alaghi is with the Computer Science and Engineering the corresponding values of X and Y are both 1, hence
Department, University of Washington, Seattle, WA 98195 USA (e-mail:
amin@cs.washington.edu). p̂Z ≈ pX × pY . (1)
W. Qian is with the University of Michigan—Shanghai Jiao Tong University
Joint Institute, Shanghai Jiao Tong University, Shanghai 200240, China In other words, the AND gate serves as a multiplier of
(e-mail: qianwk@sjtu.edu.cn).
J. P. Hayes is with the Computer Science and Engineering probabilities, and can be orders of magnitude smaller than
Division, University of Michigan, Ann Arbor, MI 48109 USA (e-mail: a comparable BN multiplier. The SC multiplier’s output bit-
jhayes@umich.edu). pattern Z varies with the randomness of the SNGs generating
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. X and Y. These variations have little impact on the multiplier’s
Digital Object Identifier 10.1109/TCAD.2017.2778107 output value p̂Z , however, indicating a naturally high degree
0278-0070 c 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
1516 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 37, NO. 8, AUGUST 2018

Fig. 2. AND gate as a stochastic multiplier, with pX = 8/16, pY = 6/16,


and pZ = pX × pY = 3/16. Equivalently, Z = X × Y = 3/16.

of error tolerance. On the other hand, the precision with


which p̂Z reflects pX × pY tends to be rather low. This sit-
uation can be improved by increasing N, but N must be
doubled for every desired extra bit of precision, a property Fig. 3. Spike trains in a biological neural network and equivalent SNs.
that leads to very long bit-streams and slow computations.
For example, BNs of length k = 8 provide 8-bit precision.
To obtain similar precision with SC requires SNs of length
N = 2k = 256 or more. Hence, SC tends to be restricted to
(a)
low-precision applications where the bit-streams are not exces-
sively long. More troublesome is the need for X and Y to
have statistically independent or uncorrelated bit-patterns in
order for p̂X and p̂Y to be treated as independent probabil-
ities, as required by (1). In the extreme case where exactly
the same bit-pattern X is applied to both inputs of the AND
gate, the output bit-stream’s value becomes pX instead of pX 2 ,
implying a potentially large computation error which cannot
(b)
be corrected simply by extending N.
As the cost and reliability of conventional binary comput- Fig. 4. Stochastic circuits for LDPC decoding [30]. (a) Parity check node.
ing (BC) improved in the 1960s and 1970s with the develop- (b) Equality node.
ment of integrated circuits tracked by Moore’s law, interest in
SC wanted. It was seen as poorly suited to general-purpose
computation, where high speed, accuracy, and compact stor-
Although LDPC codes, like SC, were discovered in the 1960s,
age were routinely expected. However, SC continued to find
there was little practical interest in them until the advent of
niche applications in areas such as image processing, con-
suitable decoding methods and circuits, as well as the inclu-
trol systems, and models of neural networks, which can take
sion of LDPC codes in new wireless communication standards
advantage of some of its unique features.
such as digital video broadcasting (DVB-S2) and WiMAX
Neural networks, both natural and artificial, constitute an
(IEEE 802.16). The LDPC decoding employs a probabilistic
interesting case. As Fig. 3 suggests, biological neurons pro-
algorithm that passes messages around a code representation
cess noisy sequences of voltage spikes which loosely resemble
called a Tanner graph, while repeatedly performing two basic
SNs [31], [57]. Information is encoded in both timing and
operations, parity checking and equality checking. It turns
frequency of the spikes–the exact nature of the neural code
out that these operations are implemented efficiently by the
is one of nature’s mysteries. However, significant information,
stochastic circuits in Fig. 4. Many copies of these circuits
such as the intensity of a muscular action, is embedded SN-like
can be operated in parallel, resulting in fast, low-cost decod-
in the spike rate over some time window; the spike positions
ing, and demonstrating the potential of SC to provide massive
also exhibit SN-style randomness. Moreover, the operation of
parallelism. Recent developments have shown that SC-based
a single neuron is commonly modeled by an inner-product
LDPC decoders are competitive in performance and cost with
function of the form
conventional binary designs [47].

N
Other new applications and technology developments sup-
F= Wi × Xi (2) ported this revival of interest in SC. With the emergence of
i=1
mobile devices such as smart phones and medical implants,
where the Xi s are signals from other neurons, and the extremely small size and power, as well as low-cost digital sig-
Wi s are synaptic weights denoting the influence of those nal processing (DSP), have become major system goals [48].
neurons. Since the number of interneural connections and An illustrative application of SC in the medical field is the
multiplications N can be in the thousands, SC-based imple- design of retinal implants to aid the blind. An implant chip can
mentations of (2) are attractive because of their relatively low be placed in the eye to receive and process images and trans-
hardware cost [14], [46]. fer the results via pulse trains through the optic nerve directly
The state of SC circa 2000 can be characterized as focused to the brain. The chip must satisfy extraordinarily severe size
on a handful of old and specialized applications [3], [59]. The and power constraints, which SC is particularly well-suited to
situation changed dramatically when Gaudet and Rapley [30] meet [4].
observed that SC could be applied successfully to the diffi- Significant aspects of SC that had been ignored in
cult task of decoding low-density parity check (LDPC) codes. the past—why does the apparently simple logic circuit of
ALAGHI et al.: PROMISE AND CHALLENGE OF SC 1517

Fig. 4(b) implement such a complex arithmetic function?— therefore interpreted as hybrid analog–digital circuits because
now began to receive attention. The relation between the they employ digital components and signals to process analog
logic circuits and the stochastic functions they implement data. Theoretically, the AND gate of Fig. 2 can perform mul-
has been clarified, resulting in general design procedures for tiplication on numbers with arbitrary precision. However, to
implementing arithmetic operations [75]. Correlation effects find the probability pZ = pX ×pY we must obtain a finite num-
in SC have recently been quantified, leading to the surprising ber of discrete samples of the circuit’s output from which to
conclusion that correlation can serve as a valuable compu- estimate pZ . The estimation’s accuracy increases slowly with
tational resource [5]. Bit-stream length can be reduced by the number of samples, and is limited by noise considerations,
careful management of correlation and precision (progressive making it impractical to estimate pZ with high precision.
precision [6]). The high contribution of stochastic-BN con-
version circuits to overall SC costs [75] is being recognized A. Stochastic Number Formats
and addressed. New technologies, notably memristors, have
Interpreting SNs as probabilities is natural, but it limits them
appeared that have naturally stochastic properties which reduce
to the unit interval [0,1]. To implement arithmetic operations
data-conversion needs [43].
outside this interval, we need to scale the number range in
Despite these successes, SC still has limitations that must
application-dependent ways. For example, integers in the range
be considered when used in certain applications. Most impor-
[0, 256] can be mapped to [0, 1] by dividing them by a scaling
tantly, the run time of SC circuits increases prohibitively when
factor of 256, so that {0, 1, 2, . . . , 255, 256} is replaced by {0,
high precision or highly accurate computations are needed.
1/256, 2/256, . . . , 255/256, 1}. Such scaling can be considered
Recent investigations have shown that the long computation
as a preprocessing step required by SC.
time may lead to excessive energy consumption, thus making
SC can readily be defined to handle signed numbers. An SN
low-precision BC a better choice [1], [58], [62]. Manohar [58]
X whose numerical value is interpreted in the most obvious
provided a theoretical comparison between SC and BC and
fashion as pX is said to have the unipolar format. To accommo-
shows that even for multiplication, SC ends up having more
date negative numbers, many SC systems employ the bipolar
gate invocations (i.e., the number of times an AND gate is
format where the value of X is interpreted as 2pX −1, so the SC
called). De Aguiar and Khatri [1] performed a similar compar-
range effectively becomes [−1, 1]. Thus, an all-0 bit-stream
ison but instead of comparing the number of gate invocations,
has unipolar value 0 and bipolar value −1, while a bit-stream
they actually implemented BC and SC multipliers with dif-
with equal numbers of 0s and 1s has unipolar value 0.5, but
ferent bit widths. They concluded that SC multiplication is
bipolar value 0. Note that the function of an SC circuit usu-
more energy efficient for computations that require 6 bits
ally changes with the data format used. For instance, the AND
of precision (or lower). However, if conversion circuits are
gate of Fig. 2 does not perform multiplication in the bipolar
needed, SC is almost always worse than BC [1].
domain. Instead, an XNOR gate must be used, as shown in
This poses an important challenge to SC designers: their
Example 1 below. On the other hand, both formats can use the
designs must be competitive in terms of energy efficiency
same adder circuit. In what follows, to reduce confusion, we
with BC circuits of similar accuracy/precision. Some of
use X to denote the numerical value of the SN X. With this
the topics that can potentially address this problem are
convention, X = pX in the unipolar domain, while X = 2pX −1
as follows.
in the bipolar domain.
1) Exploiting progressive precision to reduce overall
Several other SN formats have appeared in the literature.
run time.
Inverted bipolar is used in [2] to simplify the notation for spec-
2) Exploiting SC’s error tolerance to improve energy usage.
tral transforms. In [61] the value of a bit-stream is interpreted
3) Reducing or eliminating the cost of data conversion.
as the ratio of 1s to 0s, which creates a very wide, albeit sparse,
Examples of these techniques appear in the current literature.
number range. Table I shows the various number formats
This paper focuses on more recent SC work than the sur-
mentioned so far. These formats deal with single bit-stream
vey [3], and attempts to highlight the big challenges facing
only. Dual-rail and multirail representations have also been
SC and their potential solutions. The remainder of this paper
proposed. Gaines [29], for example, presented dual-rail unipo-
is organized as follows. Section II provides a formal intro-
lar and bipolar number formats, along with the basic circuits
duction to SC and its terminology, including SC data formats,
for each format. Toral et al. [94] proposed another dual-rail
basic operations, and randomness requirements. Readers famil-
encoding that represents a ternary SN X = x1 x2 . . . xN , where
iar with the topic can skip this section. General synthesis
each xi ∈ {−1, 0, 1}; it will be discussed in Section IV-A. The
methods for combinational and sequential SC circuits are dis-
binomial distribution generator of [75], which is discussed in
cussed in Section III. Section IV examines the application
Section III, produces a multirail SN.
domains of SC, as well as some emerging new applications.
The conclusion and future challenges of SC are discussed in
Section V. B. Stochastic Number Generation
We can map an ordinary BN to an SN in unipolar format
using the SNG in Fig. 1. To convert the unipolar SN back
II. BASIC C ONCEPTS to the binary, it suffices to count the number of 1s in the
Probabilities are inherently analog quantities that corre- bit-stream using a plain (up) counter. Slight changes to these
spond to continuous real numbers. Stochastic circuits can be circuits allow for conversion between bipolar SNs and BNs. In
1518 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 37, NO. 8, AUGUST 2018

TABLE I
P OSSIBLE I NTERPRETATIONS OF A B IT-S TREAM OF L ENGTH N C ONTAINING N1 1 S AND N0 0 S

SC, number-conversion circuits tend to cost much more than Section I, if a bit-stream representing X is fanned out to both
number-processing circuits. For example, to multiply two 8- inputs of the AND gate in Fig. 2, the gate computes X instead
bit BNs using the SC multiplier in Fig. 2, we need two SNGs of X squared. This major error is due to maximal (positive)
and a counter. A rough gate count reveals that the conversion correlation between the AND’s input signals. In general, if cor-
circuits have about 250 gates while the computation part, is relation changes the output number, the resulting error does
just a single AND gate. Extensive use of conversion circuits not necessarily go toward zero as N increases.
can severely affect the cost of SC circuits. Qian et al. [76] It is instructive to interpret SN generation as a Monte Carlo
reported that the conversion circuits consume up to 80% of sampling process [6]. Consider again the SNG in Fig. 1 and,
the total area of several representative designs. For this rea- for simplicity, assume that both input B and random source R
son, it is highly desirable to reduce the cost of conversion have arbitrary precision. Assume further that the value pX of
circuits. B is unknown. The SNG effectively generates a sequence X of
Methods to reduce the cost of constant number generation N samples, and we can get an estimate p̂X of pX by counting
are investigated in [25] and [79]. For massively parallel appli- the number of 1s in X. It is known √ that p̂X converges to the
cations such as LDPC decoding, a single copy of random exact value pX at the rate of O(1/ N).
number generator can be shared among multiple copies of For most stochastic designs, LFSRs are used as the ran-
SC circuits to provide random inputs, thus effectively amor- dom number sources (RNSs) to produce stochastic bit-streams.
tizing the cost of conversion circuits [21], [89]. Furthermore, Although these random sources are, strictly speaking deter-
inherently stochastic nanotechnologies like memristors offer ministic, they pass various randomness tests [32], [44] and
the promise of very low-cost SNGs [43]. The cost of data con- so are considered pseudo-random. Such tests measure cer-
version can also be lowered if analog inputs are provided to tain properties of a bit-stream, e.g., the frequency of 1s, the
the SC circuit. In this case, it may be feasible to directly frequency of runs of k 1s, etc., and check the extent to which
convert the inputs from analog to stochastic using ramp- these properties match the behavior of a true random number
compare analog-to-digital converters [46], [64] or delta-sigma generator.
converters [83]. Despite what is commonly believed, SNs do not need to
pass many randomness tests. In fact, in order to have p̂X = pX
C. Accuracy and Randomness we only need X to have the correct frequency of 1s. So it is
The generation of an SN X resembles an ideal Bernoulli possible to replace RNSs by so-called deterministic sources,
process producing an infinite sequence of random 0s and 1s. which employ predictable patterns and lack most of the usual
In such a process, each 1 is generated independently with randomness attributes [6], [38]. An example of a deterministic
fixed probability pX ; 0s thus appear with probability 1 − pX . format is where all the 1s of an SN are grouped together and
The difference between the exact value pX and its estimated followed by all the 0s, as in 111111100000 [13].
value p̂X (estimated over N samples) indicates the accuracy To generate a deterministic bit-stream of the above form, we
of X. This difference is usually expressed by the mean square can use a counter to generate a sequence of deterministic val-
error (MSE) EX given by ues 0, 1/N, 2/N, . . . , (N − 1)/N and feed it to the comparator
 of Fig. 1. It can be proved that the difference between p̂X (the
2  pX (1 − pX ) value of the generated bit-stream) and pX (the constant number
EX = E p̂X − pX = . (3)
N fed to the comparator) is no more than 1/N, implying that p̂X
Equation (3) implies that inaccuracies due to random fluctua- converges to pX at the faster rate of O(1/N). This motivates
tions in the SN bit-patterns can be reduced as much as desired the use of deterministic number sources in SC, and indeed
by increasing the bit-stream length N. Hence the precision of some SC circuits use such deterministic numbers [6]. However,
X can be increased by increasing N or, loosely speaking, the there are several challenges to overcome when deterministic
quality of a stochastic computation tends to improve over time. number formats are used, including limited scalability, and
This property is termed progressive precision, and is a feature the cost of number generation to conserve the deterministic
of SC that will be discussed further later. formats.
Stochastic circuits are subject to another error source which When many mutually uncorrelated SNs are needed, we
is much harder to deal with, namely insufficient independence can still extend the foregoing deterministic number generation
or correlation among the input bit-streams of a stochastic cir- approach, but its cost significantly increases with number of
cuit. Correlation is due to signal reuse caused by reconvergent inputs. Gupta and Kumaresan [34] described an SN multiplier
fanout, shared randomness sources, and the like. As noted in that produces exact results for any given input precision.
ALAGHI et al.: PROMISE AND CHALLENGE OF SC 1519

Fig. 6. Counter-based stochastic divider.


Fig. 5. Multiplexer serving as a stochastic adder, with pX = 8/16, pY = 6/16,
pR = 8/16, and pZ = 1/2(pX + pY ) = 7/16.

described an adder that eliminates the need for a separate ran-


dom source. Since adding is expensive, Ting and Hayes [91]
However, to multiply k m-bit numbers using their method proposed using accumulative parallel counters (APCs) [71] in
requires bit-streams of length 2km , which becomes impractical computations that end with an adding reduction, e.g., matrix
for circuits with a large number of inputs. multiplication. An APC performs addition and stochastic-to-
By employing the deterministic approach, one gains a bet- binary conversion simultaneously.
ter control over the progressive precision of the SNs. RNSs SC subtraction is easily implemented in the bipolar domain.
provide this property naturally to some degree. To fully Because inverting a bit-stream negates its bipolar value, we can
exploit it, quasi-random or low-discrepancy sources may use an inverter and a MUX to implement a bipolar subtractor.
be used [6]. SNs generated via low-discrepancy sequences However, with unipolar encoding, since the value range [0, 1]
converge with the rate of O(1/N). However, the bene- does not include negative numbers, implementing subtrac-
fits of using low-discrepancy sequences also diminish as tion becomes complicated. Various methods of approximating
the number of inputs increase, because the cost of gen- unipolar subtraction exist in [5] and [27].
erating them is much higher than pseudo-random number SC division is the most difficult of the basic operations.
generation. First, the result Z = X1 /X2 falls outside the range [0, 1] if
In summary, it may be beneficial to use deterministic num- X1 > X2 , so we must assume that X1 ≤ X2 . Second, as will
ber sources for SC circuits that have few inputs (three or fewer be discussed in Section II-E, SC combinational circuits are
uncorrelated inputs). For circuits with more number sources,√ it only capable of implementing multilinear functions, but divi-
appears better to use LFSRs and settle for the slower O(1/ N) sion is naturally a nonlinear function. Nevertheless, SC circuits
convergence rate. that implement division have been proposed in the literature.
These circuits either include sequential elements, or exploit
correlation among the input SNs.
D. Basic Arithmetic Operations
Gaines [29] implemented division using a feedback loop
SC multiplication was discussed in the previous sections. (Fig. 6). First, an initial guess of the result is stored in a binary
SC addition is usually performed by a multiplexer (MUX) variable pZ , and then Y = Z × X2 is calculated using an SC
implementing the Boolean function z = (x ∧ r ) ∨ (y ∧ r), multiplier. If Z were a correct guess of the division result
where x and y are the primary (data) inputs and r is the select X1 /X2 , then X1 = Y must hold. So based on the observed
input. A purely random bit-stream of probability pR = 0.5 is value of Y, the guessed result pZ is updated. If Y > X1 , pZ is
applied to r. The bit-streams X, Y, and Z can be interpreted reduced, and if Y < X1 , pZ is increased. Given sufficient time,
either as unipolar or bipolar. As Fig. 5 shows, half the output pZ eventually converges to X1 /X2 . Note that this method needs
bit-stream Z comes from X (blue) and the other half from a binary register to hold the guessed result pZ , and an SNG to
Y (red), as decided by R. It follows that pZ = 0.5(pX ) + generate an SN representing Z. Furthermore, the convergence
0.5(pY ). Therefore, with either the unipolar or bipolar format, time of the circuit can be long.
the output value Z = 0.5X + 0.5Y. Notice that R provides An approximate divider can be implemented by a JK
a scaling factor of 0.5 and maps the sum to [0, 1] in the flip-flop. If we connect the J and K inputs to X1 and X2 , respec-
unipolar case, or to [−1, 1] in the bipolar case. This type of tively, then the SN appearing at the output of the flip-flop
scaled addition entails a loss of precision since half of the implements Z = X1 /(X1 + X2 ). The JK flip-flop of Fig. 4(b) is
information in the input bit-streams is effectively discarded. used for the purpose of division. Recently, a new SC divider
Thus, in the case of Fig. 5 where the input precision is log2 has been proposed by Chen and Hayes [20]. This divider
N = 4 bits, the precision of the output also drops to 4 bits (as exploits correlation among its inputs and implements an exact
opposed to the expected 5 bits of precision). To ensure that Z division function.
has precision of 5, the length of all the bit-streams would have
to be doubled to 32. It should also be noted that the probability
pZ can be expected to fluctuate around 0.5(pX + pY ) due to E. Stochastic Functions
random fluctuations in R. As shown in the previous sections, SC addition and multipli-
Several other adder designs have been proposed in the litera- cation can be implemented by simple combinational circuits.
ture. A novel, scaling-free stochastic adder is proposed in [99], A related question is: given an arbitrary combinational circuit,
which operates on the ternary stochastic encoding proposed what SC function does it compute?
in [94]. Its key idea is to use a counter to remember carries of Consider a combinational circuit C implementing the
1 and −1 and release them at a later time slot. Lee et al. [46] Boolean function f (x1 , . . . , xk ). When supplied with
1520 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 37, NO. 8, AUGUST 2018

uncorrelated SNs, C implements the (unipolar) stochastic


function F(X1 , . . . , Xk ) defined by
F(X1 , . . . , Xk ) = (1 − X1 )(1 − X2 ) . . . (1 − Xk )f (0, 0, . . . , 0)
+ (1 − X1 )(1 − X2 ) . . . (Xk )f (0, 0, . . . , 1) (a) (b)
··· Fig. 7. (a) Regeneration-based and (b) isolation-based decorrelation of
+ (X1 )(X2 ) . . . (Xk )f (1, 1, . . . , 1). (4) a squarer circuit. DFF denotes a D flip-flop isolator.

When expanded out, (4) takes the form of a multilinear polyno-


mial. Consequently, combinational circuits with uncorrelated
inputs can only approximate their target function via a suitable
multilinear polynomial (see Section III).
Example 1: Let f (x1 , x2 ) be the logic function of an XOR
gate. Then from (4)
Z = F(x1 , x2 ) = (1 − X1 )X2 + X1 (1 − X2 ) (a)
= X1 + X1 − 2X1 X2
or, equivalently
pZ = pX1 + pX2 − 2pX1 pX2 . (5)
Thus X1 + X2 − 2X1 X2 is the unipolar stochasic function of
XOR. If we treat the SNs as bipolar numbers, where X = 2pX −
(b)
1, (5) can be rewritten as 2pZ −1 = −(2pX1 −1)( 2pX2 −1), i.e.,
Z = −X 1 X2 . Hence, an XOR gate serves as a bipolar multiplier Fig. 8. Exploiting correlation insensitivity in an SC adder. (a) Original design
with negation. Using the inverted bipolar or IBP format (see and (b) design sharing an RNS.
Table I), we get F = X1 X2 , and the XOR gate becomes an
IBP multiplier without negation. Clearly, an XNOR gate is high hardware cost, and may eliminate desirable properties
the basic bipolar multiplier. such as progressive precision. An alternative method called
We can extend the functionality of SC circuits by incor- isolation is illustrated in Fig. 7(b). A D-flip-flop (DFF) is
porating sequential elements, as in the examples of Fig. 4. In inserted into line x and clocked at the bit-stream frequency, so
particular, sequential elements enable implementation of ratio- it delays X by one clock cycle. If the bits of X are indepen-
nal functions. Section III shows how arbitrary functions can dent, as is normally the case, then X and a k-cycle delayed
be implemented efficiently using sequential SC circuits. version of X are statistically independent in any given clock
period. In general, isolation-based decorrelation has far lower
F. Correlation in Stochastic Operations cost than regeneration, but the numbers and positions of the
isolators must be carefully chosen. Ting and Hayes [92] have
Although correlation in input SNs is usually detrimental to developed a theory for placing isolators and have obtained
the functional correctness of stochastic circuits, careful use of conditions for a placement to be valid.
correlation may be beneficial. Indeed, by feeding a circuit with As noted earlier, a stochastic multiplier requires indepen-
inputs that are intentionally correlated, we obtain a different dent inputs for correct operation. However, Alaghi and Hayes
SC function, which may sometimes be very useful. For exam- noticed that some operations, including MUX-based addition,
ple, an XOR gate with maximally correlated inputs X and Y do not require their inputs to be independent [8]; such cir-
implements the absolute difference function |X − Y|, as shown cuits are called correlation insensitive (CI). Fig. 8 shows how
in [5]. correlation insensitivity can be exploited in an SC adder. The
To measure the correlation between SNs, Alaghi and original design of Fig. 8(a) assumes that inputs X and Y are
Hayes [5] introduced a similarity measure called SC corre- generated independently. Because an SC adder is CI, the input
lation (SCC), which is quite different from the more usual RNSs can be shared as shown in Fig. 8(b). Correlation between
Pearson correlation measure [22]. It is claimed in [5] that SCC X and Y does not lead to errors, since the output bit z at any
is more suitable for SC circuit design because unlike the Pearson time is taken from X or Y, but not both (see Fig. 5).
correlation, it is independent from the value of SNs. However,
SCC cannot be easily extended to more than two SNs.
Maintaining a desired level of correlation between SNs is III. D ESIGN M ETHODS
difficult. Consider the problem of decorrelation, i.e., system- Until recently, stochastic circuits were designed manu-
atic elimination of undesired correlation. There are two main ally. The circuits of Fig. 4 are examples of clever designs
ways to reduce correlation. One is regeneration, which con- that implement complex functions with a handful of gates.
verts a corrupted SN to binary form and then back to stochastic Designing stochastic circuits for arbitrary functions is not
using a new SNG. An example of this is shown in Fig. 7(a), easy. This problem has been studied intensively in the last
which computes Z = X 2 . This decorrelation method has very few years, and several general synthesis methods have been
ALAGHI et al.: PROMISE AND CHALLENGE OF SC 1521

proposed [2], [7], [19], [49], [52], [82], [101]. These meth-
ods can be classified into two types depending on whether
the target design is reconfigurable or fixed. A reconfigurable
design has some programmable inputs that allow the same
design to be reused for different functions. A fixed design can
only implement one target function. In this section, unless oth-
erwise specified, we only discuss SC design in the unipolar
domain.

A. Reconfigurable Stochastic Circuits


The basic form of a reconfigurable stochastic circuit Fig. 9. Reconfigurable stochastic circuit; examples of the DGC include an
adder [77] and an up/down counter [52].
is shown in Fig. 9. Its computing core consists of a
distribution-generating circuit (DGC) and a MUX. The DGC
has m inputs x1 , . . . , xm . It outputs a binary value s in the range
stochastic circuit. The drawback of this method is that n SNGs
{0, 1, . . . , n}. The inputs x1 , . . . , xm are fed with independent
are required to generate n SNs X1 , . . . , Xn . To address this
SNs X1 , X2 , . . . , Xm , all encoding the same variable value X.
issue, later work explored the use of sequential circuits as
Then the port s outputs a sequence of random numbers. The
the DGC. The key is to find a simple circuit which produces
probability of s to assume the value i (0 ≤ i ≤ n) is a function
a distribution that approximates arbitrary functions closely.
of the variable X, denoted by Fi (X). With different DGCs,
Li et al. [52] first studied the use of an up/down counter as
different probability distributions F0 (X), . . . , Fn (X) of s can
the DGC. The counter has a Boolean input x and outputs the
be achieved. The signal s is used as the select input of the
current count value. If x = 1, the count increases by one, other-
MUX. The data inputs of the MUX are n + 1 SNs B0 , . . . , Bn ,
wise it decreases by one. The count value remains unchanged
which encode constant probabilities B0 , . . . , Bn . The value of
for x = 1 if it has reached its maximal value. Also, it remains
the output SN Y of the MUX can be expressed as
unchanged for x = 0 if it has reached its minimal value.

n
If the input x carries an SN X, the state behavior of
Y = P(y = 1) = P(y = 1|s = i)P(s = i) the counter can be modeled as a time-homogeneous Markov
i=0 chain [82]. A Markov chain has an equilibrium distribution
n
(π0 (X), . . . , πn (X)), where πi (X) is the probability of the state
= Bi Fi (X). (6)
i at equilibrium, which is a function of the input value X. The
i=0
equilibrium probability distribution can be used as the DGC
As (6) shows, the final output is a linear combination of in Fig. 9, yielding Fi (X) = πi (X) and
the distribution functions F0 (X), . . . , Fn (X). This type of cir-
cuit is reconfigurable because with different sets of constant 
n

values Bi , different functions can be realized using the same Y= Bi πi (X).


i=0
design. Of course, not every function can be realized exactly
by (6). Given a target function G(X), an optimal set of con- However, the reconfigurable stochastic circuit using the
stant values are determined by minimizing the approximation counter as the DGC is not able to approximate a wide range of
error between the linear combination and G(X) [76]. functions. To enhance the representation capability, extensions
Prior research on synthesizing reconfigurable stochastic cir- were proposed in [49] and [84]. These extensions use FSMs
cuits can be distinguished by the form of the DGC proposed. with extra degrees of freedom, thus allowing a wider range of
The first work in this category employed an adder as the functions to be implemented.
DGC [77]. The adder takes n Boolean inputs and computes
their sum as the output signal s. Given that the n Boolean
B. Fixed Stochastic Circuits
inputs are independent and have the same probability X
of being 1, the output s follows the well-known binomial In many applications, the computation does not change,
distribution: so a fixed stochastic circuit is enough. The design of fixed
 stochastic circuits based on combinational logic has been
n
P(s = i) = (1 − X)n−i X i studied in several recent papers [2], [7], [101].
i
The work in [7] proposes a synthesis method STRAUSS
for i = 0, 1, . . . , n. Therefore, the computation realized has based on the Fourier transform of a Boolean function. The


the following form: Fourier transform maps a vector T representing a Boolean


n  function into a spectrum vector S as follows:
n
Y= Bi (1 − X)n−i X i (7)
i −
→ 1 −

i=0 S = n Hn × T . (8)
2
which is known as a Bernstein polynomial [55], [78]. The


approach of [76] finds a Bernstein polynomial that is closest Here T is obtained by replacing 0 and 1 in the output column
to the target function and realizes it using the reconfigurable of the truth table by +1 and −1, respectively, and Hn is the
1522 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 37, NO. 8, AUGUST 2018

Walsh matrix recursively defined as





+1 +1 Hn−1 Hn−1
H1 = , Hn = .
+1 −1 Hn−1 −H n−1
Alaghi and Hayes [7] first demonstrated a fundamental rela-
tion between the computation of a stochastic circuit and its
spectrum vector. They used the IBP format for SNs defined in
Table I. The Boolean function f (x1 , . . . , xn ) then corresponds
to the stochastic function

n
a Fig. 10. General form of a fixed stochastic circuit based on a combinational
F(X1 , . . . , Xn ) = c(a1 , . . . , an ) Xj j circuit.
(a1 ,...,an )∈{0, 1}n j=1

where the c(a1 , . . . , an )’s are constant coefficients. This is stochastic circuit in Fig. 10 realizes a polynomial of the form
a multilinear polynomial on X1 , . . . , Xn [see (4)]. An impor-
tant finding in [7] is that the coefficient vector − →
c =  g(a1 , . . . , an )
n

→ F(X1 , . . . , Xn ) =
[c(0, . . . , 0), . . . , c(1, . . . , 1)] is the spectrum vector S spec-
T
2m
ified by (8). (a1 ,...,an ) j=1
∈{0, 1}n
a 1−aj
Example 2: Consider an XOR gate, which serves as
a multiplier in IBP format (see Example 1). Its original × Xj j 1 − Xj . (9)
truth table vector is [0 1 1 0]T . Replacing 0s and 1s by

→ T In this equation, for any (a1 , . . . , an ) ∈ {0, 1}n ,
+1s and −1s, we get the vector T = [ + 1 − 1 − 1 + 1] .
g(a1 , . . . , an ) denotes the weight of the Boolean func-
Applying (8) to perform the Fourier transform yields the
tion f (a1 , . . . , an , r1 , . . . , rm ) on r, . . . rm , i.e., the number of
spectrum vector
input vectors (b1 , . . . , bm ) ∈ {0, 1}m such that f (a1 , . . . , an ,
⎡ ⎤⎡ ⎤ ⎡ ⎤
+1 +1 +1 +1 +1 0 b1 , . . . , bm ) = 1.
−→ 1⎢ +1 −1 +1 −1 ⎥⎢ −1 ⎥ ⎢ 0 ⎥ Example 3: Consider the case where the combinational cir-
S = 2⎢ ⎥⎢ ⎥ = ⎢ ⎥.
2 ⎣ +1 +1 −1 −1 ⎦⎣ −1 ⎦ ⎣ 0 ⎦ cuit in Fig. 10 is a MUX, with x1 and x2 as its data inputs
+1 −1 −1 +1 +1 1 and r1 as its select input. Then, the circuit’s Boolean function
is f (x1 , x2 , r1 ) = (x1 ∧ r1 ) ∨ (x2 ∧ r1 ). We have f (0, 0, r1 ) = 0,
This again shows that the stochastic function of XOR is IBP f (0, 1, r1 ) = r1 , f (1, 0, r1 ) = r1 , and f (1, 1, r1 ) = 1.
multiplication. Correspondingly, we have g(0, 0) = 0, g(0, 1) = g(1, 0) = 1,
Based on the relation between spectral transforms and and g(1, 1) = 2. According to (9), the circuit’s stochastic
stochastic circuits, a method to synthesize a stochastic cir- function is


cuit for a target function S is proposed in [7]. The basic

→ −
→ 1 1 2
idea is to apply the inverse Fourier transform T = Hn S to F(X1 , X2 ) = (1 − X1 )X2 + X1 (1 − X2 ) + X1 X2

→ 2 2 2
obtain the vector T . However, this vector may contain entries

→ = 1/2(X1 + X2 ).
that are neither +1 nor −1, implying that S does not corre-
spond to a Boolean function. For example, consider the scaled This again shows that the stochastic function of MUX is


addition function 1/2(X1 + X2 ). Its S (coefficient) vector is a scaled addition.

→ −

[0 1/2 1/2 0]T , and the inverse Fourier transform T = H2 S A synthesis method is further proposed in [101] to realize


yields T = [1 0 0 − 1]T , which contains the non-Boolean a general polynomial. It first converts the target to a multilin-
element zero. This problem is implicitly resolved in the stan- ear polynomial. Then, it transforms the multilinear polynomial
dard MUX-based scaled adder (Fig. 5) which has a third input to a polynomial of the form shown in (9). This transfor-
r that introduces the constant probability 0.5. mation is unique and can be easily obtained. After that,


In general, an entry −1 < q < 1 in the T vector the problem reduces to finding an optimal Boolean function
corresponds to an SN of constant probability (1 − q)/2. f ∗ (x1 , . . . , xn , r1 , . . . , rm ) such that for each (a1 , . . . , an ) ∈
STRAUSS employs extra SNs of probability 0.5 to generate {0, 1}n , the weight of f ∗ (a1 , . . . , an , r1 , . . . , rm ) is equal to
these SNs, since a probability of 0.5 can be easily obtained the value g(a1 , . . . , an ) specified by the multilinear polyno-
from an LFSR. A heuristic method is introduced to synthesize mial. A greedy method is applied to find a good Boolean
a low-cost circuit to produce multiple constant probabilities function. The authors also found that in synthesizing poly-
simultaneously. nomials of degree more than 1, all (a1 , . . . , an ) ∈ {0, 1}n can
A synthesis problem similar to that of [7] is addressed be partitioned into a number of equivalent classes and the
in [101]. The authors first analyzed the stochastic behavior of weight constraint can be relaxed so that the sum of the weights
a general combinational circuit whose inputs comprise n vari- f (a1 , . . . , an , r1 , . . . , rm ) over all (a1 , . . . , an )’s in each equiv-
able SNs X1 , . . . , Xn and m constant input SNs R1 , . . . , Rm alence class is equal to a fixed value derived from the target
of value 0.5, as shown in Fig. 10. If the Boolean function of polynomial. Zhao and Qian [101] exploited this freedom to
the combinational circuit is f (x1 , . . . xn , r1 , . . . , rm ), then the further reduce the circuit cost.
ALAGHI et al.: PROMISE AND CHALLENGE OF SC 1523

IV. A PPLICATIONS
SC has been applied to a variety of application domains,
including artificial neural networks (ANNs) [12], [14], [15],
[17], [24], [39], [46], [93], [95], control systems [59], [100],
reliability estimation [35], data mining [21], DSP [4], [18],
[40], [48], [50], [54], [83], and decoding of modern error-
correcting codes (ECCs) [26], [30], [47], [63], [85], [86], [89],
[90], [96], [97]. Most of these applications are characterized by
the need of a large amount of arithmetic computation, which
can leverage the simple circuitry provided by SC. They also
have low precision requirements for the final results, which
can avoid the use of the excessively long SNs to represent
data values. In this section, we review four important appli- Fig. 11. Stochastic implementation of a five-tap FIR filter with an uneven-
cations for which SC has had some success: 1) filter design; weighted MUX tree.
2) image processing; 3) LDPC decoding; and 4) ANNs.

A. Filter Design
The design of finite impulse response (FIR) filters is con-
sidered in [18] and [36]. A general M-tap FIR filter computes (a) (b)
an output based on the M most recent inputs as follows:
Fig. 12. Two-line stochastic encoding. (a) Example of encoding the value
Y[n] = H0 X[n] + H1 X[n − 1] + . . . + HM−1 X[n − M + 1] −0.5. (b) Multiplier for the encoding.
(10)

where X[n] is the input signal, Y[n] is the output signal, and Hi Area-efficient stochastic designs for the discrete Fourier
is the filter coefficient. The FIR filter thus computes the inner transform (DFT) and the fast Fourier transform (FFT), which
product of two vectors [see (2)]. A conventional binary imple- are important transformation techniques between the time and
mentation of (10) requires M multipliers and M − 1 adders, frequency domains, are described in [99]. An M-point DFT
which has high hardware complexity. SC-based designs can for discrete signals X[n] (n = 0, 1, . . . , M − 1) computes
potentially mitigate this problem. the frequency domain values Y[k] (k = 0, 1, . . . , M − 1)
Since the values of H, X, and Y may be negative, bipo- as follows:
lar SNs are used to encode them. A straightforward way to

M−1
implement (10) uses M XNOR gates for multiplications and Y[k] = kN
X[n]WM
an M-to-1 MUX for additions. However, this implementation n=0
has the problem that the output of the MUX is 1/M times the
desired output. Such down-scaling causes severe accuracy loss where WM = e−j(2π/M) . The FFT is an efficient way to realize
when M is large. the DFT by using a butterfly architecture [70].
To address the foregoing problem, a stochastic design The basic DFT computation resembles that of an FIR fil-
based on an uneven-weighted MUX tree has been ter. Although the technique of the uneven-weighted MUX tree
proposed [18], [36]. Fig. 11 shows such a design for can be applied [98], the accuracy of the result degrades as
a five-tap FIR filter. The input Sign(Hi ) is a stream of the number of points becomes larger due to the growing scal-
bits, each equal to the sign bit of Hi in its 2s complement ing factor. To address this problem, the work in [99] proposes
binary representation. The probability for the select input of a scaling-free stochastic adder based on a two-line stochastic
each MUX is shown inthe figure. The output probability encoding scheme [94]. This encoding represents a value in the
of the design is Y[n]/ 4i=0 |Hi |. In the general case, the interval [−1, 1] by a magnitude stream M(X) and a sign stream
outputprobability of an uneven-weighted MUX tree is S(X). Fig. 12(a) shows an example of encoding the value −0.5.
Y[n]/ M−1 |Hi |. Note that the scaling factor is reduced to Indeed, this encoding can be viewed as employing a ternary
M−1 i=0 M−1
i=0 |Hi | ≤ M. In the case where i=0 |Hi | < 1, the stochastic stream X = x1 x2 . . . xN with each xi ∈ {−1, 0, 1}.
proposed design will even scale up the result. The magnitude and the sign of xi are represented by the
Although the datapath of the stochastic FIR filter consists of ith bit in the magnitude stream and the sign stream, respec-
just a few logic gates as shown in Fig. 11, the interface SNGs tively. If the sign bit is 0 (1), the value is positive (negative).
(not shown) may occupy a large area, offsetting the potential Fig. 12(b) shows the multiplier for this encoding. Experimental
area benefit brought by the simple datapath. To further reduce results indicate that using the stochastic multiplier and the
the area of SNGs, techniques of sharing the RNSs used in special stochastic adder to implement DFT/FFT can achieve
the SNGs and circularly shifting the outputs of the RNS to much higher accuracy than an implementation based on the
generate multiple random numbers with low correlation are uneven-weighted MUX tree when the number of points M
proposed in [36]. is large.
1524 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 37, NO. 8, AUGUST 2018

The design of infinite impulse response (IIR) filters is con-


sidered in [53], [54], and [72]. Compared to FIR filters,
the implementation of IIR filters using SC is more chal-
lenging. The main difficulty is the feedback loop in the IIR
filter, which causes correlation in the stochastic bit-streams.
However, the correct computation of SC usually requires the
independence of the stochastic bit-streams. To address this (a) (b)
problem, Liu and Parhi [54] proposed transforming the IIR
Fig. 13. Two implementations of Roberts cross operator. (a) Stochastic.
filter into a lattice structure via the Schur algorithm [88]. The (b) Conventional.
benefit of such a lattice structure is that its states are orthogonal
and hence, are uncorrelated, which makes the design suitable
for stochastic implementation. To reduce error due to the state Conventional
overflow (where the state value may be outside of the range binary
[−1, 1] of the bipolar stochastic representation), the authors
further proposed a scaling method that derives a normalized
lattice structure as the implementation target. Stochastic
computing

(a) (b) (c)


B. Image Processing
A DSP application that is well-suited to SC is image pro- Fig. 14. Edge-detection performance for two implementation methods with
cessing [4], [50], [67]. It can exploit the massive parallelism noise levels of (a) 5%, (b) 10%, and (c) 20%.
provided by simple stochastic circuits, because many image-
processing operations are applied pixel-wise or block-wise
across an entire image [33]. Also, long SNs are not required for A stochastic implementation of (11) is shown in Fig. 13(a).
image-processing applications, because the precision demands It consists of only two XOR gates and one MUX. By delib-
are low; in many cases, 8-bit precision is enough. erately correlating the two input SNs of an XOR gate so they
Li et al. [50] proposed stochastic implementations for have the maximum overlap of 0s and 1s, the XOR computes
five image processing tasks: 1) edge detection; 2) median the absolute difference between the two input SNs [5]. The
filter-based noise reduction; 3) image contrast stretching; MUX further performs a scaled addition on two absolute dif-
4) frame difference-based image segmentation; and 5) kernel ferences. In contrast, a conventional implementation of (11) on
density estimation (KDE)-based image segmentation. Their BNs is much more complicated, as suggested by Fig. 13(b);
designs introduce some novel SC elements based on sequen- it has two subtractors, two absolute value calculators, and
tial logic. All the designs show smaller area than their an adder.
conventional counterparts. The reduction in area is greatest Although a stochastic implementation often needs a large
for KDE-based image segmentation, due to its high com- number of clock cycles to obtain the final result, the critical
putational complexity. This paper demonstrates that stochas- path delay of the stochastic implementation is much smaller
tic designs are advantageous for relatively complicated than a conventional implementation’s due to the simplicity of
computations. the stochastic circuit. For instance, the overall delay of the
Najafi and Salehi [67] applied SC to a local image thresh- circuit of Fig. 13(a) is only 3× higher than the delay of its
olding algorithm called the Sauvola method [87]. Image binary counterpart [Fig. 13(b)].
thresholding is an important step in optical character recogni- Another benefit of a stochastic implementation is its error
tion. It selects a threshold and uses that threshold to determine tolerance. Fig. 14 visually demonstrates this advantage by
whether a pixel should be set to 1 (background) or 0 (fore- comparing the stochastic implementation of edge detection
ground). The Sauvola method determines the threshold for with conventional binary implementations for different lev-
each pixel in an image and involves calculating product, sum, els of noise injected into the input sensor [4]. As shown
mean, square, absolute difference, and square root. All these in the first row of Fig. 14, when the noise level is 10% to 20%,
operations can be realized efficiently by SC units. the conventional design generates useless outputs. In contrast,
Improved stochastic designs for several image-processing the SC implementation in the second row is almost unaffected
applications were also proposed in [4]. An example is real- by noise and is able to detect the edges even at a noise level
time edge detection. The authors considered the Roberts cross of 20%.
operator, which takes an input image and produces an out-
put image with edges highlighted. Let Xi,j and Zi,j denote the C. Decoding Error-Correcting Codes
pixel values at row i and column j in the input and the out-
One successful application of SC is the decoding of certain
put images, respectively. The operator calculates Zi,j in the
modern ECCs. Researchers have proposed stochastic decoder
following way:
designs for several ECCs, such as turbo code [26], polar
    code [96], [97], binary LDPC codes [30], [47], [89], [90],
Zi,j = 0.5 Xi,j − Xi+1,j+1  + Xi,j+1 − Xi+1,j  . (11) and nonbinary LDPC codes [85], [86].
ALAGHI et al.: PROMISE AND CHALLENGE OF SC 1525

The earliest stochastic decoder was proposed for binary convergence problem of a pure stochastic decoder for nonbi-
LDPC codes (for simplicity, hereafter referred to as LDPC nary LDPC codes, a way of mixing the binary computation
codes), which have very efficient decoding performance that and stochastic computation units has been proposed [86].
approaches the Shannon capacity limit [81]. They have been A technique of splitting and shuffling stochastic bit-streams
adopted in several recent digital communication standards, is described in [97] to simultaneously mitigate the costs of
such as the DVB-S2, the IEEE 802.16e (WiMAX), and the long stochastic bit-streams and rerandomization of a stochastic
IEEE 802.11n (WiFi) standards. decoder for polar codes.
A binary LDPC code is characterized by a bipartite fac-
tor graph consisting of two groups of nodes: 1) variable D. Artificial Neural Networks
nodes (VNs) and 2) parity-check nodes (PNs). A widely used
ANNs, mimicking aspects of biological neural networks,
method to decode an LDPC code applies the sum-product
are an early application of SC [14], [15], [24], [68], [93].
algorithm (SPA) to the factor graph. The SPA iteratively passes
Only recently, with advances in machine learning algorithms
a probability value, which represents the belief that a bit in
and computer hardware technology, have they found com-
the code block is 1, from a VN to a connected PN, or vice
mercial success in applications such as computer vision and
versa. The codeword is determined by comparing the final
speech recognition [45]. ANNs are usually implemented in
probabilities against a threshold.
software on warehouse-scale computing platforms, which are
The major computation in the decoder involves the follow-
extremely costly in size and energy needs. These short-
ing two operations on probabilities:
comings have stimulated renewed interest in using SC in
pC = pA (1 − pB ) + pB (1 − pA ) (12) ANNs [10], [11], [41], [46], [80]. Furthermore, many clas-
pX pY sification tasks such as ANNs do not require high accu-
pZ = . (13)
pX pY + (1 − pX )(1 − pY ) racy; it suffices that their classification decisions be cor-
rect most of the time [51]. Hence, SC’s drawbacks of low
Binary implementation of (12) and (13) requires complicated
precision and stochastic variability are well-tolerated in ANN
arithmetic circuits, such as adders, multipliers, and dividers.
applications.
To alleviate this problem, Gaudet and Rapley [30] proposed
A widely used type of ANN is the feed-forward network
a stochastic implementation of LDPC decoding in which (12)
shown in Fig. 15 [37]. It is composed of an input layer, several
and (13) are realized efficiently by the circuits in Fig. 4(a)
hidden layers, and an output layer. A node in the network is
and (b), respectively.
referred to as a neuron. Each neuron in a hidden or an output
Besides reducing the area of the processing units, SC also
layer is connected to a number of neurons in the previous layer
reduces routing area. In a conventional binary implementa-
via weighted edges. The output 0 (inactive) or 1 (active) of
tion, the communication of probability values of precision k
a neuron is determined by applying an activation function to
between two nodes requires k wires connecting the two nodes,
the weighted sum of its inputs. For example, the output of the
which leads to a large routing area. However, with SC, due to
neuron Y1 in Fig. 15 is given by
its bit-serial nature, communication between two nodes only
 n 
requires a single wire. Another benefit of SC is its support 
of an asynchronous pipeline. In SN representation, bit order Y1 = F Wi Xi (14)
does not matter, so we do not require the input of the PNs i=1
and VNs to be the output bits of the immediately previous where Xi is the signal produced by the ith input neuron of
cycle. This allows different edges to use different numbers Y1 , Wi is the weight of the edge from Xi to Y1 , and F(Z) is
of pipeline stages, thus increasing the clock frequency and the activation function. A frequent choice for F is the sigmoid
throughput [89]. function defined by
To improve the SPA convergence rate, Tehrani et al. [89]
added a module called edge memory (EM) to each edge in 1
F(Z) =
the factor graph. Since one EM is assigned to each edge, 1 + e−βZ
the hardware usage of EMs can be large. To further reduce where β is the slope parameter.
this hardware cost, Tehrani et al. [90] introduced a module A key problem in ANN design is the addition of a large
called a majority-based tracking forecast memory, which is number of items supplied to a neuron; a similar problem occurs
assigned to each VN. This method has been integrated into in FIR filters with a large number of taps. The straightforward
a fully parallel stochastic decoder ASIC that decodes the use of MUX-based adders to perform the scaled addition is not
(2048, 1723) LDPC code from the IEEE 802.3an (10GBASE- a good solution, because the scaling factor is proportional to
T) standard [90]. This decoder turns out to be one of the most the number of a neuron’s connections. When rescaling the final
area-efficient fully parallel LDPC decoders. MUX output, even a very small error due to stochastic vari-
Stochastic LDPC decoders essentially implement the belief ation may be enlarged significantly. To address this problem,
propagation algorithm [73]. This fundamental approach can Li et al. [51] revived the old idea of using an OR gate as
also be used to decode other ECCs, such as polar codes and an adder [29]. OR combines two unipolar SNs X and Y as
nonbinary LDPC codes. Given their algorithm-level similar- follows:
ity, researchers have proposed SC-based decoders for these
codes [85], [86], [96], [97]. For example, to resolve the slow Z = X + Y − XY.
1526 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 37, NO. 8, AUGUST 2018

most of the input data can be classified easily because they are
far from the decision boundary. For these input data, compu-
tation with low-precision SNs is enough to obtain the correct
results. Based on this, the authors devised an early decision ter-
mination (EDT) strategy which adaptively selects the number
of bits to use in the computation depending on the difficulty of
the classification tasks. The resulting design has a misclassifi-
cation error rate very close to the conventional implementation.
Furthermore, EDT reduces energy consumption with a slight
increase in misclassification errors.
Efficient stochastic implementation of convolutional neu-
ral networks (CNNs), a special type of feed-forward ANN,
Fig. 15. Typical feed-forward network structure. is the focus of [10]. In a CNN, the signals of all the neu-
rons in a layer are obtained by first convolving a kernel
with the signals in the input layer−a special kind of filtering
This is not strictly addition, but when either X 1 or Y 1, operation−and then applying an activation function. The size
the output Z is approximately the sum of the two inputs. To of the kernel is much less than that of the input layer, which
make the inputs close to zero, Li et al. [51] applied a moderate means a neuron signal only depends on a subset of the neu-
scaling factor to scale down the inputs. rons in its input layer. CNNs have been successfully applied
Some other studies have addressed the addition problem to machine learning tasks such as face and speech recognition.
with new stochastic data representations [11], [17]. In [17], A major contribution of [10] is an efficient stochastic imple-
an encoding scheme called extended stochastic logic (ESL) is mentation of the convolution operation. Unlike SC that uses
proposed which uses two bipolar SNs X and Y to represent the SNs to encode real values, the proposed method uses the prob-
number X/Y. ESL addition has the advantage of being exact, ability mass function of a random variable to represent an array
with no scaling factor. Moreover, ESL encoding allows easy of real values. An efficient implementation of convolution is
implementation of multiplication, division, and the sigmoid developed based on this representation. Furthermore, a few
function. Together, these operations lead to an efficient neuron other techniques are introduced in [10] to implement other
design. components of a CNN, such as the pooling and nonlinear acti-
Ardakani et al. [11] have proposed the concept of integer vation components. Compared to a conventional binary CNN,
SN (ISN) in which a sequence of random integers represents the proposed SC implementation achieves large improvements
a value equal to the mean of these integers. For example, the in performance and power efficiency.
sequence 2, 0, 4, 1 represents 7/4. With this encoding, any real Efficient stochastic implementation of CNNs has also been
number can be represented without prior scaling. The weights studied by Ren et al. [80]. They performed a comprehensive
in an ANN, which can lie outside the range [−1, 1], do not study of SC operators and how they should be optimized to
need to be scaled. The addition of two ISNs uses a conven- obtain energy-efficient CNNs. Ren et al. [80] adopted the
tional binary adder, which makes the sum exact. Multiplication approximate APC of [42] to add a large number of input
of two ISNs requires a conventional binary multiplier, which is stochastic bit-streams. Kim et al. [42] reported that the approx-
expensive. Fortunately, in the ANN implementation proposed imate APC has negligible accuracy loss and is about 40%
in [11], one input to the multiplier, which corresponds to the smaller than the exact APC.
neuron signal, is always a binary SN. Then, the conventional
multiplier is reduced to several AND gates. The sigmoid acti- V. D ISCUSSION
vation function is implemented by a counter similar to that
Since the turn of the present century, significant progress
in [14]. Although the hardware cost of the ISN implementa-
has been made in developing the theory and application of
tion is larger than that of a binary stochastic implementation,
SC. New questions and challenges have emerged, many of
the former has much lower latency and energy consumption.
which still need to be addressed. With the notable exception
Compared to the conventional binary design, the ISN design
of LDPC decoder chips, few large-scale SC-based systems
produces fewer misclassification errors, while reducing energy
have actually been built and evaluated. As a result, real-world
and area cost substantially.
experience with SC is limited, making it likely that many
Another recent work [41] proposes two new ways to design
practical aspects of SC such as its true design costs, run-time
ANNs with SC. The first considers training in the design phase
performance, and energy consumption are not yet fully appre-
to make the network friendly to a stochastic implementation.
ciated. Small-scale theoretical and simulation-based studies are
The authors observed that weights close to zero, which corre-
fairly plentiful, but they often consider only a narrow range
spond to (bipolar) SNs of probability 0.5, contribute the most
of issues under restrictive assumptions.
to random fluctuation errors. Therefore, they proposed to iter-
atively drop near-zero weights and then retrained the network
to derive a network with high classification accuracy but no A. Conclusion
near-zero weights. The second technique is to exploit the pro- Based on what is now known, we can draw some general
gressive precision property of SC. The authors observed that conclusions about what SC is, and is not, good for.
ALAGHI et al.: PROMISE AND CHALLENGE OF SC 1527

1) Precision and Errors: SC is inherently approximate and energy usage is therefore emerging as a significant challenge
inexact. Its probability-based and redundant data encoding for SC.
makes it a relatively low-precision technology, but one that 4) Design Issues: Until recently, SC design was an ad
is very tolerant of errors. It has been successfully applied hoc process with little theory to guide it. However, thanks
to image-processing using 256-bit SNs, which correspond to a deeper understanding of the properties of stochastic func-
roughly to 8-bit (fixed-point) BNs. SC is unsuited to the very tions and circuits, several general synthesis techniques have
high 32- or 64-bit precision error-sensitive calculations that been developed, which can variously be classified as reconfig-
are the domain of BNs and BC. This is seen in the random urable or fixed, and combinational or sequential [7], [49], [76].
noise-like fluctuations that are normal to SNs, in the way SNs The new understanding has revealed unexpected and novel
are squeezed into the unit interval producing errors near the solutions to some of SC’s basic problems.
boundaries, and in the fact that SNs grow in length expo- For example, it has come to be recognized that dif-
nentially faster than BNs as the desired level of precision ferent circuits realizing different logic functions can have
increases. Also the stochastic encoding of numbers does not the same stochastic behavior [19]. Far from just being the
provide a dynamic range, similar to the one provided by enemy, correlation can sometimes be harnessed as a design
floating point numbers. resource to reduce circuit size and cost, as the edge detec-
On the other hand, low precision and error tolerance have tors of Fig. 13(a) vividly illustrate. Common circuits like
definite advantages. They have evolved in the natural world for the MUX-based scaled adder turn out to have correla-
use by the human brain and nervous system. Similar features tion insensitivity that enables RNSs to be removed or
are increasingly seen in artificial constructs like deep learn- shared (see Fig. 8). A fundamental redesign of the SC
ing networks that aim to mimic brain operations [23]. Thus it scaled adder itself is shown in Fig. 16, which converts it
seems pointless to compare SC and BC purely on the basis of from a three-input to a two-input element, while improv-
precision or precision-related costs alone [1], [58]. ing both its accuracy and correlation properties [46]. Despite
Finally, we observe that while BC circuits have fixed such progress, many questions concerning the properties of
precision, SC circuits have the advantage of inherently vari- stochastic circuits that influence design requirements, remain
able precision in their bit-streams. Moreover, the bit-streams unanswered.
can be endowed with progressive precision where accu- 5) Circuit Level Aspects: Since SC employs digital com-
racy improves monotonically as computation proceeds, as ponents, conventional digital design process (synthesis, auto-
has been demonstrated for some image-processing tasks [4]. matic placement and routing, timing closure, etc.) have
If a variable precision cannot be exploited, a simple bit- been used to implement SC ASIC and FPGA-based designs.
reduction technique in BC often provides better energy effi- However, as discussed in this paper, SC shares similarities
ciency over SC. As reported in recent work, with fixed with analog circuits, so the digital design aspects of it may
precision, SC becomes worse for designs above 6 bits of differ from conventional digital circuits.
precision [1], [46]. Various circuit-level aspects of SC designs have been inves-
2) Area-Related Costs: The use of tiny circuits for opera- tigated very recently as a means of improving SC’s energy
tions like multiplication and addition remains SC’s strongest efficiency [9], [65]. They suggest that SC circuits are prob-
selling point. A stochastic multiplier contains orders-of- ably not optimal if they are designed using standard digital
magnitude fewer gates than a typical BC multiplier. However, design tools. Najafi et al. [65] demonstrated that SC circuits
many arithmetic operations including multiplication require do not need clock trees. Eliminating the clock tree signifi-
uncorrelated inputs to function correctly. This implies a need cantly reduces the energy consumption of the circuit. In fact,
for randomization or decorrelation circuits incorporating many employing analog components, rather than digital, can lead to
independent random sources or phase-shifting delay ele- significant energy savings [66]. One example is the use of ana-
ments (isolators), whose combined area can easily exceed that log integrators, instead of counters, to collect the computation
of the arithmetic logic [92]. The low-power benefit of stochas- results.
tic components must be weighed against the additional power Alaghi et al. [9] have investigated a different circuit-level
consumed by their randomization circuitry. aspect of SC. They showed that SC’s inherent error-tolerance
3) Speed-Related Costs: Perhaps the clearest drawback of makes it robust against errors caused by voltage overscaling.
SC is its need for long, multicycle SNs to generate satisfactory Voltage overscaling, i.e., the process of reducing the power
results. This leads to long run-times, which are compensated consumption of the circuit without reducing the frequency,
for, in part, by the fact that the clock cycles tend to be usually leads to critical path timing failures and catastrophic
very short. Parallel processing, where long bit-streams are errors in regular digital circuits. However, timing violations in
partitioned into segments that are processed in parallel is SC manifest as extra or missing pulses on the output SN. The
a speed-up possibility that has often been proposed, but not extra and missing pulses tend to cancel each other out, leading
be studied much [21]. The same can be said of progressive to negligible error. An optimization method is described in [9]
precision. that balances the circuit paths to guarantee maximum error
Small stochastic circuits have relatively low power con- cancellation. It is worth noting that the observations of [9]
sumption. However, since energy = power × time, the have been confirmed through a fabricated chip.
longer run-times of stochastic circuits can lead to higher The new results suggest that circuit-level aspects of SC
energy use than their BC counterparts [62]. Reducing must be considered at design time, as they provide valuable
1528 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 37, NO. 8, AUGUST 2018

2) Design Optimization: Despite recent advances in SC


synthesis, a number of open problems remain. It is now
recognized that many different Boolean functions can real-
ize the same computation [7], [19], [101]. For instance, the
Boolean functions f1 (x1 , x2 , r1 ) = (x1 ∧ r1 ) ∨ (x2 ∧ r1 ) and
Fig. 16. SC adder built around a toggle flip-flop [46].
f2 (x1 , x2 , r1 ) = (x1 ∧ r1 ) ∨ (x2 ∧ r1 ) ∨ (x1 ∧ x2 ) both realize the
same stochastic addition function F(X1 , X2 ) = 1/2(X1 + X2 ).
An open question is: among numerous Boolean functions
sources of energy saving. As a result, SC circuits should be that have the same stochatic behavior, how can we find an
either manually designed [64] or new CAD tools must be optimal one? All the previous work on synthesis assumes
provided [9]. that the input SNs are independent. However, as shown
6) Applications: As discussed in detail in Section IV, SC in [5] and [20], sometimes taking the advantage of cor-
has been successfully applied to a relatively small range of related input SNs helps reduce circuit area. Another open
applications, notably filter design, image processing, LDPC problem is how to develop a synthesis approach that takes
decoding, and ANN design. A common aspect of these appli- correlation into consideration and exploits it when neces-
cations is a need for very large numbers of low-precision sary. Finally, most work on synthesis has been restricted to
arithmetic operations, which can take advantage of the small combinational logic. This has led to a deeper understand-
size of stochastic circuits. They also typically have a high ing of combinational synthesis, for example, the existence of
degree of error tolerance. It is worth noting that current trends stochastic equivalence classes [19]. In contrast, far fewer the-
in embedded and cloud computing, e.g., the increasing use oretical advances have been made in understanding sequential
of fast online image recognition and machine learning tech- stochastic design. How to synthesize optimal stochastic cir-
niques by smart-phones and automobiles, call for algorithms cuits based on sequential logic therefore remains an unsolved
for which SC is well suited. The so-called Internet of Things is problem.
likely to create a big demand for tiny, ultralow-cost processing 3) Energy Harvesting: With the development of the
circuits with many of the characteristics of SC. Internet-of-Things, many future computing systems are
expected to be powered by energy harvested from the envi-
ronment. The potential energy sources include solar energy,
B. Future Challenges as well as ambient RF, motion, and temperature energy [56].
The issues covered in the preceding sections are by no A difficulty with such energy sources is that they tend to be
means completely understood, and many of them deserve fur- highly variable and unstable. This can significantly degrade
ther study. There are however, other important topics that have the performance of BC systems. SC, on the other hand, has
received little recognition or attention; we briefly discuss four strong tolerance of errors caused by the random fluctuation
of them next. of the supply voltage [9]. A problem for SC is the poten-
1) Accuracy Management: In conventional BC, the accu- tially large energy needs of its many randomness sources for
racy goals of a new design, such as its precision level and number conversion and decorrelation. This may be solved by
error bounds, are determined a priori during the specifi- emerging technologies that have naturally stochastic behav-
cation phase. As the design progresses and prototypes are iors. For example, very compact random sources can be
produced, fine tuning may be needed to ensure that these constructed from memristors. Moreover, a single memristor
goals and related performance requirements are actually met. source can supply independent random bit-streams to multiple
This approach is much harder to apply to stochastic circuit destinations simultaneously [43].
design. Interacting factors including bit-stream length, RNS 4) Biomedical Devices: It was remarked in Section I that
placement, and correlation can drastically affect accuracy in stochastic bit-streams can mimic the low-power spike trains
complex ways. For example, it is pointed out in [92] that cas- used for communication in natural neural networks (see
cading two well-designed squarer circuits, each computing X 2 , Fig. 3). This has suggested the use of SC in implantable
does not implement X 4 , as might be expected; instead the devices such as retinal implants to treat the visually
cascaded circuit implements X 3 . impaired [4]. Retinal implants are ICs that are placed directly
Because of hard-to-predict behavior like this, extensive sim- on the retina, sense visual images in the form of pixel arrays,
ulation is almost always used to determine the basic accuracy and convert the pixel information into bit-streams that are
limits and error sensitivities of a new SC design. SC projects sent directly to the brain via the optic nerve where they
often have a cut-and-try flavor which involves multiple design- produce flashes of light that the brain can be trained to
and-simulate iterations that resemble design-space exploration interpret. With better understanding of the information coding
rather than the fine tuning of well-founded designs. It would and data processing involved, SC may be found applicable to
be very useful to be able to incorporate into an SC design other applications that involve interfacing stochastic circuits
flow an “accuracy manager” that can comprehend and auto- with natural neural networks. A particular advantage of SC in
matically adjust the relations among the design parameters this domain is its very low power consumption which is neces-
affecting accuracy. A first step in this direction can be found sary to avoid heat damage to human tissue. So far, however, we
in [69], while automatic decorrelation methods to enhance know of no current work to incorporate SC into implantable
accuracy are addressed in [92]. medical devices.
ALAGHI et al.: PROMISE AND CHALLENGE OF SC 1529

R EFERENCES [25] Y. Ding, Y. Wu, and W. Qian, “Generating multiple correlated prob-
abilities for MUX-based stochastic computing architecture,” in Proc.
[1] J. M. de Aguiar and S. P. Khatri, “Exploring the viability of stochastic ICCAD, San Jose, CA, USA, 2014, pp. 519–526.
computing,” in Proc. Int. Conf. Comput. Design (ICCD), New York, [26] Q. T. Dong, M. Arzel, C. Jego, and W. J. Gross, “Stochastic decod-
NY, USA, 2015, pp. 391–394. ing of turbo codes,” IEEE Trans. Signal Process., vol. 58, no. 12,
[2] A. Alaghi and J. P. Hayes, “A spectral transform approach to stochastic pp. 6421–6425, Dec. 2010.
circuits,” in Proc. Int. Conf. Comput. Design (ICCD), Montreal, QC, [27] D. Fick, G. Kim, A. Wang, D. Blaauw, and D. Sylvester, “Mixed-signal
Canada, 2012, pp. 315–312. stochastic computation demonstrated in an image sensor with integrated
[3] A. Alaghi and J. P. Hayes, “Survey of stochastic computing,” ACM 2D edge detection and noise filtering,” in Proc. IEEE Custom Integr.
Trans. Embedded Comput. Syst., vol. 12, no. 2, pp. 1–19, May 2013. Circuits Conf. (CICC), San Jose, CA, USA, 2014, pp. 1–4.
[4] A. Alaghi, C. Li, and J. P. Hayes, “Stochastic circuits for real-time [28] B. R. Gaines, “Stochastic computing,” in Proc. AFIPS Spring Joint
image-processing applications,” in Proc. Design Autom. Conf. (DAC), Comput. Conf., 1967, pp. 149–156.
Austin, TX, USA, 2013, pp. 1–6. [29] B. R. Gaines, “Stochastic computing systems,” in Advances in
[5] A. Alaghi and J. P. Hayes, “Exploiting correlation in stochastic circuit Information Systems Science, vol. 2, J. T. Tou, Ed. Boston, MA, USA:
design,” in Proc. Int. Conf. Comput. Design (ICCD), Asheville, NC, Springer-Verlag, 1969, pp. 37–172.
USA, Oct. 2013, pp. 39–46. [30] V. C. Gaudet and A. C. Rapley, “Iterative decoding using stochastic
[6] A. Alaghi and J. P. Hayes, “Fast and accurate computation using computation,” Electron. Lett., vol. 39, no. 3, pp. 299–301, Feb. 2003.
stochastic circuits,” in Proc. Design Autom. Test Europe Conf. (DATE), [31] W. Gerstner and W. M. Kistler, Spiking Neuron Models. Cambridge,
Dresden, Germany, 2014, pp. 1–4. U.K.: Cambridge Univ. Press, 2002.
[7] A. Alaghi and J. P. Hayes, “STRAUSS: Spectral transform use in [32] S. W. Golomb, Shift Register Sequences. Laguna Hills, CA, USA:
stochastic circuit synthesis,” IEEE Trans. Comput.-Aided Design Integr. Aegean Park Press, 1982.
Circuits Syst., vol. 34, no. 11, pp. 1770–1783, Nov. 2015. [33] R. C. Gonzalez and R. E. Woods, Digital Image Processing, 2nd ed.
[8] A. Alaghi and J. P. Hayes, “Dimension reduction in statistical sim- Upper Saddle River, NJ, USA: Prentice-Hall, 2002.
ulation of digital circuits,” in Proc. Symp. Theory Model. Simulat. [34] P. K. Gupta and R. Kumaresan, “Binary multiplication with PN
(TMS DEVS), Alexandria, VA, USA, 2015, pp. 1–8. sequences,” IEEE Trans. Acoust., Speech, Signal Process., vol. 36,
[9] A. Alaghi, W.-T. J. Chan, J. P. Hayes, A. B. Kahng, and J. Li, no. 4, pp. 603–606, Apr. 1988.
“Optimizing stochastic circuits for accuracy-energy tradeoffs,” in Proc. [35] J. Han et al., “A stochastic computational approach for accurate and
ICCAD, Austin, TX, USA, 2015, pp. 178–185. efficient reliability evaluation,” IEEE Trans. Comput., vol. 63, no. 6,
[10] M. Alawad and M. Lin, “Stochastic-based deep convolutional networks pp. 1336–1350, Jun. 2014.
with reconfigurable logic fabric,” IEEE. Trans. Multi-Scale Comput. [36] H. Ichihara, T. Sugino, S. Ishii, T. Iwagaki, and T. Inoue, “Compact
Syst., vol. 2, no. 4, pp. 242–256, Oct./Dec. 2016. and accurate digital filters based on stochastic computing,” IEEE
[11] A. Ardakani, F. Leduc-Primeau, N. Onizawa, T. Hanyu, and Trans. Emerg. Topics Comput., to be published. [Online]. Available:
W. J. Gross, “VLSI implementation of deep neural network using inte- http://ieeexplore.ieee.org/document/7565493/
gral stochastic computing,” IEEE Trans. Very Large Scale Integr. (VLSI) [37] A. K. Jain, J. Mao, and K. M. Mohiuddin, “Artificial neural networks:
Syst., vol. 25, no. 10, pp. 2688–2699, Oct. 2017. A tutorial,” Computer, vol. 29, no. 3, pp. 31–44, Mar. 1996.
[12] S. L. Bade and B. L. Hutchings, “FPGA-based stochastic neural [38] D. Jenson and M. Riedel, “A deterministic approach to stochastic com-
networks-implementation,” in Proc. IEEE Workshop FPGAs Custom putation,” in Proc. Int. Conf. Comput.-Aided Design (ICCAD), Austin,
Comput. Mach., Napa County, CA, USA, 1994, pp. 189–198. TX, USA, 2016, pp. 1–8.
[13] D. Braendler, T. Hendtlass, and P. O’Donoghue, “Deterministic bit- [39] Y. Ji, F. Ran, C. Ma, and D. J. Lilja, “A hardware implementation of
stream digital neurons,” IEEE Trans. Neural Netw., vol. 13, no. 6, a radial basis function neural network using stochastic logic,” in Proc.
pp. 1514–1525, Nov. 2002. Design Autom. Test Europe Conf. (DATE), Grenoble, France, 2015,
pp. 880–883.
[14] B. D. Brown and H. C. Card, “Stochastic neural computation I:
[40] H. Jiang, C. Shen, P. Jonker, F. Lombardi, and J. Han, “Adaptive filter
Computational elements,” IEEE Trans. Comput., vol. 50, no. 9,
design using stochastic circuits,” in Proc. IEEE Symp. VLSI (ISVLSI),
pp. 891–905, Sep. 2001.
Pittsburgh, PA, USA, 2016, pp. 122–127.
[15] B. D. Brown and H. C. Card, “Stochastic neural computation II:
[41] K. Kim et al., “Dynamic energy-accuracy trade-off using stochastic
Soft competitive learning,” IEEE Trans. Comput., vol. 50, no. 9,
computing in deep neural networks,” in Proc. Design Autom. Conf.
pp. 906–920, Sep. 2001.
(DAC), Austin, TX, USA, 2016, Art. no. 124.
[16] A. W. Burks, H. H. Goldstine, and J. Von Neumann, Preliminary [42] K. Kim, J. Lee, and K. Choi, “Approximate de-randomizer for stochas-
Discussion of the Logical Design of an Electronic Computer tic circuits,” in Proc. Int. SoC Design Conf., 2015, pp. 123–124.
Instrument. Princeton, NJ, USA: Inst. Adv. Study, Jan. 1946. [43] P. Knag, W. Lu, and Z. Zhang, “A native stochastic computing archi-
[17] V. Canals, A. Morro, A. Oliver, M. L. Alomar, and J. L. Rosselló, tecture enabled by memristors,” IEEE Trans. Nanotechnol., vol. 13,
“A new stochastic computing methodology for efficient neural network no. 2, pp. 283–293, Mar. 2014.
implementation,” IEEE Trans. Neural Netw. Learn. Syst., vol. 27, no. 3, [44] D. E. Knuth, “The art of computer programming,” in Seminumerical
pp. 551–564, Mar. 2016. Algorithms, vol. 2, 2nd ed. Redwood City, CA, USA: Addison-Wesley,
[18] Y.-N. Chang and K. K. Parhi, “Architectures for digital filters using 1998.
stochastic computing,” in Proc. Int. Conf. Acoust. Speech Signal [45] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521,
Process. (ICASSP), Vancouver, BC, Canada, 2013, pp. 2697–2701. pp. 436–444, May 2015.
[19] T.-H. Chen and J. P. Hayes, “Equivalence among stochastic logic [46] V. T. Lee, A. Alaghi, J. P. Hayes, V. Sathe, and L. Ceze, “Energy-
circuits and its application,” in Proc. Design Autom. Conf. (DAC), efficient hybrid stochastic-binary neural networks for near-sensor com-
San Francisco, CA, USA, 2015, pp. 131–136. puting,” in Proc. Design Autom. Test Europe Conf. (DATE), Lausanne,
[20] T.-H. Chen and J. P. Hayes, “Design of division circuits for stochastic Switzerland, 2017, pp. 13–18.
computing,” in Proc. IEEE Symp. VLSI (ISVLSI), Pittsburgh, PA, USA, [47] X.-R. Lee, C.-L. Chen, H.-C. Chang, and C.-Y. Lee, “A 7.92 Gb/s 437.2
2016, pp. 116–121. mW stochastic LDPC decoder chip for IEEE 802.15.3c applications,”
[21] V. K. Chippa, S. Venkataramani, K. Roy, and A. Raghunathan, “StoRM: IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 62, no. 2, pp. 507–516,
A stochastic recognition and mining processor,” in Proc. Int. Symp. Low Feb. 2015.
Power Electron. Design (ISLPED), 2014, pp. 39–44. [48] P. Li and D. J. Lilja, “Using stochastic computing to implement digi-
[22] S. S. Choi, S. H. Cha, and C. Tappert, “A survey of binary similarity and tal image processing algorithms,” in Proc. Int. Conf. Comput. Design
distance measures,” J. Syst. Cybern. Informat., vol. 8, no. 1, pp. 43–48, (ICCD), Amherst, MA, USA, 2011, pp. 154–161.
2010. [49] P. Li, D. J. Lilja, W. Qian, K. Bazargan, and M. Riedel, “The synthe-
[23] M. Courbariaux, Y. Bengio, and J.-P. David, “BinaryConnect: Training sis of complex arithmetic computation on stochastic bit streams using
deep neural networks with binary weights during propagations,” in sequential logic,” in Proc. Int. Conf. Comput.-Aided Design (ICCAD),
Proc. Int. Conf. Neural Inf. Process. Syst. (NIPS), Montreal, QC, San Jose, CA, USA, 2012, pp. 480–487.
Canada, 2015, pp. 3123–3131. [50] P. Li, D. J. Lilja, W. Qian, K. Bazargan, and M. D. Riedel,
[24] J. A. Dickson, R. D. McLeod, and H. C. Card, “Stochastic arithmetic “Computation on stochastic bit streams: Digital image processing case
implementations of neural networks with in situ learning,” in Proc. Int. studies,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 22,
Conf. Neural Netw., San Francisco, CA, USA, 1993, pp. 711–716. no. 3, pp. 449–462, Mar. 2014.
1530 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 37, NO. 8, AUGUST 2018

[51] B. Li, M. H. Najafi, and D. J. Lilja, “Using stochastic computing to [77] W. Qian and M. D. Riedel, “The synthesis of robust polynomial arith-
reduce the hardware requirements for a restricted Boltzmann machine metic with stochastic logic,” in Proc. Design Autom. Conf. (DAC),
classifier,” in Proc. Int. Symp. FPGA, Monterey, CA, USA, 2016, Anaheim, CA, USA, 2008, pp. 648–653.
pp. 36–41. [78] W. Qian, M. D. Riedel, and I. Rosenberg, “Uniform approximation
[52] P. Li, W. Qian, M. D. Riedel, K. Bazargan, and D. J. Lilja, “The and Bernstein polynomials with coefficients in the unit interval,” Eur.
synthesis of linear finite state machine-based stochastic computational J. Combinatorics, vol. 32, no. 3, pp. 448–463, 2011.
elements,” in Proc. Asia South Pac. Design Autom. Conf. (ASP DAC), [79] W. Qian, M. D. Riedel, H. Zhou, and J. Bruck, “Transforming proba-
Sydney, NSW, Australia, 2012, pp. 757–762. bilities with combinational logic,” IEEE Trans. Comput.-Aided Design
[53] Y. Liu and K. K. Parhi, “Architectures for stochastic normalized and Integr. Circuits Syst., vol. 30, no. 9, pp. 1279–1292, Sep. 2011.
modified lattice IIR filters,” in Proc. Asilomar Conf. Signals Syst. [80] A. Ren et al., “SC-DCNN: Highly-scalable deep convolutional neu-
Comput., Pacific Grove, CA, USA, 2015, pp. 1351–1381. ral network using stochastic computing,” in Proc. Int. Conf. Archit.
[54] Y. Liu and K. K. Parhi, “Architectures for recursive digital filters using Support Program. Lang. Oper. Syst. (ASPLOS), Xi’an, China, 2017,
stochastic computing,” IEEE Trans. Signal Process., vol. 64, no. 14, pp. 405–418.
pp. 3705–3718, Jul. 2016. [81] T. J. Richardson and R. L. Urbanke, “The capacity of low-density
[55] G. G. Lorentz, Bernstein Polynomials, 2nd ed. New York, NY, USA: parity-check codes under message-passing decoding,” IEEE Trans. Inf.
AMS Chelsea, 1986. Theory, vol. 47, no. 2, pp. 599–618, Feb. 2001.
[56] K. Ma et al., “Architecture exploration for ambient energy harvesting [82] N. Saraf, K. Bazargan, D. J. Lilja, and M. D. Riedel, “Stochastic
nonvolatile processors,” in Proc. Int. Symp. High Perform. Comput. functions using sequential logic,” in Proc. Int. Conf. Comput. Design
Archit. (HPCA), Burlingame, CA, USA, 2015, pp. 526–537. (ICCD), Asheville, NC, USA, 2013, pp. 507–510.
[57] W. Maass and C. M. Bishop, Eds., Pulsed Neural Networks. [83] N. Saraf, K. Bazargan, D. J. Lilja, and M. Riedel, “IIR filters using
Cambridge, MA, USA: MIT Press, 1999. stochastic arithmetic,” in Proc. Design Autom. Test Europe Conf.
[58] R. Manohar, “Comparing stochastic and deterministic computing,” (DATE), Dresden, Germany, 2014, pp. 1–6.
IEEE Comput. Archit. Lett., vol. 14, no. 2, pp. 119–122, Jul./Dec. 2015. [84] N. Saraf and K. Bazargan, “Polynomial arithmetic using sequential
[59] S. L. T. Marin, J. M. Q. Reboul, and L. G. Franquelo, “Digital stochastic logic,” in Proc. Great Lakes Symp. VLSI (GLSVLSI), Boston,
stochastic realization of complex analog controllers,” IEEE Trans. Ind. MA, USA, 2016, pp. 245–250.
Electron., vol. 49, no. 5, pp. 1101–1109, Oct. 2002. [85] G. Sarkis and W. J. Gross, “Efficient stochastic decoding of non-binary
[60] P. Mars and W. J. Poppelbaum, Stochastic and Deterministic Averaging LDPC codes with degree-two variable nodes,” IEEE Commun. Lett.,
Processors. London, U.K.: Peter Peregrinus, 1981. vol. 16, no. 3, pp. 389–391, Mar. 2012.
[61] S.-J. Min, E.-W. Lee, and S.-I. Chae, “A study on the stochastic com- [86] G. Sarkis, S. Hemati, S. Mannor, and W. J. Gross, “Stochastic decoding
putation using the ratio of one pulses and zero pulses,” in Proc. Int. of LDPC codes over GF(q),” IEEE Trans. Commun., vol. 61, no. 3,
Symp. Circuits Syst. (ISCAS), London, U.K., 1994, pp. 471–474. pp. 939–950, Mar. 2013.
[62] B. Moons and M. Verhelst, “Energy-efficiency and accuracy of stochas- [87] J. Sauvola and M. Pietikäinen, “Adaptive document image binariza-
tic computing circuits in emerging technologies,” IEEE J. Emerg. Sel. tion,” Pattern Recognit., vol. 33, no. 2, pp. 225–236, 2000.
Topics Power Electron., vol. 4, no. 4, pp. 475–486, Dec. 2014. [88] I. Schur, “Über potenzreihen, die im innern des einheitskreises
beschränkt sind,” J. für die Reine und Angewandte Mathematik,
[63] A. Naderi, S. Mannor, M. Sawan, and W. J. Gross, “Delayed stochastic
vol. 147, pp. 205–232, 1917.
decoding of LDPC codes,” IEEE Trans. Signal Process., vol. 59, no. 11,
[89] S. S. Tehrani, S. Mannor, and W. J. Gross, “Fully parallel stochas-
pp. 5617–5626, Nov. 2011.
tic LDPC decoders,” IEEE Trans. Signal Process., vol. 56, no. 11,
[64] M. H. Najafi et al., “Time-encoded values for highly efficient stochastic
pp. 5692–5703, Nov. 2008.
circuits,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 25,
[90] S. S. Tehrani et al., “Majority-based tracking forecast memories for
no. 5, pp. 1644–1657, May 2017.
stochastic LDPC decoding,” IEEE Trans. Signal Process., vol. 58, no. 9,
[65] M. H. Najafi, D. J. Lilja, M. Riedel, and K. Bazargan, pp. 4883–4896, Sep. 2010.
“Polysynchronous stochastic circuits,” in Proc. Asia South Pac. Design [91] P.-S. Ting and J. P. Hayes, “Stochastic logic realization of matrix oper-
Autom. Conf. (ASP DAC), Macau, China, 2016, pp. 492–498. ations,” in Proc. Euromicro Conf. Digit. Syst. Design (DSD), Verona,
[66] M. H. Najafi and D. J. Lilja, “High-speed stochastic circuits using Italy, 2014, pp. 356–364.
synchronous analog pulses,” in Proc. Asia South Pac. Design Autom. [92] P.-S. Ting and J. P. Hayes, “Isolation-based decorrelation of stochastic
Conf. (ASP DAC), 2017, pp. 481–487. circuits,” in Proc. Int. Conf. Comput. Design (ICCD), Scottsdale, AZ,
[67] M. H. Najafi and M. E. Salehi, “A fast fault-tolerant architecture for USA, 2016, pp. 88–95.
Sauvola local image thresholding algorithm using stochastic comput- [93] J. E. Tomberg and K. K. K. Kaski, “Pulse-density modulation tech-
ing,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 24, no. 2, nique in VLSI implementations of neural network algorithms,” IEEE
pp. 808–812, Feb. 2016. J. Solid-State Circuits, vol. 25, no. 5, pp. 1277–1286, Oct. 1990.
[68] N. Nedjah and L. de Macedo Mourelle, “Stochastic reconfigurable [94] S. L. Toral, J. M. Quero, and L. G. Franquelo, “Stochastic pulse
hardware for neural networks,” in Proc. Euromicro Conf. Digit. Syst. coded arithmetic,” in Proc. Int. Symp. Circuits Syst. (ISCAS), Geneva,
Design (DSD), 2003, pp. 438–442. Switzerland, 2000, pp. 599–602.
[69] F. Neugebauer I. Polian, and J. P. Hayes, “Framework for quanti- [95] D. E. Van Den Bout and T. K. Miller, III, “A digital architecture
fying and managing accuracy in stochastic circuit design,” in Proc. employing stochasticism for the simulation of hopfield neural nets,”
Design Autom. Test Europe Conf. (DATE), Lausanne, Switzerland, IEEE Trans. Circuits Syst., vol. 36, no. 5, pp. 732–738, May 1989.
2017, pp. 1–6. [96] B. Yuan and K. K. Parhi, “Successive cancellation decoding of polar
[70] A. V. Oppenheim, A. S. Willsky, and S. H. Nawab, Signals & Systems, codes using stochastic computing,” in Proc. Int. Symp. Circuits Syst.
2nd ed. Upper Saddle River, NJ, USA: Prentice-Hall, 1996. (ISCAS), Lisbon, Portugal, 2015, pp. 3040–3043.
[71] B. Parhami and C.-H. Yeh, “Accumulative parallel counters,” in Proc. [97] B. Yuan and K. K. Parhi, “Belief propagation decoding of polar codes
Asilomar Conf. Signals Syst. Comput., Pacific Grove, CA, USA, 1995, using stochastic computing,” in Proc. Int. Symp. Circuits Syst. (ISCAS),
pp. 966–970. Montreal, QC, Canada, 2016, pp. 157–160.
[72] K. K. Parhi and Y. Liu, “Architectures for IIR digital filters using [98] B. Yuan, Y. Wang, and Z. Wang, “Area-efficient error-resilient dis-
stochastic computing,” in Proc. Int. Symp. Circuits Syst. (ISCAS), crete fourier transformation design using stochastic computing,” in
Melbourne, VIC, Australia, 2014, pp. 373–376. Proc. Great Lakes Symp. VLSI (GLSVLSI), Boston, MA, USA, 2016,
[73] J. Pearl, Probabilistic Reasoning in Intelligent Systems: Networks of pp. 33–38.
Plausible Inference. San Mateo, CA, USA: Morgan Kaufmann, 1988. [99] B. Yuan, Y. Wang, and Z. Wang, “Area-efficient scaling-free DFT/FFT
[74] W. J. Poppelbaum, C. Afuso, and J. W. Esch, “Stochastic comput- design using stochastic computing,” IEEE Trans. Circuits Syst. II, Exp.
ing elements and systems,” in Proc. AFIPS Fall Joint Comput. Conf., Briefs, vol. 63, no. 12, pp. 1131–1135, Dec. 2016.
Anaheim, CA, USA, 1967, pp. 635–644. [100] D. Zhang and H. Li, “A stochastic-based FPGA controller for
[75] W. Qian, “Digital yet deliberately random: Synthesizing logical com- an induction motor drive with integrated neural network algo-
putation on stochastic bit streams,” Ph.D. dissertation, Dept. Elect. rithms,” IEEE Trans. Ind. Electron., vol. 55, no. 2, pp. 551–561,
Comput. Eng., Univ. Minnesota, Minneapolis, MN, USA, 2011. Feb. 2008.
[76] W. Qian, X. Li, M. D. Riedel, K. Bazargan, and D.J. Lilja, “An architec- [101] Z. Zhao and W. Qian, “A general design of stochastic circuit and
ture for fault-tolerant computation with stochastic logic,” IEEE Trans. its synthesis,” in Proc. Design Autom. Test Europe Conf. (DATE),
Comput., vol. 60, no. 1, pp. 93–105, Jan. 2011. Grenoble, France, 2015, pp. 1467–1472.
ALAGHI et al.: PROMISE AND CHALLENGE OF SC 1531

Armin Alaghi (S’06–M’15) received the B.Sc. John P. Hayes (S’67–M’70–SM’81–F’85–LF’10)


degree in electrical engineering and the M.Sc. received the B.E. degree from the National
degree in computer architecture from the University University of Ireland, Dublin, Ireland, and the M.S.
of Tehran, Tehran, Iran, in 2006 and 2009, respec- and Ph.D. degrees from the University of Illinois,
tively, and the Ph.D. degree from the Electrical Urbana–Champaign, Champaign, IL, USA, all in
Engineering and Computer Science Department, electrical engineering.
University of Michigan, Ann Arbor, MI, USA, He participated in the design of the ILLIAC
in 2015. III computer at the University of Illinois. In
From 2005 to 2009, he was a Research 1970, he joined the Operations Research Group,
Assistant with the Field-Programmable Gate- Shell Benelux Computing Center, The Hague, The
Array Laboratory and the Computer-Aided Design Netherlands, where he researched on mathemati-
Laboratory, University of Tehran, where he researched on field-programmable cal programming and software development. From 1972 to 1982, he was
gate-array testing and network-on-chip testing. From 2009 to 2015, he a Faculty Member with the Department of Electrical Engineering and
was with the Advanced Computer Architecture Laboratory, University of Department of Systems and Computer Science, University of Southern
Michigan. He is currently a Research Associate with the University of California, Los Angeles, CA, USA. Since 1982, he has been with the Electrical
Washington, Seattle, WA, USA. His current research interests include digital Engineering and Computer Science Department, University of Michigan, Ann
system design, embedded systems, very large-scale integration circuits, Arbor, MI, USA, where he holds the Claude E. Shannon Chair in Engineering
computer architecture, and electronic design automation. Science. He has authored over 300 technical papers, several patents, and seven
books, including Computer Architecture and Organization (3rd ed., 1998)
and Design, Analysis and Test of Logic Circuits Under Uncertainty (2012).
His current research interests include computer-aided design, verification, and
testing, very large-scale integration circuits, computer architecture, and uncon-
Weikang Qian (S’07–M’11) received the B.Eng. ventional computing systems.
degree in automation from Tsinghua University, Prof. Hayes was a recipient of the University of Michigan’s Distinguished
Beijing, China, in 2006, and the Ph.D. degree Faculty Achievement Award in 1999, the Alexander von Humboldt
in electrical engineering from the University of Foundation’s Research Award in 2004, the IEEE Lifetime Contribution Medal
Minnesota, Minneapolis, MN, USA, in 2011. for outstanding contributions to test technology in 2013, and the ACM
He is an Assistant Professor with the University Pioneering Achievement Award for contributions to logic design, fault tol-
of Michigan–Shanghai Jiao Tong University Joint erant computing, and testing in 2014. He has served as an Editor for the
Institute, Shanghai Jiao Tong University, Shanghai, Communications of the ACM and the IEEE T RANSACTIONS ON PARALLEL
China. His current research interests include elec- AND D ISTRIBUTED S YSTEMS . He was elected as a fellow of ACM in 2001.
tronic design automation and digital design for
emerging technologies.
Dr. Qian was a recipient of the Best Paper Award Nominations at
the 2009 International Conference on Computer-Aided Design and the
2016 International Workshop on Logic and Synthesis for his research works.

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy