Stochastic Computing
Stochastic Computing
Gaudet
Editors
Stochastic
Computing:
Techniques and
Applications
Stochastic Computing: Techniques
and Applications
Warren J. Gross • Vincent C. Gaudet
Editors
Stochastic Computing:
Techniques and Applications
123
Editors
Warren J. Gross Vincent C. Gaudet
Department of Electrical ECE Department
and Computer Engineering University of Waterloo
McGill University Waterloo, ON, Canada
Montréal, QC, Canada
This Springer imprint is published by the registered company Springer Nature Switzerland AG.
The registered company address is: Gewerbestrasse 11, 6330 Cham, Switzerland
To Amy, Ester, Margaret, and Katherine,
who inspire us at every step
Foreword
s4-bin/Main/ItemDisplay?l=0&l_ef_l=-1&id=22483.50643&v=1&lvl=1&coll=18&rt=1&itm=
16138573&rsn=S_WWWhdabTK1KF&all=1&dt=AW+|ananth|&spi=-&rp=1&v=1
vii
viii Foreword
mainframe computers and patented a number of techniques. Ravi made the research
investigation a thrilling ride that has captured my continued ongoing interest in the
subject matter over the years.
This area has developed both in depth of insight and in breadth of application due
to the efforts of many researchers globally over the years, including many important
contributions by Profs. Gross and Gaudet and their respective graduate students and
research collaborators.
Now is an exciting time to start to reacquaint yourself with the latest develop-
ments in the field of stochastic computing, and this book is the best place to start.
Stochastic computing has fascinated both of us for several years. After seeing many
publications in the 1960s, the area largely fell silent until the early 2000s. Although
we both had been introduced to the area by our doctoral advisor, Glenn Gulak,
it really came into our consciousness after seeing a publication by Howard Card
and one of his colleagues in a 2001 issue of the IEEE Transactions on Computers.
Since then, stochastic computing has come back to life, especially in terms of
its application to error-control decoding (and, more broadly, in belief propagation
networks), image processing, and neural networks.
This manuscript represents an attempt to present a contemporary view of the field
of stochastic computing, as seen from the perspective of its leading researchers. We
wish to thank Kenneth C. Smith and Brian Gaines, both retired faculty members,
who worked in the area during their illustrious careers and who have shared with
us their memories of the previous boom in activity in stochastic computing in the
1960s.
There are three main parts to this book. The first part, comprising Chaps. 1 to
3, provides a history of the technical developments in stochastic computing and a
tutorial overview of the field for both novice and seasoned stochastic computing
researchers. In the second part, comprising Chaps. 4 to 9, we review both well-
established and emerging design approaches for stochastic computing systems,
with a focus on accuracy, correlation, sequence generation, and synthesis. The last
part, comprising Chaps. 10 and 11, provides insights into applications in machine
learning and error-control coding.
We hope you enjoy reading the book. May you discover new ways to make
computation more efficient and effective and, above all, exciting!
ix
Contents
xi
Contributors
xiii
xiv Contributors
Xiaotao Jia Fert Beijing Institute, BDBC and School of Microelectronics, Beihang
University, Beijing, China
Phil Knag University of Michigan, Ann Arbor, MI, USA
François Leduc-Primeau École Polytechnique de Montréal, Montréal, QC,
Canada
Vincent T. Lee University of Washington, Seattle, WA, USA
Wei Lu University of Michigan, Ann Arbor, MI, USA
Naoya Onizawa Tohoku University, Sendai, Miyagi, Japan
Weikang Qian Shanghai Jiao Tong University, Shanghai, China
Yuanzhuo Qu The Department of Electrical and Computer Engineering, Univer-
sity of Alberta, Edmonton, AB, Canada
Marc Riedel University of Minnesota, Minneapolis, MN, USA
Kenneth C. Smith University of Toronto, Toronto, ON, Canada
Paishun Ting University of Michigan, Ann Arbor, MI, USA
You Wang Fert Beijing Institute, BDBC and School of Microelectronics, Beihang
University, Beijing, China
Chris Winstead Department of Electrical and Computer Engineering, UMC 4120,
Utah State University, Logan, UT, USA
Jianlei Yang Fert Beijing Institute, BDBC and School of Computer Science and
Engineering, Beihang University, Beijing, China
Yue Zhang Fert Beijing Institute, BDBC and School of Microelectronics, Beihang
University, Beijing, China
Zhengya Zhang University of Michigan, Ann Arbor, MI, USA
Weisheng Zhao Fert Beijing Institute, BDBC and School of Microelectronics,
Beihang University, Beijing, China
Abbreviations
xv
xvi Abbreviations
Abstract In addition to opening the book, this chapter serves as a brief overview
of the historical advances in stochastic computing. We highlight four distinct eras,
including early work from the 1950s, fundamental work in the 1960s, an era of
slower progress in the 1970s, 1980s, and 1990s, and a return to prominence in the
2000s.
Introduction
In addition to opening the book, this Chapter serves as an overview of the historical
advances in stochastic computing. We highlight four distinct eras in its evolution:
Section “The Early Years (1950s)” describes early work mainly led by John von
Neumann in the 1950s; Section “Exciting Times (1960s)” describes the fundamental
advances from the 1960s that form the basis for what we know today as stochastic
computing; Section “Into the Darkness (1970s, 1980s and 1990s)” talks about the
decline in stochastic computing research in the 1970s through 1990s and provides
some insights into why this may have happened; then, Section “Rebirth (2000s
and Beyond)” points to advances from the past 20 years that brought stochastic
computing back to life. Finally, in Section “Overview of the Book” we provide a
high-level overview of the rest of this textbook.
V. C. Gaudet ()
University of Waterloo, Waterloo, ON, Canada
e-mail: vcgaudet@uwaterloo.ca
W. J. Gross
McGill University, Montréal, QC, Canada
e-mail: warren.gross@mcgill.ca
K. C. Smith
University of Toronto, Toronto, ON, Canada
Fig. 2 Multiplication circuit proposed in [1] and based on the Sheffer stroke, redrawn for clarity
[6] sponsored by the Office of Naval Research (ONR), he had identified as early
as 1963 the need in stochastic systems for multiple independent random-sequence
generators and had done initial experiments using avalanche diodes for this purpose
as performed by his graduate student Chushin Afuso [7] (working as a research
assistant on topics unrelated to his thesis). Subsequently, Afuso realized that a
system using a single random generator was possible, which he referred to as a
Synchronous Random Pulse Sequence (SRPS) system (described in J.W. Esch’s
doctoral thesis [8]).
Chushin Afuso
John William Esch (born in 1942) graduated from Electrical Engineering, Uni-
versity of Wisconsin, Madison, with a variety of experiences having worked as a
programmer in the College of Letters and Science and as an electronics technician
for the Space and Astronomy Laboratory, and later participated in a NASA grant
with the Institute of Space Physics at Columbia University. In 1965, he began
working on his Master’s in Electrical Engineering at the University of Illinois
with a research assistantship from the Department of Computer Science under
Poppelbaum. In 1967, he received his Master’s degree and then contributed to the
Introduction to Stochastic Computing 5
1967 paper with Popplebaum and Afuso [6]. In 1969, he completed his Ph.D. thesis
titled “RASCEL – A Programmable Analog Computer Based on a Regular Array of
Stochastic Computing Element Logic” [8], which described an elaborate hardware
implementation of a stochastic computer. Subsequently, he was employed by Sperry
then Unisys in the software area.
Sergio Telles Ribeiro (born in 1933) was born in Brazil, completed his Master’s
degree at the University of Illinois in 1961 on BJT saturation, and his Ph.D. titled
“Phase Plane Theory of Transistor Bistable Circuits” [12] under Poppelbaum in
1963. He notably published a paper on pulsed-data hybrid computers in 1964 [13].
By the time of his publication on “Random-Pulse Machines” in 1967 [14], he was
working for Carson Laboratories, Bristol, Connecticut. In his 1967 paper [14], an
intriguing footnote indicates that in 1963 he had recognized the potential importance
of stochastic computing as described by von Neumann in 1956 [2], and encouraged
the Poppelbaum group to pursue research in the area, reporting Afuso’s work on
stochastic generators documented in a 1965 progress report.
Brian Gaines
Brian R. Gaines (1938-) began his technical career as a polymath at the age
of 12, fascinated by science including math, physics, chemistry, and philosophy.
Fortunately, his father did not share the same passion and requested a departure
from chemistry. Thus, Brian converted to an intense interest in electronics, building
various measuring instruments including his own oscilloscope. This hobby led to
early employment following high school at the Standard Telephone Cables Ltd.
(STC) in the semiconductor laboratory, a connection which he maintained for
many years. Subsequently, during his time at Cambridge taking math, his talents in
electronics allowed him to work as an electronics assistant for a professor working
in applied psychology. Intrigued with his work in cognitive psychology, the issue
of pursuing an advanced degree in psychology arose, resulting in a requirement for
an appropriate first degree which he subsequently pursued. Later, with all of his
degrees from Cambridge, and as a chartered engineer and chartered psychologist,
he led a diverse career including chairmanship of Electrical Engineering Science at
Essex, Professor of Industrial Engineering at the University of Toronto, and finally
Killam Memorial Research Professor at the University of Calgary, from which he
retired in the late 1990s to Vancouver Island. Along the way, Gaines’ work in human
learning and the potential for machine intelligence led to his interest in what became
stochastic computing, publishing many of the seminal works [15–18].
6 V. C. Gaudet et al.
Reed Lawlor
1967 was an unusual year! Besides being the 100th anniversary of the founding
of Canada (marked by hosting a hugely successful world fair called Expo 67
in Montreal), it was also a milestone in the publication of papers in stochastic
computing.
Brian Gaines (1938-) born in the United Kingdom, working at Standard
Telecommunication Laboratories, published “Stochastic Computing”; Ted
Popplebaum (1924–1993) born in Germany, working at the University of
Illinois, Urbana, published (with Afuso and Esch) “Stochastic Computing
Elements and Systems”; Sergio Telles Ribeiro (1933-) born in Brazil, working
at Carson Laboratories, Connecticut, published “Random-Pulse Machines”.
While Gaines’ and Popplebaum’s work was done independently, Ribeiro
obtained his Ph.D. under Poppelbaum in 1963, on a different topic (phase
plane theory of transistor bistable circuits), but soon started publishing in related
topics.
Introduction to Stochastic Computing 7
The 1970s started off well for stochastic computing. Poppelbaum continued his
work in the area [23, 24]. Kenneth C. Smith, by then at the University of Toronto,
supervised Gary Black’s doctoral thesis on random pulse-density modulation [25].
There was also some discussion at the International Symposium on Multiple-Valued
Logic (ISMVL), including a panel that included Gaines, Poppelbaum, and Smith at
the 1976 ISMVL in Logan, Utah [26]. Finally, a dedicated conference on stochastic
computing and its applications was held in 1978 in Toulouse, France [27]; it appears
to have been well-received, with a 420-page conference record. However, it was to
be the only conference on stochastic computing to be held until 2016.
Progress on stochastic computing became more sporadic in the late 1970s
and 1980s. In many ways, the continued progress in semiconductor scaling and
high performance of these technologies gave more traditional digital approaches a
competitive edge over stochastic computing.
However, by the 1990s, there appeared to be a slight upsurge in interest,
fueled by the advent of field-programmable gate arrays (FPGAs). In his Master’s
thesis completed under Glenn Gulak at the University of Toronto, Ravi Ananth
proposes field-programmable stochastic processors [28]. There was also work
on sequence generation [29]. The 1990s also saw significant interest in efficient
hardware implementations of artificial neural networks. There were several papers
on digit-serial approaches including pulse-width modulation (PWM) [30] and pulse-
density modulation (PDM) [31–33]. The latter can be seen as a form of stochastic
computing, as pointed out by Howard Card and his colleagues [34, 35].
Indeed, Howard Card, by then a Distinguished Professor at the University of
Manitoba, was one of the instigators of the renewed interest in stochastic computing.
Howard Card (1947–2006) earned his B.Sc. and M.Sc. degrees at the University of
Manitoba and then went on to complete a Ph.D. at the University of Manchester in
1971. He subsequently became an Assistant Professor at the University of Waterloo,
and then an Associate Professor at Columbia University. He served as a consultant
for the IBM T.J. Watson Research Centre and was an instructor at AT&T Bell Labs.
He returned to the University of Manitoba in 1980 to take on a Professorship in
Electrical Engineering, and was eventually honoured as a Distinguished Professor.
He profoundly cared about his students, and in 1983 he received the Olive Beatrice
Stanton Award for Excellence in Teaching. Card believed that “if a thing is worth
doing, it’s worth doing to excess.” That mantra characterized all aspects of his life,
including his mentorship, teaching and research.
8 V. C. Gaudet et al.
shifts its focus towards other sources of randomness such as in timing variations, and
then introduces us to post-CMOS approaches. This serves as a transition to chap-
ters “RRAM Solutions for Stochastic Computing” and “Spintronic Solutions for
Stochastic Computing”, which look into the post-CMOS world. Chapter “RRAM
Solutions for Stochastic Computing”, by Zhengya Zhang and his colleagues Phil
Knag, Siddharth Gaba, and Wei Lu, looks at memristive devices and their inherent
randomness as a way to generate stochastic sequences. Then, in chapter “Spintronic
Solutions for Stochastic Computing”, Jie Han and his colleagues Xiaotao Jia, You
Wang, Zhe Huang, Yue Zhang, Jianlei Yang, Yuanzhuo Qu, Bruce Cockburn, and
Weisheng Zhao, look at similar phenomena in spintronic devices such as magnetic
tunneling junctions. These last two chapters really show us that stochastic comput-
ing may have a very significant impact in the future world of nanotechnology.
The last two chapters look into applications of stochastic computing. Chapter
“Brain-Inspired Computing”, by Naoya Onizawa and his colleagues Warren Gross
and Takahiro Hanyu, reports very compelling applications in the domain of machine
learning, postulating that we are entering the era of brainware LSI, or “BLSI.”
Finally, chapter “Stochastic Decoding of Error-Correcting Codes”, by François
Leduc-Primeau, Saied Hemati, Vincent Gaudet, and Warren Gross reviews progress
over the past 15 years in the area of stochastic decoding of error-control codes.
References
1. J. von Neumann, Lectures on probabilistic logics and the synthesis of reliable organisms from
unreliable components, California Institute of Technology, notes by R. S. Pierce, Jan. 1952.
2. J. von Neumann, “Probabilistic logics and the synthesis of reliable organisms from unreliable
components,” Automata Studies, pp. 43-98, 1956.
3. P. Elias, “Computation in the Presence of Noise,” IBM Journal of Research and Development,
vol. 2, no. 4, pp. 346-353, 1958.
4. A.A. Mullin, “Reliable Stochastic Sequential Switching Circuits,” Transactions of the Amer-
ican Institute of Electrical Engineers, Part I: Communication and Electronics, vol. 77, no. 5,
pp. 606-611, Nov. 1958.
5. J.D. Cowan, “Toward a Proper Logic for Parallel Computation in the Presence of Noise,”
Bionics Symposium, Dayton, OH,
6. W.J. Poppelbaum, C. Afuso, and J. W. Esch, “Stochastic Computing Elements and Systems,”
American Federation of Information Processing Societies, Fall Joint Computer Conference,
vol. 31, pp. 635-644, Books, Inc., New York, 1967.
7. W.J. Poppelbaum and C. Afuso, Noise Computer, University of Illinois, Dept. Computer
Science, Quarterly Technical Progress Reports, April 1965-January 1966.
8. J. Esch, Rascel – A Programmable Analog Computer Based on a Regular Array of Stochastic
Computing Element Logic, doctoral thesis, University of Illinois, 1969.
9. C. Afuso, “Two Wire System Computer Circuits Using Transistor Difference Amplifier,”
The Science Bulletin of the Division of Agriculture, Home Economics & Engineering, The
University of the Ryukyus, no. 9, pp. 308-321, Dec. 1962. Abstract Available: http://ir.lib.u-
ryukyu.ac.jp/handle/20.500.12000/23153. Accessed on Aug. 2, 2018.
10. C. Afuso, Analog Computing with Random Pulse Sequences, doctoral thesis, University of
Illinois report #255, 1968.
11. Y. Nagata and C. Afuso, “A Method of Test Pattern Generation for Multiple-Valued PLAs,”
International Symposium on Multiple-Valued Logic, pp. 87-91. 1993.
10 V. C. Gaudet et al.
12. S.T. Ribeiro, Phase Plane Theory of Transistor Bistable Circuits, doctoral thesis, University of
Illinois, 1963.
13. S. T. Ribeiro, “Comments on Pulsed-Data Hybrid Computers,” IEEE Transactions on Elec-
tronic Computers, vol. EC-13, no. 5, pp. 640-642, Oct. 1964.
14. S. T. Ribeiro, “Random-Pulse Machines,” IEEE Transactions on Electronic Computers, vol.
EC-16, no. 3, pp. 261-276, June 1967.
15. B. R. Gaines, “Stochastic Computing,” American Federation of Information Processing
Societies, Spring Joint Computer Conference, vol. 30, pp. 149-156, Books, Inc., New York,
1967.
16. B. R. Gaines, “Techniques of Identification with the Stochastic Computer,” International
Federation of Automatic Control Symposium on Identification, Prague, June 1967.
17. B. R. Gaines, “Stochastic Computer Thrives on Noise,” Electronics, vol. 40, no. 14, pp. 72-79,
July 10, 1967.
18. B. R. Gaines, “Stochastic Computing,” Encyclopaedia of Information, Linguistics and Control,
pp. 766-781, Pergamon Press, New York and London, 1968.
19. R.C. Lawlor, Computer Utilizing Random Pulse Trains, U.S. patent 3,612,845, priority date
July 5, 1968, granted Oct. 12, 1971.
20. G. White, “The Generation of Random-Time Pulses at an Accurately Known Mean Rate and
Having a Nearly Perfect Poisson Distribution,” Journal of Scientific Instruments, vol. 41, no.
6, p. 361, 1964.
21. R.C. Lawlor, “What Computers can do: Analysis and Prediction of Judicial Decisions,”
American Bar Association Journal, vol. 49, no. 4, pp. 337-344, April 1963.
22. W. Peakin, “Alexa, Guilty or Not Guilty?” posted Nov. 13, 2016. Available: <http://
futurescot.com/alexa-guilty-not-guilty/>, Accessed: August 31, 2018.
23. W.J. Poppelbaum, “Statistical Processors,” Advances in Computers, vol. 14, pp. 187–230,
1976.
24. P. Mars and W. J. Poppelbaum, Stochastic and Deterministic Averaging Processors, Peter
Peregrinus Press, 1981.
25. G.A. Black, Analog Computation Based on Random Pulse Density Modulation, doctoral thesis,
University of Toronto, 1974.
26. —, “Applications of Multiple-Valued Logic,” panel session at the International Symposium on
Multiple-Valued Logic, chair: M.S. Michalski, panelists: B.R. Gaines, S. Haack, T. Kitahashi,
W.J. Poppelbaum, D. Rine, and K.C. Smith, Logan, UT, May 1976.
27. —, First International Symposium on Stochastic Computing and Its Applications, Toulouse,
France, 420 pages, 1978.
28. R. Ananth, A Field Programmable Stochastic Computer for Signal Processing Applications,
Master of Applied Science thesis, University of Toronto, 1992.
29. P. Jeavons, D.A. Cohen, J. and Shawe-Taylor, “Generating Binary Sequences for Stochastic
Computing,” IEEE Transactions on Information Theory, vol. 40, pp. 716–720, 1994.
30. A.F. Murray, S. Churcher, A. Hamilton, et al., “Pulse Stream VLSI Neural Networks,” IEEE
Micro, vol. 13, no. 3, pp. 29-39.
31. Y . Hirai, “PDM Digital Neural Network System,” in K. W. Przytula and V.K. Prasanna,
Parallel Digital Implementations of Neural Networks, pp. 283-311, Englewood Cliffs: Prentice
Hall, 1993.
32. J.E. Tomberg and K. Kaski, “Pulse Density Modulation Technique in VLSI Implementation of
Neural Network Algorithms,” IEEE Journal of Solid-State Circuits, Vol. 25, no. 5, pp. 1277-
1286, 1990.
33. L. Zhao, Random Pulse Artificial Neural Network Architecture, Master of Applied Science
Thesis, University of Ottawa, May 1998. Available: https://www.collectionscanada.gc.ca/obj/
s4/f2/dsk2/tape17/PQDD_0006/MQ36758.pdf. Accessed on August 2, 2018.
34. J.A. Dickson, R.D. McLeod, and H.C. Card, “Stochastic Arithmetic Implementations of Neural
Networks with In Situ Learning,” International Conference on Neural Networks, pp. 711-716,
1993.
Introduction to Stochastic Computing 11
35. H. Card, “Limits to Neural Computations in Digital Arrays,” Asilomar Conference on Signals,
Systems and Computers, vol. 2, pp. 1125-1129, 1997.
36. B.D. Brown and H.C. Card, “Stochastic Neural Computation. I. Computational Elements,”
IEEE Transactions on Computers, vol. 50, no. 9, pp. 891-905, 2001.
37. B.D. Brown and H.C. Card, “Stochastic Neural Computation. II. Soft Competitive Learning,”
IEEE Transactions on Computers, vol. 50, no. 9, pp. 891-905, 2001.
Origins of Stochastic Computing
Brian R. Gaines
Abstract In the early 1960s research groups at the University of Illinois, USA,
and Standard Telecommunication Laboratories (STL), UK, each independently
conceived of a constructive use of random noise to implement analog computers in
which the probability of a pulse in a digital pulse stream represented a continuous
variable. The USA group initially termed this a noise computer but shortly adopted
the UK terminology of stochastic computer. The target application of the USA
group was visual pattern recognition, and that of the UK group was learning
machines, and both developed trial hardware implementations. However, as they
investigated applications they both came to recognize that the technology of their
era did not support stochastic computing systems that could compete with avail-
able computational technologies, and they moved on to develop other computing
architectures, some of which derived from the stochastic computing concepts.
Both groups published expositions of stochastic computing which provided a
comprehensive account of the technology, the architecture of its functional modules,
its potential applications and its then current limitations. These have become highly
cited in recent years as new technologies and issues have made stochastic computing
a competitive technology for a number of significant applications. This chapter
provides a historical a historical analysis of the motivations of the pioneers and
how they arrived at the notion of stochastic computing.
Introduction
B. R. Gaines ()
University of Victoria, Victoria, BC, Canada
University of Calgary, Calgary, AB, Canada
e-mail: gaines@uvic.ca; gaines@ucalgary.ca
early 1960s. Ribeiro was a graduate student of Ted Poppelbaum in the Information
Engineering Laboratory (IEL) at the University of Illinois, Champaign, Illinois,
USA, and Gaines was a graduate student of Richard Gregory in the Department of
Experimental Psychology, Cambridge University, UK and also a consultant to John
Andreae’s learning machines group at Standard Telecommunications Laboratory
(STL), UK.
The US and UK groups both implemented digitally-based analog computers
using probabilistic pulse trains, the IEL group initially terming this a noise computer
but shortly adopting the terminology of the STL group, stochastic computer,
which became the common designation in later research. As both groups evaluated
applications of stochastic computing, for IEL primarily image processing and for
STL navigational aids and radar tracking, it became apparent that the stochastic
computer based on the digital circuitry then available was not competitive with
alternative techniques. They began to develop other computer architectures to
address those applications such as burst and bundle processing [58], and phase
computers [37] and microprogrammed computers [21], respectively.
Both groups published extensively on stochastic computing in the late 1960s
[24, 26, 30, 61, 68] which stimulated research in other research groups world-wide
and many of those publications continue to be widely cited in the current renaissance
of stochastic computing as they provide tutorial material on the fundamentals
and the commonly adopted terminology for stochastic computer components,
representations and applications. They also contain critical commentaries on the
strengths and weaknesses of stochastic computing which are still applicable today.
Ted Poppelbaum
When I was asked to contribute a foreword to this collection of articles on the current
state of the art in stochastic computing and its applications, my first reaction was
sorrow that Ted Poppelbaum was no longer available to co-author it with me. Ted
died in 1993 at the age of 68 and did not live to see the massive resurgence of
stochastic computing research in the past decade.
Wolfgang (Ted) Johan Poppelbaum was born in Germany in 1924 and studied
Physics and Mathematics at the University of Lausanne from 1944 to 1953. In 1954
he joined the Solid State Research Group under Bardeen at the University of Illinois
and researched an electrolytic analog of a junction transistor. In 1955 he joined
the faculty of the Department of Computer Science and became a member of the
Digital Computer Laboratory developing the circuits for the ILLIAC II and later
computers. In 1960 he received a patent for his invention of the transistor flip-flop
storage module [59]. In 1972 he became Director of the Information Engineering
Laboratory and received a major Navy contract to support his research on statistical
computers and their applications. He retired in 1989.
Ted had many and varied projects in his laboratory. His 1973 report [57] on
the achievements and plans of the Information Engineering Laboratory summarizes
Origins of Stochastic Computing 15
some 45 distinct projects during the post-Illiac II phase from 1964 to 1973. They are
grouped under the categories: Storage/Hybrid Techniques; Stochastic and Bundle
Processing; Displays and Electro-Optics; Communication/Coding; World Models
and Pattern Recognition; Electronic Prostheses.
Ted and I became aware of our common interests in stochastic computing in 1967
as we both commenced publishing about the research and he invited me to present a
paper on stochastic computing [18] at the IEEE Convention in March 1968 in New
York where he was organizing a session on New Ideas in Information Processing.
I also visited his laboratory, saw the many systems he had developed including the
Paramatrix image processor (Fig. 1) which was one of his target applications for
stochastic computing, and met John Esch who had built the RASCEL stochastic
computing system.
16 B. R. Gaines
Ted and I found we had a common background in solid state electronics and
computer innovation, and discussed them at length as if we had been colleagues for
many years. I met with him again and introduced him to John Andreae and David
Hill at the IFIP conference in August 1968 in Edinburgh (Fig. 2). We kept in touch
intermittently and planned a joint book on stochastic computing but I had moved
on to other projects and introduced him to Phil Mars at Robert Gordon Institute
in Aberdeen who was actively pursuing stochastic computing research. They co-
published Stochastic and Deterministic Averaging Processors in 1981 [47].
Fig. 2 Three pioneers of computational intelligence: from left to right, John Andreae (learning
machines), David Hill (speech recognition), Ted Poppelbaum (stochastic computing in image
processing), IFIP Congress, August 1968, Edinburgh
Ted published several additional major articles that placed stochastic computing
in the context of other computing technologies, notably his surveys in Advances
in Computers in 1969 on what next in computer technology? [60], in 1976 on
statistical processors [58] and in 1987 on unary processing [62]. His 1972 textbook
on Computer Hardware Theory [56] that was widely used in engineering courses
includes a chapter on analog, hybrid and stochastic circuits.
Sergio Telles Ribeiro was born in Brazil in 933, received an Engineering degree
there in 1957 and taught electronics at the Institute of Technology and Aeronautics.
In 1960 he received a fellowship from the Brazilian Government to study in the
USA and entered the University of Illinois, receiving his masters in 1961 and his
doctorate in 1963. His doctoral topic was a phase plane theory of transistor bistable
Origins of Stochastic Computing 17
circuits [67] reflecting Ted’s continuing interest in the circuit he had invented and
its dynamic behavior that determined the speed and reliability of its operation.
After his doctorate Ribeiro continued as a research assistant working with
Ujhelyi on the electronic deflection [65] and intensity modulation [79] of laser
beams, and in 1964 they joined Carson Laboratories to pursue the industrial
applications of that research. In July 1966 he submitted a paper to the IEEE
Transaction on Computers on random pulse machines [68] that has become one
of the classics in the stochastic computing literature.
It appears that Ribeiro’s research on study of the architecture and potential of
random pulse machines was theoretical. He notes in footnote 2 that “In the spring
of 1963 while working with Dr. W.J. Poppelbaum at the University of Illinois the
author suggested that a research program be undertaken to investigate theoretical
and practical aspects of random-pulse systems.” He thanks Dr. Carson for his
support of the writing of the paper without implying that it is a project at Carson
Laboratories.
Ribeiro had left Ted’s laboratory before I visited and I never met him and have not
been able to trace any publications by him after a Carson Laboratories 1966 patent
for a display device based on his research with Ujhelyi [80]. There is no specific
information about how Ribeiro came to be interested in random pulse computing.
However, there is some strong circumstantial evidence that indicates how the notion
may have occurred to him.
In 1964 Ribeiro [66] published a correspondence item in the IEEE Computer
Transactions critiquing Schmid’s [71] 1963 paper on a providing analog-type
computation with digital elements. He corrects some errors in Schmid’s discussion,
suggests improvements in his implementation and then, whilst discussing the utility
of pulse rate computers, suggests that studies of artificial neurons show that the
implementation could be simple. Ribeiro cites three papers on electronic neuron
models [7, 48, 50] from the Bionics session at 1961 National Electronics held in
Chicago, about an hour away from Champaign, suggesting he may have attended
that meeting, and a 1963 paper [44] from the IEEE Transactions of Biomedical
Electronics suggesting he continued to follow the related literature.
However, none of the cited papers mention the notion that neurons had stochastic
behavior which was common in the neurological literature going back to at least to
Lashley in 1942 [42, p. 311]. In 1962 Jenik [40, 41] showed that the rates of the
non-coherent pulse trains of two neurons afferent to a third were multiplied in its
efferent train. Ribeiro might have become aware of such analyses or he might have
considered the optoelectronic approximate multiplier described in one of the neuron
model papers [7] and realized that if the pulse streams were independent random
events then the output of an AND-gate would be the product of their generating
probabilities.
In his 1967 paper Ribeiro mentions neurons in his abstract and index terms,
commences the introduction with a presentation of Von Neumann’s book on The
Computer and the Brain [81], discusses the neural analogy extensively throughout,
and has a Bionics subsection in his references with 12 citations. However, he does
18 B. R. Gaines
not specifically attribute the source of his introduction of the notion of random
pulses into Schmid’s architecture to any specific material that he cites.
In 1964 Ted initiated a research program to study the computational potential
of random-pulse systems by making it the topic of Afuso’s doctoral research in
1964 and that of Esch in 1967. Cushin Afuso was born in 1933 in Japan, studied
for his masters at the University of Illinois in 1959–1960, and returned for his
doctorate in 1964–1968. He states that his 1968 dissertation, Analog computation
with random pulse sequences [1] is “is a feasibility study of a stochastic computing
system” taking the operations of an analog computer as his target and showing how
multipliers, dividers, adders and subtractors may be implemented.
Fig. 3 John Esch presenting his RASCEL stochastic computer, Information Engineering Labora-
tory, University of Illinois, 1969
John W. Esch was born in the USA in 1942, studied for his masters at the
University of Illinois in 1965–1967, and for his doctorate in 1967–1969. He states
in his 1969 dissertation, RASCEL - A programmable analog computer based on a
regular array of stochastic computing element logic [11] (Fig. 3) that “in February
of 1967 this author joined Afuso and worked with him to extend the SRPS system to
a sign-magnitude number representation and to develop a programmable arithmetic
Origins of Stochastic Computing 19
Brian Gaines
It should be easier to describe my own research and the influences on it, and
I do have some detailed recollections, but, after five decades, much has been
forgotten and I have had to go back to files of notes, documentation, reports papers,
memoranda and correspondence from the 1960s that I have dragged around the
world for over 50 years but not previously opened—there were many surprises.
I was born in the UK in 1940, studied at Trinity College, Cambridge, from 1959
to 1967 for my bachelors in mathematics and theoretical physics and doctorate in
psychology. Electronics became my primary hobby when I was 12 after my father
banned analytical chemistry when hydrogen sulphide permeated the house. The
availability of low-cost government surplus electronics components and systems
after the war made it feasible to create a professional laboratory at home and I built
my first oscilloscope from the components of a rocket test set when I was 14.
My school library had several of Norbert Wiener’s books and I became fascinated
by his notion of cybernetics as the common science of people and machines and his
portrayal of what is was to be a mathematician. The headmaster taught a small
group of students philosophy in the library and I audited his lectures becoming very
interested in Kant and the issues of human understanding of the world and of the
nature of scientific knowledge. I found Ashby’s writings on cybernetics and admired
the way that he solved very general problems using algebraic techniques and I also
found Carnap’s logical structure of the world and Wittgenstein’s tractatus provided
formal approaches to the issues that Wiener and Kant had raised.
I was on the science side at school and obtained a state scholarship in math-
ematics to attend University in 1958 and applied to Trinity College, Cambridge
20 B. R. Gaines
but they made it a condition of acceptance that I also qualify in Latin and delay
entry until 1959. I went to the Latin teacher’s home for an hour on Saturdays
for 3 months to prepare for the examination, and spent the interim year working
as a laboratory assistant at ITT’s semiconductor research laboratory1 working on
phosphorus and boron diffusion, epitaxial growth of silicon and the fabrication of
gallium arsenide tunnel diodes. I also designed and built a nanoamp measuring
instrument to determine the greatly reduced leakage current in transistors as we
experimented with the planar process and was surprised to find it still in routine use
at the end of a 74n integrated circuit family production line when I visited STC at
Footscray again some 5 years later.
When I went up to Cambridge I planned to continue my activities in electronics
and took with me many samples of the transistor and tunnel diodes that I had
fabricated. At the Societies Fair in my first term I asked the chairman of the Wireless
Society, Steve Salter, whether he knew anyone who might be interested in using
them as I hope to find a home in some electronics laboratory. Steve was Richard
Gregory’s instrumentation engineer and introduced me to Richard who agreed that
I could act as Steve’s electronics assistant. Richard’s primary research was how the
brain reconstructed depth information from the disparate images of the separated
eyes. I built an oscilloscope with two cathode ray tubes and prisms that allowed the
eyes to be stimulated separately. This enabled me to display the 3D projection of
a rotatable 4D cube and I studied how the projection was perceived as the subject
manipulated it.
In 1961 saw an advertisement in Nature for staff for a new learning machines
project at STL,2 ITT Europe’s primary research laboratories, about an hour away
from Cambridge. I applied to John Andreae, the Project Leader, to be employed
there in vacations and became his part-time assistant in mathematics, electronics
and computing. In particular, I worked on the interpretation of neural net simulations
and on the theoretical foundations of the STeLLA3 learning robot [5] which John
was simulating on the KDF9 and his electronics assistant, Peter Joyce, had built in
the laboratory (Fig. 4).
Richard and John’s laboratories were my focus of attention during my Cambridge
years. Trinity only required me to attend a 1 h tutorial with a college Fellow once
a week and work on questions from past examination papers, and eventually take
the part II mathematics tripos exam to qualify for a degree. Lectures were offered
by the university and open to all students but not compulsory or assessed. I largely
went to philosophy topics that interested me and lectures by renowned visitors such
as Murray Gell-Mann and Richard Feynman in cutting-edge research areas where
it was fascinating to meet those who were creating new theories of the nature of
matter.
In June 1962 I took the part II tripos examination in mathematics, and asked the
state scholarship committee if I could have a further year of funding to study for the
mathematics tripos part III. However, Richard was a consultant to the Ministry of
Defence and had been offered funding for a graduate student to undertake a study
of the adaptive training of perceptual-motor skills. He offered me the opportunity
but Oliver Zangwill, the head of the department, said I needed a psychology degree
to do so. My scholarship sponsors approved this variation, and Richard asked Alan
Watson, the eminent behavioral psychologist, to be my tutor. My positivist leanings
suited him well and he set me to write essays for him on a very wide range of topics
in psychology, debating my extremely behavioristic mathematical interpretations.
In June 1963 I took the part II tripos examination in psychology and became
Richard’s graduate student funded through the contract. Adaptive training is a
technique to generate a learning progression for a skill by automatically adjusting
the task difficulty based on the trainee’s performance, thus turning a simulator into
a teaching machine [29]. Common sense suggests it should be effective but nearly
all studies to date had negative outcomes. I analyzed the situation Ashby-style
assuming that a skill was constituted as a number of dependent sub-skills ordered
such that the probability of learning one was low if prior ones had not been learned
and showed that in such a situation adaptive training should be very effective even
22 B. R. Gaines
if one has no knowledge of the sub-skill structure or trainee’s state in terms of it. I
examined previous laboratory studies and felt that the tasks investigated had been
insufficiently challenging. The task of interest to the sponsor was classified but the
training literature suggested that a tandem-propeller submarine involving managing
position by controlling the rate of change of acceleration was extremely difficult and
I decided to simulate that.
Fig. 5 Brian Gaines working with the analog computer and stereoscopic oscilloscope that he built,
Department of Experimental Psychology, Cambridge, 1964
innovation that became structured through the various filtering processes discussed
by Kant, Carnap and Wittgenstein. From my experiences in constructing analog
computer multipliers the simplicity of the multiplication of probabilities of the
conjunction of uncorrelated events seemed to have engineering potential. From
Richard I developed an interest in the neurological basis of depth perception and
proposed that the representation of visual intensity by neuronal discharges could be
used to extract depth information by spatial correlation through a simple conjunctive
processes if the discharges were essentially asynchronous and hence uncorrelated.
In addition, my studies of adaptive training had three components: a theoretical
one to show that a very general framework for what is was for any system to
learn a skill showed that adaptive training accelerated learning; an empirical one
of training humans; and an empirical one of training learning machines undertaking
the same task as the humans. For the last I used a digital version of Rosenblatt’s
[69] perceptron which did not have the same convergence properties as an analog
version. I had noticed this when analyzing Novikoff’s [53] proof of perceptron
convergence as one of steepest descent. I had previously deconstructed Pontryagin’s
[55] maximum principle4 to understand why there was no discrete version even
though there were several erroneous attempts in the literature to derive one. It
seemed to me that a discrete perceptron would have similar problems because it
could only approximate steepest descent and might enter a non-convergent limit
cycle. I hypothesized that random variation in the training sequence might overcome
this as might random variation in the weight changes and showed empirically that
the limit cycles did prevent convergence and theoretically that randomness in the
training sequence or weight changes could overcome this [30]. However, even
though I envisioned a discrete perceptron with random weight changes I did not
at that time extend the notion to more general applications. I also did not implement
at that time a stochastic perceptron but I found the issues of training one that had
problems of convergence very useful to my analysis of the dynamics of training both
people and learning machines [17].
All these notions came together when I visited STL in May 1965 and found that
Peter Joyce had designed a digital module that John Andreae termed an ADDIE5
that enabled the reinforcing weight adjustments for the STeLLA learning robot to be
made automatically rather than by manually adjusting potentiometers. The weight
update equation was in the form of a running average, w = αw + (1 − α)x, and
Peter had approximated this with a complex of integrated circuits. I noted that the
component count could be greatly reduced and the approximation improved if the
variables were represented as the generating probability of a digital pulse train, and
sketched out circuit diagrams for ADDIE’s with various resolutions using 74n series
integrated circuit up-down counters, flip-flop arrays with logic gates to generate a
pseudo-random number sequence and adders acting as binary number comparators.
John approved further investigation and during the next week I developed the
statistical theory for the behaviour of an ADDIE, Peter breadboarded the circuit,
and we were able to confirm that theory and practice conformed and provided a
module providing the functionality required in the STeLLA architecture. I realized
the ADDIE emulated an operational amplifier with a negative feedback loop in my
analog computer, that a logic gate acted in the same way as my chopping multiplier
and that the [0, 1] range of probabilities could be used to emulate [−1, +1], [0, ∞]
and [−∞, +∞] ranges through appropriate transformations.
Fig. 6 On right, stochastic analog computer designed by Brian Gaines and built by Peter Joyce;
on left, visual display of STeLLA learning controller’s trajectory to the specified region of state
space, Standard Telecommunication Laboratories, 1965
on, but also the simulation of dynamic systems, digital control, solution of partial
differential equations, and so on. STL’s primary output was ITT patents and I
worked with our resident agent who had trained as a mathematician and understood
the principles of the computer to file in March 1966 a comprehensive patent that had
54 claims covering the computer, representations and applications [27].
Once the patent was filed ITT approved the publication of details of the stochastic
computer. The first public disclosure was at the IFAC Congress in June 1966
where I presented a paper with John on A Learning Machine in the Context of
the General Control Problem [35] which updated his at the 1963 Congress [5].
The stochastic computer was the focus of my discussion contribution reporting
progress since the paper was submitted which was transcribed in the published
proceedings [35].
The first full paper I wrote on stochastic computing was requested late in 1966 by
Roger Meetham at the National Physical Laboratory (NPL) who had heard a non-
disclosure presentation that John and I gave to some NPL researchers and requested
an article for the Encyclopaedia of Linguistics, Information and Control that he was
editing. The paper was written early in 1966 and approved for submission by ITT in
April but the book did not appear until 1969 [25].
The IFAC discussion [35], encyclopaedia entry [25], an internal presentation in
December 1965 [16] and the patent [27] together provide a good account of how we
perceived the stochastic computer at the time of its invention and before we were
aware of a similar invention at the University of Illinois.
The magazine, Electronics, had published a short news item in December 1966
noting that “at the University of Illinois, a group of computer researchers has
designed a series of analog computer circuits that depend on noise and therefore
needn’t be protected from it” and providing circuit examples [10]. I asked the editor
if they would like an article on the similar research at STL and he took my draft,
redrew all my diagrams as hand-drawn sketches to make them appear doodles from
a research notebook, and retitled it as Stochastic computer thrives on noise [24].
I submitted a paper to the analog computing session Spring Joint Computer
Conference in Atlantic City [26] as part of my first trip to the USA where I visited
IBM, Bell Laboratories and Xerox research laboratories, under a research-liaison
agreement between the major electronics companies. The doyens of analog and
hybrid computers, Granino Korn and Walter Karplus also presented and, in leading
the discussion on my paper, Walter remarked that he had never expected to see
further radical innovations in analog computing.
At the conference exhibition I met Gene Clapper from IBM who was exhibiting
his research on character recognition and speech recognition based on digital
perceptrons [9]. He remarked that he had been surprised to find character recognition
less accurate but ascribed it to a lower variety in the training sequences, and we
discussed the role of noise in aiding the convergence of perceptrons. I also presented
a paper [30] at the IFAC conference on system identification in Prague which
focused on the modelling applications of the stochastic computer such as gradient
techniques, the digital perceptron and Bayesian predictor.
26 B. R. Gaines
Fig. 7 Phase computer designed by Brian Gaines and built by Peter Joyce, Standard Telecommu-
nication Laboratories, 1966
purpose computers and I still remember the joking, but perceptive, comment of one
discussant that buying a general-purpose machine was safer because if it turns out
to be unsuitable you can always find another use for it.
I was asked to summarize the conference by the editor of Automatica and
Ray and I wrote an overview that concluded “Whatever the state of the special-
purpose/general-purpose controversy, it is clear that the advent of low-cost inte-
grated circuits has opened up a tremendous range of possibilities for new develop-
ments in control hardware; the re-development of DDA-like incremental computing
techniques is one of the more attractive possibilities which is likely to lead to
practical applications” [38]. I was also asked by the editor of the IEEE Computer
Transactions to write a commentary on Ribeiro’s 1967 paper and concluded that
“the main obstacle to the practical application of the stochastic computer is, at
present, the generation of the random variables required in a reliable and economical
manner. It may well be that we should look to truly random physical processes, such
as photon-photon interactions, to provide the hardware foundation for stochastic
computing systems” [23].
In May 1966 ITT decided to tender for the Boeing 747 simulators required in the
UK on the basis of the simulation capabilities of LMT, their French company, but
needed to establish a British company to manage the tendering process and support
the products if they were successful. I was told I was to be appointed chief engineer
of the new company rather than head of the new advanced developments and
computing division. I had already arranged for that division to have research links
with the Electrical Engineering Science department at the newly formed University
28 B. R. Gaines
Fig. 8 Minic 8-bit microprogrammed microcomputer designed by Brian Gaines and built by
Tony De’ath, Essex University, 1968; on left. the university prototype; on right, the commercial
version
6 Earl Hunt was one of the first to cite this chapter (in the context of von Neumann’s book [81])
in his 1971 paper on “what kind of computer is man?” and comes to the conclusion that man is
a stochastic computer. Earl unfortunately died in 2016 just before the advent of stochastic deep
learning neural networks [45] and the assessment of how the behaviour of deep networks emulated
human visual perception [54] that begins to validate his conjecture.
Origins of Stochastic Computing 29
Ted and I took for granted the independent simultaneous invention of stochastic
computing at the University of Illinois and STL and never discussed it or tried to
ascertain who was ‘first.’ We became aware of earlier pulse rate computers and of
statistical linearization techniques in polarity coincidence correlators [86] and saw
noise/stochastic computing as an extension of such techniques.
Multiple discovery and invention [51] is a common and well-studied phe-
nomenon across many disciplines [83] and the usual explanation is that those
involved were stimulated by the availability of the same, or similar, information.
I have tried to ascertain that common inspiration for Ribeiro’s and my research, and
have suggested that it is the overlapping neural analogy in Ribeiro’s considering
artificial neurons as modules of pulse rate computers, and my considering the
multiplicative processes implementing correlation in the interaction of the pulse
streams of natural neurons.
In addition, the history of stochastic computing also exhibits another phe-
nomenon of multiple discovery/invention where later researchers are unaware of
Origins of Stochastic Computing 31
previous work. One of my colleagues at STL, David Hill, found in a patent search
in the early 1970s that an invention filed by William G. Thistle in 1962 entitled
Integrating Apparatus [77] that carried out computations using random pulse
trains.
Thistle was an electronics engineer conducting research for the Canadian
Ministry of Defence at the Canadian Armament Research and Development Estab-
lishment in Québec. David contacted him for further information and received both
the patent and an internal report entitled A novel special purpose computer [76]. He
sent me copies at the time and I recollect reading the patent and noting it was related
to stochastic computing but have only now read the report in detail whilst writing
this paper.
The abstract of Thistle’s report states: “A type of computer is described for the
real time evaluation of integrals of the form I = ydx, where x and y are functions
of time. It is believed to be novel in it use of quasi-random processes, operating
on pulse trains, to achieve the desired result. The method may be extended to cases
where y is a function of several variables dependent on time. Accuracies comparable
to analog methods are obtainable without the drift problems usually associated with
analog methods.”
Thistle describes techniques for addition, subtraction, multiplication, division
and integration using random pulse trains, provides circuit diagrams, and described
an application to a simple navigation system. His computer encompasses the basic
architecture of the stochastic computers developed at the University of Illinois and
STL and would constitute prior art from a patent perspective.
His report was not widely circulated. The distribution list shows that only 3
copies were issued (to the superintendents of systems and of electronics, and the to
chief superintendent) and 25 were lodged in the documents library. Thistle has three
other patents (for power supplies and a gas discharge matrix display), and seems to
have no other publications although there will likely be other internal reports. It is
probable that much of his work was associated with classified systems.
A google scholar search on his name returns two of his patents, one of which
is the US version of his Integrating Apparatus retitled Computer for evaluating
integrals using a statistical computing process. His patent is not cited in other
patents as prior art, and it seems unlikely that, even today, a content-based automated
search would be able to link his text to the stochastic or pulse rate computing
literature. As far as I know, Thistle’s research is completely unrecognized and has
had no influence, and there is no indication of how he came to invent a stochastic
computer, but it deserves recognition in the history of computing as the earliest
documented invention of a fully-functional stochastic computer.
Thistle’s invention is also relevant to another question frequently asked about
discoveries and inventions, what would have happened if neither the Illinois or STL
teams had developed stochastic computers, would others have done so? The answer
is clearly yes—it had already happened but no one knew. There was also research
in neurology where it became known empirically, possibly as early as the 1950s,
that the coincidence of neurons firing could result in a common afferent neuron
firing and that this might be the basis of motion detection [64]. This led to an
32 B. R. Gaines
empirical analysis of the jitter in neural firing that was shown to be sufficient for the
afferent neuron to be acting as a pulse frequency multiplier [75]. Thus, stochastic
bit-stream neural networks [8] were conceived from biological studies uninfluenced
by stochastic computing (even though the similarity to the stochastic computer is
often noted in that literature, e.g. [43]).
Conclusions
In the three decades after Ted and I completed our research in stochastic computing
research continued elsewhere but at a low intensity. We received papers to referee,
were asked to be thesis examiners, and were aware that there was continuing activity
by colleagues across the world, such as Phil Mars in the UK, Sadamu Ohteru in
Japan, Robert Massen in Germany (who in 1977 wrote the first book on stochastic
computer technology [49]) and others, but no major growth in interest. However,
in the recent past there has been a significant growth in research as illustrated in
Fig. 9 which shows the citations to my 1969 survey (a more robust estimator based
on a basket of several commonly cited articles shows a similar pattern). This book
provides a much-needed overview of this burgeoning literature through tutorials and
overviews by some of those who make have major contributions to its growth.
250
200
150
100
50
0
1970-74 1975-79 1980-84 1985-89 1990-94 1995-99 2000-04 2005-09 2010-14 2015-18
Acknowledgements I am grateful to John Esch for his help in verifying my commentary on the
research at the University of Illinois and for providing the photograph of his RASCEL stochastic
computer. I am grateful to David Hill for prompting my memory of certain dates and events and
for providing the material by Thistle documenting his early stochastic computer. I would also like
to thank the editors of this volume for providing the opportunity to contribute this account of the
origins of stochastic computing knowing that there are very few of us still alive to do so. I hope it
will be of interest to the stochastic computing research community of this era, and wish them well.
References
1. Chusin Afuso. “Analog computation with random pulse sequences”. PhD thesis. University of
Illinois, 1968.
2. A. Alaghi and J. P. Hayes. “Exploiting correlation in stochastic circuit design”. IEEE 31st
International Conference on Computer Design (ICCD). 2013, pp. 39–46.
3. A. Alaghi, W. Qian, and J. P. Hayes. “The promise and challenge of stochastic computing”.
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 37.8 (2018),
pp. 1515–1531.
4. Armin Alaghi, Cheng Li, and John P. Hayes. “Stochastic circuits for real-time image-
processing applications”. Proceedings of the 50th Annual Design Automation Conference.
Austin, Texas: ACM, 2013, pp. 1–6.
5. John H. Andreae. “STeLLA: A scheme for a learning machine”. Proceedings 2nd IFAC
Congress: Automation & Remote Control. Ed. by V. Broida. London: Butterworths, 1963,
pp. 497–502.
6. A. Basu et al. “Low-power, adaptive neuromorphic systems: Recent progress and future
directions”. IEEE Journal on Emerging and Selected Topics in Circuits and Systems 8.1 (2018),
pp. 6–27.
7. T. E. Bray. “An optoelectronic-magnetic neuron component”. Proceedings National Electron-
ics Conference. 1961, pp. 302–310.
8. Peter S. Burge et al. “Stochastic bit-stream neural networks”. Pulsed Neural Networks. Ed. by
Wolfgang Maass and Christopher M. Bishop. Cambridge, MA: MIT Press, 1999, pp. 337–352.
9. Gene L. Clapper. “Machine looks, listens, learns”. Electronics (Oct. 30, 1967).
10. Electronics. “Dropping the guard”. Electronics (Dec. 12, 1966), p. 48.
11. John W. Esch. “RASCEL - A Programmable Analog Computer Based on a Regular Array of
Stochastic Computing Element Logic”. PhD thesis. University of Illinois, 1969.
12. John W. Esch. “System and method for frame and unit-like symbolic access to knowledge
represented by conceptual structures”. Pat. 4964063. Unisys Corporation. Sept. 15, 1988.
13. John W. Esch and Robert Levinson. “An implementation model for contexts and negation
in conceptual graphs”. Conceptual Structures: Applications, Implementation and Theory.
Springer Berlin Heidelberg, 1995.
14. Peter V. Facey and Brian R. Gaines. “Real-time system design under an emulator embedded in
a high-level language”. Proceedings DATAFAIR 73. London: British Computer Society, 1973,
pp. 285–291.
15. Brian R. Gaines. “A mixed-code approach to commercial microcomputer applications”.
Conference on Microprocessors in Automation and Communications. London: IERE, 1978,
pp. 291–301.
Origins of Stochastic Computing 35
16. Brian R. Gaines. A Stochastic Computer: Some Notes on the Application of Digital Circuits
to the Operations of Arithmetic and Differential Calculus by Means of a Probabilistic
Representation of Quantities. Tech. rep. Standard Telecommunication Laboratories, Dec. 9,
1965.
17. Brian R. Gaines. “Adaptive control theory: the structural and behavioural properties of adaptive
controllers”. Encyclopaedia of Linguistics, Information & Control. Ed. by A.R. Meetham and
R.A. Hudson. London: Pergamon Press, 1969, pp. 1–9.
18. Brian R. Gaines. “Foundations of stochastic computing systems”. Digest of IEEE International
Convention. New York: IEEE, 1968, p. 33.
19. Brian R. Gaines. “Interpretive kernels for microcomputer software”. Proceedings Symposium
Microprocessors at Work. London: Society of Electronic & Radio Technicians, 1976, 56–69.
20. Brian R. Gaines. “Linear and nonlinear models of the human controller”. International Journal
of Man-Machine Studies 1.4 (1969), pp. 333–360.
21. Brian R. Gaines. MINIC I Manual. Colchester, UK: Department of Electrical Engineering
Science, 1969.
22. Brian R. Gaines. MINSYS Manual. Colchester, Essex: Department Electrical Engineering
Science, 1974.
23. Brian R. Gaines. “R68-18 Random Pulse Machines”. IEEE Transactions on Computers C-17.4
(1968), pp. 410–410.
24. Brian R. Gaines. “Stochastic computer thrives on noise”. Electronics (July 10, 1967), pp. 72–
79.
25. Brian R. Gaines. “Stochastic computers”. Encyclopaedia of Linguistics, Information &
Control. Ed. by A.R. Meetham and R.A. Hudson. London: Pergamon Press, 1969, pp. 66–76.
26. Brian R. Gaines. “Stochastic computing”. Spring Joint Computer Conference. Vol. 30. Atlantic
City: AFIPS, 1967, pp. 149–156.
27. Brian R. Gaines. “Stochastic Computing Arrangement”. British pat. 184652. Standard Tele-
phones & Cables Ltd. Mar. 7, 1966.
28. Brian R. Gaines. “Stochastic computing systems”. Advances in Information Systems Science,
2. Ed. by J. Tou. New York: Plenum Press, 1969, pp. 37–172.
29. Brian R. Gaines. “Teaching machines for perceptual-motor skills”. Aspects of Educational
Technology. Ed. by D. Unwin and J. Leedham. London: Methuen, 1967, pp. 337–358.
30. Brian R. Gaines. “Techniques of identification with the stochastic computer”. Proceedings
IFAC Symposium on The Problems of Identification in Automatic Control Systems. 1967, pp. 1–
10.
31. Brian R. Gaines. “Training the human adaptive controller”. Proceedings Institution Electrical
Engineers 115.8 (1968), pp. 1183–1189.
32. Brian R. Gaines. “Trends in stochastic computing”. Colloquium on Parallel Digital Computing
Methods—DDA’s and Stochastic Computing. Vol. 30. London: IEE, 1976, pp. 1–2.
33. Brian R. Gaines. “Uncertainty as a foundation of computational power in neural networks”.
Proceedings of IEEE First International Conference on Neural Networks. Vol.3. Ed. by M.
Caudhill and C. Butler. 1987, pp. 51–57.
34. Brian R. Gaines. “Varieties of computer —– their applications and inter-relationships”.
Proceedings of IFAC Symposium on Pulse Rate and Pulse Number Signals in Automatic
Control, Budapest: IFAC, 1968, pp. 1–16.
35. Brian R. Gaines and John H. Andreae. “A learning machine in the context of the general control
problem”. Proceedings of the 3rd Congress of the International Federation for Automatic
Control. London: Butterworths, 1966, 342–348 (discussion, session 14, p.93).
36. Brian R. Gaines, M. Haynes, and D. Hill. “Integration of protection and procedures in a high-
level minicomputer”. Proceedings IEE 1974 Computer Systems and Technology Conference.
London: IEE, 1974.
37. Brian R. Gaines and Peter L. Joyce. “Phase computers”. Proceedings of 5th Congress of
International Association for Analog Computation. 1967, pp. 48–57.
38. Brian R. Gaines and Ray A. Shemer. “Fitting Control Mathematics to Control Hardware: An
Aspect of the 1968 IFAC Pulse-Symposium”. Automatica 5 (1969), pp. 37–40.
36 B. R. Gaines
39. Brian R. Gaines et al. “Design objectives for a descriptor-organized minicomputer”. European
Computing Congress Conference Proceedings, EUROCOMP 74. London: Online, 1974,
pp. 29–45.
40. F. Jenik. “Electronic neuron models as an aid to neurophysiological research”. Ergebnisse der
Biologie 25 (1962), pp. 206–245.
41. F. Jenik. “Pulse processing by neuron models”. Neural Theory and Modeling: Proceedings of
the 1962 Ojai Symposium. Ed. by Richard F. Reiss. Stanford, CA: Stanford University Press,
1964, pp. 190–212.
42. K. S. Lashley. “The problem of cerebral organization in vision”. Visual Mechanisms. Ed. by H.
Kluüver. Oxford: Cattell, 1942.
43. Robert Legenstein and Wolfgang Maass. “Ensembles of spiking neurons with noise support
optimal probabilistic inference in a dynamically changing environment”. PLoS Computational
Biology 10.10 (2014), e1003859 1–27.
44. E. R. Lewis. “The locus concept and its application to neural analogs”. IEEE Transactions on
Bio-medical Electronics 10.4 (1963), pp. 130–137.
45. Y. Liu et al. “An energy-efficient online-learning stochastic computational deep belief net-
work”. IEEE Journal on Emerging and Selected Topics in Circuits and Systems (2018), pp. 1–1.
46. A.R. Luria. The Role of Speech in the Regulation of Normal and Abnormal Behavior. Oxford:
Pergamon Press, 1961.
47. P. Mars and W. J. Poppelbaum. Stochastic and Deterministic Averaging Processors. Stevenage:
IEE/Peregrinus, 1981.
48. T. B. Martin. “Analog signed processing by neural networks”. Proceedings National Electron-
ics Conference. 1961, pp. 317–321.
49. Robert Massen. Stochastische Rechentechnik: Eine Einfuührung in die Informationsverar-
beitung mit zurfaälligen Pulsfolgen. Munich: Hanser, 1977.
50. E. P. McGrogan. “Improved transistor neural models”. Proceedings National Electronics
Conference. 1961, pp. 302–310.
51. Robert King Merton. The Sociology of Science: Theoretical and Empirical Investigations.
Chicago: University of Chicago Press, 1973.
52. M. Hassan Najafi, David J. Lilja, and Marc Riedel. “Deterministic methods for stochas-
tic computing using low-discrepancy sequences”. IEEE/ACM International Conference On
Computer-Aided Design (ICCAD ’18). New York: ACM, 2018.
53. Albert B. Novikoff. On Convergence Proofs for Perceptrons. Tech. rep. SRI Project 3605.
Menlo Park, CA: Stanford Research Institute, 1963.
54. Joshua C. Peterson, Joshua T. Abbott, and Thomas L. Griffiths. “Evaluating (and improving)
the correspondence between deep neural networks and human representations”. Cognitive
Science (2018), 42.8, 2648–2699.
55. L. S. Pontryagin et al. The Mathematical Theory of Optimal Processes. Oxford, England:
Pergamon Press, 1962.
56. W. J. Poppelbaum. Computer Hardware Theory. New York, Macmillan, 1972.
57. W. J. Poppelbaum. Record of Achievements and Plans of the Information Engineering
Laboratory. Champaign, Urbana, IL: Department of Computer Science, University of Illinois,
1973.
58. W. J. Poppelbaum. “Statistical processors”. Advances in Computers. Ed. by Rubinoff Morris
and C. Yovits Marshall. Vol. Volume 14. Elsevier, 1976, pp. 187–230.
59. W. J. Poppelbaum. “Transistor Flip-Flop Circuit”. Pat. 2933621. University of Illinois Founda-
tion. Aug. 2, 1956.
60. W. J. Poppelbaum. “What next in computer technology?” Advances in Computers. Ed. by L.
Alt Franz and Rubinoff Morris. Vol. Volume 9. Elsevier, 1969, pp. 1–21.
61. W. J. Poppelbaum, C. Afuso, and J.W. Esch. “Stochastic computing elements and systems”.
Fall Joint Computer Conference. Vol. 31. New York: Books, Inc, 1967, pp. 635–644.
62. W. J. Poppelbaum et al. “Unary Processing”. Advances in Computers. Ed. by C. Yovits
Marshall. Vol. Volume 26. Elsevier, 1987, pp. 47–92.
Origins of Stochastic Computing 37
63. W. Qian et al. “An architecture for fault-tolerant computation with stochastic logic”. IEEE
Transactions on Computers 60.1 (2011), pp. 93–105.
64. Werner Reichardt. “Evaluation of optical motion information by movement detectors”. Journal
of Comparative Physiology A 161 (1987), pp. 533–547.
65. S. T. Ribeiro and G. K. Ujhelyi. “Electro-Optical Modulation of Radiation Pattern Using
Curved Electrodes”. U.S. pat. 3433554. Secretary of the Navy. May 1, 1964.
66. Sergio Telles Ribeiro. “Comments on Pulsed-Data Hybrid Computers”. IEEE Transactions on
Electronic Computers EC-13.5 (1964), pp. 640–642.
67. Sergio Telles Ribeiro. “Phase Plane Theory of Transistor Bistable Circuits”. PhD thesis.
University of Illinois, 1963.
68. Sergio Telles Ribeiro. “Random pulse machines”. IEEE Trans. Electronic Computers EC-16.6
(1967), pp. 261–276.
69. Frank Rosenblatt. “The perceptron: A probabilistic model for information storage and organi-
zation in the brain”. Psychological Review 65.6 (1958), pp. 386–408.
70. Sayed Ahmad Salehi et al. “Computing mathematical functions using DNA via fractional
coding”. Scientific Reports 8.1 (2018), p. 8312.
71. Hermann Schmid. “An operational hybrid computing system provides analog-type compu-
tation with digital elements”. IEEE Transactions on Electronic Computers EC-12.6 (1963),
pp. 715–732.
72. Ray A. Shemer. “A hybrid-mode modular computing system”. Proceedings of IFAC Symposium
on Pulse Rate and Pulse Number Signals in Automatic Control. Budapest, 1968.
73. Ray A. Shemer. “A Hybrid-Mode Modular Computing System”. PhD thesis. 1970.
74. I. W. Smith, D. A. Hearn, and P. Williamson. “Software development for Batchmatic computer
numerical control system”. Proceedings of the Fourteenth International Machine Tool Design
and Research Conference. Ed. by F. Koenigsberger and S. A. Tobias. London: Macmillan,
1974, pp. 381–389.
75. Mandyam V. Srinivasan and Gary D. Bernard. “A proposed mechanism for multiplication of
neural signals”. Biological Cybernetics 21.4 (1976), pp. 227–236.
76. William G. Thistle. A Novel Special Purpose Computer. Tech. rep. CADRE Technical Note
1460. Valcartier, Queébec: Canadian Armament Research and Development Establishment,
1962.
77. William G. Thistle. “Integrating Apparatus”. Canadian pat. 721406. Ministry of National
Defence. Nov. 30, 1962.
78. Financial Times “George Kent backs a micro-computer venture”. Financial Times (Feb. 5,
1970), p. 11.
79. G. K. Ujhelyi and S. T. Ribeiro. “An electro-optical light intensity modulator”. Proceedings of
the IEEE 52.7 (1964), pp. 845–845.
80. Gabor K. Ujhelyi, Sergio T. Ribeiro, and Andras M. Bardos. “Data Display Device”. U.S. pat.
3508821. Carson Laboratories. Aug. 11, 1966.
81. John Von Neumann. The Computer and the Brain. New Haven, Yale University Press, 1958.
82. Computer Weekly. “MINIC system is bought by George Kent”. Computer Weekly (Feb. 12,
1970).
83. Wikipedia. List of multiple discoveries. 2018. https://en.wikipedia.org/wiki/List_of_multiple_
discoveries.
84. F.K. Williamson et al. “A high-level minicomputer”. Information Processing 74. Amsterdam:
Noth-Holland, 1974, pp. 44–48.
85. Yiu Kwan Wo. “APE machine: A novel stochastic computer based on a set of autonomous
processing elements”. PhD thesis. 1973.
86. S. Wolff, J. Thomas, and T. Williams. “The polarity-coincidence correlator: A nonparametric
detection device”. IRE Transactions on Information Theory 8.1 (1962), pp. 5–9.
Tutorial on Stochastic Computing
Chris Winstead
Introduction
Stochastic computing circuits have a number of features that have attracted the
attention of researchers for several decades. This chapter introduces the fundamental
concepts of stochastic computation and describes their attraction for applications in
approximate arithmetic, error correction, image processing and neural networks.
Some disadvantages and limitations are also discussed, as well as a discussion of
future circuits that utilize native non-determinism to avoid certain disadvantages.
Some background on digital design, probability theory and stochastic processes
is necessary. Prior knowledge of image processing or machine learning topics—
particularly Bayesian networks and neural networks—is also helpful for following
the application examples. For deeper study on the topic, the reader is directed to
several recent key references that explore each of the sub-topics in greater detail. A
good starting point is the recent survey article by Alaghi et al. [2]. This chapter does
not provide a comprehensive bibliography on stochastic computing; references in
the chapter bibliography are selected to provide the greatest benefit to new students
of the subject.
C. Winstead ()
Department of Electrical and Computer Engineering, UMC 4120, Utah State University, Logan,
UT, USA
e-mail: chris.winstead@usu.edu
Fundamental Concepts
Stochastic computing circuits are able to realize arithmetic functions with very few
logic gates. This is achieved by encoding numerical values within the statistics of
random (or pseudorandom) binary sequences. For example, the number x = 1/3
can be represented by the sequence 0, 0, 1, 0, 1, 0, . . . , wherein the frequency of
1’s is equal to 1/3. A numerical value encoded in this way is called a stochastic
number. Throughout this chapter, we will use the terms bit sequence or bit stream
interchangeably to refer to the specific random bits that encode a stochastic number.
Since many different sequences can encode the same value, stochastic numbers are
defined by the sequence statistics rather than the particular order of bits.
Definition 1 (Stochastic Number) Given a probability pX , 0 ≤ pX ≤ 1, the
corresponding stochastic number X is a sequence of random binary numbers
X0 , X1 , . . . for which any Xj ∈ {0, 1} may equal 1 with probability pX .
Throughout this chapter, we will use capital letters (X) to refer to stochastic
numbers, and subscripted capital letters (X ) to refer to the individual sequence
bits, where the subscript () indicates the clock cycle index. When analyzing
logic operations, we will often omit the subscript when stating combinational
relationships that hold for all clock cycles. We will also use capital letters to refer to
binary values, integers, and parameters, which will be clearly indicated by context.
A lower-case letter (x) represents the real value associated with the stochastic
number, and the sequence probability is indicated by the letter p with a subscript to
indicate the stochastic number (as in pX ).
Given a sufficiently long bit sequence, the sequence’s mean value is expected to
converge to the probability pX . Stochastic numbers have precision that improves
with sequence length, a property called progressive precision. For instance, the
value 3/7 can be precisely represented by a sequence of at least length 7, however
the value 5/14 requires a sequence of at least 14 bits. Since the individual bits in
the sequence are random, a much longer sequence length is required before the
sequence’s actual average converges within the desired precision.
In practice, most stochastic computing circuits produce non-ideal stochastic
numbers in which the bits depend on the sequence history. The resulting auto-
correlation can sometimes distort or reduce the efficiency of stochastic computa-
tions, posing a serious challenge that takes several forms that are discussed in later
sections of this chapter. Because of the potential inaccuracies, we need to carefully
distinguish between ideal and non-ideal stochastic numbers.
Definition 2 (Ideal Stochastic Number) An ideal stochastic number has the
properties of a Bernoulli process, wherein the sequence bits are all statistically
independent from each other.
It is sometimes said that an ideal stochastic number is memoryless, because each
bit has no statistical dependence on the sequence history. A non-ideal stochastic
number does depend on the sequence history, and can be considered as a hidden
Markov model.
Tutorial on Stochastic Computing 41
Binary sequences can directly represent positive real numbers between zero and one.
In order to represent larger numbers, they must be mapped onto the unit interval.
This can be done in a variety of different ways, resulting in distinct numerical
representations or formats. Some of the more common formats are defined in this
section. We begin with the most common unipolar format.
42 C. Winstead
LX = log X . (1)
p Q = 1 − pX . (2)
Fig. 2 The NOT gate as a stochastic 1 − pX operation. For the bipolar format, x = 1/3 and
q = −1/3 (q = −x). For the ratioed format, x = 2 and q = 1/2 (q = 1/x). For the LLR format,
x = 0.693 and q = −0.693 (q = −x)
Unipolar Case For a unipolar stochastic input with non-unit scale constant M, the
equivalent real-valued output can be expressed as
q x
=1−
M M (3)
⇒ q = M − x.
1
LQ = log
X
(5)
= − log X
= −LX
If both X and Q are LLR stochastic numbers, then the corresponding real-valued
computation is
q = −x. (6)
Fig. 3 The AND gate as a unipolar stochastic multiplier. For the bipolar case. a = 0, b = 1/3 and
q = −1/3. For the ratioed case, a = 1, b = 2 and q = 1/2
Unipolar Case If the inputs are unipolar stochastic numbers, then the AND gate
can be interpreted as a multiplier. If the unipolar inputs have scale constants MA
and MB , then the output scale constant is MQ = MA MB . Supposing a = 3 with
MA = 6, and b = 2 with MB = 3, then the output is expected to be q = 6 with
MQ = 18. This is consistent with the example probabilities shown in Fig. 3.
Bipolar Case The bipolar case is more complicated, and is left as an exercise for
the reader. Given bipolar inputs with unit scale constants, the reader should find that
q = (a + b + ab − 1) /2.
Ratioed Case The behavior is also interesting for ratioed stochastic numbers:
pQ = pA pB
q a b
⇒ =
q +1 a+1 b+1 (7)
ab
⇒q=
1+a+b
ab
q≈ , (8)
a+b
LQ = log Q ≈ log A B
= log A + log B (9)
= LA + LB
If A, B, and Q are all LLR stochastic numbers, then the corresponding computation
is approximated by
q ≈ a + b. (10)
46 C. Winstead
p Q = pA + pB − pA p B . (11)
pQ ≈ p A + p B . (12)
A numerical example for this approximation is shown in Fig. 5. In this case, the
input probabilities are 1/6 and 1/12, and the output probability is their sum, 1/4.
Unipolar Case For unipolar inputs with scale constants MA , MB , the output is
a b ab
q = MA M B + − (13)
MA MB MA MB
Repeating the same scale constants and values from Example 3, the output is
expected to be q = 18(3/6 + 2/3 − 6/18) = 15 with MQ = 18, which corresponds
to pQ = 5/6 as shown in Fig. 4.
Fig. 4 Stochastic behavior of the OR gate. The unipolar input probabilities are pA = 1/2 and
pB = 2/3, and the output probability is pQ = 5/6. For the bipolar case, the corresponding values
are a = 0, b = 1/3, q = 1/3. For the ratioed case, a = 1, b = 2, q = 5. For the LLR case, a = 0,
b = 0.693, and q = 1.609
Fig. 5 The OR gate as an approximate adder for small input probabilities in the unipolar and
ratioed representations. Numerical values for the unipolar case are indicated in the figure. For the
ratioed case, a = 1/5, b = 1/11 and q = 1/3, which is close to a + b
Tutorial on Stochastic Computing 47
Bipolar Case As with the bipolar AND operation from Example 3, the reader
can verify that, given bipolar inputs with unit scale constants, the result is q =
(a + b − ab + 1) /2.
Ratioed Case If a and b are ratioed stochastic inputs, then the output is
q a b ab
= + −
q +1 a+1 b+1 (a + 1) (b + 1)
a + b + ab (14)
=
1 + a + b + ab
⇒ q = a + b + ab.
LQ ≈ log A B
(15)
= log A + log B
Then if A, B, and Q are all understood to be LLR stochastic numbers, then the OR
gate acts as an LLR adder:
q ≈ a + b. (16)
pQ = (1 − pS ) pA + pS pB . (17)
48 C. Winstead
Fig. 7 Stochastic behavior of the XOR gate. The unipolar computation is indicated in the figure.
In the bipolar domain, the corresponding values are a = 0, b = 1/3 and q = ab = 0. In the ratioed
domain, the values are a = 1, b = 2 and q = 1
For the unipolar case, this can be viewed as either a weighted sum or an averaging
of the two inputs. The reader can verify that the same result holds for the bipolar
case. For ratioed and LLR stochastic numbers, the result is less convenient and is
omitted from this example.
Example 6 (XOR Gate)
Another fundamental logic operation is the XOR gate. An example of its unipolar
behavior is shown in Fig. 7.
Unipolar Case The output Q is 1 only if the two inputs are unequal, so the output
probability is
pQ = pA (1 − pB ) + (1 − pA ) pB
(18)
= pA + pB − 2pA pB .
which is very similar to the AND gate behavior. For small values of a and b, such
that ab 1, the XOR gate acts as an adder. For large values it implements the
operation
a+b
q≈ . (21)
ab
The LLR behavior is not illuminating and is omitted from this example.
0
50 C. Winstead
Lastly, we substitute the ratioed format expressions from Definition 5 and find
that
q ab
=
1+q 1 + ab (24)
⇒ q = ab.
Hence this circuit operates as a multiplier for ratioed stochastic numbers. In the LLR
domain, it operates as an adder.
To give a precise example, we suppose the D element initially holds a zero value,
and evaluate the circuit for two ratioed stochastic inputs representing a = 4 and
b = 1/2. The corresponding probabilities are pA = 4/5 and pB = 1/3. Example
sequences for these probabilities are
A= 1110_1111_1011
B= 0110_0000_0101
A&B = 0110_0000_0001
A|B = 1110_1111_1111
Q= 0_1110_1010_1011
Ignoring the initial state of Q, the remaining sequence has a mean of 8/12, which
corresponds to a real value of q = 2 in the ratioed format. This is the expected result.
Example 8 (J/K Flip-Flop)
The J/K flip-flop is a classical memory element that sets Q+1 := J K +Q J K +
Q J K. The typical schematic symbol and operation table are shown in Fig. 9.
Assuming that the inputs J and K are ideal stochastic numbers, the output Q is
delayed by one clock cycle and therefore independent of the two inputs. Then the
flip-flop’s logical definition maps directly to a probability expression:
J J Q Q J K Operation
0 0 Hold
1 0 Set
K K 0 1 Reset
1 1 Toggle
clk
pQ = pJ (1 − pK ) + pQ (1 − pJ ) (1 − pK )
+ 1 − pQ p J p K
= pQ + pJ − pQ (pJ + pK ) (25)
pJ
⇒ pQ =
p J + pK
This can be described as a unipolar normalizing operation, useful for computing the
relative proportion of two signals.
Example 9 (Bayes’ Law)
Bayesian inference is a point of intersection for many applications ranging from
error correction to neural networks. For readers who may be unfamiliar with the
concept, suppose we are uncertain about some fact, for instance I may worry that
I’ve left the oven on. Let A be a binary random variable representing the event that
the oven is on. Based on past experience, I can guess at a prior probability pA
that I left it turned on. Additionally suppose that I have a remote sensor B which
is supposed to indicate whether the oven is on. The sensor is unreliable for some
reason, and it is only accurate with probability pB . According to Bayes’ Law, I
can combine my prior probability with my sensor evidence to obtain an improved
posterior probability pQ :
Pr (B |A ) Pr (A)
pQ = Pr (A | B ) =
Pr (B)
(26)
Pr (B |A ) Pr (A)
=
a∈{on, off} Pr (B | A = a) Pr (A = a)
p A pB
pQ = (27)
(1 − pA ) (1 − pB ) + pA pB
This operation can be implemented a number of different ways. One of the common
implementations uses a J/K flip flop as shown in Fig. 10. In this circuit, the J/K flip-
flop sets Q := 1 if A and B are both 1, and resets Q := 0 if A and B are both 0.
If A = B in a given clock cycle, then the value of Q does not change. We assume
that A and B are ideal stationary stochastic numbers, so that their statistics do not
vary over time. In that case, it can be shown that Q converges in mean so that pQ
does not vary from one clock cycle to the next. Q is not ideal, but in any given clock
cycle Q is statistically independent of A and B. Then the output probability can be
expressed as
52 C. Winstead
A Q
C
B
Fig. 11 Muller C-element device symbol. The C-element is a classical asynchronous gate
functionally equivalent to the Bayes’ Law circuit in Fig. 10
This is the expression for Bayes’ Law. By an interesting coincidence, the output
probability is the same as for the ratioed multiplier in Example 7, so the ratioed
multiplier can just as well be used to implement Bayes’ Law. Conversely, the Bayes
circuit can serve as a ratioed multiplier, or as an adder in the LLR domain.
The J/K-based circuit shown in Fig. 10 is functionally equivalent to a classic
logic gate known as the Muller C-element, which is widely used in asynchronous
digital circuits and has several alternative implementations. In future examples, we
will use the C-element symbol shown in Fig. 11 to stand in for the J/K Bayes’ Law
circuit.
As a concrete example of the C-element’s function, we consider input streams
with probabilities pA = 5/12 and pB = 1/3. The output probability should be
pQ = 0.263, which is a little more than 1/4.
A = 0101_1100_0100 (5/12)
B = 0011_0001_0010 (1/3)
Q = 0001_1100_0000 (1/4)
In the ratioed domain, the corresponding values are a = 0.7143, b = 0.5 and
q = 1/3, which is close to the product ab = 0.3572. The accuracy tends to improve
as the stream length increases.
Example 10 (Toggle Flip-Flop)
The toggle flip-flop (TFF) is an interesting case where the output has probability
1/2, independent of the input probability (so long as the input is non-zero). A
numerical example is shown in Fig. 12. The TFF can be used to generate a known
constant stochastic number without requiring an additional RNG.
Tutorial on Stochastic Computing 53
A
B
Q
T Q 1
The main drawback to the TFF is that it introduces substantial correlation into
the output sequence. For example, if the input probability is close to 1, then the
TFF’s output will follow a regular 1-0-1-0 pattern, giving it a nearly deterministic
behavior. This can potentially interfere with the accuracy of some circuits. Never-
theless there are some important applications as demonstrated in the following two
examples.
Example 11 (Unipolar Divide-by-Two)
One immediate application of the TFF is a divide-by-two circuit shown in
Fig. 13. Since the TFF generates a unipolar output with probability 1/2, this can
be multiplied into the original input stream using an AND gate. If the circuit’s input
is an ideal stochastic number, then the TFF’s delay ensures statistically independent
inputs at the AND gate, which is required for proper function as a stochastic
multiplier (consequences of statistical dependence are studied in Example 14). This
circuit inherits some correlation effects from the TFF. For instance, if the input has
probability 1, then the output will follow the same deterministic 1-0-1-0 pattern as
the TFF.
Example 12 (Unipolar/Bipolar Adder)
A second TFF application is a non-weighted adder circuit, shown in Fig. 14, that
works for unipolar and bipolar formats. In this circuit, when A = B the TFF toggles
54 C. Winstead
pZ = pA pB + (1/2) ((1 − pA ) pB + pA (1 − pB ))
(29)
= (1/2) (pA + pB ) .
As with the other TFF circuits, the output of this adder is not ideal. For instance,
suppose that pA = 1 and pB = 0. In that circumstance, the TFF’s output is always
selected, and we again see the regular 1-0-1-0 pattern.
5 · 10−2
Autocorrelation
−5 · 10−2
−0.1
0 200 400 600 800 1,000 1,200 1,400 1,600
Clock Cycles
Isolation Methods
0101_1100_0011
& 0010_1110_0001
= 0000_1100_0001
The delay method only works for ideal stochastic numbers. If successive bits are
correlated, then the result cannot be trusted. Consider the case when all 1s appear at
the beginning of the sequence, like this:
1111_1100_0000
& 0111_1110_0000
= 0111_1100_0000
Here the output sequence is nearly the same as the input sequence, and has pQ =
5/12 which is approximately pX rather than pX2.
X D D D D Z
1 TFF (Q)
Regen. (Z)
0.8 Ideal
Autocorrelation
0.6
0.4
0.2
Fig. 21 Change in TFF autocorrelation after applying the isolation method shown in Fig. 20 when
the input probability is close to zero
schematic for this simulation is shown in Fig. 20. For this input probability, the TFF
output Q has runs of 1s and 0s with an average run length of about 20. This results
in a very long-lived autocorrelation for Q, as seen in the simulation results shown in
Fig. 21. After regeneration via the sample memory, the correlation effect is reduced
but not completely eliminated.
This simulation used a memory depth of 64 delays, with 8 taps spaced 7 delays
apart, at delay indices 64, 57, 50, . . . , 8. The simulation was repeated for a high
input probability, pX = 0.95, and we see from the result in Fig. 22 that the
regeneration is more successful. Again, however, the autocorrelation is reduced but
not entirely eliminated.
Regeneration Methods
1 TFF (Q)
Regen. (Z)
Ideal
0.5
Autocorrelation
−0.5
−1
−25 −20 −15 −10 −5 0 5 10 15 20 25
Clock Cycles
Fig. 22 Change in TFF autocorrelation after applying the isolation method shown in Fig. 20 when
the input probability is close to one
the system, since it is not possible to instantaneously sample the input statistics. In
some special cases, the latency effects may prove advantageous, for instance it may
help stabilize iterative calculations where feedback is present. But most of the time
the latency is a drawback.
Example 16 (Counter-Based Regeneration)
The most direct means of regenerating a stochastic number is to use a binary
counter to estimate the input stream’s probability. A generic schematic for this
approach is shown in Fig. 23, where the output stream Z is generated using a
uniform RNG and a comparator. This circuit is a modification of the stochastic
number generator from Example 1. If the counter has K bits, then the counter may
accumulate for 2K clock cycles, yielding the unsigned integer count C representing
the total number of non-zero bits that occurred in that time window. Then the
probability estimate is
C
p̂X = (30)
2K
If the RNG produces uniformly distributed random numbers in the interval [0, 2K −
1], then the output stream has pZ = p̂X .
There are several tradeoffs associated with the simple counter approach in
Fig. 23. Since the counter must accumulate 2K samples, the output probability can
only be updated with a period of 2K clock cycles. Alternatively, the probability
Tutorial on Stochastic Computing 59
Fig. 23 Counter-based
regeneration schematic C +
X counter
Z
−
uniform R
RNG
Fig. 24 Free-running
regeneration based on up/down C p̂X +
X C+2K
up/down counter counter 2
Z
−
uniform R
RNG
estimate can be updated every clock cycle if a shift-register is used with depth 2K ,
so that the oldest sample can be subtracted out at the same time as the newest sample
arrives.
An alternative counter solution is use an up/down counter (UDC) as shown in
Fig. 24. In this approach, the input sequence it treated as a bipolar stream. The
UDC estimates the bipolar average, so the count C is treated as a signed integer.
Whenever the input is X = 1, the counter increments; whenever X = 0 the counter
decrements. Then the bipolar average is converted to a probability by
C + 2K
p̂X = (31)
2
This mapping is trivial to implement. One major advantage for the UDC is that it
can run continuously, providing a revised probability estimate every clock cycle,
without the need for a shift register.
Feedback Methods
Fig. 25 Simplified b
signal-flow schematic for a
tracking forecast memory
X å D C
0.8
Probability
0.6
0.4
b = 1/8
0.2
b = 1/32
b = 1/128
0
50 100 150 200 250 300 350 400
Clock Cycles
where β is a parameter, 0 < β < 1, that controls the step size. In essence, this
process acts as a low-pass filter to track the constant (or slow-changing) probability
value from the fast-switching bit stream. A small β means slower tracking but
better accuracy and stability. Feedback estimation is the basis of Tracking Forecast
Memory (TFM) regeneration methods which have proved very effective in error
correction decoders.
In practice, the β parameter can be chosen as a power of 2 in order to
minimize implementation complexity. As a practical example, the generic TFM
from Fig. 25 was simulated for three different values of β. The simulation results,
shown in Fig. 26, demonstrate the time/stability tradeoff for estimating a stochastic
number with pX = 0.75. In this example, the counter’s input is an ideal
stochastic number with pX = 0.75 from an initial value of 0.5. With smaller
values of β, the circuit has higher latency but the estimate is more stable and
accurate.
The preceding examples treat regeneration as a modular problem. It is also
common to merge stochastic regeneration into the design of functional circuits.
This opens the possibility of using feedback in interesting ways, as illustrated by
the divider circuit in Example 18.
Tutorial on Stochastic Computing 61
Fig. 27 Regenerative +
unipolar stochastic divider A up/down C +
circuit B counter
− Q
−
RNG
Applications
Error Correction
u2 = u0 ⊕ u1 (33)
u0 u1 u2
0 0 0
0 1 1
1 0 1
1 1 0
62 C. Winstead
U0
Û2
U1
U2
Û0
Û1
(b)
This behavior corresponds to that of an XOR gate. Now suppose that we are able
to retrieve only two bits of the sequence, say u0 and u2 , but the remaining bit u1 is
unknown. In that situation we can infer the value of u1 by applying the parity rule:
u1 = u0 ⊕ u2 .
Now let’s alter the situation and say that our retrieval system is able to estimate
probabilities for each of the three bits, p0 , p1 , and p2 , where each pj indicates
the probability that uj = 1, based on a measured signal. In this situation we can
estimate the extrinsic probability of one bit (say, u1 ) based on evidence from the
other two:
p1 | 0,2 = p0 (1 − p2 ) + (1 − p0 ) p2 . (34)
(a)
U0
C Û2
U1
U2
C Û0
C Û1
(b)
u0 u1 u2
0 0 0
1 1 1
As we did in Example 19, suppose that we are able to retrieve only u0 and u2 , but the
remaining bit u1 is lost. We can infer the value of u1 only if u0 = u2 , then u1 must
have the same value. Now suppose the retrieval system estimates bit probabilities
p0 , p1 , and p2 . In the belief propagation algorithm, we use Bayes’ Law to obtain
the extrinsic probabilities. For bit u1 , the extrinsic probability is
p 1 p2
p1 | 0,2 = . (35)
p0 p2 + (1 − p0 ) (1 − p2 )
c1 c2 c3 c4
1 The subgraph associated with a trapping set should usually contain both degree-1 and degree-2
parity check nodes. Here we have omitted the degree-1 nodes since they have no relevant effect on
the circuit’s behavior.
Tutorial on Stochastic Computing 65
Fig. 32 A deterministic U1
fixed-state on a trapping set in
a stochastic decoder 0
C
0 C U0
C
0
U2
Image Processing
X S0 S1 S2 ... SN−1 X
X X X X
Q = f (S)
unsigned S
X up/down f (S) Q
counter
G=2
100
G=4
G=8
10−1
pQ
10−2
10−3
0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9
pX
Fig. 35 Simulation results for a stochastic FSM-based exponentiation circuit. Dotted curves
indicate ideal behavior, solid curves indicate measured results
and high values of pQ , but degrades markedly when the output probability is low.
The output from this circuit is also non-ideal, since the state-machine mapping tends
to emit long runs of 1’s and 0’s, similar to the toggle flip-flop.
Example 23 (Tanh Function)
Supposing a bipolar stochastic input X and a bipolar output Q, a stochastic tanh
function is achieved by this mapping:
1, S ≥ N/2
f (S) = (38)
0, otherwise
This behavior was simulated for several values of N , with sequence lengths of ten
thousand clock cycles. The results are shown in Fig. 36.
The simulation curves show that the tanh function becomes threshold-like for
relatively small values of N . Since the input is a bipolar number, the threshold
occurs close to a real value of x = 0, corresponding to the probability pX = 0.5.
This makes the tanh circuit useful as a component for stochastic comparators and
sorting circuits.
68 C. Winstead
0.5
0
q
−0.5
N=8
N = 16
−1 N = 32
Fig. 36 Simulation results for a stochastic FSM-based tanh circuit. Dotted curves indicate ideal
behavior, solid curves indicate measured results
pS = 0.5
S mod 2, S < N/2
f (S) = (40)
S + 1 mod 2, S ≥ N/2
Neural Networks
Inputs Outputs
K−1
y= wj xj , (41)
j =0
1
fA (y) = . (42)
1 + e−ky
Other popular activation functions include tanh (which we saw in Example 23),
arctan and other functions with a sigmoid shape. Also popular are non-linear or
piecewise-linear functions, such as the “rectified linear unit” (ReLU), among others.
The ReLU and logistic activation functions are plotted in Fig. 41.
Tutorial on Stochastic Computing 71
x1 w1 y
Σ fA (y) q
w3
...
xK−1
ReLU Logistic
1 1
0.8 0.8
0.6 0.6
pQ
pQ
0.4 0.4
0.2 0.2
0 0
−1 −0.5 0 0.5 1 −1 −0.5 0 0.5 1
x x
Fig. 41 Rectified linear unit (left) and logistic (right) activation functions
Fig. 42 Attenuation in a
two-layer MUX-based A 1
stochastic adder
B 0
1
pS = 0.5
Q
0
C 1
pS = 0.5
D 0
Fig. 43 An OR gate as an
A
approximate unipolar adder Q
when probabilities are small B pQ ≈ pA + pB
pA , pB 1 pA pB pA , pB
X X X X X
X S0 S1 S2 S3 S4 S5 X
X X X X X
Fig. 44 Stochastic finite state machine design for the rectified linear unit function
Since the topology is not linear, this FSM needs a slightly different implementation
from the up/down counter circuit used in section “Image Processing”. The FSM
implementation was simulated for ten thousand samples per data point to obtain the
results shown in Fig. 45. The FSM behavior is close to the ReLU function, but is not
perfectly discontinuous. It could be said to behave more like a “soft” ReLU.
Tutorial on Stochastic Computing 73
0.8
0.6
pQ
0.4
0.2
10
Current (µA)
1
ON
0.1
OFF
Cathode 0.01
0 10 20 30 40 50 60
Pulse Index
(a) (b)
Fig. 46 (a) Formation of conductive filaments due to ion migration in a resistance switching
device. (b) Experimental results showing stochastic sub-threshold switching as reported in [3]
0.25 Measured
Fit Distribution
0.2
Switch Probability
0.15
0.1
0.05
0
0 500 1,000 1,500 2,000 2,500 3,000
Time (ms)
Fig. 47 Distribution of resistance switching delay due to sub-threshold pulsing, as reported in [3]
Conclusion
References
1. Al-Shedivat, M., Naous, R., Cauwenberghs, G., Salama, K.: Memristors empower spiking
neurons with stochasticity. Emerging and Selected Topics in Circuits and Systems, IEEE
Journal on 5(2), 242–253 (2015). https://doi.org/10.1109/JETCAS.2015.2435512
2. Alaghi, A., Qian, W., Hayes, J.P.: The promise and challenge of stochastic computing. IEEE
Transactions on Computer-Aided Design of Integrated Circuits and Systems pp. 1–1 (2017).
https://doi.org/10.1109/TCAD.2017.2778107
76 C. Winstead
3. Gaba, S., Sheridan, P., Zhou, J., Choi, S., Lu, W.: Stochastic memristive devices for computing
and neuromorphic applications. Nanoscale 5(13), 5872–5878 (2013)
4. Gaudet, V.C., Rapley, A.C.: Iterative decoding using stochastic computation. Electronics
Letters 39(3), 299–301 (2003). https://doi.org/10.1049/el:20030217
5. Huang, K.L., Gaudet, V.C., Salehi, M.: Trapping sets in stochastic ldpc decoders. In: 2015 49th
Asilomar Conference on Signals, Systems and Computers, pp. 1601–1605 (2015). https://doi.
org/10.1109/ACSSC.2015.7421418
6. Knag, P., Lu, W., Zhang, Z.: A native stochastic computing architecture enabled by memristors.
IEEE Transactions on Nanotechnology 13(2), 283–293 (2014). https://doi.org/10.1109/
TNANO.2014.2300342
7. Li, P., Lilja, D.J., Qian, W., Bazargan, K., Riedel, M.D.: Computation on stochastic bit streams
digital image processing case studies. IEEE Transactions on Very Large Scale Integration
(VLSI) Systems 22(3), 449–462 (2014). https://doi.org/10.1109/TVLSI.2013.2247429
8. Li, P., Lilja, D.J., Qian, W., Riedel, M.D., Bazargan, K.: Logical computation on stochastic bit
streams with linear finite-state machines. IEEE Transactions on Computers 63(6), 1474–1486
(2014). https://doi.org/10.1109/TC.2012.231
9. Onizawa, N., Katagiri, D., Gross, W.J., Hanyu, T.: Analog-to-stochastic converter using
magnetic tunnel junction devices for vision chips. IEEE Transactions on Nanotechnology
15(5), 705–714 (2016). https://doi.org/10.1109/TNANO.2015.2511151
10. Smithson, S.C., Boga, K., Ardakani, A., Meyer, B.H., Gross, W.J.: Stochastic computing can
improve upon digital spiking neural networks. In: 2016 IEEE International Workshop on Signal
Processing Systems (SiPS), pp. 309–314 (2016). https://doi.org/10.1109/SiPS.2016.61
11. Tehrani, S.S., Mannor, S., Gross, W.J.: Fully parallel stochastic LDPC decoders. IEEE
Transactions on Signal Processing 56(11), 5692–5703 (2008). https://doi.org/10.1109/TSP.
2008.929671
12. Tehrani, S.S., Naderi, A., Kamendje, G.A., Hemati, S., Mannor, S., Gross, W.J.: Majority-
based tracking forecast memories for stochastic LDPC decoding. IEEE Transactions on Signal
Processing 58(9), 4883–4896 (2010). https://doi.org/10.1109/TSP.2010.2051434
Accuracy and Correlation in Stochastic
Computing
Fig. 1 Structure of a generic stochastic circuit annotated with known sources of inaccuracy
many forms. Cross correlation, or simply correlation, occurs between two or more
non-independent SNs. For example, the SN X = 10111010 is highly correlated in
a negative sense with Y1 = 01000101, since their 1s and 0s never overlap. The
SN Y2 = 10011000 is also correlated with X because its 1s always overlap the
1s of X. The SN Y3 = 01011101, which is generated by rotating or shifting X
to the right by one bit, is not significantly cross correlated with X, but the one-
cycle-delayed version of Y3 is. Cross correlation may change the functionality
of both combinational and sequential stochastic circuits by favoring certain input
patterns. On the other hand, temporal correlation, or autocorrelation, refers to
correlation between a bit-stream or part of a bit-stream and a delayed version of
itself. For instance, Y4 = 011001110 contains some autocorrelation due to the fact
that 01 is always followed by 1. Autocorrelation can severely affect the functionality
of a sequential stochastic circuit by biasing it towards certain state-transition
behavior.
Defining and measuring correlation is surprisingly difficult. A survey made in
2010 by Choi et al. [11] catalogs 76 different correlation metrics developed in
different fields over many years, none of which is well suited to SC! Relatively
easy to define is the independence or no-correlation assumption, which allows a
stochastic circuit C’s SN inputs to be treated as Bernoulli processes, and the function
of C to be expressed and analyzed using basic probability theory. For example, if
two independent SNs X1 and X2 of value X1 and X2 , respectively, are applied to
an AND gate, the output value Z is the arithmetic product X1 X2 . This reflects the
fact that the probability of the AND gate outputting a 1 is the probability of a 1 at
the first input multiplied by the probability of a 1 at the second input, provided the
inputs are not cross correlated. If X1 and X2 are correlated, Z can deviate from X1 X2
in complex ways, as we will see shortly.
Random number sources (RNSs) play a central role in the design and operation
of stochastic circuits. They provide the stochasticity needed by stochastic number
generators (SNGs) to produce SNs with a sufficient level of independence, but they
are a big contributor to overall hardware cost [25]. SC designers generally rely on
linear feedback shift registers (LFSRs) as RNSs because of their relatively small
size and low cost. An LFSR is a deterministic finite-state machine (FSM) whose
behavior is pseudo-random, meaning that it only approximates a true random source
[14]. An SC designer must usually optimize the use of RNSs in a way that provides
sufficient randomness while meeting a cost budget.
SC is a type of approximate computing and trades off computational errors for
other benefits. It has several error sources, as shown in Fig. 1. These error sources
are peculiar to SC and do not include physical errors due to unreliable hardware or
soft errors caused by environmental effects like cosmic radiation [8]. The errors in
question are briefly summarized next.
Rounding Errors Errors caused by rounding or quantization reflect the fact that
with N bits, a bit-stream can only represent exactly the N + 1 numbers in the set
SN = {0, 1/N, 2/N, . . . , (N−1)/N, 1}. If a desired number X is not in this set, then
it must be rounded off to the nearest member of SN . For instance, with N = 16 and
80 A. Alaghi et al.
X = 0.1555, we can round X down to 2/16 = 0.1250 or, slightly more accurately,
round it up to 3/16 = 0.1785. Rounding errors can be mitigated by increasing N to
expand SN . Note, however, that N must be doubled just to add 1 bit of precision to
the numbers in SN .
Approximation Errors These errors result from the fact that most arithmetic
functions of interest cannot be implemented exactly by a stochastic circuit. As a
result, they must be approximated by stochastic functions that are implementable.
All stochastic function values must be scaled to lie in the unit interval [0,1].
Without constant Ri ’s as inputs, the only single-variable stochastic functions that
can be combinationally realized exactly√are the trivial cases X and 1−X. Hence,
common arithmetic functions like X2 , X and sin(X) must be approximated by
some synthesizable stochastic function of the form Z(X, R1 , R2 , . . . , Rk ). Only a
few general methods for finding such functions are known; all are relatively complex
and have particular design styles [4, 26]. For example, the ReSC synthesis method
employs Bernstein polynomials with constant coefficients in the unit interval to
approximate Z(X) [25].
Random Fluctuations The (pseudo) random nature of the bits forming an N-bit
SN X as it emerges from an SNG is also a major error source. Fluctuations in X’s
bit-pattern cause its estimated or measured value X to deviate from the target or
exact value X. Since X can have any of 2N different bit-patterns, X and X can differ
significantly, especially when N is small. Figure 2 shows how three SNG-generated
SNs fluctuate around their target value 0.5 as N changes. Such random fluctuation
2
errors can be quantified by the mean square error (MSE)EX = E X − X . Like
Fig. 2 Random fluctuations in three SNs with the exact value 0.5 as bit-stream length N
increases [31]
Accuracy and Correlation in Stochastic Computing 81
rounding errors, random fluctuation errors tend to diminish with increasing N. Note,
however, that when N is odd, X must differ from X = 0.5 by at least one bit. Hence as
N increases toward infinity, the graphs plotted in Fig. 2 continue to oscillate around
0.5 with a steadily decreasing MSE that approaches, but never reaches, zero.
Constant-Induced Errors It was recently observed that the ancillary SNs R1 , R2 ,
. . . , Rk (see Fig. 1) found in most SC designs are an unexpected and significant
error source [31]. This is because their influence on the output value Z is subject
to time-dependent random variations. Interestingly, constant-induced errors can
be eliminated completely by removing the Ri ’s and transferring their function to
sequential subcircuits inside C that track the behavior of the Ri ’s. A systematic
algorithm called CEASE has been devised for efficiently removing constants and
the errors they produce [31].
Correlation To maintain accuracy, it is often desirable that the bit-streams applied
to a stochastic circuit retain their independence as they are being processed. This
independence is reduced by correlation from several sources including: interactions
among bit-streams during normal computation that introduce dependencies and
similarities, poor randomness properties of individual RNSs that cause successive
bits to be related, sharing of RNSs either directly or indirectly across the SNGs to
reduce overall hardware costs, and temporal dependencies injected by sequential
circuits. As a result, correlation errors tend to increase with circuit size and the
number of layers of processing. They cannot be eliminated merely by increasing
bit-stream length N.
At this point, we see that the accuracy of a stochastic circuit is impacted by many
loosely related factors that are addressed by many different methods and are by no
means fully understood. Correlation is amongst the most intractable of these factors.
Figure 3 illustrates an example of how cross correlation can introduce errors and
how to appropriately fix such errors. The problem here is to design a stochastic
squarer to compute X2 using the standard AND-gate-based multiplier described
previously. To use it for squaring requires two independent, and therefore different,
bit-streams with the same value X. This may be achieved by generating the bit-
streams from two independent RNSs. However, the design of Fig. 3a uses a
single input bit-stream X that fans out into two identical, and therefore highly
correlated copies that have a shared RNS and re-converge at the AND gate.
Consequently, Z = X instead of X2 . This illustrates correlation due to RNS sharing
and reconvergent fanout.
Figure 3b, c shows two ways to mitigate the correlation problem. The circuit in
Fig. 3b converts one copy of X from stochastic to binary and then back to stochastic
again using a new RNS; this process is known as regeneration. As a result, the AND
gate sees two independent SNs of value X and so computes a good approximation
to X2 . The design of Fig. 3c employs a D flip-flop called an isolator [13] to delay
one copy of X by a clock cycle. Instead of seeing the same bit X(t) twice in clock
cycle t, the AND gate sees X(t) and X(t−1), which are independent by the Bernoulli
82 A. Alaghi et al.
R
(b)
X Z
D
(c)
property. This method of decorrelation is termed isolation and is usually much less
expensive than regeneration [29].
Some stochastic operations, notably the scaled addition of Eq. (1) implemented
by a multiplexer, do not require their inputs to be independent. Such circuits are
said to be correlation insensitive (CI) [5]. The CI property allows the two input SNs
X1 and X2 of the adder to share a common RNS without producing correlation-
based errors of the type illustrated by Fig. 3a. This can be explained by the fact that
the adder’s output bit Z(t) at clock cycle t is either X1 (t) or X2 (t), so there is no
interaction between the two data inputs.
While correlation usually reduces the accuracy of stochastic circuits, in some
cases its deliberate use can change a circuit’s function to a new one that is advan-
tageous in some way [2]. For example, an XOR (exclusive-OR) gate supplied with
uncorrelated inputs X1 and X2 realizes the not-so-useful function X1 + X2 − X1 X2 .
If the inputs are positively correlated by enforcing maximum overlap of 1s, the XOR
realizes the absolute difference function |X1 − X2 |. This has been used to design an
edge-detector for image processing that contains orders of magnitude fewer gates
than a comparable non-stochastic circuit [2]. Correlation is similarly used in the
design a stochastic division circuit CORDIV that has accuracy advantages [10].
The design and optimization of RNSs for correlation management are also
an important issue in SC [2, 23]. The problems fall into two categories: (1)
designing RNSs and SNGs to generate bit-streams with desirable cross correlation
and autocorrelation properties, and (2) strategically reducing the use of RNSs to
decrease hardware cost while maintaining moderate independence requirements for
SNs. The latter problem usually requires inexpensive re-randomization techniques
and can take advantage of any CI properties for RNS sharing. Making effective use
of correlation in SC is by no means well understood and is a subject of on-going
research.
The rest of the chapter is organized as follows. Section “Measuring Correlation”
reviews the SC correlation metric for correlation measurement and describes how
Accuracy and Correlation in Stochastic Computing 83
Measuring Correlation
An early effort to quantify correlation for SC was made by Jeavons et al. [16].
Instead of directly providing a correlation measure for SC, they define two SNs
X and Y as independent or uncorrelated if the value of the SN Z obtained
from ANDing X and Y is XY. This definition effectively says that two SNs
are independent if a stochastic multiplier can compute their product accurately.
Obviously, this definition of independence assumes the computation to be otherwise
error-free, i.e., it has no random fluctuation errors, rounding errors, etc. However, it
is rarely the case that Z’s value is exactly XY, even when X and Y are generated
using independent RNSs. With only this definition of independence, it remains
challenging to quantify the behavior of stochastic circuits under different levels of
correlation.
Table 1 shows how the function of an AND-based multiplier changes under the
influence of correlation. The multiplier performs as expected when the inputs X
and Y are independent. However, it computes Z = min(X, Y) when X and Y are
Table 1 SC functions implemented by a two-input AND gate with different levels of input SN
correlations
X Y X∧Y Function
Uncorrelated 01010101(0.5) 11110011(0.75) 01010001(0.375) X×Y
Positively 11110000(0.5) 11111100(0.75) 11110000(0.5) min(X, Y)
correlated
Negatively 11110000(0.5) 00111111(0.75) 00110000(0.25) max(0, X + Y − 1)
correlated
84 A. Alaghi et al.
maximally correlated in the positive sense, i.e., when the 1s in X and Y overlap as
much as possible. On the other hand, it computes Z = max(0, X + Y − 1) when
the 1s in X and in Y overlap as little as possible. Instead of using vague terms like
maximally correlated or negatively correlated, it is desirable to be able to rigorously
quantify correlation for SC. Unfortunately, none of the 76 correlation measures
summarized in [11] perfectly fits the needs of SC, including the Pearson correlation
measure ρ which is widely used in statistical analysis. Pearson correlation presents
a problem for SC, because its value depends on the actual value of the bit-
streams being compared. For example, the maximum Pearson correlation value
ρ = +1 implies that the bit-streams are identical. This means that bit-streams having
different values, even if their 1s maximally overlap, fail to attain the maximum value
of ρ.
A suitable correlation metric for SNs would yield a value +1 for maximum
overlapping of 1s and 0s, a value −1 for minimum overlapping of 1s and 0s, and a
value 0 for independent SNs. The metric should not be impacted by the actual value
of the SN, and should also provide intuitive functional interpolation for correlation
value other than +1, −1 or 0.
The correlation measure called the SC correlation coefficient or stochastic cross
correlation (SCC) has been proposed to fit SC’s needs [2]. For a pair of SNs X and
Y, SCC is defined as follows
pX∧Y −pX pY
if pX∧Y > pX pY
SCC (X, Y) = min(pX ,pY )−pX pY
pX∧Y −pX pY (2)
pX pY −max(pX +pY −1,0) otherwise
where the numerator N11 N00 − N10 N01 is common to many correlation measures
including Pearson correlation:
The major difference between SCC and ρ lies in the denominator. SCC nor-
malizes the measure in a way that maximally positively/negatively correlated SNs
would produce a + 1/−1 correlation value. Pearson correlation, on the other hand,
is normalized by the variance of the bit-streams, which does depend on the value of
the bit-streams.
Table 2 shows some examples of SN pairs and compares their ρ and SCC values.
Note that ρ and SCC are the same for independent SNs and for SNs with equal
values. When the SNs have different values, SCC consistently gives the value
+1 and −1 when the maximal overlap and minimal overlap of 1s and 0s occur,
respectively.
The SCC metric of correlation provides a precise way to define a circuit’s
stochastic behavior under the influence of various (cross) correlation levels. It
further allows us to explore new SC designs enabled by intentionally introducing
non-zero correlations. Figure 4 shows a pair of SNs X and Y having SCC(X,
Y) = +1 applied to an XOR gate, which computes X + Y − 2XY if X and Y are
independent. The correlation between the inputs changes the circuit’s functionality
to the potentially more useful absolute difference function, which leads to a highly
efficient way of implementing edge detection in SC-based vision chips [6]. This
illustrates the usefulness of deliberately injected correlation in designing stochastic
circuits.
So far, we have only discussed cross correlation between SNs. Autocorrelation in
stochastic circuits is much less well understood. Except the standard autocorrelation
metric used in signal processing, an autocorrelation measure that is suitable for
SC appears to be lacking. Almost all existing SC designs therefore assume the
X 01101110 (5/8)
00100000 (1/8) Z
Y 01001110 (4/8)
Fig. 4 XOR gate with maximal positively correlated inputs which implements the absolute-
difference subtraction function |X − Y| [2]
86 A. Alaghi et al.
Table 1 shows the functionality of the AND gate at SCC 0, +1, and −1. To derive
the stochastic function of the AND gate at any other SCC level, we need to calculate
the linear combination of the function at SCC = 0 and the function at SCC = +1
or −1, depending on the direction of the correlation [2]. For instance, the AND gate
with SCC = 0.5 implements the function Z = 0.5(min(X, Y) + XY). In the general
case, if we have a circuit implementing a two-input Boolean function z = f (x, y)
with input SNs X and Y having arbitrary correlation level SCC, the value of SN Z
at the output of the circuit will be
(1 + SCC) .F0 + SCC.F−1 if SCC (X, Y) < 0
Z= (3)
(1 − SCC) .F0 + SCC.F+1 otherwise
Here F0 , F+1 and F−1 denote the stochastic function implemented by the same
circuit at SCC levels 0, −1 and + 1, respectively. Using probabilistic transfer
matrices (PTMs), Alaghi and Hayes [2] show that for any two-input combinational
circuit, we can derive F0 , F−1 , and F+1 via the following matrix multiplication
[i0 i1 i2 i3 ] .[t0 t1 t2 t3 ]T
in which the tk ’s denote the truth table of the corresponding Boolean function and
the ik ’s are obtained from Table 3. As an example, suppose we want to derive the
stochastic function implemented by an XOR gate at SCC levels 0 and +1. The
truth table PTM of the XOR gate is [0 1 1 0]T , so we will have F0 = (1 − X).
Y+(1 − Y). X and F+1 = max (Y − X, 0)+ max (X − Y, 0) = | X − Y |. To find
the stochastic function of the XOR gate with SCC = 0.25, we simply calculate the
linear combination F0.25 = 0.75F0 +0.25F+1 .
Table 3 PTM elements used to derive the stochastic function of a two-input combinational circuit
at SCC levels 0, −1 and +1 [2]
F0 , SCC = 0 F−1 , SCC = −1 F+1 , SCC = +1
i0 (1 − X). (1 − Y) max(1 − X − Y, 0) min(1 − X, 1 − Y)
i1 (1 − X). Y min(1 − X, Y) max(Y − X, 0)
i2 (1 − Y). X min(1 − Y, X) max(X − Y, 0)
i3 X. Y max(X+Y − 1, 0) min(X, Y)
Accuracy and Correlation in Stochastic Computing 87
Deriving the stochastic function of circuits or Boolean function with more than
two inputs is not trivial, because SCC does not extend easily to multiple inputs.
The most convenient method of quantifying correlation between more than two
inputs is done by using PTMs, which enumerate the probability distribution of
any combination of 0s and 1s among the signals. However, a systematic method of
handling multi-input functions with arbitrary SCC levels is not known, except in a
few special cases. One such case is when all a function’s inputs are independent [1].
When all the inputs are maximally positively correlated with SCC = +1, we may
also be able to derive the circuit’s stochastic function. For instance, a k-input AND
gate with maximally correlated inputs X1 , X2 , . . . , Xk implements the function
min(X1 , X2 , . . . , Xk ).
Correlation-Controlling Units
Regeneration-Based Decorrelation
Perhaps the most direct way to eliminate correlation is through regeneration, where
SNs are first converted to binary form using stochastic-to-binary converters, and
then are converted back to SNs by SNGs with suitably independent RNSs. A
regenerated SN has a value which is the same as, or very close to, its original value.
However, the positions of its 1s are expected to be different.
An example of regeneration-based decorrelation is shown in Fig. 3b, where the
goal is to produce one of the two copies of X using an RNS that is independent
of the original X. In this example, it is sufficient to regenerate X such that the two
inputs of the multiplier are not cross correlated, as the multiplier is a combinational
88 A. Alaghi et al.
Shuffle-Based Decorrelation
Fig. 5 Shuffle-based R
decorrelator of depth 3, where =
R is a random number 0 en
uniformly distributed among D 0
=
0, 1, 2 and 3 [19] 1 en
D 1
Xin 2 =
en
Xout
D 2
Isolation-Based Decorrelation
Unlike the aforementioned decorrelation methods, isolation does not alter the
positions of 0s and 1s in the SN. It was proposed in the 1960s [13] mainly to cope
with cross correlation by adding appropriate delays to SNs. The added delays shift
SNs temporally so that correlated bits from different SNs are staggered. An example
of isolation-based decorrelation appears in Fig. 3c, where the isolator (a delay
element implemented by a D flip-flop) is inserted into one of the two inputs of the
squarer. By delaying one copy of X by one clock cycle, the output Z(t) = p(X(t) = 1,
X(t − 1) = 1) = p(X(t) = 1)p(X(t − 1) = 1), so Z = X2 as expected, provided that
X(t) and X(t − 1) are statistically independent for all t, as asserted by the Bernoulli
property.
The major advantages of isolation are very low hardware cost and low latency,
compared to regeneration. However, the application of isolators tends to be difficult.
Carelessly placing isolators in a stochastic circuit can lead to several problems, such
as failure to decorrelate correctly and unexpectedly changing the circuit’s function.
These problems occur when the placement fails to track and delay correlated signals
properly for some signal lines since isolators can inject undesired autocorrelation
into the circuit and some isolators can turn out to be unnecessary. Figure 6a shows
a stochastic circuit that is intended to compute X4 by naïvely cascading two squarer
circuits of the kind in Fig. 3c. While this construction appears to make sense at the
first sight, the resulting circuit does not compute X4 as expected; instead, it computes
Z = X3 , a huge functional error! To see this, observe that at time t, the output of the
first AND gate is X(t) ∧ X(t − 1), and therefore the inputs to the second AND
gate are Y1 (t) = X(t) ∧ X(t − 1) and Y2 (t) = X(t − 1) ∧ X(t − 2). By ANDing
these two bit-streams, we get the final output as Z(t) = Y1 (t) ∧ Y2 (t) = X(t) ∧
X(t − 1) ∧ X(t − 2), implying that Z = XXX = X3 . The cause of this error is
unanticipated autocorrelation. Note that the squarer is implemented by an AND gate
and an isolator, which effectively makes the circuit sequential. The adjacent bits of
X(t) Y1(t)=X(t)X(t-1)
X(t)
D Z(t) = X(t)X(t-1)X(t-2)
X(t-1) D
Y2(t)=X(t-1)X(t-2)
(a)
X(t) Y1(t)=X(t)X(t-1)
X(t)
D Z(t) = X(t)X(t-1)X(t-2)X(t-3)
X(t-1) D D
Y2(t)=X(t-2)X(t-3)
(b)
the squarer’s output bit-stream are correlated. Therefore, delaying this bit-stream by
only one clock cycle yields a cross-correlated SN. A correct implementation of X4
is given in Fig. 6b, where the second squarer has two isolators inserted in the bottom
input line.
Generally speaking, isolators must be inserted in a way that all undesired
correlations between interacting SNs are eliminated. Finding a correct isolator
placement while minimizing the isolator usage is a challenging problem. An isolator
insertion algorithm called VAIL has been proposed for combinational stochastic
circuits [29]. It formulates isolator insertion as a linear integer program, where
the objective is to minimize the isolator count. A set of constraints are enforced
on the number of isolators that can be placed on each line of the circuit to be
decorrelated. These constraints, when satisfied, ensure that undesired correlation
between interacting SNs is removed without affecting other SN interactions.
While almost all stochastic circuits are designed to work with uncorrelated inputs,
there exist circuits implementing useful functions enabled by positively or nega-
tively correlated inputs. For example, if the XOR gate in Fig. 4 is used to compute
absolute difference, it requires its two inputs to be maximally correlated. To generate
inputs with predetermined correlation for such circuits, one can resort to special
types of SNGs that are capable of controlling the amount of correlation. However,
regenerating SNs with specific correlation levels in the middle of an SC system is
expensive, both in hardware cost and in system latency.
In error-tolerant SC applications such as many machine-learning and image-
processing tasks, another way to inject correlation is to use a sequential unit called
a synchronizer, which attempts to maximize the correlation level between a pair
of SNs [19]. This approach, while providing no guarantee of attaining the desired
correlation, is usually far less expensive than regeneration in terms of hardware and
latency cost. Figure 7a shows the state-transition graph of a three-state synchronizer,
whose key idea is to align the bits with the same value from inputs X and Y as much
as possible. For example, when the synchronizer receives the pattern X(t)Y(t) = 01,
it will output 00 and then go from state S0 to S2 , which remembers the 1 received
from Y for later release. If X(t)Y(t) = 10 is received, then the synchronizer will
return to S0 and output 11. This effectively transforms X(t)Y(t) = (01, 10) to (00,
11), which has obviously become more correlated.
Observe that the synchronizer in Fig. 7a does not guarantee that its outputs
will have exactly the same value as X and Y. This synchronizer-induced error
occurs when the computation ends at any state other than S0 , and hence there are
some remembered bits yet to be released into the outputs. Also, the synchronizer
only increases the correlation level; it does not guarantee that the output will
be maximally correlated. In fact, it does not provide any promises on the final
correlation level of the outputs. This is because this synchronizer can only remember
92 A. Alaghi et al.
11/11 00/00
(a)
11/11 00/00
S1 00/10 S2
11/01 11/10
00/01
00/00
S0 S3 11/11
01/01 10/10 01/01 10/10
(b)
Fig. 7 State-transition graphs for correlation-controlling units that inject correlation between a
pair of SN: (a) synchronizer that increases SCC; (b) desynchronizer that reduces SCC [19]
one unreleased bit from either X or Y. Thus, at state S0 , if two consecutive bit
patterns XY = (01, 01) are received, the synchronizer will have no choice but to
release a 1 from Y without matching it with another 1 from X. In that case, the output
will be (00, 01), and the synchronizer will end at state S2 . In general, increasing the
number of states allows the synchronizer to remember more yet-to-be-aligned bits,
and hence can produce outputs that are more correlated. But this comes at the cost of
more synchronizer-induced error, because the probability of ending at a state other
than the initial state is higher.
Based on the synchronizer concept, we can push the SCC of two SNs towards −1
using a desynchronizer. The state-transition graph of a four-state desynchronizer is
depicted in Fig. 7b. Like the synchronizer, the desynchronizer takes two input SNs X
and Y, and generates two output SNs with the same value but with stronger negative
correlation or an SCC closer to −1. The key idea in the desynchronizer design is
to intentionally misalign bits of the same value while still preserving the encoded
SN value. To do this, the desynchronizer selectively absorbs and releases bits to
maximize the occurrence of the patterns XY = (10) and (01), and minimize the
occurrence of the patterns XY = (11) and (00). If the desynchronizer receives the
pattern XY = (11), it will pass one of the bits and save the other bit to emit later.
In the desynchronizer design shown in Fig. 7b, the FSM alternates between storing
X and Y when it receives XY = (11) but alternative variants are possible. When
the desynchronizer receives the pattern XY = (00) it will emit the stored bit in the
FSM to misalign the bits. If the desynchronizer receives the pattern XY = (01) or
(10), it will simply pass the inputs to the outputs since the bits at that SN offset
Accuracy and Correlation in Stochastic Computing 93
are already different. This effectively yields more negatively correlated SNs. For
instance, the input pattern XY = (11, 00) becomes XY = (01, 10) after passing
through the desynchronizer.
The desynchronizer has similar tradeoffs to the synchronizer. Bits that get saved
in the desynchronizer may not be emitted before the end of execution which can
yield a slight negative bias. Notice also that the desynchronizer FSM can only save
one bit at a time. As a result, there are cases where it may be forced to pass the
pattern XY = (11) or (00). For instance, if the desynchronizer receives the pattern
XY = (11, 11) it will output (01, 11). In this case, the desynchronizer absorbs a bit
from the first occurrence of XY = (11) but not from the second XY = (11). This
forces the desynchronizer to simply pass XY = (11) to the output on the second
occurrence. This limitation can be addressed by augmenting the desynchronizer to
allow it to absorb more bits to improve its efficacy. Again, this increases the potential
error due to bits that get saved in the FSM but are not released before the end of
execution.
To illustrate the strengths and weaknesses of each correlation manipulation
technique, consider an image processing pipeline which consists of a 3 × 3 Gaussian
blur followed by a Roberts Cross edge detector. The Gaussian blur kernel requires
input SNs for each multiply in the kernel to be uncorrelated, while the Roberts Cross
edge detector requires inputs to the subtractor to be positively correlated. Figure 8
shows the resulting image along with energy efficiency and average absolute error
for three different configurations: (1) no correlation correction between kernels,
(2) regeneration before the edge detector, and (3) synchronizers before the edge
detector. Absolute error is measured as the deviation from a floating-point baseline
implementation. The resulting image without any correlation correction at all clearly
suffers from significant accuracy losses. Using correlation controlling circuits like
regeneration or the synchronizer, on the other hand, leads to much more accurate
results. The synchronizer is more energy efficient and yields comparable accuracy
to regeneration.
No
Floating point correction Regeneration Synchronizer
Image result
Fig. 8 Image processing case study results for Gaussian blur kernel followed by Roberts Cross
edge detector [19]
94 A. Alaghi et al.
Cross-Correlation Insensitivity
(a)
R 01101001 (4/8)
X 00101011 (4/8) 0
01101011 (5/8) Z
Y 01111011 (6/8) 1
(b)
Accuracy and Correlation in Stochastic Computing 95
where 0 denotes the zero Boolean function and dz/dxi denotes the Boolean
difference of z with respect to xi , i.e.,
A proof of Eq. (4) can be found in [5]; here we provide a brief intuitive
explanation. The Boolean difference dz/dxi is a Boolean function of (x1 , x2 , . . . ,
xn ) whose minterms correspond to the input assignments such that a change of xi ’s
value will lead to a change of z’s value. Therefore, Eq. (4) simply says that if there
is no input assignment such that xi ’s value change and xj ’s value change can each
change z’s value, then xi and xj form a CI pair.
The preceding definition is useful for identifying CI pairs in a given stochastic
circuit. For example, recall that the multiplexer in Fig. 9 implements the function
Z = 0.5(X + Y) in the stochastic domain, and the function z = x ∧ r ∨ y ∧ r in the
Boolean domain. Here x and y form a CI pair, because
dz/dx = (y ∧ r) ⊕ r ∨ y = r
dz/dy = y ∧ r ⊕ (r ∨ x) = r
Fig. 12 Implementation of
X4 in canonical SRB form Z
[30]
X D D D
Shift register
process guarantees that there will be a single 1 released into the output, whenever
the adder receives two 1s from X or Y, thereby computing 0.5(X + Y). In general,
CEASE-generated circuits not only avoid the potential correlation problems induced
by ancillary inputs, but also are insensitive to autocorrelation. This is because the
number of 1s in the output is completely determined by the number of times the
modulo counter overflows, which is obviously independent of the ordering of the
input pattern.
Shift-Register-Based Circuits In general, sequential stochastic circuits have strict
correlation specifications on their inputs, which usually requires them to be
autocorrelation-free. However, sequential stochastic circuits also inject autocorrela-
tion into the SNs they process. This makes it difficult to cascade sequential designs,
since autocorrelation introduced by an upstream circuit will degrade the accuracy
of a downstream circuit. For example, sequential circuits employing the linear FSM
architecture [21] require their inputs to be autocorrelation-free, but at the same time
they produce output SNs with a high level of autocorrelation. It is therefore difficult
to connect multiple linear circuits without sacrificing accuracy. Autocorrelation
injected by a linear FSM has a diminishing but continuing effect over time. A current
output bit can be correlated with all previous output bits, although the correlation
level is lower with bits that are further away in time. This implies that when its input
value changes, the output of a linear FSM may take a very long time to respond to
the change, so the change can have an extended accuracy-reducing impact.
Thus, it is sometimes desirable to use alternative designs that have less severe
autocorrelation problems. There is a class of sequential stochastic circuits called
shift-register-based (SRB) which have a highly desirable property: their output
autocorrelation is bounded in time [30]. SRB circuits realize a type of FSM
termed a definite machine that has finite input memory [18]. They also have a
canonical implementation consisting of a feed-forward shift register built around
a combinational component. Many SC designs, including those generated by the
STRAUSS synthesizer [4], belong to the SRB class. For example, Fig. 12 shows
the canonical SRB implementation of X4 , which has a 3-tap shift register that
produces three delayed copies of the input SN X. SRB circuits have their output
autocorrelation bounded in time, because each output bit is completely determined
by the m most recent input bits, where m − 1 is the number of taps of the shift
98 A. Alaghi et al.
register. Therefore, output bits that are separated by m clock cycles are determined
by different and independent sets of input bits, and hence must be uncorrelated. In
the X4 example, Z(4) = X(4)X(3)X(2)X(1), while Z(8) = X(8)X(7)X(6)X(5), so
Z(8) and Z(4), which are four clock cycles apart, are uncorrelated. The definiteness
of the SRB circuits guarantees that m clock cycles after an input value change, the
output value will have fully responded to the change. Furthermore, it is possible to
sample the output SN every m cycles to get a completely autocorrelation-free bit-
stream, which facilitates the use of such SNs as inputs to circuits that must avoid
autocorrelation.
Random number sources provide the randomness to drive the dynamics of the
stochastic signals. RNSs with insufficient randomness can result in significant
accuracy loss for stochastic circuits that require independent inputs. This can occur,
for example, if a shared RNS is used to drive multiple SNGs for SN generation.
The quality of RNSs also plays an important role in the accuracy of SC. It has
been shown that, instead of using an RNS that has good randomness property like
an LFSR, using carefully designed deterministic number sequences can sometimes
result in significantly improved accuracy. For specialized circuits that work with
correlated inputs, SNs with any SCC level can be generated by interpolating
independent SNs and maximally correlated SNs with multiple independent RNSs.
The CI property also is important in SNG design, as it allows a single RNS to be
shared by multiple SNGs without compromising accuracy.
In most existing stochastic circuits, it is desirable to have SNGs that can generate
high quality uncorrelated SNs, i.e., SNs that have SCC = 0. In SNG design, arguably
the most common RNSs are obtained by tapping an LFSR of maximum period,
and are essentially pseudo-random. However, it has been shown that deterministic
number sources such as plain binary counters can also be used as in SN generation
without compromising accuracy [17]. In fact, circuits that use such deterministic
number sources are usually more accurate than the ones using LFSRs, because
random fluctuation errors are eliminated and correlation control is easier.
To achieve fast convergence rates during a stochastic computation, researchers
have also looked into using quasi-Monte Carlo methods and low-discrepancy
sequences [3, 22]. While these methods provide good convergence when generating
a few uncorrelated SNs, they are affected by the curse of dimensionality and are no
better than counter-based SNGs. In many cases, the convergence rate of the SNs is
not relevant, and only the accuracy at the end of computation matters. In such cases,
Accuracy and Correlation in Stochastic Computing 99
R1
RNS
< X
X
Y1
< 01
Y2
R2 00
RNS
< Y
Y3 10
Y4
< 11
Y
A
< Z
SCCneg B
R3
RNS Comparator
< Z = 1 when A < B
SCCmag
Fig. 13 SNG that generates a pair of SNs with a user-specified SCC level [2]
RNSs provide the randomness required by stochastic circuits, and are a key design
resource. While there are stochastic systems that use non-LFSR-based RNSs,
LFSRs remain the popular choice due to their compatibility with digital logic
and their relatively small hardware area, on top of the fact that they have been
100 A. Alaghi et al.
intensively studied for many years. The design considerations around deploying
RNSs include: providing adequate randomness, minimizing hardware overhead, and
reducing unnecessary use of the RNSs.
As discussed earlier, an SN X can be derived from its binary counterpart B by an
SNG containing an RNS and a comparator that compares the B with the random
number R from the RNS at each clock cycle. The SNG outputs a 1 whenever
B > R; otherwise the SNG outputs a 0. A common approach is to treat k taps
from the LFSR as the k-bit random number R. Sharing exactly the same random
number R with other SNGs can reduce overall hardware cost, but will result in a
maximally correlated SN which is usually undesirable. Previous work attempts to
squeeze out more randomness from a single LFSR by adding a re-wiring layer that
shuffles the order of the bits in R. In [15], the authors show that circularly shifting
R is a low-cost and effective way to reduce the SCC of two SNs sharing the same
LFSR. Specifically, they experimentally demonstrate that by circularly shifting a
k-bit random number by approximately k/2 bits, the SCC level can be reduced by
around 75%, compared to random shuffling which achieves only 40% reduction in
SCC on average. They further show that taking advantage of the CI property can
reduce the need for RNSs.
Conclusions
Abbreviations
CI Correlation insensitive
FSM Finite-state machine
LDPC Low density parity check code
LFSR Linear feedback shift register
MSE Mean square error
PTM Probabilistic transfer matrix
RNS Random number source
SC Stochastic computing
SCC Stochastic correlation coefficient
SCH Single-ended counter hysteresis
SN Stochastic number
SNG Stochastic number generator
SRB Shift register based
References
1. A. Alaghi and J.P. Hayes, “A Spectral Transform Approach to Stochastic Circuits,” Proc. Intl.
Conf. Computer Design (ICCD), pp. 315–312, 2012.
2. A. Alaghi and J.P. Hayes, “Exploiting Correlation in Stochastic Circuit Design,” Proc. Intl
Conf. on Computer Design (ICCD), pp. 39–46, Oct. 2013.
3. A. Alaghi and J.P. Hayes, “Fast and Accurate Computation Using Stochastic Circuits,” Proc.
Design, Automation, and Test in Europe Conf. (DATE), pp. 1–4, 2014.
4. A. Alaghi and J.P. Hayes, “STRAUSS: Spectral Transform Use in Stochastic Circuit Synthe-
sis,” IEEE Trans. CAD of Integrated Circuits and Systems, vol. 34, pp. 1770–1783, 2015.
5. A. Alaghi and J.P. Hayes, “Dimension Reduction in Statistical Simulation of Digital Circuits,”
Proc. Symp. on Theory of Modeling & Simulation (TMS-DEVS), pp. 1–8, 2015.
6. A. Alaghi, C. Li and J.P. Hayes, “Stochastic Circuits for Real-time Image-Processing Applica-
tions,” Proc. Design Autom. Conf. (DAC), article 136, 6p, 2013.
7. A. Alaghi. W-K. Qian and J.P. Hayes, “The Promise and Challenge of Stochastic Computing,”
IEEE Trans. CAD, vol. 37, pp.1515–1531, Aug. 2018.
8. R. Baumann, “Soft Errors in Advanced Computer Systems,” IEEE Design & Test of Comput-
ers, vol.22, pp. 258–266, 2005.
9. B.D. Brown and H.C. Card, “Stochastic Neural Computation I: Computational Elements,”
IEEE Trans. Comp., vol. 50, pp. 891–905, 2001.
10. T-H. Chen and J.P. Hayes, “Design of Division Circuits for Stochastic Computing,” Proc. IEEE
Symp. on VLSI (ISVLSI), pp. 116–121, 2016.
11. S.S. Choi, S.H. Cha and C. Tappert, “A Survey of Binary Similarity and Distance Measures,”
Jour. Systemics, Cybernetics and Informatics, vol. 8, pp. 43–48, 2010.
12. J. Friedman et al. “Approximation Enhancement for Stochastic Bayesian Inference,” Elsevier
Int. Jour. of Approximate Reasoning, 85, pp.139–158, 2017.
13. B.R. Gaines, “Stochastic Computing Systems,” Advances in Information Systems Science, vol.
2, J.T. Tou (ed.), Springer, pp. 37–172, 1969.
14. S.W. Golomb, Shift Register Sequences. Revised ed., Aegean Park Press, Laguna Hills, CA,
1982.
102 A. Alaghi et al.
15. H. Ichihara et al., “Compact and Accurate Digital Filters Based on Stochastic Computing,”
IEEE Trans. Emerging Topics in Comp., 2018 (early access).
16. P. Jeavons, D.A. Cohen and J. Shawe-Taylor., “Generating Binary Sequences for Stochastic
Computing,” IEEE Trans. Info. Theory, vol. 40, pp. 716–720, 1994.
17. D. Jenson and M. Riedel, “A Deterministic Approach to Stochastic Computation,” Proc. Intl.
Conf. Computer-Aided Design (ICCAD), pp. 1–8, 2016.
18. Z. Kohavi and N.K. Jha, Switching and Finite Automata Theory, 3rd ed. Cambridge Univ.
Press, 2010.
19. V.T. Lee, A. Alaghi and L. Ceze, “Correlation Manipulating Circuits for Stochastic Comput-
ing,” Proc. 2018 Design, Automation & Test in Europe (DATE) Conf., pp. 1417–1422, 2018.
20. V.T. Lee et al., “Energy-Efficient Hybrid Stochastic-Binary Neural Networks for Near-Sensor
Computing,” Proc. Design, Automation and Test in Europe Conf. (DATE), pp. 13–18, 2017.
21. P. Li et al., “Logical Computation on Stochastic Bit Streams with Linear Finite-State
Machines,” IEEE Trans. Computers, vol. 63, pp. 1474–1486, 2014.
22. S. Liu and J. Han, “Energy Efficient Stochastic Computing with Sobol Sequences,” Proc.
Design, Automation, and Test in Europe Conf. (DATE), pp. 650–653, 2017.
23. Y. Liu et al., “Synthesis of Correlated Bit Streams for Stochastic Computing,” Proc. Asilomar
Conf. on Signals, Systems and Computers, pp. 167–174, 2016.
24. A. Naderi et al., “Delayed Stochastic Decoding of LDPC Codes,” IEEE Trans. Signal
Processing, vol. 59, pp. 5617–5626, 2011.
25. W. Qian et al., “An Architecture for Fault-Tolerant Computation with Stochastic Logic,” IEEE
Trans. Comp., vol. 60, pp. 93–105, 2011.
26. W. Qian and M. D. Riedel, “The Synthesis of Robust Polynomial Arithmetic with Stochastic
Logic,” Proc. Design Autom. Conf. (DAC), pp. 648–653, 2008.
27. S.S. Tehrani, W. J. Gross, and S. Mannor, “Stochastic Decoding of LDPC Codes,” IEEE Comm.
Letters, vol. 10, pp. 716–718, 2006.
28. S. Tehrani et al., “Relaxation Dynamics in Stochastic Iterative Decoders,” IEEE Trans. Signal
Processing, vol. 58, pp. 5955–5961, 2010.
29. P.S. Ting and J.P. Hayes, “Isolation-Based Decorrelation of Stochastic Circuits,” Proc. Intl.
Conf. Computer Design (ICCD), pp. 88–95, 2016.
30. P.S. Ting and J.P. Hayes, “On the Role of Sequential Circuits in Stochastic Computing,” Proc.
Great Lakes VLSI Symp. (GLSVLSI), pp. 475–478, 2017.
31. P. S. Ting and J.P. Hayes, “Eliminating a Hidden Error Source in Stochastic Circuits,” Proc.
Symp. Defect & Fault Tolerance in VLSI and Nano. Systems (DFT), pp.44–49, Oct. 2017.
Synthesis of Polynomial Functions
Abstract This chapter addresses the fundamental question: what functions can
stochastic logic compute? We show that, given stochastic inputs, any combinational
circuit computes a polynomial function. Conversely, we show that, given any
polynomial function, we can synthesize stochastic logic to compute this function.
The only restriction is that we must have a function that maps the unit interval [0, 1]
to the unit interval [0, 1], since the stochastic inputs and outputs are probabilities.
Our approach is both general and efficient in terms of area. It can be used to
synthesize arbitrary polynomial functions. Through polynomial approximations, it
can also be used to synthesize non-polynomial functions.
Introduction
First introduced by Gaines [1] and Poppelbaum [2, 3] in the 1960s, the field of
stochastic computing has seen widespread interest in recent years. Much of the
work, both early and recent, has had more of an applied than a theoretical flavor.
The work of Gaines, Poppelbaum, Brown & Card [4], as well as recent papers
pertaining to image processing [5] and neural networks [6] all demonstrate how
to compute specific functions for particular applications.
This chapter has a more theoretical flavor. It addresses the fundamental question:
can we characterize the class of functions that stochastic logic can compute? Given
a combinational circuit, that is to say a circuit with no memory elements, the answer
M. Riedel ()
University of Minnesota, Minneapolis, MN, USA
e-mail: mriedel@umn.edu
W. Qian
Shanghai Jiao Tong University, Shanghai, China
e-mail: qianwk@sjtu.edu.cn
is rather easy: given stochastic inputs, we show such a circuit computes a polynomial
function. Since the stochastic inputs and outputs are probabilities, this polynomial
function maps inputs from the unit interval [0, 1] to outputs in the unit interval [0, 1].
The converse question is much more challenging: given a target polynomial
function, can we synthesize stochastic logic to compute it? The answer is yes:
we prove that there exists a combinational circuit that computes any polynomial
function that maps the unit interval to the unit interval. So the characterization
of stochastic logic is complete. Our proof method is constructive: we describe a
synthesis methodology for polynomial functions that is general and efficient in terms
of area. Through polynomial approximations, it can also be used to synthesize non-
polynomial functions.
Consider basic logic gates. Table 1 describes the functions that they implement given
stochastic inputs. These are all straight-forward to derive algebraically. For instance,
given a stochastic input x representing the probability of seeing a 1 in a random
stream of 1s and 0s, a NOT gate implements the function
NOT(x) = 1 − x. (1)
It is well known that any Boolean function can be expressed in terms of AND
and NOT operations (or entirely in terms of NAND operations). Accordingly, the
function of any combinational circuit can be expressed as a nested sequence of
multiplications and 1 − x type operations. It can easily be shown that this nested
sequence results in a polynomial function. (Note that special treatment is needed for
any reconvergent paths.)
We will make the argument based upon truth tables. Here we will consider only
univariate functions, that is to say stochastic logic that receives multiple independent
copies of a single variable t. (Technically, t is the Bernoulli coefficient of a random
variable Xi , where t = [Pr(Xi = 1)].) Please see [7] for a generalization to
multivariate polynomials.
Consider a combinational circuit computing a function f (X1 , X2 , X3 ) with
the truth table shown Table 2. Now suppose that each variable has independent
probability t of being 1:
[Pr(X1 ) = 1] = t (5)
[Pr(X2 ) = 1] = t (6)
[Pr(X3 ) = 1] = t (7)
The probability that the function evaluates to 1 is equal to the sum probabilities of
occurrence of each row that evaluates to 1. The probability of each row, in turn, is
obtained from the assignments to the variables, as shown in Table 3. Summing up
the rows that evaluate to 1, we obtain
Generalizing from this example, suppose we are given any combination circuit with
n inputs that each evaluate to 1 with independent probability t. We conclude that the
probability that the output of the circuit evaluates to 1 is equal to the sum of terms of
the form t i (1 − t)j , where 0 ≤ i ≤ n, 0 ≤ j ≤ n, i + j = n, corresponding to rows
of the truth table of the circuit that evaluate to 1. Expanding out this expression, we
always obtain a polynomial in t.
We note that the analysis here was presented as early as 1975 in [8]. Algorithmic
details for such analysis were first fleshed out by the testing community [9].
They have also found mainstream application for tasks such as timing and power
analysis [10, 11].
In this chapter, we will explore the more challenging task of synthesizing logical
computation on stochastic bit streams that implements the functionality that we
want. Naturally, since we are mapping probabilities to probabilities, we can only
implement functions that map the unit interval [0, 1] onto the unit interval [0, 1].
Consider the behavior of a multiplexer, shown in Fig. 1. It implements scaled
addition: with stochastic inputs a, b and a stochastic select input s, it computes a
stochastic output c:
c = sa + (1 − s)b. (11)
(We use the convention of upper case letters for random variables and lower case
letters for the corresponding probabilities.)
Based on the constructs for multiplication (an AND gate) and scaled addition
(a multiplexer), we can readily implement polynomial functions of a specific form,
namely polynomials with non-negative coefficients that sum up to a value no more
than one:
n
g(t) = ai t i
i=0
where, for all i = 0, . . . , n, ai ≥ 0 and ni=0 ai ≤ 1.
For example, suppose that we want to implement the polynomial g(t) = 0.3t 2 +
0.3t + 0.2. We first decompose it in terms of multiplications of the form a · b and
scaled additions of the form sa + (1 − s)b, where s is a constant:
w1 = t · t,
w2 = 0.5w1 + (1 − 0.5)t,
w3 = 0.75w2 + (1 − 0.75) · 1,
w4 = 0.8 · w3 .
Fig. 2 Computation on
stochastic bit streams
implementing the polynomial
g(t) = 0.3t 2 + 0.3t + 0.2
108 M. Riedel and W. Qian
Fig. 3 A generalized
multiplexing circuit
implementing the polynomial
g(t) = 34 − t + 34 t 2
Note that the coefficients of the Bernstein polynomial are 34 , 14 and 12 , all of which
are in the unit interval.
2. Implement the Bernstein polynomial with a multiplexing circuit, as shown in
Fig. 3. The block labeled “+” counts the number of ones among its two inputs;
this is either 0, 1, or 2. The multiplexer selects one of its three inputs as its output
according to this value. Note that the inputs with probability t are each fed with
independent stochastic streams with bits that have probability t.
Bernstein Polynomials
The coefficients βk,n are called Bernstein coefficients and the polynomials
b0,n (t), b1,n (t), . . . , bn,n (t) are called Bernstein basis polynomials of degree
n.
We list some pertinent properties of Bernstein polynomials.
1. The positivity property:
For all k = 0, 1, . . . , n and all t in [0, 1], we have
t j = t j (t + (1 − t))n−j
1 Here n
k denotes the binomial coefficient “n choose k.”
110 M. Riedel and W. Qian
and perform a binomial expansion on the right hand side. This gives
n
k
j
t =
j n bk,n (t),
k=j j
Substituting Eqs. (16) and (17) into Eq. (18) and comparing the Bernstein
coefficients, we have
n k −1
k n
βk,n = aj,n σj k = aj,n . (20)
j j
j =0 j =0
Equation (20) provide a means for obtaining Bernstein coefficients from power-
form coefficients.
4. Degree elevation:
Based on Eq. (13), we have that for all k = 0, 1, . . . , m,
1 1
m+1 bk,m+1 (t) + m+1 bk+1,m+1 (t)
k k+1
=t (1 − t)
k m+1−k
+t k+1
(1 − t)m−k
1
=t k (1 − t)m−k = m bk,m (t),
k
Synthesis of Polynomial Functions 111
or
m m
bk,m (t) = k k
m+1 bk,m+1 (t) + m+1 bk+1,m+1 (t)
k k+1 (21)
m+1−k k+1
= bk,m+1 (t) + bk+1,m+1 (t).
m+1 m+1
m m+1
βk,m bk,m (t) = βk,m+1 bk,m+1 (t). (22)
k=0 k=0
Substituting Eq. (21) into the left-hand side of Eq. (22) and comparing the
Bernstein coefficients, we have
⎧
⎪
⎪ for k = 0
⎨β0,m ,
βk,m+1 = m+1
k
βk−1,m + 1 − k
βk,m , for 1 ≤ k ≤ m (23)
⎪
⎪
m+1
⎩
βm,m , for k = m + 1.
Equation (23) provides a means for obtaining the coefficients of the Bernstein
polynomial of degree m+1 of g from the coefficients of the Bernstein polynomial
of degree m of g. We will call this procedure degree elevation.
The second result pertains to a special type of Bernstein polynomials: those with
coefficients that are all in the unit interval. We are interested in this type of Bernstein
polynomial since we can show that it can implemented by logical computation on
stochastic bit streams
Definition 2 Define U to be the set of Bernstein polynomials with coefficients that
are all in the unit interval [0, 1]:
U = p(t) | ∃ n ≥ 1, 0 ≤ β0,n , β1,n , . . . , βn,n ≤ 1, such that
n
p(t) = βk,n bk,n (t) .
k=0
We prove that the set U and the set V are equivalent, thus giving a clear
characterization of the set U .
Theorem 2
V = U.
The proof of the above theorem utilizes Theorem 1. Please see [7] for the proof.
We end this section with two examples illustrating Theorem 2. In what follows,
we will refer to a Bernstein polynomial of degree n converted from a polynomial g
Synthesis of Polynomial Functions 113
(0, 1) into (0, 1) with g(0) = 58 and g(1) = 1. Thus, g is in the set V . Based on
Theorem 2, we have that g is in the set U . We verify this by considering Bernstein
polynomials of increasing degree.
• The Bernstein polynomial of degree 2 of g is
5 5
g(t) = · b0,2 (t) + − · b1,2 (t) + 1 · b2,2 (t).
8 16
5 1
g(t) = · b0,3 (t) + 0 · b1,3 (t) + · b2,3 (t) + 1 · b3,3 (t).
8 8
Note that all the coefficients are in [0, 1].
Since the Bernstein polynomial of degree 3 of g satisfies Definition 2, we conclude
that g is in the set U .
Example 2 Consider the polynomial g(t) = 14 − t + t 2 . Since g(0.5) = 0,
thus g is not in the set V . Based on Theorem 2, we have that g is not in the
set U . We verify this. By contraposition, suppose that there exist n ≥ 1 and
0 ≤ β0,n , β1,n , . . . , βn,n ≤ 1 such that
n
g(t) = βk,n bk,n (t).
k=0
n
Since g(0.5) = 0, therefore, βk,n bk,n (0.5) = 0. Note that for all k = 0, 1, . . . , n,
k=0
bk,n (0.5) > 0. Thus, we have that for all k = 0, 1, . . . , n, βk,n = 0. Therefore,
g(t) ≡ 0, which contradicts the original assumption about g. Thus, g is not in the
set U .
If all the coefficients of a Bernstein polynomial are in the unit interval, i.e., 0 ≤
bi,n ≤ 1, for all 0 ≤ i ≤ n, then we can implement it with the construct shown in
Fig. 4.
Synthesis of Polynomial Functions 115
y = P (Y = 1)
n
n
n
(25)
= P Y = 1| Xi = k P Xi = k .
k=0 i=1 i=1
n
Since the multiplexer sets Y equal to Zk , when i=1 Xi = k, we have
n
P Y = 1| Xi = k = P (Zk = 1) = bk,n . (26)
i=1
n
y= bk,n Bk,n (t) = Bn (t). (27)
k=0
2 5 3 6
g1 (t) = B0,3 (t) + B1,3 (t) + B2,3 (t) + B3,3 (t).
8 8 8 8
Figure 6 shows a circuit that implements this Bernstein polynomial. The function is
evaluated at t = 0.5. The stochastic bit streams X1 , X2 and X3 are independent,
each with probability t = 0.5. The stochastic bit streams Z0 , . . . , Z3 have
probabilities 28 , 58 , 38 , and 68 , respectively. As expected, the computation produces
the correct output value: g1 (0.5) = 0.5.
Synthesis of Polynomial Functions 117
Fig. 6 Computation on
stochastic bit streams that
implements the Bernstein
polynomial
g1 (t) = 28 B0,3 (t)+ 85 B1,3 (t)+
8 B2,3 (t) + 8 B3,3 (t) at
3 6
t = 0.5
In the previous section, we saw that we can implement a polynomial through logical
computation on stochastic bit streams if the polynomial can be represented as a
Bernstein polynomial with coefficients in the unit interval. A question that arises
is: what kind of polynomials can be represented in this form? Generally, we seek
to implement polynomials given to us in power form. In [16], we proved that any
polynomial that satisfies Theorem 3—so essentially any polynomial that maps the
unit interval onto the unit interval—can be converted into a Bernstein polynomial
with all coefficients in the unit interval.2 Based on this result and Theorem 4, we can
see that the necessary condition shown in Theorem 3 is also a sufficient condition for
a polynomial to be implemented by logical computation on stochastic bit streams.
Example 5 Consider the polynomial g2 (t) = 3t − 8t 2 + 6t 3 of degree 3, Since
g2 (t) ∈ (0, 1), for all t ∈ (0, 1) and g2 (0) = 0, g2 (1) = 1, it satisfies the necessary
condition shown in Theorem 3. Note that
2
g2 (t) = B1,3 (t) − B2,3 (t) + B3,3 (t)
3
3 1 1
= B1,4 (t) + B2,4 (t) − B3,4 (t) + B4,4 (t)
4 6 4
3 2
= B1,5 (t) + B2,5 (t) + B5,5 (t).
5 5
Thus, the polynomial g2 (t) can be converted into a Bernstein polynomial with
coefficients in the unit interval. The degree of such a Bernstein polynomial is 5,
greater than that of the original power form polynomial.
2 The degree of the equivalent Bernstein polynomial with coefficients in the unit interval may be
greater than the degree of the original polynomial.
118 M. Riedel and W. Qian
Given a power-form polynomial g(t) = ni=0 ai,n t i that satisfies the condition
of Theorem 3, we can synthesize it in the following steps:
1. Let m = n. Obtain b0,m , b1,m , . . . , bm,m from
a0,n , a1,n , . . . , an,n by Eq. (16).
2. Check to see if 0 ≤ bi,m ≤ 1, for all i = 0, 1, . . . , m. If so, go to step 4.
3. Let m = m + 1. Calculate b0,m , b1,m , . . . , bm,m from
b0,m−1 , b1,m−1 , . . . , bm−1,m−1 based on Eq. (13). Go to step 2.
4. Synthesize the Bernstein polynomial
m
Bm (t) = bi,m Bi,m (t)
i=0
subject to
Discussion
This chapter presented a necessary and sufficient condition for synthesizing stochas-
tic functions with combinational logic: the target function must be a polynomial that
maps the unit interval [0, 1] to the unit interval [0, 1]. The “necessary” part was easy:
given stochastic inputs, any combinational circuits produces a polynomial. Since the
inputs and outputs are probabilities, this polynomial maps the unit interval to the unit
interval.
The “sufficient” part entailed some mathematics. First we showed that any
polynomial given in power form can be transformed into a Bernstein polynomial.
This was well known [13]. Next we showed that, by elevating the degree of the
Bernstein polynomial, we always obtain a Bernstein polynomial with coefficients in
the unit interval. This was a new result, published in [16]. Finally, we showed that
any Bernstein polynomial with coefficients in the unit interval can be implemented
by a form of “general multiplexing”. These results were published in [17, 18].
The synthesis method is both general and efficient. For a wide variety of appli-
cations, it produces stochatic circuits that have remarkably small area, compared to
circuits that operate on a conventional binary positional encodings [18]. We note
that our characterization applies only to combinational circuits, that is to say logic
circuits without memory elements. Dating back to very interesting work by Brown
& Card [4], researchers have explored stochastic computing with sequential circuits,
that is to say logic circuits with memory elements. With sequential circuits, one
can implement a much larger class of functions than polynomials. For instance,
Brown & Card showed that a sequential circuit can implement the tanh function. A
complete characterization of what sort of stochastic functions can be computed by
sequential circuits has not been established. However, we point the reader to recent
work on the topic: [19–22].
120 M. Riedel and W. Qian
References
Marc Riedel
M. Riedel ()
University of Minnesota, Minneapolis, MN, USA
e-mail: mriedel@umn.edu
Introduction
As detailed throughout this book, the topic of stochastic computing has been
investigated from many angles, by many different researchers. In spite of the
activity, it is fair to say that the practical impact of the research has been modest.
In our view, interest has been sustained because of the intellectual appeal of the
paradigm. It presents a completely different way of computing functions with digital
logic. Complex functions can be computed with remarkably simple structures.
For instance, multiplication can be performed with a single AND gate. Complex
functions such as exponentiation, absolute value, square roots, and hyperbolic
tangent can each be computed with a very small number of gates [1]. Although
this is a claim that can only be justified through design examples, stochastic designs
consistently achieve 50× to 100× reductions in gate count over a wide range of
applications in signal, image and video processing, compared to conventional binary
radix designs [1]. Savings in area correlate well with savings in power, a critical
metric.
Note that while stochastic computation is digital—operating on 0s and 1s—and
performed with ordinary logic gates, it has an “analog” flavor: conceptually, the
computation consists of mathematical operations on real values, the probabilities
of the streams. The approach is a compelling and natural fit for computing
mathematical functions, for applications such as image processing and neural
processing.
The intellectual appeal notwithstanding, the approach has a glaring weakness: the
latency it incurs. A stochastic representation is not compact: to represent 2M distinct
numbers, it requires roughly 22M bits, whereas a conventional binary representation
requires only M bits. When computing on serial bit streams, this results in an
exponential, near-disastrous increase in latency. The simplicity of the logic generally
translates to very short critical paths, so one could, in principle, bump up the clock
to very high rates. This could mitigate the increase in latency. But there are practical
limitations to increasing the clock rate [2, 3].
Another issue is the cost of generating randomness. Most implementations have
used pseudo-random number generators such as linear-feedback shift registers
(LFSRs). The cost of these easily overwhelms the total area cost, completely
offsetting the gains made in the structures for computation [4, 5]. Researchers have
explored sources of true randomness [6, 7]. Indeed, with emerging technologies such
as nanomagnetic logic, exploiting true randomness from physical sources could tip
the scales, making stochastic computing a winning proposition [8]. Still, the latency
and the cost of interfacing random signals with deterministic signals make it a hard
sell.
In this chapter, we reexamine the foundations of stochastic computing, and
come to some surprising conclusions. Why is computing on probabilities so
powerful, conceptually? Why can complex functions be computed with such simple
structures? Intuition might suggest that somehow we are harnessing deep aspects
of probability theory; perhaps we are computing approximate answers to hard
Deterministic Approaches to Bitstream Computing 123
(b)
(c)
Period (1ns)
high (on)
Voltage
low (off)
Analog circuitry can control the length Time
of a pulse with precision
Fig. 2 A Pulse-Width Modulated (PWM) signal. The value represented is the fraction of the time
that the signal is high in each cycle, in this case 0.687
124 M. Riedel
Fig. 4 Multiplication with a single AND gate: operating on deterministic periodic signals. Signal
A represents 0.5 with a period of 20ns; Signal B represents 0.6 with a period of 13ns. The output
signal C from t=0ns to 260ns represents 0.30, the expected value from multiplication of the inputs
not equal to 1/4, the value required. However, suppose that one adopts the following
strategy when generating the bit streams: hold each bit of one stream, while cycling
through all the bits of the other stream. Figure 3 gives an example. Here the value
1/3 is represented by the bits 100 repeating, while the value 2/3 is represented by the
110, clock-divided by three. The result is 2/9, as expected. This method works in
general for all stochastic constructs.
In an analogous way, we can perform operations on PWM signals. For instance,
one can use periodic signals with relatively prime frequencies. Figure 4 shows an
example of multiplying two values, 0.5 and 0.6, represented as PWM signals. The
period of the first is 20ns and that of the second is 13ns. The figure shows that,
after performing the operation for 260ns, the fraction of the total time the output
signal is high equals the value expected when multiplying the two input values,
namely 0.3.
The idea of computing on time-encoded signals has a long history [9–12]. We
have been exploring the idea of time-based computing with constructs developed
for stochastic computing [13, 14]. We note that other researchers have explored
very similar ideas in the context of LDPC decoding [15].
As we will argue, compared to computing on stochastic bit streams, we can
reduce the latency significantly—by an exponential factor—with deterministic
approaches. Of course, compared to binary radix, uniform bit streams still incur
high latency. However, with PWM signals, the precision is no longer depen-
dent on the length of pulses, but rather on how accurately the duty cycle can
be set.
As technology has scaled and device sizes have gotten smaller, the supply
voltages have dropped while the device speeds have improved [16]. Control of the
dynamic range in the voltage domain is limited; however, control of the length
of pulses in the time domain can be precise [16, 17]. Encoding data in the time
Deterministic Approaches to Bitstream Computing 125
1010 Conventional
Analog-To-Digital Digital
ADC Computation
Approach
input Conversion
conventional (a)
(voltage) binary radix
high gate count,
(low latency)
high power
Analog-To- 1111111110000000
Digital Computing in Time
input
ADCTime
Digital
Computation (digital domain)
(voltage) Conversion uniform digital (b)
representation
(high latency) very low gate count,
very low power
Computing in Time
Analog-To- (analog domain)
Digital
input ADCTime
Analog
Computation (c)
(voltage) Conversion analog periodic
representation
(relatively low latency) very low gate count,
very low power
Fig. 5 Comparison of (a) the conventional approach, namely digital computation on binary radix;
to (b) our methodology on uniform bit streams; and (c) our methodology on pulse-width modulated
(PWM) signals
domain can be done more accurately and more efficiently than converting signals
into binary radix. Given how precisely values can be encoded in time, our method
could produce designs that are much faster than conventional ones—operating in
the terahertz range. Figure 5 compares the conventional approach, consisting of an
analog-to-digital converter (ADC) that produces binary radix, to the new methods
that we are proposing here.
A Deterministic Approach
Intuitive View
Fig. 6 Discrete convolution. (a) Mathematical operation on two bit streams, X and Y . (b)
Intuition: convolution is equivalent to sliding one bit streams past the other
Deterministic Approaches to Bitstream Computing 127
Figure 7 illustrates that the result is pC = pS pA + (1 − pS )pB = 2/9 + 2/9 = 4/9.
A stochastic representation maintains the property that each bit of one stream meets
every bit of an other stream the same number of times, but this property occurs on
average, meaning the bit streams have to be much longer than the resolution they
represent due to random fluctuations. The bit stream length N required to estimate
the average proportion within an error margin is
p(1 − p)
N>
2
(This is proved in [5].) To represent a value within a binary resolution 1/2n , the
error margin must equal 1/2n+1 . Therefore, the bit stream must be greater than
22n uniform bits long, as the p(1 − p) term is at most 2−2 . This means that the
length of a stochastic bit stream increases exponentially with the desired resolution.
This results in enormously long bit streams. For example, if we want to find the
proportion of a random bit stream with 10-bit resolution (1/210 ), we will have to
observe at least 220 bits. This is over a thousand times longer than the bit stream
required by a deterministic uniform representation.
The computations also suffer from some level of correlation between bit streams.
This can cause the results to bias away from the correct answer. For these reasons,
stochastic logic has only been used to perform approximate computations. Another
related issue is that the LFSRs must be at least as long as the desired resolution
in order to produce bit streams that are sufficiently random. A “Randomizer Unit”,
described in [22], uses a comparator and LFSR to convert a binary encoded number
into a random bit stream. Each independent random bit stream requires its own
generator. Therefore, circuits requiring i independent inputs with n-bit resolution
128 M. Riedel
need i LFSRs with length L approximately equal to 2n. This results in the LFSRs
dominating a majority of the circuit area.
By using deterministic bit streams, we avoid all problems associated with
randomness while retaining all the computational benefits associated with a stochas-
tic representation. However, we can use much shorter bit streams to achieve
the same precision: to represent a value with resolution 1/2n in a deterministic
representation, the bit stream must be 2n bits long. The computations are also
completely accurate; they do not suffer from correlation. The next section discusses
three methods for generating independent deterministic bit streams and gives their
circuit implementations. Without the requirement of randomness, the hardware cost
of the bit stream generators is reduced, so it is a win in every respect.
Deterministic Methods
The “relatively prime”’ method maintains independence by using bit streams that
have relatively prime lengths. Here the ranges [0, Ri ) between converter modules
Deterministic Approaches to Bitstream Computing 129
Fig. 10 Circuit
implementation of the
“relatively prime” method
are relatively prime. Figure 9 demonstrates the method with two bit streams A and
B, one with operand length four and the other with operand length three. The bit
streams are shown in array notation to show the position of each bit in time.
Independence between bit streams is maintained because the remainder, equal to
the overlap between bit streams, always results in a new rotation (or initial phase)
of stream. Intuitively, this occurs because the bit lengths share no common factors.
This results in every bit of each operand seeing every bit of the other operand. For
example, a0 sees b0 , b1 , and b2 ; b0 sees a0 , a3 , a2 , and a1 ; and so on. Using two bit
streams with relatively prime bit lengths j and k, the output of a logic gate repeats
with period j k. This means that, with multi-level circuits, the output of the logic
gates will also be relatively prime. This allows for the same arithmetic logic as a
stochastic representation.
A circuit implementation of the “relatively-prime” method is shown in Fig. 10.
Each converter module uses a counter as a number source for iterating through
each bit of the stream. The state of the counter Qi is compared with the stream
constant Ci . The relatively prime counter ranges Ri between modules maintain
independence. In terms of general circuit components, the circuit uses i counters
and i comparators, where i is the number of generated independent bit streams.
Assuming the max range is a binary resolution 2n and all modules are close to
this value (i.e., 256, 255, 253, 251. . . ), the circuit contains approximately i n-bit
counters and i n-bit comparators.
Rotation
In contrast to the previous method, the “rotation” method allows bit streams of
arbitrary length to be used. Instead of relying on relatively prime lengths, the bit
130 M. Riedel
Fig. 12 Circuit
implementation of the
“rotation” method
streams are explicitly rotated. This requires the sequence generated by the number
source to change after it iterates through its entire range. For example, a simple way
to generate a bit stream where the stream lengths rotates in time is to inhibit or stall
a counter every 2n clock cycles (where n is the length of the counter). Figure 11
demonstrates this method with two bit streams, both of length four.
By rotating bit stream B’s length, it is straightforward to see that each bit of one
bit stream sees every bit in the other stream. Assuming all streams have the same
length, we can extend the example with two bit streams to examples with multiple
bit streams; here we would be inhibiting counters at powers of the operand length.
This allows the operands to rotate relative to longer bit streams.
A circuit implementation, shown in Fig. 12, follows from the previous example.
We can generate any number of independent bit streams as long as the counter of
every ith converter module is inhibited every 2ni clock cycles. This can be managed
by adding additional counters between each module. These counters control the
phase of each converter module and maintain the property that each converter
module rotates relative to the other modules. Using n-bit binary counters and
comparators, the circuit requires i n-bit comparators and 2i − 1 n-bit counters. The
advantage of using rotation as a method for generating independent bit streams is
that we can use operands with the same resolution, but this requires slightly more
circuitry than the “relatively-prime” method.
Deterministic Approaches to Bitstream Computing 131
Fig. 14 Circuit
implementation of the “clock
division” method
Clock Division
The “clock division” method works by clock dividing operands. Similar to the
“rotation” method, it operates on streams of arbitrary lengths. (This method was first
seen in Examples 1 and 2 in the section “Intuitive View”.) Figure 13 demonstrates
this method with two bit streams, both with bit streams of length four. Bit stream B
is clock divided by the length of bit stream A’s value.
Assuming all operands have the same length, we can generate an arbitrary
number of independent bit streams as long as the counter of every ith converter
module increments every 2ni clock cycles. This can be implemented in circuit form
by simply chaining the converter module counters together, as shown in Fig. 14.
Using n-bit binary counters and comparators, the circuit requires i n-bit comparators
and i n-bit counters. This means the “clock division” method allows operands of
the same length to be used with approximately the same hardware complexity as the
“relatively-prime” method.
Here we compare the hardware complexity and latency of the deterministic methods
with conventional stochastic methods. Perfectly precise computations require the
output resolution to be at least equal to the product of the independent input
resolutions. For example, with input bit stream lengths of n and m, the precise output
contains nm bits.
Consider a stochastic representation implemented with LFSRs. As discussed in
the section “Comparing Stochastic and Deterministic Representations”, a stochastic
representation requires bit streams that are 22n bits long to represent a value with 1/2n
precision. In order to ensure that the generated bit streams are sufficiently random
132 M. Riedel
and independent, each LFSR must have at least as many states as the required output
bit stream. Therefore, to compute with perfect precision each LFSR must have at
least length 2ni.
With our deterministic methods, the resolution n of each of the i inputs is deter-
mined by the length of its converter module counter. The output resolution is simply
the product of the counter ranges. For example, with the “clock division” method,
each converter module counter is connected in series. The series connection forms a
large counter with 2ni states. This shows that output resolution is not determined by
the length of each individual number source, but by their concatenation. This allows
for a large reduction in circuit area compared to stochastic designs.
To compare the area of the circuits, we assume three gates for every cell of a
comparator and six gates for each flip-flop of a counter or LFSR (this is similar
to the hardware complexity used in [29] in terms of fanin-two NAND gates). For
i inputs with n-bit binary resolution, the gate count for each basic component
is given by Table 1. Table 2 gives the total gate count and bit stream length
for precise computations in terms of independent inputs i with resolution n for
prior stochastic methods as well as the deteterminstic methods that we propose
here. The basic component totals for each deterministic method were discussed
in section “Deterministic Methods”. For stochastic methods, we assume that each
“Randomizer Unit”’ needs one comparator and one LFSR per input.
The equations of Table 2 show that our deterministic methods use less area and
compute to the same precision, in exponentially less time. It is a win on both metrics,
but the reduction in latency is especially compelling. Consider a reduction in latency
from 1/220 = 1, 048, 576, to just 1/210 = 1, 024!.
An Analog Approach
IN1 IN2
Sel
0 MUX
1
Out
IN1
IN2
Sel
Out
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Fig. 15 An example of the scaled addition of two PWM signals using a MUX. Here IN1 and IN2
represent 0.2 and 0.6 with a period of 5ns. Sel represents 0.5 with a period of 4ns. The output signal
from t=0ns to 20ns represents 0.40 (8ns/20ns=4/10), the expected value from the scaled addition
of the inputs
compared to the low (off) state in each cycle. An example was shown in Fig. 2 in
the introduction.
As we will show, the key is choosing different periods for the PWM signals, and
letting the system run over multiple cycles. If we choose relatively prime periods
and run the signals to their common multiple, we achieve the effect of “convolving”
the signals. This is analogous to the approach that we took with deterministic digital
bit streams in the section “Relatively Prime Bit Lengths”, where we used relatively
prime bit stream lengths.
Figure 4 in the introduction showed an example of multiplication on PWM
signals. Here we show an example of addition. Recall that with stochastic logic,
scaled addition can be performed with a multiplexer (MUX). The performance
of a MUX as a stochastic scaled adder/subtracter is insensitive to the correlation
between its inputs. This is because only one input is connected to the output
at a time [24]. Thus, highly overlapped inputs like PWM signals with the same
frequency can be connected to the inputs of a MUX. The important point when
performing scaled addition and subtraction with a MUX on PWM signals is that
the period of the select signal should be relatively prime to the period of the input
signals.
Figure 15 shows an example of scaled addition on two numbers, 0.2 and 0.6,
represented by two PWM signals. Both have periods of 5ns. A PWM signal with a
duty cycle of 50% and period of 4ns is connected to the select input of the MUX.
As shown, after performing the operation for 20ns, the fraction of the total time the
output signal is high equals the expected value, 0.40.
134 M. Riedel
Conclusion
could produce designs that are much faster than conventional ones—operating in the
terahertz range. This remains a work in progress. Potentially, this paradigm could
deliver circuits that are as efficient in terms of area and power as stochastic circuits,
with considerably lower latency.
References
18. P. Li, D. Lilja, W. Qian, K. Bazaragan, and M. D. Riedel, “The synthesis of complex arithmetic
computation on stochastic bit streams using sequential logic,” in International Conference on
Computer-Aided Design, 2012, pp. 480–487.
19. W. Qian, X. Li, M. D. Riedel, K. Bazargan, and D. J. Lilja, “An architecture for fault-tolerant
computation with stochastic logic,” IEEE Transactions on Computers, vol. 60, no. 1, pp. 93–
105, 2011.
20. W. Qian and M. D. Riedel, “The synthesis of robust polynomial arithmetic with stochastic
logic,” in Design Automation Conference, 2008, pp. 648–653.
21. W. Qian, M. D. Riedel, K. Barzagan, and D. Lilja, “The synthesis of combinational logic
to generate probabilities,” in International Conference on Computer-Aided Design, 2009, pp.
367–374.
22. W. Qian, M. D. Riedel, H. Zhou, and J. Bruck, “Transforming probabilities with combinational
logic,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems (to
appear), 2011.
23. W. Qian, C. Wang, P. Li, D. Lilja, K. Bazaragan, and M. D. Riedel, “An efficient imple-
mentation of numerical integration using logical computation on stochastic bit streams,” in
International Conference on Computer-Aided Design, 2012, pp. 156–162.
24. A. Alaghi and J. P. Hayes, “On the functions realized by stochastic computing circuits,” in
Proceedings of the 25th Edition on Great Lakes Symposium on VLSI, ser. GLSVLSI ’15.
New York, NY, USA: ACM, 2015, pp. 331–336. [Online]. Available: http://doi.acm.org/10.
1145/2742060.2743758
25. S. S. Tehrani, A. Naderi, G.-A. Kamendje, S. Hemati, S. Mannor, and W. J. Gross, “Majority-
based tracking forecast memories for stochastic ldpc decoding,” IEEE Transactions on Signal
Processing, vol. 58, pp. 4883–4896, 2010.
26. B. Gaines, “Stochastic computing systems,” in Advances in Information Systems Science.
Plenum Press, 1969, vol. 2, ch. 2, pp. 37–172.
27. B. Brown and H. Card, “Stochastic neural computation I: Computational elements,” IEEE
Transactions on Computers, vol. 50, no. 9, pp. 891–905, 2001.
28. P. Li, D. J. Lilja, W. Qian, K. Bazargan, and M. D. Riedel, “Case studies of logical computation
on stochastic bit streams,” in Lecture Notes in Computer Science: Proceedings of Power
and Timing Modeling, Optimization and Simulation Workshop, G. Goos, J. Hartmanis, and
J. Leeuwen, Eds. Springer, 2012.
29. P. Li, W. Qian, and D. J. Lilja, “A stochastic reconfigurable architecture for fault-tolerant
computation with sequential logic,” IEEE 30th International Conference on Computer Design
(ICCD), 2012.
30. M. H. Najafi and D. Lilja, “High quality down-sampling for deterministic approaches to
stochastic computing,” IEEE Transactions on Emerging Topics in Computing, pp. 1–1, 2018.
Generating Stochastic Bitstreams
Abstract Stochastic computing (SC) hinges on the generation and use of stochastic
bitstreams—streams of randomly generated 1s and 0s, with the probabilities of p
and 1 − p, respectively. We consider approaches for stochastic bitstream generation,
considering randomness, circuit area/performance/cost, and the impact of the vari-
ous approaches on SC accuracy. We first review the widely used Linear-Feedback
Shift Register (LFSR)-based approach and variants. Alternative low-discrepancy
sequences are then discussed, followed by techniques that leverage post-CMOS
technologies and metastability of devices as sources of randomness. We conclude
with a discussion on correlations between bitstreams, and how (1) correlations can
be reduced/eliminated, and (2) correlations may actually be leveraged to positive
effect in certain circumstances.
Introduction
Overview
Sequence Generation
The most widely used implementation for the random-number generator in Fig. 1
is a linear-feedback shift register (LFSR). An N -bit LFSR comprises N flip-flops,
as well as XOR gates in a feedback configuration. Figure 2 shows an example 3-bit
LFSR. With N bits, there are 2N possible binary numbers, however, the all-zeros
possibility is generally not used, leaving 2N − 1 numbers. A maximal LFSR walks
through all such 2N − 1 numbers in a deterministic, yet pseudo-random order. For
example, the LFSR in Fig. 2 walks through the sequence (output bits are Q2 Q1 Q0 ):
Low-Discrepancy Sequences
The van der Corput sequence [3] {X1 , X2 , X3 . . .}, where 0 < Xi < 1, is an infinite-
length sequence with distinct values for each term. It is “seeded” by picking a base
b, and constructed by reversing the base-b representation of natural numbers i ≥ 1.
Every natural number i can be expressed in base-b as
∞
i= aj (i)bj
j =0
where aj (i) ∈ {0, 1, . . . , b − 1} and aj (i) = 0 for all sufficiently large j . The i-th
van der Corput sequence is therefore expressed as
∞
Xi = aj (i)b−j −1
j =0
001 002 010 011 012 020 021 022 100 101 ...
where the individual digits in each number are the aj (i) coefficients. The van der
Corput sequence flips the digits of i at the radix point, giving
0.100 0.200 0.010 0.110 0.210 0.020 0.120 0.220 0.001 0.101 ...
142 H. Hsiao et al.
1 2 1 4 7 2 5 8 1 10
...
3 3 9 9 9 9 9 9 27 27
The van der Corput sequence is a one-dimensional LD sequence, and the Halton
sequence generalizes the van der Corput sequence to higher dimensions. The Halton
sequence uses co-prime numbers as its bases for each dimension. For a circuit that
requires k uncorrelated inputs, a k-dimensional Halton sequence generator will be
needed, each using a different base b.
Hardware Implementation Figure 3 shows the structure of the Halton sequence
generator proposed by Alaghi and Hayes [6]. It consists of a binary-coded base-b
counter, where b is a prime number. The order of the output digits from the counter
is reversed and the resulting number is converted to a binary number with the base-
b-to-binary converters and the adder. When b = 2, Fig. 3 reduces to a simple binary
counter. For k inputs, k copies of the circuit with different prime bases are needed.
Sobol’ Sequences
The Sobol’ sequence is a base-2 (t, s)-sequence [3, 5] that is infinite in length, with
distinct values for each term. We describe the steps required to generate the Sobol’
sequence based on the algorithms proposed by Bratley and Fox [7].
adder
Xi
Xi = b1 v1 ⊕ b2 v2 ⊕ . . . (1)
Xi = g1 v1 ⊕ g2 v2 ⊕ . . . (2)
Since the Gray code of i can be obtained with gk = bk ⊕ bk+1 , the expression in
Eq. (2) can be rewritten as
Xi+1 = Xi ⊕ vc (3)
mj
vj = 0.vj 1 vj 2 vj 3 . . . vj k . . . or vj =
2j
where coefficients ai ∈ {0, 1}. The coefficients ai are then used to construct a set of
direction vectors vj , where
In this recurrence relation, j > d and the last term is vj −d shifted right by d. The
recurrence relation for the direction vectors can alternatively be expressed as
x3 + x + 1
144 H. Hsiao et al.
mj = 4mj −2 ⊕ 8mj −3 ⊕ mj −3
To use the recurrence equation, an initial value for the first d mj ’s needs to be
assigned. The initial values can be chosen freely as long as they are odd and mj <
2j . Assuming we assign m1 = 1, m2 = 3, and m3 = 7, then
m4 = 4(3) ⊕ 8(1) ⊕ 1 = 5
m5 = 4(7) ⊕ 8(3) ⊕ 3 = 7
m6 = 4(5) ⊕ 8(7) ⊕ 7 = 43
The first 6 direction vectors are shown in Table 1. Using these precomputed
direction vectors and Eq. (2), the Sobol’ sequence can be computed, as shown in
Table 2. Here, we are assuming that X0 = 0.
The i-th value in the sequence can also be obtained directly using Eq. (2). For
example, for i = 23, the Gray code representation is i = 11100. X23 can therefore
be computed as:
X23 = v3 ⊕ v4 ⊕ v5
= 0.11100 ⊕ 0.01010 ⊕ 0.00111
= 0.10001
17
=
32
+2 i LSZ c vc
RAM Xi+2
counter circuit
Comparison of LD Sequences
Liu and Han [10] compared the energy efficiency of stochastic number generators
(SNGs) based on LFSRs, Halton sequences, and Sobol’ sequences. They used
the root-mean-square error (RMSE) as a measure of accuracy, then compared the
performance of different circuit implementations for a given accuracy. Figure 5
shows the accuracy results achieved by the stochastic computing implementations
using LFSRs, the Halton sequence, and the Sobol’ sequence, for a 2-input multiplier
circuit and a 3rd-order Bernstein polynomial [11] circuit. On the same graph, the
functionally equivalent circuit implemented with the conventional binary number
system is plotted, providing a comparison with the stochastic computing imple-
mentations. For a simple 2-input multiplier circuit, the RMSE of the LD sequences
decreases at a similar rate as binary numbers, as the length of the LD sequence
increases at the rate of 2N and the bitwidth of the binary number increases at the rate
of N . The LFSR’s RMSE decreases at a much slower rate as the length increases.
For a 3rd-order Bernstein polynomial circuit [11], the gap between how accuracy
scales for binary and LD sequences widens slightly. However, the LD sequences
still provide better accuracy than an LFSR in terms of how RMSE scales as length
increases. Overall, Liu and Han [10] reports that the LD sequences provide a higher
accuracy than an LFSR for the same sequence length in these two applications.
Similar observations were made by Seva et al. [12] when they compared the use of
an LD sequence versus an LFSR for edge detection applications.
146 H. Hsiao et al.
100
LFSR
10-2 LFSR
Halton 10-2
RMSE
Sobol
RMSE
Halton
-4
10 Sobol
10-4 Binary
Binary
10-6 10-6
4 6 8 10 12 14 16 4 6 8 10 12 14 16
Resolution (2 N -bit length or N -bit width) Resolution (2 N-bit length or N-bit width)
(a) (b)
Fig. 5 Accuracy comparison between different random number generators with different sequence
length for a (a) 2-input multiplier circuit and a (b) 3rd-order Bernstein polynomial circuit [10]
Liu and Han quantified the performance of the circuit using energy per operation
(EPO) and throughput per area (TPA). EP O = P ower × Tclk × L, where L is the
sequence length, Tclk is the clock period, and P ower is the measured power at Tclk .
T P A = (# of effective bits)/(tc × L)/area, where tc is the critical path delay and
the (# of effective bits) is the bitwidth for the corresponding binary representation
(log2 (L)). For LD sequences requiring low dimensions (i.e. a low number of
uncorrelated inputs), the EPO is lower than the LFSR-based circuits and the TPA is
greater than the LFSR-based circuits. With increasing dimension, the area overhead
required to generate uncorrelated sequences increases faster for LD sequences than
for LFSRs. This results in comparable EPO and TPA between the LD sequences and
the LFSR-based circuit. Detailed performance results can be referred to in the work
by Liu and Han [10].
Fig. 7 RS latches-based TRNG on FPGA [14]; (a) RS latch; (b) FPGA RS latch implementation;
(c) employing multiple RS latches for higher entropy
ring oscillators, some recent works such as [15] further utilize physically unclonable
functions (PUF), which arise due to physical manufacturing variations within the
integrated circuit, as a source of variability/randomness.
Because ring oscillator-based TRNGs are power-hungry, other implementations,
such as latch-based TRNGs, have been also studied [14, 16]. For example, for an
RS latch in Fig. 7a, activating both R and S inputs simultaneously is prohibited as
the latch enters a metastable state having (Q, Q) as either of (0, 1) or (1, 0). A
TRNG can be constructed by feeding the same Clk to R and S of the RS latch—
when Clk = 1, metastability is realized. In [14], this concept is implemented
to generate entropy using two lookup tables (LUTs) in an FPGA, as shown in
Fig. 7b. Although theoretically one RS latch may realize a TRNG, it is not possible
to generate sufficient entropy through a single RS latch only. Therefore, multiple
copies of this RS latch are needed with XOR’ing among the latch outputs (see
Fig. 7c). Furthermore, using distant LUTs to realize the RS latch-based TRNG offers
the opportunity to “capture” additional noise, such as thermal noise, (i.e., larger
entropy) through longer wires, leading to higher-quality randomness. However, this
will affect the throughput of the TRNG. The authors of [14] have studied the suitable
number of latches in terms of the quality and throughput (leaving the place-and-
route task to the vendor tool), and revealed 64-256 latches achieve a reasonable
trade-off of quality versus cost.
148 H. Hsiao et al.
Electrode
Free layer
I
Insulator
Fixed layer
Electrode
hardware when generating multiple bitstreams? This question has been considered
in a number of recent works, e.g. [24–26], which proposed sharing portions of
the bitstream-generation circuitry. With sharing of circuitry, however, comes the
possibility of correlations among bitstreams.
Ichihara et al. [25] proposed that the same random number generator (LFSR)
be used to generate multiple stochastic bitstreams, with a unique comparator for
each stream. To reduce correlations, the LFSR output bits are rotated by a fixed and
unique number of bits for each bitstream, there making the random numbers fed to
each comparator appear different. The idea is depicted in Fig. 9 for the two-bitstream
case. Rotating the LFSR output bits by a fixed amount is “free” in hardware, as it
can be done solely with wiring.
Ichihara et al. [25] also noted that for certain functions, such as stochastic
scaled addition using a 2-to-1 multiplexer, correlations between the MUX data
inputs are acceptable in that they do not impact output accuracy. Ding et al.
[27] leverages this property in the generation of stochastic bitstreams by using a
single LFSR, along with a logic circuit, to generate multiple correlated bitstreams
to feed MUX data inputs in the MUX-based stochastic computing architecture
proposed in [1]. Specifically, the authors in [27] note that logic functions applied to
stochastic bitstreams having a probability of 0.5 can be used to generate bitstreams
with specifically chosen probabilities. The individual LFSR bits are fed into an
optimized multi-output logic circuit to produce multiple output bitstreams with
desired probabilities in limited area cost.
More recently, Neugebauer et al. [26] proposed an alternative hardware sharing
approach, comprising feeding LFSR output bits into an S-box—an n-input-to-n-
output Boolean function the authors borrowed from cryptography. As in [25],
the work in [26] uses different bit-wise rotation amounts for different stochastic
bitstreams, however, the presence of the S-box provides improved statistical inde-
pendence between generated streams. Finally, Li and Lilja [24] proposes techniques
for generating a lengthy stochastic bitstream, where overlapping portions of the
stream are used as shorter bitstreams having specific desired probabilities.
Fig. 9 Sharing the LFSR to reduce the hardware cost of stochastic bitstream generation [25]
150 H. Hsiao et al.
Summary
References
1. W. Qian, X. Li, M. D. Riedel, K. Bazargan, and D. J. Lilja, “An architecture for fault-tolerant
computation with stochastic logic,” IEEE Transactions on Computers, vol. 60, no. 1, pp. 93–
105, 2011.
Generating Stochastic Bitstreams 151
Introduction
Stochastic Switching
The temporal variations in resistive switching devices have been studied recently
[23]. A binary memristive device is usually engineered to have a relatively high
threshold voltage of approximately 5 V. Bias voltages lower than the threshold
voltage can be applied to uncover a device’s temporal variations. During such an
experimentation, current through the device under test is continuously monitored
until a sharp jump in the current is observed, indicating the device is turned ON.
The wait time leading to the switch is recorded. The device is then reset to the OFF
state and the experimentation is repeated.
As Fig. 1 indicates, the wait time is not constant every time and it varies
from cycle to cycle even for the same device. The memristive filament formation
associated with the OFF to ON transition is driven by thermodynamics and involves
oxidation and ion transport. Since all these physical processes are thermally
Fig. 1 Random distribution of the wait time prior to switching. (a–c) Distributions of wait times
for applied voltages of 2.5 V (a), 3.5 V (b) and 4.5 V (c). Solid lines: fitting to the Poisson
distribution (1) using τ as the only fitting parameter. τ = 340 ms, 4.7 ms and 0.38 ms for (a)–
(c), respectively. Insets: (a) DC switching curve, (b) example of a wait time measurement and (c)
scanning electron micrograph of a typical device (scale bar: 2.5 μm). (d) Dependence of τ on the
programming voltage. Solid squares were obtained from fitting of the wait time distributions while
the solid line is an exponential fit. Reproduced from [23]
156 P. Knag et al.
activated and thermal activation over the dominant energy barrier is probabilistic
in nature if only a dominant filament is involved [24]. Therefore, in theory, the wait
time should follow a Poisson distribution, and the probability of a switching event
occurring within t at time t is given by
t
e− τ ,
t
P (t) = (1)
τ
where τ is the characteristic wait time.
The switching of a memristive device is determined by the characteristic wait
time. The characteristic wait time decreases exponentially with the applied voltage.
This is consistent with the dominant filament model since both oxidation and ion
transport are dependent on the electric field. As the applied voltage is reduced,
the effective activation barrier is reduced, resulting in an exponential speed up in
switching time [24–26].
By integrating the distribution from (1), the switching probability by a certain
time t since the external voltage is applied, C(t), can be obtained.
C(t) = 1 − e− τ ,
t
(2)
Fig. 2 Generation of non-correlated bit streams. Different devices (A–D) when programmed
with identical bias conditions give non-correlated bit streams but with very similar bias (∼0.6).
Reproduced from [23]
Fig. 3 Native stochastic computing system using RRAM-based stochastic memory. Reproduced
from [29]
required for the deterministic use of RRAM. The approach combines stochastic
computing with temporally varying RRAM devices to enable an efficient computing
system that is not possible with either stochastic computing or RRAM alone.
The in-memory stochastic computing system is illustrated in Fig. 3. The system
consists of RRAM integrated with CMOS periphery and logic circuits. The system
directly accepts analog input. RRAM converts the analog input to a stochastic bit
stream. Computing is entirely done in the bit stream domain. The output bit stream
is written to RRAM. A write to RRAM allows the input to be converted to a new bit
stream, which serves the purpose of reshuffling needed in stochastic computing to
prevent reconvergent fanout.
The in-memory stochastic computing system accepts analog inputs directly.
Binary to bit stream conversions are eliminated, but amplifiers and sample and hold
circuitry may be needed. In comparison, a classic stochastic computing is entirely
digital and requires analog-to-digital conversion to accept analog inputs.
The in-memory stochastic computing system takes advantage of the randomness
inherent in RRAM devices that is only present when the operating voltage is
relatively low, which naturally leads to a good energy efficiency. The in-memory
stochastic computing also inherits all the benefits of a classic stochastic computing
system: if a high performance is needed, simple stochastic arithmetic circuits can be
parallelized in a flat topology; the independence between bits in a bit stream cuts the
critical path delay and simplifies routing; and stochastic computing is error-resilient,
and it tolerates noise and soft errors.
In the following subsections, we describe the aspects of the in-memory stochastic
computing system.
Stochastic Programming
In an RRAM device, the high-resistance state represents OFF or 0, and the low-
resistance state represents ON or 1. As explained previously, the switching of a
RRAM device from 0 to 1 is a stochastic process. We can use voltage and pulse
width to adjust the switching probability. To save energy, short pulses and low
voltage are preferred. A low voltage also prevents device wear-out and prolongs
device’s lifetime.
RRAM Solutions for Stochastic Computing 159
Write Compensation
Fig. 4 (a) Stochastic group write to memristor using pulse train, (b) voltage pre-distortion, and
(c) parallel single-pulse write. Reproduced from [29]
160 P. Knag et al.
cell is less than 0.125×2 = 0.25 due to the nonlinear relationship between switching
probably and the number of pulses, or pulse width. The nonlinear write process
introduces inaccuracy, and a compensation scheme is needed.
Voltage Predistortion The approach is illustrated in Fig. 4b. The write voltage is
increased for each subsequent pulse to undo the nonlinearity of the write process.
The approach is expensive since it requires many voltage levels to be provided.
A piecewise approximation can be applied to reduce the number of voltage levels
needed to reduce cost. In the above example, a three-piece approximation, i.e., three
voltages, reduces the relative error to 2.5%.
Downscaled Write and Upscaled Read A downscaled write scales a value to a
lower range. Within a lower range, the nonlinearity error is reduced even without
applying any compensation. To recover the downscaled value, an upscaled read
through a scalar gain function, such as [24], can be applied in the readout. The
downscaled write and upscaled read approach avoids using multiple voltages, but
small nonlinearity errors remain.
Parallel Single-Pulse Write The approach is illustrated in Fig. 4c, where single
pulses in a pulse train are applied to multiple columns of RRAM cells in parallel.
Using this approach, only one pulse is applied to a column of cells, avoiding the
nonlinearity issue experienced in successive writes. The parallel single-pulse write
requires an extra merge step to compress a 2D array of bits to a 1D bit stream by
OR’ing the bits in every row. An error can be introduced when there are two or more
1s in a row, where an OR produces only a single 1 in the output. Such an inaccuracy
can be compensated through a simple offset correction. The parallel single-pulse
write approach is relatively simple to implement, but it uses more memory.
Test Applications
Fig. 6 Stochastic gradient descent algorithm using (a) 32-Kbit stochastic bit stream with ideal
write, (b) 32-Kbit stochastic bit stream with voltage predistortion, (c) 256-Kbit stochastic bit
stream with downscaled write and upscaled read. Reproduced from [29]
k-means is a popular clustering algorithm [28] for placing a set of data points into
different clusters whose members are similar. The k-means algorithm involves three
162 P. Knag et al.
steps: (1) select k cluster centers (centroids); (2) place a data point in one of the
clusters to minimize the distance between the data point and the cluster centroid;
(3) update the centroid of each cluster based on all the data points placed in the
cluster. Steps (2) and (3) are iterated until convergence.
The hardware design of a k-means clustering processor is illustrated in Fig. 7,
assuming L1 distance metric. In a stochastic computing implementation, input data
points and centroids are stored in RRAMs and the readouts are in bit streams; and L1
distances and comparisons are calculated by stochastic arithmetic. Once an iteration
of k-means clustering is done, stochastic averaging is done to update the cluster
centroids.
The stochastic design of a k-means clustering processor is simulated using 4-
Kbit stochastic bit streams following bipolar stochastic number representation. Sets
of 256 data points are placed in three clusters based on the L1 distance metric. Ideal
write and voltage-predistortion compensation technique are used in the simulations.
The results demonstrate satisfactory results shown in Fig. 8.
Fig. 8 256-point k-means clustering with 4-Kbit stochastic bit stream using (a) ideal write, (b)
voltage pre-distortion with number of voltage levels chosen to meet 0.1% error bound, (c) voltage
pre-distortion with number of voltage levels chosen to meet 0.001% error bound. Reproduced from
[29]
RRAM Solutions for Stochastic Computing 163
Concluding Remarks
Acknowledgements This work was supported in part by NSF CCF-1217972. The work of W. Lu
was supported by the NSF ECCS-0954621 and in part by the AFOSR under MURI grant FA9550-
12-1-0038.
References
11. Hammadou, Tarik, Magnus Nilson, Amine Bermak, and Philip Ogunbona. “A 96/spl times/64
intelligent digital pixel array with extended binary stochastic arithmetic.” In Circuits and
Systems, 2003. ISCAS’03. Proceedings of the 2003 International Symposium on, vol. 4, pp.
IV-IV. IEEE, 2003.
12. Gaudet, Vincent C., and Anthony C. Rapley. “Iterative decoding using stochastic computation.”
Electronics Letters 39, no. 3 (2003): 1.
13. Tehrani, S. Sharifi, Warren J. Gross, and Shie Mannor. “Stochastic decoding of LDPC codes.”
IEEE Communications Letters 10, no. 10 (2006): 716–718.
14. Govoreanu, B., G. S. Kar, Y. Y. Chen, V. Paraschiv, S. Kubicek, A. Fantini, I. P. Radu et al.
“10 10nm 2 Hf/HfO x crossbar resistive RAM with excellent performance, reliability and low-
energy operation.” In Electron Devices Meeting (IEDM), 2011 IEEE International, pp. 31–6.
IEEE, 2011.
15. Lee, Myoung-Jae, Chang Bum Lee, Dongsoo Lee, Seung Ryul Lee, Man Chang, Ji Hyun Hur,
Young-Bae Kim et al. “A fast, high-endurance and scalable non-volatile memory device made
from asymmetric Ta 2 O 5? x/TaO 2? x bilayer structures.” Nature materials 10, no. 8 (2011):
625.
16. Chin, Albert, C. H. Cheng, Y. C. Chiu, Z. W. Zheng, and Ming Liu. “Ultra-low switching power
RRAM using hopping conduction mechanism.” ECS Transactions 50, no. 4 (2013): 3–8.
17. Strachan, John Paul, Antonio C. Torrezan, Gilberto Medeiros-Ribeiro, and R. Stanley
Williams. “Measuring the switching dynamics and energy efficiency of tantalum oxide
memristors.” Nanotechnology 22, no. 50 (2011): 505402.
18. Park, Jubong, K. P. Biju, Seungjae Jung, Wootae Lee, Joonmyoung Lee, Seonghyun Kim,
Sangsu Park, Jungho Shin, and Hyunsang Hwang. “Multibit Operation of TiOx -Based ReRAM
by Schottky Barrier Height Engineering.” IEEE Electron Device Letters 32, no. 4 (2011): 476–
478.
19. Baek, I. G., D. C. Kim, M. J. Lee, H-J. Kim, E. K. Yim, M. S. Lee, J. E. Lee et al. “Multi-layer
cross-point binary oxide resistive memory (OxRRAM) for post-NAND storage application.”
In Electron Devices Meeting, 2005. IEDM Technical Digest. IEEE International, pp. 750–753.
IEEE, 2005.
20. Baek, I. G., C. J. Park, H. Ju, D. J. Seong, H. S. Ahn, J. H. Kim, M. K. Yang et al. “Realization
of vertical resistive memory (VRRAM) using cost effective 3D process.” In Electron Devices
Meeting (IEDM), 2011 IEEE International, pp. 31–8. IEEE, 2011.
21. Yu, Shimeng, Ximeng Guan, and H-S. Philip Wong. “On the switching parameter variation
of metal oxide RRAM?Part II: Model corroboration and device design strategy.” IEEE
Transactions on Electron Devices 59, no. 4 (2012): 1183–1188.
22. Chen, An, and Ming-Ren Lin. “Variability of resistive switching memories and its impact
on crossbar array performance.” In Reliability Physics Symposium (IRPS), 2011 IEEE
International, pp. MY-7. IEEE, 2011.
23. Gaba, Siddharth, Patrick Sheridan, Jiantao Zhou, Shinhyun Choi, and Wei Lu. “Stochastic
memristive devices for computing and neuromorphic applications.” Nanoscale 5, no. 13 (2013):
5872–5878.
24. Jo, Sung Hyun, Kuk-Hwan Kim, and Wei Lu. “Programmable resistance switching in
nanoscale two-terminal devices.” Nano letters 9, no. 1 (2008): 496–500.
25. Strukov, Dmitri B., and R. Stanley Williams. “Exponential ionic drift: fast switching and low
volatility ofthin-film memristors.” Applied Physics A 94, no. 3 (2009): 515–519.
26. Schroeder, Herbert, Victor V. Zhirnov, Ralph K. Cavin, and Rainer Waser. “Voltage-time
dilemma of pure electronic mechanisms in resistive switching memory cells.” Journal of
applied physics 107, no. 5 (2010): 054517.
27. Nocedal, Jorge, and Stephen J. Wright. “Numerical optimization 2nd.” (2006).
28. MacQueen, James. “Some methods for classification and analysis of multivariate obser-
vations.” In Proceedings of the fifth Berkeley symposium on mathematical statistics and
probability, vol. 1, no. 14, pp. 281–297. 1967.
29. Knag, Phil, Wei Lu, and Zhengya Zhang. “A native stochastic computing architecture enabled
by memristors.” IEEE Transactions on Nanotechnology 13, no. 2 (2014): 283–293.
Spintronic Solutions for Stochastic
Computing
Xiaotao Jia, You Wang, Zhe Huang, Yue Zhang, Jianlei Yang, Yuanzhuo Qu,
Bruce F. Cockburn, Jie Han, and Weisheng Zhao
Introduction
MTJ consists of one oxide barrier sandwiched by two ferromagnetic (FM) layers in
which the Tunnel MagnetoResistance (TMR) effect was discovered by [12] in 1975.
Spintronic Solutions for Stochastic Computing 167
The resistance of MTJ depends on the relative magnetization orientation of the two
FM layers (RP at parallel (P) state and RAP at antiparallel (AP) state). As the MTJ
resistance can be configured comparable with CMOS transistors, it can be integrated
in the memories and logic circuits to represent logic ‘0’ or ‘1’. Its characteristic is
quantified by TMR ratio ((RAP − RP )/RP ). The development of MTJ has been
quickly prompted by the improvement of TMR ratio and energy consumption reduc-
tion of switching approaches (between RP and RAP ) since the first experimental
demonstration of TMR effect. The switching method has evolved from field induced
magnetic switching (FIMS∼10 mA), thermally assisted switching (TAS∼1 mA) to
the currently widely used spin transfer torque switching (STT∼100 μA). Without
the need of magnetic field, STT makes it possible to achieve high density and low
power magnetoresistive random access memory (MRAM). MTJ with interfacial
perpendicular magnetic anisotropy (PMA-MTJ) was discovered by Shoji Ikeda [10]
which features low switching current (49 μA), and high thermal stability. Recently,
the atom-thick tungsten layers have been integrated in PMA-MTJ instead of the
conventional tantalium layers to obtain larger TMR ratio and higher thermal stability
[18, 26].
Figure 1 shows the typical structure of STT-PMA-MTJ which mainly con-
sists of three layers: two FM layers separated by an insulating oxide barrier.
With STT mechanism, MTJ changes between two states when a bidirectional
current I is higher than the critical current Ic0 . The switching of MTJ state is
not immediate after the injection of current, resulting an incubation delay. The
dynamics of MTJ is mainly characterized by the average switching delay τsw
(with 50% of switching probability). Depending on the magnitude of switching
current, the dynamic behavior of MTJ can be divided into two regimes [14]:
Sun model (I > Ic0 ) and Neel-brown model (I < 0.8Ic0 ). The former is
also called precessional switching which addresses fast switching (until sub 3 ns)
but consumes more energy with high current density [29]. Reversely, the latter
consumes less energy with low current density but leads to a slower switching
which is called thermally-assisted switching [8]. The two regimes are derived
from the Landau-Lifshitz-Gilbert equation. τsw can be calculated as Eqs. (1)
and (2):
AP
IP->AP
168 X. Jia et al.
b I
τsw = τ0 · exp 1− , when I < 0.8Ic0 (1)
kB T Ic0
1 2 μB Pref (I − Ic0 )
= , when I > Ic0 (2)
τsw C + ln( π 2ζ
) emm (1 + Pref Pf ree )
4
where tpulse is the voltage pulse width, delay is a fitting parameter. Figure 3
demonstrates the switching probability as function of stress voltage and pulse width.
Spintronic Solutions for Stochastic Computing 169
M
1.0 (c)
(a)
0.5
0.0
Mz
-0.5
-1.0
1.0
(b)
0.5
0.0
Mx
Mz
-0.5 M
-1.0
0 2 4 6 Mx My
(ns)
Fig. 2 The precession of magnetization under the influence of a spin current [28]: Time
dependence of (a) Mz and (b) Mx, (c) The reversal process of magnetic moment
1.4V
0.8
1.2V
0.6
Psw
0.4
0.2
Theoretical values
Monte Carlo simulations
0
0.4 0.6 0.8 1 1.2
Pulse width (ns)
Fig. 3 Switching probability Psw as a function of pulse width [27]: the lines are theoretical
values plotted from Eq. (3) and the markers are statistical results from 1000 times of Monte Carlo
simulation under Cadence
A tunable switching current Isw can be applied to control the switching probability
and further obtain random bitstream during circuit design phase.
170 X. Jia et al.
the stochastic bitstream generated by parallel MTJs will have smaller standard
deviations in the probability. In other words, the biased probabilities of each single
MTJ will be averaged so that the overall probability gets closer to 50%.
Spintronic Solutions for Stochastic Computing 171
The schematic of this parallel MTJ TRNG design is shown in Fig. 4. According
to the different precision requirements of stochastic bitstream, the actual number
of parallel MTJs can be adjusted. To generate N bits stochastic number, the circuit
needs N + 2 phases: a reset phase, a write phase and N read phases, with each phase
set to 5 ns. During each phase, the corresponding control signal is driven high while
the others are held low. All MTJs work simultaneously during the first two phases
while one MTJ is sensed each time in the read phases. Here the N + 2 phases are
explained in detail:
(1) Reset Phase
After the output of the previous cycle is completed, it is necessary to reset the state
of all MTJs back to initial state before the start of the write phase of the next cycle.
In this phase, the control signal Reset will be high while others are low. The voltage
controller provides Vreset and current flows from the free layer to the pinned layer
until all MTJs are switched to the P state. The voltage difference between Vreset and
Vb is high enough to ensure deterministic switching.
(2) Write Phase
In this phase, the control signal Write is high while others are low. The Vwrite
should be lower than Vb to generate switching current from the pinned layer to the
free layer. In testing process, the voltage is selected for 50% switching probability
in 5 ns for each MTJ. In actual application scenario for stochastic computing,
the switching probability can be set to any value required to generate stochastic
bitstream directly without external circuit. Because all MTJs are connected in
172 X. Jia et al.
parallel, the voltages across each MTJ and the corresponding transistors are the
same. All MTJs are written under same bias voltage simultaneously but each MTJ
switches independently. At the end of write phase, parts of MTJs will switch to the
AP state while others remain in the P state.
(3) Read Phase
In the read phases, only one of the N Readn is high, from Read1 to ReadN , while
others are low. The current flows from VDD to GND passing through only one MTJ.
Depending on the resistance variation of MTJ, the output voltages are different. The
comparator will judge the state of this MTJ by comparing the output voltage with
the reference value and generate single bit stochastic number. After N read phases,
the RNG finally outputs N bits stochastic bitstream.
Compared to other RNG designs, the significant advantage of parallel structure
is that the switching probability is controllable. In actual stochastic computing,
RNG needs to generate stochastic bitstream with different proportions of ‘0’ and
‘1’. Digital comparator and other external circuit are always necessary to achieve
this target. However, through variations of Vwrite , TRNG of parallel structure can
output stochastic bitstream directly in any proportions of ‘0’ and ‘1’.
Moreover, all MTJs work simultaneously in reset and write phase, which requires
less time compared to the structure of single MTJ. Suppose that each phase needs
5 ns, the generation speed is Eq. (6).
N
× 200 (6)
N +2
In the parallel design, the accuracy of the switching probability is subject to
the actual voltage and duration of the pulse applied to the MTJs, and variety of
circuit parameters. In order to keep the precise probability, the pulses applied to the
MTJs should be well controlled and the variations of the transistors should be less
significant compared to that of MTJs.
The quality of the random sequences needs to be evaluated in aspects other than
frequency to demonstrate the effectiveness of our approaches. Therefore, we applied
the widely used statistical test suite National Institute of Standards and Technology
(NIST) [20].
For the given value N , the proposed generation procedure was repeated 256/N
times, and each MTJ was used 256/N times to generate random bits, where N is
the number of MTJs in the array. After one sequence of 256 bits is generated, a new
set of N MTJs is used to generate the next sequence.
The four curves at the left side of Fig. 5 show the pass rate trends for different
categories of tests, and illustrate the quality improvement of the generators with
increasing number of MTJs used. The horizontal line is the threshold of 0.981 for
passing the tests. When using at least 16 MTJs, the pass rates for all tests are no less
than 0.981. Therefore, it was shown by the statistical test suite that high-quality 256-
bit random sequences can be generated by utilizing at least 16 MTJs in the proposed
TRNG.
This TRNG of parallel structure is suitable for stochastic computing. Firstly,
TRNG has better randomness compared to PRNG. Furthermore, the parallel
Spintronic Solutions for Stochastic Computing 173
Fig. 5 Statistical quality pass rates of four MTJ-based TRNGs and two combined Tausworthe
generators
With accurately tunable write voltage, consecutive proportion values between 0 and
1 can be obtained by this circuit.
For each bias voltage ranging from 1.13 V to 1.36 V, 1000 Monte-Carlo (MC)
simulations are performed [11]. The simulated P-V relationship is illustrated in
Fig. 7 by the red line. It is demonstrated in the figure that the switching probability
increases monotonously as the increasing of voltage. It means that voltages and
probability values are almost corresponding one by one. In order to evaluate the
performance of the proposed SNG circuit, bitstreams are generated with length of
64, 128 and 256. As shown in Fig. 7, results of all the three classes bitstreams are
well coincident with Monte-Carlo simulation results. Compared with Monte-Carlo
simulation results, the average errors are only 1.6%, 1.3% and 1.1% for length of
64, 128 and 256, respectively. It is obvious that the longer the bitstream, the smaller
the error.
Evaluation Framework
the circuit simulation results, the SNG array and stochastic computing logics are
abstracted as behavioral blocks by performing characterizations. Meanwhile, the
RTL implementation of stochastic to digital converter (SDC) is synthesized by
Synopsys Design Compiler with 45 nm FreePDK library. After performing the
characterization of SDC, an architectural level simulation is carried out according to
the specified application trace. Finally, the evaluation results of Bayesian inference
system are obtained in terms of inference accuracy, energy efficiency and inference
speed.
Data fusion is the process of integrating multiple data sources to produce more
consistent, accurate, and useful information than that provided by any individual
data source. In this section, a simple data fusion example and corresponding
Bayesian inference system are studied.
176 X. Jia et al.
Sensor fusion aims to determine a target location by multiple sensors [4]. Assuming
that there are three sensors on a 2D plane while the width and length of 2D plane is
64 and three sensors are located at (0, 0), (0, 32), and (32, 0), respectively. Each
sensor has two data channels: distance (d) and bearing (b). The measured data
(d1 , b1 , d2 , b2 , d3 , b3 ) from three sensors with two channels are utilized to inference
the target location (x , y ). In this application, the probability that target object
locates at one position of the plane is calculated based on the sensor data. The
position with the largest probability is considered to be the position where the object
target is located at.
Based on the observed data (d1 , b1 , d2 , b2 , d3 , b3 ), the probability of target object
located on (x, y) is denoted as p(x, y|d1 , b1 , d2 , b2 , d3 , b3 ) and could be calculated
based on Bayes’ theory:
p(x, y|d1 , b1 , d2 , b2 , d3 , b3 ) ∝ p(x, y) ∗ p(di |x, y)p(bi |x, y) (7)
i
where p(x, y) is denoted as prior probability, and p(di |x, y), p(bi |x, y) are known
as evidence or likelihood information. Since the target may locate at any position,
the prior probability p(x, y) has the same value for any position. Hence, p(x, y)
is ignored in the following Bayesian inference system. p(di |x, y) means the
probability that sensor i return the distance value of di if the target object is located
at position (x, y). The meaning of p(bi |x, y) is similar to that of p(di |x, y). The
value of p(di |x, y) and p(bi |x, y) is calculated by Eqs. (8) and (9).
2
d(x,y)−μd
i
− 2
1 2 σid
p(di |x, y) = √ ·e (8)
2π σid
2
b(x,y)−μb
i
− 2
1 2 σib
p(bi |x, y) = √ ·e (9)
2π σib
where d(x, y) is the Euclidian distance between position (x, y) and the i-th sensor,
μdi is the distance data provided by the i-th sensor, σid = 5 + μdi /10. b(x, y) is the
viewing angle from the i-th sensor to position (x, y), μbi is the bear data provided
by the i-th sensor, σib is set as 14.0626 degree.
It can be seen from Bayesian inference mechanism (Eq. 7) that the distribution of
object location is calculated by the product of a series of conditional probabilities.
In stochastic computing, it could be realized by AND gates. In addition, we
Spintronic Solutions for Stochastic Computing 177
could find that the calculation of probability value that the object locates at one
position is independent of each other. Based on the analysis, the Bayesian inference
architecture of solving data fusion problem is illustrated in Fig. 9 as a matrix
structure. For each position, six SNGs are deployed to yield stochastic bitstreams
and 5 AND gates are deployed to realize multiplication. Thus, for a 64 × 64 grid,
24576 SNGs and 20480 AND gates are needed. In Fig. 9, the output of each row is the
posterior probability value that the object locates at this position. In our simulation,
64 × 64 counters are employed to decode the outputs from stochastic bitstreams
to binary numbers by calculating the proportion of ‘1’. Utilizing the independent
of inference algorithm (i.e. Eq. 7), all rows of the system could perform stochastic
computing at the same time. The proposed architecture makes the best use of high
parallel attribute of Bayesian inference and stochastic computing.
Simulation Results
Cadence Virtuoso is used to analyze the accuracy and efficiency of the proposed
Bayesian inference system. In the simulation, 64 × 64, 32 × 32 and 16 × 16
grids are utilized to evaluate our Bayesian inference system. The finer the grid, the
more accurate the target position. For every grid scale, stochastic bitstreams (BSs)
with length of 64, 128 and 256 are generated to perform stochastic computing. In
Fig. 10, the fusion results on 64 × 64 grid are shown as heat maps. Figure 10a is
the exact inference result using exact arithmetic computing in float-point arithmetic
computer. Figure 10b, c and d are the inference results by the proposed Bayesian
inference system with stochastic bitstreams length of 64, 128 and 256, respectively.
The simulation results indicate that the proposed system could achieve the Bayesian
inference results correctly. Compared with exact inference results, the longer the
stochastic bitstream, the smaller the error. To quantify the precision of the infer-
ence system, the Kullback-Leibler divergence (KL divergence) between stochastic
inference distribution and the exact reference distribution is calculated. As shown in
Table 1, the first column shows the grid scale. The following 3 columns are the KL
divergence value for different bitstream lengths. Taking 32 × 32 grid for example,
178 X. Jia et al.
(a) (b)
(c) (d)
Fig. 10 Data fusion result of target location problem on 64 × 64 grid. (a) Exact inference results.
(b)–(d) Stochastic computing results with length of 64, 128, 256
10−3 KL divergence requires length of 256. But for the same precision, the work
in [4] requires length of 105 . The outstanding results benefit from the high accuracy
and low correlation bitstreams generated by the MTJ based SNG. As reported in [4],
for an instance with 32 × 32 grid, the software version on a typical laptop takes
919 mJ, and the FPGA based Bayesian machine only takes 0.23 mJ with stochastic
bitstream length of 1000. Benefiting from the low power consumption of MTJs and
high quality of SNG, the proposed Bayesian inference system only spends less than
0.01 mJ to achieve the same accuracy with the 32×32 grid. Speed of the proposed
Bayesian inference system depends on the bitstream length.
Spintronic Solutions for Stochastic Computing 179
Figure 11 is a BNN example for heart disaster prediction. In this network, the parent
nodes of heart disaster (H) are factors that cause heart disaster, including exercise
(E) and diet (D). The child nodes are clinical manifestations of HD, including blood
pressure (B) and chest pain (C). In addition to the graph structure, Conditional
probability tables (CPT) are also given. For example, the second value 0.45 in the
CPT of node HD means that if a person takes regular exercise but unhealthy diet,
the risk of HD is 0.45. In this problem, we pay more attention to inference based on
given evidences. For the sake of convenience, X1 is used to indicate that the value
of random variable X is TRUE and X0 is used to indicate that the value of random
variable X is FALSE. If the value of random variable X is not determined, there is
no superscript. The inference mechanism could be classed as two groups based on
the junction tree algorithm. The first case is considering E, D and H as a group and
calculating p(H D) as Eq. (10):
(a) (b)
Fig. 12 (a) Bayesian inference circuit for BBN that realizes Eq. (10). (b) Bayesian Inference
circuit for BBN that realizes Eq. (11)
p(B|H 1 )p(C|H 1 )P (H 1 )
p(H 1 |B, C) = (11)
p(B, C)
The denominator of Eq. (11) can be calculated with the formula of full probability
as Eq. (12):
Here, p(H 1 ) is calculated by Eq. (10). In Eq. (11), the value of B and C is not
labeled explicitly. Their value is determined based on the diagnostic results.
Based on the inference algorithm, the inference system could be easily con-
structed. Equation (10) could be calculated by three MUXs as shown in Fig. 12a.
Equation (11) could be calculated by three AND gates and five MUXs as shown in
Fig. 12b. Based on the evidence, the Bayesian inference is performed by different
combination of MUX control signal.
Spintronic Solutions for Stochastic Computing 181
Simulation Results
The simulation of Bayesian inference system for BBN is also used Cadence
Virtuoso and the simulation results are shown in Table 2. The first column of the
table lists some of the possible posterior probability. The second column gives
the corresponding settings of control signal for each MUX. Column 3 shows the
exact results calculated by [1]. Column 4 is the results calculated by the proposed
Bayesian inference system using stochastic computing. The comparison between
column 6 and column 7 indicates that the proposed Bayesian inference system for
BBN could achieve reasonable results.
References
10. Ikeda, S., Miura, K., Yamamoto, H., Mizunuma, K., Gan, H.D., Endo, M., Kanai, S.,
Hayakawa, J., Matsukura, F., Ohno, H.: A perpendicular-anisotropy CoFeB-MgO magnetic
tunnel junction. Nature Materials 9, 721–724 (2010)
11. Jia, X., Yang, J., Wang, Z., Chen, Y., Zhao, W.: Spintronics based stochastic computing for
efficient bayesian inference system. In: Asia and South Pacific Design Automation Conference,
pp. 580–585 (2018)
12. Julliere, M.: Tunneling between ferromagnetic films. Physics Letters A 54(3), 225–226 (1975)
13. Katz, J., Menezes, A.J., Van Oorschot, P.C., Vanstone, S.A.: Handbook of applied cryptogra-
phy. CRC press (1996)
14. Koch, R.H., Katine, J.A., Sun, J.Z.: Time-resolved reversal of spin-transfer switching in a
nanomagnet. Phys. Rev. Lett. 92, 088,302 (2004)
15. Liu, N., Pinckney, N., Hanson, S., Sylvester, D., Blaauw, D.: A true random number generator
using time-dependent dielectric breakdown. In: Symposium onVLSI Circuits, pp. 216–217
(2011)
16. Matsunaga, S., Hayakawa, J., Ikeda, S., Miura, K., Endoh, T., Ohno, H., Hanyu, T.: MTJ-based
nonvolatile logic-in-memory circuit, future prospects and issues. In: Design, Automation &
Test in Europe Conference & Exhibition, pp. 433–435 (2009)
17. Oliver, N., Soriano, M.C., Sukow, D.W., Fischer, I.: Fast random bit generation using a chaotic
laser: approaching the information theoretic limit. IEEE Journal of Quantum Electronics
49(11), 910–918 (2013)
18. Peng, S., Zhao, W., Qiao, J., Su, L., Zhou, J., Yang, H., Zhang, Q., Zhang, Y., Grezes, C., Amiri,
P.K., Wang, K.L.: Giant interfacial perpendicular magnetic anisotropy in mgo/cofe/capping
layer structures. Applied Physics Letters 110(7), 072,403 (2017)
19. Qu, Y., Han, J., Cockburn, B.F., Pedrycz, W., Zhang, Y., Zhao, W.: A true random number
generator based on parallel STT-MTJs. In: Design, Automation & Test in Europe Conference
& Exhibition, pp. 606–609 (2017)
20. Soto, J.: The NIST statistical test suite. National Institute Of Standards and Technology (2010)
21. Sun, J.Z.: Spin-current interaction with a monodomain magnetic body: A model study. Phys.
Rev. B 62, 570–578 (2000)
22. Sun, J.Z., Robertazzi, R.P., Nowak, J., Trouilloud, P.L., Hu, G., Abraham, D.W., Gaidis, M.C.,
Brown, S.L., O’Sullivan, E.J., Gallagher, W.J., Worledge, D.C.: Effect of subvolume excitation
and spin-torque efficiency on magnetic switching. Phys. Rev. B 84, 064,413 (2011)
23. Tomita, H., Miwa, S., Nozaki, T., Yamashita, S., Nagase, T., Nishiyama, K., Kitagawa, E.,
Yoshikawa, M., Daibou, T., Nagamine, M., Kishi, T., Ikegawa, S., Shimomura, N., Yoda, H.,
Suzuki, Y.: Unified understanding of both thermally assisted and precessional spin-transfer
switching in perpendicularly magnetized giant magnetoresistive nanopillars. Applied Physics
Letters 102(4) (2013)
24. Tomita, H., Nozaki, T., Seki, T., Nagase, T., Nishiyama, K., Kitagawa, E., Yoshikawa, M.,
Daibou, T., Nagamine, M., Kishi, T., Ikegawa, S., Shimomura, N., Yoda, H., Suzuki, Y.:
High-speed spin-transfer switching in GMR nano-pillars with perpendicular anisotropy. IEEE
Transactions on Magnetics 47(6), 1599–1602 (2011)
25. Uchida, A., Amano, K., Inoue, M., Hirano, K., Naito, S., Someya, H., Oowada, I., Kurashige,
T., Shiki, M., Yoshimori, S., et al.: Fast physical random bit generation with chaotic
semiconductor lasers. Nature Photonics 2(12), 728 (2008)
26. Wang, M., Cai, W., Cao, K., Zhou, J., Wrona, J., Peng, S., Yang, H., Wei, J., Kang, W.,
Zhang, Y., Langer, J., Ocker, B., Fert, A., Zhao, W.: Current-induced magnetization switching
in atom-thick tungsten engineered perpendicular magnetic tunnel junctions with large tunnel
magnetoresistance. Nature Communications 9(671), 1–7 (2018)
27. Wang, Y., Cai, H., d. B. Naviner, L.A., Zhang, Y., Zhao, X., Deng, E., Klein, J.O., Zhao, W.:
Compact model of dielectric breakdown in spin-transfer torque magnetic tunnel junction. IEEE
Transactions on Electron Devices 63(4), 1762–1767 (2016)
28. Wang, Y., Cai, H., Naviner, L.A.B., Klein, J.O., Yang, J., Zhao, W.: A novel circuit design of
true random number generator using magnetic tunnel junction. In: IEEE/ACM International
Symposium on Nanoscale Architectures, pp. 123–128 (2016)
Spintronic Solutions for Stochastic Computing 183
29. Worledge, D., Hu, G., Abraham, D.W., Sun, J., Trouilloud, P., Nowak, J., Brown, S., Gaidis,
M., Osullivan, E., Robertazzi, R.: Spin torque switching of perpendicular ta cofeb MgO-based
magnetic tunnel junctions. Applied Physics Letters 98(2), 022,501 (2011)
30. Yang, K., Fick, D., Henry, M.B., Lee, Y., Blaauw, D., Sylvester, D.: 16.3 a 23Mb/s 23pJ/b fully
synthesized true-random-number generator in 28nm and 65nm CMOS. In: Solid-State Circuits
Conference Digest of Technical Papers (ISSCC), 2014 IEEE International, pp. 280–281 (2014)
31. Zhao, H., Zhang, Y., Amiri, P.K., Katine, J.A., Langer, J., Jiang, H., Krivorotov, I.N., Wang,
K.L., Wang, J.P.: Spin-torque driven switching probability density function asymmetry. IEEE
Transactions on Magnetics 48(11), 3818–3820 (2012)
32. Zhao, W., Moreau, M., Deng, E., Zhang, Y., Portal, J.M., Klein, J.O., Bocquet, M., Aziza, H.,
Deleruyelle, D., Muller, C., Querlioz, D., Ben Romdhane, N., Ravelosona, D., Chappert, C.:
Synchronous non-volatile logic gate design based on resistive switching memories. Circuits
and Systems I: Regular Papers, IEEE Transactions on 61(2), 443–454 (2014)
Brain-Inspired Computing
Introduction
Recently, brain-inspired computing (e.g., spiking neural networks [1] and deep
learning [2, 3]) has been studied for highly accurate recognition and classification
capabilities, as found in human brains. Several hardware implementations of brain-
inspired computing have been presented in [4, 5], but the energy efficiency of the
current hardware approaches is significantly lower than that of human brains.
1 Since 2014 in BLSI project of Ministry of Education, Culture, Sports, Science and Technology
(MEXT) in Japan, we have implemented several BLSIs based on stochastic computing for brain-
inspired physiological models and deep neural networks.
Brain-Inspired Computing 187
coding is designed using a two-input AND gate and that in bipolar coding is
designed using a two-input XNOR gate. An addition is realized as a scaled adder
designed using a two-input multiplexer, where the selector signal is a random bit
sequence.
In addition, hyperbolic tangent and exponential functions are designed using
finite state machines (FSMs). At each cycle, the state transits to the right, if the
input stochastic bit, X(t), is “1” and the state transits to the left, otherwise. After
the transition, the output stochastic bit, Y (t), is determined by the current state. By
changing the output condition, different functions can be designed. The stochastic
tanh function, Stanh, in bipolar coding is defined as follows:
where NT is the total number of states. The stochastic exponential function, Sexp,
is defined in unipolar coding as follows:
where NE is the total number of states and G determines the number of states
generating outputs of “1”. The detailed explanation is summarized in [9].
Application to BLSI
retinas are sent to the primary visual cortex (V1) through the lateral geniculate
nucleus (LGN) and then the information are extracted in V1. The extracted
information are distributed to two pathways: dorsal pathway to the middle temporal
(MT) and ventral pathway to the inferior temporal (IT).
Using stochastic computing, we have designed several BLSIs such as analog-
to-stochastic converters [22], 2D Gabor filters [6, 7] and a disparity energy model
[23] for brainware visual information processing. The 2D Gabor filters show similar
responses of simple cells of V1 and the disparity energy model exhibits the relative
depth estimations using two images. Stochastic deep neural networks have been also
designed [8] that shows similar responses from V1 to IT. In addition to the visual
information processing, stochastic gammatone filters are designed for auditory
signal processing [24], where the gammatone filters well express the performance
of human auditory peripheral mechanism. Among them, two examples of BLSIs are
introduced in section “BLSI Design”.
For designing the BLSIs, we have proposed extended arithmetic functions,
such as circular functions. These arithmetic functions are summarized in sec-
tion “Extended Arithmetic Functions”.
Circular Functions
Sine and cosine functions used for Gabor filters were not previously presented
in stochastic computing. To realize stochastic Gabor filters, the circular functions
have been proposed using several Stanh functions [6]. The stochastic sin function,
Ssin(ω, λ, x) (≈ sin(ωx)), in bipolar coding is defined as follows:
Ssin(ω, λ, x)
ωπ 1 π k
= (−1)k Stanh 4ω , λx + , (3)
2 ω
k=− ωπ
Fig. 2 Graphical representation of Ssin function using five Stanh functions, where ω = 2π and
ω = π are used
Scos(ω, λ, x)
ωπ − 12 1 π(k + 12 )
= (−1)k Stanh 4ω , λx + , (4)
2 ω
k=− ωπ − 12
where ai+ are the positive coefficients and ai− are the negative coefficients. The
scaling factor of the output in the proposed circuit is 2 that is independent of N ,
leading to a higher computation accuracy than that of the conventional circuit.
where Xi (t) is a stochastic bit stream and m is the number of bit streams. In unipolar
coding, a real value, s, is defined as follows:
m
s= xi , (8)
i
10101111
(A=0.75)
21202122 s
(C=1.5)
y
x
11101011
(B=0.75) Bit-wise
AND
(a) (b) (c)
Fig. 4 Integral stochastic circuit components: (a) adder, (b) multiplier, and (c) simplified multi-
plier when one of two inputs is a stochastic bit stream
192 N. Onizawa et al.
BLSI Design
2D Gabor filters exhibit similar responses of simple cells in primary visual cortex
(V1) of brains as shown in Fig. 5. Many simple cells activated with different spatial
frequencies and angles of images are placed as the hypercolumn structure. Based
on the hypercolumn structures, brains can extract many different features, such as
edges and lines of images used for object recognitions and classifications in the
latter part of brains.
Using stochastic computing, an energy-efficient configurable 2D Gabor-filter
chip is implemented. 2D Gabor function (odd phase) [27] is defined as follows:
x 2 + γ 2 y 2
gω,σ,γ ,θ (x, y) = exp − sin(2ωx ), (12)
2σ 2
where x = xcosθ +ysinθ and y = −xsinθ +ycosθ . ω represents the spatial angular
frequency of the sinusoidal factor, θ represents the orientation of the normal to the
parallel stripes of a Gabor function, σ is the sigma of the Gaussian envelope and γ
is the spatial aspect ratio of the Gabor function.
Using Eqs. (2), (3), and (4), the stochastic 2D Gabor function is defined as
follows:
SGabor(ω, γ , λ, G, θ, x, y)
Sexp NE , G, 12 (x 2 + γ 2 y 2 ) + 1
= Ssin(ω, λ, x ), (13)
2
Images
Surface of
cortex
Layers of
cortex L
Filtered
R
L outputs
R
Simple cell
Ocular dominance (Gabor filter)
0 90 180
columns
Fig. 5 Hypercolumn structure of primary visual cortex (V1) including many simple cells, where
Gabor filters exhibit a similar response to a simple cell
Brain-Inspired Computing 193
Stochastic Gabor
64 parallel
D Gabor
1790 m
filtering
block
Coefficient
generator
1790 m
(a) (b)
Fig. 6 Stochastic Gabor filter: (a) 51 × 51 coefficients and (b) chip microphotograph
where α is a constant value for fitting SGabor with the original Gabor function.
Using Eq. (13), flexible coefficients with flexible kernel sizes are generated in
hardware.
Figure 6a shows SGabor results (coefficients) for a kernel size of 51 × 51 with
ω = 2π and θ = 0o . The number of stochastic bits (Nsto ) is 218 . In this simulation,
NE − 256, G = 8, ω = 14, λπ = 0.6614, and γ − 1 are selected. ω = 14 is
selected that supports the maximum angular frequency of 4π . γ = 1 is selected
based on [28] that uses the same Gaussian envelope along with x and y.
Figure 6b shows a photomicrograph of the proposed stochastic Gabor filter chip
using TSMC 65-nm CMOS technology. The proposed chip includes 64-parallel
Gabor-filtering blocks and a coefficient generator based on Eq. (13). The filtering
block is designed using a stochastic convolution circuit in unipolar coding. As the
coefficients are generated in hardware if necessary, memory blocks are not required,
leading to a power-gating capability. The supply voltage is 1.0 V and the area is
1.79 mm × 1.79 mm including I/Os. The proposed circuit is designed using Verilog
HDL (Hardware Description Language) and the test chip is realized using Synopsys
Design Compiler and Cadence SoC Encounter.
Table 1 shows performance comparisons with related works. It is hard to compare
the performance directly because they are designed with different functionalities
194 N. Onizawa et al.
and configurations. The memory-based methods [29, 30] use fixed coefficients
with fixed kernel sizes, lacking the flexibility. In the conventional configurable
Gabor filter [28], COordinate Rotation DIgital Computer (CORDIC) is exploited
to dynamically generate the coefficients related to sinusoidal function for flexible
Gabor filtering. However, other coefficients are stored in memory, losing the power-
gating capability. In contrast, the proposed circuits achieve a higher throughput/area
and a more flexible filtering than the conventional configurable Gabor filter with the
power-gating capability, leading to zero standby power.
Recently, deep neural networks based on stochastic computing have been reported
for area-efficient hardware [31, 32]. However, the energy dissipation is significantly
larger than that of a fixed-point design because a large number of bit streams is
required. In order to reduce the energy dissipation, integral stochastic computing
has been proposed [8]. Integral stochastic computing can reduce the number of
bit streams and hence the energy dissipation while maintaining the computation
accuracy with the area overhead.
As a design example of deep neural networks based on stochastic computing,
the deep belief network (DBN) is selected as shown in Fig. 7a. The DBN contains
784 input nodes and 10 output nodes with two different configurations of the hidden
Brain-Inspired Computing 195
Visible log2(m)+1
log2(m)+1
Nodes b B2IS Bit-wise
w1 1 B2S AND
log2(m)+1
Hidden log2(m)+1
w1 B2IS
Layer 1 Bit-wise log2(m⬘)+1
w2 v1 B2S AND Tree
adder NStanh
Hidden
Layer 2 log2(m)+1
w3 log2(m)+1
wM B2IS Bit-wise
Output vM B2S AND
Nodes
(a) (b)
Fig. 7 Stochastic deep neural networks: (a) two-layer deep belief network (DBN) and (b)
stochastic neuron
layers. The 1st hidden layer has 100 or 300 neurons and the 2nd hidden layer has
200 or 600 neurons. The function at each neuron is defined as follows:
M
zj = wij vi + bj , (15)
i=1
1
hj = = σ (zj ), (16)
1 + exp(−zj )
where M is the number of inputs, wij are the weights, vi are the inputs, bj is the
bias, zj is the intermediate value and hj is the output. j is the index of the neuron.
The sigmoid function can be replaced by tahn function as follows:
1 + tanh(zj /2)
σ (zj ) = (17)
2
Based on Eqs. (15), (16), and (17), neurons are designed based on integral
stochastic computing as shown in Fig. 7b. First, binary signals are converted
to stochastic bit streams using binary-to-stochastic (B2S) converters or integral
stochastic bit streams using binary-to-integral stochastic (B2IS) converters. Second,
the bit streams are multiplied using the integral stochastic multipliers and then added
using the tree adder. Third, the output bit stream of the adder corresponding to zj
are the input of NStanh function based on Eq. (10) in order to determine hj .
Table 2 shows misclassification rates of floating-point simulations and stochastic
computing in he Mixed National Institute of Standards and Technology (MNIST)
data set [34]. For training, floating-point simulations are used to obtain wij in both
cases. For inference, 10,000 handwritten digits are tested. In stochastic computing,
wij are represented by 10-bit fixed points that are converted to stochastic bit streams
196 N. Onizawa et al.
Table 3 Hardware evaluation of two-layer DBN using TSMC 65-nm CMOS process
Fixed-point
(10-bit) Stochastic
Network 784-100-200-10 784-100-200-10 784-300-600-10 784-300-600-10
configuration
Supply voltage 1.0 1.0 1.0 0.8
(V)
Nsto − 256 16 22
Misclassification 2.3 2.33 2.27 2.30
error (%)
Energy (nJ) 0.380 2.96 0.299 0.256
Gate count (M 23.6 4.2 15.6 15.6
Gates (NAND2))
Latency (ns) 30 650 50 65
using B2S. As a result, the stochastic DBN with Nsto = 256 achieves similar
misclassification rates to that of the floating-point simulations.
Table 3 shows the performance comparisons between 10-bit fixed-point and
stochastic two-layer DBNs. Both DBNs are designed using Verilog-HDL and
synthesized using Cadence RC Compiler. The power dissipation is obtained using
Synopsys Power Prime. The technology is TSMC 65-nm CMOS with the frequency
of 400 MHz and the supply voltage of 1 V.
In case of the same size of network (784-100-200-10), the hardware area of
the stochastic implementation reduces 82.3% in comparison with that of the fixed-
point design. However, the energy dissipation is 7.6 times larger because a large
Nsto = 256 is required to achieve the similar misclassification rate. By utilizing
the area efficiency, the large size of network (784-300-600-10) is designed using
stochastic computing. In this case, the stochastic implementation reduces Nsto to
16 while achieving the similar misclassification rate. As a result, the proposed
hardware reduces the energy dissipation and the area by 21% and 34%, respectively,
in comparison with the fixed-point design.
In order to further reduce the energy dissipation, the supply voltage is dropped
to 0.8 V in the stochastic implementation. Lowering the supply voltage generally
induces soft errors because of timing errors, however, stochastic computing is
robust against soft errors. In case of the supply voltage of 0.8 V, the stochastic
implementation achieves the similar misclassification rate by slightly increasing
Nsto from 16 to 22. As a result, a 33% energy reduction is achieved in total.
Brain-Inspired Computing 197
Conclusion
Acknowledgements This work was supported by Brainware LSI Project of MEXT and JSPS
KAKENHI Grant Number JP16K12494. This work is supported by VLSI Design and Education
Center (VDEC), The University of Tokyo with the collaboration with Synopsys Corporation and
Cadence Corporation.
References
12. S. S. Tehrani, S. Mannor, and W. J. Gross. Fully parallel stochastic LDPC decoders. IEEE
Transactions on Signal Processing, 56(11):5692–5703, Nov. 2008.
13. S. S. Tehrani, A. Naderi, G. A. Kamendje, S. Hemati, S. Mannor, and W. J. Gross. Majority-
based tracking forecast memories for stochastic LDPC decoding. IEEE Transactions on Signal
Processing, 58(9):4883–4896, Sep. 2010.
14. L. Peng and D. J. Lilja. Using stochastic computing to implement digital image processing
algorithms. In 29th ICCD, pages 154–161, Oct 2011.
15. P. Li, D. J. Lilja, W. Qian, K. Bazargan, and M. D. Riedel. Computation on stochastic
bit streams digital image processing case studies. IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, 22(3):449–462, Mar. 2014.
16. A. Alaghi, C. Li, and J. P. Hayes. Stochastic circuits for real-time image-processing
applications. In 50th DAC, pages 1–6, May 2013.
17. K. K. Parhi and Y. Liu. Architectures for IIR digital filters using stochastic computing. In 2014
ISCAS, pages 373–376, June 2014.
18. N. Saraf, K. Bazargan, D. J. Lilja, and M. D. Riedel. IIR filters using stochastic arithmetic.
In 2014 Design, Automation Test in Europe Conference Exhibition (DATE), pages 1–6, March
2014.
19. Y. Liu and K. K. Parhi. Architectures for recursive digital filters using stochastic computing.
IEEE Transactions on Signal Processing, 64(14):3705–3718, July 2016.
20. J. Chen, J. Hu, and J. Zhou. Hardware and energy-efficient stochastic LU decomposition
scheme for MIMO receivers. IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, 24(4):1391–1401, April 2016.
21. S. Sato, K. Nemoto, S. Akimoto, M. Kinjo, and K. Nakajima. Implementation of a new
neurochip using stochastic logic. IEEE Transactions on Neural Networks, 14(5):1122–1127,
Sept 2003.
22. N. Onizawa, D. Katagiri, W. J. Gross, and T. Hanyu. Analog-to-stochastic converter using
magnetic tunnel junction devices for vision chips. IEEE Transactions on Nanotechnology,
15(5):705–714, 2016.
23. K. Boga, F. Leduc-Primeau, N. Onizawa, K. Matsumiya, T. Hanyu, and W. J. Gross. A
generalized stochastic implementation of the disparity energy model for depth perception.
Journal of Signal Processing Systems, 90(5):709–725, May 2018.
24. N. Onizawa, S. Koshita, S. Sakamoto, M. Abe, M. Kawamata, and T. Hanyu. Area/energy-
efficient gammatone filters based on stochastic computation. IEEE Transactions on Very Large
Scale Integration (VLSI) Systems, 25(10):2724–2735, Oct 2017.
25. B. Moons and M. Verhelst. Energy-efficiency and accuracy of stochastic computing circuits
in emerging technologies. IEEE Journal on Emerging and Selected Topics in Circuits and
Systems, 4(4):475–486, Dec 2014.
26. C. L. Janer, J. M. Quero, J. G. Ortega, and L. G. Franquelo. Fully parallel stochastic
computation architecture. IEEE Transactions on Signal Processing, 44(8):2110–2117, Aug
1996.
27. D. Gabor. Theory of communications. Journal of Inst. Elect. Eng. - Part III: Radio and
Communication Engineering, 93(26):429–441, Nov. 1946.
28. J.-B. Liu, S. Wang, Y. Li, J. Han, and X.-Y. Zeng. Configurable pipelined Gabor filter imple-
mentation for fingerprint image enhancement. In 2010 10th IEEE International Conference on
Solid-State and Integrated Circuit Technology (ICSICT), pages 584–586, Nov 2010.
29. T. Morie, J. Umezawa, and A. Iwata. A pixel-parallel image processor for Gabor filtering based
on merged analog/digital architecture. In Digest of Technical Papers in 2004 Symposium on
VLSI Circuits, pages 212–213, June 2004.
30. E. Cesur, N. Yildiz, and V. Tavsanoglu. On an improved FPGA implementation of CNN-based
Gabor-type filters. IEEE Transactions on Circuits and Systems II: Express Briefs, 59(11):815–
819, Nov. 2012.
31. B. Li, M. H. Najafi, and D. J. Lilja. An FPGA implementation of a restricted boltzmann
machine classifier using stochastic bit streams. In 2015 IEEE 26th International Conference on
Application-specific Systems, Architectures and Processors (ASAP), pages 68–69, July 2015.
Brain-Inspired Computing 199
32. K. Kim, J. Kim, J. Yu, J. Seo, J. Lee, and K. Choi. Dynamic energy-accuracy trade-off
using stochastic computing in deep neural networks. In 2016 53rd ACM/EDAC/IEEE Design
Automation Conference (DAC), pages 1–6, June 2016.
33. M. Tanaka and M. Okutomi. A novel inference of a restricted boltzmann machine. In 2014
22nd International Conference on Pattern Recognition, pages 1526–1531, Aug 2014.
34. Y. Lecun and C. Cortes. The MNIST database of handwritten digits. http://yann.lecun.com/
exdb/mnistl/.
Stochastic Decoding of Error-Correcting
Codes
F. Leduc-Primeau ()
École Polytechnique de Montréal, Montréal, QC, Canada
e-mail: francois.leduc-primeau@polymtl.ca
S. Hemati
Department of Electrical and Computer Engineering, University
of Idaho, Moscow, ID, USA
e-mail: shemati@uidaho.edu
V. C. Gaudet
University of Waterloo, Waterloo, ON, Canada
e-mail: vcgaudet@uwaterloo.ca
W. J. Gross
McGill University, Montréal, QC, Canada
e-mail: warren.gross@mcgill.ca
Introduction
Error-correction codes (ECC) or channel codes are widely used to improve the
efficiency of digital communication and storage systems. They allow to significantly
reduce the signal power required to transmit information in a communication system
or to increase the amount of stored information in a storage system. Low-density
parity-check (LDPC) codes have now been established as one of the leading channel
codes for approaching the channel capacity in data storage and communication
systems. Notably, they have recently been selected as the channel code for the
data channel in the fifth generation of cellular systems. Compared to other codes,
they stand out by their ability of being decoded with a message-passing decoding
algorithm that offers a large degree of parallelism, which make it possible to
simultaneously achieve a large channel coding gain and a high data throughput.
Exploiting all the available parallelism in message-passing decoding is difficult
because of the logic area required for replicating processing circuits, but also
because of the large number of wires required for exchanging messages. The use
of stochastic computing was thus proposed as a way of achieving highly parallel
LDPC decoders with a smaller logic and wiring complexity. Furthermore, because
the energy efficiency per operation in integrated circuits is now improving much
more slowly than in the past, many researchers are looking into approaches that
allow trading off the reliability of computations in return for tolerating an increase
in manufacturing variability and ultimately obtaining large improvements in energy
efficiency [1]. The stochastic nature of the value representation in stochastic
computing makes it an interesting paradigm in which to perform such reliability
versus energy optimizations. Interestingly, LDPC decoding algorithms are naturally
robust to hardware faults, and it was shown that the energy usage of a decoder can
be reduced with no performance degradation by operating the circuit in a regime
where timing violations can occur [2].
Current decoding algorithms based on stochastic computing do not outperform
standard algorithms on all fronts, but they generally offer a significant advantage
in average processing throughput normalized to circuit area. Potentially, they could
also offer further improvements in robustness to circuit faults. Finally, as discussed
in section “Asynchronous Decoders”, their simplicity combined with the robustness
of LDPC decoders makes it possible to envision asynchronous implementations
with no signaling overhead, which offers another avenue for tolerating propagation
delay variations occurring within the circuit.
We start this chapter by describing several LDPC decoding algorithms that
perform computations using the stochastic representation in section “Stochastic
Decoding of LDPC Codes”. One exciting aspect of stochastic computing is its
ability to enable new circuit implementation styles that can achieve improved
energy efficiency. Section “Asynchronous Decoders” reviews some work on digital
asynchronous implementations of LDPC decoders. Finally, the use of stochastic
computing is not limited to the decoding of binary LDPC codes. Section “Stochastic
Decoders for Non-Binary LDPC Codes” presents a stochastic approach for decoding
Stochastic Decoding of Error-Correcting Codes 203
non-binary LDPC codes, and section “The Stochastic Turbo Decoder” presents a
decoder for Turbo codes based entirely on a stochastic number representation.
LDPC codes are part of the family of linear block codes, which are commonly
defined using a parity-check matrix H of size m × n. The codewords corresponding
to H are the column vectors x of length n for which H · x = 0, where 0 is the zero
vector. LDPC codes can be binary or non-binary. For a binary code, the elements of
H and x are from the Galois Field of order 2, or equivalently H ∈ {0, 1}m×n and
x ∈ {0, 1}n . Non-binary LDPC codes are defined similarly, but the elements of H
and x are taken from higher order Galois Fields. The rate r of a code expresses the
number of information bits contained in the codeword divided by the code length.
Assuming H is full rank, we have r = 1 − m n.
A block code can also be equivalently represented as a bipartite graph. We call a
node of the first type a variable node (VN), and a node of the second type a check
node (CN). Every row i of H corresponds to a check node ci , and every column j
of H corresponds to a variable node j . An edge exists between ci and vj if H (i, j )
is non-zero.
The key property that distinguishes LDPC codes from other linear block codes
is that their parity-check matrix is sparse (or “low density”), in the sense that each
row and each column contains a small number of non-zero elements. Furthermore
this number does not depend on n. In other words, increasing the code size n also
increases the sparsity of H . The number of non-zero elements in a row of H is equal
to the number of edges incident on the corresponding check node and is called the
check node degree, denoted dc . Similarly the number of non-zero elements in a
column is called the variable node degree and denoted dv .
LDPC codes can be decoded using a variety of message-passing algorithms that
operate by passing messages on the edges of the code graph. These algorithms are
interesting because they have a low complexity per codeword bit while also offering
a high level of parallelism. If the graph contains no cycles, there exists a message-
passing algorithm that yields the maximum-likelihood estimate of each transmitted
bit, called the Sum-Product algorithm (SPA) [3]. In practice, all good LDPC codes
contain cycles, and in that case the SPA is not guaranteed to generate the optimal
estimate of each symbol. Despite this fact, the SPA usually performs very well on
graphs with cycles, and experiments have shown that an LDPC code decoded with
the SPA can still be used to approach the channel capacity [4]. The SPA can be
defined in terms of various likelihood metrics, but when decoding binary codes, the
log likelihood ratio (LLR) is preferred because it is better suited to a fixed-point
representation and removes the need to perform multiplications. Suppose that p is
the probability that the transmitted bit is a 1 (and 1 − p the probability that it is a 0).
The LLR metric i is defined as
204 F. Leduc-Primeau et al.
1−p
i = ln .
p
Algorithm 1 describes the SPA for binary codes (the SPA for non-binary codes
is described in section “Stochastic Decoders for Non-Binary LDPC Codes”). The
algorithm takes LLR priors as inputs and outputs an estimate of each codeword
bit. If the modulated bits are represented as xi ∈ {−1, 1} and transmitted over the
additive white Gaussian noise channel, the LLR priors i corresponding to each
codeword bit i ∈ {1, 2, . . . , n} are obtained from the channel output yi using
−2yi
i = ,
σ2
where σ 2 is the noise variance. The algorithm operates by passing messages on the
code graph. We denote a message passed from a variable node i to a check node
j as ηi,j , and from a check node j to a variable node i as θj,i . Furthermore, for
each variable node vi we define a set Vi that contains all the check node neighbors
of vi , and similarly for each check node cj , we define a set Cj that contains the
variable node neighbors of cj . The computations can be described by two functions:
a variable node function VAR(S) and a check node function CHK(S), where S is a
set containing the function’s inputs. If we let S = {1 , 2 , . . . , d }, the functions
are defined as follows:
d
VAR (S) = i (1)
i=1
input : {1 , 2 , . . . , n }
output: x̂ = [x̂1 , x̂2 , . . . , x̂n ]
begin
θj,i ← 0, ∀i, j
for t ← 1 to L do
for i ← 1 to n do // VN to CN messages
foreach j ∈ Vi do
ηi,j ← VAR({i } ∪ {θa,i : a ∈ Vi } \ {θj,i })
for j ← 1 to m do // CN to VN messages
foreach i ∈ Cj do
θj,i ← CHK({ηa,j : a ∈ Cj } \ {ηi,j })
for i ← 1 to n do // Compute the decision vector
i ← VAR({i } ∪ {θa,i : a ∈ Vi })
if Λi ≥ 0 then x̂i ← 0
else x̂i ← 1
Terminate if x̂ is a valid codeword
Declare a decoding failure
Algorithm 1: Sum-Product decoding of an LDPC code using LLR messages
Stochastic Decoding of Error-Correcting Codes 205
d
CHK (S) = arctanh tanh(i ) . (2)
i=1
The algorithm performs up to L iterations, and stops as soon as the bit estimate
vector x̂ forms a valid codeword, that is H · x̂ = 0.
LDPC decoders have the potential to achieve a high throughput because each of
the n codeword bits can be decoded in parallel. However, the length of the codes
used in practice is on the order of 103 , going up to 105 or more. This makes it
difficult to make use of all the available parallelism while still respecting circuit
area constraints. One factor influencing area utilization is of course the complexity
of the VAR and CHK functions to be implemented, but because of the nature of the
message-passing algorithm, the wires that carry messages between processing nodes
also have a large influence on the area, as was identified early on in one of the first
circuit implementations of an SPA LDPC decoder [5].
The need to reduce both logic and wiring complexity suggests that stochastic
computing could be a good approach. The use of stochastic computation for
the message-passing decoding of block codes was first proposed by Gaudet and
Rapley [6]. The idea was prompted by the realization that the two SPA functions
VAR and CHK had very simple stochastic implementations when performed in the
probability domain. Let us first consider the CHK function. In the LLR domain, the
function is given by (2), which in the probability domain becomes
d
1− i=1 (1 − 2pi )
CHK (p1 , p2 , . . . , pd ) = . (3)
2
The implementation of this function in the stochastic domain is simply an exclusive-
OR ( XOR ) gate. That is, if we have independent binary random variables X1 , X2 ,
. . . , Xd , each distributed such that Pr(Xi = 1) = pi , then taking
Y = X1 + X2 + · · · + Xd mod 2 (4)
p 1 p2
VAR (p1 , p2 ) = . (5)
p1 p2 + (1 − p1 )(1 − p2 )
In the stochastic domain, this function can be computed approximately using the
circuit shown in Fig. 1. The JK flip-flop becomes 1 if its J input is 1, and 0 if its K
input is 1. Otherwise, it retains its previous value. This implementation is different
in nature from the one used for the CHK function, since it contains a memory. The
behavior of the circuit can be analyzed by modeling the output Y as a Markov chain
with states Y = 0 and Y = 1. Suppose that the stochastic streams X1 [t] and X2 [t]
are generated according to the expectation sequences p1 [t] and p2 [t], respectively,
and let the initial state be Y [0] = so . Then, at time t = 1, we have
p1 [1]p2 [1] if so = 0,
E[Y [1]] = Pr(Y [1] = 1) =
p1 [1] + p2 [1] − p1 [1]p2 [1] if so = 1.
None of the expressions above are equal to VAR(p1 [1], p2 [1]), and therefore the
expected value of the first output of the circuit is incorrect, irrespective of the starting
state. However, if we assume that the input streams are independent and identically
distributed (i.i.d.) with p1 [t] = p1 and p2 [t] = p2 , it is easy to show that the
Markov chain converges to a steady-state such that
To build a circuit that will compute the VAR function for more than 2 inputs, we
can make use of the fact that the VAR function can be distributed arbitrarily, which
can easily be seen by considering the equivalent LLR-domain formulation in (1).
For example we have VAR(p1 , p2 , p3 ) = VAR(VAR(p1 , p2 ), p3 ).
Stochastic decoders built using these circuits were demonstrated for very small
codes, but they are unable to decode realistic LDPC codes. The reason is that (6)
is not sufficient to guarantee the accuracy of the variable node computation, since
we do not know that the input streams are stationary or close to stationary. In
graphs with cycles, low precision messages can create many fixed points in the
decoder’s iterative dynamics that would not be there otherwise. This was noted in
[7], and the authors proposed to resolve the precision issue by adding an element
called a supernode, which takes one stochastic stream as input and outputs another
stochastic stream. This approach interrupts the feedback path by using a constant
expectation parameter to generate the output stochastic stream. Simultaneously, it
estimates the mean of the incoming stochastic stream. The decoding is performed
Stochastic Decoding of Error-Correcting Codes 207
Now that we have explained the basic concepts used to build stochastic decoders, we
are ready to present stochastic decoding algorithms that are able to decode practical
LDPC codes. Most such algorithms make use of a smoothing mechanism called
Successive Relaxation. Message-passing LDPC decoders are iterative algorithms.
We can express their iterative progress by defining a vector xo of length n containing
the information received from the channel, and a second vector x[t] of length ne
containing all the messages sent from variable nodes to check nodes at iteration
t, where ne is the number of edges in the graph. The standard SPA decoder for
an LDPC code is an iterative algorithm that is memoryless, by which we mean
that the messages sent on the graph edges at one iteration only depend on the
initial condition, and on the messages sent at the previous iteration. As a result,
the decoder’s progress can be represented as follows:
where h() is a function that performs the check node and variable node message
updates, as described in Algorithm 1.
In the past, analog circuit implementations of SPA decoders have been consid-
ered for essentially the same reasons that motivated the research into stochastic
decoders. Since these decoders operate in continuous time, a different approach
was needed to simulate their decoding performance. The authors of [8] proposed to
simulate continuous-time SPA by using a method called successive relaxation (SR).
Under SR, the iterative progress of the algorithm becomes
the most interesting aspect of this method is that it can be used not only as a
simulator, but also as a decoding algorithm in its own right, usually referred to
as Relaxed SPA. Under certain conditions, Relaxed SPA can provide significantly
better decoding performance than the standard SPA.
In stochastic decoders, SR cannot be applied directly because the vector of
messages x[t] is a binary vector, while x[t] obtained using (7) is not if β < 1.
However, if we want to add low-pass filters to a stochastic decoder, we must add
memories that can represent the expectation domain of the stochastic streams.
Suppose that we associate a state memory with each edge, and group these memories
in a vector s[t] of length ne . Since the expectation domain is the probability domain,
the elements of s[t] are in the interval [0, 1]. Stochastic messages can be generated
from the edge states by comparing each edge state to a random threshold. We can
then rewrite (7) as a mean tracking filter, where s[t] is the vector of estimated means
after iteration t, and x[t], xo [t] are vectors of stochastic bits:
The value of β controls the rate at which the decoder state can change, and since
E[x[t]] = s[t], it also controls the precision of the stochastic representation.
We will first consider stochastic variable node circuits with two inputs X1 [t] and
X2 [t]. As previously, we denote by p1 [t] and p2 [t] the expectation sequences
associated with each input stream. Let E denote the event X1 [t] = X2 [t]. We have
that
where VAR() is defined in (5). Therefore, one way to implement the variable node
function for stochastic streams is to track the mean of the streams at the time instants
when they are equal. As long as Pr(E) > 0, a mean tracker can be as close as desired
to VAR(p1 [t], p2 [t]) if the rate of change of p1 [t], p2 [t] is appropriately limited. If
the mean tracker takes the form of (8), this corresponds to choosing a sufficiently
small β.
The first use of relaxation in the form of (8) was proposed in [9], where the
relaxation (or mean tracking) step is performed in the variable node, by using a
variable node circuit that is an extension of the original simple circuit shown in
Fig. 1. In the original VN circuit, each graph edge had a corresponding 1-bit flip-
flop. This flip-flop can be extended to an -bit shift-register, in which a ‘1’ is shifted
in if both inputs X1 [t] and X2 [t] are equal to 1, and a ‘0’ is shifted in if both inputs
are equal to 0. When a new bit is shifted in, the oldest bit is discarded.
Let us denote the number of ‘1’ bits in the shift-register by w[t], and define the
current mean estimate as s[t] = w[t]/. If we make the simplifying assumptions
Stochastic Decoding of Error-Correcting Codes 209
5 inputs from CNs
enable 0
in out
2-bit
random address IM
initialization?
output to a CN
1
1 update
0 0
enable 0 in out
1 64-bit EM
in out
2-bit
random address IM random address
that the bits in the shift register are independent from X1 [t] and X2 [t], and that
when a bit is added to the shift register, the bit to be discarded is chosen at random,
then it is easy to show that the shift-register implements the successive relaxation
rule of (7) in distribution, with β = Pr(E)/, in the sense that
Pr(E) Pr(E)
E[s[t]] = 1 − · s[t − 1] + · VAR(p1 [t − 1], p2 [t − 1]).
When the variable node degree is large, it was suggested in [10] to implement the
variable node function using a computation tree with two levels. Let us denote the
computation performed by the first level circuit as VARST1 and by the second level
circuit as VARST2 . For example, the circuit for a degree-6 VN can be implemented as
Asynchronous Decoders
The vast majority of modern digital circuits designs use a synchronous design
approach. In a synchronous circuit, all inputs of logic circuits are only allowed
to change at pre-determined moments dictated by a clock signal. This is used to
ensure that the inputs are always logically valid, and furthermore prevents the
occurrence of race conditions between signals in the presence of feedback paths.
The synchronous design approach provides a highly desirable design abstraction
that allows to manage the huge complexity of modern systems. However, it is not
without costs. First, a clock signal must be distributed throughout the circuit with
high accuracy. Second, the processing speed of the system is dictated by the longest
possible propagation delay between any two connected clocked memory elements
in the systems. This worst-case delay, known as the critical path, can be significantly
longer than the average processing delay, especially when possible process, voltage,
and temperature variations are taken into account.
Asynchronous circuits have the potential of decreasing the average time required
for a computation by using local signals to indicate valid outputs instead of
relying on a global clock. According to measurements reported in [12], the delays
required to propagate messages between variable and check nodes in a fully parallel
stochastic LDPC decoder represent the majority of the delay required to complete a
decoding iteration, and this delay varies significantly from one wire to another. The
authors of [12] thus propose to use asynchronous signaling to conduct the exchange
of messages, which leads to significant speedup in total decoding time.
Because the basic circuits required to build a stochastic LDPC decoder are
very simple, it is also worth considering whether their simplicity might allow
constructing a decoder circuit that operates without any signaling. In this scheme,
it is up to the designer to examine the circuit at the gate and even transistor level
to ensure that it is free of harmful glitches or race conditions. This approach was
investigated in [13], in which the authors have implemented a clockless version
Stochastic Decoding of Error-Correcting Codes 211
Fig. 3 A degree-six
clockless stochastic check OUT0
node [13] IN0
IN1
IN2 OUT1
IN3
IN4
IN5
OUT5
Regenerative Bit
A
B
C
Edge Memory
Address
Random
0
Pulse Generator
OUT
decode. Stochastic computation is one approach that has been explored to reduce
the complexity of the decoding algorithm.
In a non-binary code, codeword symbols can take any value from the Galois Field
(GF) of order q. The field order is usually chosen as a power of 2, and in that case we
denote the power as p, that is 2p = q. The information received about a symbol can
be expressed as a probability mass function (PMF) that, for each of the q possible
values of this symbol, indicates the probability that it was transmitted, given the
channel output. For a PMF U , we denote by U [γ ] the probability corresponding to
symbol value γ ∈ GF(q). Decoding is achieved by passing messages representing
PMFs on the graph representation of the code, as in message-passing decoding of
binary codes. However, when describing the algorithm, it is convenient to add a third
node type called a permutation node (PN), which handles part of the computation
associated with the parity-check constraint. The permutation nodes are inserted on
every edge in the graph, such that any message sent from a VN to a CN or from a
CN to a VN passes through a permutation node, resulting in a tripartite graph.
At every decoding iteration, a variable node v receives dv PMF messages from
neighboring permutation nodes. A PMF message sent from v to a permutation node
(t)
p at iteration t, denoted by Uvp , is given by
⎛ ⎞
= NORM ⎝Lv × Up v ⎠ ,
(t) (t−1)
Uvp (9)
p =p
where Lv is the channel PMF, and NORM() is a function that normalizes the PMF
(t)
so that all its probabilities sum to 1. A PN p receives a message Uvp from a VN and
generates a message to a CN c by performing
(t)
Upc [γ hp ] = Uvp
(t)
[γ ], ∀γ ∈ GF(q),
(t)
(t)
Ucp = ∗ Up c , (10)
p =p
where h−1 −1
p is such that hp × hp = 1.
Among the computations described above, the multiplications required in (9)
are costly to implement, but (10) has the highest complexity, since the number of
operations required scales exponentially in the field order q and in the CN degree dc .
Stochastic Decoding of Error-Correcting Codes 213
Stochastic decoding was also extended to convolutional and Turbo codes in [18].
Compared to stochastic decoders for LDPC codes, the challenge in implementing
a Turbo code decoder using stochastic computing resides in the need to perform
additions of probability values, which occur in the a posteriori probability (APP)
operation performed by each soft-input soft-output (SISO) component decoder.
Addition cannot be implemented directly using the stochastic representation, since
the stream’s expected value must lie in [0, 1]. The sum of N streams normalized by a
factor of 1/N can be implemented by feeding the streams into a N -input multiplexer
that randomly selects one of its input with equal probability. However, many useful
bits are discarded in the process, which translates into high processing latency.
To improve the precision of the addition, the implementation of [18] uses
an addition technique introduced in [19], where the addition is replaced by an
exponential transformation, followed by a multiplication, followed by the inverse
transformation. According to the results presented, approximating the exp(−x) and
the − ln(x) functions using the first two terms of their Taylor series, is sufficient to
reduce the number of decoding cycles by almost one order of magnitude.
Figure 5 shows a section of a stochastic tail-biting APP decoder in [18] that has
multiple inputs and outputs that facilitate exchange of information among sections.
The number of sections is equal to the number of symbols to decode and each section
consists of four modules. A module receives the channel outputs ui and vi , which
correspond to the i-th transmitted symbol di and its parity bit yi , respectively. The
214 F. Leduc-Primeau et al.
Fig. 5 A section of a ui vi ex
stochastic tail-biting APP Pr in
decoder [18]
ai A a i+1
bi B b i+1
Extr
Dec
ex
dˆi Pr out
module converts the received channel outputs into a priori probabilities, which
are represented by two stochastic sequences to stochastically compute the branch
metrics, the forward metrics in the A module, and the backward metrics in the
B module. These modules are involved in an iterative process since they use the
forward and backward metrics αi and βi+1 from their neighbors and provide them
αi+1 and βi . A decision-making module “Dec” determines the final value of each
binary symbol, d̂i for the transmitted symbol di . In the turbo decoder, the “Extr”
module computes the output extrinsic probability P r exout , which is then used by a
module of the second APP decoder as the input P r ex in . Simulation results showed
the performance of stochastic turbo decoder was close to the floating-point Max-
Log-MAP decoding algorithm for a few turbo codes [18].
References
1. Rahimi, A., Benini, L., Gupta, R.K.: Variability mitigation in nanometer CMOS integrated
systems: A survey of techniques from circuits to software. Proceedings of the IEEE 104(7),
1410–1448 (2016). https://doi.org/10.1109/JPROC.2016.2518864
2. Leduc-Primeau, F., Kschischang, F.R., Gross, W.J.: Modeling and energy optimization of
LDPC decoder circuits with timing violations. IEEE Transactions on Communications 66(3),
932–946 (2018). https://doi.org/10.1109/TCOMM.2017.2778247
3. Kschischang, F.R., Frey, B.J., Loeliger, H.A.: Factor graphs and the sum-product algorithm.
IEEE Trans. on Information Theory 47(2), 498–519 (2001)
Stochastic Decoding of Error-Correcting Codes 215
4. Chung, S.Y., David Forney, J., Richardson, T.J., Urbanke, R.: On the design of low-density
parity-check codes within 0.0045dB of the Shannon limit. IEEE Commun. Lett. 5(2), 58–60
(2001)
5. Blanksby, A.J., Howland, C.J.: A 690-mW 1-Gb/s 1024-b, rate-1/2 low-density parity-check
code decoder. IEEE Journal of Solid-State Circuits 37(3) (2002)
6. Gaudet, V., Rapley, A.: Iterative decoding using stochastic computation. Electronics Letters
39(3), 299–301 (2003). https://doi.org/10.1049/el:20030217
7. Winstead, C., Gaudet, V.C., Rapley, A., Schlegel, C.B.: Stochastic iterative decoders. In:
International Symposium on Information Theory, pp. 1116–1120 (2005)
8. Hemati, S., Banihashemi, A.: Dynamics and performance analysis of analog iterative decoding
for low-density parity-check (LDPC) codes. IEEE Trans. on Communications 54(1), 61–70
(Jan. 2006). https://doi.org/10.1109/TCOMM.2005.861668
9. Sharifi Tehrani, S., Gross, W., Mannor, S.: Stochastic decoding of LDPC codes. IEEE
Communications Letters 10(10), 716–718 (2006)
10. Sharifi Tehrani, S., Mannor, S., Gross, W.: Fully parallel stochastic LDPC decoders. IEEE
Trans. on Signal Processing 56(11), 5692–5703 (2008). https://doi.org/10.1109/TSP.2008.
929671
11. Sharifi Tehrani, S., Naderi, A., Kamendje, G.A., Mannor, S., Gross, W.J.: Tracking forecast
memories in stochastic decoders. In: Proc. IEEE International Conference on Acoustics,
Speech, and Signal Processing (ICASSP) (2009)
12. Onizawa, N., Gaudet, V.C., Hanyu, T., Gross, W.J.: Asynchronous stochastic decoding of low-
density parity-check codes. In: 2012 IEEE 42nd International Symposium on Multiple-Valued
Logic, pp. 92–97 (2012). https://doi.org/10.1109/ISMVL.2012.35
13. Onizawa, N., Gross, W.J., Hanyu, T., Gaudet, V.C.: Clockless stochastic decoding of low-
density parity-check codes. In: 2012 IEEE Workshop on Signal Processing Systems, pp. 143–
148 (2012). https://doi.org/10.1109/SiPS.2012.53
14. Leduc-Primeau, F., Hemati, S., Mannor, S., Gross, W.J.: Dithered belief propagation decoding.
IEEE Trans. on Communications 60(8), 2042–2047 (2012). https://doi.org/10.1109/TCOMM.
2012.050812.110115A
15. Davey, M., MacKay, D.: Low-density parity check codes over gf(q). Communications Letters,
IEEE 2(6), 165–167 (1998). https://doi.org/10.1109/4234.681360
16. Sarkis, G., Hemati, S., Mannor, S., Gross, W.: Stochastic decoding of LDPC codes over GF(q).
IEEE Trans. on Communications 61(3), 939–950 (2013). https://doi.org/10.1109/TCOMM.
2013.012913.110340
17. Leduc-Primeau, F., Hemati, S., Mannor, S., Gross, W.J.: Relaxed half-stochastic belief
propagation. IEEE Trans. on Communications 61(5), 1648–1659 (2013). https://doi.org/10.
1109/TCOMM.2013.021913.120149
18. Dong, Q.T., Arzel, M., Jego, C., Gross, W.J.: Stochastic decoding of turbo codes. IEEE
Transactions on Signal Processing 58(12), 6421–6425 (2010). https://doi.org/10.1109/TSP.
2010.2072924
19. Janer, C., Quero, J., Ortega, J., Franquelo, L.: Fully parallel stochastic computation architec-
ture. Signal Processing, IEEE Transactions on 44(8), 2110–2117 (1996). https://doi.org/10.
1109/78.533736