Low_Latency_Library_for_HFT-Algo-Logic-4831a009 (1)
Low_Latency_Library_for_HFT-Algo-Logic-4831a009 (1)
net/publication/262293285
CITATIONS READS
76 9,750
6 authors, including:
All content following this page was uploaded by John W. Lockwood on 13 April 2015.
10
Maxeler Technologies and Convey Computer (amongst others) Within the research community, the original NetFPGA
offer such platforms. Examples of Hybrid Computers [2][3] was used as base platform for building FPGA-
specifically targeted at HFT include Nomura’s NXT Direct and accelerated network processing applications. It was developed
Deutsche Bank’s Autobahn Ultra. Both products were by Stanford University in collaboration with Xilinx and other
developed to accelerate pre-trade SEC regulatory compliance partners. In [4], a latency measurement circuit is built on the
and risk management checks on orders. NetFPGA to monitor the distribution of the difference in arrival
times between market data on a pair of redundant (“A/B”)
feeds. An event processor for algorithmic trading on NetFPGA
in [10] reduces latency up to two orders of magnitude compared
to software. Morris et al. present an FPGA-assisted HFT engine
implemented on Celoxica’s AMDC card which improves
message throughput by 12 times, while reducing latency [11].
An FPGA-based feed handler for the NASDAQ ITCH protocol
achieves a low, predictable end-to-end latency of 2.7µs [26]. A
comparable software implementation (with no kernel bypass)
exhibits a typical latency of 38µs, varying more than ±20µs.
Finally, some financial processing systems are completely
Figure 2 Typical Smart NIC in a supervised financial trading application implemented on FPGA-based platforms without any host CPU
yielding best possible latencies. As oncoming traffic no longer
FPGA-based “Smart NICs” offer another way to apply needs to make a round trip over the system’s PCIe interface for
programmable logic to the acceleration of financial processing in the CPU, the inherent latency and jitter
applications. Smart NICs typically bring together high-speed introduced by the host’s operating system are entirely avoided.
network interfaces, a PCIe host interface, memory and an The approach presented in this paper fits into this category.
FPGA. The FPGA implements the NIC controller, acting as the
bridge between the host computer and the network and allows III. ADVANTAGES AND DISADVANTAGES OF FPGAS FOR
user-designed custom processing logic to be integrated directly LOW-LATENCY FINANCE
into the data path. As illustrated in Figure 2, this allows a smart
A. Advantages
NIC to function as a programmable trading platform under the
supervision of a host CPU. Modern FPGAs can implement most aspects of any HFT
application. Incoming market data can typically be processed
Many vendors offer FPGA-based “smart NICs”. A completely on the FPGA without needing to travel over and
selection of products targeted specifically at financial trading back to a host CPU. Figure 3 compares an FPGA-based
appears in Table I. The NIC hardware is typically accompanied platform to a traditional software trading platform.
by programmable logic IP implementing networking-related
functions as well as a PCIe host interface, with or without
direct memory access (DMA). Some products include a very
limited set of finance-specific IP blocks.
11
on a network adapter. Relevant traffic is transferred to memory greatly simplifies protocol parsing which is a task that is
and the CPU is interrupted to handle the application processing. tedious to describe with a standard programming language.
After this processing is complete, the data is transferred back to
the network adapter and transmitted over the network. The Finally, in contrast to ASICs, FPGAs offer the performance
interrupt-driven software stack, unpredictable PCIe transfers benefits of custom hardware while retaining programmability.
and cache misses make network latencies higher and less Trading algorithms can be continually improved or circuit bugs
predictable in software implementations. By contrast, in the fixed in the field.
FPGA implementation, the incoming network data is fed B. Disadvantages
directly into a custom-designed, highly-optimized and FPGAs also have disadvantages compared to traditional
application-specific processing pipeline via hardware PHY and software-based approaches. At the root of this is the greater
MAC blocks. complexity of the FPGA development flow. Firstly, many
Furthermore, relevant information within a packet can in developers of financial applications are unfamiliar with FPGA
fact be extracted before the complete packet is received. The technology generally and lack the expertise to use the
data is directly available within the same clock cycle of its hardware-oriented FPGA development tools in particular.
arrival time which is significantly different from any software Secondly, building and verifying new hardware is more time-
implementation. (The network stack itself holds the complete consuming than writing new software due to the significantly
packet at least once before its available for processing.) lower abstraction level in the design flow. However, the FPGA
Directly related to this, it’s important to highlight that there is delivers better throughput and lower latency than a software
literally no jitter in extracting and processing incoming data. implementation will ever be able to yield in return for this time
The availability and the processing time for any piece of investment. This is conceptually illustrated in Figure 5.
information is completely predictable down to a clock cycle.
FPGAs can therefore naturally achieve significantly lower
latencies with minimal jitter, as shown in Figure 4. This is one
of the key advantages that FPGAs offer, as the example in
Section V proves.
12
IV. ALGO-LOGIC’S LOW LATENCY LIBRARY AND The Register Interface can be accessed via a C++ API on a
FPGA IMPLEMENTATION ON NETFPGA-10G host or controlled interactively via a web-based Graphical User
Interface (GUI). The UDP-based interface allows the FPGA to
Algo-Logic’s Low-Latency Library is gateware that is
be controlled and monitored from any host on the network.
compatible with standard FPGA hardware platforms. One
supported platform is the NetFPGA-10G [8] which is a quad- The TCP/IP Processing Layer separates the headers from
port, 10-Gigabit Ethernet successor to the original NetFPGA the payloads, stores the headers and forwards the payloads to
card, based on a Xilinx Virtex-5 device. the application layer for further processing.
Algo-Logic’s Low-Latency Library processes financial The Application Layer processes packet payloads and sends
protocols used by trading venues in the US, Europe and other them back to the TCP/IP Layer. Packets are reassembled with
regions, extracting information from the packets as they flow their corresponding headers and sent out to the Ethernet MAC.
through the FPGA. The library components are used to A new TCP/IP checksum is computed over the resulting
construct custom trading applications, reducing time-to-market. payload after application-level processing is complete.
Algo-Logic’s library of low-latency FPGA IP blocks Finally, a 10G Ethernet MAC IP block from Xilinx
(shown in Figure 6) can be divided in two main categories. A provides the interface to the NetFPGA-10G’s on-board discrete
set of Infrastructure Components includes generic IP cores PHY ICs. A MicroBlaze soft processor core is used to initialize
providing interfaces for the network, external memories and the on-board PHYs. Note that the soft core is not used in the
host software. Financial Processing Components parse and datapath to process time-critical packets. However, if required,
process standard and stock-exchange-specific protocols. The it could be used to run a simple software application which
next sections describe these blocks in more detail. could communicate with the rest of the system through the
Register Interface.
B. Financial Processing Components
Algo-Logic’s library includes IP blocks designed to process
application layer messages for HFT. These messages consist
largely of the orders and order execution reports sent between
clients, brokers and exchanges. A set of pre-verified IP blocks
understands and translates the exchange protocols, enabling
end-users to focus on application development rather than
interface components. Each component is discussed in more
detail below.
1) Financial Protocol Parser
The Financial Protocol Parser receives a payload from the
TCP/IP Processing Layer and identifies message boundaries for
the data in the payload. The parser extracts individual fields
and raises a flag when the value of each field becomes valid.
This ensures that each field is extracted with the lowest
possible latency. Algo-Logic’s library currently has parsers for
Figure 6 Algo-Logic’s Low-Latency FPGA IP Blocks ASCII protocols such as Financial Information eXchange (FIX)
and binary protocols such as OUCH for NASDAQ [14], XPRS
A. Infrastructure Components for DirectEdge [15], ArcaDirect for Arca [16], and Native
All components share a common, standardized interface Trading Gateway for London Stock Exchange (LSE) [17].
with a 64-bit (8-byte) data word and protocol conforming to the Support for BATS BOE [18] will be added shortly.
industry-standard AXI-4 Stream specifications [20]. An example of how the Finance Protocol Parser in Algo-
An SRAM controller IP block drives the on-board Quad- Logic’s IP library parses a message with minimal latency is
Datarate (QDR) II Static RAM ICs [5]. SRAM offers low- shown in the waveform in Figure 7. An incoming OUCH
latency, high-performance storage for small amounts of data. message packet is presented to the Financial Protocol Parser 8
bytes-per-cycle as streams through the 64-bit AXI-Stream data
A Register Interface module controls and monitors the path. As each field is extracted, the value is transferred to
status of registers written and read by host software. The internal I/O pins and a valid flag is raised.
Register Interface contains multiple register types, including
write-only Configuration Registers, read-only Status Registers, The block extracts the OUCH message type (“O”, or “Enter
general-purpose read/write registers and write-only registers. Order”) and the 14-byte Order Token. The Buy/Sell Indicator
Configuration and status update words are transmitted and field is “T” (“Sell Short”). The number of shares is “1002”,
received via UDP. with the stock symbol identified as “FFHL”. The remaining
fields specify the asking price and specify that the order is not
eligible for intermarket sweep.
13
positions. While it is normal that traders maintain both long
and short positions in the market, it is critical that the net
position remains within a bound so that there is a margin of
safety in the holdings. The ability to monitor positions and
exposures in real-time can help avert finanical disaster.
This section describes how Algo-Logic’s platform was used
to track the real-time exposure and positions of multiple trading
clients by parsing FIX execution reports. Figure 9 illustrates the
Figure 7 OUCH Protocol Parsing system set-up of this application. AlgoLogic’s system acts as a
“bump in the wire” between broker and exchange, processing
The parser adapts automatically to variable-length data the FIX execution reports coming from the exchange while
fields in the packet. With a 156 MHz clock and a single set of passing on orders coming from the involved brokers. A third
Delay Flip/Flops (DFFs), the circuit extracts all OUCH packet 10GE interface connects the appliance with the control and
fields within exactly 6.4 nanoseconds of the packet’s arrival logging host. This interface is used to read in normalized
with no jitter. This is orders of magnitude faster than a similar market data with price updates, to control the appliance,
extraction in software which not only incurs system-level retrieve data, monitor status and log the actions performed by
delays for transporting the packet to the CPU, but also needs to the card.
receive the full packet before it can start parsing.
2) Market data parsing and on-chip storage for price data 10GigE
10GigE
Configuration/Debug registers
In financial processing, applications need to know the price Host PHY chip
Ethernet
MAC
Market data update/ Price tables
of instrument (security) that appears in the orders. To track
10GigE
prices, Market data is fed into the card via UDP/IP datagrams. links
The FPGA extracts price updates and stores this data to 10GigE
10GigE
Exchange Ethernet
memory for each symbol. Prices for all 8000 securities traded PHY chip
MAC
Position,
on U.S. Exchanges fit within the FPGA’s on-chip memory. The 400ns
TCP/IP
processing,
Long and
table size is scalable to support more symbols as required. Protocol
Short
Exposure
Parsing
calculation
10GigE
10GigE
Client PHY chip
Ethernet
Algo-Logic’s 1U MAC
NetFPGA 10GAppliance
QDR II SRAM
Xilinx
V5TX240T
ATX
Figure 9 Exposure and Position-tracking Application built on Algo-Logic’s
Power
Supply
HFT Platform
4x10G SFP+ i/f
Xilinx
Platform A. Exposure/Position Processing Circuit
Cable
USB
Orders
Control
Figure 8 Algo-Logic’s HFT Appliance Market
&
data
Status
14
from the exchange and stored within on-chip memory. These exposures, respectively, across all sessions. Looking at this
prices are needed to calculate exposure. Traffic from brokers plot, a broker or regulator can instantly view total exposure.
and FIX execution reports from the exchange are passed from
the MACs to the TCP/IP processing module. Along with
network functions, this module also handles the processing of
the financial protocols. For this application, the Financial
Information eXchange (FIX) protocol is used. The parser
extracts all pertinent FIX fields (“tags”) from the execution
report, including Security, Shares Filled, Price and Order Type.
These correspond to FIX tags 55, 38, 44 and 54, respectively.
The Exposure and Position calculation module receives the
traffic from the TCP/IP module and updates the market
data/price tables using the procedure described in the following
subsection. Finally, the FIX execution reports are forwarded to
Figure 11 Web-based GUI showing Exposure and Position
the clients providing details of order execution.
B. Position and Exposure Calculations D. Results
The exposure and positions are calculated in the FPGA The total delay from wire to receiving FIX payload from
from the extracted data according to the following formulas: socket, parsing FIX execution reports and calculating these
position = total number of shares (Tag 38) filled by the exchange
values can be significantly reduced by performing all these
operations above within the FPGA. Rather than reading data
If it is a Buy order (Tag 54), then the position is categorized across a memory bus, data is stored in fast on-chip memory and
as “long”. If it is a Sell/Short Sell order, then the position is QDR-II directly attached to the FPGA. This memory is
“short”. The net position per security (FIX Tag 55) is: accessed without using the PCIe bus when an order (from
net position = ( total long positions) – ( total short positions ) client to exchange) enters FPGA and the table lookups and
computation are performed directly in logic.
Similarly, long exposure per security is defined as follows,
where the current market price comes from the price table: The total wire-to-wire latency in this application is 1µsec
with liertally no jitter. The breakup of 1 µsec total delay is
long exposure (security) = ( total long positions ) * current market price 400ns (PHY+MAC, high-speed serial receiver) + 200ns
short exposure (security) = ( total short positions ) * current market price (parsing/calculations) + 400ns (PHY+MAC, high-speed serial
transmitter). With future generations of FPGAs these delays
Exposure (security) across all sessions is defined as the
will further decrease. It is worth noting that the 1µs round-trip
product of net position and current market price. Net and gross
latency through this design is between one and two orders of
market values can be calculated as follows:
magnitude lower than a recent software implementation [21].
Net market value = sum[position/security across all sessions * price]
As for throughput, all the processing on the FPGA is
Gross market value = sum[abs(positions) * price] performed at the full 10Gbps line rate. When using FIX, the
Once the Tags are extracted from the Execution reports and average size of a message is 150 bytes (the exact value depends
prices are read from the price tables, the exposure is calculated of the value of the fields). OUCH messages are smaller, and
and position is updated based on the type of order. range in size between 12 to 82 bytes in length, whereby
multiple messages can appear in a TCP/IP flow. When a single
C. User Interface FIX message appears in a TCP packet, the total size of the
Position and exposure data are periodically forwarded to packet is 204 bytes. The throughput is therefore roughly 6.1M
the control and logging host in UDP packets for visualization FIX messages/second.
on a Graphical User Interface (GUI). The speed at which the In regards to development, it is important to note that the
screen is updated is limited only by the host’s ability to capture application could be completely assembled out of existing
logs and process them. The host updates its database and library components. The Low-Latency Library components
populates positions and exposures on a web-based GUI as were simply connected together to form the infrastructure,
shown in Figure 11. Exposure is plotted in the time domain. including FIX parsing. Secondly, the integration of the
The table on the left shows positions per security and dollar hardware was completed within a month. Most importantly, the
values based on current market price per security. Green Low-Latency Library and hardware platform were pre-verified,
indicates long positions and red indicates short positions. The greatly simplifying the task of verifying the final design.
table on the top-right visualizes long exposure per session, Overall, the IP library reduced development time considerably.
short exposure per session, net market value per session and
gross market value per session. Dollar values are calculated VI. CONCLUSIONS
based on latest prices received from the market. Finally, the In the race to minimize latency in high frequency trading,
graph in the bottom-right corner illustrates all exposures. The traders, brokers and exchanges are exploring a range of
green and red lines represent the sum of long and short technologies to build platforms, monitor and manage risk and
improve the efficiency of trading.
15
Software approaches that utilize CPUs with high [6] Myricom DBL (Datagram Bypass Layer) Documentation,
performance network interface cards provide part of the http://myri.org/dbl.html
solution. Through optimized bus logic and device driver [7] Datasheet for the QLogic 7300-series Network Adapters,
http://www.qlogic.com/Resources/Documents/DataSheets/Adapters/Dat
software, they reduce the time required to transfer a message aSheet_QLE7300Series.pdf
from the network to the CPU then back to the network in just a
[8] M. Blott, J. Ellithorpe, N. McKeown, K. Vissers and H. Zeng, “FPGA
few microseconds. Smart NICs go one step further by Research Design Platform Fuels Network Advances”, Xilinx Xcell
offloading some of the processing to a Field Programmable Journal, Issue 73, September 2010
Gate Array (FPGA) on the network adapter. [9] N. Lobo, V. Malik, C. Donnally, S. Jahne and H. Jhaveri, “Evaluating
the Latency Impact of IPv6 on a High Frequency Trading System”,
The advantage of developing financial applications in University of Colorado, May 2012
software is a typically short development time. A challenge [10] M. Sadoghi, M. Labrecque, H. Singh, W. Shum and H.A. Jacobsen,
with software is that it is hard to achieve low end-to-end “Efficient Event Processing through Reconfigurable Hardware for
latency and jitter. A custom-designed datapath in an FPGA Algorithmic Trading,” Journal Proceedings of the VLDB Endowment,
circuit offers the benefits of minimal, deterministic processing vol. 3,no. 1-2, pp. 1525-1528, September 2010
times down to clock cycles. However, the benefits of FPGAs [11] G.W. Morris, D.B. Thomas and W. Luk, “FPGA Accelerated Low-Latency
have traditionally come at the cost of a longer development Market Data Feed Processing”, High Performance Interconnects, 2009
(HOTI 2009). 17th IEEE Symposium on, pp. 83-89, 25-27 Aug. 2009
time. The contribution of this paper is to provide a pre-built
[12] Report by the Aite Group, “A New World Order: The High Frequency
gateware library that allows financial applications to be built on Trading Community and Its Impact On Market Structure”, 2009
standard FPGA cards (such as the NetFPGA-10G) without the
[13] R. Martin, “Wall Street’s Quest to Process Data at the Speed of Light”,
burden of an extended development period. Information Week, 21st April 2007
Algo-Logic’s Low-Latency Library described in this paper [14] O*U*C*H 4.2 Protocol Specification, NASDAQ Inc.,
http://www.nasdaqtrader.com/content/technicalsupport/specifications/Tr
includes infrastructure components and domain-specific adingProducts/OUCH4.2.pdf
gateware that extract data from market data streams, including
[15] XPRS 1.24 Protocol Specification, Direct Edge,
FIX, OUCH, XPRS, ArcaDirect, LSE and BATS BOE. For http://www.directedge.com/Portals/0/docs/Connect/Direct%20Edge%20
order processing systems that also need to make decisions XPRS%20API%20Manual%20V%201.24.pdf
based on price, Algo-Logic’s gateware includes a module that [16] ARCADirect 4.0 Protocol Specification, NYSE Arca,
receives normalized market data via UDP and stores current http://www.nyse.com/pdfs/ArcaDirectSpecVersion4_0.pdf
prices using on-chip memory so that logic can use these values [17] Native Trading Gateway 10.1 Protocol Specification, London Stock
during order processing. Exchange (LSE), http://www.londonstockexchange.com/products-and-
services/millennium-exchange/millennium-exchange-
To demonstrate the utility of this gateware library, a migration/mit203v101.pdf
complete application has been developed which tracks the [18] BATS US Options BOE Specification 1.5.2, BATS Global Markets Inc.,
exposure and positions of multiple traders as orders are http://www.batsoptions.com/resources/membership/BATS_US_Options
transferred over and back to the stock exchange. Incoming FIX _BOE_Specification.pdf
messages are decoded, analyzed and forwarded onward in real [19] R. Iati, “The Real Story of Trading Software Espionage”, TABB Group
Perspective, July 10thth, 2009
time with a stable end-to-end latency of 1µs. This is up to two
[20] AMBA AXI4-Stream Protocol Specification, ARM Ltd.,
orders of magnitude lower than software approaches achieve. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ihi0051a/i
ndex.html
REFERENCES
[21] OnX Enterprise Solutions, “Report: High Performance Trading – FIX
[1] P. Gomber, B. Arndt, M. Lutat and T. Uhle, “High Frequency Trading” Messaging; Testing for Low Latency”, Jan 2012,
(report commissioned by Deutsche Börse Group), Goethe Universitat http://fixglobal.com/content/high-performance-trading-fix-messaging-
Frankfurt am Main, March 2011 testing-low-latency
http://www.frankfurt-main-finance.de/de/finanzplatz/daten- [22] R. Mueller, J. Teubner and G. Alonso, “Streams on wires: a query
studien/studien/High-Frequency-Trading.pdf compiler for FPGAs”, Proc. VLDB Endowment, vol. 2, no. 1, Aug 2009,
[2] J.W. Lockwood, N. McKeown, G. Watson, G. Gibb, P. Hartke, J. Naous, pp. 229 – 240
R. Raghuraman and Jianying Luo, “NetFPGA - An Open Platform for [23] M. Kearns, A. Kulesza and Y. Nevmyvaka, “Empirical Limitations on
Gigabit-rate Network Switching and Routing”, Microelectronic Systems High Frequency Trading Profitability”, Working Paper Series, Social
Education, 2007. MSE '07. IEEE International Conference on, pp. 160- Science Research Network (SSRN), September 17, 2010
161, 3-4 June 2007
[24] H. Subramoni, F. Petrini, V. Agarwal, D. Pasetto, “Streaming, Low-
[3] G. Gibb, J.W. Lockwood, J. Naous, P. Hartke and N. McKeown, latency Communication in On-line Trading Systems”, IEEE
"NetFPGA—An Open Platform for Teaching How to Build Gigabit-Rate International Symposium on Parallel & Distributed Processing,
Network Switches and Routers," Education, IEEE Transactions on , Workshops and Phd Forum (IPDPSW), 2010, pp.1-8, 19-23 April 2010
vol.51, no.3, pp.364-369, Aug. 2008
[25] C. Starke, V. Grossman, L. Wienbrandt, M. Schimmler, “An FPGA
[4] A. Gupte and J.W. Lockwood, “Precise Precise Latency Comparison Implementation of an Investment Strategy Processor”, Procedia
Module for the NetFPGA”, workshop at the 2010 North American Computer Science, Volume 9, 2012, Pages 1880-1889
NetFPGA Developers’ Workshop,
http://netfpga.org/tutorials/WorkshopNorthAmerican2010/pdf/NetFPGA [26] R. Pottahuparambil, J. Coyne, J. Allred, W. Lynch and V. Natoli, “Low-
_Dev_2010_Precise_Latency_Comparison_Module.pdf latency FPGA Based Financial Data Feed Handler”, 2011 IEEE 19th
Annual International Symposium on Field-Programmable Custom
[5] Cypress Quad Data Rate (QDR-II) Static Random Access Memory Computing Machines (FCCM), pp. 93-96, 1-3 May 2011
(SRAM), http://www.cypress.com/?id=107
16