0% found this document useful (0 votes)
15 views

Low_Latency_Library_for_HFT-Algo-Logic-4831a009 (1)

The document discusses the development of a low-latency FPGA library designed for High-Frequency Trading (HFT), which aims to significantly reduce latency compared to traditional software-based platforms. It highlights the importance of low latency in trading environments and surveys existing HFT solutions, emphasizing the advantages of using FPGAs for rapid data processing and order execution. The paper presents an FPGA IP library that supports various networking and financial protocols, enabling faster and more efficient trading applications.

Uploaded by

corina.bistrita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Low_Latency_Library_for_HFT-Algo-Logic-4831a009 (1)

The document discusses the development of a low-latency FPGA library designed for High-Frequency Trading (HFT), which aims to significantly reduce latency compared to traditional software-based platforms. It highlights the importance of low latency in trading environments and surveys existing HFT solutions, emphasizing the advantages of using FPGAs for rapid data processing and order execution. The paper presents an FPGA IP library that supports various networking and financial protocols, enabling faster and more efficient trading applications.

Uploaded by

corina.bistrita
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/262293285

A low-latency library in FPGA hardware for High-Frequency Trading (HFT)

Conference Paper · August 2012


DOI: 10.1109/HOTI.2012.15

CITATIONS READS
76 9,750

6 authors, including:

John W. Lockwood Nishit Mehta


Algo-Logic Systems, Inc. 2 PUBLICATIONS 79 CITATIONS
199 PUBLICATIONS 5,986 CITATIONS
SEE PROFILE
SEE PROFILE

Michaela Blott Kees A. Vissers


Xilinx Inc. 110 PUBLICATIONS 5,332 CITATIONS
86 PUBLICATIONS 3,092 CITATIONS
SEE PROFILE
SEE PROFILE

All content following this page was uploaded by John W. Lockwood on 13 April 2015.

The user has requested enhancement of the downloaded file.


2012 IEEE 20th Annual Symposium on High-Performance Interconnects

A Low-Latency Library in FPGA


Hardware for High-Frequency Trading (HFT)
John W. Lockwood, Adwait Gupte, Nishit Mehta Michaela Blott, Tom English, Kees Vissers
Algo-Logic Systems Inc. Xilinx Inc.
Santa Clara, CA Dublin, Ireland, San Jose, CA
{jwlockwd, adwait, nishit}@algo-logic.com {michaela.blott, tom.english, kees.vissers}@xilinx.com

Abstract—Current High-Frequency Trading (HFT) platforms are A. Traders


typically implemented in software on computers with high- Traders receive multiple high-datarate UDP/IP market data
performance network adapters. The high and unpredictable
streams over the network from trading venues. An initial line
latency of these systems has led the trading world to explore
alternative “hybrid” architectures with hardware acceleration. In
handling stage uses redundancy in the streams to replace any
this paper, we survey existing solutions and describe how FPGAs information temporarily missed due to packet losses in order to
are being used in electronic trading to approach the goal of zero reconstruct the necessary parts of the order book of the
latency. We present an FPGA IP library which implements exchange. Figure 1 illustrates how traders, brokers/dealers and
networking, I/O, memory interfaces and financial protocol trading venues (exchanges) interact in the HFT ecosystem.
parsers. The library provides pre-built infrastructure which Traders employ a range of algorithms and heuristics to
accelerates the development and verification of new financial
decide when and what to trade. These approaches combine
applications. We have developed an example financial application
current market data with historical data and outputs from
using the IP library on a custom 1U FPGA appliance. The
application sustains 10Gb/s Ethernet line rate with a fixed end-to- computational models. Trading decisions may be simple and
end latency of 1µs – up to two orders of magnitude lower than quick, slow but smart, or (preferably) quick and smart.
comparable software implementations. Quantitative Analysts (“Quants”) are employed to develop the
most successful and repeatable approaches.
Keywords- Algorithmic, trading, latency, HFT, FPGA The final step is to enter these orders into the exchange –via
I. INTRODUCTION a broker. Further computation is used to ensure that the trader’s
exposure is carefully managed.
High-Frequency Trading (HFT) refers to rapid electronic
trade in financial instruments. Trading venues make current B. Broker/Dealer
pricing information available as continuous streams of The function of a Broker in the trading infrastructure is to
electronic data. Traders monitor these streams, reconstructing be a market participant who takes orders from clients (traders)
pricing and demand for relevant stocks, options, futures and and passes them to the exchange. Brokers manage risk
currencies to determine when and what to trade. Buy/Sell according to regulatory requirements [1]. Providing the most
orders are then sent to the exchange as soon as possible. direct and low-latency connection possible to an exchange
Algorithms take advantage of fleeting variations in stock price inherently makes a broker more attractive to a client.
or demand over time or between different exchanges,
accumulating profit by making tiny gains on large numbers of Many financial instruments are traded at more than one
transactions. Traders avoid holding significant overnight venue. Each venue charges some small matching fee per
positions, with individual stocks being held only for a few transaction. Accordingly, once an order is deemed compliant
seconds before being re-sold [23]. HFT is growing rapidly and with regulations, the optimum execution venue must be chosen
is estimated to have accounted for more than 70% of all trades to ensure both the fastest execution and the lowest transaction
made on US equity markets during 2010 [12]. fees. This step, known as order routing, is one way for brokers
to increase their profits without compromising the level of
service to clients. Another way by which brokers enhance
profitability is by matching orders from different clients with
each other internally, without sending them to the exchange.
This internalization follows strict regulations.
Stock exchanges handle a large amount of orders from
different brokers. Incoming bids are matched with ask prices
set by current stock owners, triggering trades. The exchange is
one of the few links in the infrastructure where both throughput
and latency are critically important. The faster an exchange is
able to match orders, the more efficient and attractive it
becomes to traders who care about latency. Also, the greater
Figure 1 traders, brokers and trading venues the number of transactions handled by the exchange, the more
matching fees it can earn.

978-0-7695-4831-9/12 $26.00 © 2012 IEEE 9


DOI 10.1109/HOTI.2012.15
C. Why Low Latency is Important in HFT Several prior studies have been conducted into the latency
HFT traders attempt to exploit fleeting inefficiencies in the of software-based approaches with low-latency Ethernet NICs.
market. By doing so, they not only make money but also make Measurements on a Linux server in [11] estimate a half round
the markets more efficient. However, due to the increase in trip of 15-20µsec for packets through the operating system
number of players in the HFT space which try to exploit these kernel and network stack. Lobo et al. [9] claim a roughly
opportunities, the first few players to execute orders may be the 40µsec round trip time for un-optimized systems, while
only ones able to profit from a given opportunity. This is a first transmit and receive latencies of 2.9µs and 6µs were measured
reason why latency matters. when a TCP offload engine was used. It is important to
remember that none of these figures takes into account the
Another reason is the phenomenon of slippage which refers additional latency of the end-user application. Myricom’s low-
to the price of a highly liquid security moving away from its latency 10G Ethernet NIC products offer commercial software
expected price after a market order for that security has been called Datagram Bypass Layer (DBL), which allows high-
entered. The longer the time between a trader making a priority user threads to send and receive IP frames directly to
decision to trade and the order reaching the execution venue, and from the NIC, bypassing the operating system’s kernel and
the greater is the possibility and magnitude of slippage which networking stack. This reduces both latency and jitter, resulting
translates into cost for traders. in claimed application-to-application latencies of 3.5µs for
Finally, some HFT strategies such as latency arbitrage UDP and 4.0µs for TCP [6]. A software decoder for the OPRA
depend upon the ability to access market data and execute option pricing feed on an Intel Xeon CPU processed 3 million
orders faster than other investors. Profiting from this activity OPRA messages per second per CPU core with a round-trip
requires traders to be able to react in real time to market events latency of 4µs using Myricom’s 10G Ethernet adapters and
using both low-latency and low-jitter market access. Every DBL kernel bypass software [24].
millisecond reduction in latency can improve arbitrage Infiniband-based networking can reduce latency
profitability by more than $100M a year [13]. significantly compared to Ethernet. For example, Qlogic’s
HFT strategies therefore require low-latency market access 7300-series Infiniband NICs claim HPC Message-Passing
both to receive market data as well as to transmit new orders. Interface (MPI) network latencies of as low as 1µs [7]. (his
figure excludes latency at the application layer.) Despite
D. Other Important Factors Infiniband’s latency advantage, Ethernet is currently more
Throughput and flexibility are additional important prevalent in the HFT ecosystem. 40Gb/s and 100Gb/s Ethernet
considerations for HFT. Throughput is becoming more and devices are already on the market, suggesting Ethernet will
more critical as HFT inherently involves large volumes of continue to dominate for the foreseeable future.
orders arriving at trading venues continuously for processing. B. Custom Hardware-based HFT Platforms
In the trade of options and futures, for example, the volume of
market data being propagated has increased exponentially. The comparatively large and unpredictable latency of
Flexibility is also vital to enable trading platforms to adapt to software-based HFT platforms has led the industry to explore
changing market conditions and trading strategies. alternative lower-latency approaches using custom hardware.
ASICs are typically not considered for HFT as they lack the
II. SURVEY OF CURRENT PLATFORMS FOR HFT flexibility to be reconfigured to handle new protocols. GPUs
are optimized for throughput and cannot offer sufficiently low
Most traders and brokers today implement their HFT latency due to their deep pipelines. Field-Programmable Gate
platforms in software on commodity servers. This allows Arrays (FPGAs) offer the performance of custom hardware
algorithms to be expressed in familiar high-level programming without compromising on flexibility or latency.
languages and be re-compiled quickly to make improvements.
However, the industry-wide race to reduce latency is making FPGAs can be used in a number of ways to accelerate
the long and unpredictable response times of software systems financial applications. One approach, referred to as Hybrid
increasingly uncompetitive. This section surveys software Computing, is used for example in risk management, option
platforms for HFT, contrasting them with hardware-based pricing and portfolio modelling and can accelerate performance
approaches offering lower, more predictable latency. by three orders of magnitude while reducing energy costs [25].
Hybrid computing blends traditional multi-core CPUs with
A. Software-based HFT Platforms FPGA-based co-processors for acceleration. CPUs integrate
Software providers such as Mantara, Ullink and several powerful processing cores and large multi-level caches
QuantHouse offer customizable trading software packages and are best suited to control-intensive parts of an application.
designed for minimum latency. Much of the remaining latency FPGAs perform best at non-floating-point tasks such as integer,
in software-based trading platforms is due to the computer’s binary, character or fixed-point data processing. FPGAs also
operating system networking stack. To alleviate this, end users offer the possibility of tailoring the accelerator exactly to the
can combine trading software with specialized, low-latency application for optimal performance and efficiency.
network interface cards (for example available from Solarflare
and Myricom) that accelerate parts of the networking stack in In a typical Hybrid Computer, the CPU is connected using a
hardware and bypass the operating system’s kernel. These high-bandwidth interconnect such as FrontSide Bus (FSB), PCI
techniques can significantly reduce overall latency compared to Express or QPI to one or more FPGA modules. High-level
a naïve approach. programming tools enable applications to be compiled
seamlessly to both the CPU and FPGA. Vendors such as

10
Maxeler Technologies and Convey Computer (amongst others) Within the research community, the original NetFPGA
offer such platforms. Examples of Hybrid Computers [2][3] was used as base platform for building FPGA-
specifically targeted at HFT include Nomura’s NXT Direct and accelerated network processing applications. It was developed
Deutsche Bank’s Autobahn Ultra. Both products were by Stanford University in collaboration with Xilinx and other
developed to accelerate pre-trade SEC regulatory compliance partners. In [4], a latency measurement circuit is built on the
and risk management checks on orders. NetFPGA to monitor the distribution of the difference in arrival
times between market data on a pair of redundant (“A/B”)
feeds. An event processor for algorithmic trading on NetFPGA
in [10] reduces latency up to two orders of magnitude compared
to software. Morris et al. present an FPGA-assisted HFT engine
implemented on Celoxica’s AMDC card which improves
message throughput by 12 times, while reducing latency [11].
An FPGA-based feed handler for the NASDAQ ITCH protocol
achieves a low, predictable end-to-end latency of 2.7µs [26]. A
comparable software implementation (with no kernel bypass)
exhibits a typical latency of 38µs, varying more than ±20µs.
Finally, some financial processing systems are completely
Figure 2 Typical Smart NIC in a supervised financial trading application implemented on FPGA-based platforms without any host CPU
yielding best possible latencies. As oncoming traffic no longer
FPGA-based “Smart NICs” offer another way to apply needs to make a round trip over the system’s PCIe interface for
programmable logic to the acceleration of financial processing in the CPU, the inherent latency and jitter
applications. Smart NICs typically bring together high-speed introduced by the host’s operating system are entirely avoided.
network interfaces, a PCIe host interface, memory and an The approach presented in this paper fits into this category.
FPGA. The FPGA implements the NIC controller, acting as the
bridge between the host computer and the network and allows III. ADVANTAGES AND DISADVANTAGES OF FPGAS FOR
user-designed custom processing logic to be integrated directly LOW-LATENCY FINANCE
into the data path. As illustrated in Figure 2, this allows a smart
A. Advantages
NIC to function as a programmable trading platform under the
supervision of a host CPU. Modern FPGAs can implement most aspects of any HFT
application. Incoming market data can typically be processed
Many vendors offer FPGA-based “smart NICs”. A completely on the FPGA without needing to travel over and
selection of products targeted specifically at financial trading back to a host CPU. Figure 3 compares an FPGA-based
appears in Table I. The NIC hardware is typically accompanied platform to a traditional software trading platform.
by programmable logic IP implementing networking-related
functions as well as a PCIe host interface, with or without
direct memory access (DMA). Some products include a very
limited set of finance-specific IP blocks.

Table I SMART NICS FOR LOW-LATENCY FINANCE

Product Interfaces Platform IP


1x10GE Low-level PCIe driver, 10G
SFP+,
DINI group MAC, TCP offload engine,
1xIB CX4
DNPCIe_10G_HXT_LL DDR memory controller, PCIe
PCIe Gen2
x4 DMA engine and FIX parser
4x10GE PCIe driver, expressXG SDK
Advanced IO SFP+
V5021/V5022 PCIe Gen1
and basic network/host I/O IP
x8 blocks
Accelize 1x/2x10GE PCIe driver, HCE SDK, 10G
XPS4S530LP-20G/ SFP/SFP+ MAC, multi-session TCP
XPS4S1050GT-20G/ PCIe Gen2 offload engine, DDR memory
XPS4S1050GT-10G x4/x8 controller

For these products, precise latency figures are application-


dependent and difficult to obtain. However, Accelize claims
Figure 3 Software-based (top) and FPGA-based (bottom) trading platforms
their smart NICs can turn around financial trades on the FPGA
in under 2µs. This is significantly lower than the minimum
In comparison to the FPGA-based trading platform,
latency achievable with software, as discussed earlier.
software-based platforms require network traffic to be received

11
on a network adapter. Relevant traffic is transferred to memory greatly simplifies protocol parsing which is a task that is
and the CPU is interrupted to handle the application processing. tedious to describe with a standard programming language.
After this processing is complete, the data is transferred back to
the network adapter and transmitted over the network. The Finally, in contrast to ASICs, FPGAs offer the performance
interrupt-driven software stack, unpredictable PCIe transfers benefits of custom hardware while retaining programmability.
and cache misses make network latencies higher and less Trading algorithms can be continually improved or circuit bugs
predictable in software implementations. By contrast, in the fixed in the field.
FPGA implementation, the incoming network data is fed B. Disadvantages
directly into a custom-designed, highly-optimized and FPGAs also have disadvantages compared to traditional
application-specific processing pipeline via hardware PHY and software-based approaches. At the root of this is the greater
MAC blocks. complexity of the FPGA development flow. Firstly, many
Furthermore, relevant information within a packet can in developers of financial applications are unfamiliar with FPGA
fact be extracted before the complete packet is received. The technology generally and lack the expertise to use the
data is directly available within the same clock cycle of its hardware-oriented FPGA development tools in particular.
arrival time which is significantly different from any software Secondly, building and verifying new hardware is more time-
implementation. (The network stack itself holds the complete consuming than writing new software due to the significantly
packet at least once before its available for processing.) lower abstraction level in the design flow. However, the FPGA
Directly related to this, it’s important to highlight that there is delivers better throughput and lower latency than a software
literally no jitter in extracting and processing incoming data. implementation will ever be able to yield in return for this time
The availability and the processing time for any piece of investment. This is conceptually illustrated in Figure 5.
information is completely predictable down to a clock cycle.
FPGAs can therefore naturally achieve significantly lower
latencies with minimal jitter, as shown in Figure 4. This is one
of the key advantages that FPGAs offer, as the example in
Section V proves.

Figure 5 Latency vs Development Time for Software and FPGA

To overcome this problem, some FPGA-based HFT


products include domain-specific high-level programming
environments, avoiding the need for designs to be described in
Figure 4 Throughput vs Latency in Software (red) and Hardware (green) hardware description languages (HDLs). Within the research
community, a more general “query compiler” has been
Along with lower latency and minimal jitter, FPGAs can introduced in which HFT application primitives can be
achieve significantly higher throughput using parallelism. For efficiently expressed [22]. We address this problem by
example, a processing pipeline that implements a financial providing a library of pre-built FPGA IP blocks for networking
application can easily operate at 200MHz. Given a 256bit data and financial protocol parsing. This allows end users to focus
path, throughput would reach 51Gb/s, corresponding to a on their application without first constructing basic
message rate of 76Mpps. In addition, multiple pipelines can be infrastructure. This approach dramatically reduces development
instantiated in parallel when data dependencies do not exist. By time, as is shown in Figure 5 and discussed in the example
comparison, multi-core CPUs do not necessarily improve presented in Section V.
performance due to the effects of Amdahl’s Law. Inter-core
and inter-thread signalling overheads are increased, often The last stages in the FPGA design process relates to
requiring complex management techniques in the application synthesis, placement and routing which can still take hours for
and in the OS. For example, Mantara advertises a maximum a complex design whereas software designs compile near-
message rate of 10Mpps for their Expressway market data instant in comparison. This however is acceptable as trading
software, which is significantly lower than what can be strategies are relatively long-lived – a typical algorithm may
achieved on an FPGA. have a shelf life of several days [19] while an FPGA
implementation time takes hours in comparison.
Although in general the low- abstraction level in the design
entry for FPGAs increases the design complexity, it also brings
advantages. The available bit-level access to incoming data

12
IV. ALGO-LOGIC’S LOW LATENCY LIBRARY AND The Register Interface can be accessed via a C++ API on a
FPGA IMPLEMENTATION ON NETFPGA-10G host or controlled interactively via a web-based Graphical User
Interface (GUI). The UDP-based interface allows the FPGA to
Algo-Logic’s Low-Latency Library is gateware that is
be controlled and monitored from any host on the network.
compatible with standard FPGA hardware platforms. One
supported platform is the NetFPGA-10G [8] which is a quad- The TCP/IP Processing Layer separates the headers from
port, 10-Gigabit Ethernet successor to the original NetFPGA the payloads, stores the headers and forwards the payloads to
card, based on a Xilinx Virtex-5 device. the application layer for further processing.
Algo-Logic’s Low-Latency Library processes financial The Application Layer processes packet payloads and sends
protocols used by trading venues in the US, Europe and other them back to the TCP/IP Layer. Packets are reassembled with
regions, extracting information from the packets as they flow their corresponding headers and sent out to the Ethernet MAC.
through the FPGA. The library components are used to A new TCP/IP checksum is computed over the resulting
construct custom trading applications, reducing time-to-market. payload after application-level processing is complete.
Algo-Logic’s library of low-latency FPGA IP blocks Finally, a 10G Ethernet MAC IP block from Xilinx
(shown in Figure 6) can be divided in two main categories. A provides the interface to the NetFPGA-10G’s on-board discrete
set of Infrastructure Components includes generic IP cores PHY ICs. A MicroBlaze soft processor core is used to initialize
providing interfaces for the network, external memories and the on-board PHYs. Note that the soft core is not used in the
host software. Financial Processing Components parse and datapath to process time-critical packets. However, if required,
process standard and stock-exchange-specific protocols. The it could be used to run a simple software application which
next sections describe these blocks in more detail. could communicate with the rest of the system through the
Register Interface.
B. Financial Processing Components
Algo-Logic’s library includes IP blocks designed to process
application layer messages for HFT. These messages consist
largely of the orders and order execution reports sent between
clients, brokers and exchanges. A set of pre-verified IP blocks
understands and translates the exchange protocols, enabling
end-users to focus on application development rather than
interface components. Each component is discussed in more
detail below.
1) Financial Protocol Parser
The Financial Protocol Parser receives a payload from the
TCP/IP Processing Layer and identifies message boundaries for
the data in the payload. The parser extracts individual fields
and raises a flag when the value of each field becomes valid.
This ensures that each field is extracted with the lowest
possible latency. Algo-Logic’s library currently has parsers for
Figure 6 Algo-Logic’s Low-Latency FPGA IP Blocks ASCII protocols such as Financial Information eXchange (FIX)
and binary protocols such as OUCH for NASDAQ [14], XPRS
A. Infrastructure Components for DirectEdge [15], ArcaDirect for Arca [16], and Native
All components share a common, standardized interface Trading Gateway for London Stock Exchange (LSE) [17].
with a 64-bit (8-byte) data word and protocol conforming to the Support for BATS BOE [18] will be added shortly.
industry-standard AXI-4 Stream specifications [20]. An example of how the Finance Protocol Parser in Algo-
An SRAM controller IP block drives the on-board Quad- Logic’s IP library parses a message with minimal latency is
Datarate (QDR) II Static RAM ICs [5]. SRAM offers low- shown in the waveform in Figure 7. An incoming OUCH
latency, high-performance storage for small amounts of data. message packet is presented to the Financial Protocol Parser 8
bytes-per-cycle as streams through the 64-bit AXI-Stream data
A Register Interface module controls and monitors the path. As each field is extracted, the value is transferred to
status of registers written and read by host software. The internal I/O pins and a valid flag is raised.
Register Interface contains multiple register types, including
write-only Configuration Registers, read-only Status Registers, The block extracts the OUCH message type (“O”, or “Enter
general-purpose read/write registers and write-only registers. Order”) and the 14-byte Order Token. The Buy/Sell Indicator
Configuration and status update words are transmitted and field is “T” (“Sell Short”). The number of shares is “1002”,
received via UDP. with the stock symbol identified as “FFHL”. The remaining
fields specify the asking price and specify that the order is not
eligible for intermarket sweep.

13
positions. While it is normal that traders maintain both long
and short positions in the market, it is critical that the net
position remains within a bound so that there is a margin of
safety in the holdings. The ability to monitor positions and
exposures in real-time can help avert finanical disaster.
This section describes how Algo-Logic’s platform was used
to track the real-time exposure and positions of multiple trading
clients by parsing FIX execution reports. Figure 9 illustrates the
Figure 7 OUCH Protocol Parsing system set-up of this application. AlgoLogic’s system acts as a
“bump in the wire” between broker and exchange, processing
The parser adapts automatically to variable-length data the FIX execution reports coming from the exchange while
fields in the packet. With a 156 MHz clock and a single set of passing on orders coming from the involved brokers. A third
Delay Flip/Flops (DFFs), the circuit extracts all OUCH packet 10GE interface connects the appliance with the control and
fields within exactly 6.4 nanoseconds of the packet’s arrival logging host. This interface is used to read in normalized
with no jitter. This is orders of magnitude faster than a similar market data with price updates, to control the appliance,
extraction in software which not only incurs system-level retrieve data, monitor status and log the actions performed by
delays for transporting the packet to the CPU, but also needs to the card.
receive the full packet before it can start parsing.
2) Market data parsing and on-chip storage for price data 10GigE
10GigE
Configuration/Debug registers

In financial processing, applications need to know the price Host PHY chip
Ethernet
MAC
Market data update/ Price tables
of instrument (security) that appears in the orders. To track
10GigE
prices, Market data is fed into the card via UDP/IP datagrams. links
The FPGA extracts price updates and stores this data to 10GigE
10GigE
Exchange Ethernet
memory for each symbol. Prices for all 8000 securities traded PHY chip
MAC
Position,
on U.S. Exchanges fit within the FPGA’s on-chip memory. The 400ns
TCP/IP
processing,
Long and
table size is scalable to support more symbols as required. Protocol
Short
Exposure
Parsing
calculation
10GigE
10GigE
Client PHY chip
Ethernet
Algo-Logic’s 1U MAC
NetFPGA 10GAppliance
QDR II SRAM

400ns 200ns FPGA

Total wire-to-wire delay = 1µsec.


( 400ns (PHY+MAC) + 200ns (Logic processing) + 400ns (PHY+MAC) )

Xilinx
V5TX240T
ATX
Figure 9 Exposure and Position-tracking Application built on Algo-Logic’s
Power
Supply
HFT Platform
4x10G SFP+ i/f
Xilinx
Platform A. Exposure/Position Processing Circuit
Cable
USB

Orders

Client Control & Power Broker FPGA Exchange


Exchange Logging USB programming interface
Host
Execution Report

Control
Figure 8 Algo-Logic’s HFT Appliance Market
&
data
Status

C. FPGA implementation on NetFPGA-10G


Control &
Algo-Logic has created a 1U rack-mounted HFT Appliance Logging Host

using the Low-Latency Library gateware and the NetFPGA-


10G card. The hardware components of the appliance are
illustrated in Figure 8. The appliance supports up to four 10G Figure 10 Block Diagram of Exposure and Position Tracking System
Ethernet interfaces which can be used to connect to the
Exchange, a trader/broker and an optional host computer. All processing for the exposure and position tracking
system is implemented entirely on the FPGA. The various
The next section illustrates how a high-performance, low- components of the processing system are illustrated in Figure
latency financial application can be developed quickly on top 10. All components are part of the low latency library.
of this platform.
The design includes three instances of the 10GE MAC to
V. EXPOSURE AND POSITION TRACKING APPLICATION interface to the host, the broker and the exchange, respectively.
A major concern of financial regulators is that traders may The host interacts via the configuration/debug register module
put brokers at risk by exposing themselves to large, unbalanced and can access the market data update/price table module.
Current market prices for each security are received via UDP

14
from the exchange and stored within on-chip memory. These exposures, respectively, across all sessions. Looking at this
prices are needed to calculate exposure. Traffic from brokers plot, a broker or regulator can instantly view total exposure.
and FIX execution reports from the exchange are passed from
the MACs to the TCP/IP processing module. Along with
network functions, this module also handles the processing of
the financial protocols. For this application, the Financial
Information eXchange (FIX) protocol is used. The parser
extracts all pertinent FIX fields (“tags”) from the execution
report, including Security, Shares Filled, Price and Order Type.
These correspond to FIX tags 55, 38, 44 and 54, respectively.
The Exposure and Position calculation module receives the
traffic from the TCP/IP module and updates the market
data/price tables using the procedure described in the following
subsection. Finally, the FIX execution reports are forwarded to
Figure 11 Web-based GUI showing Exposure and Position
the clients providing details of order execution.
B. Position and Exposure Calculations D. Results
The exposure and positions are calculated in the FPGA The total delay from wire to receiving FIX payload from
from the extracted data according to the following formulas: socket, parsing FIX execution reports and calculating these
position = total number of shares (Tag 38) filled by the exchange
values can be significantly reduced by performing all these
operations above within the FPGA. Rather than reading data
If it is a Buy order (Tag 54), then the position is categorized across a memory bus, data is stored in fast on-chip memory and
as “long”. If it is a Sell/Short Sell order, then the position is QDR-II directly attached to the FPGA. This memory is
“short”. The net position per security (FIX Tag 55) is: accessed without using the PCIe bus when an order (from
net position = ( total long positions) – ( total short positions ) client to exchange) enters FPGA and the table lookups and
computation are performed directly in logic.
Similarly, long exposure per security is defined as follows,
where the current market price comes from the price table: The total wire-to-wire latency in this application is 1µsec
with liertally no jitter. The breakup of 1 µsec total delay is
long exposure (security) = ( total long positions ) * current market price 400ns (PHY+MAC, high-speed serial receiver) + 200ns
short exposure (security) = ( total short positions ) * current market price (parsing/calculations) + 400ns (PHY+MAC, high-speed serial
transmitter). With future generations of FPGAs these delays
Exposure (security) across all sessions is defined as the
will further decrease. It is worth noting that the 1µs round-trip
product of net position and current market price. Net and gross
latency through this design is between one and two orders of
market values can be calculated as follows:
magnitude lower than a recent software implementation [21].
Net market value = sum[position/security across all sessions * price]
As for throughput, all the processing on the FPGA is
Gross market value = sum[abs(positions) * price] performed at the full 10Gbps line rate. When using FIX, the
Once the Tags are extracted from the Execution reports and average size of a message is 150 bytes (the exact value depends
prices are read from the price tables, the exposure is calculated of the value of the fields). OUCH messages are smaller, and
and position is updated based on the type of order. range in size between 12 to 82 bytes in length, whereby
multiple messages can appear in a TCP/IP flow. When a single
C. User Interface FIX message appears in a TCP packet, the total size of the
Position and exposure data are periodically forwarded to packet is 204 bytes. The throughput is therefore roughly 6.1M
the control and logging host in UDP packets for visualization FIX messages/second.
on a Graphical User Interface (GUI). The speed at which the In regards to development, it is important to note that the
screen is updated is limited only by the host’s ability to capture application could be completely assembled out of existing
logs and process them. The host updates its database and library components. The Low-Latency Library components
populates positions and exposures on a web-based GUI as were simply connected together to form the infrastructure,
shown in Figure 11. Exposure is plotted in the time domain. including FIX parsing. Secondly, the integration of the
The table on the left shows positions per security and dollar hardware was completed within a month. Most importantly, the
values based on current market price per security. Green Low-Latency Library and hardware platform were pre-verified,
indicates long positions and red indicates short positions. The greatly simplifying the task of verifying the final design.
table on the top-right visualizes long exposure per session, Overall, the IP library reduced development time considerably.
short exposure per session, net market value per session and
gross market value per session. Dollar values are calculated VI. CONCLUSIONS
based on latest prices received from the market. Finally, the In the race to minimize latency in high frequency trading,
graph in the bottom-right corner illustrates all exposures. The traders, brokers and exchanges are exploring a range of
green and red lines represent the sum of long and short technologies to build platforms, monitor and manage risk and
improve the efficiency of trading.

15
Software approaches that utilize CPUs with high [6] Myricom DBL (Datagram Bypass Layer) Documentation,
performance network interface cards provide part of the http://myri.org/dbl.html
solution. Through optimized bus logic and device driver [7] Datasheet for the QLogic 7300-series Network Adapters,
http://www.qlogic.com/Resources/Documents/DataSheets/Adapters/Dat
software, they reduce the time required to transfer a message aSheet_QLE7300Series.pdf
from the network to the CPU then back to the network in just a
[8] M. Blott, J. Ellithorpe, N. McKeown, K. Vissers and H. Zeng, “FPGA
few microseconds. Smart NICs go one step further by Research Design Platform Fuels Network Advances”, Xilinx Xcell
offloading some of the processing to a Field Programmable Journal, Issue 73, September 2010
Gate Array (FPGA) on the network adapter. [9] N. Lobo, V. Malik, C. Donnally, S. Jahne and H. Jhaveri, “Evaluating
the Latency Impact of IPv6 on a High Frequency Trading System”,
The advantage of developing financial applications in University of Colorado, May 2012
software is a typically short development time. A challenge [10] M. Sadoghi, M. Labrecque, H. Singh, W. Shum and H.A. Jacobsen,
with software is that it is hard to achieve low end-to-end “Efficient Event Processing through Reconfigurable Hardware for
latency and jitter. A custom-designed datapath in an FPGA Algorithmic Trading,” Journal Proceedings of the VLDB Endowment,
circuit offers the benefits of minimal, deterministic processing vol. 3,no. 1-2, pp. 1525-1528, September 2010
times down to clock cycles. However, the benefits of FPGAs [11] G.W. Morris, D.B. Thomas and W. Luk, “FPGA Accelerated Low-Latency
have traditionally come at the cost of a longer development Market Data Feed Processing”, High Performance Interconnects, 2009
(HOTI 2009). 17th IEEE Symposium on, pp. 83-89, 25-27 Aug. 2009
time. The contribution of this paper is to provide a pre-built
[12] Report by the Aite Group, “A New World Order: The High Frequency
gateware library that allows financial applications to be built on Trading Community and Its Impact On Market Structure”, 2009
standard FPGA cards (such as the NetFPGA-10G) without the
[13] R. Martin, “Wall Street’s Quest to Process Data at the Speed of Light”,
burden of an extended development period. Information Week, 21st April 2007
Algo-Logic’s Low-Latency Library described in this paper [14] O*U*C*H 4.2 Protocol Specification, NASDAQ Inc.,
http://www.nasdaqtrader.com/content/technicalsupport/specifications/Tr
includes infrastructure components and domain-specific adingProducts/OUCH4.2.pdf
gateware that extract data from market data streams, including
[15] XPRS 1.24 Protocol Specification, Direct Edge,
FIX, OUCH, XPRS, ArcaDirect, LSE and BATS BOE. For http://www.directedge.com/Portals/0/docs/Connect/Direct%20Edge%20
order processing systems that also need to make decisions XPRS%20API%20Manual%20V%201.24.pdf
based on price, Algo-Logic’s gateware includes a module that [16] ARCADirect 4.0 Protocol Specification, NYSE Arca,
receives normalized market data via UDP and stores current http://www.nyse.com/pdfs/ArcaDirectSpecVersion4_0.pdf
prices using on-chip memory so that logic can use these values [17] Native Trading Gateway 10.1 Protocol Specification, London Stock
during order processing. Exchange (LSE), http://www.londonstockexchange.com/products-and-
services/millennium-exchange/millennium-exchange-
To demonstrate the utility of this gateware library, a migration/mit203v101.pdf
complete application has been developed which tracks the [18] BATS US Options BOE Specification 1.5.2, BATS Global Markets Inc.,
exposure and positions of multiple traders as orders are http://www.batsoptions.com/resources/membership/BATS_US_Options
transferred over and back to the stock exchange. Incoming FIX _BOE_Specification.pdf
messages are decoded, analyzed and forwarded onward in real [19] R. Iati, “The Real Story of Trading Software Espionage”, TABB Group
Perspective, July 10thth, 2009
time with a stable end-to-end latency of 1µs. This is up to two
[20] AMBA AXI4-Stream Protocol Specification, ARM Ltd.,
orders of magnitude lower than software approaches achieve. http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ihi0051a/i
ndex.html
REFERENCES
[21] OnX Enterprise Solutions, “Report: High Performance Trading – FIX
[1] P. Gomber, B. Arndt, M. Lutat and T. Uhle, “High Frequency Trading” Messaging; Testing for Low Latency”, Jan 2012,
(report commissioned by Deutsche Börse Group), Goethe Universitat http://fixglobal.com/content/high-performance-trading-fix-messaging-
Frankfurt am Main, March 2011 testing-low-latency
http://www.frankfurt-main-finance.de/de/finanzplatz/daten- [22] R. Mueller, J. Teubner and G. Alonso, “Streams on wires: a query
studien/studien/High-Frequency-Trading.pdf compiler for FPGAs”, Proc. VLDB Endowment, vol. 2, no. 1, Aug 2009,
[2] J.W. Lockwood, N. McKeown, G. Watson, G. Gibb, P. Hartke, J. Naous, pp. 229 – 240
R. Raghuraman and Jianying Luo, “NetFPGA - An Open Platform for [23] M. Kearns, A. Kulesza and Y. Nevmyvaka, “Empirical Limitations on
Gigabit-rate Network Switching and Routing”, Microelectronic Systems High Frequency Trading Profitability”, Working Paper Series, Social
Education, 2007. MSE '07. IEEE International Conference on, pp. 160- Science Research Network (SSRN), September 17, 2010
161, 3-4 June 2007
[24] H. Subramoni, F. Petrini, V. Agarwal, D. Pasetto, “Streaming, Low-
[3] G. Gibb, J.W. Lockwood, J. Naous, P. Hartke and N. McKeown, latency Communication in On-line Trading Systems”, IEEE
"NetFPGA—An Open Platform for Teaching How to Build Gigabit-Rate International Symposium on Parallel & Distributed Processing,
Network Switches and Routers," Education, IEEE Transactions on , Workshops and Phd Forum (IPDPSW), 2010, pp.1-8, 19-23 April 2010
vol.51, no.3, pp.364-369, Aug. 2008
[25] C. Starke, V. Grossman, L. Wienbrandt, M. Schimmler, “An FPGA
[4] A. Gupte and J.W. Lockwood, “Precise Precise Latency Comparison Implementation of an Investment Strategy Processor”, Procedia
Module for the NetFPGA”, workshop at the 2010 North American Computer Science, Volume 9, 2012, Pages 1880-1889
NetFPGA Developers’ Workshop,
http://netfpga.org/tutorials/WorkshopNorthAmerican2010/pdf/NetFPGA [26] R. Pottahuparambil, J. Coyne, J. Allred, W. Lynch and V. Natoli, “Low-
_Dev_2010_Precise_Latency_Comparison_Module.pdf latency FPGA Based Financial Data Feed Handler”, 2011 IEEE 19th
Annual International Symposium on Field-Programmable Custom
[5] Cypress Quad Data Rate (QDR-II) Static Random Access Memory Computing Machines (FCCM), pp. 93-96, 1-3 May 2011
(SRAM), http://www.cypress.com/?id=107

16

View publication stats

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy