0% found this document useful (0 votes)
44 views116 pages

Xcell 59

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views116 pages

Xcell 59

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 116

ISSUE 59, FOURTH QUARTER 2006

Xcell journal
Issue 59
Fourth Quarter 2006
XCELL JOURNAL

TT H
H EE A
AUU TT H
HOO RR II TT A
A TT II V
V EE JJ O
OUU RR N
NAA LL FF O
O RR PP RR O
OGG RR A
AMMM
MAA BB LL EE LL O
OGG II C
C U
U SS EE RR SS
XILINX, INC.

Virtex-5
Special
Edition

INSIDE

Achieve Higher Performance


with Virtex-5 FPGAs

HDL Coding and Design


Practices for Improving
Virtex-5 Utilization,
Performance, and Power

A Multi-Gigabit Transceiver
for the Masses

Introducing the Virtex-5


Virtex-5
PCI Express Endpoint Block

Meeting Memory Interface


Design Challenges with
Virtex-5 FPGAs

www.xilinx.com/xcell/

Support Across The Board.

Design Kits Fuel Feature-Rich Applications

Build your own system by Silica designs, manufactures, sells and supports a wide
mixing and matching:
variety of hardware evaluation, development and reference
• Processors
design kits for developers looking to get a quick start on
• FPGAs
a new project.
• Memory
• Networking With a focus on embedded processing, communications
• Audio and networking applications, this growing set of modular
• Video
hardware kits allows users to evaluate, experiment,
• Mass storage
benchmark, prototype, test and even deploy complete
• Bus interface
designs for field trial.
• High-speed serial interface
By providing a stable hardware platform that enhances system
Available add-ons: development, design kits from Silica help original equipment
• Software manufacturers (OEMs) bring differentiated products to market
• Firmware
quickly and in the most cost-efficient way possible.
• Drivers
• Third-party development tools For a complete listing of available boards, visit
www.silica.com

SI L I C A I Th e E n gin eer s of Distribution


www. silica. com

© Avnet, Inc. 2006. All rights reserved. AVNET is a registered trademark of Avnet, Inc.
Ultimate Performance…

Achieve highest system speed


and better design margin with
the world’s first 65nm FPGAs.

Virtex™-5 FPGAs feature ExpressFabric™ technology on 65nm triple-oxide


process. This new fabric offers the industry’s first LUT with six independent
inputs for fewer logic levels, and advanced diagonal interconnect to enable
the shortest, fastest routing. Now you can achieve 30% higher
performance, while reducing dynamic power by 35% and area by 45%
4. 4 x compared to previous generations.
2.4 x
1.4 x
Design systems faster than ever before
Performance

1.3 x Shipping now, Virtex-5 LX is the first of four platforms optimized for
1.6 x Industry’s
fastest
90nm FPGA
logic, DSP, processing, and serial. The LX platform offers 330,000 logic
benchmark
cells and 1,200 user I/Os, plus hardened 550 MHz IP blocks. Build deeper
FIFOs with 36 Kbit block RAMs. Achieve 1.25 Gbps on all I/Os without
restrictions, and make reliable memory interfacing easier with enhanced
ChipSync™ technology. Solve SI challenges and simplify PCB layout with our
Logic On-chip DSP I/O LVDS I/O Memory sparse chevron packaging. And enable greater DSP precision and dynamic
Fabric RAM 32-Tap Filter Bandwidth Bandwidth
Performance 550 MHz 550 MHz 750 Gbps 384 Gbps range with 550 MHz, 25x18 MACs.
Virtex-5 FPGAs Virtex-4 FPGAs Nearest Competitor

Numbers show comparision with nearest competitor Visit www.xilinx.com/virtex5, view the TechOnline webcast, and give
Based on competitor’s published datasheet numbers
your next design the ultimate in performance.

The Programmable Logic Company SM

www.xilinx.com/virtex5

The Ultimate System Integration Platform

©2006 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.
L E T T E R F R O M T H E P U B L I S H E R

W
Welcome to this special edition of Xcell Journal, featuring a broad array of articles on Xilinx®
Virtex™-5 FPGAs. In this issue you’ll find executive and industry viewpoints; articles on
engineering solutions, design challenges, tools, customer successes, and vertical markets; and
a technical reference section covering application notes, boards, and IP.

Xcell journal As exciting as this is, I’d also like to let you know about a couple of announcements from
Xcell Publications.
PUBLISHER Forrest Couch Xcell Publications Honored with APEX 2006 Award of Excellence
forrest.couch@xilinx.com
408-879-5270
Xcell Publications was recently awarded the APEX 2006 Award of Excellence in two categories –
magazine and journal design and layout and custom-published magazines and journals – for two
EDITOR Charmaine Cooper Hussain of its flagship Xcell Publications, Xcell Journal and I/O Magazine.

ART DIRECTOR Scott Blair APEX 2006 – the 18th Annual Awards for Publication Excellence – is an international compe-
tition that recognizes outstanding publications, including newsletters, magazines, annual
DESIGN/PRODUCTION Teie, Gelwicks & Associates
reports, brochures, and websites. According to APEX judges, this year’s competition was excep-
1-800-493-5551
tionally intense, with nearly 5,000 entries. Awards were granted based on
ADVERTISING SALES Dan Teie excellence in graphic design, quality of editorial content, and the success of
1-800-493-5551
the entry in conveying the message and achieving overall communications
TECHNICAL COORDINATOR Greg Lara effectiveness.
“We’re honored that Xcell magazines have been selected for excellence in
INTERNATIONAL Dickson Seow, Asia Pacific
dickson.seow@xilinx.com publishing among such a stellar list of companies by the APEX panel of
Andrea Barnard, Europe/ judges,” said Sandeep Vij, vice president of worldwide marketing at Xilinx. “Over the past 18
Middle East/Africa years, our custom publications have served as a foundational tool, delivering ‘how-to’ information
andrea.barnard@xilinx.com
to a growing base of engineers using Xilinx programmable chips to design a wide variety of electronic
Yumi Homura, Japan
yumi.homura@xilinx.com
systems, ranging from the Mars Rover to high-volume consumer handsets, flat-panel displays and
automotive infotainment systems. Being ranked among the industry’s best underscores the value
SUBSCRIPTIONS All Inquiries and quality of our company’s portfolio of custom magazines.”
www.xcellpublications.com
Xilinx joins a prestigious list of award-winning companies from a variety of industries in the
REPRINT ORDERS 1-800-493-5551 APEX competition for custom-published magazines and journals, including Blue Cross Blue
Shield, CMP Media/Digital Connect, DaimlerChrysler, IBM Journal of Research and
Development, Mac Publishing, National Football League, National Foundation for Advancement
in the Arts, Penton Custom Media, and Time Inc. Strategic Communications.
New Digital Editions Available
We now offer digital editions of our magazines. Now you can subscribe for free to the new
www.xilinx.com/xcell/
Xcell Journal Digital, requiring no software downloads and visible on any standard Internet browser.
This updated publishing technology lets you browse, search, make notes, e-mail authors, and click
Xilinx, Inc.
2100 Logic Drive through to advertisers’ websites.
San Jose, CA 95124-3400
Phone: 408-559-7778 To receive Xcell Journal Digital, you have to subscribe. In addition
FAX: 408-879-4780
www.xilinx.com/xcell/ to Xcell Journal, we also now offer digital subscriptions of all of
our magazines. Please visit our website at www.xilinx.com/xcell
© 2006 Xilinx, Inc. All rights reserved. XILINX, and click on “Subscriber Services.”
the Xilinx Logo, and other designated brands included
herein are trademarks of Xilinx, Inc. PowerPC is a
trademark of IBM, Inc. All other trademarks are the I hope you enjoy reading this issue.
property of their respective owners.

The articles, information, and other materials included


in this issue are provided solely for the convenience of
our readers. Xilinx makes no warranties, express,
implied, statutory, or otherwise, and accepts no liability
with respect to any such articles, information, or other
materials or their use, and any use thereof is solely at
the risk of the user. Any person or entity using such
information in any way releases and waives any claim it
might have against Xilinx for any loss, damage, or Forrest Couch
expense caused thereby.
Publisher
O N T H E C O V E R

PERFORMANCE SERIAL CONNECTIVITY

16 42
Achieve Higher Performance with Virtex-5 FPGAs
New architectural elements can help you attain
higher system-level performance. A Multi-Gigabit Transceiver for the Masses
The Virtex-5 GTP transceiver brings versatility, ease of use,
power efficiency, and cost-effectiveness to high-volume
PERFORMANCE
mainstream applications.

19 SERIAL CONNECTIVITY

45
HDL Coding and Design Practices for Improving
Virtex-5 Utilization, Performance, and Power
These tips and techniques can lead to better Virtex-5 designs. Introducing the Virtex-5 PCI Express Endpoint Block
With PCI Express quickly becoming the standard high-bandwidth
interconnect, the Virtex-5 LXT PCIe Endpoint block enables a
Viewpoint Introducing the Virtex-5 FPGA Family
The first 65-nm advanced FPGAs
raise the bar in performance,
configurable single-chip solution.

8 power efficiency, capacity, and value. M E M O RY I N T E R FA C E S

73
Meeting Memory Interface Design Challenges with Virtex-5 FPGAs
Virtex-5 devices support the latest generation of high-speed
memory interfaces.
FOURTH QUARTER 2006, ISSUE 59

VIEWPOINTS
Xcell journal
Introducing the Virtex-5 FPGA Family ....................................................................................8
Serial Everywhere – The Triple-Play Challenge .....................................................................12
Virtex-5 Serial Connectivity Solutions .................................................................................13
FPGAs for Serial Interconnections .......................................................................................14

PERFORMANCE
Achieve Higher Performance with Virtex-5 FPGAs .................................................................16
HDL Coding and Design Practices for Improving Virtex-5 Utilization, Performance, and Power......19
Getting the Best Results from Virtex-5 FPGAs ......................................................................23
Maximizing Design Performance for Virtex-5 FPGAs ..............................................................28
Clock Management in Virtex-5 Devices ...............................................................................31

POWER
Reduce Power with Virtex-5 FPGAs ....................................................................................33
Applying Compact Thermal Models .....................................................................................38

SERIAL CONNECTIVITY
A Multi-Gigabit Transceiver for the Masses............................................................................42
Introducing the Virtex-5 PCI Express Endpoint Block ..............................................................45
PCI Express Markets, Trends, and Applications ......................................................................49
Designing with Virtex-5 Embedded Tri-Mode Ethernet MACs ....................................................54
Asynchronous Sample-Rate Conversion Between AES Audio Streams........................................57
Implementing Integrated Video Connectivity Solutions with Virtex-5 LXT Devices .......................61
Enhancing System Management and Diagnostics with the Virtex-5 System Monitor ...................64
Real-Time Debugging for Virtex-5 FPGAs ..............................................................................68

MEMORY INTERFACES
Memories are Made of This... ...........................................................................................70
Meeting Memory Interface Design Challenges with Virtex-5 FPGAs ..........................................73
Implementing Memory Controllers Using the Memory Interface Generator Tool..........................76
Micron Memory Interface ..................................................................................................79
Designing Virtex-5 DDR2 Memory Interfaces for Signal Integrity..............................................83

SOURCE-SYNCHRONOUS INTERFACES
Improve System Reliability with SPI-4.2 LogiCORE Solutions and Virtex-5 FPGAs .......................87

VERTICAL MARKET SOLUTIONS


Using Virtex-5 FPGAs in COTS Board-Level Products ...............................................................90
Tackling Serial Backplane Interface Design Challenges ...........................................................97
Enabling Multi-Port 1 Gbps and 10 Gbps TCP/iSCSI Protocol Offload Solutions........................101
Implementing Encryption Algorithms with the Virtex-5 LXT Platform.......................................103

GENERAL
Virtex-5 Configuration Options Offer Designers a Choice.......................................................107
Introducing Virtex-5 EasyPath FPGAs .................................................................................108

REFERENCE
Connectivity Solutions .....................................................................................................110
Intellectual Property Offerings ..........................................................................................111
Virtex-5 Boards and Kits..................................................................................................112
V I E W P O I N T

Introducing the
Virtex-5 FPGA Family
The first 65-nm advanced FPGAs raise the bar in performance, power efficiency, capacity, and value.

by Steve Douglass
Vice President
Product Development,
Advanced Product Division
Xilinx, Inc.
stephen.douglass@xilinx.com

Welcome to the
Virtex™-5 issue of
Xcell Journal. The Xilinx® Virtex-5 family
is not only the industry’s first 65-nm
FPGA – it also offers some of the most
advanced architecture and highest per-
formance in the world. Continuing our
history of developing groundbreaking
technology, we listened to leading design
engineers in various markets and built on
key characteristics that made our Virtex-4
FPGA family a tremendous success:
• Higher performance
• Higher logic density
• Lower power consumption
• More advanced features
The fundamental value propositions of
FPGAs include faster time to market, ver-
satility, support for evolving standards, risk
mitigation, field upgradability, and lower
system costs. Our FPGAs accommodate
your demands for continued improve-
ments in performance, capacity, power
consumption, and cost.
8 Xcell Journal Fourth Quarter 2006
V I E W P O I N T

The Virtex-5 family combines the inher- Virtex-5


ent advantage of state-of-the-art 65-nm Slice
process technology with an innovative design
6-LUT
that is based on a deeper understanding of the
applications our products serve. In this arti-
cle, I’ll provide an overview of the new fea-
tures in Virtex-5 devices, explain the 6-LUT
underlying technology, and offer a glimpse of
the design decisions that led to our world-
leading FPGA architecture. 6-LUT

Process Technology and Architectural Innovations


Virtex-5 FPGAs are built on 65-nm triple-
oxide technology using our Advanced Silicon 6-LUT

Modular Block (ASMBL™) architecture and


providing additional levels of system integra-
tion. This new family offers an advanced plat-
form that meets the growing need for
programmable systems with higher perform- Virtex-4
Interconnect
ance, higher density, lower power consump- Capability
tion, and lower overall system cost.
Direct
It might be easy to deliver on one or two 1 Hop
of these items, but our challenge was to 2 Hops
CLB
deliver all of them at the same time. 3 Hops

We successfully met those challenges Virtex-5


through a combination of advanced IC Interconnect
Capability
process development and innovative archi-
tecture and circuit design. Introduced in the
Virtex-4 family, our proven ASMBL chip
layout architecture allows us to provide the Figure 1 – Virtex-5 ExpressFabric technology
optimal mix of required device resources
(logic, memory, arithmetic, I/O, and IP), chip area is 45 percent smaller, resulting in (notably, the interconnect). A six-input
thus creating the ideal combination for four a lower cost per function. LUT (6-LUT) with four times more bits
new platforms: thus increases the CLB area by only 15% –
• The LX platform, optimized for Higher Performance and Density but packs, on average, 40% more logic into
high-performance logic ExpressFabric™ technology implements each LUT. This higher logic density often
logic and local interconnect routing. It reduces the number of cascaded LUTs and
• The LXT platform, optimized for incorporates look-up tables (LUTs) with six can improve the critical path delay, as
high-performance logic with low- independent inputs, plus a new diagonal shown in Figure 2.
power serial I/O interconnect structure, as illustrated in
• The SXT platform, optimized for Figure 1. ExpressFabric technology imple-
high-performance arithmetic- and mem- ments combinatorial logic in fewer LUT
Making the Right
ory-intensive DSP with low- levels and uses fewer concatenated connec-
Critical Path Delay

Trade-Off
Used Die Area

power serial I/O tions to neighboring building blocks, as


compared to the Virtex-4 architecture. This
• The FXT platform, optimized for
reduces datapath delays and thus increases
embedded processing and very
design performance.
high-speed serial I/O 2 3 4 5 6 7 8
Number of LUT Inputs
Compared to our Virtex-4 family, Advanced 6-LUT Logic Structure
Architectural Evaluation of Typical Designs
Virtex-5 devices offer 30 percent higher For many years, four-input LUTs were the
average speed and 65 percent higher capac- industry standard. However, at 65 nm, the
regular structure of the LUT can be shrunk Figure 2 – Optimal
ity in the largest device. Dynamic power
performance/area trade-off
consumption is reduced by 35 percent and even more than the remaining circuitry
Fourth Quarter 2006 Xcell Journal 9
V I E W P O I N T

Smaller
connects between adjacent logic, again
lowering routing capacitance.
-60 VCCINT, the core supply voltage, is
now 1.0V. All of these factors contribute
LUT Count Reduction (%)

Most designs show


more than 20% to a reduction in overall dynamic power
improvement in
-40
both size and consumption. With the success of the
performance. Virtex-4 family, we know that many engi-
neers view performance and power con-
-20
sumption as two equally important
constraints in their system designs; there-
0
fore, we need to offer both high perform-
20 60 60 Faster ance and low power.
Performance Increase (%) We completely reengineered the
Virtex-5 logic fabric to fully take advan-
Figure 3 – Virtex-5 FPGA versus Virtex-4 FPGA design suite benchmarks
tage of the 65-nm triple-oxide CMOS
process, resulting in the highest perform-
We took a suite of customer designs and These features are transparent to Virtex-5 ance fabric ever, with system clock rates in
implemented them using ISE™ 8.1i soft- FPGA users and are automatically exercised excess of 550 MHz. At the same time,
ware. For each design, we compared the by ISE software, resulting in easier routabili- static power is comparable to that of the
number of LUTs used with Virtex-4 and ty and higher overall performance. 90-nm Virtex-4 devices, while dynamic
Virtex-5 device implementations and cor- power has been reduced by at least 35%.
related this information with the perform- Lowest Power Advanced FPGA Solution Just like its predecessor, the Virtex-5 fam-
ance increase in megahertz. The scatterplot The Virtex-5 device family uses our ily again provides the lowest power solu-
graph in Figure 3 shows the percentage of advanced 65-nm, triple-oxide, 11-layer tion of any advanced FPGA family.
performance improvement on the X axis copper CMOS process technology.
and percentage area savings in terms of “Triple oxide” refers to the number of dif- Advanced Features for System Integration
LUT count reduction on the Y axis. The ferent transistor gate-oxide thicknesses In the Virtex-5 family, we have added a
new 6-LUT ExpressFabric technology pro- used. The I/O transistors must be 3.3V phase-locked loop (PLL) to each clock
vides a win-win solution in both perform- tolerant and use relatively thick oxide, but management tile (CMT), which now con-
ance gain and resource savings. the very fast transistors used for logic and tains two digital clock managers (DCMs)
Unlike competing FPGAs, Virtex-5 other core functions use very thin oxide. and one PLL. The CMT thus offers the
FPGAs provide real 6-LUTs that you can use Unfortunately, very thin oxide and best of both worlds: the robust versatility
as logic or as distributed memories, where a very low threshold voltage unavoidably and precise incremental phase shift capa-
LUT can be a 64-bit distributed RAM (even cause high leakage current. There are, bility of a digital clock manager combined
dual- or quad-ported) or a 32-bit program- however, many transistors in an FPGA with the jitter reduction from the analog
mable shift register. Each LUT can have two that need not be very fast (notably the PLL. The largest device in the family has
outputs, thus implementing two logic func- configuration storage cells). Starting with six CMTs capable of generating and
tions of five variables, storing 32 x 2 RAM the Virtex-4 family, Xilinx pioneered a manipulating 550-MHz clocks, support-
bits, or acting as a 16 x 2-bit shift register. third, intermediate gate thickness for ing the performance of Virtex-5 logic and
those transistors. This triple-oxide block functions.
New Diagonally Symmetric Interconnect approach allows us to fine-tune the per- Synchronous dual-ported block RAM
A new diagonally symmetric intercon- formance and power in the device circuit- is an important function. The size of each
nect pattern enhances performance by ry; it enables Virtex-5 devices to deliver block RAM has been increased to 36 Kb,
reaching more places in fewer routing industry-leading performance while dra- but you can also use it as two independent
hops. A comparison between Virtex-5 matically lowering leakage current and 18-Kb block RAMs. The data bus width is
and Virtex-4 FPGA interconnect pat- thus static power consumption. programmable from 1 bit to 36 bits. In
terns (with each box representing a CLB) Additionally, the new 6-LUT logic simple dual port mode (one port write,
is illustrated in Figure 1. The color codes structure combines more logic per LUT, one port read) the data bus width can be
show that with the Virtex-5 FPGA, the uses fewer local interconnect nodes, and as high as 72 bits, effectively doubling the
pattern is more symmetric, with more fewer high capacitance nodes between data bandwidth. You can turn off unused
CLBs reached in fewer hops. The sym- logic functions, reducing the levels of 18-Kb blocks to save power.
metry thus achieves better results from logic and thus the path delay. The new The block RAM has integrated FIFO
place and route software tools. symmetric routing also uses more direct control logic, simplifying the design of

10 Xcell Journal Fourth Quarter 2006


V I E W P O I N T

asynchronous (or synchronous) FIFOs as many as 24 in the largest LXT device. 8,500 LUTs compared to implementation
running as fast as 550 MHz without con- In designing our fourth-generation with soft IP.
suming any logic resources. RocketIO™ technology of high-speed Virtex-5 devices offer more and small-
The 72-bit-wide block RAM now serial transceivers, we invested significant er I/O banks. The outer I/O banks (as
includes 64-bit error checking and correc- engineering effort to lower power con- many as eight banks in the largest device)
tion (ECC) control logic. Like the inte- sumption. At the top speed of 3.2 Gbps, also are arranged to provide a PCB rout-
grated FIFO support, the integrated ECC the LXT RocketIO transceiver consumes ing advantage that in some cases might
improves memory performance and elim- typically less than 100 mW, making it the save board layers.
inates the cost associated with traditional lowest power transceiver in any FPGA To ensure the best simultaneously
fabric-based solutions. You can also use product (see Figure 4). switching output (SSO) performance and
the dedicated ECC logic to augment Each Virtex-5 LXT RocketIO trans- provide the best signal integrity (SI) solu-
external memory interfaces. ceiver is programmable and can imple- tion in the FPGA industry, all Virtex-5
Interfacing to external devices and ment a myriad of speed and serial devices use Xilinx sparse chevron technol-
especially external memory such as DDR, standards. Link-layer IP is available for ogy pinout assignments. This ensures that
DDR2, QDR II, and RLDRAM II is dra-
matically enhanced and simplified by our
new ChipSync™ technology. A memory
development system (ML561) based on
our LX50T devices contains fully func-
tional and hardware-proven reference Pre-
Parallel
to
Polarity
Phase
Adjust
Driver Emphasis 8B/10B
designs for all of today’s most popular TX Serial FIFO and
Over-
Sampling
memory technologies.
In the DSP domain, we are now provid- PMA PLL
Divider
TX PIPE Control
PRBS
ing 25 x 18-bit multipliers, mainly for Generator
TX-PMA TX-PCS
more efficient floating-point designs. These
FPGA
DSP48E slices can be directly cascaded for Fabric
higher performance in digital filtering or Over- Comma
Equlizer Polarity
video broadcast applications. Direct cas- and CDR
Serial
to
Sampler Detect
and 8B/10B Elastic
RX RX-OOB Parallel Align Buffer
cading also saves power – as much as 40%
PRBS
compared to competing solutions. PMA PLL
Checker

Virtex-5 SelectIO™ technology contin- Divider Loss of Sync

RX Status Control
ues to lead the industry. Every pin supports From PMA PLL

virtually every I/O standard in use today RX-PMA RX-PCS RX Pipe Control

and offers up to 1.25 Gbps LVDS and 800


Mbps single-ended I/O performance.
Beyond the IDELAY option, which
Figure 4 – RocketIO GTP transceiver
offers programmable input delay in steps
of 75 ps, the new ODELAY option now
offers the same fine granularity at the such standards as Ethernet, HD/SDI, each I/O pin is closely surrounded by
FPGA output. Either of these functions is Serial RapidIO, FibreChannel, and power and ground pins, thus minimizing
individually programmable on every Aurora. Finally, we anticipated the popu- current loop inductance and improving SI.
device pin. larity of PCI Express (PCIe) endpoint
The IODELAY function is an impor- applications and integrated the complete Conclusion
tant feature to enhance reliable transmit PCIe endpoint protocol in hard logic. The I hope that you have enjoyed reading
and receive of high-speed source-synchro- Virtex-5 LXT PCIe Endpoint block is about Virtex-5 devices and the factors that
nous data and clocks. The intended appli- fully compliant to PCIe standard specifi- drove their design. At Xilinx, we have
cation includes compensation for cation version 1.1 and can support x1, x2, truly enjoyed the excitement in the system
board-level skews, bit alignment in a bus, x4, and x8 lane implementations. The engineering community about this new
and alignment between data and clock sig- integrated hard IP saves logic resources architecture. We look forward to seeing
nals. This enables LVDS I/Os to achieve and improves performance for increasing- your next-generation systems benefit from
speeds as fast as 1.25 Gbps per pin pair. ly popular PCIe applications. For an x4 the Virtex-5 enhanced performance and
Virtex-5 LXT, SXT, and FXT devices PCIe lane implementation, the Virtex-5 functionality, taking your complex designs
also offer embedded serial transceivers – PCIe subsystem block saves as many as to the next level.

Fourth Quarter 2006 Xcell Journal 11


View from the top

Serial Everywhere –
The Triple-Play Challenge
Xilinx is helping to empower the next innovation in the triple-play race.

ruption of existing services, nor will they and design support software, hardware,
pay extra for poor service quality. and services.
Motivated by the promise of substantial In each case, one of the key objectives in
rewards to those that enable this massive the introduction strategy of these products –
by Wim Roelandts business food chain, the electronics indus- with their attending high-speed serial I/O
CEO and try is marshaling every possible resource to solution packages – was to reach the early
Chairman of the Board find solutions at all levels to the triple-play adopters and innovators within the FPGA
Xilinx, Inc. challenge. It is no surprise that the semi- customer base with a viable alternative to
conductor industry endeavors to keep pace custom ASIC and ASSP serial I/O solutions.
The electronics industry is pressed to its lim- with system manufacturers. Having successfully proven the viability
its as it strives to develop solutions to feed of FPGA-based serial I/O solutions with
the insatiable appetites of the consumer and Xilinx Serial I/O Solutions: Crossing the Chasm these previous product families, there
enterprise markets for voice, video, and The evolution of serial I/O solutions in remained a single yet extremely important
computer data communications on a single Xilinx® FPGAs is the result of our high- evolutionary step. To cross the chasm into
network. To the global broadcast and speed serial initiative, which we announced the mainstream FPGA customer base and
telecommunications industries, the triple- in 2002. The aim of the initiative was (and truly create equivalency between Xilinx serial
play opportunity is at once a potentially is) to accelerate the industry’s move from I/O solutions and custom solutions required
inexhaustible source of revenue and a con- parallel to high-speed serial I/O by deliver- the delivery of fully verified, fully integrated,
stant source of frustration. Despite the ing a new generation of connectivity solu- hard IP-based, turnkey serial I/O solutions.
immeasurable reward for successfully deliv- tions for system designs that meet With our newest 65-nm Virtex-5 LXT
ering triple-play services to the masses, sub- bandwidth requirements from 3.125 Gbps platform, we believe that we have indeed
stantial obstacles continue to impede access. to 10 Gbps and beyond. crossed the chasm. By offering the indus-
Perhaps the most central of these obsta- We began by adding up to twenty-four try’s first FPGA to deliver hard-coded PCI
cles is the inadequacy of legacy infrastruc- 3.125 Gbps serial transceivers in our Express Endpoint and tri-mode Ethernet
ture equipment to support the massive Virtex™-II Pro family, accompanied by IP media access controller (MAC) blocks,
increases in bandwidth. Evolving from soft cores for numerous serial connectivity Virtex-5 LXT devices are addressing the
voice-only, the legacy infrastructure is a standards, reference designs, hardware bandwidth, power, and cost challenges fac-
complex web of overlaid networks that rep- development platforms, design software, ing equipment vendors working to enable
resents both a financial and technological characterization data, and an in-depth the emerging triple-play services market.
burden to service providers. In short, it is design support program. The Virtex-5 LXT platform is optimized to
neither technologically feasible to deliver The Virtex-4 FX family followed suit enable FPGA designers across a wide range
triple-play services with existing equipment in 2005 with a similar complement of of applications to benefit from serial con-
nor economically practical to replace it broad-range transceivers, this time deliv- nectivity by delivering a comprehensive,
with a completely new network. Moreover, ering 622 Mbps to 6.5 Gbps perform- fully compliant protocol solution with the
legacy customers will not tolerate any inter- ance, as well as an equally robust set of IP greatest ease of use.

12 Xcell Journal Fourth Quarter 2006


V I E W P O I N T

Virtex-5 Serial Connectivity Solutions


Enabling unconstrained product development for the triple-play market.

by Sandeep Vig flexible enough to adapt to the seemingly block supporting one to eight lanes
Vice President, endless evolution of standards and protocols. provides as much as 32 Gbps of
Worldwide Marketing In the computing infrastructure mar- full-duplex host I/O for extreme
Xilinx, Inc. ket, PCIe has become the predominant performance applications
sandeep.vij@xilinx.com host interface for networking, graphics,
These features reduce the engineering
and backplane connectivity because of its
Although “triple effort spent on resource utilization, trou-
quantum leap in performance, scalability,
play” may be one of bleshooting connectivity issues, minimiz-
and pin-count efficiency over the legacy
the hottest buzz- ing power consumption, and optimizing
PCI bus. Designing products that span
words and growth drivers in the semicon- performance, thus giving our customers
network and compute infrastructures like
ductor industry, it is insightful to unconstrained Virtex-5 FPGA resources in
those in triple-play markets requires system
understand the evolution of the technology designing infrastructure and end-user
architects and engineers to be well-versed
that was required to realize triple play, the products for delivering voice, video, and
in these new domains, introducing new
forces behind its explosive growth, chal- data over IP.
risks. To this end, Xilinx embarked on a
lenges that will occur along the way, and As a programmable platform, the
project two years ago to mitigate design
the critical role of Xilinx® Virtex™-5 Virtex-5 family positions our customers
risk by introducing a new generation of
products in the development and deploy- and partners to enable value-added triple-
Platform FPGAs that substantially increase
ment of triple-play products and services. play technologies such as:
performance, functionality, and device
Central to the Virtex-5 platform’s value density while reducing cost per gate. • QoS – customer-specific traffic
is the recent emergence of two serial I/O management solutions enabling tiered
standards: Gigabit Ethernet (GbE) and Next-Generation FPGAs services that can change with market
PCI Express (PCIe). In the last three years, Leveraging our core competence as the pre- conditions
these two interfaces have become the de- mier FPGA vendor and working with our
facto connectivity standards for network world-class customers and partners, Xilinx • Digital rights management – enabling
and computing applications; according to developed the Virtex-5 FPGA architecture. hardware-based, adaptive, end-to-end
Electronic Trend Publications, GbE and With the introduction of the LXT family, data security for the wide diversity of
PCIe will account for 80% of all port ship- Virtex-5 devices now feature integrated standards inherent to these markets
ments in 2009. multi-GbE and PCIe connectivity technol-
ogy ideally suited to designs for the triple- Conclusion
Disruptive Technology play market. In the very dynamic consumer industry
IP is clearly the preferred protocol in the net- This LXT family is equipped to support where time to market with flexible services is
work market as telecom vendors and service high-speed serial connectivity, with fea- the name of the game, companies are still
providers transition to an all-IP-based infra- tures that include: trying to figure out the right mix of products
structure supporting Voice over IP, Video and services to generate substantial revenue.
• Built-in GbE MAC – each Virtex-5
over IP, and Data over IP (also known as The Virtex-5 LXT family integrates world-
LXT device features four hard-core
triple play). Designing carrier-grade to end- class programmable logic architecture with
GbE MACs for multi-port Ethernet
user products that support triple play is very embedded serial connectivity, providing the
connectivity
challenging, as these products must achieve performance, density, and connectivity
high levels of performance, manage quality • Built-in PCIe block – an integrated required for delivering voice, video, and data
of service (QoS), and be power-efficient and standards-compliant PCIe Endpoint in the emerging triple-play market.

Fourth Quarter 2006 Xcell Journal 13


V I E W P O I N T

FPGAs for Serial Interconnections


Research by Electronic Trend Publications points to a key role for FPGAs in serial interconnections.

by Steve Berry will soon become dominant. Table 1 illus- also supports nearly all available serial inter-
President, Electronic Trend Publications trates the change from parallel to serial. faces. Two of these interfaces – RapidIO and
saberry@electronictrendpubs.com In 2006, serial interconnections will Aurora – are emerging as most important to
www.electronictrendpubs.com move into the majority. By 2009, serial users of FPGAs.
will represent more than 80 percent of all RapidIO is becoming a favorite for
For most of the last 15 interconnections. high-end, low-volume DSP applications. A
years, networking the world Although standard semiconductor prod- number of implementations in this arena
for voice, video, and data has been the key ucts will supply the serial interconnection use FPGAs (rather than merchant silicon)
driver of the electronics industry. This needs of high-volume markets, FPGAs are to implement DSP functions as well as
worldwide network required that the com- increasingly important for a wide variety of RapidIO interface and switching func-
munications industry connect and converge tasks. There are some key reasons. First, tions. This should continue to be the case
with the computer processing industry. That before low-cost standards products are avail- in the future.
convergence has primarily settled on able, FPGAs will provide a mechanism to Similarly, the Aurora protocol has quietly
Ethernet for the communications side and get to market faster. Second, FPGAs enable gained a substantial following in certain
PCI for the computer side. system integration with customer algo- high-end embedded markets. Although
Since its inception, Ethernet has been a rithms and standards-based serial interfaces. Xilinx created Aurora, it is an open protocol,
serial interface. It has been repeatedly scaled Third, the ability to easily make multi-stan- free of charge, that designers can implement
up in bandwidth. Today, 1 Gbps connections dard serial connections to FPGAs will dra- in any silicon device. Aurora is a scalable,
are ubiquitous, 10 Gbps connections are matically simplify product design. lightweight, link-layer protocol that is used
becoming more common, and 100 Gbps Thus, the new Xilinx® Virtex™-5 LXT to move data across point-to-point serial
connections have been proven in the labora- platform – with its built-in PCI Express links. Aurora enables simple, high-speed
tory. Ethernet has vanquished all challengers Endpoint blocks, tri-mode Ethernet connections between fixed points either on a
in the LAN market and is rapidly conquering MACs, and low-power RocketIO™ trans- single board or across multiple boards. As
the MAN and WAN markets. ceivers – precisely fits the requirements of many applications in the board-level embed-
PCI started out as a parallel interface, today’s FPGA market by giving designers a ded market use fixed links between various
and as such ran out of bandwidth when con- solution that not only saves time, but also points in the system, there is no need for a
nection requirements exceeded 1 Gbps. reduces power consumption and conserves complex message-passing protocol.
Industry groups such as the InfiniBand FPGA logic resources.
Trade Association and the RapidIO Trade Conclusion
Association introduced new connections to RapidIO and Aurora With its hard-coded PCI Express Endpoint
replace PCI. But PCI is much more than the Although PCI Express and Ethernet will be and Ethernet blocks, I anticipate that many
physical connection between system ele- the overwhelming leaders in the number of will use the Virtex-5 LXT platform to bridge
ments. PCI represents an enormous global serial ports deployed by the industry, a host between PCI Express or Ethernet and numer-
investment in software that is not readily of other serial interfaces have carved niches ous other interfaces. The Virtex-5 LXT plat-
replaceable. Only PCI Express has met the for themselves. The Virtex-5 LXT platform form is ideally suited for this task.
challenge of true compatibility with PCI.
PCI Express bandwidth will be scaled up
Serial vs. Parallel Ports 2004 2005 2006 2007 2008 2009
repeatedly over the coming years to support
the industry’s needs. Parallel 75.5% 56.3% 34.8% 25.5% 20.4% 15.9%
As a result of the nearly 10-year effort to Serial 24.5% 43.7% 65.2% 74.5% 79.6% 84.1%
transition the industry from parallel to seri-
Figure 1 – Serial interfaces are rapidly replacing parallel.
al interconnections, serial interconnections

14 Xcell Journal Fourth Quarter 2006


Featured Seminars -

High Power PC Embedded System Design

VELOCITY
This seminar provides the embedded systems developers with
the necessary skills to develop a PPC System on a Programmable
Chip system utilizing the Virtex 4 FPGA. Utilizing the Embedded
Development Kit (EDK) the embedded systems developers will
create a full system based on the Nu Horizons XC4FX12 evaluation
board, labs provide hands on experience with the development,
verification, debugging, and simulation of an embedded system.

Prerequisites:
• Experience in C programming
• Some HDL modeling experience

LEARNING • Basic microprocessor experience and understanding of


PowerPC™ processor systems
• A basic understanding of FPGA devices and the tools used to
program them

Implementing New Features of Virtex-5


This 3-hour seminar will introduce to you the first 65-nm family
of Platform FPGAs, the Virtex-5 LX from Xilinx. The Virtex-5
family of FPGAs is the 2nd generation of devices based on ASMBL
architecture. Learn how the new features in Virtex-5 can increase
logic performance by 30%, reduce area by 45%, and decrease
dynamic power by 35% when compared to the 90 nm Virtex-4
family.

Course Outline
• Virtex 4 versus Virtex 5 comparison
• V5’s new PLL and Use with DCMs
– Lab 1 – Introduction to the PLL/Architecture Wizard
• Improved Features in V5
– Lab 2 – Leveraging Improved Features

DSP Imaging Seminar


Course Outline
• Interpreting images as 2D signals
• Understanding the spectral content of images
• The concept of correlation between target and scene
• Image edge enhancement and its applicability to correlation
Nu Horizons Electronics Corp. is proud to present our newest education and • Tradeoffs between correlation calculation methods
• Practical application of correlation to target tracking in video
training program - XpressTrack - which offers engineers the opportunity to • Overview of the Xilinx Video Starter Kit (VSK)
• Use of the VSK to perform video target tracking in real time
participate in technical seminars conducted around the country by experts
focused on the latest technologies from Xilinx. This program provides MicroBlaze Seminar
higher velocity learning to help minimize start-up time to quickly begin your
This 3-hour workshop will introduce you to MicroBlaze™: The
design process utilizing the latest development tools, software and products Low-Cost and Configurable 32-Bit Soft Processor Solution from
Xilinx. It will also introduce Xilinx Embedded Development Kit
from both Nu Horizons and Xilinx. (EDK). As part of this class you will learn how to build a complete
customized MicroBlaze soft processor system including user
defined peripherals. You will also be introduced to the “Create
Visit our website and let us know where you reside and what you are and Import Peripheral Wizard” and guide you through process of
creating a custom peripheral in the EDK environment and using it
interested in learning about and we’ll develop a curriculum just for you. in a processor system.

Course Outline
• Overview of MicroBlaze
• Overview of the Embedded Development Kit (EDK)
• Lab 1: Build and Optimize a MicroBlaze Soft Processor
For a complete list of course offerings, or to System in Minutes
register for a seminar near you, please visit: • Lab 2: Custom Hardware Interface Utilizing the MicroBlaze
IPIF Interface

www.nuhorizons.com/xpresstrack
Fundamentals of FPGAs
Course Outline
• Basic FPGA Architecture
• Xilinx Tool Flow
– Lab 1: Xilinx Tool Flow Demo
• Reading Reports
• Architecture Wizard and PACE
– Lab 2: Architecture Wizard and PACE Demo
• Global Timing Constraints
– Lab 3: Global Timing Constraints
• Implementation Options
– Lab 4: Implementation Options
• Synchronous Design Techniques
• Summary
PERFORMANCE

Achieve Higher Performance


with Virtex-5 FPGAs
New architectural elements can help you attain higher system-level performance.

by Adrian Cosoroaba architecture. The Virtex-5 family is the a multiplexer (MUX). Implementing a 4:1
Marketing Manager first FPGA platform to offer a true six- MUX requires two four-input LUTs and a
Xilinx, Inc. input LUT (6-LUT) fabric with fully inde- MUXF block in the Virtex-4 architecture.
adrian.cosoroaba@xilinx.com pendent (not shared) inputs (Figure 1). The same 4:1 MUX can now be imple-
Moving to a 6-LUT fabric architecture mented in a Virtex-5 device with a single
In FPGA system design, maximizing per- provides the 65-nm Virtex-5 FPGA family LUT. Similarly, an 8:1 MUX requires four
formance requires a balanced mix of per- with the most effective trade-off between LUTs and three MUXF blocks in a Virtex-4
formance-efficient components – logic critical path delay – the determining factor FPGA, while the new Virtex-5 architecture
fabric, on-chip memory, DSP, and I/O for logic fabric performance – and die size. requires only two 6-LUTs. The result is
bandwidth. In this article, I’ll explain how With process technology advance- better performance and better logic utiliza-
you can benefit from Xilinx® Virtex™-5 ments, interconnect timing delay can tion, as shown in Figure 2.
FPGA building blocks, particularly the account for more than 50% of the critical As in previous Xilinx FPGA families,
new ExpressFabric™ technology, in your path delay. Xilinx has developed a new the Virtex-5 Slice L (logic slice) can imple-
quest for higher system-level performance. interconnect pattern for Virtex-5 FPGAs ment logic functions, registers, and arith-
I will explore key features of the to enhance performance by reaching more metic functions using the dedicated carry
ExpressFabric architecture with examples places in fewer hops. The new pattern chain. The slightly more complex Slice M
that quantify the anticipated performance increases the number of logic connections (memory slice) adds the capabilities of
improvements for logic and arithmetic achievable within two and three hops. implementing distributed RAM and shift
functions. Benchmarks based on actual Moreover, a more regular routing pattern registers within the LUT (SRL).
customer designs will show that Virtex-5 makes it easier for Xilinx ISE™ software Among the various improvements pro-
ExpressFabric technology performs on to find the most optimal routes. All of the vided by the ExpressFabric architecture, the
average 30% better than previous-genera- interconnect features are transparent to new carry chain structure delivers substan-
tion Virtex-4 FPGAs. FPGA designers, but will translate to high- tially higher performance when used to
With the new logic fabric (in which er overall performance and easier design implement arithmetic operations. Its effect
you can implement functions such as routability. Essentially, the Virtex-5 pat- on critical path delay is readily seen for sev-
counters, adders, and RAM/ROM stor- tern provides fast, predictable routing eral examples listed in Table 1.
age) and available hard IP blocks, memo- based on distance. Distributed memory functions such as
ry, and DSP (optimized to operate at The combination of the new 6-LUT LUT RAM or ROM also benefit in several
clock rates as fast as 550 MHz), the structure and special functions like carry ways from the larger LUT structure. The
Virtex-5 FPGA is clearly the platform of chains, dedicated multiplexers, and flip- new aspect ratio allows a much denser
choice for high-performance designs. flops (along with the unique methods by packing of small memory functions leading
which these elements are connected) cre- to significant performance benefits, as
ExpressFabric Performance ates unsurpassed performance and effi- depicted in Table 2.
Since the first FPGA was introduced in the ciency for implementing logic and The performance increases provided by
mid 1980s, the logic fabric for most arithmetic functions. the improved logic fabric with its 6-LUT
FPGAs has been based on the same funda- One example that clearly shows the architecture and interconnect structure are
mental four-input look-up table (LUT) benefits of the ExpressFabric technology is substantial, but this is only the beginning.

16 Xcell Journal Fourth Quarter 2006


PERFORMANCE

Most applications require more on-chip


ExpressFabric
RAM than what LUT-based RAM can pro- Virtex-5
vide. With the enhanced Virtex-5 block
CLB CLB
Slice
RAM, you can achieve higher on-chip
6-LUT
memory performance. CLB

Block RAM Performance CLB CLB


With the move to 65 nm, the Virtex-5 block
6-LUT
RAM inherited a 10% increase in clocking
speed to 550 MHz. However, to achieve the
desired performance for most applications 6-LUT 6-LUT

today, block RAMs need to be more than


6-LUT
just faster. They need to be larger. 6-LUT 6-LUT

The Virtex-5 block RAM has doubled


6-LUT 6-LUT
in size to 36 Kb. This larger block size
(comprising two 18-Kb memories) will 6-LUT 6-LUT 6-LUT
support 72-bit data words in simple dual- Slice Slice Carry
port mode, thereby doubling block RAM CLB Chain

bandwidth. Moreover, the Virtex-5 FPGA Figure 1 – Virtex-5 configurable logic blocks (CLBs) comprise two slices.
provides dedicated connections to enable Each slice uses four independent 6-LUTs that provide the benefits of fewer logic levels.
you to cascade two adjacent 36-Kb block
RAMs together in the block RAM column,
thereby implementing a 72-Kb memory 4
A1
I7 L
running at the maximum 550-MHz rate. I6 U
The availability of ever-larger FPGAs T I7 6
A0
has accelerated the trend toward integrating 4
I6
I5 L
more subsystems into a single device, mak- I5
I4
L
U A2
A1
I4 U A2
ing more common the necessity of interfac- T
A0 T
ing multiple clock domains. Virtex-5
devices accommodate this by providing
4
L 6
I3
integrated logic to simplify the implemen- I2
U
T
I3
I2 L
tation of flexible and efficient FIFOs. 4
I1
I0 U
Through this combination of enhance-
I1
L
U T
ments, the Virtex-5 block RAM delivers I0 T
Virtex-4 Virtex-5
more on-chip memory, easier to build
FIFOs, and higher bandwidth.
8-to-1 MUX Virtex-4 Virtex-5 Improvement
DSP Performance
Logic Levels 2 1 100%
The growing acceptance of FPGAs as a
viable solution for high-performance DSP Path Delay 1.33 ns 1.08 ns 23%
applications is well deserved. Whether as a
co-processor or a stand-alone solution for Figure 2 – 8:1 multiplexer implemented with Virtex-5 FPGAs versus Virtex-4 FPGAs

Virtex-4 FPGA Virtex-5 FPGA


Function Improvement Function Virtex-4 Virtex-5 Improvement
Path Delay Path Delay
Adder 64-bit 3.5 ns 2.4 ns 46% LUT RAM 64 x 1 Logic Levels 2 1 100 %
Ternary Adder 64-bit 4.3 ns 3.0 ns 40% Path Delay 1.76 ns 1.26 ns 40 %
Barrel Shifter 32-bit 3.8 ns 2.8 ns 37% LUT ROM 128 x 12 Logic Levels 3 1 200 %
Magnitude Comp. 48-bit 2.4 ns 1.8 ns 34% Path Delay 1.84 ns 1.20 ns 53 %

Table 1 – Arithmetic functions implemented with Table 2 – LUT-based RAM/ROM implementations with
Virtex-5 FPGAs versus Virtex-4 FPGAs Virtex-5 FPGAs versus Virtex-4 FPGAs

Fourth Quarter 2006 Xcell Journal 17


PERFORMANCE

more demanding applications, FPGAs con- Virtex-5 FPGAs improve on Virtex-4 blocks generated by CORE Generator™
tinue to provide the best combination of bandwidth by increasing both the data rate software (a part of ISE software).
performance, power, and cost. per pin and the number of available I/Os For these benchmarks, we performed
To keep pace with the seemingly insa- with larger packages. For example, for popu- synthesis in a timing-driven fashion with
tiable demand for more DSP performance, lar memory interfaces like DDR2 SDRAM, Synplicity’s Synplify Pro, using tight, realis-
Xilinx is leading with Virtex-5 DSP capa- the bandwidth has increased per pin from tic constraints to effectively measure per-
bilities in terms of both clock rate and pre- 534 Mbps to 667 Mbps; the number of data formance. This was done to ensure that all
cision – the clock rate has increased to 550 I/Os, when considering SSO requirements, special optimizations and logic replications
MHz and the precision has improved from has increased from 432 to 576. were employed.
18 x 18 bits to 25 x 18 bits. Implementation in ISE
Xilinx also optimized the software was accomplished
Virtex-5 DSP48 slice for Virtex-5 vs. Virtex-4 FPGA with the place and route
Performance Advantage (%)
adder-chain implementations, 60 effort set to high. Clocks
a powerful capability that were tightened iteratively by
50
enables the creation of very 5% increments until the
efficient high-performance fil- 40 design failed to meet design
30% Average Advantage
ters. Dedicated routing 30 Designs with
constraints.
resources on the inputs and many levels The result was an average
of logic.
logic
outputs of each DSP48 slice Designs
20
Use of hard
performance gain of 30%
with fewer
permit any number of slices to levels of 10 IP
IP blocks
blocks over designs implemented in
be chained together within a logic
logic Virtex-4 FPGAs, as shown in
0
column. This dedicated rout- Figure 3.
ing ensures that every DSP48 Those designs that
slice in the chain will run at Figure 3 – Comparison based on a suite of improved the most have large
74 customer designs using ISE 8.2i software
full speed without consuming cones of logic; the critical path
any of the fabric routing or implements a large, often
logic resources, as other 1.7 X complex logic equation. For
1.6 X
FPGAs require. Taken togeth- example, ASIC prototyping
er, these improvements reduce 1.3 X designs will typically have very
by half the number of 1.1 X 1.1 X few registers for a large
resources needed to imple- amount of logic in their criti-
Performance

ment common high-precision cal path. These types of


functions. For example, for a designs exhibit a significant
35 x 25-bit multiply, four improvement with Virtex-5
DSP48 slices are needed with ExpressFabric technology.
the Virtex-4 FPGA. With the Logic Fabric On-chip RAM DSP I/O LVDS I/O Memory Those designs exhibiting a
wider DSP block available in Performance 550 MHz 32-Tap Filter Bandwidth Bandwidth more moderate improvement
550 MHz 750 Gbps 385 Gbps
the Virtex-5 FPGA, half as either have less levels of logic
Virtex-5 FPGA Virtex-4 FPGA
many slices are used to imple- or provide little opportunity
ment this multiply function. for the use of hard IP blocks
Figure 4 – Virtex-5 FPGA performance improvements
or carry-chain structure to
I/O Bandwidth Performance improve performance.
As performance benchmarks go, the Customer Design Benchmarks Figure 4 summarizes by category the per-
speed with which an FPGA can process To further evaluate the performance formance improvements of Virtex-5 FPGAs
data is relevant only in context with the improvements provided by the Virtex-5 over previous-generation Virtex-4 FPGAs.
device’s I/O bandwidth, which is the FPGA logic fabric, we implemented a set of
speed with which large amounts of data customer designs using Xilinx ISE software. Conclusion
can be moved on and off the device. These designs were all written in With its new ExpressFabric technology and
When using external memory buffers, the VHDL or Verilog. We implemented some tight coupling to other high-performance
interface must be at least two times faster specific design units like memories and hard-IP blocks and I/Os, the Virtex-5
than the data-processing rate because data FIFOs using direct instantiation of library FPGA family represents a significant per-
must be both written out of and read components or synthesis inference, but formance boost compared to previous-gen-
back into the FPGA. many were implemented using EDIF eration architectures.

18 Xcell Journal Fourth Quarter 2006


PERFORMANCE

HDL Coding and Design Practices


for Improving Virtex-5 Utilization,
Performance, and Power
These tips and techniques can lead to better Virtex-5 designs.

by Brian Philofsky For example, if you know of and use terms of area, performance, and power is to
Staff Software Technical Marketing Manager Bitslip technology within the ISERDES, install the latest version of the software.
Xilinx, Inc. you could save time, effort, and resources by
brian.philofsky@xilinx.com capturing input data rather than attempting Control Signal Polarity
to describe and build similar circuitry. The Virtex-5 architecture can support dif-
FPGAs have been very flexible in accom- In another example, if you know the ferent control signal polarity (clock
modating any HDL coding or design style structure and capability of the DSP48E, enables, resets, or sets). However, to have
for digital logic; Xilinx® Virtex™-5 you can make better choices as to when the most optimal design, I recommend
devices are no exception. Although Virtex- and where to place pipeline registers. consistent use of active high control signals
5 FPGAs can accommodate many differ- Dedicated features like the wider multipli- in your design. The Virtex-5 slice control
ent types of designs written in many er or post adder can also help you achieve logic is active high, and when described in
different methods, certain recommended better area, performance, and power. this same manner in the code should never
constructs and manners can achieve Similarly, knowing the capabilities and require additional LUT resources for a sim-
improved optimization in terms of area, current limitations of your synthesis tool can ple signal inversion.
performance, and power. not only help when choosing coding styles If the signal comes from an external pin
to properly infer primitives but can also give and needs an active low polarity, I suggest
Know Your Target you greater insight as to when to instantiate inverting the signal in the top-level code
Architecture and Synthesis Tool a component or use inference. Review syn- and using a positive polarity in all process-
Before beginning any project, you thesis manuals, application notes, or other es and sub-modules requiring that signal.
should understand the device architec- relevant materials before starting so that you This is critical for designs that have several
ture you are targeting. For Virtex-5 know the recommended coding styles for cores, use bottom-up synthesis techniques,
FPGAs, I recommend reading the the synthesis tool you are using. have KEEP_HIERARCHY constraints, or
Virtex-5 Users Guide (http://direct. You should also update and use the lat- employ the use of partitions (Figure 2).
xilinx.com/bvdocs/userguides/ug190.pdf) est versions of synthesis and ISE™ tools Designs that fall into these categories
before starting your first line of code. before beginning a project. Although ini- are more susceptible to the use of addition-
Once you have a better understanding tial synthesis support for the Virtex-5 al LUTs per core/netlist/hierarchy/parti-
and vision as to how your code will ulti- architecture is strong, many improvements tion for the sole purpose of inverting these
mately result in the base hardware, you in optimization and inference support are control signals, which not only consume
can make both large and small design still to come with new releases. One easy extra LUT resources but may also have
and coding decisions confidently. way to ensure more optimal designs in negative effects on performance and slice

Fourth Quarter 2006 Xcell Journal 19


PERFORMANCE

packing. As a general rule, always code sets, register LUTs); distributed RAM (LUT- The Virtex-5 device departs from the
resets, and enables with an active high based RAM) memory; or block RAM for traditional four-input LUT in previous
(logic 1 activates) polarity. the implementation, which would not be FPGA families and has an enhanced six-
otherwise possible nor optimal. The synthe- input LUT (6-LUT), allowing for wider
Use of Resets sis tool has maximum flexibility to choose logic functions between pipeline registers
It is common practice to use a global asyn- the best resource for the described code. while maintaining top performance. You
chronous reset in the source HDL code to should keep this in mind, as logic functions
initialize the design; however, in many Pipelining coded into HDL as optimal code should
cases this consumes additional resources. As with previous FPGA generations, prop- include six inputs to the logic function
Instead, think synchronous and local. I erly pipelining your design is necessary to between registers to get the most optimal
suggest describing a synchronous set/reset achieve top performance and improved pipelining and LUT resource management.
logic to the portions of the design that do power characteristics. With the introduc- In cases where it is not practical or pos-
need periodical resets. For those portions of tion of the Virtex-5 architecture, a new sible to have exactly six inputs in a given
the design that do not, you can initialize logic structure dictates slightly different logic function, the wider input 6-LUT
the signals defined to be registered in the rules regarding when and how to pipeline. still allows for good performance by
HDL code at the time they are declared
(for example, when defining a reg in
Verilog or a signal in VHDL). This
Top
methodology allows for improved packing Flip-Flop
density, enhances timing analysis and per- Clock CE
formance, and can improve area resources. Enable
Partition
LUT6
In terms of FPGA behavior, without a
Flip-Flop
global reset described in the code, a GSR
Old Netlist
(global set/reset) will occur upon comple- CE

tion of the configuration cycle, initializing LUT6


CE
all registers to known specified values. This
LUT6
same cycle is also simulated in the gate- KEEP_HIERARCHY Flip-Flop
level simulation netlist, giving the same
known starting point as in the FPGA. Core Flip-Flop
CE

In terms of RTL simulation, having the LUT6


CE
registers initialized in the code allows for
LUT6
proper RTL or behavioral simulation; this
same initialization will be picked up by the
synthesis tool and applied to the imple-
mented design. Therefore, for simulation
at any stage, a global reset is redundant Top
and unnecessary. Flip-Flop

Using a synchronous reset instead of an Clock CE


Enable
asynchronous reset also allows for more pre- LUT6 Partition
dictable behavior upon the assertion and Flip-Flop
release of the reset, because the synchronous Old Netlist CE
signals are automatically analyzed and their
behavior is more deterministic when all CE

timing constraints are met. It also allows for


the possibility of greater logic optimization KEEP_HIERARCHY Flip-Flop

and performance because it is not global. CE


Core
When using synchronous control sig- Flip-Flop

nals, you can move portions of the logic CE

function to the synchronous set or reset of


the flip-flop; this is not possible with asyn-
chronous signals. By only describing a reset
where necessary, the synthesis tool can use
Figure 2 – How clock enable polarity affects LUT utilization in a design
alternative resource choices like SRLs (shift

20 Xcell Journal Fourth Quarter 2006


PERFORMANCE

Verilog Coding Example VHDL Coding Example


`timescale 1ns / 1ps
----------------------------------------------------------------------------------
//////////////////////////////////////////////////////////////////////////////////
-- Company: Xilinx
// Company: Xilinx
-- Engineer: Brian Philofsky
// Engineer: Brian Philofsky
--
//
-- Create Date: 07:42:58 08/12/2006
// Create Date: 07:42:58 08/12/2006
-- Design Name: good_design
// Design Name: good_design
-- Module Name: good_code2
// Module Name: good_code
-- Project Name: HDL Coding Practices for Improving Virtex 5 Utilization,
// Project Name: HDL Coding and Design Practices for Improving Virtex 5
-- Performance and Power
// Utilization, Performance and Power
-- Target Devices: Virtex 5
// Target Devices: Virtex 5
-- Tool versions: ISE 8.2i
// Tool versions: ISE 8.2i
-- Description: This is an example code employing some good coding practices
// Description: This is example code employing some good coding practices
-- when targeting a Virtex 5 device.
// when targeting a Virtex 5 device.
--
//
-- Revision 0.01 - File Created
// Revision 0.01 - File Created
--
//
----------------------------------------------------------------------------------
//////////////////////////////////////////////////////////////////////////////////
library IEEE;
module good_code #(
use IEEE.std_logic_1164.all;
parameter data_width = 16,
use IEEE.std_logic_arith.all;
parity_width = 2)
library UNISIM;
( input [data_width-1:0] DATA_IN,
use UNISIM.Vcomponents.all;
input DATA_STORE,
entity good_code2 is
input CLK, RST,
generic (
input READ_DATA,
data_width : integer := 16;
parity_width : integer := 2
output [data_width+parity_width-1:0] DATA_OUT,
);
output reg RW_ERROR = 1'b0,
port (
output DATA_VALID, FULL
DATA_IN : in std_logic_vector(data_width-1 downto 0);
);
DATA_STORE: in std_logic;
// Always initialize registers to known values
CLK, RST: in std_logic;
reg [data_width-1:0] data_in_reg = {data_width{1'b0}};
READ_DATA: in std_logic;
reg [data_width-1:0] data_in_reg2 = {data_width{1'b0}};
reg [2:0] data_store_delay = 3'b000;
DATA_OUT : out std_logic_vector(data_width+parity_width-1 downto 0);
reg [2:0] data_valid_delay = 3'b000;
reg [parity_width-1:0] parity = {parity_width{1'b0}}; RW_ERROR : out std_logic := '0';
DATA_VALID, FULL : out std_logic
wire read_error, write_error; );
end good_code2;
// Use resets only where necessary and make them synchronous
// Make resets and clock enables active high architecture XILINX of good_code2 is
always @(posedge CLK)
if (RST) -- Always initialize registers to known values
data_in_reg <= {data_width{1'b0}}; signal data_in_reg: std_logic_vector(data_width-1 downto 0) := (others => '0');
else if (DATA_STORE) signal data_in_reg2: std_logic_vector(data_width-1 downto 0) := (others => '0');
data_in_reg <= DATA_IN; signal data_store_delay: std_logic_vector(2 downto 0) := "000";
signal data_valid_delay: std_logic_vector(2 downto 0) := "000";
// Do not use resets where not necessary signal parity: std_logic_vector(parity_width-1 downto 0) := (others => '0');
// In this case an SRL can be used due to the fact no reset is described.
always @(posedge CLK) begin signal read_error, write_error: std_logic;
data_store_delay <= {data_store_delay[1:0], DATA_STORE};
data_in_reg2 <= data_in_reg; begin
data_valid_delay <= {data_valid_delay[1:0], READ_DATA};
RW_ERROR <= read_error | write_error; -- Use resets only where necessary and make them synchronous
parity[1] <= ^data_in_reg[15:8]; -- Make resets and clock enables active high
parity[0] <= ^data_in_reg[7:0]; process (CLK)
end begin
if (CLK'event and CLK='1') then
// In general, RAMs should be inferred however in this case, a FIFO is needed if RST='1' then
// and synthesis can not yet infer the dedicated Virtex 5 FIFO. data_in_reg <= (others => '0');
elsif (DATA_STORE='1') then
// FIFO18: 16k+2k Parity Synchronous/Asynchronous BlockRAM FIFO data_in_reg <= DATA_IN;
// Virtex-5 end if;
// Xilinx HDL Language Template, version 8.2.2i end if;
end process;
FIFO18 #(
.ALMOST_FULL_OFFSET(12'h080), // Sets almost full threshold -- Do not use resets where not necessary
-- In this case an SRL can be used due to the fact no reset is described.
.ALMOST_EMPTY_OFFSET(12'h080), // Sets the almost empty threshold process (CLK)
.DATA_WIDTH(18), // Sets data width to 4, 9 or 18 begin
.DO_REG(1), // Enable output register (0 or 1) if (CLK'event and CLK='1') then
// Must be 1 if EN_SYN = "FALSE data_store_delay <= (data_store_delay(1 downto 0) & DATA_STORE);
.EN_SYN("TRUE"), // Specifies FIFO as Asynchronous ("FALSE") data_in_reg2 <= data_in_reg;
// or Synchronous ("TRUE") data_valid_delay <= (data_valid_delay(1 downto 0) & READ_DATA);
.FIRST_WORD_FALL_THROUGH("FALSE") // Sets the FIFO FWFT to "TRUE" or "FALSE RW_ERROR <= read_error OR write_error;
) FIFO18_inst ( parity(1) <= (data_in_reg(15) XOR data_in_reg(14) XOR data_in_reg(13) XOR
.ALMOSTEMPTY(), // 1-bit almost empty output flag data_in_reg(12) XOR data_in_reg(11) XOR data_in_reg(10) XOR
.ALMOSTFULL(), // 1-bit almost full output flag data_in_reg(9) XOR data_in_reg(8));
.DO(DATA_OUT[15:0]), // 16-bit data output parity(0) <= (data_in_reg(7) XOR data_in_reg(6) XOR data_in_reg(5) XOR
.DOP(DATA_OUT[17:16]), // 2-bit parity data output data_in_reg(4) XOR data_in_reg(3) XOR data_in_reg(2) XOR
.EMPTY(), // 1-bit empty output flag data_in_reg(1) XOR data_in_reg(0));
.FULL(FULL), // 1-bit full output flag end if;
.RDCOUNT(), // 12-bit read count output end process;
.RDERR(read_error), // 1-bit read error output
.WRCOUNT(), // 12-bit write count output -- In general, RAMs should be inferred however in this case, a FIFO is needed
.WRERR(write_error), // 1-bit write error -- and synthesis can not yet infer the dedicated Virtex 5 FIFO.
.DI(data_in_reg2), // 16-bit data input
.DIP(parity[1:0]), // 2-bit parity input -- FIFO18: 16k+2k Parity Synchronous/Asynchronous BlockRAM FIFO BlockRAM Memory
.RDCLK(CLK), // 1-bit read clock input -- Virtex-5
.RDEN(READ_DATA), // 1-bit read enable input -- Xilinx HDL Language Template version 8.2.2i
.RST(RST), // 1-bit reset input
.WRCLK(CLK), // 1-bit write clock input FIFO18_inst : FIFO18
.WREN(data_store_delay[2]) // 1-bit write enable input generic map (
); ALMOST_FULL_OFFSET => X"080", -- Sets almost full threshold
ALMOST_EMPTY_OFFSET => X"080", -- Sets the almost empty threshold
// End of FIFO18_inst instantiation DATA_WIDTH => 18, -- Sets data width to 4, 9, 18, or 36
DO_REG => 1, -- Enable output register (0 or 1)
endmodule -- Must be 1 if the EN_SYN = FALSE
EN_SYN => TRUE, -- Specified FIFO as Asynchronous (FALSE) or
-- Synchronous (TRUE)
FIRST_WORD_FALL_THROUGH => FALSE) -- Sets the FIFO FWFT to TRUE or FALSE
port map (
ALMOSTEMPTY => open, -- 1-bit almost empty output flag
ALMOSTFULL => open, -- 1-bit almost full output flag
DO => DATA_OUT(15 downto 0), -- 32-bit data output
DOP => DATA_OUT(17 downto 16), -- 2-bit parity data output
EMPTY => open, -- 1-bit empty output flag
FULL => FULL, -- 1-bit full output flag
RDCOUNT => open, -- 12-bit read count output
RDERR => read_error, -- 1-bit read error output
WRCOUNT => open, -- 12-bit write count output
WRERR => write_error, -- 1-bit write error
DI => data_in_reg2, -- 16-bit data input
DIP => parity, -- 2-bit parity input
RDCLK => CLK, -- 1-bit read clock input
RDEN => READ_DATA, -- 1-bit read enable input
RST => RST, -- 1-bit reset input
WRCLK => CLK, -- 1-bit write clock input
WREN => data_store_delay(2) -- 1-bit write enable input
);

-- End of FIFO18_inst instantiation

end XILINX;

Figure 3 – Sound FPGA coding styles

Fourth Quarter 2006 Xcell Journal 21


PERFORMANCE

reducing the number of logic levels, thus Both block RAM and distributed RAM For designs in which some or most of the
requiring fewer pipeline stages to achieve memories also have additional capabilities code was created for an architecture other
the same as or better performance than that require different coding and design con- than Virtex-5 FPGAs, I suggest that you
previous FPGA architectures. siderations. For performance, perhaps the review the code to ensure that it is well suit-
A good goal is to aim for less than 10 most important is the proper use of output ed for implementation into the new archi-
inputs to a given logic function between registers. For block RAMs, this means tecture. A few minutes of time spent here
I/Os, registers, or synchronous blocks (like enabling the output registers to the block can save several hours later if you identify
block RAM or DSP48Es), which generally RAM whenever possible. By enabling the and correct suboptimal code.
would represent two logic levels. When you output registers, a reduced clock-to-out is If your design contains cores or pre-
need a significantly higher number of realized from the RAM, thus improving tim- compiled netlists (EDIF or NGC files)
inputs for the design path to meet latency ing for the data leaving the RAM. However, from a previous architecture, you should
or other requirements, you can attempt to an extra clock cycle of latency is added dur- regenerate those targeting Virtex-5
reduce the fan-in to that logic function ing reads, for which you must account. devices. Unless regenerated, netlists opti-
(when possible) if high performance or low Similarly, when using distributed RAM, mized for a previous architecture are more
power are your design objectives. the output of the RAM can be asynchro- likely than not far less optimal when tar-
nous; however, coding it synchronously will geting Virtex-5 architectures.
Coding Memories allow the use of the register within the slice, One last suggestion is to use the HDL
Among other innovations within the providing better timing characteristics and language templates within the ISE tools.
Virtex-5 architecture, Xilinx has enhanced reducing the chance of the RAM being part They not only help with accelerating the
both block RAM and distributed RAM of the timing bottleneck. generation of VHDL or Verilog code, but
memories with greater capacity and capabil- There are more advanced features of the also provide assistance in creating more
ity. You must make different decisions early block RAMs, such as FIFO and ECC optimal code for FPGAs. They also cut
in the design process and while coding to (error correction circuitry) capabilities. down on the possibility of creating syntax
get the most from these valuable resources. The distributed RAM also has new capa- or other simple but common mistakes that
General guidelines call for inferring bilities such as a quad-port configuration. can hold up the testing and verifying of
RAMs when possible for easier code In some cases, these features cannot be HDL code.
changes, faster simulation, and more realized by inference within synthesis and Figure 3 shows both Verilog and VHDL
portable code. However, even when behav- instantiation is necessary. If you need such code following the guidelines discussed here.
iorally describing the RAM, you should functionality, I suggest instantiating the
keep some important things in mind. The RAMs either by generating cores within Conclusion
first and most obvious thought is RAM Xilinx CORE Generator™ software or by Coding styles are very individual; howev-
capacity. In terms of block RAMs, the base instantiating the base primitive. Taking er, following these suggestions makes it
memory block increased in Virtex-5 devices advantage of these advanced features can more likely that you will achieve a more
to 36 Kb of memory storage space. You can save RAM and logic resources as well as optimal result. These guidelines do not
configure this block to the wider but shal- improve area, performance, and power. represent absolutely everything you need
lower 512 x 72 configuration, the deeper to know to achieve the best Virtex-5
single-bit width 32 Kb x 1, or several con- Some General Guidelines design possible, but I have provided some
figurations in between. It is also possible to A few other general recommendations do common strategies that can help in achiev-
cascade two 36-Kb RAMs to form a 64-Kb not fall into any specific categories but can ing more optimal designs.
x 1 configuration or break up the 36-Kb result in better coding and design choices. Almost any set of valid HDL code likely
RAMs into two separate 18-Kb RAMs capa- First, you should make wise choices in terms will result in a functioning design, but fol-
ble of 512 x 36 to 16-Kb x 1 configurations. of your design hierarchy right from the lowing a few simple guidelines can help in
Distributed RAM have benefited from start. Your choice of hierarchy can have terms of improved density, performance,
the larger LUT structure and can now effi- effects on the synthesis and implementation and power, and many times may reduce the
ciently accommodate 64-bit depths without tools’ ability to optimize the logic paths. amount of time it takes to ultimately com-
any area or performance penalties. This is In general, do not allow timing paths to plete a design.
the most optimal size for this type of RAM cross multiple boundaries of hierarchy. This For more information, see the Synthesis
in the Virtex-5 device, although other sizes not only limits the tool’s ability to optimize and Simulation Design Guide at
can be accommodated. The base RAM sizes logic but may also limit your options for http://toolbox.xilinx.com/docsan/xilinx82/
are important to remember during memory design implementation and design debug- books/docs/sim/sim.pdf or White Paper 231,
selection and coding to most efficiently use ging. For instance, you may not be able to “HDL Coding Practices to Accelerate Design
the limited RAM resources in the device use partitions or KEEP_HIERARCHY on Performance,” at http://direct.xilinx.com/
and achieve the best performance. certain hierarchies with this practice. bvdocs/whitepapers/wp231.pdf.

22 Xcell Journal Fourth Quarter 2006


PERFORMANCE

Getting the Best Results


from Virtex-5 FPGAs
Synplicity applies new algorithms and heuristics for optimal Virtex-5 support.

by John Gallagher
Sr. Director Outbound Marketing
Synplicity, Inc.
johng@synplicity.com

The revolutionary capabilities of Xilinx®


Virtex™-5 devices can be fully realized
only if the EDA technologies to unlock
those capabilities are available when the
device is. To achieve this, the FPGA archi-
tecture and corresponding EDA design
tools must be developed simultaneously.
The scale of EDA development over previ-
ous generations is similar to the difference
between Virtex-5 devices and their previ-
ous generations.
As the first FPGAs created at the 65-nm
technology node, the Virtex-5 family of
domain-optimized devices provides as
much as 65% more logic cells and 25%
more I/O pins than the Virtex-4 family. At
the same time, devices in the Virtex-5 fam-
ily provide 30% higher performance, 35%
lower dynamic power dissipation, and con-
sume 45% less silicon real estate when
compared to their Virtex-4 counterparts.

Fourth Quarter 2006 Xcell Journal 23


PERFORMANCE

To reduce the levels of logic needed to map functions like


wide data paths and DSP, the ExpressFabric technology in the
Virtex-5 family features LUTs with six independent inputs.
Clearly, the Virtex-5 architecture deliv- However, the combinatorial explosion Synplicity equipped Synplify Pro soft-
ers revolutionary capabilities. To develop a associated with mapping to 6-LUTs can ware (which features a unique direct-map-
synthesis flow that leverages these new fea- cause memory utilization and run-time ping capability) with a variety of
tures, Xilinx gave Synplicity early access to problems if not handled correctly. If you sophisticated heuristic algorithms that are
the Virtex-5 architecture. By the time the applied the algorithms for a fabric based on tailored to minimize the number of cuts,
first Virtex-5 devices were introduced, we 4-LUTs to a fabric with 6-LUTs without handle huge capacities, and address these
had been working side-by-side with Xilinx significant modification, synthesis run complex mapping and timing scenarios.
engineers for more than a year to enhance times would be orders of magnitude longer
our Synplify Pro synthesis engine. This (if they completed at all). In addition, Timing Estimation with
involved making substantial algorithmic when attempting to find an optimal map- Diagonally Symmetric Interconnect
changes to Synplify Pro software to maxi- ping, traditional algorithms run the risk of Another ExpressFabric feature is a radically
mize the performance and logic density of becoming trapped in local minima. new form of diagonally symmetric inter-
Virtex-5-based designs. Because of our Unlike timing-driven engines, most con- connect that reaches more locations with
partnership, we have the tools and method- ventional synthesis engines simply attempt fewer hops (Figure 2). This diagonally sym-
ologies in place to enable you to rapidly to reduce the number of logic levels. This is metric interconnect pattern was designed
deploy these devices. problematic for LUT architectures in which to improve speed and predictability. The
In this article, I will describe some of the different input-to-output paths have asym- combination of ExpressFabric 6-LUTs and
ways in which we enhanced the Synplify metric delays. Consider the five shared input a diagonally symmetric interconnect pat-
Pro FPGA synthesis engine to take full pins versus the sixth unique pin in Figure 1; tern results in an average increase of logic
advantage of the capabilities offered by the these pins will have very different delays. performance of 30% over Virtex-4 devices,
Virtex-5 family. The fact that you can use Virtex-5 LUTs which equals two speed grades.
in a 2 x 5-input configuration further The diagonally symmetric interconnect
New Algorithms to Synthesize 6-LUTs increases the complexity of the mapping pattern also delivers significantly higher
Increases in the complexity of the FPGA operations. Synthesis tool vendors must complexity in timing analysis; previous
fabric demanded corresponding increases conduct a lot of research and development physical synthesis algorithms were based
in the sophistication of the synthesis algo- to use structures that share inputs but rep- on architectures with “Manhattan rout-
rithms. If you applied the same algorithms resent different functions. ing” or 90-degree routes. To handle the
for a fabric based on 4-input look-up tables
(4-LUTs) to a fabric with 6-LUTs, for
example, synthesis run times would dra-
matically increase. To take full advantage of
the specialized architectural features in the
Virtex-5 family, Synplicity had to either
fine-tune or in some cases completely re- 5-LUT
craft the underlying synthesis algorithms.
To reduce the levels of logic needed to
map functions like wide data paths and
DSP, the ExpressFabric™ technology in
the Virtex-5 family features LUTs with six
independent inputs (Figure 1). This signif- 5-LUT
icantly reduces the number of logic levels
and LUT area required to implement wide
functions. Within the synthesis process, 6-LUT
you can use each of these logical elements
as a true 6-LUT or as two 5-LUTs that Figure 1 – Virtex-5 six-input LUTs
share their five inputs.

24 Xcell Journal Fourth Quarter 2006


PERFORMANCE

the Synplify Pro tool has the ability to


“push” the registers into the RAM.
Similarly, the software will recognize
potential conflicts and automatically gener-
ate appropriate conflict-resolution logic. In
cases such as dual-port RAMs, for example,
in which the result of writing two words to
the same address may be undefined, the
Synplify Pro tool automatically inserts the
appropriate logic to resolve the issue such
that the memory works in exactly the same
Direct
1 Hop
way as the RTL will simulate.
2 Hops Furthermore, Synplify Pro software can
3 Hops
automatically analyze the memory
described in the design and recognize
(a) Virtex-4: Traditional (b) Virtex-5: Diagonally Symmetric potential issues in mapping it to preferred
Interconnect Pattern Interconnect Pattern
memory resources. If you require block
RAM, for example, but have used more
Figure 2 – Diagonally symmetric interconnect routing
than what is available on the physical
device, the software will automatically move
some of the memory into select RAM.
combination of both 90-degree and diag-
onal routes, Synplicity custom-designed Optimal Use of Faster, Wider DSP Blocks
Interconnect Logic

new algorithms and delay models. As 18-kb RAM The Virtex-5 hard DSP slice – called the
opposed to simple wire-load models, we DSP48E – features a 25 x 18-bit multiplier
ECC and

engineered the Synplify Pro tool to (versus the 18 x 18-bit multiplier employed
FIFO Logic
employ sophisticated netlist-based routing in Virtex-4 FPGAs). This increase can lead
estimation (coupled with known routing to fewer cascaded stages, thereby resulting
values where applicable). In the case of fast 18-kb RAM in higher overall performance and utiliza-
carry chains, for example, routing delays tion (Figure 4).
are well known and can be directly Tuned for 550-MHz operation, you can
“plugged in.” Similarly, in the case where a configure these high-precision, high-per-
cell, driver, load, and specific route are Figure 3 – The Virtex-5 family features as
much as 10 Mb of 550-MHz block RAM.
formance, highly flexible slices for DSP,
known, an accurate routing delay associat- arithmetic, and logic functions and cascade
ed with this path can be plugged in to the them for adder-chain architectures. The
routing and timing algorithms. tuned for 550-MHz operation to provide DSP48E slice has 40% lower power con-
higher on-chip memory bandwidth. The sumption compared to equivalent func-
Synthesizing Fast High-Capacity RAM Blocks 18-Kb block RAMs are constructed from tions in Virtex-4 FPGAs (1.38 mW/100
The new block RAM structures (with two physical 9-Kb memories, which are MHz at a 38% toggle rate).
pipeline) in the Virtex-5 family have automatically controlled to save power by The sophistication of these DSP slices
increased to 32 Kb in size – twice the size enabling only one of the 9-Kb sub-blocks means that it is unlikely that a data path
of those found in Virtex-4 components. for any given read or write operation in defined in RTL will exactly match the opti-
In addition to offering a simple dual-port most configurations. mal DSP implementation structure. For
mode that can double the RAM’s band- For our part, the Synplify Pro synthesis example, rather than implementing a func-
width, these blocks also contain addi- software can perform automatic memory tion such as “(a + b) + (c + d)” by adding “a”
tional hard IP in the form of FIFO logic inferencing, including single-port and dual- and “b,” adding “c” and “d,” and then adding
and new 64-bit error checking and cor- port implementations, single and multiple the results generated by these operations, it
rection (ECC) logic (Figure 3). clocking schemes, and automatic retiming. may be more efficient to cascade the DSP
Implementing this logic as hard IP frees Regarding the latter point, Virtex-5 block slices along the lines of “(((a + b) + c) +d).”
up other resources and minimizes RAMs are inherently synchronous; howev- We equipped Synplify Pro software with
dynamic power consumption. er, the design’s RTL could describe the extremely sophisticated mapping algo-
As with all hard IP blocks in Virtex-5 memory and registers in such a way as to be rithms that perform a lot of data path mas-
devices, these block RAMs have been technically asynchronous. In such a case, saging, creating data path structures that

Fourth Quarter 2006 Xcell Journal 25


PERFORMANCE

with engineers at Xilinx for almost a year


to enhance our Synplify Pro synthesis
engine has yielded tremendous benefits,
18 including having the best synthesis tech-
3-input
+. -, AND, OR, nology available immediately when the
XOR, NOT, etc. device was brought to market, tested
43
against real designs.
X Future FPGA platforms will provide
25 48 even greater densities and capabilities, fur-
ther expanding the reach of advanced
FPGA architectures across a wide range of
application domains. These new platforms
will require ever-more-sophisticated design
Sign-extended to flows and synthesis solutions. For this rea-
48 bits by the MUXs son, Synplicity and Xilinx formed an
Ultra-High-Capacity Timing Closure Task
48
Force. As part of this endeavor, engineer-
DSP48E
ing teams from both of our companies will
collaborate to define and implement new
design flows that maximize design produc-
Figure 4 – Virtex-5 DSP slices with 25 x 18 multipliers tivity and quality of results for ultra-high-
density designs implemented using
next-generation FPGAs.
map efficiently onto – and take full advan- standard and custom protocols. The The task force will initially focus on
tage of – the DSP48E slices featured in SelectIO™ technology behind these pins dramatically improving overall quality of
Virtex-5 SXT devices. provides 1.25 Gbps differential I/O and results and run times, and ensure the sta-
Note that the DSP48E slice contains a 800 Mbps single-ended I/O. bility of results when small changes are
large number of registers (not shown in Second-generation ChipSync™ source- made to designs. Ultimately, the goal of
Figure 4 for simplicity). The Synplify Pro synchronous technology allows program- the task force is for designers to realize the
tool can use advanced pipelining and retim- mable delays to be individually applied to benefits of near push-button results for
ing techniques to take full advantage of each input and each output. Furthermore, ultra-high-density designs, completing
these embedded register elements. Another the unique power and ground pin pattern multiple design iterations per day.
consideration is that these internal register of the Virtex-5 second-generation sparse
elements support only synchronous resets. chevron packaging technology simplifies Conclusion
Thus, if you employ an asynchronous reset circuit board layout while minimizing sig- The Virtex-5 family from Xilinx has an
in your code, the Synplify Pro tool auto- nal integrity and crosstalk effects. implementation flow engineered by Xilinx
matically inserts the appropriate glue logic Combined, these I/O technologies ensure and Synplicity to achieve the best possible
to restore the required functionality. reliable operation for high-bandwidth results. These new FPGAs boast a wide
interfaces such as DDR2 and QDR II. range of new architectural features, such as
Synthesis Considerations with High-Speed I/Os We engineered Synplify Pro synthesis 6-LUTs and a diagonally symmetric inter-
The number and type of available I/O in a software to automatically handle differen- connect fabric.
specific device plays a critical role in design tial signals (and bi-directional I/O signals). The matching software solution from
implementation, particularly when the appli- For example, if you apply an attribute that Synplicity features new forms of timing
cation of the FPGA is chip- or system-level identifies the port as being a low voltage estimation for diagonal interconnect, new
verification. With Synplicity tools like differential signal (LVDS) output port, the synthesis algorithms to deal with combi-
Certify ASIC RTL prototyping and the Synplify Pro tool will automatically insert natorial explosion, specialized RAM
Synplify Pro synthesis solution, we optimized the appropriate LVDS primitive with one inferencing and I/O handling, and
the powerful I/O capabilities in Virtex-5 input and two outputs. numerous improvements that enable sta-
devices to take into account both signal inte- ble results with minor design changes.
gration as well as automatic I/O assignments. Next Steps Taken together, Virtex-5 devices and
Virtex-5 FPGAs offer as many as 1,200 As devices move deeper into the submicron Synplify Pro software bring system
general-purpose input/output (GPIO) pins domain, the symbiosis between FPGA ven- designers new capabilities that you can
that you can use to implement industry- dors and EDA vendors increases. Working design with today.

26 Xcell Journal Fourth Quarter 2006


Triple-Oxide Ultimate Power Optimization . . .

Reduce power without


compromising performance.

Virtex™-5 FPGAs give you unbeatable power savings with the highest
Power vs Performance
performance. The unique combination of 65nm process, second-
generation Triple-Oxide technology, ExpressFabric™ architecture, and
power-optimized hard IP extends the 1 to 5 Watt power advantage
Power Budget GA delivered by previous-generation Virtex FPGAs. Achieve higher
FP
g
tin
e reliability and a smaller form factor. Save cost on power supplies,
Total Power

mp
A
Co FPG
x-4 heat sinks, and fans. All this, plus the industry’s highest performance.
le
Vi r
te PGA
lab x-5 F
Virte
ai
Av No other FPGA vendor comes close.

Meet performance targets within your power budget


Performance limited Max. performance
by power budget within power budget Our Triple-Oxide technology optimizes multiple oxide thicknesses
Performance
to control leakage and keep static power on par with 90nm Virtex
FPGAs while maximizing performance. New 65nm ExpressFabric
Note: Under worst-case operating conditions (85°C)
architecture with real 6-input LUTs and diagonally symmetric routing
reduces dynamic power by at least 35%. With power-optimized hard
IP and automated, block-based power control, you can save even
more. With Virtex-5 FPGAs, you can meet your most aggressive
performance and power targets. No compromises.

Visit www.xilinx.com/virtex5/power, view the Virtex-5 power webcast,


download the XPower Estimator tool, and read the power analysis
white paper.

The Programmable Logic CompanySM

www.xilinx.com/virtex5/power

Virtex-5 LX is the first of four platforms optimized for


Logic, DSP, processing, and serial connectivity.
The Ultimate System Integration Platform
PERFORMANCE

Maximizing Design
Performance for
Virtex-5 FPGAs
ISE software gives you the tools to achieve
the timing goals of a Virtex-5 design.

by Michelle Fernandez Understanding the Architecture erations if any of these hard-IP blocks
Software Technical Marketing Engineer When evaluating a new FPGA architec- show up as part of your critical paths:
Xilinx, Inc. ture like the Virtex-5 family, it is impor-
michelle.fernandez@xilinx.com • Check to see if your design is making
tant to study the user guide and data sheet
the most of the block’s features and
to understand the hardware features.
As FPGAs push the performance envelope, that the synthesis tool is inferring the
The Virtex-5 FPGA family is based
maximizing design performance requires features as expected from your RTL.
on a new ExpressFabric architecture
knowledge of the device architecture and that delivers higher speeds, a new 6- • When using the embedded block
design software. The 65-nm Xilinx® input LUT structure that reduces logic RAM memory or the DSP48E slices, it
Virtex™-5 FPGA family delivers the levels, and diagonally symmetric rout- is important to use their dedicated
industry’s highest performance, with new ing that minimizes delays. Each CLB pipeline registers when possible to
ExpressFabric™ technology, diagonally contains two slices that have four reduce setup and clock-to-out timing.
symmetric routing, enhanced on-chip 6-input LUTs and four registers config-
• Another consideration is the mix of
memory, DSP slices, and high-speed I/O. urable in many ways. For maximum
block RAMs or DSP48E slices in the
To maximize system performance, you slice packing, it is imperative that you
design, and the trade-off between using
should use proper design techniques such understand the slice interconnectivity
dedicated blocks or implementing the
as defining timing constraints and selecting and any shared resources.
same function in slices to allow for
options in synthesis and implementation Virtex-5 FPGAs contain hard IP such
placement flexibility.
that work best for your design. In this arti- as embedded memory (block RAM) and
cle, I’ll describe how to achieve faster tim- math functions (DSP48E slices) tuned to The choice of clocking resources can
ing in the fewest design iterations. 550 MHz. Here are some design consid- also affect a design’s performance. Virtex-5

28 Xcell Journal Fourth Quarter 2006


PERFORMANCE

Figure 2 – Recommended Synplify Pro settings

Figure 1 – Proper timing constraints

FPGAs have I/O, regional, and global inal register also cover the replicated reg-
clocking resources. These devices are isters for implementation. When writing Figure 3 – Recommended
divided into multiple clock regions, which timing constraints, group the maximum ISE synthesis (XST) settings
at most can contain 4 regional clocks and number of paths with the same timing
10 global clocks. During design planning, requirement before generating a specific
• Explore the synthesis tool settings. (See
it is important to analyze how many clock constraint to minimize implementation
Figure 2 for Synplicity and Figure 3 for
regions you plan to use as well as specific run times and memory usage.
Xilinx Synthesis Technology [XST] sug-
clocks within a clock region. Placing your
gested tool settings.) There are also a
I/Os so that their interface logic does not Driving Synthesis
variety of attributes that can affect syn-
require all of the clock resources in a Here are some design considerations for
thesis optimizations. These attributes
clock region gives ISE™ software greater getting optimal results from synthesis tools:
are an easy way to affect synthesis with
placement flexibility.
• Use proper coding techniques to out having to re-code (see Table 1).
ensure that the inference of your RTL
Define Timing Requirements Certain tool settings, such as retiming
by synthesis takes advantage of the in Synplify Pro and register balancing in
Synthesis and ISE implementation tools are
architectural features. XST, can impact area. If your design is
driven by the performance goals that you
specify with timing constraints for internal • Add any lower level netlists to your affected by high fan-out nets and you want
clock domains, I/O paths, multi-cycle paths, synthesis project to better optimize the synthesis tool to reduce that fanout, use
and false paths (see Figure 1). Defining real- HDL that interfaces to those netlists. fan-out attributes specifically on that net
istic timing constraints will prevent excessive versus globally reducing the fan-out limit.
replication and longer run times. • If critical paths in your implementa- Avoid maintaining hierarchy if critical
In your synthesis report, check for any tion are not seen as critical in synthesis, paths cross over the hierarchical bound-
replicated registers and confirm that the try Synplify Pro’s “-route” constraint to aries. Before implementation, review the
timing constraints that apply to the orig- force synthesis to focus on that path. warnings in your synthesis report.

Fourth Quarter 2006 Xcell Journal 29


PERFORMANCE

Choosing Implementation Options tion. A datapath comprises both logic and 2. If the critical path contains hard-IP
Having obtained an acceptable timing interconnect delay. Individual component blocks such as block RAMs or
estimate from the synthesis tool, you can delays that make up logic delay are fixed. You DSP48E slices, verify that the design
use the implementation tools to deter- can reduce logic delay by reducing the num- takes full advantage of the embedded
mine the true performance of the design. ber of logic levels or by redefining the struc- registers. Also understand when to
The ISE default mode is the performance ture of the logic. make the trade-off between using
evaluation mode, which enables you to In comparison, interconnect delay is these hard blocks or using slice logic.
get high-performance results out of your much more variable and is dependent on the
3. Analyze clock skew.
implementation tools without having to placement of the logic. Before running your
specify timing goals. design through PAR, a timing analysis after 4. If the logic appears to be placed far
The next step is to run timing-driven MAP is recommended. Although this timing apart, floorplanning of critical blocks
mapping (MAP) and place and route report will only have estimates for your rout- may be required. Only floorplan
(PAR). Timing-driven MAP performs ing delays, it can give you an idea of the crit- where necessary.
closed-loop packing and timing-driven ical paths the implementation tools are 5. If area groups were created for a
placement, while PAR performs the rout- working on. If the critical paths have a high design with a previous version of soft-
ing of the design. Both MAP and PAR number of logic levels, you may want to ware or before many design changes,
should run with their effort levels set to work on improving the logic levels versus consider removing those area groups.
high to achieve optimal results. running it through PAR.
Physical synthesis options in imple- If your design has an excessive amount 6. Consider placing hard-IP blocks such
mentation can re-optimize and pack logic of logic levels: as block RAMs for DSP48E slices.
based on knowledge of the critical paths
1. Try the physical synthesis options Conclusion
of a design, leading to better placement
in MAP. Virtex-5 FPGAs are optimized for high-per-
and routing. The physical synthesis
options are implemented during the 2. Go back to synthesis and verify formance designs, while ISE software has
MAP process and include global netlist that critical paths reported in imple- the capabilities you need to quickly achieve
optimization, localized logic optimiza- mentation match what is reported design closure, improve productivity, and
tion, retiming, register duplication, and in synthesis. efficiently verify your designs. Xilinx pro-
equivalent register removal. Details on vides a comprehensive suite of software
3. Review the synthesis inference of
each of these options can be found in the tools (powered by ISE Fmax technology)
your HDL code.
Xilinx White Paper, “Physical Synthesis that improves design performance.
and Optimization with ISE 8.1i,” available If there are few logic levels but certain However, the more that you can do up-
at www.xilinx.com/bvdocs/whitepapers/ datapaths are not meeting timing: front with good coding styles, defining tim-
wp230.pdf. ing constraints, and resource planning, the
1. Evaluate fan-out on routes with
easier it will be for downstream tools to
long delay.
Xplorer Utility achieve your timing requirements.
Xplorer is a tool that helps to determine
the set of implementation options that XST Synplify Pro
result in the best performance for a design.
Xplorer has two modes: timing closure and Fan-out Control max_fanout syn_maxfan
best performance. The timing closure Directs Inference of RAMs to Block RAMs or SelectRAM ram_style syn_ramstyle
mode evaluates your timing constraints
and tries different sets of implementation Directs Usage of DSP48 Slice use_dsp48 syn_multstyle/syn_dspstyle
options to achieve those goals. In best per- Directs Usage of SRL16 shreg_extract syn_srlstyle
formance mode, you can give the tool a
clock domain to focus on; the tool will try Controls % of Block RAMs Utilized n/a syn_allowed_resources
to achieve the best frequency for the clock. Preservation of Register Instances During Optimizations Keep syn_preserve
This is helpful when benchmarking a
design’s maximum performance. Preservation of Wires Keep syn_keep
Preservation of Black Boxes with Unused Outputs Keep syn_noprune
Evaluating Your Critical Paths
By understanding the characteristics of your * You can find XST documentation at http://toolbox.xilinx.com/docsan/xilinx82/books/docs/xst/xst.pdf. Synplify Pro
documentation is located in the tool help documentation.
critical path, you can make better decisions
about what to do for your next design itera- Table 1 – Helpful synthesis attributes*

30 Xcell Journal Fourth Quarter 2006


PERFORMANCE

Clock Management
in Virtex-5 Devices
Virtex-5 FPGAs give designers fresh choices.

by Ralf Krueger ply/divide feature that does not depend for the delay on the routing network,
Sr. Staff Applications Engineer on any maximum VCO frequency. effectively eliminating the delay from the
Xilinx, Inc. However, the PLL filters input clock jit- external input port to the individual clock
ralf.krueger@xilinx.com ter, support a wide range of output fre- loads within the device.
quencies with higher frequencies, and In addition to providing zero delay
As FPGAs grow in size, quality on-chip consume less power. with respect to a user source clock, the
clock distribution becomes increasingly The DCM and PLL are also designed DCM provides multiple phases of the
important. Clock skew and clock delay to interact with each other. The PLL can source clock. The DLL can also act as a
impact device performance; managing help clean up input or output clocks to the clock doubler or divide the user source
clock skew and clock delay with conven- DCM. Dedicated resources within each clock by as much as 16. The DCM can
tional clock trees becomes more difficult CMT make the connections and still guar- also act as a clock mirror. By driving the
in large devices. antee a proper deskew of the FPGA clocks. DCM output off-chip and then back in
Traditionally, you would deploy solu- The CMTs are located in the center col- again, the DCM can deskew a board-level
tions such as a Xilinx® Virtex™-4 digital umn of the Virtex-5 architecture. This clock between multiple devices.
clock management (DCM) or mixed-signal enables well-matched clock routes to and Another submodule provides the abili-
phase-locked loop (PLL) to achieve clock from every DCM or PLL for enhanced ty to phase shift the DCM’s output clock
tree deskew and frequency synthesis, among symmetry (see Figure 1). in small increments (1/256th of the peri-
other functions. Yet each solution has its od). The versatile digital phase shift (DPS)
advantages and disadvantages. DCM operates in four different modes for maxi-
In Virtex-5 devices, for the first time in Virtex-5 DCMs provide a zero propaga- mum flexibility: fixed, variable-positive,
an FPGA, both digital DCMs and analog tion delay buffer, clock division and mul- variable-center, and direct. The DCM’s
PLLs are implemented side by side in a tiplication capabilities, fixed and digital frequency synthesis (DFS) module
clock management tile (CMT). You can dynamic fine phase shift, and multiple provides two outputs, CLKFX and
now select the clock management solution phases of the input clock. Along with CLKFX180, which are derived from the
best suited for your particular applications. fully differential global clock trees and input clock by frequency multiplication
Each Virtex-5 device has as many as six low skew between output signals, the and division. You provide valid multiply
CMTs. A CMT contains two DCMs and application’s various clocks are distrib- (M) and divide (D) values, which the DFS
one PLL. You can use either of the two uted efficiently throughout the device. implements through a frequency calcula-
DCMs or the PLL as a stand-alone mod- Each DCM can drive as many as 9 of the tor. For example, if you provide an M
ule, or they can interact with each other. If 32 global clock routing networks within value of 19 and a D value of 8, they would
used as a stand-alone module, the applica- the device. yield a 2.375 source-clock multiplier.
tion requirements typically dictate which The global clock distribution network
clock management solution to use. The minimizes skews caused by loading differ- PLL
DCM, for example, supports a fine phase ences. By monitoring a sample of the The CMT’s PLL is a mixed signal block
shift, a dynamic phase shift, and a multi- DCM output clock, the DLL compensates designed to support clock network deskew,

Fourth Quarter 2006 Xcell Journal 31


PERFORMANCE

frequency synthesis, and jitter reduction. The Conclusion plify and improve system-level designs
PLL block diagram in Figure 2 provides a Virtex-5 FPGAs give digital designers a involving high fan-out and high-perform-
general overview of the various components. choice of either digital or analog clock ance clocks. Virtex-5 devices have powerful
Input multiplexers (MUXs) are used to management. Depending on your particu- frequency synthesis, phase-shifting, and
select the reference and feedback clocks lar application, either module – or a com- clock deskew capabilities never offered
from the global clock pins, global clock bination of both modules – provides you before in an FPGA. Along with compre-
trees, or one of the DCMs. Each clock with choices that you never had before. hensive software support, you can achieve
input has a programmable counter. This Together with an abundance of clock larger, faster, and more complex designs
pre-scales the reference clock and allows a tree resources, Virtex-5 devices greatly sim- than in any previous-generation FPGA.
wide range of frequency synthesis.
The phase frequency detector (PFD)
compares both phase and frequency of From Global Clock Input Pins
the input clock and the feedback clock. A From Global Clock Buffers
signal is generated that is proportional to DCM1
To Global
Clock Buffers
the phase and frequency error between
the two clocks, which is then used to
drive the charge pump and loop filter to
generate a reference voltage to the VCO.
An up or down signal from the PFD
determines if the VCO should operate at
clkout_pll<5:0> To Global
a higher or lower frequency. PLL Clock Buffers
After the PFD determines that the input
and feedback clocks are phase- and frequen-
cy-aligned, a lock signal is raised, indicating
that the PLL output clocks are valid. The
VCO continues to compensate for any vari-
ations in voltage or temperature. The M To Global
DCM2
counter in the feedback path controls the Clock Buffers

feedback clock and multiplies the VCO fre-


quency to the desired target frequency. The
VCO output clock drives six output coun-
ters. Each can be independently pro- Figure 1 – Virtex-5 CMT block diagram
grammed to generate a variety of frequencies
for the application design.
General
Additionally, clock switchover, phase Routing
shifting, various duty cycles, and bandwidth Lock Detect
control are also supported. You can dynam- LOCK
Lock Monitor
Clock
ically select one of two input clocks before or Switch
during operation. In many cases, alternate Circuit

phases of a clock are required. The VCO


provides eight phase-shifted clocks at 45 CLKIN1 V
P C L
D F C O0
degrees each. The higher the VCO frequen- D
P F
0
CLKIN2
cy, the smaller the phase-shift resolution of O1
the clocks coming out of the 0 counter.
You can individually program each O2
output counter to provide a separately
M
O3
phase-shifted clock. The PLL can also
generate non-50/50 discrete duty cycles
CLKFB O4
in each output counter. The resolution
and possible output duty cycles depend O5
on the divide value. The higher the out-
put divide value, the higher the resolution
setting of the output duty cycle. Figure 2 – PLL block diagram

32 Xcell Journal Fourth Quarter 2006


P O W E R

Reduce Power
with Virtex-5 FPGAs
The world’s first 65-nm FPGAs offer the
lowest power without compromising performance.
by Derek Curd
Senior Staff Applications Engineer,
Advanced Products Division
Xilinx, Inc.
derek.curd@xilinx.com

With the introduction of the Virtex™-5


family, Xilinx is once again leading the
charge to deliver new technologies and
capabilities to FPGA consumers. The move
to 65-nm FPGAs promises to deliver bene-
fits traditionally associated with smaller
process geometries: lower cost, higher per-
formance, and greater logic capacity. And
although these benefits present exciting
opportunities for advanced system design-
ers, the 65-nm process node brings with it
new challenges.
Power consumption, for instance,
becomes increasingly important when
selecting an FPGA for your application.
More than likely, your next-generation
design will require you to integrate more fea-
tures and higher performance within a simi-
lar (or perhaps even smaller) power budget.
In this article, I’ll explore the benefits of
reduced power consumption. I’ll also illus-
trate the many process and architectural
innovations implemented in Virtex-5 devices
to offer you the lowest possible power solu-
tion without compromising performance.

Fourth Quarter 2006 Xcell Journal 33


P O W E R

Benefits of Reducing Power illustrates the importance of controlling tremendous tool to fight leakage. In older
Implementing a lower power FPGA design power and temperature for systems with FPGAs, two gate-oxide thicknesses were
offers advantages beyond simply adhering to high reliability requirements. used: a thin one for the high-performance,
the device’s thermal operating requirements. lower operating voltage transistors in the
Although meeting component specifications Power: Challenges and Solutions FPGA core, and a thicker one for the larg-
is obviously critical for performance and Total power in an FPGA (or any semi- er, high-voltage-tolerant transistors in the
reliability, how you achieve this has a signif- conductor device) is the sum of two com- I/O blocks. Simply put, “triple oxide”
icant impact on system cost and complexity. ponents: static power and dynamic refers to the addition of a third, medium-
First, lowering FPGA power consump- power. Static power results primarily thickness gate oxide (or “midox”) transis-
tion allows you to use less expensive power from transistor leakage current, the small tor that has much lower leakage than the
supplies, which have fewer components current that “leaks” from either source- thin-oxide core transistor.
and consume less PCB area. The imple- to-drain or through the gate oxide of the The “midox” transistor is used exten-
sively in the core of the device for
non-performance-critical circuits
2500
(like configuration memory) or
circuits that do not require fast
Virtex-4 LX Devices switching times in response to a
2000
changing gate voltage (like rout-
Virtex-5 LX Devices
ing pass gates). The thin-oxide,
Gate
highest leakage transistors are
Power (mW)

1500

Source Drain reserved only for the portions of


the speed path that require very
1000 fast switching times. The net
Triple Oxide = Three Gate Oxide Thicknesses result is that total device leakage
is dramatically reduced, while
500 still offering a substantial per-
formance improvement over pre-
vious-generation FPGAs.
0
The triple-oxide process
XC4VLX15

XC4VLX25

XC4VLX30

XC4VLX40

XC4VLX50

XC4VLX60

XC4VLX80

XC5VLX85

XC4VLX100

XC5VLX110

XC4VLX160

XC4VLX200

XC5VLX220

XC5VLX330

allowed Virtex-4 devices to


reduce static power consumption
Device Density by an average of more than 70%
relative to competing 90-nm
Figure 1 – Static power comparison at 85° C FPGAs. The results were so suc-
cessful that the Virtex-5 family
mentation cost for a high-performance transistor even when it is logically “off.” again makes extensive use of this technol-
power system is typically between $.50 and Dynamic power is the power consumed ogy to reduce leakage at the 65-nm
$1 per Watt. Lower power FPGA opera- during switching events in the core or process node.
tion, therefore, contributes directly to low- I/O of the device and is therefore fre- Figure 1 illustrates how the triple-oxide
ering overall system cost. quency-dependent. process enables 65-nm Virtex-5 devices to
Second, because power consumption is achieve comparable static power to similar-
directly related to heat dissipation, lower Static Power ly sized 90-nm Virtex-4 devices under
power operation allows you to use simpler, As you shrink transistor size (for example, worst-case (high-temperature) operating
less expensive thermal management solu- move from 90-nm to 65-nm devices), leak- conditions, despite industry predictions
tions. In many cases, designs will not need age current tends to increase. The shorter that 65-nm devices would see a dramatic
heat sinks, or they will need smaller, less channel lengths and thinner gate oxides rise in static power consumption. Thus, the
expensive heat sinks. generally used at the new process node Virtex-5 family retains a substantial static
Finally, because lower power operation make it easier for current to leak, either power advantage over competing high-per-
means fewer components and lower device across the channel region or through the formance FPGAs.
temperatures, overall system reliability gate oxide of the transistor.
improves. A decrease of 10° C in device In the 90-nm Virtex-4 family, Xilinx Dynamic Power
operating temperature can translate to a 2x introduced “triple-oxide” process technol- Dynamic Power consumption presents
increase in component life, which clearly ogy, which gave Xilinx® circuit designers a other challenges for 65-nm FPGAs. The

34 Xcell Journal Fourth Quarter 2006


P O W E R

equation governing dynamic power is: • The Virtex-5 routing architecture now comparison to implementing these func-
dynamic power = CV f 2 includes diagonally symmetric routes, tions in general-purpose FPGA logic.
meaning that every CLB now has a Unlike the FPGA fabric, these dedicat-
where C is the capacitance of the node direct “one hop” connection to all of ed blocks contain only the transistors nec-
switching, V is the supply voltage, and f is the its neighbors, including diagonal essary to implement the required
switching frequency. The 65-nm process neighbors. When a connection is function. And there are no programmable
node enables FPGAs that have significantly required between logic functions, it is interconnects, so routing capacitance is as
greater logic capacity and higher performance now more likely that this connection is small as possible. Fewer transistors and
than older devices. In other words, more a less-capacitive “one hop” connection, lower node capacitance benefit both stat-
nodes are switching at higher frequencies. All whereas previous routing architectures ic and dynamic power consumption. The
else being equal, this tends to increase may have required two or more hops net result is that these dedicated blocks
dynamic power. for the same connectivity. can perform the same function in as little
However, there is good news with respect
to dynamic power at 65 nm. The core FPGA
800
supply voltage (V) and node capacitance (C)
generally reduce with each new process node, 700
providing substantial dynamic power savings
over previous-generation FPGAs. 600

In Virtex-5 devices, the core supply voltage -55%


(VCCINT) decreases from the 1.2V used in 500
Power (mW)

Virtex-4 devices to 1.0V. Node capacitance


400
tends to decrease because of smaller parasitic
capacitances (associated with the smaller tran-
300
sistors) and shorter, less capacitive intercon-
nects between logic. Additionally, Virtex-5 200
devices use a reduced-K dielectric material
Virtex-5 FPGAs
between metal interconnect layers to mini- 100
Virtex-4 FPGAs
mize routing capacitance.
The estimated reduction in average node 0
0 50 100 150 200 250
capacitance for Virtex-5 devices is 15% com-
Frequency (MHz)
pared to Virtex-4 devices. Taken together with
the voltage reduction benefit, this translates to Figure 2 – Dynamic power comparison of counter benchmark design
a 35-40% reduction – at least – in core
dynamic power for Virtex-5 devices. Together, the 6-LUT architecture and as one-tenth the power of an equivalent
Although the “process shrink” to 65 nm improved routing pattern reduce core implementation using the general-pur-
provides an inherent 35-40% dynamic power dynamic power by lowering average node pose FPGA fabric.
reduction, architectural innovations in capacitance beyond the level achieved In addition to adding new types of dedi-
Virtex-5 devices offer additional power sav- purely from 65-nm process scaling. Figure 2 cated blocks, many blocks that existed in
ings for every design. Most of the node capac- shows the core dynamic power measure- Virtex-4 devices have been redesigned in
itance that contributes to dynamic power is ments from a benchmark design in which Virtex-5 devices to add features, improve
attributed to the routing or interconnect a Virtex-5 device and a Virtex-4 device are performance, and reduce power. For exam-
between logic functions. The new Virtex-5 each filled with 1,024 8-bit counters. ple, the 18-Kb block RAM memories in the
architecture fundamentally reduces routing These actual silicon measurements illus- Virtex-4 family have been sized up to 36-Kb
capacitance in two ways: trate that the combined process and archi- block RAMs in Virtex-5 devices; each of
• Virtex-5 configurable logic blocks tectural benefits to dynamic power these block RAMs can be broken into two
(CLBs) are based on a six-input look-up reduction can exceed 50%. independent 18-Kb memories for backward
table (6-LUT) logic architecture, as compatibility to Virtex-4 designs.
opposed to the 4-LUT architecture used Hard IP Blocks Interestingly, from a power perspective,
in older devices. This means that more Virtex-5 devices contain more hard IP each of the 18-Kb sub-blocks is constructed
logic is implemented within each LUT, blocks (circuitry dedicated to commonly from two 9-Kb physical memory arrays. For
translating to fewer levels of logic and used functions) than any other FPGA in the majority of block RAM configurations,
thus a reduced need for higher capaci- the industry. FPGA designs that utilize any given read or write request to the block
tance routing between logic functions. these blocks see additional power savings in RAM only needs to access one of the 9-Kb

Fourth Quarter 2006 Xcell Journal 35


P O W E R

physical memories at a time. The other additional capabilities. In many cases, you architectural innovations aimed at offer-
9-Kb memory can therefore be effectively can achieve dynamic power reductions as ing the lowest possible power consump-
“powered down” while it is not being high as 75% when utilizing the full capabil- tion, while still enabling performance
accessed. This reduces power consumption ity of the new DSP slice. If you are not increases of 30% or more.
by nearly an additional 50% beyond those designing a DSP application, keep in mind As Figure 3 illustrates, with static power
reductions resulting from the 65-nm that you can use the DSP slices for many levels comparable to Virtex-4 devices, the
process migration. This “ping-pong” standard logic functions (counter, adder, Virtex-5 family provides a clear advantage
accessing of the 9-Kb blocks is inherent to barrel shifter) at a substantial power savings relative to competing FPGAs. As the only
the new block RAM architecture, meaning compared to implementing the same func- available 65-nm FPGA, Virtex-5 devices
that no user or software control is required tion in standard FPGA logic. also offer a minimum of 35-40% core
to take advantage of this capability. It As a final example of redesigned dedicated dynamic power reduction over other high-
occurs dynamically and automatically, pro- blocks, the LXT platform of the Virtex-5 performance FPGAs on the market.
Architectural innovations such as the new
6-LUT and diagonally symmetric routing
7 90-nm Static Power are likely to enable actual core dynamic
90-nm Core Dynamic Power power savings up to 50% or more. And
6 taking advantage of the unprecedented
65-nm Static Power
level of dedicated blocks lowers power con-
5 65-nm Core Dynamic Power
Power (Watts)

sumption even further.


To find out more about how you can
4 harness the low power of Virtex-5 devices,
-10%
visit www.xilinx.com/power.
3
-35 to -40%
Minimum
2 Xilinx Power Estimator (XPE)
1
Introduced in January 2006, the Xilinx® Power
-70% Estimator (XPE) spreadsheet-based power tool
supports the Virtex™-4 and now Virtex-5 and
Competing Virtex-4 FPGAs Virtex-5 FPGAs Spartan™-3E FPGA families. XPE was designed
Device
to replace the Web Power tool as the premier
90 nm *Temp = 85° C 65 nm pre-design power estimation tool for all new
Xilinx FPGA families. The key advantages of
Figure 3 – Power comparison between available FPGAs for a typical design
XPE over previous power estimation tools are
an improved user interface, improved accuracy,
viding dramatic power reductions for all family includes integrated multi-gigabit serial
and better presentation of important data.
designs that use block RAM without com- transceivers, running at rates as fast as 3.125
XPE’s summary page displays a complete
promising block performance. Gbps. These “SERDES” blocks are imple-
summary of power usage, first by resource
The dedicated DSP elements in Virtex-5 mented with an emphasis on reducing power
type and then by voltage supply. You can use
devices have also received a significant consumption. Each full-duplex transceiver in
the navigation buttons on the summary page
design overhaul to incorporate more func- a Virtex-5 LXT device consumes less than
to access more detailed information. XPE auto-
tionality at higher performance and lower 100 milliwatts of total power at 3.125 Gbps,
power consumption. In a slice-versus-slice representing roughly a 75% reduction rela- matically displays several graphs to complete
comparison, the new Virtex-5 DSP slice tive to Virtex-4 serial transceivers. the power usage picture.
has roughly 40% lower dynamic power Since the initial release, Xilinx has intro-
consumption relative to the Virtex-4 DSP Conclusion duced newer versions of XPE that include many
slice. This is mostly attributable to the Xilinx has a long history of innovation additional features and improvements in accu-
voltage and capacitance scaling factors of dating back to the invention of the first racy. These versions, plus those supporting
the 65-nm process discussed earlier. FPGA more than 20 years ago. So it is no Virtex-5 and Spartan-3E devices, are available
However, because the new Virtex-5 surprise that Xilinx was the first FPGA at www.xilinx.com/power.
DSP slice has greater functionality and company to make reducing power a top
wider interfaces, many DSP operations priority in deep sub-micron technologies. – Kevin Bixler
Power Tools Product Marketing Engineer
experience even greater dynamic power As with the Virtex-4 family, Virtex-5 Xilinx, Inc.
reduction by taking advantage of these devices employ a number of process and

36 Xcell Journal Fourth Quarter 2006


P O W E R

Applying Compact Xilinx gives you


one more tool in
the FPGA thermal

Thermal Models management toolbox.


by Abu Eghan
Principal Engineer
Xilinx, Inc.
abu.eghan@xilinx.com

Although Xilinx has made substantial


progress at the silicon level to reduce static
and dynamic power in FPGAs, each suc-
cessive family takes advantage of reduced
feature sizes to increase transistor density
and performance – thus resulting in ther-
mal concerns for the top end of the family.
You should not underestimate the impor-
tance of power consumption mitigation
for these devices.
Designers cannot afford inaccurate tem-
perature predictions when power consump-
tion is high and thermal budget margins are
low. Flip-chip packages used in high-per-
formance FPGAs have multiple heat-flow
paths and are thermally efficient. Using the
basic “one-resistor” figure of merit thermal
resistance – Theta-ja (Θja) – in estimating
temperature does not do justice to the ther-
mal efficiency of the packages.
Thus, a need exists for an alternate and
more accurate approach to obtain Tj predic-
tions on these components in an end-user’s
environment. This is where the boundary
condition-independent compact thermal
model (BCI-CTM) becomes useful. You can
conveniently use these models to make faster
and more accurate Tj predictions.
In this article, I’ll discuss better ways to
predict temperature for these faster and
denser FPGA components in a system envi-
ronment. I’ll also introduce the availability
of and support for compact thermal models
for Virtex™-4 and Virtex-5 devices as one
way to help system designers and compo-
nent selectors estimate temperatures in the
pre-design and implementation phases.

38 Xcell Journal Fourth Quarter 2006


P O W E R

Motivation for Better Predictive Models ing standardization body of the Electronic is 10.8° C per Watt. Although the Tj pre-
In a specific system implementation, the Industries Alliance, explains in diction expression will suggest a 43.2° C
actual component Tj may be different from EIA/JESD51-2 that “the intent of Theta-ja above ambient for 4W dissipation, actual
the arithmetic predictions using the pub- measurements is solely for a thermal per- detailed simulation shows a much lower
lished Θja. The prediction depends on the formance comparison of one package to number – and thus suggests a lower effec-
environment and the prevailing conditions another in a standardized environment. tive Θja – of close to 5° C per Watt.
in the system. The following equation gov- This methodology is not meant to and will Table 1 shows the corresponding Tj for
erns the relationship: not predict the performance of a package in the same component dissipating 4W on
an application-specific environment.” various FR4 board sizes and layer counts.
Tj – Ta
Θja = _______ A typical implementation of a one- This illustrates the power of the environ-
P cubic-foot Θja still-air standardized envi- ment or boundary conditions on the effec-
ronment is depicted in Figure 1. This is tive Θja, and the type of Tj prediction
Or, stated in Tj prediction form:
discrepancy that can result.
Tj = Ta + P * Θja Note that while in general the
effective Θja tends to be lower on
where
larger board environments, it can
Θja is the thermal resistance between the also trend higher and under-pre-
device junction and ambient dict Tj on small cards in confined
Tj = junction temperature of the device places like PDAs or cell phones.
Ta = ambient temperature The same rationale is at play –
P = package power dissipation Θja is not boundary condition-
independent. A component with
Although you can easily determine Tj, Ta, Θja = 22° C per Watt on a
and P, representing the thermal resistance in an JEDEC board can easily exhibit a
application is not easy, particularly for pack- 30° C per Watt-effective Θja on a
ages with multiple thermal paths. The single 30 mm x 30 mm card.
parameter Θja is strongly influenced by the Some application engineers
application environment and therefore does Figure 1 – The Analysis Tech implementation have suggested that because
not represent a suitable thermal resistance. of Theta-ja standardized environment most high-performance devices
use denser and larger PC boards,
Theta-ja – The Misunderstood Model
Theta-ja has become the base thermal param-
Xilinx 35 x 35mm Board Size
eter most engineers gravitate toward when
FF1136-5VLX50T* 4" x 4" Board 10" x 10" Board 20" x 20" Board
estimating component Tj with known Ta.
But for a more demanding, higher wattage 4 68.2° C 64.3° C –
component on a large multilayer system Layer 8 63.0° C 50.9° C 48.3° C
board – particularly with other components Count of
around it – this approach often leads to an Mounted 12 60.4° C 47.0° C 45.7° C
erroneous prediction of Tj. Board** 16 59.1° C 46.6° C 44.9° C
In a design with loose margins in the ther-
mal budget, the simple prediction using pub- 24 – 45.3° C 44.0° C
* Single component considered at 25° C ambient
lished Θja data may not be an issue. Indeed, it **All layers have 1oz Cu with 80% coverage except outer layers that have 2 oz with 20% coverage.

will likely lead to a system running at a lower Table 1 – Tj matrix for FF1136-XC5VLX50T on various boards
than predicted Tj, because most common
board types are more efficient than the largest clearly not a typical system environment. component suppliers should provide Θja
standardized thermal board. Increasingly, with Ideally, you should use these numbers to using a larger “JEDEC/network board” – a
higher wattage components where margins are compare package efficiency, reserving any board that may be closer to network appli-
tight, “conservative” data may be the differ- serious Tj prediction for other tools using cation boards. This seems like a good argu-
ence between selection and rejection of the models that are more relevant. ment and should be advocated at the next
component in a specific program. To illustrate the pitfalls and potential JEDEC forum. However, regardless of the
The key point here is that Θja was not discrepancies in Tj predictions, let’s look at board used for data gathering, the predic-
meant to be used in these types of predic- a Virtex-5 flip-chip component – tion will be wrong for some applications.
tions. JEDEC, the semiconductor engineer- XC5VLX50T- FF1136. The published Θja Additional JEDEC boards and standard-

Fourth Quarter 2006 Xcell Journal 39


P O W E R

ized enclosures will only lead to more fla- To address these limitations and to Xilinx offers two model types for
vors of Θja, further confusing the issue. make more accurate Tj predictions in a sys- FPGA products:
There ought to be a better way. tem environment, a more refined model of
1. Two-resistor (2-R) compact models
the package is needed. Recognizing this
comprising the familiar Theta-jc and
What Should an Engineer Do? need, Xilinx now supports compact ther-
Theta-jb for the package. There is no
Engineers should view Θja with caution mal model data for high-performance
geometrical information. Although
when predicting Tj in specific environ- FPGA devices.
2-R models are useful and give better
ments. Xilinx will continue to publish Θja
predictions than traditional Θja esti-
and other thermal resistance data because What is a Compact Thermal Model?
mations, they are not as accurate as
those are the prevailing standards. They A compact thermal model is a behavioral
Delphi models.
have their uses and should be deployed model that seeks to accurately predict the
with their limitations in mind. temperature of the package at selected 2. A Delphi compact model comprising
nodes: junction, case, top, bot- several thermal resistors that connect
DELPHI BCI-CTM Topology tom, and balls, for example. It a junction node (representing the
for FCBGA Two Resistor Model
cannot predict the temperature die) to several surface nodes. Thermal
TI TO at any other part of the package links are also allowed between the
that is not predefined. It can be surface nodes. Figure 2 shows the
RJC
viewed as a reduced node topology for a flip-chip BGA Delphi
abstraction of the response of a compact model. The matrix of resis-
component to various boundary tors has been optimized through a
Junction SIDE Junction
conditions. It is also more com- Delphi optimization algorithm so
putationally efficient than the that they can be used in various envi-
RJB corresponding detailed model. ronments without compromising pre-
These models are supplied for diction accuracy.
use in compatible computation-
BI BO al fluid dynamics (CFD) tools Table 2 depicts a typical Delphi half-
for thermal simulations in place matrix model for flip chip. The resistance
Figure 2 – CTM topologies of detail models. data is usually saved along with the node
definitions and package extents to complete
the model.
Schematic Overview JEDEC has proposed a neutral file for-
CTM Implementation Concept mat in XML for CTM distribution. Xilinx
plans to support the format when CFD
CTM BOARD tools adopt and support it. In the interim,
ENVIRONMENTS
Component DEFINITION
T-Ambient AirFlow
Extent, Layer Details, Xilinx is offering the CTM files in two
Heatsinks Heat Pipes Library Cut-Outs
Space etc. 2R – Ok CFD tool formats, Flotherm and Icepak,
selected from a pre-introductory survey of
Xilinx customers. These tools cover the
majority of those end-users who answered
the survey. If you do not use one of these
CTM TOOL
tools, you can request ASCII data for man-
(Thermal Solver) ual or script-based entry into your tool.
Input Power – Pd
For Components Application Examples
Figure 3 shows a typical flow for a CTM
More Than application. Normally, the component
One CTM data is stored in a library; as the user,
Component
you will bring in the CTM data as a library
Tj – item. You then specify the board attributes
PREDICTION
(Other Predefined and boundary conditions of your assembly,
Component Temps) adding other items like component power
and heat contributions from other compo-
Figure 3 – CTM application schematics nents for the Tj prediction.

40 Xcell Journal Fourth Quarter 2006


P O W E R

the single component on the board. The


Delphi Compact Top Inner Bottom Inner Top Outer Bottom Outer predicted Tj is already worse than the
Model (C/Watt) (TI) (BI) (TO) (BO) Side JEDEC prediction. This case is in line
Junction 0.22 1.25 1.05 14.97 -- with what you would see with smaller
components that use smaller, thinner
Top Inner (TI) 16.14 4.45 -- --
boards in consumer products – PDAs,
Bottom Inner (BI) -- 9.47 11.1 MP3 players, or GPS systems.
Figure 4 shows the board temperature
Top Outer (TO) 14.55 3.18
contour with four chip-size package
Bottom Outer (BO) 4.72 (CSP) components doing 0.25W each.
The 8-mm-square CSP components used
Table 2 – Delphi CTM resistors for FF676-XC5VLX50 2-R data (published jb and jc) in the
model. Figure 4 shows the component
Let’s examine the benefits of CTM pre- temperatures. The XC3S1000-FT256
diction in the cases below. component yielded a Tj of 65° C – a fur-
ther 10° C rise over the single component
Case Study #1: case. Both single and multi-component
Battleboard Temperature Estimations runs took less than four minutes on a
The “battleboard” is a 24-layer 20 x 16- conduction-based tool using the Delphi
inch board that Xilinx uses to assess signal CTM for the Xilinx component. These
integrity issues on Virtex components. In predictions are clearly different from
this case, lab measurements showed that a what the basic Θja parameter predicted.
Virtex-4 component case temperature was
well below what you would expect from the Conclusion
Θja prediction. A discrepancy of about 20° Relying on the basic Θja metric to pre-
C was apparent. The Icepak CFD model dict junction temperature for high-per-
using CTM inputs with radiation “on” Figure 4 – Contours of static temperature formance devices in a system is
showed a more realistic Tj – very compara- inadequate and can cause errors that
ble to those measured in the lab. Table 3 could lead you to preclude a perfectly
summarizes the observations (note that The high-density interconnect (HDI) good component in your system. To
reported temperature is T-case). board used in this case is smaller than address this shortcoming, Xilinx pro-
the JEDEC standard 2S2P board. The vides compact thermal models to assist
Case Study #2: Xilinx XC3S1000-FT256 component in predicting Tj – in stand-alone calcula-
Small Board with Multiple Components deployed on this board has a Θja of tions as well as system deployments.
A 3.75 x 2-inch board can illustrate the 19.7° C per Watt. With a 20° C ambi- You can use these models in CFD
small board size effect and the influence of ent, and without any board input, you tools to make Tj predictions that take
adjacent components on the Tj prediction can predict a Tj of 39.7° C (20 + Pd * your environment and board conditions
of the component of interest. I have used Θja [100LFM]). into consideration. Although you can
the BCI-CTM approach with the aggressor The BCI-CTM model with 100 lin- accomplish the same predictions with a
components in active and inactive states to ear-feet-per-minute (LFM) airflow detailed package model, note that these
show the impact on Tj prediction. shows a Tj of 55° C with Ta = 20° C for CTMs offer reduced node benefits of
faster solutions that are also computa-
tionally efficient.
Hand Calc Reported ICEPAK CFD solution < 4 min You can download Virtex-4 and Virtex-
Tj -- JA = 10.6° C/watt Battleboard Tc Tj; 2-Resistor Model Tj: Delphi Model 5 CTM data at www.xilinx.com/xlnx/
(1-R model Ta = 25° C) Ta = 25 - 27° C Ta = 25° C as Published xil_sw_updates_home.jsp. Future high-
(Ta = 25° C) performance FPGA products will have
the Delphi models available for down-
JB = 2.6 and load as part of thermal collateral. Xilinx
JC = 0.19° C/watt will also support legacy products such as
67.4° C 46.4° C 48.8° C 43.9° C Virtex-II and Virtex-II Pro devices, older
Spartan™ FPGAs, and CPLD products
Table 3 – “Battleboard” temperature predictions on a by-request basis.

Fourth Quarter 2006 Xcell Journal 41


SERIAL CONNECTIVITY

A Multi-Gigabit Transceiver
for the Masses
The Virtex-5 GTP transceiver brings versatility,
ease of use, power efficiency, and cost-effectiveness
to high-volume mainstream applications.

by Gang Sun and the extra overhead can sometimes much power. For applications requiring
Senior Product Marketing Manager, High-Speed Serial I/O outweigh the benefits associated with these advanced features, this extra power
Xilinx, Inc. increased data rates. consumption is a worthwhile trade-off.
gang.sun@xilinx.com But it becomes advantageous to offer both
Transceivers in Transition a low-power 3.2 Gbps transceiver and a
The incessant demand for ever-increasing Figure 1 shows the frequency loss and high-performance transceiver for cutting-
bandwidth has led designers away from crosstalk associated with a legacy back- edge applications – in essence offering the
parallel buses and low-speed transceivers plane channel. At 1.6 GHz, the loss is rea- best tool for the job.
toward serial transceiver-based interfaces. sonably manageable, making transceiver At 5 GHz, the signal-to-noise ratio
High-speed signals solve many design implementation at or below 3.2 Gbps rel- (SNR) becomes negative. In that case,
challenges; they offer new levels of band- atively cost-effective and power-efficient. you would have to redesign the entire
width and lower overall system cost and However, at 3 GHz, the loss becomes backplane with more expensive materials
power consumption. significant. Consequently, the implemen- and more sophisticated manufacturing
These successes have led engineers to tation of a 6 Gbps backplane transceiver technologies to enable 10 Gbps transmis-
believe that the industry can continue to requires different feature sets. You will like- sion. Consequently, achieving a 10 Gbps
lower overall cost and power simply by ly need advanced techniques such as deci- serial transmission over a backplane
increasing transceiver speed indefinitely. sion feedback equalization (DFE) to incurs a higher cost in terms of die area
However, going beyond 3 Gbps can in maintain signal integrity, and these and power consumption.
some cases lead to fundamentally different advanced capabilities require a different set The preceding example clearly shows that
engineering challenges that make it hard- of optimized features. transceivers running at or below 3.2 Gbps
er to lower overall system cost and power This explains why a 3 Gbps transceiver are at a sweet spot; they are more cost-effec-
consumption. The explanation is simple; typically consumes less than 100 mW per tive and power-efficient than both parallel
maintaining signal integrity becomes channel, whereas a DFE-enabled 6 Gbps interfaces and ultra-high-speed transceivers
increasingly difficult at ultra-high speeds, transceiver consumes at least twice as (running at 6 Gbps and 10 Gbps) for a large

42 Xcell Journal Fourth Quarter 2006


SERIAL CONNECTIVITY

majority of interconnect applications. This tion, validation and characterization of the FPGA CAD tools. The Xilinx® Virtex-5
phenomenon has led to two diverging trends GTP transceiver occurs in application-spe- RocketIO GTP transceiver wizard offers an
in the transceiver market: cific settings to ensure standards compli- intuitive GUI interface that allows you to
ance. The combination of these design and select the GTP, clocking option, FPGA fab-
1. Bandwidth-hungry applications (such
characterization approaches ensures the ric interface, protocol stack, and encod-
as a backplane interconnect for ter-
universal appeal of the GTP transceiver. ing/decoding mechanism. After you have
abit routers) have needs for 6 Gbps
The GTP transceiver is easy to use completed your selections, the tool generates
and 10 Gbps transceivers. These
because it enjoys the support of the best a GTP wrapper with the necessary features.
applications continue to push the
performance envelope while trading
off cost and power.
2. High-volume applications are well
served by transceivers running at or
below 3.2 Gbps.

The Virtex-5 RocketIO GTP Transceiver


Xilinx clearly recognizes the different
requirements of the high-performance
market segment, noting that the high-vol-
ume market segments have in some cases
conflicting requirements. The vast major-
ity of serial protocols run at or below 3.2
Gbps; examples include PCI Express
Generation 1, Gigabit Ethernet, XAUI,
SATA I and II, Serial RapidIO, CPRI, Figure 1 – Channel S-parameter and crosstalk
OBSI, and HD-SDI. Many emerging
protocols such as JEDEC’s data converter
interface and VESA’s DisplayPort also run
at these relatively slow data rates. In reali-
ty, these established and emerging proto-
cols represent more than 90% of current
transceiver applications. Therefore, trans-
ceivers running at or below 3.2 Gbps are
“transceivers for the masses.”
Xilinx has taken a truly innovative step
and developed two different transceivers
for its Virtex™-5 FPGA family. The first
transceiver, the Virtex-5 RocketIO™
GTP transceiver, is designed for high-vol-
ume applications and covers data rates
from 100 Mbps to 3.2 Gbps. Targeting the
majority of system designers, the GTP
transceiver is versatile, easy to use, power-
efficient, and cost-effective.
The GTP transceiver is versatile because
it is designed to support not only 8B/10B-
based protocols such as the PCI Express
Wrapper but also scrambling-based proto-
cols such as SONET. (Table 1 is a complete
list of applications supported by GTP
transceivers.) Consequently, the spectrum
of applications that can be supported by
the GTP transceiver is limitless. In addi- Figure 2 – ChipScope IBERT console

Fourth Quarter 2006 Xcell Journal 43


SERIAL CONNECTIVITY

The Xilinx ChipScope™ Analyzer


offers self-testing capabilities for the GTP Speed
Market Standard Key Features
transceiver by leveraging the integrated bit- (Bits per Second)
error-rate tester (IBERT) feature built into Telecom OC-3/SDH STM-1 155 Mbps
the transceiver. The ChipScope IBERT • FIFOs can be Bypassed for
OC-12/SDH STM-4 622 Mbps
console is shown in Figure 2. Synchronous Operation
OC-48/SDH STM-16 2.488 Gbps
Among the advanced features offered by
the ChipScope Analyzer are channel per- OBSAI (Issue 1.0) 768 Mbps
formance measurement capabilities, auto- 1.536 Gbps
mated eye scan capability for finding the 3.072 Gbps
best Tx and Rx settings, and transceiver
CPRI (Version 2.0) 614 Mbps
and link status reporting. This comprehen-
sive set of tool offerings greatly simplifies 1.228 Gbps
design and manufacturing efforts based on 2.457 Gbps
the GTP transceiver and is a key enabler for SFI-5 2.448 -3.125 Gbps • Synchronous Clocking
a large variety of applications. (Bypass FIFOs)
As PCB boards become increasingly
crowded, transceiver power consumption Datacom 1G Ethernet (802.3z D5.0) 1.25 Gbps
becomes a critical issue. Therefore, power XAUI (802.3ae D5.0) 3.125 Gbps • Loss of Signal (LOS)
efficiency was one of the top design 10G Base CX-4 3.125 Gbps (x4)
objectives for the GTP transceiver. Computing / PCI Express Specification 2.5 Gbps • Tx Receive Detect
Average power consumption per GTP Communication
(Rev 1.1)
transceiver is substantially below 100 mW. • Loss of Signal (LOS)/Idle
In some cases, per-transceiver power con- state detect
sumption is as low as 60 mW. The uni- • Low Power States and OOB
versal appeal of low-power requirements Beacon
further enhances the competitiveness of
• Ground Referenced
the GTP transceiver for power-sensitive
Termination
applications.
As high-volume applications start to use Serial Rapid IO 3.125 Gbps • Supports All Data Rates
embedded transceivers, cost has also from 1.25-3.125G
become an important consideration.
InfiniBand 2.5 Gbps
Consequently, Xilinx offers certain solu-
tions in hard logic rather than in look-up Storage Fibre Channel (Rev4.0) 1.0625 Gbps • Rate Negotiation, Allows
tables (LUTs). For example, a hard-coded 2.125 Gbps Tx and Rx to Operate at
PCI Express protocol stack includes a Different Speeds
physical layer based on the GTP transceiv- SATA (Rev1.0a) 1.5 Gbps • Rate Negotiation for
er, a link layer, and a transaction layer. This
3.0 Gbps Gen1/Gen2
approach significantly lowers overall
solution costs, and the increased cost- • LOS and Out-of-Band
effectiveness makes GTP transceiver- Signaling Beacon
based solutions even more attractive to SAS (Rev5) 1.5 Gbps
high-volume/high-margin applications.
3.0 Gbps

Conclusion Video SDI 143 Mbps • Internal AC Coupling Caps can


Transceivers at 3.2 Gbps or below are at a 176 Mbps be Bypassed for Video
sweet spot; the vast majority of transceiver- DVB-ASI 270 Mbps Standards
based applications fall into this data-rate • 2.97G is the New HD-SDI
range. With its versatility, ease of use, Standard in Development
power efficiency, and cost-effectiveness, the
Virtex-5 GTP transceiver from Xilinx is
ideally positioned in this market, a true Table 1 – Applications supported by GTP
multi-gigabit transceiver for the masses.

44 Xcell Journal Fourth Quarter 2006


SERIAL CONNECTIVITY

Introducing the Virtex-5


PCI Express Endpoint Block
With PCI Express quickly becoming the standard high-bandwidth interconnect,
the Virtex-5 LXT PCIe Endpoint block enables a configurable single-chip solution.

by Doug Kern 8B/10B encoding, dual-simplex signaling, A switch has one upward facing port and
Staff System Design Engineer and message-based serial protocol. numerous downward facing ports. These
Xilinx, Inc. With plans in place to increase band- downward facing ports connect to the work-
doug.kern@xilinx.com width to 5 Gbps in Generation 2 and ing devices or endpoints of a system.
10 Gbps in Generation 3, the PCIe bus is Although only one root exists in a sys-
Currently dominating the desktop PC expected to be the dominant high-band- tem, there are one or more endpoint
motherboard and graphics markets, the width interconnect for several years to devices. For example, a standard PC
PCI Express (PCIe) interconnect is poised come. (For more information on the PCIe motherboard provides three to seven
to supplant PCI and PCI-X as the domi- specification or compliance information, expansion PCIe slots. With the integrated
nant high-bandwidth interconnect for the visit www.pcisig.com.) PCI Express Endpoint block, Xilinx®
server, enterprise, mobile, workstation, net- With scalable lane widths from x1 to Virtex™-5 LXT FPGAs allow you to rap-
working, communications, industrial con- x32 lanes and advanced features such as idly develop and deploy high value-added
trol, and medical equipment markets. traffic classes, virtual channels, hot-plug, PCIe endpoint devices. The numerous
With more than 58 form factors, includ- and power management, the Xilinx PCIe value-added endpoint designs are the tar-
ing Express Card, Advanced TCA, block provides support for a wide range of get applications for the FPGA-based con-
Compact PCI Express, Com Express, and a applications, from a simple upgrade from figurable Virtex-5 LXT PCI Express
cable spec, the PCIe protocol is becoming PCI to an x1 PCIe endpoint device to Endpoint block.
ubiquitous. The PCI Special Interest Group advanced high-bandwidth x8 PCIe com-
(PCI-SIG) maintains the PCIe specification munications endpoint devices. The Virtex-5 LXT PCIe Endpoint Block
(along with the PCI and PCI-X specifica- Figure 1 shows the topology of a PCIe The Virtex-5 LXT PCIe Endpoint block
tions) and holds compliance workshops. system. The CPU is connected to a root (see Figure 2) implements the physical
The PCIe subsystem is a point-to-point device and is responsible for configuring layer (PHY), data link layer (DLL), trans-
interface that replaces and overcomes the and enumerating all plug-and-play PCI action layer (TL), and configuration layers
limitations of bus-based PCI and PCI-X Express endpoint devices in a system. of a PCIe endpoint device. The imple-
standards. PCIe Generation 1 (Gen1) Because the PCIe system is point-to-point, mentation of a small reset circuit and
offers 2.5 Gbps speed with low-voltage dif- switch devices are necessary to grow the clock generation blocks require you to use
ferential signaling (LVDS), embedded number of devices or endpoints in a system. the FPGA fabric.

Fourth Quarter 2006 Xcell Journal 45


SERIAL CONNECTIVITY

The PCI Express Endpoint block capa- deskew. The DLL is responsible for data virtual channels, great flexibility for packet
bilities include: integrity and implements a user-config- arbitration is available.
urable-sized retry buffer to retransmit pack-
• Compliance with the PCI Express base
ets that are received incorrectly without High-Level Intregration
specification, revision 1.1
re-requests from the applications software. The Virtex-5 PCI Express Endpoint block
• Choice of PCI Express Endpoint block The TL provides Tx and Rx buffers and allows you to implement a single endpoint
or legacy PCI Express Endpoint block orders the packets to be transmitted. With device with one FPGA while leaving almost
implementation capability for eight traffic classes and two all of the FPGA programmable fabric avail-
• x8, x4, x2, or x1 lane width
• Easy-to-use user interface similar to the CPU
familiar Xilinx LocalLink interface
• Integration of RocketIO™ GTP PCI Express Memory
transceivers Graphics: 16x Root Complex

• Spread-spectrum clocking support


• Low power operation Switch Switch Can Be Open or
• Power management support Closed System

• Ability to use on-chip block RAMs for


Switch PCI
buffering Bridge x8 End
Legacy
End
Point
• Fully buffered transmit and receive Point

• Management interface to access PCIe x2 End x1 End


configuration space and internal con- Point Point PCI
figuration
Virtex-5 PCIe Endpoint Block Applications
• Support for full range of maximum
payload size (128 to 4,096 bytes) Figure 1 – PCI Express topology

• Capable of as many as two virtual


channels (VCs)
Block Block Block
RAM RAM RAM
• VC arbitration: round robin, weighted (Tx) (Tx) (Tx)
round robin, or strict priority
Block RAM
• 6 x 32-bit or 3 x 64-bit base-address Interface Transceiver
Transaction Interface
registers (BARs) or a combination of Layer
Interface PL Lane
32-bit and 64-bit BARs PL Lane GTP Transceiver(s)
PL Lane
• BARs configurable for memory or I/O Transaction Data Link Physical PL Lane
Layer Layer Layer PL Lane
• Memory BAR checking/filtering PL Lane
User Application

PL Lane
• Non-memory transaction layer packet PL Lane

(TLP) ID checking/filtering
• Implements one PCI Express function
PCIe
Configuration and Capabilities Module
• Signals to the programmable fabric for Block
statistics and monitoring
Management
• Full documentation and reference Interface Hot Plug and Power Configuration Clock and
Management and Status Reset
example design Interface Interface Interface

Clock and
Virtex-5 GTP transceivers interface to Miscellaneous Logic (Optional)
Reset Block
the serial differential electrical signals of the
PCIe specification. The PCIe block com-
Figure 2 – Xilinx Virtex-5 LXT PCI Express Endpoint block
pletes the physical logic that provides lane
46 Xcell Journal Fourth Quarter 2006
SERIAL CONNECTIVITY

able for value-added end- In addition, the successful


point application design PCI-SIG compliance achieved
functionality. The combina- with this design greatly speeds
tion of the PCIe block, GTP up the path to compliance for
transceivers, and the block the users of the PCIe block in
RAM incorporate the major- Virtex-5 devices. Xilinx provides
ity of the logic required for a a suite of hardware reference
low-power, high-bandwidth, boards such as the ML523 char-
configurable PCIe endpoint acterization board, ML505
port. The GTP transceivers embedded design reference
support Gen1 2.5 Gbps seri- board, and the ML555 x8 high-
al rates while being electri- data-bandwidth PCIe board to
cally compliant to the PCIe enable you to build and test
Figure 3 – Xilinx CORE Generator GUI
specification. Some of the PCIe systems (Figure 4).
new transceiver features Compliance testing requires a
include power management complete design with a demon-
support such as beacon and strable function, hardware board,
electrical idle detect and the software device driver, and appli-
spread-spectrum reference cation software to demonstrate
clock required in PC system interoperability with PC mother-
motherboards. board systems. The memory end-
The block RAM provides a point and PIO reference designs
scalable, user-configurable provide all of the above. Sample
retry memory along with Tx device drivers for Windows XP,
and Rx FIFO for any packet Windows Server 2003, Windows
size supporting one or two vir- Vista, or Linux are available by
tual channels. With complex request. As the reference design
configuration options of the function emulates a memory
PCIe block, GTP transceivers, aperture, a simple test is provid-
and block RAMs, along with ed to demonstrate the design
clock and reset logic, software Figure 4 – ML523, ML555, and ML505 Virtex-5 operation.
automation provides quick LXT PCIe hardware reference boards This Virtex-5 LXT device and
and accurate configuration Xilinx reference hardware boards
and interconnect of these functions. clock and reset RTL blocks to address are listed on the PCI-SIG integrators list. For
advanced system requirements. In addi- PCI-SIG-complaint Xilinx solutions, visit
Using the PCI Express Endpoint Block tion to connecting the various hardware www.pcisig.com/developers/compliance_
The Xilinx PCIe LogiCORE™ solution resources to the PCIe endpoint block, the program/integrators_list/pcie.
delivers wrappers through the advanced wrapper provides value-added
CORE Generator™ software GUI flow of features such as the user-friendly interface Conclusion
the ISE™ tool, which makes it easy to use similar to Xilinx LocalLink, the memory The Virtex-5 LXT PCI Express Endpoint
the PCIe block and still provides full flexibil- BAR, and non-memory TLP ID checking block, combined with the GTP trans-
ity of the configurable features and capabili- and filtering. ceivers and block RAMs, provide an
ties of the high-bandwidth PCI Endpoint extremely high level of integration for you
block. The configuration capabilities of the Ready to Provide High Bandwidth to efficiently and quickly build high-per-
PCIe block are abstracted into several self- The launch of the Virtex-5 PCIe block in formance, fully compliant PCIe systems in
checking and instant-feedback menus that the Virtex-5 LXT device not only provides a single device. Xilinx developed and deliv-
walk you through the configuration of key the block and PCIe wrapper support, but ered the Virtex-5 PCI Express solution
design elements (Figure 3). also includes extensive system design aids. with a focus on end-user requirements for
The LogiCORE wrappers connect the Memory endpoint and programmed ease of use, legacy design migration, pow-
PCIe block with the GTP transceivers input/output (PIO) reference designs are erful yet flexible features and capabilities,
and block RAMs. They also create and included in the deliverables. These designs system-level compliance, and cost.
connect the clock and reset logic blocks serve as a training aid, quickly bringing up a For more information, visit www.xilinx.
to the PCIe block. You can customize the simple user application to test in hardware. com/virtex5.

Fourth Quarter 2006 Xcell Journal 47


30% faster than last
year’s model…

Here are 6 of the new, faster, bigger, Virtex-5 FPGAs on a 12 Million ASIC Gate Board
that offers unmatched performance to ASIC Prototypers, IP Designers, and FPGA
Developers. The V5 65nm process, with 6 input LUT and advanced interconnect,
enables 30% faster clock speeds in your application. The Dini DN9000k10PCI captures
this performance on an easy to use board with these handy features:

• 33/66 MHz PCI bus or stand-alone operation


• 6 - DDR2 SODIMM Sockets
• 7 - Global Clock Networks
• 3 - 400pin FCI-MEG Connectors for daughter cards
• Easy configuration via CompactFlash, USB or PCI

All necessary operating software, including reference


designs and Synplicity Certify™ models to simplify
partitioning, is supplied with the board. If your
need is speed — visit The Dini Group web site 1010 Pearl Street, Suite 6
www.dinigroup.com for complete details on La Jolla, CA 92037
the fastest FPGA board ever. (858) 454-3419
sales@dinigroup.com
SERIAL CONNECTIVITY

PCI Express Markets,


Trends, and Applications
The Virtex-5 LXT device’s built-in PCI Express solution by Navneet Rao
Technical Marketing Manager, Horizontal Platform Solutions

enables significant power and area savings. Xilinx, Inc.


navneet.rao@xilinx.com

End users are adopting multimedia-enabled


devices rapidly; you need look no further
than iPod video or YouTube-like blog sites.
As users consume this type of rich data, the
need for efficient storage and higher con-
nectivity speeds becomes critical.
Today, the megahertz debate has been
replaced with the gigabit-per-second
debate, as the focus shifts from processing
speed to high-speed interconnect. A host of
serial standards have come into play. The
key market requirements governing these
standards are:
• Scalable performance
• An extensible feature set to adapt
to various use models (chip-to-chip,
backplanes, cable)
• Interconnects suitable for multiple
market segments and applications
• Implementation of cost-effective
solutions in mainstream high-volume
technology
One of the key serial standards to emerge
is PCI Express (PCIe), a third-generation
I/O interconnect introduced in 2002 to
provide a scalable path from PCI and PCI-
X (see Table 1). PCIe has become the stan-
dard interconnect of the PC industry and is
rapidly gaining momentum in other appli-
cations as well (Figure 1). It promises scala-
bility, an extensible feature set, multiple
market suitability, and cost-effectiveness.

Fourth Quarter 2006 Xcell Journal 49


SERIAL CONNECTIVITY

Key highlights of PCIe include:


PCI Specification Bus Width Transfer Rate Lane Width Line Rate Max Data Bandwidth
• A high-speed serial standard offering
PCI 1.0 32 bits 33 MHz 133 Mbps (half-duplex)
bidirectional communication at
2.5 Gbps line rates per lane PCI 2.x 64 bits 33-66 MHz 266-533 Mbps (half-duplex)
• Layered packet-based architecture, PCI-X 1.x 64 bits 133 MHz Up to 1 Gbps (half-duplex)
enabling modular design
PCI-X 2.0 64 bits 266-533 MHz Up to 4 Gbps (half-duplex)
• Bandwidth enhancement (as much as
PCI Express 1.x 1 lane 2.5 GHz Up to 500 Mbps
80 GB) through easier scalability –
1, 2, 4, 8, 16, and 32 lanes 2 lane 2.5 GHz Up to 1 Gbps
• Advanced features like reliability, 4 lane 2.5 GHz Up to 2 Gbps
power management, and hot plug 8 lane 2.5 GHz Up to 4 Gbps
• Support for next-generation three- 16 lane 2.5 GHz Up to 8 Gbps
dimensional/multimedia traffic
through virtual channels, traffic classes, 32 lane 2.5 GHz Up to 16 Gbps
and quality of service (QoS) PCI Express 2.0* 1-32 lanes 5 GHz Up to 32 Gbps
• Ease of use through new form factors * The PCI Express 2.0 specification is “still under construction.”
and innovative designs, enabling
Table 1 – PCI/PCI-X/PCIe specification and bandwidths
applications targeted to multiple
market segments
• Software preservation by supporting 300
Systems
legacy PCI architecture and 250
infrastructure
200
Millions

Tremendous acceptance, design wins, and


strong customer feedback have propelled our 150
understanding of the inherent benefits of
100
PCI Express in our customers’ applications.
To create solutions for solving tomorrow’s 50
problems today and keep pace with rapidly 0
changing times, Xilinx has introduced a 2004 2005 2006 2007 2008 2009
hard PCI Express Endpoint block in its
Systems = PCs + Servers + Workstations + Embedded Systems
Virtex™-5 LXT devices (Figure 2).
The salient features of the PCIe Figure 1 – PCI Express momentum
Endpoint block are:
• Full-featured and compliant to PCIe
base specification v1.1
– Highly configurable PCIe end-
point solution
• Passed compliance/interoperability
tests at PCI plug-fest
(www.pcisig.com/developers/compliance_ Tx
program/integrators_list/pcie)
Rx
• Supports 1-, 2-, 4-, or 8-lane
implementations
• Meets all key requirements
– Electrical signaling Hard Core with GTP
– Protocol (CRC, automatic retry) Transceivers

– QoS
– Hot-pluggable Figure 2 – PCIe Endpoint block in the Virtex-5 LXT FPGA

50 Xcell Journal Fourth Quarter 2006


SERIAL CONNECTIVITY

• Uses Xilinx® RocketIO™ GTP


transceiver blocks Power Consumption and Area Required to Implement
a Typical Design Including 8-Lane PCIe Endpoint
– PCI Express electrical support
– 100-MHz direct reference clock 6.22 34,600

• Saves resources User Logic


– Integrated in all Virtex-5 LXT 25,100
devices

Power (Watts)
User Logic

Area (LUTs)
– Adjacent to GTP transceivers 3.09
• Ease of design PCIe

– Shortens design cycles


– Simplified, intuitive design flow PCIe
Static Power
• Low cost and low power
• Packet buffering with configurable Virtex-5 LXT Nearest Virtex-5 LXT Nearest
block RAM FPGAs Competitor FPGAs Competitor
(64 nm) (90 nm) (64 nm) (90 nm)
– Rx buffer
XC5VLX30T versus 2SGX60D. Target Frequency = 200 MHz. Worst-case process 25K LUTs,
– Tx buffer 17K flip-flops, 1 Mb on-chip RAM, 64 DSP blocks, and 128 2.5V I/Os. Based on Xilinx tool v8.2
and competitor tool v6.01.
– Retry buffer
• Simple transaction layer interface for Figure 3 – Power and area savings of a high-performance Virtex-5 LXT PCIe solution
easy integration
• Signals available to fabric for statistics the Intel Developers Forum in • Bridging legacy and disparate stan-
and monitoring September 2006. Xilinx also endorses dards to PCIe. The movement of lega-
– credit status, max payload size, this initiative, which extends the PCIe cy applications to new form factors
error signals architecture to enable emerging appli- optimized for PCI Express requires
cation accelerators. bridging functions between legacy
• As many as two virtual channels for
standards and PCI Express. The new
QoS • Low power and reduced area.
Virtex-5 LXT platform offers the cus-
– Round robin, weighted round Applications that need higher perform-
tomization and logic resources to
robin, or strict priority ance but are designed to smaller form
enable this transition as well as bridg-
factors can use the Virtex-5 LXT solu-
ing to other serial standards.
tion (Figure 3). The PCIe Endpoint
Designing with the Virtex-5 LXT PCIe Block block allows you to be able to choose a • Scalable solution. The PCI Express
PCI Express has gained considerable smaller device and still achieve signifi- protocol is here to stay, but the proto-
momentum, with broad acceptance in the cant power and cost savings. col itself and use models are in rapidly
PC industry. Engineers designing with
Virtex-5 LXT FPGA-based PCIe end-
points can also lead the proliferation of Xilinx PCI History
PCI Express in new markets by leveraging Xilinx has been at the forefront of the PCI/PCI-X/PCIe technology.
these advantages: Significant achievements include:
• Faster time to market. Existing ASSPs 1996 – Industry’s first PCI core for FPGAs
do not support PCIe today; FPGAs 1999 – Industry’s first 64-bit, 66-MHz PCI solution
enable bridging between proprietary 2000 – Industry’s first 64-bit, 133-MHz PCI-X solution
parallel interfaces and PCIe. In addi-
2003 – Industry’s first PCIe solution
tion, evolving add-ons to the PCI
Express standard discourage 2005 – Industry’s first PCIe PIPE solution – Xilinx + NXP Semiconductors
ASIC/ASSP starts until a broad market 2006 – Industry’s first FPGA Express Card solution – Xilinx + NXP
base has been created. A case in point Semiconductors
is the recent “Geneseo” architecture 2006 – Industry’s first FPGA with built-in PCI Express Endpoint block
announcement by Intel and IBM at

Fourth Quarter 2006 Xcell Journal 51


SERIAL CONNECTIVITY

evolving phases. Designing with


Virtex-5 LXT PCIe Endpoint blocks Applications Form Factors Link Width Data Prominent Feature Required
enables you to scale from 1- to 4- to (Typical) Bandwidth
8-lane link-widths in the same
Enterprise HBAs, Server x1 250 Mbps High Reliability
Virtex-5 family. This allows you to
I/O Module Scalability
future-proof the system and the x4 1 Gbps
equipment. In addition, because Error Recovery
x8 2 Gbps
PCIe is inherently compatible with Reduced Board Space
legacy PCI and PCI-X architectures, Reduced Power Budgets
scaling and designing Virtex-5 LXT
FPGA-based PCIe solutions will pre-
Desktop Add-In Card x1 250 Mbps Legacy Support to Existing
serve software investments and (PCI) Software
extend infrastructure life. x4 1 Gbps
Reduced Board Space
• Form factors supported. Virtex-5 x8 2 Gbps
Reduced Power Budgets
LXT RocketIO GTP transceivers x16 4 Gbps Ecosystem Existence
offer significant power advantages High Availability
over competing FPGA/ASSP solu-
tions. This enables designers to con-
sider Virtex-5 FPGAs in new markets. Mobile Express Card, x1 250 Mbps Reduced Power Budgets
You can use the inherent advantages Mini-Card High Reliability
of the 65-nm FPGA to support mul-
Power Management Capability
tiple form factors by using scalable
logic density for different solutions.
For example, a desktop solution in an Communications HBAs, ATCA, x4 1 Gbps High Availability
add-in card form factor can be scaled Server I/O Module High Performance
to a lower power Express Card form x8 2 Gbps
Interoperability
factor using similar/identical FPGA
Reliability
resources. Conversely, a desktop add-
in card form factor PCIe solution in a
Virtex-5 LXT FPGA can be easily Embedded Integrated Endpoints, x1 250 Mbps Low Cost
scaled up to support the transition to Platforms Custom Cards, Reliability
high-performance form factor solu- Mini-Card High Availability
tions like ATCA, uTCA, and server Ease of Use and Integration
I/O module.
Table 2 – Virtex-5 LXT PCIe Endpoint applications

PCI ExpressFabric Topology CPU


The PCI ExpressFabric™ topology, referred to as a hierarchy, comprises
a root complex (RC), multiple endpoints (I/O devices), a switch, and PCI Express PCI Express
Endpoint
a PCI Express/PCI bridge, all interconnected through PCI Express links. Root
Memory
Complex
An RC denotes the root of an I/O hierarchy that connects the PCI Express-PCI PCI Express
Bridge
CPU/memory subsystem to the I/O. A root complex may support one PCI Express
or more PCI Express ports, for example, Intel chipset(s).
A switch is defined as a logical assembly of multiple virtual PCI- PCI/PCI-X
Switch
to-PCI bridge devices, which forward transactions using PCI bridge
mechanisms, namely address-based routing such as the IDT PCI PCI PCI
Express PCI PCI Express
Express switch. Express Express

Endpoint refers to a type of device that can be the requester or Legacy Legacy PCI Express PCI Express
completer of a PCI Express transaction, either on its own behalf Endpoint Endpoint Endpoint Endpoint

or on behalf of a distinct non-PCI Express device, for example,


a PCI Express-attached graphics controller. Figure 4 – PCI Express topology

52 Xcell Journal Fourth Quarter 2006


SERIAL CONNECTIVITY

System Dual-Channel
Card Memory Memory
x4 PCIe
DDR2 QDR
Backplane
Links
CPU Root Complex

x4 PCIe
Switch Legacy Backplane
Card PCI EP Links
PCIe
Switch
Memory

Memory DSP

SONET Line Control


FPGA GbE Line TOE + TM
Card x4 PCIe Card
SONET Backplane
Links Ethernet x4 PCIe
Links
Optics Links Backplane
SFI to Fabric
Fabric Links
Optics SPI NPU Interface GbE to SPI NPU Interface
Optics Bridge Controller Bridge
Controller

Tri-Mode Ethernet MAC PCIe Link PCIe Endpoint

Figure 5 – PCIe in a communication system

Virtex-5 LXT FPGAs with built-in FSB


PCIe Endpoint blocks can easily be CPU CPU
designed in all form factor applications, as
shown in Table 2.
DDR2 Dual-Channel
Figures 5 and 6 outline applications using Embedded Memory Controller
x16 PCIe DDR2 Memory
Virtex-5 LXT PCIe Endpoint block capa- Graphics Hub (MCH)
Add-In
Controller PCIe Root Complex
bilities to aggregate multiple-source traffic Expansion Slot(s)
and bridge protocols to PCI Express.
x8 PCIe
GbE x8 PCIe HD Video
Conclusion Ethernet x1 PCIe HD Video Source
Streams
10/100 Controller PCIe Switch
The Virtex-5 LXT platform, with built-in x4 PCIe Audio
HD Audio Source
Streams
PCIe Endpoint blocks and RocketIO GTP x4 PCIe I/O Controller Hub
PCIe-PCI (IOCH) x8 PCIe FPGA-Based
transceivers, offers great value by providing Bridge DSP Acceleration

a full-featured, fully compliant PCIe solu- PCI / PCI-X SATA x8 PCIe


Endpoints
tion. Say goodbye to IP licensing and hello
to lower power and fewer utilized logic InfiniBand InfiniBand
HDD PCIe - InfiniBand
resources. You can achieve significant cost Bridge Switched
Fabric
savings by targeting smaller FPGA devices
with 50% of the power of soft-IP alterna-
tives. Built-in hard blocks deliver guaran-
teed functionality and ease of use by Xilinx Virtex-5 LXT FPGAs
PCI Express Link
reducing design time.
PCI Express Endpoint Block + GTP Transceiver
Thus, the Virtex-5 LXT platform offers Tri-Mode Ethernet MAC Block + GTP Transceiver
unique built-in PCIe capabilities in a fast,
low-power, 65-nm FPGA, launching a new
era of efficient PCIe system development. Figure 6 – PCIe in a high-end desktop/server system

Fourth Quarter 2006 Xcell Journal 53


SERIAL CONNECTIVITY

Designing with Virtex-5 Embedded


Tri-Mode Ethernet MACs
You can implement flexible Ethernet systems
using the Virtex-5 10/100/1000 Ethernet MAC.

by Nick McKay mode to enable backplane connectivity at 10/100 Mbps; full-duplex operation is
Senior Design Engineer speeds as fast as 2,000 Mbps. supported at all speeds.
Xilinx, Inc. Xilinx developed the Virtex-5 Ethernet Serial GMII (SGMII) and 1000 BASE-X
nicholas.mckay@xilinx.com MAC from the Virtex-4 FX Ethernet are serial interfaces that use the physical cod-
MAC, making improvements in the areas ing sublayer (PCS) and physical medium
Soma Potluri of global clock usage, serial interface flexi- attachment (PMA) sections of the Ethernet
Senior Design Manager bility, and software control complexity. MAC. These interface to the Virtex-5
Xilinx, Inc. In this article, we’ll review the feature RocketIO GTP serial transceivers. SGMII,
soma.potluri@xilinx.com set of Ethernet MAC blocks in Virtex-5 as with the parallel interfaces, provides
Stuart Nisbet devices. We’ll also describe the differences 10/100/1000 Mbps full-duplex BASE-T
Senior Design Manager between Virtex-5 and Virtex-4 FX functionality. The serial interface signifi-
Xilinx, Inc. Ethernet MACs, illustrate some potential cantly reduces the number of pins required
stuart.nisbet@xilinx.com applications, and describe how to use to connect to the external PHY chip.
standard Xilinx tools to integrate an When the Ethernet MAC is configured
Ethernet is the dominant wired connectivity Ethernet MAC into your design. in 1000 BASE-X mode, the PCS/PMA
standard. The Xilinx® Virtex™-5 Ethernet block, along with the RocketIO transceiv-
media access controller (Ethernet MAC) Supported Interfaces er, provides all of the functionality required
block provides dedicated Ethernet functional- The Virtex-5 Ethernet MAC is fully to connect directly to a gigabit interface
ity, which together with Virtex-5 RocketIO™ compliant to the IEEE802.3 specifica- converter (GBIC) or small form-factor
GTP transceivers and SelectIO™ technology tion. Figure 1 shows a block diagram of pluggable (SFP) optical transceiver. This
enables you to connect to a wide variety of the Ethernet MAC. removes the need for an external PHY chip
network devices. The Ethernet MAC block is for 1000 BASE-X network applications.
integrated into the FPGA as a hard block in Physical Interfaces
Virtex-5 devices. You can independently configure the Control Interfaces
The Ethernet MAC is available in the physical interface of each Ethernet MAC The host interface provides access to the
Xilinx design environment as a library prim- to operate as one of five different configuration registers of the Ethernet
itive, named TEMAC. The primitive con- Ethernet interfaces. MAC block. Examples of configuration
tains a pair of 10/100/1000 Mbps Ethernet The Media Independent Interface options include jumbo frame enable, pause
MACs. Each Virtex-5 LXT device contains (MII), Gigabit Media Independent and unicast address settings, and frame
four Ethernet MAC blocks; thus, a Virtex-5 Interface (GMII), and Reduced GMII check sequence generation.
LXT design can incorporate two TEMAC (RGMII) are parallel interfaces. These are The host interface is accessible through
primitives. Using standard Xilinx products, typically connected to an external physical either a generic host bus or a device control
you can create a range of customized packet layer (PHY) chip to provide BASE-T register (DCR) bus (when connecting to a
processing and network end-point products. functionality at 10/100/1000 Mbps. processor). In addition, each Ethernet
Xilinx has also provided an overclocking Half-duplex operation is supported at MAC has an optional management data

54 Xcell Journal Fourth Quarter 2006


SERIAL CONNECTIVITY

DCR Bus Addressing


Statistics Block
The Virtex-5 DCR interface now features
an individual base address for each of the
Rx Stats Tx Stats
Clocking1 Ethernet MACs. This makes the shared
MUX1 MUX1
GMII, MII, or RGMII DCR bus interface transparent to software
to External PHY via
SelectIO Interface drivers. The software no longer needs to
EMAC0 Client Interface
PCS/PMA1 PCS/PMA
SGMII
know the bit locations for individual
EMAC1 to RocketIO Transceiver Ethernet MACs; the hardware automatical-
MDIO1 MDIO to External PHY
ly multiplexes in the correct bits depending
Generic Host Bus on the base addresses.
Host Interface

DCR Bus DCR


Bridge Serial Interface Changes
MDIO0 MDIO to External PHY
Xilinx made several changes to the operation
of the serial interfaces. Auto-negotiation is
EMAC1 Client Interface EMAC0
PCS/PMA or now more flexible with the inclusion of a
PCS/PMA0 SGMII
to RocketIO Transceiver programmable link timer. You can alter the
GMII, MII, or RGMII
to External PHY via
timing of the auto-negotiation process and
SelectIO Interface
Rx Stats Tx Stats reduce simulation time.
MUX0 MUX0
Clocking0
A newly added unidirectional mode per-
forms the unidirectional enable function
FPGA Fabric
Statistics Block from the IEEE802.3ah-2004 specification.
When enabled, the Ethernet MAC transmits
regardless of whether valid input is present
Figure 1 – Block diagram of the Virtex-5 Ethernet MAC at the receiver.
Finally, loopback can now take place in
I/O (MDIO) interface. This allows access dard Ethernet applications, giving a 1,000 the Ethernet MAC as well as in the trans-
to the management registers of an external Mbps data rate with a 125 MHz clock. ceiver. This enables the transmission of idles
PHY and to the physical interface manage- Using the 16-bit mode, you can increase the to the link partner while in loopback, ensur-
ment registers within the PCS/PMA sec- data rate to 2,000 Mbps without any ing that the link remains active.
tion of the Ethernet MAC. increase in clock speed at the client interface.
Each Ethernet MAC outputs statistics Virtex-5 Ethernet MAC Use Models
Client Interface vectors containing information about the The versatility of the Virtex-5 Ethernet
Frames are passed to the Ethernet MAC Ethernet frames seen on its transmit and MAC enables its use in a wide variety of
across the transmitter client interface. The receive datapaths. An external statistics applications. For example, you can:
transmitter pads the incoming data when it module is freely available in Xilinx
• Attach the Ethernet MAC to a
is less than the minimum Ethernet frame CORE Generator™ software. The statis-
processor running a protocol stack in
length and maintains the minimum inter- tics module accumulates all of the Tx and Rx
network processing or remote moni-
frame gap between frames; however, you datapath statistics of each Ethernet MAC.
toring systems, as shown in Figure 2.
can increase the size of the gap. You can
also configure the transmitter to add a New Features in the Virtex-5 Ethernet MAC • Interface the Ethernet MAC to a packet
frame-check sequence to the frame. A sep- In Virtex-4 FPGAs, implementing just the processing system implemented in the
arate flow control interface allows you to datapath consumes as many as four global FPGA, such as a checksum offload engine
generate pause frames. In half-duplex clock buffers: one each for the Tx and Rx or remote direct memory access design.
mode, the transmitter signals collisions and client interface logic, and one each for the
• Connect multiple Ethernet MACs to
requests retransmissions for valid collisions. Tx and Rx physical interface logic. For
dedicated packet FIFOs and external
The receiver interface verifies incoming Virtex-5 FPGAs, Xilinx added a clock-
memory for packet storage, bridging, or
frames and signals frame errors. Good and enable feature. You can use the clocks
switching applications.
bad frame signals are provided. You can derived for the physical interface for all of
also configure the Ethernet MAC to pause your client logic. The internally generated
and restart frame transmission upon the clock enable provides a way to maintain the Tools and IP Support
detection of valid pause frames. correct data throughput on each of the Xilinx provides support for the Ethernet
The data on the client interface is 8 or 16 interfaces. This reduces the number of nec- MAC through CORE Generator software,
bits wide. The 8-bit interface is used for stan- essary clock buffers by 50%. LogiCORE™ IP, and reference designs.

Fourth Quarter 2006 Xcell Journal 55


SERIAL CONNECTIVITY

Virtex-5 Ethernet MAC Wrappers The different levels of hierarchy enable • Block Level Wrapper. In the next level
Figure 3 shows a block diagram of the you to extract the correct wrapper for of hierarchy, the physical interfaces
HDL wrappers available from the Xilinx your application. and the required clock resources are
CORE Generator tool. instantiated. This includes the
The Ethernet MAC is a complex com- • Ethernet MAC Wrapper. In the lowest RocketIO GTP transceivers for the
ponent with 162 ports and 79 parameters. level, a single or dual Ethernet MAC is serial interfaces. Clocking is also opti-
Wrapper files enable you to easily set the instantiated and its attributes are set mized for your configuration, and you
parameters and interface only to those to your preferred selection in the can clock the output to your design.
ports required for your application. They CORE Generator GUI. All of the
• LocalLink Level Wrapper. In this
also offer benefits in simplifying the use of unused input ports are tied to ground
level, FIFOs are added to the client
clocking and physical I/O resources. and the output ports are left open.
transmitter and receiver interfaces.
The FIFOs handle the dropping of
bad frames on reception and retrans-
Virtex-5
mission of frames in half-duplex
Ethernet MAC
Master
Attachment
DMA
Read
Packet
mode. LocalLink is used as the back-
Client Receive
FIFO Rx
end interface.

SelectIO Interface or RocketIO Transceiver


Interface
to
Processor
• Example Design Wrapper. The top

External PHY
Slave Host level features a demonstration design
Attachment Interface
where the received data is looped back
and sent to the transmitter. You can
Write download this design to a board and
Packet
Register,
SRAM, and
FIFO Client Transmit Tx
stimulate the receiver from a network
Interrupt
Interfaces device to demonstrate the operation
of the Ethernet MAC in hardware.
FPGA Fabric Testbenches that stimulate receiver
input and monitor the transmitter out-
Figure 2 – MAC connected to a processor on the Virtex-5 FPGA put of the design are also included in
the CORE Generator software.
Example Design
LogiCORE IP and Reference Designs
LocalLink Level Wrapper Most of the existing Virtex-4 Ethernet MAC
Block Level Wrapper
documentation is reusable with the Virtex-5
Dedicated
Ethernet MAC. For example, a version of the
Ethernet MAC
Wrapper “Ethernet Cores Hardware Demonstration
10 M/100 M/1 G
Ethernet FIFO
Client Dedicated Physical
Interface
Platform” (XAPP443, www.xilinx.com/
Interface Ethernet MAC
bvdocs/appnotes/xapp443.pdf ) will be avail-
LocalLink Interface

Tx Client
FIFO Physical I/F

EMAC0 (GMII/MII, able for the Virtex-5 Ethernet MAC.


RGMII,

Address Rx Client
or
RocketIO
LogiCORE IP, such as Ethernet statistics,
Swap
Module
FIFO Transceiver)
already supports the new architecture.

FPGA
Fabric
Conclusion
Host
Interface
Clock The Virtex-5 Ethernet MAC provides
Circuitry

10 M/100 M/1 G
a cost-effective solution for a wide range
Ethernet FIFO
of network interfaces, enabling you to
Local Link Interface

Tx Client
FIFO Physical I/F
connect to BASE-X and BASE-T net-
EMAC1 (GMII/MII,
RGMII,
works at 10/100/1000 Mbps. Xilinx soft-
Address Rx Client
or
RocketIO ware tools and IP also allow you to take
Swap FIFO Transceiver)
Module advantage of the improved feature set of
the Ethernet MAC.
For more information, visit the
Virtex-5 links on the Xilinx website,
Figure 3 – Block diagram of the Virtex-5 Ethernet MAC wrappers www.xilinx.com/virtex5/.

56 Xcell Journal Fourth Quarter 2006


SERIAL CONNECTIVITY

Asynchronous Sample-Rate
Conversion Between
AES Audio Streams
Xilinx Virtex-5 FPGAs provide the perfect platform for
implementing AES digital audio sample-rate conversion.

by Gregg C. Hawkes
Principal Engineer, Advanced Products Division
Xilinx, Inc.
gregg.hawkes@xilinx.com

Reed Tidwell
Senior Staff Applications Engineer,
Advanced Products Division
Xilinx, Inc.
reed.tidwell@xilinx.com

John F. Snow
Senior Staff Applications Engineer,
Advanced Products Division
Xilinx, Inc.
john.snow@xilinx.com

The diversified uses and ever-changing


innovations for digital video and audio con-
tinue to drive the fast-paced proliferation of
equipment for audio, video, and broadcast
(AVB). Today’s AVB equipment demands
better image quality, higher resolutions,
higher bandwidths, more audio/video
channels, and the combining of previous-
ly separate but related functions such as
HD-SDI, audio multiplex, audio demul-
tiplex, and asynchronous sample-rate
conversion (ASRC).

Fourth Quarter 2006 Xcell Journal 57


SERIAL CONNECTIVITY

Xilinx® FPGAs have kept pace with cus-


tomer integration needs by incorporating sil-
icon features, facilitating the absorption of
less integrated, complex, and expensive
ASSP chips. One such ASSP chip function,
ASRC, can be integrated into Xilinx FPGAs
by leveraging diffused silicon features known
as DSP48E slices and block RAMs to build
sophisticated filter functions.
Free Xilinx application notes and refer-
ence designs have also kept pace with our
customers’ need to integrate sophisticated
algorithms. The ASRC reference design cor-
rectly handles synchronous sample-rate con-
version and the far more complex ASRC
called for in most audio/video applications.
Simpler “synchronous-only” methods,
offered by many ASSP chips and FPGA IP
suppliers, can be smaller in terms of utiliza-
tion per audio channel; however, when
applied incorrectly to asynchronous applica-
tions, these methods have one or both of the
following artifacts:

• The input-to-output latency changes Figure 1 – ML571 board and frame synchronization demonstration board with
because of accumulating delay an ASRC to match the output digital audio rate to the output digital video rate.

• Artifacts are produced in the audio,


such as skipping samples or repeating • Automatically and accurately moni- adjusts on the fly, maintaining high per-
samples toring the input-to-output ratio and formance with no special attention needed
sample-rate changes for input and output clocks. You can verify
Both of these cases represent undesirable
all of this with the IP running the Xilinx
distortions. • Adapting the filter function (filter
ML571 Serial Digital Video demonstration
coefficients) on the fly to provide the
board shown in Figure 1.
Understanding Sample-Rate Conversion highest performance
Best of all, the broad functionality and
Before diving into the theory of digital sam-
Supporting ASRC for digital audio with high-performance ASRC IP is free.
ple-rate conversion, you should look at the
an FPGA means that you can now signifi-
basic types of problems audio/video engi-
cantly save costs for every SDI interface in Sample-Rate Conversion Theory
neers are trying to solve. A few applications
your system – and in many systems, there Figure 2 illustrates conceptually the gen-
exist where you could use a fixed-rate syn-
are many channels. eral case of up or down conversion. The
chronous conversion, such as a 48-KHz
The Xilinx ASRC IP is very high per- conversion ratio can vary continuously by
input converted to a 44.1-KHz output using
formance, with a worst-case input-to-out- rational numbers with fractional values.
the same clock source, or an output clock
put signal-to-noise ratio of -125 dB. It also The diagram shows the up conversion
derived from the input clock. However, more
supports conversion for multiple audio- process (creating many more samples and
likely is the asynchronous case, where input
input frequencies to multiple audio-output time positions to choose from) followed
and output clocks are completely independ-
frequencies. The rate conversion algorithm by down conversion (judiciously choos-
ent, such as two boards communicating
audio between them. The different clock
oscillators can be the same nominal frequen-
cy but several parts-per-million different. Upsampler
Low Pass
Downsampler
Input Samples at Anti-Imaging / Output Samples at
The Xilinx ASRC reference design for fs in Anti-Aliasing Filter fs out = fs in * L/M
L g(n) M
the asynchronous case of independent
input and output clocks provides two
important and difficult design functions: Figure 2 – Classic conceptual data path for sample-rate conversion

58 Xcell Journal Fourth Quarter 2006


SERIAL CONNECTIVITY

The ASRC adjusts the de-embedded audio to match


the output video stream clock rate, where it can then
be re-embedded into the output SDI video stream.
ing the samples in the output datastream because the inputs are in different locations ASRC Example Implemented on the ML571
that most closely match the positions of the relative to that output phase. The sub-filter, The simple function known as frame syn-
desired samples). The anti-imaging/anti- having a set of coefficients that align with chronization of video provides a great demon-
aliasing filter in the center of the data path the input sample positions, is formed by stration of where you might use ASRC. Video
ensures that the spectral content is less than interpolating the prototype filter coeffi- can be stored in a frame buffer at some rate
half the Nyquist rate of both the input and cients. When this sub-filter is convolved and removed at a fractionally different rate.
output sampling frequencies. with the corresponding input samples, the This process can be useful if two pieces of
Figures 3 and 4 show that for every out- output sample of interest is produced. This video equipment are not “genlocked” togeth-
put sample location or output phase, a dif- process repeats, with new sub-filter coeffi- er and operate at different pixel rates.
ferent set of sub-filter coefficients is required cients interpolated for each output sample. The result is the occasional need to add or
drop a frame of video data. Your eye proba-
bly would not notice an added or dropped
frame of video on your TV screen, but the
human ear is very good at detecting discrep-
Output Sample Times ancies in added or dropped audio. The solu-
tion is to remove the audio from the starting
See Inset
Input Samples video stream and reinsert it in the resulting
Output Samples video stream at a fractionally different rate,
matching the output audio rate to the new
output video rate. The Xilinx ASRC refer-
Time
Inset ence design is perfect for this task.
Original Samples
As an example, let’s connect two boards
Interpolated Samples with SDI video running at slightly different
Output Samples frequencies because of the different clock
oscillators on each board. The receiving board
demultiplexes the embedded AES digital
Figure 3 – Output sample position relative to the original audio from the video stream and sends it to
sample position dictates which interpolated samples to use. the ASRC. The difference in clock frequency
between the two boards causes the frame
buffer synchronization logic to add or drop
video frames. The ASRC adjusts the de-
X Output Sample Position embedded audio to match the output video
stream clock rate, where it can then be re-
embedded into the output SDI video stream.
The difference in clock frequencies between
Prototype Filter the two boards causes the frame buffer syn-
chronization logic to add or drop video
frames. The ASRC adjusts the de-embedded
Input Sample Positions audio to match the output video stream clock
rate, where it can then be re-embedded into
the output SDI video stream.
For more information about frame
Resulting Sub-Filter
buffer synchronization and asynchronous
sample-rate conversion techniques, see
XAPP514, “Audio/Video Connectivity
Figure 4 – Prototype filter centered at output sample position Solutions for the Broadcast Industry,” at
www.xilinx.com/bvdocs/appnotes/xapp514.pdf.

Fourth Quarter 2006 Xcell Journal 59


SERIAL CONNECTIVITY

Block Diagram and Specification Highlights • Continuous rational/fractional ratio,


The simple diagram in Figure 5 illustrates up conversion, 8:1 Get on Target
two key design elements required in
• Continuous rational/fractional ratio,
ASRC. The first element is determining
down conversion, 1:7.5
the changes between input sample rate and
required output sample rate, labeled “ratio • Continuous input-to-output rate mon-
control.” The second element within the itoring with adaptive filtering
“re-sampler” is a set of prototype filters • Input/output rates 8 KHz-192 KHz,
that are modified depending on the statis- continuous
tics reported by the ratio control.
The ASRC reference design converts • Low deterministic latency
stereo audio from one sample frequency The reference design has an interpolated
to another. The input and output sample coefficient FIR filter coded with Virtex™-5
frequencies can be an arbitrary fraction DSP48E slices as the primary math ele-
of one another or the same frequency, ment and block RAM for input sample
but based on different clocks. The out- buffers and prototype storage.
put is a band-limited version of the input
re-sampled to match the output sample Conclusion
timing. The reference design has the The need to maintain different input-to-out-
following features: put audio rates for varying numbers of digi-
• Fully asynchronous operation tal audio channels and support new AVB
functions is a tremendous challenge. Throw
• Expandable to multiple channels
in varying protocols, memory management,
• A -125 dB THD+N worst case with different sized payloads, and a variety of dif-
-130 dB THD+N typical ferent system interfaces, and it is easy to see
how these designs require high-performance,
• A 24-bit audio word width in and
cost-effective flexibility that ASSPs and
out, with 31-bit internal math preci-
sion and round away from zero
ASICs cannot offer. These challenges open Is your marketing
up opportunities for Virtex-5 devices because message reaching
• Automatic input-to-output sample these devices can enable equipment vendors
ratio monitoring with continuous
the right people?
to provide solutions to the ever-evolving
filter modification AVB equipment landscape.
Hit your target audience by
advertising your product or service
in Xcell Journal. You’ll reach more
than 30,000 engineers, designers,
and engineering managers worldwide.
Asynchronous Re-Sampler
Sample-Rate We offer very attractive advertising
Converter Coefficient
Memory rates to meet any budget!
Radio Control
Input Sample
Clock Ratio
Call today:
Filter Phase Interpolation
Output Sample Detection (800) 493-5551
Clock
or e-mail us at
xcelladsales@aol.com
Input Output
Input Samples Sample FIR Filter
Samples
Storage

Figure 5 – Xilinx ASRC reference design top-level block diagram

60 Xcell Journal Fourth Quarter 2006


SERIAL CONNECTIVITY

Implementing Integrated
Video Connectivity Solutions
with Virtex-5 LXT Devices
Xilinx Virtex-5 FPGAs provide the perfect platform for
integrating broadcast video solutions inside a single chip.

by Gregg C. Hawkes With the ever-changing video connec- Integrating the encoders and decoders
Principal Engineer, Advanced Products Division tivity landscape prevalent throughout the for these standards into the FPGA is simple
Xilinx, Inc. broadcast chain, our goal is to offer help in with the clear, concise reference material
gregg.hawkes@xilinx.com the form of free reference designs, forming found within the chapters of XAPP514.
drop-in building blocks that can solve The reference design code, offered in both
Reed Tidwell many system-level video connectivity Verilog and VHDL, is clearly documented
Senior Staff Applications Engineer, issues. By providing you with cost-effective and illustrated, as shown in Figure 1.
Advanced Products Division
and highly integrated solutions compared We also offer a suite of validation plat-
Xilinx, Inc.
to ASSP chips, Xilinx hopes to get you to forms that can quickly and easily test
reed.tidwell@xilinx.com
market faster, lower costs, and differentiate your video processing algorithms or veri-
John F. Snow your product from the competition. fy connectivity performance. For exam-
Senior Staff Applications Engineer, Our video connectivity IP and reference ple, you can use our new Xilinx ®
Advanced Products Division design book, “Audio/Video Connectivity Virtex™-5 ML571 Serial Digital Video
Xilinx, Inc. Solutions for the Broadcast Industry” (SDV) board (www.cook-tech.com) to
john.snow@xilinx.com ( w w w. x i l i n x . c o m / b v d o c s / a p p n o t e s / demonstrate or develop video connectivi-
xapp514.pdf), includes chapters about SDI, ty with Virtex-5 FPGAs. Figure 2 shows a
At Xilinx, we understand the challenges HD-SDI, DVB-ASI, AES embedded block diagram; Figure 3 is a photograph
that broadcast system designers are fac- audio, and audio-asynchronous sample rate of the ML571 board. Many of the free
ing. The number of emerging new stan- conversion. Each chapter describes a specif- reference designs linked to XAPP514’s
dards for video connectivity creates ic video connectivity topic and links to free chapters were verified on the ML571
difficult design challenges and schedules reference designs in Verilog and VHDL, platform using broadcast industry-stan-
for broadcast products. providing implementation examples. dard test equipment.

Fourth Quarter 2006 Xcell Journal 61


SERIAL CONNECTIVITY

“The ML571 board is yet another board demonstrates how engineers can Talk to your Xilinx sales channel about
example of how Xilinx provides cus- easily implement advanced video net- seeing the demonstrations or obtaining one
tomers with detailed design assistance for working protocols while greatly increas- of these boards so that you can test your
real broadcast industry issues,” said Andy ing system integration, reducing system new algorithms long before your propri-
DeBaets, senior director, systems and costs, lowering power, and shortening etary board is produced. We hope you find
application engineering at Xilinx. “This design schedules.” this article and the audio/video connectivity
book valuable, but it represents just a small
sample of the information available about
Video Ancillary
designing with Xilinx programmable logic
SDI
Digital Standard Data ANC & EDH
SDI Video SDI
Driver
devices. To access the latest information
Video Detect & Encoder
Processor
Flywheel on these subjects and more, visit
Data
SDI www.xilinx.com/esp/broadcast.
Bitstream

Data Ancillary
Virtex-5 Features Support Broadcast Designs
Video
Digital Video
Test Pattern
SDI SDI Video Standard
SDI
ANC & EDH
Data The Virtex-5 feature set supports many
Receiver Clock Decoder Detect &
Generator
Flywheel
Processor Digital aspects of broadcast solutions by providing
Video
high performance, flexibility, and scalability
with unique, cost-optimized family mem-
Figure 1 – Example block diagram of free modular Verilog and VHDL reference designs bers built on the following features:
• High-density, high-speed, reprogram-
mable ExpressFabric™ technology
• 550-MHz, 36-Kb, dual-port block
SD-SDI or ASI In Multi-Rate
Equalizer Select IO LDVS SD-SDI or ASI Out
RAM/FIFO

SD-SDI or ASI In Multi-Rate


GTP Transceiver
• 550-MHz, 25 x 18 DSP48E slice
Equalizer

SD-SDI or ASI In Multi-Rate • 550-MHz clock management tile (CMT)


Equalizer GTP Transceiver
• SelectIO™ technology
GTP Transceiver HD/SD-SDI or ASI Out
148.35-MHz • Reduced power consumption
Low Jitter XO
GTP Transceiver HD/SD-SDI or ASI Out
148.5-MHz • Sparse chevron package
XO
135-MHz & 270-MHz VCXO

Virtex-5 5X & 10X 27-MHz DAC


These features are described throughout
PLL VCXO
FPGA the articles in this issue of Xcell Journal,
GLCK with detailed descriptions of the features
4 LDVS Pairs

32 Select IO
XGI
Daughtercard Connector
and performance at www.xilinx.com/
Clock 2 GLCKs
Module Interface
products/silicon_solutions/fpgas/virtex/
Sync Video Sync Input
Separator
Video Sync Output virtex5/index.htm.
GLCK
GLCK
27.576-MHz

DAC
148.35- / 74.1758-
MHz
GLCK
VCXO
DAC Audio VCXO
Overview of the Xilinx ML571
VCXO
GLCK 33-MHz XO
The new serial digital video (SDV) board
Digital Audio In AES/EBU 133- / 166-
for demonstrating and testing high-band-
GLCK MHz XO
(2 Stereo Pairs - 2 BNCs) Audio In
width video communications channels
Digital Audio Out
64-MB
AES/EBU
Audio Out (2 Stereo Pairs - 2 BNCs) based on Xilinx Virtex-5 platform FPGAs
DDR

10/100/1000
shows you how to easily implement high-
GTP Transceiver Ethernet 2 RJ-45 Connectors
125- / 200-
MHz XO speed serial interfaces to popular industry
Compact 4 Pairs standards like HD-SDI.
Flash
System JTAG GTP Transceiver ML410 Personality +12V Power
ACE 16 Pairs Module Connectors
JTAG Interface LVDS
Header Standards and Functionality Supported
Switches
RS-232 LEDs The diffused silicon integration of high-per-
Rx/Tx
DB9 formance and low-power multi-gigabit serial
I/O, tri-mode Ethernet MACs, PowerPC™
Figure 2 – Xilinx ML571 SDV video connectivity board block diagram processor, and PCI Express Endpoint block

62 Xcell Journal Fourth Quarter 2006


SERIAL CONNECTIVITY

• DVB-ASI to/from Ethernet for video


over IP
• Frame synchronization using external
DDR DRAM
• Sync separator and genlock capability.
A sync separator can accept a variety
of video sync sources including bi-
level and tri-level video sync (HD
and SD). The separated sync signals
from the sync separator go to the
FPGA, where they can be used to
build genlock PLLs using any of
the VCXO clock sources available
to the FPGA.
• An XGI-compatible expansion
connector set is provided to allow
video I/O daughtercards
• Two 10/100/1000 Ethernet interfaces
• Debug RS-232 serial port
• Configuration six-pin JTAG header
for connection to a Xilinx download
Figure 3 – Xilinx ML571 SDV video connectivity board cable
• Xilinx System ACE™ configuration
controller with a CompactFlash
into Virtex-5 platforms has enabled the • DVB-ASI (CENELEC EN 50083-9 Type II socket
support of many more networking stan- Annex B), 270 Mbps
dards than previously possible. • A SelectIO video input and video out- Conclusion
The ML571 board now supports: put providing differential LVDS I/O. The need to support new AVB designs
This demonstrates the ability of the and assist you, our customers, with
• Virtex-5 XC5VLX50T-FF1136 FPGAs implementations in Xilinx FPGAs is a
Virtex-5 SelectIO interface to transmit
(LX110T offered in a pin-compatible tremendous challenge. However, at
and receive video bitstreams supporting
package) Xilinx, we pride ourselves in striving to
the following video standards:
• Two RocketIO™ GTP HD/SD-SDI keep up with the demand for excellence.
• SD-SDI (SMPTE 259M), 270 Mbps
receivers and two RocketIO GTP With varying protocols and a variety of
transmitters. The transmitters have • DVB-ASI, 270 Mbps different system interfaces, it is easy to see
Gennum tri-mode, 3 Gbps-capable • Select IO technology, LVDS, AES3 how these designs require high-perform-
cable drivers and the receivers have digital audio (AES3id) I/O. Two BNC ance, cost-effective flexibility that ASSPs
Gennum tri-mode, 3 Gbps-capable input connectors provide two stereo and ASICs cannot offer. These challenges
receiver equalizers. The standards sup- pairs of AES3id digital audio in. Two open up opportunities for Virtex-5
ported are: BNC output connectors provide two devices, for these devices can enable you
stereo pairs of AES3id digital audio to provide solutions to the ever-evolving
• 3 Gbps HD-SDI (SMPTE424M),
out. These inputs meet the SMPTE AVB equipment landscape.
2.97 Gbps
276M 75-Ohm unbalanced AES3 The ML571 board is designed and sold
• HD-SDI dual link (SMPTE372M) audio input electrical specifications. by Cook Technologies. The Cook
1.485 Gbps, 1.4835 Gbps Technologies part number for the ML571 is
• SDI AES digital audio, embed and de- CTXIL406. There are many clock and con-
• HD-SDI (SMPTE292M) embed (SMPTE272M-2004) nectivity option daughtercards that also
1.485 Gbps, 1.4835 Gbps
• AES digital audio, high-performance, plug into the ML571 SDV board. For more
• SD-SDI (SMPTE 259M), asynchronous sample-rate conversion information, e-mail colin@cook-tech.com or
270 Mbps (ASRC) visit their website at www.cook-tech.com.

Fourth Quarter 2006 Xcell Journal 63


SERIAL CONNECTIVITY

Enhancing System Management


and Diagnostics with the
Virtex-5 System Monitor
You can use the Virtex-5 System Monitor to greatly increase
environmental monitoring coverage of your FPGA design. by Anthony Collins
Staff Product Marketing Engineer
Xilinx, Inc.
anthony.collins@xilinx.com

The telecommunications industry demands


high availability; when you pick up the tele-
phone, you expect to hear a dial tone. As
broadband providers start to compete for
voice and video (with the deployment of so-
called “triple-play” services), customers
expect the same high availability.
High availability is only possible by
building redundancy into the hardware
that makes up the system. However, to
effectively manage this redundancy, the
system must be able to monitor its own
operating conditions and switch to back-
up hardware in the event of a failure
before the customer notices any down-
time. Close monitoring of the physical
environment allows for preemptive action
in the event of a failing component. This
involves monitoring the physical environ-
ment inside the chassis, using various sen-
sors to record such variables as
temperature, supply voltages, humidity,
and cooling performance.
FPGAs are important building blocks in
high-availability infrastructure. Therefore,
the on-chip environment of the FPGA and
its immediate surroundings within the sys-
tem should be carefully monitored. The
Xilinx® Virtex™-5 System Monitor facili-
tates easier monitoring of the FPGA and its
external environment.

64 Xcell Journal Fourth Quarter 2006


SERIAL CONNECTIVITY

Checking the Checker


External Using the Virtex-5 System Monitor to pro-
Register File Interface
Sensor
Inputs
00h
01h
40h
41h
vide accurate and reliable environmental
MUX 02h 42h
03h 43h information requires reliability checks on
Status Registers Control Registers
3Eh 7Eh
the measurement data and system monitor
3Fh 7Fh
operation. The System Monitor has a num-
10 Bits
On-Chip MUX 200 kSPS Control Logic
Dynamic
Reconfiguration
ber of features that help to confirm reliable
o
Temperature C ADC
Sensor
Port (DRP) operation. Built-in auto-calibration of the
VCCINT
VCCAUX ADC and sensors correct any drift in the
VREFP
VREFN analog measurement system because of the
On-Chip
Supply Monitoring FPGA
JTAG TAP
operating environment. Self-check features
Interconnect
Controller also allow the system host to monitor the
operation of the System Monitor.

Figure 1 – Virtex-5 System Monitor Leveraging System Monitor JTAG Access


A novel feature of the Virtex-5 System
Monitor is the ability to access the full func-
Virtex-5 System Monitor using the same interfaces. The control tionality of the block using the JTAG TAP.
The Virtex-5 System Monitor allows you registers configure the System Monitor By enabling analog testing and access to
to easily access information about the operation (for example, selecting sensor analog information, you can obtain greater
FPGA on-chip (die) temperature and channels for measurement, program value and efficiencies using the existing
power supply conditions. The system mon- alarm limits, and sensor averaging). The JTAG infrastructure in the system. This
itor also provides access to external sensor System Monitor is fully functional short- access is available before configuration of
information through external analog input ly after power-up and does not require the FPGA for use as part of a PC board test
channels (monitoring as many as 17 exter- the FPGA to be configured for correct scheme in production, or during normal
nal sensors). Access to this information operation. By default, only the on-chip operation to facilitate a debugging effort.
involves little or no design effort, depend- sensors are monitored after power-up; To facilitate off-chip measurements such
ing on the required functionality. however, you can also enable external as supply voltages and currents on the PC
Common functionality like alarms, auto- analog inputs. Measurement information board, you can use special JTAG commands
matic channel sequencer, and data averag- can only be accessed through the JTAG test to enable external analog inputs before
ing are available within the System access port (TAP) before configuration. FPGA configuration. Even after FPGA con-
Monitor block, enabling you to develop a figuration, the System Monitor does not
solution easily. User Alarms require an explicit instantiation in your
Figure 1 shows a block diagram of the One of the useful built-in features of the design, thereby allowing full access to its fea-
Virtex-5 System Monitor. The system mon- System Monitor is its ability to generate tures for debugging work through the JTAG
itor is built around a 10-bit, 200 kilosample- alarm signals for the on-chip sensors. As a TAP, even at a late stage in the design process.
per-second analog-to-digital converter designer, you can specify the threshold lim- To ensure the availability of the System
(ADC). The analog input range of the ADC its for these alarm signals. The System Monitor, the only requirement is that the
is 0V-1V. At a resolution of 10 bits, the Monitor can autonomously monitor the correct PC board support must be in place.
ADC can resolve an input voltage to an sensors and alert the system only when an This involves the connection of an external
accuracy of approximately 1 mV. alarm condition is detected. 2.5V reference IC as described in the System
As shown in Figure 1, both the on-chip The System Monitor also contains a fac- Monitor User Guide (www.xilinx.com/
sensors and external analog input channels tory-set alarm condition called over tem- bvdocs/userguides/ug192.pdf).
are connected to the ADC input using ana- perature (OT). If you enable this feature, Figure 2 illustrates a typical diagnostic
log multiplexers. Therefore, the output the System Monitor can request a full chip application where the physical operating
voltages from various sensors must be power-down if a die temperature greater environment of the FPGA is monitored
sequentially converted to a digital word by than 125° C is detected. Chip power-up is during normal operation. In the example
the ADC. These measurement results are initiated after the die has cooled to a level illustrated in Figure 2, the System Monitor
written to status registers, where they are that you specify. The System Monitor con- is used to look at the voltage (IR) drop in
easily read using the FPGA fabric, or exter- tinues to operate and monitor the on-chip the power distribution system (PDS) dur-
nally through the FPGA and PC board sensors during chip power-down. ing a period of heavy current demand start-
JTAG infrastructure. The System Monitor The OT functionality is disabled by ing at time t0. The temperature of the
control registers can be written or read default and must be explicitly enabled. FPGA is also monitored during this period

Fourth Quarter 2006 Xcell Journal 65


SERIAL CONNECTIVITY

of high activity. Potential issues with the System Integration arbitration scheme is available to manage
power supply or PC board design can be In addition to convenient access possible contention.
quickly identified during development. through the JTAG TAP, full access to the You can also define the contents of
The JTAG access also provides an easy way System Monitor control and status reg- these registers when the System Monitor is
to confirm that adequate cooling is in place isters is also provided through the FPGA instantiated in a design and initialized dur-
for a particular design. The ChipScope™ fabric. These registers can be configured ing FPGA configuration. Thus, the System
Pro Analyzer provides an easy way to access and read at any time from the fabric. Monitor can be configured to start up in a
the System Monitor; however, access can Dual access to the System Monitor reg- user-defined mode of operation post-con-
easily be incorporated into other JTAG test isters by the JTAG TAP controller and figuration. The fabric interface is known as
and programming environments. fabric interface is permitted, and an the dynamic reconfiguration port (DRP).
The DRP is a parallel 16-bit synchronous
data port (similar to block RAM).
Diagnostic SW
For more advanced applications where
1.01V
greater control over the System Monitor
External
Sensors VCCINT is required, the DRP allows the System
Intermediate Power Bus 1.00V Monitor to be easily mapped into the
peripheral address space of a hard or soft
0.99V
POL VCCAUX
to time microprocessor. Figure 3 illustrates a typ-
2.5V ical system management application
2.55V
VCCAUX where the MicroBlaze™ processor is
TCLK 2.50V running a protocol-like intelligent plat-
POL VCCINT TMS
1.0V TDO form management interface (IPMI) and
TDI
2.45V
to time
communicating with the system host
over management channels like Ethernet
o
60 C
FPGA Physical Temperature or even a simple UART/modem.
Environment Monitored
via JTAG TAP o
The System Monitor also provides an
50 C
important microprocessor peripheral in
40oC the form of a general-purpose ADC. This
to time
is the first time analog peripherals like
those commonly found in microcon-
trollers have been integrated into an
Figure 2 – You can access System Monitor measurements through the JTAG TAP.
FPGA. Full control over the ADC opera-
tion is supported. The ADC offers a num-
ber of sampling modes and can support
unipolar, bipolar, and full-differential ana-
log input schemes.

Conclusion
The Virtex-5 System Monitor delivers a
On-Chip
Peripheral Bus (OPB) greatly simplified solution for common on-
chip and external environmental monitoring
needs. Minimal development and design
LAN
effort are required to access the functionali-
10 Bits ty. By interfacing the System Monitor to the
PHY EMAC UART 200 kSPS JTAG TAP controller, JTAG functionality
ADC
has been extended into new application
Analog
Input areas, thus enabling new test capabilities.
We would like to hear your comments
and feedback regarding any topics
Modem JTAG
TAP touched on in this short article; in partic-
ular, how our development team can bet-
ter support your system monitoring and
Figure 3 – System Monitor (or ADC) as a microprocessor peripheral
test requirements.

66 Xcell Journal Fourth Quarter 2006


The Ultimate
Embedded Solution
for Xilinx MicroBlaze™

With a rich debugging environment and a small, auto-


configurable RTOS, Accelerated Technology provides the
ultimate embedded solution for MicroBlaze development.
Our scalable Nucleus RTOS and the Eclipse-based EDGE
development tools are optimized for MicroBlaze and
PowerPC processors, so multi-core development is easier
and faster than ever before. Combined with world-
renowned customer support, our comprehensive product
line is exactly what you need to develop your FPGA-based
device and get it to market before the competition!

For more information on the Nucleus complete solution, go to


www.acceleratedtechnology.com/xilinx

Accelerated Technology, A Mentor Graphics Division


info@AcceleratedTechnology.com • www.AcceleratedTechnology.com

©2006 Mentor Graphics Corporation. All Rights Reserved. Mentor Graphics, Accelerated Technology, Nucleus is a registered
trademarks of Mentor Graphics Corporation. All other trademarks and registered trademarks are property of their respective owners.
SERIAL CONNECTIVITY

Real-Time Debugging for Virtex-5 FPGAs


Version 8.2 of the ChipScope Pro Analyzer delivers
verification performance for Xilinx FPGAs.

by Lee Hansen using ChipScope Pro 8.2 Service Pack 2 or ISE™ logic design software and allows you
Design Methodologies Sr. Marketing Manager – later versions. The debugging cores deliver to debug Virtex-5 devices and other Xilinx
Horizontal Platform Solutions new enhanced performance, supporting FPGA-based projects in real time. You can
Xilinx, Inc. higher clock speeds as fast as 500 MHz. You quickly find and analyze design problems
lee.hansen@xilinx.com can analyze signals with greater speed and while the chip is running on the board,
agility through advanced features like wider interacting with the rest of the system.
Xilinx® Virtex™-5 devices set a new bench- data capture of up to 1,024 bits, deeper data Then, leveraging FPGA re-programmabili-
mark in FPGA functionality, with as much capture of up to 128K storage samples, and ty, design changes can be quickly imple-
as 12 times the logic capacity, 112 times higher density slice packing of trigger match mented and sent back to the device on
more memory, 2 times the bandwidth, and unit and capture control logic. board in a matter of minutes or hours
2.5 times the performance of the leading The resource estimator introduced with through the programming cable. Such
FPGA devices of just 8 years ago. Additional ChipScope Pro version 8.1 lets you see changes might take days or weeks using
dedicated hardware functionality like how much memory and device space the ASIC or competing FPGA offerings.
DCM-based clock management tiles, debugging cores will take up on the chip, The ChipScope Pro system also links
embedded hard processors, high-speed useful for project planning. internal FPGA debugging to Agilent
MGTs, and DSP48E slices extend platform Another breakthrough feature is Technologies’ bench-top logic analyzers
functionality to a broad spectrum of end remote debugging, first introduced in ver- using the included ChipScope Pro ATC2
applications. This extreme functionality sion 7.1 of ChipScope Pro software. core. This core synchronizes the ChipScope
places a huge demand on the design cycle Remote debugging lets you run the Pro system to Agilent’s FPGA Dynamic
and in particular the verification cycle, ChipScope Pro Analyzer and capture sys- Probe software, an optionally purchased
which tends to be the most time-consuming tem through a server/client Internet con- plug-in to your Agilent 1680, 1690, or
and time-critical phase of the design flow. nection. Your board can be running 16900 logic analyzer.
The Xilinx ChipScope™ Pro software and remotely in the lab while you debug from This unique partnership between Xilinx
analyzer deliver advanced real-time debug- an office on the other side of the building and Agilent delivers deeper trace memory,
ging functionality to complex Virtex-5- or the other side of the world. You can faster clock speeds, and more trigger
based designs, moving you through the ver- share a single board or system in the lab options, all using even fewer pins on the
ification cycle faster than ever before. with other engineers on your team or FPGA. The advanced technology contained
allow helpdesk personnel to debug a prob- within the ATC2 core and FPGA Dynamic
New Functionality lem remotely at a customer site, helping to Probe is not available in other FPGA or
The functionality of the ChipScope Pro lower field debugging and repair costs. ASIC real-time verification solutions.
Analyzer version 8.2 has been enhanced For more information on the ChipScope
with Virtex-5 performance in mind. All Optimized Real-Time Debugging Pro Analyzer, visit www.xilinx.com/
ChipScope Pro-optimized software debug- The ChipScope Pro system is available as a chipscopepro or contact your local sales office
ging cores work with Virtex-5 devices when separately purchased option to Xilinx for ordering information.

68 Xcell Journal Fourth Quarter 2006


X-ray vision for your designs
Agilent Logic Analyzers with FPGA dynamic probe

Now you can see inside your FPGA designs in a way that
will save weeks of development time.

The FPGA dynamic probe, when combined with an Agilent


Windows®-based logic analyzer, allows you to access
different groups of signals inside your FPGA for debug—
without requiring design changes. You’ll increase visibility
into internal FPGA activity by gaining access up to 128
internal signals with each debug pin.
Agilent 16800 Series Portable logic analyzers
• Increased visibility with FPGA dynamic probe Our newest 16800 Series logic analyzers offer you
• Portable and modular logic analyzers unprecedented price-performance in a portable family,
• Customized protocol analysis with Agilent's exclusive packet viewer software
with up to 204 channels, 32 M memory depth and a pattern
• Low-cost embedded PCI Express packet analysis
• Agilent 16800 portable logic analyzers, starting at $9450
generator available.

Agilent’s user interface makes our logic analyzers easy


to get up and running. A touch screen or mouse makes
it simple to use, with prices to fit your budget. Optional
soft touch connectorless probing solutions provide
excellent reliability, convenience and the smallest
probing footprint available. Contact Agilent Direct today
Get a quick quote and/or FREE CD-ROM to learn more.
with video demos showing how you can
reduce your development time.
www.agilent.com/find/xcell
©Agilent Technologies, Inc. 2006
Windows is a U.S. registered trademark of Microsoft Corporation
M E M O RY I N T E R FA C E S

Memories are Made of This...


Virtex-5 FPGAs offer a wider range of memories and memory interfaces.

by Peter Alfke COMMON 5 or 6


Distinguished Engineer DI
DO WRITE
ADDRESS 64- or
6
Xilinx, Inc. ADDR
64-Bit D Q CE
2 x 32-Bit
RAM
WE RAM
peter.alfke@xilinx.com CLK
CLK

READ
5 or 6 2 or 1
ADDRESS DOC
All FPGA applications use various amounts C
RAM
of memory for data, parameters, and
or
instructions. To store from a few bits to
READ
multiple megabytes, Xilinx® Virtex™-5 ADDRESS
5 or 6 2 or 1
DOB
B
2
devices offer a hierarchy of three different DI
DO
06
RAM
5
ADDR 05
memory implementations: WE
2 x 32 Bit
RAM
D Q
CLK READ
• LUT-based distributed RAM has a ADDRESS
5 or 6 2 or 1
DOA
A
granularity of 64 bits RAM

• Block RAM has a granularity of 18 Kb


• External memory can store practically Figure 1 – LUT RAM Figure 2 – Slice as quad-port RAM
unlimited amounts of megabytes with
the help of on-chip memory interfaces
Why 6-LUTs?
LUT RAM Xilinx invented the use of four-input LUTs in FPGAs 20 years ago. Exhaustive
academic and commercial studies had shown that four inputs (16 stored bits) were the
Since the early days of the XC4000, Xilinx
optimal size for a LUT that implements random logic.
has made look-up tables (LUTs) available as
FPGA evolution has led to ever-smaller transistors as well as ever-more routing and
user RAM. In Virtex-5 devices, the LUT has
other dedicated structures. As a result, the highly optimized LUT became a much
grown to 64 bits and can be used as either
smaller part of the circuitry. For Virtex-5 devices, we re-evaluated the optimal LUT size
64 x 1 or 32 x 2 RAM. LUT RAM (see
and found that a four-times-larger six-input LUT (6-LUT) increases the CLB size by
Figure 1) offers very fast (sub-nanosecond) only 15%. Extensive benchmarking then showed that, on average, a 6-LUT packs 40%
access time, tight integration with the more logic functionality compared to the traditional four-input LUT. The decision was
logic fabric, and ultimate design flexibility easy: spend 15% more area to gain 40% more logic (or, expressed differently, save
(see sidebar, “Why 6-LUTs?”). roughly 30% in logic area).
The four-times-larger memory capacity of each LUT is an extra, very welcome
Multiport Option bonus, as is the ability to make the LUT and the RAM 2 bits wide.
The four LUTs in a Slice M can share a Virtex-5 devices combine four LUTs in a slice. There are two different types of slices:
common write address that does not inter- Slice L and Slice M, roughly equal in number on any Virtex-5 device. The LUTs in a
fere with the read addressing of the other Slice L can perform logic and contain a carry chain. The LUTs in a Slice M have the
three LUTs. Together, these four LUTs can same functionality but can also be used as distributed memory or shift register logic
thus implement quad-port memory, with (SRL32) functionality.
one write port and three independent read

70 Xcell Journal Fourth Quarter 2006


M E M O RY I N T E R FA C E S

ports all accessing the same data. The 64-Bit Data correction (ECC) using Hamming code.
newest MicroBlaze™ processor uses this The controller is built into each block
feature to reduce its register file from 384 Block RAM RAM. It detects single and double errors
to 44 LUTs. In this kind of application, the and corrects all single errors.

ECC Encode
new six-input LUT is six times more effi- 8-Bit Parity
The ECC controller can also be used to
64-Bit Data External
cient than a previous-generation four-input Memory operate with external memory. In this case,
LUT (see Figure 2). one complete block RAM is necessary for
Block RAM 64-Bit Data writing and another for reading. The built-
Shift Register 64-Bit Data in ECC circuit is a great simplification for

ECC Decode
You can use any LUT in a Slice M as a seri- 8-Bit Parity
memory designers who care about the ulti-
Error
al shift register with addressable length. The mate data integrity (Figure 4).
LUT is configurable as either a single-bit
shift register (a maximum of 32 bits long)
or as a 2-bit-wide shift register (a maximum Figure 4 – ECC for external memory 18K
36K Block RAM
of 16 bits long). Different from earlier Block RAM or
SRL16 structures, the Virtex-5 shift register block RAM, configured as: or
18K Block RAM
FIFO
uses a more traditional and scalable design or FIFO
• 36 bits wide, 512 deep
with two latches per shift register bit –
hence the maximum 32 bits (not 64 bits) • 18 bits wide, 1K deep Figure 5 – Dual-ported RAM or FIFO
per LUT (Figure 3). • 9 bits wide, 2K deep
• 4 bits wide, 4K deep FIFO
SHIFT IN
32-Bit • 2 bits wide, 8K deep FIFOs are usually implemented using dual-
CE SHIFTOUT 31
Shift Register ported SRAMs, with one port used for writ-
CLK • 1 bit wide, 16K deep ing and the other for reading. Many Virtex
Each block RAM always has two inde- family block RAMs are traditionally used as
A
5
MUX pendent access ports and each port can be FIFOs. That is why Xilinx chose to equip
individually configured. This greatly sim- all Virtex-5 block RAMs with a built-in
D Q plifies data-width conversion. dedicated FIFO controller (Figure 5).
Virtex-5 devices have between 32 and
CLK
Read During Write 288 block RAMs, and each can be config-
Each port supports a data-in (DI) bus and ured as a 36- or 18-Kb FIFO.
Figure 3 – LUT as shift register a data-out (DO) bus. When writing the The controller can use the whole block
data on the DI bus into the memory, the RAM as FIFO with the following configu-
Block RAM DO bus presents either the previous data ration options:
For larger RAM structures, Virtex-5 devices at the write address or the new data just
• 72 bits wide, 512 deep
have tens or hundreds of block RAMs, each being written. A third option keeps DO
with a capacity of as much as 36 Kb. unchanged from its previous state. These • 36 bits wide, 1K deep
You can structure each block RAM three configuration options offer a design • 18 bits wide, 2K deep
through configuration as: flexibility that is often overlooked.
All block RAM operations require a • 9 bits wide, 4K deep
• 72 bits wide, 512 deep
clock, even for reading data. This require- • 4 bits wide, 8K deep
• 36 bits wide, 1K deep ment is not always desirable, but it is
absolute. Nothing happens without an But the controller can also use only half
• 18 bits wide, 2K deep of the block RAM and leave the other half
enabled clock. Whenever the clock is
• 9 bits wide, 4K deep enabled, data and address must meet the to be used as general-purpose block RAM.
required setup and hold-time specification. The FIFO options are then:
• 4 bits wide, 8K deep
Violating this requirement can contami- • 36 bits wide, 512 deep
• 2 bits wide, 16K deep nate the data content.
• 18 bits wide, 1K deep
• 1 bit wide, 32K deep
ECC • 9 bits wide, 2K deep
You can also use the two halves of the A 72-bit-wide block RAM can provide
36-Kb block RAM separately as two 18-Kb 64-bit-wide data with error detection and • 4 bits wide, 4K deep

Fourth Quarter 2006 Xcell Journal 71


M E M O RY I N T E R FA C E S

In all cases, the FIFO write and read


ports have identical width. Unequal width Verifying the EMPTY Flag Synchronization
would complicate the interpretation of
full/empty flags and is therefore not imple- We tested the EMPTY synchronization logic exhaustively by writing data into
mented in Virtex-5 devices. the FIFO at 200 MHz and reading it out at 500 MHz, which makes it go
Soft FIFO controller cores have been EMPTY soon after each write cycle. This exercised the detection logic and re-
available for many years, but the dedicated synchronized the trailing edge of EMPTY 200 million times a second.
FIFO controller offers three advantages:
More specifically, we wrote an ascending data sequence at 200 MHz and read
• Higher performance, since dedicated it out at 500 MHz. We wrote the output data directly into a second FIFO at the
logic is naturally faster than program- same 500 MHz. We then read the second FIFO out at the original 200-MHz rate.
mable logic
The combined dual FIFO forms a synchronous system but with asynchro-
• Smaller size and lower power consump- nous data transfer between the two halves. When we synchronously subtracted
tion, as it uses no fabric resources, the input data from the output data, the difference was constant, indicating
CLBs, nor additional interconnects
flawless transfer at the 500-MHz read/write rate and no flag synchronization
• Guaranteed functionality and perform- problem – even at this high rate.
ance without any design effort When two clock frequencies are uncorrelated, each read clock cycle has a dif-
Write and read clocks can have arbitrary ferent phase relationship with respect to the write clock. During any second, the
or undefined phase and frequency relation- active read clock edge steps across the ~5 ns write clock period in ~200 million
ships. But for proper flag operation, both different phase orientations, thus creating a timing granularity of 0.025 fem-
clocks should be free-running.
toseconds. This resolution is millions of times better than any conventional
The challenging aspect of FIFO design
is the reliable generation of status flags deterministic test methodology can possibly achieve.
(FULL, EMPTY, and ALMOST_FULL We ran this test for several weeks, with more than 1014 operations, without
or ALMOST_EMPTY) when write and any errors.
read clock frequencies are unrelated. The
trailing edges of these flags are inevitably
generated by the “wrong” clock domain
and must be re-synchronized to the prop- word after the FIFO has been empty. In nization circuitry, reduces the trailing flag
er clock domain. For a detailed explana- subsequent operations, no behavioral dif- delay, and completely avoids the delay
tion, visit www.sunburst-design.com/ ference exists between the two modes. uncertainty. The performance improve-
papers/CummingsSNUG2002SJ_FIFO2.pdf. ment is very small.
(See sidebar, “Verifying the EMPTY Flag Asynchronous vs. Synchronous Operation
Synchronization.”) The main purpose of most FIFOs is to External Memory
The FIFO controller offers two new bridge between independent clock When a design needs multiple megabytes
options: first-word fall through (FWFT) domains; most FIFO applications thus of memory, it is best implemented in exter-
and synchronous operations. use separate and uncorrelated clocks for nal DRAM devices. High-performance
After a first entry has been written into writing and reading. Because the trailing SDRAM controllers can pose a design chal-
an empty FIFO, the EMPTY output goes edge of each flag must be re-synchronized lenge, but Xilinx offers several application
low (inactive), indicating that the read port to the opposite clock, there is an unavoid- notes and well-documented cores and eval-
is allowed to enable its read clock and thus able one-clock-period ambiguity about uation boards that implement several dif-
cause the data word to appear at the out- the delay of the rising flag edge. This can ferent memory controller designs.
put. This might be described as a “pull” increase the delay for flags going inactive
operation. Data appears at the output after and thus can cause a very small perform- Conclusion
the next enabled read clock. ance loss. The operation is less pre- Virtex-5 memories and memory interfaces
In FWFT, the newly written data word dictable, but only while the FIFO recovers have come a long way since the first LUT
appears automatically at the output simul- from the empty or full condition. RAM appeared in the XC4000 family 16
taneously with EMPTY going inactive. In certain applications, there is only years ago. Versatile dual-ported block
This might be called a “push” operation. one clock domain, and write and read RAMs with FIFO and ECC options sim-
You can configure the FIFO for either clocks are therefore identical. In that case, plify system design, while well-documented
of these modes. The difference is visible you can (optionally) set the mode to “syn- interface designs allow for unlimited expan-
only at the read data output of the very first chronous.” This eliminates re-synchro- sion with external DRAMs.

72 Xcell Journal Fourth Quarter 2006


M E M O RY I N T E R FA C E S

Meeting Memory Interface Design


Challenges with Virtex-5 FPGAs
Virtex-5 devices support the latest generation of high-speed memory interfaces.

by Richard Chiu which adds to each I/O block an resources to provide more direct
Staff Applications Engineer
adjustable delay element (IDELAY) com- routes within a slice and between
Xilinx, Inc.
pensated over process, voltage, and tem- configurable logic blocks (CLBs).
rich.chiu@xilinx.com
perature changes as well as enhanced DDR
• Reduction of the maximum bank size
capture support. These features help meet
When not supporting new interface proto- from 64 I/O (or 80 I/O in select
the challenges of designing with source-
cols, memory interface designers are con- Virtex-4 part/package combinations) to
synchronous memory interfaces. With
stantly supporting faster and faster bus 40 I/O, and an increase in the number
Virtex-4 memory interface designs, you
speeds for existing interfaces. Today’s source- of banks. This leads to a more efficient
can employ calibration algorithms to fac-
synchronous double-data-rate (DDR) implementation of the usual myriad of
tor out many of the skews and delays in
memory devices, such as DDR2 SDRAM, I/O voltage levels on the same FPGA.
the timing path and operate your design at
QDR II SRAM, and RLDRAM II, present More I/O clocking resources have also
higher frequencies.
designers with challenges at chip and PCB been added to each bank.
The Virtex-5 architecture adds addition-
levels. Higher clock frequencies result in a
al features that allow you to push the limits • The availability of phase-locked loop
rapidly shrinking data valid window.
of operating frequency. Enhancements to (PLL) blocks as clocking resources in
Signal integrity issues, clock jitter, memo-
the Virtex-5 device integral to memory addition to digital clock manager
ry uncertainties, varying silicon delays,
interface design include: (DCM) blocks. PLLs are useful for
PCB trace skew mismatch, and other fac-
low-jitter clock generation and input
tors now have a proportionally larger • The addition of ExpressFabric™
clock jitter filtering.
impact on meeting timing with a smaller technology. This architectural
data valid window. enhancement enables internal logic to • Enhanced block RAM/FIFOs that
run at higher clock frequencies. The have doubled in size to 36 Kb and
Virtex-5 FPGAs Enhance basic slice look-up table (LUT) has support a maximum width of 72 bits.
Memory Interface Design increased from a four- to a six-input Applications requiring error-correcting
The Xilinx® Virtex™-4 FPGA family LUT (6-LUT), reducing the number code (ECC) detection and correction
introduced a number of on-chip resources, of required logic levels. The technology can now take advantage of ECC
in particular ChipSync™ technology, also offers additional routing encode/decode logic built into each

Fourth Quarter 2006 Xcell Journal 73


M E M O RY I N T E R FA C E S

block RAM, reducing logic usage and ding read data and register it with a Most Virtex-4 designs use the direct-
allowing much higher performance delayed version of the strobe distrib- clocking method for read data capture.
over implementing the same function- uted through a localized I/O clock Beginning with the Virtex-4 SERDES
ality in general logic. buffer (BUFIO). This data is then DDR2 design and continuing with the
synchronized to the system clock new generation of Virtex-5 memory inter-
• Support for digitally controlled imped-
domain in a second stage of flops. face designs, the strobe-based method is
ance (DCI) on-chip split-Thevenin ter-
The input serializer/deserializer best to meet the tighter timing require-
mination for bidirectional I/O only
(ISERDES) feature in the I/O block is ments at higher clock speeds.
when the driver is 3-stated. Similar to
used for read capture – the first two Both techniques involve the use of IDE-
the on-die termination (ODT) feature
levels of flops in the ISERDES trans- LAY elements that are varied during a cali-
implemented in many memory device
fer the data from the delayed strobe to bration routine. This routine is performed
families, this support is provided for
the system clock domain. Figure 2 during system initialization, delaying both
certain HSTL and SSTL I/O standards
shows the read capture path for a the strobe and data to determine and set
and can be used to save power when
Virtex-5 memory interface design. the optimal phase between strobe/data and
the FPGA is writing to memory.
• The incorporation of low-inductance
bypass capacitors directly on the pack- IOB Fabric
age substrate, simplifying PCB layout
User Interface FIFOs
by reducing the amount of external Data ISERDES
Q2 Read Data
bypassing required. IDELAY Rising
Q1
Read Data
Virtex-5 Data Interface Techniques Falling

Meeting read and write timing for a high- CLK OCLK CLKDIV
speed source-synchronous bus demands
FPGA Clock
that you keep uncertainties to a minimum.
Typically, the capture of read data is the
most challenging part of the design. Delayed Strobe
Write timing for Virtex-5 FPGAs is sup- BUFIO
ported in the same way as in the Virtex-4 Strobe
IDELAY
device. The DCM (or PLL) generates quad-
rature phase outputs of the base (“system”)
clock. The memory strobe is forwarded
using an output DDR register clocked by Figure 1 – Virtex-4 direct-clocking read data capture path
an in-phase copy (CLK0) of the system
clock. The write data is clocked by a DCM
clock output that is 90 degrees ahead
(CLK270) of the system clock. This ensures IOB Fabric
that the strobe is center-aligned to the data User Interface FIFOs
Data IDDR
on a write at the outputs of the FPGA. Q2 Read Data
IDELAY Rising
Both Virtex-4 and Virtex-5 memory Q1
Read Data
interface designs support two kinds of read Falling
capture techniques:
CLK
• The “direct-clocking” technique delays
FPGA Clock
the read data so that it can be directly
registered using the system clock in the
input DDR flop of an I/O block. The
memory strobe is only used during cali-
bration to determine the optimal time to Strobe Used for
delay the associated data. Figure 1 shows IDELAY Calibration Only
the direct-clocking read capture path.
• The “strobe-based” technique uses the
memory strobe to capture correspon- Figure 2 – Virtex-5 strobe-based read data capture path

74 Xcell Journal Fourth Quarter 2006


M E M O RY I N T E R FA C E S

the system clock to maximize timing mar- consider the data-to-clock variation ed with opening and closing banks.
gins. Calibration removes any uncertainty (for DDR2, this is tAC) because the In an LRU algorithm, banks are left
caused by process-related delays, compen- system clock is used to both drive the open at the end of accesses. If a new
sating for components of the path delay memory clock and capture read data. bank needs to open, the controller
that are static to any one board. These This is a larger uncertainty than the closes the bank least recently used. At
components include PCB trace delays, strobe-to-data variation. any time, as many as four banks can
package delays, and process-related com- be left open.
• The strobe-to-clock variation is
ponents of propagation delays (both in
important for the second stage of
the memory and FPGA), as well as Generating Virtex-5 Memory Designs
capture, when the data is transferred
setup/hold times of capture flops in the You can generate a custom memory con-
from the delayed strobe to the system
FPGA I/O blocks. Calibration accounts troller by using the Memory Interface
clock domain. However, by this time
for variation in delays that are process-, Generator (MIG) tool. The MIG tool is
the data is split into two separate sin-
voltage-, and temperature-dependent at accessed through CORE Generator™
gle-data-rate paths; therefore, aligning
the system initialization stage – you software and outputs HDL source (Verilog
the delayed strobe to the system clock
should also factor additional operating or VHDL) design files, along with accom-
can take place over a much larger
temperature and voltage variations sepa- panying constraint and build scripts.
timing window.
rately into your interface timing budget. The latest version of the MIG tool
During calibration, IDELAY for strobe The strobe-based capture method is (1.6) supports DDR2 SDRAM-registered
and data are incremented to perform edge more pinout-restrictive, as it requires the DIMM and QDR II SRAM component
detection by continuously reading back memory strobes to be placed on clock- interfaces for Virtex-5 devices. The DDR2
from memory and by sampling either a capable I/O pins. This can limit the I/O controller supports operation of bus clock
prewritten training pattern or the memo- utilization over a given bank. Virtex-5 speeds as fast as 333 MHz (667 Mbps). The
ry strobe itself until either the leading devices have smaller banks and more I/O QDR II supports operation of bus clock
edge or both edges of the data valid win- clocking resources per bank (for example, speeds as fast as 300 MHz (600 Mbps).
dow are determined. The IDELAY for the number of BUFIO local clock buffers Virtex-5 designs generated by the MIG
data or strobe is then set to provide the per bank has increased from two to four), tool also allow the physical layer interface
maximum timing margin. In the case of easing this restriction and allowing more portion of the design to be easily separated
direct clocking, the optimal delay for the strobes and their accompanying I/O (data, from the controller portion. You can then
strobe is used to delay the associated data. mask) to be placed in each bank. incorporate your own specific controller
For strobe-based capture, the strobe and Other significant differences in Virtex-5 but retain the memory initialization
data can have different delay values because memory controllers include: and high-performance source-synchronous
there are essentially two stages of synchro- • Full-speed operation. Both the Virtex-4 calibration logic.
nization: one to first capture the data in the SERDES design and Virtex-5 designs
strobe domain and another to transfer this use the ISERDES for memory capture. Conclusion
data to the system clock domain. However, Virtex-5 designs do not use The Virtex-5 device family builds on the
The direct-clocking capture method is the width expansion feature of the Virtex-4 FPGA, with additional features
simpler in design complexity, and com- ISERDES, and the controller runs at to ease memory interface design and meet
pared to the strobe-based capture method, the same speed as the memory clock. the challenges of supporting ever-increas-
it has fewer pin-out restrictions. However, The Virtex-4 ISERDES design runs at ing bus speeds.
the strobe-based capture method becomes half the memory clock speed but twice To download the MIG tool and for more
necessary at higher clock frequencies. Its the bus width. Running at the same information about the implementation and
two-stage approach offers better capture clock speed as the memory is made pos- design details of Virtex-5 memory controller
timing margins for two reasons: sible by the higher performance of the reference designs, visit the Xilinx Memory
Virtex-5 fabric. This minimizes read- Corner at www.xilinx.com/memory/.
• The DDR portion of the timing is
data latency through the ISERDES – as Virtex-5 memory controllers are also
restricted to the first rank of flops in
well as controller latency – and simpli- available as reference designs for down-
the ISERDES. Because the strobe is
fies bank-management logic. loading from the Memory Corner:
used to register the data, timing is
limited largely by the strobe-to-data • Bank management. The Virtex-5 • XAPP858 (DDR2 SDRAM)
variation; for example, in the case of DDR2 controller employs a least- • XAPP853 (QDR II SRAM)
DDR2, these are given by the recently-used (LRU) bank-manage-
tDQSQ and tQHS parameters of the • XAPP852 (RLDRAM II)
ment algorithm that keeps banks
part. For direct clocking, you must open to reduce the overhead associat- • XAPP851 (DDR SDRAM)

Fourth Quarter 2006 Xcell Journal 75


M E M O RY I N T E R FA C E S

Implementing Memory Controllers Using


the Memory Interface Generator Tool
The Memory Interface Generator tool simplifies designing memory controllers for Xilinx FPGAs.

by Nagesh Gupta
Founder & CEO
Taray Incorporated
nagesh@tarayinc.com

The Memory Interface Generator (MIG)


tool is a comprehensive tool used to sim-
plify the design of memory controllers for
Xilinx® FPGAs. Memories are part of a
majority of Xilinx applications. The goal of
the MIG tool is to simplify memory inter-
faces, thus enabling FPGA users to focus
on the rest of the system design.
The MIG tool was first introduced in
2002 as a memory controller pin selection
utility for Virtex™-II and Virtex-II Pro
FPGAs. Since then, the MIG tool has
progressed significantly; it now supports
all Xilinx FPGA devices, including
Virtex-4, Virtex-5, Spartan™-3, and
Spartan-3E FPGAs.
The MIG tool dynamically generates
HDL in Verilog or VHDL formats based
on user inputs. Additionally, the MIG tool
generates .ucf pin constraints, any slice and
logic placement constraints, and any other
constraints required to create high-perform-
ance designs with minimal user changes.
MIG outputs are fully available in non-
encrypted formats. This enables you to
modify the designs.

76 Xcell Journal Fourth Quarter 2006


M E M O RY I N T E R FA C E S

The Time-to-Market Advantage Hardware verification starts with a strobe (CAS) latencies, burst lengths,
High speed memories are complex to point test, such as a read/write data and data widths, as well as all supported
design. Conservatively, you can save more match, at a particular frequency for a synthesis tools.
than six months by using the targeted ref- given memory part. We then perform fre-
erence designs provided by the MIG tool. quency sweeps and ensure that the Simulations
Fully verified MIG reference designs designs work ±10% in the required fre- Taray simulates MIG designs using
enable you to focus on other design activ- quency range. We also verify all the possi- ModelSim from Mentor Graphics. We
ities, thus reducing overall time to market. ble parameters such as column address simulate a large number of combina-

MIG Controller Architecture


The MIG tool produces everything Hardware-Tested Configurations
required to fully implement a memory HDL Verilog and VHDL
controller. MIG controllers are imple- Synthesis Tools XST and Synplicity
mented in logical layers comprising: Board and FPGAs ML 461 —> XC4VLX25-FF668-10 and ML 462 —> XC4VLX25-FF668-11
1. The physical layer, or PHY, which Burst Lengths 4 and 8
captures the read data, transfers it CAS Latencies 3 and 4
to a convenient clock domain, and
Additive Latencies 0, 1, and 2
stores it. The PHY also transmits
the write data and command/control ODT (in Ohms) Verified 0, 75, and 150
signals. Depth Verified for Components 1
2. The controller generates the Depth Verified for DIMMs 1, 2, 3, and 4
required commands based on user Component Verified MT47H32M16BT-37E
requests. The controller also imple- DIMM Verified MT18HTF6472G-53E (Registered DIMM)
ments the state machine for reading,
Component Data Width Verified 16
writing, and refreshing the memory.
DIMM Data Width Verified 72 and 144
3. The user interface enables exchange
ECC Verified 72 and 144
of data and commands to and from
your application. Frequency Range 100 MHz to 280 MHz for 16-bit component
100 MHz to 250 MHz for 72-bit DIMM (with and without ECC)
This layered approach allows you to
modify the required portions of the 100 MHz to 250 MHz for 144-bit DIMM (with and without ECC)
design. In Virtex-5 devices, Xilinx has Simulation-Tested Configurations
further simplified the layering com- HDL Verilog and VHDL
pared to previous designs. For example,
Burst Lengths 4 and 8
some designers want to use their own
controllers, which is possible by replac- CAS Latencies 3 and 4
ing the controller that the MIG tool Additive Latencies 0, 1, and 2
generates. This is easily achieved in ODT (in Ohms) Verified 0, 75, and 150
Virtex-5 designs. Depth Verified 1, 2, 3, and 4 (for both components and DIMMs)
Components Verified All supported by the MIG tool (X4, X8, and X16)
Hardware Verification
The designs generated by the MIG tool DIMMs Verified All supported by the MIG tool (registered, unbuffered, and SODIMMs)
are thoroughly verified to ensure high Component Data Width Verified 8, 16, 24, 32, 40, 48, 56, 64, 72, 128, and 144
quality. These quality checks have DIMM Data Width Verified 64, 72, 128, and 144
increased significantly over time as we at
ECC Verified 40, 72, and 144
Taray learn more from the field.
For a given FPGA family, we verify Frequencies Verified 200 MHz and 267 MHz (for both components and DIMMs)
at least one set of designs in hardware. Initialization As per both Micron and JEDEC specifications
Hardware verification is usually Multicontroller 1 to 8
performed on a Xilinx memory refer-
ence board, such as the ML461 or Table 1 – Hardware and simulation test summary for Virtex-4 DDR2 SDRAM designs
ML561 boards.

Fourth Quarter 2006 Xcell Journal 77


M E M O RY I N T E R FA C E S

tions and ensure that every memory listed screen shot of the MIG GUI. The key fea- • DDR2 SDRAM, Verilog
in the MIG tool is verified with at least tures of the MIG tool v1.6 are: and VHDL
one of the test cases. Table 1 is a summa-
• Virtex-5 FPGAs: • Spartan-3E FPGAs:
ry of the different simulation test cases
for Virtex-4 DDR2 SDRAM designs. • DDR2 SDRAM, Verilog • DDR SDRAM, Verilog
Below are some parameters to generate and VHDL
• QDR II SRAM, Verilog
the test cases: • All Spartan-3 and Spartan-3E
• Support for Virtex-4 FPGAs (and the
• All possible data widths designs support XST, Synplicity,
following designs):
and Precision Synthesis
• All of the supported memory compo- • DDR2 SDRAM, Verilog and
nents/DIMMs • Support for many different memory
VHDL, direct clocking
components and DIMMs
• Different values for CAS latencies, • DDR SDRAM, Verilog and
burst lengths, and additive latencies, • Pins picked are based on the selected
VHDL, direct clocking
depending on the memory type memory part and user inputs
• QDR II SRAM, DDR II SRAM,
• Simulated Verilog and VHDL • Generates RTL and bit files for Xilinx
Verilog and VHDL, direct clocking
RTL files reference boards containing memories
• RLDRAM II, Verilog and VHDL,
• RTL with and without testbench • Basic I/O design rule check (DRC)
direct clocking
engine ensures that signals are
• RTL with and without DCM • DDR2 SDRAM, Verilog and allocated correctly
• Use memory models with VHDL, SERDES clocking
• Verifying a modified MIG .ucf file
different frequencies • All Virtex-4 designs support both ensures that MIG pin-out rules are valid
XST and Synplicity
Key Features
• Spartan-3 FPGAs: Using the Outputs of the MIG tool
The MIG tool is part of Xilinx ISE™
software and is invoked through the • DDR SDRAM, Verilog The MIG tool generates everything
CORE Generator™ tool. Figure 1 is a and VHDL required to create a memory interface:
• The RTL (Verilog or VHDL)
design files
• Synthesis scripts
• ISE scripts for build, map, and place
and route
• A .ucf file for pin locations, RLOCs,
and any other constraints
After generating the design RTL, you can
execute a batch file to synthesize, map, and
place the design. The MIG tool generates
two designs – one with a testbench and
another without. The MIG scripts work on
the version with the synthesizable testbench.
However, you can integrate your applica-
tions to the version without the testbench.

Conclusion
The MIG tool significantly reduces design
burden and improves time to market. It has
been used successfully by many customers.
For a copy of the Memory Interface
Generator or for additional information,
Figure 1 – The MIG tool 1.6 GUI
visit www.xilinx.com/memory.

78 Xcell Journal Fourth Quarter 2006


M E M O RY I N T E R FA C E S

Micron Memory Interface


RLDRAM offers a complete low-latency memory interface
for networking and communication solutions.

by Chris Johnson
Networking and Communications
Strategic Applications Engineer
Micron
csjohnson@micron.com

The increased bandwidth requirements of


high-speed networking systems are pushing
DRAM to perform at SRAM speeds and
latencies. Micron’s reduced-latency DRAM
(RLDRAM) II memory addresses this need
with additional densities not available in
SRAM memories. Combining this memo-
ry technology with the Xilinx® Virtex™-5
device provides an excellent high-density,
high-speed, low-latency solution for the
networking and packet buffer applications
of current and future platforms.

RLDRAM II Memory Features


RLDRAM II memory uses an eight-bank
architecture optimized for high-speed oper-
ation and a double-data-rate (DDR) I/O
for increased bandwidth. The eight-bank
architecture enables RLDRAM II memory
devices to achieve peak bandwidth by
decreasing the probability of random access
conflicts. Although bank management
remains important with RLDRAM II
memory architectures, one bank is always
available for use even in the worst case
(burst of two at 533-MHz operation).

Fourth Quarter 2006 Xcell Journal 79


M E M O RY I N T E R FA C E S

One of the key features added in the Additional Features error-correcting schemes used to eliminate
RLDRAM II memory architecture is for RLDRAM II Memory soft errors in the memory channel.
reduced row cycle latency time. Row RLDRAM II memory also deviates from RLDRAM II memory is the first DRAM-
cycle latency (tRC) is the amount of time the refresh requirements of current DRAM based technology to add the ECC DQ pins
that must elapse before a recently technologies. Because ordinary DRAM to the devices. RLDRAM II memory is
accessed bank can be accessed again. devices refresh a row in all banks, they offered in x9, x18, and x36 configurations
Table 1 shows a direct comparison require dead clock cycles on the bus after a to provide a single-chip ECC solution
between RLDRAM II memory, DDR2, refresh command. This requires a period of without adding unwanted components,
and DDR at device densities of 576 Mb, inactivity on the DQ bus, typically 66 ns. reducing board layout space.
512 Mb, and 512 Mb, respectively. RLDRAM II memory devices have Manufacturability in large-component-
incorporated a bank-based refresh scheme count systems is a major problem, but it
to hide the refresh recovery periods required has been ignored in the DRAM industry,
Latency RLDRAM II DDR2 DDR1 Units by other DRAM technologies. The refresh largely because the applications that use
Memory process for RLDRAM memory requires the commodity DRAM devices do not need
tRC 15 55 55 ns bank address of the bank that needs to be continuity testing. The module-based busi-
refreshed, still allowing bus activity during ness uses so few components that the extra
Table 1 – Row cycle time DRAM comparison
100%

I/O Options 90%


RLDRAM II memory offers separate I/O
80%
(SIO) and common I/O (CIO) options.
The SIO devices have separate read and 70%
write ports to eliminate bus turnaround
Bus Utilization

cycles and contention. CIO devices have 60%

a shared read/write port that requires one 50%


additional cycle to turn the bus around.
RLDRAM II memory CIO architecture 40%
is optimized for data streaming, where RLD SIO BL2
30%
the near-term bus operation is either DDR2 CIO BL4
RLD SIO BL4, BL8
100% read or 100% write, independent 20% RLD CIO BL2
RLD CIO BL4
of the long-term balance. RLD CIO BL8
10%
Figure 1 illustrates the performance RLD CIO 16 Burst
RLD CIO 32 Burst
variations between the versions at differ- 0%
ent read-to-write ratios. The reduced 0.01 0.10 1.00 10.00 100.00
latency and eight-bank architecture
achievable with RLDRAM II memory Read/Write Ratio
allow faster random access from the Figure 1 – Peak bandwidth comparison with different DRAM technologies
memory array, increasing sustainable bus
utilization.
the refresh process and eliminating the dead pins are not desirable for JTAG support.
SRAM-Style Interface time seen in other DRAM technologies. Also, discarding manufacturing errors is
The command bus protocol used in Bank-dependent refresh helps increase the less expensive than repairing them in small
RLDRAM memory is a bit simpler than percentage of bus utilization, increasing the component-based systems.
other DRAM devices. RLDRAM II overall bandwidth of the system using the The target market for RLDRAM mem-
memory incorporates an SRAM-style Virtex-5 device. ory is an entirely different scenario.
interface with read and write commands, Computer-based memory technologies Systems based on RLDRAM memory are
replacing the activate and precharge com- such as DDR2 have relied on modules to typically point-to-point applications,
mands used in DDR2 and other similar support error-correcting code (ECC) tech- where the DRAM is soldered directly to
computer-based memory technologies. nologies to remedy soft errors in the transi- the main board alongside countless other
This reduction of commands frees up the tion of data from the processor to the components. Placing all of these compo-
command bus, helping reduce dead DRAM. Multiple parts are placed on the nents can be a manufacturing challenge. To
cycles during short burst lengths. DIMM, adding extra data lines for the address this issue, RLDRAM II memory

80 Xcell Journal Fourth Quarter 2006


M E M O RY I N T E R FA C E S

The density requirements of today’s applications have presented


a challenge for SRAM systems. The memory cell for SRAM memory
devices is approximately five times larger than a DRAM memory cell.
incorporates JTAG continuity technology, ation introduced during the manufacturing provides you with simple, effective, and
which helps with manufacturing issues in process. Updates are made continuously flexible termination options for high-
the large system-based boards used in the during device operation without interrupt- speed memory designs.
networking and communication industry. ing data transfer, ensuring consistent oper-
The density requirements of today’s ation of the RLDRAM memory system Conclusion
applications have presented a challenge for independent of temperature and voltage. RLDRAM II memory combines several
SRAM systems. The memory cell for Micron’s RLDRAM II memory is also performance-critical features to provide
SRAM memory devices is approximately equipped with on-die termination (ODT) flexibility and simplicity for a wide range of
five times larger than a DRAM memory to enable more stable operation at high high-speed applications. The speed and
cell. Figure 2 shows a comparison of the speeds without the use of an external ter- latency requirements for high-speed appli-
two memory technologies. The DRAM
memory cell is substantially smaller,
allowing for significantly higher density VCC
memory subsystems that still approach
SRAM latency speeds.
In the first quarter of 2007, Micron is Digitline
introducing a 576-Mb RLDRAM II
memory device compatible with the high-
volume 288-Mb RLDRAM memory
device. The 576-Mb device will be
offered in multiple configurations that VCC/2

are pin-for-pin compatible, but will have Wordline Wordline

additional address pins to accommodate


the higher density. RLDRAM II memory Digitline Digitline

offers one of the highest density, lowest SRAM CELL DRAM CELL
latency DRAM-based solutions available
on the market today.
Figure 2 – SRAM cell compared to a DRAM cell
Additional I/O Interface Options
The RLDRAM II memory I/O interface
provides other features and options, includ- mination resistor. ODT provides simplici- cations continue to grow, demanding new
ing support for both 1.5V and 1.8V I/O ty and flexibility for high-speed designs by and more innovative solutions to meet
levels and a programmable output imped- bringing termination resistors on-die, elim- market requirements. As RLDRAM II
ance driver that enables compatibility with inating some of the on-board termination. memory helps address current market
both HSTL and SSTL I/O schemes. At high-frequency operation, however, requirements, the demand continues for
RLDRAM II memory requires an exter- it is important you analyze the signal driv- increased performance. Micron will address
nal one-percent precision resistor (RQ) tied er, receiver, printed circuit board network, this demand with future low-latency
to VSS in order to calibrate the driver to a and terminations to obtain good signal devices such as RLDRAM III memory.
known value and eliminate the process integrity and the best possible voltage and Micron continues to innovate and deliv-
variation that can be introduced during timing margins. Without proper termina- er solutions to meet the needs of today’s
manufacturing. The calibration process tions, the system can suffer from excessive and tomorrow’s markets. The joint efforts
requires the external resistor to operate at signal attenuation, leading to reduced of Micron and Xilinx help enable our cus-
five times the desired driver impedance. voltage and timing margins. This, in turn, tomers to quickly deliver next-generation
The programmable impedance control can lead to marginal designs and cause networking, video, and imaging systems.
(PIC) circuit calibrates the output imped- random soft errors that are difficult to For more information, please contact
ance to the desired value, eliminating vari- debug. Micron’s RLDRAM II memory Ray Fontayne at rfontayne@micron.com.

Fourth Quarter 2006 Xcell Journal 81


FREE on-line tutorials
with Demos On Demand

A series of compelling, highly technical product


demonstrations, presented by Xilinx experts, is now
available on-line. These comprehensive videos provide
excellent, step-by-step tutorials and quick refreshers
on a wide array of key topics. The videos are segmented
into short chapters to respect your time and make for
easy viewing.

Ready for viewing, anytime you are


Offering live demonstrations of powerful tools, the
videos enable you to achieve complex design require-
ments and save time. A complete on-line archive is
easily accessible at your fingertips. Also, a free DVD
containing all the video demos is available at
www.xilinx.com/dod. Order yours today!

The Programmable Logic CompanySM

©2006 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.
M E M O RY I N T E R FA C E S

Designing Virtex-5 DDR2 Memory


Interfaces for Signal Integrity
Follow these guidelines to make your next
Virtex-5 DDR2 design experience a success.

by David Banas
Sr. Staff Applications Engineer
Xilinx, Inc.
david.banas@xilinx.com

Let’s say that you are about to design your


first Xilinx® Virtex™-5 DDR2 memory
interface. You need quick guidelines for
preferred circuit topologies and a quick
summary of the trade-offs involved when
using digitally controlled impedance (DCI)
on-die termination instead of external ter-
mination resistors. In this article, I’ll pro-
vide practical design guidelines taken from
previous real-world design experience, as
well as IBIS simulation results.

Circuit Topologies for Memory Interfaces


Figures 1 and 2 show several possible
topologies for DDR2 address/control and
data lines, respectively. On the bidirectional
data lines, I made the memory chip the
driver and the Virtex-5 device the receiver
to make use of the FPGA’s DCI. The top
schematic diagram in Figure 1 shows the
preferred and recommended use model,
while the other diagrams show variations
often tried in regular design practice.

Fourth Quarter 2006 Xcell Journal 83


M E M O RY I N T E R FA C E S

Figures 3 and 4 show typical receiver • Bit interval = 1.5 ns (667 Mbps) assume that the “_DCI” versions of the
eye diagrams corresponding to the topolo- SSTL (stub series terminated logic) driver
• One sequence repetition
gies shown in Figures 1 and 2, respectively. family adjust their output impedance to
The input switching thresholds of the • First 50 bits skipped match the DCI calibration resistors and
receiver are shown as horizontal dashed • Zero added jitter can therefore be used as matched imped-
blue lines for reference. The color of the ance drivers of the transmission line.
“probe” arrows in Figures 1 and 2 corre- When looking at the traces in Figure 3, But this is not true. The SSTL18_I_DCI
spond to the colors of the associated traces it should be obvious that of the three output driver, for instance, has a fixed out-
in Figures 3 and 4, respectively. I used topologies shown, the recommended use put impedance of approximately 20Ω, as
Mentor Graphics’s HyperLynx software to model gives by far the cleanest eye. per the SSTL18 specification. The disas-
generate these eye diagrams with the fol- The middle schematic in Figure 1 shows trous results of this erroneous assumption
lowing parameter settings: a typical mistake made by novice DCI are clearly visible in the yellow trace shown
users, which is to assume that using in Figure 3. Not only has the eye been dras-
• Pseudo-random binary sequence SSTL18_I_DCI drivers eliminates the tically narrowed, but problematic over-
(PRBS) with bit order 7 (a sequence need for any external termination compo- shoot/undershoot has also been introduced
length of 127) nents. Some DCI users often incorrectly at the receiver input.
VpullUp
0.9V

RP(B0)

Can be either SSTL18_I +


External Resistor, or 50.0 Ohms
SSTL18_I_DCI. PCB Trace
U(A0) RS(A0) TL1 U(B0)

20.0 Ohms 50.0 Ohms


1.000 ns
Virtex-5 FPGA Simple MT47H128M4CB_...
SSTL18_I A0

U(A1) RS(A1) TL2 U(B1)

20.0 Ohms 50.0 Ohms


1.000 ns
Virtex-5 FPGA Simple MT47H128M4CB_...
SSTL18_I A0

U(A2) RS(A2) TL3 U(B2)

50.0 Ohms 50.0 Ohms


1.000 ns
Virtex-5 FPGA Simple MT47H128M4CB_...
SSTL18_I A0

Must be SSTL18_I +
External Resistor

Figure 1 – Typical address/control circuit topology


Figure 3 – Typical eye patterns for address/control
VpullUp VpullUp
0.9V 0.9V

RP(B0) RP(C0)

50.0 Ohms 50.0 Ohms

U(A0) RS(A0) TL1 RS(C0) U(D0)

20.0 Ohms 50.0 Ohms 20.0 Ohms


1.000 ns Virtex-5 FPGA
MT47H128M4CB_... Simple
DQ0 SSTL18_II

VpullUp VpullUp
0.9V 0.9V

RP(B1) RP(C1)

50.0 Ohms 50.0 Ohms

U(A1) TL2 U(D1)

50.0 Ohms
1.000 ns
MT47H128M4CB_... Virtex-5 FPGA
Simple
DQ0 SSTL18_II

U(A2) RS(A2) TL3 RS(C2) U(D2)

20.0 Ohms 50.0 Ohms 20.0 Ohms


1.000 ns Virtex-5 FPGA
MT47H128M4CB_...
Simple SSTL18_II
DQ0

Figure 2 – Typical data circuit topology Figure 4 – Typical eye patterns for data

84 Xcell Journal Fourth Quarter 2006


M E M O RY I N T E R FA C E S

...blindly following recommended use models, rules of thumb, or


general guidelines is never a good substitute for simulating your design.
Increasing the series termination to 50Ω, • Reduced bill of materials (BOM) across the supply rails, consequently
as shown in the bottom schematic in Figure parts count increasing overall system power consump-
1, successfully eliminates overshoot/under- tion. If the system architectural design
SI at receiver inputs improves when
shoot but does nothing to restore the eye to specifications do not provide for a voltage
using DCI because the termination lies
its original width. Therefore, you should supply at VCCO/2, then no power penalty
closer to the inputs than when you use an
always use parallel termination at the end of is incurred for using DCI because the ter-
external termination resistor.
all address/control lines. mination scheme has to be Thevenin-
PCB size and BOM parts count are
If an appropriate termination voltage equivalent in either case.
both reduced when using DCI because of
source is not available, you can form a Table 1 gives the worst-case output
the elimination of external termination
Thevenin-equivalent termination using power dissipation of a single line for three
components.
two resistors connected in series between termination styles: source series, external
Caveats when using DCI include:
the VCCO supply and ground, where each parallel (assuming availability of VCCO/2),
resistor has a value twice that of the desired • Impedance variation over process/volt- and internal parallel (DCI).
impedance. In this case, simply terminate age/temperature
the line by connecting its end point to the • Greater power consumption Termination Type Power Dissipation
net that connects the two resistors. Note
that your circuit consumes more power The termination impedance when External Source
41 mW
when terminated in this fashion because of using DCI is provided by CMOS transis- Series Termination
the constant load on the VCCO supply tors; therefore, the value of that imped-
External Parallel
formed by the two resistors. ance can vary along with variations in the 49 mW
Termination
The data-line eyes in Figure 4 also illus- fabrication process, supply voltage, and
trate that removing the parallel termination operating temperature (PVT) of the Internal (DCI)
57 mW
from the ends of the line causes unaccept- FPGA. You should always perform sys- Parallel Termination
able overshoot/undershoot. However, in tem-level SI simulations twice, using the
Table 1 – Power dissipation versus termination type
this case, removing the series terminations high and low extremes for the value of the (for a discussion of output power calculations, see
appears to have improved the eye, making termination impedance to ensure correct “High-Speed Digital Design: A Handbook of Black
it slightly wider and providing more “head system operation across all possible com- Magic” by Howard Johnson and Martin Graham)
room” against noise without introducing binations of PVT.
overshoot/undershoot. This serves as an Using DCI for parallel termination at Conclusion
excellent reminder that blindly following the end of a transmission line results in In this article, I’ve shown how various digres-
recommended use models, rules of thumb, higher power consumption than using an sions from the Xilinx recommended use
or general guidelines is never a good substi- external resistor, assuming that an appro- model for circuit topology of DDR2 memo-
tute for simulating your design. priate termination voltage source is avail- ry interfaces affect the eye at the receiver. I
Keep in mind that before approving an able. In this case, the end of the line can be hope you are convinced that system-level
engineering change order (ECO) for the connected to the voltage source through a simulation with an IBIS simulator such as
removal of the series termination resistors, resistor with a value equal to the character- HyperLynx is a necessity when designing
you should reverse the direction of the line istic impedance of the line, and no load will DDR2 memory interfaces. And I’ve given
with the Virtex-5 device driving data to the be placed across the supply rails. you some pros and cons for using DCI as an
memory chip and check the eye for good Conversely, when DCI is used to termi- alternative to external termination resistors
signal integrity (SI). nate the line, two pass transistors connect in your next DDR2 design.
the receiver input to VCCO and ground, Going forward, as memory interface
Why Use DCI? respectively. Each transistor is adjusted to speeds continue to increase and I/O voltage
The benefits of using DCI, as opposed to have an effective resistance equal to twice levels continue to decrease, you will be able
equivalent external termination, are the characteristic impedance of the line, to apply the general principles learned here
numerous, including: thus producing a Thevenin-equivalent ter- to the design of more complex memory
mination impedance of Z0 to VCCO/2. interfaces as those standards emerge.
• Better SI at receiver inputs A side effect of this termination scheme For more information, visit
• Reduced PCB size is that an additional load of 4Z0 appears www.jedec.com.

Fourth Quarter 2006 Xcell Journal 85


Æ
SOURCE-SYNCHRONOUS INTERFACES

by Dean Armintrout

Improve System Product Marketing Engineer, IP Solutions Division


Xilinx Inc.
dean.armintrout@xilinx.com

Chris Ebeling

Reliability with Principal Engineer, IP Solutions Division


Xilinx, Inc.
chris.ebeling@xilinx.com

System Packet Interface Level 4 Phase 2

SPI-4.2 LogiCORE (SPI-4.2) is the Optical Internetworking


Forum’s recommended interface for the inter-
connection of devices for aggregate band-
widths of OC-192 (ATM and POS) and 10

Solutions and
Gbps (Ethernet), as illustrated in Figure 1.
The SPI-4.2 interface has become the
standard for interconnecting leading-edge
10 Gbps framers, traffic managers, network
processors, and switch fabrics. SPI-4.2 is

Virtex-5 FPGAs
popular because of its efficient interface,
which offers high bandwidth and low pin
count, along with seamless handling of typ-
ical system requirements such as flow con-
trol, error detection, synchronization, and
bus realignment.
Virtex-5 devices provide an ideal platform The Xilinx® Virtex™-5 architecture pro-
vides an ideal platform for implementing
for source-synchronous designs like the SPI-4.2. The Xilinx SPI-4.2 LogiCORE™

widely adopted SPI-4.2 interface. IP targeting Virtex-5 devices provides a sig-


nificantly smaller solution with dramatic
power savings, 1.2 Gbps LVDS DDR I/O,
and complete pin assignment flexibility.

SPI-4.2 LogiCORE IP
Continually improving on its SPI-4.2 solu-
tion, Xilinx has made the latest implemen-
tation 25% smaller than previous versions
by leveraging the 65-nm ExpressFabric™
technology and real six-input look-up tables
(LUTs) of Virtex-5 FPGAs.
Enhanced ChipSync™ technology is
supported on every pin of the Virtex-5
device family, allowing you to target the
SPI-4.2 LogiCORE solution to any device
pinout to meet your system and PCB
requirements. High-performance interfaces
are supported by 1.2 Gbps LVDS data rates.
For applications requiring multiple
SPI-4.2 interfaces, the Virtex-5 FPGA’s
logic density, high pin count, and exten-
sive clocking resources support four or
more full-duplex cores in a single device.

Fourth Quarter 2006 Xcell Journal 87


SOURCE-SYNCHRONOUS INTERFACES

ChipSync Source-Synchronous Technology


Virtex-5 devices build on ChipSync tech- SPI-4.2 Virtex-5 Device User
Interface Interface
nology to ensure reliable high-speed data
SPI-4.2 Sink Core
transfer for source-synchronous applica-
tions like SPI-4.2 with these features: Rx Datapath
SPI-4.2 User
• Built-in serializer/deserializer Sink Sink
(SERDES) logic enables the fabric to Rx Status Path Interface Interface
SPI-4.2
interface to the I/O at a fraction of the PHY Layer User Logic
source-synchronous clock rate. The Device
or
included bitslip function allows shift- NPU SPI-4.2 Source Core

ing of the deserialized data to achieve


Tx Datapath
word alignment when linking multiple SPI-4.2 User
pins (bus deskew). Source Source
Tx Status Path Interface Interface
• Input delay (IDELAY) components
allow the dynamic phase alignment
(DPA) logic to independently adjust
the delay of each bit of a bus in 75-ps Figure 1 – Typical SPI-4.2 application
increments, providing a mechanism for
tuning the interface timing to the sys-
tem environment.
tem interface timing – a process referred to as M

• DDR registers integrated into the I/O dynamic phase alignment (DPA). S2 S1
(a):
System Jitter
pins simplify the interface between the In Virtex-5 FPGAs, the IDELAY feature Initial Data Eye Alignment

FPGA fabric and the I/O blocks by present in every I/O is ideally suited to Fixed offset

supporting data transfer on a single adjust the clock-data phase relationship for M
of 2 taps

clock edge. maximum I/O timing margin. This has S2 S1


(b):
two primary benefits for the SPI-4.2 core: Data Window Drifts
SPI-4.2 and ChipSync Technology
• Integrating the IDELAY feature into
The SPI-4.2 interface has a DDR source-
the input pin (ILOGIC) reduces the
synchronous data bus that comprises 18 M

FPGA resources required for DPA to S2 S1


LVDS pairs (16 data, 1 control, and 1 clock), (c):
less than 350 slices. Adjust Master IDELAY Tap to
operating at a minimum rate of 311 MHz. the Middle of Data Eye
The SPI-4.2 core uses ChipSync tech- • The IDELAY function’s ability to
nology to serialize/deserialize bus data to a adjust the data sampling point enables Legend:
four-word SPI-4.2 datastream at a lower DPA to be implemented in the I/O – S1 = Slave IDELAY sample point 1
clock rate; thus, you can implement high- S2 = Slave IDELAY sample point 2
except for a small control state machine M = Master IDELAY sample point
frequency SPI-4.2 interfaces in slower implemented in the fabric. The state
speed grade Virtex-5 devices. machine portion is fully synchronous Figure 2 – Continuous DPA operation
The SERDES functions allow the core and does not require a complex macro.
logic to transfer these four words to and Thus, there are no restrictions on
from the I/O logic without using any CLB SPI-4.2 pin assignments. shift with changes in operating conditions,
logic resources and operate at half the such as voltage and temperature, as well as
source-synchronous DDR clock rate. For Continuous DPA other variations (Figure 2b). Continuous
example, a SPI-4.2 interface with a 500- The Xilinx SPI-4.2 LogiCORE solution DPA addresses this by constantly monitor-
MHz DDR reference clock only requires enhances communication system reliability ing the ingress data and adjusting the sam-
an FPGA fabric clock of 250 MHz – easily with continuous DPA, which monitors the ple point of each data bit to provide the
achievable in the Virtex-5 architecture. clock-data alignment during operation and maximum timing margin (Figure 2c).
As the frequency of the source-synchro- constantly adjusts the data sampling points Although the OIF SPI-4.2
nous clock increases, data recovery at the to adapt to system timing changes. Implementation Agreement calls for the
receiving (sink) device becomes more chal- Following the initial clock-data align- insertion of periodic training patterns to
lenging. The SPI-4.2 protocol provides a ment phase, the sampling point of each maintain clock-data alignment over time,
training pattern that permits a receiving data bit is aligned to the middle of the data Xilinx continuous DPA does not depend
device to adjust its data sampling to the sys- valid window (Figure 2a). This window can on the presence of training patterns. By

88 Xcell Journal Fourth Quarter 2006


SOURCE-SYNCHRONOUS INTERFACES

reducing/eliminating the need for periodic Clocking Resources tion uses 25% less fabric resources. At the
training patterns, continuous DPA enables Virtex-5 FPGAs provide an unprecedented same time, Virtex-5 FPGAs support 20%
the maximum data bandwidth in your sys- number of clock resources for implementing higher performance for SPI-4.2, with high-
tem while maintaining the optimal clock- multiple SPI-4.2 interfaces in a single speed 1.2 Gbps LVDS data rates on every
data alignment at each pin. device. The abundance and flexibility of I/O of the device.
clock distribution in the Virtex-5 family This means that not only can you place
DPA Diagnostics solves this challenge, supporting as many multiple SPI-4.2 interfaces anywhere on the
If your hardware operation encounters SPI-4.2 interfaces as the device logic and device, but for each interface, you can realize
alignment issues, the Xilinx SPI-4.2 core I/O will accommodate. an aggregate bandwidth as high as 19 Gbps.
includes DPA diagnostic ports to aid In the Virtex-5 family, all devices have Designs not requiring this level of perform-
with debugging. The DPA diagnostic 32 global clock resources, with any 10 of ance (such as more typical framer interfaces
data monitors the data eye and final sam-
pling point of the initial alignment Virtex-4 FPGA Virtex-5 FPGA
process, as well as a second sweep of the
data valid window to determine if any Power: Static Alignment @ 700 Mbps per LVDS Pair 1.55W 1.42W
changes have occurred. Power: Dynamic Alignment Performance per LVDS Pair 2.0W @ 1 Gbps 1.66W @ 1 Gbps
You can connect the diagnostic ports to
the ChipScope™ analyzer or other logic Speed Grades Supporting 800 Mbps per LVDS Pair -10, -11, -12 -1, -2, -3
probes to analyze alignment conditions
Table 1 – SPI-4.2 power estimates for Virtex-4 and Virtex-5 FPGAs
while the FPGA is on the board, interact-
ing with the rest of the system
the 32 total global buffers available in each running at 10-12 Gbps) automatically get
clock region. The global clock trees and additional performance overhead, ensuring
associated buffers are implemented differ- ease of design integration and timing closure.
entially for best duty-cycle fidelity and
greater common-mode noise rejection. Conclusion
In addition, each region in the device has Xilinx SPI-4.2 LogiCORE IP coupled with
four regional clock nets, which are ideal for Virtex-5 features provides a highly efficient
source-synchronous interface clocking at and reliable SPI-4.2 solution. We devel-
rates above 1 Gbps. You can configure the oped ChipSync technology and continuous
SPI-4.2 LogiCORE IP to use either global DPA specifically for source-synchronous
or regional clock resources. interfaces like SPI-4.2.
These high-performance clock resources This technology allows you to design
support as many as four SPI-4.2 interfaces in the most efficient and reliable SPI-4.2 solu-
a mid-range device (LX85/LX110) and more tions, which use significantly less resources
than four SPI-4.2 interfaces in the larger (25% less), allow fully flexible device pin
devices (Figure 3). The Virtex-5 clocking assignments (you choose the pinout), and
capability enables a whole new class of support extremely high interface speeds
SPI-4.2 applications and provides an ideal (1.2 Gbps LVDS DDR I/O).
platform for applications such as multiplexing The higher performance is even more
and demultiplexing, bridges, and switches. remarkable because Virtex-5 FPGAs
achieve this while consuming significantly
Higher Performance at Lower Power less power. The wealth of Virtex-5 clocking
Virtex-5 silicon is manufactured with a resources, combined with full pin assign-
65-nm triple-oxide process that reduces ment flexibility, enables a new class of appli-
power consumption by as much as 35%. cations with multiple SPI-4.2 interfaces.
This has a positive impact for all designs, For more information about the
including the SPI-4.2 interface; the power SPI-4.2 LogiCORE IP targeting Virtex-5
savings are summarized in Table 1. devices, visit the Xilinx IP Center at
Figure 3 – Illustration of four instances of With Virtex-5 devices, SPI-4.2 uses sig- www.xilinx.com/systemio/spi-4.2. A hard-
SPI-4.2 LogiCORE IP implemented on nificantly less power than its predecessors, ware demonstration is also available; for
a Virtex-5 XC5VLX110 device
both because of the enhanced 65-nm more information, contact your Xilinx
process and because the LogiCORE solu- representative.

Fourth Quarter 2006 Xcell Journal 89


VERTICAL MARKET SOLUTIONS

Using Virtex-5 FPGAs in


COTS Board-Level Products
Optimize your COTS designs with the many improvements in the Virtex-5 family.

When creating the Virtex-4 family, real-time DSP systems. In this article, we’ll
by Craig Davies
Firmware Engineer Xilinx harnessed the flexibility of the examine those Virtex-5 architecture com-
VMETRO Ltd. (High Wycombe, UK) ASMBL architecture to build the first multi- ponents that enable COTS designers to
cdavies@vmetro.com platform FPGA family. Xilinx continues this deliver more bang for the buck.
approach with the Virtex-5 family. The ini-
Jeff Bateman tial offering is the Virtex-5 LX platform, COTS FPGA Backgrounder
Senior Systems Engineer optimized for high-performance logic. Freed from the need to design hardware
VMETRO Inc. (Ithaca, NY) Seasoned FPGA users expect new FPGA and IP from scratch, COTS board-level
jbateman@vmetro.com generations to deliver more and the Virtex-5 users can focus their energies on imple-
family certainly delivers, all while consum- menting their specialist algorithms.
In the fast-paced world of FPGA develop- ing less power. Compared to Virtex-4 LX COTS products incorporating user-pro-
ment, Xilinx has struck again with its sec- devices, Virtex-5 LX FPGAs offer: grammable FPGAs target a variety of
ond-generation ASMBL™ architecture applications, from simple customizable
• 65% higher logic capacity with as
devices, the Virtex™-5 family. This device digital I/O to RADAR, video, and signals
many as 330,000 logic cells
family has many upgrades from its prede- intelligence (SigInt).
cessor, the Virtex-4 family, and likewise • 70% more block RAM Typically, the FPGA requires hardware
continues the evolution of the ASMBL • 100% more DSP slices connections to a real-world data source or
architecture, with scalable FPGAs catering destination, plus a standardized interface
to the application-specific marketplace. For • 25% more SelectIO™ pins
to a host processor. COTS products must
commercial off-the-shelf (COTS) develop- For COTS board vendors, these fea- usually follow an industry-standard form
ers, this means a platform that is low cost, tures enable powerful products capable of factor (such as PMC, VME, VXS, and
light on power consumption, and opti- handling the very high data rates and pro- CompactPCI), enabling end users to inte-
mized for high performance. cessing complexity required of modern grate products from a range of vendors.

90 Xcell Journal Fourth Quarter 2006


VERTICAL MARKET SOLUTIONS

With its parallel architecture and high- With a proven track record in the high- inputs. For many of today’s applications,
speed I/O capabilities, the Virtex-5 FPGA is end FPGA DSP arena and comprehensive especially those in DSP, this optimization
capable of streaming and processing data at tool and IP support from a variety of reduces significantly as system-level algo-
the gigabyte-per-second rates typically sources, the Virtex-5 FPGA family is a nat- rithms increase in complexity.
required for today’s applications. It is well ural choice for COTS board-level vendors. Configurable logic block storage density
suited to algorithms where a core “inner improvements increase the shift register
loop” can be parallelized to speed up opera- Optimization of Soft Components LUT (SRL) length from 16 bits to 32 bits
tion, employing the resources available in A great addition to the Virtex-5 architec- (SRL32), while retaining a dual SRL16
modern devices. Many DSP algorithms ture is the replacement of traditional four- option. Distributed RAM now offers a 64-
dovetail with this architecture. Conversely, input look-up tables (LUTs) with new bit option, up from 16 bits. With improved
even the fastest CPUs cannot easily process six-input LUTs (6-LUTs) for more efficient reduced-hop routing and more logic per
data at gigabyte-per-second rates; they are, mapping of wider functions. Because 6- slice (four LUTs/four flip-flops versus two
however, well suited to decision making and LUTs are also configurable as dual five- LUTs/two flip-flops), speed improvements
user interaction tasks. input LUTs, design software tools can of as much as 45% are possible.
Given these trade-offs, FPGA-based DSP achieve greater efficiency in logic mapping
systems often employ a hybrid approach, as when six-input functions are not required. Improved FIR Efficiency
illustrated in Figure 1. Here, a wide-band- Most FPGA devices these days base Let’s consider a finite impulse response filter
width RADAR or video source is digitized at their soft fabric components – those com- implemented in distributed logic.
gigasample-per-second rates and fed to an ponents configured to implement logic Distributed arithmetic filters are often
FPGA. The FPGA performs some heavy- equations – on LUTs. Previously, the com- selected because their operating frequency is
duty number crunching to eliminate mon choice was the four-input LUT, as this not tied to the length of the tap vector. This
unwanted data, focusing in on the key area of was a nice binary base and was relatively characteristic is highly desirable because
interest. Pre-processed data is fed at a more easy to work with for optimizing a logic increasing the tap vector length is funda-
manageable rate to a general-purpose CPU function. A given equation can be opti- mental to improving the overall filter
for post-processing control and display. mized to contain a sum of products of four response. However, these types of filters are
Key COTS FPGA board requirements are:
• Large, reconfigurable FPGAs with
ample room for customer-programma-
Capture Sensor Data
ble application logic Gigasample/sec ADC
• Regular air-cooled and rugged conduc- Gbps to FPGA
tion-cooled options
• High-speed interface for efficient trans- Filtering and Data Reduction
Pre-Process Data
fers to and from a host processor Find Targets Within Noise Large, Fast FPGA
High Speed, Parallel DSP User-Programmable IP
• Flexible, fast I/O to and from a variety
of real-world interfaces
Mbps
• Local memory interfaced directly to
the FPGA for I/O buffering as well as Post-Process Data
Interpretation/
temporary storage during algorithm Identify Target:
Display
Friend/Enemy
operation (CPU)
Vehicle/Aircraft Type
Speed, Position, Altitude
• Wide range of I/O and signal-process-
Threat Level
ing IP cores to speed end-user develop-
ment cycle times
• Flexible FPGA development tools cov-
ering both budget-conscious and Display Result
extreme-performance users
• Debugging interface capable of in-
FPGA logic analysis
• Comprehensive board support firmware Figure 1 – Processing chain
and software

Fourth Quarter 2006 Xcell Journal 91


VERTICAL MARKET SOLUTIONS

serial in nature; most applications need valid improvement in logic utilization over previ- But what if more precision is carried
output at a marginally decimated rate with ous generations of the Virtex device family. into the input side of each butterfly stage?
respect to the sampling frequency. Thus, a Using 25 x 18-bit multipliers, we can carry
fully parallel architecture is required. Multiplying Computational Power more precision from our partial products
Parallel Distributed Arithmetic FIR Not to be outdone by the soft-logic com- when multiplying them to new sample data
filters (DAFIRs) utilize significantly more ponents, the hard-logic dedicated multipli- and in turn introduce less rounding errors
logic versus other FIR implementations ers have also been optimized for the into our results.
to perform the many partial products on Virtex-5 FPGA. The 18 x 18-bit multipli-
a clock-to-clock basis (even when deci- ers present in the Virtex-II and Virtex-4 Improved Source-Synchronous Memory Access
mating the sampling rate). In the distrib- families have been upgraded to 25 x 18-bit Much to the delight of many designers
uted arithmetic architecture, a multipliers in the new family. Application using Virtex-4 FPGAs, Xilinx introduced a
corresponding output product y(n) is developers who implement beam-forming primitive called IDELAY capable of syn-
produced by summing the products of a arrays or other advanced computations will chronizing data and strobes to a source
time-delayed series of input x(n) and benefit from this enhancement. clock off the FPGA. This feature meant
coefficients a(m), where m, an integer Large multiplication arrays that require that high-speed DDR and DDR2 SDRAM
between 1 and N, is the filter length. a high degree of precision traditionally and QDR and QDR II SRAM memories
For the sake of simplicity, let’s say that required a large tree structure of multipli- could be accessed through controllers
each tap or filter coefficient is two bits wide ers. As output is carried between interme- inside the Virtex-4 device at high data rates.
and that the input vector is six bits. In total, diary stages of a large multiplication, the COTS developers are increasingly find-
our filter is 96 taps in length. If we calculate maximum allowable output value increases ing applications that require fast and deep
this product using the partial products with each subsequent stage. To handle this onboard memory. For example, data
method, we need 6 x 2 partial products bit-width increase, typical solutions involve recording applications benefit greatly from
using four-input LUTs. Each LUT is capa- a precision reduction by truncation or fast onboard memory to implement the
ble of a 2 x 2 multiplication, which means some other intelligent scheme such as con- sizeable buffers needed to sustain high-
using three 4-input LUTs. Using 6-LUTs, vergent rounding or (less often) by break- speed data transfers over PCI/PCI-X buses.
we can reduce this to just two LUTs. For 96 ing down the multiplication into smaller Video processing applications also require
taps, we have saved 96 LUTs of a possible stages and then rebuilding the final prod- large, fast external memories to store the
total of 288. This is just the savings when uct by summation. Utilizing 25 x 18-bit very-high-resolution, high-frame-rate
producing the partial product. multipliers, more precision is carried images produced by today’s leading camera
LUTs and SRLs are also used for shift through intermediary stages of a multipli- equipment.
registers in the input delay pipe and for the cation and thus reduces the impact of The introduction of the IDELAY prim-
scaling accumulator responsible for sum- intermediary truncation/rounding errors itive also benefits ruggedized application
mation and normalization of the output. while improving on overall speed and min- developers, as the IDELAY taps can be con-
Expanding our example input and tap imizing pipeline latency. stantly monitored by logic to perform run-
widths to a more applicable precision of 16 Suppose convergent rounding is time resynchronization to the source clock;
bits increases the depth of our partial prod- employed to reduce the precision at each this technique is known as dynamic clock-
uct multiplication tree, requiring even stage of multiplication within an FFT. If to-data centering.
more LUTs. Using 6-LUTs as opposed to we implement an 8K FFT using a mixed- Now, with the Virtex-5 family, Xilinx
four-input LUTs results in a LUT logic radix base of radix-4 and radix-2, that gives has expanded the primitive to add ODE-
reduction of more than 33%. us six radix-4 butterfly stages and one LAY, enabling delay control on both input
radix-2 butterfly stage. In an FFT, at each and output signals. The key component of
Wider LUTs Improve Efficiency and Speed subsequent stage we perform calculations the IDELAY primitive is to delay the input
Switched fabric developers will also ben- that produce partial products. These out- data relative to the clock such that the
efit from the 6-LUTs, as these are often puts are fed into the multipliers of the next internal FPGA version of the source clock
used to implement multiplexers. 6-LUTs stage until the time-domain data is trans- edge is centered with the input data. The
mean a reduction in the overall depth of formed completely to the frequency ODELAY enables variable delays per out-
a logic equation. For implementing mul- domain. However, each stage must employ put data line to better match trace-length
tiplexers, this means an effective increase a scheme to reduce the precision of the out- differences.
in speed for an equivalent multiplexer put so that subsequent stages can accept
implemented using 6-LUTs, as opposed them as inputs. After each stage of multi- Improving High-Speed I/O Communication
to four-input LUTs. plication, scaling is employed to reduce the As the COTS marketplace moves more and
Depending on the application, changing precision. Each stage of scaling introduces more into high-speed serial implementa-
to 6-LUTs can make as much as a 1.6x quantization errors. tions, clock and data recovery techniques

92 Xcell Journal Fourth Quarter 2006


VERTICAL MARKET SOLUTIONS

become more in demand. When imple- 65-nm Copper CMOS When implementing large designs –
menting general-purpose high-speed serial COTS developers will greatly benefit such as in software-defined radio (SDR)
links, transmission errors and data loss from the move into the 65-nm copper applications where multi-channel digital
become a reality, especially when targeting CMOS process. One of the consequences filters consume significant CLB space –
data rates beyond 1 Gbps. of process shrinks is that density and per- the dynamic power dissipation is quite
For developers using previous genera- formance increase with the next genera- high because of the large amount of
tions of Virtex devices, the choices for tion. This is true in the case of the switching activity that occurs. This is in
clock and data recovery (CDR) implemen- Virtex-5 LX platform, which has part caused by the extensive signal rout-
tation to de-serialize incoming streams increased the amount of CLBs by 65% ing required to implement these designs.
without using multi-gigabit transceivers over the Virtex-4 LX platform, block With the new components in the Virtex-
(MGTs) were limited. 5 family, existing designs
Although the delay-locked implement in a smaller num-
loops (DLLs) used for clock ber of primitives, reducing the
generation in previous fami- overall switching activity. In
lies are very stable in nature addition, the Virtex-5 architec-
because of their first-order ture includes the enhancement
loop architecture and digital of diagonally symmetric rout-
implementation, they are not ing for more efficient design
able to filter input jitter or implementation.
handle phase alignment COTS developers often
beyond their discrete range. complain about violating power
With the phase-locked loop specifications when developing
(PLL) blocks introduced with mezzanine cards in existing
Virtex-5 family, jitter reduc- Figure 2 – VMETRO PMC-FPGA05 designs. With the Virtex-5
tion is an intrinsic feature, device, their mezzanine cards
resulting in large improve- will be less power-hungry and
ments in higher data-rate sus- more desirable to end users. In
tainability. Filtering input jitter other words: less power
to produce stable internal ver- required, simplified cooling,
sions of source clocks is critical- and greater reliability.
ly important to correctly COTS products are often
sample and store incoming employed in environments
data at the FPGA I/O bound- where power consumption is a
ary. Using these new blocks, significant challenge – for exam-
implementing SERDES com- ple, high ambient temperatures
ponents using regular SelectIO may limit a cooling system’s
pins becomes practical even at effectiveness, so reducing heat
1 Gbps and above. output is an important motiva-
Together with SelectIO tion. Other applications such as
performance of as much as unmanned airborne vehicles
Figure 3 – Standard PCI option
800 Mbps per pin single- (UAVs) have limited electrical
ended and 1.25 Gbps differ- power availability, so using every
ential, the Virtex-5 device is able to RAM by 70%, DSP slices by 100%, and Watt available effectively is of paramount
input, process, and output the high data SelectIO pins by 25%. importance.
rates generated by current real-world With increased logic density in the over-
interfaces. For example, interfacing all package, power consumption has been The VMETRO PMC-FPGA05
directly with ADCs and DACs running reduced significantly. While the Virtex-4 A good example of a current COTS prod-
in the gigasample-per-second range is FPGA operates at 1.2V core voltage, the uct implementing the advanced Virtex-5
now perfectly feasible. Virtex-5 FPGA improves power efficiency 65-nm technology is the VMETRO PMC-
Looking to the future, new high-speed with a core voltage of 1.0V. You can achieve FPGA05, a general-purpose high-end
serial fabric interfaces will be a natural fit further power savings by optimizing the FPGA PCI mezzanine card (PMC) pic-
for interfacing between FPGAs, external soft and hard components, such as the 6- tured in Figures 2 and 3 and illustrated in
devices, and host systems. LUTs and 25 x 18-bit multipliers. the block diagram shown in Figure 4.

Fourth Quarter 2006 Xcell Journal 93


VERTICAL MARKET SOLUTIONS

power up the PCI bus, configure the IP


onto the FPGA, and re-initialize the PCI
8 MB
4Mbytes 8 MB
4Mbytes 8 MB
4Mbytes
bus to detect the FPGA’s PCI controller.
QDR II
QDR SRAM
II SRAM QDR II
QDR SRAM
II SRAM QDR
QDRII SRAM
II SRAM
This low-level development requires a tool
18-bit
18-bit,
18-bit, 200200MHz
MHz like the ChipScope Pro Analyzer to reduce
the number of bus initialization cycles.
GP I/O
GP I/O (64-bit)
(64-bit)
Without the ChipScope Pro tool, debug-
ging would require a sophisticated bus ana-
138
138 Signals
signals
Virtex-5 FPGA
Virtex-5 FPGA 256 Mbit
lyzer – and even then the internal FPGA
LX110
(LX50 - LX110) FLASH

functionality would be inaccessible.


64-bit, 133 125MHz
64-bit, MHz PCI-XPCI-X Utilizing the ChipScope Pro Analyzer saves
the design team considerable time by
16-bit,
reducing the number of iterations required
16-bit, 200MHz
200 MHz
to debug the PCI-X interface.
128 MB 128 MB
64Mbytes
64Mbytes
DDR SDRAM
QDR II SRAM DDR SDRAM
QDR II SRAM
256Mbit
FLASH The ChipScope Pro tool is not only use-
ful for VMETRO board development, but
also for customers integrating their IP and
external interfaces. We designed the PMC-
Figure 4 – PMC-FPGA05 block diagram
FPGA05 with a high-density parallel inter-
face that you can use with a range of I/O
A Virtex-5 FPGA at the heart of a PMC- between generating bitstreams and improv- modules including analog I/O, RS485,
FPGA05 means that it is a highly integrat- ing efficiency during the debugging cycle. LVDS, FPDP, and Camera Link. Using the
ed design. There is no need for external The ChipScope Pro Analyzer consumes a ChipScope Pro Analyzer makes these inter-
bridges and controllers because the Virtex-5 limited amount of the FPGA’s resources, esting projects manageable.
device is more than capable of handling though this is largely a function of the The PMC-FPGA05 also offers applica-
these functions directly (see Figure 5) while depth of the analyzer sample memory. tion developers a platform that can sustain
using only a small amount of resources. Features such as the analyzer memory high-bandwidth data transfers and imple-
This leaves plenty of space in the PMC- depth and triggering functions are parame- ment sophisticated DSP and processing
FPGA05 to include IP such as digital terizable through the ChipScope Pro algorithms at a fraction of the power of pre-
receivers, FFTs, or other DSP functions. inserter tool, eliminating unnecessary vious generations.
Implementing Flash and PCI-X inter- resource waste.
faces inside the FPGA introduces some The ChipScope Pro Analyzer is a valu- FPGA Resources
design challenges. How is the FPGA con- able asset when debugging designs such as More than any other FPGA family, the
figured and debugged without these inter- the VMETRO PMC-FPGA05. The PCI-X Virtex series leads the way in IP availability.
faces already in place – especially if the IP is interface in particular represents a chal- Through the Xilinx® Alliance Program, IP
complex? This is where the ChipScope™ lenge. In simple terms, this might be to is available from both Xilinx directly as well
Pro Analyzer (version 8.2) comes in as third-party IP suppliers. With a strong
– it embeds a logic analyzer inside worldwide network of IP vendors accessible
8 MB 8 MB 8 MB
the FPGA and connects the user QDR II SRAM QDR II SRAM QDR II SRAM directly from www.xilinx.com, COTS
interface to the FPGA through FPGA board users can choose IP cores
JTAG. It provides a debugging por- from suppliers experienced with as many as
SRAM SRAM SRAM
tal into the FPGA that can be Controller Controller Controller
five generations of the Virtex family. This is
inserted through HDL entry. key to success when developing projects to
Controller
User I/O
Custom

Once built, a ChipScope Pro


Flash

Virtex-5 FPGA 256 Mbit


FLASH
aggressive timescales, and therefore is a
ILA (integrated logic analyzer) high-priority reason for selecting the
Controller

port assignment can be changed in PCI-X


Virtex-5 family.
PCI-X

SDRAM SDRAM
FPGA Editor. Therefore, changes Controller Controller

to the ILA do not require another FPGA Firmware Development Process


full run through the ISE™ flow With the Xilinx ISE tool chain offering an
128 MB 128 MB
when debugging, as shown in QDR II SRAM QDR II SRAM
easy-to-use GUI development environment
Figure 6. Instead, you can make and an included VHDL/Verilog synthesis
changes at the map stage of the tool (XST), many end users need only pur-
Figure 5 – PMC-FPGA05 controller IP components
design flow, reducing the time chase ISE software for a complete cost-

94 Xcell Journal Fourth Quarter 2006


VERTICAL MARKET SOLUTIONS

Chipscope
Basic Design Flow
Get Published
ILA
HDL Design Entry

NGD

Chipscope
ILA Mapping, Placement
and Routing

NCD

interate
Bit Gen

BIT

JTAG
Xilinx FPGA

Figure 6 – ChipScope debugging cycle

effective development solution. Those tion and support: from introductory cours-
implementing the most complex algorithms es in VHDL to DSP logic design courses to
may benefit from high-end synthesis tools development laboratories equipped with
from third-party vendors. These integrate the latest gear for testing high-speed serial
directly with ISE software to maintain a interfaces, Xilinx offers the resources neces-
simple project management process. sary for successful FPGA deployment. Would you like
Simulation may not reveal all errors in As the Virtex-5 device is brand new, the
the design, particularly for complex proj- PMC-FPGA05 is still in development. Stay
to be published
ects. When an implementation does not tuned for an update on the challenges, in Xcell
function as expected, you must connect up solutions, and lessons learned as VMETRO Publications?
to the actual hardware to see what is going works hard to bring you the world’s first It's easier than you think!
on. But with dense packaging covering Virtex-5 COTS product.
many thousands of pins, there’s no practical Submit an article draft for our Web-based
way to connect a traditional logic analyzer. Conclusion or printed Xcell Publications and we will
VMETRO engineers have many years To meet the growing demands being placed assign an editor and a graphic artist
of experience developing firmware for on the COTS marketplace, you must adapt to work with you to make your work
Xilinx FPGAs. Using ISE software as the and implement platforms with the right look as good as possible.
primary synthesis/place and route tool, tools for the application. Today, this means
For more information on this
Model Technology’s ModelSim PE for sim- integrating high-speed serial communica-
ulation, and the ChipScope Pro tool for in- tion, fast access memory, and plenty of exciting and highly rewarding program,
circuit debugging, VMETRO developed optimized logic space for advanced algo- please contact:
the IP necessary for interfacing with the rithm development. Equally important are
board hardware. This includes interfaces efficient development and debug tools, IP Forrest Couch
for the SRAM and SDRAM memory resources, and a commitment to high-
Publisher, Xcell Publications
devices, to which users simply connect speed DSP development.
xcell@xilinx.com
their address and data signals, and a high- The highest logic density available, the
performance bus mastering PCI-X inter- lowest power consumption, and the best
face core supporting customizable registers performance are what COTS developers
and simple FIFO-based DMA transfers for need to meet the needs of their customers.
streaming data. Virtex-5 FPGAs, with their ASMBL archi-
Another good reason to choose the tecture and 65-nm process, deliver on
Virtex-5 family is a commitment to educa- these demands.

Fourth Quarter 2006 Xcell Journal 95


Xilinx Virtex-5 LX FPGA

Stay Ahead! A new generation of performance

Analog I/O, Camera


Link, LVDS, FPDP-II &
RS422/485 options
Fast, integrated I/O without
bottlenecks

Multiple banks of fast


PCI based memory
development DSP & I/O optimized memory
architecture
system also
available
PCI-X Interface with
multiple DMA controllers
More than 1GB/s bandwidth to host

Libraries and Example


Code
Easy to use with head-start time-
with the PMC-FPGAO5 Range of Virtex-5 PMCs to-market

Processing and FPGA - Input/Output - Data Recording - Bus Analyzers

For more information, please visit


virtex5.vmetro.com or call (281) 584 0728

Xilinx Worldwide Events


Xilinx participates in numerous trade shows and events throughout the year.
This is a perfect opportunity to meet our silicon and software experts, ask questions,
see demonstrations of new products, and hear other customer success stories.
For more information and the current schedule, visit www.xilinx.com/events/.

January 06 - 10, 2007 Int’l Conference on VLSI Design


& Int’l Conference on Embedded Systems Bangalore, India

January 08 - 11, 2007 International Consumer Electronics Show Las Vegas, NV

February 12 - 15, 2007 3GSM Spain

February 13 - 15, 2007 Embedded World Germany

February 21 - 23, 2007 IDGA Software Radio Summit Vienna, VA

April 03 - 05, 2007 Embedded Systems Conference - Silicon Valley San Jose, CA

April 16 - 19, 2007 NAB Las Vegas, NV

96 Xcell Journal Fourth Quarter 2006


VERTICAL MARKET SOLUTIONS

Tackling Serial Backplane


Interface Design Challenges
The Virtex-5 LXT FPGA enables robust, high-performance,
and high-integration serial backplane interface solutions.

by Delfin Rodillas which allow reduced engineering efforts low bit-error rates. Silicon-based
Senior Manager, Wired Communications and interoperability. Standardization efforts approaches to mitigating SI issues are
Xilinx, Inc. for serial backplane form factors such as particularly important in “legacy
delfin.rodillas@xilinx.com AdvancedTCA and MicroTCA in the PCI upgrade” scenarios, in which designers
Industrial Computer Manufacturers Group re-use older backplanes with legacy com-
The rate of adoption of serial technology in (PICMG) have also contributed to the ponents and design rules.
high-end system design has reached critical accelerated adoption. The benefits of serial There are also challenges in developing
mass. As shown in Figure 1, 92% of respon- backplanes are so compelling that they have serial backplane protocols and fabric inter-
dents in a recent EE Times survey answered been used as the backbone of not only com- faces. The majority of backplane designs
“yes” when asked if they were designing munications, compute, and storage systems leverage legacy ASICs, which have propri-
serial I/O systems in 2006, compared to but also broadcast, medical, defense, and etary protocols. Even some newer back-
64% serial design activity in 2005. industrial/test systems. plane designs require a proprietary
A good portion of this dramatic adop- backplane protocol. Silicon solutions must
tion rate is caused by the penetration of Persistent Design Challenges therefore be flexible and provide the nec-
serial technology in backplane applications. Regardless of the increased rate of adop- essary customizability. Although an ASIC
As system throughput requirements tion, many design challenges still exist. allows this, it can often be costly and risky,
increase, the parallel backplane technologies Because the backplane subsystem is the with unproven product demand/volume
of old will be displaced by SerDes-based heart of the system, it must be able to pass and the possibility of design bugs and
backplane subsystems that provide higher signals from card to card reliably. Thus, specification changes.
bandwidth, better signal integrity, lower designing backplanes with high signal An approach that has recently gained
EMI and power, and simpler PCB designs. integrity (SI) is of primary importance. traction is the use of off-the-shelf stan-
Further promoting this growth is the Also significant is the use of proper dards-based switch fabrics. This saves
emergence of standard serial protocols such silicon ICs with SerDes technology, development time, but you must have sili-
as XAUI and Gigabit Ethernet (GbE), capable of driving backplanes with very con solutions that conform to the standard

Fourth Quarter 2006 Xcell Journal 97


VERTICAL MARKET SOLUTIONS

100% it goes out of the line driver, while equal- these IP cores are tested through consortia
ization occurs on the received signal after plug-fests and independent third-party verifi-
92% it enters the IC package. Both pre-empha- cation. To facilitate the creation of light-
75% sis and equalization features are program- weight serial protocol designs, Xilinx also
mable to different states to allow for created the Aurora protocol, which is ideal
64% optimum signal compensation. for simpler designs requiring minimal over-
50% Besides signal conditioning features, the head and optimized slice/resource utilization.
serial tranceivers also provide additional fea- With increased usage of Ethernet and
tures beneficial for backplanes, such as pro- PCIe, Virtex-5 LXT FPGAs also include
25%
grammable output swings that allow embedded tri-mode Ethernet MACs and
interfacing to a variety of other current PCIe Endpoint blocks. These allow signifi-
0% mode logic (CML)-based devices and built- cant savings of FPGA slice resources for cus-
2005 2006 in AC coupling capacitors that simplify tomers needing interfaces in control plane
Source: EE Times Survey, 2005
transmission line design and reduce ISI. applications, for example.
Because many chips with parallel inter-
Figure 1 – Percentage of engineers IP Cores faces are still used even in newer systems,
designing Serial I/O systems Proprietary protocols still make up most Xilinx also offers IP cores for popular parallel
serial backplane implementations. However, interfaces such as SPI-4.2, SPI-3, and PCI.
some newer designs have used standards- These allow you to rapidly create serial-to-
protocol, as well as the flexibility to cus- based protocols such as XAUI and GbE. parallel bridges, which are still required in
tomize the end product and make it unique. This growing acceptance has been driven many applications.
And of course, there are the ever-pres- primarily by the maturity of these standards Besides serial and parallel interface IP,
ent challenges of cost, power, and time to and the emergence of switch fabric ASSPs Xilinx offers more complete IP solutions
market. To meet the challenges of serial utilizing these protocols. Using ASSPs for that further reduce development time and
backplane design, Xilinx provides the switching applications saves tremendous time to market. These solutions include a
Virtex™-5 LXT platform of FPGAs as development time, but designers realize that Traffic Manager for prioritizing traffic
well as IP solutions. they need to differentiate their products by flows across backplanes, as well as a Mesh
adding value-added capabilities, primarily Fabric Reference Design that allows “every-
Xilinx Solutions for Serial Backplanes on the line card. to-every” connectivity between cards.
The key technology that enables the applica- FPGAs are the ideal platform for provid- Lastly, the ChipScope™ Pro Serial I/O
tion of Xilinx® Virtex-5 LXT FPGAs in seri- ing customizability, as the serial tranceivers Tool Kit enables rapid serial tranceiver
al backplane applications is the embedded are designed to support a majority of stan- setup and debugging as well as BERT test-
RocketIO™ GTP low-power serial trans- dard serial backplane protocols. Together, the ing. Table 1 summarizes the serial back-
ceiver. There are as many as 24 serial tran- serial tranceivers and fabric allow for stan- plane-related IP available from Xilinx.
ceivers in the largest Virtex-5 LXT FPGA; dards-compliant designs with value-added
each serial tranceiver is capable of running functions – all in a single silicon device. Application Examples
from 100 Mbps to 3.2 Gbps. Coupled with To reduce design time, Xilinx offers off- Let’s look at how you could integrate all of
programmable fabric, the FPGA is capable of the-shelf available IP cores for key serial I/O the solution components to create a com-
supporting virtually any serial protocol – interface standards such as XAUI, GbE, plete serial backplane fabric interface
proprietary or standard – up to 3.2 Gbps. SRIO, and PCIe. To ensure interoperability, FPGA for both a star and mesh system.
More important for serial backplane
applications are built-in signal condition-
ing features, including transmit pre-
emphasis and receive equalization. These IP Category Available IP
features enable transmission of multi-
Serial Interfaces XAUI, GbE, PCI Express, Serial RapidIO, Aurora, CPRI, OBSAI
gigabit signals over long distances, often
reaching 40 inches or longer. Both equal- Parallel Interfaces SPI-4.2, SPI-3, Utopia, PCI, CSIX
ization methods minimize the impact of System-Level Solutions 10G Traffic Manager, Mesh Fabric Reference Design
inter-symbol interference (ISI) by boost-
Serial Backplane Test Solutions ChipScope Pro Serial I/O Tool Kit
ing high-frequency signal components
and attenuating low-frequency compo-
nents. The difference is that pre-emphasis Table 1 – Xilinx IP for serial backplanes
is performed on the transmitted signal as

98 Xcell Journal Fourth Quarter 2006


VERTICAL MARKET SOLUTIONS

which requires full connectivity between four


10 GbE Line Card 24-port VDSL line cards and a 10 GbE back-
10 GbE Line Card haul card that connects to a metro Ethernet
10 GbE Line Card
CPU Memory Memory
network. Each card uses a Virtex-5 LXT
Memory
Memory Memory
Memory
device and four embedded serial tranceivers
G to realize the four independent channels of
10 GbE Virtex-5 Star
XFP PHY Ne
Network
MAC
Processor
Fabric I/F the mesh fabric physical layer. Implementing
the four link layers is the Aurora protocol,
16-Channel which runs at approximately 3 Gbps to trans-
Memory
XAUI-Based
Controller Switch Fabric port the 2.4 Gbps payload – plus additional
MGT
ASSP or ASIC overhead such as the encoding.
SPI-4.2 10G Traffic XAUI MGT Chhh.. 0
LogiCORE Manager LogiCORE
IP Solutions IP MGT
Switch Fabric
A SPI-4.2 and SPI-3 LogiCORE IP is
Single XAUI
MGT Channel @ 4 x 3.125 Gbps used on the trunk card and line cards, respec-
tively, providing connectivity to the network
processor. The Mesh Fabric Reference
Figure 2 – Star fabric I/F FPGA in a 10 GbE line card
Design and Traffic Manager solution provide
the distributed switching and QoS functions
required on all of the line cards.
The line card fabric interface could easily
24-Port VDSL Line Card
fit in an XC5VLX30T device, while the trunk
24-Port VDSL Line Card
24-Port VDSL Line Card
card fabric could fit in an XC5VLX50T
24-Port VDSL Line Card device. Similar to the star example, you can
CPU Memory
Memory Memory
Memory
Memory Memory
realize significant benefits in integration,
Virtex-5 time-to-market reduction, feature optimiza-
DSL 2.5G
Mesh tion, and power and cost reduction by using
Chipset NPU
Fabric I/F
the Virtex-5 LXT solution.
Memory
Memory CPU
Memory
Memory
Conclusion
Controller
Aurora MGT Ch. 0
Four Virtex-5
Serial backplane technology is now main-
Mesh 10G 10GbE
SPI-3 Aurora Ch. 1 Mesh PHY XFP
LogiCORE
Fabric
MGT
Individual NPU MAC stream; its adoption will only continue to
Ch.
h.. 2 Aurora Fabric I/F
+ Aurora MGT
IP
TM Aurora MGT C 3
Ch. Channels @ 10 GbE Trunk Card increase with the rapidly growing demand
1 x 3.125 Gpbs for bandwidth. Evolution of backplane sys-
tem requirements in terms of rates and pro-
Figure 3 – Mesh fabric I/F FPGA in a VDSL line card tocols is inevitable and designers will face
new challenges.
However, with Xilinx Virtex-5 LXT
Star Backplane Topology Application the external memory, which is used prima- FPGAs and off-the-shelf-available IP for
Star fabric topologies prevail in high-end rily as packet buffers. The benefits of this serial backplanes, system architects have an
infrastructure equipment because of their architecture include increased integration option that can accommodate legacy as well
cost-effectiveness, particularly in systems of SerDes and logic functions and quick as newer backplane designs. Virtex-5 LXT
with a high number of cards. Figure 2 is an time to market through the use of IP, while FPGAs with embedded SerDes have the
example of a 10 GbE line card that imple- allowing an implementation that meets critical SI-improving features and integra-
ments an FPGA-based star fabric interface. your exact system specifications. It also tion required to provide high reliability and
This FPGA instantiates the XAUI provides solid SI as well as low SerDes area- and cost-optimized designs.
LogiCORE™ IP and uses four serial tran- power consumption (~400 mW total). You Furthermore, Xilinx off-the-shelf IP
ceivers to connect to the 16-channel XAUI can implement all of this in the lowest cost reduces development time and time to mar-
switch fabric card. A LogiCORE SPI-4.2 speed grade XC5VLX50T device. ket. Together, the powerful silicon and IP
core is also realized in the FPGA to interface cores are what make the Virtex-5 solution
to the 10 Gbps network processing unit. Mesh Fabric Architectures the ideal vehicle for tackling even the tough-
Between the serial and parallel interface Star topologies prevail in most cases, but in est serial backplane design challenge.
is the Traffic Manager IP, which performs some smaller systems, a mesh topology is For more information, visit www.
QoS-related functions on ingress and required. Take the case of the five-slot IP xilinx.com/backplanes and www.xilinx.
egress traffic. A memory controller controls DSL access multiplexer shown in Figure 3, com/qos.

Fourth Quarter 2006 Xcell Journal 99


Where Great Ideas
Go to Work

®
A career with Xilinx puts you at
the Leading edge of technology.
The world leader in programmable systems, Xilinx solutions are
found in numerous applications including wireless, networking,
storage, automotive, aerospace, and much more.

Visit our website today, www.xilinx.com/hr/ncg/index.htm,


and let’s talk about putting your ideas to work.

The programmable Logic CompanySM

©2006 Xilinx Inc. All rights reserved. XILINX, the Xilinx logo, are other designated brands included herein are trademarks of Xilinx, Inc.
All other trademarks are the property of their respective owners.
VERTICAL MARKET SOLUTIONS

Enabling Multi-Port 1Gbps and 10 Gbps


TCP/iSCSI Protocol Offload Solutions
The Virtex-5 LXT platform enables low-footprint,
system-level, multi-port 1Gbps and 10 Gbps TOE solutions.

by Sriram R. Chelluri lanes, capable of providing as much as the Virtex-5 family, you can design cost-
Senior Manager, Storage and Servers 32 Gbps full-duplex performance effective TCP and iSCSI offload solutions
Xilinx, Inc. for the server, storage, multi-protocol
• Built-in Gigabit Ethernet MAC
sriram.chelluri@xilinx.com switch, and wireless base station markets
(GEMAC) – four hardcore GEMACs
with extended product life cycles.
enable multi-port gigabit solutions,
As the data center network infrastructure reducing total real estate requirements
TCP Offload Engine (TOE) Overview
migrates to 10 Gbps, moving data traffic for SoC designs
Current TCP offload solutions rely on a
to an Ethernet-based solution becomes
• Real six-input LUT (6-LUT) technology complete software stack or on special net-
economically viable without sacrificing
– improves slice utilization and reduces work interface cards (NICs) based on
performance and latency. Hardware-
routing latency for high performance ASICs for handling TCP/IP processing. An
based host interfaces like PCI Express and
all-software solution is acceptable for low-
multi-Gigabit Ethernet (GbE) support • 36-Kb dual-port block RAM – bandwidth applications, but high-perform-
open up design possibilities for low-cost, higher memory density with error- ance applications would consume all of the
high-performance products in the com- correction circuitry enables support CPU resources, creating a system bottle-
puter and data-processing markets. The for reliable computational logic neck for critical applications.
Xilinx® Virtex™-5 family of FPGAs sets structures and increased on-chip ASIC-based solutions are primarily
the stage for designing system-on-chip TCP sessions for simultaneous from start-ups looking to capitalize on
(SoC) solutions with higher functionality transmit and receive operations the high-performance 10 Gbps market.
and low power.
• DSP48E slices – enable massively These solutions are still expensive and
The Virtex-5 architecture brings to
parallel computations for image pro- prone to vendor lock-in with an uncer-
market critical features that make SoC
cessing and multimedia applications tain financial future.
designs easy to implement for TCP and
Xilinx and its third-party IP partners pro-
iSCSI offload engines:
Because the Virtex family is a program- vide fully standards-compliant TCP/iSCSI
• Built-in PCI Express (PCIe) block – mable platform, you can adapt your designs offload solutions that you can implement as
An integrated standards-compliant PCIe to changing standards and market require- is or customize for functionality, size, speed,
endpoint for supporting one to eight ments. Leveraging the resources available in or the target application.

Fourth Quarter 2006 Xcell Journal 101


VERTICAL MARKET SOLUTIONS

FPGA-Based TCP/iSCSI Engine


With standards-compliant built-in
GbE PHY GbE PHY GbE PHY GbE PHY
GEMACs, a PCIe core, and increased
block RAM, the Virtex-5 LXT device is a
GEMAC GEMAC GEMAC GEMAC
programmable platform chip that system
architects can exploit for TCP and iSCSI
Access Controller
protocol processing without worrying
about serial connectivity issues on the DDR2
TCP/iSCSI Memory RLDRAM 2
network or host interface side. Some of Offload Core Controller QDRII SRAM
the protocol offload design challenges are:
DMA
PCIe Core
• The number of TCP connections Engine

to support
• TCP reassembly/reorder Back-End I/O Interface

• IP fragmentation and reassembly Programmable Logic

• Latency
• On-chip versus off-chip TCP Figure 1 – Designing a TCP offload solution with traditional FPGAs
session management
These issues can be mitigated with the
unique features of Virtex-5 devices and
GbE PHY GbE PHY GbE PHY GbE PHY
available IP cores. With built-in GEMAC
and PCIe interfaces, you can implement
direct memory access solutions with min- GEMAC GEMAC GEMAC GEMAC

imal FPGA resources, reducing memory


Access Controller
transfer latencies and enabling TCP
reassembly without using temporary Memory
DDR2
TCP/iSCSI RLDRAM 2
memory. Virtex-5 FPGAs also feature a Offload Core Controller QDRII SRAM
36-Kb dual-port block RAM, allowing
you to support twice as many TCP con- DMA
PCIe Core
Engine
nections than previous generations. With
the Xilinx LogiCORE™ high-speed
memory controller, you can use external Back-End I/O Interface
DDR2 memory to scale TCP session Programmable Logic
management. Let’s look at the resources Built-In Hardcore Logic
you could save in an FPGA-based net-
work interface solution.
Figure 2 – Designing a TCP offload solution with Virtex-5 LXT FPGAs
1 Gbps and 10 Gbps NIC Solution
An integrated multi-port 1 Gbps and tects can also reduce NRE costs because Virtex-5 LXT platform – with hardened
10 Gbps TCP offload NIC for IP storage they are not required to implement high- GEMACs and PCIe Endpoint blocks, larger
and bladed servers enables companies to speed I/O interfaces for GbE and PCIe. block RAMs, and 6-LUTs – uses fewer
leverage network infrastructure for storage Figure 2 shows a redesign of the TCP FPGA resources to implement complex solu-
traffic. Figure 1 shows a typical FPGA- offload NIC leveraging the built-in tions for the server, storage, multi-protocol
based NIC design. resources of the Virtex-5 family. switch, and wireless base station markets.
Depending on the IP cores used, this To learn more about Virtex-5 LXT
design can take as many as 20,000 slices to Conclusion FPGAs, visit www.xilinx.com/virtex5. To
implement. The Virtex-5 LXT platform With standards-compliant TCP and iSCSI learn more about protocol offload solu-
can reduce resource utilization by 50%, offload IP cores from third-party vendors tions, visit www.xilinx.com/esp/storage/.
enabling you to develop lower cost solu- implemented on Xilinx FPGAs, you can And to learn how Xilinx FPGAs can help
tions without sacrificing performance. now design a drop-in or custom SoC at a you in other applications, visit www.
Besides hardware efficiency, system archi- much lower total cost of development. The xilinx.com/esp.

102 Xcell Journal Fourth Quarter 2006


VERTICAL MARKET SOLUTIONS

Implementing Encryption Algorithms


with the Virtex-5 LXT Platform
The Virtex-5 LXT platform makes encryption product development easy.

by Mike Nelson
Sr. Staff System Architect,
Storage and Servers, Vertical Markets
Xilinx, Inc.
mike.nelson@xilinx.com

Encryption is a computationally inten-


sive function, which makes extremely
high-performance implementations a
serious system design challenge. The
Xilinx® Virtex™-5 LXT platform
meets this challenge with performance-
optimized features ideal for 10 Gbps
and faster implementations of leading-
edge encryption algorithms.
A world-class programmable fabric
provides superior logic performance.
Integrated GTP serial transceivers, hard
PCI Express (PCIe) Endpoint blocks,
and highly flexible SelectIO™ technol-
ogy enable tremendous I/O bandwidth.
And 65-nm device densities provide a
family of devices appropriate to almost
any system design need.

Fourth Quarter 2006 Xcell Journal 103


VERTICAL MARKET SOLUTIONS

As the world of cryptography continuously evolves with additional modes


and algorithmic refinements, your design can evolve with it...
The Virtex-5 architecture features several ance for the target application. FPGAs Virtex-5 LXT platform FPGAs address
advances that enable the very high-perform- have always been well suited for this role, these limitations by combining embedded
ance logic necessary for high-bandwidth but scaling to approach to very high per- RocketIO™ GTP transceivers and a hard-
encryption applications: formance has been problematic. ened PCI Express Endpoint block in every
PCI and PCI-X solutions require mod- device. With the LXT platform, extremely
• Real six-input LUT-based fabric
est soft logic but have limited performance high-performance co-processor I/O is easy
means that you can map circuits into
and must share what bandwidth they do and efficient, as shown in Figure 2.
denser structures with fewer levels of
have. PCI Express can implement a very Virtex-5 LXT platform FPGAs are also
logic, increasing device utilization and
high-performance non-blocking switched ideal for in-line applications, as illustrated
performance
fabric, but traditionally requires extensive in Figure 3. A key requirement for in-line
• Improved routing architecture increases soft-logic resources to implement the con- encryption applications is flexibility.
the reach of low-latency logic intercon- troller, and possibly an external PHY for They may require identical – or different
nection, providing more flexibility to the electrical connection. – input and output ports, port aggregation,
synthesis tools and also increasing
device utilization and performance
Applications Encryption
• 36-Kb dual-port block RAMs with
API Acceleration
integrated ECC allow extremely high-
OS Driver Host I/F
performance on-chip memory
Packet Packet
resources for creating FIFOs and Reader
In Line
Writer

Port I/F

Port I/F
computational logic structures System
Encrypt &
CPU Chipset Packet Decrypt Packet
I/O
Combined, these resources enable very Writer Reader

cost-effective 10 Gbps and faster implemen-


tations of IPsec AES-CBC/AES-XCBC-
MAC-96, 802.1ae MACSec, LRW-AES, System Memory
AES-GCM, SHA-256/384/512, and
Look Aside Co-Processor In-Line Processor
many other cryptographic algorithms. • Offloads computationally intense • Performs encryption as a flow-through
Furthermore, as the world of cryptography workload for another processor function in the datapath

continuously evolves with additional modes


and algorithmic refinements to these algo- Figure 1 – Look-aside and in-line encryption processing
rithms, your design can evolve with it –
because the Virtex-5 family is a programma-
ble logic platform.
Previous FPGA Application Co-Processor Options Virtex-5 LXT FPGA

I/O Bandwidth and Flexibility


Computationally intensive core logic Application Application
Application
requires high-bandwidth I/O. But the Acceleration Acceleration Application
Acceleration
Acceleration
nature of that I/O will vary based on your
PCI Express Controller PCI Express Controller
system architecture. Figure 1 shows two PCI/PCI-X Controller

common architectures for implementing PHY


encryption processing.
Look-aside co-processing is an attractive 132 MB – 1 Gbps Simplex Up to 1 Gbps Duplex Up to 2 Gbps Duplex Up to 2 Gbps Duplex
option widely used in x86-based system Shared, Parallel Bus Switched Fabric, Serial Switched Fabric, Serial Switched Fabric, Serial

appliances. This model leverages the excel- User Programmable Soft Logic
Hard PCI Express Controller
lent value of the commodity x86 platform
RocketIO Multi-Gigabit Transcievers
to implement the application framework
and selectively “looks aside” to an opti-
Figure 2 – I/O bandwidth and soft logic progression for FPGA co-processor options
mized accelerator to achieve high perform-

104 Xcell Journal Fourth Quarter 2006


VERTICAL MARKET SOLUTIONS

DDR2
RLDRAM II
QDRII SRAM
Etc.

n X GE n X GE
10GE Packet Packet 10GE
Reader Writer

Memory Controller
10G FC 10G FC
Port I/F

Port I/F
Multi-Ported
PCIe PCIe
SPI-4.2 Packet Packet SPI-4.2
Writer Reader
Etc. Etc.

Encryption Decryption

Figure 3 – Virtex-5 LXT device in-line encryption platform flexibility

or local subsystem memory. The Virtex-5 DDR2, RLDRAM II, and QDR II
LXT platform meets this challenge with a SRAM. These capabilities enable vir-
wide range of capabilities: tually any local memory subsystem
that an in-line processing engine
• Gigabit Ethernet (GbE) – Each device
might require.
in the Virtex-5 LXT platform includes
four independent hardened GbE These features allow you to create in-
MACs, making multi-port Ethernet a line solutions that will connect to the ports
very efficient I/O option. You can add you need with the integrated encryption
additional ports as necessary with technology you want.
100% form-, fit-, and function-equiva-
lent soft LogiCORE™ IP. Conclusion
The Virtex-5 LXT platform expands the
• 10 Gigabit Ethernet – A Xilinx soft capabilities of the Virtex-5 FPGA architec-
LogicCORE function is available that ture with the addition of RocketIO GTP
can be connected to four RocketIO transceivers, plus hard PCI Express
MGTs for a XAUI interface or to a Endpoint and tri-mode Ethernet MAC
SelectIO pinout for an XGMII interface. blocks. The result is a platform ideally suit-
• 10 Gbps Fibre Channel (FC) – A ed to support very high-performance look-
XAUI-like Fibre Channel standard uses aside and in-line encryption functions.
four RocketIO MGTs operating at Other applications where LXT platform
3.1875 Gbps in parallel to create a devices will excel include high-performance
10.2 Gbps FC channel. packet handling and deep content inspec-
tion for networking; high-speed data
• PCI Express – Available to interface to mining for databases; time-critical compu-
a variety of industry-standard PCIe- tational processing for industrial, scientific,
based port controllers. and medical applications; and real-time
image processing for aerospace/defense and
• SPI-4.2 – Soft LogicCORE IP sup-
video graphic applications.
ports this networking industry stan-
To learn more about Virtex-5 LXT
dard for chip-to-chip connectivity over
platform FPGAs, visit www.xilinx.com/
high-performance SelectIO technology.
virtex5. To learn more about Xilinx in
• Memory – In addition to port I/O encryption, visit www.xilinx.com/esp/security/
standards, Virtex-5 SelectIO technolo- data_security/index.htm. And to learn how
gy also supports a wide range of mem- Xilinx FPGAs can help you in other applica-
ory interface technologies including tions, visit www.xilinx.com/esp.

Fourth Quarter 2006 Xcell Journal 105


SystemBIST™ enables FPGA Configuration that is less filling for your PCB area, less filling for your BOM budget and less filling for your prototype
schedule. All the things you want less of and more of the things you do want for your PCB – like embedded JTAG tests and CPLD reconfiguration.
Typical FPGA configuration devices blindly “throw bits” at your FPGAs at power-up. SystemBIST is different – so different it has three US patents
granted and more pending. SystemBIST’s associated software tools enable you to develop a complex power-up FPGA strategy and validate it.
Using an interactive GUI, you determine what SystemBIST does in the event of a failure, what to program into the FPGA when that daughterboard
is missing, or which FPGA bitstreams should be locked from further updates. You can easily add PCB 1149.1/JTAG tests to lower your down-
stream production costs and enable in-the-field self-test. Some capabilities:
I User defined FPGA configuration/CPLD re-configuration
I Run Anytime-Anywhere embedded JTAG tests
I Add new FPGA designs to your products in the field
I “Failsafe” configuration – in the field FPGA updates without risk
I Small memory footprint offers lowest cost per bit FPGA configuration
I Smaller PCB real-estate, lower parts cost compared to other methods
I Industry proven software tools enable you to get-it-right before you embed
I FLASH memory locking and fast re-programming
I New: At-speed DDR and RocketIO™ MGT tests for V4/V2
If your design team is using PROMS, CPLD & FLASH or CPU and in-house software to
configure FPGAs please visit our website at http://www.intellitech.com/xcell.asp to learn more.

Copyright © 2006 Intellitech Corp. All rights reserved. SystemBIST™ is a trademark of Intellitech Corporation. RocketIO™ is a registered trademark of Xilinx Corporation.
Virtex-5 Configuration Options
Offer Designers a Choice
Xilinx provides a host of flexible choices in configuration memory to help you make the best decision for your design.

by Frank Toth Platform Flash PROMs SPI Flash PROMs


EasyPath FPGAs and Configuration Solutions Dropping a Platform Flash PROM into a Virtex-5 FPGAs support direct connection
Xilinx, Inc. design provides a seamless solution with a to SPI PROMs using the industry-standard
frank.toth@xilinx.com low TCO, including minimum board space, four-wire SPI interface. Many systems cur-
high configuration speed, a guaranteed rently use SPI PROMs; you can now easily
System designers are always making source of supply, and value-added features take advantage of the on-board SPI PROM
trade-offs between alternative require- like bitstream de-compression, design revi- without any additional circuitry or soft-
ments. Considerations include time to sion management, JTAG Boundary Scan for ware. Designers should think about design
market, ease of use, total own- trade-offs, including different
ership cost, and system speed. features offered by SPI PROM
Every alternative offers dif- manufacturers and the slower
ferent total-cost-of-ownership configuration speed compared to
(TCO) considerations that parallel SelectMAP, Platform
you should examine when Flash, and BPI.
designing a configuration sys-
tem, including design time, BPI Flash PROMs
prototyping, manufacturing Virtex-5 devices include on-
and test costs, and the per-bit board circuitry to directly con-
costs of the configuration nect – without any additional
device. The trade-offs of all Figure 1 – Virtex-5 configuration modes glue logic or software – to
these factors should enter into industry-standard parallel flash
your decision of which configuration test and configuration, and additional stor- devices. The parallel flash interface can
method to use. age for boot and scratch-pad memory. be directly connected to the FPGA and
Designers using Xilinx® Virtex™-5 Platform Flash features on-board decom- the memory shared by the system bus.
devices have many additional choices pression that can result in as much as 50%
for configuring the new FPGA family, more configuration data into the same over- Conclusion
including new configuration modes all memory space. Design revisioning allows Virtex-5 FPGAs offer you the widest variety
built right into the chips; support for you to switch between memory blocks for of configuration alternatives in the industry,
32-bit-wide high-performance parallel various configurations: for instance, using including Platform Flash, 32-bit SelectMAP,
SelectMAP, which offers the ultimate in the board and system in different geograph- and interfaces that directly connect to SPI
speed; and both BPI (byte parallel inter- ical regions or loading a diagnostic followed and BPI PROM devices. You should under-
face) and SPI (serial peripheral inter- by a mission load of configuration. In addi- stand these alternatives and make an
face) using industry-standard SPI and tion, unused Platform Flash memory can be informed decision on which one meets your
parallel flash memory devices (see allocated to boot code or scratch-pad mem- needs based on the trade-offs between speed,
Figure 1). The easy-to-use, full-featured, ory. Both of these features work without any complexity, and features.
configuration-engineered Platform additional glue logic or special software. The For more information, please see Xilinx
Flash PROMs offer a pre-engineered Boundary Scan JTAG port enables configu- Application Note 483, “Multiple-Boot with
way to flexibly configure Virtex-5 ration and includes the Platform Flash in Platform Flash PROMs,” at www.
devices and manage multiple bitstreams. overall board tests. xilinx.com/bvdocs/appnotes/xapp483.pdf.
Fourth Quarter 2006 Xcell Journal 107
Introducing Virtex-5 EasyPath FPGAs
The world’s first 65-nm FPGA cost-reduction solution.
by Gokul Krishnan, Ph.D
EasyPath Marketing
Xilinx, Inc.
gokul.krishnan@xilinx.com

Derek Johnson
APD Marketing
Xilinx, Inc.
derek.johnson@xilinx.com

With increasing competition in many dif-


ferent market segments, many companies
must drive down their product develop-
ment costs while at the same time adding
more and more complexity and features. In
addition, companies must react to fast-
changing market requirements and height-
ened time-to-market pressures.
Xilinx® FPGAs can help you face these
challenges by continually innovating to
provide increasingly complex functions at a
lower cost per logic cell. The recently intro-
duced Virtex™-5 FPGAs, in combination
with Virtex-5 EasyPath™ FPGAs, are the
latest generation of 65-nm devices that
provide higher performance, lower system
cost, and greater embedded functionality
than ever before.
Virtex-5 EasyPath FPGAs are the indus-
try’s only 65-nm customer-specific FPGA
cost-reduction solution, providing the lowest
total cost of ownership (TCO) when com-
pared to other solutions. EasyPath FPGAs are
identical to standard Xilinx FPGA offerings
but use patented testing techniques and cus-
tomer-specific test patterns to significantly
improve FPGA yields for designs that no
longer require the full programmability of a
standard FPGA. You can reap the benefits of
these improved yields in the form of lower
costs. EasyPath technology is available across
multiple platforms, different product fami-
lies, and 28 different devices over a range of
gate and memory counts.

108 Xcell Journal Fourth Quarter 2006


Lowest TCO with Virtex-5 EasyPath FPGAs tions. Getting to market faster can have a you must take care to reduce parasitic
Virtex-5 EasyPath FPGAs devices are man- significant impact on the market share a capacitance issues when signals are being
ufactured using a 65-nm process, which product can capture. Studies have indicat- transmitted simultaneously on adjoining
intrinsically offers a cost advantage (see ed that just a three-month delay in time to metal lines.
Table 1). In addition to a low unit price, market can reduce market share by as Another major problem with 65-nm
EasyPath FPGAs provide many other cost much as 15%, according to research from design is the issue of power consumption.
advantages, such as: International Business Strategies, Inc. Although Virtex-5 FPGAs take advantage
of triple-oxide technology to reduce leak-
• Low NRE
age power consumption significant-
• No re-qualification ly, each ASIC design must factor
required Total Cost Driver Structured ASICs Xilinx EasyPath in power consumption
• No engineering and use techniques
Time to Cost Reduction 20 to 24 Weeks Yes such as clock gat-
resources required EasyPath
NRE Costs $100K to $400K No Total Cost of Ownership ing and selective
• Shorter lead times Advantage transistors to
(12-16 weeks) Cost of Requalification $100K to $500K Yes Only Cost-Reduction Path mitigate leak-
Supporting Complex IP
With Virtex-5 EasyPath 100% FPGA Feature Support
age current. So
Engineering Costs $250K to $300K Identical
FPGAs, you can realize a 100% Package Support although Virtex-
30%-75% price reduction Cost of Design Tools $100K to $200K Identical 5 EasyPath FPGAs
when moving to high vol- retain the simplicity
Unit Costs Lowest Low they had at 130 nm and 90
ume as compared to stan-
dard FPGAs. EasyPath Cosy of Respin High High nm, 65-nm ASICs offer unique
FPGAs are identical to challenges that could significantly
their standard FPGA Table 1 – EasyPath total cost of ownership advantage reduce the chance of first-time success
counterparts, effectively with a design.
eliminating any conversion
work. This has two important implications. Difficulties with 65-nm ASICs Conclusion
The first is that you pay less yet incur very lit- One of the industry trends in place for The Virtex-5 family is one of the most
tle risk, because every single feature in a stan- some time now is that ASIC design starts cost-effective, high-performance FPGA
dard FPGA is supported and will work in an are decreasing every year. Part of the reason families in the industry. With advanced
EasyPath FPGA. Second, you do not need to for this is that FPGAs have been able to features such as a higher utilization logic
re-qualify your boards or systems when you provide lower unit costs for higher func- fabric, more integrated block memory,
move to EasyPath FPGAs. This saves valu- tionality. The other driver has been the ris- higher precision DSP slices, as well as
able engineering time and resources and pro- ing cost of mask sets, design, and advanced connection and embedded pro-
vides cost savings of $500K or more. verification. By some recent estimates, the cessing blocks, you can reduce your overall
Unlike structured ASICs, where cus- cost of developing a new 90-nm ASIC system cost by fitting into a smaller FPGA
tomers have to go through multiple reviews design is in excess of $10 million or replacing external discrete devices on
with the vendor and spend many months (International Business Strategies, Inc). A your boards. This complements the natural
of valuable engineering resources, EasyPath significant portion of this cost occurs in the cost reduction that comes from fabricating
FPGAs demand almost no resources from verification phase, which has become devices on a 65-nm process.
you. Once you have finalized the design longer and longer with increases in chip Virtex-5 FPGAs are a faster time-to-mar-
and handed off the relevant files to Xilinx, complexity. At 65 nm, operating voltages ket alternative to ASICs and other custom
you can get to full production directly in 8- and transistor sizes are so small that very logic solutions, and enable a lower total sys-
12 weeks. No intermediate prototyping is small process variations can have a big tem cost. In addition, Virtex-5 FPGAs are
required because the design has already effect on the functionality of a design. designed for the lowest overall power con-
been finalized (prototyped) in a standard What this means to you is that you sumption, highest signal integrity, and high-
FPGA. The lead time to get to production must now factor in DFM (design for man- est performance. All of these attributes can
is at least three to four months less than ufacturability) rules during the physical lead to lower overall system cost: lower
with structured ASICs. design phase – something previously power consumption and high signal integri-
Alternatively, those of you in fast-mov- assumed to have been embedded in the ty can cut design and debugging costs, and
ing markets can postpone the design library itself. Furthermore, because the high performance can save device costs by
freeze milestone by three to four months interconnects are very closely spaced, signal allowing the design to be done in a lower,
to better address dynamic market condi- integrity becomes even more important; less-expensive speed grade.

Fourth Quarter 2006 Xcell Journal 109


A P P L I C AT I O N N O T E S

Connectivity Solutions
Realize the full potential of the solutions in our silicon with Xilinx application notes.
Memory Interfaces levels of 300 MHz (600 Mbps), resulting XAPP860 – 16-Channel DDR
XAPP851 – DDR SDRAM Controller in an aggregate throughput for each 36-bit LVDS Interface with Real-Time
Using Virtex-5 FPGA Devices memory interface of 43.2 Gbps. Window Monitoring
By Toshihiko Moriyama and Rich Chiu The design greatly simplifies the task by Greg Burton
of read data capture within the FPGA
This application note describes a 200- while minimizing the number of This application note describes a
MHz DDR SDRAM memory controller resources used. A straightforward user 16-channel source-synchronous DDR
implemented in a Virtex™-5 device. interface is provided to allow simple inte- LVDS interface. The receiver operates at
This reference design uses the Virtex-5 gration into a complete FPGA design 1:6 deserialization on each of the 16
ChipSync™ features to calibrate and utilizing one or more QDR II interfaces. data channels. Similar to XAPP855, the
adjust read data timing. design also includes a real-time window
A straightforward back-end user inter- On the Web at www.xilinx.com/bvdocs/ monitoring circuit for added perform-
face is provided to allow integration into a appnotes/xapp853.pdf ance. This reference design calibrates
complete FPGA design. and compensates for skews associated
XAPP858 – High-Performance DDR2 with process, voltage, and temperature
On the Web at www.xilinx.com/bvdocs/ SDRAM Interface in Virtex-5 Devices (PVT) at initialization and also dynami-
appnotes/xapp851.pdf
by Karthi Palanisamy and Maria George cally during operation.

XAPP852 – Synthesizable CIO DDR This application note describes the con- On the Web at www.xilinx.com/bvdocs/
RLDRAM II Controller for Virtex-5 troller and data capture technique for appnotes/xapp860.pdf
FPGAs high-performance DDR2 SDRAM inter-
faces. This data capture technique uses the
By Benoit Payette and Rodrigo Angel Serial Connectivity
input serializer/deserializer (ISERDES)
XAPP861 – Efficient 8x Oversampling
This application note describes how to use and output double data rate (ODDR) fea-
Asynchronous Serial Data Recovery
a Virtex-5 device to interface to common tures available in every Virtex-5 I/O.
Using IDELAY
I/O (CIO) double data rate (DDR)
On the Web at www.xilinx.com/bvdocs/
reduced latency DRAM (RLDRAM II) by John Snow
appnotes/xapp858.pdf
devices. The reference design targets two Virtex-5 devices a have a high-precision
CIO DDR RLDRAM II devices at a programmable delay element (IDELAY)
Source-Synchronous Interfaces
clock rate of 200/300 MHz, with data associated with every input pin. This
XAPP855 – 16-Channel DDR LVDS
transfers at 400/600 Mbps per pin. application note shows how to imple-
Interface with Per-Channel Alignment
On the Web at www.xilinx.com/bvdocs/ ment 8x oversampling of many data
by Greg Burton
appnotes/xapp852.pdf streams using a single DCM, two global
This application note describes a 16-channel clock resources, and minimal FPGA
source-synchronous DDR LVDS interface. logic resources. This solution provides
XAPP853 – QDR II SRAM Interface The design takes advantage of the Virtex-5 better jitter tolerance than techniques
for Virtex-5 Devices I/O ChipSync feature’s ability to adjust the using multiple DCMs. When paired
By Lakshmi Gopalakrishnan delay of the receiver datapaths, creating with a suitable data recovery scheme,
dynamic setup and hold timing for each this oversampling technique can be used
This application note describes the imple-
device at initialization and compensating for with many different data protocols up to
mentation and timing details of a four-
skews associated with the manufacturing 550 Mbps. A reference design is includ-
word-burst quad data rate (QDR II)
process. The receiver operates at 1:8 deserial- ed that implements a SD-SDI (SMPTE
SRAM interface for Virtex-5 devices. The
ization on each of the 16 data channels. 259M) receiver running at 270 Mbps.
synthesizable reference design leverages the
unique I/O and clocking capabilities of the On the Web at www.xilinx.com/bvdocs/ On the Web at www.xilinx.com/bvdocs/
Virtex-5 family to achieve performance appnotes/xapp855.pdf appnotes/xapp861.pdf

110 Xcell Journal Fourth Quarter 2006


INTELLECTUAL PROPERTY

Intellectual Property Offerings


The Xilinx IP Center on the Web allows you to search for IP by function, type, vendor, or keywords.

The IP Center includes intellectual property


from Xilinx and its third-party vendors. To
access the IP solutions highlighted here, visit
www.xilinx.com/ipcenter and type the key-
words listed in the search box.

Source-Synchronous Interfaces
SPI-4 Phase 2 Interface Solutions
(DO-DI-POSL4MC) Xilinx IP Core
The Xilinx® SPI-4 Phase 2 core provides a
fully compliant packet-over-SONET/SDH
(POS) solution, which can be quickly inte-
grated into networking systems.
Through user-configurable options, the
Xilinx SPI-4.2 core provides ultimate flexi-
bility while seamlessly interoperating with Virtex-5 Embedded Tri-Mode munication, multimedia, server, storage,
industry-leading ASSPs to maximize the data Ethernet MAC Wrapper and mobile platforms and enables applica-
transfer bandwidth. The Xilinx SPI-4.2 core The CORE Generator™ tool supports tions such as high-end medical imaging,
is fully compliant with the OIF’s System the Virtex-5 Tri-Mode Ethernet Media graphics-intensive video games, DVD
Packet Interface Level 4 (SPI-4) Phase 2 Access Controller (MAC) Wrapper to quality streaming video on the desktop,
standard, as well as the Saturn Development automate the generation of HDL wrapper and 10 Gigabit Ethernet interface cards.
Group’s POS-PHY Level 4 (PL4) interface files for the tri-mode Ethernet MAC in
Type search keywords: PCI Express Block
specification. Virtex-5 LXT devices. Preconfigured HDL
wrappers, testbenches, and implement and
Type search keywords: SPI-4 Phase 2 Virtex-5 LXT PCI Express Block Plus
simulation scripts are generated automati-
cally based on user-defined options. LogiCORE Xilinx IP Core
The Xilinx PCI Express Plus LogiCORE IP
Serial Connectivity Type search keywords: Virtex-5
integrates and interfaces to the PCI Express
Virtex-5 RocketIO GTP Wizard Ethernet MAC
Endpoint block, supporting 1-lane, 4-lane,
The Virtex™-5 RocketIO™ GTP Wizard and 8-lane complete endpoint core imple-
Virtex-5 PCI Express
automates the task of creating HDL wrap- mentations. In addition, a PCI Express devel-
Endpoint Block Wrapper
pers to configure Virtex-5 RocketIO GTP opment kit is also available. This solution is
The Xilinx PCI Express Endpoint block
transceivers. The wizard’s customization used in communication, multimedia, server,
wrapper integrates and interfaces to the
GUI allows you to configure one or more storage, and mobile platforms and enables
on-chip PCI Express Endpoint block,
GTP transceivers using pre-defined tem- applications such as high-end medical imag-
supporting 1-lane, 2-lane, 4-lane, and 8-
plates to support popular industry stan- ing, graphics-intensive video games, DVD
lane complete endpoint core implemen-
dards, or from scratch to support a wide quality streaming video on the desktop, and
tations. In addition, a PCI Express
variety of custom protocols. 10 Gigabit Ethernet interface cards.
Endpoint block development kit is also
Type search keywords: GTP Wizard available. This solution is used in com- Type search keywords: PCI Express Plus

Fourth Quarter 2006 Xcell Journal 111


THE BOARD ROOM

Virtex-5 Boards and Kits


Jumpstart your Virtex-5 designs with these development platforms and tool kits.

Nu Horizons Virtex-5 LXT Evaluation Kit


Nu Horizons creates a low-cost evaluation
kit for Virtex-5 LXT platform FPGAs.
Nu Horizons’s newest evaluation kit is
designed for customers interested in evaluating
Virtex-5 LXT FPGAs. This kit differs from
other Nu Horizons kits in that it has the added
ability for high-speed serial communication.
On the Web at www.nuhorizons.com

Xilinx Virtex-5 ML505 Evaluation


and Development Platform
A low-cost embedded system and
RocketIO GTP transceiver development
platform.
The Xilinx Virtex-5 ML505, based on
RocketIO™ technology, is a feature-rich,
low-cost evaluation/development platform
that provides easy and practical access to the
resources available in the on-board Virtex-5
LXT FPGA.

Supported by industry-standard inter-


faces/connectors, generous memory
resources, and companion chipsets, the
ML505 evaluation platform is a versatile
development platform for multiple applica-
tions including embedded systems.
On the Web at www.xilinx.com/XOB

112 Xcell Journal Fourth Quarter 2006


THE BOARD ROOM

Avnet Virtex-5 LX Development Kit HiTech Global Virtex-5 Xilinx Virtex-5 ML555 PCI Express
A complete development platform for PCI Express Development Platform Development Tool Kit
designing and verifying applications based Seamless serial interface connectivity A highly configurable pre-verified
on the Xilinx Virtex-5 LX FPGA family. enabled by the Virtex-5 LXT FPGA. development solution.
Available with the Xilinx® Virtex™-5 Powered by a Xilinx Virtex-5 LXT FPGA,
XC5VLX50-1FF676 device, the Avnet supported by mainstream peripherals, and
Virtex-5 Development Kit allows you to designed with excellent signal integrity per-
prototype high-performance designs with formance, the HiTech Global HTG-V5PCIE
ease, while providing expandability and cus- is the ideal platform for serial interface/con-
tomization through the EXP expansion slot. nectivity developments, including PCI
The system board includes DDR2 Express subsystems, Serial ATA (SATA), Fibre The Xilinx ML555 RoHS-compliant
SDRAM, flash memory, a 10/100/1000 Channel, RapidIO, and XAUI. PCIe/PCI-X/PCI development board
Ethernet PHY, and a serial port, making provides a pre-verified solution to parallel
On the Web at www.hitechglobal.com
it an ideal platform for MicroBlaze™ and serial PCI interface design chal-
development. Other board features lenges. Using an established development
include a USB port, programmable Xilinx Virtex-5 ML550 environment can dramatically shorten
LVDS clock, 10-bit Tx/Rx high-speed Networking Interfaces Tool Kit the design cycle. By using proven Xilinx
LVDS interface, user switches and LEDs, Designing networking, telecom, servers, and dedicated blocks, you can focus your
and a 2 x 16-character LCD panel. computing systems with Virtex-5 FPGAs. efforts on specific application develop-
The board also provides a full EXP ment and avoid time-consuming PCIe or
Many of today’s telecom and networking
expansion slot, providing a total of 168 PCI development.
systems use high-bandwidth interfaces based
high-speed, single-ended, and differential
on LVDS or other differential I/O standards. On the Web at www.xilinx.com/XOB
user I/O. You can easily add EXP modules
Differential I/O standards simplify system
to the board for additional application-
design by improving system performance
specific functions. Xilinx Virtex-5 ML561 Advanced
and signal integrity.
Memory Development System
On the Web at www.avnet.com
Achieve your performance targets in
the shortest development time.
Xilinx Virtex-5 ML501 Evaluation
Building interfaces to high-performance
and Development Platform
memory devices presents challenges such
An ideal general-purpose, low-cost
as high-speed synchronous data capture,
development platform.
along with implementing complex physi-
The Xilinx Virtex-5 ML501 evaluation cal-layer interfaces and control logic. The
and development platform is a feature- ML561 advanced memory development
rich, low-cost evaluation/development Protocols based on source-synchronous system offers an excellent platform to
platform that provides easy and practical I/Os such as SPI-4.2 and SFI-4 are central to develop and verify high-performance
access to the resources available in the leading-edge system design. To take advan- memory interfaces using Virtex-5 FPGAs.
on-board Virtex-5 LX FPGA. Supported tage of these technologies, you have to work
On the Web at www.xilinx.com/XOB
by industry-standard interfaces and through multiple challenges to ensure device
connectors, the ML501 is a versatile interoperability and standards compliance.
development platform for multiple appli- Xilinx provides the Virtex-5 network interface
cations. Video, audio, and communica- board, as well as standards-compliant IP cores
tion ports as well as generous memory and free reference designs, to help you tackle
resources extend the functionality and these high-speed, source-synchronous inter-
flexibility of the ML501 evaluation plat- face challenges. This allows you to focus on
form beyond a typical FPGA develop- user application design and not worry about
ment platform. interoperability and standards compliance.
On the Web at www.xilinx.com/ML501 On the Web at www.xilinx.com/XOB

Fourth Quarter 2006 Xcell Journal 113


HAPS – HARDI ASIC Prototyping System psTr

ak
Ha
a high performance, high capacity FPGA platform 2007

co

le
for ASIC prototyping and emulation composed of m
patib
multi-FPGA boards and standard or custom-made
daughter boards

HapsTrak
a set of rules for pinout and mechanical HAPS-40
characteristics, which guarantees
compatibility with previous and future
psTr
generation HAPS motherboards
ak
Ha

and daughter boards 2005


co

le

m
patib

HAPS-30

psTr
ak
Ha

2004
co

le

m
patib

HAPS-20

psTr
ak
Ha

2003
co

le

m
patib

www.hardi.com, haps@hardi.com
HARDI Electronics Inc., 26831 Magdalena Lane, Mission Viejo, CA 92691, (949) 202-5572

HAPS-10 Virtex-II, Virtex-II Pro, Virtex-4, and Virtex-5 are registered trademarks of Xilinx Inc.
Low-PowerUltimate
Transceivers
Connectivity . . .

Reduce serial I/O power, cost


and complexity with the world’s
first 65nm FPGAs.

With a unique combination of up to 24 low-power transceivers,


and built-in PCIe™ and Ethernet MAC blocks, Virtex-5 LXT FPGAs
get your system running fast. Whether you are an expert or
just starting out, only Xilinx delivers this complete solution to
Power consumption and area required to implement
simplify high-speed serial design.
a typical design including 8-lane PCIe endpoint

6.22 34,600
Lowest-power, most area-efficient serial I/O solution
User Logic RocketIO™ GTP transceivers deliver up to 3.2 Gbps connectivity
25,100
at less than 100 mW to help you beat your power budget. The
Power (Watts)

User Logic
Area (LUTs)

3.09 PCIe
embedded PCI Express® endpoint block ensures easy implemen-
tation and reduced development time. Embedded Ethernet MAC
blocks enable a single-chip UNH-verified implementation. And
PCIe
Static Power
the Xilinx solution is fully supported by development tools,
Virtex-5 LXT Nearest Virtex-5 LXT Nearest design kits, IP, characterization reports, and more.
FPGAs Competitor FPGAs Competitor
(65nm) (90nm) (65nm) (90nm)
5VLX30T vs. 2SGX60D. Target Frequency = 200 MHz. Worst-case process.
25K LUTs, 17K Flip-Flops,1 Mbit On-Chip RAM, 64 DSP Blocks, 128 2.5V I/Os.
Visit our website today, view the Webcast, and order your free
Based on Xilinx tool v8.2 and competitor tool v6.0.1
eval CD to give your next design the ultimate in connectivity.

The Programmable Logic CompanySM

www.xilinx.com/virtex5

The Ultimate System Integration Platform

©2006 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.

PN 0010999

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy