Xcell 59
Xcell 59
Xcell journal
Issue 59
Fourth Quarter 2006
XCELL JOURNAL
TT H
H EE A
AUU TT H
HOO RR II TT A
A TT II V
V EE JJ O
OUU RR N
NAA LL FF O
O RR PP RR O
OGG RR A
AMMM
MAA BB LL EE LL O
OGG II C
C U
U SS EE RR SS
XILINX, INC.
Virtex-5
Special
Edition
INSIDE
A Multi-Gigabit Transceiver
for the Masses
www.xilinx.com/xcell/
™
Build your own system by Silica designs, manufactures, sells and supports a wide
mixing and matching:
variety of hardware evaluation, development and reference
• Processors
design kits for developers looking to get a quick start on
• FPGAs
a new project.
• Memory
• Networking With a focus on embedded processing, communications
• Audio and networking applications, this growing set of modular
• Video
hardware kits allows users to evaluate, experiment,
• Mass storage
benchmark, prototype, test and even deploy complete
• Bus interface
designs for field trial.
• High-speed serial interface
By providing a stable hardware platform that enhances system
Available add-ons: development, design kits from Silica help original equipment
• Software manufacturers (OEMs) bring differentiated products to market
• Firmware
quickly and in the most cost-efficient way possible.
• Drivers
• Third-party development tools For a complete listing of available boards, visit
www.silica.com
© Avnet, Inc. 2006. All rights reserved. AVNET is a registered trademark of Avnet, Inc.
Ultimate Performance…
1.3 x Shipping now, Virtex-5 LX is the first of four platforms optimized for
1.6 x Industry’s
fastest
90nm FPGA
logic, DSP, processing, and serial. The LX platform offers 330,000 logic
benchmark
cells and 1,200 user I/Os, plus hardened 550 MHz IP blocks. Build deeper
FIFOs with 36 Kbit block RAMs. Achieve 1.25 Gbps on all I/Os without
restrictions, and make reliable memory interfacing easier with enhanced
ChipSync™ technology. Solve SI challenges and simplify PCB layout with our
Logic On-chip DSP I/O LVDS I/O Memory sparse chevron packaging. And enable greater DSP precision and dynamic
Fabric RAM 32-Tap Filter Bandwidth Bandwidth
Performance 550 MHz 550 MHz 750 Gbps 384 Gbps range with 550 MHz, 25x18 MACs.
Virtex-5 FPGAs Virtex-4 FPGAs Nearest Competitor
Numbers show comparision with nearest competitor Visit www.xilinx.com/virtex5, view the TechOnline webcast, and give
Based on competitor’s published datasheet numbers
your next design the ultimate in performance.
www.xilinx.com/virtex5
©2006 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.
L E T T E R F R O M T H E P U B L I S H E R
W
Welcome to this special edition of Xcell Journal, featuring a broad array of articles on Xilinx®
Virtex™-5 FPGAs. In this issue you’ll find executive and industry viewpoints; articles on
engineering solutions, design challenges, tools, customer successes, and vertical markets; and
a technical reference section covering application notes, boards, and IP.
Xcell journal As exciting as this is, I’d also like to let you know about a couple of announcements from
Xcell Publications.
PUBLISHER Forrest Couch Xcell Publications Honored with APEX 2006 Award of Excellence
forrest.couch@xilinx.com
408-879-5270
Xcell Publications was recently awarded the APEX 2006 Award of Excellence in two categories –
magazine and journal design and layout and custom-published magazines and journals – for two
EDITOR Charmaine Cooper Hussain of its flagship Xcell Publications, Xcell Journal and I/O Magazine.
ART DIRECTOR Scott Blair APEX 2006 – the 18th Annual Awards for Publication Excellence – is an international compe-
tition that recognizes outstanding publications, including newsletters, magazines, annual
DESIGN/PRODUCTION Teie, Gelwicks & Associates
reports, brochures, and websites. According to APEX judges, this year’s competition was excep-
1-800-493-5551
tionally intense, with nearly 5,000 entries. Awards were granted based on
ADVERTISING SALES Dan Teie excellence in graphic design, quality of editorial content, and the success of
1-800-493-5551
the entry in conveying the message and achieving overall communications
TECHNICAL COORDINATOR Greg Lara effectiveness.
“We’re honored that Xcell magazines have been selected for excellence in
INTERNATIONAL Dickson Seow, Asia Pacific
dickson.seow@xilinx.com publishing among such a stellar list of companies by the APEX panel of
Andrea Barnard, Europe/ judges,” said Sandeep Vij, vice president of worldwide marketing at Xilinx. “Over the past 18
Middle East/Africa years, our custom publications have served as a foundational tool, delivering ‘how-to’ information
andrea.barnard@xilinx.com
to a growing base of engineers using Xilinx programmable chips to design a wide variety of electronic
Yumi Homura, Japan
yumi.homura@xilinx.com
systems, ranging from the Mars Rover to high-volume consumer handsets, flat-panel displays and
automotive infotainment systems. Being ranked among the industry’s best underscores the value
SUBSCRIPTIONS All Inquiries and quality of our company’s portfolio of custom magazines.”
www.xcellpublications.com
Xilinx joins a prestigious list of award-winning companies from a variety of industries in the
REPRINT ORDERS 1-800-493-5551 APEX competition for custom-published magazines and journals, including Blue Cross Blue
Shield, CMP Media/Digital Connect, DaimlerChrysler, IBM Journal of Research and
Development, Mac Publishing, National Football League, National Foundation for Advancement
in the Arts, Penton Custom Media, and Time Inc. Strategic Communications.
New Digital Editions Available
We now offer digital editions of our magazines. Now you can subscribe for free to the new
www.xilinx.com/xcell/
Xcell Journal Digital, requiring no software downloads and visible on any standard Internet browser.
This updated publishing technology lets you browse, search, make notes, e-mail authors, and click
Xilinx, Inc.
2100 Logic Drive through to advertisers’ websites.
San Jose, CA 95124-3400
Phone: 408-559-7778 To receive Xcell Journal Digital, you have to subscribe. In addition
FAX: 408-879-4780
www.xilinx.com/xcell/ to Xcell Journal, we also now offer digital subscriptions of all of
our magazines. Please visit our website at www.xilinx.com/xcell
© 2006 Xilinx, Inc. All rights reserved. XILINX, and click on “Subscriber Services.”
the Xilinx Logo, and other designated brands included
herein are trademarks of Xilinx, Inc. PowerPC is a
trademark of IBM, Inc. All other trademarks are the I hope you enjoy reading this issue.
property of their respective owners.
16 42
Achieve Higher Performance with Virtex-5 FPGAs
New architectural elements can help you attain
higher system-level performance. A Multi-Gigabit Transceiver for the Masses
The Virtex-5 GTP transceiver brings versatility, ease of use,
power efficiency, and cost-effectiveness to high-volume
PERFORMANCE
mainstream applications.
19 SERIAL CONNECTIVITY
45
HDL Coding and Design Practices for Improving
Virtex-5 Utilization, Performance, and Power
These tips and techniques can lead to better Virtex-5 designs. Introducing the Virtex-5 PCI Express Endpoint Block
With PCI Express quickly becoming the standard high-bandwidth
interconnect, the Virtex-5 LXT PCIe Endpoint block enables a
Viewpoint Introducing the Virtex-5 FPGA Family
The first 65-nm advanced FPGAs
raise the bar in performance,
configurable single-chip solution.
73
Meeting Memory Interface Design Challenges with Virtex-5 FPGAs
Virtex-5 devices support the latest generation of high-speed
memory interfaces.
FOURTH QUARTER 2006, ISSUE 59
VIEWPOINTS
Xcell journal
Introducing the Virtex-5 FPGA Family ....................................................................................8
Serial Everywhere – The Triple-Play Challenge .....................................................................12
Virtex-5 Serial Connectivity Solutions .................................................................................13
FPGAs for Serial Interconnections .......................................................................................14
PERFORMANCE
Achieve Higher Performance with Virtex-5 FPGAs .................................................................16
HDL Coding and Design Practices for Improving Virtex-5 Utilization, Performance, and Power......19
Getting the Best Results from Virtex-5 FPGAs ......................................................................23
Maximizing Design Performance for Virtex-5 FPGAs ..............................................................28
Clock Management in Virtex-5 Devices ...............................................................................31
POWER
Reduce Power with Virtex-5 FPGAs ....................................................................................33
Applying Compact Thermal Models .....................................................................................38
SERIAL CONNECTIVITY
A Multi-Gigabit Transceiver for the Masses............................................................................42
Introducing the Virtex-5 PCI Express Endpoint Block ..............................................................45
PCI Express Markets, Trends, and Applications ......................................................................49
Designing with Virtex-5 Embedded Tri-Mode Ethernet MACs ....................................................54
Asynchronous Sample-Rate Conversion Between AES Audio Streams........................................57
Implementing Integrated Video Connectivity Solutions with Virtex-5 LXT Devices .......................61
Enhancing System Management and Diagnostics with the Virtex-5 System Monitor ...................64
Real-Time Debugging for Virtex-5 FPGAs ..............................................................................68
MEMORY INTERFACES
Memories are Made of This... ...........................................................................................70
Meeting Memory Interface Design Challenges with Virtex-5 FPGAs ..........................................73
Implementing Memory Controllers Using the Memory Interface Generator Tool..........................76
Micron Memory Interface ..................................................................................................79
Designing Virtex-5 DDR2 Memory Interfaces for Signal Integrity..............................................83
SOURCE-SYNCHRONOUS INTERFACES
Improve System Reliability with SPI-4.2 LogiCORE Solutions and Virtex-5 FPGAs .......................87
GENERAL
Virtex-5 Configuration Options Offer Designers a Choice.......................................................107
Introducing Virtex-5 EasyPath FPGAs .................................................................................108
REFERENCE
Connectivity Solutions .....................................................................................................110
Intellectual Property Offerings ..........................................................................................111
Virtex-5 Boards and Kits..................................................................................................112
V I E W P O I N T
Introducing the
Virtex-5 FPGA Family
The first 65-nm advanced FPGAs raise the bar in performance, power efficiency, capacity, and value.
by Steve Douglass
Vice President
Product Development,
Advanced Product Division
Xilinx, Inc.
stephen.douglass@xilinx.com
Welcome to the
Virtex™-5 issue of
Xcell Journal. The Xilinx® Virtex-5 family
is not only the industry’s first 65-nm
FPGA – it also offers some of the most
advanced architecture and highest per-
formance in the world. Continuing our
history of developing groundbreaking
technology, we listened to leading design
engineers in various markets and built on
key characteristics that made our Virtex-4
FPGA family a tremendous success:
• Higher performance
• Higher logic density
• Lower power consumption
• More advanced features
The fundamental value propositions of
FPGAs include faster time to market, ver-
satility, support for evolving standards, risk
mitigation, field upgradability, and lower
system costs. Our FPGAs accommodate
your demands for continued improve-
ments in performance, capacity, power
consumption, and cost.
8 Xcell Journal Fourth Quarter 2006
V I E W P O I N T
Trade-Off
Used Die Area
Smaller
connects between adjacent logic, again
lowering routing capacitance.
-60 VCCINT, the core supply voltage, is
now 1.0V. All of these factors contribute
LUT Count Reduction (%)
asynchronous (or synchronous) FIFOs as many as 24 in the largest LXT device. 8,500 LUTs compared to implementation
running as fast as 550 MHz without con- In designing our fourth-generation with soft IP.
suming any logic resources. RocketIO™ technology of high-speed Virtex-5 devices offer more and small-
The 72-bit-wide block RAM now serial transceivers, we invested significant er I/O banks. The outer I/O banks (as
includes 64-bit error checking and correc- engineering effort to lower power con- many as eight banks in the largest device)
tion (ECC) control logic. Like the inte- sumption. At the top speed of 3.2 Gbps, also are arranged to provide a PCB rout-
grated FIFO support, the integrated ECC the LXT RocketIO transceiver consumes ing advantage that in some cases might
improves memory performance and elim- typically less than 100 mW, making it the save board layers.
inates the cost associated with traditional lowest power transceiver in any FPGA To ensure the best simultaneously
fabric-based solutions. You can also use product (see Figure 4). switching output (SSO) performance and
the dedicated ECC logic to augment Each Virtex-5 LXT RocketIO trans- provide the best signal integrity (SI) solu-
external memory interfaces. ceiver is programmable and can imple- tion in the FPGA industry, all Virtex-5
Interfacing to external devices and ment a myriad of speed and serial devices use Xilinx sparse chevron technol-
especially external memory such as DDR, standards. Link-layer IP is available for ogy pinout assignments. This ensures that
DDR2, QDR II, and RLDRAM II is dra-
matically enhanced and simplified by our
new ChipSync™ technology. A memory
development system (ML561) based on
our LX50T devices contains fully func-
tional and hardware-proven reference Pre-
Parallel
to
Polarity
Phase
Adjust
Driver Emphasis 8B/10B
designs for all of today’s most popular TX Serial FIFO and
Over-
Sampling
memory technologies.
In the DSP domain, we are now provid- PMA PLL
Divider
TX PIPE Control
PRBS
ing 25 x 18-bit multipliers, mainly for Generator
TX-PMA TX-PCS
more efficient floating-point designs. These
FPGA
DSP48E slices can be directly cascaded for Fabric
higher performance in digital filtering or Over- Comma
Equlizer Polarity
video broadcast applications. Direct cas- and CDR
Serial
to
Sampler Detect
and 8B/10B Elastic
RX RX-OOB Parallel Align Buffer
cading also saves power – as much as 40%
PRBS
compared to competing solutions. PMA PLL
Checker
RX Status Control
ues to lead the industry. Every pin supports From PMA PLL
virtually every I/O standard in use today RX-PMA RX-PCS RX Pipe Control
Serial Everywhere –
The Triple-Play Challenge
Xilinx is helping to empower the next innovation in the triple-play race.
ruption of existing services, nor will they and design support software, hardware,
pay extra for poor service quality. and services.
Motivated by the promise of substantial In each case, one of the key objectives in
rewards to those that enable this massive the introduction strategy of these products –
by Wim Roelandts business food chain, the electronics indus- with their attending high-speed serial I/O
CEO and try is marshaling every possible resource to solution packages – was to reach the early
Chairman of the Board find solutions at all levels to the triple-play adopters and innovators within the FPGA
Xilinx, Inc. challenge. It is no surprise that the semi- customer base with a viable alternative to
conductor industry endeavors to keep pace custom ASIC and ASSP serial I/O solutions.
The electronics industry is pressed to its lim- with system manufacturers. Having successfully proven the viability
its as it strives to develop solutions to feed of FPGA-based serial I/O solutions with
the insatiable appetites of the consumer and Xilinx Serial I/O Solutions: Crossing the Chasm these previous product families, there
enterprise markets for voice, video, and The evolution of serial I/O solutions in remained a single yet extremely important
computer data communications on a single Xilinx® FPGAs is the result of our high- evolutionary step. To cross the chasm into
network. To the global broadcast and speed serial initiative, which we announced the mainstream FPGA customer base and
telecommunications industries, the triple- in 2002. The aim of the initiative was (and truly create equivalency between Xilinx serial
play opportunity is at once a potentially is) to accelerate the industry’s move from I/O solutions and custom solutions required
inexhaustible source of revenue and a con- parallel to high-speed serial I/O by deliver- the delivery of fully verified, fully integrated,
stant source of frustration. Despite the ing a new generation of connectivity solu- hard IP-based, turnkey serial I/O solutions.
immeasurable reward for successfully deliv- tions for system designs that meet With our newest 65-nm Virtex-5 LXT
ering triple-play services to the masses, sub- bandwidth requirements from 3.125 Gbps platform, we believe that we have indeed
stantial obstacles continue to impede access. to 10 Gbps and beyond. crossed the chasm. By offering the indus-
Perhaps the most central of these obsta- We began by adding up to twenty-four try’s first FPGA to deliver hard-coded PCI
cles is the inadequacy of legacy infrastruc- 3.125 Gbps serial transceivers in our Express Endpoint and tri-mode Ethernet
ture equipment to support the massive Virtex™-II Pro family, accompanied by IP media access controller (MAC) blocks,
increases in bandwidth. Evolving from soft cores for numerous serial connectivity Virtex-5 LXT devices are addressing the
voice-only, the legacy infrastructure is a standards, reference designs, hardware bandwidth, power, and cost challenges fac-
complex web of overlaid networks that rep- development platforms, design software, ing equipment vendors working to enable
resents both a financial and technological characterization data, and an in-depth the emerging triple-play services market.
burden to service providers. In short, it is design support program. The Virtex-5 LXT platform is optimized to
neither technologically feasible to deliver The Virtex-4 FX family followed suit enable FPGA designers across a wide range
triple-play services with existing equipment in 2005 with a similar complement of of applications to benefit from serial con-
nor economically practical to replace it broad-range transceivers, this time deliv- nectivity by delivering a comprehensive,
with a completely new network. Moreover, ering 622 Mbps to 6.5 Gbps perform- fully compliant protocol solution with the
legacy customers will not tolerate any inter- ance, as well as an equally robust set of IP greatest ease of use.
by Sandeep Vig flexible enough to adapt to the seemingly block supporting one to eight lanes
Vice President, endless evolution of standards and protocols. provides as much as 32 Gbps of
Worldwide Marketing In the computing infrastructure mar- full-duplex host I/O for extreme
Xilinx, Inc. ket, PCIe has become the predominant performance applications
sandeep.vij@xilinx.com host interface for networking, graphics,
These features reduce the engineering
and backplane connectivity because of its
Although “triple effort spent on resource utilization, trou-
quantum leap in performance, scalability,
play” may be one of bleshooting connectivity issues, minimiz-
and pin-count efficiency over the legacy
the hottest buzz- ing power consumption, and optimizing
PCI bus. Designing products that span
words and growth drivers in the semicon- performance, thus giving our customers
network and compute infrastructures like
ductor industry, it is insightful to unconstrained Virtex-5 FPGA resources in
those in triple-play markets requires system
understand the evolution of the technology designing infrastructure and end-user
architects and engineers to be well-versed
that was required to realize triple play, the products for delivering voice, video, and
in these new domains, introducing new
forces behind its explosive growth, chal- data over IP.
risks. To this end, Xilinx embarked on a
lenges that will occur along the way, and As a programmable platform, the
project two years ago to mitigate design
the critical role of Xilinx® Virtex™-5 Virtex-5 family positions our customers
risk by introducing a new generation of
products in the development and deploy- and partners to enable value-added triple-
Platform FPGAs that substantially increase
ment of triple-play products and services. play technologies such as:
performance, functionality, and device
Central to the Virtex-5 platform’s value density while reducing cost per gate. • QoS – customer-specific traffic
is the recent emergence of two serial I/O management solutions enabling tiered
standards: Gigabit Ethernet (GbE) and Next-Generation FPGAs services that can change with market
PCI Express (PCIe). In the last three years, Leveraging our core competence as the pre- conditions
these two interfaces have become the de- mier FPGA vendor and working with our
facto connectivity standards for network world-class customers and partners, Xilinx • Digital rights management – enabling
and computing applications; according to developed the Virtex-5 FPGA architecture. hardware-based, adaptive, end-to-end
Electronic Trend Publications, GbE and With the introduction of the LXT family, data security for the wide diversity of
PCIe will account for 80% of all port ship- Virtex-5 devices now feature integrated standards inherent to these markets
ments in 2009. multi-GbE and PCIe connectivity technol-
ogy ideally suited to designs for the triple- Conclusion
Disruptive Technology play market. In the very dynamic consumer industry
IP is clearly the preferred protocol in the net- This LXT family is equipped to support where time to market with flexible services is
work market as telecom vendors and service high-speed serial connectivity, with fea- the name of the game, companies are still
providers transition to an all-IP-based infra- tures that include: trying to figure out the right mix of products
structure supporting Voice over IP, Video and services to generate substantial revenue.
• Built-in GbE MAC – each Virtex-5
over IP, and Data over IP (also known as The Virtex-5 LXT family integrates world-
LXT device features four hard-core
triple play). Designing carrier-grade to end- class programmable logic architecture with
GbE MACs for multi-port Ethernet
user products that support triple play is very embedded serial connectivity, providing the
connectivity
challenging, as these products must achieve performance, density, and connectivity
high levels of performance, manage quality • Built-in PCIe block – an integrated required for delivering voice, video, and data
of service (QoS), and be power-efficient and standards-compliant PCIe Endpoint in the emerging triple-play market.
by Steve Berry will soon become dominant. Table 1 illus- also supports nearly all available serial inter-
President, Electronic Trend Publications trates the change from parallel to serial. faces. Two of these interfaces – RapidIO and
saberry@electronictrendpubs.com In 2006, serial interconnections will Aurora – are emerging as most important to
www.electronictrendpubs.com move into the majority. By 2009, serial users of FPGAs.
will represent more than 80 percent of all RapidIO is becoming a favorite for
For most of the last 15 interconnections. high-end, low-volume DSP applications. A
years, networking the world Although standard semiconductor prod- number of implementations in this arena
for voice, video, and data has been the key ucts will supply the serial interconnection use FPGAs (rather than merchant silicon)
driver of the electronics industry. This needs of high-volume markets, FPGAs are to implement DSP functions as well as
worldwide network required that the com- increasingly important for a wide variety of RapidIO interface and switching func-
munications industry connect and converge tasks. There are some key reasons. First, tions. This should continue to be the case
with the computer processing industry. That before low-cost standards products are avail- in the future.
convergence has primarily settled on able, FPGAs will provide a mechanism to Similarly, the Aurora protocol has quietly
Ethernet for the communications side and get to market faster. Second, FPGAs enable gained a substantial following in certain
PCI for the computer side. system integration with customer algo- high-end embedded markets. Although
Since its inception, Ethernet has been a rithms and standards-based serial interfaces. Xilinx created Aurora, it is an open protocol,
serial interface. It has been repeatedly scaled Third, the ability to easily make multi-stan- free of charge, that designers can implement
up in bandwidth. Today, 1 Gbps connections dard serial connections to FPGAs will dra- in any silicon device. Aurora is a scalable,
are ubiquitous, 10 Gbps connections are matically simplify product design. lightweight, link-layer protocol that is used
becoming more common, and 100 Gbps Thus, the new Xilinx® Virtex™-5 LXT to move data across point-to-point serial
connections have been proven in the labora- platform – with its built-in PCI Express links. Aurora enables simple, high-speed
tory. Ethernet has vanquished all challengers Endpoint blocks, tri-mode Ethernet connections between fixed points either on a
in the LAN market and is rapidly conquering MACs, and low-power RocketIO™ trans- single board or across multiple boards. As
the MAN and WAN markets. ceivers – precisely fits the requirements of many applications in the board-level embed-
PCI started out as a parallel interface, today’s FPGA market by giving designers a ded market use fixed links between various
and as such ran out of bandwidth when con- solution that not only saves time, but also points in the system, there is no need for a
nection requirements exceeded 1 Gbps. reduces power consumption and conserves complex message-passing protocol.
Industry groups such as the InfiniBand FPGA logic resources.
Trade Association and the RapidIO Trade Conclusion
Association introduced new connections to RapidIO and Aurora With its hard-coded PCI Express Endpoint
replace PCI. But PCI is much more than the Although PCI Express and Ethernet will be and Ethernet blocks, I anticipate that many
physical connection between system ele- the overwhelming leaders in the number of will use the Virtex-5 LXT platform to bridge
ments. PCI represents an enormous global serial ports deployed by the industry, a host between PCI Express or Ethernet and numer-
investment in software that is not readily of other serial interfaces have carved niches ous other interfaces. The Virtex-5 LXT plat-
replaceable. Only PCI Express has met the for themselves. The Virtex-5 LXT platform form is ideally suited for this task.
challenge of true compatibility with PCI.
PCI Express bandwidth will be scaled up
Serial vs. Parallel Ports 2004 2005 2006 2007 2008 2009
repeatedly over the coming years to support
the industry’s needs. Parallel 75.5% 56.3% 34.8% 25.5% 20.4% 15.9%
As a result of the nearly 10-year effort to Serial 24.5% 43.7% 65.2% 74.5% 79.6% 84.1%
transition the industry from parallel to seri-
Figure 1 – Serial interfaces are rapidly replacing parallel.
al interconnections, serial interconnections
VELOCITY
This seminar provides the embedded systems developers with
the necessary skills to develop a PPC System on a Programmable
Chip system utilizing the Virtex 4 FPGA. Utilizing the Embedded
Development Kit (EDK) the embedded systems developers will
create a full system based on the Nu Horizons XC4FX12 evaluation
board, labs provide hands on experience with the development,
verification, debugging, and simulation of an embedded system.
Prerequisites:
• Experience in C programming
• Some HDL modeling experience
Course Outline
• Virtex 4 versus Virtex 5 comparison
• V5’s new PLL and Use with DCMs
– Lab 1 – Introduction to the PLL/Architecture Wizard
• Improved Features in V5
– Lab 2 – Leveraging Improved Features
Course Outline
• Overview of MicroBlaze
• Overview of the Embedded Development Kit (EDK)
• Lab 1: Build and Optimize a MicroBlaze Soft Processor
For a complete list of course offerings, or to System in Minutes
register for a seminar near you, please visit: • Lab 2: Custom Hardware Interface Utilizing the MicroBlaze
IPIF Interface
www.nuhorizons.com/xpresstrack
Fundamentals of FPGAs
Course Outline
• Basic FPGA Architecture
• Xilinx Tool Flow
– Lab 1: Xilinx Tool Flow Demo
• Reading Reports
• Architecture Wizard and PACE
– Lab 2: Architecture Wizard and PACE Demo
• Global Timing Constraints
– Lab 3: Global Timing Constraints
• Implementation Options
– Lab 4: Implementation Options
• Synchronous Design Techniques
• Summary
PERFORMANCE
by Adrian Cosoroaba architecture. The Virtex-5 family is the a multiplexer (MUX). Implementing a 4:1
Marketing Manager first FPGA platform to offer a true six- MUX requires two four-input LUTs and a
Xilinx, Inc. input LUT (6-LUT) fabric with fully inde- MUXF block in the Virtex-4 architecture.
adrian.cosoroaba@xilinx.com pendent (not shared) inputs (Figure 1). The same 4:1 MUX can now be imple-
Moving to a 6-LUT fabric architecture mented in a Virtex-5 device with a single
In FPGA system design, maximizing per- provides the 65-nm Virtex-5 FPGA family LUT. Similarly, an 8:1 MUX requires four
formance requires a balanced mix of per- with the most effective trade-off between LUTs and three MUXF blocks in a Virtex-4
formance-efficient components – logic critical path delay – the determining factor FPGA, while the new Virtex-5 architecture
fabric, on-chip memory, DSP, and I/O for logic fabric performance – and die size. requires only two 6-LUTs. The result is
bandwidth. In this article, I’ll explain how With process technology advance- better performance and better logic utiliza-
you can benefit from Xilinx® Virtex™-5 ments, interconnect timing delay can tion, as shown in Figure 2.
FPGA building blocks, particularly the account for more than 50% of the critical As in previous Xilinx FPGA families,
new ExpressFabric™ technology, in your path delay. Xilinx has developed a new the Virtex-5 Slice L (logic slice) can imple-
quest for higher system-level performance. interconnect pattern for Virtex-5 FPGAs ment logic functions, registers, and arith-
I will explore key features of the to enhance performance by reaching more metic functions using the dedicated carry
ExpressFabric architecture with examples places in fewer hops. The new pattern chain. The slightly more complex Slice M
that quantify the anticipated performance increases the number of logic connections (memory slice) adds the capabilities of
improvements for logic and arithmetic achievable within two and three hops. implementing distributed RAM and shift
functions. Benchmarks based on actual Moreover, a more regular routing pattern registers within the LUT (SRL).
customer designs will show that Virtex-5 makes it easier for Xilinx ISE™ software Among the various improvements pro-
ExpressFabric technology performs on to find the most optimal routes. All of the vided by the ExpressFabric architecture, the
average 30% better than previous-genera- interconnect features are transparent to new carry chain structure delivers substan-
tion Virtex-4 FPGAs. FPGA designers, but will translate to high- tially higher performance when used to
With the new logic fabric (in which er overall performance and easier design implement arithmetic operations. Its effect
you can implement functions such as routability. Essentially, the Virtex-5 pat- on critical path delay is readily seen for sev-
counters, adders, and RAM/ROM stor- tern provides fast, predictable routing eral examples listed in Table 1.
age) and available hard IP blocks, memo- based on distance. Distributed memory functions such as
ry, and DSP (optimized to operate at The combination of the new 6-LUT LUT RAM or ROM also benefit in several
clock rates as fast as 550 MHz), the structure and special functions like carry ways from the larger LUT structure. The
Virtex-5 FPGA is clearly the platform of chains, dedicated multiplexers, and flip- new aspect ratio allows a much denser
choice for high-performance designs. flops (along with the unique methods by packing of small memory functions leading
which these elements are connected) cre- to significant performance benefits, as
ExpressFabric Performance ates unsurpassed performance and effi- depicted in Table 2.
Since the first FPGA was introduced in the ciency for implementing logic and The performance increases provided by
mid 1980s, the logic fabric for most arithmetic functions. the improved logic fabric with its 6-LUT
FPGAs has been based on the same funda- One example that clearly shows the architecture and interconnect structure are
mental four-input look-up table (LUT) benefits of the ExpressFabric technology is substantial, but this is only the beginning.
bandwidth. Moreover, the Virtex-5 FPGA Figure 1 – Virtex-5 configurable logic blocks (CLBs) comprise two slices.
provides dedicated connections to enable Each slice uses four independent 6-LUTs that provide the benefits of fewer logic levels.
you to cascade two adjacent 36-Kb block
RAMs together in the block RAM column,
thereby implementing a 72-Kb memory 4
A1
I7 L
running at the maximum 550-MHz rate. I6 U
The availability of ever-larger FPGAs T I7 6
A0
has accelerated the trend toward integrating 4
I6
I5 L
more subsystems into a single device, mak- I5
I4
L
U A2
A1
I4 U A2
ing more common the necessity of interfac- T
A0 T
ing multiple clock domains. Virtex-5
devices accommodate this by providing
4
L 6
I3
integrated logic to simplify the implemen- I2
U
T
I3
I2 L
tation of flexible and efficient FIFOs. 4
I1
I0 U
Through this combination of enhance-
I1
L
U T
ments, the Virtex-5 block RAM delivers I0 T
Virtex-4 Virtex-5
more on-chip memory, easier to build
FIFOs, and higher bandwidth.
8-to-1 MUX Virtex-4 Virtex-5 Improvement
DSP Performance
Logic Levels 2 1 100%
The growing acceptance of FPGAs as a
viable solution for high-performance DSP Path Delay 1.33 ns 1.08 ns 23%
applications is well deserved. Whether as a
co-processor or a stand-alone solution for Figure 2 – 8:1 multiplexer implemented with Virtex-5 FPGAs versus Virtex-4 FPGAs
Table 1 – Arithmetic functions implemented with Table 2 – LUT-based RAM/ROM implementations with
Virtex-5 FPGAs versus Virtex-4 FPGAs Virtex-5 FPGAs versus Virtex-4 FPGAs
more demanding applications, FPGAs con- Virtex-5 FPGAs improve on Virtex-4 blocks generated by CORE Generator™
tinue to provide the best combination of bandwidth by increasing both the data rate software (a part of ISE software).
performance, power, and cost. per pin and the number of available I/Os For these benchmarks, we performed
To keep pace with the seemingly insa- with larger packages. For example, for popu- synthesis in a timing-driven fashion with
tiable demand for more DSP performance, lar memory interfaces like DDR2 SDRAM, Synplicity’s Synplify Pro, using tight, realis-
Xilinx is leading with Virtex-5 DSP capa- the bandwidth has increased per pin from tic constraints to effectively measure per-
bilities in terms of both clock rate and pre- 534 Mbps to 667 Mbps; the number of data formance. This was done to ensure that all
cision – the clock rate has increased to 550 I/Os, when considering SSO requirements, special optimizations and logic replications
MHz and the precision has improved from has increased from 432 to 576. were employed.
18 x 18 bits to 25 x 18 bits. Implementation in ISE
Xilinx also optimized the software was accomplished
Virtex-5 DSP48 slice for Virtex-5 vs. Virtex-4 FPGA with the place and route
Performance Advantage (%)
adder-chain implementations, 60 effort set to high. Clocks
a powerful capability that were tightened iteratively by
50
enables the creation of very 5% increments until the
efficient high-performance fil- 40 design failed to meet design
30% Average Advantage
ters. Dedicated routing 30 Designs with
constraints.
resources on the inputs and many levels The result was an average
of logic.
logic
outputs of each DSP48 slice Designs
20
Use of hard
performance gain of 30%
with fewer
permit any number of slices to levels of 10 IP
IP blocks
blocks over designs implemented in
be chained together within a logic
logic Virtex-4 FPGAs, as shown in
0
column. This dedicated rout- Figure 3.
ing ensures that every DSP48 Those designs that
slice in the chain will run at Figure 3 – Comparison based on a suite of improved the most have large
74 customer designs using ISE 8.2i software
full speed without consuming cones of logic; the critical path
any of the fabric routing or implements a large, often
logic resources, as other 1.7 X complex logic equation. For
1.6 X
FPGAs require. Taken togeth- example, ASIC prototyping
er, these improvements reduce 1.3 X designs will typically have very
by half the number of 1.1 X 1.1 X few registers for a large
resources needed to imple- amount of logic in their criti-
Performance
by Brian Philofsky For example, if you know of and use terms of area, performance, and power is to
Staff Software Technical Marketing Manager Bitslip technology within the ISERDES, install the latest version of the software.
Xilinx, Inc. you could save time, effort, and resources by
brian.philofsky@xilinx.com capturing input data rather than attempting Control Signal Polarity
to describe and build similar circuitry. The Virtex-5 architecture can support dif-
FPGAs have been very flexible in accom- In another example, if you know the ferent control signal polarity (clock
modating any HDL coding or design style structure and capability of the DSP48E, enables, resets, or sets). However, to have
for digital logic; Xilinx® Virtex™-5 you can make better choices as to when the most optimal design, I recommend
devices are no exception. Although Virtex- and where to place pipeline registers. consistent use of active high control signals
5 FPGAs can accommodate many differ- Dedicated features like the wider multipli- in your design. The Virtex-5 slice control
ent types of designs written in many er or post adder can also help you achieve logic is active high, and when described in
different methods, certain recommended better area, performance, and power. this same manner in the code should never
constructs and manners can achieve Similarly, knowing the capabilities and require additional LUT resources for a sim-
improved optimization in terms of area, current limitations of your synthesis tool can ple signal inversion.
performance, and power. not only help when choosing coding styles If the signal comes from an external pin
to properly infer primitives but can also give and needs an active low polarity, I suggest
Know Your Target you greater insight as to when to instantiate inverting the signal in the top-level code
Architecture and Synthesis Tool a component or use inference. Review syn- and using a positive polarity in all process-
Before beginning any project, you thesis manuals, application notes, or other es and sub-modules requiring that signal.
should understand the device architec- relevant materials before starting so that you This is critical for designs that have several
ture you are targeting. For Virtex-5 know the recommended coding styles for cores, use bottom-up synthesis techniques,
FPGAs, I recommend reading the the synthesis tool you are using. have KEEP_HIERARCHY constraints, or
Virtex-5 Users Guide (http://direct. You should also update and use the lat- employ the use of partitions (Figure 2).
xilinx.com/bvdocs/userguides/ug190.pdf) est versions of synthesis and ISE™ tools Designs that fall into these categories
before starting your first line of code. before beginning a project. Although ini- are more susceptible to the use of addition-
Once you have a better understanding tial synthesis support for the Virtex-5 al LUTs per core/netlist/hierarchy/parti-
and vision as to how your code will ulti- architecture is strong, many improvements tion for the sole purpose of inverting these
mately result in the base hardware, you in optimization and inference support are control signals, which not only consume
can make both large and small design still to come with new releases. One easy extra LUT resources but may also have
and coding decisions confidently. way to ensure more optimal designs in negative effects on performance and slice
packing. As a general rule, always code sets, register LUTs); distributed RAM (LUT- The Virtex-5 device departs from the
resets, and enables with an active high based RAM) memory; or block RAM for traditional four-input LUT in previous
(logic 1 activates) polarity. the implementation, which would not be FPGA families and has an enhanced six-
otherwise possible nor optimal. The synthe- input LUT (6-LUT), allowing for wider
Use of Resets sis tool has maximum flexibility to choose logic functions between pipeline registers
It is common practice to use a global asyn- the best resource for the described code. while maintaining top performance. You
chronous reset in the source HDL code to should keep this in mind, as logic functions
initialize the design; however, in many Pipelining coded into HDL as optimal code should
cases this consumes additional resources. As with previous FPGA generations, prop- include six inputs to the logic function
Instead, think synchronous and local. I erly pipelining your design is necessary to between registers to get the most optimal
suggest describing a synchronous set/reset achieve top performance and improved pipelining and LUT resource management.
logic to the portions of the design that do power characteristics. With the introduc- In cases where it is not practical or pos-
need periodical resets. For those portions of tion of the Virtex-5 architecture, a new sible to have exactly six inputs in a given
the design that do not, you can initialize logic structure dictates slightly different logic function, the wider input 6-LUT
the signals defined to be registered in the rules regarding when and how to pipeline. still allows for good performance by
HDL code at the time they are declared
(for example, when defining a reg in
Verilog or a signal in VHDL). This
Top
methodology allows for improved packing Flip-Flop
density, enhances timing analysis and per- Clock CE
formance, and can improve area resources. Enable
Partition
LUT6
In terms of FPGA behavior, without a
Flip-Flop
global reset described in the code, a GSR
Old Netlist
(global set/reset) will occur upon comple- CE
end XILINX;
reducing the number of logic levels, thus Both block RAM and distributed RAM For designs in which some or most of the
requiring fewer pipeline stages to achieve memories also have additional capabilities code was created for an architecture other
the same as or better performance than that require different coding and design con- than Virtex-5 FPGAs, I suggest that you
previous FPGA architectures. siderations. For performance, perhaps the review the code to ensure that it is well suit-
A good goal is to aim for less than 10 most important is the proper use of output ed for implementation into the new archi-
inputs to a given logic function between registers. For block RAMs, this means tecture. A few minutes of time spent here
I/Os, registers, or synchronous blocks (like enabling the output registers to the block can save several hours later if you identify
block RAM or DSP48Es), which generally RAM whenever possible. By enabling the and correct suboptimal code.
would represent two logic levels. When you output registers, a reduced clock-to-out is If your design contains cores or pre-
need a significantly higher number of realized from the RAM, thus improving tim- compiled netlists (EDIF or NGC files)
inputs for the design path to meet latency ing for the data leaving the RAM. However, from a previous architecture, you should
or other requirements, you can attempt to an extra clock cycle of latency is added dur- regenerate those targeting Virtex-5
reduce the fan-in to that logic function ing reads, for which you must account. devices. Unless regenerated, netlists opti-
(when possible) if high performance or low Similarly, when using distributed RAM, mized for a previous architecture are more
power are your design objectives. the output of the RAM can be asynchro- likely than not far less optimal when tar-
nous; however, coding it synchronously will geting Virtex-5 architectures.
Coding Memories allow the use of the register within the slice, One last suggestion is to use the HDL
Among other innovations within the providing better timing characteristics and language templates within the ISE tools.
Virtex-5 architecture, Xilinx has enhanced reducing the chance of the RAM being part They not only help with accelerating the
both block RAM and distributed RAM of the timing bottleneck. generation of VHDL or Verilog code, but
memories with greater capacity and capabil- There are more advanced features of the also provide assistance in creating more
ity. You must make different decisions early block RAMs, such as FIFO and ECC optimal code for FPGAs. They also cut
in the design process and while coding to (error correction circuitry) capabilities. down on the possibility of creating syntax
get the most from these valuable resources. The distributed RAM also has new capa- or other simple but common mistakes that
General guidelines call for inferring bilities such as a quad-port configuration. can hold up the testing and verifying of
RAMs when possible for easier code In some cases, these features cannot be HDL code.
changes, faster simulation, and more realized by inference within synthesis and Figure 3 shows both Verilog and VHDL
portable code. However, even when behav- instantiation is necessary. If you need such code following the guidelines discussed here.
iorally describing the RAM, you should functionality, I suggest instantiating the
keep some important things in mind. The RAMs either by generating cores within Conclusion
first and most obvious thought is RAM Xilinx CORE Generator™ software or by Coding styles are very individual; howev-
capacity. In terms of block RAMs, the base instantiating the base primitive. Taking er, following these suggestions makes it
memory block increased in Virtex-5 devices advantage of these advanced features can more likely that you will achieve a more
to 36 Kb of memory storage space. You can save RAM and logic resources as well as optimal result. These guidelines do not
configure this block to the wider but shal- improve area, performance, and power. represent absolutely everything you need
lower 512 x 72 configuration, the deeper to know to achieve the best Virtex-5
single-bit width 32 Kb x 1, or several con- Some General Guidelines design possible, but I have provided some
figurations in between. It is also possible to A few other general recommendations do common strategies that can help in achiev-
cascade two 36-Kb RAMs to form a 64-Kb not fall into any specific categories but can ing more optimal designs.
x 1 configuration or break up the 36-Kb result in better coding and design choices. Almost any set of valid HDL code likely
RAMs into two separate 18-Kb RAMs capa- First, you should make wise choices in terms will result in a functioning design, but fol-
ble of 512 x 36 to 16-Kb x 1 configurations. of your design hierarchy right from the lowing a few simple guidelines can help in
Distributed RAM have benefited from start. Your choice of hierarchy can have terms of improved density, performance,
the larger LUT structure and can now effi- effects on the synthesis and implementation and power, and many times may reduce the
ciently accommodate 64-bit depths without tools’ ability to optimize the logic paths. amount of time it takes to ultimately com-
any area or performance penalties. This is In general, do not allow timing paths to plete a design.
the most optimal size for this type of RAM cross multiple boundaries of hierarchy. This For more information, see the Synthesis
in the Virtex-5 device, although other sizes not only limits the tool’s ability to optimize and Simulation Design Guide at
can be accommodated. The base RAM sizes logic but may also limit your options for http://toolbox.xilinx.com/docsan/xilinx82/
are important to remember during memory design implementation and design debug- books/docs/sim/sim.pdf or White Paper 231,
selection and coding to most efficiently use ging. For instance, you may not be able to “HDL Coding Practices to Accelerate Design
the limited RAM resources in the device use partitions or KEEP_HIERARCHY on Performance,” at http://direct.xilinx.com/
and achieve the best performance. certain hierarchies with this practice. bvdocs/whitepapers/wp231.pdf.
by John Gallagher
Sr. Director Outbound Marketing
Synplicity, Inc.
johng@synplicity.com
new algorithms and delay models. As 18-kb RAM The Virtex-5 hard DSP slice – called the
opposed to simple wire-load models, we DSP48E – features a 25 x 18-bit multiplier
ECC and
engineered the Synplify Pro tool to (versus the 18 x 18-bit multiplier employed
FIFO Logic
employ sophisticated netlist-based routing in Virtex-4 FPGAs). This increase can lead
estimation (coupled with known routing to fewer cascaded stages, thereby resulting
values where applicable). In the case of fast 18-kb RAM in higher overall performance and utiliza-
carry chains, for example, routing delays tion (Figure 4).
are well known and can be directly Tuned for 550-MHz operation, you can
“plugged in.” Similarly, in the case where a configure these high-precision, high-per-
cell, driver, load, and specific route are Figure 3 – The Virtex-5 family features as
much as 10 Mb of 550-MHz block RAM.
formance, highly flexible slices for DSP,
known, an accurate routing delay associat- arithmetic, and logic functions and cascade
ed with this path can be plugged in to the them for adder-chain architectures. The
routing and timing algorithms. tuned for 550-MHz operation to provide DSP48E slice has 40% lower power con-
higher on-chip memory bandwidth. The sumption compared to equivalent func-
Synthesizing Fast High-Capacity RAM Blocks 18-Kb block RAMs are constructed from tions in Virtex-4 FPGAs (1.38 mW/100
The new block RAM structures (with two physical 9-Kb memories, which are MHz at a 38% toggle rate).
pipeline) in the Virtex-5 family have automatically controlled to save power by The sophistication of these DSP slices
increased to 32 Kb in size – twice the size enabling only one of the 9-Kb sub-blocks means that it is unlikely that a data path
of those found in Virtex-4 components. for any given read or write operation in defined in RTL will exactly match the opti-
In addition to offering a simple dual-port most configurations. mal DSP implementation structure. For
mode that can double the RAM’s band- For our part, the Synplify Pro synthesis example, rather than implementing a func-
width, these blocks also contain addi- software can perform automatic memory tion such as “(a + b) + (c + d)” by adding “a”
tional hard IP in the form of FIFO logic inferencing, including single-port and dual- and “b,” adding “c” and “d,” and then adding
and new 64-bit error checking and cor- port implementations, single and multiple the results generated by these operations, it
rection (ECC) logic (Figure 3). clocking schemes, and automatic retiming. may be more efficient to cascade the DSP
Implementing this logic as hard IP frees Regarding the latter point, Virtex-5 block slices along the lines of “(((a + b) + c) +d).”
up other resources and minimizes RAMs are inherently synchronous; howev- We equipped Synplify Pro software with
dynamic power consumption. er, the design’s RTL could describe the extremely sophisticated mapping algo-
As with all hard IP blocks in Virtex-5 memory and registers in such a way as to be rithms that perform a lot of data path mas-
devices, these block RAMs have been technically asynchronous. In such a case, saging, creating data path structures that
Virtex™-5 FPGAs give you unbeatable power savings with the highest
Power vs Performance
performance. The unique combination of 65nm process, second-
generation Triple-Oxide technology, ExpressFabric™ architecture, and
power-optimized hard IP extends the 1 to 5 Watt power advantage
Power Budget GA delivered by previous-generation Virtex FPGAs. Achieve higher
FP
g
tin
e reliability and a smaller form factor. Save cost on power supplies,
Total Power
mp
A
Co FPG
x-4 heat sinks, and fans. All this, plus the industry’s highest performance.
le
Vi r
te PGA
lab x-5 F
Virte
ai
Av No other FPGA vendor comes close.
www.xilinx.com/virtex5/power
Maximizing Design
Performance for
Virtex-5 FPGAs
ISE software gives you the tools to achieve
the timing goals of a Virtex-5 design.
by Michelle Fernandez Understanding the Architecture erations if any of these hard-IP blocks
Software Technical Marketing Engineer When evaluating a new FPGA architec- show up as part of your critical paths:
Xilinx, Inc. ture like the Virtex-5 family, it is impor-
michelle.fernandez@xilinx.com • Check to see if your design is making
tant to study the user guide and data sheet
the most of the block’s features and
to understand the hardware features.
As FPGAs push the performance envelope, that the synthesis tool is inferring the
The Virtex-5 FPGA family is based
maximizing design performance requires features as expected from your RTL.
on a new ExpressFabric architecture
knowledge of the device architecture and that delivers higher speeds, a new 6- • When using the embedded block
design software. The 65-nm Xilinx® input LUT structure that reduces logic RAM memory or the DSP48E slices, it
Virtex™-5 FPGA family delivers the levels, and diagonally symmetric rout- is important to use their dedicated
industry’s highest performance, with new ing that minimizes delays. Each CLB pipeline registers when possible to
ExpressFabric™ technology, diagonally contains two slices that have four reduce setup and clock-to-out timing.
symmetric routing, enhanced on-chip 6-input LUTs and four registers config-
• Another consideration is the mix of
memory, DSP slices, and high-speed I/O. urable in many ways. For maximum
block RAMs or DSP48E slices in the
To maximize system performance, you slice packing, it is imperative that you
design, and the trade-off between using
should use proper design techniques such understand the slice interconnectivity
dedicated blocks or implementing the
as defining timing constraints and selecting and any shared resources.
same function in slices to allow for
options in synthesis and implementation Virtex-5 FPGAs contain hard IP such
placement flexibility.
that work best for your design. In this arti- as embedded memory (block RAM) and
cle, I’ll describe how to achieve faster tim- math functions (DSP48E slices) tuned to The choice of clocking resources can
ing in the fewest design iterations. 550 MHz. Here are some design consid- also affect a design’s performance. Virtex-5
FPGAs have I/O, regional, and global inal register also cover the replicated reg-
clocking resources. These devices are isters for implementation. When writing Figure 3 – Recommended
divided into multiple clock regions, which timing constraints, group the maximum ISE synthesis (XST) settings
at most can contain 4 regional clocks and number of paths with the same timing
10 global clocks. During design planning, requirement before generating a specific
• Explore the synthesis tool settings. (See
it is important to analyze how many clock constraint to minimize implementation
Figure 2 for Synplicity and Figure 3 for
regions you plan to use as well as specific run times and memory usage.
Xilinx Synthesis Technology [XST] sug-
clocks within a clock region. Placing your
gested tool settings.) There are also a
I/Os so that their interface logic does not Driving Synthesis
variety of attributes that can affect syn-
require all of the clock resources in a Here are some design considerations for
thesis optimizations. These attributes
clock region gives ISE™ software greater getting optimal results from synthesis tools:
are an easy way to affect synthesis with
placement flexibility.
• Use proper coding techniques to out having to re-code (see Table 1).
ensure that the inference of your RTL
Define Timing Requirements Certain tool settings, such as retiming
by synthesis takes advantage of the in Synplify Pro and register balancing in
Synthesis and ISE implementation tools are
architectural features. XST, can impact area. If your design is
driven by the performance goals that you
specify with timing constraints for internal • Add any lower level netlists to your affected by high fan-out nets and you want
clock domains, I/O paths, multi-cycle paths, synthesis project to better optimize the synthesis tool to reduce that fanout, use
and false paths (see Figure 1). Defining real- HDL that interfaces to those netlists. fan-out attributes specifically on that net
istic timing constraints will prevent excessive versus globally reducing the fan-out limit.
replication and longer run times. • If critical paths in your implementa- Avoid maintaining hierarchy if critical
In your synthesis report, check for any tion are not seen as critical in synthesis, paths cross over the hierarchical bound-
replicated registers and confirm that the try Synplify Pro’s “-route” constraint to aries. Before implementation, review the
timing constraints that apply to the orig- force synthesis to focus on that path. warnings in your synthesis report.
Choosing Implementation Options tion. A datapath comprises both logic and 2. If the critical path contains hard-IP
Having obtained an acceptable timing interconnect delay. Individual component blocks such as block RAMs or
estimate from the synthesis tool, you can delays that make up logic delay are fixed. You DSP48E slices, verify that the design
use the implementation tools to deter- can reduce logic delay by reducing the num- takes full advantage of the embedded
mine the true performance of the design. ber of logic levels or by redefining the struc- registers. Also understand when to
The ISE default mode is the performance ture of the logic. make the trade-off between using
evaluation mode, which enables you to In comparison, interconnect delay is these hard blocks or using slice logic.
get high-performance results out of your much more variable and is dependent on the
3. Analyze clock skew.
implementation tools without having to placement of the logic. Before running your
specify timing goals. design through PAR, a timing analysis after 4. If the logic appears to be placed far
The next step is to run timing-driven MAP is recommended. Although this timing apart, floorplanning of critical blocks
mapping (MAP) and place and route report will only have estimates for your rout- may be required. Only floorplan
(PAR). Timing-driven MAP performs ing delays, it can give you an idea of the crit- where necessary.
closed-loop packing and timing-driven ical paths the implementation tools are 5. If area groups were created for a
placement, while PAR performs the rout- working on. If the critical paths have a high design with a previous version of soft-
ing of the design. Both MAP and PAR number of logic levels, you may want to ware or before many design changes,
should run with their effort levels set to work on improving the logic levels versus consider removing those area groups.
high to achieve optimal results. running it through PAR.
Physical synthesis options in imple- If your design has an excessive amount 6. Consider placing hard-IP blocks such
mentation can re-optimize and pack logic of logic levels: as block RAMs for DSP48E slices.
based on knowledge of the critical paths
1. Try the physical synthesis options Conclusion
of a design, leading to better placement
in MAP. Virtex-5 FPGAs are optimized for high-per-
and routing. The physical synthesis
options are implemented during the 2. Go back to synthesis and verify formance designs, while ISE software has
MAP process and include global netlist that critical paths reported in imple- the capabilities you need to quickly achieve
optimization, localized logic optimiza- mentation match what is reported design closure, improve productivity, and
tion, retiming, register duplication, and in synthesis. efficiently verify your designs. Xilinx pro-
equivalent register removal. Details on vides a comprehensive suite of software
3. Review the synthesis inference of
each of these options can be found in the tools (powered by ISE Fmax technology)
your HDL code.
Xilinx White Paper, “Physical Synthesis that improves design performance.
and Optimization with ISE 8.1i,” available If there are few logic levels but certain However, the more that you can do up-
at www.xilinx.com/bvdocs/whitepapers/ datapaths are not meeting timing: front with good coding styles, defining tim-
wp230.pdf. ing constraints, and resource planning, the
1. Evaluate fan-out on routes with
easier it will be for downstream tools to
long delay.
Xplorer Utility achieve your timing requirements.
Xplorer is a tool that helps to determine
the set of implementation options that XST Synplify Pro
result in the best performance for a design.
Xplorer has two modes: timing closure and Fan-out Control max_fanout syn_maxfan
best performance. The timing closure Directs Inference of RAMs to Block RAMs or SelectRAM ram_style syn_ramstyle
mode evaluates your timing constraints
and tries different sets of implementation Directs Usage of DSP48 Slice use_dsp48 syn_multstyle/syn_dspstyle
options to achieve those goals. In best per- Directs Usage of SRL16 shreg_extract syn_srlstyle
formance mode, you can give the tool a
clock domain to focus on; the tool will try Controls % of Block RAMs Utilized n/a syn_allowed_resources
to achieve the best frequency for the clock. Preservation of Register Instances During Optimizations Keep syn_preserve
This is helpful when benchmarking a
design’s maximum performance. Preservation of Wires Keep syn_keep
Preservation of Black Boxes with Unused Outputs Keep syn_noprune
Evaluating Your Critical Paths
By understanding the characteristics of your * You can find XST documentation at http://toolbox.xilinx.com/docsan/xilinx82/books/docs/xst/xst.pdf. Synplify Pro
documentation is located in the tool help documentation.
critical path, you can make better decisions
about what to do for your next design itera- Table 1 – Helpful synthesis attributes*
Clock Management
in Virtex-5 Devices
Virtex-5 FPGAs give designers fresh choices.
by Ralf Krueger ply/divide feature that does not depend for the delay on the routing network,
Sr. Staff Applications Engineer on any maximum VCO frequency. effectively eliminating the delay from the
Xilinx, Inc. However, the PLL filters input clock jit- external input port to the individual clock
ralf.krueger@xilinx.com ter, support a wide range of output fre- loads within the device.
quencies with higher frequencies, and In addition to providing zero delay
As FPGAs grow in size, quality on-chip consume less power. with respect to a user source clock, the
clock distribution becomes increasingly The DCM and PLL are also designed DCM provides multiple phases of the
important. Clock skew and clock delay to interact with each other. The PLL can source clock. The DLL can also act as a
impact device performance; managing help clean up input or output clocks to the clock doubler or divide the user source
clock skew and clock delay with conven- DCM. Dedicated resources within each clock by as much as 16. The DCM can
tional clock trees becomes more difficult CMT make the connections and still guar- also act as a clock mirror. By driving the
in large devices. antee a proper deskew of the FPGA clocks. DCM output off-chip and then back in
Traditionally, you would deploy solu- The CMTs are located in the center col- again, the DCM can deskew a board-level
tions such as a Xilinx® Virtex™-4 digital umn of the Virtex-5 architecture. This clock between multiple devices.
clock management (DCM) or mixed-signal enables well-matched clock routes to and Another submodule provides the abili-
phase-locked loop (PLL) to achieve clock from every DCM or PLL for enhanced ty to phase shift the DCM’s output clock
tree deskew and frequency synthesis, among symmetry (see Figure 1). in small increments (1/256th of the peri-
other functions. Yet each solution has its od). The versatile digital phase shift (DPS)
advantages and disadvantages. DCM operates in four different modes for maxi-
In Virtex-5 devices, for the first time in Virtex-5 DCMs provide a zero propaga- mum flexibility: fixed, variable-positive,
an FPGA, both digital DCMs and analog tion delay buffer, clock division and mul- variable-center, and direct. The DCM’s
PLLs are implemented side by side in a tiplication capabilities, fixed and digital frequency synthesis (DFS) module
clock management tile (CMT). You can dynamic fine phase shift, and multiple provides two outputs, CLKFX and
now select the clock management solution phases of the input clock. Along with CLKFX180, which are derived from the
best suited for your particular applications. fully differential global clock trees and input clock by frequency multiplication
Each Virtex-5 device has as many as six low skew between output signals, the and division. You provide valid multiply
CMTs. A CMT contains two DCMs and application’s various clocks are distrib- (M) and divide (D) values, which the DFS
one PLL. You can use either of the two uted efficiently throughout the device. implements through a frequency calcula-
DCMs or the PLL as a stand-alone mod- Each DCM can drive as many as 9 of the tor. For example, if you provide an M
ule, or they can interact with each other. If 32 global clock routing networks within value of 19 and a D value of 8, they would
used as a stand-alone module, the applica- the device. yield a 2.375 source-clock multiplier.
tion requirements typically dictate which The global clock distribution network
clock management solution to use. The minimizes skews caused by loading differ- PLL
DCM, for example, supports a fine phase ences. By monitoring a sample of the The CMT’s PLL is a mixed signal block
shift, a dynamic phase shift, and a multi- DCM output clock, the DLL compensates designed to support clock network deskew,
frequency synthesis, and jitter reduction. The Conclusion plify and improve system-level designs
PLL block diagram in Figure 2 provides a Virtex-5 FPGAs give digital designers a involving high fan-out and high-perform-
general overview of the various components. choice of either digital or analog clock ance clocks. Virtex-5 devices have powerful
Input multiplexers (MUXs) are used to management. Depending on your particu- frequency synthesis, phase-shifting, and
select the reference and feedback clocks lar application, either module – or a com- clock deskew capabilities never offered
from the global clock pins, global clock bination of both modules – provides you before in an FPGA. Along with compre-
trees, or one of the DCMs. Each clock with choices that you never had before. hensive software support, you can achieve
input has a programmable counter. This Together with an abundance of clock larger, faster, and more complex designs
pre-scales the reference clock and allows a tree resources, Virtex-5 devices greatly sim- than in any previous-generation FPGA.
wide range of frequency synthesis.
The phase frequency detector (PFD)
compares both phase and frequency of From Global Clock Input Pins
the input clock and the feedback clock. A From Global Clock Buffers
signal is generated that is proportional to DCM1
To Global
Clock Buffers
the phase and frequency error between
the two clocks, which is then used to
drive the charge pump and loop filter to
generate a reference voltage to the VCO.
An up or down signal from the PFD
determines if the VCO should operate at
clkout_pll<5:0> To Global
a higher or lower frequency. PLL Clock Buffers
After the PFD determines that the input
and feedback clocks are phase- and frequen-
cy-aligned, a lock signal is raised, indicating
that the PLL output clocks are valid. The
VCO continues to compensate for any vari-
ations in voltage or temperature. The M To Global
DCM2
counter in the feedback path controls the Clock Buffers
Reduce Power
with Virtex-5 FPGAs
The world’s first 65-nm FPGAs offer the
lowest power without compromising performance.
by Derek Curd
Senior Staff Applications Engineer,
Advanced Products Division
Xilinx, Inc.
derek.curd@xilinx.com
Benefits of Reducing Power illustrates the importance of controlling tremendous tool to fight leakage. In older
Implementing a lower power FPGA design power and temperature for systems with FPGAs, two gate-oxide thicknesses were
offers advantages beyond simply adhering to high reliability requirements. used: a thin one for the high-performance,
the device’s thermal operating requirements. lower operating voltage transistors in the
Although meeting component specifications Power: Challenges and Solutions FPGA core, and a thicker one for the larg-
is obviously critical for performance and Total power in an FPGA (or any semi- er, high-voltage-tolerant transistors in the
reliability, how you achieve this has a signif- conductor device) is the sum of two com- I/O blocks. Simply put, “triple oxide”
icant impact on system cost and complexity. ponents: static power and dynamic refers to the addition of a third, medium-
First, lowering FPGA power consump- power. Static power results primarily thickness gate oxide (or “midox”) transis-
tion allows you to use less expensive power from transistor leakage current, the small tor that has much lower leakage than the
supplies, which have fewer components current that “leaks” from either source- thin-oxide core transistor.
and consume less PCB area. The imple- to-drain or through the gate oxide of the The “midox” transistor is used exten-
sively in the core of the device for
non-performance-critical circuits
2500
(like configuration memory) or
circuits that do not require fast
Virtex-4 LX Devices switching times in response to a
2000
changing gate voltage (like rout-
Virtex-5 LX Devices
ing pass gates). The thin-oxide,
Gate
highest leakage transistors are
Power (mW)
1500
XC4VLX25
XC4VLX30
XC4VLX40
XC4VLX50
XC4VLX60
XC4VLX80
XC5VLX85
XC4VLX100
XC5VLX110
XC4VLX160
XC4VLX200
XC5VLX220
XC5VLX330
equation governing dynamic power is: • The Virtex-5 routing architecture now comparison to implementing these func-
dynamic power = CV f 2 includes diagonally symmetric routes, tions in general-purpose FPGA logic.
meaning that every CLB now has a Unlike the FPGA fabric, these dedicat-
where C is the capacitance of the node direct “one hop” connection to all of ed blocks contain only the transistors nec-
switching, V is the supply voltage, and f is the its neighbors, including diagonal essary to implement the required
switching frequency. The 65-nm process neighbors. When a connection is function. And there are no programmable
node enables FPGAs that have significantly required between logic functions, it is interconnects, so routing capacitance is as
greater logic capacity and higher performance now more likely that this connection is small as possible. Fewer transistors and
than older devices. In other words, more a less-capacitive “one hop” connection, lower node capacitance benefit both stat-
nodes are switching at higher frequencies. All whereas previous routing architectures ic and dynamic power consumption. The
else being equal, this tends to increase may have required two or more hops net result is that these dedicated blocks
dynamic power. for the same connectivity. can perform the same function in as little
However, there is good news with respect
to dynamic power at 65 nm. The core FPGA
800
supply voltage (V) and node capacitance (C)
generally reduce with each new process node, 700
providing substantial dynamic power savings
over previous-generation FPGAs. 600
physical memories at a time. The other additional capabilities. In many cases, you architectural innovations aimed at offer-
9-Kb memory can therefore be effectively can achieve dynamic power reductions as ing the lowest possible power consump-
“powered down” while it is not being high as 75% when utilizing the full capabil- tion, while still enabling performance
accessed. This reduces power consumption ity of the new DSP slice. If you are not increases of 30% or more.
by nearly an additional 50% beyond those designing a DSP application, keep in mind As Figure 3 illustrates, with static power
reductions resulting from the 65-nm that you can use the DSP slices for many levels comparable to Virtex-4 devices, the
process migration. This “ping-pong” standard logic functions (counter, adder, Virtex-5 family provides a clear advantage
accessing of the 9-Kb blocks is inherent to barrel shifter) at a substantial power savings relative to competing FPGAs. As the only
the new block RAM architecture, meaning compared to implementing the same func- available 65-nm FPGA, Virtex-5 devices
that no user or software control is required tion in standard FPGA logic. also offer a minimum of 35-40% core
to take advantage of this capability. It As a final example of redesigned dedicated dynamic power reduction over other high-
occurs dynamically and automatically, pro- blocks, the LXT platform of the Virtex-5 performance FPGAs on the market.
Architectural innovations such as the new
6-LUT and diagonally symmetric routing
7 90-nm Static Power are likely to enable actual core dynamic
90-nm Core Dynamic Power power savings up to 50% or more. And
6 taking advantage of the unprecedented
65-nm Static Power
level of dedicated blocks lowers power con-
5 65-nm Core Dynamic Power
Power (Watts)
Motivation for Better Predictive Models ing standardization body of the Electronic is 10.8° C per Watt. Although the Tj pre-
In a specific system implementation, the Industries Alliance, explains in diction expression will suggest a 43.2° C
actual component Tj may be different from EIA/JESD51-2 that “the intent of Theta-ja above ambient for 4W dissipation, actual
the arithmetic predictions using the pub- measurements is solely for a thermal per- detailed simulation shows a much lower
lished Θja. The prediction depends on the formance comparison of one package to number – and thus suggests a lower effec-
environment and the prevailing conditions another in a standardized environment. tive Θja – of close to 5° C per Watt.
in the system. The following equation gov- This methodology is not meant to and will Table 1 shows the corresponding Tj for
erns the relationship: not predict the performance of a package in the same component dissipating 4W on
an application-specific environment.” various FR4 board sizes and layer counts.
Tj – Ta
Θja = _______ A typical implementation of a one- This illustrates the power of the environ-
P cubic-foot Θja still-air standardized envi- ment or boundary conditions on the effec-
ronment is depicted in Figure 1. This is tive Θja, and the type of Tj prediction
Or, stated in Tj prediction form:
discrepancy that can result.
Tj = Ta + P * Θja Note that while in general the
effective Θja tends to be lower on
where
larger board environments, it can
Θja is the thermal resistance between the also trend higher and under-pre-
device junction and ambient dict Tj on small cards in confined
Tj = junction temperature of the device places like PDAs or cell phones.
Ta = ambient temperature The same rationale is at play –
P = package power dissipation Θja is not boundary condition-
independent. A component with
Although you can easily determine Tj, Ta, Θja = 22° C per Watt on a
and P, representing the thermal resistance in an JEDEC board can easily exhibit a
application is not easy, particularly for pack- 30° C per Watt-effective Θja on a
ages with multiple thermal paths. The single 30 mm x 30 mm card.
parameter Θja is strongly influenced by the Some application engineers
application environment and therefore does Figure 1 – The Analysis Tech implementation have suggested that because
not represent a suitable thermal resistance. of Theta-ja standardized environment most high-performance devices
use denser and larger PC boards,
Theta-ja – The Misunderstood Model
Theta-ja has become the base thermal param-
Xilinx 35 x 35mm Board Size
eter most engineers gravitate toward when
FF1136-5VLX50T* 4" x 4" Board 10" x 10" Board 20" x 20" Board
estimating component Tj with known Ta.
But for a more demanding, higher wattage 4 68.2° C 64.3° C –
component on a large multilayer system Layer 8 63.0° C 50.9° C 48.3° C
board – particularly with other components Count of
around it – this approach often leads to an Mounted 12 60.4° C 47.0° C 45.7° C
erroneous prediction of Tj. Board** 16 59.1° C 46.6° C 44.9° C
In a design with loose margins in the ther-
mal budget, the simple prediction using pub- 24 – 45.3° C 44.0° C
* Single component considered at 25° C ambient
lished Θja data may not be an issue. Indeed, it **All layers have 1oz Cu with 80% coverage except outer layers that have 2 oz with 20% coverage.
will likely lead to a system running at a lower Table 1 – Tj matrix for FF1136-XC5VLX50T on various boards
than predicted Tj, because most common
board types are more efficient than the largest clearly not a typical system environment. component suppliers should provide Θja
standardized thermal board. Increasingly, with Ideally, you should use these numbers to using a larger “JEDEC/network board” – a
higher wattage components where margins are compare package efficiency, reserving any board that may be closer to network appli-
tight, “conservative” data may be the differ- serious Tj prediction for other tools using cation boards. This seems like a good argu-
ence between selection and rejection of the models that are more relevant. ment and should be advocated at the next
component in a specific program. To illustrate the pitfalls and potential JEDEC forum. However, regardless of the
The key point here is that Θja was not discrepancies in Tj predictions, let’s look at board used for data gathering, the predic-
meant to be used in these types of predic- a Virtex-5 flip-chip component – tion will be wrong for some applications.
tions. JEDEC, the semiconductor engineer- XC5VLX50T- FF1136. The published Θja Additional JEDEC boards and standard-
ized enclosures will only lead to more fla- To address these limitations and to Xilinx offers two model types for
vors of Θja, further confusing the issue. make more accurate Tj predictions in a sys- FPGA products:
There ought to be a better way. tem environment, a more refined model of
1. Two-resistor (2-R) compact models
the package is needed. Recognizing this
comprising the familiar Theta-jc and
What Should an Engineer Do? need, Xilinx now supports compact ther-
Theta-jb for the package. There is no
Engineers should view Θja with caution mal model data for high-performance
geometrical information. Although
when predicting Tj in specific environ- FPGA devices.
2-R models are useful and give better
ments. Xilinx will continue to publish Θja
predictions than traditional Θja esti-
and other thermal resistance data because What is a Compact Thermal Model?
mations, they are not as accurate as
those are the prevailing standards. They A compact thermal model is a behavioral
Delphi models.
have their uses and should be deployed model that seeks to accurately predict the
with their limitations in mind. temperature of the package at selected 2. A Delphi compact model comprising
nodes: junction, case, top, bot- several thermal resistors that connect
DELPHI BCI-CTM Topology tom, and balls, for example. It a junction node (representing the
for FCBGA Two Resistor Model
cannot predict the temperature die) to several surface nodes. Thermal
TI TO at any other part of the package links are also allowed between the
that is not predefined. It can be surface nodes. Figure 2 shows the
RJC
viewed as a reduced node topology for a flip-chip BGA Delphi
abstraction of the response of a compact model. The matrix of resis-
component to various boundary tors has been optimized through a
Junction SIDE Junction
conditions. It is also more com- Delphi optimization algorithm so
putationally efficient than the that they can be used in various envi-
RJB corresponding detailed model. ronments without compromising pre-
These models are supplied for diction accuracy.
use in compatible computation-
BI BO al fluid dynamics (CFD) tools Table 2 depicts a typical Delphi half-
for thermal simulations in place matrix model for flip chip. The resistance
Figure 2 – CTM topologies of detail models. data is usually saved along with the node
definitions and package extents to complete
the model.
Schematic Overview JEDEC has proposed a neutral file for-
CTM Implementation Concept mat in XML for CTM distribution. Xilinx
plans to support the format when CFD
CTM BOARD tools adopt and support it. In the interim,
ENVIRONMENTS
Component DEFINITION
T-Ambient AirFlow
Extent, Layer Details, Xilinx is offering the CTM files in two
Heatsinks Heat Pipes Library Cut-Outs
Space etc. 2R – Ok CFD tool formats, Flotherm and Icepak,
selected from a pre-introductory survey of
Xilinx customers. These tools cover the
majority of those end-users who answered
the survey. If you do not use one of these
CTM TOOL
tools, you can request ASCII data for man-
(Thermal Solver) ual or script-based entry into your tool.
Input Power – Pd
For Components Application Examples
Figure 3 shows a typical flow for a CTM
More Than application. Normally, the component
One CTM data is stored in a library; as the user,
Component
you will bring in the CTM data as a library
Tj – item. You then specify the board attributes
PREDICTION
(Other Predefined and boundary conditions of your assembly,
Component Temps) adding other items like component power
and heat contributions from other compo-
Figure 3 – CTM application schematics nents for the Tj prediction.
A Multi-Gigabit Transceiver
for the Masses
The Virtex-5 GTP transceiver brings versatility,
ease of use, power efficiency, and cost-effectiveness
to high-volume mainstream applications.
by Gang Sun and the extra overhead can sometimes much power. For applications requiring
Senior Product Marketing Manager, High-Speed Serial I/O outweigh the benefits associated with these advanced features, this extra power
Xilinx, Inc. increased data rates. consumption is a worthwhile trade-off.
gang.sun@xilinx.com But it becomes advantageous to offer both
Transceivers in Transition a low-power 3.2 Gbps transceiver and a
The incessant demand for ever-increasing Figure 1 shows the frequency loss and high-performance transceiver for cutting-
bandwidth has led designers away from crosstalk associated with a legacy back- edge applications – in essence offering the
parallel buses and low-speed transceivers plane channel. At 1.6 GHz, the loss is rea- best tool for the job.
toward serial transceiver-based interfaces. sonably manageable, making transceiver At 5 GHz, the signal-to-noise ratio
High-speed signals solve many design implementation at or below 3.2 Gbps rel- (SNR) becomes negative. In that case,
challenges; they offer new levels of band- atively cost-effective and power-efficient. you would have to redesign the entire
width and lower overall system cost and However, at 3 GHz, the loss becomes backplane with more expensive materials
power consumption. significant. Consequently, the implemen- and more sophisticated manufacturing
These successes have led engineers to tation of a 6 Gbps backplane transceiver technologies to enable 10 Gbps transmis-
believe that the industry can continue to requires different feature sets. You will like- sion. Consequently, achieving a 10 Gbps
lower overall cost and power simply by ly need advanced techniques such as deci- serial transmission over a backplane
increasing transceiver speed indefinitely. sion feedback equalization (DFE) to incurs a higher cost in terms of die area
However, going beyond 3 Gbps can in maintain signal integrity, and these and power consumption.
some cases lead to fundamentally different advanced capabilities require a different set The preceding example clearly shows that
engineering challenges that make it hard- of optimized features. transceivers running at or below 3.2 Gbps
er to lower overall system cost and power This explains why a 3 Gbps transceiver are at a sweet spot; they are more cost-effec-
consumption. The explanation is simple; typically consumes less than 100 mW per tive and power-efficient than both parallel
maintaining signal integrity becomes channel, whereas a DFE-enabled 6 Gbps interfaces and ultra-high-speed transceivers
increasingly difficult at ultra-high speeds, transceiver consumes at least twice as (running at 6 Gbps and 10 Gbps) for a large
majority of interconnect applications. This tion, validation and characterization of the FPGA CAD tools. The Xilinx® Virtex-5
phenomenon has led to two diverging trends GTP transceiver occurs in application-spe- RocketIO GTP transceiver wizard offers an
in the transceiver market: cific settings to ensure standards compli- intuitive GUI interface that allows you to
ance. The combination of these design and select the GTP, clocking option, FPGA fab-
1. Bandwidth-hungry applications (such
characterization approaches ensures the ric interface, protocol stack, and encod-
as a backplane interconnect for ter-
universal appeal of the GTP transceiver. ing/decoding mechanism. After you have
abit routers) have needs for 6 Gbps
The GTP transceiver is easy to use completed your selections, the tool generates
and 10 Gbps transceivers. These
because it enjoys the support of the best a GTP wrapper with the necessary features.
applications continue to push the
performance envelope while trading
off cost and power.
2. High-volume applications are well
served by transceivers running at or
below 3.2 Gbps.
by Doug Kern 8B/10B encoding, dual-simplex signaling, A switch has one upward facing port and
Staff System Design Engineer and message-based serial protocol. numerous downward facing ports. These
Xilinx, Inc. With plans in place to increase band- downward facing ports connect to the work-
doug.kern@xilinx.com width to 5 Gbps in Generation 2 and ing devices or endpoints of a system.
10 Gbps in Generation 3, the PCIe bus is Although only one root exists in a sys-
Currently dominating the desktop PC expected to be the dominant high-band- tem, there are one or more endpoint
motherboard and graphics markets, the width interconnect for several years to devices. For example, a standard PC
PCI Express (PCIe) interconnect is poised come. (For more information on the PCIe motherboard provides three to seven
to supplant PCI and PCI-X as the domi- specification or compliance information, expansion PCIe slots. With the integrated
nant high-bandwidth interconnect for the visit www.pcisig.com.) PCI Express Endpoint block, Xilinx®
server, enterprise, mobile, workstation, net- With scalable lane widths from x1 to Virtex™-5 LXT FPGAs allow you to rap-
working, communications, industrial con- x32 lanes and advanced features such as idly develop and deploy high value-added
trol, and medical equipment markets. traffic classes, virtual channels, hot-plug, PCIe endpoint devices. The numerous
With more than 58 form factors, includ- and power management, the Xilinx PCIe value-added endpoint designs are the tar-
ing Express Card, Advanced TCA, block provides support for a wide range of get applications for the FPGA-based con-
Compact PCI Express, Com Express, and a applications, from a simple upgrade from figurable Virtex-5 LXT PCI Express
cable spec, the PCIe protocol is becoming PCI to an x1 PCIe endpoint device to Endpoint block.
ubiquitous. The PCI Special Interest Group advanced high-bandwidth x8 PCIe com-
(PCI-SIG) maintains the PCIe specification munications endpoint devices. The Virtex-5 LXT PCIe Endpoint Block
(along with the PCI and PCI-X specifica- Figure 1 shows the topology of a PCIe The Virtex-5 LXT PCIe Endpoint block
tions) and holds compliance workshops. system. The CPU is connected to a root (see Figure 2) implements the physical
The PCIe subsystem is a point-to-point device and is responsible for configuring layer (PHY), data link layer (DLL), trans-
interface that replaces and overcomes the and enumerating all plug-and-play PCI action layer (TL), and configuration layers
limitations of bus-based PCI and PCI-X Express endpoint devices in a system. of a PCIe endpoint device. The imple-
standards. PCIe Generation 1 (Gen1) Because the PCIe system is point-to-point, mentation of a small reset circuit and
offers 2.5 Gbps speed with low-voltage dif- switch devices are necessary to grow the clock generation blocks require you to use
ferential signaling (LVDS), embedded number of devices or endpoints in a system. the FPGA fabric.
The PCI Express Endpoint block capa- deskew. The DLL is responsible for data virtual channels, great flexibility for packet
bilities include: integrity and implements a user-config- arbitration is available.
urable-sized retry buffer to retransmit pack-
• Compliance with the PCI Express base
ets that are received incorrectly without High-Level Intregration
specification, revision 1.1
re-requests from the applications software. The Virtex-5 PCI Express Endpoint block
• Choice of PCI Express Endpoint block The TL provides Tx and Rx buffers and allows you to implement a single endpoint
or legacy PCI Express Endpoint block orders the packets to be transmitted. With device with one FPGA while leaving almost
implementation capability for eight traffic classes and two all of the FPGA programmable fabric avail-
• x8, x4, x2, or x1 lane width
• Easy-to-use user interface similar to the CPU
familiar Xilinx LocalLink interface
• Integration of RocketIO™ GTP PCI Express Memory
transceivers Graphics: 16x Root Complex
PL Lane
• Non-memory transaction layer packet PL Lane
(TLP) ID checking/filtering
• Implements one PCI Express function
PCIe
Configuration and Capabilities Module
• Signals to the programmable fabric for Block
statistics and monitoring
Management
• Full documentation and reference Interface Hot Plug and Power Configuration Clock and
Management and Status Reset
example design Interface Interface Interface
Clock and
Virtex-5 GTP transceivers interface to Miscellaneous Logic (Optional)
Reset Block
the serial differential electrical signals of the
PCIe specification. The PCIe block com-
Figure 2 – Xilinx Virtex-5 LXT PCI Express Endpoint block
pletes the physical logic that provides lane
46 Xcell Journal Fourth Quarter 2006
SERIAL CONNECTIVITY
Here are 6 of the new, faster, bigger, Virtex-5 FPGAs on a 12 Million ASIC Gate Board
that offers unmatched performance to ASIC Prototypers, IP Designers, and FPGA
Developers. The V5 65nm process, with 6 input LUT and advanced interconnect,
enables 30% faster clock speeds in your application. The Dini DN9000k10PCI captures
this performance on an easy to use board with these handy features:
– QoS
– Hot-pluggable Figure 2 – PCIe Endpoint block in the Virtex-5 LXT FPGA
Power (Watts)
User Logic
Area (LUTs)
– Adjacent to GTP transceivers 3.09
• Ease of design PCIe
Endpoint refers to a type of device that can be the requester or Legacy Legacy PCI Express PCI Express
completer of a PCI Express transaction, either on its own behalf Endpoint Endpoint Endpoint Endpoint
System Dual-Channel
Card Memory Memory
x4 PCIe
DDR2 QDR
Backplane
Links
CPU Root Complex
x4 PCIe
Switch Legacy Backplane
Card PCI EP Links
PCIe
Switch
Memory
Memory DSP
by Nick McKay mode to enable backplane connectivity at 10/100 Mbps; full-duplex operation is
Senior Design Engineer speeds as fast as 2,000 Mbps. supported at all speeds.
Xilinx, Inc. Xilinx developed the Virtex-5 Ethernet Serial GMII (SGMII) and 1000 BASE-X
nicholas.mckay@xilinx.com MAC from the Virtex-4 FX Ethernet are serial interfaces that use the physical cod-
MAC, making improvements in the areas ing sublayer (PCS) and physical medium
Soma Potluri of global clock usage, serial interface flexi- attachment (PMA) sections of the Ethernet
Senior Design Manager bility, and software control complexity. MAC. These interface to the Virtex-5
Xilinx, Inc. In this article, we’ll review the feature RocketIO GTP serial transceivers. SGMII,
soma.potluri@xilinx.com set of Ethernet MAC blocks in Virtex-5 as with the parallel interfaces, provides
Stuart Nisbet devices. We’ll also describe the differences 10/100/1000 Mbps full-duplex BASE-T
Senior Design Manager between Virtex-5 and Virtex-4 FX functionality. The serial interface signifi-
Xilinx, Inc. Ethernet MACs, illustrate some potential cantly reduces the number of pins required
stuart.nisbet@xilinx.com applications, and describe how to use to connect to the external PHY chip.
standard Xilinx tools to integrate an When the Ethernet MAC is configured
Ethernet is the dominant wired connectivity Ethernet MAC into your design. in 1000 BASE-X mode, the PCS/PMA
standard. The Xilinx® Virtex™-5 Ethernet block, along with the RocketIO transceiv-
media access controller (Ethernet MAC) Supported Interfaces er, provides all of the functionality required
block provides dedicated Ethernet functional- The Virtex-5 Ethernet MAC is fully to connect directly to a gigabit interface
ity, which together with Virtex-5 RocketIO™ compliant to the IEEE802.3 specifica- converter (GBIC) or small form-factor
GTP transceivers and SelectIO™ technology tion. Figure 1 shows a block diagram of pluggable (SFP) optical transceiver. This
enables you to connect to a wide variety of the Ethernet MAC. removes the need for an external PHY chip
network devices. The Ethernet MAC block is for 1000 BASE-X network applications.
integrated into the FPGA as a hard block in Physical Interfaces
Virtex-5 devices. You can independently configure the Control Interfaces
The Ethernet MAC is available in the physical interface of each Ethernet MAC The host interface provides access to the
Xilinx design environment as a library prim- to operate as one of five different configuration registers of the Ethernet
itive, named TEMAC. The primitive con- Ethernet interfaces. MAC block. Examples of configuration
tains a pair of 10/100/1000 Mbps Ethernet The Media Independent Interface options include jumbo frame enable, pause
MACs. Each Virtex-5 LXT device contains (MII), Gigabit Media Independent and unicast address settings, and frame
four Ethernet MAC blocks; thus, a Virtex-5 Interface (GMII), and Reduced GMII check sequence generation.
LXT design can incorporate two TEMAC (RGMII) are parallel interfaces. These are The host interface is accessible through
primitives. Using standard Xilinx products, typically connected to an external physical either a generic host bus or a device control
you can create a range of customized packet layer (PHY) chip to provide BASE-T register (DCR) bus (when connecting to a
processing and network end-point products. functionality at 10/100/1000 Mbps. processor). In addition, each Ethernet
Xilinx has also provided an overclocking Half-duplex operation is supported at MAC has an optional management data
Virtex-5 Ethernet MAC Wrappers The different levels of hierarchy enable • Block Level Wrapper. In the next level
Figure 3 shows a block diagram of the you to extract the correct wrapper for of hierarchy, the physical interfaces
HDL wrappers available from the Xilinx your application. and the required clock resources are
CORE Generator tool. instantiated. This includes the
The Ethernet MAC is a complex com- • Ethernet MAC Wrapper. In the lowest RocketIO GTP transceivers for the
ponent with 162 ports and 79 parameters. level, a single or dual Ethernet MAC is serial interfaces. Clocking is also opti-
Wrapper files enable you to easily set the instantiated and its attributes are set mized for your configuration, and you
parameters and interface only to those to your preferred selection in the can clock the output to your design.
ports required for your application. They CORE Generator GUI. All of the
• LocalLink Level Wrapper. In this
also offer benefits in simplifying the use of unused input ports are tied to ground
level, FIFOs are added to the client
clocking and physical I/O resources. and the output ports are left open.
transmitter and receiver interfaces.
The FIFOs handle the dropping of
bad frames on reception and retrans-
Virtex-5
mission of frames in half-duplex
Ethernet MAC
Master
Attachment
DMA
Read
Packet
mode. LocalLink is used as the back-
Client Receive
FIFO Rx
end interface.
External PHY
Slave Host level features a demonstration design
Attachment Interface
where the received data is looped back
and sent to the transmitter. You can
Write download this design to a board and
Packet
Register,
SRAM, and
FIFO Client Transmit Tx
stimulate the receiver from a network
Interrupt
Interfaces device to demonstrate the operation
of the Ethernet MAC in hardware.
FPGA Fabric Testbenches that stimulate receiver
input and monitor the transmitter out-
Figure 2 – MAC connected to a processor on the Virtex-5 FPGA put of the design are also included in
the CORE Generator software.
Example Design
LogiCORE IP and Reference Designs
LocalLink Level Wrapper Most of the existing Virtex-4 Ethernet MAC
Block Level Wrapper
documentation is reusable with the Virtex-5
Dedicated
Ethernet MAC. For example, a version of the
Ethernet MAC
Wrapper “Ethernet Cores Hardware Demonstration
10 M/100 M/1 G
Ethernet FIFO
Client Dedicated Physical
Interface
Platform” (XAPP443, www.xilinx.com/
Interface Ethernet MAC
bvdocs/appnotes/xapp443.pdf ) will be avail-
LocalLink Interface
Tx Client
FIFO Physical I/F
Address Rx Client
or
RocketIO
LogiCORE IP, such as Ethernet statistics,
Swap
Module
FIFO Transceiver)
already supports the new architecture.
FPGA
Fabric
Conclusion
Host
Interface
Clock The Virtex-5 Ethernet MAC provides
Circuitry
10 M/100 M/1 G
a cost-effective solution for a wide range
Ethernet FIFO
of network interfaces, enabling you to
Local Link Interface
Tx Client
FIFO Physical I/F
connect to BASE-X and BASE-T net-
EMAC1 (GMII/MII,
RGMII,
works at 10/100/1000 Mbps. Xilinx soft-
Address Rx Client
or
RocketIO ware tools and IP also allow you to take
Swap FIFO Transceiver)
Module advantage of the improved feature set of
the Ethernet MAC.
For more information, visit the
Virtex-5 links on the Xilinx website,
Figure 3 – Block diagram of the Virtex-5 Ethernet MAC wrappers www.xilinx.com/virtex5/.
Asynchronous Sample-Rate
Conversion Between
AES Audio Streams
Xilinx Virtex-5 FPGAs provide the perfect platform for
implementing AES digital audio sample-rate conversion.
by Gregg C. Hawkes
Principal Engineer, Advanced Products Division
Xilinx, Inc.
gregg.hawkes@xilinx.com
Reed Tidwell
Senior Staff Applications Engineer,
Advanced Products Division
Xilinx, Inc.
reed.tidwell@xilinx.com
John F. Snow
Senior Staff Applications Engineer,
Advanced Products Division
Xilinx, Inc.
john.snow@xilinx.com
• The input-to-output latency changes Figure 1 – ML571 board and frame synchronization demonstration board with
because of accumulating delay an ASRC to match the output digital audio rate to the output digital video rate.
Implementing Integrated
Video Connectivity Solutions
with Virtex-5 LXT Devices
Xilinx Virtex-5 FPGAs provide the perfect platform for
integrating broadcast video solutions inside a single chip.
by Gregg C. Hawkes With the ever-changing video connec- Integrating the encoders and decoders
Principal Engineer, Advanced Products Division tivity landscape prevalent throughout the for these standards into the FPGA is simple
Xilinx, Inc. broadcast chain, our goal is to offer help in with the clear, concise reference material
gregg.hawkes@xilinx.com the form of free reference designs, forming found within the chapters of XAPP514.
drop-in building blocks that can solve The reference design code, offered in both
Reed Tidwell many system-level video connectivity Verilog and VHDL, is clearly documented
Senior Staff Applications Engineer, issues. By providing you with cost-effective and illustrated, as shown in Figure 1.
Advanced Products Division
and highly integrated solutions compared We also offer a suite of validation plat-
Xilinx, Inc.
to ASSP chips, Xilinx hopes to get you to forms that can quickly and easily test
reed.tidwell@xilinx.com
market faster, lower costs, and differentiate your video processing algorithms or veri-
John F. Snow your product from the competition. fy connectivity performance. For exam-
Senior Staff Applications Engineer, Our video connectivity IP and reference ple, you can use our new Xilinx ®
Advanced Products Division design book, “Audio/Video Connectivity Virtex™-5 ML571 Serial Digital Video
Xilinx, Inc. Solutions for the Broadcast Industry” (SDV) board (www.cook-tech.com) to
john.snow@xilinx.com ( w w w. x i l i n x . c o m / b v d o c s / a p p n o t e s / demonstrate or develop video connectivi-
xapp514.pdf), includes chapters about SDI, ty with Virtex-5 FPGAs. Figure 2 shows a
At Xilinx, we understand the challenges HD-SDI, DVB-ASI, AES embedded block diagram; Figure 3 is a photograph
that broadcast system designers are fac- audio, and audio-asynchronous sample rate of the ML571 board. Many of the free
ing. The number of emerging new stan- conversion. Each chapter describes a specif- reference designs linked to XAPP514’s
dards for video connectivity creates ic video connectivity topic and links to free chapters were verified on the ML571
difficult design challenges and schedules reference designs in Verilog and VHDL, platform using broadcast industry-stan-
for broadcast products. providing implementation examples. dard test equipment.
“The ML571 board is yet another board demonstrates how engineers can Talk to your Xilinx sales channel about
example of how Xilinx provides cus- easily implement advanced video net- seeing the demonstrations or obtaining one
tomers with detailed design assistance for working protocols while greatly increas- of these boards so that you can test your
real broadcast industry issues,” said Andy ing system integration, reducing system new algorithms long before your propri-
DeBaets, senior director, systems and costs, lowering power, and shortening etary board is produced. We hope you find
application engineering at Xilinx. “This design schedules.” this article and the audio/video connectivity
book valuable, but it represents just a small
sample of the information available about
Video Ancillary
designing with Xilinx programmable logic
SDI
Digital Standard Data ANC & EDH
SDI Video SDI
Driver
devices. To access the latest information
Video Detect & Encoder
Processor
Flywheel on these subjects and more, visit
Data
SDI www.xilinx.com/esp/broadcast.
Bitstream
Data Ancillary
Virtex-5 Features Support Broadcast Designs
Video
Digital Video
Test Pattern
SDI SDI Video Standard
SDI
ANC & EDH
Data The Virtex-5 feature set supports many
Receiver Clock Decoder Detect &
Generator
Flywheel
Processor Digital aspects of broadcast solutions by providing
Video
high performance, flexibility, and scalability
with unique, cost-optimized family mem-
Figure 1 – Example block diagram of free modular Verilog and VHDL reference designs bers built on the following features:
• High-density, high-speed, reprogram-
mable ExpressFabric™ technology
• 550-MHz, 36-Kb, dual-port block
SD-SDI or ASI In Multi-Rate
Equalizer Select IO LDVS SD-SDI or ASI Out
RAM/FIFO
32 Select IO
XGI
Daughtercard Connector
and performance at www.xilinx.com/
Clock 2 GLCKs
Module Interface
products/silicon_solutions/fpgas/virtex/
Sync Video Sync Input
Separator
Video Sync Output virtex5/index.htm.
GLCK
GLCK
27.576-MHz
DAC
148.35- / 74.1758-
MHz
GLCK
VCXO
DAC Audio VCXO
Overview of the Xilinx ML571
VCXO
GLCK 33-MHz XO
The new serial digital video (SDV) board
Digital Audio In AES/EBU 133- / 166-
for demonstrating and testing high-band-
GLCK MHz XO
(2 Stereo Pairs - 2 BNCs) Audio In
width video communications channels
Digital Audio Out
64-MB
AES/EBU
Audio Out (2 Stereo Pairs - 2 BNCs) based on Xilinx Virtex-5 platform FPGAs
DDR
10/100/1000
shows you how to easily implement high-
GTP Transceiver Ethernet 2 RJ-45 Connectors
125- / 200-
MHz XO speed serial interfaces to popular industry
Compact 4 Pairs standards like HD-SDI.
Flash
System JTAG GTP Transceiver ML410 Personality +12V Power
ACE 16 Pairs Module Connectors
JTAG Interface LVDS
Header Standards and Functionality Supported
Switches
RS-232 LEDs The diffused silicon integration of high-per-
Rx/Tx
DB9 formance and low-power multi-gigabit serial
I/O, tri-mode Ethernet MACs, PowerPC™
Figure 2 – Xilinx ML571 SDV video connectivity board block diagram processor, and PCI Express Endpoint block
of high activity. Potential issues with the System Integration arbitration scheme is available to manage
power supply or PC board design can be In addition to convenient access possible contention.
quickly identified during development. through the JTAG TAP, full access to the You can also define the contents of
The JTAG access also provides an easy way System Monitor control and status reg- these registers when the System Monitor is
to confirm that adequate cooling is in place isters is also provided through the FPGA instantiated in a design and initialized dur-
for a particular design. The ChipScope™ fabric. These registers can be configured ing FPGA configuration. Thus, the System
Pro Analyzer provides an easy way to access and read at any time from the fabric. Monitor can be configured to start up in a
the System Monitor; however, access can Dual access to the System Monitor reg- user-defined mode of operation post-con-
easily be incorporated into other JTAG test isters by the JTAG TAP controller and figuration. The fabric interface is known as
and programming environments. fabric interface is permitted, and an the dynamic reconfiguration port (DRP).
The DRP is a parallel 16-bit synchronous
data port (similar to block RAM).
Diagnostic SW
For more advanced applications where
1.01V
greater control over the System Monitor
External
Sensors VCCINT is required, the DRP allows the System
Intermediate Power Bus 1.00V Monitor to be easily mapped into the
peripheral address space of a hard or soft
0.99V
POL VCCAUX
to time microprocessor. Figure 3 illustrates a typ-
2.5V ical system management application
2.55V
VCCAUX where the MicroBlaze™ processor is
TCLK 2.50V running a protocol-like intelligent plat-
POL VCCINT TMS
1.0V TDO form management interface (IPMI) and
TDI
2.45V
to time
communicating with the system host
over management channels like Ethernet
o
60 C
FPGA Physical Temperature or even a simple UART/modem.
Environment Monitored
via JTAG TAP o
The System Monitor also provides an
50 C
important microprocessor peripheral in
40oC the form of a general-purpose ADC. This
to time
is the first time analog peripherals like
those commonly found in microcon-
trollers have been integrated into an
Figure 2 – You can access System Monitor measurements through the JTAG TAP.
FPGA. Full control over the ADC opera-
tion is supported. The ADC offers a num-
ber of sampling modes and can support
unipolar, bipolar, and full-differential ana-
log input schemes.
Conclusion
The Virtex-5 System Monitor delivers a
On-Chip
Peripheral Bus (OPB) greatly simplified solution for common on-
chip and external environmental monitoring
needs. Minimal development and design
LAN
effort are required to access the functionali-
10 Bits ty. By interfacing the System Monitor to the
PHY EMAC UART 200 kSPS JTAG TAP controller, JTAG functionality
ADC
has been extended into new application
Analog
Input areas, thus enabling new test capabilities.
We would like to hear your comments
and feedback regarding any topics
Modem JTAG
TAP touched on in this short article; in partic-
ular, how our development team can bet-
ter support your system monitoring and
Figure 3 – System Monitor (or ADC) as a microprocessor peripheral
test requirements.
©2006 Mentor Graphics Corporation. All Rights Reserved. Mentor Graphics, Accelerated Technology, Nucleus is a registered
trademarks of Mentor Graphics Corporation. All other trademarks and registered trademarks are property of their respective owners.
SERIAL CONNECTIVITY
by Lee Hansen using ChipScope Pro 8.2 Service Pack 2 or ISE™ logic design software and allows you
Design Methodologies Sr. Marketing Manager – later versions. The debugging cores deliver to debug Virtex-5 devices and other Xilinx
Horizontal Platform Solutions new enhanced performance, supporting FPGA-based projects in real time. You can
Xilinx, Inc. higher clock speeds as fast as 500 MHz. You quickly find and analyze design problems
lee.hansen@xilinx.com can analyze signals with greater speed and while the chip is running on the board,
agility through advanced features like wider interacting with the rest of the system.
Xilinx® Virtex™-5 devices set a new bench- data capture of up to 1,024 bits, deeper data Then, leveraging FPGA re-programmabili-
mark in FPGA functionality, with as much capture of up to 128K storage samples, and ty, design changes can be quickly imple-
as 12 times the logic capacity, 112 times higher density slice packing of trigger match mented and sent back to the device on
more memory, 2 times the bandwidth, and unit and capture control logic. board in a matter of minutes or hours
2.5 times the performance of the leading The resource estimator introduced with through the programming cable. Such
FPGA devices of just 8 years ago. Additional ChipScope Pro version 8.1 lets you see changes might take days or weeks using
dedicated hardware functionality like how much memory and device space the ASIC or competing FPGA offerings.
DCM-based clock management tiles, debugging cores will take up on the chip, The ChipScope Pro system also links
embedded hard processors, high-speed useful for project planning. internal FPGA debugging to Agilent
MGTs, and DSP48E slices extend platform Another breakthrough feature is Technologies’ bench-top logic analyzers
functionality to a broad spectrum of end remote debugging, first introduced in ver- using the included ChipScope Pro ATC2
applications. This extreme functionality sion 7.1 of ChipScope Pro software. core. This core synchronizes the ChipScope
places a huge demand on the design cycle Remote debugging lets you run the Pro system to Agilent’s FPGA Dynamic
and in particular the verification cycle, ChipScope Pro Analyzer and capture sys- Probe software, an optionally purchased
which tends to be the most time-consuming tem through a server/client Internet con- plug-in to your Agilent 1680, 1690, or
and time-critical phase of the design flow. nection. Your board can be running 16900 logic analyzer.
The Xilinx ChipScope™ Pro software and remotely in the lab while you debug from This unique partnership between Xilinx
analyzer deliver advanced real-time debug- an office on the other side of the building and Agilent delivers deeper trace memory,
ging functionality to complex Virtex-5- or the other side of the world. You can faster clock speeds, and more trigger
based designs, moving you through the ver- share a single board or system in the lab options, all using even fewer pins on the
ification cycle faster than ever before. with other engineers on your team or FPGA. The advanced technology contained
allow helpdesk personnel to debug a prob- within the ATC2 core and FPGA Dynamic
New Functionality lem remotely at a customer site, helping to Probe is not available in other FPGA or
The functionality of the ChipScope Pro lower field debugging and repair costs. ASIC real-time verification solutions.
Analyzer version 8.2 has been enhanced For more information on the ChipScope
with Virtex-5 performance in mind. All Optimized Real-Time Debugging Pro Analyzer, visit www.xilinx.com/
ChipScope Pro-optimized software debug- The ChipScope Pro system is available as a chipscopepro or contact your local sales office
ging cores work with Virtex-5 devices when separately purchased option to Xilinx for ordering information.
Now you can see inside your FPGA designs in a way that
will save weeks of development time.
READ
5 or 6 2 or 1
ADDRESS DOC
All FPGA applications use various amounts C
RAM
of memory for data, parameters, and
or
instructions. To store from a few bits to
READ
multiple megabytes, Xilinx® Virtex™-5 ADDRESS
5 or 6 2 or 1
DOB
B
2
devices offer a hierarchy of three different DI
DO
06
RAM
5
ADDR 05
memory implementations: WE
2 x 32 Bit
RAM
D Q
CLK READ
• LUT-based distributed RAM has a ADDRESS
5 or 6 2 or 1
DOA
A
granularity of 64 bits RAM
ports all accessing the same data. The 64-Bit Data correction (ECC) using Hamming code.
newest MicroBlaze™ processor uses this The controller is built into each block
feature to reduce its register file from 384 Block RAM RAM. It detects single and double errors
to 44 LUTs. In this kind of application, the and corrects all single errors.
ECC Encode
new six-input LUT is six times more effi- 8-Bit Parity
The ECC controller can also be used to
64-Bit Data External
cient than a previous-generation four-input Memory operate with external memory. In this case,
LUT (see Figure 2). one complete block RAM is necessary for
Block RAM 64-Bit Data writing and another for reading. The built-
Shift Register 64-Bit Data in ECC circuit is a great simplification for
ECC Decode
You can use any LUT in a Slice M as a seri- 8-Bit Parity
memory designers who care about the ulti-
Error
al shift register with addressable length. The mate data integrity (Figure 4).
LUT is configurable as either a single-bit
shift register (a maximum of 32 bits long)
or as a 2-bit-wide shift register (a maximum Figure 4 – ECC for external memory 18K
36K Block RAM
of 16 bits long). Different from earlier Block RAM or
SRL16 structures, the Virtex-5 shift register block RAM, configured as: or
18K Block RAM
FIFO
uses a more traditional and scalable design or FIFO
• 36 bits wide, 512 deep
with two latches per shift register bit –
hence the maximum 32 bits (not 64 bits) • 18 bits wide, 1K deep Figure 5 – Dual-ported RAM or FIFO
per LUT (Figure 3). • 9 bits wide, 2K deep
• 4 bits wide, 4K deep FIFO
SHIFT IN
32-Bit • 2 bits wide, 8K deep FIFOs are usually implemented using dual-
CE SHIFTOUT 31
Shift Register ported SRAMs, with one port used for writ-
CLK • 1 bit wide, 16K deep ing and the other for reading. Many Virtex
Each block RAM always has two inde- family block RAMs are traditionally used as
A
5
MUX pendent access ports and each port can be FIFOs. That is why Xilinx chose to equip
individually configured. This greatly sim- all Virtex-5 block RAMs with a built-in
D Q plifies data-width conversion. dedicated FIFO controller (Figure 5).
Virtex-5 devices have between 32 and
CLK
Read During Write 288 block RAMs, and each can be config-
Each port supports a data-in (DI) bus and ured as a 36- or 18-Kb FIFO.
Figure 3 – LUT as shift register a data-out (DO) bus. When writing the The controller can use the whole block
data on the DI bus into the memory, the RAM as FIFO with the following configu-
Block RAM DO bus presents either the previous data ration options:
For larger RAM structures, Virtex-5 devices at the write address or the new data just
• 72 bits wide, 512 deep
have tens or hundreds of block RAMs, each being written. A third option keeps DO
with a capacity of as much as 36 Kb. unchanged from its previous state. These • 36 bits wide, 1K deep
You can structure each block RAM three configuration options offer a design • 18 bits wide, 2K deep
through configuration as: flexibility that is often overlooked.
All block RAM operations require a • 9 bits wide, 4K deep
• 72 bits wide, 512 deep
clock, even for reading data. This require- • 4 bits wide, 8K deep
• 36 bits wide, 1K deep ment is not always desirable, but it is
absolute. Nothing happens without an But the controller can also use only half
• 18 bits wide, 2K deep of the block RAM and leave the other half
enabled clock. Whenever the clock is
• 9 bits wide, 4K deep enabled, data and address must meet the to be used as general-purpose block RAM.
required setup and hold-time specification. The FIFO options are then:
• 4 bits wide, 8K deep
Violating this requirement can contami- • 36 bits wide, 512 deep
• 2 bits wide, 16K deep nate the data content.
• 18 bits wide, 1K deep
• 1 bit wide, 32K deep
ECC • 9 bits wide, 2K deep
You can also use the two halves of the A 72-bit-wide block RAM can provide
36-Kb block RAM separately as two 18-Kb 64-bit-wide data with error detection and • 4 bits wide, 4K deep
by Richard Chiu which adds to each I/O block an resources to provide more direct
Staff Applications Engineer
adjustable delay element (IDELAY) com- routes within a slice and between
Xilinx, Inc.
pensated over process, voltage, and tem- configurable logic blocks (CLBs).
rich.chiu@xilinx.com
perature changes as well as enhanced DDR
• Reduction of the maximum bank size
capture support. These features help meet
When not supporting new interface proto- from 64 I/O (or 80 I/O in select
the challenges of designing with source-
cols, memory interface designers are con- Virtex-4 part/package combinations) to
synchronous memory interfaces. With
stantly supporting faster and faster bus 40 I/O, and an increase in the number
Virtex-4 memory interface designs, you
speeds for existing interfaces. Today’s source- of banks. This leads to a more efficient
can employ calibration algorithms to fac-
synchronous double-data-rate (DDR) implementation of the usual myriad of
tor out many of the skews and delays in
memory devices, such as DDR2 SDRAM, I/O voltage levels on the same FPGA.
the timing path and operate your design at
QDR II SRAM, and RLDRAM II, present More I/O clocking resources have also
higher frequencies.
designers with challenges at chip and PCB been added to each bank.
The Virtex-5 architecture adds addition-
levels. Higher clock frequencies result in a
al features that allow you to push the limits • The availability of phase-locked loop
rapidly shrinking data valid window.
of operating frequency. Enhancements to (PLL) blocks as clocking resources in
Signal integrity issues, clock jitter, memo-
the Virtex-5 device integral to memory addition to digital clock manager
ry uncertainties, varying silicon delays,
interface design include: (DCM) blocks. PLLs are useful for
PCB trace skew mismatch, and other fac-
low-jitter clock generation and input
tors now have a proportionally larger • The addition of ExpressFabric™
clock jitter filtering.
impact on meeting timing with a smaller technology. This architectural
data valid window. enhancement enables internal logic to • Enhanced block RAM/FIFOs that
run at higher clock frequencies. The have doubled in size to 36 Kb and
Virtex-5 FPGAs Enhance basic slice look-up table (LUT) has support a maximum width of 72 bits.
Memory Interface Design increased from a four- to a six-input Applications requiring error-correcting
The Xilinx® Virtex™-4 FPGA family LUT (6-LUT), reducing the number code (ECC) detection and correction
introduced a number of on-chip resources, of required logic levels. The technology can now take advantage of ECC
in particular ChipSync™ technology, also offers additional routing encode/decode logic built into each
block RAM, reducing logic usage and ding read data and register it with a Most Virtex-4 designs use the direct-
allowing much higher performance delayed version of the strobe distrib- clocking method for read data capture.
over implementing the same function- uted through a localized I/O clock Beginning with the Virtex-4 SERDES
ality in general logic. buffer (BUFIO). This data is then DDR2 design and continuing with the
synchronized to the system clock new generation of Virtex-5 memory inter-
• Support for digitally controlled imped-
domain in a second stage of flops. face designs, the strobe-based method is
ance (DCI) on-chip split-Thevenin ter-
The input serializer/deserializer best to meet the tighter timing require-
mination for bidirectional I/O only
(ISERDES) feature in the I/O block is ments at higher clock speeds.
when the driver is 3-stated. Similar to
used for read capture – the first two Both techniques involve the use of IDE-
the on-die termination (ODT) feature
levels of flops in the ISERDES trans- LAY elements that are varied during a cali-
implemented in many memory device
fer the data from the delayed strobe to bration routine. This routine is performed
families, this support is provided for
the system clock domain. Figure 2 during system initialization, delaying both
certain HSTL and SSTL I/O standards
shows the read capture path for a the strobe and data to determine and set
and can be used to save power when
Virtex-5 memory interface design. the optimal phase between strobe/data and
the FPGA is writing to memory.
• The incorporation of low-inductance
bypass capacitors directly on the pack- IOB Fabric
age substrate, simplifying PCB layout
User Interface FIFOs
by reducing the amount of external Data ISERDES
Q2 Read Data
bypassing required. IDELAY Rising
Q1
Read Data
Virtex-5 Data Interface Techniques Falling
Meeting read and write timing for a high- CLK OCLK CLKDIV
speed source-synchronous bus demands
FPGA Clock
that you keep uncertainties to a minimum.
Typically, the capture of read data is the
most challenging part of the design. Delayed Strobe
Write timing for Virtex-5 FPGAs is sup- BUFIO
ported in the same way as in the Virtex-4 Strobe
IDELAY
device. The DCM (or PLL) generates quad-
rature phase outputs of the base (“system”)
clock. The memory strobe is forwarded
using an output DDR register clocked by Figure 1 – Virtex-4 direct-clocking read data capture path
an in-phase copy (CLK0) of the system
clock. The write data is clocked by a DCM
clock output that is 90 degrees ahead
(CLK270) of the system clock. This ensures IOB Fabric
that the strobe is center-aligned to the data User Interface FIFOs
Data IDDR
on a write at the outputs of the FPGA. Q2 Read Data
IDELAY Rising
Both Virtex-4 and Virtex-5 memory Q1
Read Data
interface designs support two kinds of read Falling
capture techniques:
CLK
• The “direct-clocking” technique delays
FPGA Clock
the read data so that it can be directly
registered using the system clock in the
input DDR flop of an I/O block. The
memory strobe is only used during cali-
bration to determine the optimal time to Strobe Used for
delay the associated data. Figure 1 shows IDELAY Calibration Only
the direct-clocking read capture path.
• The “strobe-based” technique uses the
memory strobe to capture correspon- Figure 2 – Virtex-5 strobe-based read data capture path
the system clock to maximize timing mar- consider the data-to-clock variation ed with opening and closing banks.
gins. Calibration removes any uncertainty (for DDR2, this is tAC) because the In an LRU algorithm, banks are left
caused by process-related delays, compen- system clock is used to both drive the open at the end of accesses. If a new
sating for components of the path delay memory clock and capture read data. bank needs to open, the controller
that are static to any one board. These This is a larger uncertainty than the closes the bank least recently used. At
components include PCB trace delays, strobe-to-data variation. any time, as many as four banks can
package delays, and process-related com- be left open.
• The strobe-to-clock variation is
ponents of propagation delays (both in
important for the second stage of
the memory and FPGA), as well as Generating Virtex-5 Memory Designs
capture, when the data is transferred
setup/hold times of capture flops in the You can generate a custom memory con-
from the delayed strobe to the system
FPGA I/O blocks. Calibration accounts troller by using the Memory Interface
clock domain. However, by this time
for variation in delays that are process-, Generator (MIG) tool. The MIG tool is
the data is split into two separate sin-
voltage-, and temperature-dependent at accessed through CORE Generator™
gle-data-rate paths; therefore, aligning
the system initialization stage – you software and outputs HDL source (Verilog
the delayed strobe to the system clock
should also factor additional operating or VHDL) design files, along with accom-
can take place over a much larger
temperature and voltage variations sepa- panying constraint and build scripts.
timing window.
rately into your interface timing budget. The latest version of the MIG tool
During calibration, IDELAY for strobe The strobe-based capture method is (1.6) supports DDR2 SDRAM-registered
and data are incremented to perform edge more pinout-restrictive, as it requires the DIMM and QDR II SRAM component
detection by continuously reading back memory strobes to be placed on clock- interfaces for Virtex-5 devices. The DDR2
from memory and by sampling either a capable I/O pins. This can limit the I/O controller supports operation of bus clock
prewritten training pattern or the memo- utilization over a given bank. Virtex-5 speeds as fast as 333 MHz (667 Mbps). The
ry strobe itself until either the leading devices have smaller banks and more I/O QDR II supports operation of bus clock
edge or both edges of the data valid win- clocking resources per bank (for example, speeds as fast as 300 MHz (600 Mbps).
dow are determined. The IDELAY for the number of BUFIO local clock buffers Virtex-5 designs generated by the MIG
data or strobe is then set to provide the per bank has increased from two to four), tool also allow the physical layer interface
maximum timing margin. In the case of easing this restriction and allowing more portion of the design to be easily separated
direct clocking, the optimal delay for the strobes and their accompanying I/O (data, from the controller portion. You can then
strobe is used to delay the associated data. mask) to be placed in each bank. incorporate your own specific controller
For strobe-based capture, the strobe and Other significant differences in Virtex-5 but retain the memory initialization
data can have different delay values because memory controllers include: and high-performance source-synchronous
there are essentially two stages of synchro- • Full-speed operation. Both the Virtex-4 calibration logic.
nization: one to first capture the data in the SERDES design and Virtex-5 designs
strobe domain and another to transfer this use the ISERDES for memory capture. Conclusion
data to the system clock domain. However, Virtex-5 designs do not use The Virtex-5 device family builds on the
The direct-clocking capture method is the width expansion feature of the Virtex-4 FPGA, with additional features
simpler in design complexity, and com- ISERDES, and the controller runs at to ease memory interface design and meet
pared to the strobe-based capture method, the same speed as the memory clock. the challenges of supporting ever-increas-
it has fewer pin-out restrictions. However, The Virtex-4 ISERDES design runs at ing bus speeds.
the strobe-based capture method becomes half the memory clock speed but twice To download the MIG tool and for more
necessary at higher clock frequencies. Its the bus width. Running at the same information about the implementation and
two-stage approach offers better capture clock speed as the memory is made pos- design details of Virtex-5 memory controller
timing margins for two reasons: sible by the higher performance of the reference designs, visit the Xilinx Memory
Virtex-5 fabric. This minimizes read- Corner at www.xilinx.com/memory/.
• The DDR portion of the timing is
data latency through the ISERDES – as Virtex-5 memory controllers are also
restricted to the first rank of flops in
well as controller latency – and simpli- available as reference designs for down-
the ISERDES. Because the strobe is
fies bank-management logic. loading from the Memory Corner:
used to register the data, timing is
limited largely by the strobe-to-data • Bank management. The Virtex-5 • XAPP858 (DDR2 SDRAM)
variation; for example, in the case of DDR2 controller employs a least- • XAPP853 (QDR II SRAM)
DDR2, these are given by the recently-used (LRU) bank-manage-
tDQSQ and tQHS parameters of the • XAPP852 (RLDRAM II)
ment algorithm that keeps banks
part. For direct clocking, you must open to reduce the overhead associat- • XAPP851 (DDR SDRAM)
by Nagesh Gupta
Founder & CEO
Taray Incorporated
nagesh@tarayinc.com
The Time-to-Market Advantage Hardware verification starts with a strobe (CAS) latencies, burst lengths,
High speed memories are complex to point test, such as a read/write data and data widths, as well as all supported
design. Conservatively, you can save more match, at a particular frequency for a synthesis tools.
than six months by using the targeted ref- given memory part. We then perform fre-
erence designs provided by the MIG tool. quency sweeps and ensure that the Simulations
Fully verified MIG reference designs designs work ±10% in the required fre- Taray simulates MIG designs using
enable you to focus on other design activ- quency range. We also verify all the possi- ModelSim from Mentor Graphics. We
ities, thus reducing overall time to market. ble parameters such as column address simulate a large number of combina-
tions and ensure that every memory listed screen shot of the MIG GUI. The key fea- • DDR2 SDRAM, Verilog
in the MIG tool is verified with at least tures of the MIG tool v1.6 are: and VHDL
one of the test cases. Table 1 is a summa-
• Virtex-5 FPGAs: • Spartan-3E FPGAs:
ry of the different simulation test cases
for Virtex-4 DDR2 SDRAM designs. • DDR2 SDRAM, Verilog • DDR SDRAM, Verilog
Below are some parameters to generate and VHDL
• QDR II SRAM, Verilog
the test cases: • All Spartan-3 and Spartan-3E
• Support for Virtex-4 FPGAs (and the
• All possible data widths designs support XST, Synplicity,
following designs):
and Precision Synthesis
• All of the supported memory compo- • DDR2 SDRAM, Verilog and
nents/DIMMs • Support for many different memory
VHDL, direct clocking
components and DIMMs
• Different values for CAS latencies, • DDR SDRAM, Verilog and
burst lengths, and additive latencies, • Pins picked are based on the selected
VHDL, direct clocking
depending on the memory type memory part and user inputs
• QDR II SRAM, DDR II SRAM,
• Simulated Verilog and VHDL • Generates RTL and bit files for Xilinx
Verilog and VHDL, direct clocking
RTL files reference boards containing memories
• RLDRAM II, Verilog and VHDL,
• RTL with and without testbench • Basic I/O design rule check (DRC)
direct clocking
engine ensures that signals are
• RTL with and without DCM • DDR2 SDRAM, Verilog and allocated correctly
• Use memory models with VHDL, SERDES clocking
• Verifying a modified MIG .ucf file
different frequencies • All Virtex-4 designs support both ensures that MIG pin-out rules are valid
XST and Synplicity
Key Features
• Spartan-3 FPGAs: Using the Outputs of the MIG tool
The MIG tool is part of Xilinx ISE™
software and is invoked through the • DDR SDRAM, Verilog The MIG tool generates everything
CORE Generator™ tool. Figure 1 is a and VHDL required to create a memory interface:
• The RTL (Verilog or VHDL)
design files
• Synthesis scripts
• ISE scripts for build, map, and place
and route
• A .ucf file for pin locations, RLOCs,
and any other constraints
After generating the design RTL, you can
execute a batch file to synthesize, map, and
place the design. The MIG tool generates
two designs – one with a testbench and
another without. The MIG scripts work on
the version with the synthesizable testbench.
However, you can integrate your applica-
tions to the version without the testbench.
Conclusion
The MIG tool significantly reduces design
burden and improves time to market. It has
been used successfully by many customers.
For a copy of the Memory Interface
Generator or for additional information,
Figure 1 – The MIG tool 1.6 GUI
visit www.xilinx.com/memory.
by Chris Johnson
Networking and Communications
Strategic Applications Engineer
Micron
csjohnson@micron.com
One of the key features added in the Additional Features error-correcting schemes used to eliminate
RLDRAM II memory architecture is for RLDRAM II Memory soft errors in the memory channel.
reduced row cycle latency time. Row RLDRAM II memory also deviates from RLDRAM II memory is the first DRAM-
cycle latency (tRC) is the amount of time the refresh requirements of current DRAM based technology to add the ECC DQ pins
that must elapse before a recently technologies. Because ordinary DRAM to the devices. RLDRAM II memory is
accessed bank can be accessed again. devices refresh a row in all banks, they offered in x9, x18, and x36 configurations
Table 1 shows a direct comparison require dead clock cycles on the bus after a to provide a single-chip ECC solution
between RLDRAM II memory, DDR2, refresh command. This requires a period of without adding unwanted components,
and DDR at device densities of 576 Mb, inactivity on the DQ bus, typically 66 ns. reducing board layout space.
512 Mb, and 512 Mb, respectively. RLDRAM II memory devices have Manufacturability in large-component-
incorporated a bank-based refresh scheme count systems is a major problem, but it
to hide the refresh recovery periods required has been ignored in the DRAM industry,
Latency RLDRAM II DDR2 DDR1 Units by other DRAM technologies. The refresh largely because the applications that use
Memory process for RLDRAM memory requires the commodity DRAM devices do not need
tRC 15 55 55 ns bank address of the bank that needs to be continuity testing. The module-based busi-
refreshed, still allowing bus activity during ness uses so few components that the extra
Table 1 – Row cycle time DRAM comparison
100%
offers one of the highest density, lowest SRAM CELL DRAM CELL
latency DRAM-based solutions available
on the market today.
Figure 2 – SRAM cell compared to a DRAM cell
Additional I/O Interface Options
The RLDRAM II memory I/O interface
provides other features and options, includ- mination resistor. ODT provides simplici- cations continue to grow, demanding new
ing support for both 1.5V and 1.8V I/O ty and flexibility for high-speed designs by and more innovative solutions to meet
levels and a programmable output imped- bringing termination resistors on-die, elim- market requirements. As RLDRAM II
ance driver that enables compatibility with inating some of the on-board termination. memory helps address current market
both HSTL and SSTL I/O schemes. At high-frequency operation, however, requirements, the demand continues for
RLDRAM II memory requires an exter- it is important you analyze the signal driv- increased performance. Micron will address
nal one-percent precision resistor (RQ) tied er, receiver, printed circuit board network, this demand with future low-latency
to VSS in order to calibrate the driver to a and terminations to obtain good signal devices such as RLDRAM III memory.
known value and eliminate the process integrity and the best possible voltage and Micron continues to innovate and deliv-
variation that can be introduced during timing margins. Without proper termina- er solutions to meet the needs of today’s
manufacturing. The calibration process tions, the system can suffer from excessive and tomorrow’s markets. The joint efforts
requires the external resistor to operate at signal attenuation, leading to reduced of Micron and Xilinx help enable our cus-
five times the desired driver impedance. voltage and timing margins. This, in turn, tomers to quickly deliver next-generation
The programmable impedance control can lead to marginal designs and cause networking, video, and imaging systems.
(PIC) circuit calibrates the output imped- random soft errors that are difficult to For more information, please contact
ance to the desired value, eliminating vari- debug. Micron’s RLDRAM II memory Ray Fontayne at rfontayne@micron.com.
©2006 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.
M E M O RY I N T E R FA C E S
by David Banas
Sr. Staff Applications Engineer
Xilinx, Inc.
david.banas@xilinx.com
Figures 3 and 4 show typical receiver • Bit interval = 1.5 ns (667 Mbps) assume that the “_DCI” versions of the
eye diagrams corresponding to the topolo- SSTL (stub series terminated logic) driver
• One sequence repetition
gies shown in Figures 1 and 2, respectively. family adjust their output impedance to
The input switching thresholds of the • First 50 bits skipped match the DCI calibration resistors and
receiver are shown as horizontal dashed • Zero added jitter can therefore be used as matched imped-
blue lines for reference. The color of the ance drivers of the transmission line.
“probe” arrows in Figures 1 and 2 corre- When looking at the traces in Figure 3, But this is not true. The SSTL18_I_DCI
spond to the colors of the associated traces it should be obvious that of the three output driver, for instance, has a fixed out-
in Figures 3 and 4, respectively. I used topologies shown, the recommended use put impedance of approximately 20Ω, as
Mentor Graphics’s HyperLynx software to model gives by far the cleanest eye. per the SSTL18 specification. The disas-
generate these eye diagrams with the fol- The middle schematic in Figure 1 shows trous results of this erroneous assumption
lowing parameter settings: a typical mistake made by novice DCI are clearly visible in the yellow trace shown
users, which is to assume that using in Figure 3. Not only has the eye been dras-
• Pseudo-random binary sequence SSTL18_I_DCI drivers eliminates the tically narrowed, but problematic over-
(PRBS) with bit order 7 (a sequence need for any external termination compo- shoot/undershoot has also been introduced
length of 127) nents. Some DCI users often incorrectly at the receiver input.
VpullUp
0.9V
RP(B0)
Must be SSTL18_I +
External Resistor
RP(B0) RP(C0)
VpullUp VpullUp
0.9V 0.9V
RP(B1) RP(C1)
50.0 Ohms
1.000 ns
MT47H128M4CB_... Virtex-5 FPGA
Simple
DQ0 SSTL18_II
Figure 2 – Typical data circuit topology Figure 4 – Typical eye patterns for data
by Dean Armintrout
Chris Ebeling
Solutions and
Gbps (Ethernet), as illustrated in Figure 1.
The SPI-4.2 interface has become the
standard for interconnecting leading-edge
10 Gbps framers, traffic managers, network
processors, and switch fabrics. SPI-4.2 is
Virtex-5 FPGAs
popular because of its efficient interface,
which offers high bandwidth and low pin
count, along with seamless handling of typ-
ical system requirements such as flow con-
trol, error detection, synchronization, and
bus realignment.
Virtex-5 devices provide an ideal platform The Xilinx® Virtex™-5 architecture pro-
vides an ideal platform for implementing
for source-synchronous designs like the SPI-4.2. The Xilinx SPI-4.2 LogiCORE™
SPI-4.2 LogiCORE IP
Continually improving on its SPI-4.2 solu-
tion, Xilinx has made the latest implemen-
tation 25% smaller than previous versions
by leveraging the 65-nm ExpressFabric™
technology and real six-input look-up tables
(LUTs) of Virtex-5 FPGAs.
Enhanced ChipSync™ technology is
supported on every pin of the Virtex-5
device family, allowing you to target the
SPI-4.2 LogiCORE solution to any device
pinout to meet your system and PCB
requirements. High-performance interfaces
are supported by 1.2 Gbps LVDS data rates.
For applications requiring multiple
SPI-4.2 interfaces, the Virtex-5 FPGA’s
logic density, high pin count, and exten-
sive clocking resources support four or
more full-duplex cores in a single device.
• DDR registers integrated into the I/O dynamic phase alignment (DPA). S2 S1
(a):
System Jitter
pins simplify the interface between the In Virtex-5 FPGAs, the IDELAY feature Initial Data Eye Alignment
FPGA fabric and the I/O blocks by present in every I/O is ideally suited to Fixed offset
supporting data transfer on a single adjust the clock-data phase relationship for M
of 2 taps
reducing/eliminating the need for periodic Clocking Resources tion uses 25% less fabric resources. At the
training patterns, continuous DPA enables Virtex-5 FPGAs provide an unprecedented same time, Virtex-5 FPGAs support 20%
the maximum data bandwidth in your sys- number of clock resources for implementing higher performance for SPI-4.2, with high-
tem while maintaining the optimal clock- multiple SPI-4.2 interfaces in a single speed 1.2 Gbps LVDS data rates on every
data alignment at each pin. device. The abundance and flexibility of I/O of the device.
clock distribution in the Virtex-5 family This means that not only can you place
DPA Diagnostics solves this challenge, supporting as many multiple SPI-4.2 interfaces anywhere on the
If your hardware operation encounters SPI-4.2 interfaces as the device logic and device, but for each interface, you can realize
alignment issues, the Xilinx SPI-4.2 core I/O will accommodate. an aggregate bandwidth as high as 19 Gbps.
includes DPA diagnostic ports to aid In the Virtex-5 family, all devices have Designs not requiring this level of perform-
with debugging. The DPA diagnostic 32 global clock resources, with any 10 of ance (such as more typical framer interfaces
data monitors the data eye and final sam-
pling point of the initial alignment Virtex-4 FPGA Virtex-5 FPGA
process, as well as a second sweep of the
data valid window to determine if any Power: Static Alignment @ 700 Mbps per LVDS Pair 1.55W 1.42W
changes have occurred. Power: Dynamic Alignment Performance per LVDS Pair 2.0W @ 1 Gbps 1.66W @ 1 Gbps
You can connect the diagnostic ports to
the ChipScope™ analyzer or other logic Speed Grades Supporting 800 Mbps per LVDS Pair -10, -11, -12 -1, -2, -3
probes to analyze alignment conditions
Table 1 – SPI-4.2 power estimates for Virtex-4 and Virtex-5 FPGAs
while the FPGA is on the board, interact-
ing with the rest of the system
the 32 total global buffers available in each running at 10-12 Gbps) automatically get
clock region. The global clock trees and additional performance overhead, ensuring
associated buffers are implemented differ- ease of design integration and timing closure.
entially for best duty-cycle fidelity and
greater common-mode noise rejection. Conclusion
In addition, each region in the device has Xilinx SPI-4.2 LogiCORE IP coupled with
four regional clock nets, which are ideal for Virtex-5 features provides a highly efficient
source-synchronous interface clocking at and reliable SPI-4.2 solution. We devel-
rates above 1 Gbps. You can configure the oped ChipSync technology and continuous
SPI-4.2 LogiCORE IP to use either global DPA specifically for source-synchronous
or regional clock resources. interfaces like SPI-4.2.
These high-performance clock resources This technology allows you to design
support as many as four SPI-4.2 interfaces in the most efficient and reliable SPI-4.2 solu-
a mid-range device (LX85/LX110) and more tions, which use significantly less resources
than four SPI-4.2 interfaces in the larger (25% less), allow fully flexible device pin
devices (Figure 3). The Virtex-5 clocking assignments (you choose the pinout), and
capability enables a whole new class of support extremely high interface speeds
SPI-4.2 applications and provides an ideal (1.2 Gbps LVDS DDR I/O).
platform for applications such as multiplexing The higher performance is even more
and demultiplexing, bridges, and switches. remarkable because Virtex-5 FPGAs
achieve this while consuming significantly
Higher Performance at Lower Power less power. The wealth of Virtex-5 clocking
Virtex-5 silicon is manufactured with a resources, combined with full pin assign-
65-nm triple-oxide process that reduces ment flexibility, enables a new class of appli-
power consumption by as much as 35%. cations with multiple SPI-4.2 interfaces.
This has a positive impact for all designs, For more information about the
including the SPI-4.2 interface; the power SPI-4.2 LogiCORE IP targeting Virtex-5
savings are summarized in Table 1. devices, visit the Xilinx IP Center at
Figure 3 – Illustration of four instances of With Virtex-5 devices, SPI-4.2 uses sig- www.xilinx.com/systemio/spi-4.2. A hard-
SPI-4.2 LogiCORE IP implemented on nificantly less power than its predecessors, ware demonstration is also available; for
a Virtex-5 XC5VLX110 device
both because of the enhanced 65-nm more information, contact your Xilinx
process and because the LogiCORE solu- representative.
When creating the Virtex-4 family, real-time DSP systems. In this article, we’ll
by Craig Davies
Firmware Engineer Xilinx harnessed the flexibility of the examine those Virtex-5 architecture com-
VMETRO Ltd. (High Wycombe, UK) ASMBL architecture to build the first multi- ponents that enable COTS designers to
cdavies@vmetro.com platform FPGA family. Xilinx continues this deliver more bang for the buck.
approach with the Virtex-5 family. The ini-
Jeff Bateman tial offering is the Virtex-5 LX platform, COTS FPGA Backgrounder
Senior Systems Engineer optimized for high-performance logic. Freed from the need to design hardware
VMETRO Inc. (Ithaca, NY) Seasoned FPGA users expect new FPGA and IP from scratch, COTS board-level
jbateman@vmetro.com generations to deliver more and the Virtex-5 users can focus their energies on imple-
family certainly delivers, all while consum- menting their specialist algorithms.
In the fast-paced world of FPGA develop- ing less power. Compared to Virtex-4 LX COTS products incorporating user-pro-
ment, Xilinx has struck again with its sec- devices, Virtex-5 LX FPGAs offer: grammable FPGAs target a variety of
ond-generation ASMBL™ architecture applications, from simple customizable
• 65% higher logic capacity with as
devices, the Virtex™-5 family. This device digital I/O to RADAR, video, and signals
many as 330,000 logic cells
family has many upgrades from its prede- intelligence (SigInt).
cessor, the Virtex-4 family, and likewise • 70% more block RAM Typically, the FPGA requires hardware
continues the evolution of the ASMBL • 100% more DSP slices connections to a real-world data source or
architecture, with scalable FPGAs catering destination, plus a standardized interface
to the application-specific marketplace. For • 25% more SelectIO™ pins
to a host processor. COTS products must
commercial off-the-shelf (COTS) develop- For COTS board vendors, these fea- usually follow an industry-standard form
ers, this means a platform that is low cost, tures enable powerful products capable of factor (such as PMC, VME, VXS, and
light on power consumption, and opti- handling the very high data rates and pro- CompactPCI), enabling end users to inte-
mized for high performance. cessing complexity required of modern grate products from a range of vendors.
With its parallel architecture and high- With a proven track record in the high- inputs. For many of today’s applications,
speed I/O capabilities, the Virtex-5 FPGA is end FPGA DSP arena and comprehensive especially those in DSP, this optimization
capable of streaming and processing data at tool and IP support from a variety of reduces significantly as system-level algo-
the gigabyte-per-second rates typically sources, the Virtex-5 FPGA family is a nat- rithms increase in complexity.
required for today’s applications. It is well ural choice for COTS board-level vendors. Configurable logic block storage density
suited to algorithms where a core “inner improvements increase the shift register
loop” can be parallelized to speed up opera- Optimization of Soft Components LUT (SRL) length from 16 bits to 32 bits
tion, employing the resources available in A great addition to the Virtex-5 architec- (SRL32), while retaining a dual SRL16
modern devices. Many DSP algorithms ture is the replacement of traditional four- option. Distributed RAM now offers a 64-
dovetail with this architecture. Conversely, input look-up tables (LUTs) with new bit option, up from 16 bits. With improved
even the fastest CPUs cannot easily process six-input LUTs (6-LUTs) for more efficient reduced-hop routing and more logic per
data at gigabyte-per-second rates; they are, mapping of wider functions. Because 6- slice (four LUTs/four flip-flops versus two
however, well suited to decision making and LUTs are also configurable as dual five- LUTs/two flip-flops), speed improvements
user interaction tasks. input LUTs, design software tools can of as much as 45% are possible.
Given these trade-offs, FPGA-based DSP achieve greater efficiency in logic mapping
systems often employ a hybrid approach, as when six-input functions are not required. Improved FIR Efficiency
illustrated in Figure 1. Here, a wide-band- Most FPGA devices these days base Let’s consider a finite impulse response filter
width RADAR or video source is digitized at their soft fabric components – those com- implemented in distributed logic.
gigasample-per-second rates and fed to an ponents configured to implement logic Distributed arithmetic filters are often
FPGA. The FPGA performs some heavy- equations – on LUTs. Previously, the com- selected because their operating frequency is
duty number crunching to eliminate mon choice was the four-input LUT, as this not tied to the length of the tap vector. This
unwanted data, focusing in on the key area of was a nice binary base and was relatively characteristic is highly desirable because
interest. Pre-processed data is fed at a more easy to work with for optimizing a logic increasing the tap vector length is funda-
manageable rate to a general-purpose CPU function. A given equation can be opti- mental to improving the overall filter
for post-processing control and display. mized to contain a sum of products of four response. However, these types of filters are
Key COTS FPGA board requirements are:
• Large, reconfigurable FPGAs with
ample room for customer-programma-
Capture Sensor Data
ble application logic Gigasample/sec ADC
• Regular air-cooled and rugged conduc- Gbps to FPGA
tion-cooled options
• High-speed interface for efficient trans- Filtering and Data Reduction
Pre-Process Data
fers to and from a host processor Find Targets Within Noise Large, Fast FPGA
High Speed, Parallel DSP User-Programmable IP
• Flexible, fast I/O to and from a variety
of real-world interfaces
Mbps
• Local memory interfaced directly to
the FPGA for I/O buffering as well as Post-Process Data
Interpretation/
temporary storage during algorithm Identify Target:
Display
Friend/Enemy
operation (CPU)
Vehicle/Aircraft Type
Speed, Position, Altitude
• Wide range of I/O and signal-process-
Threat Level
ing IP cores to speed end-user develop-
ment cycle times
• Flexible FPGA development tools cov-
ering both budget-conscious and Display Result
extreme-performance users
• Debugging interface capable of in-
FPGA logic analysis
• Comprehensive board support firmware Figure 1 – Processing chain
and software
serial in nature; most applications need valid improvement in logic utilization over previ- But what if more precision is carried
output at a marginally decimated rate with ous generations of the Virtex device family. into the input side of each butterfly stage?
respect to the sampling frequency. Thus, a Using 25 x 18-bit multipliers, we can carry
fully parallel architecture is required. Multiplying Computational Power more precision from our partial products
Parallel Distributed Arithmetic FIR Not to be outdone by the soft-logic com- when multiplying them to new sample data
filters (DAFIRs) utilize significantly more ponents, the hard-logic dedicated multipli- and in turn introduce less rounding errors
logic versus other FIR implementations ers have also been optimized for the into our results.
to perform the many partial products on Virtex-5 FPGA. The 18 x 18-bit multipli-
a clock-to-clock basis (even when deci- ers present in the Virtex-II and Virtex-4 Improved Source-Synchronous Memory Access
mating the sampling rate). In the distrib- families have been upgraded to 25 x 18-bit Much to the delight of many designers
uted arithmetic architecture, a multipliers in the new family. Application using Virtex-4 FPGAs, Xilinx introduced a
corresponding output product y(n) is developers who implement beam-forming primitive called IDELAY capable of syn-
produced by summing the products of a arrays or other advanced computations will chronizing data and strobes to a source
time-delayed series of input x(n) and benefit from this enhancement. clock off the FPGA. This feature meant
coefficients a(m), where m, an integer Large multiplication arrays that require that high-speed DDR and DDR2 SDRAM
between 1 and N, is the filter length. a high degree of precision traditionally and QDR and QDR II SRAM memories
For the sake of simplicity, let’s say that required a large tree structure of multipli- could be accessed through controllers
each tap or filter coefficient is two bits wide ers. As output is carried between interme- inside the Virtex-4 device at high data rates.
and that the input vector is six bits. In total, diary stages of a large multiplication, the COTS developers are increasingly find-
our filter is 96 taps in length. If we calculate maximum allowable output value increases ing applications that require fast and deep
this product using the partial products with each subsequent stage. To handle this onboard memory. For example, data
method, we need 6 x 2 partial products bit-width increase, typical solutions involve recording applications benefit greatly from
using four-input LUTs. Each LUT is capa- a precision reduction by truncation or fast onboard memory to implement the
ble of a 2 x 2 multiplication, which means some other intelligent scheme such as con- sizeable buffers needed to sustain high-
using three 4-input LUTs. Using 6-LUTs, vergent rounding or (less often) by break- speed data transfers over PCI/PCI-X buses.
we can reduce this to just two LUTs. For 96 ing down the multiplication into smaller Video processing applications also require
taps, we have saved 96 LUTs of a possible stages and then rebuilding the final prod- large, fast external memories to store the
total of 288. This is just the savings when uct by summation. Utilizing 25 x 18-bit very-high-resolution, high-frame-rate
producing the partial product. multipliers, more precision is carried images produced by today’s leading camera
LUTs and SRLs are also used for shift through intermediary stages of a multipli- equipment.
registers in the input delay pipe and for the cation and thus reduces the impact of The introduction of the IDELAY prim-
scaling accumulator responsible for sum- intermediary truncation/rounding errors itive also benefits ruggedized application
mation and normalization of the output. while improving on overall speed and min- developers, as the IDELAY taps can be con-
Expanding our example input and tap imizing pipeline latency. stantly monitored by logic to perform run-
widths to a more applicable precision of 16 Suppose convergent rounding is time resynchronization to the source clock;
bits increases the depth of our partial prod- employed to reduce the precision at each this technique is known as dynamic clock-
uct multiplication tree, requiring even stage of multiplication within an FFT. If to-data centering.
more LUTs. Using 6-LUTs as opposed to we implement an 8K FFT using a mixed- Now, with the Virtex-5 family, Xilinx
four-input LUTs results in a LUT logic radix base of radix-4 and radix-2, that gives has expanded the primitive to add ODE-
reduction of more than 33%. us six radix-4 butterfly stages and one LAY, enabling delay control on both input
radix-2 butterfly stage. In an FFT, at each and output signals. The key component of
Wider LUTs Improve Efficiency and Speed subsequent stage we perform calculations the IDELAY primitive is to delay the input
Switched fabric developers will also ben- that produce partial products. These out- data relative to the clock such that the
efit from the 6-LUTs, as these are often puts are fed into the multipliers of the next internal FPGA version of the source clock
used to implement multiplexers. 6-LUTs stage until the time-domain data is trans- edge is centered with the input data. The
mean a reduction in the overall depth of formed completely to the frequency ODELAY enables variable delays per out-
a logic equation. For implementing mul- domain. However, each stage must employ put data line to better match trace-length
tiplexers, this means an effective increase a scheme to reduce the precision of the out- differences.
in speed for an equivalent multiplexer put so that subsequent stages can accept
implemented using 6-LUTs, as opposed them as inputs. After each stage of multi- Improving High-Speed I/O Communication
to four-input LUTs. plication, scaling is employed to reduce the As the COTS marketplace moves more and
Depending on the application, changing precision. Each stage of scaling introduces more into high-speed serial implementa-
to 6-LUTs can make as much as a 1.6x quantization errors. tions, clock and data recovery techniques
become more in demand. When imple- 65-nm Copper CMOS When implementing large designs –
menting general-purpose high-speed serial COTS developers will greatly benefit such as in software-defined radio (SDR)
links, transmission errors and data loss from the move into the 65-nm copper applications where multi-channel digital
become a reality, especially when targeting CMOS process. One of the consequences filters consume significant CLB space –
data rates beyond 1 Gbps. of process shrinks is that density and per- the dynamic power dissipation is quite
For developers using previous genera- formance increase with the next genera- high because of the large amount of
tions of Virtex devices, the choices for tion. This is true in the case of the switching activity that occurs. This is in
clock and data recovery (CDR) implemen- Virtex-5 LX platform, which has part caused by the extensive signal rout-
tation to de-serialize incoming streams increased the amount of CLBs by 65% ing required to implement these designs.
without using multi-gigabit transceivers over the Virtex-4 LX platform, block With the new components in the Virtex-
(MGTs) were limited. 5 family, existing designs
Although the delay-locked implement in a smaller num-
loops (DLLs) used for clock ber of primitives, reducing the
generation in previous fami- overall switching activity. In
lies are very stable in nature addition, the Virtex-5 architec-
because of their first-order ture includes the enhancement
loop architecture and digital of diagonally symmetric rout-
implementation, they are not ing for more efficient design
able to filter input jitter or implementation.
handle phase alignment COTS developers often
beyond their discrete range. complain about violating power
With the phase-locked loop specifications when developing
(PLL) blocks introduced with mezzanine cards in existing
Virtex-5 family, jitter reduc- Figure 2 – VMETRO PMC-FPGA05 designs. With the Virtex-5
tion is an intrinsic feature, device, their mezzanine cards
resulting in large improve- will be less power-hungry and
ments in higher data-rate sus- more desirable to end users. In
tainability. Filtering input jitter other words: less power
to produce stable internal ver- required, simplified cooling,
sions of source clocks is critical- and greater reliability.
ly important to correctly COTS products are often
sample and store incoming employed in environments
data at the FPGA I/O bound- where power consumption is a
ary. Using these new blocks, significant challenge – for exam-
implementing SERDES com- ple, high ambient temperatures
ponents using regular SelectIO may limit a cooling system’s
pins becomes practical even at effectiveness, so reducing heat
1 Gbps and above. output is an important motiva-
Together with SelectIO tion. Other applications such as
performance of as much as unmanned airborne vehicles
Figure 3 – Standard PCI option
800 Mbps per pin single- (UAVs) have limited electrical
ended and 1.25 Gbps differ- power availability, so using every
ential, the Virtex-5 device is able to RAM by 70%, DSP slices by 100%, and Watt available effectively is of paramount
input, process, and output the high data SelectIO pins by 25%. importance.
rates generated by current real-world With increased logic density in the over-
interfaces. For example, interfacing all package, power consumption has been The VMETRO PMC-FPGA05
directly with ADCs and DACs running reduced significantly. While the Virtex-4 A good example of a current COTS prod-
in the gigasample-per-second range is FPGA operates at 1.2V core voltage, the uct implementing the advanced Virtex-5
now perfectly feasible. Virtex-5 FPGA improves power efficiency 65-nm technology is the VMETRO PMC-
Looking to the future, new high-speed with a core voltage of 1.0V. You can achieve FPGA05, a general-purpose high-end
serial fabric interfaces will be a natural fit further power savings by optimizing the FPGA PCI mezzanine card (PMC) pic-
for interfacing between FPGAs, external soft and hard components, such as the 6- tured in Figures 2 and 3 and illustrated in
devices, and host systems. LUTs and 25 x 18-bit multipliers. the block diagram shown in Figure 4.
SDRAM SDRAM
FPGA Editor. Therefore, changes Controller Controller
Chipscope
Basic Design Flow
Get Published
ILA
HDL Design Entry
NGD
Chipscope
ILA Mapping, Placement
and Routing
NCD
interate
Bit Gen
BIT
JTAG
Xilinx FPGA
effective development solution. Those tion and support: from introductory cours-
implementing the most complex algorithms es in VHDL to DSP logic design courses to
may benefit from high-end synthesis tools development laboratories equipped with
from third-party vendors. These integrate the latest gear for testing high-speed serial
directly with ISE software to maintain a interfaces, Xilinx offers the resources neces-
simple project management process. sary for successful FPGA deployment. Would you like
Simulation may not reveal all errors in As the Virtex-5 device is brand new, the
the design, particularly for complex proj- PMC-FPGA05 is still in development. Stay
to be published
ects. When an implementation does not tuned for an update on the challenges, in Xcell
function as expected, you must connect up solutions, and lessons learned as VMETRO Publications?
to the actual hardware to see what is going works hard to bring you the world’s first It's easier than you think!
on. But with dense packaging covering Virtex-5 COTS product.
many thousands of pins, there’s no practical Submit an article draft for our Web-based
way to connect a traditional logic analyzer. Conclusion or printed Xcell Publications and we will
VMETRO engineers have many years To meet the growing demands being placed assign an editor and a graphic artist
of experience developing firmware for on the COTS marketplace, you must adapt to work with you to make your work
Xilinx FPGAs. Using ISE software as the and implement platforms with the right look as good as possible.
primary synthesis/place and route tool, tools for the application. Today, this means
For more information on this
Model Technology’s ModelSim PE for sim- integrating high-speed serial communica-
ulation, and the ChipScope Pro tool for in- tion, fast access memory, and plenty of exciting and highly rewarding program,
circuit debugging, VMETRO developed optimized logic space for advanced algo- please contact:
the IP necessary for interfacing with the rithm development. Equally important are
board hardware. This includes interfaces efficient development and debug tools, IP Forrest Couch
for the SRAM and SDRAM memory resources, and a commitment to high-
Publisher, Xcell Publications
devices, to which users simply connect speed DSP development.
xcell@xilinx.com
their address and data signals, and a high- The highest logic density available, the
performance bus mastering PCI-X inter- lowest power consumption, and the best
face core supporting customizable registers performance are what COTS developers
and simple FIFO-based DMA transfers for need to meet the needs of their customers.
streaming data. Virtex-5 FPGAs, with their ASMBL archi-
Another good reason to choose the tecture and 65-nm process, deliver on
Virtex-5 family is a commitment to educa- these demands.
April 03 - 05, 2007 Embedded Systems Conference - Silicon Valley San Jose, CA
by Delfin Rodillas which allow reduced engineering efforts low bit-error rates. Silicon-based
Senior Manager, Wired Communications and interoperability. Standardization efforts approaches to mitigating SI issues are
Xilinx, Inc. for serial backplane form factors such as particularly important in “legacy
delfin.rodillas@xilinx.com AdvancedTCA and MicroTCA in the PCI upgrade” scenarios, in which designers
Industrial Computer Manufacturers Group re-use older backplanes with legacy com-
The rate of adoption of serial technology in (PICMG) have also contributed to the ponents and design rules.
high-end system design has reached critical accelerated adoption. The benefits of serial There are also challenges in developing
mass. As shown in Figure 1, 92% of respon- backplanes are so compelling that they have serial backplane protocols and fabric inter-
dents in a recent EE Times survey answered been used as the backbone of not only com- faces. The majority of backplane designs
“yes” when asked if they were designing munications, compute, and storage systems leverage legacy ASICs, which have propri-
serial I/O systems in 2006, compared to but also broadcast, medical, defense, and etary protocols. Even some newer back-
64% serial design activity in 2005. industrial/test systems. plane designs require a proprietary
A good portion of this dramatic adop- backplane protocol. Silicon solutions must
tion rate is caused by the penetration of Persistent Design Challenges therefore be flexible and provide the nec-
serial technology in backplane applications. Regardless of the increased rate of adop- essary customizability. Although an ASIC
As system throughput requirements tion, many design challenges still exist. allows this, it can often be costly and risky,
increase, the parallel backplane technologies Because the backplane subsystem is the with unproven product demand/volume
of old will be displaced by SerDes-based heart of the system, it must be able to pass and the possibility of design bugs and
backplane subsystems that provide higher signals from card to card reliably. Thus, specification changes.
bandwidth, better signal integrity, lower designing backplanes with high signal An approach that has recently gained
EMI and power, and simpler PCB designs. integrity (SI) is of primary importance. traction is the use of off-the-shelf stan-
Further promoting this growth is the Also significant is the use of proper dards-based switch fabrics. This saves
emergence of standard serial protocols such silicon ICs with SerDes technology, development time, but you must have sili-
as XAUI and Gigabit Ethernet (GbE), capable of driving backplanes with very con solutions that conform to the standard
100% it goes out of the line driver, while equal- these IP cores are tested through consortia
ization occurs on the received signal after plug-fests and independent third-party verifi-
92% it enters the IC package. Both pre-empha- cation. To facilitate the creation of light-
75% sis and equalization features are program- weight serial protocol designs, Xilinx also
mable to different states to allow for created the Aurora protocol, which is ideal
64% optimum signal compensation. for simpler designs requiring minimal over-
50% Besides signal conditioning features, the head and optimized slice/resource utilization.
serial tranceivers also provide additional fea- With increased usage of Ethernet and
tures beneficial for backplanes, such as pro- PCIe, Virtex-5 LXT FPGAs also include
25%
grammable output swings that allow embedded tri-mode Ethernet MACs and
interfacing to a variety of other current PCIe Endpoint blocks. These allow signifi-
0% mode logic (CML)-based devices and built- cant savings of FPGA slice resources for cus-
2005 2006 in AC coupling capacitors that simplify tomers needing interfaces in control plane
Source: EE Times Survey, 2005
transmission line design and reduce ISI. applications, for example.
Because many chips with parallel inter-
Figure 1 – Percentage of engineers IP Cores faces are still used even in newer systems,
designing Serial I/O systems Proprietary protocols still make up most Xilinx also offers IP cores for popular parallel
serial backplane implementations. However, interfaces such as SPI-4.2, SPI-3, and PCI.
some newer designs have used standards- These allow you to rapidly create serial-to-
protocol, as well as the flexibility to cus- based protocols such as XAUI and GbE. parallel bridges, which are still required in
tomize the end product and make it unique. This growing acceptance has been driven many applications.
And of course, there are the ever-pres- primarily by the maturity of these standards Besides serial and parallel interface IP,
ent challenges of cost, power, and time to and the emergence of switch fabric ASSPs Xilinx offers more complete IP solutions
market. To meet the challenges of serial utilizing these protocols. Using ASSPs for that further reduce development time and
backplane design, Xilinx provides the switching applications saves tremendous time to market. These solutions include a
Virtex™-5 LXT platform of FPGAs as development time, but designers realize that Traffic Manager for prioritizing traffic
well as IP solutions. they need to differentiate their products by flows across backplanes, as well as a Mesh
adding value-added capabilities, primarily Fabric Reference Design that allows “every-
Xilinx Solutions for Serial Backplanes on the line card. to-every” connectivity between cards.
The key technology that enables the applica- FPGAs are the ideal platform for provid- Lastly, the ChipScope™ Pro Serial I/O
tion of Xilinx® Virtex-5 LXT FPGAs in seri- ing customizability, as the serial tranceivers Tool Kit enables rapid serial tranceiver
al backplane applications is the embedded are designed to support a majority of stan- setup and debugging as well as BERT test-
RocketIO™ GTP low-power serial trans- dard serial backplane protocols. Together, the ing. Table 1 summarizes the serial back-
ceiver. There are as many as 24 serial tran- serial tranceivers and fabric allow for stan- plane-related IP available from Xilinx.
ceivers in the largest Virtex-5 LXT FPGA; dards-compliant designs with value-added
each serial tranceiver is capable of running functions – all in a single silicon device. Application Examples
from 100 Mbps to 3.2 Gbps. Coupled with To reduce design time, Xilinx offers off- Let’s look at how you could integrate all of
programmable fabric, the FPGA is capable of the-shelf available IP cores for key serial I/O the solution components to create a com-
supporting virtually any serial protocol – interface standards such as XAUI, GbE, plete serial backplane fabric interface
proprietary or standard – up to 3.2 Gbps. SRIO, and PCIe. To ensure interoperability, FPGA for both a star and mesh system.
More important for serial backplane
applications are built-in signal condition-
ing features, including transmit pre-
emphasis and receive equalization. These IP Category Available IP
features enable transmission of multi-
Serial Interfaces XAUI, GbE, PCI Express, Serial RapidIO, Aurora, CPRI, OBSAI
gigabit signals over long distances, often
reaching 40 inches or longer. Both equal- Parallel Interfaces SPI-4.2, SPI-3, Utopia, PCI, CSIX
ization methods minimize the impact of System-Level Solutions 10G Traffic Manager, Mesh Fabric Reference Design
inter-symbol interference (ISI) by boost-
Serial Backplane Test Solutions ChipScope Pro Serial I/O Tool Kit
ing high-frequency signal components
and attenuating low-frequency compo-
nents. The difference is that pre-emphasis Table 1 – Xilinx IP for serial backplanes
is performed on the transmitted signal as
®
A career with Xilinx puts you at
the Leading edge of technology.
The world leader in programmable systems, Xilinx solutions are
found in numerous applications including wireless, networking,
storage, automotive, aerospace, and much more.
©2006 Xilinx Inc. All rights reserved. XILINX, the Xilinx logo, are other designated brands included herein are trademarks of Xilinx, Inc.
All other trademarks are the property of their respective owners.
VERTICAL MARKET SOLUTIONS
by Sriram R. Chelluri lanes, capable of providing as much as the Virtex-5 family, you can design cost-
Senior Manager, Storage and Servers 32 Gbps full-duplex performance effective TCP and iSCSI offload solutions
Xilinx, Inc. for the server, storage, multi-protocol
• Built-in Gigabit Ethernet MAC
sriram.chelluri@xilinx.com switch, and wireless base station markets
(GEMAC) – four hardcore GEMACs
with extended product life cycles.
enable multi-port gigabit solutions,
As the data center network infrastructure reducing total real estate requirements
TCP Offload Engine (TOE) Overview
migrates to 10 Gbps, moving data traffic for SoC designs
Current TCP offload solutions rely on a
to an Ethernet-based solution becomes
• Real six-input LUT (6-LUT) technology complete software stack or on special net-
economically viable without sacrificing
– improves slice utilization and reduces work interface cards (NICs) based on
performance and latency. Hardware-
routing latency for high performance ASICs for handling TCP/IP processing. An
based host interfaces like PCI Express and
all-software solution is acceptable for low-
multi-Gigabit Ethernet (GbE) support • 36-Kb dual-port block RAM – bandwidth applications, but high-perform-
open up design possibilities for low-cost, higher memory density with error- ance applications would consume all of the
high-performance products in the com- correction circuitry enables support CPU resources, creating a system bottle-
puter and data-processing markets. The for reliable computational logic neck for critical applications.
Xilinx® Virtex™-5 family of FPGAs sets structures and increased on-chip ASIC-based solutions are primarily
the stage for designing system-on-chip TCP sessions for simultaneous from start-ups looking to capitalize on
(SoC) solutions with higher functionality transmit and receive operations the high-performance 10 Gbps market.
and low power.
• DSP48E slices – enable massively These solutions are still expensive and
The Virtex-5 architecture brings to
parallel computations for image pro- prone to vendor lock-in with an uncer-
market critical features that make SoC
cessing and multimedia applications tain financial future.
designs easy to implement for TCP and
Xilinx and its third-party IP partners pro-
iSCSI offload engines:
Because the Virtex family is a program- vide fully standards-compliant TCP/iSCSI
• Built-in PCI Express (PCIe) block – mable platform, you can adapt your designs offload solutions that you can implement as
An integrated standards-compliant PCIe to changing standards and market require- is or customize for functionality, size, speed,
endpoint for supporting one to eight ments. Leveraging the resources available in or the target application.
to support
• TCP reassembly/reorder Back-End I/O Interface
• Latency
• On-chip versus off-chip TCP Figure 1 – Designing a TCP offload solution with traditional FPGAs
session management
These issues can be mitigated with the
unique features of Virtex-5 devices and
GbE PHY GbE PHY GbE PHY GbE PHY
available IP cores. With built-in GEMAC
and PCIe interfaces, you can implement
direct memory access solutions with min- GEMAC GEMAC GEMAC GEMAC
by Mike Nelson
Sr. Staff System Architect,
Storage and Servers, Vertical Markets
Xilinx, Inc.
mike.nelson@xilinx.com
Port I/F
Port I/F
computational logic structures System
Encrypt &
CPU Chipset Packet Decrypt Packet
I/O
Combined, these resources enable very Writer Reader
appliances. This model leverages the excel- User Programmable Soft Logic
Hard PCI Express Controller
lent value of the commodity x86 platform
RocketIO Multi-Gigabit Transcievers
to implement the application framework
and selectively “looks aside” to an opti-
Figure 2 – I/O bandwidth and soft logic progression for FPGA co-processor options
mized accelerator to achieve high perform-
DDR2
RLDRAM II
QDRII SRAM
Etc.
n X GE n X GE
10GE Packet Packet 10GE
Reader Writer
Memory Controller
10G FC 10G FC
Port I/F
Port I/F
Multi-Ported
PCIe PCIe
SPI-4.2 Packet Packet SPI-4.2
Writer Reader
Etc. Etc.
Encryption Decryption
or local subsystem memory. The Virtex-5 DDR2, RLDRAM II, and QDR II
LXT platform meets this challenge with a SRAM. These capabilities enable vir-
wide range of capabilities: tually any local memory subsystem
that an in-line processing engine
• Gigabit Ethernet (GbE) – Each device
might require.
in the Virtex-5 LXT platform includes
four independent hardened GbE These features allow you to create in-
MACs, making multi-port Ethernet a line solutions that will connect to the ports
very efficient I/O option. You can add you need with the integrated encryption
additional ports as necessary with technology you want.
100% form-, fit-, and function-equiva-
lent soft LogiCORE™ IP. Conclusion
The Virtex-5 LXT platform expands the
• 10 Gigabit Ethernet – A Xilinx soft capabilities of the Virtex-5 FPGA architec-
LogicCORE function is available that ture with the addition of RocketIO GTP
can be connected to four RocketIO transceivers, plus hard PCI Express
MGTs for a XAUI interface or to a Endpoint and tri-mode Ethernet MAC
SelectIO pinout for an XGMII interface. blocks. The result is a platform ideally suit-
• 10 Gbps Fibre Channel (FC) – A ed to support very high-performance look-
XAUI-like Fibre Channel standard uses aside and in-line encryption functions.
four RocketIO MGTs operating at Other applications where LXT platform
3.1875 Gbps in parallel to create a devices will excel include high-performance
10.2 Gbps FC channel. packet handling and deep content inspec-
tion for networking; high-speed data
• PCI Express – Available to interface to mining for databases; time-critical compu-
a variety of industry-standard PCIe- tational processing for industrial, scientific,
based port controllers. and medical applications; and real-time
image processing for aerospace/defense and
• SPI-4.2 – Soft LogicCORE IP sup-
video graphic applications.
ports this networking industry stan-
To learn more about Virtex-5 LXT
dard for chip-to-chip connectivity over
platform FPGAs, visit www.xilinx.com/
high-performance SelectIO technology.
virtex5. To learn more about Xilinx in
• Memory – In addition to port I/O encryption, visit www.xilinx.com/esp/security/
standards, Virtex-5 SelectIO technolo- data_security/index.htm. And to learn how
gy also supports a wide range of mem- Xilinx FPGAs can help you in other applica-
ory interface technologies including tions, visit www.xilinx.com/esp.
Copyright © 2006 Intellitech Corp. All rights reserved. SystemBIST™ is a trademark of Intellitech Corporation. RocketIO™ is a registered trademark of Xilinx Corporation.
Virtex-5 Configuration Options
Offer Designers a Choice
Xilinx provides a host of flexible choices in configuration memory to help you make the best decision for your design.
Derek Johnson
APD Marketing
Xilinx, Inc.
derek.johnson@xilinx.com
Connectivity Solutions
Realize the full potential of the solutions in our silicon with Xilinx application notes.
Memory Interfaces levels of 300 MHz (600 Mbps), resulting XAPP860 – 16-Channel DDR
XAPP851 – DDR SDRAM Controller in an aggregate throughput for each 36-bit LVDS Interface with Real-Time
Using Virtex-5 FPGA Devices memory interface of 43.2 Gbps. Window Monitoring
By Toshihiko Moriyama and Rich Chiu The design greatly simplifies the task by Greg Burton
of read data capture within the FPGA
This application note describes a 200- while minimizing the number of This application note describes a
MHz DDR SDRAM memory controller resources used. A straightforward user 16-channel source-synchronous DDR
implemented in a Virtex™-5 device. interface is provided to allow simple inte- LVDS interface. The receiver operates at
This reference design uses the Virtex-5 gration into a complete FPGA design 1:6 deserialization on each of the 16
ChipSync™ features to calibrate and utilizing one or more QDR II interfaces. data channels. Similar to XAPP855, the
adjust read data timing. design also includes a real-time window
A straightforward back-end user inter- On the Web at www.xilinx.com/bvdocs/ monitoring circuit for added perform-
face is provided to allow integration into a appnotes/xapp853.pdf ance. This reference design calibrates
complete FPGA design. and compensates for skews associated
XAPP858 – High-Performance DDR2 with process, voltage, and temperature
On the Web at www.xilinx.com/bvdocs/ SDRAM Interface in Virtex-5 Devices (PVT) at initialization and also dynami-
appnotes/xapp851.pdf
by Karthi Palanisamy and Maria George cally during operation.
XAPP852 – Synthesizable CIO DDR This application note describes the con- On the Web at www.xilinx.com/bvdocs/
RLDRAM II Controller for Virtex-5 troller and data capture technique for appnotes/xapp860.pdf
FPGAs high-performance DDR2 SDRAM inter-
faces. This data capture technique uses the
By Benoit Payette and Rodrigo Angel Serial Connectivity
input serializer/deserializer (ISERDES)
XAPP861 – Efficient 8x Oversampling
This application note describes how to use and output double data rate (ODDR) fea-
Asynchronous Serial Data Recovery
a Virtex-5 device to interface to common tures available in every Virtex-5 I/O.
Using IDELAY
I/O (CIO) double data rate (DDR)
On the Web at www.xilinx.com/bvdocs/
reduced latency DRAM (RLDRAM II) by John Snow
appnotes/xapp858.pdf
devices. The reference design targets two Virtex-5 devices a have a high-precision
CIO DDR RLDRAM II devices at a programmable delay element (IDELAY)
Source-Synchronous Interfaces
clock rate of 200/300 MHz, with data associated with every input pin. This
XAPP855 – 16-Channel DDR LVDS
transfers at 400/600 Mbps per pin. application note shows how to imple-
Interface with Per-Channel Alignment
On the Web at www.xilinx.com/bvdocs/ ment 8x oversampling of many data
by Greg Burton
appnotes/xapp852.pdf streams using a single DCM, two global
This application note describes a 16-channel clock resources, and minimal FPGA
source-synchronous DDR LVDS interface. logic resources. This solution provides
XAPP853 – QDR II SRAM Interface The design takes advantage of the Virtex-5 better jitter tolerance than techniques
for Virtex-5 Devices I/O ChipSync feature’s ability to adjust the using multiple DCMs. When paired
By Lakshmi Gopalakrishnan delay of the receiver datapaths, creating with a suitable data recovery scheme,
dynamic setup and hold timing for each this oversampling technique can be used
This application note describes the imple-
device at initialization and compensating for with many different data protocols up to
mentation and timing details of a four-
skews associated with the manufacturing 550 Mbps. A reference design is includ-
word-burst quad data rate (QDR II)
process. The receiver operates at 1:8 deserial- ed that implements a SD-SDI (SMPTE
SRAM interface for Virtex-5 devices. The
ization on each of the 16 data channels. 259M) receiver running at 270 Mbps.
synthesizable reference design leverages the
unique I/O and clocking capabilities of the On the Web at www.xilinx.com/bvdocs/ On the Web at www.xilinx.com/bvdocs/
Virtex-5 family to achieve performance appnotes/xapp855.pdf appnotes/xapp861.pdf
Source-Synchronous Interfaces
SPI-4 Phase 2 Interface Solutions
(DO-DI-POSL4MC) Xilinx IP Core
The Xilinx® SPI-4 Phase 2 core provides a
fully compliant packet-over-SONET/SDH
(POS) solution, which can be quickly inte-
grated into networking systems.
Through user-configurable options, the
Xilinx SPI-4.2 core provides ultimate flexi-
bility while seamlessly interoperating with Virtex-5 Embedded Tri-Mode munication, multimedia, server, storage,
industry-leading ASSPs to maximize the data Ethernet MAC Wrapper and mobile platforms and enables applica-
transfer bandwidth. The Xilinx SPI-4.2 core The CORE Generator™ tool supports tions such as high-end medical imaging,
is fully compliant with the OIF’s System the Virtex-5 Tri-Mode Ethernet Media graphics-intensive video games, DVD
Packet Interface Level 4 (SPI-4) Phase 2 Access Controller (MAC) Wrapper to quality streaming video on the desktop,
standard, as well as the Saturn Development automate the generation of HDL wrapper and 10 Gigabit Ethernet interface cards.
Group’s POS-PHY Level 4 (PL4) interface files for the tri-mode Ethernet MAC in
Type search keywords: PCI Express Block
specification. Virtex-5 LXT devices. Preconfigured HDL
wrappers, testbenches, and implement and
Type search keywords: SPI-4 Phase 2 Virtex-5 LXT PCI Express Block Plus
simulation scripts are generated automati-
cally based on user-defined options. LogiCORE Xilinx IP Core
The Xilinx PCI Express Plus LogiCORE IP
Serial Connectivity Type search keywords: Virtex-5
integrates and interfaces to the PCI Express
Virtex-5 RocketIO GTP Wizard Ethernet MAC
Endpoint block, supporting 1-lane, 4-lane,
The Virtex™-5 RocketIO™ GTP Wizard and 8-lane complete endpoint core imple-
Virtex-5 PCI Express
automates the task of creating HDL wrap- mentations. In addition, a PCI Express devel-
Endpoint Block Wrapper
pers to configure Virtex-5 RocketIO GTP opment kit is also available. This solution is
The Xilinx PCI Express Endpoint block
transceivers. The wizard’s customization used in communication, multimedia, server,
wrapper integrates and interfaces to the
GUI allows you to configure one or more storage, and mobile platforms and enables
on-chip PCI Express Endpoint block,
GTP transceivers using pre-defined tem- applications such as high-end medical imag-
supporting 1-lane, 2-lane, 4-lane, and 8-
plates to support popular industry stan- ing, graphics-intensive video games, DVD
lane complete endpoint core implemen-
dards, or from scratch to support a wide quality streaming video on the desktop, and
tations. In addition, a PCI Express
variety of custom protocols. 10 Gigabit Ethernet interface cards.
Endpoint block development kit is also
Type search keywords: GTP Wizard available. This solution is used in com- Type search keywords: PCI Express Plus
Avnet Virtex-5 LX Development Kit HiTech Global Virtex-5 Xilinx Virtex-5 ML555 PCI Express
A complete development platform for PCI Express Development Platform Development Tool Kit
designing and verifying applications based Seamless serial interface connectivity A highly configurable pre-verified
on the Xilinx Virtex-5 LX FPGA family. enabled by the Virtex-5 LXT FPGA. development solution.
Available with the Xilinx® Virtex™-5 Powered by a Xilinx Virtex-5 LXT FPGA,
XC5VLX50-1FF676 device, the Avnet supported by mainstream peripherals, and
Virtex-5 Development Kit allows you to designed with excellent signal integrity per-
prototype high-performance designs with formance, the HiTech Global HTG-V5PCIE
ease, while providing expandability and cus- is the ideal platform for serial interface/con-
tomization through the EXP expansion slot. nectivity developments, including PCI
The system board includes DDR2 Express subsystems, Serial ATA (SATA), Fibre The Xilinx ML555 RoHS-compliant
SDRAM, flash memory, a 10/100/1000 Channel, RapidIO, and XAUI. PCIe/PCI-X/PCI development board
Ethernet PHY, and a serial port, making provides a pre-verified solution to parallel
On the Web at www.hitechglobal.com
it an ideal platform for MicroBlaze™ and serial PCI interface design chal-
development. Other board features lenges. Using an established development
include a USB port, programmable Xilinx Virtex-5 ML550 environment can dramatically shorten
LVDS clock, 10-bit Tx/Rx high-speed Networking Interfaces Tool Kit the design cycle. By using proven Xilinx
LVDS interface, user switches and LEDs, Designing networking, telecom, servers, and dedicated blocks, you can focus your
and a 2 x 16-character LCD panel. computing systems with Virtex-5 FPGAs. efforts on specific application develop-
The board also provides a full EXP ment and avoid time-consuming PCIe or
Many of today’s telecom and networking
expansion slot, providing a total of 168 PCI development.
systems use high-bandwidth interfaces based
high-speed, single-ended, and differential
on LVDS or other differential I/O standards. On the Web at www.xilinx.com/XOB
user I/O. You can easily add EXP modules
Differential I/O standards simplify system
to the board for additional application-
design by improving system performance
specific functions. Xilinx Virtex-5 ML561 Advanced
and signal integrity.
Memory Development System
On the Web at www.avnet.com
Achieve your performance targets in
the shortest development time.
Xilinx Virtex-5 ML501 Evaluation
Building interfaces to high-performance
and Development Platform
memory devices presents challenges such
An ideal general-purpose, low-cost
as high-speed synchronous data capture,
development platform.
along with implementing complex physi-
The Xilinx Virtex-5 ML501 evaluation cal-layer interfaces and control logic. The
and development platform is a feature- ML561 advanced memory development
rich, low-cost evaluation/development Protocols based on source-synchronous system offers an excellent platform to
platform that provides easy and practical I/Os such as SPI-4.2 and SFI-4 are central to develop and verify high-performance
access to the resources available in the leading-edge system design. To take advan- memory interfaces using Virtex-5 FPGAs.
on-board Virtex-5 LX FPGA. Supported tage of these technologies, you have to work
On the Web at www.xilinx.com/XOB
by industry-standard interfaces and through multiple challenges to ensure device
connectors, the ML501 is a versatile interoperability and standards compliance.
development platform for multiple appli- Xilinx provides the Virtex-5 network interface
cations. Video, audio, and communica- board, as well as standards-compliant IP cores
tion ports as well as generous memory and free reference designs, to help you tackle
resources extend the functionality and these high-speed, source-synchronous inter-
flexibility of the ML501 evaluation plat- face challenges. This allows you to focus on
form beyond a typical FPGA develop- user application design and not worry about
ment platform. interoperability and standards compliance.
On the Web at www.xilinx.com/ML501 On the Web at www.xilinx.com/XOB
ak
Ha
a high performance, high capacity FPGA platform 2007
co
le
for ASIC prototyping and emulation composed of m
patib
multi-FPGA boards and standard or custom-made
daughter boards
HapsTrak
a set of rules for pinout and mechanical HAPS-40
characteristics, which guarantees
compatibility with previous and future
psTr
generation HAPS motherboards
ak
Ha
le
m
patib
HAPS-30
psTr
ak
Ha
2004
co
le
m
patib
HAPS-20
psTr
ak
Ha
2003
co
le
m
patib
www.hardi.com, haps@hardi.com
HARDI Electronics Inc., 26831 Magdalena Lane, Mission Viejo, CA 92691, (949) 202-5572
HAPS-10 Virtex-II, Virtex-II Pro, Virtex-4, and Virtex-5 are registered trademarks of Xilinx Inc.
Low-PowerUltimate
Transceivers
Connectivity . . .
6.22 34,600
Lowest-power, most area-efficient serial I/O solution
User Logic RocketIO™ GTP transceivers deliver up to 3.2 Gbps connectivity
25,100
at less than 100 mW to help you beat your power budget. The
Power (Watts)
User Logic
Area (LUTs)
3.09 PCIe
embedded PCI Express® endpoint block ensures easy implemen-
tation and reduced development time. Embedded Ethernet MAC
blocks enable a single-chip UNH-verified implementation. And
PCIe
Static Power
the Xilinx solution is fully supported by development tools,
Virtex-5 LXT Nearest Virtex-5 LXT Nearest design kits, IP, characterization reports, and more.
FPGAs Competitor FPGAs Competitor
(65nm) (90nm) (65nm) (90nm)
5VLX30T vs. 2SGX60D. Target Frequency = 200 MHz. Worst-case process.
25K LUTs, 17K Flip-Flops,1 Mbit On-Chip RAM, 64 DSP Blocks, 128 2.5V I/Os.
Visit our website today, view the Webcast, and order your free
Based on Xilinx tool v8.2 and competitor tool v6.0.1
eval CD to give your next design the ultimate in connectivity.
www.xilinx.com/virtex5
©2006 Xilinx, Inc. All rights reserved. XILINX, the Xilinx logo, and other designated brands included herein are trademarks of Xilinx, Inc. All other trademarks are the property of their respective owners.
PN 0010999