You are currently browsing the tag archive for the ‘Xilinx’ tag.
Nadav Rotem, C to Verilog, here.
C-to-Verilog is a free and open sourced on-line C to Verilog compiler. You can copy-and-paste your existing C code and our on-line compiler will synthesize it into optimized verilog.
For additional information on how to use our website to create FPGA designs, watch the screencast below.
Andrew Putnam, Susan Eggers, et. al., Performance and Power of Cache-Based Reconfigurable Computing, here. Did not see these slides at Susan Eggers’ website. Note Prassanna Sundararajan’s name is on the slides – he must know something about 2008 CHIMPS.
This paper describes CHiMPS, a C-based accelerator compiler for hybrid CPU-FPGA computing platforms. CHiMPS’s goal is to facilitate FPGA programming for high performance computing developers. It inputs generic ANSIC code and automatically generates VHDL blocks for an FPGA. The accelerator architecture is customized with multiple caches that are tuned to the application. Speedups of 2.8x to 36.9x (geometric mean 6.7x) are achieved on a variety of HPC benchmarks with minimal source code changes.
Looking at the Chimps paper and they have a black scholes benchmark but only quote relative performance between x86/gcc and CHIMPS/FPGA. It sure looks like the FPGA performance here is compared against naive x86 code. Recall Pink I published Black Scholes in 36 nanoseconds on 2007 commodity hardware using XLC. So unless Chimps was being misleading here I would assume they had a 2008 FPGA that did Black Scholes evals in 2.05 (36/17.5) nanoseconds. I’m gonna go ahead give a time travel 2008 shenanigans yellow card because I don’t believe there was enough non ILP parallelism in the basic Black Scholes formula evaluation given the FPGA clock speeds circa 2008 (something like 550MHz) see Xilinx References Apr 2012.
Speedup numbers do not fully convey the performance potential of the CHiMPS accelerators, because performance is often highly dependent on the input data size and the computation-communication ratio. To illustrate this, Figure 2 depicts Black-Scholes’s sensitivity to input size. For very small numbers of stocks, using CHiMPS is detrimental to performance, because the data transfer time to the FPGA is greater than executing on the CPU. Performance breaks even around 24 stocks, and continues to increase until it peaks at a speedup of 17.6 (700 stocks). Speedup then decreases since the data for more stocks no longer fits in the FPGA caches, and the CPU’s larger cache and memory prefetcher keep the CPU execution time steady.
Extreme Tech, Ivy Bridge: Intel’s killing blow on AMD, here. Let’s look at this again. For the FinQuant application space I’d estimate somewhere between 50% and 85% of what you care about in selecting a Linux server is the current, expected future, and realized future feature size of the fab producing the server’s microprocessors and chips. There are lots of other important variables: system and microprocessor architecture, programming languages, network transmission lines, compilers, operating systems, file systems, databases, etc. and each alone can make or break a FinQuant app, but they are all tails. The microprocessor fab feature size is the dog, it effectively determines how well my FinQuant infrastructure scales with Moore’s Law. The comparative technology priority has not always been this way. There used to be different instruction set architectures, networks were slower than DRAMs, and memories were small all requiring evaluation in addition to the shrinking microprocessor fab feature size. In all likelihood the comparative technology priorities will change in the future as well.
Right now, Intel Sandy Bridge is 32nm, Ivy Bridge is 22nm, AMD Operton is 28nm, Xilinx is 28nm, Achronix is 22nm and the microprocessor market share main event is between AMD and Intel over design wins in mobile systems. Server-side Intel holds 95% market share to AMD 5%. Intel tries to set expectations of 22nm by 2013 and 14nm by 2014, here for example while showing the Chandler, Ariz 14 nm fab construction, here. Recall that things do not always move so smoothly for Intel, think about the relatively recent 8M Sandy Bridge support chip recall and Itanium. On the other hand AMD’s problems appear to be a shade worse than Intel’s witness: ars technica on Server market share here, The Register here, Extreme Tech here. I don’t know how much these websites are owned or comped by Intel, but if i am holding a bunch of Opteron server-side exposure it is probably safe to argue that it’s time to think about a hedge.
All things being equal, if I am aggressively setting up an HPC FinQuant infrastructure play now. I kind of want to be production ready with 22 nm silicon by the end of 2012 looking to set up a smooth infrastructure transition to 14nm in 2014.
Anand Tech, Intel’s Ivy Bridge Architecture Exposed, here. Not sure how much I care about the integrated GPU for server side FinQuant apps unless the AVX2 is somehow related to the GPU.
ars, Transactional memory going mainstream with Intel Haswell, here.
phoronix, Compilers Mature For Intel Sandy/Ivy Bridge, Prep For Haswell, here. Wow Treasure.
tom’s hardware, AMD Steals Market Share From Intel, here. The interesting fight is in mobile from the market’s perspective. Servers not so interesting.
Hieroglyph, web site, here. Neal Stephenson pumping big Big Science.
My life span encompasses the era when the United States of America was capable of launching human beings into space. Some of my earliest memories are of sitting on a braided rug before a hulking black-and-white television, watching the early Gemini missions. At the age of 51—not even old!—I watched on a flat-panel screen as the last Space Shuttle lifted off the pad. I have followed the dwindling of the space program with sadness, even bitterness. Where’s my orbiting, donut-shaped space station? Where’s my fleet of colossal Nova rockets? Where’s my ticket to Mars?
Business Insider, REVEALED: The Asteroid Mining Plan Backed By Google And Goldman Billionaires, here.
EE Times, Broadcom aims to spread 100-Gbit Ethernet with single-chip solution, here.
Broadcom Corp. Tuesday (April 24) announced its fourth-generation Ethernet network processor, which it claims is the industry’s first chip to use massive parallelism by virtue of its 64 packet-processing cores running at one gigahertz. Providing full-duplex 100Gbit per second performance, it can also be configured to provide a dozen 10-Gbit channels.
The VivadoTM Design Suite is a new IP and system-centric design environment that accelerates design productivity for the next decade of All-Programmable devices.
All-Programmable devices go beyond programmable logic and I/O, integrating various combinations of 3D stacked silicon interconnect technology, software programmable ARM® processing systems, programmable Analog Mixed Signal (AMS), and a significant amount of intellectual-property (IP) cores. These next generation devices enable designers to go beyond programmable logic to programmable systems integration, to incorporate more system functions into fewer parts, increase system performance, reduce system power, and lower BOM cost.
NYT, Carr, Navigating a Tightrope With Amazon, here. Buzz Bissinger got his book into a Starbucks promotion through Apple that subsidized his book sales. The Amazon pricer covered by offering the book at 0.00 dollars. My guess is Amazon will fix the pricing code to buy all the subsidized Apple copies and resell them as used copies at a non zero price to keep Buzz happy. Crowdsource as a Kickstarter project to pay 0.01 USD for each subsidized Bissinger book up to the limit you think is number of copies authorized by Starbucks, it could work.
MITx, web site, here. Anant Agarwal Circuits course is pretty good from the few lectures I watched. The guy has some I/O bandwidth.
MITx will offer a portfolio of MIT courses for free to a virtual community of learners around the world. It will also enhance the educational experience of its on-campus students, offering them online tools that supplement and enrich their classroom and laboratory experiences.
The first MITx course, 6.002x (Circuits and Electronics), will be launched in an experimental prototype form. Watch this space for further upcoming courses, which will become available in Fall 2012.
Xilinx, High Performance Computing Using FPGAs, Sep 2010, here.
The shift to multicore CPUs forces application developers to adopt a parallel programming model to exploit CPU performance. Even using the newest multicore architectures, it is unclear whether the performance growth expected by the HPC end user can be delivered, especially when running the most data- and compute- intensive applications. CPU-based systems augmented with hardware accelerators as co-processors are emerging as an alternative to CPU-only systems. This has opened up opportunities for accelerators like Graphics Processing Units (GPUs), FPGAs, and other accelerator technologies to advance HPC to previously unattainable performance levels.
I buy the argument to a degree. As the number of cores per chip grow, the easy pipelining and parallelization opportunities will diminish. The argument is stronger if there are more cores per chip. 8 cores or under per general purpose chip it’s sort of a futuristic theoretical argument. More than a few programmers can figure out how to code up a 4 to 8 stage pipeline for their application without massive automated assistance. But the FPGA opportunity does exist.
The convergence of storage and Ethernet networking is driving the adoption of 40G and 100G Ethernet in data centers. Traditionally, data is brought into the processor memory space via a PCIe network interface card. However, there is a mismatch of bandwidth between PCIe (x8, Gen3) versus the Ethernet 40G and 100G protocols; with this bandwidth mismatch, PCIe (x8, Gen3) NICs cannot support Ethernet 40G and 100G protocols. This mismatch creates the opportunity for the QPI protocol to be used in networking systems. This adoption of QPI in networking and storage is in addition to HPC.
I buy the FPGA application in the NIC space. I want my NIC to go directly to L3 pinned pages, yessir I do, 100G please.
Xilinx FPGAs double their device density from one generation to the next. Peak performance of FPGAs and processors can be estimated to show the impact of doubling the performance on FPGAs [Ref 6], [Ref 7]. This doubling of capacity directly results in increased FPGA compute capabilities.
The idea proposed here is that you want to be on the exponentially increasing density curve for the FPGAs in lieu of clock speed increases you are never going to see again. Sort of a complicated bet to make for mortals, maybe.
I like how they do the comparisons though. They say here is our Virtex-n basketball player and here is the best NBA Basketball player … and they show you crusty old Mike Bibby 2012. Then they say watch as the Virtex-n basketball player takes Mike Bibby down low in the post, and notice the Virtex-n basketball player is still growing exponentially. So you can imagine how much better he will do against Mike Bibby in the post next year. Finally they say that Mike Bibby was chosen as the best NBA player for this comparison by his father Henry, who was also a great NBA player.
FPGAs tend to consume power in tens of watts, compared to other multicores and GPUs that tend to consume power in hundreds of watts. One primary reason for lower power consumption in FPGAs is that the applications typically operate between 100–300 MHz on FPGAs compared to applications on high-performance processors executing between 2–3 GHz.
Silly making Lemonade out of Lemons argument, the minute I can have my FPGAs clocked at 3 GHz I throw away the 300MHz FPGAs, no?
Intel, An Introduction to the Intel QuickPath Interconnect, QPI, Jan 2009, here.
Xilinx Research Labs/NCSA, FPGA HPC – The road beyond processors, Jul 2007, here. Need more current references but I keep hearing the same themes in arguments for FGPA HPC, so let’s think about this for a bit:
FPGAs have an opening because you are not getting any more clocks from microprocessor fab shrinks: OK.
Power density: meh. Lots of FinQuant code can run on a handful of cores. The Low Latency HFT folks cannot really afford many L2 misses. The NSA boys are talking about supercomputers for crypto not binary protocol parsing.
Microprocessors have all functions that are hardened in silicon and you pay for them whether you use them or not and you can’t use that silicon for something else: Meh, don’t really care if I use all the silicon on my 300 USD microprocessor as long as the code is running close to optimal on the parts of the silicon useful to my application. It would be nice if I got more runtime performance for my 300USD, no doubt. This point is like Advil is bad because you don’t always need to finish the bottle after you blow out your ankle. Yeah, I understand the silicon real estate is the most expensive in the world.
Benchmarks: Black Scholes 18msec FPGA @ 110 Mhz Virtex-4 203x faster than Opeteron – 2.2 Ghz: You Cannot be Serious! 3.7 microseconds per Black Scholes evaluation was competitive performance at the turn of the century. The relative speedup slides and quotations make me nervous. Oh, Celoxica provided the data – hey Black Scholes in 36 Nanoseconds on a single core of a dual core off-the-shelf general microprocessor from 2007. So the Virtex-4 does 1M Black Scholes evaluations in 18 milliseconds flat to competitive code on a dual core general purpose off-the-shelf microprocessor in 2007.
Make it easy for the users to use this hardware and get „enough of a performance‟ increase to be useful: meh, it’s for applications that do not need to go fast, for now (2007)?
Do not try to be the fastest thing around when being as fast with less power is sufficient: meh, really do not care so much about the power thing
FPGA: Different operations map to different silicon allows massive pipelining; lots of parallelism: OK. So, why bother with the previous two points?
Eggers/ U. Washington, CHiMPS, here. Eggers is reasonable.
There have been (at least) two hindrances to the widespread adoption of FPGAs by scientific application developers: having to code in a hardware description language, such as Verilog (with its accompanying hardware-based programming model) and poor FPGA memory performance for random memory accesses. CHiMPS, our C-to-FPGA synthesis compiler, solves both problems with one memory architecture, the many-cache memory model.
Many-cache organizes the small, distributed memories on an FPGA into application-specific caches, each targeting a particular data structure or region of memory in an application and each customized for the particular memory operations that access it.
CHiMPS provides all the traditional benefits we expect from caching. To reduce cache latency, CHiMPS duplicates the caches, so that they’re physically located near the hardware logic blocks that access them. To increase memory bandwidth, CHiMPS banks the caches to match the memory parallelism in the code. To increase task-level parallelism, CHiMPS duplicates caches (and their computation blocks) through loop unrolling and tiling. Despite the lack of FPGA support for cache coherency, CHiMPS facilitates data sharing among FPGA caches and between the FPGA and its CPU through a simple flushing of cached values. And in addition, to harness the potential of the massively parallel computation offered by FPGAs, CHiMPS compiles to a spatial dataflow execution model, and then provides a mechanism to order dependent memory operations to retain C memory ordering semantics.
CHiMPS’s compiler analyses automatically generate the caches from C source. The solution allows scientific programmers to retain their familiar programming environment and memory model, and at the same time provides performance that is on average 7.8x greater and power that is one fourth that of a CPU executing the same source code. The CHiMPS work has been published in the International Symposium on Computer Architecture (ISCA, 2009), the International Conference on Field Programmable Logic and Applications (FPL, 2008), and High-Performance Reconfigurable Computing Technology and Applications (HPRCTA, 2008), where it received the Best Paper Award.
Algorithms, Sedgewick and Wayne, 4th Edition, here. They’re covering the online free course trend starting in Aug 2012, here . They are going to sell some books, I think. First, they have the Princeton Logo for the course, that’s a big deal I suspect. Second, isn’t this a case where the dog and the tail are mixed up? We’re leading with the Book (the tail) and then mention there is this course thing (The Dog) you are probably not interested in. So where do we mention that course, the one that Bezos took before he went to Amazon? Somewhere toward the end of the announcement where people won’t see it. And lets make it two courses so they have to sign up for each one, in case they don’t like the first one.
The Headline announcement should be:
Psst, Kid wanna do Algorithms with Sedgewick? Wouldja look at this? Here is a free internet course starting in August, direct from Pee – rince – ton Univer- si – tay. U in? …good!
Oh, and there’s a book that will help you solve the homework problems at Amazon, buy it if you need it. Gotta go kid, see ya.
Irving Wladawsky-Berger, Blog, here. Head of IBM Research back in the day. Probably worth tracking to see where it goes.
Technology Review, Moore’s Law Lives Another Day, here. Confirms what was bugging me, the 3D lithography techniques have been around since the late 80s in Japan. Used to see it in DRAM manufacturing presentations. 22 years to volume production in 22nm process for x86. Wired, April 19, 1965: How Do You Like It? Moore, Moore, Moore, here. Puff piece, but it has a link to Moore’s 1965 paper in Electronics magazine.
Next-generation 20 nm processes can support optimized versions for low power and high performance, according to an IBM expert. GlobalFoundries will decide in August whether or not it will offer such variations.Those were just two data points from wide ranging discussions at the GSA Silicon Summit here. Separately, executives said a variety of 3-D ICs will hit the market in 2014 despite numerous challenges, and CMOS scaling is slowing down but still viable through a 7 nm node.“Recently TSMC said at 20 nm there are no significant differences [in process optimizations], but I don’t believe that,” said Subramanian Iyer, an IBM fellow and chief technologist in its microelectronics division. “I believe at same node you can have two [different variations],” he said in a keynote here. Indeed, GlobalFoundries is debating whether it wants to offer high performance and low power variants of a 20 nm process it is putting in place today.
Fabless FPGA vendor Achronix Semiconductor Corp. (Santa Clara, Calif.) has announced details of its Speedster22i HD and HP product families, claimed to be the first FPGAs to be built on a 22-nm manufacturing process technology.The devices are the result of a foundry agreement with Intel Corp. announced in November 2010 and the first devices are due to sample in the third quarter of 2012.Both the HD (high density) and HP (high performance) families come loaded with a variety of high-speed data communications interfaces hardwired. These include 10/40/100G Ethernet MACs, 100Gbit Interlaken channels, PCI Express and DDR3 memory channels that run at up to 2133 Mbps. In the case of the HD1000 device these are two, two, two and six respectively. This optimizes the Speedster22i FPGAs for work in networking and telecommunications equipment although the company stresses that the devices can find applications in servers, high-performance computing, military, industrial and scientific applications. The large number of high speed memory channels provides the industry’s highest bandwidth FPGAs, Achronix claimed.Achronix’ existing product range is based on 65-nm process technology and the move to Intel’s 22-nm FinFET process allows Speedster22i family to consume half the power at half the cost of high-end, 28-nm FPGAs.
Wired, How to Spot the Future, here.
This may sound like a paradox. Surely technology always promises something radically new, wholly unexpected, and unlike anything anybody has seen before. But in fact even when a product or service breaks new ground, it’s usually following a familiar trajectory. After all, the factors governing thermodynamics, economics, and human interaction don’t change that much. And they provide an intellectual platform that has allowed technology to succeed on a massive scale, to organize, to accelerate, to connect.
So how do we spot the future—and how might you? The seven rules that follow are not a bad place to start. They are the principles that underlie many of our contemporary innovations. Odds are that any story in our pages, any idea we deem potentially transformative, any trend we think has legs, draws on one or more of these core principles. They have played a major part in creating the world we see today. And they’ll be the forces behind the world we’ll be living in tomorrow.
Noahpinion, Thursday Roundup, here, I need to know more about how money market funds and commercial paper broke in the credit crisis – he points to Cochrane on money markets which is a start. The idea is to sort out where bank runs can happen as confidence evaporates. What breaks first next time? Noah Smith could be the EcoFin summary guy after DeLong.
HPC Wire, Some Thoughts on Intel’s Acquisition of Cray’s Interconnect Technology, here. They lead with:
The reasons for this deal, in my opinion, are as follows:
The general trend to commodity components continues. For small companies like Cray (with about 800 employees), it is simply too expensive to innovate and develop sophisticated hardware such as an interconnect for exascale computing. And Intel is certainly able to take this on, especially now with all the expertise gained from the previous QLogic acquisition and now from the 74 interconnect experts moving from Cray to Intel.
While feverishly triaging my last place fantasy basketball team this past weekend, with no help from us f2bbooks eclipsed 500 hits for Jan. Must be a record for Business Math blogs. Let’s check in with our nemesis Business Math Blog, Mrs. Hooker’s Blog at edublogs – Just another Wicomico County Public Schools Edublog. Sure f2bbooks has a clear posting lead but anything can happen in this hypercompetitive Social Networking world. As long as Mrs. Hooker doesn’t catch on to FPGA’s and the credit derivative pricing angle everything is going to be copacetic in the page views competition.
Tilera started sampling the latest Tile 64-bit processors w 16 and 36 cores at 1.2Ghz, here.
HPC wire: The Year Ahead in HPC – they are going long GPUs (NVIDIA Kepler) and Intel Knights Corner.
Tao and Gowers re: Reed Elsevier journals pricing protest, here
Bunch of JP Morgan Maxeler stuff follows:
Workshop on High Performance Computational Finance at SC10, here.
Rapid computation of value and risk for derivatives portfolios
Practical Quant: Maxeler JP Morgan, here
Maxeler in Ptown at the Equad 31 Jan: Maximum Performance Computing for Exascale Applications here.
Money Science: here.
Xcell Journal: full text of Rapid computation of value and risk for derivatives portfolios, via Xilinx, here. Wow, this is published in the first quarter of 2011 and matches in many ways the contents of the JPM Stanford talk linked to earlier. I can think of a bunch of followup questions. Let’s pull them into a single post after we pick through these references.
Technology in Banking slides, here. Maxeler derivative pricing.