You are currently browsing the tag archive for the ‘Computer Arch’ tag.
Nadav Rotem, C to Verilog, here.
C-to-Verilog is a free and open sourced on-line C to Verilog compiler. You can copy-and-paste your existing C code and our on-line compiler will synthesize it into optimized verilog.
For additional information on how to use our website to create FPGA designs, watch the screencast below.
Andrew Putnam, Susan Eggers, et. al., Performance and Power of Cache-Based Reconfigurable Computing, here. Did not see these slides at Susan Eggers’ website. Note Prassanna Sundararajan’s name is on the slides – he must know something about 2008 CHIMPS.
This paper describes CHiMPS, a C-based accelerator compiler for hybrid CPU-FPGA computing platforms. CHiMPS’s goal is to facilitate FPGA programming for high performance computing developers. It inputs generic ANSIC code and automatically generates VHDL blocks for an FPGA. The accelerator architecture is customized with multiple caches that are tuned to the application. Speedups of 2.8x to 36.9x (geometric mean 6.7x) are achieved on a variety of HPC benchmarks with minimal source code changes.
Looking at the Chimps paper and they have a black scholes benchmark but only quote relative performance between x86/gcc and CHIMPS/FPGA. It sure looks like the FPGA performance here is compared against naive x86 code. Recall Pink I published Black Scholes in 36 nanoseconds on 2007 commodity hardware using XLC. So unless Chimps was being misleading here I would assume they had a 2008 FPGA that did Black Scholes evals in 2.05 (36/17.5) nanoseconds. I’m gonna go ahead give a time travel 2008 shenanigans yellow card because I don’t believe there was enough non ILP parallelism in the basic Black Scholes formula evaluation given the FPGA clock speeds circa 2008 (something like 550MHz) see Xilinx References Apr 2012.
Speedup numbers do not fully convey the performance potential of the CHiMPS accelerators, because performance is often highly dependent on the input data size and the computation-communication ratio. To illustrate this, Figure 2 depicts Black-Scholes’s sensitivity to input size. For very small numbers of stocks, using CHiMPS is detrimental to performance, because the data transfer time to the FPGA is greater than executing on the CPU. Performance breaks even around 24 stocks, and continues to increase until it peaks at a speedup of 17.6 (700 stocks). Speedup then decreases since the data for more stocks no longer fits in the FPGA caches, and the CPU’s larger cache and memory prefetcher keep the CPU execution time steady.
Greg Pfister, The Perils of Parallel, Intel Xeon Phi Announement (&me), here. Check out the Intel links at the end of the post.
Number one is their choice as to the first product. The one initially out of the blocks is, not a lower-performance version, but rather the high end of the current generation: The one that costs more ($2649) and has high performance on double precision floating point. Intel says it’s doing so because that’s what its customers want. This makes it extremely clear that “customers” means the big accounts – national labs, large enterprises – buying lots of them, as opposed to, say, Prof. Joe with his NSF grant or Sub-Department Kacklefoo wanting to try it out. Clearly, somebody wants to see significant revenue right now out of this project after so many years. They have had a reasonably-sized pre-product version out there for a while, now, so it has seen some trial use. At national labs and (maybe?) large enterprises.
I fear we have closed in on an incrementally-improving era of computing, at least on the hardware and processing side, requiring inhuman levels of effort to push the ball forward another picometer. Just as well I’m not hanging my living on it any more.
Got some time before I have to go tend the fantasy bball team mired in the middle of the dogpile. This a generalized two exchange trade crossing model. The generalization to more than two exchanges is straightforward.
Here, in the DynaPie Latency Model Figure (below), we have represented two colo installations. The first colo contains Exchange EX1, Gateway GW1, and Client Server CL1. The second colo similarly contains an exchange, gateway, and Client server (EX2, GW2, CL2). The Client Servers run the relevant exchange protocols to perform order execution via respective broker dealer gateways and solicit current market data from each exchange locally. To simplify the analysis we assume all latency is accounted for in the links, and for example, the portion of the gateway latency not accounted for in the link latency is negligible. The market data and client to gateway Link latency, represented by the variable L, is equal to a constant C. We assume the latency for any link within the colo is a constant. If you worry about this, you can assume C is the longest latency measured on average over a sufficiently large sampling period of all links within a single colo installation. The client, via the colo gateways, has two order entry options. CL1 can send a new order (wlg any other message) to EX1 via GW1 or send a new order to EX2 via GW1. CL2 has symmetric corresponding order options via GW2. The latency to submit a new order within the colo is C as previously discussed. The latency cost to submit a new order from GW1 to EX2 is ax+b, where a and b are constants and x is proportional to the minimal transmission line latency connecting GW1 and EX2. Similarly the latency between GW2 and EX1 is ax+b (we assume the same transmission lines, nics, and switch connecting the gateways and exchanges). Finally, the latency between the two client servers is represented as dy+e where d and e are constants and y is proportional to the minimal transmission line latency connecting CL1 and CL2.
Salmon, Sell-side research isn’t inside information, here.
So let’s let brokerages’ clients trade what they like, so long as they’re not trading on genuinely inside information from the company in question. If we’re going to be serious about the Volcker Rule, and prevent the brokerages from trading for their own account, the least we can do is let them monetize their analysts’ research as best they can.
HPC Wire, NVIDIA’s Bill Dally Talks 3D Chips and More at GTC, here. Keep an eye out for Dally interviews and talks.
The conversation turned to Stanford and what Dally views as the University’s most promising research. He mentioned a program where researchers are looking to take supercomputing interconnect technology and deliver it to commercial datacenters. Stanford University has worked with Cray on the Dragonfly interconnect for the Cascade system and began pitching the technology to Google and Facebook. According to him, they loved the technology because of its low latency. The Stanford team plans to test the design on a small FPGA cluster and if everything goes as planned, they’ll start looking for a commercial adopter.
The Register, Inside Nvidia’s GK110 monster GPU, here.
At the tail end of the GPU Technology Conference in San Jose this week, graphics chip juggernaut and compute wannabe Nvidia divulged the salient characteristics of the high-end “Kepler2″ GK110 GPU chips that are going to be the foundation of the two largest supercomputers in the world and that are no doubt going to make their way into plenty of workstations and clusters in the next several years.
If you just want awesome graphics, then the dual-chip GTX 690 graphics card, which is based on the smaller “Kepler1″ GK104 GPU chip, which Nvidia previewed back in March, is what you want. And if you want to do single-precision floating point math like mad, then theTesla K10 coprocessor, also sporting two GK104 chips, is what you need to do your image processing, signal processing, seismic processing, or chemical modeling inside of server clusters.
NYT, Discord at Key JPMorgan Unit Is Faulted in Loss, here. Juicy but ultimately not moving the discovery forward. Where is the P&L on JPM’s counterparties? The loss can get larger or smaller, internal squabbles can spill into view, and there can be televised Congressional hearings but it’s all sort of missing the big picture. Who said “Follow the money?” They were smart.
John Scalzi, Epic troll, Straight White Male: The Lowest Difficulty Setting There Is, here. Found it via Brad DeLong, here. I cannot determine if DeLong knows about internet trolls in the same way he knows about Oh, say Eurozone austerity. But he is DeLong and we’re not, so it all sort of evens out in the limit.
Dudes. Imagine life here in the US — or indeed, pretty much anywhere in the Western world — is a massive role playing game, like World of Warcraft except appallingly mundane, where most quests involve the acquisition of money, cell phones and donuts, although not always at the same time. Let’s call it The Real World. You have installed The Real World on your computer and are about to start playing, but first you go to the settings tab to bind your keys, fiddle with your defaults, and choose the difficulty setting for the game. Got it?
Okay: In the role playing game known as The Real World, “Straight White Male” is the lowest difficulty setting there is.
This means that the default behaviors for almost all the non-player characters in the game are easier on you than they would be otherwise. The default barriers for completions of quests are lower. Your leveling-up thresholds come more quickly. You automatically gain entry to some parts of the map that others have to work for. The game is easier to play, automatically, and when you need help, by default it’s easier to get.
HPC Wire, Intel Rolls Out New Server CPUs, here.
Since the E5-4600 supports the Advanced Vector Extensions (AVX), courtesy of the Sandy Bridge microarchitecture, the new chip can do floating point operations at twice the clip of its pre-AVX predecessors. According to Intel, a four-socket server outfitted with E5-4650 CPUs can deliver 602 gigaflops on Linpack, which is nearly twice the flops that can be achieved with the top-of the-line E7 technology. That makes this chip a fairly obvious replacement for the E7 when the application domain is scientific computing.
In the Adaptive Markets Hypothesis (AMH) intelligent but fallible investors learn from and adapt to changing economic environments. This implies that markets are not always efficient, but are usually competitive and adaptive, varying in their degree of efficiency as the environment and investor population change over time. The AMH has several implications including the possibility of negative risk premia, alpha converging to beta, and the importance of macro factors and risk budgeting in asset-allocation policies.
NYT, Stock Trading Is Still Falling After ’08 Crisis, here.
Trading in the United States stock market has not only failed to recover since the 2008 financial crisis, it has continued to fall. In April, the average daily trades in American stocks on all exchanges stood at nearly half of its peak in 2008: 6.5 billion compared with 12.1 billion, according to Credit Suisse Trading Strategy.
The decline stands in marked contrast to past economic recoveries, when Americans regained their taste for stock trading within two years of economic shocks in 1987 and 2001.
IEEE Spectrum, John L. Hennessy: Risk Taker, here.
In the 1980s, John L. Hennessy, then a professor of electrical engineering at Stanford University, shook up the computer industry by taking the concepts of reduced instruction set computing (RISC) to the masses. Hennessy wrote papers, gave talks, designed chips, started companies, and even, literally, wrote the book (a textbook that’s still used today). The RISC architecture, which focused on simpler, lower-cost microprocessors, was then thought to be an academic exercise with little practical use; today it plays a major role in the industry.
Hennessy, now president of Stanford, is once again designing, testing, and advocating a new architecture, this time in the field of university education. He first began rethinking research at universities and recently began reimagining university education itself.
HPC Wire, AMD : The Integration Revolution? here.
While it has been recognized for some time that GPUs can be used to do parallel processing, the programmer’s task has been difficult if not extraordinary. That’s where AMD’s Heterogeneous System Architecture (HSA) comes in. HSA enables a new way to program applications using the GPU that can make it easier for mainstream programmers. AMD’s HSA is a full solution approach, enabling mainstream programmers to write parallel processing code as easily for the GPU as the CPU. And in some cases, the code may be able to execute on either the CPU or GPU, based on the system’s resources.
One way HSA can help solve the problem is by providing a unified address space for the CPU and GPU. With HSA, GPUs support the same page tables x86 CPUs use for mapping program memory pages to physical memory. Now GPUs can use a much larger memory map and, more importantly, a pointer is usually the same for code running on the CPU and code running on the GPU. The latter allows one copy of data to exist in memory and both the CPU and GPU can act upon it. The programmer doesn’t have to manage two or more copies of the same data. This design also helps improve performance because it is no longer necessary to make copies and keep them synchronized.
Xilinx, High Performance Computing Using FPGAs, Sep 2010, here.
The shift to multicore CPUs forces application developers to adopt a parallel programming model to exploit CPU performance. Even using the newest multicore architectures, it is unclear whether the performance growth expected by the HPC end user can be delivered, especially when running the most data- and compute- intensive applications. CPU-based systems augmented with hardware accelerators as co-processors are emerging as an alternative to CPU-only systems. This has opened up opportunities for accelerators like Graphics Processing Units (GPUs), FPGAs, and other accelerator technologies to advance HPC to previously unattainable performance levels.
I buy the argument to a degree. As the number of cores per chip grow, the easy pipelining and parallelization opportunities will diminish. The argument is stronger if there are more cores per chip. 8 cores or under per general purpose chip it’s sort of a futuristic theoretical argument. More than a few programmers can figure out how to code up a 4 to 8 stage pipeline for their application without massive automated assistance. But the FPGA opportunity does exist.
The convergence of storage and Ethernet networking is driving the adoption of 40G and 100G Ethernet in data centers. Traditionally, data is brought into the processor memory space via a PCIe network interface card. However, there is a mismatch of bandwidth between PCIe (x8, Gen3) versus the Ethernet 40G and 100G protocols; with this bandwidth mismatch, PCIe (x8, Gen3) NICs cannot support Ethernet 40G and 100G protocols. This mismatch creates the opportunity for the QPI protocol to be used in networking systems. This adoption of QPI in networking and storage is in addition to HPC.
I buy the FPGA application in the NIC space. I want my NIC to go directly to L3 pinned pages, yessir I do, 100G please.
Xilinx FPGAs double their device density from one generation to the next. Peak performance of FPGAs and processors can be estimated to show the impact of doubling the performance on FPGAs [Ref 6], [Ref 7]. This doubling of capacity directly results in increased FPGA compute capabilities.
The idea proposed here is that you want to be on the exponentially increasing density curve for the FPGAs in lieu of clock speed increases you are never going to see again. Sort of a complicated bet to make for mortals, maybe.
I like how they do the comparisons though. They say here is our Virtex-n basketball player and here is the best NBA Basketball player … and they show you crusty old Mike Bibby 2012. Then they say watch as the Virtex-n basketball player takes Mike Bibby down low in the post, and notice the Virtex-n basketball player is still growing exponentially. So you can imagine how much better he will do against Mike Bibby in the post next year. Finally they say that Mike Bibby was chosen as the best NBA player for this comparison by his father Henry, who was also a great NBA player.
FPGAs tend to consume power in tens of watts, compared to other multicores and GPUs that tend to consume power in hundreds of watts. One primary reason for lower power consumption in FPGAs is that the applications typically operate between 100–300 MHz on FPGAs compared to applications on high-performance processors executing between 2–3 GHz.
Silly making Lemonade out of Lemons argument, the minute I can have my FPGAs clocked at 3 GHz I throw away the 300MHz FPGAs, no?
Intel, An Introduction to the Intel QuickPath Interconnect, QPI, Jan 2009, here.
Xilinx Research Labs/NCSA, FPGA HPC – The road beyond processors, Jul 2007, here. Need more current references but I keep hearing the same themes in arguments for FGPA HPC, so let’s think about this for a bit:
FPGAs have an opening because you are not getting any more clocks from microprocessor fab shrinks: OK.
Power density: meh. Lots of FinQuant code can run on a handful of cores. The Low Latency HFT folks cannot really afford many L2 misses. The NSA boys are talking about supercomputers for crypto not binary protocol parsing.
Microprocessors have all functions that are hardened in silicon and you pay for them whether you use them or not and you can’t use that silicon for something else: Meh, don’t really care if I use all the silicon on my 300 USD microprocessor as long as the code is running close to optimal on the parts of the silicon useful to my application. It would be nice if I got more runtime performance for my 300USD, no doubt. This point is like Advil is bad because you don’t always need to finish the bottle after you blow out your ankle. Yeah, I understand the silicon real estate is the most expensive in the world.
Benchmarks: Black Scholes 18msec FPGA @ 110 Mhz Virtex-4 203x faster than Opeteron – 2.2 Ghz: You Cannot be Serious! 3.7 microseconds per Black Scholes evaluation was competitive performance at the turn of the century. The relative speedup slides and quotations make me nervous. Oh, Celoxica provided the data – hey Black Scholes in 36 Nanoseconds on a single core of a dual core off-the-shelf general microprocessor from 2007. So the Virtex-4 does 1M Black Scholes evaluations in 18 milliseconds flat to competitive code on a dual core general purpose off-the-shelf microprocessor in 2007.
Make it easy for the users to use this hardware and get „enough of a performance‟ increase to be useful: meh, it’s for applications that do not need to go fast, for now (2007)?
Do not try to be the fastest thing around when being as fast with less power is sufficient: meh, really do not care so much about the power thing
FPGA: Different operations map to different silicon allows massive pipelining; lots of parallelism: OK. So, why bother with the previous two points?
Eggers/ U. Washington, CHiMPS, here. Eggers is reasonable.
There have been (at least) two hindrances to the widespread adoption of FPGAs by scientific application developers: having to code in a hardware description language, such as Verilog (with its accompanying hardware-based programming model) and poor FPGA memory performance for random memory accesses. CHiMPS, our C-to-FPGA synthesis compiler, solves both problems with one memory architecture, the many-cache memory model.
Many-cache organizes the small, distributed memories on an FPGA into application-specific caches, each targeting a particular data structure or region of memory in an application and each customized for the particular memory operations that access it.
CHiMPS provides all the traditional benefits we expect from caching. To reduce cache latency, CHiMPS duplicates the caches, so that they’re physically located near the hardware logic blocks that access them. To increase memory bandwidth, CHiMPS banks the caches to match the memory parallelism in the code. To increase task-level parallelism, CHiMPS duplicates caches (and their computation blocks) through loop unrolling and tiling. Despite the lack of FPGA support for cache coherency, CHiMPS facilitates data sharing among FPGA caches and between the FPGA and its CPU through a simple flushing of cached values. And in addition, to harness the potential of the massively parallel computation offered by FPGAs, CHiMPS compiles to a spatial dataflow execution model, and then provides a mechanism to order dependent memory operations to retain C memory ordering semantics.
CHiMPS’s compiler analyses automatically generate the caches from C source. The solution allows scientific programmers to retain their familiar programming environment and memory model, and at the same time provides performance that is on average 7.8x greater and power that is one fourth that of a CPU executing the same source code. The CHiMPS work has been published in the International Symposium on Computer Architecture (ISCA, 2009), the International Conference on Field Programmable Logic and Applications (FPL, 2008), and High-Performance Reconfigurable Computing Technology and Applications (HPRCTA, 2008), where it received the Best Paper Award.
SD Times, Fog Around Intel Compilers, here.
Agner Fog is a computer science professor at the University of Copenhagen‘s college of engineering. As he puts it, “I have done research on microprocessors and optimized code for more than 12 years. My motivation is to make code compatible, especially when it pretends to be.”
Fog has written a number of blog entries about Intel’s compilers and how they treat competing processors. In November, AMD and Intel settled, and Fog has written up a magnificent analysis of the agreement.
If you have any interest in compilers, and in Intel’s compilers, you should definitely read his paragraph-by-paragraph read through.
Fog, Agner, Software Optimization Resources, here. I was reading Fog’s Optimizing Software in C++ (here) this morning. It’s a runtime optimization guide for Windows, Linux, and Mac. I have seen it before and perhaps been remiss in not commenting more fully. Without the benefit of trying out many of Fog’s code samples and directives against current versions of ICC and GCC I cannot be certain, but based on what I have optimized in the recent past, his body of works looks very legitimate and exhaustive. You ask, how exhaustive? Let’s start with the copyright, it’s got a succession plan:
This series of five manuals is copyrighted by Agner Fog. Public distribution and mirroring is not allowed. Non-public distribution to a limited audience for educational purposes is allowed. The code examples in these manuals can be used without restrictions. A GNU Free Documentation License shall automatically come into force when I die. See http://www.gnu.org/copyleft/fdl.html.
Professor Fog is laying out code optimization paths for 4 different compilers on 3 different operating systems. I will not and cannot check out/verify all the scenarios presented because I possess the attention span of a squirrel compared to Professor Fog. He also provides a page on random number generators, here, which seems legit to the extent that he points you to Matsumoto’s Mersenne Twister RNG page, here. The random number references do not appear to be as comprehensive as the C++ runtime optimization references. But this looks to be a case of:
in a most complimentary way to Professor Fog.
Extreme Tech, China plans national, unified CPU architecture, here. “Life moves pretty fast. If you don’t stop and look around once in a while, you could miss it.”
According to reports from various industry sources, the Chinese government has begun the process of picking a national computer chip instruction set architecture (ISA). This ISA would have to be used for any projects backed with government money — which, in a communist country such as China, is a fairly long list of public and private enterprises and institutions, including China Mobile, the largest wireless carrier in the world. The primary reason for this move is to lessen China’s reliance on western intellectual property.
There are at least five existing ISAs on the table for consideration — MIPS, Alpha, ARM, Power, and the homegrown UPU — but the Chinese leadership has also mooted the idea of defining an entirely new architecture. The first meeting to decide on a nationwide ISA, attended by government officials and representatives from academic groups and companies such as Huawei and ZTE, was held in March. According to MIPS vice president Robert Bismuth, a final decision will be made in “a matter of months.”
No UltraSparc hmm.
Crooked Timber, Harvard Library pushes open access, here.
We write to communicate an untenable situation facing the Harvard Library. … The Faculty Advisory Council to the Library, representing university faculty in all schools and in consultation with the Harvard Library leadership, reached this conclusion: major periodical subscriptions, especially to electronic journals published by historically key providers, cannot be sustained: continuing these subscriptions on their current footing is financially untenable. … It is untenable for contracts with at least two major providers to continue on the basis identical with past agreements. Costs are now prohibitive. … since faculty and graduate students are chief users, please consider the following options open to faculty and students (F) and the Library (L), state other options you think viable, and communicate your views:
Make sure that all of your own papers are accessible by submitting them to DASH in accordance with the faculty-initiated open-access policies (F). Consider submitting articles to open-access journals, or to ones that have reasonable, sustainable subscription costs; move prestige to open access (F). If on the editorial board of a journal involved, determine if it can be published as open access material, or independently from publishers that practice pricing described above. If not, consider resigning (F).
Extreme Tech, Intel Core i7-3770K review: Ivy Bridge brings lower power, better performance, here.
Intel’s Ivy Bridge (IVB) has been one of the hottest tech topics of the past 12 months — we haven’t seen this much interest in a CPU since Intel launched Nehalem. Ivy Bridge is the first 22nm processor at a time when die shrinks have become increasingly difficult, the first CPU to use FinFETs (Intel calls its specific implementation Tri-Gate), and it’s a major component of Intel’s ultrabook initiative. If all goes well, Ivy Bridge will usher in a new series of 15W ultra-mobile parts, though these won’t reach the market for a little while yet.
Ivy Bridge is a “tick” in Intel’s tick-tock model, but the company is referring to its latest architecture as a “tick+.” The reason for the change is the disparity of improvement between Ivy Bridge’s CPU and GPU components. IVB’s CPU core is a die-shrunk Sandy Bridge (SNB) with a few ultra-low-level efficiency improvements. Performance improvements on the CPU side are in the 5-10% range. Unlike Westmere (Nehalem’s “tick”), which offered 50% more cores, Ivy Bridge keeps Sandy Bridge’s quad-core configuration.
The FPGA folks get an opening in HPC floating point if they can get more aggressive on clock speed and not worry so much about power efficiency, while Intel tries to shake out AMD mobile CPU market share with the Ivy Bridge integrated GPU. I see the Maxeler bet on adding parallelism through Dataflow architecture as interesting/plausible – but nowhere near a done deal at this point.
Liquid Nitrogen Overclocking, The Fastest Rack Mounted Servers in the World, here. Running Intel i7-2700K at 5GHz.
Financial Sense: SPY versus SPX, here. SPX historical data doesn’t account for dividends while SPY does. Chris Whalen interview, The Fallacy of “Too Big To Fail”–Why the Big Banks Will Eventually Break Up, here.
IEEE Transactions on Computers, Ferrer et.al., Progressive Congestion Management Based on Packet Marking and Validation Techniques, here.
Business Insider, HSBC’s Incredible Video On The Rise Of Correlated Markets, here. The rolling time window is pretty good.