Home » Architecture
Category Archives: Architecture
Nicole Hemsoth, HPC Wire, Intel Sheds Light on the “Corner to Landing” Leap, here. This seems like a rough system architecture transition for some Wall Street floating point codes. Knights Landing will give you AVX-512 (doubling your SIMD execution throughput) but then pull the clock speed down from 3+ GHz to 1+ GHz (reportedly) in order to give you all the cores to work with. It seems like the latency-sensitive Algorithmic trading folks get hurt the most in this transition since their applications typically need the highest clock frequency as a top priority and then can find ways to use (essentially free) ILP resources on chip. The less latency-sensitive portfolio P&L and Risk codes will have an easier transition to Knights Landing because their applications typically have much simpler load balancing requirements, some like Monte Carlo simulation across the Firm’s derivative contract inventory approaching embarrassingly parallel.
In essence, as we have touched on already, one can look at Knight’s Landing as simply a new Xeon with higher core counts since at least some of the complexities of using it as a coprocessor will no longer be an issue. Unlike with the current Xeon Phi, transfers across PCIe are eliminated, memory is local and Landing acts as a true processor that can carry over the benefits of parallelism and efficiency of Phi in a processor form factor while still offering the option to use it as a coprocessor for specific highly parallel parts of a given workload. So this should make programming for one of these essentially the same as programming for Xeon—that is, in theory.
Despite the emphasis on extending programmability, make no mistake, it’s not as though parallel programming is suddenly going to become magically simple–and certainly that’s still not the case for using coprocessors, says James Reinders, Intel’s software director. However, there are some notable features that will make the transition more seamless.
xcelerit, Benchmarks: Ivy Bridge vs, Sandy Bridge for Financial Analytics, here. Folks are reporting that the performance jump from Sandy Bridge to Ivy Bridge on this MC code is mostly explained by the increased core count 12/8 =1.5. It is a little uncomfortable not knowing how the code is compiled but the relative figures make sense also in light of the fact that Sandy Bridge and Ivy Bridge have the same AVX architecture 8 DP FLOPS/cycle. This will change dramatically with Haswell AVX2 and FMA which should double the flops per cycle on a suitable MC code while keeping the core count flat to Ivy Bridge.
The table below shows the speedups for different numbers of paths, comparing the Ivy-Bridge processor vs. the Sandy-Bridge processor:
Paths Speedup Ivy-Bridge vs. Sandy-Bridge 64K 1.15x 128K 1.25x 256K 1.34x 512K 1.4x 1,024K 1.48x
As can be seen, the Ivy-Bridge processor gains significantly compared to the Sandy-Bridge, reaching 1.5x speedup for high numbers of paths. This is in line with the increase in the number of cores from 8 to 12 per chip. The benefits of the new Ivy-Bridge for financial Monte-Carlo applications can be clearly seen here.
Andy Patrizio, IT World, Intel changes the whole supercomputing game with Knight’s Landing, here.
Much more important is what else it takes away. Knight’s Landing will erasing the memory buffer and PCI Express bus that sat between the CPU and main memory and the coprocessor chip and frame buffer memory in the Xeon Phi card. Now that applications run entirely natively instead of offloading the data sets to the coprocessor, all of that latency goes away.
Now you will have both scalar processor cores and vector processor cores on the same chip sharing access to unified memory. This is huge. A fair amount of time has to go into offloading data sets from main memory to the frame buffer memory of the co-processor and then back to the CPU and main memory. It’s why Nvidia had to come out with the CUDA language, because plain old C++ or Java wouldn’t work.
Matt Levine, Bloomberg, Welcome Back, Leveraged Super Senior Synthetic CDOs, here.
You can tell that leveraged super senior synthetic collateralized debt obligation tranches are fun because they are called leveraged super senior synthetic collateralized debt obligation tranches, and anything with that many words in its name is up to something. And in fact LSS CDOs were popular prior to the financial crisis, got various people in various kinds of trouble, and more or less vanished.
But now Euromoney is reporting that Citigroup is trying to market them again, with a slight modification that might get people into less or at least different kinds of trouble, though it is far from clear that anyone will be interested.
Joel Hruska, Hot Hardware, Intel’s 128MB L4 Cache May Be Coming to Desktops with 14nm Broadwell-K CPUs, here.
When Intel debuted Haswell this year, it launched its first mobile processor with a massive 128MB L4 cache. Dubbed “Crystal Well,” this on-package (not on-die) pool of memory wasn’t just a graphics frame buffer, but a giant pool of RAM for the entire core to utilize. The performance impact from doing so is significant, though the Haswell processors that utilize the L4 cache don’t appear to account for very much of Intel’s total CPU volume.
Right now, the L4 cache pool is only available on mobile parts, but that could change next year. According to CPU-World, Broadwell-K will change that. The 14nm desktop chips aren’t due until the tail end of next year — we should see a desktop refresh in the spring with a second-generation Haswell part. Still, it’s a sign that Intel intends to integrate the large L4 as standard on a wide range of parts.
Nicole Hemsoth, HPC Wire, Intel Brings Knights to the Roundtable at SC13, here. David Kanter seems to think the Knights Landing processors are going to be clocked around 1GHz but you get more cores. That’s a problem for a bunch of FinQuant code that you can run on a 3.9GHz Haswell, for example. If you can organize your scalar code to instead use vectors between 8 and 20 doubles long you can generally improve MFLOPS performance a factor of 3 to 4 at the cost of more latency (look at Hager’s Blaze Lib performance, here). The Mflops to latency trade-off can be a somewhat subtle but rewarding analysis to perform. That seems to me a better place to be rather than losing a factor of 3+ in the clock frequency at the get go, no matter what your application needs are. That is also why you like that sharp upwards slope in the MFlops to vector length plot. It tells you, in a way, how worthwhile it is to alter your scalar code (vectorizing the code) to get the most MFlops out of the microprocessor. It is a balancing act for some FinQuant codes. If your application need is hard real time scalar then you may not be able to afford climbing that Mflops curve. If your application is FinQuant Portfolio evaluation then typically you choose reasonably long vectors, 100 to 1K looks good on Hager’s plot. That’s where your code’s performance scales with Moore’s Law on contemporary commodity microprocessors in 2013-4. That trade-off analysis between scalar and vectorized execution is in many cases, the difference between hitting or not hitting your benchmark performance given your application’s needs.
This seems to be the grid argument coming up again. I’ve seen real smart folks write really comparatively slow grid Monte Carlo simulations (missing competitive performance by factor of 1000 or more) just because the structure of the code was agnostic to the underlying microprocessor. It is hard to over emphasize how much you really do not want your code execution to miss in L2. For some LinPack codes there is not much you can do, for example some big non sparse matrix computations are going to cause some L2 misses. A large chunk of FinQuant codes do not have LinPack in the inner loop. Hence there could be a strong preference for Broadwell and Skylake FP execution over Knights Landing.
This week during SC13, Intel hosted a roundtable session to discuss the future of its upcoming Knights Landing product, hitting on where they key benefits are expected for technical computing users and how Knight’s Landing might influence the shape of next generation systems and applications.
Matt Levine, Bloomberg, Financial Innovation Is Depressing, here. Wow, Levine got Bloomberg to give him actual footnotes.
DealBook has a special section today on ideas and innovation on Wall Street and let’s just say it will not inspire too many idealistic Stanford undergrads to stop work on their iPhone apps and take that financial-engineering job at Wells Fargo. It’s all pretty sad stuff.
My favorite article in the section is of course the one on innovation in fraud, which it turns out is genuinely fertile ground for creativity, though tell that to the Stanford kids. To be fair, though, there is also innovation in catching fraud, specifically a Commodity Futures Trading Commission Rule, adopted in 2011, that outlaws market manipulation. So! Nice work there CFTC.
Favorite footnote so far -
3 Provably false in the aggregate, but always possible for you. You don’t need to settle for average returns, you are a special snowflake, buy our snowflake index fund, etc.
Jessica, appnexus tech blog, AppNexus Engineering@Scale: Building & Shipping a Scalable Product, here.
Few people are more familiar with website scalability problems than Theo Schlossnagle. Not only is Theo the founder and CEO of OmniTI, he is also the author of Scalable Internet Architectures, a book that draws on his 15 years of experience to provide developers with a blueprint for tackling the biggest obstacles to successful scaling. Theo shared his wisdom in a recent AppNexus Engineering@Scale talk.
This 20Nov talk looks interesting as well:
TestOps: Continuous Integration when Infrastructure is the Product
An AppNexus Engineering @ Scale Conversation Series
Join us November 20th for “TestOps: Continuous Integration when Infrastructure is the Product” presented by Barry Jaspan, Senior Architect of Acquia. Continuous Integration and Deployment are powerful approaches for improving software development and release engineering. However, when your product is infrastructure running other people’s applications instead of just your own, different problems arise.
Releases involve reconfiguring daemons and servers, possibly restarting them. Updates must be carefully choreographed to maintain high availability. Upgrading 6000+ servers “at once” is impossible, so running multiple versions simultaneously is required. Rolling back is difficult, so automated testing is critical. With the rise of configuration management systems like Puppet and Chef, server configuration is now software. Like all software, server configuration needs constant automated testing in order to work, but testing infrastructure-as-software is a substantially different problem from testing normal application software.
puppet labs, What is Puppet? here.
Puppet is IT automation software that helps system administrators manage infrastructure throughout its lifecycle, from provisioning and configuration to orchestration and reporting. Using Puppet, you can easily automate repetitive tasks, quickly deploy critical applications, and proactively manage change, scaling from 10s of servers to 1000s, on-premise or in the cloud.
Opscode, Chef, How Chef Works, here.
Chef is based on a key insight: You can model your evolving IT infrastructure and applications as code. Chef makes no assumptions about your environment and the approach you use to configure and manage it. Instead, Chef gives you a way to describe and automate your infrastructure and processes. Your infrastructure becomes testable, versioned and repeatable. It becomes part of your Agile process.
Peter Wayner, InfoWorld, Puppet or Chef: The configuration management dilemma, here.
Thank goodness for automation. Over the years, smart sys admins looked at the ballooning task list and figured out a way to write scripts that would handle the repetitive tasks. They built their own junior robot sys admin to do the work for them.
The hard work has coalesced into two major factions called Puppet and Chef. There are a number of other notable projects with readable names like Ansible and unreadable names like Bcfg2, but Puppet and Chef seem to have gathered the most excitement for now.
Both are open source stacks of code designed to make it easy for you to reach out and touch the files in your vast empire of virtual machines. Both have open source marketplaces for you to swap plug-ins that extend the framework and handle your particular type of hardware or software. Both are pretty cool, and both are finding homes in the racks of data centers around the world. Both now have companies built around the open source core selling assistance.
Agner’s CPU Blog, Future instruction set: AVX-512, here.
The size of vector registers are extended from 256 bits (YMM registers) to 512 bits (ZMM) registers. There is room for further extensions to at least 1024 bits (what will they be called?)
The number of vector registers is doubled to 32 registers in 64-bit mode. There will still be only 8 vector registers in 32-bit mode.
Eight new mask registers k0 – k7 allow masked and conditional operations. Most vector instructions can be masked so that it only operates on selected vector elements while the remaining vector elements are unchanged or zeroed. This will replace the use of vector registers as masks.
Most vector instructions with a memory operand have an option for broadcasting a scalar operand.
Floating point vector instructions have options for specifying the rounding mode and for suppressing exceptions.
There is a new addressing mode called compressed displacement. Where instructions have a memory operand with a pointer and an 8-bit sign-extended displacement, the displacement is multiplied by the size of the operand. This makes it possible to address a larger interval with just a single byte displacement as long as the memory operands are properly aligned. This makes the instructions smaller in some cases to compensate for the longer prefix.
More than 100 new instructions
The 512-bit registers can do vector operations on 32-bit and 64-bit signed and unsigned integers and single and double precision floats, but unfortunately not on 8-bit and 16-bit integers
Optimization manuals updated, 4 Sep, here. Heed the words of Professor Fog, “Note that these manuals are not for beginners.”
The optimization manuals at www.agner.org/optimize/#manuals have now been updated. The most important additions are:
- AMD Piledriver and Jaguar processors are now described in the microarchitecture manual and the instruction tables.
- Intel Ivy Bridge and Haswell processors are now described in the microarchitecture manual and the instruction tables.
- The micro-op cache of Intel processors is analyzed in more detail
- The assembly manual has more information on the AVX2 instruction set.
- The C++ manual describes the use of my vector classes for writing parallel code.
Some interesting test results for the newly tested processors:
- Supports the new AVX2 instruction set which allows integer vectors of 256 bits and gather instructions
- Supports fused multiply-and-add instructions of the FMA3 type
- The cache bandwidth is doubled to 256 bits. It can do two reads and one write per clock cycle.
- Cache bank conflicts have been removed
- The number of read and write buffers, register files, reorder buffer and reservation station are all bigger than in previous processors
- There are more execution units and one more execution port than on previous processors. This makes a throughput of four instructions per clock cycle quite realistic in many cases.
- The throughput for not-taken branches is doubled to two not-taken branches per clock cycle, including fused branch instructions. The throughput for taken branches is largely unchanged.
- There are two execution units for floating point multiplication and for fused multiply-and-add, but only one execution unit for floating point addition. This design appears to be suboptimal since floating point code typically contains more additions than multiplications. But at least it enables Intel to boast a floating point performance of 32 FLOPS per clock cycle.
- The fused multiply-and-add operation is the first case in the history of Intel processors of micro-ops having more than two input dependencies. Other instructions with more than two input dependencies are still split into two micro-ops, though. AMD processors don’t have this limitation.
- The delays for moving data between different execution units is smaller than on previous Intel processors in many cases.
Ashlee Vance, Bloomberg, Inside the Arctic Circle, Where Your Facebook Data Live, here. DynaRack lives but probably not near the Arctic Circle. On the other hand I doubt it’s optimal to do Equatorial DynaRack, unless your DynaRack can fly.
Every year, computing giants including Hewlett-Packard (HPQ), Dell (DELL), and Cisco Systems (CSCO) sell north of $100 billion in hardware. That’s the total for the basic iron—servers, storage, and networking products. Add in specialized security, data analytics systems, and related software, and the figure gets much, much larger. So you can understand the concern these companies must feel as they watch Facebook (FB) publish more efficient equipment designs that directly threaten their business. For free.
Ian Cutress, Anand Tech, Memory Scaling on Haswell CPU, IGP and dGPU: DDR3-1333 to DDR3-3000 Tested with G.Skill, here. Should not really care about the memory bandwidth all that much for FinQuant code that is not a PDE solver. The benchmark bears this out single threaded or multi-threaded it’s a 1 to 2 percent difference in runtime. There are exceptions but for the most part, on Wall Street if your code needs to hit DDR3-3000 to perform competitively, you are doing it wrong. If your boss doesn’t know about DDR3-3000 you are also doing it wrong. It’s a fine line.
One side I like to exploit on CPUs is the ability to compute and whether a variety of mathematical loads can stress the system in a way that real-world usage might not. For these benchmarks we are ones developed for testing MP servers and workstation systems back in early 2013, such as grid solvers and Brownian motion code. Please head over to the first of such reviews where the mathematics and small snippets of code are available.
3D Movement Algorithm Test
The algorithms in 3DPM employ uniform random number generation or normal distribution random number generation, and vary in various amounts of trigonometric operations, conditional statements, generation and rejection, fused operations, etc. The benchmark runs through six algorithms for a specified number of particles and steps, and calculates the speed of each algorithm, then sums them all for a final score. This is an example of a real world situation that a computational scientist may find themselves in, rather than a pure synthetic benchmark. The benchmark is also parallel between particles simulated, and we test the single thread performance as well as the multi-threaded performance. Results are expressed in millions of particles moved per second, and a higher number is better.
Felienne, Felienne’s Blog, Excel Turing Machine, here.
This weekend, I went on a get away with some fellow developers of Devnology. We do this every year, drinking beer, talking about programming, and often also, programming.
This year I had even more fun than usual, as Daan van Berkel proposed we build a Turing machine in Excel. In this blog post I’ll talk about how we built it, but of course you can also skip all that and just download it and play with it. Please note that automatic calculations are turned off, so if you want to run the machine, you’ll have to do it manually.
Patrick Kennedy, tom’s Hardware, Haswell-Based Xeon E3-1200: Three Generations, Benchmarked, here. Ever wonder how much memory bandwidth your cash register needs in Haswell to breakeven in performance with a high end Sandy Bridge server chip? DynaRack – when you need to place an order … now.
Comparatively, Intel’s high-end Xeon E5 brand, intended for more compute-intensive workloads, supports up to quad-channel memory configurations and registered DIMMs. That gives those LGA 2011-based platforms the ability to address hundreds of gigabytes of memory. Back when Sandy Bridge first surfaced, 32 GB seemed like a lot of RAM for a small server or workstation. In 2013, we see high-end desktops sporting that much (particularly easy across eight memory slots).
Peter Woit, Not Even Wrong, Love and Math, here.
A large part of the book is basically a memoir, recounting Frenkel’s eventful career, which began in a small city in the former Soviet Union. He explains how he fell in love with mathematics, his struggles with the grotesque anti-Semitism of the Soviet system of that time (this chapter of the story was published earlier, available here), his experiences with Gelfand and others, and how he came to the US and ended up beginning a successful academic career in the West at Harvard. I remember fairly well the upheaval in the mathematics research community of that era, as the collapse of the Soviet system brought a flood of brilliant mathematicians from Russia to the West. It’s fascinating to read Frenkel’s account of what that all looked like from the other side.
Josiah Neeley, Not Quite Noahpinion, The Science of Hippie-Punching, here.
For those not in the know, “hippie-punching” refers to when someone (usually but not always on the center-left) attacks someone farther to their left as a means of gaining credibility and support with the general populace. The term appears to date from 2007, but the practice itself is far older. Bill Clinton, for example, was an expert hippie-puncher, and the term itself seems to be an oblique reference to the 1968 Democratic convention, when anti-war protesters battled Chicago police under the control of Democratic mayor Richard Daley (the nearest right-wing equivalent to the term “hippie-punching” is “that time when William F. Buckley kicked the Birchers out of the conservative movement”).