Home » Code

Category Archives: Code

Vectorization

Looking at von Leitner’s example of vectorization the reader could come away with the mistaken impression that the stakes are small with respect to vectorization.  For some codes it’s true, vectorization is just one of many optimizations that compilers are getting increasingly good at like: in-lining, strength reduction, or tail recursion.  After all the example shows a vectorized implementation of memset(). How often is memset() in the critical performance path of code that is important to you? I’ll guess, not all that often.

```int zero(char* array) {
unsigned long i;
for (i=0; i<1024; ++i)
array[i]=23;
}```

For FinQuant codes vectorization typically plays large role in determining the competitive performance capabilities of the optimized code. Vectorized optimization may be important for simply being able to get the full P&L and risk valuation for the desk’s trade inventory posted with the end of day marks before the traders leave for the day. Vectorized code may be the only way to run enough paths to get a particular Firm Risk Monte Carlo to converge. Vectorized code may be important for your Algorithmic trading code execution in a particularly competitive market.

Code Optimization via vectorization does come at a cost. In many FinQuant applications the contemporary compiler simply is not able to automatically find the available vectorization, it has to be exposed by the programmer. It is not that von Leitner is wrong with the direction of his argument. But the argument, in the case of important FinQuant codes and applications, is a little too general.

Take a look at the first part of the Intel Black Scholes code. The big difference between a piece of scalar code and a vectorized code is that vector length and loop index parameter nopt. If nopt = 1 at the time BlackScholesFormula() is called the code is going to execute as scalar code. This means that the floating point execution pipeline can only schedule the operations in a single iteration of the for loop and if the compiler cannot find enough Instruction Level parallelism (ILP) then each floating point operation costs closer to 4 to 5 cycles rather than 0.5 or 0.25 cycles (on an average throughput basis).

```void BlackScholesFormula( int nopt, tfloat r, tfloat sig,tfloat s0[], tfloat x[], tfloat t[], tfloat vcall[], tfloat vput[] )
{
vmlSetMode( VML_EP );
DIV(s0, x, Div); LOG(Div, Log);
for ( j = 0; j < nopt; j++ ) {
tr [j] = t[j] * r;
tss[j] = t[j] * sig_2;
tss05[j] = tss[j] * HALF;
mtr[j] = -tr[j];
}```

If nopt > 1 then the compiler and the programmer start to have something to work with v.v. optimization and vectorization. This is pretty much the whole game for floating point performance improvements known as SSE, AVX, AVX2, AVX-512. In other words, in a contemporary world where you cannot get more clock cycles on commodity microprocessors, the only way the performance of a large collection of important FinQuant code is going to improve with Moore’s Law is to expose ILP to the compiler.

In some cases it is not so difficult to find the ILP. Monte Carlo simulation provides an enormous amount of ILP. Many portfolio valuation applications have lots of easy ILP because the individual trades in the portfolio require independent valuation. With streaming computations finding the ILP is not something the compiler is generally going to figure out if you just give it the right switches. Moreover, If you don’t find the available ILP for many applications it is performance lost. There is no other way to recover. Parallelization across on chip cores is slower, off chip cores is a non-starter typically, coprocessors on the other side of the bus cannot recover from the data transfer time, the overhead of even multiple threads can leave the floating point execution units idle for relatively long periods of time. If the competitive computation bottleneck is the floating point execution unit pipeline then exposing ILP almost becomes a forced move.

In the Intel code, and generally in any vectorized code, the key to exposing ILP to the compiler is to assemble all the inputs required for the floating point execution unit so that nopt is as large as possible (looking for the knee in the performance curve).  To see why, look at Hager’s daxpy performance plot of MFlops versus Vector Length. Vector Length is the known size of nopt at the time BlackScholesFormula() is called. The larger nopt is known to be (i.e., all the inputs are available) the more MFlops your code execution gets. The steep slope for the Blaze daxpy is very attractive meaning it can be worthwhile to rearrange your code so that you get the execution performance that Blaze is advertizing. How different is daxpy than the Intel loop example we started with? Not all that much, daxpy is a little more aggressive because it can schedule the Fused Multiply Add execution units to get a floating point add and multiply to complete on each clock cycle. But Hager has many vectorized performance template plots you can use to refine performance estimates.

Von Leitner’s presentation is great but could be misleading for some FinQuant floating point codes. You as the programmer need to think through how you can make nopt hit the sweet spot for performance for your code’s execution. In general compilers are not just going to figure it out, today.

Source Code Optimization

Felix von Leitner, Code Blau GmbH, Source Code Optimization, here.  Top shelf presentation. I would adjust this money shot though for the competitive FinQuant folks. Yes, the optimization of the memory hierarchy performance is a really important goal. But simply optimizing your code’s memory performance without addressing the fact that you are running on a little superscalar, superpiplined parallel machine, may set you apart from the apathetic coders out there, but it doesn’t win you many competitive races.  There are two primary optimization targets for FinQuant code optimizations that need to exhibit competitive performance:

1. Instruction Level Parallelism (ILP) and

2. Multi level Cache Hierarchies

If you miss the ILP, you can lose close to an order of magnitude of execution performance. If your code is missing in L2 or L3 needlessly, you could be dumping multiple orders of magnitude of execution performance.  If you get both of these right in FinQuant analytics, and you have a the right algorithms, and top of the line off-the-shelf microprocessors then there is no competitor running significantly faster than you (in any race you care about).  Where does this fail to hold? Something like session oriented protocol processing for exchange order routing – custom hardware w. FPGAs  can demonstrate much better latency performance  (say an order of magnitude improvement 10mics to 400ns) than off-the-shelf servers running software no matter how cleverly you optimize. Where does this hold? Standard FinQuant applications like Monte Carlo simulation, most Fixed Income P&L and Risk,  even something as simple as Black Scholes.

Memory Hierarchy

• Only important optimization goal these days

• Use mul instead of shift: 5 cycles penalty.

• Conditional branch mispredicted: 10cycles.

• Cache miss to main memory: 250cycles.

Compilers and More: Accelerated Programming

Michael Wolfe, PGI, HPC Wire, Compilers and More: Accelerated Programming, here. John Barr link looks interesting as well.

Having just returned from SC13, one burning issue is the choice of a standard approach for programming the next generation HPC systems. While not guaranteed, these systems are likely to be large clusters of nodes with multicore CPUs and some sort of attached accelerators. A standard programming approach is necessary to convince developers, and particularly ISVs, to start adoption now in preparation for this coming generation of systems. John Barr raised the same question in arecent article at Scientific Computing World from a more philosophical point of view. Here I address this question from a deeper technical perspective.

Floating Point Benchmarks

xcelerit, Benchmarks: Ivy Bridge vs, Sandy Bridge for Financial Analytics, here. Folks are reporting that the performance jump from Sandy Bridge to Ivy Bridge on this MC code is mostly explained by the increased core count   12/8 =1.5. It is a little uncomfortable not knowing how the code is compiled but the relative figures make sense also in light of the fact that Sandy Bridge and Ivy Bridge have the same AVX architecture 8 DP FLOPS/cycle. This will change dramatically with Haswell AVX2 and FMA which should double the flops per cycle on a suitable MC code while keeping the core count flat to Ivy Bridge.

The table below shows the speedups for different numbers of paths, comparing the Ivy-Bridge processor vs. the Sandy-Bridge processor:

 Paths Speedup Ivy-Bridge vs. Sandy-Bridge 64K 1.15x 128K 1.25x 256K 1.34x 512K 1.4x 1,024K 1.48x

As can be seen, the Ivy-Bridge processor gains significantly compared to the Sandy-Bridge, reaching 1.5x speedup for high numbers of paths. This is in line with the increase in the number of cores from 8 to 12 per chip. The benefits of the new Ivy-Bridge for financial Monte-Carlo applications can be clearly seen here.

The Challenge of Cross-language Interoperability

David Chisnall, acmqueue, The Challenge of Cross-language Interoperability, here. Seems unlikely that there there will be a change in fast code = C code for sometime on the order of years for FinQuant Floating Point. Given the Street’s main Risk and P&L problems are addressed precisely by the x86/AVX performance growth of the preceeding 2-3 years and the expected x86 improvements through 2016, it is difficult to see how a significant processor architecture change could take root. That said, there does not seem to be a wealth of folks on the Street figuring out how to get their compiler to issue instructions rapidly enough to keep the FMA floating point execution units busy on each clock cycle. So if these applications are not actually tied to high performance execution, something else in the way of a new technology or architecture could come along.

The traditional division between high-level languages representing the class that is similar to a human’s understanding of the problem domain and low-level languages representing the class similar to the hardware no longer applies. No low-level language has semantics that are close to a programmable data-flow processor, an x86 CPU, a massively multithreaded GPU, and a VLIW (very long instruction word) DSP (digital signal processor). Programmers wanting to get the last bit of performance out of the available hardware no longer have a single language they can use for all probable targets.

The industry has spent the past 30 years building CPUs optimized for running languages such as C, because people who needed fast code used C (because people who designed processors optimized them for C, because…). Maybe the time has come to start exploring better built-in support for common operations in other languages. The RISC project was born from looking at the instructions that a primitive compiler generated from compiling C code. What would we end up with if we started by looking at what a native JavaScript or Haskell compiler would emit?

Ohloh, Compare Languages, here. Tracks the number of lines of code available in public open source repositories. 4 billion lines of C code and around 1.5 billion each of C++ and Java.

Language Comparison Page

The graph compares the languages picked by the user. The height of each point on the graph is the sum of all commits in that month that included at least one line of change for that language. A commit that changed two languages will be counted for each language.

Languages are always charted over 20 years, and do not include the most recent month. The most recent month is excluded because Ohloh does not yet have complete information for it.

Intel Digital Random Number Generator

Gael Hofemeier, Intel, Intel Digital Random Number Generator(DRNG) Software Implementation Guide, here.

RDRAND Response Time and Reseeding Frequency

• ~150 clocks per invocation
Note: Varies with CPU clock frequency since constraint is shared data path from DRNG to cores.
• Little contention until 8 threads

Walking Randomly

Walking Randomly, Vectorizing code to take advantage of modern CPUs (AVX and SSE), here.  A SIMD vectorized Mersenne Twister, nice. Walking Randomly says Agner Fog’s Website is the Bible for MC simulation.

Vendors of numerical libraries are steadily applying vectorisation techniques in order to maximise performance.  If the execution speed of your application depends upon these library functions, you may get a significant speed boost simply by updating to the latest version of the library and recompiling with the relevant compiler flags.

Matt Pharr

Matt Pharr, Intel, ispc: A SPMD Compiler for High-Performance CPU Programming, here. One of Pat Hanrahan’s students at Stanford did http://ispc.github.com.  Nice 2012 presentations, covers much of what we were discussing here at Pink I regarding Dr. Johnson’s Black Scholes code, but in a larger context.

I joined Google [x] in March 2013. As a side project, I had a great time teaching the 2013 installment of cs348b, the graduate-level rendering course at Stanford this spring.

I was previously a Principal Engineer at Intel, where I was responsible for the design and implementation of ispc, the Intel SPMD Program Compiler, now available in open-source form from github. Before attacking the problem of building compilers for better high-performance programming models for modern CPU architectures, I was the architecture lead of the Advanced Rendering Technology group, which is the group that grew out of Neoptica.

Intel SPMD Program Compiler, here.

ispc is a compiler for a variant of the C programming language, with extensions for “single program, multiple data” (SPMD) programming. Under the SPMD model, the programmer writes a program that generally appears to be a regular serial program, though the execution model is actually that a number of program instancesexecute in parallel on the hardware. (See the ispc documentation for more details and examples that illustrate this concept.)

ispc compiles a C-based SPMD programming language to run on the SIMD units of CPUs and the Intel Xeon Phi™ architecture; it frequently provides a 3x or more speedup on CPUs with 4-wide vector SSE units and 5x-6x on CPUs with 8-wide AVX vector units, without any of the difficulty of writing intrinsics code. Parallelization across multiple cores is also supported by ispc, making it possible to write programs that achieve performance improvement that scales by both number of cores and vector unit size.

Random Number Generators

Peter Hellekalek, U Salzburg, PLab, Random Number Generators, here. Nice site. Matsumoto’s Mersenne Twister still seems to be top dog?

We present results and links for this fundamental tool in stochastic simulation and in applied cryptography, some of them due to our own research in this field. Enjoy the data and allow for necessary incompleteness and subjectivity.

An unmoderated collection of network-resouces on random number generators is located on this server as part of the WWW Virtual Library.

A short tour of the information we offer:

• GeneratorsWe discuss different types of random number generators and some of their properties.

• LinksWe provide rather extensive links to people, software and related sites.

• SoftwareIf you need code for random number generation, even for nonuniform distributions: This is your page!

• TestsAll random number generators have their weak points. We discuss the issue of empirical (statistical) testing of RNGs and provide results as well as links.

Pierre L’Ecuyer et.al., An Object-Oriented Random-Number Package With Many Long Streams and Substreams, here. I was curious if  there was a literature of parallel random  number generators. You are kind of drawn in to thinking about it when you have to put a large MC computation on a grid. For performance reasons it is unpleasant to contemplate having to generate all your random deviates at one time on one proecessor and then have to store them and then send them to a particular grid element for use in the simulation, rather than simply generate them in parallel on the fly.

Multiple independent streams of random numbers are often required in simulation studies, for instance, to facilitate synchronization for variance-reduction purposes, and for making independent replications. A portable set of software utilities is described for uniform random number generation. It provides for multiple generators (streams) running simultaneously, and each generator(stream) has its sequence of numbers partitioned into many long disjoint contiguous substreams. The basic underlying generator for this implementation is a combined multiple-recursive generator with period length of approximately 2191, proposed by L’Ecuyer(1999a). A C++interface is described here. Portable implementations are available in C, C++, and Java via the online companion to this paper on the Operations Research Web site. 􏰈http://or.pubs.informs.org/pages/collect.html􏰉.

Pierre L’Ecuyer’s publications, here.

Check in w. Agner Fog

Agner Fog, Intel Developer Forum, AVX-512 is a big step forward – but repeating past mistakes! here.

AVX512 is arguably the biggest step yet in the evolution of the x86 instruction set in terms of new instructions, new registers and new features. The first try was the Knights Corner instruction set. It had some problems and AVX512 is better, so I am quite happy that AVX512 seems to be replacing the Knights Corner instruction set. But there are still some shortsighted issues that are lilkely to cause problems for later extensions.