You are currently browsing the tag archive for the ‘Black Scholes’ tag.
Xilinx, High Performance Computing Using FPGAs, Sep 2010, here.
The shift to multicore CPUs forces application developers to adopt a parallel programming model to exploit CPU performance. Even using the newest multicore architectures, it is unclear whether the performance growth expected by the HPC end user can be delivered, especially when running the most data- and compute- intensive applications. CPU-based systems augmented with hardware accelerators as co-processors are emerging as an alternative to CPU-only systems. This has opened up opportunities for accelerators like Graphics Processing Units (GPUs), FPGAs, and other accelerator technologies to advance HPC to previously unattainable performance levels.
I buy the argument to a degree. As the number of cores per chip grow, the easy pipelining and parallelization opportunities will diminish. The argument is stronger if there are more cores per chip. 8 cores or under per general purpose chip it’s sort of a futuristic theoretical argument. More than a few programmers can figure out how to code up a 4 to 8 stage pipeline for their application without massive automated assistance. But the FPGA opportunity does exist.
The convergence of storage and Ethernet networking is driving the adoption of 40G and 100G Ethernet in data centers. Traditionally, data is brought into the processor memory space via a PCIe network interface card. However, there is a mismatch of bandwidth between PCIe (x8, Gen3) versus the Ethernet 40G and 100G protocols; with this bandwidth mismatch, PCIe (x8, Gen3) NICs cannot support Ethernet 40G and 100G protocols. This mismatch creates the opportunity for the QPI protocol to be used in networking systems. This adoption of QPI in networking and storage is in addition to HPC.
I buy the FPGA application in the NIC space. I want my NIC to go directly to L3 pinned pages, yessir I do, 100G please.
Xilinx FPGAs double their device density from one generation to the next. Peak performance of FPGAs and processors can be estimated to show the impact of doubling the performance on FPGAs [Ref 6], [Ref 7]. This doubling of capacity directly results in increased FPGA compute capabilities.
The idea proposed here is that you want to be on the exponentially increasing density curve for the FPGAs in lieu of clock speed increases you are never going to see again. Sort of a complicated bet to make for mortals, maybe.
I like how they do the comparisons though. They say here is our Virtex-n basketball player and here is the best NBA Basketball player … and they show you crusty old Mike Bibby 2012. Then they say watch as the Virtex-n basketball player takes Mike Bibby down low in the post, and notice the Virtex-n basketball player is still growing exponentially. So you can imagine how much better he will do against Mike Bibby in the post next year. Finally they say that Mike Bibby was chosen as the best NBA player for this comparison by his father Henry, who was also a great NBA player.
FPGAs tend to consume power in tens of watts, compared to other multicores and GPUs that tend to consume power in hundreds of watts. One primary reason for lower power consumption in FPGAs is that the applications typically operate between 100–300 MHz on FPGAs compared to applications on high-performance processors executing between 2–3 GHz.
Silly making Lemonade out of Lemons argument, the minute I can have my FPGAs clocked at 3 GHz I throw away the 300MHz FPGAs, no?
Intel, An Introduction to the Intel QuickPath Interconnect, QPI, Jan 2009, here.
Xilinx Research Labs/NCSA, FPGA HPC – The road beyond processors, Jul 2007, here. Need more current references but I keep hearing the same themes in arguments for FGPA HPC, so let’s think about this for a bit:
FPGAs have an opening because you are not getting any more clocks from microprocessor fab shrinks: OK.
Power density: meh. Lots of FinQuant code can run on a handful of cores. The Low Latency HFT folks cannot really afford many L2 misses. The NSA boys are talking about supercomputers for crypto not binary protocol parsing.
Microprocessors have all functions that are hardened in silicon and you pay for them whether you use them or not and you can’t use that silicon for something else: Meh, don’t really care if I use all the silicon on my 300 USD microprocessor as long as the code is running close to optimal on the parts of the silicon useful to my application. It would be nice if I got more runtime performance for my 300USD, no doubt. This point is like Advil is bad because you don’t always need to finish the bottle after you blow out your ankle. Yeah, I understand the silicon real estate is the most expensive in the world.
Benchmarks: Black Scholes 18msec FPGA @ 110 Mhz Virtex-4 203x faster than Opeteron – 2.2 Ghz: You Cannot be Serious! 3.7 microseconds per Black Scholes evaluation was competitive performance at the turn of the century. The relative speedup slides and quotations make me nervous. Oh, Celoxica provided the data – hey Black Scholes in 36 Nanoseconds on a single core of a dual core off-the-shelf general microprocessor from 2007. So the Virtex-4 does 1M Black Scholes evaluations in 18 milliseconds flat to competitive code on a dual core general purpose off-the-shelf microprocessor in 2007.
Make it easy for the users to use this hardware and get „enough of a performance‟ increase to be useful: meh, it’s for applications that do not need to go fast, for now (2007)?
Do not try to be the fastest thing around when being as fast with less power is sufficient: meh, really do not care so much about the power thing
FPGA: Different operations map to different silicon allows massive pipelining; lots of parallelism: OK. So, why bother with the previous two points?
Eggers/ U. Washington, CHiMPS, here. Eggers is reasonable.
There have been (at least) two hindrances to the widespread adoption of FPGAs by scientific application developers: having to code in a hardware description language, such as Verilog (with its accompanying hardware-based programming model) and poor FPGA memory performance for random memory accesses. CHiMPS, our C-to-FPGA synthesis compiler, solves both problems with one memory architecture, the many-cache memory model.
Many-cache organizes the small, distributed memories on an FPGA into application-specific caches, each targeting a particular data structure or region of memory in an application and each customized for the particular memory operations that access it.
CHiMPS provides all the traditional benefits we expect from caching. To reduce cache latency, CHiMPS duplicates the caches, so that they’re physically located near the hardware logic blocks that access them. To increase memory bandwidth, CHiMPS banks the caches to match the memory parallelism in the code. To increase task-level parallelism, CHiMPS duplicates caches (and their computation blocks) through loop unrolling and tiling. Despite the lack of FPGA support for cache coherency, CHiMPS facilitates data sharing among FPGA caches and between the FPGA and its CPU through a simple flushing of cached values. And in addition, to harness the potential of the massively parallel computation offered by FPGAs, CHiMPS compiles to a spatial dataflow execution model, and then provides a mechanism to order dependent memory operations to retain C memory ordering semantics.
CHiMPS’s compiler analyses automatically generate the caches from C source. The solution allows scientific programmers to retain their familiar programming environment and memory model, and at the same time provides performance that is on average 7.8x greater and power that is one fourth that of a CPU executing the same source code. The CHiMPS work has been published in the International Symposium on Computer Architecture (ISCA, 2009), the International Conference on Field Programmable Logic and Applications (FPL, 2008), and High-Performance Reconfigurable Computing Technology and Applications (HPRCTA, 2008), where it received the Best Paper Award.
BBC News magazine, Black-Scholes: The maths formula linked to the financial crash, here.
It’s not every day that someone writes down an equation that ends up changing the world. But it does happen sometimes, and the world doesn’t always change for the better. It has been argued that one formula known as Black-Scholes, along with its descendants, helped to blow up the financial world.
It doesn’t say if Scotland Yard had Scholes in for questioning yet. Oh, this story is sourced from Ian Stewart the math guy from Warwick.
Stewart says the lessons from Long-Term Capital Management were obvious. “It showed the danger of this kind of algorithmically-based trading if you don’t keep an eye on some of the indicators that the more conventional people would use,” he says. “They [Long-Term Capital Management] were committed, pretty much, to just ploughing ahead with the system they had. And it went wrong.”
Scholes says that’s not what happened at all. “It had nothing to do with equations and nothing to do with models,” he says. “I was not running the firm, let me be very clear about that. There was not an ability to withstand the shock that occurred in the market in the summer and fall of late 1998. So it was just a matter of risk-taking. It wasn’t a matter of modelling.”
Would it be a bad thing if John Meriwether and Myron Scholes attend a remedial applied maths course taught by Professor Stewart? Perhaps not, it could be awesome if there is You Tube video of the class.
Asymco analysis, here, highlights recent trends in computing. It looks increasingly as if you need to adapt to what the commercial market is giving you, even in floating point. I suspect that Joe and Suzy Sixpack will decide how you get your fp cycles. GPUs start to look more attractive in this light,right?
Oh and Terry Tao on Black Scholes, here
“Sandy Bridge-EP” Xeon E5 processors and their related “Romley” server platforms, are now in volume shipment, here
Overclocking insurance for Sandy Bridge from Intel, here
Whoa more to think about now: Maxeler says Intel’s Knights Ferry simplicity might not suit HPC, here, at The Inquirer. The article by Lawrence Latif has a subtitle that reads “More effort yields better performance”! I think I like the Inquirer it looks like a fancy version of the old Microprocessor Reports.
Check out the Thalesian Seminar: Stein from Bloomberg talking about CVA at the NY public Library 31 Jan 6pm
FPGAs in HFT Thalesian Seminar Slides from eigen.systems
Intel Xeon E5-2690 Sandy Bridge-EP Performance Revealed, Tom’s Hardware, here
MJ Flynn Accelerating computation with FPGAs, slides, @Berkeley video nice talk audio starts to break up around 31:00 though
Maxeler Technologies, home page
JP Morgan FPGA Careers factsheet. So, that JP Morgan credit derivative batch @300K CDS and 30K credit curves is going to run in 2 milliseconds (1000x the cheapest MacPro at the Apple Store) but I might only get 200x in practice so it will be between 2-10 milliseconds on the 40-node hybrid FPGA machine. Do I get all Virtex6 chips or is that hybrid as well?
Xilinx Virtex-7 FPGA Family looks to be the fastest 2X and 50% lower power then there is Virtex-6.
Black Scholes Evaluation on a Contemporary Microprocessor
We present a simple analysis on the expected Black Scholes runtime performance via competitive code running on a contemporary microprocessor in 2009. The Black Scholes formula used here is a closed form solution for the Black Scholes PDE for a European Call option. This BS formula should run about 36 nanoseconds per valuation with a full set of risk outputs. A portfolio of 100,000 European Call options is estimated to run in about 4 milliseconds on a single core of a single contemporary microprocessor. To look at it another way, there is no more than 4 milliseconds of actual work for a single 4.7GHz core to perform in evaluating BS on a portfolio of 100,000 Calls. The analysis scales; for 50,000 simulated points in a Monte Carlo using the BS marks we expect runtime to complete in 200 seconds on a single contemporary microprocessor core.
Problem
Let’s look at the estimated time to run the Black Scholes formula for 100,000 European Calls. For simple mark to market the Black Scholes formula requires five doubles as input for each Call.
| Input Variable | Definition |
| t | Time to maturity |
| S | Spot |
| K | Strike |
| r | Risk free rate |
| sigma | Vol of log returns |
So for 100,000 Calls, each requiring 40 bytes of input (5 inputs * 8 bytes) we get 4MB of inputs.
The code implementing the Black Scholes formula evaluates a closed form expression depending on the five inputs and produces a series of six outputs.
| Output Variable | Definition |
| mtm | Mark to market |
| delta | Derivative wrt S |
| gamma | 2nd derivative wrt S |
| vega | Derivative wrt sigma |
| theta | Derivative wrt Time |
| rho | Derivative wrt risk free rate |
So for 100,000 Calls we are going to generate 48 bytes of output per Call or 4.8 MB. Just by virtue of the input and output size being small we know the BS evaluation can possibly run with the floating point unit of the target microprocessor highly utilized. The two additional pieces of information we need are the specifics about the target microprocessor and the computational form of the Black Scholes formula. In the next two sections we assemble that information. This is not an IO bound computation, agreed?
Microprocessor Background
IBM published a specification for the 64 bit Power 6 in May 2007 see [1]. The processor is clocked at 4.7 GHz with 128KB L1 and 4MB private Level 2 cache (8MB total). The MASS library provides scalar and vectorized math functions and even clock cycle performance estimates, see [2]. For evaluating Black Scholes formulas there are several vector functions we are interested in, including those in the table below.
| Vector Function | Clocks | Description |
| vexp | 11.74 | exponential |
| vdiv | 7.88 | divide |
| vlog | 12.34 | log |
| vsqrt | 11.14 | square root |
The takeaway here is that for input vectors of length 1000 these functions require 2 to 3 nanoseconds per valuation. So now you can probably guess where we are going with this analysis. We are going to look at the specific form of the Black Scholes equations and estimate:
1. The length of time to get the inputs on average
2. The length of time to perform the evaluation of a vectorized Black Scholes code and compute each of the outputs
3. The length of time needed to write the outputs on average
We have 4MBs of inputs and 4.8 MBs of outputs so we are going to make the handwaving assumption that all the inputs and outputs fit in the 8MB L2 cache. Recognize we could simply adjust the number of Calls so all the inputs and output strictly fit in the L2 Cache and run multiple batches without substantially changing the average performance given the rather crude assumptions we are employing.
Patterson gives us an estimate of 200 clocks to get to DRAM from a 2006 vintage microprocessor. Notice in Patterson’s presentation he observes it is frequently better to recompute values than read them from memory (or a file); sort of an important point if you are coding floating point these days. Intel lists the following for Pentium M
| To Where | Clocks |
| Register | 1 |
| L1 | 3 |
| L2 | 14 |
| Main Memory | 240 |
So, once we know the details of the computational form for Black Scholes we will have most of the data we need to make an estimate of the BS runtime on a portfolio of 100K Calls.
Black Scholes Formula Background
Black Scholes in computational form looks like this:
MTM(t, S, K, r, sigma) = S * N(d1) – K*exp(-r*t)* N(d2)
where
d1 = (ln(S/K) + (r + sigma*sigma/2)*t)/ sigma*sqrt(t) and
d2 = d1 – sigma*t.
Then see [6] from Abromowitz and Stegun to get the computational form of the cumulative normal distribution function
N(x) = 1 – n(x)*(b1*y + b2*y*y+ b3*y*y*y + b4*y*y*y*y + b5*y*y*y*y*y)
where
n(x) = (exp(x*x/2))/sqrt(2*pi)
and y is actually a function of x so
y(x) = 1/(1.2316419*x)
and that is the whole computational story. Very roughly this cannot take long to run: 26 multiplies, divides, and adds, a couple exponential function and square root valuations, and a natural log valuation and it only takes five input variables. Think in terms of the number of clocks in a microsecond. In one millionth of a second a contemporary microprocessor executes somewhere between 3 and 4.7 thousand operations (read adds or multiplies). In a scalar function each multiply may take 4 cycles but in a vector pipeline you can simply get a multiply result per clock. When you write this code you are simply trying to value a handful of expensive operations (exp, divide, ln, sqrt where expensive means 10-20 cycles) and a couple dozen arithmetic ops that retire once a cycle in the pipeline. If you are not familiar with math library code, just recall broadly, Chebyshev approximations will typically give you a reasonably low residual error fit for a modest degree polynomial. You already know you will not be needing even one millionth of a second to execute this BS valuation. Now you just want to know how many million BS evaluations your code could run in a second. Lets finish the rough analysis.
Runtime Estimates
All you need to see is that you can vectorize the entire BS formula for each valuation, you only have five doubles as inputs.. Now just count the operations (multiplies and additions) and look up the expensive ones (the exps, divides, lns, square roots) in the MASS library table above – all the intermediate results and constants fit in the L1 cache for the batch of 1000 Calls (yes, handwaving but very plausible). We are also going to assume the dependencies within the computation are insignificant. Loosely, we assume, at a batch size of 1000, there is always computation that can be submitted to the execution pipeline so there are not bubbles of idle processor time (less plausible, requires more detailed analysis for completely accurate runtime accounting). Finally, we are not accounting for the time to load from the L1 cache instead assuming we have register file time access, always (still less plausible but probably not a big deal).
Let’s assume a batch size of 1000 Calls so we match up with the MASS library performance data and let’s also assume all the inputs are in the L2 cache to start. The time to get 1000 input vectors or 40KB into the 128KB L1 cache cost the 14 cycle latency on each of the L1 cache misses lets estimate 1000 cycles of latency (the computation stops waiting for inputs from L2) over 1000 input vectors of 5 doubles (lets assume we do not know how to prefetch at all). Let’s be conservative and say that inputs require one clock per valuation since they are buffered. Let’s argue that the writes will be overlapped with computation (some kind of write back cache) and cost no more than another clock per valuation of latency. So we are left with the time for a single BS valuation is 2 clocks + the length of time to perform the evaluation of a vectorized Black Scholes code and compute each of the outputs.
From the bottom up then; the CDF coded up looks like this in scalar format. (This code will further abuse the notation for t – it is just a temp variable in the code and it is the time to maturity in the text, sry.)
double N(const double x)
{
const double b1 = 0.319381530;
const double b2 = -0.356563782;
const double b3 = 1.781477937;
const double b4 = -1.821255978;
const double b5 = 1.330274429;
const double p = 0.2316419;
const double c = 0.39894228;
if(x >= 0.0) {
double t = 1.0 / ( 1.0 + p * x );
return (1.0 – c * exp( -x * x / 2.0 ) * t *
( t *( t * ( t * ( t * b5 + b4 ) + b3 ) + b2 ) + b1 ));
}
else {
double t = 1.0 / ( 1.0 – p * x );
return ( c * exp( -x * x / 2.0 ) * t *
( t *( t * ( t * ( t * b5 + b4 ) + b3 ) + b2 ) + b1 ));
}
}
With no heroism whatsoever the cumulative normal distribution function will cost:
a cycle for the Boolean on the if statement and
double t = 1.0 / ( 1.0 + p * x );
8 cycles for this vector divide and an additional cycle for the fused multiply-add in the denominator – a total of 9 clocks.
return (1.0 – c * exp( -x * x / 2.0 ) * t *
( t *( t * ( t * ( t * b5 + b4 ) + b3 ) + b2 ) + b1 ));
12 cycles for the vector exponential plus 5 cycles for the Horner’s rule form of the polynomial with fused multiply-adds plus at most 5 cycles for the remaining operations – a total of 23 clocks. Lets round up and say the entire cumulative normal distribution function will cost 35 clocks. This particular scalar code will not get there, but a suitably vectorized and equivalent code would hit the performance mark of 35 clocks per element.
How much will d1 and d2 cost?
d1 = (ln(S/K) + (r + sigma*sigma/2)*t)/ sigma*sqrt(t)
d2 = d1 – sigma*t
Well, d2 will cost one additional cycle after computing d1. The vector log in d1 cost 13 cycles. S/K will cost 8 cycles using vdiv(). The sqrt(t) will cost 12 cycles. The reciprocal of sigma*sqrt(t) will cost another 8 cycles. All the remaining operations in d1 cost say 5 cycles. The total cost for computing d1 and d2 is less than 50 cycles.
So then computing the mark will cost 50 cycles to get d1 and d2.
MTM(t, S, K, r, sigma) = S * N(d1) – K*exp(-r*t)* N(d2)
The two evaluations of N() will cost 35 cycles a piece or 70 cycles. The vector exponential will cost 12 cycles. All the remaining ops take no more than 5 cycles. The total count for the mark is less than 140 cycles or perhaps 30 nanoseconds.
But wait, what about the delta, gamma, and the rest of the greeks. By differentiating the BS formula appropriately we get:
delta = N(d1),
gamma = n(d1)/(S*sigma*t),
vega = S*n(d1) * sqrt(t),
theta = -(S*n(d1)*sigma)/2*sqrt(t) – r* K*exp(-r*t)*N(d2),
and
rho = K*t*exp(-r*t)*N(d2)
These are all subexpressions that were computed in the evaluation of the MTM lets call it as 5 cycles per greek, 25 cycles for all the greeks (a little handwaving but not much new computation in the greeks right?).
So the whole evaluation of a vectorized Black Scholes code and computing each of the outputs takes 165 cycles plus the 2 cycles for getting the inputs and outputs to and from the L2 caches, call it 170 cycles or 36 nanoseconds. So for 100,000 Calls you expect the computation to finish in 3.6 milliseconds i.e., in practice the dominant part of a contemporary BS computation will be reading the trades and market data from the database. By the time the option trade portfolio SQL query returns the BS batch is done (assuming somewhat aggressive buffered input overlapping with BS computation) on a single core of a single processor purchased new in the last 24 months. In terms of Black Sholes valuations per second close to 30 million per second is competitive in 2009.
Conclusions
This BS formula should run about 170 clocks per valuation – call it 36 nanoseconds per valuation with a full set of risk outputs – or 3.6 milliseconds for the portfolio of 100, 000 Calls on a single core of a single processor. We anticipated vectorizing the computation a bit but assumed nothing executing in parallel, i.e., we only need one core to achieve these runtime performance targets. Note, the analysis doesn’t look much different if the portfolio is composed of both Puts and Calls. Also notice we have assumed nothing about the composition of the series of inputs such as finding sequences of evaluations where t is constant, etc. So the same analysis trick will basically work for as many inputs as you can throw at it. Say you have 50,000 Monte Carlo paths giving you variable t, r, S, and sigma with K constant. The evaluations will finish in a little under 200 seconds on a single core of a single microprocessor without any further optimization (I think it is straight forward to pull the 200 seconds MC BS performance estimate down with a little further analysis). Finally, you can always just run this MC BS code on a grid or multicore and parallelize until you meet your elapsed time goals.
The IBM Power is fairly nice architecture to work with for floating point computations but the methodology used here would be applicable to any contemporary microprocessor that is seriously supporting floating point computation. There is nothing special about the using IBM Power for the analysis other than it is easy to do with publically available references. Obviously the headline performance estimate is going to depend on the target microprocessor core details but a ballpark performance figure between 30-100 nanoseconds per BS valuation seems reasonable. That is between 150 and 500 operations per BS evaluation with large enough L1 cache so that there are not many expensive cache misses.
References
[1] http://www-03.ibm.com/press/us/en/presskit/21546.wss
[3] http://www.fz-juelich.de/jsc/datapool/jump/JUMP-Tuning-Guide-for-HPC-Applications-on-POWER6.pdf
[4] http://www.eecs.berkeley.edu/BEARS/presentations/06/Patterson.ppt
[5] http://lwn.net/Articles/252125/
[6] http://www.sitmo.com/doc/Calculating_the_Cumulative_Normal_Distribution
