You are currently browsing the category archive for the ‘Analytics’ category.
Xilinx, High Performance Computing Using FPGAs, Sep 2010, here.
The shift to multicore CPUs forces application developers to adopt a parallel programming model to exploit CPU performance. Even using the newest multicore architectures, it is unclear whether the performance growth expected by the HPC end user can be delivered, especially when running the most data- and compute- intensive applications. CPU-based systems augmented with hardware accelerators as co-processors are emerging as an alternative to CPU-only systems. This has opened up opportunities for accelerators like Graphics Processing Units (GPUs), FPGAs, and other accelerator technologies to advance HPC to previously unattainable performance levels.
I buy the argument to a degree. As the number of cores per chip grow, the easy pipelining and parallelization opportunities will diminish. The argument is stronger if there are more cores per chip. 8 cores or under per general purpose chip it’s sort of a futuristic theoretical argument. More than a few programmers can figure out how to code up a 4 to 8 stage pipeline for their application without massive automated assistance. But the FPGA opportunity does exist.
The convergence of storage and Ethernet networking is driving the adoption of 40G and 100G Ethernet in data centers. Traditionally, data is brought into the processor memory space via a PCIe network interface card. However, there is a mismatch of bandwidth between PCIe (x8, Gen3) versus the Ethernet 40G and 100G protocols; with this bandwidth mismatch, PCIe (x8, Gen3) NICs cannot support Ethernet 40G and 100G protocols. This mismatch creates the opportunity for the QPI protocol to be used in networking systems. This adoption of QPI in networking and storage is in addition to HPC.
I buy the FPGA application in the NIC space. I want my NIC to go directly to L3 pinned pages, yessir I do, 100G please.
Xilinx FPGAs double their device density from one generation to the next. Peak performance of FPGAs and processors can be estimated to show the impact of doubling the performance on FPGAs [Ref 6], [Ref 7]. This doubling of capacity directly results in increased FPGA compute capabilities.
The idea proposed here is that you want to be on the exponentially increasing density curve for the FPGAs in lieu of clock speed increases you are never going to see again. Sort of a complicated bet to make for mortals, maybe.
I like how they do the comparisons though. They say here is our Virtex-n basketball player and here is the best NBA Basketball player … and they show you crusty old Mike Bibby 2012. Then they say watch as the Virtex-n basketball player takes Mike Bibby down low in the post, and notice the Virtex-n basketball player is still growing exponentially. So you can imagine how much better he will do against Mike Bibby in the post next year. Finally they say that Mike Bibby was chosen as the best NBA player for this comparison by his father Henry, who was also a great NBA player.
FPGAs tend to consume power in tens of watts, compared to other multicores and GPUs that tend to consume power in hundreds of watts. One primary reason for lower power consumption in FPGAs is that the applications typically operate between 100–300 MHz on FPGAs compared to applications on high-performance processors executing between 2–3 GHz.
Silly making Lemonade out of Lemons argument, the minute I can have my FPGAs clocked at 3 GHz I throw away the 300MHz FPGAs, no?
Intel, An Introduction to the Intel QuickPath Interconnect, QPI, Jan 2009, here.
Xilinx Research Labs/NCSA, FPGA HPC – The road beyond processors, Jul 2007, here. Need more current references but I keep hearing the same themes in arguments for FGPA HPC, so let’s think about this for a bit:
FPGAs have an opening because you are not getting any more clocks from microprocessor fab shrinks: OK.
Power density: meh. Lots of FinQuant code can run on a handful of cores. The Low Latency HFT folks cannot really afford many L2 misses. The NSA boys are talking about supercomputers for crypto not binary protocol parsing.
Microprocessors have all functions that are hardened in silicon and you pay for them whether you use them or not and you can’t use that silicon for something else: Meh, don’t really care if I use all the silicon on my 300 USD microprocessor as long as the code is running close to optimal on the parts of the silicon useful to my application. It would be nice if I got more runtime performance for my 300USD, no doubt. This point is like Advil is bad because you don’t always need to finish the bottle after you blow out your ankle. Yeah, I understand the silicon real estate is the most expensive in the world.
Benchmarks: Black Scholes 18msec FPGA @ 110 Mhz Virtex-4 203x faster than Opeteron – 2.2 Ghz: You Cannot be Serious! 3.7 microseconds per Black Scholes evaluation was competitive performance at the turn of the century. The relative speedup slides and quotations make me nervous. Oh, Celoxica provided the data – hey Black Scholes in 36 Nanoseconds on a single core of a dual core off-the-shelf general microprocessor from 2007. So the Virtex-4 does 1M Black Scholes evaluations in 18 milliseconds flat to competitive code on a dual core general purpose off-the-shelf microprocessor in 2007.
Make it easy for the users to use this hardware and get „enough of a performance‟ increase to be useful: meh, it’s for applications that do not need to go fast, for now (2007)?
Do not try to be the fastest thing around when being as fast with less power is sufficient: meh, really do not care so much about the power thing
FPGA: Different operations map to different silicon allows massive pipelining; lots of parallelism: OK. So, why bother with the previous two points?
Eggers/ U. Washington, CHiMPS, here. Eggers is reasonable.
There have been (at least) two hindrances to the widespread adoption of FPGAs by scientific application developers: having to code in a hardware description language, such as Verilog (with its accompanying hardware-based programming model) and poor FPGA memory performance for random memory accesses. CHiMPS, our C-to-FPGA synthesis compiler, solves both problems with one memory architecture, the many-cache memory model.
Many-cache organizes the small, distributed memories on an FPGA into application-specific caches, each targeting a particular data structure or region of memory in an application and each customized for the particular memory operations that access it.
CHiMPS provides all the traditional benefits we expect from caching. To reduce cache latency, CHiMPS duplicates the caches, so that they’re physically located near the hardware logic blocks that access them. To increase memory bandwidth, CHiMPS banks the caches to match the memory parallelism in the code. To increase task-level parallelism, CHiMPS duplicates caches (and their computation blocks) through loop unrolling and tiling. Despite the lack of FPGA support for cache coherency, CHiMPS facilitates data sharing among FPGA caches and between the FPGA and its CPU through a simple flushing of cached values. And in addition, to harness the potential of the massively parallel computation offered by FPGAs, CHiMPS compiles to a spatial dataflow execution model, and then provides a mechanism to order dependent memory operations to retain C memory ordering semantics.
CHiMPS’s compiler analyses automatically generate the caches from C source. The solution allows scientific programmers to retain their familiar programming environment and memory model, and at the same time provides performance that is on average 7.8x greater and power that is one fourth that of a CPU executing the same source code. The CHiMPS work has been published in the International Symposium on Computer Architecture (ISCA, 2009), the International Conference on Field Programmable Logic and Applications (FPL, 2008), and High-Performance Reconfigurable Computing Technology and Applications (HPRCTA, 2008), where it received the Best Paper Award.
BBC News magazine, Black-Scholes: The maths formula linked to the financial crash, here.
It’s not every day that someone writes down an equation that ends up changing the world. But it does happen sometimes, and the world doesn’t always change for the better. It has been argued that one formula known as Black-Scholes, along with its descendants, helped to blow up the financial world.
It doesn’t say if Scotland Yard had Scholes in for questioning yet. Oh, this story is sourced from Ian Stewart the math guy from Warwick.
Stewart says the lessons from Long-Term Capital Management were obvious. “It showed the danger of this kind of algorithmically-based trading if you don’t keep an eye on some of the indicators that the more conventional people would use,” he says. “They [Long-Term Capital Management] were committed, pretty much, to just ploughing ahead with the system they had. And it went wrong.”
Scholes says that’s not what happened at all. “It had nothing to do with equations and nothing to do with models,” he says. “I was not running the firm, let me be very clear about that. There was not an ability to withstand the shock that occurred in the market in the summer and fall of late 1998. So it was just a matter of risk-taking. It wasn’t a matter of modelling.”
Would it be a bad thing if John Meriwether and Myron Scholes attend a remedial applied maths course taught by Professor Stewart? Perhaps not, it could be awesome if there is You Tube video of the class.
One Div Zero: tentative add to our heroes list – James Iry “If cars were built like software then…well, I don’t know squat about building cars so who knows. It might be kinda cool. But probably not.” A Brief and Incomplete and Mostly Wrong History of Programming Languages
Zerohedge on the CDS market trade volume distribution and a proposal for CDS indicies to be exchange traded, here.
Kamakura Corporation, Risk management tools and Jarrow is involved somehow look at the research and blogs, here.
Cloud Computing get reviewed by US DOE, Argonne, and Lawrence Berkeley and gets a grade of meh, here from Clusterstock.
Salmon on Udacity, here. Stanford AI professor starts up online University UDACITY. Agreed this looks like it could grow. An AI course at Stanford gets 100K worldwide enrollment? wow. I would like to hear why the notion that Stanford, Harvard, Ptown, or Oxford should brand this is an obviously bad idea.
HPCWire Russell Fish @ Venray Technologies has an embedded Microprocessor in DRAM play, here. Problem is apart from Mortgages I doubt much Street P&L/Risk analytics is intrinsically memory bandwidth starved as opposed to processing starved. Super good at creating pipeline bubbles though.
Asymco analysis, here, highlights recent trends in computing. It looks increasingly as if you need to adapt to what the commercial market is giving you, even in floating point. I suspect that Joe and Suzy Sixpack will decide how you get your fp cycles. GPUs start to look more attractive in this light,right?
Oh and Terry Tao on Black Scholes, here
“Sandy Bridge-EP” Xeon E5 processors and their related “Romley” server platforms, are now in volume shipment, here
Overclocking insurance for Sandy Bridge from Intel, here
Whoa more to think about now: Maxeler says Intel’s Knights Ferry simplicity might not suit HPC, here, at The Inquirer. The article by Lawrence Latif has a subtitle that reads “More effort yields better performance”! I think I like the Inquirer it looks like a fancy version of the old Microprocessor Reports.
Check out the Thalesian Seminar: Stein from Bloomberg talking about CVA at the NY public Library 31 Jan 6pm
FPGAs in HFT Thalesian Seminar Slides from eigen.systems
Intel Xeon E5-2690 Sandy Bridge-EP Performance Revealed, Tom’s Hardware, here
MJ Flynn Accelerating computation with FPGAs, slides, @Berkeley video nice talk audio starts to break up around 31:00 though
Maxeler Technologies, home page
JP Morgan FPGA Careers factsheet. So, that JP Morgan credit derivative batch @300K CDS and 30K credit curves is going to run in 2 milliseconds (1000x the cheapest MacPro at the Apple Store) but I might only get 200x in practice so it will be between 2-10 milliseconds on the 40-node hybrid FPGA machine. Do I get all Virtex6 chips or is that hybrid as well?
Xilinx Virtex-7 FPGA Family looks to be the fastest 2X and 50% lower power then there is Virtex-6.
Apart from saving over a million dollars infrastructure cost annually, why bother running your credit derivative P&L/Risk batch Mac Pro+2s rather than Supercomputer+FPGAs+4m?
A. Rates, Mortgage, FX, Commodities P&L/Risk batches
It’s only the credit derivative guys in the Computerworld UK article that had this P&L and Risk batch performance problem in 2008 and now it is fixed?
B. Moore’s Law
More of a prediction than a law really, but it has predicted accurately for 40 years and is expected to hold for another decade. In technology if you do nothing more than “ride the wave” of Moore’s Law you are doing good, however, this observation’s simplicity can be deceiving. If the plan is for the credit derivative batch to “ride the wave” with thousands of grid CPUs, FPGAs, and associated staff you may have a challenge when scaling to larger problems, like counterparty valuation adjustment (CVA), that Moore’s Law will not automatically address for you. For example, Moore’s Law doesn’t get you power and machine room space for 10,000 grid CPUs for the CVA simulation. That is closer to “not riding the wave.”
C. Simplicity
Supercomputers and FPGAs, in addition to GPUs, Cuda, and CPU grids even dataflow architecture are all reasonable technology bets that could very well be the best way to ride the Moore’s Law wave through 2020. Using some or all of these technologies together in 2011 to run such a modest computation so slowly should elicit some follow up questions.
So, a major broker-dealer gets a 2011 industry award for running their overnight Risk and P&L credit derivative batch though a 1000 node CPU grid and some FPGAs in 238 seconds (in 2008 the same computation and same broker-dealer but presumably different CDS inventory took 8 hours, a great success). Then some blog posting claims that this same credit derivative batch could be run with some optimized C++ code on a $2500 Mac Pro in under 2 seconds IEEE 754 double precision tied out to the precision of the inputs. What’s going on? Does the credit derivative batch require a $1,000,000 CPU grid, FPGAs, and 238 seconds or one $2,500 MacPro, some optimized code, and 2 seconds?
It’s the MacPro + 2 seconds very likely, let’s think how this could be wrong:
A. The disclosed data is materially wrong or we have misinterpreted it egregiously. The unusual event in all this is the public disclosure of the broker-dealer’s runtime experience and the code they were running. It is exceedingly rare to see such a public disclosure. That said, the 8 hour production credit derivative batch at a broker-dealer in 2008 is not the least-bit surprising. The disclosure itself tells you the production code was once bloated enough to be optimized from 8 hours to 4 minutes. You think they nailed the code optimization on an absolute basis when the 4 minutes result was enough to get a 2011 industry award, really? The part about the production infrastructure being a supercomputer with thousands of grid CPUs and FPGAs, while consistent with other production processes we have seen and heard of running on the Street, is the part you really have to hope is not true.
B. The several hundred thousand position credit derivative batch could be Synthetic CDO tranches and standard tranches and require slightly more complicated computation than the batch of default swaps assumed in the previous analysis. But all the correlation traders (folks that trade CDOs and tranches) we know in 2011 were back to trading vanilla credit derivatives and bonds. The credit derivative batch with active risk flowing through in 2011 is the default swap batch (you can include index protection like CDX and ITRX in this batch as well). Who is going to spend three years improving the overnight process on a correlation credit derivative book that is managed part time by a junior trader with instructions to take on zero risk? No one.
C. The ISDA code just analyzed is not likely to be the same as the 2011 award winning production credit derivative batch code. In fact, we know portions of the production credit derivative batch were translated into FPGA circuitry, so the code is real different, right? Well over the last decade of CDS trading most of the broker-dealers evolved to the same quantitative methodology for valuing default swaps. Standards for converting upfront fees to spreads (ISDA) and unwind fees (Bloomberg’s CDSW screen) have influenced broker-dealer CDS valuation methodology. We do not personally know exactly what quantitative analytics each one of the of the broker-dealers runs in 2011, but Jarrow-Turnbull default arrival and Brent’s method for credit curve cooking covers a non-trivial subset of the broker-dealers. The ISDA code is not likely to be vastly different from the production code the other broker-dealers use in terms of quantitative method. Of course, in any shop there could be some undisclosed quantitative tweaks included in production and the MacPro + 2 seconds analysis case would be exposed in that event.
D. The computational performance analysis just presented could be flawed. We have only thought about this since seeing the Computerworld UK article and spent a couple weekends working out the estimate details. We could have made a mistake or missed something. But even if we are off by a factor of 100 in our estimates (we are not) its still $2500 + MacPro + 200 seconds versus $1,000,000 + 1000 CPU+ FPGAs+238 seconds.
I am not gonna pay a lot for this muffler
What is left to get a runtime estimate for the credit derivative batch on an off-the-shelf computer? We need a cooked credit curve (bootstraped hazard rate term structure) and to estimate the computational cost of the IOUs from the previous valuation step. Let’s account for the clock cycles needed to cook a USD credit curve out to 10 years with par CDS quotes at 1Y, 3Y, 5Y, 7Y, 10Y and a previously cooked USD Libor. Then we will address the additional computational cost of the valuation IOUs. The final post in this series will cover the additional computational cost of perturbational risk (multiply single curve cooking estimates by 20-30x corresponding to the number of distinct perturbed curves required for risk, ditto valuation estimates) as well as the cost of valuation for non-standard inventory CDS trades (trades where the paydates are not standard and therefore introduce an interpolation computation charge).
Credit Curve Cooking:
Assuming we are given a cooked USDLibor curve we can see from the valuation code the credit curve cooking only needs to take the par CDS quotes for the value date and return a set of assignments through JpmcdsForwardZeroPrice() for just the 20 paydates in the expected case (5Y CDS paying quarterly) or both the 20 paydates and the 130 numerical integration grid points (biweekly grid). So, if we derive estimates on the convergence speed of the one dimensional root finder, Brent’s method, and then get estimates on the cost of the Valuation IOUs we will have a reasonably complete floating point performance picture. We will assume away the cost of USD libor curve cooking and USDLibor interpolation since the computation cost is small and can be amortized across all the credits cooked against USDLibor, in any case. So, Valuation IOUs 2, 6, and 9 (see below) are assumed to cost less than one clock cycle of floating point execution. They are just variable assignments with some cost in cache pressure.
Here is the ISDA credit curve cooking code. Notice they use Brent’s method and the convergence criteria looks like zero to ten decimal paces. Given that you are cooking a credit curve that was cooked the previous business day as well you can assume the root is bracketed to 10bps or three decimal places. Let’s not even bother to account for potential compute time efficiency the bootstrapping valuations could provide. Assume every Brent function valuation cost as much as a 5Y CDS valuation.
/* we work with a continuously compounded curve since that is faster -
but we will convert to annual compounded since that is traditional */
cdsCurve = JpmcdsMakeTCurve (today,
endDates,
couponRates,
nbDate,
JPMCDS_CONTINUOUS_BASIS,
JPMCDS_ACT_365F);
if (cdsCurve == NULL) goto done;
context.discountCurve = discountCurve;
context.cdsCurve = cdsCurve;
context.recoveryRate = recoveryRate;
context.stepinDate = stepinDate;
context.cashSettleDate = cashSettleDate;
for (i = 0; i < nbDate; ++i)
{
double guess;
double spread;
guess = couponRates[i] / (1.0 – recoveryRate);
cl = JpmcdsCdsContingentLegMake (MAX(today, startDate),
endDates[i],
1.0,
protectStart);
if (cl == NULL) goto done;
fl = JpmcdsCdsFeeLegMake(startDate,
endDates[i],
payAccOnDefault,
couponInterval,
stubType,
1.0,
couponRates[i],
paymentDCC,
badDayConv,
calendar,
protectStart);
if (fl == NULL) goto done;
context.i = i;
context.cl = cl;
context.fl = fl;
if (JpmcdsRootFindBrent ((TObjectFunc)cdsBootstrapPointFunction,
(void*) &context,
0.0, /* boundLo */
1e10, /* boundHi */
100, /* numIterations */
guess,
0.0005, /* initialXstep */
0, /* initialFDeriv */
1e-10, /* xacc */
1e-10, /* facc */
&spread) != SUCCESS)
{
JpmcdsErrMsg (“%s: Could not add CDS maturity %s spread %.2fbp\n”,
routine,
JpmcdsFormatDate(endDates[i]),
1e4 * couponRates[i]);
goto done;
}
cdsCurve->fArray[i].fRate = spread;
Assume Brent on average converges superlinearly like the secant method (the exponent is something like 1.6, see Convergence of Secant Method). Assume also that one iteration of Brent’s method costs one function evaluation (w. the IOUs computation time added) and some nominal extra clocks say +20%. Given the accuracy of the initial guess, the smooth objective function, and the expected superlinear convergence five iterations (1.6**5) should be good enough on average. With par CDS quotes at 1Y, 3Y, 5Y, 7Y, and 10Y to fit we expect to need 20 CDS valuations. Thinking ahead to perturbational risk (in the next post), we can obviously bundle many perturbations into a single massive vectorized Brent’s method call and drive the cooking cost massively lower. Let’s ignore this obvious optimization for now. What do we have so far?
CDS credit curve cooking estimates are then as follows:
Expected case = 1.2 * (20 * 150 + 20 * valuation IOU) clock cycles (quarterly paydates=gridpoints)
Worst case = 1.2 * (20 * 810 + 20 * valuation IOU) clock cycles (biweekly numerical integration gridpoints)
Expected case is running at 3600 cycles plus 24x the IOU cycle count. In other words, one microsecond plus some IOU accounting.
Valuation IOU Runtime Accounting:
With subtopic titles like “Valuation IOU Runtime Accounting” I am certain to cut my blog reading population in half – the price of a lonely optimization post. Recall our list of valuation IOUs. We assumed that some Rates process will exogenously produce some suitably cooked and interpolated USD Lidor (discCurve) at the cost of some L2 cache pressure to us. Note we assume on the code rewrite we vectorize and unroll loops anywhere possible in the code. So in our accounting we take the cycle counts from Intel’s vectorized math library (MKL) for double precision. Vector divides cost 5.4 cycles per element. Vector exponentials cost 7.9 cycles per element. Vector log costs 9.9 cycles and vector reciprocals cost 3.36 cycles per element. The vector lengths for these cycle counts are probably close to 1024 so perhaps the cycle counts are slightly optimistic.
Valuation IOU Accounting list
- survival [paydates] = JpmcdsForwardZeroPrice(SpreadCurve,…) 4*paydate clocks
- discount[paydates] = JpmcdsForwardZeroPrice(discCurve,…) 0 assumed away
- vector1[paydates] = notional*couponRate*accTime[paydates] paydate clocks
- vector2[paydates] = survival[paydates]*discount[paydates] paydate clocks
- s[gridpts] = JpmcdsForwardZeroPrice(SpreadCurve,…) 4*paydate clocks
- df[gridpts] = JpmcdsForwardZeroPrice(discCurve,…) 0 assumed away
- t[gridpts] grid time deltas 0 assume away
- lambda[gridpts] = log(s[gridpts]/s[gridpts])/t[gridpts] 15*gridpts clocks
- fwdRate[gridpts] = log(df[gridpts]/df[gridpts])/t[gridpts] 0 assumed away
- lambdafwdRates[gridpts] = lambda[gridpts] * fwdRates[gridpts] gridpts clocks
- vector3[gridpts0 = exp(-t[gridpts]*lambdafwdRates[gridpts]) 10 * gridpts clocks
- vector4[gridpts] = s[gridpts] * df[gridpts] gridpts clocks
- vector5[gridpts] = reciprocal (lambdafwdRates[gridpts]) 4 * gripts clocks
- t0[gridpts], t1[gridpts] coefficients for accrued interpolation 0 assume away
- thisPv[gridpts] 4 * gridpts clocks
Looks like the IOU cost in aggregate is 10 * paydate clocks or 200 clock cycles plus 35 * gridpts clocks (ugh!). So expected case assumes gridpoints = quarterly paydates so all in 45 * paydates clock cycles or 900 clocks. Worst case 200 clocks + 35 * 130 clocks or 4750 clocks.
All in then, CDS credit curve cooking estimates are as follows:
Expected case = 1.2 * (20 * 150 + 20 * 900) clock cycles (quarterly paydates=gridpoints)
Worst case = 1.2 * (20 * 810 + 20 * 4750) clock cycles (biweekly numerical integration gridpoints)
Expected case looks like 25,200 clocks or 7 microseconds. Worst case looks like 133,440 cycles or 37 microseconds. Ok, we tentatively thought 20 microseconds worst case but recall we left a chunk of optimizations unaccounted for in the interests of brevity. But the worst case is about two times more expensive than we naively expected.
From here it is easy, full risk will be 20x to 30x the single curve cooking estimate even if we are totally apathetic and clueless. Expected case is 7 * 20 microseconds call it 140 microseconds and worst case is 37 * 30 microseconds call it 1.11 milliseconds. Since we are going to argue that non-standard CDS interpolation can be stored with the trade description the incremental cycle count for non-standard deals is de minimus since the computational cost is amortized over the life of the position (5Y on average). Expected case then in our target 10K credit curves 100K CDS looks like 10 K * 140 microseconds + 100K * 20*42 nanoseconds or one and a half a seconds on a single core of a single microprocessor. If I have to do the Computerworld UK credit batch on the cheapest Mac Pro I can order from the Apple store today, the Quad core 2.8 GHz Nehalem for 2,499 USD (not gonna take the Add 1,200USD to get the 3.33 GHz Westmere). It’ll cost me 5.8 seconds on a single core but fortunately I purchased four cores (of raw power according to the Apple website) so the Computerworld UK credit derivative batch still runs in a second and change for 2,499 USD + shipping.
The first step to getting a better estimate on the off-the-shelf runtime for credit curve cooking is to get a sense of how long the valuation of a quoted SNAC CDS (standardized default swaps, no custom cashflows) requires. Getting a SNAC CDS valuation runtime estimate is the subject of this post. The outline of the argument is to take the SNAC valuation runtime estimate and cook the credit curve (bootstrap the hazard rate term structure) from the quoted SNAC default swaps with terms 1Y, 3Y, and 5Y (see O’Kane and Turnbull pgs 12-14). Once we have a cooked credit curve we can run valuation and risk for the default swap inventory dependent on the given cooked credit curve. For a given credit, the serial computation runtime for a single CDS position valuation is an order of magnitude smaller than the credit curve cooking computation. Take a look at O’Kane and Turnbull pgs5-10 to review the Jarrow-Turnbull reduced form model for valuing the contingent leg and the fee leg. See the FINCAD document describing the modeling assumptions behind the ISDA CDS Standard Model, here. Recognize also that the ISDA code doesn’t have all the calendars and currency convention data you would expect in a complete production Credit P&L but it’s OK as a proxy code for getting estimates.
Vanilla default swaps have two legs a Premium Leg and a Contingent Leg, the default swap PV (present value) is the difference between the Premium Leg PV and the Contingent Leg PV. If counterparty A is long protection in a default swap they are paying premium to get the protection of the contingent leg, as well as being short the credit. We will assume the average maturity of a default swap is 5Y. We will proceed purely with static analysis of the code with no peeking at the underlying mathematics for better optimizations – nothing but the code.
Valuing the Premium Leg
The basic computational task with the Premium Leg is to discount each of the scheduled premium cash flows off the risky curve accounting for both the time value of money and the probability that the premium may never be paid due to termination of the default swap. Typically the main computational part of valuing the Premium Leg is valuing the accrual on default. The actual discounting of the premium cashflows to be paid on the quarterly paydates is straightforward and computationally inexpensive. Accruing on default is a standard convention and refers to the portion of the premium owed by the protection buyer in the event that a default occurs between premium paydates ( as oppose to default arriving exactly on a paydate). The protection buyer owes the premium accrued up to the date that the relevant default is officially recognized. For computational efficiency we are going to push the accrued on default computation over to the Contingent Leg computation, because they are so similar. This leaves about 20-30 cycles of computation for a 5Y default swap plus an IOU to account for the more computationally expensive accrual on default valuation. Here is the relevant code from ISDA from inside a loop for each paydate:
amount = notional * couponRate * accTime;
survival = JpmcdsForwardZeroPrice(spreadCurve, today, accEndDate + obsOffset);
discount = JpmcdsForwardZeroPrice(discCurve, today, payDate);
myPv = amount * survival * discount;
Notice that the variables notional, couponRate, and n-1 of n assignments of accTime are known at compile time. The calls to the function JpmcdsForwardZeroPrice() for the spreadCurve and this discCurve are simply interpolations from the cooked curves where the interpolation parameters (at least n-1 out of n) are known at compile time. Price them as assignments and assume that the cache size will not be swamped with 20 or so doubles for the quarterly paying CDS. The product survival * discount per paydate is known at curve cooking time. The trade off is between consuming a cycle per paydate versus adding to the cache load. Let’s put it in the cache for now and assume the curve cooker will compute the product survival*discount. So there is slightly more than a cycle (call it 1.5 cycles) per quarterly paydate w. no cache pressure or wait state penalties. We will call it 30 cycles for a 5Y quarterly paying average default swap.
Valuing the Contingent Leg
We need to account for the clock cycles required to:
- PV the default swap payout given the expected default arrival over the term of the CDS contract.
- PV the accrued premium owed by the protection buyer to the protection seller in the event the default arrives between premium cash flow paydates (IOU from the payleg).
Again static code analysis only, no peeking at the mathematics. Here is the relevant ISDA code for 1. integrating the value of the terminal CDS contingent payoff over the term of the CDS contract.
s1 = JpmcdsForwardZeroPrice(spreadCurve, today, startDate);
df1 = JpmcdsForwardZeroPrice(discCurve, today, MAX(today, startDate));
loss = 1.0 – recoveryRate;
for (i = 1; i < tl->fNumItems; ++i)
{
double lambda;
double fwdRate;
double thisPv;
s0 = s1;
df0 = df1;
s1 = JpmcdsForwardZeroPrice(spreadCurve, today, tl->fArray[i]);
df1 = JpmcdsForwardZeroPrice(discCurve, today, tl->fArray[i]);
t = (double)(tl->fArray[i] – tl->fArray[i-1])/365.0;
lambda = log(s0/s1)/t;
fwdRate = log(df0/df1)/t;
thisPv = loss * lambda / (lambda + fwdRate) *
(1.0 – exp(-(lambda + fwdRate) * t)) * s0 * df0;
myPv += thisPv;
}
Again, as in the case of the pay leg, all the calls to JpmcdsForwardZeroPrice() are interpolations known at credit curve cooking time so we will account for them as variable assignments and assign the computational cost of interpolation to the curve cooker. The time consuming computation here and in the code below (for accrued on default) depends on the resolution of the time grid (fNumItems) to get the discrete summation to converge to the continuous integral value. If the integration time grid has M points and r is the 5Y swap rate (the risk free rate USD currently 114 bps 19Jan12) then O’Kane and Turnbull (pg10) show the percentage error in the discreet approximation is:
r/2*M
In production P&L batches I have seen M=26 (biweekly integration grid points) but based on current levels M=4 (quarterly cashflow paydates) would bring the error to within the bid ask spread. Let’s assume M=26 worst cast and M=4 expected case for performance approximations.
Notice that the expensive floating point operations in these loops are the math.h functions log() and exp() and divides. We will push the log and exp calls to the curve cooker since the grid points for the integration as well as all the interpolations are known at curve cooking time. The variable lambda is known at credit curve cooking time and the variable fwdRate is known at Libor curve cooking time. Similarly all the values of t are known at compile time so we are not even going to multiply by the reciprocal of 365 inside the loop. We will also book the computational cost of the reciprocal 1.0/(lambda + fwdRate) to the curve cooker. So, no divides and no math.h calls in the loop we cost it at 4 fused add multiply cycles after vectorizing, loop unrolling, and optimization. In the expected case, loop 1 cost us 4 cycles * (4*5) grid points or 80 cycles. In the worst case, loop 1 cost us 4 cycles * (26*5) grid points or 520 clock cycles.
Here is the relevant ISDA code for 2, a very similar loop compared to the PV of the contingent payoff loop, right?
for (i = 1; i < tl->fNumItems; ++i)
{
double lambda;
double fwdRate;
double thisPv;
double t0;
double t1;
double lambdafwdRate;
if(tl->fArray[i] <= stepinDate) continue;
s1 = JpmcdsForwardZeroPrice(spreadCurve, today, tl->fArray[i]);
df1 = JpmcdsForwardZeroPrice(discCurve, today, tl->fArray[i]);
t0 = (double)(subStartDate + 0.5 – startDate)/365.0;
t1 = (double)(tl->fArray[i] + 0.5- startDate)/365.0;
t = t1-t0;
lambda = log(s0/s1)/t;
fwdRate = log(df0/df1)/t;
lambdafwdRate = lambda + fwdRate + 1.0e-50;
thisPv = lambda * accRate * s0 * df0 * (
(t0 + 1.0/(lambdafwdRate))/(lambdafwdRate) -
(t1 + 1.0/(lambdafwdRate))/(lambdafwdRate) *
s1/s0 * df1/df0);
myPv += thisPv;
s0 = s1;
df0 = df1;
subStartDate = tl->fArray[i];
}
We will treat both the loops simultaneously assume that the loops will be fused. The same analysis applies to this loop: laqmbda, fwdRate, lambdafwdRate, and all the interpolations and ratios are known at curve cooking time so they will be accounted for in the curve cooker not in the CDS valuation. Net, 2 fused multiply cycles will get the accrued on default value per grid point. Expected case = 40 cycles, worst case an additional 260 cycles.
CDS valuation estimates are then as follows
Expected case = 30 + 80 + 40 = 150 clock cycles, 42 ns, 23MM valuations /second
Worst case = 30 + 520 + 260 = 810 clock cycles, 225 ns, 4.5 MM valuations/second
In another post we will go through the curve cooking accounting and deal with the cache traffic we are creating with allocating computation to the curve cooker. Informally, we think cooking is an order of magnitude more expensive than valuation so we are kind of thinking under 10 – 20 microseconds to cook a curve on a single off-the-shelf core in either expected or worst case. 50K curves cooked per second seems plausible – let’s see how it goes the cache penalties/fp pipeline bubbles could catch up with us.
Question:
Over/Under: 12 seconds of floating point computation; P&L and risk for several 100K default swaps; off the shelf servers in 2012 – running standard ISDA CDS model code; 5 to 6 decimal place tie out?
Even for non-SNAC positions I’m thinking under, perhaps epically under depending on the hardware budget. Ideally you run code to determine this and we do not have the optimized code in hand, but we can write it. Before writing the code you want to run the due diligence to see if writing the code is worth it. Let’s see.
History:
We recall from previous 2009 estimates ( see Totally Serial) for unoptimized scalar code cooking of par and perturbed curves certainly costs less 7 milliseconds and valuation of a default swap’s contingent and fee legs take about 40 microseconds . This is of course assuming that the default swaps have custom cashflows and are not SNAC default swaps (standard instruments with IMM date cashflow schedules) in which case the default swap valuation execution time massively collapses and the credit curve cooking time can be significantly reduced as well.
Methodology:
Let’s assume 100K OTC (non SNAC default swaps) and 10K credit curves and we can scale to fit the actual curve and deal counts with more accurate estimates. Lets further assume that the 10K credit curves are independent (there is no known geometric or arithmetic structure to the traders marks). So, you actually have to cook each of the 10K credit curves. For unoptimized code then we expect about 7 ms*10K + 0.04ms * 100K or ~80 seconds floating point computation elapsed time on a single contemporary microprocessor core. Naively, you would guess this benchmark optimization would be looking for about 20-30 seconds (performance improvement 3x-4x over unoptimized) on a single x86 core. So assuming we buy a $3000 6-8 core microprocessor in 2012 the 100K/10K batch optimized fp code executes in a few seconds, expected case. Variance in the estimate will depend on some assumptions like the average number of iterations for convergence with Brent, the average term of the default swaps, the number of perturbed credit curves needed for full risk, the scaling across multiple on-chip cores among other things.
Amdahl’s Law directs the due diligence to first focus on the curve cooking so that’s what we will do in addition to rechecking the previous estimates ( I think they were very conservative) in the outlined argument. Lets count the operations in the ISDA code to nail this estimate down. Then we will estimate what the L1 and L2 cache efficiency we can plausibly expect to get a target due diligence number. What do we look at in the ISDA src distribution? Bunch of src files we don’t care about (DK) from the perspective of getting an estimate of the floating point computation time. The other annotated file names (in bold) contain source we may need to study.
- badday.c DK
- bsearch.c DK
- buscache.c DK
- busday.c DK
- cashflow.c DK
- cds.c computes contingent leg, fee leg, vanilla CDS PV
- cdsbootstrap.c Brent root finder for credit curve cooking
- cdsone.c converts upfront to flat spread
- cerror.c DK
- cfileio.c DK
- cfinanci.cpp DK
- cmemory.c DK
- contingentleg.c computes contingent leg and one period integral
- convert.c DK
- cx.c DK
- cxbsearch.c DK
- cxdatelist.c DK
- cxzerocurve.c calc zero price for given start and maturity
- date_sup.c DK
- dateadj.c DK
- dateconv.c DK
- datelist.c DK
- dtlist.c DK
- feeleg.c calc PV of single fee, fee leg, accrued
- fltrate.c DK
- gtozc.c looks like curve cooking
- interpc.c interpolation of rates
- ldate.c DK
- linterpc.c curve interpolation
- lintrpl.c DK
- lprintf.c DK
- lscanf.c DK
- rtbrent.c root finding for curve cooking
- schedule.c possibly DK – cash flow schedule generation
- streamcf.c calculate cashflows
- strutil.c DK
- stub.c stub payments
- tcurve.c discounting
- timeline.c not sure maybe DK
- version.c DK
- yearfrac.c DK
- zcall.c zero curve
- zcswap.c add strip of swap to zcurve
- zcswdate.c DK
- zcswutil.c cashflow lists for swap instruments
- zerocurve.c build zero curve from mm and swaps
- zr2coup.c calculates the par swap rate
- zr2fwd.c calculates zero coupon forward rate
We should be able to pound this due diligence through with a little effort, let’s see how it goes. Maybe I am missing something? It’ll take a couple postings I expect, but it will be interesting.
ISDA and Markit offer the CDS JPM CDS code here for open source download since about 2009.
James Zucker ported the code to an iPhone and recounted the tale in My First iPhone App. You can get the App from iTunes for free its called …iCDS from iJAZ Software.
Dominic O’Kane has/had his own web based calculator based on his 2008 book Modeling Single-name and Multi-name Credit Derivatives which is in turn based on a very good Lehman research report O’Kane published with Stuart Turnbull in 2003, Valuation of Credit Default Swaps.
Hull and White 2003 on Valuation of a CDO and an nth to Default CDS Without Monte Carlo Simulation. If there is a broker dealer running a PDE solver on their Credit Derivative inventory for daily P&L, find out who the head of quantitative research is there and bow before that guy because he has achieved Steve Jobs-level marketing skills.
Matlab CDS pricer, here.
BionicTurtle has a YouTube video of how to run a CDS valuation on a spreadsheet, here. Appears to be the tip of iceberg of You Tube videos explaining Credit Derivatives
