You are currently browsing the tag archive for the ‘Floating Point’ tag.
Nadav Rotem, C to Verilog, here.
C-to-Verilog is a free and open sourced on-line C to Verilog compiler. You can copy-and-paste your existing C code and our on-line compiler will synthesize it into optimized verilog.
For additional information on how to use our website to create FPGA designs, watch the screencast below.
Andrew Putnam, Susan Eggers, et. al., Performance and Power of Cache-Based Reconfigurable Computing, here. Did not see these slides at Susan Eggers’ website. Note Prassanna Sundararajan’s name is on the slides – he must know something about 2008 CHIMPS.
This paper describes CHiMPS, a C-based accelerator compiler for hybrid CPU-FPGA computing platforms. CHiMPS’s goal is to facilitate FPGA programming for high performance computing developers. It inputs generic ANSIC code and automatically generates VHDL blocks for an FPGA. The accelerator architecture is customized with multiple caches that are tuned to the application. Speedups of 2.8x to 36.9x (geometric mean 6.7x) are achieved on a variety of HPC benchmarks with minimal source code changes.
Looking at the Chimps paper and they have a black scholes benchmark but only quote relative performance between x86/gcc and CHIMPS/FPGA. It sure looks like the FPGA performance here is compared against naive x86 code. Recall Pink I published Black Scholes in 36 nanoseconds on 2007 commodity hardware using XLC. So unless Chimps was being misleading here I would assume they had a 2008 FPGA that did Black Scholes evals in 2.05 (36/17.5) nanoseconds. I’m gonna go ahead give a time travel 2008 shenanigans yellow card because I don’t believe there was enough non ILP parallelism in the basic Black Scholes formula evaluation given the FPGA clock speeds circa 2008 (something like 550MHz) see Xilinx References Apr 2012.
Speedup numbers do not fully convey the performance potential of the CHiMPS accelerators, because performance is often highly dependent on the input data size and the computation-communication ratio. To illustrate this, Figure 2 depicts Black-Scholes’s sensitivity to input size. For very small numbers of stocks, using CHiMPS is detrimental to performance, because the data transfer time to the FPGA is greater than executing on the CPU. Performance breaks even around 24 stocks, and continues to increase until it peaks at a speedup of 17.6 (700 stocks). Speedup then decreases since the data for more stocks no longer fits in the FPGA caches, and the CPU’s larger cache and memory prefetcher keep the CPU execution time steady.
Agner’s CPU Blog, New C++ Vector Class Library, here. I’m interested in the AVX2 side of this
Xilinx, High Performance Computing Using FPGAs, Sep 2010, here.
The shift to multicore CPUs forces application developers to adopt a parallel programming model to exploit CPU performance. Even using the newest multicore architectures, it is unclear whether the performance growth expected by the HPC end user can be delivered, especially when running the most data- and compute- intensive applications. CPU-based systems augmented with hardware accelerators as co-processors are emerging as an alternative to CPU-only systems. This has opened up opportunities for accelerators like Graphics Processing Units (GPUs), FPGAs, and other accelerator technologies to advance HPC to previously unattainable performance levels.
I buy the argument to a degree. As the number of cores per chip grow, the easy pipelining and parallelization opportunities will diminish. The argument is stronger if there are more cores per chip. 8 cores or under per general purpose chip it’s sort of a futuristic theoretical argument. More than a few programmers can figure out how to code up a 4 to 8 stage pipeline for their application without massive automated assistance. But the FPGA opportunity does exist.
The convergence of storage and Ethernet networking is driving the adoption of 40G and 100G Ethernet in data centers. Traditionally, data is brought into the processor memory space via a PCIe network interface card. However, there is a mismatch of bandwidth between PCIe (x8, Gen3) versus the Ethernet 40G and 100G protocols; with this bandwidth mismatch, PCIe (x8, Gen3) NICs cannot support Ethernet 40G and 100G protocols. This mismatch creates the opportunity for the QPI protocol to be used in networking systems. This adoption of QPI in networking and storage is in addition to HPC.
I buy the FPGA application in the NIC space. I want my NIC to go directly to L3 pinned pages, yessir I do, 100G please.
Xilinx FPGAs double their device density from one generation to the next. Peak performance of FPGAs and processors can be estimated to show the impact of doubling the performance on FPGAs [Ref 6], [Ref 7]. This doubling of capacity directly results in increased FPGA compute capabilities.
The idea proposed here is that you want to be on the exponentially increasing density curve for the FPGAs in lieu of clock speed increases you are never going to see again. Sort of a complicated bet to make for mortals, maybe.
I like how they do the comparisons though. They say here is our Virtex-n basketball player and here is the best NBA Basketball player … and they show you crusty old Mike Bibby 2012. Then they say watch as the Virtex-n basketball player takes Mike Bibby down low in the post, and notice the Virtex-n basketball player is still growing exponentially. So you can imagine how much better he will do against Mike Bibby in the post next year. Finally they say that Mike Bibby was chosen as the best NBA player for this comparison by his father Henry, who was also a great NBA player.
FPGAs tend to consume power in tens of watts, compared to other multicores and GPUs that tend to consume power in hundreds of watts. One primary reason for lower power consumption in FPGAs is that the applications typically operate between 100–300 MHz on FPGAs compared to applications on high-performance processors executing between 2–3 GHz.
Silly making Lemonade out of Lemons argument, the minute I can have my FPGAs clocked at 3 GHz I throw away the 300MHz FPGAs, no?
Intel, An Introduction to the Intel QuickPath Interconnect, QPI, Jan 2009, here.
Xilinx Research Labs/NCSA, FPGA HPC – The road beyond processors, Jul 2007, here. Need more current references but I keep hearing the same themes in arguments for FGPA HPC, so let’s think about this for a bit:
FPGAs have an opening because you are not getting any more clocks from microprocessor fab shrinks: OK.
Power density: meh. Lots of FinQuant code can run on a handful of cores. The Low Latency HFT folks cannot really afford many L2 misses. The NSA boys are talking about supercomputers for crypto not binary protocol parsing.
Microprocessors have all functions that are hardened in silicon and you pay for them whether you use them or not and you can’t use that silicon for something else: Meh, don’t really care if I use all the silicon on my 300 USD microprocessor as long as the code is running close to optimal on the parts of the silicon useful to my application. It would be nice if I got more runtime performance for my 300USD, no doubt. This point is like Advil is bad because you don’t always need to finish the bottle after you blow out your ankle. Yeah, I understand the silicon real estate is the most expensive in the world.
Benchmarks: Black Scholes 18msec FPGA @ 110 Mhz Virtex-4 203x faster than Opeteron – 2.2 Ghz: You Cannot be Serious! 3.7 microseconds per Black Scholes evaluation was competitive performance at the turn of the century. The relative speedup slides and quotations make me nervous. Oh, Celoxica provided the data – hey Black Scholes in 36 Nanoseconds on a single core of a dual core off-the-shelf general microprocessor from 2007. So the Virtex-4 does 1M Black Scholes evaluations in 18 milliseconds flat to competitive code on a dual core general purpose off-the-shelf microprocessor in 2007.
Make it easy for the users to use this hardware and get „enough of a performance‟ increase to be useful: meh, it’s for applications that do not need to go fast, for now (2007)?
Do not try to be the fastest thing around when being as fast with less power is sufficient: meh, really do not care so much about the power thing
FPGA: Different operations map to different silicon allows massive pipelining; lots of parallelism: OK. So, why bother with the previous two points?
Eggers/ U. Washington, CHiMPS, here. Eggers is reasonable.
There have been (at least) two hindrances to the widespread adoption of FPGAs by scientific application developers: having to code in a hardware description language, such as Verilog (with its accompanying hardware-based programming model) and poor FPGA memory performance for random memory accesses. CHiMPS, our C-to-FPGA synthesis compiler, solves both problems with one memory architecture, the many-cache memory model.
Many-cache organizes the small, distributed memories on an FPGA into application-specific caches, each targeting a particular data structure or region of memory in an application and each customized for the particular memory operations that access it.
CHiMPS provides all the traditional benefits we expect from caching. To reduce cache latency, CHiMPS duplicates the caches, so that they’re physically located near the hardware logic blocks that access them. To increase memory bandwidth, CHiMPS banks the caches to match the memory parallelism in the code. To increase task-level parallelism, CHiMPS duplicates caches (and their computation blocks) through loop unrolling and tiling. Despite the lack of FPGA support for cache coherency, CHiMPS facilitates data sharing among FPGA caches and between the FPGA and its CPU through a simple flushing of cached values. And in addition, to harness the potential of the massively parallel computation offered by FPGAs, CHiMPS compiles to a spatial dataflow execution model, and then provides a mechanism to order dependent memory operations to retain C memory ordering semantics.
CHiMPS’s compiler analyses automatically generate the caches from C source. The solution allows scientific programmers to retain their familiar programming environment and memory model, and at the same time provides performance that is on average 7.8x greater and power that is one fourth that of a CPU executing the same source code. The CHiMPS work has been published in the International Symposium on Computer Architecture (ISCA, 2009), the International Conference on Field Programmable Logic and Applications (FPL, 2008), and High-Performance Reconfigurable Computing Technology and Applications (HPRCTA, 2008), where it received the Best Paper Award.
A small informal effort like Pink Iguana needs to lean heavily on curation for a specific audience. How narrow is the audience? Take the entire massive EcoFin community of DeLong, Wilmott, and Mankiw then subtract most of the folks who: don’t care if a 22nm semiconductor fab is competitive in 2012, haven’t compiled their code –O3 recently, or are sort of meh to the idea that there is a RDMA transport to L3. Those folks remaining might be the Pink Iguana audience if they also like: buying credit protection from AIG stories, P=NP speculation, and IEEE754. It’s the far side of the long tail.
So why curate for such a specific audience? Despite The End of Blogging, in 2012 there are remarkable and reasonably frequent publication streams from Gowers, Lipton, and Tao. The thing that is different in the last five years is the public availability of unfiltered, authoritative, and lucid commentary on specific topics. The keys are unfiltered, authoritative, and lucid. DeLong, Mankiw, Cowen, and Krugman run similarly authoritative and lucid publication streams that are more informed by their partisan backgrounds than Gowers, Lipton, and Tao. Intel, NVIDIA, and IBM have authoritative and lucid information as well, but they also have a day job to do. If folks like Gowers, Lipton, and Tao are regularly publishing there might be more, right? You just have to go look around, and maybe you figure out how something (e.g., ETP Arbitrage, Credit Derivatives, HFT, a specific floating point computation) actually works. So, Wisty curates on Pink Iguana.
Why are these folks in the Pink Iguana Hall of Heroes (listed below the Blogroll) and why should you read the Heroes?
A Credit Trader hasn’t published since 2009, he went to do other stuff, but wow what got published there was magnificent. Read Getchen Morgenson at NYT, for example this, then read The AIG Fiasco or Bond-CDS Negative Basis or How to Lose a Billion Dollars on a Trade, it is like a teenage lucidity head rush.
Avellaneda – 2010 Quant of the Year posts regularly from his NYU faculty page and covers Research and market commentary, Stochastic Calculus, PDEs for Finance, Risk and Portfolio Management.
Bookstaber – Author of the book A Demon of Our Own Design, ran Firm Risk at Salomon back in the day, and now is Senior Policy Advisor at the SEC. See Physics Envy in Finance or Human Complexity: The Strategic game of ? and ?
DeLong – Even with the constant bitching about the press and Team Republican plus the liveblogging of World War 2, I have never seen a better EcoFin website, see DeLong and Summers: Fiscal Policy in a Depressed Economy or Econ 191: Spring 2012. DeLong’s blog really is the model for curation and commentary to a large audience.
Gowers – Rouse Ball chair, Cambridge U, Fields Medal 1998, see ICM 2010 or Finding Cantor’s proof that there are transcendental numbers, and he was piqued to comment Re: Steig Larsson, or perhaps the translator Reg Keeling in Wiles meets his match. So, Salander’s picture perfect memory, capacity to defeat armed motorcycle gangs in hand-to-hand combat, and assorted other superpowers pass without comment but she thinks she has a proof of Fermat, you gotta call a mathematician to check yourself before you wreck yourself. Gowers is on the Heroes list forever, check.
Kahan – doesn’t publish so much anymore but he is the Edgar Allen Poe of floating point computations gone wrong horror stories, and they are all here. He did IEEE 754 floating point standard and won a Turing Award. When and if he has something to say, I will probably want to listen, see How Java’s Floating-Point Hurts Everyone Everywhere and Desperately Needed Remedies for the Undebuggability of Large Floating-Point Computations in Science and Engineering.
Lipton has a gloriously unique perspective presented in Godel’s Lost Letter. He provides the descriptive narrative for algorithm complexity in a public conversation typically dominated by proofs and expositions of computational models. If algorithm complexity was professional sports, its kind of like Lipton figured out there should be color commentators broadcasting live from the game. Top posts include: Interdisciplinary Research – Challenges, The Letterman Top Ten list of why P = NP is impossible, and The Singularity Is Here In Chess; its John Madden, Dick Vitale, and Andres Cantor meet Kurt Godel, John von Neumann, and Andrey Kolmogorov in the best possible way.
Tufte is “the guy” for the visual display of quantitative information. He has been the guy at least since the early 1980s and does not really publish the same way as Gowers, Lipton, or Tao. Tufte kind of figured out his publication flow before the internet, so you buy his books and if you want to know what he is thinking about now, you go to his course. He has stuff on line, lots of it, for example see his notebooks, or about ET. The Tufte course attendance is sort of mandatory, not sure but I think that’s in Dodd-Frank Title VII, so just do it before they find out.
Desperately Needed Remedies for the Undebuggability of Large Floating-Point Computations in Science and Engineering
(for ICME, Stanford University, 13 Oct. 2011), here. Fabulous read. I was trying to remember all the nightmare horror stories of folks who decided double precision was too much bother for their floating point computations. Kahan is the Edgar Allen Poe of botched-floating-point-computation story telling. A bunch of those stories are here, and they are … horrifying. Get the children out of the room before opening the pdf file. I recall the first chapters in Higham’s book were recounted like this as well.
Check this out (from Kahan’s presentation) …
“Jean-Michel Muller’s (Counter)Example
His program implements a discrete dynamical system whose state at time N is the row [xN, xN+1] . Starting with x0 := 2 and x1 := –4 , the sequence x2, x3, x4, …, xN+1, … is
generated by a recurrence xN+1 := 111 – ( 1130 – 3000/xN–1)/xN for N = 1, 2, 3, … in
turn. An interesting challenge is the computation of, say, x50 using the Floating-Point
hardware’s arithmetic in any commercial computer or calculator, new or old.
They all get x50 ≈ 100 . The correct value is x50 ≈ 6.0001465345614 .”
Go ahead put it in a spreadsheet it takes no more than few seconds to put the recurrence in Excel, we’ll wait.
ARE YOU NOT ENTERTAINED!
Wall Street Journal – here Maxeler Makes Waves with Dataflow Design
HPCWire – here J.P. Morgan Deploys Maxeler Dataflow Supercomputer for Fixed Income Trading
Peter Cherasia, Head of Markets Strategies at J.P. Morgan, commented: “With the new Maxeler technology, J.P. Morgan’s trading businesses can now compute orders of magnitude more quickly, making it possible to improve our understanding and control of the profile of our complex trading risk.”
insideHPC – here JP Morgan Fires Up Maxeler FPGA Super
Compute Scotland – here Near realtime: JP Morgan & Maxeler
So, a major broker-dealer gets a 2011 industry award for running their overnight Risk and P&L credit derivative batch though a 1000 node CPU grid and some FPGAs in 238 seconds (in 2008 the same computation and same broker-dealer but presumably different CDS inventory took 8 hours, a great success). Then some blog posting claims that this same credit derivative batch could be run with some optimized C++ code on a $2500 Mac Pro in under 2 seconds IEEE 754 double precision tied out to the precision of the inputs. What’s going on? Does the credit derivative batch require a $1,000,000 CPU grid, FPGAs, and 238 seconds or one $2,500 MacPro, some optimized code, and 2 seconds?
It’s the MacPro + 2 seconds very likely, let’s think how this could be wrong:
A. The disclosed data is materially wrong or we have misinterpreted it egregiously. The unusual event in all this is the public disclosure of the broker-dealer’s runtime experience and the code they were running. It is exceedingly rare to see such a public disclosure. That said, the 8 hour production credit derivative batch at a broker-dealer in 2008 is not the least-bit surprising. The disclosure itself tells you the production code was once bloated enough to be optimized from 8 hours to 4 minutes. You think they nailed the code optimization on an absolute basis when the 4 minutes result was enough to get a 2011 industry award, really? The part about the production infrastructure being a supercomputer with thousands of grid CPUs and FPGAs, while consistent with other production processes we have seen and heard of running on the Street, is the part you really have to hope is not true.
B. The several hundred thousand position credit derivative batch could be Synthetic CDO tranches and standard tranches and require slightly more complicated computation than the batch of default swaps assumed in the previous analysis. But all the correlation traders (folks that trade CDOs and tranches) we know in 2011 were back to trading vanilla credit derivatives and bonds. The credit derivative batch with active risk flowing through in 2011 is the default swap batch (you can include index protection like CDX and ITRX in this batch as well). Who is going to spend three years improving the overnight process on a correlation credit derivative book that is managed part time by a junior trader with instructions to take on zero risk? No one.
C. The ISDA code just analyzed is not likely to be the same as the 2011 award winning production credit derivative batch code. In fact, we know portions of the production credit derivative batch were translated into FPGA circuitry, so the code is real different, right? Well over the last decade of CDS trading most of the broker-dealers evolved to the same quantitative methodology for valuing default swaps. Standards for converting upfront fees to spreads (ISDA) and unwind fees (Bloomberg’s CDSW screen) have influenced broker-dealer CDS valuation methodology. We do not personally know exactly what quantitative analytics each one of the of the broker-dealers runs in 2011, but Jarrow-Turnbull default arrival and Brent’s method for credit curve cooking covers a non-trivial subset of the broker-dealers. The ISDA code is not likely to be vastly different from the production code the other broker-dealers use in terms of quantitative method. Of course, in any shop there could be some undisclosed quantitative tweaks included in production and the MacPro + 2 seconds analysis case would be exposed in that event.
D. The computational performance analysis just presented could be flawed. We have only thought about this since seeing the Computerworld UK article and spent a couple weekends working out the estimate details. We could have made a mistake or missed something. But even if we are off by a factor of 100 in our estimates (we are not) its still $2500 + MacPro + 200 seconds versus $1,000,000 + 1000 CPU+ FPGAs+238 seconds.
What is left to get a runtime estimate for the credit derivative batch on an off-the-shelf computer? We need a cooked credit curve (bootstraped hazard rate term structure) and to estimate the computational cost of the IOUs from the previous valuation step. Let’s account for the clock cycles needed to cook a USD credit curve out to 10 years with par CDS quotes at 1Y, 3Y, 5Y, 7Y, 10Y and a previously cooked USD Libor. Then we will address the additional computational cost of the valuation IOUs. The final post in this series will cover the additional computational cost of perturbational risk (multiply single curve cooking estimates by 20-30x corresponding to the number of distinct perturbed curves required for risk, ditto valuation estimates) as well as the cost of valuation for non-standard inventory CDS trades (trades where the paydates are not standard and therefore introduce an interpolation computation charge).
Credit Curve Cooking:
Assuming we are given a cooked USDLibor curve we can see from the valuation code the credit curve cooking only needs to take the par CDS quotes for the value date and return a set of assignments through JpmcdsForwardZeroPrice() for just the 20 paydates in the expected case (5Y CDS paying quarterly) or both the 20 paydates and the 130 numerical integration grid points (biweekly grid). So, if we derive estimates on the convergence speed of the one dimensional root finder, Brent’s method, and then get estimates on the cost of the Valuation IOUs we will have a reasonably complete floating point performance picture. We will assume away the cost of USD libor curve cooking and USDLibor interpolation since the computation cost is small and can be amortized across all the credits cooked against USDLibor, in any case. So, Valuation IOUs 2, 6, and 9 (see below) are assumed to cost less than one clock cycle of floating point execution. They are just variable assignments with some cost in cache pressure.
Here is the ISDA credit curve cooking code. Notice they use Brent’s method and the convergence criteria looks like zero to ten decimal paces. Given that you are cooking a credit curve that was cooked the previous business day as well you can assume the root is bracketed to 10bps or three decimal places. Let’s not even bother to account for potential compute time efficiency the bootstrapping valuations could provide. Assume every Brent function valuation cost as much as a 5Y CDS valuation.
/* we work with a continuously compounded curve since that is faster -
but we will convert to annual compounded since that is traditional */
cdsCurve = JpmcdsMakeTCurve (today,
if (cdsCurve == NULL) goto done;
context.discountCurve = discountCurve;
context.cdsCurve = cdsCurve;
context.recoveryRate = recoveryRate;
context.stepinDate = stepinDate;
context.cashSettleDate = cashSettleDate;
for (i = 0; i < nbDate; ++i)
guess = couponRates[i] / (1.0 – recoveryRate);
cl = JpmcdsCdsContingentLegMake (MAX(today, startDate),
if (cl == NULL) goto done;
fl = JpmcdsCdsFeeLegMake(startDate,
if (fl == NULL) goto done;
context.i = i;
context.cl = cl;
context.fl = fl;
if (JpmcdsRootFindBrent ((TObjectFunc)cdsBootstrapPointFunction,
0.0, /* boundLo */
1e10, /* boundHi */
100, /* numIterations */
0.0005, /* initialXstep */
0, /* initialFDeriv */
1e-10, /* xacc */
1e-10, /* facc */
&spread) != SUCCESS)
JpmcdsErrMsg (“%s: Could not add CDS maturity %s spread %.2fbp\n”,
1e4 * couponRates[i]);
cdsCurve->fArray[i].fRate = spread;
Assume Brent on average converges superlinearly like the secant method (the exponent is something like 1.6, see Convergence of Secant Method). Assume also that one iteration of Brent’s method costs one function evaluation (w. the IOUs computation time added) and some nominal extra clocks say +20%. Given the accuracy of the initial guess, the smooth objective function, and the expected superlinear convergence five iterations (1.6**5) should be good enough on average. With par CDS quotes at 1Y, 3Y, 5Y, 7Y, and 10Y to fit we expect to need 20 CDS valuations. Thinking ahead to perturbational risk (in the next post), we can obviously bundle many perturbations into a single massive vectorized Brent’s method call and drive the cooking cost massively lower. Let’s ignore this obvious optimization for now. What do we have so far?
CDS credit curve cooking estimates are then as follows:
Expected case = 1.2 * (20 * 150 + 20 * valuation IOU) clock cycles (quarterly paydates=gridpoints)
Worst case = 1.2 * (20 * 810 + 20 * valuation IOU) clock cycles (biweekly numerical integration gridpoints)
Expected case is running at 3600 cycles plus 24x the IOU cycle count. In other words, one microsecond plus some IOU accounting.
Valuation IOU Runtime Accounting:
With subtopic titles like “Valuation IOU Runtime Accounting” I am certain to cut my blog reading population in half – the price of a lonely optimization post. Recall our list of valuation IOUs. We assumed that some Rates process will exogenously produce some suitably cooked and interpolated USD Lidor (discCurve) at the cost of some L2 cache pressure to us. Note we assume on the code rewrite we vectorize and unroll loops anywhere possible in the code. So in our accounting we take the cycle counts from Intel’s vectorized math library (MKL) for double precision. Vector divides cost 5.4 cycles per element. Vector exponentials cost 7.9 cycles per element. Vector log costs 9.9 cycles and vector reciprocals cost 3.36 cycles per element. The vector lengths for these cycle counts are probably close to 1024 so perhaps the cycle counts are slightly optimistic.
Valuation IOU Accounting list
- survival [paydates] = JpmcdsForwardZeroPrice(SpreadCurve,…) 4*paydate clocks
- discount[paydates] = JpmcdsForwardZeroPrice(discCurve,…) 0 assumed away
- vector1[paydates] = notional*couponRate*accTime[paydates] paydate clocks
- vector2[paydates] = survival[paydates]*discount[paydates] paydate clocks
- s[gridpts] = JpmcdsForwardZeroPrice(SpreadCurve,…) 4*paydate clocks
- df[gridpts] = JpmcdsForwardZeroPrice(discCurve,…) 0 assumed away
- t[gridpts] grid time deltas 0 assume away
- lambda[gridpts] = log(s[gridpts]/s[gridpts])/t[gridpts] 15*gridpts clocks
- fwdRate[gridpts] = log(df[gridpts]/df[gridpts])/t[gridpts] 0 assumed away
- lambdafwdRates[gridpts] = lambda[gridpts] * fwdRates[gridpts] gridpts clocks
- vector3[gridpts0 = exp(-t[gridpts]*lambdafwdRates[gridpts]) 10 * gridpts clocks
- vector4[gridpts] = s[gridpts] * df[gridpts] gridpts clocks
- vector5[gridpts] = reciprocal (lambdafwdRates[gridpts]) 4 * gripts clocks
- t0[gridpts], t1[gridpts] coefficients for accrued interpolation 0 assume away
- thisPv[gridpts] 4 * gridpts clocks
Looks like the IOU cost in aggregate is 10 * paydate clocks or 200 clock cycles plus 35 * gridpts clocks (ugh!). So expected case assumes gridpoints = quarterly paydates so all in 45 * paydates clock cycles or 900 clocks. Worst case 200 clocks + 35 * 130 clocks or 4750 clocks.
All in then, CDS credit curve cooking estimates are as follows:
Expected case = 1.2 * (20 * 150 + 20 * 900) clock cycles (quarterly paydates=gridpoints)
Worst case = 1.2 * (20 * 810 + 20 * 4750) clock cycles (biweekly numerical integration gridpoints)
Expected case looks like 25,200 clocks or 7 microseconds. Worst case looks like 133,440 cycles or 37 microseconds. Ok, we tentatively thought 20 microseconds worst case but recall we left a chunk of optimizations unaccounted for in the interests of brevity. But the worst case is about two times more expensive than we naively expected.
From here it is easy, full risk will be 20x to 30x the single curve cooking estimate even if we are totally apathetic and clueless. Expected case is 7 * 20 microseconds call it 140 microseconds and worst case is 37 * 30 microseconds call it 1.11 milliseconds. Since we are going to argue that non-standard CDS interpolation can be stored with the trade description the incremental cycle count for non-standard deals is de minimus since the computational cost is amortized over the life of the position (5Y on average). Expected case then in our target 10K credit curves 100K CDS looks like 10 K * 140 microseconds + 100K * 20*42 nanoseconds or one and a half a seconds on a single core of a single microprocessor. If I have to do the Computerworld UK credit batch on the cheapest Mac Pro I can order from the Apple store today, the Quad core 2.8 GHz Nehalem for 2,499 USD (not gonna take the Add 1,200USD to get the 3.33 GHz Westmere). It’ll cost me 5.8 seconds on a single core but fortunately I purchased four cores (of raw power according to the Apple website) so the Computerworld UK credit derivative batch still runs in a second and change for 2,499 USD + shipping.
Even for non-SNAC positions I’m thinking under, perhaps epically under depending on the hardware budget. Ideally you run code to determine this and we do not have the optimized code in hand, but we can write it. Before writing the code you want to run the due diligence to see if writing the code is worth it. Let’s see.
We recall from previous 2009 estimates ( see Totally Serial) for unoptimized scalar code cooking of par and perturbed curves certainly costs less 7 milliseconds and valuation of a default swap’s contingent and fee legs take about 40 microseconds . This is of course assuming that the default swaps have custom cashflows and are not SNAC default swaps (standard instruments with IMM date cashflow schedules) in which case the default swap valuation execution time massively collapses and the credit curve cooking time can be significantly reduced as well.
Let’s assume 100K OTC (non SNAC default swaps) and 10K credit curves and we can scale to fit the actual curve and deal counts with more accurate estimates. Lets further assume that the 10K credit curves are independent (there is no known geometric or arithmetic structure to the traders marks). So, you actually have to cook each of the 10K credit curves. For unoptimized code then we expect about 7 ms*10K + 0.04ms * 100K or ~80 seconds floating point computation elapsed time on a single contemporary microprocessor core. Naively, you would guess this benchmark optimization would be looking for about 20-30 seconds (performance improvement 3x-4x over unoptimized) on a single x86 core. So assuming we buy a $3000 6-8 core microprocessor in 2012 the 100K/10K batch optimized fp code executes in a few seconds, expected case. Variance in the estimate will depend on some assumptions like the average number of iterations for convergence with Brent, the average term of the default swaps, the number of perturbed credit curves needed for full risk, the scaling across multiple on-chip cores among other things.
Amdahl’s Law directs the due diligence to first focus on the curve cooking so that’s what we will do in addition to rechecking the previous estimates ( I think they were very conservative) in the outlined argument. Lets count the operations in the ISDA code to nail this estimate down. Then we will estimate what the L1 and L2 cache efficiency we can plausibly expect to get a target due diligence number. What do we look at in the ISDA src distribution? Bunch of src files we don’t care about (DK) from the perspective of getting an estimate of the floating point computation time. The other annotated file names (in bold) contain source we may need to study.
- badday.c DK
- bsearch.c DK
- buscache.c DK
- busday.c DK
- cashflow.c DK
- cds.c computes contingent leg, fee leg, vanilla CDS PV
- cdsbootstrap.c Brent root finder for credit curve cooking
- cdsone.c converts upfront to flat spread
- cerror.c DK
- cfileio.c DK
- cfinanci.cpp DK
- cmemory.c DK
- contingentleg.c computes contingent leg and one period integral
- convert.c DK
- cx.c DK
- cxbsearch.c DK
- cxdatelist.c DK
- cxzerocurve.c calc zero price for given start and maturity
- date_sup.c DK
- dateadj.c DK
- dateconv.c DK
- datelist.c DK
- dtlist.c DK
- feeleg.c calc PV of single fee, fee leg, accrued
- fltrate.c DK
- gtozc.c looks like curve cooking
- interpc.c interpolation of rates
- ldate.c DK
- linterpc.c curve interpolation
- lintrpl.c DK
- lprintf.c DK
- lscanf.c DK
- rtbrent.c root finding for curve cooking
- schedule.c possibly DK – cash flow schedule generation
- streamcf.c calculate cashflows
- strutil.c DK
- stub.c stub payments
- tcurve.c discounting
- timeline.c not sure maybe DK
- version.c DK
- yearfrac.c DK
- zcall.c zero curve
- zcswap.c add strip of swap to zcurve
- zcswdate.c DK
- zcswutil.c cashflow lists for swap instruments
- zerocurve.c build zero curve from mm and swaps
- zr2coup.c calculates the par swap rate
- zr2fwd.c calculates zero coupon forward rate
We should be able to pound this due diligence through with a little effort, let’s see how it goes. Maybe I am missing something? It’ll take a couple postings I expect, but it will be interesting.
Dominic O’Kane has/had his own web based calculator based on his 2008 book Modeling Single-name and Multi-name Credit Derivatives which is in turn based on a very good Lehman research report O’Kane published with Stuart Turnbull in 2003, Valuation of Credit Default Swaps.
Hull and White 2003 on Valuation of a CDO and an nth to Default CDS Without Monte Carlo Simulation. If there is a broker dealer running a PDE solver on their Credit Derivative inventory for daily P&L, find out who the head of quantitative research is there and bow before that guy because he has achieved Steve Jobs-level marketing skills.
Matlab CDS pricer, here.
BionicTurtle has a YouTube video of how to run a CDS valuation on a spreadsheet, here. Appears to be the tip of iceberg of You Tube videos explaining Credit Derivatives