Home » Posts tagged 'FinQuant'
Tag Archives: FinQuant
Bill Dally, EE Times, Q&A: Nvidia’s Dally on 3-D ICs, China, cloud computing, here. More complete interview reporting that HPC Wire recently summarized. EE Times may make you register for this but it is otherwise free. On China’s microprocessors:
Five years ago Godson was laughable. Now it’s competent but not state of the art. If they continue, I would expect them to be matching the West in three to five years and then pulling ahead. Quite frankly, this country is not investing as much in R&D in these strategic areas. It’s a question of government investment in research. In computing it’s slowed to a trickle.
If we want to have a pipeline of innovations that can fuel competitive products going forward, the government needs to invest in stuff beyond the horizon of what companies will reasonably invest in. The fundamental research lifts all the boats.
IB Times, ARM, Intel Battle Heats Up, here. Microprocessor server HPC is a side show, w Intel holding 94+% share.
Low-power processor maker ARM Holdings PLC (Nasdaq: ARMH) is stepping up the rhetoric against chip rival Intel Corp. (Nasdaq: INTC), saying it expects to take more of Intel‘s share in the notebook personal-computer market than Intel can take from it in the smartphone market.
I notice that Emanuel Derman is about to release his new book. The tome seems to deal with how the failings of finance theory can impact the world. This sounds very close to what my Lecturing Birds attempted to do. There are big differences though.
For one, Derman knows much more than I do about the subject matter.
He is also a better writer.
But I suspect that there is an area where I may have a slight comparative advantage. I am an amateur, a dilettante, a stranger in a strange land. Derman is a pro in the field. While he is way more open and honest than most other pros in this debate, he may not want to be more open and honest than necessary. In other words, he probably can´t or doesn´t want to be a denunciator. He can´t or doesn´t want to be too critical or too cynical. I, on the other hand, was able to be stringently accusatorial because I had no allegiance but to the evidence I unearthed and what such findings dictated me to conclude. Derman can highlight VaR´s weaknesses but he might not want to call for its banning. Derman can talk about BSM´s flaws, but he might not want to embrace Taleb-Haug. Derman can denounce the unrealism of models but he might not want to lead a campaign against the (possibly impractical, probably lethal) modelling of finance.
My Little Pony Physics, You Tube, here. Going viral via Tosh.0.
Weisenthal, Business Insider,Everyone Agrees: The ECB Is About To Make The Biggest Decision In Its History, here.
All the politicians in Greece (even the mainstream ones) have said they want to renegotiate the bailout agreement.
If the rest of Europe doesn’t back down and agree to this, then the ECB will have to make a huge decision.
Unless Greece chooses to leave the Euro area, which we doubt will happen, a Greek exit will require the rest of the region to push the country out. The mechanism for this will be the ECB excluding the Greek central bank from TARGET2, the regional payments and settlement system. Although this might look like a technical decision about monetary plumbing, the ECB will elevate this to the Euro area heads of state. It will be the most important political decision since EMU’s launch.
Extreme Tech, Ivy Bridge: Intel’s killing blow on AMD, here. Let’s look at this again. For the FinQuant application space I’d estimate somewhere between 50% and 85% of what you care about in selecting a Linux server is the current, expected future, and realized future feature size of the fab producing the server’s microprocessors and chips. There are lots of other important variables: system and microprocessor architecture, programming languages, network transmission lines, compilers, operating systems, file systems, databases, etc. and each alone can make or break a FinQuant app, but they are all tails. The microprocessor fab feature size is the dog, it effectively determines how well my FinQuant infrastructure scales with Moore’s Law. The comparative technology priority has not always been this way. There used to be different instruction set architectures, networks were slower than DRAMs, and memories were small all requiring evaluation in addition to the shrinking microprocessor fab feature size. In all likelihood the comparative technology priorities will change in the future as well.
Right now, Intel Sandy Bridge is 32nm, Ivy Bridge is 22nm, AMD Operton is 28nm, Xilinx is 28nm, Achronix is 22nm and the microprocessor market share main event is between AMD and Intel over design wins in mobile systems. Server-side Intel holds 95% market share to AMD 5%. Intel tries to set expectations of 22nm by 2013 and 14nm by 2014, here for example while showing the Chandler, Ariz 14 nm fab construction, here. Recall that things do not always move so smoothly for Intel, think about the relatively recent 8M Sandy Bridge support chip recall and Itanium. On the other hand AMD’s problems appear to be a shade worse than Intel’s witness: ars technica on Server market share here, The Register here, Extreme Tech here. I don’t know how much these websites are owned or comped by Intel, but if i am holding a bunch of Opteron server-side exposure it is probably safe to argue that it’s time to think about a hedge.
All things being equal, if I am aggressively setting up an HPC FinQuant infrastructure play now. I kind of want to be production ready with 22 nm silicon by the end of 2012 looking to set up a smooth infrastructure transition to 14nm in 2014.
Anand Tech, Intel’s Ivy Bridge Architecture Exposed, here. Not sure how much I care about the integrated GPU for server side FinQuant apps unless the AVX2 is somehow related to the GPU.
ars, Transactional memory going mainstream with Intel Haswell, here.
phoronix, Compilers Mature For Intel Sandy/Ivy Bridge, Prep For Haswell, here. Wow Treasure.
A small informal effort like Pink Iguana needs to lean heavily on curation for a specific audience. How narrow is the audience? Take the entire massive EcoFin community of DeLong, Wilmott, and Mankiw then subtract most of the folks who: don’t care if a 22nm semiconductor fab is competitive in 2012, haven’t compiled their code –O3 recently, or are sort of meh to the idea that there is a RDMA transport to L3. Those folks remaining might be the Pink Iguana audience if they also like: buying credit protection from AIG stories, P=NP speculation, and IEEE754. It’s the far side of the long tail.
So why curate for such a specific audience? Despite The End of Blogging, in 2012 there are remarkable and reasonably frequent publication streams from Gowers, Lipton, and Tao. The thing that is different in the last five years is the public availability of unfiltered, authoritative, and lucid commentary on specific topics. The keys are unfiltered, authoritative, and lucid. DeLong, Mankiw, Cowen, and Krugman run similarly authoritative and lucid publication streams that are more informed by their partisan backgrounds than Gowers, Lipton, and Tao. Intel, NVIDIA, and IBM have authoritative and lucid information as well, but they also have a day job to do. If folks like Gowers, Lipton, and Tao are regularly publishing there might be more, right? You just have to go look around, and maybe you figure out how something (e.g., ETP Arbitrage, Credit Derivatives, HFT, a specific floating point computation) actually works. So, Wisty curates on Pink Iguana.
Why are these folks in the Pink Iguana Hall of Heroes (listed below the Blogroll) and why should you read the Heroes?
A Credit Trader hasn’t published since 2009, he went to do other stuff, but wow what got published there was magnificent. Read Getchen Morgenson at NYT, for example this, then read The AIG Fiasco or Bond-CDS Negative Basis or How to Lose a Billion Dollars on a Trade, it is like a teenage lucidity head rush.
Avellaneda – 2010 Quant of the Year posts regularly from his NYU faculty page and covers Research and market commentary, Stochastic Calculus, PDEs for Finance, Risk and Portfolio Management.
Bookstaber – Author of the book A Demon of Our Own Design, ran Firm Risk at Salomon back in the day, and now is Senior Policy Advisor at the SEC. See Physics Envy in Finance or Human Complexity: The Strategic game of ? and ?
DeLong – Even with the constant bitching about the press and Team Republican plus the liveblogging of World War 2, I have never seen a better EcoFin website, see DeLong and Summers: Fiscal Policy in a Depressed Economy or Econ 191: Spring 2012. DeLong’s blog really is the model for curation and commentary to a large audience.
Gowers – Rouse Ball chair, Cambridge U, Fields Medal 1998, see ICM 2010 or Finding Cantor’s proof that there are transcendental numbers, and he was piqued to comment Re: Steig Larsson, or perhaps the translator Reg Keeling in Wiles meets his match. So, Salander’s picture perfect memory, capacity to defeat armed motorcycle gangs in hand-to-hand combat, and assorted other superpowers pass without comment but she thinks she has a proof of Fermat, you gotta call a mathematician to check yourself before you wreck yourself. Gowers is on the Heroes list forever, check.
Kahan – doesn’t publish so much anymore but he is the Edgar Allen Poe of floating point computations gone wrong horror stories, and they are all here. He did IEEE 754 floating point standard and won a Turing Award. When and if he has something to say, I will probably want to listen, see How Java’s Floating-Point Hurts Everyone Everywhere and Desperately Needed Remedies for the Undebuggability of Large Floating-Point Computations in Science and Engineering.
Lipton has a gloriously unique perspective presented in Godel’s Lost Letter. He provides the descriptive narrative for algorithm complexity in a public conversation typically dominated by proofs and expositions of computational models. If algorithm complexity was professional sports, its kind of like Lipton figured out there should be color commentators broadcasting live from the game. Top posts include: Interdisciplinary Research – Challenges, The Letterman Top Ten list of why P = NP is impossible, and The Singularity Is Here In Chess; its John Madden, Dick Vitale, and Andres Cantor meet Kurt Godel, John von Neumann, and Andrey Kolmogorov in the best possible way.
Tufte is “the guy” for the visual display of quantitative information. He has been the guy at least since the early 1980s and does not really publish the same way as Gowers, Lipton, or Tao. Tufte kind of figured out his publication flow before the internet, so you buy his books and if you want to know what he is thinking about now, you go to his course. He has stuff on line, lots of it, for example see his notebooks, or about ET. The Tufte course attendance is sort of mandatory, not sure but I think that’s in Dodd-Frank Title VII, so just do it before they find out.
Apart from saving over a million dollars infrastructure cost annually, why bother running your credit derivative P&L/Risk batch Mac Pro+2s rather than Supercomputer+FPGAs+4m?
A. Rates, Mortgage, FX, Commodities P&L/Risk batches
It’s only the credit derivative guys in the Computerworld UK article that had this P&L and Risk batch performance problem in 2008 and now it is fixed?
B. Moore’s Law
More of a prediction than a law really, but it has predicted accurately for 40 years and is expected to hold for another decade. In technology if you do nothing more than “ride the wave” of Moore’s Law you are doing good, however, this observation’s simplicity can be deceiving. If the plan is for the credit derivative batch to “ride the wave” with thousands of grid CPUs, FPGAs, and associated staff you may have a challenge when scaling to larger problems, like counterparty valuation adjustment (CVA), that Moore’s Law will not automatically address for you. For example, Moore’s Law doesn’t get you power and machine room space for 10,000 grid CPUs for the CVA simulation. That is closer to “not riding the wave.”
Supercomputers and FPGAs, in addition to GPUs, Cuda, and CPU grids even dataflow architecture are all reasonable technology bets that could very well be the best way to ride the Moore’s Law wave through 2020. Using some or all of these technologies together in 2011 to run such a modest computation so slowly should elicit some follow up questions.
So, a major broker-dealer gets a 2011 industry award for running their overnight Risk and P&L credit derivative batch though a 1000 node CPU grid and some FPGAs in 238 seconds (in 2008 the same computation and same broker-dealer but presumably different CDS inventory took 8 hours, a great success). Then some blog posting claims that this same credit derivative batch could be run with some optimized C++ code on a $2500 Mac Pro in under 2 seconds IEEE 754 double precision tied out to the precision of the inputs. What’s going on? Does the credit derivative batch require a $1,000,000 CPU grid, FPGAs, and 238 seconds or one $2,500 MacPro, some optimized code, and 2 seconds?
It’s the MacPro + 2 seconds very likely, let’s think how this could be wrong:
A. The disclosed data is materially wrong or we have misinterpreted it egregiously. The unusual event in all this is the public disclosure of the broker-dealer’s runtime experience and the code they were running. It is exceedingly rare to see such a public disclosure. That said, the 8 hour production credit derivative batch at a broker-dealer in 2008 is not the least-bit surprising. The disclosure itself tells you the production code was once bloated enough to be optimized from 8 hours to 4 minutes. You think they nailed the code optimization on an absolute basis when the 4 minutes result was enough to get a 2011 industry award, really? The part about the production infrastructure being a supercomputer with thousands of grid CPUs and FPGAs, while consistent with other production processes we have seen and heard of running on the Street, is the part you really have to hope is not true.
B. The several hundred thousand position credit derivative batch could be Synthetic CDO tranches and standard tranches and require slightly more complicated computation than the batch of default swaps assumed in the previous analysis. But all the correlation traders (folks that trade CDOs and tranches) we know in 2011 were back to trading vanilla credit derivatives and bonds. The credit derivative batch with active risk flowing through in 2011 is the default swap batch (you can include index protection like CDX and ITRX in this batch as well). Who is going to spend three years improving the overnight process on a correlation credit derivative book that is managed part time by a junior trader with instructions to take on zero risk? No one.
C. The ISDA code just analyzed is not likely to be the same as the 2011 award winning production credit derivative batch code. In fact, we know portions of the production credit derivative batch were translated into FPGA circuitry, so the code is real different, right? Well over the last decade of CDS trading most of the broker-dealers evolved to the same quantitative methodology for valuing default swaps. Standards for converting upfront fees to spreads (ISDA) and unwind fees (Bloomberg’s CDSW screen) have influenced broker-dealer CDS valuation methodology. We do not personally know exactly what quantitative analytics each one of the of the broker-dealers runs in 2011, but Jarrow-Turnbull default arrival and Brent’s method for credit curve cooking covers a non-trivial subset of the broker-dealers. The ISDA code is not likely to be vastly different from the production code the other broker-dealers use in terms of quantitative method. Of course, in any shop there could be some undisclosed quantitative tweaks included in production and the MacPro + 2 seconds analysis case would be exposed in that event.
D. The computational performance analysis just presented could be flawed. We have only thought about this since seeing the Computerworld UK article and spent a couple weekends working out the estimate details. We could have made a mistake or missed something. But even if we are off by a factor of 100 in our estimates (we are not) its still $2500 + MacPro + 200 seconds versus $1,000,000 + 1000 CPU+ FPGAs+238 seconds.
Peter Cotton at A Quant’s Apology could be entertaining
XKCD sorts out orders of magnitude in Money.
Cars Connect with Apps – Wired. Gotta see the spec sheets but if the L1 and L2 caches are not too small I have a feeling we are looking at a full worldwide Credit Derivative valuation and risk run in the Kia (Uvo eServices infotainment and telematics system) on my evening Isle of Dogs to Dorking (or Leighton Buzzard worst case) commute home.
Salmon opines on the Citi Benchmark portfolio risk metric idea, here.
Paul Glasserman’s publications, here
Grand Challenges in Economics, here from Turing’s Invisible hand 2010
Brooklyn Academy of Music Blog, here.
Should computers have their own websites? – Marginal Revolution
Bookstaber – The Day the Earth Stood Still – mistaking remixing for technology advancement.
Dominic O’Kane has/had his own web based calculator based on his 2008 book Modeling Single-name and Multi-name Credit Derivatives which is in turn based on a very good Lehman research report O’Kane published with Stuart Turnbull in 2003, Valuation of Credit Default Swaps.
Hull and White 2003 on Valuation of a CDO and an nth to Default CDS Without Monte Carlo Simulation. If there is a broker dealer running a PDE solver on their Credit Derivative inventory for daily P&L, find out who the head of quantitative research is there and bow before that guy because he has achieved Steve Jobs-level marketing skills.
Matlab CDS pricer, here.
BionicTurtle has a YouTube video of how to run a CDS valuation on a spreadsheet, here. Appears to be the tip of iceberg of You Tube videos explaining Credit Derivatives
Since 2006, the clock cycles offered in new microprocessors remain constant due to fundamental power and heat dissipation constraints in silicon chip design and production (Olukotun); nevertheless Moore’s Law remains in effect and these hot power hungry microprocessors are effectively the universal compute platform of choice. Cray and Convex are history and Burton Smith works for Microsoft, game over. The good news is everyone, regardless of capitalization, computes with the “same commodity microprocessor.”
A Demon of our Own Design author Rick Bookstaber argues in his blog in 2009 that:
1. High Frequency trading is a capacity constrained trading strategy very sensitive to the number of active High Frequency traders and
2. Competition in High Frequency trading is aptly characterized as an arms race and hence a negative-sum game; the collective infrastructure spend confers no solid expectation of long-term advantage just more opportunity to spend on improved infrastructure (in The Arms Race in High Frequency Trading ).
On the other hand, in 2007 Information Week, writer Richard Martin quotes the assertion, “A millisecond advantage in trading application can be worth $100MM a year to a brokerage firm.” before concluding with the quote, “ Once you’ve got a half dozen systems that can all handle that kind of throughput, then you have to distinguish yourself somewhere else.” Protecting IP is hard, Teza Technology‘s recruiting of ex-Goldman High Frequency trading programmer Sergey Aleynikov and Tower’s recruitment of ex-Soc Gen High Frequency trading programmer Samarth Agrawal are presumably only representative known samples the actual velocity of code migration. It stands to reason that in reality the IP must move slightly faster than the code, indicating yet another way for this trade to get crowded. NYT quotes Andrew Lo in 2006 “Now it’s an arms race, everyone is building more sophisticated algorithms, and the more competition exists, the smaller the profits.”
Bookstaber, Lo, and Martin’s assessments were all made several years ago in a booming High Frequency trading market and all implicitly agree the space is destined to get crowded, implicitly sooner rather than later. Bookstaber correctly points out the figurative exits are small when the space gets crowded; Martin speculates where the exit is located (the exit is to go long, or to buy, improved computational latency). Martin’s millisecond mark-to-market quote is pre Dodd-Frank Title VII implementation so it stands to reason that post Title VII implementation a millisecond will be worth much more than 100MM USD. So, euphemistically, how does one go long milliseconds when running Low Latency over a WAN running at or approaching known physical signal propagation limits? Moreover, given the expected crowding in the High Frequency/Low Latency business, it’s sort of important that the desk’s stack of milliseconds is larger than the competitor’s millisecond stack.
We have three variables to control in this game: Network Switch and Wire latency, Pre-trade Algorithms, and the Optimized Code/Compute Hardware. We have assumed away the wire latency at the outset of this survey; we have the lowest latency connection between NJ and Chicago by assumption. We do not know the details of the switch latency from Spread Networks, but the only reasonable question to ask is how much of a competitive advantage might accrue using this low latency WAN link?
Bandwidth does not appear to be competition driver. The High Frequency trading folks seem fairly consistent in claiming that the pre-trade decision algorithms complete (come to a decision) in microseconds. A scalar core can only touch about 64 megabits in one millisecond. Under the reasonable assumption that the algorithms are more or less scalar with respect to a given Low Latency arbitrage pair of securities/contracts, and there are only 100 or so liquid pairs, the data bandwidth requirements in a market microstructure based trading system seem modest. This limited bandwidth assumption could weaken with index arbitrage since each decision could require the time series of prices for 100s of underlying contracts – but lets stick with the limited bandwidth assumption. The only important networking factor for Low Latency trading is …wait for it … low latency, doh.
Latency benchmarks reported by Verizon show an US avg. latency of 42 ms to 44 ms going down to 33 ms to 34 ms for Private IP. AT&T reports NY to Chicago latencies of 21 ms with a nationwide average of 34 ms. The Barksdale Forbes article claims existing private lines between NY and Chicago running 16.3ms roundtrip latency.
Assume best possible case, the desk is on the Barksdale’s fiber and all competitors are running private connections with round trip latencies between 16.3ms and 21ms. So the desk’s one-way latency is 7 clocks versus the competitor’s 9 to 11 clocks. Any advantage should manifest itself in remote event notification such as: trade notification; order book updates; and exogenous trading system shutdown notice.
In SPY/SPY1D arbitrage the desk will see anywhere from 1 to 2 more time series points corresponding to executed orders from the remote opening market than the competition. In terms of event notification a competitor could run event driven rather than synchronously and recapture a trading system clock cycle of latency. Longer term though I think you have to assume the synchronous trading system design prevails over the event driven trading system design for similar reasons that synchronous circuits beat out asynchronous circuits due to complexity in handling signal race conditions. To the degree that the pre-trade algorithm execution time allows a higher trading system clock rate, the capacity of the event driven trading system to catch up to a synchronous system in event notification is proportionally decreased.
The updated remote order book, similar to the executed order notification, will run a couple of milliseconds ahead of a disadvantaged WAN competitor. Its not immediately clear that this advantage is as valuable or actionable as the trade execution notification. Perhaps there is some automated mechanism to determine if one market is leading another in price discovery for a given contract then if the updated remote order book is from the market leading price discovery there is some concrete advantage.
Shut down notification latency is always important but even more so if the desk starts to run strategies that are directional or not risk-neutral. Thorp described the scene at Princeton-Newport when a surprise merger was announced and the Stat arb trading system needed to shut down immediately; Low Latency prop trading has the same problem. Event driven communication is better than the synchronous model in this case. Long-term it might make sense to keep an alternate WAN channel around simply for asynchronous notifications for system exceptions.
Assuming the pre-trade algorithm is fixed then we do sort of know that code optimization and hardware selection are the keys to getting long milliseconds, reducing computational latency, and beating the competition. The key to code optimization now is the observation that contemporary microprocessors are little superscalar and superpiplined parallel machines. The job of the floating point programmer is to deconstruct the pre-trade analytics into an executable form that keeps the floating point units busy, the FP pipeline running with the fewest bubbles (idle pipeline stages) possible, and doesn’t miss in the cache too often. For typical Financial Engineering codes you can get good estimates of optimal core cycle executions times and drive the analytics performance reasonably close to the optimal core cycle count. On the hardware selection side, the job is to stay on the Moore’s law technology wave (could be any technology using the latest silicon fabrication generation: on-chip mp, FPGAs, or GPUs) without losing too much software support from optimizing compilers, vectorized math libraries, and multiprocessing code generation.
The fewer core clocks required by the pre-trade algorithm execution the faster we can run the trading system clock in the synchronous model. As the trading system clock cycles get faster the ability to tolerate off chip communication for parallel computing will dissipate, so the main opportunity for parallelism will be on-chip.
If the pre-trade analytics is heavily dependent on IEEE 754 double precision floating point execution then the desk needs to use native compilers and native libraries for the chip running the pre-trade algorithm (see the Intel or IBM math library maps of operations to clock cycles and precision). The shipping 4.25 GHz, 8-way mp, 45nm IBM POWER7 with 4 double precision floating point units per core, running XLC code on top of the MASS vector libraries is probably slightly faster than even the upcoming 3.9 GHz, 4-way mp, 32nm Sandy Bridge Intel chips running ICC code on top of the MKL vector libraries for conventional double precision quantitative analytics commonly encountered on Wall Street. This is less of a vendor allegiance point (although it might sound otherwise) and more of just keep track of the smart people that know this particular stuff point. Long-term relative value between the Intel and IBM infrastructure floating point performance will hinge on the native compiler assist with on-chip parallel processing. Right now, IBM will throw you bigger caches, more independent floating point units, and a higher frequency clock than Intel. On the other hand, Intel’s microprocessor feature size runs almost one generation ahead of IBM and the Intel compiler folks and math libraries are quite competitive. If you look at SPECfp where vendors display the maximum floating point execution speed of their products, they will only quote native compiler executions (see SPEC CPU FP). There is no reasonable expectation I am aware of that a non-native compiler (or interpreter) is going to issue code that can run to speed after deconstructing the code and estimating the code cache footprint end-to-end. Moreover, the only optimized math libraries I have seen are for native compilers. On the other hand, if single precision is tolerable for pre-trade analytics there are several custom FPGAs (see Wallach’s most recent startup Convey) and GPU computing options (NVIDIA Tesla) that could be potential competitive sources of compute power for pre-trade analytics, given sufficient compiler and math library support.
If the premium is on reducing computation latency then accounting precisely for core processor clocks in pre-trade algorithm code is important. It may be pragmatically reasonable to assume the computation execution is effectively scalar given that the parallel computation support even on the native compilers is still rather new and raw. The parallelization opportunities available currently are on-chip, with multicore execution (probably 2 to 4 way) moving to 8-way with newer silicon (see for example the IBM Power7).