You are currently browsing the category archive for the ‘Architecture’ category.
David Kanter, real world technologies, Intel’s Haswell CPU Microarchitecture, Nov. 2012, here.
The physical register files hold the actual input and output operands for uops. The integer PRF added a modest 8 registers, bringing the total to 168. Given that AVX2 was a major change to Haswell, it should be no surprise that the number of 256-bit AVX registers grew substantially to accommodate the new integer SIMD instructions. Haswell features 24 extra physical registers for renaming YMM and XMM architectural registers. The branch order buffer, which is used to rollback to known good architectural state in the case of a misprediction is still 48 entries, as with Sandy Bridge. The load and store buffers, which are necessary for any memory accesses have grown by 8 and 6 entries respectively, bringing the total to 72 loads and 42 stores in-flight.
Unlike AMD’s Bulldozer, Haswell continues to use a unified scheduler that holds all different types of uops. The scheduler in Haswell is now 60 entries, up from 54 in Sandy Bridge, those entries are dynamically shared between the active threads. The scheduler holds uops that are waiting to execute due to resource or operand contraints. Once ready, uops are issued to the execution units through dispatch ports. While fused uops occupy a single entry in the ROB, execution ports can handle a single un-fused uop. So a fused load+ALU uop will occupy two ports to execute.
Haswell and Sandy Bridge both retire upto 4 fused uops/cycle, once all the constituent uops have been successfully executed. Retirement occurs in-order and clears out resources such as the ROB, physical register files and branch order buffer.
Sam Biddle, Gizmodo, AMD Just Set a World Record for Overclocking, here.
It may not technically be the world’s fastest processor, but AMD squeezed an overclocked Guinness World Record number out of its impending 8-core FX CPU: 8.429 GHz, beating the previous global high of 8.308 GHz. Intel fanboys, begin your outrage!
Niels Broekhuijsen, Tom’s Hardware, Intel’s Core i7-4770K Overclocked to 8.0 GHz, here. AMD makes better geek video.
Earlier, we showed you a screenshot of Intel’s Haswell Core i7-4770K overclocked to a massive 7.0 GHz, but now we have a piece of video showing the CPU overclocked all the way to 8.0 GHz. Theoretically, the CPU shouldn’t be able to go above 8 GHz due to restrictions with the multiplier and base clock, even on the “*K” version.
The video shows the CPU with only two cores enabled and HyperThreading disabled. The voltage of the CPU reaches 2.259 V as the CPU’s clock speed is gradually increased.
Report: Intel’s Ivy Bridge-E CPUs Will Launch Sept. 2013, here.
The flagship i7-4960X would pack six processing cores and with HyperThreading feature 12 threads. It would have a base clock of 3.6 GHz, which will boost up to the 4.0 GHz mark. It has 15 MB of L3 cache and support for DDR3-1866 memory. All this processingpower is made possible through a TDP of a massive 130 Watts.
Wolfgang Gruener, Tom’s Hardware, Intel Has 5 nm Processors in Sight, here. Looks uncomfortable for the FPGA guys.
According to the company, future production processes down to 5 nm are on the horizon and will most likely be reached without significant problems. Following the current 22 nm process, Intel’s manufacturing cadence suggests that the first 14 nm products will arrive in late 2013, 10 nm in 2015, 7 nm in 2017, and 5 nm in 2019. A slight adjustment has been made to include different production processes for traditional processors and nowSoCs. The company previously indicated that SoCs will be accelerated to catch up with the process applied to Intel’s main processor products.
Anand Lai Simpi, Anand Tech, Intel Details Haswell Overclocking at IDF Beijing, here. Haswell is out 2 June. Not going to buy any Ivy Bridge at this point, too late.
As we march towards the June 2nd release of Intel’s Haswell processors, the company is slowly but surely filling in the missing blanks. Most recently we saw a shot of the often discussed but rarely seen Haswell GT3e part with on-package DRAM, and today we get some confirmation on what overclocking Haswell will be like.
As a quick refresher, the max clock frequency of Haswell is governed by the following equation:
Clock Speed = BCLK * Ratio
In the old days, both of the variables on the right hand side were unlocked (back then it wasn’t called BCLK). Around the time of the Pentium II, Intel locked the multiplier ratios (rightmost variable) and then a few years ago we lost the ability to manipulate un-multiplied input frequency.
Computing Frontiers 2013 Program, here. Maxeler folks are there doing the Dataflow talk. Boy it sure helps to know a little history doesn’t it? It can save you a bunch.
Alex Mansfield, BBC, NASA buys into ‘quantum’ computer, here. Since when did NASA get money to buy stuff? Alex had a draft of this article with “blink of an eye” instead of “fractions of a second,” right?
A $15m computer that uses “quantum physics” effects to boost its speed is to be installed at a Nasa facility.
It will be shared by Google, Nasa, and other scientists, providing access to a machine said to be up to 3,600 times faster than conventional computers.
Unlike standard machines, the D-Wave Two processor appears to make use of an effect called quantum tunnelling.
This allows it to reach solutions to certain types of mathematical problems in fractions of a second.
Quentin Hardy, NYT, Google Buys a Quantum Computer, here.
The Quantum Artificial Intelligence Lab, as the entity is called, will focus on machine learning, which is the way computers take note of patterns of information to improve their outputs. Personalized Internet search and predictions of traffic congestion based on GPS data are examples of machine learning. The field is particularly important for things like facial or voice recognition, biological behavior, or the management of very large and complex systems.
“If we want to create effective environmental policies, we need better models of what’s happening to our climate,” Google said in a blog post announcing the partnership. “Classical computers aren’t well suited to these types of creative problems.”
Google said it had already devised machine-learning algorithms that work inside the quantum computer, which is made by D-Wave Systems of Burnaby, British Columbia. One could quickly recognize information, saving power on mobile devices, while another was successful at sorting out bad or mislabeled data. The most effective methods for using quantum computation, Google said, involved combining the advanced machines with its clouds of traditional computers.
Google and NASA bought in cooperation with the Universities Space Research Association, a nonprofit research corporation that works with NASA and others to advance space science and technology. Outside researchers will be invited to the lab as well.
This year D-Wave sold its first commercial quantum computer to Lockheed Martin. Lockheed officials said the computer would be used for the test and measurement of things like jet aircraft designs, or the reliability of satellite systems.
Scott Aaronson, Shtetl-Optimized, Ask Me Anything! Tenure Edition, here. You don’t really have to read the BBC and NYT coverage of Quantum when you can just ask Aaronson anything.
By popular request, for the next 36 hours—so, from now until ~11PM on Tuesday—I’ll have a long-overdue edition of “Ask Me Anything.” (For the previous editions, see here, here, here, and here.) Today’s edition is partly to celebrate my new, tenured “freedom to do whatever the hell I want” (as well as the publication after 7 years of Quantum Computing Since Democritus), but is mostly just to have an excuse to get out of changing diapers (“I’d love to, honey, but the world is demanding answers!”). Here are the ground rules:
One question per person, total.
Please check to see whether your question was already asked in one of the previous editions—if it was, then I’ll probably just refer you there.
No questions with complicated backstories, or that require me to watch a video, read a paper, etc. and comment on it.
No questions about D-Wave. (As it happens, Matthias Troyer will be giving a talk at MIT this Wednesday about his group’s experiments on the D-Wave machine, and I’m planning a blog post about it—so just hold your horses for a few more days!)
If your question is offensive, patronizing, nosy, or annoying, I reserve the right to give a flippant non-answer or even delete the question.
Keep in mind that, in past editions, the best questions have almost always been the most goofball ones (“What’s up with those painting elephants?”).
stackoverflow, Flops per cycle for sandy-bridge and haswell SSE2/AVX/AVX2, here. Wow FMA 16 DP issue per clock Haswell.
Intel Core 2 and Nehalem:
- 4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
- 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication
Intel Sandy Bridge/Ivy Bridge:
- 8 DP FLOPs/cycle: 4-wide AVX addition + 4-wide AVX multiplication
- 16 SP FLOPs/cycle: 8-wide AVX addition + 8-wide AVX multiplication
- 16 DP FLOPs/cycle: two 4-wide FMA (fused multiply-add) instructions
- 32 SP FLOPs/cycle: two 8-wide FMA (fused multiply-add) instructions
- 4 DP FLOPs/cycle: 2-wide SSE2 addition + 2-wide SSE2 multiplication
- 8 SP FLOPs/cycle: 4-wide SSE addition + 4-wide SSE multiplication
8 DP FLOPs/cycle: 4-wide FMA
16 SP FLOPs/cycle: 8-wide FMA
Cade Metz, Wired, Facebook Ratlles Networking World With ‘Open Source’ Gear, here. Very cool, the Kickstarter play for my 2014 Open Source Haswell Server cannot be far behind. Adds numbers in less time than a blink of an eye, communicates over fiber thinner than a human hair, and does not need to move it’s lips when it reads. It’s like totally inhuman, or something.
Two years ago, Mark Zuckerberg and company turned the hardware world on its head when they launched the Open Compute Project, an effort to improve every aspect of the modern data center and share the results with the world at large. They began by “open sourcing” fresh designs for computer servers and power systems and cooling equipment. Then they did the same with hardware that stores massive amounts of digital data. Then they remade the racks that hold all these machines. And now it’s time for the networking gear.
The idea is to design a networking switch that anyone can load with their own operating system — just as you can load your own OS on a computer server. Typically, networking switches are sold by hardware giants such as Cisco and HP and Dell, and they ship with software specific to the company that designed them. But Facebook aims to separate the hardware from the software.
Peter Bright, ars technica, AMD’s “heterogeneous Uniform Memory Access” coming this year in Kaveri, here.
The central HSA concept is that systems will have multiple different kinds of processors, connected together and operating as peers. The two main kinds of processors are conventional: versatile CPUs and the more specialized GPUs.
Modern GPUs have enormous parallel arithmetic power, especially floating point arithmetic, but are poorly-suited to single-threaded code with lots of branches. Modern CPUs are well-suited to single-threaded code with lots of branches, but less well-suited to massively parallel number crunching. Splitting workloads between a CPU and a GPU, using each for the workloads it’s good at, has driven the development of general purpose GPU (GPGPU) software and development.
Jeff Dean, Google, 2009 Presentation slides, Designs, Lessons, and Advice from Building Large Distributed Systems, here. How big does the distributed system need to be before the government insures some of the failures? Google, Amazon, and Facebook too small apparently. What about the funds transfer system once we get rid of paper checks? The Herstatt Bank failure was not enough to get government insurance although it did get the creation of the Continuous Settlement Link program. Who does the first Distributed System with any portions of it’s failures costing over 1 billion USD guaranteed by the US Taxpayer? When does this system go on line or is it already on line? I’m thinking it is already online. I suppose any modern tank, aircraft carrier, or just stealth fighter jet projects with cost overruns already hit that. So, how do you get the CFTC and SEC classified as National Defense organizations, just move them to the Pentagon? Exercise left for the reader.
• Ghemawat, Gobioff, & Leung. Google File System, SOSP 2003.
• Barroso, Dean, & Hölzle . Web Search for a Planet: The Google Cluster Architecture, IEEE Micro, 2003.
• Dean & Ghemawat. MapReduce: Simplified Data Processing on Large Clusters, OSDI 2004.
• Chang, Dean, Ghemawat, Hsieh, Wallach, Burrows, Chandra, Fikes, & Gruber. Bigtable: A Distributed Storage System for Structured Data, OSDI 2006.
• Burrows. The Chubby Lock Service for Loosely-Coupled Distributed Systems. OSDI 2006.
• Pinheiro, Weber, & Barroso. Failure Trends in a Large Disk Drive Population. FAST 2007.
• Brants, Popat, Xu, Och, & Dean. Large Language Models in Machine Translation, EMNLP 2007.
• Barroso & Hölzle. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Morgan & Claypool Synthesis Series on Computer Architecture, 2009.
• Malewicz et al. Pregel: A System for Large-Scale Graph Processing. PODC, 2009.
•Schroeder, Pinheiro, & Weber. DRAM Errors in the Wild: A Large-Scale Field Study. SEGMETRICS’09.
• Protocol Buffers. http://code.google.com/p/protobuf/
Martin Thompson and Michael Barker, InfoQ, Lock-free Algorithms, here. The talk is pretty good – interesting crew of people there. InfoQ is new to me.
Martin Thompson and Michael Barker explain how Intel x86_64 processors and their memory model work along with low-level techniques that help creating lock-free software.
Martin Thompson, Mechanical Sympathy, Memory Access Patterns Are Important, here. Thompson’s blog and a ref to Sally McKee’s paper with Wulf.
In high-performance computing it is often said that the cost of a cache-miss is the largest performance penalty for an algorithm. For many years the increase in speed of our processors has greatly outstripped latency gains to main-memory. Bandwidth to main-memory has greatly increased via wider, and multi-channel, buses however the latency has not significantly reduced. To hide this latency our processors employ evermore complex cache sub-systems that have many layers.
The 1994 paper “Hitting the memory wall: implications of the obvious” describes the problem and goes on to argue that caches do not ultimately help because of compulsory cache-misses. I aim to show that by using access patterns which display consideration for the cache hierarchy, this conclusion is not inevitable.
Michael Barker, Bad Concurrency, Disruptor v3 – Faster Hopefully, here. Barker’s blog.
I’ve be working sporadically on the next major revision of the Disruptor, but still making steady progress. I’ve merged my experimental branch into the main line and I’m working on ensure comprehensive test coverage and re-implement the Disruptor DSL.
As a matter of course I’ve been running performance tests to ensure that we don’t regress performance. While I’ve not been focusing on performance, just some refactoring and simplification I got a nice surprise. The new version is over twice as fast for the 1P1C simple test case; approximately 85M ops/sec versus 35M ops/sec. This is on my workstation which is an Intel(R) Xeon(R) CPU E5620@2.40GHz.
Greg Pfister, The Perils of Parallel, Intel Xeon Phi Announement (&me), here. Check out the Intel links at the end of the post.
Number one is their choice as to the first product. The one initially out of the blocks is, not a lower-performance version, but rather the high end of the current generation: The one that costs more ($2649) and has high performance on double precision floating point. Intel says it’s doing so because that’s what its customers want. This makes it extremely clear that “customers” means the big accounts – national labs, large enterprises – buying lots of them, as opposed to, say, Prof. Joe with his NSF grant or Sub-Department Kacklefoo wanting to try it out. Clearly, somebody wants to see significant revenue right now out of this project after so many years. They have had a reasonably-sized pre-product version out there for a while, now, so it has seen some trial use. At national labs and (maybe?) large enterprises.
I fear we have closed in on an incrementally-improving era of computing, at least on the hardware and processing side, requiring inhuman levels of effort to push the ball forward another picometer. Just as well I’m not hanging my living on it any more.
Robert X. Cringely, I, Cringely, Who’s your daddy? Intel swoons for Apple, here. If you write Wall Street style analytics (e.g., expression evaluation with great locality) you have to root for AVX2 to get to market before the mobile wave shuts down further floating point feature development. Smart architecture guy was asking how you can get reliable cycle count estimates on Wall Street analytics. Informally, I think the answer is – you know what the vectorized max performance is, you know the equations being evaluated, you know approximately how many adds and multiplies are required to retire the equations to be evaluated, and you know locality is not going to get so bad that you have to think deep thoughts about the memory hierarchy (i.e., the locality will give you something like 80% of the vectorized benchmark to retire ops) in many cases. I’m sure there are exceptions – but the gaussian copula for the London Whale is not one of them, for example. For that matter almost nothing in vanilla credit, rates, or FX analytics is an exception to this informal rule. Even the exotic stuff is most often just the Monte Carlo version of the vanilla valuation analytics with finite difference approximations. The thing that matters for expression evaluation is Locality, Locality, Locality. That, by the way, is one way you can easily tell when Execs are lost about what they are doing w specific analytics. Does not happen all that often but it is there if you look for it. Like the London Whale guys and the FPGA supercomputer running something about as fast as an optimized piece of code on an old iPhone, Happy Festivus.
Just days after I wrote a column saying Apple will dump Inteland make Macintosh computers with its own ARM-based processors, along comes a Wall Street analyst saying no, Intel will be taking over from Samsung making the Apple-designed iPhone and iPod chips and Apple will even switch to x86 silicon for future iPads. Well, who is correct?
Maybe both, maybe neither, but here’s what I think is happening.