Computing has always been about speed. Herman Hollerith‘s first job was to work on the 1880 US census. It was done by hand, painful, slow, and prone to errors. So, Hollerith created the Hollerith 1890 Census Tabulator punch card system to count the 1890 census results. His electromechanical devices could do the job in a remarkably quick two-years and saved the government $5 million. Using his profits, he established his own company, the Tabulating Machine Company. You know it better as IBM.
Of course, his machine wasn’t a true computer, but it set the pattern for all the computers to come. But, it wasn’t until 1964, when Seymour Cray designed the Control Data Center (CDC) 6600 that we started called the fastest of the fast machines supercomputers.
CDC: Supercomputing begins
Cray believed there would always be a need for a machine “a hundred times more powerful than anything available today.” It was his dream and efforts to push the limits of technology, which led to supercomputers.
The journey was never easy. Cray, already known as a temperamental hardware genius, threatened to leave CDC. He finally agreed to stay after being allowed to form his own team to build the CDC 6600. With 400,000 transistors, over 100 miles of hand-wiring, and Freon cooling, it reached a top speed of 40MHz or 3 million floating-point operations per second (MegaFlops). Hollerith would have recognized its primary I/O method: Punch cards. It left the previous fastest computer, the IBM 7030 Stretch, eating its dust.
Today, 3 MegaFlops are painfully slow. The first Raspberry Pi with its 700 MHz ARM1176JZF-S processor runs at 42 MegaFlops. For its day, the CDC 6600 was the fastest of the fast and it would remain so until Cray and CDC followed it up in 1969 with the CDC 7600.
The other computer manufacturers of the 60s were caught flat-footed. In a famous memo, Thomas Watson, Jr, IBM’s CEO, said “Last week, Control Data … announced the 6600 system. I understand that in the laboratory developing the system there are only 34 people including the janitor. Of these, 14 are engineers and 4 are programmers. Contrasting this modest effort with our vast development activities, I fail to understand why we have lost our industry leadership position by letting someone else offer the world’s most powerful computer.” To this day, IBM remains a serious supercomputer competitor.
The birth of Cray
In the meantime, Cray and CDC were not getting along. His designs, while both technically powerful and commercially successful, were expensive. CDC saw him as a perfectionist. He saw CDC as clueless middle managers. You can see where this was going.
So, when the next generation CDC 8600 was running over budget and failed to be on schedule, CDC elected to support another high-performance computing machine: The Star-100. This was one of the first supercomputers to use separate vector processors in addition to its CPU. This set a trend that still with us today.
Be that as it may, Cray, to no one’s surprise, left CDC to form his own company: Cray Research. There, freed of management constraints and fueled with ample Wall Street funds, he built the first of his eponymous supercomputers in 1976: The Cray-1.
The 80MHz Cray-1 used integrated circuits to achieve performance rates as high as 136 MegaFlops. Part of the Cray-1’s remarkable speed came from its unusual “C” shape. This look was not done for the science-fiction look of it, but because the shape gave the most speed-dependent circuit boards shorter, hence faster, I/O.
This attention to every last detail of design from the CPU up is a distinguishing mark of Cray’s work. Every element of Cray’s design was built to be as fast as possible.
Cray also adopted vector processing for the Cray-1. On the Cray design, the vector processors operated on vectors–linear arrays of 64-bit floating-point numbers, to obtain results. Compared to scalar code, vector codes could minimize pipelining slowdowns by as much as 90%.
The Cray-1 was also the first computer to use transistor memory, instead of high-latency magnetic core memory. With these new forms of memory and processing the Cray-1 and its descendants became the poster child of late 70s and early 80s supercomputing.
Seymour Cray wasn’t done leading the way in supercomputing. The Cray-1, like all the machines, which had come before it, used a single main processor. With 1982’s Cray X-MP he added four processors to the Cray-1 signature C body. With the X-MP 105MHz processors and a 200% plus improvement memory bandwidth, a maxed-out X-MP could deliver 800 Megaflops of performance.
The next step forward was 1985’s Cray 2. This model came with eight processors, with a “foreground processor” managing storage, memory, and I/O to the “background processors,” which did the actual work. It also was the first liquid-cooled supercomputer. And, unlike its predecessors, you could work with it using a general-purpose operating system: UniCOS, a Cray-specific Unix System V with additional BSD features, instead of a customized, architecture-specific operating system.
Today, supercomputers are used to work on a variety of massive computational problems. These jobs include quantum mechanics, weather forecasting, climate research, and biomolecular analysis on COVID-19. But, in the 70s and 80s, Cold War research on nuclear explosion simulations and code-cracking was what governments and their businesses paid for. With the rise of glasnost and the disintegration of the Warsaw Pact and the Soviet Union, Cray’s military-industrial customers were no longer interested in spending millions on supercomputers.
While Cray still loved his vector architectures, their processors were very expensive. Companies explored using multiple processors in a single computer using Massively Parallel Multiprocessing (MPP). For example, the Connection Machine’s tens of thousands of simple single-bit processors working with global memory cost a fraction of Cray’s designs.
MPP machines made supercomputing more affordable, but Cray resisted it. As he said, “If you were plowing a field, which would you rather use: Two strong oxen or 1024 chickens?” Instead, he refocused on faster vector processors using the untried gallium arsenide semiconductors. This would prove a mistake.
Cray, the company, went bankrupt in 1995. Cray, the supercomputer architect, wasn’t done. He founded a new company, SRC Computers, to work on his take on a machine that combined the best features of his approach and the MPP. Unfortunately, he died in a car accident before he could put his new take on supercomputing to the test.
His ideas lived on in Japan. There, companies such as NEC, Fujitsu, and Hitachi built vector-based supercomputers. From 1993 to 1996, Fujitsu’s Numerical Wind Tunnel was the world’s fastest supercomputer, with speeds of up to 600 GigaFlops. A GigaFlop is one billion Flops.
These machines relied on vector processing, dedicated chips using one-dimensional arrays of data. They also used multi-buses to make the most of MPP’s I/O. This, the ancestor of the multiple instruction, multiple data (MIMD) approach that enables today’s CPUs to use multiple cores.
Intel, which had been watching from the supercomputer sidelines, thought MIMD might enable them to create more affordable supercomputers without specialized vector processors. In 1996, ASCI Red proved Intel right.
ASIC Red used over 6,000 200MHz Pentium Pros to break the 1 TeraFlop (one trillion Flops) barrier. For years, it would be both the fastest and most reliable supercomputer in the world.
Supercomputing for everyone: Beowulf
While Intel was spending millions developing ASIC Red, some underfunded contractors at NASA’s Goddard Space Flight Center (GSFC) built their own “supercomputer” using commercial off the shelf (COTS) hardware. Using 16 486DX processors with a 10Mbps Ethernet cable for the “bus,” in 1994 NASA contractors Don Becker and Thomas Sterling created Beowulf,
Little did they know that in creating the first Beowulf cluster, they were creating the ancestor to today’s most popular supercomputer design: Linux-powered, Beowulf-cluster supercomputers. Today, all 500 of the fastest machines, the TOP500, run Linux. In the November 2020 Top 500 supercomputing ranking, no fewer than 492 systems are using cluster designs based on Beowulf’s principles.
While the first Beowulf could only hit single-digit GigaFlops speeds, Beowulf showed that supercomputing was within almost anyone’s reach. I mean, you can even build a Beowulf “supercomputer” from Raspberry Pi!
Another advance in supercomputing came when designers started using multiple processor types within their designs. For example, in 2014, Tianhe-2, or Milky Way-2, used both Intel Xeon IvyBridge processors and Xeon Phi processors to become the fastest supercomputer of its day. The Xeon Phi is a high-performance GPU. These “graphic” chips excel at floating-point calculations.
This new style of combining two types of COTS processors is becoming more common. In the November 2020 TOP500 supercomputer list, the vast majority of the fastest of the fast use floating-point GPUs such as the Xeon Phi, NVIDIA Tesla V100 GPUs, and PEZY-SCx accelerators. Today, 149 systems of the TOP500 rely on accelerator/coprocessor chips.
Why? Because the two chip styles complement each other. Today’s GPUs have a massively parallel architecture made up of thousands of cores handling multiple tasks simultaneously. For example, NVIDIA V100 Tensor Core GPU, which is used in several supercomputers, has 640 cores. Conventional CPUs have few cores, but they’re optimized for sequential serial processing. Yoked together, they’re the foundation for much faster supercomputers.
While over 90 percent of the TOP500 use Intel Xeon or Xeon Phi chips. While AMD processors, especially the AMD Ryzen 9 Zen family, have taken over the desktop and laptop speed records, only 21 systems use AMD CPUs. Even so, that’s twice as many as it was six months ago. AMD is followed by ten IBM Power-based systems and just five Arm-based systems.
However, the Arm results are misleading. The current world champion of supercomputers, Japan’s Fugaku is powered by Arm A64FX CPUs with 7,630,848 cores. Its world record speed is 442 petaflops, a quadrillion floating-point operations per second, on the High-Performance Linpack (HPL) test. If you’re keeping score at home, that’s three times ahead of its closest competitor. Intel’s processor lead will soon be challenged by both AMD and Arm.
Looking ahead, the next supercomputing goal is the ExaFlop. An exaFlop is one quintillion (1018) floating-point operations per second, or 1,000 petaFlops. We’d hope to be there by now, but it’s proving harder than expected.
Still, Intel hopes to be there first with Aurora in 2021. In the meantime, AMD and Cray think they’ll get there first with El Capitan. And, we can’t count Arm, now owned by Nvidia, out. The Fugaku supercomputer architects have their eye on cracking the ExaFlop barrier too.
After that, the next mountain to climb is a ZettaFlop, a 1,000 exaFLOPS. Is ZettaFlop computing even possible? Sterling, now Professor of Computing at Indiana University and co-inventor of Beowulf, used to think we couldn’t do it. “I think we will never reach ZettaFlops, at least not by doing discrete floating-point operations.”
More recently, Sterling changed his mind. Sterling said that by combining logic circuits with memory to reduce I/O speeds reaching ZettaFlops speeds, and beyond, is possible. Indeed, he thinks by using non-von Neumann architectures and superconducting we might get to YottaFlop supercomputer speeds, a thousand ZettaFlops, by 2030. Chinese researchers aren’t that optimistic, but they predict that we’ll see Zettascale systems in 2035.
The one thing we can say for certain is the race for more computing speed will never end. We may not be able to get our minds around what that speed will mean for us, but we will. Remember, Bill Gates was once rumored to have said, “640K is all the memory anybody would ever need.” Well, we certainly found something to do with all that memory, and we’ll certainly find useful things to do with all that speed.