A Tightening Exascale Race Reveals Underlying Forces Shaping Supercomputing

A Tightening Exascale Race Reveals Underlying Forces Shaping Supercomputing

The competition between the US, China, and Japan to field the first exascale supercomputer looks a lot closer than it did a couple of years ago. But the real significance of the narrowing schedules reflects a shift in technology preferences and a trend toward domestic control of HPC hardware.

The three latest developments in the exascale sweepstakes that point to a tightening race include Fujitsu’s production of the prototype of the ARM processor that will power Japan’s Post-K supercomputer, the recapture of the number one spot on the TOP500 list by the US, and the report that China’s exascale effort could be delayed by up to a year.

The latter development, reported in MIT Technology Review, could move the deployment of China’s initial exascale supercomputer into 2021, which happens to be the same year that Japan and the US are planning power up their first machines. The article quotes Depei Qian, a Beihang University professor who helps direct China’s exascale effort and who admitted: “I don’t know if we can still make it by the end of 2020. There may be a year or half a year’s delay.”

The problem apparently stems from the fact the Chinese are having trouble deciding between the different types of systems under consideration. Three different exascale processor architectures are being pursued in parallel, all of which rely on domestic development: one based on some version of the ShenWei chip, another using a Chinese-designed Arm CPU, and the third using licensed x86 technology. According to Qian, the evaluation process for these approaches has dragged out and a call for proposals to build the exascale systems has been pushed back.

Of the three approaches, the only one the Chinese have any experience with in the HPC realm is their native ShenWei processor, which is currently being used to power the nation’s most powerful supercomputer, the Sunway TaihuLight. Because ShenWei is a non-standard processor architecture, it relies exclusively on custom software tools for application development. Any effort to establish the processor as a more widely accepted architecture for servers, HPC or otherwise, would probably take a decade or more.

The indigenous Arm and x86 efforts in China don’t have that problem, but as far as we know, these approaches have yet to produce a working prototype. As we reported earlier this month, Chinese chipmaker Hygon just recently began manufacturing Zen-based x86 CPUs based on a licensing agreement with AMD. That chip would almost certainly need to be deployed in tandem with some sort of accelerator to deliver an exascale-capable machine based a reasonable number of servers. The Chinese have such an accelerator in the Matrix general-purpose DSP, but the current Matrix-2000 implementation delivers only about three teraflops per chip. The Arm effort, which appears to be based on a Phytium Technology design, would likewise require an accelerator coprocessor to achieve a practical exascale system, unless a much more performant version that incorporated Arm’s Scalable Vector Extension (SVE) was developed .

The dilemma of having too many choices may indeed be delaying China’s exascale plans, but the more obvious explanation is that it’s time-consuming to develop new processors, not to mention systems based on those processors. That’s true even if the architects involved have some experience with the underlying technology. And when those systems have to operate at the exascale level, those challenges are magnified by the additional demands of energy efficiency, scalability, and reliability.

Japan ran up against such challenges early on. In 2016, Fujitsu and RIKEN had committed to developing an Arm SVE-powered supercomputer for the country’s first exascale system, known as Post-K. The original schedule had RIKEN installing the system in 2020. But a few months after the plans were announced, Dr. Yutaka Ishikawa, who was the project lead at the time, admitted the Post-K deployment could be delayed by as much as two years. Last month, Fujitsu revealed it had built a prototype of the Arm chip, which is now being tested and benchmarked. Currently, Post-K appears to be track for a 2021 deployment.

At this point, the US appears to be closest to reaching the exascale milestone, inasmuch as IBM’s newly-deployed Summit system at Oak Ridge National Lab is, from a Linpack perspective, is within 12 percent of that goal. That’s a bit of mirage though. America’s first exascale supercomputer will be Aurora, an Intel-based system whose architecture and even processor design are still largely unknown. If this was just a matter of developing a manycore Xeon processor with enough computational horsepower (at least 20 teraflops per chip) to power Aurora’s 50,000 nodes and integrating the company’s second-generation Omni-Path fabric, silicon photonics, and Optane NVDIMMs, that all seems pretty doable. But considering Intel’s latest problem in moving to its 10nm process node, the chipmaker’s curious abandonment of its Xeon Phi roadmap, and the chimaker’s general lack of specifics regarding its HPC plans, this is no slam dunk.

Even though Summit’s Power/GPU hybrid design will not be the architecture for the first exascale system in the US, it will almost certainly be the model for subsequent machines. In one respect, it is the most mature architecture for these future supercomputers, since it will be based on processors and other componentry with long-established product roadmaps, namely IBM Power processors, NVIDIA Tesla GPUs, and Mellanox InfiniBand.

The EU countries have conceded that their first exascale supercomputer will be a couple of years behind the initial systems deployed in the US, China, and Japan. Like their counterparts, the Europeans are also developing domestic processors for these next-generation machines, in this case based on Arm and RISC-V. The driving effort for this work, known as the European Processor Initiative (EPI), is part of a larger push to develop an indigenous HPC capability on the continent and free itself from its dependency on the North American chipmakers, specifically, Intel, AMD, NVIDIA, and IBM. EPI recently got underway and is supposed to produce pre-exascale versions of these chips by the end of the decade.

An additional complication in all these efforts is the realization that machine learning is emerging as a new application requirement for HPC work. This is forcing all the players to demand additional mixed precision support in the processor or coprocessor and ensure memory capacity and performance will be adequate for these workloads. Large-scale machine learning also places extra requirements on the system interconnect. It’s doubtful if any exascale supercomputers will be built without these additional capabilities.

The move toward greater processor diversity and more specialization in high performance computing is a good thing, and probably necessary considering the slowdown in Moore’s Law. The emerging importance of machine learning is also a welcome development since it promises to advance the breadth and usefulness of HPC applications. But these developments come at a time when the top tier of supercomputing users seem overly focused on attaining the artificial milestone of exascale and national policies are favoring domestic designs. Would we really be building the most powerful systems in the world with such new technology were it not for the constraints imposed by exascale computing and the political winds of economic nationalism?

Consider that the Arm architecture features more prominently in current exascale plans than either x86 or Power, but has no representation on the TOP500 list (although it will soon). Even if these systems arrive on schedule, there are bound to be growing pains due to immature software tools, application porting challenges, and a general lack of support for the technology in the datacenter. A broader market for such systems is far from assured.

One plausible scenario is that these first supercomputers will be one-off machines – not stunt systems, per se, but not widely adopted for general use. Depending upon what Intel, AMD, and NVIDIA do, the more traditional combo of an x86 CPU married to a GPU accelerator could turn out to be the most practical architecture for a good chunk of exascale systems and most of the non-exascale systems.

Regardless of how this all shakes out, the next decade of supercomputing is bound to be an interesting time for everyone involved.