That puts a bit of pressure on Franza, a 22-year Intel veteran who joined the Aurora project in 2016 as a systems hardware architect, oversaw the transition to a GPU-based machine and was named chief architect in 2021.
“The chief architect is responsible for defining the overall system architecture of the supercomputer, according to the high-level requirements of the customer,” Franza explains. “There are basic requirements such as overall performance metrics and power requirements, but also inherent features such as RAS – reliability, availability, maintainability – that are essential to building a scalable system.”
His responsibilities also include the details of the system topology, from a node to a rack to a complete system, including the network structure and storage components.
A swing in the roadmap opens up the opportunity to shape future products
When the original planning for Aurora, a system funded by the U.S. Department of Energy, began, the design consisted of a collection of Intel technologies. However, changes to Intel’s product roadmap, particularly the end of the Xeon Phi and Omnipath product families, necessitated a reboot. As Intel made plans to build data center GPUs, Franza became involved in discussions about the Intel® Data Center GPU Max Series design (codenamed Ponte Vecchio).
In this way, Aurora is not just a one-off system. Rather, it has helped influence the Intel-wide strategy and product portfolio to achieve scale and performance at the highest levels.
“We have taken all the Aurora requirements at the system level down to the component level,” says Franza.
For example, the architecture and concept for the Intel® Xeon® CPU Max Series with high-bandwidth memory was spawned by some of the features of the Intel Xeon Phi platform, the first product to integrate an innovative high-bandwidth, high-capacity memory architecture on a single chassis.
In addition, the need for high performance drove further advances in all subsystems, from the compute blade’s thermomechanical solution to dense physical integration and memory.
“Intel eventually developed an entirely new memory concept, DAOS (Distributed Asynchronous Object Storage),” Franza says. This is an open-source software ecosystem that enables high-speed storage on traditional hardware. “Aurora will be one of the first systems to use it, and by far the largest.”
From designing components to assembling thousands of systems
The Aurora project required system-level thinking and extensive collaboration between different business units within Intel, as well as with Argonne scientists and engineers at Hewlett Packard Enterprise, the other major partner in the project.
“Bringing the entire team together to deliver a machine like Aurora is a once-in-a-lifetime experience for many of us,” Franza says.
Although engineers installed the final bucket in June, the project continues to keep Franza up at night as the system goes through the phases of large-scale testing, stabilization and validation.
He leads a large team working on system deployment, validation, stabilization, optimization and activation of full system performance workloads. Of particular note is the High Performance Linpack (HPL) benchmark, which identifies the best systems in the world, as evidenced by the biannual Top500 list.
Each morning, Franza attends the daily standup meeting to review each node’s nightly runs and create a plan for the next day’s work and beyond. Each afternoon, a daily wrap-up meeting summarizes progress and hurdles. Work never stops; the machine is always running.
“We take a step-by-step approach to methodically validate and stabilize on a large scale,” he explains. “You start with the blade, then move to the rack, then to multiple racks, and scale from there.”
Aurora consists of 10,624 compute blades with 63,744 Intel Max Series GPUs-more GPUs than any other system in the world-and 21,248 Intel Xeon Max CPUs in 166 racks.
“It’s the size of four tennis courts, which sounds like a lot, right?” he says. “But it’s not until you actually see it that you realize the sheer size of the project.”
Franza has to make sure the huge system is stable, functional and capable. It’s a daunting task, but the goal is within reach.
“Walking through the aisles when all the lights are on and feeling the machine running is impressive and, of course, very satisfying,” he says. “It’s a very tangible achievement that speaks for itself.”
A “once-in-a-lifetime” achievement, a science-shaping supercomputer
What drives him, despite the technical hurdles and unexpected obstacles, is the opportunity to build “an extraordinary machine” that will advance research. He cites Aurora’s enormous potential for cancer research as one area where the project will benefit all of us.
“I think this is something that will make us very proud,” he says.
Aurora will not only work to solve some of the world’s most complex scientific and engineering problems, but will also be an ideal platform for running generative AI and applying it to research. “It will enable one of the largest large-scale language models planned to date, the 1 trillion parameter Aurora GenAI project, which will improve, enable and simplify the lives of scientists,” Franza says.
But most of all, he enjoys the teamwork and camaraderie.
“It’s a long endeavor that requires a lot of perseverance,” he says. “The core team has maintained a marathon mentality where it’s not over until it’s over. We needed the kind of people who could focus on a big challenge over a long period of time. And in the end, we accomplished something that very few can say they did.
– – –
Further links
👉 www.intel.de
Photo: Intel