In this article, you will find out everything about the impacts of generative AI on data centers.
Part I – The models
Generative AI looks set to change the way we work, create and live. Governments, companies, and individuals are struggling with what this means for the economy and our species, but we struggle because we simply don’t know what AI will be capable of or the costs and benefits of applying it.
Behind this transformation is a deeper story of major changes in computing architectures, network topologies, and data center design. Deploying the enormous computing resources these systems require could change the cloud industry and put the traditional supercomputing industry at risk.
To understand what this moment means and what could come next, DCD spent four months talking to nearly two dozen AI researchers, semiconductor experts, networking experts, cloud operators, supercomputing visionaries, and data center leaders.
This story starts with the models, the algorithms that fundamentally determine how an AI system works. We look at how they are made and how they could grow. In operation, we see the twin requirements of training and inference and so-called “foundation models” that can be accessed by companies and users. We also asked what the future holds for open source AI development.
Supercomputers
From there, we move into the world of supercomputers, understanding their use today and why generative AI could upend the traditional high-performance computing (HPC) sector. Next, we talk to the three hyperscalers who have built gigantic AI supercomputers in the cloud.
Next we turn to chips, where Nvidia leads the way in GPU processors that power AI machines. We spoke to seven companies trying to disrupt Nvidia — and then heard from Nvidia’s head of data centers and AI to learn why unseating the leader will be so difficult.
But the history of computing is meaningless without understanding networking, so we talked to Google about a bold attempt to overhaul how racks are connected.
Finally, we learn about what this all means for the data center. From the CEO of Digital Realty to the CEO of DE-CIX, we hear from those who are ready to build the infrastructure of tomorrow
Making a model
Our journey through this industry starts with the model. In 2017, Google published the paper ‘Attention is all you need ‘ which introduced the transformer model, which allowed significantly more parallelization and reduced training time for AIs.
This triggered a boom in development, with generative AI models all built from transformers. These systems, like OpenAI’s GPT-4 large language model (LLM), are known as foundation models, where a company develops a pre-trained model for others to use.
“The model is a combination of a lot of data and a lot of computation,” Rishi Bommasani, co-founder of the Stanford Center for Research in Fundamental Models and lead author of a seminal paper defining these models, told DCD. “Once you have a foundation model, you can adapt it for a wide variety of different downstream applications,” he explained.
The companies that build the most advanced models are not transparent about how they train them, and no one knows how big these models are.
Each foundation model is different and the costs to train them can vary greatly. But two things are clear: the companies building the most advanced models are not transparent about how they train them, and no one knows how well these models will scale.
Scaling laws are an ongoing area of research that attempts to find the optimal balance between model size, amount of data, and available computational resources.
Raising a chinchilla
“The scaling relationships with model size and computation are especially mysterious,” noted a 2020 paper by Jared Kaplan of OpenAI, describing the power-law relationship between, noted a 2020 paper by Jared Kaplan of OpenAI, describing the power law relationship between model size, dataset size, and computing power used for training.
As each factor increases, so does the overall performance of the large language model.
This theory has led to ever larger models, with increasing counts of parameters (the values that a model can change as it learns) and more tokens (the units of text that the model processes, essentially data). Optimizing these parameters involves multiplying sets of numbers or matrices, which requires a lot of calculations and means larger computing clusters.
That document was replaced in 2022 by a new approach from Google subsidiary DeepMind, known as ‘Chinchilla Scaling Laws’, which again attempted to find the optimal parameter and token size for training an LLM under a given computing budget. He discovered that the models at the time were oversized in parameters compared to the tokens.
While Kaplan’s paper said that a 5.5× increase in model size should be paired with a 1.8× increase in the number of tokens, Chinchilla found that parameter and token sizes should be scaled in equal proportions. “Scaling relationships with model size and computation are especially
Training
The Google subsidiary trained the 67 billion parameter Chinchilla model based on this compute optimization approach, using the same amount of compute budget as a previous model, the 280 billion parameter Gopher, but with four times as much data. Tests found that it was able to outperform Gopher as well as other comparable models, and used four times less computation for fine-tuning and inference.
The approximate compute costs to train a trillion parameter models on Nvidia A100s would be $308 million over three months, not including preprocessing, crash restoration, and other costs.
Crucially, under the new paradigm, DeepMind found that Gopher, which already had a huge computing budget, would have benefited from more computing used on 17.2× as much data.
Meanwhile, an ideal trillion-parameter model should use about 221.3 times more computing budget for larger data, pushing the limits of what is possible today. That’s not to say you can’t train a trillion-parameter model (in fact, Google itself did it), it’s just that the same computation could have been used to train a smaller model with better results.
Based on Chinchilla’s findings, semiconductor research firm SemiAnalysis calculated that the approximate compute costs for training a trillion parameter models on Nvidia A100s would be $308 million over three months, not including preprocessing, failure, and other costs.
Taking things further, Chinchilla found that an ideal 10 trillion parameter model would use about 22,515.9 times more data and resulting computation than Gopher’s ideal model. Training such a system would cost $28.9 billion over two years, SemiAnalysis believes, although costs have improved with the launch of Nvidia’s more advanced H100 GPUs.
OpenAI
It is understood that OpenAI, Anthropic and others in this space have changed the way they optimize computing since the paper’s publication to be closer to this approach, although Chinchilla has its critics.
As these companies look to build the next generation of models and hope to show dramatic improvements in a competitive field, they will be forced to throw ever larger data center clusters at the challenge. Industry estimates put GPT -4 training costs at up to 100 times the costs of GPT-3.5.
OpenAI did not respond to requests for comment. Anthropic declined to comment, but suggested we talk to Epoch AI Research, which studies the advancement of such models, about the future of computational scalability.
“The most expensive model we can reasonably calculate the cost of training on is Google’s [540 billion parameter] Minerva,” said Jaime Sevilla, director of Epoch. “This cost about $3 million to train in their internal data centers, we estimate. But you have to train it multiple times to find a promising model, so it costs more than $10 million.”
In use, this model may also need to be retrained frequently, to take advantage of data collected from this use or to maintain an understanding of recent events.
“We can reason about how quickly computing needs have increased so far and try to extrapolate that to think about how expensive it will be 10 years from now,” Sevilla said. “And it looks like the approximate trend of rising costs increases 10-fold every two years. For top models, this appears to be decreasing, so it increases 10 times every five years.”
Predictions
Trying to predict where this will lead is a difficult exercise. “It looks like in 10 years, if this current trend continues – which is a big if – it will cost somewhere between $3 billion or $3 trillion for all the training runs to develop a model,” Sevilla explained.
“It makes a huge difference as the first is something companies like Microsoft could do. And then they won’t be able to advance further unless they generate revenue to justify larger investments.”
What to infer from inference
These models, big and small, will have to be actually used. This is the inference process – which requires significantly less computing resources than per-use training but will consume much more computing overall as multiple instances of a trained AI will be deployed to do the same work in many places.
Microsoft’s Bing AI chatbot (based on GPT-4), only needed to be trained a few times (and is retrained at an unknown cadence), but is used by millions daily.
“Chinchilla and Kaplan are really great papers, but they are focused on how to optimize training,” explained Finbarr Timbers, a former DeepMind researcher. the amount of money they spent training these models.”
Timbers, who joined AI imaging company Midjourney (which was used to illustrate this piece) after our interview, added: “As an engineer trying to optimize inference costs, making the model bigger is worse in every way, except in performance. It’s this necessary evil that you do.
“If you look at the GPT-4 paper, you can make the model deeper to make it better. But the problem is that it makes it much slower, requires much more memory, and makes every aspect more painful to deal with. But that’s the only thing you can do to improve the model.”
Inference Trace
It will be difficult to track how inference scales because the industry is becoming less transparent as the major players are subsumed within the tech giants. OpenAI started as a non-profit company and is now a for-profit company linked to Microsoft, which has invested billions in the company. Another important player, DeepMind, was acquired by Google in 2014.
Publicly, there are no Chinchilla scaling laws for inference that show ideal model designs or predict how they will develop.
Inference was not a priority of previous approaches, as models were primarily developed as prototype tools for internal research. Now, they are starting to be used by millions and are becoming a prime concern.
Right now the strategies are pretty brutal, in the sense that it’s just ‘use more compute’ and there’s nothing deeply intellectually complicated about it.
“As we consider inference costs, you will create new scaling laws that tell you that you should allocate much less to model size because that increases your inference costs,” Bommasani believes. “The difficult part is that you don’t fully control the inference, because you don’t know how much demand you will have.”
Is the scaling uniform?
Not all scaling will happen uniformly. Large language models are, as their name suggests, quite large. “In the text, we have models with 500 billion parameters or more,” said Bommasani. This doesn’t need to be the case for all types of generative AI, he explained.
“In vision, we just got a recent paper from Google with models with 20 billion parameters. Things like stable diffusion are in the billion parameter range, so it’s almost 100 times smaller than LLMs. I’m sure we’ll continue to scale things, but it’s more a question of where we scale and how we do it.”
This could lead to a diversification in the way models are made. “Right now, there’s a lot of homogeneities because it’s early days,” he said, with most companies and researchers simply following and copying the leader, but he hopes that as we reach the limits of computing, new approaches, and tricks will be found.
“Right now the strategies are pretty brutal, in the sense that it’s just ‘use more compute’ and there’s nothing deeply intellectually complicated about it,” he said. “You have a recipe that works and, more or less, you just run the same recipe with more computation and then it does better in a pretty predictable way.”
As economics catch up with models, they may end up shifting to focus on the needs of their use cases. Search engines are intended for heavy and frequent use, so inference costs will dominate and become the primary driver of how a model is developed.
Keeping this sparse
As part of the effort to reduce inference costs, it is also important to look at sparsity – the effort to remove as many unnecessary parameters as possible from a model without affecting its accuracy. Outside of LLMs, researchers have been able to remove up to 95% of the weights in a neural network without significantly affecting accuracy.
However, sparsity research is again in its infancy, and what works in one model does not always work in another. Equally important is pruning, where a model’s memory consumption can be drastically reduced, again with a marginal impact on accuracy.
Then there is a mixture of experts (MoE), where the model does not reuse the same parameters for all inputs, as is typical in deep learning. Instead, MoE models select different parameters for each incoming example, choosing the best parameters for the task at a constant computational cost, incorporating small networks of experts within the broader network.
“However, despite several notable MoE successes, widespread adoption has been hindered by complexity, communication costs, and training instability,” Google researchers noted in a 2022 paper where they outlined a new approach that resolved some of these problems. But the company has not yet implemented it in its main models, and the ideal size and number of specialists to place in a model are still being studied.
GPT-4 is rumored to use MoEs, but no one outside the company knows for sure. Some of China’s technically larger models take advantage of them, but they aren’t very performant.
Year of the MoE?
SemiAnalysis chief analyst Dylan Patel believes 2023 “will be the year of the MoE” as current approaches put pressure on the capacity of current computing infrastructure. However, it will have its own impact, he told DCD: “MoEs actually lead to more memory growth versus compute growth” as the parameter count needs to increase for the additional experts.
But, he said, regardless of the approach these companies take to improving training and inference efficiency, “they would be foolish to say ‘hey, with all these efficiencies, we’re done scaling.’”
Instead, “big companies will continue to scale and scale and scale. If you get a 10x improvement in efficiency, given the value of that, why not 20x your computation?”
If scaling laws increase indefinitely, there will be some point at which these models become more capable than humans at basically every cognitive task.
Where does it end?
As scale begets more scale, it is difficult to see a limit to the size of LLMs and multimodal models, which can handle multiple forms of data such as text, sound, and images.
At some point, we will run out of new data to provide them, which may lead us to feed them our own output. We may also run out of computing. Or we may hit fundamental walls by climbing laws we have not yet conceived.
For humanity, the question of where the scale ends could be critical to the future of our species.
Expert opinion
“If scaling laws increase indefinitely, there will be some point at which these models become more capable than humans at basically every cognitive task,” said Shivanshu Purohit, head of engineering at EleutherAI and research engineer at Stability AI.
“So you have an entity that can think a trillion times faster than you and is smarter than you. If he can plan you and if he doesn’t have the same goals as you…”
This is far from guaranteed. “People’s expectations have risen so much so fast that there may come a point where these models cannot meet those expectations,” Purohit said.
Purohit is an “alignment” researcher, studying how to direct AI systems toward their designers’ intended goals and interests, so he says a limit to scaling “would actually be a good outcome for me. But the cynic in me says maybe they can keep delivering, which is bad news.”
EleutherAI colleague Quentin Anthony is less immediately concerned. He says growth often has limits, drawing an analogy with human development: “If my son keeps growing at this rate, he’ll be in the NBA in five years!”
He said: “We are definitely in the infant phase with these models. I don’t think we should start planning for the NBA. Of course, we have to think ‘this could happen at some point’, but we’ll see when it stops growing.”
Purohit disagrees. “I think I’m on the opposite side of that. There is a saying that the guy who sleeps with a machete is wrong every night but one.”
Generative AI and the future of data centers: Part 2 – The players
Behind Generative AI and Its Impact on the Industry
Foundation and empire
It’s impossible to say how quickly the computing demands of training these models will grow, but it is almost universally accepted that the cost of training cutting-edge models will continue to rise rapidly for the foreseeable future.
The complexity and financial hurdles of creating a foundation model have already put it out of reach for all but a small number of tech giants and well-funded AI startups. Of the startups capable of building their own models, it is no coincidence that the majority were able to do so with funding and cloud credits from hyperscalers.
This prevents most companies from competing in a space that can be extremely disruptive, cementing control in the hands of a few companies that already dominate the existing Internet infrastructure market. Rather than representing a changing of the guard in the world of technology, it risks simply becoming a new front for the old soldiers of the cloud war.
“There are a number of problems with centralization,” said Dr. Alex Hanna, director of research at the Distributed AI Research Institute (DAIR). “This means that certain people control the number of resources that go to certain things.
“You are basically limited to obeying the whims of Amazon, Microsoft and Google.”
These three companies, along with Meta’s data centers, are where most foundation models are trained. The money that startups are raising is being funneled back to these cloud companies.
“If you take OpenAI, they’re building the basic models and a lot of different companies wouldn’t be incentivized to build them right now and would rather just hold off on using those models,” said Rishi Bommasani of Stanford.
Players’ perspective
The big players argue that it doesn’t matter much if they are the only ones with the resources to build foundation models.
“I think this business model will continue. However, if you really need to specialize things for your specific use cases, you are limited to the extent to which OpenAI allows you to specialize.”
That said, Bommasani doesn’t believe that “we’re really going to see one model dominate,” with new players like Amazon starting to enter the space. “We already have a collection of 10 to 15 foundation model developers, and I don’t expect it to drop below five to 10.”
Although the field is relatively nascent, we are already seeing different business models emerging. “DeepMind and Google barely give access to any of their best models,” he said. “OpenAI provides a commercial API, and then Meta and Hugging Face generally provide full access.”
These positions may change over time (in fact, after our interview, Google announced an API for its PaLM model), but they represent a multitude of approaches to sharing access to models.
The big players (and their supporters) argue that it doesn’t matter much if they are the only ones with the resources to build foundation models. After all, they make pre-trained models more widely available, with the heavy lifting already done, so that others can tweak specific AIs on top of them.
Forward the foundation
Among those offering access to entry-level models is Nvidia, a hardware maker whose GPUs (graphics processing units) have become key to supercomputers running AI.
In March 2023, the company launched the Nvidia AI Foundations platform, which allows companies to build proprietary, domain-specific, generative AI applications based on models trained by Nvidia on their own supercomputers.
“Obviously, the advantage for companies is that they don’t have to go through this entire process. Not just the expense, but you have to do a lot of engineering work to continually test the checkpoints, test the models. made by them,” explained Nvidia’s vice president of enterprise computing, Manuvir Das.
While the big models that have captured the headlines are mostly built on public data, well-funded companies are likely to develop their own variants with their own proprietary data.
Based on what they need and the internal expertise they have, companies can adjust the models to their own needs. “There is computation [required] for tuning, but it is not as intensive as full training from the start,” Das said. “Instead of many months and millions of dollars, we are typically talking about one day of computing – but per customer.”
He also expects companies to use a mix of models in different sizes – with larger ones being more advanced and accurate, but with longer latency and a higher cost to train, tune and use.
Variants
While the big models that have captured the headlines are mostly built on public data, well-funded companies are likely to develop their own variants with their own proprietary data.
This may involve feeding data into models such as the GPT family. But who owns the resulting model? That’s a difficult question to answer – and it could mean that a company has just handed over its most valuable information to OpenAI.
“Now your data is encapsulated in a perpetual model and owned by someone else,” said Rodrigo Liang, CEO of AI hardware-as-a-service company SambaNova. “Instead, we give you a computing platform that trains on your data, produces a model you can own, and delivers the highest level of accuracy.”
Of course, OpenAI is also changing as a company and is starting to build relationships with companies, which gives customers more control over their data. Earlier this year, it was revealed that the company charges $156,000 per month to run its models on dedicated instances.
The open approach
While companies are concerned about their proprietary knowledge, there are others concerned about how closed the industry is becoming.
The lack of transparency in newer models makes it difficult to understand the power and importance of these models.
“Transparency is important for science, in terms of replicability and identifying biases in datasets, identifying weights and trying to track why a given model is giving X results,” said DAIR’s Hanna.
“It is also important in terms of governance and understanding where there may be capacity for public intervention,” he explained. “We can learn where there may be a mechanism through which a regulator can intervene, or there may be legislation passed to expose it to assessment centers and open audits.”
The major technological advances that made generative AI possible came out of the open source community, but have now been driven further by private corporations that have combined this technology with a moat of expensive computing.
EleutherAI is one of those trying to keep open source advances competitive with corporate research labs, forming a Discord group in 2020 and formally incorporating as a nonprofit research institute in January.
Patchwork
To build its vision and large language models, it was forced to rely on a patchwork of available computing. It first used Google’s TPUs through the cloud company’s research program, but then switched to niche cloud companies CoreWeave and SpellML when funding dried up.
For-profit generative AI company Stability AI has also donated a portion of its AWS cluster’s compute to EleutherAI’s ongoing LLM research.
“We’re like a little fish in the pool, just trying to grab whatever computation we can,” said Quentin Anthony of EleutherAI. “We can then give it to everyone, so that amateurs can do something with it as they are being completely left behind.
“I think it’s good that there is something that isn’t just what some corporations want it to be.”
Open source players like EleutherAI may consider the resources they have to be scraps and leftovers, but they are using systems that were at the cutting edge of computing performance when they were built.
Generative AI and the future of data centers: Part 3 – Supercomputers
What’s left for HPC in the world of generative AI?
The role of state supercomputers
Most AI training activity is now focused on the enormous resources available to tech giants building virtual supercomputers in their clouds. But in the past, research was largely carried out on supercomputers in government research labs.
During the 2010s, the world’s advanced nations raced to build facilities with enough power to carry out AI research, along with other tasks like molecular modeling and weather forecasting. Now these machines have been left behind, but their capabilities are being used by smaller players in the AI field.
When the US government launched Summit in 2018 at Oak Ridge National Laboratory, the 13-megawatt machine was the most powerful supercomputer in the world. Now, by traditional Linpack benchmarks (FP64), it is the fifth fastest supercomputer in the world at 200 petaflops, using older models of Nvidia GPUs.
If you don’t have the latest and greatest hardware, you simply can’t compete – even if you get the full Summit supercomputer.
For the frontiers of AI, it’s very old and very slow, but the open source group EleutherAI is happy to pick up the scraps. “We hosted pretty much the entire Summit,” said Quentin Anthony of EleutherAI.
“A lot of what holds it back is that those old [Tesla] GPUs just don’t have the memory to fit the model. So the model is split across a ton of GPUs and you are killed by communication costs,” he said.
“If you don’t have the latest and greatest hardware, you simply can’t compete – even if you get the full Summit supercomputer.”
The fastest machine in the world!
In Japan, the Fugaku was the fastest machine in the world when it was launched in 2020.
“We have a team trying to do GPT-like training at Fugaku, we are trying to create the frameworks to build base models and scale up to a fairly large number of nodes,” said Professor Satoshi Matsuoka, director of Japan’s RIKEN Center for Computational Science.
“By global systems standards, Fugaku is still a very fast AI machine,” he said. “But when you compare it to what OpenAI has put together, it is less effective. It’s much faster in terms of HPC, but with AI codes it’s not as fast as 25,000 A100s [Nvidia GPUs].”
Morgan Stanley estimates that OpenAI’s upcoming GPT system is being trained on 25,000 Nvidia GPUs, worth about $225 million.
Fugaku is built with 158,976 Fujitsu A64FX Arm processors, designed for massively parallel computing, but it has no GPUs.
“Of course, Fugaku Next, our next-generation supercomputer, will have heavy optimization to run these basic models,” Matsuoka said.
Today’s supercomputer and the research team using it have helped drive the Arm ecosystem forward and solve problems operating massively parallel architectures at scale.
“It is our role as a national laboratory to pursue the latest and greatest in computing, including AI, but also other aspects of HPC far beyond the normal trajectory that vendors can imagine,” Matsuoka said.
“We need to go beyond the supplier roadmap, or encourage suppliers to accelerate the roadmap with some of our ideas and discoveries – that’s our role. We are doing this with chip suppliers for our next-generation machine. We’re doing this with system vendors and cloud providers. We collectively advance computing for the greater good.”
Morality and massive machines
Just as open source developers are offering much-needed transparency and insight into the development of this next stage of artificial intelligence, state-owned supercomputers provide a way for the rest of the world to keep up with the corporate giants.
“The dangers of these models should not be exaggerated, we should be very, very sincere and very objective about what is possible,” Matsuoka said. energy or nuclear technologies.”
State supercomputers have long controlled who accesses them. “We vet users, we monitor what happens,” he said. “We make sure people don’t mine Bitcoin on these machines, for example.”
The use of these technologies can revolutionize society, but foundation models that may have illicit intent should be avoided.
Proposals for Quantum computing usage are submitted and the results are verified by experts. “Many of these results are publicized or, if a company uses them, the results must be for the public good,” he continued.
Nuclear plants and weapons are highly controlled and protected by layers of security. “We will learn the risks and dangers of AI,” he said. “The use of these technologies can revolutionize society, but foundation models that may have illicit intentions should be avoided. Otherwise, it could fall into the wrong hands, it could wreak havoc on society. While it may or may not end the human race, it can still cause a lot of damage.”
This requires state-backed supercomputers, he argued. “These public resources allow some control, as with transparency and openness we can have some reliable guarantees. It’s a much more secure way than just leaving it to some private cloud.”
Building the world’s biggest supercomputers
“We are now in a domain where if we want to get very effective foundation models, we need to start training with basically multi-exascale level performance at low precision,” Matsuoka explained.
While traditional machine learning and simulation models use 32-bit “single-precision” floating-point numbers (and sometimes 64-bit “double-precision” floating-point numbers), generative AI can use higher precision smaller.
Switching to the FP16 half-precision floating point format, and potentially even FP8, means you can put more numbers in memory and cache, as well as transmit more numbers per second. This move greatly improved the computational performance of these models and changed the design of the systems used to train them.
Fugaku is capable of 442 petaflops on the FP64-based Linpack benchmark and achieved two exaflops (i.e. 1018) using the mixed FP16/FP64 precision HPL-AI benchmark.
OpenAI is secretive about its training resources, but Matsuoka believes that “GPT-4 was trained on a resource equivalent to one of the best supercomputers the state may be installing,” estimating it could be a 10 exaflop machine (FP16) . “with AI optimizations.”
Changes
Some people even warn that HPC will be left behind by cloud investments because what governments can invest is outweighed by what hyperscalers can spend on their research budgets.
“Can we build a 100 exaflop machine to support generative AI?” Matsuoka asked. “Of course we can. Can we build a zettascale machine in FP8 or FP16? Not now, but at some point in the near future. Can we scale training to this level? In fact, this is very likely.
This will mean facing new challenges of scale. “Sustaining a 20,000- or 100,000-node machine is much more difficult,” he said. Going from a 1,000-knot machine to 10,000 doesn’t simply require scaling by a factor of 10. “It’s really difficult to operate these machines,” he said, “it’s anything but easy.”
Again, it comes down to the question of when and where the models will begin to stabilize. “Can we go five orders of magnitude better? Perhaps. Can we go two orders of magnitude? Probably. We still don’t know how far we can go. And that’s something we’ll be working on.”
Some people even warn that HPC will be left behind by cloud investments because what governments can invest is outweighed by what hyperscalers can spend on their research budgets.
Poor scaling and the future of HPC
To understand what the future holds for HPC, we must first understand how today’s large parallel computing systems came to be.
Computing tasks, including AI, can be performed faster by splitting them and running parts of them in parallel on different machines or on different parts of the same machine.
In 1967, computer scientist and mainframe pioneer Gene Amdahl noted that parallelization had limits: No matter how many cores you run it on, a program can only run as fast as the parts that can’t be divided and parallelized.
Moore’s Law has worked to provide an ever-increasing number of processor cores per dollar spent on a supercomputer, but as semiconductor manufacturing approaches fundamental physical limits, this will no longer be the case.
But in 1988, John Gustafson of Sandia Labs essentially flipped the question on its head and shifted the focus from speed to the size of the problem.
“So the execution time doesn’t decrease as you add more parallel cores, but the size of the problem increases,” Matsuoka said. “So you’re solving a more complicated problem.”
This is known as weak scaling and has been used by the HPC community for research workloads ever since.
“Advanced technologies, advanced algorithms, advanced hardware, to the extent that we now have machines with this immense power and we can utilize this massive scale,” Matsuoka said. “But we’re still making progress with this weak scaling, even things like GPUs, it’s a weak scaling machine.”
That’s “the current status quo right now,” he said.
Moore’s Law
This may change as we approach the end of Moore’s Law, the observation that the power of a CPU (based on the number of transistors that can be placed on it) will double every two years. Moore’s Law has worked to provide an ever-increasing number of processor cores per dollar spent on a supercomputer, but as semiconductor manufacturing approaches fundamental physical limits, this will no longer be the case.
“We will no longer be able to achieve the desired speed with weak scaling alone, so it may start to diverge,” warned Matsuoka.
We are already starting to see signs of different approaches. With deep learning models like generative AI capable of counting at lower precisions like FP16 and FP8, chip designers have added matrix multiplication units to their latest hardware to make them significantly better at such low precision orders. .
“It’s still poor scaling, but most HPC applications can’t use them because the accuracy is too low,” Matsuoka said. “So machine designers are coming up with all these ideas to keep performance scaling, but in some cases there are divergences that may not lead to a uniform design where most of the features can be leveraged across all fields. This would lead to an immense diversity of types of computing.”
This could change the supercomputer landscape. “Some people claim it’s going to be very diverse, which is bad, because then we have to build these specific machines for a specific purpose,” he said. “We believe there should be more uniformity and it is something we are actively working on.”
The cloudification of HPC
Riken, Matsuoka’s research institute, is looking at how to keep pace with hyperscalers, who spend billions of dollars every quarter on the latest technologies.
“It’s not easy for the cloud guys either – once you start these scale wars, you have to get into this game,” Matsuoka said.
State-supported HPC programs take about 5 to 10 years between each major system, working from the beginning on a step-change machine. During this period, cloud-based systems can switch between multiple generations of hardware.
“The only way we envision solving this problem is to be agile by combining multiple strategies,” Matsuoka said. He wants to continue releasing huge systems, based on fundamental R&D, once or twice a decade – but increase them with more regular updates of commercial systems.
He hopes a parallel program can deliver new machines more quickly but at a lower cost. “It won’t be a billion dollars [like Fugaku], but it could be a few $100 million. These foundational models and their implications are hitting us at a very rapid pace, and we have to act in a very reactive manner.”
Riken is also experimenting with the ‘Fugaku Cloud Platform’, to make its supercomputer more widely available in partnership with Fujitsu.
Generative AI and the future of data centers: Part 4 – The cloud
How hyperscalers plan to master generative AI
As Riken and others in the supercomputing field look to the cloud for ideas, hyperscalers have also turned to the HPC field to understand how to deploy massively interconnected systems.
But as we have seen, the giants discovered that their financial resources enabled them to outperform traditional supercomputers.
Sudden changes are always possible, but for now, that leaves hyperscalers like Microsoft and Google leading the way – and developing new architectures for their cloud in the process.
Microsoft: Hyperscale to Superscale
“My team is responsible for building the infrastructure that made ChatGPT possible,” said Nidhi Chappell, Microsoft GM for Azure AI. “So we work very closely with OpenAI, but we also work across our overall AI infrastructure.”
Chappell’s division was responsible for deploying some of the largest computing clusters in the world. “It’s a mindset of combining hyperscale and supercomputing in generating superscale,” she said.
This was a transition of several years in the company, as it unites the two worlds. Part of this involved several high-profile hires from the traditional HPC industry, including NERSC’s Glenn Lockwood, Cray CTO Steve Scott, and the head of Cray’s exascale efforts, Dr. Dan Ernst.
“All these people you’re talking about are on my team,” Chappell said. “When you go to a much larger scale, you are dealing with challenges that are on a completely different scale. Supercomputing is the next wave of hyperscale, in a way, and you need to completely rethink your processes, whether it’s how you acquire capacity, how you’re going to validate it, how you’re going to scale it, how you’re going to repair it. ”
When you run a single job non-stop for six months, reliability becomes central. You really have to completely rethink the design.”
Microsoft doesn’t share exactly what that scale is. For their standard public instances, they run up to 6,000 GPUs in a single cluster, but “some customers go beyond public offerings,” Chappell said.
Open AI
OpenAI is one such customer, working with Microsoft on much larger specialized deployments since the companies’ $1 billion deal. “But it’s the same fundamental building blocks that are available to any customer,” she said.
Size isn’t the only challenge your team faces. As we saw earlier, researchers are working with ever larger models, but they are also running them for much longer.
“When you run a single job non-stop for six months, reliability becomes central,” she said. “You really need to completely rethink the design.”
On the scale of thousands of GPUs, some will break. Traditionally, “hyper scalers will have a lot of independent employment, and so you can retire some fleet and be fine with that,” she said.
“For AI training, we had to go back and rethink and redesign how we do reliability, because if you’re taking a percentage of your fleet to maintain it, that percentage is literally not available.
“We had to think about how we could bring capacity back quickly. This turnaround time had to be reduced to ensure the entire fleet is available, healthy and reliable at all times. That’s almost fighting physics at some point.”
Scale
This scale will only grow as models expand in scope and time required. But just as OpenAI is leveraging the flywheel of usage data to improve its next generation of models, Microsoft is also learning an important lesson from running the ChatGPT infrastructure: how to build the next generation of data centers.
“You don’t build ChatGPT infrastructure from scratch,” she said. “We have a history of building supercomputers that have allowed us to build the next generation. And there were so many learnings about the infrastructure that we use for ChatGPT, about how you go from a hyperscaler to a supercomputing hyperscaler.”
As models get bigger and require more time, this “will require us to continue at the pace of larger, more powerful infrastructure,” she said. “So I think the pivotal moment [of launching ChatGPT] is actually the beginning of a journey.”
Google: from search to AI
Google also sees this as the start of something new. “Once you actually have these things in people’s hands, you can start to specialize and optimize,” said the head of the search giant’s global systems and services infrastructure team, Amin Vahdat.
“I think you’ll see a lot of refinement on the software, compiler and hardware side,” he added. Vahdat compared the moment to the early days of web search, when it would have been unimaginable for anyone to be able to index Internet content on the scale we do today. But as search engines grew in popularity, the industry rose to the challenge.
“Over the next few years, you will see dramatic improvements, some hardware and many software and optimizations. I think hardware specialization can and will continue, depending on what we learn about the algorithms. But we certainly won’t see 10× a year for many more years, there are some fundamental things that will break down quickly.”
This growth in cloud computing has occurred as the industry has learned and borrowed from the traditional supercomputing industry, enabling a rapid increase in how much hyperscalers can deliver as single clusters.
Advances
But now that they’ve caught up, fielding systems that would rank in the top 10 of the Top 500 list of fastest supercomputers, they’re having to forge their own path.
“The two sectors are converging, but what we and others are doing is quite different from [traditional] supercomputing in that it actually brings end-to-end data sources together in a much more dramatic way,” Vahdat said.
“And I would also say that the amount of expertise we are bringing to the problem is unprecedented,” he added, echoing Professor Matsuoka’s concerns about divergent types of HPC (see part III).
“In other words, a lot of what these models are doing is essentially preprocessing huge amounts of data. It’s not the entirety of human knowledge, but it’s a lot, and it’s becoming increasingly multimodal.” Just preparing the input properly requires “unprecedented” data processing pipelines.
Similarly, although HPC has coupled general-purpose processors with super-low latency networking, this workload allows for slightly higher latency envelopes tied to an accelerated specialized computing configuration.
“You don’t need that ultrafine, near-nanosecond latency with tremendous full-scale bandwidth,” Vahdat said.
“You still need it, but on a medium to large scale, not on an extra large scale. I see the parallels with supercomputing, but the second- and third-order differences are substantial. We are already in uncharted territory.”
The company differentiates itself from traditional HPC by calling it “supercomputing built specifically for machine learning,” he said.
How does the giant do it?
At Google, this could mean large clusters of its in-house TPU family of chips (it also uses GPUs). For this type of supercomputing, it can couple 4,096 TPUv4s. “It is determined by its topology. It turns out we have a 3D torus and the root of your chip,” Vahdat said, essentially meaning it’s a question of how many links come out of each chip and how much bandwidth is allocated along each dimension of the topology.
“So 4,096 is really a technology question and chip real estate question, how much do we allocate to SerDes and off-chip bandwidth? And then, given that number and the amount of bandwidth we need between the chips, how do we connect things?”
Latency requirements are lower than we think, so I don’t think it’s out of the question to be able to couple multiple data centers.
Vahdat noted that the company “could have gone to, say, double the number of chips, but then we would have been throttling bandwidth. So now you can have more scale but half the bandwidth of bisection, which was a different balance point.”
The industry could specialize further and build clusters that are not just better at machine learning, but specifically better at LLMs – but for now, the industry is moving too fast to do that.
However, it is pushing Google to look beyond what a cluster means and bring them together as a single, larger system. This could mean combining multiple clusters into one data center.
Communion
But as these models get larger, it could even mean multiple data centers working together. “The latency requirements are lower than we would imagine,” he said. “So I don’t think it’s out of the question to be able to couple multiple data centers.”
All of these changes mean that the traditional lines of what constitutes a data center or a supercomputer are starting to blur. “We’re in a super exciting time,” he said. “The way we compute is changing, the definition of a supercomputer is changing, the definition of computing is changing.
“We’ve done a lot in space over the last two decades, like with TPUv4. We will be announcing the next steps in our journey in the coming months. Therefore, the rate of hardware and software innovation will not slow down in the coming years.
Generative AI and the future of data centers: Part 5 – The chips
An explosion of semiconductors to meet the demands of AI
Even with the huge investments made in building supercomputers in the cloud or in the laboratory, problems can arise.
“We recently saw that due to some issue with the GPUs in our cluster, we had to throttle them down because they would go over 500 watts per GPU at full throttle, and that would basically burn out the GPU and your execution would die. ,” said Shivanshu Purohit of EleutherAI.
“Even the cloud provider didn’t consider it because they thought it shouldn’t happen, because it usually doesn’t happen. But then it happened.
Likewise, high-energy particles “can break through all the redundancies and corrupt your GPU,” he said.
“There may be new issues as we scale beyond where we are now, there is a limit to how many GPUs you can store in a single data center. Currently, the limit is about 32,000, both due to power and the challenges of how to actually design the data center.”
Maybe the answer isn’t to build ever bigger data centers but rather to move away from GPUs.
The new wave of computing
Over the past half-decade, as Moore’s Law has slowed and other AI applications have proliferated, AI chip companies have sprouted like mushrooms in the rain.
Many failed or were acquired and stripped of assets, as a promised AI revolution was slow to occur. Now, with a new wave of computing again about to flood data centers, they are hopeful that the time has come.
Every company we spoke to believes its unique approach will be able to solve the challenge posed by ever-growing AI models.
Tentorrent
“We believe our technology is exceptionally good for where we think the models will go,” said Matt Mattina, head of AI at chip startup Tenstorrent.
“If you accept this idea that you can’t natively get 10 trillion parameters, or however many trillion you want, our architecture has scaling built in.
“So, generative AI is fundamentally about matrix multiplication [a binary operation that produces a matrix from two matrices] and its large models,” he continued. “To do this, you need a machine that can do matrix multiplication with high throughput and low energy consumption, and it needs to be able to scale. You need to be able to connect many, many chips together.
“You need a fundamental building block that is efficient in terms of peaks (Tera Operations Per Second) per watt and can scale efficiently, meaning you don’t need a rack of switches when you add another node of these things .”
Each of the company’s chips has built-in Ethernet, “so the way to scale is to simply connect the chips together over standard Ethernet, there’s not a maze of switching and stuff as you go to larger sizes,” and the company claims that its software makes scaling easy.
“It’s a very promising architecture,” said Dylan Patel of SemiAnalysis. “It’s very interesting from a sizing and memory point of view and from a software programmability point of view. But none of that is there yet.
The hardware exists in some capacity and the software is still being worked on. It’s a difficult problem for them to solve and be usable, and there’s still a lot that needs to be done.”
Brains
Rival Cerebras takes a different approach to scaling: just make the chip bigger.
The Wafer Scale Engine 2 (WSE-2) chip features 2.6 trillion transistors, 850,000 ‘AI-optimized’ cores, 40GB of on-chip SRAM memory, 20 petabytes of memory bandwidth, and 220 petabits of bandwidth aggregate mesh. It is packaged in the Cerebras CS-2, a 15U box that also includes an HPE SuperDome Flex server.
“When these big companies are thinking about training generative AI, they are often thinking about gigaflops of compute,” said Cerebras CEO and co-founder Andrew Feldman. “We’re more efficient [than the current GPU approach], for sure, but you’re still going to use a crazy amount of compute because we’re training with a kind of brute force.”
Feldman again believes there will be a limit to the current approach of giant models, “because we can’t get bigger and bigger forever, there is an upper limit.” He thinks sparse approaches will help reduce model sizes.
Still, he agrees that whatever the models, they will require huge computing clusters. “Large clusters of GPUs are incredibly difficult to use,” he said. “Distributed computing is very painful, and distributing the AI work – where you have to use the tensor model in parallel and then the pipeline model in parallel and so on – is an incredibly complicated process.”
The company hopes to solve part of this challenge by moving what would otherwise be handled by hundreds of GPUs to a multimillion-dollar megachip.
“There are two reasons why you stop work,” he said. “One is that you can’t store all the parameters in memory, the second reason is that you can’t do a necessary calculation, and that’s usually a big matrix multiplied into a big layer.”
Parameters
At the GPT-3 parameter of 175 billion, the largest matrix multiplication is about 12,000 by 12,000. “We can support hundreds of times larger, and because we store our parameters off-chip in our MemoryX technology, we have arbitrarily large parameter storage – 100-200 trillion is no problem,” he said. “And so we have the ability to store a large number of parameters and perform the largest multiplication step.”
However, the single massive chip isn’t big enough for what the larger models require. “And so we built Andromeda, which has 13.5 million cores. It’s one and a half times larger than [Oak Ridge’s exascale system] Frontier in core count, and we were able to support it in three days. The first customer to use it was Argonne [another US national computing laboratory], and they were doing things they couldn’t do on a 2,000 GPU cluster.”
The cloud-based Andromeda supercomputer combines 16 of Cerebras’ CS-2 systems, but Cerebras has the potential ability to scale to 192 of these systems as a cluster. “The scaling limitation is about 160 million cores,” Feldman said.
Cerebras isn’t the only company to offer its specialized hardware as a cloud product.
Graphcore
“We decided to change our business model from selling hardware to running an AI cloud,” said Simon Knowles, CTO of British AI chip startup Graphcore.
“Is it realistic to set up and operate an AI cloud? Clearly, it’s sensible because of the huge margins Nvidia is able to reap. The real question is: is there a market for a specialized AI cloud that a generic cloud like AWS doesn’t offer? We believe that yes, it exists, and that is with IPUs.”
The company’s IPU (Intelligence Processing Unit) is another parallel processor designed from the ground up for AI workloads.
“IPUs were designed from day one with an obligation not to look like GPUs,” said Knowles. “I’m amazed at how many startups have tried to basically be an alternative GPU. The world doesn’t need another Nvidia; Nvidia are very good.”
He believes that “what the world needs is machines of different form factors, that perform well in things where Nvidia can clearly be beaten.” That’s part of the reason Graphcore is building its own cloud. While it will still sell some hardware, it has found that customers don’t commit to purchasing hardware because they want it to be as good or better than Nvidia’s GPUs in all workloads.
“They wanted insurance that met all their future needs that they didn’t know about,” he said. “Whereas, as a cloud service, it’s like ‘for this set of functions, we can do it for half the price of them.’”
Likewise, it doesn’t want to compete with AWS on every metric. “You would have to be pretty bold to believe that a cloud-based technology could do everything well,” he said.
SambaNova
Another startup offering specialized hardware in the cloud, on-premises or as a service is SambaNova. “As models grow, we believe Dataflow [SambaNova architecture] is what you will need,” said CEO Rodrigo Liang. “We believe that over time, as these models grow and expand, that the power required, the amount of cost, all of those things will be prohibitive on these legacy architectures.
“So we fundamentally believe that the new architecture will allow us to grow with the size of the models in a much more effective and efficient way than the legacy ways of doing so.”
But legacy chip designers have also fielded hardware intended to meet the training and inference needs of newer AI models.
Intel
“Habana Gaudi has already been proven to have 2x the performance of the A100 GPU in the MLPerf benchmark,” said Dr. Walter Riviera, AI technical lead at Intel EMEA, about the company’s deep learning training processor.
“As far as the GPU is concerned, we have the Flex series. And again, depending on the workload, it’s competitive. My advice to any client is to test and evaluate what will be best for them.”
OMG
In recent years, AMD has taken CPU market share from Intel. But in the world of GPUs it has the second best product on the market, believes Dylan Patel, from SemiAnalysis, and has yet to capture a significant share.
“If anyone is going to be able to compete, it’s the MI300 GPU,” he said. “But there are also some things missing, it’s not in the software and there are some aspects of the hardware that will be more expensive. It’s not a home run.”
AMD Data Center and Accelerated Processing CVP Brad McCredie pointed to the company’s leadership in HPC as a key advantage. “We are on the largest supercomputer on three continents,” he said. “Such a big piece of this explosive AI mushroom is scale, and we have demonstrated our ability to scale.
McCredie also believes that AMD’s success in packing lots of memory bandwidth into its chips will be particularly attractive for generative AI. “When you get into the inference of these LLMs, memory capacity and bandwidth come to the fore. We have eight high-bandwidth memory stacks in our MI250, which is a leadership position.”
Another important area he highlighted is energy efficiency. “When you start to get to that scale, energy efficiency is very important,” he said. “And it will continue to grow.”
Google TPU
Then there’s the tensor processing unit (TPU), a custom family of AI chips developed by Google – the same company that created the transformer model that forms the basis of today’s generative AI approaches.
“I think one of the main advantages of TPUs is interconnection,” said researcher Finbarr Timbers.
“They have really high networking between the chips, and that’s incredibly useful for machine learning. For transformers in general, memory bandwidth is the bottleneck. It’s about moving data from the machine’s RAM to on-chip memory, that’s the big bottleneck. TPUs are the best way to do this in the industry, because they have all this dedicated infrastructure for it.”
The other advantage of the chip is that it is used by Google to make its larger models, so development of the hardware and models can be done together.
“It really comes down to co-design,” said Google’s Amin Vahdat. “Understanding what the model needs from a computational perspective, figuring out how to better specify the model from a language perspective, figuring out how to write the compiler and then mapping it to the hardware.”
The company also points to TPU’s energy efficiency as a big advantage as these models grow. In a research paper, the company said its TPUv4s used ~2-6× less power and produced ~20× less CO2e than contemporary chip rivals (not including the H100) – but the main caveat is that it was comparing your hyperscale data center to an on-premises installation.
Amazon Trainium
Amazon also has its own family of Trainium chips. It hasn’t made much of an impact yet, although Stability AI recently announced that it would look into training some of its models on hardware (likely as part of its cloud deal with AWS).
“One feature I would like to highlight is hardware-accelerated stochastic rounding,” said AWS EC2 Director Chetan Kapoor.
“So stochastic rounding is a capability that we’ve built into the chip that intelligently says, okay, am I going to round a number down or up?” he said, with systems typically just rounding down. “This basically means that with stochastic rounding you can actually achieve the throughput of the FP16 data type and the accuracy of FP32.”
Nvidia: The king of generative AI
Nvidia isn’t snoozing — and chip rivals hoping to break through its fat margins will find the task daunting, like Microsoft’s Bing nibbling away at Google’s image of search superiority.
Rather than seeing this as an end to its dominance and a ‘code red’ moment similar to what’s happening at Google, Nvidia says this is the culmination of decades of preparation for this moment.
“They’ve been talking about this for years,” said SemiAnalysis’s Patel. “Of course they were caught off guard by how quickly it took off in the last few months, but they always aimed for that. I think they are very well positioned.”
Outside of Google’s use of TPUs, virtually every major generative AI model available today has been developed on Nvidia’s A100 GPUs. Tomorrow’s models will be built primarily with their recently launched H100s.
Decades of leading the AI space have meant that an entire industry has been built around its products. “Even as an academic user, if I were given infinite computation on these other systems, I would have to do a year of software engineering work before I could make them useful, because the entire deep learning stack is on Nvidia and Nvidia Mellanox. [the company’s networking platform],” said EleutherAI’s Anthony. “It’s all really a unified system.”
Colleague Purohit added: “It’s the entire ecosystem, not just Mellanox. They optimize it end to end so they have the best hardware. The generational gap between an A100 and an H100 from the preliminary tests we did is enough for Nvidia to be the king of computing in the near future.”
Pioneering improvement
In his opinion, Nvidia has perfected the hardware-improves-software-improves-hardware loop, “and the only one that competes is basically Google. Someone could build a better chip, but the software is optimized for Nvidia.”
A key example of Nvidia’s efforts to stay ahead was the launch of the tensor core in late 2017, designed for superior deep learning performance over regular cores based on Nvidia’s CUDA (Compute Unified Device Architecture) parallel platform.
“That changed the game,” Anthony said. “A regular user can just change their code to use mixed-precision tensor cores for computation and double their performance.”
Now, Nvidia hopes to take things even further with a transformer motor in the H100, for FP8. “It’s really a combination of hardware and software,” said Ian Buck, head of data centers and AI at Nvidia. “Basically, we added eight-bit floating point capability to our GPU and we did it in a smart way while maintaining accuracy.”
A software engine essentially monitors the accuracy of the training and inference work along the way and dynamically scales things down to FP8.
“Tensor cores have completely eliminated FP32 training. Before that, everything was in FP32,” said Anthony. “I don’t know if the change to FP8 will be the same, maybe it’s not enough precision. We’re yet to see if deep learning people can still converge their models on this hardware.”
Adequacy
As everyone is trying to move forward with building these models, they will use [Nvidia] GPUs. But just as Tesla’s GPUs in the Summit are too old for today’s challenges, the H100s won’t be suitable for the models of the future.
“They’re evolving together,” Buck said, pointing out that Nvidia’s GTX 580 cards were used to build AlexNet, one of the most influential convolutional neural networks ever made, in 2012.
“These GPUs are completely impractical today, a data center couldn’t even be built to make them scale to today’s models, it would simply fall apart,” said Buck.
“Will current GPUs take us to 150 trillion parameters? No. But the evolution of our GPUs, the evolution of what goes into the chips, the architecture itself, the memory interconnect, NVLink and data center designs, yes. And all the software optimizations that are happening at the top are how we beat Moore’s Law.”
Lost market
For now, this market continues to be lost to Nvidia. “As everyone is trying to move forward with building these models, they will use [Nvidia] GPUs,” Patel said. “They are better and easier to use. They’re often actually cheaper too when you don’t have to spend as much time and money optimizing them.”
This may change as models mature. Today, in a competitive space where performance and deployment speed are essential, Nvidia represents the safe and highly capable bet.
As time passes and pressure eases, companies can look for alternative architectures and optimize deployments on less expensive equipment.
I Generative and the future of data centers: Part 6 – The network
DE-CIX CEO on how data centers need to adapt
Just as silicon is being pushed to its limits to handle massive AI models, data center networking and architecture are facing challenges.
“With these big systems, no matter what, you can’t fit them on a single chip, even if you’re Cerebras,” said Dylan Patel of SemiAnalysis. “Well, how do I connect all these separate chips? If it’s 100 that’s manageable, but if it’s thousands or tens of thousands, you’re starting to have real difficulties and Nvidia is implementing exactly that. Arguably, it is either them or Broadcom that has the best network in the world.”
Cloud companies are also becoming more involved. They have the resources to create their own networking equipment and topologies to support growing computing clusters.
But cloud companies are also becoming more involved. They have the resources to create their own networking equipment and topologies to support growing computing clusters.
Amazon
Amazon Web Services has deployed clusters of up to 20,000 GPUs, with AWS-specific Nitro networking cards. “And we will deploy multiple clusters,” said the company’s Chetan Kapoor. “That’s one of the things that I believe sets AWS apart in this particular space. We leverage our Nitro technology to have our own network adapters, which we call Elastic Fabric Adapters.”
The company is in the process of implementing its second generation of EFA. “And we are also in the process of increasing the bandwidth per node, about 8× between A100s and H100s,” he said. “We will go up to 3,200 Gbps, per node.”
At Google, an ambitious, multi-year effort to overhaul the networks of its massive fleet of data centers is starting to bear fruit.
Higher interest rates, a series of high-profile bankruptcies, and the collapse of Silicon Valley Bank have put this mindset under pressure.
Right now, generative AI companies are raising huge sums based on crazy promises of future wealth. The pace of evolution will depend on how many can escape the gravity well of scale and operating costs to build realistic and confident businesses before the purse strings inevitably tighten.
And these gains will be the ones to define the final form of AI.
Larger models
We don’t yet know how much it will cost to train larger models, nor whether we have enough data to support them. We don’t know how much they will cost to run and how many business models will be able to generate enough revenue to cover that cost.
We don’t know whether major language model hallucinations can be eliminated, or whether the uncanny valley of knowledge, where AIs have effected convincing versions of realities that don’t exist, remains a limiting factor.
We don’t know in which direction the models will grow. All we know is that the process of growth and exploration will be fueled by more and more data and more computation. And that will require a new wave of data centers, ready to rise to the challenge.
The company has begun deploying Mission Apollo custom optical switching technology at a scale never before seen in a data center.
Traditional data center networks use a spine and leaf configuration, where computers are connected to top-of-rack switches (leaves), which are then connected to the spine, which consists of electronic packet switches. Project Apollo replaces the column with all-optical interconnects that redirect light beams with mirrors.
“The training bandwidth needs and, at some scale inference, are enormous,” said Google’s Amin Vahdat.
Apollo
Apollo allowed the company to build “network topologies that more closely match the communication patterns of these training algorithms,” he said. “We set up specialized, dedicated networks to distribute parameters between chips, where enormous amounts of bandwidth happen synchronously and in real-time.”
This has several benefits, he said. At this scale, individual chips or racks fail regularly, and “an optical circuit switch is quite convenient to reconfigure in response because now my communication patterns are matching the logical topology of my fabric,” he said.
“I can tell my optical circuit switch, ‘Get some other chips from somewhere else, reconfigure the optical circuit switch to plug these chips into the missing hole, and then continue.’ There is no need to restart all computing or – worst case – start from scratch.”
Apollo also helps you deploy capacity flexibly. The company’s TPUv4 scales to blocks of 4,096 chips. “If I schedule 256 here, 64 there, 128 here, another 512 there, all of a sudden I’m going to create some holes where I have a bunch of 64 token blocks available.”
In a traditional network architecture, if a customer wanted 512 of these chips, they wouldn’t be able to use them. “If I didn’t have an optical circuit switch, I would be sunk, I would have to wait for some work to be completed,” Vahdat said. “They are already taking up parts of my mesh and I don’t have a contiguous 512, although I may have 1,024 chips available.”
But with the optical circuit switch, the company can “connect the right pieces together to create a beautiful 512-node mesh that is logically contiguous. So separating logical topology from physics is super powerful.”
Changes
If generative AI becomes a major workload, every data center in the world may find that it needs to rebuild its network
If generative AI becomes a major workload, every data center in the world could find that it needs to rebuild its network, said Ivo Ivanov, CEO of internet exchange DE-CIX.
“There are three critical sets of services we see:
1) Cloud switching, hence direct connectivity to single clouds,
2) Direct interconnection between different clouds used by the company
3) Peering for direct interconnection with other end-user and customer networks.”
He argued: “If these services are fundamental to creating the environment that generative AI needs in terms of infrastructure, then every data center operator today needs to have a solution for an interconnection platform.”
This future-proof network service must be seamless, he said: “If data center operators don’t offer this to their customers today and in the future, they will be reduced to just server closet operators.”
Generative AI and the future of data centers: Part 7 – Data centers
CEO of Digital Realty and more on what generative AI means for the data center industry
A potential change in the nature of workloads will filter down to the broader data center sector, impacting how they are built and where they are located.
Bigger data centers, hotter racks
Digital Realty CEO Andy Power believes generative AI will lead to “a monumental surge in demand.
“The way this happens in the data center sector is still new, but there will definitely be a large-scale demand. Just do the math on these spending quotes and A100 chips and think about the gigawatts of power needed for them.”
When he joined the company almost eight years ago, “we were going from one to three megawatts of IT suites and quickly went from six to eight, then ten,” he recalled. “I think the biggest building we built was 100MW over several years. And the biggest deals we signed were things like 50MW. Now you’re hearing some more deals in the hundreds of megawatts, and I’ve had preliminary conversations over the last few months where customers are saying ‘Talk to me about a gigawatt’.”
Cloud adaptation
For training AI models, Power believes we will see a shift away from the traditional cloud approach, which focuses on dividing workloads across multiple regions while keeping them close to the end user.
These facilities will still need proximity to other data centers with more traditional data and workloads, but “the proximity and how close this AI workload needs to be to the cloud and data is still an unknown.”
“Given the computing intensity, you can’t just chop them up and patch them together across many regions or cities,” he said. At the same time, “you’re not going to disclose this in the middle of nowhere, because of the infrastructure and data exchange.”
These facilities will still need proximity to other data centers with more traditional data and workloads, but “the proximity and how close this AI workload needs to be to the cloud and data is still an unknown.”
He believes it will “still be very metro-focused,” which will be a challenge because “you’ll need large tracts of contiguous land and power, but it’s increasingly difficult to find a contiguous gigawatt of power,” he said, pointing to the transmission challenges in Virginia and elsewhere.
What about data centers?
As for the data centers themselves, “simple, it’s going to be a hotter environment, you’re just going to put in a lot more power-dense servers, and you’re going to need to innovate your existing footprints and your design for new footprints,” he said.
“We are innovating for our corporate customers in terms of liquid cooling. It’s been quite niche and testing, to be honest with you,” he said. “We have also been co-designing with our hyperscale customers, but these have been exceptions, not the norms. I think you’ll see a preponderance of more standards.”
two buildings close to each other and one will support the hybrid cloud. And then you have another one next to it that’s twice or triple the size, with a different design, a different cooling infrastructure, and a different power density.”
Specialized buildings
Moving forward, he believes that “you will have two buildings next to each other and one will support the hybrid cloud. And then you have another one next to it that’s twice or triple the size, with a different design, a different cooling infrastructure, and a different power density.”
Amazon agrees that large AI models will need specialized facilities. “Training needs to be pooled and you need to have very, very large, deep pools of a specific capability,” said Chetan Kapoor of AWS.
“The strategy we have been executing in recent years, and which we will reinforce, is to choose some data centers linked to our main regions, such as northern Virginia (US-East-1) or Oregon (US-West-2) as an example, and build really large clusters with dedicated data centers. Not just with raw compute, but also with racks of storage to really support high-speed file systems.”
Specialized cluster
On the training side, the company will have specialized cluster deployments. “And you can imagine that we will rinse and repeat on GPUs and Trainium,” said Kapoor. “So there will be dedicated data centers for H100 GPUs. And there will be dedicated data centers for Trainium.”
Things will be different on the inference side, where it will be closer to the traditional cloud model. “The requests we’re seeing is that customers need multiple availability zones, they need support in multiple regions. This is where some of our key scaling and infrastructure capabilities for AWS really shine. Many of these applications tend to be real-time in nature, so having the computing as close to the user as possible becomes super, super important.”
If you’re trying to pack a lot of these servers, the cost will go up because you’re going to have to find really expensive solutions to actually cool it down.
However, the company doesn’t plan to follow the same dense server rack approach as its cloud competitors.
Scalable infrastructure
“Instead of packing a lot of computing into a single rack, what we are trying to do is build an infrastructure that is scalable and deployable across multiple regions and is as energy efficient as possible,” said Kapoor. “If you’re trying to pack a lot of these servers, the cost is going to go up because you’re going to have to find really expensive solutions to really cool it down.”
Google’s Vahdat agreed that we will see specific clusters for large-scale training, but noted that in the long term it may not be as targeted. “The interesting question here is: what happens in a world where you want to gradually refine your models? I think the line between training and serving will be a little blurrier than the way we do things now.”
Comparing it to the early days of the Internet, where search indexing was done by a few high-computing centers but is now spread across the world, he noted, “We have blurred the line between training and service. You’ll see some of that moving forward with this.
Where and how to build?
While this new wave of workload risks leaving some businesses behind, the Digital Realty CEO sees this moment as a “rising tide to lift all ships, arriving as a third wave when the second and first have not yet reached shore.” ”.
The first two waves were customers moving from on-premises to colocation and then to cloud services delivered from hyperscale wholesale deployments.
This is great news for the sector, but it comes after years of the sector struggling to keep up. “Demand continues to outpace supply, [the industry] is bending over and coughing because it is out of fuel,” Power said. “The third wave of demand is not coming at a fortuitous time to be an easy path to growth.”
For all its hopes of solving or transcending today’s challenges, the growth of generative AI will be hampered by the broader difficulties plaguing the data center market – problems of scale.
How can data center operators quickly increase capacity on a larger scale and faster, consuming more energy, land and possibly water – ideally, using renewable resources and not causing an increase in emissions?
“Energy constraints in Northern Virginia, environmental concerns, moratoriums, nimbyism, supply chain issues, worker talent shortages, and so on,” Power listed the external problems.
And that ignores the stuff that goes into the data centers that the customer owns and operates. A lot of these things are time-consuming,” with GPUs currently difficult to acquire even for hyperscalers, causing rationing.
Economy
“The economy has been hot for many years,” Power said, “And it’s going to take a while to replenish much of that infrastructure, bringing transmission lines to different areas. And it’s a massive government and local community effort.”
While AI researchers and chip designers face the scaling challenges of parameter counts and memory allocation, data center builders and operators will have to overcome their own scaling bottlenecks to meet the demands of generative AI.
“We will continue to see bigger milestones that will require computing to become less of an impediment to AI progress and more of an accelerator for it,” said Microsoft’s Nidhi Chappell. “Even looking at the script I’m working on now, it’s incredible, the scale is unprecedented. And it is completely necessary.”
Could this all just be hype?
As we plan for the future and try to extrapolate what AI means for the data center industry and humanity more broadly, it’s important to take a step back from the breathtaking coverage that potentially transformational technologies can generate.
Following the silicon boom, the birth of the Internet, the smartphone and app revolution, and the proliferation of the cloud, innovation has stagnated. Silicon became more powerful, but at increasingly slower rates. Internet businesses have matured and solidified around a few giant corporations. Applications were targeted at a few key destinations, rarely replaced by newcomers. Each new generation of smartphones is barely distinguishable from the previous one.
But those who benefited from previous booms remain paranoid about what might come next and displace them. Those who missed out are equally searching for the next opportunity. Both look to the past and the wealth generated by inflection points as proof that the next wave will follow suit. This has led to a culture of multiple false starts and over-promising.
Metaverse
The metaverse was supposed to be the next wave of the Internet. Instead, it just drove down Meta’s share price. Cryptocurrency was created to reshape financial systems. Instead, it burned the planet and solidified wealth in the hands of a few. NFTs were created to revolutionize art, but they quickly became a joke. After years of promotion, commercial quantum computers remain as intangible as Schrodinger’s cat.
Generative AI appears to be different. The pace of advancement and end results are clear evidence that more tangible use cases exist. But it is notable that cryptocurrency enthusiasts have rebranded themselves as AI proponents and metaverse businesses have turned to generative ones. Many of the people promoting the next big thing may be promoting the next big thing.
The speed at which technology advances is a combination of four factors: the intellectual power we use, the tools we can use, luck, and the willingness to finance and support it.
We talk to some of the minds exploring and expanding this space and discuss some of the technologies that will power what comes next – from chip scale to data centers and cloud. But we didn’t touch on the other two variables.
Luck, by its nature, cannot be captured until it has passed. Business models, on the other hand, are often among the easiest subjects to interrogate. Not so in this case, as technology and hype trump attempts to build sustainable businesses.
Strategy
Again, we’ve seen this before with the dot-com bubble and every other technology boom. Much of this is embedded in the Silicon Valley mindset, betting huge sums on each new technology without a clear monetization strategy, hoping that the scale of the transformation will eventually lead to unfathomable wealth.
Higher interest rates, a series of high-profile bankruptcies, and the collapse of Silicon Valley Bank have put this mindset under pressure.
Right now, generative AI companies are raising huge sums based on crazy promises of future wealth. The pace of evolution will depend on how many can escape the gravity well of scale and operational costs, to build realistic and sustainable businesses before the purse strings inevitably tighten.
And these eventual winners will be the ones to define the final form of AI.
Costs
We don’t yet know how much it will cost to train larger models, nor whether we have enough data to support them. We don’t know how much they will cost to run and how many business models will be able to generate enough revenue to cover that cost.
We don’t know whether major language model hallucinations can be eliminated, or whether the uncanny valley of knowledge, where AIs produce convincing versions of realities that don’t exist, will continue to be a limiting factor.
We don’t know in which direction the models will grow. All we know is that the process of growth and exploration will be fueled by more and more data and more computation. And that will require a new wave of data centers, ready to rise to the challenge.