This week, Microsoft and Nvidia announced that they trained what they claim is one of the largest and most capable AI language models to date: Megatron-Turing Natural Language Generation (MT-NLP). MT-NLP contains 530 billion parameters — the parts of the model learned from historical data — and achieves leading accuracy in a broad set of tasks, including reading comprehension and natural language inferences.
But building it didn’t come cheap. Training took place across 560 Nvidia DGX A100 servers, each containing 8 Nvidia A100 80GB GPUs. Experts peg the cost in the millions of dollars.
Like other large AI systems, MT-NLP raises questions about the accessibility of cutting-edge research approaches in machine learning. AI training costs dropped 100-fold between 2017 and 2019, but the totals still exceed the compute budgets of most startups, governments, nonprofits, and colleges. The inequity favors corporations and world superpowers with extraordinary access to resources at the expense of smaller players, cementing incumbent advantages.
For example, in early October, researchers at Alibaba detailed M6-10T, a language model containing 10 trillion parameters (roughly 57 times the size of OpenAI’s GPT-3) trained across 512 Nvidia V100 GPUs for 10 days. The cheapest V100 plan available through Google Cloud Platform costs $2.28 per hour, which would equate to over $300,000 ($2.28 per hour multiplied by 24 hours over 10 days) — further than most research teams can stretch.
Google subsidiary DeepMind is estimated to have spent $35 million training a system to learn the Chinese board game Go. And when the company’s researchers designed a model to play StarCraft II, they purposefully didn’t try multiple ways of architecting a key component because the training cost would have been too high. Similarly, OpenAI didn’t fix a mistake when it implemented GPT-3 because the cost of training made retraining the model infeasible.
It’s important to keep in mind that training costs can be inflated by factors other than an algorithm’s technical aspects. As Yoav Shoham, Stanford University professor emeritus and cofounder of AI startup AI21 Labs, recently told Synced, personal and organizational considerations often contribute to a model’s final price tag.
“[A] researcher might be impatient to wait three weeks to do a thorough analysis and their organization may not be able or wish to pay for it,” he said. “So for the same task, one could spend $100,000 or $1 million.”
Still, the increasing cost of training — and storing — algorithms like Huawei’s PanGu-Alpha, Naver’s HyperCLOVA, and the Beijing Academy of Artificial Intelligence’s Wu Dao 2.0 is giving rise to a cottage industry of startups aiming to “optimize” models without degrading accuracy. This week, former Intel exec Naveen Rao launched a new company, Mosaic ML, to offer tools, services, and training methods that improve AI system accuracy while lowering costs and saving time. Mosaic ML — which has raised $37 million in venture capital — competes with Codeplay Software, OctoML, Neural Magic, Deci, CoCoPie, and NeuReality in a market that’s expected to grow exponentially in the coming years.
In a sliver of good news, the cost of basic machine learning operations has been falling over the past few years. A 2020 OpenAI survey found that since 2012, the amount of compute needed to train a model to the same performance on classifying images in a popular benchmark — ImageNet — has been decreasing by a factor of two every 16 months.
Approaches like network pruning prior to training could lead to further gains. Research has shown that parameters pruned after training, a process that decreases the model size, could have been pruned before training without any effect on the network’s ability to learn. Called the “lottery ticket hypothesis,” the idea is that the initial values parameters in a model receive are crucial for determining whether they’re important. Parameters kept after pruning receive “lucky” initial values; the network can train successfully with only those parameters present.
Network pruning is far from a solved science, however. New ways of pruning that work before or in early training will have to be developed, as most current methods apply only retroactively. And when parameters are pruned, the resulting structures aren’t always a fit for the training hardware (e.g., GPUs), meaning that pruning 90% of parameters won’t necessarily reduce the cost of training a model by 90%.
Whether through pruning, novel AI accelerator hardware, or techniques like meta-learning and neural architecture search, the need for alternatives to unattainably large models is quickly becoming clear. A University of Massachusetts Amherst study showed that using 2019-era approaches, training an image recognition model with a 5% error rate would cost $100 billion and produce as much carbon emissions as New York City does in a month. As IEEE Spectrum’s editorial team wrote in a recent piece, “we must either adapt how we do deep learning or face a future of much slower progress.”
For AI coverage, send news tips to Kyle Wiggers — and be sure to subscribe to the AI Weekly newsletter and bookmark our AI channel, The Machine.
Thanks for reading,
AI Staff Writer
VentureBeat’s mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact.
Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the subjects of interest to you
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
- networking features, and more