What does it mean for a model to be large? The size of a model—a trained neural network—is measured by the number of parameters it has. These are the values in the network that get tweaked over and over again during training and are then used to make the model’s predictions. Roughly speaking, the more parameters a model has, the more information it can soak up from its training data, and the more accurate its predictions about fresh data will be.
GPT-3 has 175 billion parameters—10 times more than its predecessor, GPT-2. But GPT-3 is dwarfed by the class of 2021. Jurassic-1, a commercially available large language model launched by US startup AI21 Labs in September, edged out GPT-3 with 178 billion parameters. Gopher, a new model released by DeepMind in December, has 280 billion parameters. Megatron-Turing NLG has 530 billion. Google’s Switch-Transformer and GLaM models have one and 1.2 trillion parameters, respectively.
The trend is not just in the US. This year the Chinese tech giant Huawei built a 200-billion-parameter language model called PanGu. Inspur, another Chinese firm, built Yuan 1.0, a 245-billion-parameter model. Baidu and Peng Cheng Laboratory, a research institute in Shenzhen, announced PCL-BAIDU Wenxin, a model with 280 billion parameters that Baidu is already using in a variety of applications, including internet search, news feeds, and smart speakers. And the Beijing Academy of AI announced Wu Dao 2.0, which has 1.75 trillion parameters.
Meanwhile, South Korean internet search firm Naver announced a model called HyperCLOVA, with 204 billion parameters.
Every one of these is a notable feat of engineering. For a start, training a model with more than 100 billion parameters is a complex plumbing problem: hundreds of individual GPUs—the hardware of choice for training deep neural networks—must be connected and synchronized, and the training data split must be into chunks and distributed between them in the right order at the right time.
Large language models have become prestige projects that showcase a company’s technical prowess. Yet few of these new models move the research forward beyond repeating the demonstration that scaling up gets good results.
There are a handful of innovations. Once trained, Google’s Switch-Transformer and GLaM use a fraction of their parameters to make predictions, so they save computing power. PCL-Baidu Wenxin combines a GPT-3-style model with a knowledge graph, a technique used in old-school symbolic AI to store facts. And alongside Gopher, DeepMind released RETRO, a language model with only 7 billion parameters that competes with others 25 times its size by cross-referencing a database of documents when it generates text. This makes RETRO less costly to train than its giant rivals.