What does it mean for a model to be large? The size of a model એક a trained neural network is measured by the number of its parameters. These are values in the network that change frequently during training and are then used to make model predictions. Generally speaking, the more parameters the model has, the more information it can get from its training data, and its predictions about the latest data will be more accurate.
The GPT-3 has 175 billion dimensions 10 10 times more than its predecessor, the GPT-2. But GPT-3 has been dwarfed by the class of 2021. Jurassic-1 rolled out GPT-3 with 178 billion parameters, a commercially available giant language model launched in September by US startup AI21 Labs. The new model Gopher, released by Deepmind in December, has 280 billion parameters. Megatron-Turing NLG has 530 billion. Google’s Switch-Transformer and GLaM models have dimensions of one and 1.2 trillion, respectively.
This trend is not unique to the US. This year, Chinese tech giant Huawei has created a 200-billion-parameter language model called PanGu. Inspur, another Chinese firm made 1.0 yuan, a 245-billion-parameter model. Baidu and Peng Cheng Laboratory, a research institute in Shenzhen, PCL-BAIDU Wenxin, announced a model with 280 billion parameters that Baidu is already using in various applications, including Internet search, news feeds and smart speakers. And AI’s Beijing Academy announced Wu Dao 2.0, which has 1.75 trillion dimensions.
Meanwhile, South Korean internet search firm Never announced a model called HyperCLOVA, with 204 billion parameters.
Each of these is a remarkable achievement of engineering. For starters, training a model with more than 100 billion dimensions is a complex plumbing problem: hundreds of individual GPUs – hard neural networks of choice must be hardware-connected and synchronized to train the deep neural network, and training data must be divided into parts and fitted at the right time. Distributed among them in order.
Large language models have become reputable projects that showcase the company’s technical expertise. However, some of these newer models go further than repeating the demonstration that scaling up results in better results.
There are a handful of innovations. Once trained, Google’s Switch-Transformer and GLaM use a fraction of their parameters to predict, so they save computing power. PCL-Baidu Wenxin combines the GPT-3-style model with the Knowledge Graph, a technology used in old-school symbolic AI to store facts. And with Gopher, DeepMind released RETRO, a language model with only 7 billion parameters that competes with others 25 times its size by cross-referencing a database of documents when generating text. This makes the retro less expensive to train than its larger competitors.