Base Model Size Trends

1

GPT-2 series (2019) featured models from 137 million to 1.61 billion parameters trained on about 10 billion tokens from WebText.

2

GPT-3 (2020) introduced 175 billion parameters trained on around 400 billion tokens from diverse sources like CommonCrawl and Wikipedia.

3

No official architectural or training data details are available for GPT-3.5 and GPT-4.

4

Meta’s LLaMA lineup scaled from 7 billion to 65 billion parameters, and LLaMA 3.1 (405 billion) was trained on 3.67 trillion tokens; LLaMA-4 plans a 2 trillion parameter MoE model.

5

Mixture-of-Experts (MoE) models like Mistral and DeepSeek enable very large models (hundreds of billions of parameters) with sparse activation.

6

The recent MoE wave includes models with up to 671 billion parameters and training on over 14 trillion tokens.

7

Comparisons between dense and MoE architectures remain unclear, and current benchmarks may not capture their differences.

8

The prevailing trend is towards fine-tuned assistant/chatbot models rather than pure text continuation engines.

Base Model Size Trends

Subscribe to Similar Stories

Base Model Size Trends

Subscribe to Similar Stories