GPT-2 series (2019) featured models from 137 million to 1.61 billion parameters trained on about 10 billion tokens from WebText.
GPT-3 (2020) introduced 175 billion parameters trained on around 400 billion tokens from diverse sources like CommonCrawl and Wikipedia.
No official architectural or training data details are available for GPT-3.5 and GPT-4.
Meta’s LLaMA lineup scaled from 7 billion to 65 billion parameters, and LLaMA 3.1 (405 billion) was trained on 3.67 trillion tokens; LLaMA-4 plans a 2 trillion parameter MoE model.
Mixture-of-Experts (MoE) models like Mistral and DeepSeek enable very large models (hundreds of billions of parameters) with sparse activation.
The recent MoE wave includes models with up to 671 billion parameters and training on over 14 trillion tokens.
Comparisons between dense and MoE architectures remain unclear, and current benchmarks may not capture their differences.
The prevailing trend is towards fine-tuned assistant/chatbot models rather than pure text continuation engines.
Get notified when new stories are published for "🇺🇸 Hacker News English"