Scaling Laws in Neural Networks
Scaling laws in neural networks refer to empirical and theoretical relationships that describe how the performance of artificial neural networks improves predictably as key resources—such as model parameters, training compute, and dataset size—are increased. First formalized in the late 2010s, these laws have guided the development of large-scale models, enabling forecasts of capabilities and resource allocation for frontier AI systems. Rooted in power-law behaviors, scaling laws predict that performance metrics, like cross-entropy loss, decrease smoothly with scale, often following forms like L∝N−α L \propto N^{-\alpha} L∝N−α, where L L L is loss, N N N is parameters, and α \alpha α is a domain-specific exponent.
Scaling laws have revolutionized AI research, underpinning the shift from small models to giants like GPT-4 and Gemini, but they also reveal limitations, such as diminishing returns and phase transitions leading to emergent capabilities. By 2026, as compute costs rise and data scarcity emerges, scaling laws inform debates on sustainable AI progress, with implications for infrastructure, ethics, and policy.en.wikipedia.org+3 more
Empirical Observations of Power-Law Relationships
Scaling laws emerged from observations that neural network performance follows power-law decays with increased scale. Early studies showed that test loss L L L scales as L(N)≈aN−α L(N) \approx a N^{-\alpha} L(N)≈aN−α, where α≈0.07 \alpha \approx 0.07 α≈0.07 to 0.35 depending on domain. These relationships hold across orders of magnitude, suggesting predictable improvements from scaling.arxiv.orgarxiv.org
Kaplan Scaling Laws
In the 2020 OpenAI paper "Scaling Laws for Neural Language Models," Jared Kaplan et al. formalized power-laws for language models: loss decreases with parameters N−αN N^{-\alpha_N} N−αN, compute C−αC C^{-\alpha_C} C−αC, and data D−αD D^{-\alpha_D} D−αD, with αN≈0.095 \alpha_N \approx 0.095 αN≈0.095, αC≈0.08 \alpha_C \approx 0.08 αC≈0.08, αD≈0.103 \alpha_D \approx 0.103 αD≈0.103. Optimal allocation prioritizes model size over data for fixed compute.arxiv.orgarxiv.org
Chinchilla Optimal Scaling
DeepMind's 2022 "Chinchilla" paper revised Kaplan by showing compute-optimal training requires equal scaling of parameters and data: D∝N D \propto N D∝N. Chinchilla (70B parameters, 1.4T tokens) outperformed larger models like Gopher (280B) with the same compute, suggesting prior models were data-undertrained.arxiv.org+2 more
Parameters vs Compute vs Data Trade-Offs
Trade-offs center on allocating compute C≈6ND C \approx 6ND C≈6ND (FLOPs). Kaplan favored larger N N N, Chinchilla balanced N N N and D D D. Recent work emphasizes data quality over quantity.arxiv.org+2 more
Theoretical Explanations
Theories attribute scaling to variance-limited and resolution-limited regimes, linking to information theory and statistical mechanics.researchgate.net+2 more
Neural Scaling Laws Across Architectures
Laws hold for transformers (α≈0.1 \alpha \approx 0.1 α≈0.1), CNNs (α≈0.2 \alpha \approx 0.2 α≈0.2 in vision), RNNs (similar but less efficient).arxiv.org+2 more
Domain-Specific Scaling
Language: Loss scales with N−0.095 N^{-0.095} N−0.095.
Vision: Similar power-laws for accuracy in image classification.
Multimodal: Mixed-modal laws predict performance across modalities.openreview.net+2 more
Downstream Task Performance Scaling
Pretraining scale transfers to downstream, with laws for fine-tuning efficiency.arxiv.orgarxiv.org
Emergent Capabilities and Phase Transitions
Emergent abilities arise abruptly at scale, resembling phase transitions, challenging smooth scaling predictions.openreview.net+2 more
Compute-Optimal Training
Optimal under fixed C C C: balance N N N and D D D.arxiv.org
Scaling Law Failures and Limitations
Failures include saturation, data scarcity, and domain shifts.medium.com+2 more
Small-Scale Predictability of Large-Scale Performance
Small models predict large via extrapolation, but emergent behaviors complicate.arxiv.orgopenreview.net
Implications for Model Development and Resource Allocation
Guides efficient training, but raises energy concerns.rcrwireless.comglennklockwood.com
Data Scaling Laws
Performance scales with D−αD D^{-\alpha_D} D−αD, but quality matters.aclanthology.org
Scaling of Different Capabilities
Capabilities scale differentially; reasoning lags coherence.techrxiv.org
Transfer Learning Scaling
Pretraining scale boosts transfer.arxiv.org
Fine-Tuning Scaling
Fine-tuning benefits from scaled pretraining.arxiv.org
Inference Cost Scaling
Inference scales with N N N, prompting efficiency focus.tobyord.com
Criticism and Alternative Perspectives
Critics argue laws overemphasize scale, ignoring architecture.interconnected.blogexponentialview.co
Frontier Model Trajectories
Frontier models follow laws but face diminishing returns.glennklockwood.comresearch.dimensioncap.com
Forecasting Future Capabilities
Laws enable predictions, but uncertainties persist.