Training Data
Core ConceptsThe data used to train an AI model — its quality, quantity, and composition directly determine the model's capabilities and biases.
Training data is what shapes an AI model's knowledge, abilities, and biases. Large language models are trained on massive text corpora — typically including web crawls, books, academic papers, code repositories, and curated datasets. The model can only know what its training data contains.
Data quality matters enormously. 'Garbage in, garbage out' is literally true for AI. Training on low-quality, biased, or toxic data produces a low-quality, biased, or toxic model. This is why data curation and filtering is a major part of model development.
Training data is also a legal battleground. Multiple lawsuits challenge whether training AI on copyrighted content (books, articles, code, images) constitutes fair use. The outcomes will significantly impact how future models are trained and what data they can access.
Real-World Example
When an AI model doesn't know about something recent that's its training data cutoff — the data used to train it simply didn't include information past a certain date.
Related Terms
More in Core Concepts
FAQ
What is Training Data?
The data used to train an AI model — its quality, quantity, and composition directly determine the model's capabilities and biases.
How is Training Data used in practice?
When an AI model doesn't know about something recent that's its training data cutoff — the data used to train it simply didn't include information past a certain date.
What concepts are related to Training Data?
Key related concepts include Training, Bias (AI Bias), Pre-training. Understanding these together gives a more complete picture of how Training Data fits into the AI landscape.