Synthetic Data

Synthetic data is artificially generated data used in place of, or alongside, data collected from real-world events or human authors. In language-model development, it is commonly produced by another model, a simulator, a programmatic generator, or a combination of these methods.

Synthetic datasets can contain instructions and responses, reasoning traces, preference pairs, tool-use trajectories, code, adversarial examples, or domain-specific documents. They are used for pretraining, Supervised Fine-Tuning (SFT), model distillation, evaluation, and safety testing.

A typical generation pipeline:

  1. defines a target distribution of tasks or skills;
  2. generates candidate examples from a teacher model or simulator;
  3. validates them with rules, execution, models, or humans;
  4. removes duplicates, contamination, and low-quality samples; and
  5. measures whether training on the dataset improves held-out performance.

Synthetic data can cover rare cases, reduce annotation cost, and create examples whose correctness is automatically verifiable. It also enables controlled variations that would be expensive or unsafe to collect in production.

The main limitation is that generated data reflects the generator. It can amplify biases, stylistic regularities, factual errors, and gaps in the teacher's knowledge. Repeatedly training models on unfiltered model output may reduce diversity or reinforce artifacts. Synthetic examples should therefore be mixed with high-quality real data when appropriate and tested against independently created evaluations.

Privacy is not guaranteed merely because data is generated. A model may reproduce memorized personal or copyrighted material, and prompts used to create the dataset may contain sensitive source data.

The Self-Instruct paper demonstrated bootstrapping instruction-following data from model-generated tasks and responses with filtering.

The LLM Knowledge Base is a collection of bite-sized explanations for commonly used terms and abbreviations related to Large Language Models and Generative AI.

It's an educational resource that helps you stay up-to-date with the latest developments in AI research and its applications.

Promptmetheus © 2023-present