Creating and Validating Synthetic Datasets for LLM Evaluation & Experimentation

Published September 16, 2024

Thanks to John Gilhuly for his contributions to this piece.

Liking Phoenix? Please consider giving us a star on Github! ⭐️

Synthetic datasets are artificially created datasets that are designed to mimic real-world information. Unlike naturally occurring data, which is gathered from actual events or interactions, synthetic datasets are generated using algorithms, rules, or other artificial means. These datasets are carefully created to represent specific patterns, distributions, or scenarios that developers and researchers want to study or use for testing.

In the context of large language models, synthetic datasets might include:

Generated text conversations simulating customer support interactions
Artificial question-answer pairs covering a wide range of topics
Fabricated product reviews with varying sentiments and styles
Simulated code snippets with intentional bugs or specific patterns

These synthetic datasets are very important for development with LLMs. By using synthetic data, developers can create controlled environments for experimentation, ensure coverage of edge cases, and protect privacy by avoiding the use of real user data.

The applications of synthetic datasets are varied and valuable:

They allow us to test and validate model performance, especially for assessing how well models perform specific tasks.
Synthetic data helps generate initial traces of application behavior, facilitating debugging in tools like Phoenix.
Perhaps most importantly, synthetic datasets serve as “golden data” for consistent experimental results. This is particularly useful when developing and experimenting with applications that haven’t yet launched.

While synthetic datasets can also be used for model training, this article will focus on their applications in evaluation and experimentation. Let’s dive in!

Creating and Validating Synthetic Datasets

Defining Objectives

The first step in creating a synthetic dataset is to clearly define your objectives – identify the specific tasks or capabilities you want to evaluate. For example, you might create datasets for:

Evaluating a model’s performance in sentiment analysis, question answering, or entity recognition. These datasets typically require 1000+ examples for comprehensive evaluation.
Creating a golden dataset for consistent testing during development. These can be smaller, provided they cover the breadth of expected inputs to your application, with 100+ examples often sufficing.
Simulating application traffic. These can vary in size depending on your needs.

Choosing Data Sources

When selecting sources for generating synthetic data, you have several options. Publicly available datasets can serve as a starting point. However, exercise caution when using these for model benchmarking. Many publicly available datasets have been used to train existing models, which could lead to misleading results if you’re evaluating a model that was likely trained on the same data.

Data Generation Techniques

Two primary methods for generating synthetic data are automated and rule-based generation.

Automated generation using LLMs quickly produces large amounts of diverse data. LLMs can generate a wide range of content, from simple text responses to complex narratives. This approach usually requires some tweaking of generation prompts and parameters to create a usable dataset.

Rule-based generation is particularly useful for creating golden datasets or simulating specific types of application traffic. Clear rules and parameters allow developers to make sure that specific scenarios or edge cases are covered. This is perfect for situations where certain patterns or structures must be present in the data.

Balancing Synthetic Data

Diversity and representativeness in your synthetic data is important. Typically, this involves a combination of manual review and programmatic analysis to examine the distribution of data across multiple attributes.

When creating a golden dataset, pay special attention to balancing different types of data points, including edge cases and typical scenarios. This often requires manual curation to make sure you have comprehensive coverage.

Validating Datasets

Synthetic datasets used for evaluation and experimentation become your “ground truth” – the standard against which you measure your application or model’s performance. Investing time in thorough validation is important.

To validate synthetic datasets, compare them to real-world data to verify accurate representation of patterns and distributions found in actual use cases. For evaluation datasets, benchmark them on top-performing models, being careful not to use the same model for both generation and validation. For instance, generate data with GPT-4 and validate it using Mistral Large 2. Be sure to also include manual review in your validation process; human oversight helps catch nuances or errors that automated methods might miss.

Combining Synthetic Datasets with Human Evaluation

While synthetic datasets offer many advantages, they may sometimes miss key use cases or types of inputs that humans would naturally consider. Human-in-the-loop processes are valuable for dataset improvement.

Recent research has shown that including even a small number of human-annotated examples can significantly improve the overall quality and effectiveness of a synthetic dataset. This hybrid approach combines the scalability of synthetic data with the understanding that human evaluators provide.

Human annotators can add targeted examples to synthetic datasets to address gaps or underrepresented scenarios. This process of augmenting synthetic data with human-curated examples can be easily implemented using tools like Phoenix.

The addition of these human-annotated examples can be particularly effective in improving the dataset’s performance in few-shot learning scenarios, where models need to generalize from a small number of examples.

Best Practices for Synthetic Dataset Use

Synthetic datasets are not static, one-time creations – they are dynamic tools that require ongoing attention. To maintain their usefulness, developer must do several things:

First, implement a regular refresh cycle. Revisit and update your datasets periodically to keep pace with model improvements and account for data drift in real-world applications.

Second, transparency is key in synthetic data generation. Maintain detailed records of the entire process, including the prompts used, models employed, and any post-processing steps applied.

Third, regular evaluation is important. Assess the performance of your synthetic datasets against real-world data and newer models on an ongoing basis.

Finally, when augmenting synthetic datasets with human-curated examples, take a balanced approach. Add enough human input to enhance the dataset’s quality and coverage, but be careful not to overwhelm the synthetic component.

By adhering to these practices, you can maximize the long-term value and reliability of your synthetic datasets, making them powerful tools for ongoing model evaluation and experimentation.

Conclusion

Synthetic datasets are important in the evaluation and experimentation of large language models and AI applications. By carefully creating, validating, and managing these datasets, developers and researchers can more effectively assess model performance, simulate application behavior, and maintain consistent testing environments.

As the use of generative AI continues to advance, the role of synthetic datasets is likely to evolve. While current best practices include human involvement in dataset creation and validation, future developments may shift this balance. However, according to this research, the fact that small amounts of human-added data can make a significant difference suggests that we’re not at immediate risk of model collapse or complete automation of dataset creation.

By following these guidelines and staying up to date with emerging research and best practices, your generative AI applications can get the most out of synthetic datasets.