GitPedia

Curator

Synthetic data curation for post-training and structured data extraction

From bespokelabsaiΒ·Updated May 31, 2026Β·View on GitHubΒ·

Bespoke Curator Bulk Inference and Scalable Data Curation for Post-Training The project is written primarily in Python, distributed under the Apache License 2.0 license, first published in 2024. It has gained significant community traction with 1,682 stars and 141 forks on GitHub. Key topics include: agents, deep-learning, fine-tuning, instruction-tuning, llm.

Latest release: v0.1.27
March 15, 2026View Changelog β†’
<p align="center"> <a href="https://bespokelabs.ai/" target="_blank"> <picture> <source media="(prefers-color-scheme: light)" width="100px" srcset="https://github.com/bespokelabsai/curator/blob/main/docs/Bespoke-Labs-Logomark-Red-crop.png"> <img alt="Bespoke Labs Logo" width="100px" src="https://github.com/bespokelabsai/curator/blob/main/docs/Bespoke-Labs-Logomark-Red-crop.png"> </picture> </a> </p> <h1 align="center">Bespoke Curator</h1> <h3 align="center" style="font-size: 20px; margin-bottom: 4px">Bulk Inference and Scalable Data Curation for Post-Training</h3> <br/> <div align="center">

Github Twitter Hugging Face Discord
<br>
Docs Website PyPI

</div> <div align="center"> [ English | <a href="docs/README_zh.md">δΈ­ζ–‡</a> ] </div>

πŸŽ‰ What's New

Overview

Bespoke Curator makes it easy to create synthetic data pipelines. Whether you are training a model or extracting structured data, Curator will prepare high-quality data quickly and robustly.

  • Rich Python based library for generating and curating synthetic data.
  • Viewer to monitor data while it is being generated.
  • First class support for structured outputs.
  • Built-in performance optimizations for asynchronous operations, caching, and fault recovery at every scale.
  • Support for a wide range of inference options via LiteLLM, vLLM, and popular batch APIs.

CLI in action

Check out our full documentation for getting started, tutorials, guides and detailed reference.

πŸ› οΈ Installation

bash
pip install bespokelabs-curator

πŸ“• Examples

Finetuning/Distillation

TaskLink(s)Goal
Product feature extraction<a target="_blank" href="https://colab.research.google.com/drive/1YoA23-cBcWpaSErULzBI2bo2LPGo37GQ"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>Finetuning a model to identify features of a product
Sentiment analysis<a href="https://colab.research.google.com/drive/1Zfl3g7POsqqYQqkzXdyhYRSAymLhZugn?usp=sharing" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>Aspect-based sentiment analysis of restaurant reviews and finetuning using Together.ai
RAFT for domain-specific RAG<a href="https://github.com/bespokelabsai/curator/tree/main/examples/blocks/raft" target="_blank">Code</a>Implement Retrieval Augmented Fine-Tuning (RAFT) that processes domain-specific documents, generates questions, and prepares data for fine-tuning LLMs.
Poem generation & LoRA fine-tuning<a href="https://github.com/bespokelabsai/curator/blob/main/examples/poem_finetuning_example.py" target="_blank">Code</a>End-to-end pipeline: curate poem data with Curator, then LoRA fine-tune with TinkerTrainer

Data Generation

TaskLink(s)Goal
Reasoning dataset generation (Bespoke Stratos)<a href="https://github.com/bespokelabsai/curator/tree/main/examples/bespoke-stratos-data-generation" target="_blank">Code</a>Generate the Bespoke-Stratos-17k dataset, focusing on reasoning traces from math, coding, and problem-solving datasets.
Reasoning dataset generation (Open Thoughts)<a href="https://github.com/open-thoughts/open-thoughts" target="_blank">Code</a>Generate the Open-Thoughts-114k dataset, focusing on reasoning traces from math, coding, and problem-solving datasets.
Multimodal<a href="https://github.com/bespokelabsai/curator/tree/main/examples/multimodal" target="_blank">Code</a>Demonstrates multimodal capabilities by generating recipes from food images
Ungrounded Question Answer generation<a href="https://github.com/bespokelabsai/curator/tree/main/examples/ungrounded-qa" target="_blank">Code</a>Generate diverse question-answer pairs using techniques similar to the CAMEL paper
Code Execution<a href="https://colab.research.google.com/drive/1YKj1-BC66-3LgNkf1m5AEPswIYtpOU-k" target="_blank"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>Execute code generated with Curator
3Blue1Brown video generation<a href="https://github.com/bespokelabsai/curator/tree/main/examples/code-execution/math-animation" target="_blank">Code</a>Generate videos similar to 3Blue1Brown and render them using code execution!
Synthetic charts<a href="https://github.com/bespokelabsai/curator/blob/main/examples/code-execution/chart-generation/charts.py" target="_blank">Code</a>Generate charts synthetically.
Function calling<a href="https://github.com/bespokelabsai/curator/tree/main/examples/function-calling" target="_blank">Code</a>Generate data for finetuning for function calling.

πŸš€ Quickstart

Using curator.LLM for Bulk Inference

python
from typing import Dict from bespokelabs import curator from datasets import Dataset from pydantic import BaseModel, Field from typing import Literal class Sentiment(BaseModel): sentiment: Literal["positive", "negative", "neutral"] = Field( description="Sentiment of the review") class SentimentAnalyzer(curator.LLM): def prompt(self, product: Dict): return f"Determine the sentiment of the product from the review: {product['review']}" def parse(self, product: Dict, response: Sentiment): return [{"name": product["name"], "sentiment": response.sentiment}] # You can easily have a million rows here. # Curator takes care of parallelism, retries, and caches responses. dataset = [{"name": "Curator", "review": "Already saved hours in one day of use."}, {"name": "Bespoke MiniCheck", "review": "Hallucination rates dropped by 90%."}] # You can set batch=True, and instantly uses batch mode to save 50% of the costs. analyzer = SentimentAnalyzer( model_name="gpt-4o-mini", response_format=Sentiment, batch=False) reviews = analyzer(dataset) print(reviews.to_pandas())

Output:

                name sentiment
0            Curator  positive
1  Bespoke MiniCheck  positive

In the SentimentAnalyzer class:

  • prompt takes the input (product) and returns the prompt for the LLM.
  • parse takes the input (product) and the structured output (response) and converts it to a list of dictionaries. This is so that we can easily convert the output to a HuggingFace Dataset object.

Instead of a list, you can pass a HuggingFace Dataset object as well (see below for more details).

Using curator.LLM for data generation

Here's an example of using structured outputs and chaing together two curator.LLM blocks to generate diverse poems.

python
from typing import Dict, List from bespokelabs import curator from pydantic import BaseModel, Field class Topics(BaseModel): topics_list: List[str] = Field(description="A list of topics.") class TopicGenerator(curator.LLM): response_format = Topics def prompt(self, subject): return f"Return 3 topics related to {subject}" def parse(self, input: str, response: Topics): return [{"topic": t} for t in response.topics_list] class Poem(BaseModel): title: str = Field(description="The title of the poem.") poem: str = Field(description="The content of the poem.") class Poet(curator.LLM): response_format = Poem def prompt(self, input: Dict) -> str: return f"Write two poems about {input['topic']}." def parse(self, input: Dict, response: Poem) -> Dict: return [{"title": response.title, "poem": response.poem}] topic_generator = TopicGenerator(model_name="gpt-4o-mini") poet = Poet(model_name="gpt-4o-mini") # Start generation topics = topic_generator("Mathematics") poems = poet(topics)

Output:

 	title                     poem
0	The Language of Algebra	  In symbols and signs, truths intertwine,..
1	The Geometry of Space	  In the world around us, shapes do collide,..
2	The Language of Logic	  In circuits and wires where silence speaks,..

You can see more examples in the examples directory.

See the docs for more details as well as
for troubleshooting information.

[!TIP]
If you are generating large datasets, you may want to use batch mode to save costs. Currently batch APIs from OpenAI and Anthropic are supported. With curator this is as simple as setting batch=True in the LLM class.
[!NOTE]
Retries and caching are enabled by default to help you rapidly iterate your data pipelines.
So now if you run the same prompt again, you will get the same response, pretty much instantly.
You can delete the cache at ~/.cache/curator or disable it with export CURATOR_DISABLE_CACHE=true.

[!IMPORTANT]
Make sure to set your API keys as environment variables for the model you are calling. For example running export OPENAI_API_KEY=sk-... and export ANTHROPIC_API_KEY=ant-... will allow you to run the previous two examples. A full list of supported models and their associated environment variable names can be found in the litellm docs.

Anonymized Telemetry

We collect minimal, anonymized usage telemetry to help prioritize new features and improvements that benefit the Curator community. You can opt out by setting the TELEMETRY_ENABLED environment variable to False.

πŸ“– Providers

Curator supports a wide range of providers, including OpenAI, Anthropic, and many more.

OpenAI backend

python
llm = curator.LLM( model_name="gpt-4o-mini", )

For other models that support OpenAI-compatible APIs, you can use the openai backend:

python
llm = curator.LLM( model_name="gpt-4o-mini", backend="openai", backend_params={ "base_url": "https://your-openai-compatible-api-url", "api_key": <YOUR_OPENAI_COMPATIBLE_SERVICE_API_KEY>, }, )

LiteLLM (Anthropic, Gemini, together.ai, etc.)

Here is an example of using Gemini with litellm backend:

python
llm = curator.LLM( model_name="gemini/gemini-1.5-flash", backend="litellm", backend_params={ "max_requests_per_minute": 2_000, "max_tokens_per_minute": 4_000_000 }, )

Documentation

Ollama

python
llm = curator.LLM( model_name="ollama/llama3.1:8b", # Ollama model identifier backend_params={"base_url": "http://localhost:11434"}, )

Documentation

vLLM

python
llm = curator.LLM( model_name="Qwen/Qwen2.5-3B-Instruct", backend="vllm", backend_params={ "tensor_parallel_size": 1, # Adjust based on GPU count "gpu_memory_utilization": 0.7 } )

Documentation

DeepSeek

DeepSeek offers an OpenAI-compatible API that you can use with the openai backend.

[!IMPORTANT]
The DeepSeek API is experiencing intermittent issues and will return empty responses during times of high traffic. We recommend
calling the DeepSeek API through the openai backend, with a high max retries so that we can retry failed requests upon empty
response and a reasonable max requests and tokens per minute so we don't retry too aggressively and overwhelm the API.

python
llm = curator.LLM( model_name="deepseek-reasoner", generation_params={"temp": 0.0}, backend_params={ "max_requests_per_minute": 100, "max_tokens_per_minute": 10_000_000, "base_url": "https://api.deepseek.com/", "api_key": <YOUR_DEEPSEEK_API_KEY>, "max_retries": 50, }, backend="openai", )

kluster.ai

python
llm = curator.LLM( model_name="deepseek-ai/DeepSeek-R1", backend="klusterai", )

Documentation

πŸ“¦ Batch Mode

Several providers offer about 50% discount on token usage when using batch mode. Curator makes it easy to use batch mode with a wide range of providers.

Example with OpenAI (docs reference):

python
llm = curator.LLM(model_name="gpt-4o-mini", batch=True)

See documentation:

πŸ”§ Fine-Tuning with Tinker

Curator integrates with the Tinker SDK so you can go from curated data to a LoRA fine-tuned model in a few lines of Python.

bash
pip install bespokelabs-curator tinker export TINKER_API_KEY="your-tinker-key"
python
from bespokelabs.curator import TinkerTrainer, TinkerTrainerConfig # Configure training config = TinkerTrainerConfig( base_model="Qwen/Qwen3-8B", epochs=3, batch_size=4, lora_config={"rank": 16, "alpha": 32, "dropout": 0.05}, checkpoint_every_epoch=True, ) # Training data is a list of chat-format dicts (or a HuggingFace Dataset) training_data = [ {"messages": [ {"role": "user", "content": "What is Python?"}, {"role": "assistant", "content": "Python is a programming language."}, ]}, # ... ] # Train trainer = TinkerTrainer(config) result = trainer.train(training_data) print(f"Final loss: {result.final_loss:.4f}") # Sample from the fine-tuned model response = trainer.sample("Explain recursion in Python") print(response)

Checkpoint Resume

Training can be resumed from any saved checkpoint. The trainer restores both model weights and optimizer state, then continues from where it left off β€” no data is replayed.

python
# Resume from an earlier run's checkpoint checkpoints = result.checkpoints # list of CheckpointInfo trainer = TinkerTrainer(config) trainer.load_checkpoint(checkpoints[-1]) result = trainer.train(training_data) # continues from the checkpoint

Custom Data Formats

Subclass TinkerTrainer to handle non-standard data layouts:

python
class MyTrainer(TinkerTrainer): def format_example(self, row): return TrainingExample.from_dict_messages([ {"role": "user", "content": row["question"]}, {"role": "assistant", "content": row["answer"]}, ]) trainer = MyTrainer(config) result = trainer.train([{"question": "What is 2+2?", "answer": "4"}, ...])

See the full poem fine-tuning example for an end-to-end pipeline that curates data with curator.LLM and then fine-tunes with TinkerTrainer.

Bespoke Curator Viewer

The hosted curator viewer is a rich interface to visualize data -- and makes visually inspecting the data much easier.

You can enable it as follows:

Bash:

shell
export CURATOR_VIEWER=1

Python/colab:

python
import os os.environ["CURATOR_VIEWER"]="1"

With this enabled, as curator generates data, it gets uploaded and you can see the responses streaming in the viewer. The URL for the viewer is displayed right next to the rich progress.

Authenticate with a Bespoke Labs API key

By default, datasets are accessible to anyone with the link. To keep your datasets private, you can associate them with a Bespoke Labs account. Doing so also allows you to:

  1. Track all datasets associated with your account
  2. Share datasets with collaborators
  3. Analyze data generation costs over time

You can enable authentication as follows;

  1. Sign up for a Bespoke Labs account.
  2. Create an API key from the API Key page.
  3. Set the BESPOKE_API_KEY and CURATOR_VIEWER environment variables:
shell
export BESPOKE_API_KEY=<YOUR_API_KEY> export CURATOR_VIEWER=1

With the environment variables set, all your datasets will be streamed to the hosted viewer and linked to your Bespoke Labs account. You can visit the Datasets page to see datasets generated with your API keys or shared with you by others, and the Cost Report
page to see the data generation costs for a given period.

Environment Variables

We support a range of environment variables to customize the behavior of Curator.

Here is a complete table of environment variables:

VariableDescriptionDefault
CURATOR_VIEWEREnables the Curator viewer for visualizing data curation when True.False
CURATOR_DISABLE_CACHEDisables caching for curator.LLM generations when True. Useful for fresh runs.False
CURATOR_CACHE_DIRSets the cache directory used for curator.LLM generations.~/.cache/curator
CURATOR_DISABLE_RICH_DISPLAYWhen True, disables Rich CLI output (and falls back to tqdm logging) for local data generation monitoring. This is useful when debugging with inline breakpoints or interactive debuggers like pdb, where Rich's dynamic output can interfere with terminal input.False
TELEMETRY_ENABLEDEnable telemetry for curator usage tracking when TrueTrue

Contributing

Thank you to all the contributors for making this project possible!
Please follow these instructions on how to contribute.

Citation

If you find Curator useful, please consider citing us!

@software{Curator: A Tool for Synthetic Data Creation,
  author = {Marten, Ryan* and Vu, Trung* and Ji, Charlie Cheng-Jie and Sharma, Kartik and Pimpalgaonkar, Shreyas and Dimakis, Alex and Sathiamoorthy, Maheswaran},
  month = jan,
  title = {{Curator}},
  year = {2025},
  howpublished = {\url{https://github.com/bespokelabsai/curator}}
}

Contributors

Showing top 12 contributors by commit count.

View all contributors on GitHub β†’

This article is auto-generated from bespokelabsai/curator via the GitHub API.Last fetched: 6/1/2026