Control layer
A production-grade control layer that sits between your application logic and any LLM — input validation, schema enforcement, circuit breaking, targeted retry, and audit logging in one composable pipeline.
A production-grade control layer that sits between your application logic and any LLM — input validation, schema enforcement, circuit breaking, targeted retry, and audit logging in one composable pipeline. The project is written primarily in Python, distributed under the MIT License license, first published in 2026. Key topics include: anthropic, circuit-breaker, generative-ai, input-validation, llm.
control-layer
A production-grade control layer that sits between your application logic and any LLM — input validation, schema enforcement, circuit breaking, targeted retry, and audit logging in one composable pipeline.
Most LLM integrations stop at: write a prompt, call the model, use the response. This
library handles what prompt engineering cannot — enforcing what the model actually returns,
blocking what should never reach it, and recovering cleanly when things break.
Read the full write-up on Towards Data Science →
Prompt Engineering Failed in Production — I Built the Control Layer That Actually Works
What It Does
User Input
|
[1] InputGuard -- injection detection (20 patterns), length check, sanitization
|
[2] CircuitBreaker -- stops hammering a failing LLM backend
|
[3] TokenBudget -- tiktoken-accurate slot allocation, priority order
[4] PromptBuilder -- assembles prompt within budget, injects constraints
|
[5] LLMCaller -- enforces hard timeout on every call
|
[6] ResponseValidator -- JSON schema, length bounds, forbidden phrases, quality score
| [failed?]
[7] RetryEngine -- targeted prompt mutation per failure mode, jittered backoff
| [exhausted?]
[8] FallbackRouter -- cached response, template, or escalation chain
|
AuditLogger -- every attempt written to JSONL, thread-safe, persistent
|
ControlPacket -- response, attempts, latency, score, audit_id
| Component | Job |
|---|---|
| InputGuard | Blocks injection attempts and oversized input before any LLM call |
| CircuitBreaker | Opens after N consecutive failures; rejects calls instantly during recovery |
| TokenBudget | tiktoken-accurate slot-based allocator; prevents silent overflow |
| PromptBuilder | Assembles prompt in priority order with hard constraints injected structurally |
| LLMCaller | Wraps any callable LLM with thread-based timeout enforcement |
| ResponseValidator | Validates JSON structure, required keys, length, forbidden phrases |
| RetryEngine | Maps each failure mode to a targeted mutation hint; jittered exponential backoff |
| FallbackRouter | Registered fallback chain; first non-empty response wins |
| AuditLogger | Thread-safe JSONL audit log; P50/P90/P99 latency stats; failure distribution |
Installation
bashgit clone https://github.com/Emmimal/control-layer.git cd control-layer pip install tiktoken tenacity pydantic structlog # required pip install pytest # optional — for running tests
No ML dependencies. No GPU required. All functionality runs on the Python standard library
plus the four packages above.
Quick Start
pythonfrom control_layer import ControlLayer, ControlLayerConfig, ResponseSchema # Define your output contract schema = ResponseSchema( must_be_json=True, required_keys=["summary", "confidence"], max_length=400, forbidden_phrases=["I cannot", "As an AI"], ) # Configure the layer config = ControlLayerConfig( total_tokens=800, max_attempts=3, timeout_seconds=30.0, cb_failure_threshold=5, cb_recovery_seconds=30.0, ) # Swap in any LLM callable — OpenAI, Anthropic, local model, mock def your_llm_call(prompt: str) -> str: ... layer = ControlLayer( llm_fn=your_llm_call, system_prompt="You are a structured research assistant.", schema=schema, config=config, ) # Register fallbacks — called in order when retries exhaust layer.register_fallback( "cache", lambda q: '{"summary": "Cached response.", "confidence": 0.5}', ) # Run packet = layer.run( user_input="How does token budget allocation work?", constraints=[ "Return only valid JSON.", "Include 'summary' and 'confidence' keys.", "No markdown fencing.", ], context=retrieved_documents, # optional RAG context ) print(packet.response) # final response print(packet.validation.passed) # True / False print(packet.attempts) # 1, 2, or 3 print(packet.total_latency_ms) # end-to-end latency print(packet.audit_id) # ties all log lines to this request
Running the Demos
Five runnable demos covering every failure mode and recovery path. No API key required.
The MockLLM simulates realistic failure behavior at a configurable rate.
bashpython demo.py
| Demo | What It Shows |
|---|---|
| 1 | Input guard blocking 7 of 8 inputs — injection, empty, oversized |
| 2 | Schema enforcement with retry — 75% first-attempt failure rate, mutation hints |
| 3 | Constraint violation recovery — length and forbidden phrase, 3 attempts |
| 4 | Fallback router — exhausted retries route to cached response |
| 5 | Benchmark — naive 0% pass rate vs control layer 100%, latency breakdown |
Running Demo 5 also generates control_layer_benchmark.png — a 6-panel benchmark figure
showing pass rate, failure mode distribution, retry distribution, latency percentiles,
token budget allocation, and quality score histogram.
Running the Tests
bashpytest tests/ -v
TestInputGuard 14 tests PASSED
TestTokenBudget 5 tests PASSED
TestPromptBuilder 6 tests PASSED
TestResponseValidator 10 tests PASSED
TestCircuitBreaker 5 tests PASSED
TestRetryEngine 6 tests PASSED
TestFallbackRouter 4 tests PASSED
TestLLMCaller 2 tests PASSED
TestAuditLogger 5 tests PASSED
TestControlLayerIntegration 8 tests PASSED
TestPydanticConfig 4 tests PASSED
69 passed in 1.19s
Every component is tested in isolation. Integration tests cover the full orchestration
path: first-attempt success, retry on schema violation, fallback after exhausted retries,
circuit breaker rejection after consecutive timeouts, and Pydantic config validation errors.
Configuration Reference
pythonControlLayerConfig( # Token budget total_tokens=800, # Total token budget for prompt assembly model_name="cl100k_base", # tiktoken encoding name # Input validation max_input_chars=2000, # Hard limit on user input length # LLM call timeout_seconds=30.0, # Hard timeout per LLM call # Retry max_attempts=3, # Maximum retry attempts per request base_delay_ms=50.0, # Base exponential backoff delay max_delay_ms=2000.0, # Maximum backoff delay jitter_ms=25.0, # Random jitter added to each delay # Circuit breaker cb_failure_threshold=5, # Consecutive failures before opening cb_recovery_seconds=30.0, # Seconds before attempting recovery # Audit audit_log_path="audit.jsonl", # JSONL audit log path )
pythonResponseSchema( must_be_json=False, # Require valid JSON response required_keys=[], # Keys that must appear in JSON output max_length=None, # Maximum response length in characters min_length=None, # Minimum response length in characters forbidden_phrases=[], # Phrases that must not appear in response must_contain=[], # Phrases that must appear (used for quality score) )
Swapping the LLM
The llm_fn parameter accepts any callable that takes a str and returns a str.
python# OpenAI import openai client = openai.OpenAI() def openai_call(prompt: str) -> str: response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], ) return response.choices[0].message.content layer = ControlLayer(llm_fn=openai_call, ...) # Anthropic import anthropic client = anthropic.Anthropic() def claude_call(prompt: str) -> str: response = client.messages.create( model="claude-sonnet-4-5", max_tokens=1024, messages=[{"role": "user", "content": prompt}], ) return response.content[0].text layer = ControlLayer(llm_fn=claude_call, ...) # Any local model layer = ControlLayer(llm_fn=lambda prompt: your_local_model.generate(prompt), ...)
Project Structure
control-layer/
├── control_layer.py # All eight components + ControlLayer orchestrator
├── demo.py # Five runnable demos + benchmark charts
├── tests/
│ └── test_control_layer.py # 69 tests across all components
├── audit.jsonl # Generated on first run (append-only audit log)
├── control_layer_benchmark.png # Generated by demo.py
└── README.md
Benchmark
Measured on Python 3.12.6, Windows 11, CPU only, no GPU.
Ten structured output queries, 55% first-attempt failure rate.
| Metric | Naive | Control Layer |
|---|---|---|
| Pass rate | 0% | 100% |
| Min latency (ms) | 37.3 | 46.2 |
| Median latency (ms) | 43.3 | 143.5 |
| Mean latency (ms) | 42.9 | 139.8 |
| P90 latency (ms) | 45.6 | 168.0 |
| Max latency (ms) | 48.4 | 281.9 |
| Resolved on attempt 1 | N/A | 2 |
| Resolved on attempt 2 | N/A | 7 |
| Resolved on attempt 3+ | N/A | 1 |
Component overhead (excluding LLM call):
| Operation | Latency | Notes |
|---|---|---|
| InputGuard validation | ~0.2ms | 20 regex patterns |
| tiktoken count (100 tokens) | ~0.8ms | Encoding lookup |
| PromptBuilder.build() | ~1.1ms | Budget allocation + assembly |
| ResponseValidator.validate() | ~0.3ms | JSON parse + rule checks |
| CircuitBreaker.is_open() | ~0.05ms | Lock acquire + state check |
| AuditLogger.log() | ~0.4ms | Lock + file append |
| Total non-LLM overhead | ~2.9ms | Per request |
The LLM call dominates every other number. The control layer adds under 3ms of overhead
per request, which is within the variance of a single network round-trip.
When to Use This
Worth it when you have:
- LLM responses that drive downstream code — JSON parsed programmatically, data written
to a database, outputs shown to users without human review - User input passed to an LLM without a validation layer in between
- Structured output requirements the model violates intermittently
- Production systems where a LLM outage would block threads or hang requests
Skip it when you have:
- Single-turn, low-stakes use cases where a bad response is displayed and discarded
- Hard latency requirements under 50ms — retry delays alone can exceed this
- A chatbot where the user sees the raw model output and can judge it themselves
Known Limitations
Injection patterns are not exhaustive. Twenty patterns cover the OWASP LLM Top 10
attack taxonomy. Adversarial prompts crafted to avoid known patterns will pass. Combine
with embedding-based anomaly detection for high-risk deployments.
Circuit breaker state is in-process only. A restart resets the circuit to CLOSED
regardless of backend status. For multi-instance deployments, share circuit state via
Redis or a similar low-latency store.
No streaming support. The LLMCaller collects the full response before validation.
Streaming APIs require partial validation heuristics or full response buffering — neither
is implemented.
Quality score uses phrase matching, not semantic similarity. must_contain checks
exact string presence. A response that paraphrases a required concept without using the
exact phrase scores zero. Swap in an embedding-based scorer for higher precision.
AuditLogger grows unbounded. The JSONL file appends on every call. In production,
ship it to object storage on a rolling basis and rotate locally.
Related
Same series — production layers for LLM systems:
-
RAG Is Blind to Time — I Built a Temporal Layer to Fix It in Production
— temporal awareness layer for RAG systems that treats time as a first-class
retrieval signal. -
LLM Evals Are Based on Vibes — I Built the Missing Layer That Decides What Ships
— evaluation layer that replaces gut-feel shipping decisions with measurable
output quality gates. -
PyTorch NaNs Are Silent Killers — I Built a 3ms Hook to Catch Them at the Exact Layer
— lightweight hook that catches NaN propagation at the exact layer it
originates, in under 3ms overhead. -
context-engine — retrieval,
re-ranking, memory decay, and token budget control for RAG systems. The
control layer handles what the model returns. The context engine handles
what it receives. They compose.
License
MIT
Contributors
Showing top 1 contributor by commit count.
