GitPedia

Control layer

A production-grade control layer that sits between your application logic and any LLM — input validation, schema enforcement, circuit breaking, targeted retry, and audit logging in one composable pipeline.

From Emmimal·Updated June 19, 2026·View on GitHub·

A production-grade control layer that sits between your application logic and any LLM — input validation, schema enforcement, circuit breaking, targeted retry, and audit logging in one composable pipeline. The project is written primarily in Python, distributed under the MIT License license, first published in 2026. Key topics include: anthropic, circuit-breaker, generative-ai, input-validation, llm.

control-layer

A production-grade control layer that sits between your application logic and any LLM — input validation, schema enforcement, circuit breaking, targeted retry, and audit logging in one composable pipeline.

Python Version
Tests
License
Dependencies

Most LLM integrations stop at: write a prompt, call the model, use the response. This
library handles what prompt engineering cannot — enforcing what the model actually returns,
blocking what should never reach it, and recovering cleanly when things break.

Read the full write-up on Towards Data Science →
Prompt Engineering Failed in Production — I Built the Control Layer That Actually Works


What It Does

User Input
    |
[1] InputGuard          -- injection detection (20 patterns), length check, sanitization
    |
[2] CircuitBreaker      -- stops hammering a failing LLM backend
    |
[3] TokenBudget         -- tiktoken-accurate slot allocation, priority order
[4] PromptBuilder       -- assembles prompt within budget, injects constraints
    |
[5] LLMCaller           -- enforces hard timeout on every call
    |
[6] ResponseValidator   -- JSON schema, length bounds, forbidden phrases, quality score
    | [failed?]
[7] RetryEngine         -- targeted prompt mutation per failure mode, jittered backoff
    | [exhausted?]
[8] FallbackRouter      -- cached response, template, or escalation chain
    |
    AuditLogger         -- every attempt written to JSONL, thread-safe, persistent
    |
ControlPacket           -- response, attempts, latency, score, audit_id
ComponentJob
InputGuardBlocks injection attempts and oversized input before any LLM call
CircuitBreakerOpens after N consecutive failures; rejects calls instantly during recovery
TokenBudgettiktoken-accurate slot-based allocator; prevents silent overflow
PromptBuilderAssembles prompt in priority order with hard constraints injected structurally
LLMCallerWraps any callable LLM with thread-based timeout enforcement
ResponseValidatorValidates JSON structure, required keys, length, forbidden phrases
RetryEngineMaps each failure mode to a targeted mutation hint; jittered exponential backoff
FallbackRouterRegistered fallback chain; first non-empty response wins
AuditLoggerThread-safe JSONL audit log; P50/P90/P99 latency stats; failure distribution

Installation

bash
git clone https://github.com/Emmimal/control-layer.git cd control-layer pip install tiktoken tenacity pydantic structlog # required pip install pytest # optional — for running tests

No ML dependencies. No GPU required. All functionality runs on the Python standard library
plus the four packages above.


Quick Start

python
from control_layer import ControlLayer, ControlLayerConfig, ResponseSchema # Define your output contract schema = ResponseSchema( must_be_json=True, required_keys=["summary", "confidence"], max_length=400, forbidden_phrases=["I cannot", "As an AI"], ) # Configure the layer config = ControlLayerConfig( total_tokens=800, max_attempts=3, timeout_seconds=30.0, cb_failure_threshold=5, cb_recovery_seconds=30.0, ) # Swap in any LLM callable — OpenAI, Anthropic, local model, mock def your_llm_call(prompt: str) -> str: ... layer = ControlLayer( llm_fn=your_llm_call, system_prompt="You are a structured research assistant.", schema=schema, config=config, ) # Register fallbacks — called in order when retries exhaust layer.register_fallback( "cache", lambda q: '{"summary": "Cached response.", "confidence": 0.5}', ) # Run packet = layer.run( user_input="How does token budget allocation work?", constraints=[ "Return only valid JSON.", "Include 'summary' and 'confidence' keys.", "No markdown fencing.", ], context=retrieved_documents, # optional RAG context ) print(packet.response) # final response print(packet.validation.passed) # True / False print(packet.attempts) # 1, 2, or 3 print(packet.total_latency_ms) # end-to-end latency print(packet.audit_id) # ties all log lines to this request

Running the Demos

Five runnable demos covering every failure mode and recovery path. No API key required.
The MockLLM simulates realistic failure behavior at a configurable rate.

bash
python demo.py
DemoWhat It Shows
1Input guard blocking 7 of 8 inputs — injection, empty, oversized
2Schema enforcement with retry — 75% first-attempt failure rate, mutation hints
3Constraint violation recovery — length and forbidden phrase, 3 attempts
4Fallback router — exhausted retries route to cached response
5Benchmark — naive 0% pass rate vs control layer 100%, latency breakdown

Running Demo 5 also generates control_layer_benchmark.png — a 6-panel benchmark figure
showing pass rate, failure mode distribution, retry distribution, latency percentiles,
token budget allocation, and quality score histogram.


Running the Tests

bash
pytest tests/ -v
TestInputGuard               14 tests   PASSED
TestTokenBudget               5 tests   PASSED
TestPromptBuilder             6 tests   PASSED
TestResponseValidator        10 tests   PASSED
TestCircuitBreaker            5 tests   PASSED
TestRetryEngine               6 tests   PASSED
TestFallbackRouter            4 tests   PASSED
TestLLMCaller                 2 tests   PASSED
TestAuditLogger               5 tests   PASSED
TestControlLayerIntegration   8 tests   PASSED
TestPydanticConfig            4 tests   PASSED

69 passed in 1.19s

Every component is tested in isolation. Integration tests cover the full orchestration
path: first-attempt success, retry on schema violation, fallback after exhausted retries,
circuit breaker rejection after consecutive timeouts, and Pydantic config validation errors.


Configuration Reference

python
ControlLayerConfig( # Token budget total_tokens=800, # Total token budget for prompt assembly model_name="cl100k_base", # tiktoken encoding name # Input validation max_input_chars=2000, # Hard limit on user input length # LLM call timeout_seconds=30.0, # Hard timeout per LLM call # Retry max_attempts=3, # Maximum retry attempts per request base_delay_ms=50.0, # Base exponential backoff delay max_delay_ms=2000.0, # Maximum backoff delay jitter_ms=25.0, # Random jitter added to each delay # Circuit breaker cb_failure_threshold=5, # Consecutive failures before opening cb_recovery_seconds=30.0, # Seconds before attempting recovery # Audit audit_log_path="audit.jsonl", # JSONL audit log path )
python
ResponseSchema( must_be_json=False, # Require valid JSON response required_keys=[], # Keys that must appear in JSON output max_length=None, # Maximum response length in characters min_length=None, # Minimum response length in characters forbidden_phrases=[], # Phrases that must not appear in response must_contain=[], # Phrases that must appear (used for quality score) )

Swapping the LLM

The llm_fn parameter accepts any callable that takes a str and returns a str.

python
# OpenAI import openai client = openai.OpenAI() def openai_call(prompt: str) -> str: response = client.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": prompt}], ) return response.choices[0].message.content layer = ControlLayer(llm_fn=openai_call, ...) # Anthropic import anthropic client = anthropic.Anthropic() def claude_call(prompt: str) -> str: response = client.messages.create( model="claude-sonnet-4-5", max_tokens=1024, messages=[{"role": "user", "content": prompt}], ) return response.content[0].text layer = ControlLayer(llm_fn=claude_call, ...) # Any local model layer = ControlLayer(llm_fn=lambda prompt: your_local_model.generate(prompt), ...)

Project Structure

control-layer/
├── control_layer.py          # All eight components + ControlLayer orchestrator
├── demo.py                   # Five runnable demos + benchmark charts
├── tests/
│   └── test_control_layer.py # 69 tests across all components
├── audit.jsonl               # Generated on first run (append-only audit log)
├── control_layer_benchmark.png  # Generated by demo.py
└── README.md

Benchmark

Measured on Python 3.12.6, Windows 11, CPU only, no GPU.
Ten structured output queries, 55% first-attempt failure rate.

MetricNaiveControl Layer
Pass rate0%100%
Min latency (ms)37.346.2
Median latency (ms)43.3143.5
Mean latency (ms)42.9139.8
P90 latency (ms)45.6168.0
Max latency (ms)48.4281.9
Resolved on attempt 1N/A2
Resolved on attempt 2N/A7
Resolved on attempt 3+N/A1

Component overhead (excluding LLM call):

OperationLatencyNotes
InputGuard validation~0.2ms20 regex patterns
tiktoken count (100 tokens)~0.8msEncoding lookup
PromptBuilder.build()~1.1msBudget allocation + assembly
ResponseValidator.validate()~0.3msJSON parse + rule checks
CircuitBreaker.is_open()~0.05msLock acquire + state check
AuditLogger.log()~0.4msLock + file append
Total non-LLM overhead~2.9msPer request

The LLM call dominates every other number. The control layer adds under 3ms of overhead
per request, which is within the variance of a single network round-trip.


When to Use This

Worth it when you have:

  • LLM responses that drive downstream code — JSON parsed programmatically, data written
    to a database, outputs shown to users without human review
  • User input passed to an LLM without a validation layer in between
  • Structured output requirements the model violates intermittently
  • Production systems where a LLM outage would block threads or hang requests

Skip it when you have:

  • Single-turn, low-stakes use cases where a bad response is displayed and discarded
  • Hard latency requirements under 50ms — retry delays alone can exceed this
  • A chatbot where the user sees the raw model output and can judge it themselves

Known Limitations

Injection patterns are not exhaustive. Twenty patterns cover the OWASP LLM Top 10
attack taxonomy. Adversarial prompts crafted to avoid known patterns will pass. Combine
with embedding-based anomaly detection for high-risk deployments.

Circuit breaker state is in-process only. A restart resets the circuit to CLOSED
regardless of backend status. For multi-instance deployments, share circuit state via
Redis or a similar low-latency store.

No streaming support. The LLMCaller collects the full response before validation.
Streaming APIs require partial validation heuristics or full response buffering — neither
is implemented.

Quality score uses phrase matching, not semantic similarity. must_contain checks
exact string presence. A response that paraphrases a required concept without using the
exact phrase scores zero. Swap in an embedding-based scorer for higher precision.

AuditLogger grows unbounded. The JSONL file appends on every call. In production,
ship it to object storage on a rolling basis and rotate locally.


Same series — production layers for LLM systems:


License

MIT

Contributors

Showing top 1 contributor by commit count.

View all contributors on GitHub →

This article is auto-generated from Emmimal/control-layer via the GitHub API.Last fetched: 6/26/2026