GitPedia

Cascadeflow

Cascading runtime for AI agents. Optimize cost, latency, quality, and policy decisions inside the agent loop.

From lemony-aiยทUpdated June 17, 2026ยทView on GitHubยท

**Cost Savings:** 69% (MT-Bench), 93% (GSM8K), 52% (MMLU), 80% (TruthfulQA) savings, retaining 96% GPT-5 quality. The project is written primarily in Python, distributed under the MIT License license, first published in 2025. It has gained significant community traction with 2,517 stars and 583 forks on GitHub. Key topics include: agent, ai, anthropic, api, budgets.

Latest release: v1.2.0โ€” Release 1.2.0
<div align="center"> <picture> <source media="(prefers-color-scheme: dark)" srcset="./.github/assets/CF_logo_bright.svg"> <source media="(prefers-color-scheme: light)" srcset="./.github/assets/CF_logo_dark.svg"> <img alt="cascadeflow Logo" src="./.github/assets/CF_logo_dark.svg" width="80%" style="margin: 20px auto;"> </picture>

Agent Runtime Intelligence Layer

PyPI version
npm version
LangChain version
Vercel AI version
n8n version
License: MIT
PyPI Downloads
npm Downloads
Tests
Docs
Python Docs
TypeScript Docs
X Follow
GitHub Stars

<br>

Cost Savings: 69% (MT-Bench), 93% (GSM8K), 52% (MMLU), 80% (TruthfulQA) savings, retaining 96% GPT-5 quality.

<br>

<img src=".github/assets/CF_python_color.svg" width="22" height="22" alt="Python" style="vertical-align: middle;"/> Python โ€ข <img src=".github/assets/CF_ts_color.svg" width="22" height="22" alt="TypeScript" style="vertical-align: middle;"/> TypeScript โ€ข <picture><source media="(prefers-color-scheme: dark)" srcset="./.github/assets/LC-logo-bright.png"><source media="(prefers-color-scheme: light)" srcset="./.github/assets/LC-logo-dark.png"><img src=".github/assets/LC-logo-dark.png" height="22" alt="LangChain" style="vertical-align: middle;"></picture> LangChain โ€ข <img src=".github/assets/CF_openai_color.svg" width="22" height="22" alt="OpenAI" style="vertical-align: middle;"/> OpenAI Agents โ€ข <img src=".github/assets/CF_crewai_color.svg" width="22" height="22" alt="CrewAI" style="vertical-align: middle;"/> CrewAI โ€ข <img src=".github/assets/CF_pydantic_color.svg" width="22" height="22" alt="PydanticAI" style="vertical-align: middle;"/> PydanticAI โ€ข <img src=".github/assets/CF_google_adk_color.svg" width="22" height="22" alt="Google ADK" style="vertical-align: middle;"/> Google ADK โ€ข <img src=".github/assets/CF_n8n_color.svg" width="22" height="22" alt="n8n" style="vertical-align: middle;"/> n8n โ€ข <picture><source media="(prefers-color-scheme: dark)" srcset="./.github/assets/CF_vercel_bright.svg"><source media="(prefers-color-scheme: light)" srcset="./.github/assets/CF_vercel_dark.svg"><img src=".github/assets/CF_vercel_dark.svg" width="22" height="22" alt="Vercel AI" style="vertical-align: middle;"></picture> Vercel AI โ€ข <img src=".github/assets/CF_openclaw_color.svg" width="22" height="22" alt="OpenClaw" style="vertical-align: middle;"/> OpenClaw โ€ข Hermes Agent โ€ข ๐Ÿ“–ย Docs โ€ข ๐Ÿ’กย Examples

</div>

The in-process intelligence layer for AI agents. Optimize cost, latency, quality, budget, compliance, and energy โ€” inside the execution loop, not at the HTTP boundary.

cascadeflow works where external proxies can't: per-step model decisions based on agent state, per-tool-call budget gating, runtime stop/continue/escalate actions, and business KPI injection during agent loops. It accumulates insight from every model call, tool result, and quality score โ€” the agent gets smarter the more it runs. Sub-5ms overhead. Works with LangChain, OpenAI Agents SDK, CrewAI, PydanticAI, Google ADK, n8n, Vercel AI SDK, and Hermes Agent.

Update

Hermes Agent delegation cascading

CascadeFlow now provides a Hermes Agent integration for per-skill model cascading, task-complexity cascading, topic-aware subagent cascading, observe-mode rollout, and auditable decisions without taking over provider credentials, base URLs, fallback chains, or API modes.

bash
pip install cascadeflow
bash
npm install @cascadeflow/core

Why cascadeflow?

Proxy vs In-Process Harness

DimensionExternal Proxycascadeflow Harness
ScopeHTTP request boundaryInside agent execution loop
DimensionsCost onlyCost + quality + latency + budget + compliance + energy
Latency overhead10-50ms network RTT<5ms in-process
Business logicNoneKPI weights and targets
EnforcementNone (observe only)stop, deny_tool, switch_model
AuditabilityRequest logsPer-step decision traces

cascadeflow is a library and agent harness โ€” an intelligent AI model cascading package that dynamically selects the optimal model for each query or tool call through speculative execution. It's based on the research that 40-70% of queries don't require slow, expensive flagship models, and domain-specific smaller models often outperform large general-purpose models on specialized tasks. For the remaining queries that need advanced reasoning, cascadeflow automatically escalates to flagship models if needed.

<details> <summary><b>Use Cases</b></summary>
  • Inside-the-Loop Control. Influence decisions at every agent step โ€” model call, tool call, sub-agent handoff โ€” where most cost, delay, and failure actually happen. External proxies only see request boundaries; cascadeflow sees decision boundaries.
  • Multi-Dimensional Optimization. Optimize across cost, latency, quality, budget, compliance/risk, and energy simultaneously โ€” relevant to engineering, finance, security, operations, and sustainability stakeholders.
  • Business Logic Injection. Embed KPI weights and policy intent directly into agent behavior at runtime. Shift AI control from static prompt design to live business governance.
  • Runtime Enforcement. Directly steer outcomes with four actions: allow, switch_model, deny_tool, stop โ€” based on current context and policy state. Closes the gap between analytics and execution.
  • Auditability & Transparency. Every runtime decision is traceable and attributable. Supports audit requirements, faster tuning cycles, and trust in regulated or high-stakes workflows.
  • Measurable Value. Prove impact with reproducible metrics on realistic agent workflows โ€” better economics and latency while preserving quality thresholds.
  • Latency Advantage. Proxy-based optimization adds 40-60ms per call. In a 10-step agent loop, that is 400-600ms of avoidable overhead. cascadeflow runs in-process with sub-5ms overhead โ€” critical for real-time UX, task throughput, and enterprise SLAs.
  • Framework & Provider Neutral. Works with LangChain, OpenAI Agents SDK, CrewAI, PydanticAI, Google ADK, Vercel AI SDK, n8n, Hermes Agent, and custom frameworks. Unified API across OpenAI, Anthropic, Groq, Ollama, vLLM, Together, and more.
  • Self-Improving Agent Intelligence. Because cascadeflow runs inside the agent loop, it accumulates deep insight into every model call, tool result, quality score, and routing decision over time. This enables cascadeflow to learn which models perform best for which tasks, adapt routing strategies, and continuously improve cost-quality tradeoffs โ€” without manual tuning. The agent gets smarter the more it runs.
  • Edge & Local-Hosted AI. Handle most queries with local models (vLLM, Ollama), automatically escalate complex queries to cloud providers only when needed.

โ„น๏ธ Note: SLMs (under 10B parameters) are sufficiently powerful for 60-70% of agentic AI tasks. Research paper

</details>

How cascadeflow Works

cascadeflow uses speculative execution with quality validation:

  1. Speculatively executes small, fast models first - optimistic execution ($0.15-0.30/1M tokens)
  2. Validates quality of responses using configurable thresholds (completeness, confidence, correctness)
  3. Dynamically escalates to larger models only when quality validation fails ($1.25-3.00/1M tokens)
  4. Learns patterns to optimize future cascading decisions and domain specific routing

Zero configuration. Works with YOUR existing models (>17 providers currently supported).

In practice, 60-70% of queries are handled by small, efficient models (8-20x cost difference) without requiring escalation

Result: 40-85% cost reduction, 2-10x faster responses, zero quality loss.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                      cascadeflow Stack                      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                             โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚  Cascade Agent                                        โ”‚  โ”‚
โ”‚  โ”‚                                                       โ”‚  โ”‚
โ”‚  โ”‚  Orchestrates the entire cascade execution            โ”‚  โ”‚
โ”‚  โ”‚  โ€ข Query routing & model selection                    โ”‚  โ”‚
โ”‚  โ”‚  โ€ข Drafter -> Verifier coordination                   โ”‚  โ”‚
โ”‚  โ”‚  โ€ข Cost tracking & telemetry                          โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                          โ†“                                  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚  Domain Pipeline                                      โ”‚  โ”‚
โ”‚  โ”‚                                                       โ”‚  โ”‚
โ”‚  โ”‚  Automatic domain classification                      โ”‚  โ”‚
โ”‚  โ”‚  โ€ข Rule-based detection (CODE, MATH, DATA, etc.)      โ”‚  โ”‚
โ”‚  โ”‚  โ€ข Optional ML semantic classification                โ”‚  โ”‚
โ”‚  โ”‚  โ€ข Domain-optimized pipelines & model selection       โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                          โ†“                                  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚  Quality Validation Engine                            โ”‚  โ”‚
โ”‚  โ”‚                                                       โ”‚  โ”‚
โ”‚  โ”‚  Multi-dimensional quality checks                     โ”‚  โ”‚
โ”‚  โ”‚  โ€ข Length validation (too short/verbose)              โ”‚  โ”‚
โ”‚  โ”‚  โ€ข Confidence scoring (logprobs analysis)             โ”‚  โ”‚
โ”‚  โ”‚  โ€ข Format validation (JSON, structured output)        โ”‚  โ”‚
โ”‚  โ”‚  โ€ข Semantic alignment (intent matching)               โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                          โ†“                                  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚  Cascading Engine (<2ms overhead)                     โ”‚  โ”‚
โ”‚  โ”‚                                                       โ”‚  โ”‚
โ”‚  โ”‚  Smart model escalation strategy                      โ”‚  โ”‚
โ”‚  โ”‚  โ€ข Try cheap models first (speculative execution)     โ”‚  โ”‚
โ”‚  โ”‚  โ€ข Validate quality instantly                         โ”‚  โ”‚
โ”‚  โ”‚  โ€ข Escalate only when needed                          โ”‚  โ”‚
โ”‚  โ”‚  โ€ข Automatic retry & fallback                         โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                          โ†“                                  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚  Provider Abstraction Layer                           โ”‚  โ”‚
โ”‚  โ”‚                                                       โ”‚  โ”‚
โ”‚  โ”‚  Unified interface for >17 providers                   โ”‚  โ”‚
โ”‚  โ”‚  โ€ข OpenAI โ€ข Anthropic โ€ข Groq โ€ข Ollama                 โ”‚  โ”‚
โ”‚  โ”‚  โ€ข Together โ€ข vLLM โ€ข HuggingFace โ€ข LiteLLM            โ”‚  โ”‚
โ”‚  โ”‚  โ€ข Vercel AI SDK (17+ additional providers)            โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚                                                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Harness API

Three tiers of integration โ€” zero-change observability to full policy control:

Tier 1: Zero-change observability

python
import cascadeflow cascadeflow.init(mode="observe") # All OpenAI/Anthropic SDK calls are now tracked. No code changes needed.

Tier 2: Scoped runs with budget

python
with cascadeflow.run(budget=0.50, max_tool_calls=10) as session: result = await agent.run("Analyze this dataset") print(session.summary()) # cost, latency, energy, steps, tool calls print(session.trace()) # full decision audit trail

Tier 3: Decorated agents with policy

python
@cascadeflow.agent(budget=0.20, compliance="gdpr", kpi_weights={"quality": 0.6, "cost": 0.3, "latency": 0.1}) async def my_agent(query: str): return await llm.complete(query)

Quick Start

<img src=".github/assets/CF_python_color.svg" width="24" height="24" alt="Python"/> Python

python
pip install cascadeflow[all]
python
from cascadeflow import CascadeAgent, ModelConfig # Define your cascade - try cheap model first, escalate if needed agent = CascadeAgent(models=[ ModelConfig(name="nous/hermes-flash", provider="openai", cost=0.000375), # Draft model (~$0.375/1M tokens) ModelConfig(name="gpt-5", provider="openai", cost=0.00562), # Verifier model (~$5.62/1M tokens) ]) # Run query - automatically routes to optimal model result = await agent.run("What's the capital of France?") print(f"Answer: {result.content}") print(f"Model used: {result.model_used}") print(f"Cost: ${result.total_cost:.6f}")
<details> <summary><b>๐Ÿ’ก Optional: Use ML-based Semantic Quality Validation</b></summary>

For advanced use cases, you can add ML-based semantic similarity checking to validate that responses align with queries.

Step 1: Install the optional ML package:

bash
pip install cascadeflow[semantic] # Adds semantic similarity via FastEmbed (~80MB model)

Step 2: Use semantic quality validation:

python
from cascadeflow.quality.semantic import SemanticQualityChecker # Initialize semantic checker (downloads model on first use) checker = SemanticQualityChecker( similarity_threshold=0.5, # Minimum similarity score (0-1) toxicity_threshold=0.7 # Maximum toxicity score (0-1) ) # Validate query-response alignment query = "Explain Python decorators" response = "Decorators are a way to modify functions using @syntax..." result = checker.validate(query, response, check_toxicity=True) print(f"Similarity: {result.similarity:.2%}") print(f"Passed: {result.passed}") print(f"Toxic: {result.is_toxic}")

What you get:

  • ๐ŸŽฏ Semantic similarity scoring (query โ†” response alignment)
  • ๐Ÿ›ก๏ธ Optional toxicity detection
  • ๐Ÿ”„ Automatic model download and caching
  • ๐Ÿš€ Fast inference (~100ms per check)

Full example: See semantic_quality_domain_detection.py

</details>

โš ๏ธ GPT-5 Note: GPT-5 streaming requires organization verification. Non-streaming works for all users. Verify here if needed (~15 min). Basic cascadeflow examples work without - GPT-5 is only called when needed (typically 20-30% of requests).

๐Ÿ“– Learn more: Python Documentation | Quickstart Guide | Providers Guide

<br>

<img src=".github/assets/CF_ts_color.svg" width="24" height="24" alt="TypeScript"/> TypeScript

bash
npm install @cascadeflow/core
tsx
import { CascadeAgent, ModelConfig } from '@cascadeflow/core'; // Same API as Python! const agent = new CascadeAgent({ models: [ { name: 'nous/hermes-flash', provider: 'openai', cost: 0.000375 }, { name: 'gpt-4o', provider: 'openai', cost: 0.00625 }, ], }); const result = await agent.run('What is TypeScript?'); console.log(`Model: ${result.modelUsed}`); console.log(`Cost: $${result.totalCost}`); console.log(`Saved: ${result.savingsPercentage}%`);
<details> <summary><b>๐Ÿ’ก Optional: ML-based Semantic Quality Validation</b></summary>

For advanced quality validation, enable ML-based semantic similarity checking to ensure responses align with queries.

Step 1: Install the optional ML packages:

bash
npm install @cascadeflow/ml @huggingface/transformers

Step 2: Enable semantic validation in your cascade:

tsx
import { CascadeAgent, SemanticQualityChecker } from '@cascadeflow/core'; const agent = new CascadeAgent({ models: [ { name: 'nous/hermes-flash', provider: 'openai', cost: 0.000375 }, { name: 'gpt-4o', provider: 'openai', cost: 0.00625 }, ], quality: { threshold: 0.40, // Traditional confidence threshold requireMinimumTokens: 5, // Minimum response length useSemanticValidation: true, // Enable ML validation semanticThreshold: 0.5, // 50% minimum similarity }, }); // Responses now validated for semantic alignment const result = await agent.run('Explain TypeScript generics');

Step 3: Or use semantic validation directly:

tsx
import { SemanticQualityChecker } from '@cascadeflow/core'; const checker = new SemanticQualityChecker(); if (await checker.isAvailable()) { const result = await checker.checkSimilarity( 'What is TypeScript?', 'TypeScript is a typed superset of JavaScript.' ); console.log(`Similarity: ${(result.similarity * 100).toFixed(1)}%`); console.log(`Passed: ${result.passed}`); }

What you get:

  • ๐ŸŽฏ Query-response semantic alignment detection
  • ๐Ÿšซ Off-topic response filtering
  • ๐Ÿ“ฆ BGE-small-en-v1.5 embeddings (~40MB, auto-downloads)
  • โšก Fast CPU inference (~50-100ms with caching)
  • ๐Ÿ”„ Request-scoped caching (50% latency reduction)
  • ๐ŸŒ Works in Node.js, Browser, and Edge Functions

Example: semantic-quality.ts

</details>

๐Ÿ“– Learn more: TypeScript Documentation | Quickstart Guide | Node.js Examples

<br>

๐Ÿ”„ Migration Example

Migrate in 5min from direct Provider implementation to cost savings and full cost control and transparency.

Before (Standard Approach)

Cost: $0.000113, Latency: 850ms

python
# Using expensive model for everything result = openai.chat.completions.create( model="gpt-4o", messages=[{"role": "user", "content": "What's 2+2?"}] )

After (With cascadeflow)

Cost: $0.000007, Latency: 234ms

python
agent = CascadeAgent(models=[ ModelConfig(name="nous/hermes-flash", provider="openai", cost=0.000375), ModelConfig(name="gpt-4o", provider="openai", cost=0.00625), ]) result = await agent.run("What's 2+2?")

๐Ÿ”ฅ Saved: $0.000106 (94% reduction), 3.6x faster

๐Ÿ“Š Learn more: Cost Tracking Guide | Production Best Practices | Performance Optimization

<details> <summary><b>Drop-In Gateway (Existing Apps)</b></summary>

If you already have an app using the OpenAI or Anthropic APIs and want the fastest integration, run the gateway and point your existing client at it:

bash
python -m cascadeflow.server --mode auto --port 8084
</details>
<details> <summary><b><img src=".github/assets/CF_n8n_color.svg" width="24" height="24" alt="n8n" style="vertical-align: middle;"/> n8n Integration</b></summary>

Use cascadeflow in n8n workflows for no-code AI automation with automatic cost optimization!

Installation

  1. Open n8n
  2. Go to Settings โ†’ Community Nodes
  3. Search for: @cascadeflow/n8n-nodes-cascadeflow
  4. Click Install

Two Nodes

NodeTypeUse case
CascadeFlow (Model)Language Model sub-nodeDrop-in for any Chain/LLM node
CascadeFlow AgentStandalone agent (main in/out)Tool calling, memory, multi-step reasoning

Quick Start (Model):

  1. Add two AI Chat Model nodes (cheap drafter + powerful verifier)
  2. Add CascadeFlow (Model) and connect both models
  3. Connect to Basic LLM Chain or Chain node
  4. Check Logs tab on the Chain node to see cascade decisions

Quick Start (Agent):

  1. Add a Chat Trigger node
  2. Add CascadeFlow Agent and connect it to the trigger
  3. Connect Drafter, Verifier, optional Memory and Tools
  4. Check the Agent Output tab for cascade metadata and trace

Result: 40-85% cost savings in your n8n workflows!

Features:

  • Works with any AI Chat Model node (OpenAI, Anthropic, Ollama, Azure, etc.)
  • Mix providers (e.g., Ollama drafter + GPT-4o verifier)
  • Agent node: tool calling, memory, per-tool routing, tool call validation
  • 16-domain cascading for specialized model routing
  • Real-time flow visualization in Logs/Output tabs

๐Ÿ”Œ Learn more: n8n Integration Guide | n8n Package

</details>
<details open> <summary><b>Hermes Agent Integration</b></summary>

Use CascadeFlow as an optional Hermes Agent delegation router for subagents. Hermes keeps provider credentials, base URLs, fallback chains, and API modes; CascadeFlow returns a structured routing decision before Hermes spawns a child agent.

This works as a released CascadeFlow module even before a native Hermes PR is accepted. Users can call the router from a local wrapper, local Hermes fork, or small hook script and keep Hermes' current provider configuration as the final source of truth.

python
from cascadeflow.integrations.hermes import ( HermesDelegationRequest, HermesDelegationRouter, ) router = HermesDelegationRouter.from_dict({ "enabled": True, "mode": "observe", "routes": { "code": { "provider": "nous", "model": "nous/hermes-4.1", "reasoning_effort": "high", }, "simple": { "provider": "openai", "model": "gpt-4.1-mini", "reasoning_effort": "low", }, }, }) decision = router.route_delegation(HermesDelegationRequest( goal="Debug the failing unit test and propose a patch", toolsets=("terminal", "git"), loaded_skills=("python", "debugging"), )) print(decision.to_dict())

What Hermes gets:

  • Per-skill model routing: coding, research, legal/finance, and lightweight utility skills can receive different model and reasoning profiles instead of inheriting one global default.
  • Task-complexity routing: simple delegated tasks can use cheaper/faster models, while hard debugging, architecture, research, or code-generation tasks route to stronger models.
  • Topic-aware subagent routing: subagents can route differently for code, research, data, creative, ops, medical, legal, finance, and other domains.
  • Better subagent economics: Hermes avoids paying flagship-model prices for simple worker tasks.
  • Better quality for hard tasks: difficult subagent work no longer has to inherit a weak or cheap default model.
  • Dry-run/observe mode: Hermes users can see what CascadeFlow would route without changing runtime behavior.
  • Auditability: routing decisions include reason, confidence, domain, complexity, and selected model.
  • Safer rollout: missing CascadeFlow, disabled config, low confidence, high-stakes gaps, or bad routing inputs fall back to Hermes' current behavior.
  • No credential rewrite: Hermes still owns provider credentials, base URLs, fallback chains, and API modes.

Learn more: Hermes Agent Integration Guide

Standalone example: examples/integrations/hermes_delegation_router.py

</details>
<details> <summary><b><picture><source media="(prefers-color-scheme: dark)" srcset="./.github/assets/LC-logo-bright.png"><source media="(prefers-color-scheme: light)" srcset="./.github/assets/LC-logo-dark.png"><img src="./.github/assets/LC-logo-dark.png" width="42" alt="LangChain" style="vertical-align: middle;"></picture> LangChain Integration</b></summary>

Use cascadeflow with LangChain for intelligent model cascading with full LCEL, streaming, and tools support!

Installation

<img src=".github/assets/CF_ts_color.svg" width="18" height="18" alt="TypeScript" style="vertical-align: middle;"/> TypeScript

bash
npm install @cascadeflow/langchain @langchain/core @langchain/openai

<img src=".github/assets/CF_python_color.svg" width="18" height="18" alt="Python" style="vertical-align: middle;"/> Python

bash
pip install cascadeflow langchain-openai

Quick Start

<details open> <summary><b><img src=".github/assets/CF_ts_color.svg" width="18" height="18" alt="TypeScript" style="vertical-align: middle;"/> TypeScript - Drop-in replacement for any LangChain chat model</b></summary>
typescript
import { ChatOpenAI } from '@langchain/openai'; import { ChatAnthropic } from '@langchain/anthropic'; import { withCascade } from '@cascadeflow/langchain'; const cascade = withCascade({ drafter: new ChatOpenAI({ model: 'nous/hermes-flash' }), // $0.15/$0.60 per 1M tokens verifier: new ChatAnthropic({ model: 'claude-sonnet-4-5' }), // $3/$15 per 1M tokens qualityThreshold: 0.8, // 80% queries use drafter }); // Use like any LangChain chat model const result = await cascade.invoke('Explain quantum computing'); // Optional: Enable LangSmith tracing (see https://smith.langchain.com) // Set LANGSMITH_API_KEY, LANGSMITH_PROJECT, LANGSMITH_TRACING=true // Or with LCEL chains const chain = prompt.pipe(cascade).pipe(new StringOutputParser());
</details> <details> <summary><b><img src=".github/assets/CF_python_color.svg" width="18" height="18" alt="Python" style="vertical-align: middle;"/> Python - Drop-in replacement for any LangChain chat model</b></summary>
python
from langchain_openai import ChatOpenAI from langchain_anthropic import ChatAnthropic from cascadeflow.integrations.langchain import CascadeFlow cascade = CascadeFlow( drafter=ChatOpenAI(model="nous/hermes-flash"), # $0.15/$0.60 per 1M tokens verifier=ChatAnthropic(model="claude-sonnet-4-5"), # $3/$15 per 1M tokens quality_threshold=0.8, # 80% queries use drafter ) # Use like any LangChain chat model result = await cascade.ainvoke("Explain quantum computing") # Optional: Enable LangSmith tracing (see https://smith.langchain.com) # Set LANGSMITH_API_KEY, LANGSMITH_PROJECT, LANGSMITH_TRACING=true # Or with LCEL chains chain = prompt | cascade | StrOutputParser()
</details> <details> <summary><b>๐Ÿ’ก Optional: Cost Tracking with Callbacks (Python)</b></summary>

Track costs, tokens, and cascade decisions with LangChain-compatible callbacks:

python
from cascadeflow.integrations.langchain.langchain_callbacks import get_cascade_callback # Track costs similar to get_openai_callback() with get_cascade_callback() as cb: response = await cascade.ainvoke("What is Python?") print(f"Total cost: ${cb.total_cost:.6f}") print(f"Drafter cost: ${cb.drafter_cost:.6f}") print(f"Verifier cost: ${cb.verifier_cost:.6f}") print(f"Total tokens: {cb.total_tokens}") print(f"Successful requests: {cb.successful_requests}")

Features:

  • ๐ŸŽฏ Compatible with get_openai_callback() pattern
  • ๐Ÿ’ฐ Separate drafter/verifier cost tracking
  • ๐Ÿ“Š Token usage (including streaming)
  • ๐Ÿ”„ Works with LangSmith tracing
  • โšก Near-zero overhead

Full example: See langchain_cost_tracking.py

</details> <details> <summary><b>๐Ÿ’ก Optional: Model Discovery & Analysis Helpers (TypeScript)</b></summary>

For discovering optimal cascade pairs from your existing LangChain models, use the built-in discovery helpers:

typescript
import { discoverCascadePairs, findBestCascadePair, analyzeModel, validateCascadePair } from '@cascadeflow/langchain'; // Your existing LangChain models (configured with YOUR API keys) const myModels = [ new ChatOpenAI({ model: 'gpt-3.5-turbo' }), new ChatOpenAI({ model: 'nous/hermes-flash' }), new ChatOpenAI({ model: 'gpt-4o' }), new ChatAnthropic({ model: 'claude-3-haiku' }), // ... any LangChain chat models ]; // Quick: Find best cascade pair const best = findBestCascadePair(myModels); console.log(`Best pair: ${best.analysis.drafterModel} โ†’ ${best.analysis.verifierModel}`); console.log(`Estimated savings: ${best.estimatedSavings}%`); // Use it immediately const cascade = withCascade({ drafter: best.drafter, verifier: best.verifier, }); // Advanced: Discover all valid pairs const pairs = discoverCascadePairs(myModels, { minSavings: 50, // Only pairs with โ‰ฅ50% savings requireSameProvider: false, // Allow cross-provider cascades }); // Validate specific pair const validation = validateCascadePair(drafter, verifier); console.log(`Valid: ${validation.valid}`); console.log(`Warnings: ${validation.warnings}`);

What you get:

  • ๐Ÿ” Automatic discovery of optimal cascade pairs from YOUR models
  • ๐Ÿ’ฐ Estimated cost savings calculations
  • โš ๏ธ Validation warnings for misconfigured pairs
  • ๐Ÿ“Š Model tier analysis (drafter vs verifier candidates)

Full example: See model-discovery.ts

</details>

Features:

  • โœ… Full LCEL support (pipes, sequences, batch)
  • โœ… Streaming with pre-routing
  • โœ… Tool calling and structured output
  • โœ… LangSmith cost tracking metadata
  • โœ… Cost tracking callbacks (Python)
  • โœ… Works with all LangChain features

๐Ÿฆœ Learn more: LangChain Integration Guide | TypeScript Package | Python Examples

</details>

Resources

Examples

<img src=".github/assets/CF_python_color.svg" width="20" height="20" alt="Python" style="vertical-align: middle;"/> Python Examples:

<details open> <summary><b>Basic Examples</b> - Get started quickly</summary>
ExampleDescriptionLink
Basic UsageSimple cascade setup with OpenAI modelsView
Preset UsageUse built-in presets for quick setupView
Tool ExecutionFunction calling and tool usageView
Streaming TextStream responses from cascade agentsView
Cost TrackingTrack and analyze costs across queriesView
Agentic Multi-AgentMulti-turn tool loops & agent-as-a-tool delegationView
Multi-Step CascadeMulti-step agent loops with tool callsView
</details> <details> <summary><b>Harness & Enforcement</b> - Budget, compliance, and agent governance</summary>
ExampleDescriptionLink
Budget EnforcementBudget caps with stop actions in enforce modeView
User Budget TrackingPer-user budget enforcement and trackingView
GuardrailsSafety and content guardrailsView
Rate LimitingRate limiting for cascadesView
User Profile UsageUser-specific routing and configurationsView
Stripe IntegrationBilling integration with budget enforcementView
</details> <details> <summary><b>Framework Integrations</b> - Harness with LangChain, OpenAI Agents, CrewAI, PydanticAI, Google ADK, Hermes Agent</summary>
ExampleDescriptionLink
LangChain Harnesscascadeflow harness with LangChain callback handlerView
OpenAI Agents Harnesscascadeflow harness with OpenAI Agents SDKView
CrewAI Harnesscascadeflow harness with CrewAI hooksView
PydanticAI Harnesscascadeflow cascade Model with PydanticAI agentsView
Google ADK Harnesscascadeflow harness with Google ADK pluginView
LangChain BasicSimple LangChain cascade setupView
LangChain LCEL PipelineLCEL chains with cascade routingView
LangGraph Multi-AgentLangGraph multi-agent orchestrationView
</details> <details> <summary><b>Advanced Examples</b> - Production, providers & customization</summary>
ExampleDescriptionLink
Production PatternsBest practices for production deploymentsView
Multi-ProviderMix multiple AI providers in one cascadeView
Reasoning ModelsUse reasoning models (o1/o3, Claude Sonnet 4, DeepSeek-R1)View
Streaming ToolsStream tool calls and responsesView
Batch ProcessingProcess multiple queries efficientlyView
FastAPI IntegrationIntegrate cascades with FastAPIView
Edge DeviceRun cascades on edge devices with local modelsView
vLLM ExampleUse vLLM for local model deploymentView
Multi-Instance OllamaRun draft/verifier on separate Ollama instancesView
Custom CascadeBuild custom cascade strategiesView
Custom ValidationImplement custom quality validatorsView
Semantic Quality DetectionML-based domain and quality detectionView
Cost ForecastingForecast costs and detect anomaliesView
</details>

<img src=".github/assets/CF_ts_color.svg" width="20" height="20" alt="TypeScript" style="vertical-align: middle;"/> TypeScript Examples:

<details open> <summary><b>Basic Examples</b> - Get started quickly</summary>
ExampleDescriptionLink
Basic UsageSimple cascade setup (Node.js)View
Tool CallingFunction calling with tools (Node.js)View
Multi-ProviderMix providers in TypeScript (Node.js)View
Reasoning ModelsUse reasoning models (o1/o3, Claude Sonnet 4, DeepSeek-R1)View
Cost TrackingTrack and analyze costs across queriesView
Semantic QualityML-based semantic validation with embeddingsView
StreamingStream responses in TypeScriptView
Tool ExecutionTool execution engine and result handlingView
Streaming ToolsStream tool calls with event detectionView
Agentic Multi-AgentMulti-turn tool loops & multi-agent orchestrationView
</details> <details> <summary><b>Advanced Examples</b> - Production, edge & LangChain</summary>
ExampleDescriptionLink
Production PatternsProduction best practices (Node.js)View
Multi-Instance OllamaRun draft/verifier on separate Ollama instancesView
Multi-Instance vLLMRun draft/verifier on separate vLLM instancesView
Browser/EdgeVercel Edge runtime exampleView
LangChain BasicSimple LangChain cascade setupView
LangChain Cross-ProviderHaiku โ†’ GPT-5 with PreRouterView
LangChain LangSmithCost tracking with LangSmithView
LangChain Cost TrackingCompare cascadeflow vs LangSmith cost trackingView
LangGraph Multi-AgentLangGraph multi-agent orchestrationView
LangChain Tool Risk GatingTool routing based on risk and complexityView
</details>

๐Ÿ“‚ View All Python Examples โ†’ | View All TypeScript Examples โ†’

Documentation

<details open> <summary><b>Getting Started</b> - Core concepts and basics</summary>
GuideDescriptionLink
QuickstartGet started with cascadeflow in 5 minutesRead
Providers GuideConfigure and use different AI providersRead
Presets GuideUsing and creating custom presetsRead
Streaming GuideStream responses from cascade agentsRead
Tools GuideFunction calling and tool usageRead
Cost TrackingTrack and analyze API costsRead
Agentic PatternsTool loops, multi-agent, agent-as-a-tool delegationRead
Agent HarnessBudget, compliance, KPI, and energy controlsRead
Rollout GuidePlan your production rolloutRead
</details> <details> <summary><b>Advanced Topics</b> - Production, customization & integrations</summary>
GuideDescriptionLink
Production GuideBest practices for production deploymentsRead
Enterprise NetworkingProxy, TLS, and network configurationRead
CustomizationCustom cascade strategies and validatorsRead
ObservabilityTelemetry, logging, and privacy controlsRead
LangChain IntegrationUse cascadeflow with LangChainRead
OpenAI Agents SDKUse cascadeflow with OpenAI AgentsRead
CrewAI IntegrationUse cascadeflow with CrewAIRead
PydanticAI IntegrationCascade Model for PydanticAI agentsRead
Google ADKUse cascadeflow with Google ADKRead
Hermes AgentPer-skill, complexity, and topic-aware subagent routingRead
n8n IntegrationUse cascadeflow in n8n workflowsRead
Vercel AI SDKMiddleware for Vercel AI SDKRead
</details>

๐Ÿ“š View All Documentation โ†’


Features

FeatureBenefit
๐ŸŽฏ Speculative CascadingTries cheap models first, escalates intelligently
๐Ÿ’ฐ 40-85% Cost SavingsResearch-backed, proven in production
โšก 2-10x FasterSmall models respond in <50ms vs 500-2000ms
โšก Low LatencySub-2ms framework overhead, negligible performance impact
๐Ÿ”„ Mix Any ProvidersOpenAI, Anthropic, Groq, Ollama, vLLM, Together + LiteLLM (optional) + LangChain integration
๐Ÿ‘ค User Profile SystemPer-user budgets, tier-aware routing, enforcement callbacks
โœ… Quality ValidationAutomatic checks + semantic similarity (optional ML, ~80MB, CPU)
๐ŸŽจ Cascading PoliciesDomain-specific pipelines, multi-step validation strategies
๐Ÿง  Domain Understanding15 domains auto-detected (code, medical, legal, finance, math, etc.), routes to specialists
๐Ÿค– Drafter/Validator Pattern20-60% savings for agent/tool systems
๐Ÿ”ง Tool Calling SupportUniversal format, works across all providers
๐Ÿ“Š Cost TrackingBuilt-in analytics + OpenTelemetry export (vendor-neutral)
๐Ÿš€ 3-Line IntegrationZero architecture changes needed
๐Ÿ” Agent LoopsMulti-turn tool execution with automatic tool call, result, re-prompt cycles
๐Ÿงญ Hermes Agent RoutingPer-skill, task-complexity, and topic-aware subagent routing with observe-mode rollout
๐Ÿ“‹ Message & Tool Call ListsFull conversation history with tool_calls and tool_call_id preservation across turns
๐Ÿช Hooks & CallbacksTelemetry callbacks, cost events, and streaming hooks for observability
๐Ÿญ Production ReadyStreaming, batch processing, tool handling, reasoning model support, caching, error recovery, anomaly detection
๐Ÿ’ณ Budget EnforcementPer-run and per-user budget caps with automatic stop actions when limits are exceeded
๐Ÿ”’ Compliance GatingGDPR, HIPAA, PCI, and strict model allowlists โ€” block non-compliant models before execution
๐Ÿ“Š KPI-Weighted RoutingInject business priorities (quality, cost, latency, energy) as weights into every model decision
๐ŸŒฑ Energy TrackingDeterministic compute-intensity coefficients for carbon-aware AI operations
๐Ÿ” Decision TracesFull per-step audit trail: action, reason, model, cost, budget state, enforcement status
โš™๏ธ Harness Modesoff / observe / enforce โ€” roll out safely with observe, then switch to enforce when ready

License

MIT ยฉ see LICENSE file.

Free for commercial use. Attribution appreciated but not required.


Contributing

We โค๏ธ contributions!

๐Ÿ“ Contributing Guide - Python & TypeScript development setup


Recently Shipped

  • โœ… Agent Loops & Multi-Agent - Multi-turn tool execution, agent-as-a-tool delegation, LangGraph orchestration
  • โœ… Tool Execution Engine - Automatic tool call routing, parallel execution, risk gating
  • โœ… Hooks & Callbacks - Telemetry callbacks, cost events, streaming hooks for observability
  • โœ… Vercel AI SDK Integration - 17+ additional providers with automatic provider detection
  • โœ… OpenClaw Provider - Custom provider for OpenClaw deployments
  • โœ… Gateway Server - Drop-in OpenAI/Anthropic-compatible proxy endpoint
  • โœ… User Tier Management - Cost controls and limits per user tier with advanced routing
  • โœ… Semantic Quality Validators - Lightweight local quality scoring via FastEmbed
  • โœ… Code Complexity Detection - Dynamic cascading based on task complexity analysis
  • โœ… Domain Aware Cascading - ML-based semantic domain detection with per-domain routing

Support


Citation

If you use cascadeflow in your research or project, please cite:

bibtex
@software{cascadeflow2025, author = {Lemony Inc., Sascha Buehrle and Contributors}, title = {cascadeflow: Agent runtime intelligence layer for AI agent workflows}, year = {2025}, publisher = {GitHub}, url = {https://github.com/lemony-ai/cascadeflow} }

Ready to cut your AI costs by 40-85%?

bash
pip install cascadeflow
bash
npm install @cascadeflow/core

Read the Docs โ€ข View Python Examples โ€ข View TypeScript Examples โ€ข Join Discussions


About

Built with โค๏ธ by Lemony Inc. and the cascadeflow Community

One cascade. Hundreds of specialists.

New York | Zurich

โญ Star us on GitHub if cascadeflow helps you save money!

Contributors

Showing top 6 contributors by commit count.

View all contributors on GitHub โ†’

This article is auto-generated from lemony-ai/cascadeflow via the GitHub API.Last fetched: 6/17/2026