Deploy a Qwen 3.6 Agentic RAG — Step-by-Step Walkthrough¶
Today we'll build and deploy an Agentic RAG powered by Alibaba's latest Qwen 3.6, running fully on your machine.
What you'll build¶
A private API where two AI agents collaborate:
- Researcher Agent — retrieves context from a vector database or the web
- Writer Agent — turns that research into a polished answer
Tool stack¶
| Tool | Role |
|---|---|
| Qwen 3.6 (via Ollama) | Local LLM — no cloud API needed |
| CrewAI | Multi-agent orchestration |
| Firecrawl | Web search when the vector DB doesn't have the answer |
| Qdrant | Local vector database for your knowledge base |
| LitServe | Production-style HTTP API deployment |
Architecture¶
flowchart LR
A[Client] -->|POST /predict| B[LitServe API]
B --> C[Researcher Agent]
C --> D[Vector DB Tool]
C --> E[Firecrawl Search]
C --> F[Writer Agent]
F --> G[JSON Response]
D --> H[(Qdrant)]
E --> I[Firecrawl API]
C --> J[(Ollama Qwen 3.6)]
F --> J
Flow:
- Client sends a query to LitServe
- Researcher Agent picks the right tool (vector DB or Firecrawl)
- Writer Agent synthesizes the final answer
- LitServe returns JSON to the client
Prerequisites¶
1. Remove old models (optional cleanup)¶
If you had other Ollama models taking disk space:
2. Pull Qwen 3.6¶
On a 16GB Mac, use the 27B variant:
Verify:
3. Install Python dependencies¶
4. Environment variables¶
Edit .env:
Get a Firecrawl key at firecrawl.dev.
5. Start Qdrant¶
6. Build the knowledge base¶
This embeds 20 ML FAQ chunks into Qdrant using nomic-embed-text-v1.5.
Step-by-step implementation¶
The entire server lives in server.py. LitServe calls four methods in order:
setup() → decode_request() → predict() → encode_response()
Step 1 — Set up the LLM¶
CrewAI integrates with Ollama through its LLM class. We point it at your local Qwen 3.6 model:
from crewai import LLM
import os
OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", "ollama/qwen3.6:27b")
OLLAMA_BASE_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
llm = LLM(
model=OLLAMA_MODEL,
base_url=OLLAMA_BASE_URL,
)
Why qwen3.6:27b? Qwen 3.6 adds stronger agentic reasoning and tool use. On 16GB RAM, the 27B quantized model (~17GB) is the practical choice.
Step 2 — Define the Research Agent and Task¶
The Researcher gets two tools:
ml_faq_retrieval_tool— searches your Qdrant vector DBFirecrawlSearchTool— searches the web for fresh or out-of-scope topics
from crewai import Agent, Task
from crewai_tools import FirecrawlSearchTool
from tools import ml_faq_retrieval_tool
researcher_agent = Agent(
role="Researcher",
goal="Research the user's query using the vector database and web search tools",
backstory=(
"You are a research assistant. Prefer the ML FAQ retrieval tool for "
"machine-learning questions. Use Firecrawl web search for recent or "
"general topics not covered in the knowledge base."
),
verbose=True,
tools=[ml_faq_retrieval_tool, FirecrawlSearchTool()],
llm=llm,
)
researcher_task = Task(
description=(
"Research the user's query and collect the most relevant context: {query}. "
"Use the ML FAQ tool first for ML topics. Fall back to Firecrawl for everything else."
),
expected_output="A bullet list of key findings with sources (vector DB or web).",
agent=researcher_agent,
)
Vector DB tool (tools.py)¶
The custom tool wraps Qdrant retrieval:
from crewai.tools import tool
from rag_code import COLLECTION_NAME, EmbedData, QdrantVDB, Retriever
@tool("Machine Learning FAQ Retrieval Tool")
def ml_faq_retrieval_tool(query: str) -> str:
"""Retrieve relevant ML FAQ documents from the vector database."""
retriever = Retriever(QdrantVDB(COLLECTION_NAME), EmbedData())
return retriever.search(query)
The agent decides which tool to call — that's what makes this "agentic" RAG instead of a fixed retrieve-then-generate pipeline.
Step 3 — Define the Writer Agent and Task¶
The Writer receives the Researcher's output via context=[researcher_task]:
writer_agent = Agent(
role="Writer",
goal="Write a clear, accurate answer using the researcher's findings",
backstory=(
"You synthesize research into concise, well-structured answers. "
"Cite whether information came from the knowledge base or the web."
),
verbose=True,
llm=llm,
)
writer_task = Task(
description=(
"Using the research findings, write a final answer for: {query}. "
"Keep it concise, factual, and easy to read."
),
expected_output="A polished answer to the user's query.",
agent=writer_agent,
context=[researcher_task],
)
Step 4 — Set up the Crew¶
Orchestrate both agents inside LitServe's setup() method (runs once at startup):
from crewai import Crew
self.crew = Crew(
agents=[researcher_agent, writer_agent],
tasks=[researcher_task, writer_task],
verbose=True,
)
Step 5 — Decode request¶
Extract the user query from the incoming JSON body:
def decode_request(self, request):
query = request.get("query", "").strip()
if not query:
raise ValueError("Missing required field: query")
return query
Example request:
Step 6 — Predict¶
Pass the query to the Crew. The {query} placeholder in task descriptions is filled from inputs:
def predict(self, query):
result = self.crew.kickoff(inputs={"query": query})
return result.raw if hasattr(result, "raw") else str(result)
Behind the scenes:
- Researcher runs and may call vector DB and/or Firecrawl
- Writer reads those findings and drafts the answer
- Qwen 3.6 powers both agents through Ollama
Step 7 — Encode response¶
Return the final answer as JSON:
Step 8 — Start the server¶
if __name__ == "__main__":
api = AgenticRAGAPI()
server = ls.LitServer(api, timeout=False)
server.run(port=8000)
timeout=False is important — agent crews with tool calls can take several minutes on local hardware.
Client code¶
client.py sends a POST to /predict:
import requests
payload = {"query": "What is Qwen 3.6?"}
response = requests.post("http://127.0.0.1:8000/predict", json=payload, timeout=600)
print(response.json()["output"])
Run it:
# Terminal 1
python server.py
# Terminal 2
python client.py --query "How do I avoid overfitting?"
python client.py --query "What is the latest news about Qwen 3.6?"
The second query should trigger Firecrawl because it's not in the ML FAQ knowledge base.
Full server code¶
For reference, here is the complete server.py:
import os
import litserve as ls
from crewai import Agent, Crew, LLM, Task
from crewai_tools import FirecrawlSearchTool
from dotenv import load_dotenv
from tools import ml_faq_retrieval_tool
load_dotenv()
OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", "ollama/qwen3.6:27b")
OLLAMA_BASE_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")
class AgenticRAGAPI(ls.LitAPI):
def setup(self, device):
llm = LLM(model=OLLAMA_MODEL, base_url=OLLAMA_BASE_URL)
researcher_agent = Agent(
role="Researcher",
goal="Research the user's query using the vector database and web search tools",
backstory="...",
verbose=True,
tools=[ml_faq_retrieval_tool, FirecrawlSearchTool()],
llm=llm,
)
writer_agent = Agent(
role="Writer",
goal="Write a clear, accurate answer using the researcher's findings",
backstory="...",
verbose=True,
llm=llm,
)
researcher_task = Task(
description="Research the user's query: {query}",
expected_output="Key findings with sources.",
agent=researcher_agent,
)
writer_task = Task(
description="Write a final answer for: {query}",
expected_output="A polished answer.",
agent=writer_agent,
context=[researcher_task],
)
self.crew = Crew(
agents=[researcher_agent, writer_agent],
tasks=[researcher_task, writer_task],
verbose=True,
)
def decode_request(self, request):
return request["query"]
def predict(self, query):
return self.crew.kickoff(inputs={"query": query}).raw
def encode_response(self, output):
return {"output": output}
if __name__ == "__main__":
api = AgenticRAGAPI()
server = ls.LitServer(api, timeout=False)
server.run(port=8000)
Agentic RAG vs classic RAG¶
| Classic RAG | Agentic RAG (this tutorial) |
|---|---|
| Fixed: always retrieve → generate | Agent chooses tools dynamically |
| Single LLM call | Multi-agent pipeline |
| One data source | Vector DB + web fallback |
| Hard to extend | Add tools without rewriting the pipeline |
Troubleshooting¶
| Issue | Fix |
|---|---|
connection refused on port 6333 |
Start Qdrant with Docker |
| Ollama model not found | Run ollama pull qwen3.6:27b |
| Very slow responses | Normal on 16GB RAM; close other apps |
| Firecrawl errors | Check FIRECRAWL_API_KEY in .env |
| Empty vector results | Run python setup_vectordb.py first |
What's next¶
- Replace the sample FAQ with your own documents in
rag_code.py - Add a Gradio UI in front of the LitServe API
- Swap Firecrawl for another search provider
- Deploy LitServe behind Docker or Lightning AI Cloud
Summary¶
You deployed a fully private Qwen 3.6 Agentic RAG:
- Qwen 3.6 runs locally via Ollama
- CrewAI orchestrates Researcher + Writer agents
- Qdrant stores your knowledge base
- Firecrawl fills gaps with live web data
- LitServe exposes everything as a clean REST API
Done!