Deploy a Qwen 3.6 Agentic RAG — Step-by-Step Walkthrough¶

Today we'll build and deploy an Agentic RAG powered by Alibaba's latest Qwen 3.6, running fully on your machine.

What you'll build¶

A private API where two AI agents collaborate:

Researcher Agent — retrieves context from a vector database or the web
Writer Agent — turns that research into a polished answer

Tool stack¶

Tool	Role
Qwen 3.6 (via Ollama)	Local LLM — no cloud API needed
CrewAI	Multi-agent orchestration
Firecrawl	Web search when the vector DB doesn't have the answer
Qdrant	Local vector database for your knowledge base
LitServe	Production-style HTTP API deployment

Architecture¶

flowchart LR
    A[Client] -->|POST /predict| B[LitServe API]
    B --> C[Researcher Agent]
    C --> D[Vector DB Tool]
    C --> E[Firecrawl Search]
    C --> F[Writer Agent]
    F --> G[JSON Response]
    D --> H[(Qdrant)]
    E --> I[Firecrawl API]
    C --> J[(Ollama Qwen 3.6)]
    F --> J

Flow:

Client sends a query to LitServe
Researcher Agent picks the right tool (vector DB or Firecrawl)
Writer Agent synthesizes the final answer
LitServe returns JSON to the client

Prerequisites¶

1. Remove old models (optional cleanup)¶

If you had other Ollama models taking disk space:

ollama list
ollama rm gemma4:e2b   # example — use your model name

2. Pull Qwen 3.6¶

On a 16GB Mac, use the 27B variant:

ollama pull qwen3.6:27b

Verify:

ollama run qwen3.6:27b "Say hello in one sentence."

3. Install Python dependencies¶

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

4. Environment variables¶

cp .env.example .env

Edit .env:

FIRECRAWL_API_KEY=fc-...
OLLAMA_MODEL=ollama/qwen3.6:27b
OLLAMA_BASE_URL=http://localhost:11434

Get a Firecrawl key at firecrawl.dev.

5. Start Qdrant¶

docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant

6. Build the knowledge base¶

python setup_vectordb.py

This embeds 20 ML FAQ chunks into Qdrant using nomic-embed-text-v1.5.

Step-by-step implementation¶

The entire server lives in server.py. LitServe calls four methods in order:

setup() → decode_request() → predict() → encode_response()

Step 1 — Set up the LLM¶

CrewAI integrates with Ollama through its LLM class. We point it at your local Qwen 3.6 model:

from crewai import LLM
import os

OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", "ollama/qwen3.6:27b")
OLLAMA_BASE_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")

llm = LLM(
    model=OLLAMA_MODEL,
    base_url=OLLAMA_BASE_URL,
)

Why qwen3.6:27b? Qwen 3.6 adds stronger agentic reasoning and tool use. On 16GB RAM, the 27B quantized model (~17GB) is the practical choice.

Step 2 — Define the Research Agent and Task¶

The Researcher gets two tools:

ml_faq_retrieval_tool — searches your Qdrant vector DB
FirecrawlSearchTool — searches the web for fresh or out-of-scope topics

from crewai import Agent, Task
from crewai_tools import FirecrawlSearchTool
from tools import ml_faq_retrieval_tool

researcher_agent = Agent(
    role="Researcher",
    goal="Research the user's query using the vector database and web search tools",
    backstory=(
        "You are a research assistant. Prefer the ML FAQ retrieval tool for "
        "machine-learning questions. Use Firecrawl web search for recent or "
        "general topics not covered in the knowledge base."
    ),
    verbose=True,
    tools=[ml_faq_retrieval_tool, FirecrawlSearchTool()],
    llm=llm,
)

researcher_task = Task(
    description=(
        "Research the user's query and collect the most relevant context: {query}. "
        "Use the ML FAQ tool first for ML topics. Fall back to Firecrawl for everything else."
    ),
    expected_output="A bullet list of key findings with sources (vector DB or web).",
    agent=researcher_agent,
)

Vector DB tool (`tools.py`)¶

The custom tool wraps Qdrant retrieval:

from crewai.tools import tool
from rag_code import COLLECTION_NAME, EmbedData, QdrantVDB, Retriever

@tool("Machine Learning FAQ Retrieval Tool")
def ml_faq_retrieval_tool(query: str) -> str:
    """Retrieve relevant ML FAQ documents from the vector database."""
    retriever = Retriever(QdrantVDB(COLLECTION_NAME), EmbedData())
    return retriever.search(query)

The agent decides which tool to call — that's what makes this "agentic" RAG instead of a fixed retrieve-then-generate pipeline.

Step 3 — Define the Writer Agent and Task¶

The Writer receives the Researcher's output via context=[researcher_task]:

writer_agent = Agent(
    role="Writer",
    goal="Write a clear, accurate answer using the researcher's findings",
    backstory=(
        "You synthesize research into concise, well-structured answers. "
        "Cite whether information came from the knowledge base or the web."
    ),
    verbose=True,
    llm=llm,
)

writer_task = Task(
    description=(
        "Using the research findings, write a final answer for: {query}. "
        "Keep it concise, factual, and easy to read."
    ),
    expected_output="A polished answer to the user's query.",
    agent=writer_agent,
    context=[researcher_task],
)

Step 4 — Set up the Crew¶

Orchestrate both agents inside LitServe's setup() method (runs once at startup):

from crewai import Crew

self.crew = Crew(
    agents=[researcher_agent, writer_agent],
    tasks=[researcher_task, writer_task],
    verbose=True,
)

Step 5 — Decode request¶

Extract the user query from the incoming JSON body:

def decode_request(self, request):
    query = request.get("query", "").strip()
    if not query:
        raise ValueError("Missing required field: query")
    return query

Example request:

{"query": "What is cross-validation and why is it important?"}

Step 6 — Predict¶

Pass the query to the Crew. The {query} placeholder in task descriptions is filled from inputs:

def predict(self, query):
    result = self.crew.kickoff(inputs={"query": query})
    return result.raw if hasattr(result, "raw") else str(result)

Behind the scenes:

Researcher runs and may call vector DB and/or Firecrawl
Writer reads those findings and drafts the answer
Qwen 3.6 powers both agents through Ollama

Step 7 — Encode response¶

Return the final answer as JSON:

def encode_response(self, output):
    return {"output": output}

Step 8 — Start the server¶

if __name__ == "__main__":
    api = AgenticRAGAPI()
    server = ls.LitServer(api, timeout=False)
    server.run(port=8000)

timeout=False is important — agent crews with tool calls can take several minutes on local hardware.

Client code¶

client.py sends a POST to /predict:

import requests

payload = {"query": "What is Qwen 3.6?"}
response = requests.post("http://127.0.0.1:8000/predict", json=payload, timeout=600)
print(response.json()["output"])

Run it:

# Terminal 1
python server.py

# Terminal 2
python client.py --query "How do I avoid overfitting?"
python client.py --query "What is the latest news about Qwen 3.6?"

The second query should trigger Firecrawl because it's not in the ML FAQ knowledge base.

Full server code¶

For reference, here is the complete server.py:

import os

import litserve as ls
from crewai import Agent, Crew, LLM, Task
from crewai_tools import FirecrawlSearchTool
from dotenv import load_dotenv

from tools import ml_faq_retrieval_tool

load_dotenv()

OLLAMA_MODEL = os.getenv("OLLAMA_MODEL", "ollama/qwen3.6:27b")
OLLAMA_BASE_URL = os.getenv("OLLAMA_BASE_URL", "http://localhost:11434")


class AgenticRAGAPI(ls.LitAPI):
    def setup(self, device):
        llm = LLM(model=OLLAMA_MODEL, base_url=OLLAMA_BASE_URL)

        researcher_agent = Agent(
            role="Researcher",
            goal="Research the user's query using the vector database and web search tools",
            backstory="...",
            verbose=True,
            tools=[ml_faq_retrieval_tool, FirecrawlSearchTool()],
            llm=llm,
        )

        writer_agent = Agent(
            role="Writer",
            goal="Write a clear, accurate answer using the researcher's findings",
            backstory="...",
            verbose=True,
            llm=llm,
        )

        researcher_task = Task(
            description="Research the user's query: {query}",
            expected_output="Key findings with sources.",
            agent=researcher_agent,
        )

        writer_task = Task(
            description="Write a final answer for: {query}",
            expected_output="A polished answer.",
            agent=writer_agent,
            context=[researcher_task],
        )

        self.crew = Crew(
            agents=[researcher_agent, writer_agent],
            tasks=[researcher_task, writer_task],
            verbose=True,
        )

    def decode_request(self, request):
        return request["query"]

    def predict(self, query):
        return self.crew.kickoff(inputs={"query": query}).raw

    def encode_response(self, output):
        return {"output": output}


if __name__ == "__main__":
    api = AgenticRAGAPI()
    server = ls.LitServer(api, timeout=False)
    server.run(port=8000)

Agentic RAG vs classic RAG¶

Classic RAG	Agentic RAG (this tutorial)
Fixed: always retrieve → generate	Agent chooses tools dynamically
Single LLM call	Multi-agent pipeline
One data source	Vector DB + web fallback
Hard to extend	Add tools without rewriting the pipeline

Troubleshooting¶

Issue	Fix
`connection refused` on port 6333	Start Qdrant with Docker
Ollama model not found	Run `ollama pull qwen3.6:27b`
Very slow responses	Normal on 16GB RAM; close other apps
Firecrawl errors	Check `FIRECRAWL_API_KEY` in `.env`
Empty vector results	Run `python setup_vectordb.py` first

What's next¶

Replace the sample FAQ with your own documents in rag_code.py
Add a Gradio UI in front of the LitServe API
Swap Firecrawl for another search provider
Deploy LitServe behind Docker or Lightning AI Cloud

Summary¶

You deployed a fully private Qwen 3.6 Agentic RAG:

Qwen 3.6 runs locally via Ollama
CrewAI orchestrates Researcher + Writer agents
Qdrant stores your knowledge base
Firecrawl fills gaps with live web data
LitServe exposes everything as a clean REST API

Done!