How we built a 3-level RAG with FastAPI and LlamaIndex

¡Hello everyone! Today we are leaving Themes and CSS aside for a bit to dive into the guts of our latest project: Normatia.

As you know, here at SiloCreativo we love jumping into technical puddles. When we started developing Normatia, one problem was crystal clear: technical building regulations are strict. If you ask a generic LLM (like ChatGPT) about the Building Code, it sometimes hallucinates. And in architecture, a hallucination can mean a lawsuit.

So a basic API just wasn’t enough. We needed a system that understood context, cross-references, and was fast. Today we explain how we built a 3-level RAG (Retrieval-Augmented Generation) system with hybrid Guardrails.

The Tech Stack: Speed and Memory

Before explaining the logic, let’s talk about the tools. For this project, we looked for a balance between performance and cost. This is the Stack powering Normatia’s AI:

  • FastAPI: Our favorite backend Framework. Asynchronous, fast, and perfect for Python.
  • LlamaIndex: The key piece to orchestrate the whole RAG pipeline and reranking.
  • Google Gemini: We use a combination of models. Gemini Embedding to vectorize, Gemini Flash for fast tasks, and Gemini Pro for the final response.
  • Supabase: We use its pgvector extension as a Vector Store. We separated the AI schema to keep everything tidy.
  • Redis: Essential for distributed Rate Limiting.

Let’s get to work! Let’s see how we connect all of this.

RAG AI three levels with Llamaindex

Why a 3-level RAG?

The core problem with any regulatory RAG is that a single article is rarely enough.

Imagine you are searching for “stair safety”. The retrieved article might say: “Stairs must comply with table 2.1”. If the AI only reads that paragraph, it cannot answer you because it is missing the table.

That is why we have designed a hierarchical retrieval architecture:

Level 1: Vector Search (The standard)

Here we do the usual. We search for text blocks most similar to your Query via vector similarity. We retrieve several candidates, apply a Rerank with LlamaIndex, and keep the best ones.

Level 2: Direct References

This is where the magic begins. If a Level 1 block mentions a technical term or refers to another article (e.g. “see definition in Art. 4”), our system queries the database and fetches that extra article. Since we have digitalized and connected the regulations, we can perform these cross-queries easily.

Level 3: Indirect References

We go one step further. We repeat the process for the references of the references.

Having said that, we ultimately deliver a super structured prompt to the LLM (LEVEL 1 / LEVEL 2 / LEVEL 3). This way, the model has all the necessary context to answer without making things up.

Hybrid Guardrails: Saving Tokens and Time

One of the biggest challenges in production is cost and latency. We cannot send every question to the most powerful model, because we would go bankrupt and the user would wait too long.

Because of this, we have implemented a 4-level validation system (Guardrails) that acts as a funnel:

  1. Level 0 (Cleaning): We use Regex to clean up greetings or empty phrases (“Hello, good morning…”). This is instantaneous (0ms).
  2. Level 1 (Fast Reject): If we detect obvious Off-Topic themes (football, cooking, politics), we cut the request before spending a single token.
  3. Level 2 (Fast Pass): If the Query contains technical keywords (like “CTE”, “transmittance”, “DB-HE”), we assume it is valid and skip straight to the search engine. This saves a massive amount of time.
  4. Level 4 (LLM Validation): Only if the question is ambiguous, we use a lightweight model (Gemini Flash) to decide if it is relevant or not.

Thanks to this hybrid system, about 50% of the queries are resolved without calling the expensive validator. 😉

Bonus track: Externalized Configuration

To avoid touching code every time we want to adjust the AI’s “sensitivity”, we use Pydantic Settings.

Everything is in environment variables: the number of blocks to retrieve, the similarity threshold, the character limits for each reference level… This way we can “tune” Normatia’s brain directly from the Railway dashboard (where we run the Deploy) without redeploying code.

Conclusion

Building our own product like Normatia has taught us that AI isn’t magic, it is data engineering. The biggest challenge wasn’t connecting the Google API, but designing the reference system so the AI would “think” like an architect, consulting related sources before speaking. You can see it in action here.

If you are interested in the world of Machine Learning applied to the web, or want to know more about how we integrate this with the Frontend in Astro, we will be writing about it soon. We hope you found this helpful!

Leave a Reply

Your email address will not be published. Required fields are marked *