arela

Meta-RAG Implementation for Arela’s Context Router Executive Summary Meta-Retrieval-Augmented Generation (Meta-RAG) is an advanced approach to make RAG systems more intelligent, adaptive, and reliable. Unlike traditional RAG (which blindly retrieves a fixed set of documents for every query) , a Meta-RAG system uses a “router” component (often powered by a small LLM or rules) to analyze the user’s query and route it to the most relevant knowledge source . It also verifies the quality of retrieved information and can self-correct by refining the query or trying alternate strategies. In essence, Meta-RAG adds a meta-cognitive layer on top of RAG, enabling dynamic decision-making (when/where to retrieve) and self-reflection (checking relevance and factuality) . This approach is closely related to emerging ideas like Agentic RAG (using autonomous agents for multi-step retrieval planning) and Self-RAG (LLMs that decide when to retrieve and critique their answers) . Why it matters: For Arela’s AI co-founder tool, which has a tri-memory system (vector semantic memory, graph code memory, audit log memory), Meta-RAG can be a game-changer. It will allow the system to intelligently choose the right memory (or combination) for each query, dramatically improving the relevance of answers while minimizing noise. By verifying retrieval quality (ensuring the retrieved code/docs actually answer the question), Meta-RAG reduces hallucinations and wrong answers. This is critical for developer trust and for handling complex queries that span code, design decisions, and historical context. Is it right for Arela? Yes. Arela’s use case involves large, heterogeneous codebases (up to 20k files in 15+ languages) and diverse query types (factual, conceptual, procedural, temporal, etc.). A one-size-fits-all retrieval often fails in such scenarios. Meta-RAG’s intelligent routing is specifically designed to tackle multi-source challenges – for example, distinguishing a query about “function usage” (graph memory) vs “recent changes” (audit log) vs “how to implement X” (vector memory) . By adopting Meta-RAG, Arela can ensure that 95%+ of queries are handled by the appropriate strategy, leading to far more accurate and context-aware responses. This would set Arela apart from simpler copilots, becoming a major competitive differentiator (the “intelligence layer” that others lack). Recommended Approach: We recommend a custom-built Meta-RAG router integrated into Arela’s Layer 1 (small local model layer). This will give full control and minimal bloat, aligning with Arela’s philosophy of lightweight, elegant solutions. Key steps include: • Implement a fast query classifier (using rules + a 1-3B local model) to detect the query type and needed memory. • Develop a strategy router that triggers the right retrieval: vector search for semantic questions, graph DB queries for structural code questions, log search for temporal questions, or combinations when necessary. • Add a verification module that checks retrieved results (via similarity scores or a tiny LLM “grader”) to ensure relevance and catches potential hallucinations. • Incorporate an iterative refinement loop: if the first retrieval attempt yields low-quality context, automatically reformulate the query or try a different memory, then re-run retrieval (with a strict cap on iterations to avoid loops). This approach leverages Arela’s existing stack (JSON index and SQLite graph) without requiring heavy external frameworks. It keeps all intelligence local (compatible with Ollama-run models) and adds only a small latency (~100–200ms per query for classification/verification). We estimate this could improve answer relevance by 30% or more and cut hallucination rate by 50% or more, hitting Arela’s success criteria (Section 5) while adding negligible cost (all Layer-1 reasoning done with local models). Build vs. Buy: Given Arela’s preference for owning the stack and the straightforward nature of the router logic, we advise building in-house. Frameworks like LlamaIndex or LangChain do offer routing and query planning features, but they would introduce unnecessary complexity and dependencies for this use case. A custom solution (likely <500 lines of code) can be achieved in about one week, tailored exactly to Arela’s JSON index and graph DB. We can borrow ideas from these frameworks (e.g. LlamaIndex’s router, LangChain’s dynamic chains) without adopting their entire stack . The result will be easier to maintain, lightweight, and optimized for Arela’s environment (running on a MacBook Pro with local models). Given the clear ROI (significant quality gains for minimal cost), we believe Meta-RAG is worth including in the upcoming v4.2.0 rather than waiting. It aligns with our validated 3-layer architecture (Programmatic → Small LLM → Big LLM) by greatly enhancing the Layer 1 intelligence without expensive calls to GPT-4. In summary, Meta-RAG will enable Arela’s assistant to “think before it fetches,” ensuring each query uses the right tool for the job and that the provided context truly helps answer the question. This will make responses more accurate, context-rich, and trustworthy – delivering a 10x improvement in context understanding that sets Arela apart. Technical Deep Dive Meta-RAG Fundamentals and Architectures Definition: Meta-RAG can be seen as an extension of Retrieval-Augmented Generation where an additional reasoning layer (meta layer) controls how retrieval is done and how the results are used. Traditional RAG pipelines follow a static retrieve-then-read sequence: for any query, retrieve top-K documents (often just by similarity) and feed them to the LLM . Meta-RAG, by contrast, introduces dynamic decision-making: the system first interprets the query’s intent, then decides whether to retrieve, what to retrieve (which memory or database), and how many/which results to use, possibly in an iterative manner . In academic terms, this falls under Agentic RAG, which “embeds autonomous agents into the RAG pipeline” to manage retrieval strategies, planning and reflection . It’s also related to Self-Reflective RAG, where the model itself learns to trigger retrieval only when needed and to verify its answers against retrieved evidence . Meta-RAG vs Traditional RAG: The key difference is adaptability. A traditional RAG system is like a librarian that always pulls a few books off the same shelf, no matter the question. Meta-RAG is like a smart research assistant that first figures out what type of question it is, then decides which library or database to search, and double-checks that the info is useful. This means: • Dynamic routing: Instead of one vector search for everything, Meta-RAG might choose a keyword search for one query, a graph lookup for another, or even skip retrieval entirely if the answer is obvious or not found (e.g. respond “I don’t know”) . • Quality control: Traditional RAG trusts that the top-K retrieved chunks are relevant. Meta-RAG explicitly evaluates the retrieved evidence – discarding irrelevant chunks and flagging if not enough good context was found . • Iterative refinement: A static RAG does one pass; Meta-RAG can do multiple. For tough queries, it can reformulate the query or gather additional info (akin to an agent that “thinks” in steps) . Key Components: A Meta-RAG system generally comprises: 1. Query Understanding: Analyze the user query to extract its intent, type, and requirements. This may involve classification (e.g. is it asking for a fact, a procedure, a comparison, or a timeline?) and identifying key entities or keywords. The output is a structured understanding used for routing. 2. Strategy Selection (Router): Based on the query analysis, the router decides which memory or retrieval method to use. It might choose a single source or multiple in parallel. In advanced setups, this router could be an LLM prompt that outputs a tool choice , or a rule-based system if simple. 3. Retrieval Execution: The chosen retriever(s) are executed: e.g. semantic vector search on documentation, SQL query on the graph DB, full-text keyword search on logs, etc. In some architectures, multiple retrievers run and their results are combined (fusion). The system may also retrieve in multiple rounds (if an initial query was too broad/narrow). 4. Quality Verification: A post-retrieval check is done on the results. This can include scoring each document for relevance to the query, removing low-relevance hits, and even checking for completeness (did we cover all aspects of the question?). This step can be implemented with heuristics (e.g. overlap of query terms) or a small LLM judging “relevant or not” . 5. Answer Generation: The LLM finally generates an answer using the curated context. But Meta-RAG doesn’t stop here – it can include an answer evaluation (e.g. a “hallucination checker” that verifies if the answer is fully supported by the retrieved docs) . If the answer is ungrounded or incomplete, the system could trigger another iteration (reformulate query or retrieve more info) . Proven Architectures: Several designs for Meta-RAG have been explored: • LlamaIndex Router: LlamaIndex supports a RouterRetriever/RouterQueryEngine, where an LLM-based selector (called BaseSelector) examines the query and chooses among different data sources or indices . For example, one could have separate indices per programming language or topic, and the router directs the query to the correct index. This is essentially Meta-RAG: a top-level LLM agent orchestrating lower-level retrieval. • LangChain Agents: LangChain doesn’t have a single “Meta-RAG” class, but it provides tools to build agents that use retrieval as one of their tools. For instance, one can create a custom agent that, given a query, can decide: “Do I use the vector store tool, or the SQL database tool, or both?” . LangChain’s dynamic routing (via its Expression Language or agent tool selection) can implement query classification and multi-hop retrieval. This was demonstrated in LangChain’s docs using structured output to pick a data source (e.g. route to Python docs vs JS docs based on query) . • Self-RAG Pipeline: The Self-Reflective RAG approach (Asai et al. 2023) presents a single LM that effectively contains a meta-controller via special tokens . It decides if retrieval is needed, inserts retrieved text with markers, and uses tokens like ISREL (Is Relevant) and ISSUP (Is Supported) to internally check relevance and factual support . This is a more end-to-end architecture where the LLM is fine-tuned to handle the meta reasoning internally. • Hybrid Systems: Some advanced systems incorporate multiple of the above. For example, a research pipeline might first cluster the knowledge base and generate meta-knowledge summaries per cluster, then on a query use an LLM to pick the best cluster and retrieve from there . Another example is Meta-RAG for evidence re-ranking in medicine, which after initial retrieval uses meta-analysis criteria (reliability, consistency) to filter and re-rank evidence, resulting in ~11% improved accuracy in answers . These confirm that adding such meta layers yields tangible gains in precision. Today, Meta-RAG and agentic retrieval are active research areas. A recent 2025 survey highlights how Agentic RAG enables “unparalleled flexibility” via reflection and multi-step adaptation , but also notes challenges in scaling and complexity . In practice, a few early adopters are emerging: IBM’s Watsonx has agent orchestration for RAG, Amazon has experimented with agentic RAG in their Q&A systems , and frameworks like LangChain, LlamaIndex, and LangGraph are rapidly adding support. However, this is not yet a plug-and-play solution – it requires careful design to avoid introducing latency or instability. Our approach will draw on these proven patterns while keeping the implementation lean and focused on Arela’s specific needs. Query Classification Techniques Routing a query starts with understanding what kind of question it is. This is essentially a classification problem – we want to assign the query to one (or multiple) categories that inform the retrieval strategy. For Arela, we have concrete query types to distinguish: • Factual – asking for a specific fact or definition (e.g. “What does function X return?”). Likely answerable by looking up documentation or code definitions (vector memory). • Conceptual – asking for an explanation or insight (e.g. “Why do we use a queue here?”). May require retrieving related design docs or code comments, possibly multiple sources for a comprehensive answer. • Comparative – comparing two things (e.g. “Difference between function A and B”). This likely needs retrieving both items (code for A and B) and possibly any commentary on them. • Procedural – asking for steps or “how to” (e.g. “How do I deploy the project?”). Might need code examples or deployment guides (vector memory, possibly specific file search). • Temporal – about timeline or history (e.g. “When was this module last updated and why?”). Clearly points to audit log or commit history (governance memory). Approach without large LLMs: We aim to classify queries in under 100ms, so a giant model like GPT-4 is off the table. Instead, we consider: • Rule-based classification: We can craft simple keyword-based rules to catch certain categories. For example, if the query contains “when” or “last updated” or “version” ⇒ likely temporal. If it contains “why” or “purpose” ⇒ conceptual. “How do I” or “steps to” ⇒ procedural. “Compare” or “difference between” ⇒ comparative. These rules are fast (constant time) and ensure obvious cases route correctly. However, rule-based logic can miss subtle cues or synonyms (e.g. “explain X” is conceptual even if it doesn’t say “why”). • Embedding similarity: Another lightweight option is to embed the query using the same text embedding model we use for vector search (nomic-embed-text in our case) and compare it to prototypes. For instance, we could prepare a few example queries for each category and embed them; then compute cosine similarity of the user query embedding to each category’s examples. The highest similarity could indicate the category. This essentially does semantic classification without a full LLM, using the existing vector model. It would be fast (a single embedding call + ~50 dot products) and entirely local. • Small local model (1B–3B parameters): We can leverage a local LLM (like llama2 3B or the mentioned llama3.2:1b model) to do the classification via prompting. For example, we prompt: “The user question is: <query>. Categories: [factual, conceptual, comparative, procedural, temporal]. Output the best matching category.” A 1–3B parameter model, especially one fine-tuned for instruction or classification tasks, can likely achieve decent accuracy (our target is >85% correct routing). These models (when quantized) can run in a few hundred milliseconds on an M1, which might be acceptable (~300ms). With further optimization or using a smaller distilled model (like DistilBERT or an ALBERT fine-tuned on question intent), we might push classification well under 200ms. Accuracy benchmarks: While exact benchmarks are scarce for our custom category set, similar tasks (like intent detection in chatbots) have seen small models reach ~80-90% of the accuracy of GPT-3.5/4 on classification tasks . For example, a fine-tuned BERT or MiniLM can often classify intent with >85% accuracy given enough examples. Large LLMs (GPT-4) may hit ~95% but are slower and cost money . Given Arela’s tolerance (>85% is acceptable), a local model or hybrid approach should suffice. We should also note that consistency and determinism are important – rule-based logic excels at that (the same input always yields the same output), whereas LLMs can be “random” or require temperature=0 and careful prompting to be stable . One strategy could be to combine approaches: use rules for high-confidence patterns and fall back to a small LLM for nuanced cases. This ensemble can boost overall accuracy and reliability. Handling ambiguous queries: Sometimes a query might span multiple types or be unclear. For instance, “Explain differences and how to implement X vs Y” is both comparative and procedural. Our classifier can output multiple labels or a composite strategy. In such cases, the router might run multiple retrievals (e.g. get the code for X and Y and any how-to guides). If the classifier is unsure (e.g. confidence scores tie), a simple approach is to default to a broad retrieval – perhaps run the vector search across all content (code and docs) as a fallback. Another approach is to refine the question by asking a clarifying prompt (if we had an interactive agent, but in our case, likely not feasible to ask the user). For now, we plan to bias towards recall on ambiguity: include more content sources rather than risk missing the relevant one. We will log ambiguous cases to continuously improve our classification rules or few-shot examples. In summary, query classification will likely be a hybrid system: cheap pattern matching plus a semantic check. This ensures we meet the speed requirement (<100ms in simple cases, possibly slightly more with embedding/LLM but still under 500ms total pipeline). We will validate the classifier on a set of sample queries for each type to ensure it meets the ~85% accuracy bar, tuning rules and examples as needed. Once classification is reliable, it becomes the trigger for the next phase: strategy selection. Strategy Selection and Routing Once we know the query type and intent, the router component selects the optimal retrieval strategy. The goal is to use the right tool for the right question, avoiding the pitfalls of a one-size-for-all search. Our routing logic will map query categories to one or more of: • Dense Vector Retrieval (semantic search on code/comments/docs via embeddings) • Sparse Keyword Retrieval (lexical search, e.g. BM25 or simple keyword filter) • Graph Traversal (SQL queries on the code graph database) • Audit Log Lookup (search or SQL queries on the governance log) Each has strengths: • Dense retrieval excels at capturing semantic similarity and can find relevant info even if wording differs (great for conceptual questions or when code is described in prose) . We use this for general knowledge queries on the codebase or docs (the current RAG index). • Sparse retrieval (exact keyword or BM25) is great when specific terms or identifiers are involved (e.g. “error E1234” or function names). It avoids missing results due to synonyms – if the user query uses the exact term present in files, a lexical search will catch it directly. Sparse search is also faster and cheaper for short queries, and if a query is very precise, we might not need the complexity of embeddings . • Graph DB queries are specialized but extremely precise for certain developer questions. If the query implies code structure (imports, function calls, class hierarchy), the graph can answer it directly. For example, “Where is function X called?” or “List all modules that import Y” are best handled by the dependency graph rather than scanning text. • Audit log is the go-to for anything temporal or about the history of changes. A query like “Who modified function X last and why?” cannot be answered by the current code state or docs – you need to look at commit messages or our governance memory. That likely means a SQL query on log entries filtered by function X, or a keyword search on commit messages containing “function X”. The router’s job is to implement a mapping such as: • Temporal query → Audit log retrieval: (perhaps with a date filter or sorting by recency). • Structural code query → Graph DB: (compose an SQL query to find the relations). • General code question → Vector search: (the default, especially for conceptual/how-to). • Comparative query → Multiple vector searches: (retrieve info for each item to compare). • Ambiguous/mixed → Multi-retrieval: (if in doubt, do a broader search or parallel searches). For example, if a query is classified as temporal, the router might perform two actions: query the audit log for relevant entries and do a quick vector search on docs for any references (just in case the user is referring to release notes or documentation timeline). If a query is procedural (“how to do X”), the router might focus on documentation (vector search on README, wiki, etc.) but also try a direct code search for function X to see usage examples (mix dense + sparse). This dynamic selection is where Meta-RAG shines: it can run multiple retrievers and then merge results if needed. Research shows that combining dense and sparse retrieval often yields better coverage – dense avoids vocabulary mismatch, sparse ensures exact matches, and together they improve the chance of finding the answer . LlamaIndex documentation refers to such combinations as hybrid or fusion retrievers , and our design will incorporate similar logic for certain query types. Dense vs. Sparse – when to use which: Ideally, we’d pick based on query content. If the query contains rare or specific keywords (like an error code, or a config parameter name), a sparse search (even just grepping the repo or using SQLite FTS if we index code text) might yield the answer faster and more exactly. If the query is more conceptual or uses natural language, dense is better. There is even research on using a classifier to choose dense vs sparse per query for optimal effectiveness/efficiency . In our case, the query classifier could directly inform this: e.g. factual queries that mention code entities might trigger a sparse search first, whereas conceptual queries use dense. We can also do hybrid retrieval: run both and then fuse the results. Fusion can be as simple as taking the top N from each, or more sophisticated re-ranking. For instance, we might retrieve 5 results via vector similarity and 5 via keyword, then rank them together (maybe by a weighted score or by interleaving). This ensures that if either method finds something highly relevant, it gets included. The LlamaIndex toolkit mentions methods like Reciprocal Rank Fusion and score merging for combining retrievers , which have been shown to improve recall. Graph vs. Vector – how to decide: Graph queries are very precise but only apply to certain questions. We will define triggers: words like “calls”, “references”, “depends on”, “all functions that…”, or patterns like “which module does X” strongly indicate a graph traversal. The query classifier might output a flag for “structural” vs “informational”. If structural, the router will formulate a SQL query (or use a predefined query template) to get the answer. For example, “Where is function foo() called?” → run SELECT callers FROM CallGraph WHERE function='foo'. The result might be a list of call sites. We then convert that to a textual context (like a snippet: “Function foo() is called by: [list of functions/files]”). That text can be given to the LLM as context or even directly returned if the question was straightforward. Important: Graph retrieval bypasses the LLM for the retrieval step, but we likely still use the LLM to incorporate the info into a coherent answer (e.g. explaining the result if needed). This means the router can produce some intermediate text (like a summary of graph findings) to feed into final answer generation. Combining multiple memory sources: Some queries may benefit from multiple sources. For instance, “Explain how module X works and who last updated it” touches both the code’s content and its history. A powerful Meta-RAG will not limit to one source when the query spans domains. Our router can retrieve from vector memory for “how it works” (perhaps pulling module X’s documentation or code summary) and from audit log for “who updated it”. The results need to be fused into a single context for the LLM. We can concatenate them with clear markers (e.g. “Documentation: … Change Log: …”) or even prompt the LLM to consider both sets of info. One risk here is context length – but since we’d be selectively retrieving just the relevant chunks from each memory, it should be manageable (maybe a few paragraphs total, well within a big model’s 4K-8K token limit). There are various fusion strategies for multi-modal retrieval: • Sequential: Use one source’s result to inform another. For example, find a code snippet via vector search, then use an identifier from that snippet to query the graph for relationships. This is like a mini-agent doing a chain (first dense, then graph). If needed, our system could do this programmatically (Layer 0) for certain queries. • Parallel: Retrieve from all relevant sources independently, then merge. This is simpler and, given small number of sources (3 in Arela), the overhead is low. We just have to ensure the final answer can integrate them. We might have to slightly increase top_k if we split among sources (e.g. 3 results from vector + 2 from audit). • Hierarchical: If one source is primary and others secondary. E.g. primarily use vector, but cross-check critical facts via audit log if available. This could be part of verification rather than initial retrieval. Performance trade-offs: Each additional retrieval method adds latency. However, many can be optimized: • The JSON vector search could be done with an approximate nearest neighbor library (Faiss or similar) in-memory, keeping it fast (<100ms for 50k vectors). • Keyword search on code might be done with SQLite FTS or just a simple index; on 20k files it’s not too slow, especially if scoped by file name or module from the query. • Graph DB queries are trivial (SQLite on a local DB for specific keys – should be <10ms for well-indexed fields). • Audit log search depends on size, but if it’s indexed by keyword or by date, that too is quick. So, even parallel retrievals could be done in ~200ms total (especially if done concurrently). The bigger cost is potentially the router’s decision-making if it uses an LLM – but as discussed, that will be a small model or rule-based. We also consider the case of “no relevant source”: If the router misroutes (e.g. thought it’s in the code, but actually it was something only a human would know), or none of the memories contain the answer, what then? The router could have a fallback tool: e.g., query a web search or output “I don’t know” gracefully. In a closed environment like a local codebase assistant, likely we prefer honesty: if none of the retrieval strategies yield anything, we instruct the LLM to admit it cannot find the information. This is better than hallucinating an answer. In fact, some architectures have the router explicitly decide “I can’t answer from our data” – which is exactly what we’d do if all sources come up empty. Finally, we will log the performance of strategy selection: measure how often the chosen strategy actually led to a good answer. If patterns emerge (e.g. always doing both dense and sparse yields better answers), we can refine the router (even learning from feedback). The router essentially is implementing a policy that we can adjust over time. Quality Verification & Retrieval Confidence A cornerstone of Meta-RAG is ensuring the retrieved context is actually relevant and sufficient for the query. In a naive RAG pipeline, irrelevant or tangential documents might be retrieved (especially by pure similarity), leading the LLM to produce incorrect or off-target answers. We introduce a verification layer to guard against this. Relevance Check (Retrieval Evaluator): After retrieving candidate documents or code snippets, we will assess each one’s relevance to the query. One approach is a simple heuristic: if using embeddings, the vector similarity score provides a relevance metric; we can set a threshold (say cosine similarity > 0.3) and drop any chunk below it. Additionally, we can require some keyword overlap – e.g., if the query mentions “function X” and a retrieved chunk doesn’t contain “X” or related terms, it might be a false positive from the embedding search. These simple filters can remove obviously unrelated text. For a more nuanced check, we can use a local LLM as a grader. This is akin to the retrieval evaluator in the Self-RAG implementation: a small model that reads the question and a retrieved document, and outputs “yes” if it’s relevant or “no” if not . This could be done with a model like deepseek-1.5b or even a distilled classifier fine-tuned on relevance (if we had data). In the DataCamp Self-RAG example, they used GPT-4 (mini) to judge if a doc “contains keywords or semantic meaning related to the question” . We can replicate this logic locally. Practically, we might not need an LLM call for every document – if our initial retrieval is tuned well, perhaps just the top few need verification. But having this check ensures only high-quality context goes into final answer generation. Documents graded “not relevant” will be discarded or deprioritized. If none or few documents are relevant, that’s a red flag: it means our retrieval might have failed. In such cases, we trigger the iterative refinement (discussed next) – e.g. try a different strategy or rephrase the query. This is a key self-correction mechanism: the system must recognize when it did not actually retrieve useful info. Coverage and completeness: Beyond individual docs, we want to ensure the retrieved set collectively covers the question. For example, for a comparative query, we should have info on both items being compared. For a “how to implement X” query, we want steps covering all parts of the implementation. This is harder to automate, but we can approximate it by looking at diversity of results. If our top K are all from the same file or same section, we might be missing other angles. We could enforce that the top results come from different files or sources (encouraging breadth). Alternatively, after generating an answer, we can have an Answer Grader that checks if the answer actually addresses the question fully . If not, that implies the context might have been insufficient. Hallucination (Support) Check: Even with relevant docs, the LLM might state something not actually supported by them. To catch this, we use a hallucination checker. Essentially, after the LLM drafts an answer, we ask: “Is every claim in this answer backed by the retrieved content?”. This can be implemented as another LLM prompt that looks at the answer and the source snippets and outputs a yes/no . In Self-RAG, they call this the ISSUP token or hallucination grader – if it outputs “no”, the answer is not sufficiently grounded. We can do the same with a local model or even by programmatic checks (e.g. looking for sentences in the answer that have no overlap with any source). However, LLM evaluation is more reliable for subtle factual consistency. If the hallucination checker says the answer isn’t supported (or the answer grader says the question wasn’t fully answered) , we again have a chance to refine or at least warn. For instance, we could: • Run another retrieval based on what part seemed unsupported (perhaps extract a keyword from the unsupported claim and search it). • Or append a disclaimer in the answer like “(I could not verify some information from the codebase)”. Automated Feedback Loop: We design the system so that verification steps feed back into the retrieval loop. Concretely: 1. Retrieve docs. 2. Grade each for relevance. If none are relevant – likely a retrieval miss – consider alternate strategy (e.g., if we only did vector, try keyword, or broaden the query). 3. If some are relevant but not enough to answer the question (e.g., question: “compare X and Y”, but all docs retrieved are only about X), then perhaps run a targeted search for Y as well. 4. After generating an answer, grade it for grounding. If it’s not grounded, consider that a failure of retrieval (maybe something needed wasn’t retrieved). In an automated setup, we could attempt one more retrieval iteration, possibly using the answer as a clue (although if answer has hallucinations, that’s risky to use). 5. If after one refinement the answer still isn’t grounded, the system should abstain or flag it to the user, rather than loop indefinitely. Non-LLM verification: It’s worth noting that simpler verification methods can catch many issues at near-zero cost. For example, checking the similarity scores – if the top score is very low, we know the query didn’t find a good match (maybe the knowledge isn’t there). Also monitoring if the final answer includes out-of-vocabulary terms or things not present in sources (like naming a function that was never retrieved) could signal hallucination. These checks can be done with string matching and might cover common cases. However, subtle logical inaccuracies likely need an LLM’s judgment or human evaluation. In summary, our Meta-RAG will not blindly trust retrieved data. It treats retrieval as hypotheses to be validated. This verification layer directly addresses hallucinations and irrelevant context – aligning with our goal to reduce hallucinations by >50%. It adds a bit of overhead (a few small model calls or calculations), but this is justified by the confidence it provides that the answer will be correct. The verification outputs (relevance scores, etc.) can also be logged to continuously measure how well our retrieval is doing and if our indexing or embeddings need improvement. Iterative Refinement (Self-Correction Loop) No retrieval strategy is perfect on the first try, so Meta-RAG systems often include an iterative loop: if the initial answer or context is unsatisfactory, the system can try again in a smarter way. This is analogous to how a human would research: if the first search results aren’t helpful, you reformulate the query or try a different resource. When to trigger a second retrieval pass: Our system will consider a retry in several situations: • No relevant info found: The relevance checker flags that all retrieved chunks were irrelevant or below a similarity threshold. Clearly, the query either wasn’t handled by the right memory or was phrased in a way that confused the search. • Answer not addressing question: The answer grader determines the LLM’s answer didn’t actually resolve the user’s question . For example, the answer might be generic or says “I’m not sure”. This indicates the context might have been incomplete or off-mark. • Hallucination detected: The hallucination check fails (the answer contains unsupported claims) . Possibly the LLM filled gaps with its own knowledge because the context was insufficient or too sparse. • Ambiguity in query: If during classification or routing we had low confidence or multiple possible interpretations, we might preemptively plan a refinement. For example, if a query could mean two things, the system might do a first pass assuming interpretation A; if the answer seems irrelevant, it could try interpretation B on a second pass. Reformulating queries: The primary tool for refinement is query reformulation. We take the original query and attempt to make it more retrieval-friendly. This could involve adding context, synonyms, or focusing it. A classic example: user asks “Why is the output incorrect?” – very vague. A reformulation might be “Why is the output of function X incorrect when input Y is given?” if we deduce more specifics. We might use the LLM for this: as in Self-RAG, a question rewriter role can generate an improved query by reasoning about the user’s intent . In our implementation, we could prompt a small LLM: “Rewrite the user’s question to be more specific for the knowledge base search.” Another angle is using any context we do have: e.g., if the first retrieval gave one relevant doc, perhaps use a keyword from it in the next query (expanding on a clue). Alternate strategies: Another refinement approach is switching retrieval methods. If the first attempt was dense, the second could be sparse (and vice versa). Or if we queried the code index and found nothing, maybe query the Q&A forum data (if available) or vice versa. Essentially, don’t repeat the exact same approach if it failed – try a different one. This can be rules-based: e.g., “If vector search yields no result above similarity X, then try a keyword search on codebase”. Or “If the question is conceptual and vector search failed, try searching our design docs or even external knowledge if allowed.” Stopping criteria: We must avoid infinite loops or excessive calls. We will impose: • Max iterations: likely 2 (initial + one refinement). Possibly 3 in rare cases if each step made partial progress. But beyond 2-3, returns diminish and latency/cost increase. Arela’s requirement is self-correct 80%+ of bad retrievals – we don’t need 100% perfection at the cost of spiraling queries. • Time budget: ensure the total retrieval+refine cycle stays under our latency target (~500ms overhead). So if one iteration already took 300ms, maybe only one more is feasible. If using LLM for rewriting, keep it concise (that can be done with a small model quickly). • Quality threshold: if the second attempt still has no good context, likely the answer isn’t in the knowledge base or is too complex. At that point, it’s better to respond with whatever best effort we have or say we cannot find it, rather than loop again. Example loop: User asks: “How does the scheduler work?”. Suppose Arela’s vector search returns some generic info on “scheduling” but it’s not specific to our project’s scheduler. The answer comes out vague. The answer grader says it didn’t really explain. So on second pass, the system realizes “scheduler” is a broad term – it could reformulate to “How does the TaskScheduler class in project X work?” (assuming it infers a class name from context). This time, it finds the actual TaskScheduler code doc and returns a much better answer. If that fails, we stop and perhaps reply, “The project’s scheduler mechanism is not documented clearly.” This way, we gave it a second shot with more detail, which often will solve the query (80%+ success in self-correction is our aim). Another scenario: no results at all for a query. This might mean the user asked something outside the project’s scope. For example, “What is the best sorting algorithm?” – not in our codebase. The iterative logic could detect zero relevant docs and directly conclude this is out-of-scope. Instead of trying multiple futile retrievals, the system could break out and answer: “That topic might be outside our codebase.” Essentially, the refinement in that case is to broaden knowledge (which we might not have locally). If an internet connection or plugin is available, an advanced agent might then go to the web (like Fig.2 in the NVIDIA example uses a web search tool when local index is not relevant ). Arela currently is offline/local, so likely we’ll just respond with a polite inability to answer. Avoiding loops: We will implement safeguards such as: • Compare the new query to the old query – if they are the same (or circling around), break. • If the same document or answer is coming back again, no point in continuing. • Ensure the LLM doesn’t stubbornly insist to try again unless we explicitly allow it. The iterative refinement is essentially a lightweight agentic behavior but constrained. It will significantly improve robustness: even if our first guess was wrong, the system can recover and still deliver a good answer on the second try, rather than giving a wrong answer or none at all. This feature differentiates Meta-RAG from regular RAG, which often fails silently on a bad retrieval. By v4.2.0, implementing at least one round of self-refinement will fulfill the goal of 80%+ self-correction of bad retrievals. Multi-Memory Integration (Vector, Graph, Audit) Arela’s knowledge is split across three distinct “memories”. One challenge is how to query and combine these smoothly. A naive approach would be to always search all three for every query, but as discussed, that’s inefficient and can introduce noise . Instead, Meta-RAG provides a principled way to route and fuse information from these heterogeneous sources. Routing across systems: The query classifier will tag which memory is most relevant: • Vector Memory – default for most natural language queries about code behavior, design, or usage. This covers code content (docstrings, comments) and documentation. We rely on the JSON index of embeddings to retrieve relevant chunks . • Graph Memory – for structural queries (dependencies, function calls, class structures). Instead of embedding search, we will formulate an appropriate graph query (or use a precomputed index, e.g., an adjacency list) to get results. • Governance Memory (Audit Log) – for temporal queries (who/when changes). We can use SQL or text search on commit messages, PR titles, etc., possibly filtered by file or identifier if provided. The router logic essentially chooses one of these or a combination. We might implement it as a series of if/elif in code (if query_type == temporal -> audit_query(); elif query_type == structural -> graph_query(); else -> vector_query(); etc.). This is straightforward and transparent. Alternatively, we could formalize it in a config or use an LLM to choose (by giving it descriptions of each memory and asking which to use), but that seems unnecessary given the clear mapping in our use case. Unified interface: We will write wrapper functions so that from the perspective of the answering logic, all retrieval results look similar (e.g. a list of text chunks with source info). For instance: • retrieve_vector(query) -> list of (text, source) from JSON embeddings. • retrieve_graph(query) -> list of (text, source) perhaps converting graph output to a short explanation or list. • retrieve_audit(query) -> list of (text, source) retrieving relevant log lines or commit messages. All three can then be combined easily if needed. Because their content differs (code vs log message vs documentation), we might label them when presenting to the LLM. For example, we could prefix audit entries with “[Change Log]” and graph info with “[Code Structure]” so the LLM knows what it is reading. This helps the final answer to attribute and integrate the info properly (“According to the change log, Alice updated X on 2023-10-10 to fix a bug” combined with “and the code shows function X calls Y”). Fusion of heterogeneous results: If a query triggers multiple systems, how do we merge the results? We have a few strategies: • Concatenation: Simply append the texts from different sources together (with some ordering, say graph first, then vector, then audit). This is simplest and works if each chunk is self-contained. We should be mindful of not exceeding token limits – but typically we might include 2-3 chunks from each source at most. • Prioritization: If one source is clearly more important, put its results first or exclusively. E.g., for “when and why was X changed”, the audit log is primary (the direct answer is there), but we might include one snippet of code to give context on what X is. In that case, we ensure the audit memory result is definitely included. • Intermediate reasoning: In a more agentic approach, the system could first use one memory to get an answer, then use that answer to query another. For example, query graph to get a list of functions, then automatically query vector memory for documentation on each function. This is complex to implement fully, but we can handle a simple case or two if needed with hard-coded logic. However, given time constraints, a parallel retrieval with later fusion by the LLM is likely sufficient. Conflict resolution: What if two sources seem to give conflicting information? For instance, the documentation says “Module X is deprecated”, but the audit log shows a recent update to X (so maybe it’s not deprecated after all). The LLM would see both statements. Ideally, it will mention the most up-to-date info (from audit) but also note the discrepancy. We can’t fully automate resolution of conflicts, but we can mitigate confusion by providing context (like including timestamps from the audit logs, and making sure documentation is labeled with perhaps version info if available). If needed, we could programmatically favor one source: for factual conflicts, trust the audit (since it’s ground truth of changes) over docs. But since the final answer is generated by the LLM with everything as context, we rely on it to synthesize intelligently. Testing will reveal if it needs guidance (like additional system prompt “When in doubt, prefer latest information from the logs”). Hybrid queries example: Consider “Explain the function of the Scheduler module and list recent changes to it.” Our classifier tags this as conceptual + temporal. The router then: 1. Runs vector search: finds a chunk in the README about the Scheduler module, plus maybe code comments from scheduler.py explaining its function. 2. Runs audit search: finds the last 2 commit messages involving scheduler.py. 3. The results are combined: one part describing what Scheduler does, followed by a part saying “Recent changes: - 2025-11-10: Refactored task queue… (commit by Bob)\n- 2025-10-01: Fixed timing bug…”. 4. The final LLM sees all that and produces an answer: “The Scheduler module is responsible for X (… explanation from docs …). In recent months, it underwent changes such as a refactoring of the task queue on Nov 10, 2025, and a bug fix on Oct 1, 2025, indicating ongoing improvements .” This illustrates the multi-memory synergy we aim for. With Meta-RAG, the system could handle that multi-part question in one go, whereas a traditional RAG might only retrieve docs and miss the historical aspect. Best practices for hybrid retrieval: According to literature and industry practices: • Use structured data sources (graph, logs) when you have structured queries – they give precise answers quickly . • Use unstructured semantic search for broader questions – covers what structured DBs cannot. • Always include some verification when mixing sources, to ensure each piece truly contributes to the answer (e.g., don’t include irrelevant log lines that just add confusion). • Keep the final context coherent. Too much disjoint information can overwhelm the LLM. It’s better to have fewer, highly relevant pieces than many fragments. Our verification step already filters per source; additionally, we might limit to e.g. top 3 results overall to maintain focus. By following these, we aim to harness all three memory systems effectively. Each will be used where it’s strongest, and together they provide a 360° view that a single-vector-index approach would lack. The end result: queries of all types get answered accurately, whether they are about code behavior, structure, or history. Performance and Scalability Considerations Designing Meta-RAG for Arela means balancing intelligence with speed and resource usage, especially since this must run on a developer’s laptop (no powerful cloud servers). We set targets: ideally <500ms overhead for classification + routing, and memory usage that fits comfortably on a modern MacBook Pro. Latency Benchmarks: A traditional RAG (just vector search + answer) might take ~200ms for retrieval and maybe 1-2 seconds for final LLM answer (depending on LLM size). Meta-RAG adds steps, but we can parallelize some and use lightweight models for others. Let’s break down a typical case: • Query classification (small model or embedding similarity): ~50–150ms (embedding is ~20ms, a 1B model might be ~300ms on CPU – could be reduced with quantization or using GPU if available). • Routing decision (simple if/else logic): essentially 0ms, if classification already done. • Retrieval from selected source(s): • Vector search: if using in-memory Faiss for ~50k vectors, an ANN search can be <50ms. • Graph SQL: <10ms for a query result. • Audit log search: maybe 50ms if using an index, or a bit more if scanning text, but likely small enough to index. • If doing two in parallel, maybe add a small overhead for merging results. • Verification (relevance grading): If heuristic-based, negligible. If using an LLM for each of e.g. 3 docs: a 1B model can handle a short prompt quickly, maybe 100ms each (we could also batch or parallelize this if needed). • Final LLM answer generation: This is unchanged (we still rely on either GPT-4 or a larger local model here). That’s the major time cost (couple seconds potentially for a complex answer). But that’s Layer 2 – outside our <500ms budget for the routing layer. Summing these, the additional overhead introduced by Meta-RAG (prior to the final answer generation) is on the order of a few hundred milliseconds. Our goal of <200ms for classification+routing is achievable with careful optimization (e.g., using a fast embedding library and potentially writing the classifier in C++ if needed). Even with verification, we aim to keep the meta steps under 0.5s. In practice, this overhead will often be hidden behind the LLM generation time (especially if using GPT-4 or a 13B model to answer, which might take 2-5s to compose a multi-sentence answer). So the user shouldn’t feel lag from the routing – it should feel just as responsive, but with improved answers. Memory/Compute Footprint: • The JSON RAG index (46MB) is easily handled in memory. 50k embedding vectors of 768 dims = ~150 MB as floats, or ~37 MB as FP16. Possibly our 46MB file is already optimized or uses fewer vectors. We can load this into memory at startup. Searching it can be CPU-bound; using a library like FAISS with 1 CPU thread, 50k v