Enhancing RAG Systems for the CRAG Meta KDD Cup

CS245: Big Data Analytics , UCLA

Developed an enhanced Retrieval-Augmented Generation (RAG) system for the Comprehensive RAG (CRAG) benchmark, part of the 2024 KDD Meta Cup. The system addresses persistent RAG challenges,hallucination, limited recall, and context fragmentation,through a five-stage pipeline with progressive improvements in retrieval, reranking, and generation.

Pipeline Architecture

The system processes queries through five sequential stages:

  1. Query Classification: Categorizes queries by temporal dynamics (static, slow-changing, fast-changing, real-time) to inform downstream retrieval strategy. Real-time queries prioritize recent references; static queries target stable knowledge.
  2. Recursive Chunking: Replaces line-by-line chunking with RecursiveTextSplitter (1000-char chunks, 200-char overlap), preserving semantic coherence across chunk boundaries.
  3. Dense Retrieval: Uses multi-qa-distilbert-cos-v1 (trained on 215M QA pairs) for semantic matching. Time-aware filtering restricts retrieved pages based on query time and page modification timestamps for dynamic queries.
  4. Reranking: BAAI/bge-reranker-v2-m3 evaluates query-document relevance with fine-grained scores, critical for ambiguous and multi-hop queries.
  5. Generation: GPT-4o-mini (128K token context) with Chain-of-Thought prompting for structured reasoning, explicit false-premise detection, and time-aware answer generation.

Chain-of-Thought Prompting Strategy

The CoT prompt instructs the model to: classify the question type (8 categories), evaluate reference recency for fast-changing queries, apply multi-step reasoning, respond "I don't know" when references are insufficient, and flag "Invalid question" for false premises.

Results

Progressive component additions showed consistent improvements. The final model (modified_rag_classification_gpt4o) achieved the best overall performance:

Model VariantScoreExact Acc.AccuracyHallucinationMissing
RAG Baseline (Llama3)−0.1393.22%18.73%32.58%48.69%
RAG Baseline (GPT-4o)0.0605.62%25.69%19.70%54.61%
+ Chunking + Reranking0.0577.12%32.36%26.67%40.97%
+ CoT + Time-Aware (Final)0.07915.36%35.96%28.01%36.03%

Key Findings

  • LLM selection produced the largest single improvement: switching from Llama-3.2-3B to GPT-4o-mini dramatically boosted score and reduced hallucination.
  • False premise handling improved from 0% to 28.76% exact accuracy through explicit CoT instructions,a major breakthrough.
  • Domain challenges: Finance remained the hardest domain due to dynamic numerical data. Open-domain queries performed best (42.16% accuracy).
  • Temporal dynamics: Static queries (29.59% accuracy) significantly outperformed fast-changing (13.56%) and real-time (9.17%) queries, highlighting retrieval freshness as a bottleneck.
  • Trade-off: The final model achieves highest accuracy but not lowest hallucination,improved retrieval sometimes surfaces partially relevant documents that increase hallucination risk.

Code available at github.com/HARRY1708/CRAG_Meta_KDD_cup.

RAG model comparison

Comparison of progressive RAG improvements against vanilla and Llama3 baselines across all evaluation metrics