Enhancing RAG Systems for the CRAG Meta KDD Cup
CS245: Big Data Analytics , UCLA
Developed an enhanced Retrieval-Augmented Generation (RAG) system for the Comprehensive RAG (CRAG) benchmark, part of the 2024 KDD Meta Cup. The system addresses persistent RAG challenges,hallucination, limited recall, and context fragmentation,through a five-stage pipeline with progressive improvements in retrieval, reranking, and generation.
Pipeline Architecture
The system processes queries through five sequential stages:
- Query Classification: Categorizes queries by temporal dynamics (static, slow-changing, fast-changing, real-time) to inform downstream retrieval strategy. Real-time queries prioritize recent references; static queries target stable knowledge.
- Recursive Chunking: Replaces line-by-line chunking with
RecursiveTextSplitter(1000-char chunks, 200-char overlap), preserving semantic coherence across chunk boundaries. - Dense Retrieval: Uses multi-qa-distilbert-cos-v1 (trained on 215M QA pairs) for semantic matching. Time-aware filtering restricts retrieved pages based on query time and page modification timestamps for dynamic queries.
- Reranking: BAAI/bge-reranker-v2-m3 evaluates query-document relevance with fine-grained scores, critical for ambiguous and multi-hop queries.
- Generation: GPT-4o-mini (128K token context) with Chain-of-Thought prompting for structured reasoning, explicit false-premise detection, and time-aware answer generation.
Chain-of-Thought Prompting Strategy
The CoT prompt instructs the model to: classify the question type (8 categories), evaluate reference recency for fast-changing queries, apply multi-step reasoning, respond "I don't know" when references are insufficient, and flag "Invalid question" for false premises.
Results
Progressive component additions showed consistent improvements. The final model (modified_rag_classification_gpt4o) achieved the best overall performance:
| Model Variant | Score | Exact Acc. | Accuracy | Hallucination | Missing |
|---|---|---|---|---|---|
| RAG Baseline (Llama3) | −0.139 | 3.22% | 18.73% | 32.58% | 48.69% |
| RAG Baseline (GPT-4o) | 0.060 | 5.62% | 25.69% | 19.70% | 54.61% |
| + Chunking + Reranking | 0.057 | 7.12% | 32.36% | 26.67% | 40.97% |
| + CoT + Time-Aware (Final) | 0.079 | 15.36% | 35.96% | 28.01% | 36.03% |
Key Findings
- LLM selection produced the largest single improvement: switching from Llama-3.2-3B to GPT-4o-mini dramatically boosted score and reduced hallucination.
- False premise handling improved from 0% to 28.76% exact accuracy through explicit CoT instructions,a major breakthrough.
- Domain challenges: Finance remained the hardest domain due to dynamic numerical data. Open-domain queries performed best (42.16% accuracy).
- Temporal dynamics: Static queries (29.59% accuracy) significantly outperformed fast-changing (13.56%) and real-time (9.17%) queries, highlighting retrieval freshness as a bottleneck.
- Trade-off: The final model achieves highest accuracy but not lowest hallucination,improved retrieval sometimes surfaces partially relevant documents that increase hallucination risk.
Code available at github.com/HARRY1708/CRAG_Meta_KDD_cup.
Comparison of progressive RAG improvements against vanilla and Llama3 baselines across all evaluation metrics