title: "Building a RAG Chatbot for ADAC: Semantic Search Across 2 Million Documents" slug: adac-ai-rag-chatbot-semantic-search
Three weeks into building the RAG pipeline for ADAC, our retrieval precision on vehicle test queries was at 61%. We had 2 million documents indexed in Pinecone. The pipeline returned results — confident, well-formatted, sourced answers — but the sources were wrong. A query about winter tyre performance on the Autobahn was returning answers grounded in travel advisory articles rather than the tyre test database. The model didn't hallucinate, because it was faithfully citing real ADAC documents. It just cited the wrong ones.
The problem wasn't the retrieval model. It was our chunking strategy. We had treated the entire ADAC content corpus as a homogeneous text corpus and applied a uniform 512-token chunk with 10% overlap. But ADAC's 2 million records span fundamentally different information types: vehicle test reports (structured numerical data, comparative tables), legal travel advisories (dense prose with jurisdiction-specific clauses), tyre comparison databases (tabular with brand-model-dimension keys), and short editorial articles (conversational, summary content). A 512-token chunk of a tyre comparison table contains almost no semantic signal. A 512-token chunk of a legal advisory contains too much — it crosses topic boundaries mid-sentence.
ADAC is Europe's largest automobile association, with over 21 million members across Germany. Their digital platform, adac.de, handles tens of millions of user sessions annually. The AI pilot covered three complementary capabilities: semantic search across ADAC's content catalogue, a RAG-powered conversational chatbot for member self-service, and a text-to-speech system for in-car and hands-free content consumption.
Who ADAC Are
ADAC operates as a three-pillar organisation: the membership club (providing roadside assistance, legal advice, vehicle tests, travel services, and traffic safety research), ADAC SE (insurance, car rentals, financial services), and the ADAC Stiftung. In 2024 alone, their Pannenhilfe "Yellow Angels" responded to over 3.6 million breakdown incidents — one every nine seconds on average.
Their content catalogue reflects that breadth. ADAC publishes Europe's most authoritative vehicle test database, covering thousands of models with structured performance data. They produce travel advisories and legal guides across hundreds of international destinations. Their tyre comparison database is the reference standard for German motorists. They have 20+ years of editorial articles across automotive, traffic law, and travel. The diversity of that content is both the opportunity and the retrieval challenge.
The Problem
ADAC's existing search was keyword-based. A member asking "what tyre is best for driving on the Autobahn in winter" would need to type exact keywords to find the right tyre test reports. Natural language questions either returned no results or surfaced the wrong content tier — a travel article instead of a tyre comparison, a legal advisory instead of a vehicle test.
The chatbot use case was the strongest driver. ADAC's member services team handles enormous query volume on topics — vehicle breakdown advice, legal rights during rentals, travel documentation requirements, tyre regulations — that have well-defined answers in the ADAC content catalogue. A system that could autonomously answer 70%+ of those queries from documented ADAC content would have significant operational value, and more importantly, could deliver that value 24/7 without wait time.
The TTS component addressed a different need: ADAC content consumption increasingly happens in-car. A member on a motorway trip who wants to hear the key points of a travel advisory for their destination can't read an article while driving. Spoken audio conversion of ADAC editorial content — with SSML-enhanced reading for automotive German terminology — served the Android Auto and CarPlay integration use case directly.
According to Gartner's 2025 AI in Customer Service report, organisations deploying RAG-based self-service resolve 68% of enquiries without human escalation within the first three months of operation. ADAC's member services team needed a resolution rate significantly above the keyword search baseline, with source attribution to maintain trust.
What We Built
The pilot delivered three integrated capabilities:
Semantic search: Natural language query across the full ADAC content catalogue, combining Pinecone vector search with BM25 keyword matching in a hybrid retrieval approach. Cohere Rerank as a second-pass precision layer, re-ordering candidates against the full query intent. Results surface with source attribution and content-type indicators.
RAG chatbot: LangChain orchestration with GPT-4o generation. Multi-turn conversational memory across up to 20 turns per session. Every generated answer includes clickable source links to underlying ADAC articles. A guardrail layer handles PII scrubbing, off-topic deflection, and low-confidence escalation to human agents.
Text-to-speech: Azure Cognitive Services TTS with Neural voices for German content, with SSML templating for automotive-specific terminology pronunciation. ElevenLabs for premium voice quality on long-form editorial content. Audio delivered via Azure CDN with adaptive bitrate, integrated with a Web Audio API frontend player.
How We Built It
Content-Type-Aware Chunking
Switching from uniform chunking to content-type-aware chunking was the change that moved retrieval precision from 61% to 84% on our benchmark set.
ADAC content is tagged with a taxonomy at the record level: vehicle_test, tyre_comparison, legal_advisory, editorial_article, travel_guide. Each content type has a different optimal chunking strategy:
Vehicle test reports are structured data. They contain a header block (model, test date, tester, overall score) and multiple comparison sections (safety, fuel economy, handling, value). We chunk by section boundary, not by token count — each section becomes one chunk regardless of length. This means a query about a specific vehicle's safety score retrieves the safety section chunk, not a generic overview.
Tyre comparison tables can't be meaningfully chunked — they're structured key-value data. We converted the tabular data to natural language summaries at the row level (each tyre model becomes a prose description of its ratings) and indexed those. The table structure is preserved in the metadata, not in the chunk text.
Legal advisories and travel guides are dense prose that crosses topic boundaries. We use a sentence-window chunking approach: smaller base chunks (256 tokens) with the retrieval context expanded to include the two surrounding sentences on each side when constructing the LLM prompt. This gives the model enough adjacent context to answer accurately without the imprecision of a 512-token chunk that crosses sub-topic lines.
Editorial articles are short-form and conversational. These are chunked by paragraph with a 128-token minimum — below that, the chunk lacks semantic coherence.
Hybrid Retrieval and Cohere Rerank
Dense vector retrieval (Pinecone with text-embedding-3-large) handles semantic similarity well but struggles with exact term queries — a member asking for the "ADAC eco test 2024 results for the Golf 8" is combining a specific event name, a year, and a model identifier. Pure semantic search on that query sometimes retrieves the wrong Golf generation or the wrong test type.
BM25 keyword matching handles exact term queries well but misses semantic variants — "Autobahn winter driving" won't match documents that use "Bundesautobahn" and "Winterreifen." Hybrid retrieval combines both: candidates from Pinecone and from the Elasticsearch BM25 index are merged, deduplicated, and passed to Cohere Rerank.
Cohere Rerank takes the query and all candidate chunks and re-scores them against the full query intent using a cross-encoder model — it sees the query and the candidate together, not as separate embeddings. For ADAC's mixed query types, Cohere Rerank consistently improved top-3 precision by 14–19 percentage points over vector-only retrieval on our benchmark set of 200 manually labelled queries.
Guardrails Layer
ADAC's chatbot surfaces to millions of potential users. Guardrails weren't optional. We implemented four layers:
PII detection: Phone numbers, addresses, and membership numbers in user input are detected and redacted before the query reaches the LLM. Member IDs would theoretically allow a sufficiently clever prompt to ask for another member's account information — which isn't stored in the content index, but the input cleaning is a defence-in-depth measure regardless.
Off-topic deflection: ADAC's chatbot should answer questions about ADAC's domain — vehicles, travel, legal rights in automotive contexts, roadside assistance. A query about a recipe or a national political question gets deflected with a domain-boundary message. We built the deflection classifier as a separate GPT-4o-mini zero-shot call on the query before the main pipeline runs — it's fast and cheap, and it means the main RAG pipeline isn't consumed on out-of-scope queries.
Low-confidence escalation: When the retrieved chunks have low similarity scores relative to the query, or when the generated answer contains hedging language above a threshold, the response is flagged for human agent handoff. The member sees an escalation message; ADAC's member services team sees the conversation context.
Hallucination scoring: We use LangSmith observability to log all generated answers alongside their retrieved source chunks. A post-hoc hallucination detection pass compares factual claims in the generated answer against the source chunks. Any session with a hallucination score above threshold is surfaced in the monitoring dashboard for manual review.
What Made It Hard
1. Chunking Diversity for a 2M-Document Heterogeneous Corpus
The 61% → 84% precision jump was not free. Content-type-aware chunking requires that every document in the corpus has a reliable content-type tag, and ADAC's 2 million records were tagged with varying consistency. Around 8% of the corpus had missing or incorrect content-type metadata — most commonly travel advisories that had been mis-tagged as editorial articles, causing them to receive paragraph chunking instead of sentence-window chunking.
We built a two-pass classification pipeline: a rules-based pre-classifier using document structure signals (presence of tabular data, structured header patterns, legal boilerplate) as a first pass, followed by a GPT-4o-mini zero-shot classification for documents where the rules pass returned low confidence. The classification pipeline added three weeks to the indexing phase but meant the chunking strategy was applied correctly to 98.7% of the corpus — close enough to reliable for the remaining 1.3% to not meaningfully degrade overall retrieval quality.
2. German Automotive Terminology in TTS
Azure Cognitive Services TTS handles standard German well. It does not handle the specific vocabulary of German automotive content: Fahrassistenzsystem, Kraftstoffverbrauch, Abstandsregeltempomat, Reifendruckkontrollsystem. Without SSML correction, these compound nouns are either mispronounced or broken at incorrect syllable boundaries.
We built an SSML template layer that sits between the editorial text and the TTS API: a lookup against an ADAC-specific automotive terminology dictionary that inserts SSML <phoneme> and <break> tags for known terms before the text reaches the TTS engine. The dictionary was built from ADAC's editorial team's existing style guide pronunciations — approximately 340 terms that required explicit phoneme correction for TTS intelligibility. The SSML-corrected audio scores 4.1/5.0 on intelligibility ratings from native German automotive readers versus 3.0/5.0 for uncorrected TTS on the same content.
3. In-Car Delivery and Adaptive Bitrate Audio
ADAC's TTS content was designed for in-car consumption on Android Auto and CarPlay. In-car environments present specific audio delivery requirements: connection quality varies dramatically between urban and motorway driving, and audio dropouts that would be mildly annoying on a phone are actively dangerous for a driver who expected to hear a navigation instruction.
We implemented adaptive bitrate audio streaming via Azure CDN: audio is pre-rendered at three quality levels (64kbps, 128kbps, 256kbps) and the Web Audio API frontend switches between tiers based on measured throughput. For CarPlay and Android Auto integration, we used HLS (HTTP Live Streaming) audio segments — compatible with both native app players — with 5-second segments to allow quick quality tier switching without audible gaps. The segment pre-fetch buffer target is 30 seconds of audio ahead of playback position, providing enough buffer to maintain playback through typical tunnel or urban canyon signal drops.
What Changed
Retrieval precision on the benchmark set reached 84% on launch, up from the 61% baseline. The chatbot resolution rate — queries answered autonomously without human escalation — reached 73% in the first month of pilot operation, above the 70% target. Average member session length on chatbot interactions is 4.2 exchanges, indicating members are getting their answers without needing excessive follow-up clarification.
The TTS system is live for a subset of ADAC editorial content, with audio intelligibility scores meeting the 4.0+ threshold across tested automotive terminology categories. SSML correction for the 340-term dictionary is now applied automatically to all new content entering the TTS pipeline.
What's Next
The pilot roadmap extension covers: personalised member assistant — a chatbot variant with access to member profile, vehicle data, and membership tier for context-aware answers; voice interface — full voice-in/voice-out conversational experience integrated with ADAC's mobile app; and multimodal vehicle diagnostics — photo-based issue identification for dashboard warning lights using GPT-4o Vision.
Common Questions About RAG at Scale
What is hybrid retrieval and why does it outperform pure vector search?
Hybrid retrieval combines dense vector search (semantic similarity) with sparse keyword matching (BM25). Dense search handles semantically related queries — "winter tyre Autobahn" matching documents that use "Winterreifen" and "Bundesautobahn." Sparse search handles exact term queries — specific model names, tyre dimensions, regulation article numbers. Combining both, then re-ranking with a cross-encoder, consistently outperforms either method alone for real-world query distributions that mix both query types.
What is Cohere Rerank and when should you use it?
Cohere Rerank is a second-stage retrieval model that takes a query and a set of candidate documents and re-scores them by running the query and each candidate through a cross-encoder — a model that sees both together rather than comparing independent embeddings. The result is more accurate relevance scoring for complex or specific queries. It adds latency (typically 150–300ms) and cost, so it's worth using when your initial retrieval candidates have good recall but imprecise ranking — which is the common case with large, heterogeneous corpora.
How do you prevent hallucination in a RAG system serving millions of users?
Hallucination prevention in RAG works at three layers: retrieval quality (good chunks mean the LLM has the right source material), prompt engineering (instructing the model to cite sources and express uncertainty rather than filling gaps), and post-generation checking (comparing claims in the generated answer against the retrieved source chunks). The third layer is the most important for production systems — not as a real-time block but as an observability signal that tells you where your retrieval is failing to provide good source material.
What is the right chunk size for a RAG system?
There is no single right chunk size — it depends on your content type. Short, focused chunks (128–256 tokens) are better for fact-retrieval queries. Longer chunks (512+ tokens) are better for questions that require understanding a passage's full argument. The correct approach for a heterogeneous corpus is content-type-aware chunking: different strategies per content category based on how that content is structured and how users query it.
The ADAC RAG pilot taught a principle that generalises to every large-scale retrieval system I've built since: the retrieval architecture is the product. The language model is the presentation layer. Swap GPT-4o for another frontier model and your output quality changes marginally. Fix your chunking strategy and your output quality changes fundamentally.
If you're building RAG on a large, heterogeneous content corpus, the chunking design and the content-type taxonomy are the decisions that determine whether your system is useful. The vector database, the embedding model, the LLM — those are implementation details you can tune later.
We've applied similar retrieval architecture to AI candidate screening at TalentFilter where document retrieval quality directly affected evaluation accuracy, and to the MintFit fitness coaching platform where RAG grounded workout generation in validated exercise protocols. Our AI integration and automation practice covers RAG architecture from corpus design through production deployment.
