Exploring Agentic Search Over SEC Filings — What Worked and What Didn't
12 Apr 2026I’ve been exploring how agentic AI changes the way we search financial documents. I built a system that searches 4,575 SEC 10-K filing chunks using three different approaches — naive vector search, hybrid search, and agentic search. Some of it worked better than expected. Some of it was confidently wrong. Here’s what I found.
Why Data Engineers Should Care About RAG
If you work in data, the ground is shifting under you. The traditional pipeline — extract, transform, load, query with SQL — is being augmented by a new layer: retrieval-augmented generation. Instead of writing a dashboard for every question, users ask in natural language and an AI retrieves the relevant data, synthesizes it, and answers.
A vector database (like Pinecone or ChromaDB) makes this possible. You convert your documents into numerical representations (embeddings), store them, and search by meaning instead of keywords. “Companies with rising production costs” finds relevant paragraphs even if they never use that exact phrase.
This isn’t replacing SQL and dbt. It’s a new layer on top — and the engineers who understand both worlds (structured data pipelines AND semantic retrieval) are the ones companies are hiring for right now.
What I Built
A finance research agent that combines three data sources: real-time market data (FMP API), semantic search over 4,575 SEC 10-K filing chunks (Pinecone), and web research via a subagent. Built with Anthropic’s Claude Agent SDK and MCP.
I gave the agent three search modes to choose from — and this is where it got interesting.
Three Levels of Search (and What I Learned About Each)
Naive vector search is what every tutorial teaches. Embed the query, find the closest vectors, return top-k. It works for simple questions but falls apart on anything with precise financial terminology. When I searched for “fuel cost inflation,” it returned chunks about currency exposure and gold supply — semantically adjacent but not what I needed.
Hybrid search adds keyword matching on top of vector similarity. Financial terms like “AISC,” “ROIC,” and specific ticker symbols need literal matching, not just semantic approximation. When I added a keyword boost, the relevance score of the top result jumped 28%. I should note — my hybrid implementation is an approximation (post-hoc keyword boosting), not true sparse-dense retrieval. It works, but it’s not production-grade.
Agentic search is where the AI decides how to search. Instead of a single query, the agent decomposes “compare cost pressures across gold miners” into sub-queries — one per company, one per cost dimension — and runs targeted searches for each. This is genuinely useful for cross-document analysis. But for a focused question about one company, it was overkill — more latency, no better results than hybrid. Knowing when NOT to use it is as important as knowing how.
Where It Went Wrong
This is the part nobody writes about. When I verified the agent’s gold miner analysis against the actual filing text in Pinecone, I found three errors in a single output:
It confused two metrics. The agent reported Gold Fields’ AISC guidance as “$1,500/oz.” The actual filing says AIC — a different metric — at “$1,732/oz.” Both appear in the same MD&A section. The LLM conflated them.
It cited the wrong year. For AngloGold, the agent quoted a 2022→2023 AISC increase. But our index has the FY2024 filing, which shows $1,672/oz. The agent grabbed a historical reference from the same document instead of the current-year number.
It hallucinated a number. The agent stated Newmont’s FX tailwinds “reduced costs by $190M.” That figure doesn’t appear in any retrieved chunk. The LLM fabricated it during synthesis.
I caught these by querying Pinecone directly — bypassing the agent — and comparing chunk text to the agent’s claims. The retrieval was actually solid. The failures were all in synthesis: the LLM misread its own context.
The uncomfortable takeaway: RAG grounds the LLM in real documents, but it doesn’t prevent the LLM from misinterpreting what it retrieved. The retrieval layer and the synthesis layer fail in different ways, and most people only evaluate the first.
What I Think Matters Going Forward
I’m still figuring this out — there are very few established best practices for agentic RAG evaluation. But here’s what I’d prioritize:
- Faithfulness checking. Automatically compare each claim in the output to the retrieved context. Did the agent say something the chunks don’t support? That’s a flag.
- Confidence scoring. When retrieval scores are low, the system should say “I’m not sure” instead of guessing confidently.
- Regression tests. A curated set of queries with known answers, run on every pipeline change. The errors above would’ve been caught.
The gap between “impressive demo” and “reliable system” is mostly evaluation. Building the agent took a week. Building trustworthy evaluation will take longer — and matter more.
The Stack
Pinecone (serverless vector DB) · Claude Agent SDK (agent orchestration) · MCP (tool protocol) · Local embeddings (all-MiniLM-L6-v2)