Exploring Agentic Search Over SEC Filings — What Worked and What Didn't

12 Apr 2026

I’ve been exploring how agentic AI changes the way we search financial documents. I built a system that searches 4,575 SEC 10-K filing chunks using three different approaches — naive vector search, hybrid search, and agentic search. Some of it worked better than expected. Some of it was confidently wrong. Here’s what I found.

Why Data Engineers Should Care About RAG

If you work in data, the ground is shifting under you. The traditional pipeline — extract, transform, load, query with SQL — is being augmented by a new layer: retrieval-augmented generation. Instead of writing a dashboard for every question, users ask in natural language and an AI retrieves the relevant data, synthesizes it, and answers.

A vector database (like Pinecone or ChromaDB) makes this possible. You convert your documents into numerical representations (embeddings), store them, and search by meaning instead of keywords. “Companies with rising production costs” finds relevant paragraphs even if they never use that exact phrase.

This isn’t replacing SQL and dbt. It’s a new layer on top — and the engineers who understand both worlds (structured data pipelines AND semantic retrieval) are the ones companies are hiring for right now.

What I Built

A finance research agent that combines three data sources: real-time market data (FMP API), semantic search over 4,575 SEC 10-K filing chunks (Pinecone), and web research via a subagent. Built with Anthropic’s Claude Agent SDK and MCP.

I gave the agent three search modes to choose from — and this is where it got interesting.

Three Levels of Search (and What I Learned About Each)

Naive vector search is what every tutorial teaches. Embed the query, find the closest vectors, return top-k. It works for simple questions but falls apart on anything with precise financial terminology. When I searched for “fuel cost inflation,” it returned chunks about currency exposure and gold supply — semantically adjacent but not what I needed.

Hybrid search adds keyword matching on top of vector similarity. Financial terms like “AISC,” “ROIC,” and specific ticker symbols need literal matching, not just semantic approximation. When I added a keyword boost, the relevance score of the top result jumped 28%. I should note — my hybrid implementation is an approximation (post-hoc keyword boosting), not true sparse-dense retrieval. It works, but it’s not production-grade.

Agentic search is where the AI decides how to search. Instead of a single query, the agent decomposes “compare cost pressures across gold miners” into sub-queries — one per company, one per cost dimension — and runs targeted searches for each. This is genuinely useful for cross-document analysis. But for a focused question about one company, it was overkill — more latency, no better results than hybrid. Knowing when NOT to use it is as important as knowing how.

Where It Went Wrong

This is the part nobody writes about. When I verified the agent’s gold miner analysis against the actual filing text in Pinecone, I found three errors in a single output:

It confused two metrics. The agent reported Gold Fields’ AISC guidance as “$1,500/oz.” The actual filing says AIC — a different metric — at “$1,732/oz.” Both appear in the same MD&A section. The LLM conflated them.

It cited the wrong year. For AngloGold, the agent quoted a 2022→2023 AISC increase. But our index has the FY2024 filing, which shows $1,672/oz. The agent grabbed a historical reference from the same document instead of the current-year number.

It hallucinated a number. The agent stated Newmont’s FX tailwinds “reduced costs by $190M.” That figure doesn’t appear in any retrieved chunk. The LLM fabricated it during synthesis.

I caught these by querying Pinecone directly — bypassing the agent — and comparing chunk text to the agent’s claims. The retrieval was actually solid. The failures were all in synthesis: the LLM misread its own context.

The uncomfortable takeaway: RAG grounds the LLM in real documents, but it doesn’t prevent the LLM from misinterpreting what it retrieved. The retrieval layer and the synthesis layer fail in different ways, and most people only evaluate the first.

What I Think Matters Going Forward

I’m still figuring this out — there are very few established best practices for agentic RAG evaluation. But here’s what I’d prioritize:

Faithfulness checking. Automatically compare each claim in the output to the retrieved context. Did the agent say something the chunks don’t support? That’s a flag.
Confidence scoring. When retrieval scores are low, the system should say “I’m not sure” instead of guessing confidently.
Regression tests. A curated set of queries with known answers, run on every pipeline change. The errors above would’ve been caught.

The gap between “impressive demo” and “reliable system” is mostly evaluation. Building the agent took a week. Building trustworthy evaluation will take longer — and matter more.

The Stack

Pinecone (serverless vector DB) · Claude Agent SDK (agent orchestration) · MCP (tool protocol) · Local embeddings (all-MiniLM-L6-v2)

View the code on GitHub

Tim Roller, CFA

Exploring Agentic Search Over SEC Filings — What Worked and What Didn't

Why Data Engineers Should Care About RAG

What I Built

Three Levels of Search (and What I Learned About Each)

Where It Went Wrong

What I Think Matters Going Forward

The Stack

Related Posts

What Derivatives Trading Taught Me About Building AI Systems 11 Apr 2026

Building a Finance Research Agent with Claude Agent SDK 11 Apr 2026

15 Claude Code Anti-Patterns — and the Fix for Each 09 Apr 2026