Tim Roller, CFA

Exploring Agentic Search Over SEC Filings — What Worked and What Didn't

2026-04-12T00:00:00+00:00

I’ve been exploring how agentic AI changes the way we search financial documents. I built a system that searches 4,575 SEC 10-K filing chunks using three different approaches — naive vector search, hybrid search, and agentic search. Some of it worked better than expected. Some of it was confidently wrong. Here’s what I found.

Why Data Engineers Should Care About RAG

If you work in data, the ground is shifting under you. The traditional pipeline — extract, transform, load, query with SQL — is being augmented by a new layer: retrieval-augmented generation. Instead of writing a dashboard for every question, users ask in natural language and an AI retrieves the relevant data, synthesizes it, and answers.

A vector database (like Pinecone or ChromaDB) makes this possible. You convert your documents into numerical representations (embeddings), store them, and search by meaning instead of keywords. “Companies with rising production costs” finds relevant paragraphs even if they never use that exact phrase.

This isn’t replacing SQL and dbt. It’s a new layer on top — and the engineers who understand both worlds (structured data pipelines AND semantic retrieval) are the ones companies are hiring for right now.

What I Built

A finance research agent that combines three data sources: real-time market data (FMP API), semantic search over 4,575 SEC 10-K filing chunks (Pinecone), and web research via a subagent. Built with Anthropic’s Claude Agent SDK and MCP.

I gave the agent three search modes to choose from — and this is where it got interesting.

Three Levels of Search (and What I Learned About Each)

Naive vector search is what every tutorial teaches. Embed the query, find the closest vectors, return top-k. It works for simple questions but falls apart on anything with precise financial terminology. When I searched for “fuel cost inflation,” it returned chunks about currency exposure and gold supply — semantically adjacent but not what I needed.

Hybrid search adds keyword matching on top of vector similarity. Financial terms like “AISC,” “ROIC,” and specific ticker symbols need literal matching, not just semantic approximation. When I added a keyword boost, the relevance score of the top result jumped 28%. I should note — my hybrid implementation is an approximation (post-hoc keyword boosting), not true sparse-dense retrieval. It works, but it’s not production-grade.

Agentic search is where the AI decides how to search. Instead of a single query, the agent decomposes “compare cost pressures across gold miners” into sub-queries — one per company, one per cost dimension — and runs targeted searches for each. This is genuinely useful for cross-document analysis. But for a focused question about one company, it was overkill — more latency, no better results than hybrid. Knowing when NOT to use it is as important as knowing how.

Where It Went Wrong

This is the part nobody writes about. When I verified the agent’s gold miner analysis against the actual filing text in Pinecone, I found three errors in a single output:

It confused two metrics. The agent reported Gold Fields’ AISC guidance as “$1,500/oz.” The actual filing says AIC — a different metric — at “$1,732/oz.” Both appear in the same MD&A section. The LLM conflated them.

It cited the wrong year. For AngloGold, the agent quoted a 2022→2023 AISC increase. But our index has the FY2024 filing, which shows $1,672/oz. The agent grabbed a historical reference from the same document instead of the current-year number.

It hallucinated a number. The agent stated Newmont’s FX tailwinds “reduced costs by $190M.” That figure doesn’t appear in any retrieved chunk. The LLM fabricated it during synthesis.

I caught these by querying Pinecone directly — bypassing the agent — and comparing chunk text to the agent’s claims. The retrieval was actually solid. The failures were all in synthesis: the LLM misread its own context.

The uncomfortable takeaway: RAG grounds the LLM in real documents, but it doesn’t prevent the LLM from misinterpreting what it retrieved. The retrieval layer and the synthesis layer fail in different ways, and most people only evaluate the first.

What I Think Matters Going Forward

I’m still figuring this out — there are very few established best practices for agentic RAG evaluation. But here’s what I’d prioritize:

Faithfulness checking. Automatically compare each claim in the output to the retrieved context. Did the agent say something the chunks don’t support? That’s a flag.
Confidence scoring. When retrieval scores are low, the system should say “I’m not sure” instead of guessing confidently.
Regression tests. A curated set of queries with known answers, run on every pipeline change. The errors above would’ve been caught.

The gap between “impressive demo” and “reliable system” is mostly evaluation. Building the agent took a week. Building trustworthy evaluation will take longer — and matter more.

The Stack

Pinecone (serverless vector DB) · Claude Agent SDK (agent orchestration) · MCP (tool protocol) · Local embeddings (all-MiniLM-L6-v2)

View the code on GitHub

What Derivatives Trading Taught Me About Building AI Systems

2026-04-11T00:00:00+00:00

I spent six years trading derivatives before becoming an AI engineer. Most people see those as unrelated careers. They’re not. The mental models from trading are the same ones that make you effective at building AI systems — and the ones most AI engineers are missing.

1. Thinking in Probabilities, Not Certainties

On the trading floor, every decision is a probability-weighted bet. You never know the stock will go up. You estimate the probability, size the position accordingly, and manage the risk.

AI systems work the same way. An LLM doesn’t know the answer — it produces a probability distribution over tokens. When you build a RAG pipeline, you’re not guaranteed to retrieve the right document. When you deploy an agent, you can’t be certain it will call the right tools.

The trading instinct: Think about confidence intervals, not binary outcomes. Build systems that handle the case where the model is wrong — because it will be, and more often than you expect.

Most AI engineers I work with treat model output as ground truth. Traders never make that mistake with their positions.

2. Position Sizing = Resource Allocation

In trading, the best idea in the world is worthless if you size the position wrong. Too small and it doesn’t move the needle. Too large and one bad tick wipes you out.

In AI engineering, the equivalent is token budgets, model selection, and context allocation. Do you burn 100K tokens on a single research query, or split it across four focused sub-queries? Do you use Opus for everything, or route simple tasks to Haiku and save Opus for synthesis?

The trading instinct: The size of the bet matters as much as the direction. In AI, the cost, latency, and context allocation of each model call matters as much as the prompt.

I built my finance agent with this in mind: the main agent uses Sonnet for orchestration, delegates simple web searches to a Haiku-powered subagent, and reserves context for the final synthesis. That’s position sizing applied to LLM architecture.

3. Risk Management > Return Optimization

New traders obsess over finding the perfect entry. Experienced traders obsess over managing the downside. The entry is maybe 20% of the outcome. The exit rules, stop losses, and hedges are the other 80%.

In AI systems, the equivalent is guardrails, error handling, and fallback behavior. The prompt engineering is maybe 20%. The other 80% is: What happens when the API times out? What happens when the model hallucinates a function name? What happens when the retrieval returns irrelevant chunks?

The trading instinct: Plan for failure modes, not just success paths. Every tool in my agent returns error messages as text content blocks — so the agent can reason about the failure and adapt, rather than crashing. That’s a stop loss for AI.

4. Paper Trading vs. Live Execution

Every trader knows the gap between backtesting and live execution. Your model works perfectly on historical data, then falls apart in production because of slippage, latency, and market impact that didn’t exist in the backtest.

In AI, this is the gap between notebooks and production. Your RAG pipeline works great on 10 test queries in Jupyter, then fails in production because of edge cases in document formatting, embedding drift, or retrieval under load.

The trading instinct: Don’t trust the backtest. Deploy it, measure it in production, and iterate. I evaluate my agent’s research briefs by reading them — not by running automated metrics on synthetic test cases. The real test is: would I trust this brief enough to act on it?

5. The Information Edge Is Temporary

In trading, an edge — a piece of information or a strategy that gives you an advantage — degrades the moment others discover it. The alpha in statistical arbitrage strategies half-lives in months.

In AI engineering, the same is true. The techniques that are novel today (RAG, agents, tool use) will be table stakes in 12 months. The edge isn’t knowing how to build a RAG pipeline — it’s knowing how to build the next thing while everyone else is still learning RAG.

The trading instinct: Build at the frontier, not the median. When I chose the Claude Agent SDK (released days ago) over LangChain for my finance agent, I wasn’t choosing the safer option — I was choosing the one that positions me 6 months ahead.

6. Cutting Losers Early

The hardest thing in trading is admitting you’re wrong and closing a losing position. The instinct is to hold on, add to it, wait for it to come back. Professional traders develop the discipline to cut losers fast and let winners run.

In AI engineering, the equivalent is knowing when an approach isn’t working and pivoting — not spending three more days debugging a prompt chain that fundamentally can’t solve the problem. After two failed corrections with an LLM, clear the context and rewrite the prompt from scratch. After a day of fighting a framework, switch to a simpler one.

The trading instinct: Sunk cost is irrelevant. The only question is: what’s the best action from here?

The Meta-Lesson

Trading taught me that the world is uncertain, that models are approximations, and that the system around the model matters more than the model itself. Position sizing, risk management, execution quality, and the discipline to cut losers — these aren’t finance concepts. They’re engineering principles.

The best AI engineers I know think like traders: they size their bets (model selection, context allocation), manage their risk (guardrails, fallbacks, evaluation), execute with discipline (production engineering, not notebook prototyping), and stay at the frontier (building with new tools, not just reading about them).

If you have a non-traditional background — trading, medicine, law, operations — don’t see it as a gap. It’s a mental model that most AI engineers don’t have. And mental models are the hardest thing to teach.

Building a Finance Research Agent with Claude Agent SDK

2026-04-11T00:00:00+00:00

I recently built a finance research agent that combines three data sources into autonomous research briefs: real-time market data, semantic search over SEC 10-K filings, and web research.

The system uses Anthropic’s Claude Agent SDK — the same runtime that powers Claude Code, packaged as a Python library. What makes it interesting isn’t any single component, but how they compose:

The agent decides the workflow. You say “research Micron’s HBM thesis” and it autonomously fetches the current stock quote, searches 10-K filings for relevant disclosures about memory production and costs, delegates web research to a subagent for earnings results and analyst targets, and synthesizes everything into a structured bull/bear analysis.

RAG over real financial documents. The retrieval pipeline indexes SEC 10-K filings with section-aware chunking — preserving the boundary between Risk Factors (Item 1A) and MD&A (Item 7) so retrieval returns contextually coherent results, not fragments that span unrelated sections. 4,575 chunks from 28 companies, embedded locally with no API dependency.

Custom tools via MCP. Seven tools exposed as an in-process MCP server — no subprocess overhead, standard interface. The agent calls them autonomously based on the query, and each tool handles its own error cases so the agent can adapt rather than crash.

The whole thing is ~700 lines of Python with three dependencies. The architecture pattern — agent orchestration + domain-specific tools + RAG + subagent delegation — is generalizable to any domain where you need to combine structured data, document retrieval, and live research.

View the project on GitHub

15 Claude Code Anti-Patterns — and the Fix for Each

2026-04-09T00:00:00+00:00

I use Claude Code daily — for building trading systems, research agents, and a 49-guide learning knowledge base. After hundreds of hours, I’ve cataloged the 15 mistakes that cause the worst results. Most stem from treating Claude Code like Cursor or Copilot. It’s not. It’s an autonomous agent with file access, and the patterns are different.

Here are all 15, organized by category.

Context Management

1. The Kitchen Sink Session

What happens: You start refactoring a component, then ask “hey quick question — how do I set up a cron job?”, then go back to the refactor. Claude starts conflating the two tasks, referencing wrong files, losing precision.

Why: Every message stays in the context window. The cron job digression is now noise that Claude reasons over during your refactor.

Fix: One purpose per session. Use /clear between unrelated tasks. For quick side questions, type ~ before your message to open a background thread that never enters the main conversation.

2. The Correction Spiral

What happens: Claude gets something wrong. You correct it. It gets it wrong again, differently. By correction #4, the output is worse than #1.

Why: Context is now polluted with failed approaches. Claude is reasoning about its own failures rather than the original problem.

Fix: Two-strike rule. After two failed corrections, run /clear and write a better initial prompt incorporating what you learned. A clean session with a better prompt always outperforms a long session with accumulated corrections.

3. Context Window Blindness

What happens: Claude starts referencing functions with slightly wrong names, contradicts earlier decisions, hallucinates file paths. Then the session dies.

Why: No visible progress bar. At ~70% context, precision drops. At ~85%, hallucinations increase.

Fix: Use /compact proactively at ~60% capacity. Maintain critical state in external files (HANDOFF.md, plan.md) that survive compaction.

4. The Infinite Exploration

What happens: You ask Claude to “investigate” or “look into” a bug. It reads 30+ files, filling the context window with source code. By the time it reports findings, there’s no room left for the actual fix.

Why: Unbounded exploration in the main conversation is expensive — every file read stays in context forever.

Fix: Either scope narrowly (“Check the auth flow in src/auth/, especially token refresh”) or use a subagent. The subagent explores in a separate context and reports a summary. Your main conversation stays clean for implementation.

Workflow

5. Skipping Plan Mode

What happens: You describe a feature and Claude starts coding immediately. It solves the wrong problem, modifies API contracts you didn’t discuss, creates abstractions nobody asked for.

Fix: For anything touching 3+ files: use Plan Mode (Shift+Tab). Have Claude explore first, propose a plan, get your approval, then implement. One practitioner put it well: “What seemed like speed early on turned into refactors, unclear PRs, and brittle architecture.”

6. Premature Completion

What happens: Claude says “Done! The implementation handles all edge cases.” You look at the code — it handles the happy path only. Error handling is absent. Two of the four requirements are unimplemented.

Why: Claude optimizes for helpfulness and confidence. One team found that tasks taking 10 minutes each when broken into subtasks took two days as a single large task — because Claude kept declaring premature completion.

Fix: Break large tasks into small, explicit subtasks with completion criteria. Always run the code yourself. Adopt the standard: “Would a staff engineer approve this PR?”

7. Micromanaging Implementation Steps

What happens: You write: “First open src/auth.py, find the refresh_token function on line 142, change the timeout from 3600 to 7200…” Claude follows your instructions exactly — and misses the three other places the timeout is referenced.

Why: Habit from Cursor/Copilot where you guide the tool through specific edits.

Fix: Describe the outcome, not the method. “Users report that login fails after session timeout. The token refresh window is too short. Investigate the auth flow, write a failing test, then fix it.” Let Claude determine the approach.

8. Unscoped Git Autonomy

What happens: Claude creates branches with wrong names, commits build artifacts, runs git add . which stages your .env file, or does a rebase that messes up the merge base.

Fix: Let Claude modify files. Handle git yourself. Review git diff before every commit. Run git status to check for untracked files.

Configuration

9. CLAUDE.md Bloat

What happens: Your CLAUDE.md is 500+ lines. Claude ignores half of it.

Why: Boris Cherny’s team (Claude Code creators) keeps theirs at ~100 lines. If Claude already does something correctly without the instruction, the instruction is noise.

Fix: For each line, ask: “Would removing this cause Claude to make a mistake?” If not, cut it.

10. Negative-Only Rules

What happens: Your CLAUDE.md says “Never use the --force flag” or “Don’t use class-based components.” Claude gets stuck when it thinks it needs that approach and has no alternative.

Fix: Always pair prohibitions with alternatives: “Use --baz instead of --foo-bar for X.”

11. Advisory When You Need Deterministic

What happens: Your CLAUDE.md says “Always run linting after editing files.” Claude follows this 80% of the time. The other 20%, it skips it.

Why: CLAUDE.md instructions are advisory — Claude can and does ignore them, especially as context fills. If a rule must execute 100% of the time, advisory isn’t enough.

Fix: Use hooks for anything that must happen every time. Hooks run scripts automatically at specific points in Claude’s workflow and cannot be skipped. Critical gotcha: in hooks, exit 1 is a non-blocking warning. Only exit 2 actually blocks.

12. MCP Token Bloat

What happens: You install 4-5 MCP servers because “more tools = more capable.” Your session starts with 67,000 tokens already consumed before you type a single prompt.

Why: Each MCP server injects system prompts, tool definitions, and JSON schemas. 50+ tool definitions can consume 30-40K tokens at session start.

Fix: Only connect MCP servers you actively need for the current task. Disable unused ones via /mcp.

Trust & Verification

What happens: Claude proposes a diff. It compiles. The explanation sounds right. You approve without reading. Two days later, you discover it changed an API contract.

Why: Claude produces 1.75x more logic errors than human-written code (ACM 2025). Worse: it sometimes modifies tests to match its incorrect implementation rather than fixing the code.

Fix: Read every diff. If it’s too large to read, the change is too large — break it into smaller tasks. Flag test file changes for manual review.

14. Permission Fatigue

What happens: Two failure modes: (a) You click “approve” 30+ times without reading because default permissions ask on every file write. (b) You use --dangerously-skip-permissions to skip everything.

Fix: Use auto mode, which uses a classifier to block risky operations while letting routine work proceed. Or use /permissions to allowlist specific safe commands.

15. Subagent Misuse

What happens: Three failure modes: (1) Vague delegation — no scope, no success criteria. (2) Over-delegation — spawning agents for 30-second tasks. (3) Context gatekeeping — hiding all testing context from the main agent.

Fix: A task is subagent-worthy only if you can state it in one paragraph, define what “done” looks like, and describe the output format. Use subagents for research and exploration, not for implementation you need to review.

The 5 Rules

One purpose per session. /clear between tasks.
Plan before you build. Shift+Tab on anything touching 3+ files.
Compact before you degrade. /compact at 60%, not 90%.
Read every diff. If it’s too big to read, break the task down.
Hooks for rules, CLAUDE.md for guidance. If it must happen every time, it’s a hook.

The AI Engineer Skill Tree — What to Learn and What to Skip in 2026

2026-04-08T00:00:00+00:00

AI Engineer is LinkedIn’s #1 fastest-growing job (+74% YoY). But “AI Engineer” means different things in different job postings, and most learning roadmaps tell you to learn everything. After analyzing 30+ job postings and 20+ industry reports, here’s what actually matters.

AI Engineer ≠ ML Engineer

This distinction is critical. AI Engineers build on top of foundation models. ML Engineers build the models themselves. Different skills, different math, different frameworks.

	AI Engineer	ML Engineer
Focus	Integrating foundation models into products	Training/optimizing custom models
Core work	RAG, agents, prompt chains, tool use	Data pipelines, model training, MLOps
Key frameworks	Claude Agent SDK, LangChain, LlamaIndex	PyTorch, TensorFlow, Kubeflow

If you’re targeting AI Engineer roles, you can skip PyTorch, skip CNNs, skip RLHF. Focus on the application layer.

The Four Tiers

After categorizing skills by how frequently they appear in job postings:

Must Have (70%+ of postings): Python, SQL, RAG architecture, prompt engineering, at least one orchestration framework (LangChain or Claude Agent SDK), vector databases, Docker, FastAPI.

Strong Plus (40-70%): LangGraph, fine-tuning (LoRA), cloud AI services (Bedrock/Vertex/Azure), evaluation frameworks, guardrails.

Nice to Have (20-40%): CrewAI, AutoGen, Graph RAG, Terraform, ONNX.

Emerging (<20% but growing fast): Claude Agent SDK, MCP, Managed Agents, context engineering. Less than 5% of developers have worked with MCP directly, but enterprise demand is already exceeding supply.

Where to Start

The highest-leverage move: build something real with the Claude Agent SDK. It gives you built-in file, web, and shell tools out of the box — no boilerplate. Add RAG with ChromaDB (local, free embeddings). Wrap it in FastAPI. Dockerize it. That’s four portfolio artifacts in four weeks, each demonstrating a skill tier employers are looking for.

The biggest gap employers report isn’t technical — it’s the inability to answer “How do you know it works?” Build evaluation into everything you ship. If you can explain your eval framework in an interview, you’re ahead of 90% of candidates.