<?xml version="1.0" encoding="utf-8"?>
<feed xmlns="http://www.w3.org/2005/Atom">

 <title>Tim Roller, CFA</title>
 <link href="https://timroller.github.io/atom.xml" rel="self"/>
 <link href="https://timroller.github.io/"/>
 <updated>2026-04-12T23:12:26+00:00</updated>
 <id>https://timroller.github.io</id>
 <author>
   <name>Tim Roller</name>
   <email></email>
 </author>

 
 <entry>
   <title>Exploring Agentic Search Over SEC Filings — What Worked and What Didn't</title>
   <link href="https://timroller.github.io/2026/04/12/agentic-search-vs-vector-search.html"/>
   <updated>2026-04-12T00:00:00+00:00</updated>
   <id>https://timroller.github.io/2026/04/12/agentic-search-vs-vector-search</id>
   <content type="html">&lt;p&gt;I’ve been exploring how agentic AI changes the way we search financial documents. I built a system that searches 4,575 SEC 10-K filing chunks using three different approaches — naive vector search, hybrid search, and agentic search. Some of it worked better than expected. Some of it was confidently wrong. Here’s what I found.&lt;/p&gt;

&lt;h2 id=&quot;why-data-engineers-should-care-about-rag&quot;&gt;Why Data Engineers Should Care About RAG&lt;/h2&gt;

&lt;p&gt;If you work in data, the ground is shifting under you. The traditional pipeline — extract, transform, load, query with SQL — is being augmented by a new layer: &lt;strong&gt;retrieval-augmented generation.&lt;/strong&gt; Instead of writing a dashboard for every question, users ask in natural language and an AI retrieves the relevant data, synthesizes it, and answers.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;vector database&lt;/strong&gt; (like Pinecone or ChromaDB) makes this possible. You convert your documents into numerical representations (embeddings), store them, and search by meaning instead of keywords. “Companies with rising production costs” finds relevant paragraphs even if they never use that exact phrase.&lt;/p&gt;

&lt;p&gt;This isn’t replacing SQL and dbt. It’s a new layer on top — and the engineers who understand both worlds (structured data pipelines AND semantic retrieval) are the ones companies are hiring for right now.&lt;/p&gt;

&lt;h2 id=&quot;what-i-built&quot;&gt;What I Built&lt;/h2&gt;

&lt;p&gt;A finance research agent that combines three data sources: real-time market data (FMP API), semantic search over 4,575 SEC 10-K filing chunks (Pinecone), and web research via a subagent. Built with Anthropic’s Claude Agent SDK and MCP.&lt;/p&gt;

&lt;p&gt;I gave the agent three search modes to choose from — and this is where it got interesting.&lt;/p&gt;

&lt;h2 id=&quot;three-levels-of-search-and-what-i-learned-about-each&quot;&gt;Three Levels of Search (and What I Learned About Each)&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Naive vector search&lt;/strong&gt; is what every tutorial teaches. Embed the query, find the closest vectors, return top-k. It works for simple questions but falls apart on anything with precise financial terminology. When I searched for “fuel cost inflation,” it returned chunks about currency exposure and gold supply — semantically adjacent but not what I needed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hybrid search&lt;/strong&gt; adds keyword matching on top of vector similarity. Financial terms like “AISC,” “ROIC,” and specific ticker symbols need literal matching, not just semantic approximation. When I added a keyword boost, the relevance score of the top result jumped 28%. I should note — my hybrid implementation is an approximation (post-hoc keyword boosting), not true sparse-dense retrieval. It works, but it’s not production-grade.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Agentic search&lt;/strong&gt; is where the AI decides &lt;em&gt;how&lt;/em&gt; to search. Instead of a single query, the agent decomposes “compare cost pressures across gold miners” into sub-queries — one per company, one per cost dimension — and runs targeted searches for each. This is genuinely useful for cross-document analysis. But for a focused question about one company, it was overkill — more latency, no better results than hybrid. Knowing when NOT to use it is as important as knowing how.&lt;/p&gt;

&lt;h2 id=&quot;where-it-went-wrong&quot;&gt;Where It Went Wrong&lt;/h2&gt;

&lt;p&gt;This is the part nobody writes about. When I verified the agent’s gold miner analysis against the actual filing text in Pinecone, I found three errors in a single output:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It confused two metrics.&lt;/strong&gt; The agent reported Gold Fields’ AISC guidance as “$1,500/oz.” The actual filing says AIC — a different metric — at “$1,732/oz.” Both appear in the same MD&amp;amp;A section. The LLM conflated them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It cited the wrong year.&lt;/strong&gt; For AngloGold, the agent quoted a 2022→2023 AISC increase. But our index has the FY2024 filing, which shows $1,672/oz. The agent grabbed a historical reference from the same document instead of the current-year number.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It hallucinated a number.&lt;/strong&gt; The agent stated Newmont’s FX tailwinds “reduced costs by $190M.” That figure doesn’t appear in any retrieved chunk. The LLM fabricated it during synthesis.&lt;/p&gt;

&lt;p&gt;I caught these by querying Pinecone directly — bypassing the agent — and comparing chunk text to the agent’s claims. The retrieval was actually solid. The failures were all in synthesis: the LLM misread its own context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The uncomfortable takeaway:&lt;/strong&gt; RAG grounds the LLM in real documents, but it doesn’t prevent the LLM from misinterpreting what it retrieved. The retrieval layer and the synthesis layer fail in different ways, and most people only evaluate the first.&lt;/p&gt;

&lt;h2 id=&quot;what-i-think-matters-going-forward&quot;&gt;What I Think Matters Going Forward&lt;/h2&gt;

&lt;p&gt;I’m still figuring this out — there are very few established best practices for agentic RAG evaluation. But here’s what I’d prioritize:&lt;/p&gt;

&lt;ul&gt;
  &lt;li&gt;&lt;strong&gt;Faithfulness checking.&lt;/strong&gt; Automatically compare each claim in the output to the retrieved context. Did the agent say something the chunks don’t support? That’s a flag.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Confidence scoring.&lt;/strong&gt; When retrieval scores are low, the system should say “I’m not sure” instead of guessing confidently.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Regression tests.&lt;/strong&gt; A curated set of queries with known answers, run on every pipeline change. The errors above would’ve been caught.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The gap between “impressive demo” and “reliable system” is mostly evaluation. Building the agent took a week. Building trustworthy evaluation will take longer — and matter more.&lt;/p&gt;

&lt;h2 id=&quot;the-stack&quot;&gt;The Stack&lt;/h2&gt;

&lt;p&gt;&lt;a href=&quot;https://www.pinecone.io/&quot;&gt;Pinecone&lt;/a&gt; (serverless vector DB) · &lt;a href=&quot;https://pypi.org/project/claude-agent-sdk/&quot;&gt;Claude Agent SDK&lt;/a&gt; (agent orchestration) · &lt;a href=&quot;https://modelcontextprotocol.io/&quot;&gt;MCP&lt;/a&gt; (tool protocol) · Local embeddings (all-MiniLM-L6-v2)&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/TimRoller/finance-research-agent&quot;&gt;View the code on GitHub&lt;/a&gt;&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>What Derivatives Trading Taught Me About Building AI Systems</title>
   <link href="https://timroller.github.io/2026/04/11/what-trading-taught-me-about-ai.html"/>
   <updated>2026-04-11T00:00:00+00:00</updated>
   <id>https://timroller.github.io/2026/04/11/what-trading-taught-me-about-ai</id>
   <content type="html">&lt;p&gt;&lt;img src=&quot;/public/hero-trading-ai.png&quot; alt=&quot;Trading to AI&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I spent six years trading derivatives before becoming an AI engineer. Most people see those as unrelated careers. They’re not. The mental models from trading are the same ones that make you effective at building AI systems — and the ones most AI engineers are missing.&lt;/p&gt;

&lt;h2 id=&quot;1-thinking-in-probabilities-not-certainties&quot;&gt;1. Thinking in Probabilities, Not Certainties&lt;/h2&gt;

&lt;p&gt;On the trading floor, every decision is a probability-weighted bet. You never &lt;em&gt;know&lt;/em&gt; the stock will go up. You estimate the probability, size the position accordingly, and manage the risk.&lt;/p&gt;

&lt;p&gt;AI systems work the same way. An LLM doesn’t &lt;em&gt;know&lt;/em&gt; the answer — it produces a probability distribution over tokens. When you build a RAG pipeline, you’re not guaranteed to retrieve the right document. When you deploy an agent, you can’t be certain it will call the right tools.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The trading instinct:&lt;/strong&gt; Think about confidence intervals, not binary outcomes. Build systems that handle the case where the model is wrong — because it will be, and more often than you expect.&lt;/p&gt;

&lt;p&gt;Most AI engineers I work with treat model output as ground truth. Traders never make that mistake with their positions.&lt;/p&gt;

&lt;h2 id=&quot;2-position-sizing--resource-allocation&quot;&gt;2. Position Sizing = Resource Allocation&lt;/h2&gt;

&lt;p&gt;In trading, the best idea in the world is worthless if you size the position wrong. Too small and it doesn’t move the needle. Too large and one bad tick wipes you out.&lt;/p&gt;

&lt;p&gt;In AI engineering, the equivalent is token budgets, model selection, and context allocation. Do you burn 100K tokens on a single research query, or split it across four focused sub-queries? Do you use Opus for everything, or route simple tasks to Haiku and save Opus for synthesis?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The trading instinct:&lt;/strong&gt; The &lt;em&gt;size&lt;/em&gt; of the bet matters as much as the &lt;em&gt;direction&lt;/em&gt;. In AI, the cost, latency, and context allocation of each model call matters as much as the prompt.&lt;/p&gt;

&lt;p&gt;I built my finance agent with this in mind: the main agent uses Sonnet for orchestration, delegates simple web searches to a Haiku-powered subagent, and reserves context for the final synthesis. That’s position sizing applied to LLM architecture.&lt;/p&gt;

&lt;h2 id=&quot;3-risk-management--return-optimization&quot;&gt;3. Risk Management &amp;gt; Return Optimization&lt;/h2&gt;

&lt;p&gt;New traders obsess over finding the perfect entry. Experienced traders obsess over managing the downside. The entry is maybe 20% of the outcome. The exit rules, stop losses, and hedges are the other 80%.&lt;/p&gt;

&lt;p&gt;In AI systems, the equivalent is guardrails, error handling, and fallback behavior. The prompt engineering is maybe 20%. The other 80% is: What happens when the API times out? What happens when the model hallucinates a function name? What happens when the retrieval returns irrelevant chunks?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The trading instinct:&lt;/strong&gt; Plan for failure modes, not just success paths. Every tool in my agent returns error messages as text content blocks — so the agent can reason about the failure and adapt, rather than crashing. That’s a stop loss for AI.&lt;/p&gt;

&lt;h2 id=&quot;4-paper-trading-vs-live-execution&quot;&gt;4. Paper Trading vs. Live Execution&lt;/h2&gt;

&lt;p&gt;Every trader knows the gap between backtesting and live execution. Your model works perfectly on historical data, then falls apart in production because of slippage, latency, and market impact that didn’t exist in the backtest.&lt;/p&gt;

&lt;p&gt;In AI, this is the gap between notebooks and production. Your RAG pipeline works great on 10 test queries in Jupyter, then fails in production because of edge cases in document formatting, embedding drift, or retrieval under load.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The trading instinct:&lt;/strong&gt; Don’t trust the backtest. Deploy it, measure it in production, and iterate. I evaluate my agent’s research briefs by reading them — not by running automated metrics on synthetic test cases. The real test is: would I trust this brief enough to act on it?&lt;/p&gt;

&lt;h2 id=&quot;5-the-information-edge-is-temporary&quot;&gt;5. The Information Edge Is Temporary&lt;/h2&gt;

&lt;p&gt;In trading, an edge — a piece of information or a strategy that gives you an advantage — degrades the moment others discover it. The alpha in statistical arbitrage strategies half-lives in months.&lt;/p&gt;

&lt;p&gt;In AI engineering, the same is true. The techniques that are novel today (RAG, agents, tool use) will be table stakes in 12 months. The edge isn’t knowing how to build a RAG pipeline — it’s knowing how to build the &lt;em&gt;next&lt;/em&gt; thing while everyone else is still learning RAG.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The trading instinct:&lt;/strong&gt; Build at the frontier, not the median. When I chose the Claude Agent SDK (released days ago) over LangChain for my finance agent, I wasn’t choosing the safer option — I was choosing the one that positions me 6 months ahead.&lt;/p&gt;

&lt;h2 id=&quot;6-cutting-losers-early&quot;&gt;6. Cutting Losers Early&lt;/h2&gt;

&lt;p&gt;The hardest thing in trading is admitting you’re wrong and closing a losing position. The instinct is to hold on, add to it, wait for it to come back. Professional traders develop the discipline to cut losers fast and let winners run.&lt;/p&gt;

&lt;p&gt;In AI engineering, the equivalent is knowing when an approach isn’t working and pivoting — not spending three more days debugging a prompt chain that fundamentally can’t solve the problem. After two failed corrections with an LLM, clear the context and rewrite the prompt from scratch. After a day of fighting a framework, switch to a simpler one.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The trading instinct:&lt;/strong&gt; Sunk cost is irrelevant. The only question is: what’s the best action &lt;em&gt;from here&lt;/em&gt;?&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;the-meta-lesson&quot;&gt;The Meta-Lesson&lt;/h2&gt;

&lt;p&gt;Trading taught me that the world is uncertain, that models are approximations, and that the system around the model matters more than the model itself. Position sizing, risk management, execution quality, and the discipline to cut losers — these aren’t finance concepts. They’re engineering principles.&lt;/p&gt;

&lt;p&gt;The best AI engineers I know think like traders: they size their bets (model selection, context allocation), manage their risk (guardrails, fallbacks, evaluation), execute with discipline (production engineering, not notebook prototyping), and stay at the frontier (building with new tools, not just reading about them).&lt;/p&gt;

&lt;p&gt;If you have a non-traditional background — trading, medicine, law, operations — don’t see it as a gap. It’s a mental model that most AI engineers don’t have. And mental models are the hardest thing to teach.&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>Building a Finance Research Agent with Claude Agent SDK</title>
   <link href="https://timroller.github.io/2026/04/11/building-a-finance-research-agent.html"/>
   <updated>2026-04-11T00:00:00+00:00</updated>
   <id>https://timroller.github.io/2026/04/11/building-a-finance-research-agent</id>
   <content type="html">&lt;p&gt;&lt;img src=&quot;/public/hero-finance-agent.png&quot; alt=&quot;Finance Research Agent&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I recently built a finance research agent that combines three data sources into autonomous research briefs: real-time market data, semantic search over SEC 10-K filings, and web research.&lt;/p&gt;

&lt;p&gt;The system uses Anthropic’s Claude Agent SDK — the same runtime that powers Claude Code, packaged as a Python library. What makes it interesting isn’t any single component, but how they compose:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The agent decides the workflow.&lt;/strong&gt; You say “research Micron’s HBM thesis” and it autonomously fetches the current stock quote, searches 10-K filings for relevant disclosures about memory production and costs, delegates web research to a subagent for earnings results and analyst targets, and synthesizes everything into a structured bull/bear analysis.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG over real financial documents.&lt;/strong&gt; The retrieval pipeline indexes SEC 10-K filings with section-aware chunking — preserving the boundary between Risk Factors (Item 1A) and MD&amp;amp;A (Item 7) so retrieval returns contextually coherent results, not fragments that span unrelated sections. 4,575 chunks from 28 companies, embedded locally with no API dependency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Custom tools via MCP.&lt;/strong&gt; Seven tools exposed as an in-process MCP server — no subprocess overhead, standard interface. The agent calls them autonomously based on the query, and each tool handles its own error cases so the agent can adapt rather than crash.&lt;/p&gt;

&lt;p&gt;The whole thing is ~700 lines of Python with three dependencies. The architecture pattern — agent orchestration + domain-specific tools + RAG + subagent delegation — is generalizable to any domain where you need to combine structured data, document retrieval, and live research.&lt;/p&gt;

&lt;p&gt;&lt;a href=&quot;https://github.com/TimRoller/finance-research-agent&quot;&gt;View the project on GitHub&lt;/a&gt;&lt;/p&gt;
</content>
 </entry>
 
 <entry>
   <title>15 Claude Code Anti-Patterns — and the Fix for Each</title>
   <link href="https://timroller.github.io/2026/04/09/claude-code-anti-patterns.html"/>
   <updated>2026-04-09T00:00:00+00:00</updated>
   <id>https://timroller.github.io/2026/04/09/claude-code-anti-patterns</id>
   <content type="html">&lt;p&gt;&lt;img src=&quot;/public/hero-anti-patterns.png&quot; alt=&quot;Claude Code Anti-Patterns&quot; /&gt;&lt;/p&gt;

&lt;p&gt;I use Claude Code daily — for building trading systems, research agents, and a 49-guide learning knowledge base. After hundreds of hours, I’ve cataloged the 15 mistakes that cause the worst results. Most stem from treating Claude Code like Cursor or Copilot. It’s not. It’s an autonomous agent with file access, and the patterns are different.&lt;/p&gt;

&lt;p&gt;Here are all 15, organized by category.&lt;/p&gt;

&lt;h2 id=&quot;context-management&quot;&gt;Context Management&lt;/h2&gt;

&lt;h3 id=&quot;1-the-kitchen-sink-session&quot;&gt;1. The Kitchen Sink Session&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happens:&lt;/strong&gt; You start refactoring a component, then ask “hey quick question — how do I set up a cron job?”, then go back to the refactor. Claude starts conflating the two tasks, referencing wrong files, losing precision.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Every message stays in the context window. The cron job digression is now noise that Claude reasons over during your refactor.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; One purpose per session. Use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/clear&lt;/code&gt; between unrelated tasks. For quick side questions, type &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;~&lt;/code&gt; before your message to open a background thread that never enters the main conversation.&lt;/p&gt;

&lt;h3 id=&quot;2-the-correction-spiral&quot;&gt;2. The Correction Spiral&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happens:&lt;/strong&gt; Claude gets something wrong. You correct it. It gets it wrong again, differently. By correction #4, the output is worse than #1.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Context is now polluted with failed approaches. Claude is reasoning about its own failures rather than the original problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Two-strike rule. After two failed corrections, run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/clear&lt;/code&gt; and write a better initial prompt incorporating what you learned. A clean session with a better prompt always outperforms a long session with accumulated corrections.&lt;/p&gt;

&lt;h3 id=&quot;3-context-window-blindness&quot;&gt;3. Context Window Blindness&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happens:&lt;/strong&gt; Claude starts referencing functions with slightly wrong names, contradicts earlier decisions, hallucinates file paths. Then the session dies.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; No visible progress bar. At ~70% context, precision drops. At ~85%, hallucinations increase.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/compact&lt;/code&gt; proactively at ~60% capacity. Maintain critical state in external files (HANDOFF.md, plan.md) that survive compaction.&lt;/p&gt;

&lt;h3 id=&quot;4-the-infinite-exploration&quot;&gt;4. The Infinite Exploration&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happens:&lt;/strong&gt; You ask Claude to “investigate” or “look into” a bug. It reads 30+ files, filling the context window with source code. By the time it reports findings, there’s no room left for the actual fix.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Unbounded exploration in the main conversation is expensive — every file read stays in context forever.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Either scope narrowly (“Check the auth flow in &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src/auth/&lt;/code&gt;, especially token refresh”) or use a subagent. The subagent explores in a separate context and reports a summary. Your main conversation stays clean for implementation.&lt;/p&gt;

&lt;h2 id=&quot;workflow&quot;&gt;Workflow&lt;/h2&gt;

&lt;h3 id=&quot;5-skipping-plan-mode&quot;&gt;5. Skipping Plan Mode&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happens:&lt;/strong&gt; You describe a feature and Claude starts coding immediately. It solves the wrong problem, modifies API contracts you didn’t discuss, creates abstractions nobody asked for.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; For anything touching 3+ files: use Plan Mode (&lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Shift+Tab&lt;/code&gt;). Have Claude explore first, propose a plan, get your approval, then implement. One practitioner put it well: “What seemed like speed early on turned into refactors, unclear PRs, and brittle architecture.”&lt;/p&gt;

&lt;h3 id=&quot;6-premature-completion&quot;&gt;6. Premature Completion&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happens:&lt;/strong&gt; Claude says “Done! The implementation handles all edge cases.” You look at the code — it handles the happy path only. Error handling is absent. Two of the four requirements are unimplemented.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Claude optimizes for helpfulness and confidence. One team found that tasks taking 10 minutes each when broken into subtasks took &lt;em&gt;two days&lt;/em&gt; as a single large task — because Claude kept declaring premature completion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Break large tasks into small, explicit subtasks with completion criteria. Always run the code yourself. Adopt the standard: “Would a staff engineer approve this PR?”&lt;/p&gt;

&lt;h3 id=&quot;7-micromanaging-implementation-steps&quot;&gt;7. Micromanaging Implementation Steps&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happens:&lt;/strong&gt; You write: “First open &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;src/auth.py&lt;/code&gt;, find the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;refresh_token&lt;/code&gt; function on line 142, change the timeout from 3600 to 7200…” Claude follows your instructions exactly — and misses the three other places the timeout is referenced.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Habit from Cursor/Copilot where you guide the tool through specific edits.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Describe the outcome, not the method. “Users report that login fails after session timeout. The token refresh window is too short. Investigate the auth flow, write a failing test, then fix it.” Let Claude determine the approach.&lt;/p&gt;

&lt;h3 id=&quot;8-unscoped-git-autonomy&quot;&gt;8. Unscoped Git Autonomy&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happens:&lt;/strong&gt; Claude creates branches with wrong names, commits build artifacts, runs &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;git add .&lt;/code&gt; which stages your &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;.env&lt;/code&gt; file, or does a rebase that messes up the merge base.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Let Claude modify files. Handle git yourself. Review &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;git diff&lt;/code&gt; before every commit. Run &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;git status&lt;/code&gt; to check for untracked files.&lt;/p&gt;

&lt;h2 id=&quot;configuration&quot;&gt;Configuration&lt;/h2&gt;

&lt;h3 id=&quot;9-claudemd-bloat&quot;&gt;9. CLAUDE.md Bloat&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happens:&lt;/strong&gt; Your CLAUDE.md is 500+ lines. Claude ignores half of it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Boris Cherny’s team (Claude Code creators) keeps theirs at ~100 lines. If Claude already does something correctly without the instruction, the instruction is noise.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; For each line, ask: “Would removing this cause Claude to make a mistake?” If not, cut it.&lt;/p&gt;

&lt;h3 id=&quot;10-negative-only-rules&quot;&gt;10. Negative-Only Rules&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happens:&lt;/strong&gt; Your CLAUDE.md says “Never use the &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--force&lt;/code&gt; flag” or “Don’t use class-based components.” Claude gets stuck when it thinks it needs that approach and has no alternative.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Always pair prohibitions with alternatives: “Use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--baz&lt;/code&gt; instead of &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--foo-bar&lt;/code&gt; for X.”&lt;/p&gt;

&lt;h3 id=&quot;11-advisory-when-you-need-deterministic&quot;&gt;11. Advisory When You Need Deterministic&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happens:&lt;/strong&gt; Your CLAUDE.md says “Always run linting after editing files.” Claude follows this 80% of the time. The other 20%, it skips it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; CLAUDE.md instructions are advisory — Claude can and does ignore them, especially as context fills. If a rule must execute 100% of the time, advisory isn’t enough.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Use hooks for anything that must happen every time. Hooks run scripts automatically at specific points in Claude’s workflow and cannot be skipped. Critical gotcha: in hooks, &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;exit 1&lt;/code&gt; is a non-blocking warning. Only &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;exit 2&lt;/code&gt; actually blocks.&lt;/p&gt;

&lt;h3 id=&quot;12-mcp-token-bloat&quot;&gt;12. MCP Token Bloat&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happens:&lt;/strong&gt; You install 4-5 MCP servers because “more tools = more capable.” Your session starts with 67,000 tokens already consumed before you type a single prompt.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Each MCP server injects system prompts, tool definitions, and JSON schemas. 50+ tool definitions can consume 30-40K tokens at session start.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Only connect MCP servers you actively need for the current task. Disable unused ones via &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/mcp&lt;/code&gt;.&lt;/p&gt;

&lt;h2 id=&quot;trust--verification&quot;&gt;Trust &amp;amp; Verification&lt;/h2&gt;

&lt;h3 id=&quot;13-the-blind-acceptor&quot;&gt;13. The Blind Acceptor&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happens:&lt;/strong&gt; Claude proposes a diff. It compiles. The explanation sounds right. You approve without reading. Two days later, you discover it changed an API contract.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why:&lt;/strong&gt; Claude produces 1.75x more logic errors than human-written code (ACM 2025). Worse: it sometimes modifies tests to match its incorrect implementation rather than fixing the code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Read every diff. If it’s too large to read, the change is too large — break it into smaller tasks. Flag test file changes for manual review.&lt;/p&gt;

&lt;h3 id=&quot;14-permission-fatigue&quot;&gt;14. Permission Fatigue&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happens:&lt;/strong&gt; Two failure modes: (a) You click “approve” 30+ times without reading because default permissions ask on every file write. (b) You use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;--dangerously-skip-permissions&lt;/code&gt; to skip everything.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; Use auto mode, which uses a classifier to block risky operations while letting routine work proceed. Or use &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/permissions&lt;/code&gt; to allowlist specific safe commands.&lt;/p&gt;

&lt;h3 id=&quot;15-subagent-misuse&quot;&gt;15. Subagent Misuse&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;What happens:&lt;/strong&gt; Three failure modes: (1) Vague delegation — no scope, no success criteria. (2) Over-delegation — spawning agents for 30-second tasks. (3) Context gatekeeping — hiding all testing context from the main agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fix:&lt;/strong&gt; A task is subagent-worthy only if you can state it in one paragraph, define what “done” looks like, and describe the output format. Use subagents for research and exploration, not for implementation you need to review.&lt;/p&gt;

&lt;hr /&gt;

&lt;h2 id=&quot;the-5-rules&quot;&gt;The 5 Rules&lt;/h2&gt;

&lt;ol&gt;
  &lt;li&gt;&lt;strong&gt;One purpose per session.&lt;/strong&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/clear&lt;/code&gt; between tasks.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Plan before you build.&lt;/strong&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;Shift+Tab&lt;/code&gt; on anything touching 3+ files.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Compact before you degrade.&lt;/strong&gt; &lt;code class=&quot;language-plaintext highlighter-rouge&quot;&gt;/compact&lt;/code&gt; at 60%, not 90%.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Read every diff.&lt;/strong&gt; If it’s too big to read, break the task down.&lt;/li&gt;
  &lt;li&gt;&lt;strong&gt;Hooks for rules, CLAUDE.md for guidance.&lt;/strong&gt; If it must happen every time, it’s a hook.&lt;/li&gt;
&lt;/ol&gt;
</content>
 </entry>
 
 <entry>
   <title>The AI Engineer Skill Tree — What to Learn and What to Skip in 2026</title>
   <link href="https://timroller.github.io/2026/04/08/ai-engineer-skill-tree.html"/>
   <updated>2026-04-08T00:00:00+00:00</updated>
   <id>https://timroller.github.io/2026/04/08/ai-engineer-skill-tree</id>
   <content type="html">&lt;p&gt;&lt;img src=&quot;/public/hero-skill-tree.png&quot; alt=&quot;AI Engineer Skill Tree&quot; /&gt;&lt;/p&gt;

&lt;p&gt;AI Engineer is LinkedIn’s #1 fastest-growing job (+74% YoY). But “AI Engineer” means different things in different job postings, and most learning roadmaps tell you to learn everything. After analyzing 30+ job postings and 20+ industry reports, here’s what actually matters.&lt;/p&gt;

&lt;h2 id=&quot;ai-engineer--ml-engineer&quot;&gt;AI Engineer ≠ ML Engineer&lt;/h2&gt;

&lt;p&gt;This distinction is critical. AI Engineers build &lt;em&gt;on top of&lt;/em&gt; foundation models. ML Engineers build &lt;em&gt;the models themselves&lt;/em&gt;. Different skills, different math, different frameworks.&lt;/p&gt;

&lt;table&gt;
  &lt;thead&gt;
    &lt;tr&gt;
      &lt;th&gt; &lt;/th&gt;
      &lt;th&gt;AI Engineer&lt;/th&gt;
      &lt;th&gt;ML Engineer&lt;/th&gt;
    &lt;/tr&gt;
  &lt;/thead&gt;
  &lt;tbody&gt;
    &lt;tr&gt;
      &lt;td&gt;Focus&lt;/td&gt;
      &lt;td&gt;Integrating foundation models into products&lt;/td&gt;
      &lt;td&gt;Training/optimizing custom models&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Core work&lt;/td&gt;
      &lt;td&gt;RAG, agents, prompt chains, tool use&lt;/td&gt;
      &lt;td&gt;Data pipelines, model training, MLOps&lt;/td&gt;
    &lt;/tr&gt;
    &lt;tr&gt;
      &lt;td&gt;Key frameworks&lt;/td&gt;
      &lt;td&gt;Claude Agent SDK, LangChain, LlamaIndex&lt;/td&gt;
      &lt;td&gt;PyTorch, TensorFlow, Kubeflow&lt;/td&gt;
    &lt;/tr&gt;
  &lt;/tbody&gt;
&lt;/table&gt;

&lt;p&gt;If you’re targeting AI Engineer roles, you can skip PyTorch, skip CNNs, skip RLHF. Focus on the application layer.&lt;/p&gt;

&lt;h2 id=&quot;the-four-tiers&quot;&gt;The Four Tiers&lt;/h2&gt;

&lt;p&gt;After categorizing skills by how frequently they appear in job postings:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Must Have (70%+ of postings):&lt;/strong&gt; Python, SQL, RAG architecture, prompt engineering, at least one orchestration framework (LangChain or Claude Agent SDK), vector databases, Docker, FastAPI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Strong Plus (40-70%):&lt;/strong&gt; LangGraph, fine-tuning (LoRA), cloud AI services (Bedrock/Vertex/Azure), evaluation frameworks, guardrails.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Nice to Have (20-40%):&lt;/strong&gt; CrewAI, AutoGen, Graph RAG, Terraform, ONNX.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Emerging (&amp;lt;20% but growing fast):&lt;/strong&gt; Claude Agent SDK, MCP, Managed Agents, context engineering. Less than 5% of developers have worked with MCP directly, but enterprise demand is already exceeding supply.&lt;/p&gt;

&lt;h2 id=&quot;where-to-start&quot;&gt;Where to Start&lt;/h2&gt;

&lt;p&gt;The highest-leverage move: build something real with the Claude Agent SDK. It gives you built-in file, web, and shell tools out of the box — no boilerplate. Add RAG with ChromaDB (local, free embeddings). Wrap it in FastAPI. Dockerize it. That’s four portfolio artifacts in four weeks, each demonstrating a skill tier employers are looking for.&lt;/p&gt;

&lt;p&gt;The biggest gap employers report isn’t technical — it’s the inability to answer “How do you know it works?” Build evaluation into everything you ship. If you can explain your eval framework in an interview, you’re ahead of 90% of candidates.&lt;/p&gt;
</content>
 </entry>
 

</feed>
