March 25, 2026

Agentic Financial AnalysisLangGraphSupervisor-Worker ArchitectureLLM ReliabilityDistributed Systems

Building VYNN AI: 50,000 Lines of Code, One Engineer, and Everything I Learned

How I built and scaled an agentic financial analysis pipeline from scratch, handling complex graph routing, deterministic validations, and real users.

Two days ago I open-sourced VYNN AI, the agentic financial analysis platform I built solo over six months. 50,000+ lines of production code. ~500 pilot users. A system that compresses 6-12 hours of institutional equity research into under 7 minutes.

This post is not a feature walkthrough. The GitHub README already covers that in detail. This is the story behind the architecture: why I made the decisions I made, what broke, what I would do differently, and what I actually learned about building agentic AI systems that survive contact with real users.

Here is the system in action:

Why I Built It

Equity research at institutional firms is a manual, time-intensive process: pull financials, build a DCF model in Excel, read through dozens of news articles, write up a report, formulate a recommendation. A single ticker takes an experienced analyst 6-12 hours. Retail investors and small firms cannot afford this.

I wanted to build a system that does the entire workflow autonomously. Not a chatbot that answers questions about stocks. Not a dashboard with charts. A full pipeline: you give it a ticker, it gives you a professional analyst report with a 10-tab DCF model (9 visible + 1 hidden LLM assumptions tab), sector-specific valuation, news-driven catalyst and risk analysis, and a validated recommendation with multi-horizon price targets.

I was also trying to prove something to myself. I had been thinking about agentic AI architectures for a while, independently arriving at design patterns like file-system-as-graph structures for tool loading, persistent memory across sessions, and cyclical state graphs for multi-step reasoning. These ideas felt right but I had never stress-tested them in a production system with real users. VYNN was the test.

The Architecture That Shipped

VYNN AI is a three-layer stack: an agent backend (the reasoning engine), an API orchestration layer (the coordination service), and a React frontend (the user interface). All three layers were designed, built, and deployed by me alone.

One architectural principle runs through every layer: strict semantic-symbolic separation. All LLM usage is restricted to semantic tasks: intent recognition, relevance scoring, event extraction, and narrative synthesis. All numerical computation is deterministic Python. The LLM never touches a number. This separation is what makes the system reproducible, auditable, and debuggable in a domain where getting a number wrong has real consequences.

System Architecture

The Agent Backend: LangGraph and the Supervisor Pattern

The core of VYNN is a LangGraph-based computation graph with a Supervisor-Worker architecture. A supervisor agent receives natural language input, extracts tickers, classifies intent into one of four modes (comprehensive analysis, model-only, quick news, or custom), and dynamically routes between specialized worker agents with dependency resolution.

The routing policy is hybrid: structural dependencies are enforced by deterministic rules (valuation cannot precede data retrieval), while semantic decisions (intent classification, event relevance) are delegated to the LLM. This means the supervisor never makes a structurally invalid routing decision, even if the LLM suggests one. A deterministic fallback path ensures 100% request completion regardless of LLM behavior.

The worker agents fall into two categories that mirror the semantic-symbolic separation:

Semantic agents (LLM-mediated): The News Screening agent converts raw news into structured events with directional impact, confidence scores, and evidence spans grounded to source text. The Reporting agent synthesizes narratives from structured state, with every claim required to reference a citation ID. Unsupported statements are rejected by a symbolic validator.

Symbolic agents (deterministic): The Financial Data agent retrieves and normalizes fundamentals from Yahoo Finance. The DCF Valuation agent executes a fully symbolic discounted cash flow model across six sector-specific strategies (Generic, SaaS/Rule of 40, REIT/FFO, Bank/Excess Returns, Utility, Energy NAV). The Price Adjustment agent perturbs valuation parameters based on text-derived events, with bounded magnitude and Lipschitz continuity constraints to ensure stable propagation of textual uncertainty.

Why LangGraph and not a simple chain? Because the execution flow is not linear. Financial Data and News Intelligence can run concurrently. DCF depends on Financial Data. Report Generation depends on all upstream agents. A simple sequential chain wastes time. A graph lets me express these dependencies naturally and run agents in parallel where possible, which cut end-to-end latency by 72%.

The Blackboard Pattern

All agents read from and write to a shared FinancialState frozen dataclass. This is the blackboard pattern from classical AI, and choosing it was one of my best architectural decisions.

The alternative would have been passing data directly between agents. Agent A produces output, passes it to Agent B, which passes its output to Agent C. This creates tight coupling. If I want to add a new agent that needs data from both Agent A and Agent C, I have to rewire the data flow. With a blackboard, any agent can access any upstream result without knowing which agent produced it. Adding a new agent means: write to the blackboard, read from the blackboard, register it with the supervisor. That is it.

The frozen dataclass constraint is important too. Agents cannot mutate shared state directly. They produce new data and the orchestrator writes it to the blackboard. This eliminates an entire class of concurrency bugs and makes the data flow auditable. Under identical inputs, every symbolic component produces identical outputs, which is how we achieve a 0.985 reproducibility score across repeated runs.

The Recommendation Engine: Never Trust an LLM With Numbers

This is the subsystem I am most proud of, because it solves a problem that most AI financial products quietly ignore: LLMs make up numbers.

If you ask GPT-4 to write a stock recommendation with price targets, it will produce plausible-sounding figures that have no mathematical basis. It might say the stock is worth $150 when a proper DCF yields $120. It might cite a P/E ratio it calculated incorrectly. Users would never know.

VYNN's Recommendation Engine has three layers specifically designed to prevent this:

Recommendation Engine

Layer 1 is pure Python. A deterministic RecommendationCalculator computes expected returns, price targets, and rating bands from the DCF output. No LLM involvement. The numbers are mathematically derived and stored as an immutable FixedNumbers object. This layer also applies sector-aware premiums, volatility caps, and a time decay framework for multi-horizon targets (3-month, 6-month, 12-month).

Layer 2 is the narrative layer. An EvidenceExtractor builds an evidence pack with unique citation IDs (like [FIN-001] or [NEWS-003]), scored by source quality (primary > tier-1 > syndication). The LLM takes the Layer 1 numbers and the evidence pack and writes prose around them. The LLM's job is writing, not math. It can only reference numbers that Layer 1 produced; it cannot compute new ones.

Layer 3 is regex-based verification. A RecommendationValidator cross-checks every number that appears in the narrative against its Layer 1 source. If the LLM writes "$150 price target" but Layer 1 computed $120, the validator catches it. The system requires at least 95% citation coverage and triggers an auto-correction loop back to Layer 2 if validation fails.

This is the pattern I think matters most in production agentic systems: use the LLM for what it is good at (language, reasoning, synthesis) and deterministic code for what it is bad at (arithmetic, data consistency, validation). The LLM never invents a number. Every figure in the final output has a deterministic source.

The Decisions I Got Wrong

Ephemeral Docker Containers Per Request

This is the biggest architectural mistake in VYNN, and I knew it was wrong even as I built it.

When a user submits an analysis request, the API layer spawns a fresh Docker container (~975 MB) via the Docker SDK. The container runs the full agent pipeline in isolation, streams logs back via SSE, persists results to MongoDB, and is cleaned up on completion. Every request gets its own container.

Why I did it: complete process isolation. If one analysis crashes or leaks memory, it cannot affect other users. No shared state between requests. Because workers execute deterministically against fixed inputs, failures are confined to individual jobs and reproducibility is guaranteed. It felt elegant, and for a system where correctness matters more than throughput, it was a defensible choice.

Why it is wrong at scale: cold start latency. Pulling and initializing a 975 MB container takes seconds before any actual work begins. At ~500 users this is tolerable. At 5,000 users it is a bottleneck. At 50,000 it is untenable. I am paying container orchestration overhead that far exceeds the actual compute cost.

What I would do instead: Stateless, long-running pods behind a load balancer. Conversation state lives in Redis and MongoDB, not in container memory. The agent backend runs as an always-on service. Horizontal scaling means deploying more replicas, not spawning more containers. The key architectural change is making the backend stateless, which means ripping session state out of the process and into the database layer. Once you do that, scaling is just "run more copies."

I did not make this change because at ~500 users, the container approach works. Premature optimization was a lower priority than shipping features. But if I were starting VYNN today with scale ambitions, I would never use this pattern.

33 Prompt Templates Was the Right Call

One decision I questioned at the time but now feel strongly about: externalizing all 33 LLM prompts as versioned markdown files in a prompts/ directory.

The conventional approach is to hardcode prompts as Python strings inside agent code. It is faster to write and keeps everything co-located. But it makes prompt iteration painful. Changing a single word in a prompt means modifying Python code, running tests, and redeploying. With externalized templates, I can edit a markdown file, and the change takes effect on the next request. No code change. No redeployment. Version-controlled and auditable in git with full diff visibility.

This matters more than it sounds. In a production agentic system, prompts are the most frequently iterated component. Model behavior drifts with API updates. User feedback reveals edge cases. New features require new prompt patterns. Making prompt iteration cheap is a genuine engineering advantage. The 33 templates span supervisor routing, news analysis, financial modeling, report generation, recommendations, and sector analysis, each with anti-hallucination safeguards embedded directly in the prompt text.

The Formula Evaluator: 1,293 Lines of Accidental Complexity

The DCF Model agent generates Excel workbooks with live formulas. But downstream agents (like the Report Generator) need to read computed values from those formulas without opening Excel. So I built a custom formula evaluator: 1,293 lines of Python that parse and evaluate Excel formula syntax programmatically, resolving cell references, cross-tab references, arithmetic, and functions like SUMIFS.

This works. It is also one of the most complex and brittle components in the codebase. Excel formula syntax is enormous: nested functions, range operations, conditional logic, string manipulation. My evaluator handles the subset that DCF models actually use, but every new formula pattern requires extension.

In retrospect, I should have used openpyxl to write the workbook, saved it, reopened it with a formula calculation engine (like formulas or xlcalc), and read the computed values. Instead of building a custom Excel interpreter, lean on the ecosystem. The lesson: if you are writing 1,000+ lines of code to replicate functionality that libraries already provide, stop and check if there is a better way.

What I Learned About Building Agentic Systems

The Agent is the Easy Part

The LLM reasoning loop is the conceptually interesting piece, but it is maybe 20% of the engineering effort. The other 80% is everything around it: orchestrating multiple agents with dependency resolution, streaming real-time logs to a frontend over SSE, handling partial failures gracefully (what happens when the News agent fails but Financial Data succeeds?), managing LLM provider failover (OpenAI is down, fall back to Anthropic), cost tracking per request, and building evaluation pipelines to measure whether the system is actually getting better over time.

I have started calling this the "agent harness": the infrastructure layer that makes agents work in production. It is not glamorous. It does not make for good demos. But it is the difference between a prototype that works in a Jupyter notebook and a system that serves 500 real users reliably.

Multi-Agent Coordination is a Distributed Systems Problem

When I started VYNN, I thought of it as an AI project. By the time I shipped it, I realized it was primarily a distributed systems project that happens to use LLMs as compute nodes.

The challenges are the same ones you find in any distributed system: shared state management (the blackboard pattern), concurrency control (which agents can run in parallel), failure isolation (one agent crashing should not take down the pipeline), timeout handling (LLM API calls can hang indefinitely), and observability (when something goes wrong at minute 5 of a 7-minute pipeline, how do you trace it?).

If you are building multi-agent systems, the most useful background is not ML research. It is distributed systems engineering. Read about saga patterns, circuit breakers, and distributed tracing. Those patterns transfer directly.

LLMs Are the Slowest Component, Always

In VYNN's 6.4-minute average analysis time, LLM-intensive operations account for roughly 93% of total latency. Financial data collection and DCF computation combined take under 10 seconds. The rest is waiting for LLM API calls to return. The News Screening agent alone consumes 189.4 seconds (49.4% of total) and the Reporting agent takes 167.6 seconds (43.8%). Supervisor coordination overhead is a mere 4.2%.

This has a direct architectural implication: optimize for LLM call count, not code execution speed. Rewriting a function from O(n^2) to O(n) saves microseconds. Eliminating one unnecessary LLM call saves seconds. The performance engineering of agentic systems is fundamentally about reducing the number of LLM round-trips, batching where possible, and parallelizing where dependencies allow.

The 72% latency reduction I achieved was almost entirely from parallelizing agent execution (Financial Data and News Intelligence run concurrently instead of sequentially) and caching results across runs. Zero of it came from code-level optimization.

Cache-Aware Orchestration Matters

Because semantic agents dominate latency but produce deterministic-enough outputs for short time windows, VYNN employs 10-minute semantic caching and daily fan-out for company-level reports. When a second user queries the same ticker within the cache window, the system reuses the News Screening artifacts instead of re-running the full 189-second pipeline. For the full META workflow, this reduces effective latency from 383 seconds to 193.6 seconds, a 49% reduction from a single cache hit.

This is not a novel technique, but it is one that many agentic systems ignore because they treat each request as independent. If you are building agents that multiple users query with overlapping inputs, caching semantic artifacts is the single highest-leverage performance optimization available to you.

Provider Agnosticism is Not Optional

VYNN supports OpenAI (GPT-4o, GPT-4o-mini) and Anthropic (Claude 3.5 Sonnet, Haiku, Opus) through a provider-agnostic abstraction layer. I built this because I wanted to compare model quality, but it turned out to be essential for reliability.

LLM APIs go down. OpenAI had multiple outages during VYNN's development and deployment. If your entire system is hardwired to one provider, an API outage means your product is down. With a provider-agnostic layer, failover is a configuration change, not a code change.

The abstraction also enables per-task model selection. The supervisor agent (which just classifies intent) can use a cheaper, faster model. The Report Generator (which needs nuanced writing) can use a more capable one. The News Intelligence agent (which processes many articles in parallel) can use a cost-optimized model. Matching model capability to task complexity is a meaningful cost optimization.

The Numbers

I ran formal reproducibility and performance experiments on production workloads across multiple tickers and paraphrased prompts. These are real measurements from deployed runs, not synthetic benchmarks.

Latency:

383 seconds (6.4 minutes) average end-to-end for comprehensive analysis
93% of that time is LLM-mediated semantic processing
Financial data + DCF model build combined: under 10 seconds
Supervisor routing overhead: 4.2%

Reproducibility (9 runs across 3 tickers):

100% exact reproduction of symbolic valuation outputs under identical inputs
0.985 reproducibility score (CV 0.016) for NVDA
0.969 for AAPL, 0.965 for MSFT
Within-ticker time variance reflects semantic workload, not nondeterminism

Stability under paraphrasing (3 prompts for NVDA):

100% intent recognition consistency
100% identical agent invocation sequences
Time coefficient of variation: 0.017
All prompts completed within a 13-second window (378.9s to 391.7s)

Performance optimization:

72% end-to-end latency reduction vs. the original sequential pipeline (via parallel agent execution + result caching)
49% further reduction on cache hit by reusing News Screening artifacts (383s to 193.6s for META)
Daily report fan-out eliminates redundant per-user computation

Other:

~500 pilot users in production on Hetzner Cloud
6 sector-specific DCF strategies auto-selected by company classification
33 externalized prompt templates, version-controlled
$0 external data vendor costs (yfinance, SerpAPI, newspaper3k)

The fact that I ran these experiments formally (multiple runs per ticker, measured coefficient of variation, tested paraphrased prompts) matters as much as the numbers themselves. Most people building agentic systems never measure reproducibility. They assume it works because it worked once. VYNN's symbolic components are provably deterministic. The semantic components have measured variance. Knowing the difference is how you build trust in a system that makes financial recommendations.

What the Output Looks Like

The system produces four types of artifacts. Here are real examples from production runs.

META Financial Model (xlsx)

Download META Financial Model (xlsx)

A 10-tab Excel workbook with live formulas: raw financials, historical metrics, LLM-inferred assumptions, 5-year projections, dual DCF valuation (perpetual growth + exit multiple), sensitivity matrices, and a summary dashboard. Every number is formula-driven, not static. This is what the Recommendation Engine's Layer 1 reads from.

AAPL Financial Model (xlsx)

Download AAPL Financial Model (xlsx)

Similar 10-tab Excel structure generated for Apple, automatically adapting mapping logic to different GAAP reporting line items.

NVDA Professional Analysis Report (pdf)

Preview & Download NVDA Analysis Report (pdf)

The full analyst report generated by the Report Generator agent. Executive summary, investment thesis, financial analysis, dual DCF valuation, news-driven catalyst and risk analysis, and a validated recommendation with 3/6/12-month price targets. Every claim references a citation ID that traces back to either the DCF model or a sourced news article.

ORCL Professional Analysis Report (pdf)

Preview & Download ORCL Analysis Report (pdf) Another full analyst report generated for Oracle, showcasing the system's ability to adapt to different financial structures and news cycles.

What Comes Next: From Reactive to Proactive

Everything I have described so far is a reactive system. A user picks a ticker, VYNN runs, VYNN stops. The user carries the cognitive burden of knowing when to ask. The LLM-mediated agent space today is overwhelmingly reactive in this sense: ChatGPT, Cursor, Claude Code, and VYNN v1 all wait for a human to initiate. Proactive systems exist elsewhere in software (cron jobs, Datadog alerts, algorithmic trading, spam filters), but proactive systems where an LLM is the materiality judge are rare, and that is the gap I want to spend the next chapter of VYNN on.

The next version is a proactive agent. It runs continuously, watches the market, and decides on its own when something warrants the user's attention. The burden flips: instead of the user knowing when to ask, the system knows when to interrupt. The intelligence is no longer bounded by the user's questions; it is bounded by the system's judgment about what is worth surfacing — and, just as important, what is not.

This reframe changes what the hard engineering problems are. Reactive AI's failure modes are reasoning quality, hallucination, and per-request latency. Proactive AI's failure modes are materiality judgment, alert fatigue, persistent state, and asymmetric costs on the "when to act" decision. Different problems, different architecture.

Why Proactive is the Real Product for Retail

Institutional investors already have proactive tooling. Bloomberg AIQ, FactSet portfolio monitoring, and internal hedge fund systems combine your holdings, your prior analyst notes, and event-driven alerts that reference your thesis. They work. They also cost $24,000 per seat per year and assume an analyst is on the other end of the alert.

Nothing equivalent exists for retail investors at consumer prices. Google Alerts gives you the firehose without a filter. News aggregators give you headlines without context. Neither can answer the question that actually matters when you are not a full-time analyst: "given my holdings, my thesis, and the report VYNN wrote for me last week, does this new event change anything I should care about?" That question requires the system to remember what it told you, compare it to what just happened, and decide whether the gap is worth your attention. That is the question a proactive agent can answer, and it is the question I want VYNN v2 to answer for the segment of the market that cannot afford a Bloomberg terminal.

What Changes Architecturally

Four new components appear in the proactive system. Each exists because proactive operation creates a problem the reactive system never had to solve.

1. Streaming ingestion. Reactive VYNN fetches data inside a request. Proactive VYNN cannot — there is no request. Instead, a streaming ingestion layer pulls continuously from heterogeneous sources (news feeds, price ticks, SEC EDGAR, eventually structured social signals) into Kafka topics. Producers do not know who consumes. Consumers do not know who produces. This is the same semantic-symbolic separation principle from v1, applied at the data layer: the system that ingests has no opinions about what to do with the data.

This is also where the ephemeral-container mistake from v1 stops being merely suboptimal and starts being structurally incompatible. You cannot spawn a fresh 975 MB container every time a news article arrives. The v2 backend is stateless pods consuming from Kafka topics, with all state in Redis and MongoDB. The v1 mistake was a request-scoped architecture for a request-scoped workload; the v2 architecture has to be long-running by construction.

2. Two-stage event detection. Not every event in the firehose is material. A two-stage filter separates noise from signal at very different price points.

Stage 1 is deterministic and cheap — no LLM involvement. The concrete rules I would start with: price move greater than 2σ on 30-day rolling volatility, volume greater than 3× the 20-day average, SEC filings in {8-K Items 1.01 (material agreements), 2.02 (results of operations), 5.02 (executive changes), 7.01 (Reg FD), 10-Q, 10-K, S-1}, and news sources at tier ≥ 2 on a Reuters/Bloomberg/WSJ-anchored quality ranking (Seeking Alpha and similar contributor-driven platforms sit at tier 3 — included for signal but down-weighted, which is itself a judgment call I would expect to revisit). These are starting points, not optimized values. The obvious failure mode is regime sensitivity: a 2σ threshold under low-vol conditions flags noise, while the same threshold during a high-vol regime (think March 2020 or August 2024) misses moves that genuinely matter. Regime-conditional thresholds — keying off VIX or realized vol-of-vol — are the obvious next iteration. I expect to need to live-tune this on real flow.

Stage 2 is semantic and runs only on what survives. It is a repurposed News Screening agent from v1 that classifies whether the surviving event actually shifts a thesis, returning the same structured representation v1 already produces — directional impact, confidence score, evidence spans grounded to source text. The asymmetry matters: stage 1 is fast and wrong-cheap, stage 2 is slow and wrong-expensive. Stacking them is what makes a 24/7 stream economically tractable.

A back-of-envelope to make the cost story concrete: ~5,000 actively-traded US tickers × ~10 news items per ticker per day on a normal day = ~50,000 news events daily, plus price ticks and filings. If stage 1's conjoint filter (price move + volume + source tier) survives at ~1% — a definitional consequence of the 2σ threshold alone, before the AND with quality tier — stage 2 runs on roughly 500 events per day. At 200ms median latency per stage 2 classification, that is ~100 seconds of LLM compute per day across the entire user base, batchable, parallelizable, cheap. The number depends heavily on the threshold choices and would need to be measured against a live stream, but the order of magnitude is what matters: stage 1 turns an intractable workload into a trivial one.

3. Persistent per-user state. This is the most interesting piece architecturally and the one I am still working through. In v1, the blackboard (FinancialState) lives for one request and dies with it. In v2, the blackboard extends into a persistent UserContext that lives across sessions. The schema I have been sketching:

holdings: list of (ticker, cost_basis, position_size, entry_date)
watchlist: list of tickers with attached rationale
report_history: per-ticker, the latest VYNN report decomposed into structured fields — thesis_summary, key_catalysts, key_risks, price_targets, generated_at — not free text
recommendation_history: rolling log of buy/hold/sell calls with outcomes
alert_preferences: cooldown_minutes, severity_floor, channels, quiet_hours

The v1 immutability discipline carries forward — UserContext is still a frozen object, just durable instead of ephemeral. Updates produce new versions; nothing mutates in place.

Materiality becomes a two-step question. Stage 2 of event detection answers "is this material in general?" The personalization layer answers "is this material to this user, given their thesis?" An NVDA earnings beat is universally material. An NVDA earnings beat to a user who holds NVDA and whose last VYNN report flagged margin compression as the key risk is differently material — and the alert can lead with "your margin-compression thesis was wrong, here is why." That comparison is only possible because VYNN was the system that wrote the thesis.

There is a real open problem buried in this schema, and I want to name it honestly: I do not yet know how to represent a "thesis" in a way that is both LLM-comparable and structured enough to query reliably. Free-text summaries preserve nuance but make programmatic comparison unreliable. Structured fields are queryable but lose the texture that makes a thesis useful. The answer is probably a hybrid representation with structured anchors (key catalysts, key risks as enumerated items) and free-text rationale per anchor, with the comparison logic operating on the anchors. I expect this to be a real research question, not a schema design exercise.

4. Deduplication and rate-limited delivery. This sounds like plumbing. It is the difference between a product users keep and a product users uninstall.

Three articles about the same NVDA earnings beat is one alert, not three. The clustering itself is a genuine subproblem — exact-match on (ticker, event_type, trading_day) catches the obvious duplicates but misses the case where Reuters, Bloomberg, and the WSJ cover the same event with different framings. The version I would start with: hash-based exact-match for the cheap cases, plus embedding similarity over a sliding window (6 hours is a reasonable starting window but I expect this to need empirical tuning — too long and a genuine follow-on event gets folded into the prior cluster; too short and the same story across outlets fragments into multiple alerts). Clustering quality directly determines alert quality, so this is one of the pieces I would expect to iterate on most.

Above clustering sits a rate-limit layer keyed on (user_id, event_cluster_id) with a configurable cooldown, plus a global per-user frequency cap that collapses bursts into digests. Five alerts in an hour becomes a digest, not five pings.

The alert generation layer is where the v1 Recommendation Engine pattern partially carries forward and partially breaks down, and I want to be precise about which. The shape transfers: Layer 1 deterministic compute, Layer 2 prose around Layer 1's numbers, Layer 3 verification that every claim traces back to Layer 1. The shape was the right answer for "never let the LLM invent a number" and it is the right answer for "never let the LLM invent an alert."

What does not transfer is Layer 1 itself. In v1, Layer 1 was a RecommendationCalculator that computed price targets from DCF output — a well-defined numerical operation with clear inputs (DCF outputs, sector premiums, volatility caps) and clear outputs (price targets, rating bands). In v2, Layer 1 has to quantify how a new event updates the user's existing thesis, which is not a well-defined numerical operation. It requires choosing a representation for "thesis" (see the open problem above) and a function f(thesis, event) → updated_thesis that maps the pair to a structured update with an attached magnitude.

I do not yet have a satisfying answer for what that function looks like. The candidates I have been thinking through:

Bayesian updates over enumerated catalysts and risks: treat each catalyst/risk as a probability-weighted node and use the structured event representation to update the weights. Clean, but requires good priors and an event-to-node mapping that probably itself needs an LLM, which leaks LLM into Layer 1.
LLM-mediated structured edits with deterministic validation: let the LLM propose edits to the thesis anchors (raise probability on catalyst X, retire risk Y, etc.), then run a symbolic validator that computes the implied magnitude of change to the price target. This keeps the LLM in semantic work but moves it earlier than Layer 2.
Hybrid: LLM proposes the edit, a symbolic system computes magnitude, Layer 2 writes prose around the magnitude. Probably the right answer but most complex to implement and hardest to validate.

Naming the candidates is not the same as having solved the problem. This is the part of v2 I am least confident about, and the part that most determines whether the system feels like an analyst or feels like spam.

The Hard Problem I Do Not Have a Clean Answer To

The technically tractable parts of proactive AI are the parts above. The hard part is calibrating the materiality threshold per user, and I do not have a clean answer yet.

Set the threshold too sensitive and you spam the user into ignoring you — the exact opposite of solving information overload. Set it too conservative and you miss the alert that mattered: the one that would have flagged a 10% drawdown the day before earnings. There is no ground truth label for "this alert was worth waking the user up for," because the counterfactual (what would the user have done with this alert?) is unobservable. And the cost structure is brutally asymmetric: a false negative on the alert that mattered is far more painful than a false positive on one that did not, but if you tune purely for recall, you spam.

The failure mode I worry about most is concrete and worth naming. A user holds NVDA. Last week's VYNN report flagged supply-chain concentration on a specific TSMC node as the key downside risk. Today, TSMC announces a 6-month delay on that node ramp. Universally material? Marginally — it is a process-node detail that does not move the market broadly. Material to this user given this thesis? Almost certainly yes; it is the realization of the exact risk the report named. A global materiality threshold tuned on aggregate engagement gets this wrong in both directions: too low and it spams users who do not hold NVDA with semiconductor process-node news; too high and it misses the one user whose thesis it directly addresses. Same event, opposite right answers. That asymmetry is what makes per-user calibration not a tuning problem but a structural one.

The naive approach is a global threshold tuned by aggregate engagement metrics (open rate, dismiss rate, dismissal latency). That fails on day one because individual users have wildly different signal-to-noise tolerances — a day trader and a long-term holder need fundamentally different alert policies. The next approach is per-user thresholds learned from sparse, delayed, noisy feedback (thumbs up, thumbs down, click-through, time-to-act). That works better but suffers from cold-start: the system has to be useful before it has any feedback to learn from. The harder approach involves explicit elicitation — periodically asking the user "should I have sent you this one?" on the alerts where the system was least confident — but elicitation itself has a cost, and asking too much is its own form of spam.

The honest framing is that this is a personalized decision problem under asymmetric loss, with sparse and delayed reward signals, non-stationary user preferences (the portfolio changes, the macro regime changes), and an exploration cost paid in user attention rather than in compute. It is not a problem you solve once and move on from; it is a problem you keep refining as the system runs.

What Carries Forward, and What Has to Change About It

The v1 primitives all carry forward, but most of them carry forward modified, not unchanged. Naming the modifications matters because it is where the v2 design actually lives:

Semantic-symbolic separation carries forward but gains a third layer. In v1 the boundary was between LLM calls and Python compute. In v2, deterministic stage 1 of event detection sits upstream of the semantic stage 2 — a symbolic filter gating an LLM stage gating another symbolic stage. The pattern generalizes from a two-tier separation to a sandwich.
The blackboard pattern carries forward but becomes persistent. The FinancialState frozen-dataclass discipline holds, but the lifetime extends from one request to one user, and versioning becomes a first-class concern.
The three-layer recommendation engine carries forward in shape but Layer 1 has to be redesigned around the (thesis, event) → updated_thesis problem described above. The verification pattern (Layer 3) is the part that transfers most cleanly.
Externalized prompt templates carry forward unchanged. This was the cheapest pattern in v1 and remains free.
Cache-aware orchestration carries forward but the cache key gets more complex. v1 caches on (ticker, time_window). v2 has to cache on (ticker, event_cluster_id, user_thesis_version) because the same event evaluated against different user theses produces different alerts. The v1 caching pattern is insufficient for v2, not just transferred.
Provider-agnostic LLM abstraction carries forward and becomes more important. In v1 it was for failover. In v2 it is also for load-balancing — you cannot serve a 24/7 stream of stage 2 classifications under any single provider's rate limits, and capacity has to be sharded across providers.

The summary: v1 patterns were necessary, but they were calibrated for request-scoped workloads. Every one of them needs at least one structural change to handle continuous operation. Reactive AI works as long as the user knows when to ask. Proactive AI requires something v1 never needed: judgment about when to act. That is the part of agentic AI I think the field is still figuring out, and it is the part I most want to spend the next few years working on.

VYNN AI is open source now. The codebase is split across three repositories under the Agentic-Analyst organization: stock-analyst (agent backend), api-runner (API layer), and vynnai-web (frontend). I am not actively shipping v2 as a personal project — this summer I am joining Robinhood as a Machine Learning Engineer on their Agentic AI team, where I will be building agent infrastructure at a very different scale. The lessons from v1 are general infrastructure patterns for any system where LLMs need to operate reliably, observably, and at scale. The direction in v2 is where I think the next interesting frontier sits. If you are working on judgment-about-when-to-act problems in agentic systems, email me or find me on LinkedIn.

Zanwen (Ryan) Fu is a Software Engineer and MS Computer Science student at Duke University, focused on building production-grade agentic AI systems. He joins Robinhood's Agentic AI team as an MLE intern in May 2026. More at zanwenfu.com.