Back to site
Agentic Financial AnalysisLangGraphSupervisor-Worker ArchitectureLLM ReliabilityDistributed Systems

Building VYNN AI: 50,000 Lines of Code, One Engineer, and Everything I Learned

How I built and scaled an agentic financial analysis pipeline from scratch, handling complex graph routing, deterministic validations, and real users.

Two days ago I open-sourced VYNN AI, the agentic financial analysis platform I built solo over six months. 50,000+ lines of production code. ~500 pilot users. A system that compresses 6-12 hours of institutional equity research into under 7 minutes.
This post is not a feature walkthrough. The GitHub README already covers that in detail. This is the story behind the architecture: why I made the decisions I made, what broke, what I would do differently, and what I actually learned about building agentic AI systems that survive contact with real users.
Here is the system in action:

Why I Built It

Equity research at institutional firms is a manual, time-intensive process: pull financials, build a DCF model in Excel, read through dozens of news articles, write up a report, formulate a recommendation. A single ticker takes an experienced analyst 6-12 hours. Retail investors and small firms cannot afford this.
I wanted to build a system that does the entire workflow autonomously. Not a chatbot that answers questions about stocks. Not a dashboard with charts. A full pipeline: you give it a ticker, it gives you a professional analyst report with a 10-tab DCF model (9 visible + 1 hidden LLM assumptions tab), sector-specific valuation, news-driven catalyst and risk analysis, and a validated recommendation with multi-horizon price targets.
I was also trying to prove something to myself. I had been thinking about agentic AI architectures for a while, independently arriving at design patterns like file-system-as-graph structures for tool loading, persistent memory across sessions, and cyclical state graphs for multi-step reasoning. These ideas felt right but I had never stress-tested them in a production system with real users. VYNN was the test.

The Architecture That Shipped

VYNN AI is a three-layer stack: an agent backend (the reasoning engine), an API orchestration layer (the coordination service), and a React frontend (the user interface). All three layers were designed, built, and deployed by me alone.
One architectural principle runs through every layer: strict semantic-symbolic separation. All LLM usage is restricted to semantic tasks: intent recognition, relevance scoring, event extraction, and narrative synthesis. All numerical computation is deterministic Python. The LLM never touches a number. This separation is what makes the system reproducible, auditable, and debuggable in a domain where getting a number wrong has real consequences.
System ArchitectureSystem Architecture

The Agent Backend: LangGraph and the Supervisor Pattern

The core of VYNN is a LangGraph-based computation graph with a Supervisor-Worker architecture. A supervisor agent receives natural language input, extracts tickers, classifies intent into one of four modes (comprehensive analysis, model-only, quick news, or custom), and dynamically routes between specialized worker agents with dependency resolution.
The routing policy is hybrid: structural dependencies are enforced by deterministic rules (valuation cannot precede data retrieval), while semantic decisions (intent classification, event relevance) are delegated to the LLM. This means the supervisor never makes a structurally invalid routing decision, even if the LLM suggests one. A deterministic fallback path ensures 100% request completion regardless of LLM behavior.
The worker agents fall into two categories that mirror the semantic-symbolic separation:
Semantic agents (LLM-mediated): The News Screening agent converts raw news into structured events with directional impact, confidence scores, and evidence spans grounded to source text. The Reporting agent synthesizes narratives from structured state, with every claim required to reference a citation ID. Unsupported statements are rejected by a symbolic validator.
Symbolic agents (deterministic): The Financial Data agent retrieves and normalizes fundamentals from Yahoo Finance. The DCF Valuation agent executes a fully symbolic discounted cash flow model across six sector-specific strategies (Generic, SaaS/Rule of 40, REIT/FFO, Bank/Excess Returns, Utility, Energy NAV). The Price Adjustment agent perturbs valuation parameters based on text-derived events, with bounded magnitude and Lipschitz continuity constraints to ensure stable propagation of textual uncertainty.
Why LangGraph and not a simple chain? Because the execution flow is not linear. Financial Data and News Intelligence can run concurrently. DCF depends on Financial Data. Report Generation depends on all upstream agents. A simple sequential chain wastes time. A graph lets me express these dependencies naturally and run agents in parallel where possible, which cut end-to-end latency by 72%.

The Blackboard Pattern

All agents read from and write to a shared FinancialState frozen dataclass. This is the blackboard pattern from classical AI, and choosing it was one of my best architectural decisions.
The alternative would have been passing data directly between agents. Agent A produces output, passes it to Agent B, which passes its output to Agent C. This creates tight coupling. If I want to add a new agent that needs data from both Agent A and Agent C, I have to rewire the data flow. With a blackboard, any agent can access any upstream result without knowing which agent produced it. Adding a new agent means: write to the blackboard, read from the blackboard, register it with the supervisor. That is it.
The frozen dataclass constraint is important too. Agents cannot mutate shared state directly. They produce new data and the orchestrator writes it to the blackboard. This eliminates an entire class of concurrency bugs and makes the data flow auditable. Under identical inputs, every symbolic component produces identical outputs, which is how we achieve a 0.985 reproducibility score across repeated runs.

The Recommendation Engine: Never Trust an LLM With Numbers

This is the subsystem I am most proud of, because it solves a problem that most AI financial products quietly ignore: LLMs make up numbers.
If you ask GPT-4 to write a stock recommendation with price targets, it will produce plausible-sounding figures that have no mathematical basis. It might say the stock is worth $150 when a proper DCF yields $120. It might cite a P/E ratio it calculated incorrectly. Users would never know.
VYNN's Recommendation Engine has three layers specifically designed to prevent this:
Recommendation EngineRecommendation Engine
Layer 1 is pure Python. A deterministic RecommendationCalculator computes expected returns, price targets, and rating bands from the DCF output. No LLM involvement. The numbers are mathematically derived and stored as an immutable FixedNumbers object. This layer also applies sector-aware premiums, volatility caps, and a time decay framework for multi-horizon targets (3-month, 6-month, 12-month).
Layer 2 is the narrative layer. An EvidenceExtractor builds an evidence pack with unique citation IDs (like [FIN-001] or [NEWS-003]), scored by source quality (primary > tier-1 > syndication). The LLM takes the Layer 1 numbers and the evidence pack and writes prose around them. The LLM's job is writing, not math. It can only reference numbers that Layer 1 produced; it cannot compute new ones.
Layer 3 is regex-based verification. A RecommendationValidator cross-checks every number that appears in the narrative against its Layer 1 source. If the LLM writes "$150 price target" but Layer 1 computed $120, the validator catches it. The system requires at least 95% citation coverage and triggers an auto-correction loop back to Layer 2 if validation fails.
This is the pattern I think matters most in production agentic systems: use the LLM for what it is good at (language, reasoning, synthesis) and deterministic code for what it is bad at (arithmetic, data consistency, validation). The LLM never invents a number. Every figure in the final output has a deterministic source.

The Decisions I Got Wrong

Ephemeral Docker Containers Per Request

This is the biggest architectural mistake in VYNN, and I knew it was wrong even as I built it.
When a user submits an analysis request, the API layer spawns a fresh Docker container (~975 MB) via the Docker SDK. The container runs the full agent pipeline in isolation, streams logs back via SSE, persists results to MongoDB, and is cleaned up on completion. Every request gets its own container.
Why I did it: complete process isolation. If one analysis crashes or leaks memory, it cannot affect other users. No shared state between requests. Because workers execute deterministically against fixed inputs, failures are confined to individual jobs and reproducibility is guaranteed. It felt elegant, and for a system where correctness matters more than throughput, it was a defensible choice.
Why it is wrong at scale: cold start latency. Pulling and initializing a 975 MB container takes seconds before any actual work begins. At ~500 users this is tolerable. At 5,000 users it is a bottleneck. At 50,000 it is untenable. I am paying container orchestration overhead that far exceeds the actual compute cost.
What I would do instead: Stateless, long-running pods behind a load balancer. Conversation state lives in Redis and MongoDB, not in container memory. The agent backend runs as an always-on service. Horizontal scaling means deploying more replicas, not spawning more containers. The key architectural change is making the backend stateless, which means ripping session state out of the process and into the database layer. Once you do that, scaling is just "run more copies."
I did not make this change because at ~500 users, the container approach works. Premature optimization was a lower priority than shipping features. But if I were starting VYNN today with scale ambitions, I would never use this pattern.

33 Prompt Templates Was the Right Call

One decision I questioned at the time but now feel strongly about: externalizing all 33 LLM prompts as versioned markdown files in a prompts/ directory.
The conventional approach is to hardcode prompts as Python strings inside agent code. It is faster to write and keeps everything co-located. But it makes prompt iteration painful. Changing a single word in a prompt means modifying Python code, running tests, and redeploying. With externalized templates, I can edit a markdown file, and the change takes effect on the next request. No code change. No redeployment. Version-controlled and auditable in git with full diff visibility.
This matters more than it sounds. In a production agentic system, prompts are the most frequently iterated component. Model behavior drifts with API updates. User feedback reveals edge cases. New features require new prompt patterns. Making prompt iteration cheap is a genuine engineering advantage. The 33 templates span supervisor routing, news analysis, financial modeling, report generation, recommendations, and sector analysis, each with anti-hallucination safeguards embedded directly in the prompt text.

The Formula Evaluator: 1,293 Lines of Accidental Complexity

The DCF Model agent generates Excel workbooks with live formulas. But downstream agents (like the Report Generator) need to read computed values from those formulas without opening Excel. So I built a custom formula evaluator: 1,293 lines of Python that parse and evaluate Excel formula syntax programmatically, resolving cell references, cross-tab references, arithmetic, and functions like SUMIFS.
This works. It is also one of the most complex and brittle components in the codebase. Excel formula syntax is enormous: nested functions, range operations, conditional logic, string manipulation. My evaluator handles the subset that DCF models actually use, but every new formula pattern requires extension.
In retrospect, I should have used openpyxl to write the workbook, saved it, reopened it with a formula calculation engine (like formulas or xlcalc), and read the computed values. Instead of building a custom Excel interpreter, lean on the ecosystem. The lesson: if you are writing 1,000+ lines of code to replicate functionality that libraries already provide, stop and check if there is a better way.

What I Learned About Building Agentic Systems

The Agent is the Easy Part

The LLM reasoning loop is the conceptually interesting piece, but it is maybe 20% of the engineering effort. The other 80% is everything around it: orchestrating multiple agents with dependency resolution, streaming real-time logs to a frontend over SSE, handling partial failures gracefully (what happens when the News agent fails but Financial Data succeeds?), managing LLM provider failover (OpenAI is down, fall back to Anthropic), cost tracking per request, and building evaluation pipelines to measure whether the system is actually getting better over time.
I have started calling this the "agent harness": the infrastructure layer that makes agents work in production. It is not glamorous. It does not make for good demos. But it is the difference between a prototype that works in a Jupyter notebook and a system that serves 500 real users reliably.

Multi-Agent Coordination is a Distributed Systems Problem

When I started VYNN, I thought of it as an AI project. By the time I shipped it, I realized it was primarily a distributed systems project that happens to use LLMs as compute nodes.
The challenges are the same ones you find in any distributed system: shared state management (the blackboard pattern), concurrency control (which agents can run in parallel), failure isolation (one agent crashing should not take down the pipeline), timeout handling (LLM API calls can hang indefinitely), and observability (when something goes wrong at minute 5 of a 7-minute pipeline, how do you trace it?).
If you are building multi-agent systems, the most useful background is not ML research. It is distributed systems engineering. Read about saga patterns, circuit breakers, and distributed tracing. Those patterns transfer directly.

LLMs Are the Slowest Component, Always

In VYNN's 6.4-minute average analysis time, LLM-intensive operations account for roughly 93% of total latency. Financial data collection and DCF computation combined take under 10 seconds. The rest is waiting for LLM API calls to return. The News Screening agent alone consumes 189.4 seconds (49.4% of total) and the Reporting agent takes 167.6 seconds (43.8%). Supervisor coordination overhead is a mere 4.2%.
This has a direct architectural implication: optimize for LLM call count, not code execution speed. Rewriting a function from O(n^2) to O(n) saves microseconds. Eliminating one unnecessary LLM call saves seconds. The performance engineering of agentic systems is fundamentally about reducing the number of LLM round-trips, batching where possible, and parallelizing where dependencies allow.
The 72% latency reduction I achieved was almost entirely from parallelizing agent execution (Financial Data and News Intelligence run concurrently instead of sequentially) and caching results across runs. Zero of it came from code-level optimization.

Cache-Aware Orchestration Matters

Because semantic agents dominate latency but produce deterministic-enough outputs for short time windows, VYNN employs 10-minute semantic caching and daily fan-out for company-level reports. When a second user queries the same ticker within the cache window, the system reuses the News Screening artifacts instead of re-running the full 189-second pipeline. For the full META workflow, this reduces effective latency from 383 seconds to 193.6 seconds, a 49% reduction from a single cache hit.
This is not a novel technique, but it is one that many agentic systems ignore because they treat each request as independent. If you are building agents that multiple users query with overlapping inputs, caching semantic artifacts is the single highest-leverage performance optimization available to you.

Provider Agnosticism is Not Optional

VYNN supports OpenAI (GPT-4o, GPT-4o-mini) and Anthropic (Claude 3.5 Sonnet, Haiku, Opus) through a provider-agnostic abstraction layer. I built this because I wanted to compare model quality, but it turned out to be essential for reliability.
LLM APIs go down. OpenAI had multiple outages during VYNN's development and deployment. If your entire system is hardwired to one provider, an API outage means your product is down. With a provider-agnostic layer, failover is a configuration change, not a code change.
The abstraction also enables per-task model selection. The supervisor agent (which just classifies intent) can use a cheaper, faster model. The Report Generator (which needs nuanced writing) can use a more capable one. The News Intelligence agent (which processes many articles in parallel) can use a cost-optimized model. Matching model capability to task complexity is a meaningful cost optimization.

The Numbers

I ran formal reproducibility and performance experiments on production workloads across multiple tickers and paraphrased prompts. These are real measurements from deployed runs, not synthetic benchmarks.
Latency:
  • 383 seconds (6.4 minutes) average end-to-end for comprehensive analysis
  • 93% of that time is LLM-mediated semantic processing
  • Financial data + DCF model build combined: under 10 seconds
  • Supervisor routing overhead: 4.2%
Reproducibility (9 runs across 3 tickers):
  • 100% exact reproduction of symbolic valuation outputs under identical inputs
  • 0.985 reproducibility score (CV 0.016) for NVDA
  • 0.969 for AAPL, 0.965 for MSFT
  • Within-ticker time variance reflects semantic workload, not nondeterminism
Stability under paraphrasing (3 prompts for NVDA):
  • 100% intent recognition consistency
  • 100% identical agent invocation sequences
  • Time coefficient of variation: 0.017
  • All prompts completed within a 13-second window (378.9s to 391.7s)
Performance optimization:
  • 72% end-to-end latency reduction vs. the original sequential pipeline (via parallel agent execution + result caching)
  • 49% further reduction on cache hit by reusing News Screening artifacts (383s to 193.6s for META)
  • Daily report fan-out eliminates redundant per-user computation
Other:
  • ~500 pilot users in production on Hetzner Cloud
  • 6 sector-specific DCF strategies auto-selected by company classification
  • 33 externalized prompt templates, version-controlled
  • $0 external data vendor costs (yfinance, SerpAPI, newspaper3k)
The fact that I ran these experiments formally (multiple runs per ticker, measured coefficient of variation, tested paraphrased prompts) matters as much as the numbers themselves. Most people building agentic systems never measure reproducibility. They assume it works because it worked once. VYNN's symbolic components are provably deterministic. The semantic components have measured variance. Knowing the difference is how you build trust in a system that makes financial recommendations.

What the Output Looks Like

The system produces four types of artifacts. Here are real examples from production runs.

META Financial Model (xlsx)

A 10-tab Excel workbook with live formulas: raw financials, historical metrics, LLM-inferred assumptions, 5-year projections, dual DCF valuation (perpetual growth + exit multiple), sensitivity matrices, and a summary dashboard. Every number is formula-driven, not static. This is what the Recommendation Engine's Layer 1 reads from.

AAPL Financial Model (xlsx)

Similar 10-tab Excel structure generated for Apple, automatically adapting mapping logic to different GAAP reporting line items.

NVDA Professional Analysis Report (pdf)

The full analyst report generated by the Report Generator agent. Executive summary, investment thesis, financial analysis, dual DCF valuation, news-driven catalyst and risk analysis, and a validated recommendation with 3/6/12-month price targets. Every claim references a citation ID that traces back to either the DCF model or a sourced news article.

ORCL Professional Analysis Report (pdf)

Preview & Download ORCL Analysis Report (pdf) Another full analyst report generated for Oracle, showcasing the system's ability to adapt to different financial structures and news cycles.

What Comes Next

VYNN AI is open source now. The codebase is split across three repositories under the Agentic-Analyst organization: stock-analyst (agent backend, ~15,000 LOC), api-runner (API layer, ~10,600 LOC), and vynnai-web (frontend, ~23,000 LOC).
I am not actively developing new features. This summer I am joining Robinhood as a Machine Learning Engineer on their Agentic AI team, where I will be building agent infrastructure at a very different scale. VYNN taught me how to build an agentic system from scratch. Robinhood will teach me how to make one production-grade at scale.
But the lessons from VYNN carry forward. The semantic-symbolic separation as an architectural principle. The blackboard pattern for decoupled agent state. The three-layer recommendation engine for numerical integrity. Prompt externalization for iteration speed. Cache-aware orchestration for operational efficiency. Provider-agnostic LLM abstractions for reliability. These are not financial analysis patterns. They are general infrastructure patterns for any system where LLMs need to operate reliably, observably, and at scale.
If you are building agentic AI systems and any of this resonates, I would love to hear from you. Email me or find me on LinkedIn.

Zanwen (Ryan) Fu is a Software Engineer and MS Computer Science student at Duke University, focused on building production-grade agentic AI systems. He joins Robinhood's Agentic AI team as an MLE intern in May 2026. More at zanwenfu.com.