Back to Selected Work
03

All Projects

The full catalog.

Flagship products and research — each flagship broken out by repository below so the architecture tells the story instead of a pitch.

Flagship ProductsProduction agent systems, shipped
FOUNDER · SOLE ENGINEER~500 PILOT USERSPRODUCTION

VYNN AI — Production Agentic Financial Analyst

Bloomberg-grade equity research, built for retail. A LangGraph supervisor orchestrates seven specialized agents — fundamentals, news intelligence, valuation, and a validated recommendation — and returns a 35-page PDF + 10-tab DCF + traceable rating in under seven minutes per ticker. Built end-to-end as a three-layer stack, deployed to real users, with reproducibility guarantees most research prototypes don't attempt. Live at vynnai.com.

Three architectural commitments

Explicit state semantics

Intermediate artifacts are typed, inspectable state — not implicit prompt context. Every agent produces a structured slice of the supervisor state.

Strict symbolic–semantic separation

LLMs are restricted to intent recognition, relevance assessment, and event extraction. Valuation, recommendation logic, and numerical propagation are fully deterministic.

Cache-aware orchestration

Repeated or overlapping queries reuse validated artifacts. A 10-min semantic cache + daily report fan-out bounds redundant LLM work.

Production outcomes

  • ~15,000 LOC across 40+ modules, 33 externalized prompt templates, 3-layer recommendation engine (deterministic calculator → LLM narrative → regex validator enforcing ≥95% citation coverage).
  • Custom 1,293-line Formula Evaluator interprets Excel formulas (cell references, cross-tab references, SUMIFS, cross-tab propagation) programmatically — so downstream agents consume computed values without an Excel installation and Excel-JSON consistency is guaranteed by construction.
  • Reproducibility validated empirically: coefficient of variation 0.016–0.035 across 9 production runs (NVDA/AAPL/MSFT); paraphrase stability 0.983; 72% latency reduction via parallel agent execution; median 98s end-to-end, full META workflow 383s.
  • Deployed to ~500 pilot users on Hetzner Cloud with zero-downtime multi-arch Docker.

3 repos, one system

stock-analystAGENT BACKENDLangGraph supervisor coordinating 7 specialized agents with symbolic DCF valuation.
api-runnerCONTROL PLANEFastAPI + Docker-in-Docker ephemeral workers, SSE + dual WebSocket streaming.
vynnai-webFRONTENDReact 18 dashboard + AI chat with SSE streaming and subscriber-based WS.
stock-analystAGENT BACKEND · PYTHON

The brain of VYNN AI. ~15K LOC across 40+ modules. LangGraph supervisor orchestrates 7 specialized agents with a strict symbolic–semantic split: LLMs handle intent recognition, relevance, and event extraction only; valuation, recommendation logic, and numerical propagation are fully deterministic. Includes a custom 1,293-line Formula Evaluator that interprets Excel formulas programmatically, so downstream agents don't depend on Excel at runtime and Excel-JSON consistency is guaranteed.

PythonLangGraphGPT-4oClaude 3.5 SonnetMongoDBRedis
api-runnerCONTROL PLANE · FASTAPI

10,598 LOC FastAPI control plane. Docker-in-Docker orchestration dispatches ephemeral worker containers per job; dual persistent WebSockets (news + real-time prices) with exponential backoff reconnection; SSE job streaming with log batching; pre-market scheduler at 8:30 AM ET with NYSE/NASDAQ holiday awareness (including algorithmic Good Friday via Anonymous Gregorian Easter). Graceful SIGTERM shutdown with 10s timeout and state preservation.

PythonFastAPIDocker SDKMotorRedisSSEWebSocketOAuth 2.0
vynnai-webFRONTEND · REACT + TYPESCRIPT

23K LOC React/TypeScript SPA. Subscriber-based WebSocket context with debounced subscription updates (300ms) and delayed unsubscribe (1s) to handle React StrictMode double-mounts. Module-scoped singleton SSE refs survive component unmounts. User-scoped localStorage (`user_{email}_{key}`) prevents cross-account data leakage on shared devices. Holiday-aware market status hook computes NYSE state client-side with second-level precision, including 9 NYSE holidays with algorithmic floating-holiday computation.

React 18TypeScriptViteTailwindshadcn/uiTanStack QueryRecharts
ACQUIRED BY SONAR51.6% SWE-BENCH VERIFIEDISSTA 2024

AutoCodeRover — Autonomous Code Repair in the IDE

Autonomous coding agent that resolves real GitHub issues. My contribution is two-part: I built the JetBrains IDE plugin end-to-end in Kotlin, and enhanced the AutoCodeRover Agent with a Self-Fix Agent and interactive replay infrastructure that lifted SWE-bench Verified from 38.4% to 51.6%.

Contribution boundaries

JetBrains plugin — entirely mine

End-to-end Kotlin plugin: conversational UI, SSE streaming, PSI-based context enrichment, embedded SonarLint, build/test capture, GumTree 3-way AST merge for conflict-free patch application when local code has diverged from the agent's baseline.

Self-Fix Agent — mine

4-step autonomous repair loop: collect failure reasons → diagnose which upstream agent produced the defective output → generate corrective feedback → selectively replay from that stage. Not a parallel peer to Write/Review — a recovery loop triggered on failure that routes corrective feedback back to the responsible upstream stage.

Interactive replay infrastructure — mine

Structured agent state with UUID-tagged LLM responses. Feedback at any stage replays only downstream agents; preserves upstream state; no full restart. The replay mechanism is the same substrate the Self-Fix Agent uses.

Production outcomes

  • AutoCodeRover moved from 38.4% (Jun 2024) to 51.6% (Jan 2025) on SWE-bench Verified during my contribution period. The Self-Fix Agent and interactive replay infrastructure are the mechanisms; the baseline agent architecture is group-authored.
  • 1.8× patch precision over the next-best open-source agent.
  • Published at ISSTA 2024 + arXiv (AutoCodeRover and SpecRover papers).

2 repos, one system

auto-code-roverBACKENDPython repair pipeline — Self-Fix Agent, interactive replay, UUID feedback.
jetbrains-ide-pluginIDE INTEGRATIONKotlin plugin — GumTree 3-way AST merge, SonarLint, conversational agent UI.
auto-code-roverREPAIR BACKEND · PYTHON

Python agentic repair pipeline: Context Retrieval → Patch Generation → Reviewer Agent loop across 7 languages via tree-sitter. The core agent architecture is group-authored; the Self-Fix Agent, interactive replay, and UUID-targeted feedback are mine.

PythonLangChaintree-sitterClaude 3.5 SonnetGPT-4o
jetbrains-ide-pluginIDE PLUGIN · KOTLIN · ENTIRELY MINE

Brings autonomous code repair into the developer's IDE. A single main orchestrator coordinates six subsystems across conversational UI, IDE events, static analysis, and a novel AST-level patch merge — so the agent can read the developer's working state, stream reasoning in real time, and land fixes without context-switching out of the editor.

KotlinIntelliJ Platform SDKPSIGumTreeSonarLint Core 10.3OkHttpJGit
v0 SHIPPED · ACTIVELY EXTENDINGAGENT INFRASTRUCTUREOPEN SOURCE · MIT

taste — An Operating System for Agents

An Agent OS kernel: three-core CPU split (Opus 4.7 planner / Sonnet 4.6 workers / Haiku 4.5 monitor) on a git memory substrate where branches are execution contexts, commits are checkpoints, `git worktree` gives every parallel worker filesystem-level isolation, and `git reset --hard` is rollback. Three demos shipped with committed transcripts, cost telemetry, and self-contained HTML dashboards. Thesis: build to delete — every subsystem is opt-in, so when next-gen models can self-evaluate, disable the Monitor and the kernel's git-based abstractions survive untouched.

Three architectural commitments

Three-core CPU separation

Planner / Worker / Monitor run on distinct Claude tiers — the reasoning sandwich. No agent grades its own exam; the Monitor is the only component that can gate a commit.

Git as memory, not sidecar

Branches = execution contexts. Commits = checkpoints. git show = demand paging. reset --hard = rollback. Every kernel artifact is committed — nothing lives in a progress file.

Build to delete

Every subsystem is opt-in, not load-bearing. When next-gen models self-evaluate reliably, disable the Monitor — the kernel's git-based abstractions survive untouched.

Production outcomes

  • Real-Claude run committed with full telemetry (todo_api): $0.0964, 43s, 15/15 tests green, zero rollbacks — 7 LLM calls, 16.5K input / 3.1K output tokens on Sonnet 4.6.
  • Parallel worktree execution shipped (parallel_demo): 3 concurrent workers on real `git worktree` branches cut wall-clock from ~32s serial to 21.5s; atomic merge-back only if every worker in the wave passes its Monitor. MergeConflict raised as a typed exception — “merge conflicts as coordination signals” made literal.
  • Hermetic rollback proven without API key (refactor_demo): step-2 regresses → Monitor catches via pytest → kernel runs `git reset --hard` → retry lands clean. Final branch has no trace of the failed attempt. CI asserts on the outcome.
  • Event stream survives rollback — `.git/taste/events.jsonl` lives outside the tracked tree so rollback doesn't erase the audit trail. Self-contained HTML dashboard (`taste dashboard`) renders timeline, per-step outcomes, and git topology — htop for agents.
  • 40 tests across 5 load-bearing suites, zero cyclic imports by design, pip-installable CLI (`taste run` / `taste log` / `taste dashboard`).
taste-is-all-you-needKERNEL · RUNTIME · PYTHON · CLAUDE OPUS 4.7 / SONNET 4.6 / HAIKU 4.5

The full runtime implementing the Agent OS thesis — a kernel orchestration loop, git-based memory substrate, and end-user dashboard that render every decision, checkpoint, and rollback navigable. 7 core modules, no cyclic imports. v0 shipped with three demos. Extensions in progress on long-horizon real-model rollback, autonomous parallelism selection, and LLM-judge monitoring in production.

PythonGit · WorktreeClaude Opus 4.7Claude Sonnet 4.6Claude Haiku 4.5pytest
ResearchMedical agents, security, ML systems, parameter-efficient fine-tuning, and applied evaluation

Five studies across medical multi-agent systems, security of deployed agents, ML-systems measurement, parameter-efficient fine-tuning, and applied evaluation. The through-line: falsifiable claims, matched baselines, and reviewer-verifiable artifacts — the findings are intended to survive a second reader with a skeptical eye.

agentic-reviewers-for-SRMAMEDICAL NLP · SOLO FIRST AUTHOR · 15 SRMAs · ~150K CITATIONS

Mean sensitivity 0.982 / FNR 0.018 across 15 published SRMAs (~150K citations); perfect 1.000 sensitivity with 20–40pp specificity improvements over Tran et al. 2024's GPT-3.5 PICOS baseline on 4 held-out benchmark SRMAs (Ann Intern Med). Four small agents — Classifier, PICOS Detailed Screener, Reviewer (LLM-as-a-judge), Improver — cooperate through a bounded review/improve loop, producing a full written audit trail for every inclusion decision. ~150 lines of orchestration plus six prompt files; the design thesis is that small composition is enough when composed carefully. End-to-end cost: ~$0.07 per 10 candidates. Sole first author; code, manuscript, and per-SRMA results tables released under CC BY-NC 4.0.

PythonGPT-4o-miniGPT-o3-mini
architectural-dampingSECURITY RESEARCH · ρ-METRIC · ex-ante PREDICTIVE · 80-CASE BENCHMARK

A deterministic downstream calculator absorbs 83% of LLM-layer prompt-injection successes (ρ = 1 − ASR_end / ASR_screening = 0.83 on the 12-case held-out pilot; ρ = 1.00 on direct-override attacks) before they reach users — and the exact figure is predictable ex ante from the calculator's source code. Derives a closed-form 7.30pp single-document perturbation budget from the production calculator's source; freezes three-way attackability predictions before running the pilot; 6/6 predictions hold. Identifies attack-surface rotation as a failure mode distinct from Nasr et al.'s ASR recovery — aggregate ASR invariant, attack-family distribution changes. System under study: VYNN AI (my own production deployment, ~500 pilot users; all attacks run against offline replica). 52 tests, frozen manifests with commit pinning, deterministic LLM cache, CI-green on bare clone.

PythonLangGraphClaudeGPT-4MongoDBpytest
speculative-decoding-t4SYSTEMS RESEARCH · 3× GAP DECOMPOSED · PYTORCH · T4

Sequoia predicts 1.68× speedup on T4; I measured 0.56×. A four-term decomposition reconciles the 3× gap to within 1.1% of measurement noise — the algorithm is sound, but three specific cost-model assumptions break on bandwidth-bound hardware. Attempting the natural fix (cross-iteration KV persistence, a clear A100 win) measurably worsens T4 performance (0.56× → 0.46×) because each cache-extension forward still pays a ~20ms weight-loading floor — the paper's sharpest finding and the fourth hidden assumption. A single controlled probe unifies every finding: per-call cost flat at 20.0 ± 0.2ms across an 8.5× range of cache lengths. PLD is the only family that wins on T4 (1.28–1.39×) because Cr ≈ 0 via CPU n-gram matching is the only structural bypass of the bandwidth floor.

PythonPyTorch 2.1Transformers 4.45CUDA 12.8vLLMJupyter
svd-loraPARAMETER-EFFICIENT FINE-TUNING · CONTROLLED STUDY

~30% of LoRA's parameters, +2.1 F1 over standard LoRA on IMDB — average effective rank collapses from 8 to 2.42 after SVD-guided truncation + brief post-compression fine-tuning, with no accuracy lost on SST-2 and a measurable gain on IMDB. A controlled study of a simple question: after training a LoRA adapter, how much of its rank is actually task-useful? And if you truncate down to that effective rank and keep training, what happens? Compression pass is ~30 lines of code, exact to the Eckart–Young–Mirsky bound — the provably optimal low-rank approximation under Frobenius norm, not a heuristic.

PythonPyTorchPEFTHugging FaceDistilBERTSVD
football-llmAPPLIED ML EVALUATION · TWO-REPO STUDY

On the 2022 FIFA World Cup held-out set (n=128), QLoRA-fine-tuned Llama-3.1 8B hits 79.7% O/U 2.5 directional accuracy (84.4% on named-only, Wilson CI [0.736, 0.913]) — and under a coherence-required metric that credits a prediction only when text label, score line, and ground truth all agree, the headline 61.7% score_acc for QLoRA collapses to 42.2%, tying 5-shot ICL. The magnitude/direction decomposition is the contribution: LLM ties feature-matched XGBoost on 1X2 direction but beats it by 19pp pregame / 16pp halftime on O/U 2.5 magnitude — driven by pretrained scoreline priors tabular features can't replicate. Paired McNemar on pregame → halftime+events O/U 2.5: p = 0.006. The broader claim — benchmarks over structured multi-field generative outputs should report coherence-required accuracy as a cheap diagnostic, because parser rescue inflates headline numbers.

Llama 3.1 8BQLoRAPyTorchvLLMXGBoostFastAPIGradio

More on GitHub.

Tooling, prototypes, course artifacts, and in-progress work live at one address — the curated story is above, the full archive is a click away.

github.com/zanwenfu