All Projects
The full catalog.
Flagship products and research — each flagship broken out by repository below so the architecture tells the story instead of a pitch.
VYNN AI — Production Agentic Financial Analyst
Bloomberg-grade equity research, built for retail. A LangGraph supervisor orchestrates seven specialized agents — fundamentals, news intelligence, valuation, and a validated recommendation — and returns a 35-page PDF + 10-tab DCF + traceable rating in under seven minutes per ticker. Built end-to-end as a three-layer stack, deployed to real users, with reproducibility guarantees most research prototypes don't attempt. Live at vynnai.com.
Three architectural commitments
Explicit state semantics
Intermediate artifacts are typed, inspectable state — not implicit prompt context. Every agent produces a structured slice of the supervisor state.
Strict symbolic–semantic separation
LLMs are restricted to intent recognition, relevance assessment, and event extraction. Valuation, recommendation logic, and numerical propagation are fully deterministic.
Cache-aware orchestration
Repeated or overlapping queries reuse validated artifacts. A 10-min semantic cache + daily report fan-out bounds redundant LLM work.
Production outcomes
- ~15,000 LOC across 40+ modules, 33 externalized prompt templates, 3-layer recommendation engine (deterministic calculator → LLM narrative → regex validator enforcing ≥95% citation coverage).
- Custom 1,293-line Formula Evaluator interprets Excel formulas (cell references, cross-tab references, SUMIFS, cross-tab propagation) programmatically — so downstream agents consume computed values without an Excel installation and Excel-JSON consistency is guaranteed by construction.
- Reproducibility validated empirically: coefficient of variation 0.016–0.035 across 9 production runs (NVDA/AAPL/MSFT); paraphrase stability 0.983; 72% latency reduction via parallel agent execution; median 98s end-to-end, full META workflow 383s.
- Deployed to ~500 pilot users on Hetzner Cloud with zero-downtime multi-arch Docker.
3 repos, one system
The brain of VYNN AI. ~15K LOC across 40+ modules. LangGraph supervisor orchestrates 7 specialized agents with a strict symbolic–semantic split: LLMs handle intent recognition, relevance, and event extraction only; valuation, recommendation logic, and numerical propagation are fully deterministic. Includes a custom 1,293-line Formula Evaluator that interprets Excel formulas programmatically, so downstream agents don't depend on Excel at runtime and Excel-JSON consistency is guaranteed.
10,598 LOC FastAPI control plane. Docker-in-Docker orchestration dispatches ephemeral worker containers per job; dual persistent WebSockets (news + real-time prices) with exponential backoff reconnection; SSE job streaming with log batching; pre-market scheduler at 8:30 AM ET with NYSE/NASDAQ holiday awareness (including algorithmic Good Friday via Anonymous Gregorian Easter). Graceful SIGTERM shutdown with 10s timeout and state preservation.
23K LOC React/TypeScript SPA. Subscriber-based WebSocket context with debounced subscription updates (300ms) and delayed unsubscribe (1s) to handle React StrictMode double-mounts. Module-scoped singleton SSE refs survive component unmounts. User-scoped localStorage (`user_{email}_{key}`) prevents cross-account data leakage on shared devices. Holiday-aware market status hook computes NYSE state client-side with second-level precision, including 9 NYSE holidays with algorithmic floating-holiday computation.
AutoCodeRover — Autonomous Code Repair in the IDE
Autonomous coding agent that resolves real GitHub issues. My contribution is two-part: I built the JetBrains IDE plugin end-to-end in Kotlin, and enhanced the AutoCodeRover Agent with a Self-Fix Agent and interactive replay infrastructure that lifted SWE-bench Verified from 38.4% to 51.6%.
Contribution boundaries
JetBrains plugin — entirely mine
End-to-end Kotlin plugin: conversational UI, SSE streaming, PSI-based context enrichment, embedded SonarLint, build/test capture, GumTree 3-way AST merge for conflict-free patch application when local code has diverged from the agent's baseline.
Self-Fix Agent — mine
4-step autonomous repair loop: collect failure reasons → diagnose which upstream agent produced the defective output → generate corrective feedback → selectively replay from that stage. Not a parallel peer to Write/Review — a recovery loop triggered on failure that routes corrective feedback back to the responsible upstream stage.
Interactive replay infrastructure — mine
Structured agent state with UUID-tagged LLM responses. Feedback at any stage replays only downstream agents; preserves upstream state; no full restart. The replay mechanism is the same substrate the Self-Fix Agent uses.
Production outcomes
- AutoCodeRover moved from 38.4% (Jun 2024) to 51.6% (Jan 2025) on SWE-bench Verified during my contribution period. The Self-Fix Agent and interactive replay infrastructure are the mechanisms; the baseline agent architecture is group-authored.
- 1.8× patch precision over the next-best open-source agent.
- Published at ISSTA 2024 + arXiv (AutoCodeRover and SpecRover papers).
In the News
- 2024Sonar acquires AutoCodeRover — core autonomous repair technology joins Sonar's platform.
- 2025Sonar Foundation Agent launches — built on the AutoCodeRover core.
- Feb 2026Sonar Foundation Agent reaches 79.2% on SWE-bench — #1 on the leaderboard.
- Mar 2026SonarQube Remediation Agent enters open beta — the Foundation Agent lineage enters commercial preview.
2 repos, one system
Python agentic repair pipeline: Context Retrieval → Patch Generation → Reviewer Agent loop across 7 languages via tree-sitter. The core agent architecture is group-authored; the Self-Fix Agent, interactive replay, and UUID-targeted feedback are mine.
Brings autonomous code repair into the developer's IDE. A single main orchestrator coordinates six subsystems across conversational UI, IDE events, static analysis, and a novel AST-level patch merge — so the agent can read the developer's working state, stream reasoning in real time, and land fixes without context-switching out of the editor.
taste — An Operating System for Agents
An Agent OS kernel: three-core CPU split (Opus 4.7 planner / Sonnet 4.6 workers / Haiku 4.5 monitor) on a git memory substrate where branches are execution contexts, commits are checkpoints, `git worktree` gives every parallel worker filesystem-level isolation, and `git reset --hard` is rollback. Three demos shipped with committed transcripts, cost telemetry, and self-contained HTML dashboards. Thesis: build to delete — every subsystem is opt-in, so when next-gen models can self-evaluate, disable the Monitor and the kernel's git-based abstractions survive untouched.
Three architectural commitments
Three-core CPU separation
Planner / Worker / Monitor run on distinct Claude tiers — the reasoning sandwich. No agent grades its own exam; the Monitor is the only component that can gate a commit.
Git as memory, not sidecar
Branches = execution contexts. Commits = checkpoints. git show = demand paging. reset --hard = rollback. Every kernel artifact is committed — nothing lives in a progress file.
Build to delete
Every subsystem is opt-in, not load-bearing. When next-gen models self-evaluate reliably, disable the Monitor — the kernel's git-based abstractions survive untouched.
Production outcomes
- Real-Claude run committed with full telemetry (todo_api): $0.0964, 43s, 15/15 tests green, zero rollbacks — 7 LLM calls, 16.5K input / 3.1K output tokens on Sonnet 4.6.
- Parallel worktree execution shipped (parallel_demo): 3 concurrent workers on real `git worktree` branches cut wall-clock from ~32s serial to 21.5s; atomic merge-back only if every worker in the wave passes its Monitor. MergeConflict raised as a typed exception — “merge conflicts as coordination signals” made literal.
- Hermetic rollback proven without API key (refactor_demo): step-2 regresses → Monitor catches via pytest → kernel runs `git reset --hard` → retry lands clean. Final branch has no trace of the failed attempt. CI asserts on the outcome.
- Event stream survives rollback — `.git/taste/events.jsonl` lives outside the tracked tree so rollback doesn't erase the audit trail. Self-contained HTML dashboard (`taste dashboard`) renders timeline, per-step outcomes, and git topology — htop for agents.
- 40 tests across 5 load-bearing suites, zero cyclic imports by design, pip-installable CLI (`taste run` / `taste log` / `taste dashboard`).
The full runtime implementing the Agent OS thesis — a kernel orchestration loop, git-based memory substrate, and end-user dashboard that render every decision, checkpoint, and rollback navigable. 7 core modules, no cyclic imports. v0 shipped with three demos. Extensions in progress on long-horizon real-model rollback, autonomous parallelism selection, and LLM-judge monitoring in production.
Five studies across medical multi-agent systems, security of deployed agents, ML-systems measurement, parameter-efficient fine-tuning, and applied evaluation. The through-line: falsifiable claims, matched baselines, and reviewer-verifiable artifacts — the findings are intended to survive a second reader with a skeptical eye.
Mean sensitivity 0.982 / FNR 0.018 across 15 published SRMAs (~150K citations); perfect 1.000 sensitivity with 20–40pp specificity improvements over Tran et al. 2024's GPT-3.5 PICOS baseline on 4 held-out benchmark SRMAs (Ann Intern Med). Four small agents — Classifier, PICOS Detailed Screener, Reviewer (LLM-as-a-judge), Improver — cooperate through a bounded review/improve loop, producing a full written audit trail for every inclusion decision. ~150 lines of orchestration plus six prompt files; the design thesis is that small composition is enough when composed carefully. End-to-end cost: ~$0.07 per 10 candidates. Sole first author; code, manuscript, and per-SRMA results tables released under CC BY-NC 4.0.
A deterministic downstream calculator absorbs 83% of LLM-layer prompt-injection successes (ρ = 1 − ASR_end / ASR_screening = 0.83 on the 12-case held-out pilot; ρ = 1.00 on direct-override attacks) before they reach users — and the exact figure is predictable ex ante from the calculator's source code. Derives a closed-form 7.30pp single-document perturbation budget from the production calculator's source; freezes three-way attackability predictions before running the pilot; 6/6 predictions hold. Identifies attack-surface rotation as a failure mode distinct from Nasr et al.'s ASR recovery — aggregate ASR invariant, attack-family distribution changes. System under study: VYNN AI (my own production deployment, ~500 pilot users; all attacks run against offline replica). 52 tests, frozen manifests with commit pinning, deterministic LLM cache, CI-green on bare clone.
Sequoia predicts 1.68× speedup on T4; I measured 0.56×. A four-term decomposition reconciles the 3× gap to within 1.1% of measurement noise — the algorithm is sound, but three specific cost-model assumptions break on bandwidth-bound hardware. Attempting the natural fix (cross-iteration KV persistence, a clear A100 win) measurably worsens T4 performance (0.56× → 0.46×) because each cache-extension forward still pays a ~20ms weight-loading floor — the paper's sharpest finding and the fourth hidden assumption. A single controlled probe unifies every finding: per-call cost flat at 20.0 ± 0.2ms across an 8.5× range of cache lengths. PLD is the only family that wins on T4 (1.28–1.39×) because Cr ≈ 0 via CPU n-gram matching is the only structural bypass of the bandwidth floor.
~30% of LoRA's parameters, +2.1 F1 over standard LoRA on IMDB — average effective rank collapses from 8 to 2.42 after SVD-guided truncation + brief post-compression fine-tuning, with no accuracy lost on SST-2 and a measurable gain on IMDB. A controlled study of a simple question: after training a LoRA adapter, how much of its rank is actually task-useful? And if you truncate down to that effective rank and keep training, what happens? Compression pass is ~30 lines of code, exact to the Eckart–Young–Mirsky bound — the provably optimal low-rank approximation under Frobenius norm, not a heuristic.
On the 2022 FIFA World Cup held-out set (n=128), QLoRA-fine-tuned Llama-3.1 8B hits 79.7% O/U 2.5 directional accuracy (84.4% on named-only, Wilson CI [0.736, 0.913]) — and under a coherence-required metric that credits a prediction only when text label, score line, and ground truth all agree, the headline 61.7% score_acc for QLoRA collapses to 42.2%, tying 5-shot ICL. The magnitude/direction decomposition is the contribution: LLM ties feature-matched XGBoost on 1X2 direction but beats it by 19pp pregame / 16pp halftime on O/U 2.5 magnitude — driven by pretrained scoreline priors tabular features can't replicate. Paired McNemar on pregame → halftime+events O/U 2.5: p = 0.006. The broader claim — benchmarks over structured multi-field generative outputs should report coherence-required accuracy as a cheap diagnostic, because parser rescue inflates headline numbers.
More on GitHub.
Tooling, prototypes, course artifacts, and in-progress work live at one address — the curated story is above, the full archive is a click away.