Available for contract, fractional & remote work.
Need an AI system you can actually trust in production? That's the kind of work I'm after.
- Taking an AI/LLM prototype to production reliability: evals, observability, schema-validated output, guardrails
- Hardening agentic and MCP systems that work in a demo but break unattended
- Platform reliability and cost engineering for AI-heavy products
- On-site or relocation roles
- Pure ML research or model training
- Maintenance-only work with no ownership
What kind of engineer are you, and at what level?
A staff-level backend and platform engineer specialising in production-AI reliability. A decade of building and scaling Rails platforms (1M+ users, €160k/yr infrastructure savings, teams grown 15 to 40), now focused on making LLM pipelines, agents, and eval systems dependable in production.
Are you available for contract, fractional, or full-time work?
Available now for B2B contract or fractional engagements, and open to full-time for the right role. Typical fits are a fixed-scope reliability project or an ongoing fractional staff-engineer seat on an AI-heavy product.
Where are you based and which time zones do you cover?
Europe-based and fully remote, working async-first. I keep live overlap across European, US, and APAC business hours through the year, so a distributed team gets real-time hours regardless of where it sits.
What does taking an LLM prototype to production reliability involve?
Observability tells you what your agent did; evals tell you whether it was good, and most teams have the first without the second. The work is closing that gap: tracing every model call, a graded eval suite (schema and format assertions first, then a validated LLM-as-judge measured against human labels), schema-validated output instead of trusting prompts, retries and guardrails, and only then an automatic improvement loop. The order matters: optimise against an unvalidated judge and you build a system confidently tuned toward being wrong. The goal is an agent or pipeline that runs unattended without quietly breaking.
What is the strongest proof of this work?
A daily LLM news pipeline that has run unattended every morning since November 2025: 500+ articles a day from 30+ sources distilled into the day's top stories, schema-validated, deployed with Terraform and Docker. A validated LLM-as-judge gate drops hallucinated headlines, and an offline eval suite plus a regression floor keep quality from drifting. When it has failed, each failure became a guardrail: a duplicate send after a host reboot became idempotent, recoverable delivery; a silent morning outage became a dead-man's switch and fail-closed alerting - both written up in public post-mortems. Plus contributions across the Ruby web stack (Rails, Sinatra, async) and essays on what actually breaks in production AI.
Email me, or book a call. More context in the CV and background, or point your LLM at /llms.txt and ask it about my work.