Sean Floyd.
Availability · open now

Available for contract, fractional & remote work.

Need an AI system you can actually trust in production? That's the kind of work I'm after.

What I take on
Best fit
  • Taking an AI/LLM prototype to production reliability: evals, observability, schema-validated output, guardrails
  • Hardening agentic and MCP systems that work in a demo but break unattended
  • Platform reliability and cost engineering for AI-heavy products
Not a fit
  • On-site or relocation roles
  • Pure ML research or model training
  • Maintenance-only work with no ownership
Shape
Engagement
B2B contract or fractional (open to full-time for the right role)
Location
Remote, Europe-based, async-first
Timezone
Live overlap across European, US, and APAC hours through the year
Questions

What kind of engineer are you, and at what level?

A staff-level backend and platform engineer specialising in production-AI reliability. A decade of building and scaling Rails platforms (1M+ users, €160k/yr infrastructure savings, teams grown 15 to 40), now focused on making LLM pipelines, agents, and eval systems dependable in production.

Are you available for contract, fractional, or full-time work?

Available now for B2B contract or fractional engagements, and open to full-time for the right role. Typical fits are a fixed-scope reliability project or an ongoing fractional staff-engineer seat on an AI-heavy product.

Where are you based and which time zones do you cover?

Europe-based and fully remote, working async-first. I keep live overlap across European, US, and APAC business hours through the year, so a distributed team gets real-time hours regardless of where it sits.

What does taking an LLM prototype to production reliability involve?

Observability tells you what your agent did; evals tell you whether it was good, and most teams have the first without the second. The work is closing that gap: tracing every model call, a graded eval suite (schema and format assertions first, then a validated LLM-as-judge measured against human labels), schema-validated output instead of trusting prompts, retries and guardrails, and only then an automatic improvement loop. The order matters: optimise against an unvalidated judge and you build a system confidently tuned toward being wrong. The goal is an agent or pipeline that runs unattended without quietly breaking.

What is the strongest proof of this work?

A daily LLM news pipeline that has run unattended every morning since November 2025: 500+ articles a day from 30+ sources distilled into the day's top stories, schema-validated, deployed with Terraform and Docker. A validated LLM-as-judge gate drops hallucinated headlines, and an offline eval suite plus a regression floor keep quality from drifting. When it has failed, each failure became a guardrail: a duplicate send after a host reboot became idempotent, recoverable delivery; a silent morning outage became a dead-man's switch and fail-closed alerting - both written up in public post-mortems. Plus contributions across the Ruby web stack (Rails, Sinatra, async) and essays on what actually breaks in production AI.

Get in touch

Email me, or book a call. More context in the CV and background, or point your LLM at /llms.txt and ask it about my work.