Sean Floyd - Work with me

What I take on

Best fit

Taking an AI/LLM prototype to production reliability: evals, observability, schema-validated output, guardrails
Hardening agentic and MCP systems that work in a demo but break unattended
Platform reliability and cost engineering for AI-heavy products

Not a fit

On-site or relocation roles
Pure ML research or model training
Maintenance-only work with no ownership

Shape

Engagement

B2B contract or fractional (open to full-time for the right role)

Location

Remote, Europe-based, async-first

Timezone

Live overlap across European, US, and APAC hours through the year

Questions

What kind of engineer are you, and at what level?

A staff-level backend and platform engineer specialising in production-AI reliability. A decade of building and scaling Rails platforms (1M+ users, €160k/yr infrastructure savings, teams grown 15 to 40), now focused on making LLM pipelines, agents, and eval systems dependable in production.

Are you available for contract, fractional, or full-time work?

Available now for B2B contract or fractional engagements, and open to full-time for the right role. Typical fits are a fixed-scope reliability project or an ongoing fractional staff-engineer seat on an AI-heavy product.

Where are you based and which time zones do you cover?

Europe-based and fully remote, working async-first. I keep live overlap across European, US, and APAC business hours through the year, so a distributed team gets real-time hours regardless of where it sits.

What does taking an LLM prototype to production reliability involve?

Observability tells you what your agent did; evals tell you whether it was good, and most teams have the first without the second. The work is closing that gap: tracing every model call, a graded eval suite (schema and format assertions first, then a validated LLM-as-judge measured against human labels), schema-validated output instead of trusting prompts, retries and guardrails, and only then an automatic improvement loop. The order matters: optimise against an unvalidated judge and you build a system confidently tuned toward being wrong. The goal is an agent or pipeline that runs unattended without quietly breaking.

What is the strongest proof of this work?

A daily LLM news pipeline that has run unattended every morning since November 2025: 500+ articles a day from 30+ sources distilled into the day's top stories, schema-validated, deployed with Terraform and Docker. A validated LLM-as-judge gate drops hallucinated headlines, and an offline eval suite plus a regression floor keep quality from drifting. When it has failed, each failure became a guardrail: a duplicate send after a host reboot became idempotent, recoverable delivery; a silent morning outage became a dead-man's switch and fail-closed alerting - both written up in public post-mortems. Plus contributions across the Ruby web stack (Rails, Sinatra, async) and essays on what actually breaks in production AI.

Get in touch

Email me, or book a call. More context in the CV and background, or point your LLM at /llms.txt and ask it about my work.

Available for contract, fractional & remote work.

What kind of engineer are you, and at what level?

Are you available for contract, fractional, or full-time work?

Where are you based and which time zones do you cover?

What does taking an LLM prototype to production reliability involve?

What is the strongest proof of this work?