What does proactive reliability actually look like in practice — and how do you explain it to leadership? Kolton Andrus joined Techstrong TV to discuss exactly that, including why the cost of an outage goes far beyond the revenue floor. There's the engineering cost: the days spent triaging, fixing, communicating, doing post-mortems. And there's the trust cost — especially for organizations where uptime is core to the product. Watch the full interview: https://hubs.la/Q043CL-10
Gremlin
Software Development
San Jose, California 12,291 followers
The Reliability Management Platform for high-velocity engineering teams
About us
Gremlin’s Reliability Management Platform enables high-velocity engineering teams to standardize and automate reliability across their organizations without slowing down software delivery. Gremlin's Reliability Score sets the standard for reliability so there's no guesswork, and an automated suite of Reliability Management tools makes it easy to integrate reliability throughout the software lifecycle so there's no slowdown.
- Website
-
http://www.gremlin.com
External link for Gremlin
- Industry
- Software Development
- Company size
- 51-200 employees
- Headquarters
- San Jose, California
- Type
- Privately Held
- Founded
- 2016
- Specialties
- Distributed Systems, Resilience, Failures as a Service, DevOps, and Chaos Engineering
Locations
-
Primary
Get directions
55 S Market St
Ste 1205
San Jose, California 95113, US
-
Get directions
555 Montgomery St
Ste 811
San Francisco, California 94111, US
Employees at Gremlin
Updates
-
More microservices mean more failure modes — and more places for issues to hide. 🫣 With over 120 microservices powering payments worldwide, Visa Cross-Border Solutions needed to standardize reliability testing without slowing teams down. 📊 They used Gremlin to create custom test suites for each service, then automated them across environments. The result was consistent, scalable resilience testing — and deployments that engineers can trust. 🤝 Learn how Visa Cross-Border Solutions standardized reliability across teams: https://hubs.la/Q03WYpmH0
-
See why Gremlin is the top choice for major retailers at https://hubs.la/Q046Yn6y0
-
-
Every major outage reminds us just how interconnected modern architectures are. This story isn’t new. It’s a continued risk as architectures grow in an ever-increasing web of dependencies and services. Fortunately, there is something you can do about it. Check out these testing best practices teams should follow to minimize the impact of large-scale outages so they don’t catch you by surprise. ⬇️ https://hubs.ly/Q046XSlm0
-
Resilience is increasingly a governance question, not just an engineering one. Investors, auditors, and regulators are asking what resilience looks like in practice- not just in policy. For scaling companies preparing for IPO, that means S-1 filings that can speak credibly to digital resilience. For public companies, it means 10-K disclosures that reflect real operational risk management. Gremlin's Disaster Recovery Testing produces detailed reports on service performance that are designed to support exactly this kind of accountability- making it easier to demonstrate proactive reliability efforts to the audiences who need to see them. EM360Tech covers the full picture, from how DRT works to how organizations are using it to close the gap between planning and proof. Read the full analysis: https://hubs.la/Q043CKHv0
-
Great to see Gremlin on the list! Thanks for the shoutout, CloudZero! https://lnkd.in/etMr7Tp6
-
2025 saw 15,000+ outages across the internet. Is your system prepared for the next major incident? Get started at https://hubs.la/Q046YfSf0
-
Most organizations know they should be running large-scale disaster recovery tests. They also know it's not practical to run them the way they've traditionally been done. Here's what changes with Gremlin's Disaster Recovery Testing: ➡ Select the services you want to test across your entire organization- not just one team's slice of the stack ➡ Choose from pre-built Scenarios for zone redundancy, region evacuation, DNS redundancy, and more- or bring your own ➡ Disaster Recovery Health Checks automatically halt and revert the test if key metrics go outside your SLA ➡ After the test, get a full report: which services passed, which failed, which teams own them, and what to fix first That's what repeatable DR readiness looks like. See how it works: https://hubs.la/Q043CrZJ0
-
Gremlin reposted this
I've been knee deep in Reliability and Chaos Engineering for the past 17 years. How has it evolved during that time? How is it evolved as the software industry has been turned on its head over the past 18 months? I wrote up this piece to share my thoughts, and I'd love to hear your opinions (tell me where I'm wrong!). https://lnkd.in/gXTb4aPQ
-
With teams shipping AI-assisted code faster than ever, 2026 is shaping up to be the year reliability becomes harder to ignore. More code velocity means more surface area for failure. More AI-driven products built on cloud infrastructure means more dependency on uptime that teams haven't fully pressure-tested. And when a major cloud region goes down — and it will — the question isn't whether your systems will be affected. It's whether you've already proven they can recover. TechBullioncovered Gremlin's new Disaster Recovery Testing with this in mind: in an AI-shaped world, availability isn't just a technical goal. It's what trust is built on. Read more: https://hubs.la/Q043CG070