"Should we keep this data or stay under budget?" is not a question observability teams should have to ask. But here we are. We wrote about how observability costs spiraled out of control - and more importantly, how columnar storage + OpenTelemetry might actually fix it. Spoiler: ClickHouse wasn't built specifically for observability, but turns out it's kind of perfect for it. Fast aggregations, high compression, handles high-cardinality data like a champ, and doesn't force you to scatter your logs, metrics, and traces across three different systems. https://lnkd.in/g87k5sYm
Columnar Storage Solves Observability Costs
More Relevant Posts
-
Observability pricing has become a tax on your success. The industry has spent years sold on the myth of the three pillars of observability. In practice, keeping logs, metrics, and traces in separate proprietary silos is not a technical requirement. It is a business model designed to keep your data fragmented and your bill growing. Most major platforms are essentially expensive wrappers around search index technology that was never meant for this scale. You are paying a premium for an architecture that forces you to choose which data you can actually afford to keep. ClickHouse just published a technical breakdown on why this model is broken. The shift toward unified columnar storage and OpenTelemetry is making the old vendor lock-in model look like a massive tactical error. If your observability bill is rivaling your actual compute costs, it is time to admit the current strategy is a failure. Read the full breakdown here: https://lnkd.in/gWb7Txzd #Observability #OpenTelemetry #DevOps #FinOps #ClickHouse
Breaking free from rising observability costs with open, cost-efficient architectures clickhouse.com To view or add a comment, sign in
-
The observability industry has been forcing a false choice: pay unsustainable costs for unified platforms, or accept fragmented tools that make correlation nearly impossible. This post digs into how we got stuck with this choice—from Splunk's ingest pricing to Elasticsearch's scaling limits to the "three pillars" model that codified data silos. Then it lays out what a better path looks like: columnar storage that handles high-cardinality data efficiently, OpenTelemetry for portability, and an open UI that doesn't require SQL mastery for every query. Companies like Tesla, Anthropic, and OpenAI are already running petabyte-scale observability on ClickHouse. ClickStack makes this approach accessible without requiring you to build your own platform. https://lnkd.in/g2n5_jxw
To view or add a comment, sign in
-
Understanding the difference between persistent data and transient working structures makes reasoning about data structures and designing algorithms much easier. Keeping this mental model in mind will serve you well across systems, helping clarify ownership, lifecycles, and boundaries. So naturally it will inform things like API design, caching, message-passing pipelines etc etc
To view or add a comment, sign in
-
When our content platform crossed the 500k concurrent user mark, everything looked fine from the outside. Pages loaded, APIs responded, and logs seemed quiet. But under the hood, we started seeing strange latency spikes on random requests. Database queries that should take 10 milliseconds were taking 2 seconds. Then, the deadlocks started. We had built our search and ranking system with a simple in-memory queue to coordinate updates to ranked lists. Each update would lock the list, apply the change, recalculate positions, and release the lock. It worked perfectly during development and early growth. But as more users interacted with content simultaneously, the queue became a bottleneck. Eventually, threads began to block and wait, exhausting request handlers. The system slowed down and requests piled up. The moment of clarity came during a deep investigation into lock contention and stack traces. We realized our real problem wasn't CPU or database I/O. It was how we structured ranking updates. Our naive assumption had been that ranking recalculations were fast enough to be handled serially in memory. But the O(n log n) sorting of large lists combined with synchronous locks was choking throughput. We rebuilt the ranking system around a batched event-driven model. Instead of locking and updating the rankings on each user action, we emitted events to a distributed queue. Dedicated background workers consumed these events, applied changes in batches, and used a skip list based data structure to maintain rankings efficiently. Updates became asynchronous and non-blocking. The real-time experience remained intact because users saw eventual updates within milliseconds, not nanoseconds. We also adopted optimistic locking patterns for consistency without blocking incoming writes. The key lesson here is that scalability issues often hide in seemingly small assumptions. Our logic was sound functionally but not behaviorally at scale. The right data structure and execution model matter more than ever when concurrency crosses into high contention territory. Designing for scale means asking not just if something works but how it responds under pressure and what guarantees it gives when timing and ordering go nonlinear.
To view or add a comment, sign in
-
Great article, and I enjoyed the posts before it. Been thinking a lot lately about how we can redo ‘spatial data infrastructure’ with the principles James Fee lays out: Assumptions are explicitWorkflows are visibleMetadata is mandatoryHumans are no longer the glueHave been looking for a good name as ‘cloud native sdi’ is a bit to geeky and narrow. Maybe ‘Honest SDI’ is a good name.
For a long time, we told ourselves a comforting lie: “It’s just a file.” Filenames carried meaning. Folders implied structure. And if something didn’t make sense, you could always ask the person who created it. Scale broke that illusion. COG + STAC didn’t win because they’re clever formats. They won because they made assumptions explicit — about access patterns, metadata, and discovery. I wrote something about this: 🔗 https://lnkd.in/gzGVhJmS If your system only works because someone “knows how it works,” that’s not architecture. That’s tribal knowledge — and it doesn’t scale.
To view or add a comment, sign in
-
Grafana has announced the removal of drilldown investigations, a change that will impact how users analyze data. I found it interesting that this decision reflects a growing emphasis on streamlining features for enhanced user experience. It raises questions about how organizations will adapt their monitoring strategies moving forward. What are your thoughts on this change and its potential implications for data analysis?
To view or add a comment, sign in
-
Best of both worlds? Ollama 🤝 Claude Code You can now point Claude Code at local, open-source models Important clarification: You're not running Claude models locally. You're running open-source models (qwen3-coder, llama3.1, GLM-4.7) through Claude Code's interface. Same workflow, different brain. Why this matters: For teams in air-gapped environments, regulated industries, or strict data residency requirements. You can now go fully dark while using Claude Code 🤓 Something to keep in mind, though Make sure you disable Telemetry collection for a true Air-gapped deployment: DISABLE_TELEMETRY=1 DISABLE_ERROR_REPORTING=1 CLAUDE_CODE_DISABLE_NONESSENTIAL_TRAFFIC=1 With Ollama + these flags, zero data leaves your network. The tradeoffs are real: → Context window matters in local vs SOTA LLMs → Reasoning gaps show. You get the workflow, not Claude's intelligence. Complex refactoring will feel the difference. → Tool calling varies. Not all open models handle multi-step tool use reliably. The opportunity: Use local models for routine/draft work. Escalate to Claude API for complex tasks. Keep sensitive data local. Maintain quality where it matters. The future isn't "cloud vs. local." It's "right tool for each constraint."
To view or add a comment, sign in
-
-
Have you ever wondered how time is actually handled inside large systems? Whose perspective does it represent, the user who scheduled it, the system that stores it, or the person viewing it from a different timezone? While working on a seemingly simple scheduling feature, I ran into the kind of problem that quietly separates working code from correct systems. Timezones, DST shifts, and "what the user really meant" turned out to be less about date conversion and more about data modeling, intent preservation, and architectural discipline. What started as a feature quickly became a deep dive into how distributed systems reason about time at scale. I ended up writing a technical piece to capture those lessons, not as a tutorial, but as a reflection on why time is one of the hardest problems in software engineering, and how getting the model right early can save months of downstream pain. If you've ever debugged a "simple" timezone bug that turned into a multi-day architecture discussion, or if you're designing systems that coordinate time across continents, this might resonate. 🔗 Read the full article: https://lnkd.in/dBG78xNT #SoftwareEngineering #DistributedSystems #SystemDesign #TechnicalArchitecture #EngineeringInsights
To view or add a comment, sign in
-
Observability is costly. But how costly exactly? We analyzed the latest reports to uncover the true price of observability. Here is what we found: https://lnkd.in/eSqQzgP6
To view or add a comment, sign in
-
𝐃𝐚𝐲 81 𝐨𝐟 100 𝐃𝐚𝐲𝐬 𝐨𝐟 𝐒𝐲𝐬𝐭𝐞𝐦 𝐃𝐞𝐬𝐢𝐠𝐧 🗄️ 𝐃𝐞𝐬𝐢𝐠𝐧𝐢𝐧𝐠 𝐚 𝐔𝐑𝐋 𝐒𝐡𝐨𝐫𝐭𝐞𝐧𝐞𝐫 – 𝐋𝐞𝐭’𝐬 𝐓𝐚𝐥𝐤 𝐒𝐭𝐨𝐫𝐚𝐠𝐞 One of the requirements usually sounds very innocent 👼 👉 “The short URL should live forever.” 🚨 Red flag. Nothing lives forever. Not URLs, not databases, not even your favorite startup 🥲 So as system designers, we don’t take this literally. Instead, we translate “lifetime” into something practical. Let’s assume: ⏳ 100 years (which is already very generous) Step 1️⃣: How Many URLs Will We Store? From our earlier traffic assumptions: 📌 100 million new URLs per month Now let’s do the math 🧮 100 years × 12 months × 100 million URLs ➡️ 120 billion URLs That’s… a lot of links 😅 Step 2️⃣: How Much Space Does One URL Take? Each record usually stores: Short URL Original URL Metadata (timestamps, flags, etc.) Let’s assume (very reasonably): 📦 500 bytes per URL Step 3️⃣: Total Storage Needed Now the final calculation ⬇️ 120 billion URLs × 500 bytes ➡️ ~60 TB of data And this is where things get interesting 🤔 60 TB is: ❌ Not something you want on a single machine ❌ Not cheap if designed poorly ❌ Not easy to manage without planning This single number already tells us a lot 🧠 ➡️ We cannot rely on vertical scaling ➡️ We’ll eventually need sharding / partitioning ➡️ Storage design is no longer optional — it’s core to the system And remember, this is just raw data. Indexes, replication, and backups will push this number even higher 📈 In the next post, we’ll talk about: ➡️ How to store this data efficiently ➡️ Which databases make sense ➡️ And how read-heavy traffic changes our storage choices System design is just math + assumptions + good judgment ✨ Stay tuned 🚀 #SystemDesign #URLShortener #StorageEstimation #CapacityPlanning #BackendEngineering #DistributedSystems #Scalability #DatabaseDesign #DataEngineering #SoftwareArchitecture #LearningInPublic
To view or add a comment, sign in