“I love to configure Datadog to drop metrics and logs so we can stay within budget.” – no engineer ever
No engineer is ever excited about figuring out what metrics or logs are important and making decisions about what they might or might not use in the future. Engineers want all the data all the time, because, when there’s trouble, they dread not having the one log message or the one metric that would have told them what’s happening. I started Grepr to tackle this problem and eliminate the need for engineers to set up rules to ensure they have the right data in the right place at the right time. With AI, we can now do it with less effort and more effectively than with manual labor, without losing critical information that might be needed for troubleshooting later on.
Back in 2016, when I was at AppDynamics, microservices were new and our largest customers’ metrics fit into a single MySQL database (!). Today, the rise of microservices and the exponential growth of the scale of software deployments have caused a massive explosion in complexity. Companies dedicate entire teams to individual components, each with a full software and hardware stack, deployed on hundreds or thousands of containers. To manage this complexity and be able to understand what’s happening, teams turned to observability tools, deployed ubiquitously.
With this rise in complexity, observability costs skyrocketed, often becoming the second-largest expense after cloud infrastructure. I’ve had tens of conversations with engineering leaders in addition to my own experience, and these conversations confirm that observability costs now make up 10–15% of their infrastructure expenses.
Despite this massive spending, many organizations still struggle: either they lack the data needed to resolve incidents or they’re overwhelmed with the sheer volume of data and the complexity of the applications, making it hard to pinpoint root causes. The issue is that current observability tools require engineers to piece together fragmented metrics, logs, and traces—like solving a puzzle with no picture to guide them. Engineers must also correlate inconsistent tags and labels to connect observations to their mental model of the system. This complexity often leads to confusion during troubleshooting, as teams fail to link observations or miss critical issues.
And we’re at the cusp of the next wave of complexity growth: AI. As AI inference workloads demand finer-granularity data, tracking not only each individual inference performance but its impact on the behavior of users or systems in response, we’ll be seeing another exponential growth in complexity.
What was already a less-than-ideal situation is becoming untenable. The architecture and user experience of today’s observability tools will not be able to deliver on the improvement in reliability for which companies are paying all that money.
At Grepr, we want to elevate observability beyond individual metrics and logs. Instead, we focus on the behaviors of objects like hosts, containers, and processes, making these behaviors and relationships first-class concepts in troubleshooting. Our AI will leverage this rich representation to guide engineers quickly to root causes, thus reducing complexity and downtime. This ensures significant productivity gains, giving engineers clarity without requiring deep knowledge of every system component.
Most observability tools today use a static representation of applications and infrastructure. For example, Kubernetes pods run on Kubernetes nodes, Kubernetes nodes run on hosts, hosts belong to VPCs, etc. In today’s massively complex, dynamic world where users want to monitor all sorts of objects — like users, sessions, AI models, inferences, Spark jobs, Kafka consumers, and user conversions, in addition to traditional infrastructure elements like hosts and containers — these static representations are insufficient. Unlike other observability tools that layer AI on top of this static representation, we’re building Grepr’s foundation on the right representation first—a flexible, customizable representation, enabling users to reason about the health of these objects and the relationships between them.
That said, we know that switching tools is hard, so we designed Grepr to integrate seamlessly into existing environments—and to pay for itself. Our approach focuses on cutting observability costs while maintaining the data that engineers need. Here's how we do it:
- Real-Time Noise Reduction: Using machine learning to distinguish noise from signal, we cut noisy data by 95%, forwarding only relevant information to observability tools.
- Health-based aggregation: Grepr can aggregate the data for healthy objects (like hosts, containers, users, etc), and send unhealthy objects’ data unaggregated, so users can focus on what’s important at the lowest cost.
- Low-Cost Data Storage: A highly optimized data lakehouse stores all raw data, making it easy to troubleshoot, analyze, or backfill into tools when needed.
- Incident-Based Reactivity: During incidents, Grepr stops aggregating relevant data and backfills critical details into existing tools, ensuring engineers have the information they need, when they need it.
This is just the start. Systems are growing more complex, especially as they incorporate AI workflows, and traditional observability architectures are becoming unsustainable. Grepr aims to transform how observability data is used, evolving along with modern systems to help engineers keep systems efficient and reliable.
As part of our launch, we’re excited to announce our $9M Seed round led by Martin Casado from Andreessen Horowitz and Ed Sim from boldstart ventures to lead the next wave of reliability.
It takes 20 minutes from start to finish to get started with Grepr and reduce your observability spending by 90%. Sign up for a free trial to find out for yourself!