Three Advanced Techniques to Reduce Logging Costs - Part II

Jad Naous
February 11, 2025
Image of a dam letting out some water

In the first blog post in the series, I went through four basic ways to reduce logging volumes: increasing the severity threshold, converting logs to metrics, uniform sampling, and drop rules. These techniques work well for smaller, simpler environments, but they lead to missing data that might be important when troubleshooting. Some of them require a significant effort to scale to the enterprise. In this blog post, I'll go through three advanced techniques to reduce log volumes: automatic sampling by pattern, logarithmic sampling by pattern, and sampling with automatic backfilling.

Technique 1: Automatic sampling by pattern

If we can automatically identify, in real-time, the patterns in log messages and track how many messages we're seeing for each pattern, we can then automatically make decisions on how much data to send for each pattern. Here’s how it would work:

  1. As messages come in, build a database of log patterns and pass the messages through.
  2. Keep track of the incoming rate for each pattern.
  3. When a particular pattern crosses some threshold (meaning we've already sent some number of messages for that pattern), start randomly sampling messages for some period of time (say 2 minutes). This sampling passes through a fraction (which could be zero) of additional messages for that pattern.
  4. At the end of that time period, send a summary message for each aggregated pattern with a count of the messages that were skipped for that pattern.
  5. Repeat.
Two samples from each of the two log patterns (GET and POST) are passed through unaggregated, and the rest are aggregated into summary messages.
Pros
  1. Minimal configuration: with a very low effort, this technique can significantly reduce log volumes. Unlike drop rules, no manual pattern configuration or maintenance is required, and it works for all patterns, not just the highest volume ones.
  2. Catches low-volume, important messages: error messages or executions that are infrequent are passed through unsampled, so troubleshooting can use them.
  3. Prevents spurious spikes: new high-volume patterns are automatically identified and sampled.
  4. Maintains relative volumes of patterns: by uniformly sampling data beyond the initial threshold, heavy-flow patterns would continue to be heavier than light-flow patterns. Tools that group by pattern or anomaly detection that relies on anomalous statistical profiles for messages would still work.
  5. Absolute counts for patterns are still available: Dashboards and alerts that rely on absolute values can be updated to include counts from summary messages.
  6. Avoid modifying existing dashboards and alerts: By configuring exceptions for what shouldn't be aggregated you can avoid making changes to your existing setup and minimize impacts to workflows.
  7. Immediate: since pattern detection is real-time, this starts working immediately.
Cons
  1. Loses some data: since high-volume patterns are sampled, those patterns will not have all their data available at the sink.
  2. Reconfiguration: Some dashboards and alerts may need reconfiguration to take summaries into account if exceptions are not configured.
  3. Complexity: capability doesn’t exist in any standard log aggregation tools, and an additional tool may be needed.

Technique 2: Logarithmic sampling by pattern

Logarithmic sampling increases volume reduction exponentially over uniform sampling

The previous technique will do two things: 1) guarantee a basic minimum of log messages for every pattern and 2) reduce the rest of the data by a set fraction. However, that's not optimal. This means that your heaviest patterns, which maybe 10000x more noisy than your lightest patterns will be sampled the same amount and you only get a linear decrease in log volumes. Ideally, you want to sample the heavier patterns more heavily than lighter patterns. For example, if you're seeing 100,000 messages per second for pattern A and 100 messages per second for pattern B, you probably want to only pass through 1% of pattern A and 10% of pattern B. This is what logarithmic sampling does.

This technique is an extension of Sampling by pattern, so it has the same pros and cons. The only difference is that it's exponentially (pun intended) more effective. It's also exponentially more effective than the basic uniform sampling technique.

Technique 3: Sampling with automatic backfilling

This last technique reduces the downsides of sampling by storing all the original logs into some low-cost, queryable storage and automatically reloading data to the log aggregator when there’s an anomaly. The anomaly detection could be built into the processing pipeline or it could be external (such as a callback from the observability tool itself). This way, when an engineer goes to troubleshoot the anomaly, the data would already be in the log aggregator.

When there's an anomaly, passthrough additional data and backfill history for anomalous objects.
Pros
  1. Reduces data loss: by automatically backfilling missing data, it’s less likely an engineer wouldn’t find what they need. Assuming manual backfilling is also possible and quick, data loss would be eliminated completely. At the same time, if the data store is queryable by the developer, all the raw data would still be available, albeit with a reduced query performance.
  2. Simplifies deployment: because of the lower risk of data loss, deployment can be faster and more aggressive.
  3. Iterative improvement: can iterate and improve the rules for automated backfilling over time.
Cons
  1. Requires an additional queryable storage: Requires something like a data lakehouse where the cost of storage is cheap but the data is still queryable at fast-enough performance.
  2. Complexity: the ability to handle backfills requires batch processing which is beyond data ingestion, and may require an additional component to be added.
  3. Not perfect: at some point an engineer might need to manually search through the raw data store or execute a manual backfill, which would require them to go to another tool.

Availability

I went through the available public documentation for various tools to check what capabilities are present in each. Where it was clear a technique was possible, I marked the cell with ✔️. Where it was missing, I marked it with ❌.

Grepr is the only solution that implements all advanced techniques. Further, Grepr automatically parses configured dashboards and alerts for patterns to exclude from sampling, and mitigate impacts to existing workflows. In customer deployments, we’ve seen log volume reductions of more than 90%! Sign up here for free, and see the impact Grepr can make in 20 minutes or reach out to us for a demo here.

Share this post

More blog posts

All blog posts
Digital art showing pipeline and streams
Product

Four Basic Techniques to Reduce Logging Costs - Part I

This blog reviews basic techniques to reduce logging volumes. These are available in most log aggregation systems.
February 5, 2025
Grepr logo
Announcements

Announcing Grepr: Observability for the modern complex world

Announcing our raise from Andreessen Horowitz and boldstart ventures to tackle the exponentially growing cost of observability without forcing migrations. Grepr combines machine learning with an observability data lake to reduce costs by 90% with minimal effort.
January 22, 2025

Get started free and see Grepr in action in 20 minutes.