Adventures In Go

Aug 14, 2025

By the third time our overnight ETL job set fire to the heap monitor, we suspected something deeper than the usual Go leakage. The numbers made no sense. Our struct-of-slices pipeline had every “optimization” the blog posts could muster, but every morning, the resident Grafana could only offer a memory chart shaped like the North Face of the Eiger.

We gathered, as teams do, around the build artifact of record, a three-year veteran of every breaking change in our data models, and ran pprof for the thousandth time.

It’s always pprof, at the beginning.

The first thing we did, like all properly traumatized Go engineers, was to grab a heap snapshot from the running process. The flame graph made for impressive social-media fodder: a forest of allocations stacked above the same three functions, none of them unique to our business logic. It was always the same—encoding/json, io.Reader, and an inscrutable mix of bytes.Buffer internals.

What caught our eye that time was not the height of the stack, but the width. There was a wall of tiny allocations. A million here, a million there. Each one a handful of bytes, but in aggregate, enough to drown a VM. The only thing more consistent than our allocations was our confusion.

We started with the low-hanging fruit. json.Encoder, as always, was public enemy number one.

Our data model looked simple:

type Event struct {
    Timestamp time.Time
    UserID    string
    Payload   map[string]interface{}
}

We were generating hundreds of millions of these per day, shuttling them through a pipeline of Go routines, buffering, marshaling, then dumping to S3 via a home-grown multipart uploader.

At first, we assumed the allocations came from Payload, as every Go engineer is trained to do. We reached for the profiler:

go tool pprof -alloc_objects ./mybinary mem.pprof

There it was, in black and white: the bulk of objects, a few bytes each, attributed to map growth and string copying inside json.Marshal. Obvious, right?

But when we wrote a minimal benchmark—just marshaling a bare Event to discard—the allocations were an order of magnitude lower than our real pipeline. So much for intuition.

We toggled on the debug flag, ran with GODEBUG=gctrace=1, and watched as our generational GC struggled to keep up. The clues pointed to a much more subtle villain: not our Payload, but the persistent, silent churn of context values.

Like every other Go project since 2017, we used context.WithValue everywhere. For logging, for tracing, for every scrap of metadata that might be needed downstream. The allocation cost of WithValue is supposed to be negligible, but only if you remember to reuse contexts where possible.

We did not.

Instead, every single event in the stream carried its own unique context tree, built from scratch, then discarded minutes later. The result was a garbage midden of short-lived *context.value nodes, each dragging along a copy of the key and whatever anonymous struct we’d thoughtlessly attached as a value. On paper, it should have been fine; in practice, the cumulative cost dwarfed our data payloads.

For science, we instrumented the pipeline to show the top types by allocation site. The new king: a chain of context.value, each holding on to a stray []byte, a span ID, or a logger instance.

We experimented with pooling. sync.Pool, that seductive tool, had saved us before in the HTTP server days, so we spun up a trivial pool for Event instances. The memory chart responded with polite applause—a 10% improvement, nothing to write home about.

Next, we tried to eliminate context churn. It took three afternoons to carefully refactor the event processors to share parent contexts whenever possible, passing around a single tree instead of building new ones at every stage. We nearly broke our trace propagation in the process, but the results were instant. Allocation rate dropped by half. The heap chart, for the first time in months, did not slope into infinity.

But there was more.

We ran the job in a staging environment with -race, just to check for the kind of low-level hazards that only appear at load. Immediately, the output spewed warnings about concurrent map writes. We traced it back to a forgotten cache: a global map of known UserIDs, used for deduplication. The map was protected by a naive RWMutex, but in a fit of “optimization” months earlier, someone had wrapped certain writes in goroutines, assuming the lock would protect them. The problem: the writes were batched, but the lock was not held across the entire batch.

It was a minor miracle that the program ran at all. Race detector’s wrath was justified.

We debated the fix. The “Go” thing to do would be to replace the map+lock with a concurrent map. But our keys were never deleted, and only rarely read once inserted. We tried a sync.Map, just to say we’d done it, but performance went down, not up.

So we returned to first principles: instead of deduplicating in memory, why not persist the known UserIDs in a disk-backed index? BoltDB had served us well in other projects. The interface was clunky, but at our scale, the on-disk lookup was orders of magnitude faster than fighting over locks on a global map. We wrote a thin wrapper, ran benchmarks, and watched as the time spent in UserID deduplication dropped from 40% of total CPU to under 3%.

There was an irony to all this: after months of micro-optimizations, what saved us was adding disk I/O.

Memory safe, CPU happy. For a day, at least.

We turned back to the flame graph for one final sweep. A few suspicious peaks remained, mostly in third-party libraries. One, a metrics collector, called time.Now with every metric, leading to millions of syscalls per hour. We forked it, replaced the calls with a single time.Ticker, and the peaks vanished.

All that remained was the comforting buzz of GC cycles, each pass harvesting fewer and fewer orphans. The job ran overnight, barely cresting its initial heap.

The best part: for the first time, the morning chart was flat.

We’d like to say we learned something profound about performance tuning. But really, it just confirmed what we’d always suspected: most of the work is in undoing your own cleverness.

What follows is a categorized, annotated list of the tools and tricks we used, plus what we wish we’d known before.

Tools

pprof

If you haven’t already, start with Go’s built-in pprof. Heap profiles are easiest:

go test -bench=. -benchmem -memprofile=mem.pprof
go tool pprof -web ./mybinary mem.pprof

Focus on “flat” vs “cum” columns—flat shows the current function’s allocations, cum the sum of all children. In a goroutine-heavy service, look for repeated small allocations.

In production, pprof can run as a web server:

import _ "net/http/pprof"
go func() {
    log.Println(http.ListenAndServe("localhost:6060", nil))
}()

GODEBUG

GC churn often tells the real story. Run with:

GODEBUG=gctrace=1 ./mybinary

This prints GC pause times, heap size, and allocation rates. Look for spikes during heavy load.

Race Detector

Do not trust your locks; trust -race.

go test -race

Or, for binaries:

go run -race main.go

Escape Analysis

Most mysterious heap allocations come from values that “escape” their stack frame. Build with:

go build -gcflags "-m" ./...

And parse the output for “escapes to heap” warnings.

sync.Pool

Pooling objects helps, but only for types you know are single-use. Watch for reuse bugs—accidentally pooling a pointer can lead to all manner of subtle corruption.

BoltDB / Badger

For high-concurrency, low-churn maps, disk-backed indexes often beat in-memory solutions. Both Bolt and Badger have mature Go APIs, but beware that every write is a transaction.

Code Examples

Before: Excessive context.WithValue

func process(e Event) {
    ctx := context.WithValue(context.Background(), "uid", e.UserID)
    downstream(ctx, e)
}

After: Context reuse

var rootCtx = context.Background()
func process(e Event) {
    ctx := context.WithValue(rootCtx, "uid", e.UserID)
    downstream(ctx, e)
}

Or, better still, only add values at the callsite that needs them.

Before: Global map with RWMutex

var mu sync.RWMutex
var userSet = make(map[string]struct{})

func dedup(id string) bool {
    mu.RLock()
    _, ok := userSet[id]
    mu.RUnlock()
    if ok {
        return true
    }
    mu.Lock()
    userSet[id] = struct{}{}
    mu.Unlock()
    return false
}

After: Disk-backed deduplication

db, _ := bolt.Open("ids.db", 0600, nil)

func dedup(id string) bool {
    var exists bool
    db.View(func(tx *bolt.Tx) error {
        b := tx.Bucket([]byte("ids"))
        exists = b.Get([]byte(id)) != nil
        return nil
    })
    if exists {
        return true
    }
    db.Update(func(tx *bolt.Tx) error {
        b := tx.Bucket([]byte("ids"))
        b.Put([]byte(id), []byte{1})
        return nil
    })
    return false
}

Before: Metrics with time.Now per call

func recordLatency(start time.Time) {
    elapsed := time.Since(start)
    metrics.Observe(elapsed)
}

After: Ticker-based time sampling

var now = time.Now
go func() {
    t := time.NewTicker(10 * time.Millisecond)
    for t := range t.C {
        now = func() time.Time { return t }
    }
}()

Lessons Learned

Trust your profiler, not your gut. The hottest allocation site is never where you think it is.
Context is for cancellation, not for data plumbing. Pass values explicitly, unless you want to relive the Java thread-local nightmare.
Bench your “optimizations.” sync.Pool isn’t always faster, and sometimes a map+lock is as good as it gets.
Prefer correctness over cleverness. Every micro-optimization introduces a place for bugs to hide.
Disk is cheap. Locks are expensive. When in doubt, offload to a transactional KV.

We still keep the heap flame graph from that night, not as a trophy, but as a reminder that every production problem is just an unexpected loop away.

Profiling, like all debugging, is a little bit science and a lot of confession. If you squint, you can see the outlines of your own bad habits, flickering in the pprof web UI, silently waiting for the next deployment.

And the next, and the next.