Why malloc is Killing Your Latency

The hidden cost of `new`

Every C++ developer knows that malloc (and by extension new or std::vector default allocators) is "slow." But few understand why it is unpredictable. When you ask the OS for memory on the heap, three expensive things happen:

The Lock: The allocator must acquire a mutex to prevent race conditions in a multi-threaded app.
The Search: It must walk a free-list or tree to find a block of the correct size.
The Syscall: If the heap is full, it calls sbrk or mmap to ask the kernel for more pages. This triggers a context switch.

In a latency-sensitive application (like a trading matching engine or a radar control loop), this nondeterminism is unacceptable. You might get memory in 100ns one time, and 10ms the next.

The "Legacy" Way (Heap Vector)

We recently audited a FinTech codebase where small vectors were being created inside the hot loop. Even though vectors are contiguous, the allocation itself hits the heap every single time.

struct Trade { int id; double price; double quantity; };

// BAD: Hitting the HEAP inside the hot loop
void Legacy_Malloc(int batch_size) {
    // 1. std::vector allocates on the heap immediately
    std::vector<Trade> trades;
    trades.reserve(batch_size); // MALLOC TRIGGERED HERE
    
    for (int i = 0; i < batch_size; ++i) {
        trades.push_back({i, 100.0, 50.0});
    }
    
    // ... process trades ...
    // 2. std::vector frees memory (free/delete overhead)
}

The Fix: Stack Arenas & `std::pmr`

The solution is to stop asking the OS for memory during the hot path. Instead, we pre-allocate a large chunk of memory at startup (an "Arena") and simply bump a pointer to hand out memory.

In C++17/20, we don't need to write raw pointer math to do this. We can use Polymorphic Memory Resources (std::pmr).

The Modern Way (Stack Arena)

By using a monotonic_buffer_resource on the stack, allocation becomes a single integer addition instruction. Zero locks. Zero syscalls. Zero cache misses.

#include <memory_resource>

void Modern_PMR(int batch_size) {
    // 1. Pre-allocate buffer on STACK (Hot cache)
    std::array<std::byte, 4096> buffer; 
    
    // 2. Create monotonic resource (No locks, just pointer bump)
    std::pmr::monotonic_buffer_resource pool{
        buffer.data(), buffer.size()
    };

    // 3. Vector uses the stack pool
    std::pmr::vector<Trade> trades{&pool};
    trades.reserve(batch_size); // NO MALLOC. Just bumps stack ptr.

    for (int i = 0; i < batch_size; ++i) {
        trades.push_back({i, 100.0, 50.0});
    }
} 
// 4. No "free" needed. Stack unwinds instantly.

The Result: 1.4x Faster

We benchmarked this change on a typical HFT workload (Batch Size: 100 trades). By eliminating the heap locks, we achieved a 30% reduction in CPU time per iteration.

Legacy (Heap Allocation)451 ns

Modern (Stack Arena)314 ns

*Benchmark: GCC 12 (O3), Quick-Bench.com (Batch Size: 100).

Conclusion

While veteran developers have used stack arenas for decades (often via complex custom allocators or risky pointer arithmetic), C++17 finally standardizes this pattern.

Modern C++ allows us to write safe, high-level code that performs like hand-tuned Assembly—without the maintenance nightmare of the past. By using std::pmr, we can eliminate the entire class of latency spikes caused by the heap.

If your C++ codebase is still calling new (or creating heap vectors) in the hot path, you are leaving performance on the table.

Don't believe the chart? Run the code yourself.

View Live Benchmark on Quick-Bench.com →