Premature Optimization is Fun Sometimes

by invlpg on 2025-06-19

A colleague of mine was recently discussing a connectivity monitoring system he is working on with me. It’s nothing fancy, just sending ICMP Echo Requests to a couple of different servers, and monitoring latency and dropped packet averages over 1-minute, 5-minute, and 15-minute periods. Up came the topic of how this data should be stored, the natural thought was a 512 entry ring buffer, containing entries like the following:

struct ping_timestamp {
    uint64_t sent_ns;       // when the packed was sent (unit: ns)
    uint64_t received_ns;   // when the packet was received (unit: ns)
    in_addr_t source_addr;  // source address
    uint16_t seq_no;        // echo request sequence number
    bool received;          // has the request been received?
};

And backed by the following array

struct ping_timestamp pings_rb[512];

// ...

printf("%zu\n", sizeof(pings_rb));

// 12288

12 KiB. Pretty wasteful, right? We can certainly do better. Do we need to keep fields for both sent and received? What we’re really interested is the latency. We need to know when a packet was sent, only up until we know when it was received, at that point, the data we want to keep is received - sent, so why don’t we make it a tagged union?

struct ping_timestamp_2 {
    union {
        uint64_t sent_ts;       // unit: 100μs
        uint64_t elapsed_ts;    // unit: 100μs
    };
    in_addr_t source_addr;
    uint16_t seq_no;
    bool received;
};

// ...

printf("%zu\n", sizeof(pings_rb));

// 8192

Not bad, we’ve shaved off an entire page. We can still do better.

Nanosecond precision? In our case, ping times are measured in the tens or hundreds, or even thousands of milliseconds. We don’t need to keep all of those extra bits around. If we change the unit from nanosecond to 100 microsecond increments (0.1ms), then 43-bits is sufficient for us to keep track of pings for up to 20-years¹. 20-years is a bit excessive still, but it doesn’t hurt to be at least a little bit future proof.

And received? 8-bit for a true/false value seems altogether too much. The answer: bitfields.

struct ping_timestamp_3 {
    uint64_t sent_or_elapsed_ts: 43;
    uint64_t received: 1;
    uint64_t seq_no: 16;
    in_addr_t source_addr;
};

// ...

printf("%zu\n", sizeof(pings_rb));

// 8192

Wait, what? Why haven’t we saved any space?

The answer is struct padding. The layout of ping_timestamp_2 looks like this:

     0       8      16      24      32      40      48      56      64...
     +-------+-------+-------+-------+-------+-------+-------+-------+
     |                         sent/elapsed                          |
     +-------+-------+-------+-------+-------+-------+-------+-------+
 ...64      72      80      88      96      104     112     120     128
     +-------+-------+-------+-------+-------+-------+-------+-------+
     |         source_address        |    seq_no     | recv? |  pad  |
     +---------------------------------------------------------------+

Where the padding byte at the end is to ensure alignment requirements. ping_timestamp_3 on the other end, looks like this:

     0       8      16      24      32      40  R?  48      56      64...
     +-------+-------+-------+-------+-------+-------+-------+-------+
     |                 sent/elapsed            ||     seq_no     | P |
     +-------+-------+-------+-------+-------+-------+-------+-------+
 ...64      72      80      88      96      104     112     120     128
     +-------+-------+-------+-------+-------+-------+-------+-------+
     |         source_address        |            padding            |
     +---------------------------------------------------------------+

So our optimization there didn’t actually save any space. We’re wasting 36-bits of padding. Is there any way we can somehow do better?

We keep track of the source address due to frequent changes while our product is in operation (on a mobile data network). When the address changes, we also reset the sequence number for reasons that aren’t relevant to the current topic. We have seen, in the past, packets with differing source addreses but identical sequence numbers due to be processed by our application at the same time (the joys of asynchronous programming), so the source address serves to disambiguate these changes.

But there’s another way to disambiguate.

An ICMP echo request has a 16-bit long identifier field to allow applications to identify which echo request packets were sent by them. Its value is completely arbitray. On Linux iputils ping sets it to getpid() & 0xFFFF; on OpenBSD a random number is used instead.

Although it’s 16-bits long, we don’t actually need to use the full 16-bits. There’s 4 free bits left in the first 8-bytes of our ping_timestamp_3. Our thought was to use a rolling 4-bit counter, that is increased whenever our source address changes (this is monitored elsewhere in the application), allowing us to uniquely identify² which source address the packet came from.

Our final struct looks like this:

struct ping_timestamp {
    uint64_t elapsed_or_sent_ts : 43;
    uint64_t received : 1;
    uint64_t counter: 4;
    uint64_t seq_no: 16;
};

// ...

printf("%zu\n", sizeof(pings_rb));

// 4096

Much better. A whole 8-kilobytes of savings, and down to a single page of data. You may have noticed that I have changed the order of the fields slightly. This is to line seq_no up on a 16-bit boundary, so that loading it is a single ldrh instruction rather than require a shift. Similarly, reading from elapsed_or_sent_ts only requires a mask.

In the end, this was a completely pointless exercise. Our application isn’t remotely memory constrained.

But it was fun.

Addendum 2025-06-21

I realized there’s a way to “optimize” this slightly further. By switching the order of the received/counter fields, accessing the received bit only requires a shift instruction rather than a shift and a mask:

struct ping_timestamp {
    uint64_t elapsed_or_sent_ts : 43;
    uint64_t counter: 4;
    uint64_t received : 1;
    uint64_t seq_no: 16;
};

Addendum 2025-06-22

There’s a slight “issue”³ with the above code: received is now much cheaper to access, at the expense of counter needing to mask out the received bit.

But we can fix this! We only ever read counter when received is true, i.e. 1. If received were zero, and we could tell the compiler to assume that it were zero, then no mask would be necessary.

The solution? Flip the meaning of the received bit.

struct ping_timestamp {
    uint64_t elapsed_or_sent_ts : 43;
    uint64_t counter: 4;
    uint64_t not_received : 1;
    uint64_t seq_no: 16;
};

Now, if the read of counter only ever happens inside of a conditional that checks if not_received is zero, then the compiler is able to completely elide the mask.

The timestamps are taken from the linux kernel’s monotonic clock, which measures time elapsed since boot. If we were measuring from the Unix epoch, then we would need 51-bits.↩︎
As long the IP address doesn’t change more than 16-times in the period we’re monitoring, which is not something we have ever seen.↩︎
I’m using that term loosely, we haven’t even benchmarked anything.↩︎