A colleague of mine was recently discussing a connectivity monitoring system he is working on with me. It’s nothing fancy, just sending ICMP Echo Requests to a couple of different servers, and monitoring latency and dropped packet averages over 1-minute, 5-minute, and 15-minute periods. Up came the topic of how this data should be stored, the natural thought was a 512 entry ring buffer, containing entries like the following:
struct ping_timestamp {
uint64_t sent_ns; // when the packed was sent (unit: ns)
uint64_t received_ns; // when the packet was received (unit: ns)
; // source address
in_addr_t source_addruint16_t seq_no; // echo request sequence number
bool received; // has the request been received?
};
And backed by the following array
struct ping_timestamp pings_rb[512];
// ...
("%zu\n", sizeof(pings_rb));
printf
// 12288
12 KiB. Pretty wasteful, right? We can certainly do better. Do we
need to keep fields for both sent
and
received
? What we’re really interested is the latency. We
need to know when a packet was sent, only up until we know when it was
received, at that point, the data we want to keep is
received - sent
, so why don’t we make it a tagged
union?
struct ping_timestamp_2 {
union {
uint64_t sent_ts; // unit: 100μs
uint64_t elapsed_ts; // unit: 100μs
};
;
in_addr_t source_addruint16_t seq_no;
bool received;
};
// ...
("%zu\n", sizeof(pings_rb));
printf
// 8192
Not bad, we’ve shaved off an entire page. We can still do better.
Nanosecond precision? In our case, ping times are measured in the tens or hundreds, or even thousands of milliseconds. We don’t need to keep all of those extra bits around. If we change the unit from nanosecond to 100 microsecond increments (0.1ms), then 43-bits is sufficient for us to keep track of pings for up to 20-years1. 20-years is a bit excessive still, but it doesn’t hurt to be at least a little bit future proof.
And received
? 8-bit for a true/false value seems
altogether too much. The answer: bitfields.
struct ping_timestamp_3 {
uint64_t sent_or_elapsed_ts: 43;
uint64_t received: 1;
uint64_t seq_no: 16;
;
in_addr_t source_addr};
// ...
("%zu\n", sizeof(pings_rb));
printf
// 8192
Wait, what? Why haven’t we saved any space?
The answer is struct padding. The layout of
ping_timestamp_2
looks like this:
0 8 16 24 32 40 48 56 64...
+-------+-------+-------+-------+-------+-------+-------+-------+
| sent/elapsed |
+-------+-------+-------+-------+-------+-------+-------+-------+
...64 72 80 88 96 104 112 120 128
+-------+-------+-------+-------+-------+-------+-------+-------+
| source_address | seq_no | recv? | pad |
+---------------------------------------------------------------+
Where the padding byte at the end is to ensure alignment
requirements. ping_timestamp_3
on the other end, looks like
this:
0 8 16 24 32 40 R? 48 56 64...
+-------+-------+-------+-------+-------+-------+-------+-------+
| sent/elapsed || seq_no | P |
+-------+-------+-------+-------+-------+-------+-------+-------+
...64 72 80 88 96 104 112 120 128
+-------+-------+-------+-------+-------+-------+-------+-------+
| source_address | padding |
+---------------------------------------------------------------+
So our optimization there didn’t actually save any space. We’re wasting 36-bits of padding. Is there any way we can somehow do better?
We keep track of the source address due to frequent changes while our product is in operation (on a mobile data network). When the address changes, we also reset the sequence number for reasons that aren’t relevant to the current topic. We have seen, in the past, packets with differing source addreses but identical sequence numbers due to be processed by our application at the same time (the joys of asynchronous programming), so the source address serves to disambiguate these changes.
But there’s another way to disambiguate.
An ICMP echo request has a 16-bit long identifier
field
to allow applications to identify which echo request packets were sent
by them. Its value is completely arbitray. On Linux iputils
ping
sets it to getpid() & 0xFFFF
; on
OpenBSD a random number is used instead.
Although it’s 16-bits long, we don’t actually need to use the full
16-bits. There’s 4 free bits left in the first 8-bytes of our
ping_timestamp_3
. Our thought was to use a rolling 4-bit
counter, that is increased whenever our source address changes (this is
monitored elsewhere in the application), allowing us to uniquely
identify2 which source address the packet came
from.
Our final struct looks like this:
struct ping_timestamp {
uint64_t elapsed_or_sent_ts : 43;
uint64_t received : 1;
uint64_t counter: 4;
uint64_t seq_no: 16;
};
// ...
("%zu\n", sizeof(pings_rb));
printf
// 4096
Much better. A whole 8-kilobytes of savings, and down to a single
page of data. You may have noticed that I have changed the order of the
fields slightly. This is to line seq_no
up on a
16-bit
boundary, so that loading it is a single
ldrh
instruction rather than require a shift. Similarly,
reading from elapsed_or_sent_ts
only requires a mask.
In the end, this was a completely pointless exercise. Our application isn’t remotely memory constrained.
But it was fun.
I realized there’s a way to “optimize” this slightly further. By switching the order of the received/counter fields, accessing the received bit only requires a shift instruction rather than a shift and a mask:
struct ping_timestamp {
uint64_t elapsed_or_sent_ts : 43;
uint64_t counter: 4;
uint64_t received : 1;
uint64_t seq_no: 16;
};
There’s a slight “issue”3 with the above code:
received
is now much cheaper to access, at the expense of
counter
needing to mask out the received
bit.
But we can fix this! We only ever read counter
when
received
is true, i.e. 1. If received
were
zero, and we could tell the compiler to assume that it were zero, then
no mask would be necessary.
The solution? Flip the meaning of the received bit.
struct ping_timestamp {
uint64_t elapsed_or_sent_ts : 43;
uint64_t counter: 4;
uint64_t not_received : 1;
uint64_t seq_no: 16;
};
Now, if the read of counter only ever happens inside of a conditional
that checks if not_received
is zero, then the compiler is able to
completely elide the mask.
The timestamps are taken from the linux kernel’s monotonic clock, which measures time elapsed since boot. If we were measuring from the Unix epoch, then we would need 51-bits.↩︎
As long the IP address doesn’t change more than 16-times in the period we’re monitoring, which is not something we have ever seen.↩︎
I’m using that term loosely, we haven’t even benchmarked anything.↩︎