Charlie Gallagher on Programming

Multiprocessing TSV repair

Tue, 17 Mar 2026 00:00:00 +0000

A few days ago I wrote about optimizing a TSV repair script that took large TSV files with unquoted newline characters like this:

id	name	comment	score
0	charlie	normal	20
1	alice	this is a
multiline comment	10
2	bob	normal	8

And turned them into valid TSV files like this:

id	name	comment	score
charlie	normal	20
alice	this is a multiline comment	10
bob	normal	8

I initially ignored multiprocessing as being pesky, complicated, and unlikely to yield good results. But after returning to it, I found that not only is it a tractable problem, but it could potentially have real performance benefits.

I’ve written a compatible multiprocessing TSV repair script that has taken the best end-to-end time from 5.78s down to 3.04s (though the average is somewhat worse).

What I missed

In the last post, I discounted the multiprocessing approach for the following reason:

We can’t align on a row boundary without knowing how many tab characters have come before the current seek position, and we can’t do that without indexing all of the tabs in the file. That’s what we’re already doing in the basic implementation of the TSV repair script, so I don’t see a way for multiprocessing to be faster than sequential processing.

I didn’t consider that you could also parallelize the tab indexing.

This is the basis of the two-pass approach to distributed parsing of delimited files. The first pass uses workers to gather statistics about the file. Then, the master uses those statistics to re-evaluate the naive ranges it assigned to the workers. Finally, the master tells the workers what modifications they need to make to their assigned ranges to work with only complete records, and the workers go to work. For a complete description, see https://badrish.net/papers/dp-sigmod19.pdf.

For valid delimited files, where newline characters are allowed only if they’re quoted, the real question for each worker is whether or not their initial position is in a quoted field or not. In the paper linked above, the authors use a speculative approach to decide whether or not a worker is beginning in a quoted field or not. It’s a nifty technique, and if you’re interested I recommend reading through the paper. The short version is that each worker “sniffs” the first megabyte or so of data in their chunk and makes an educated guess about whether or not they started in a quoted field. They no longer communicate back with the master, they just proceed with their guess. If they encounter an error, the whole things falls back to the two-pass approach.

In my case, there are no quotes, so I had to stay with the two-pass approach and adapt it to the malformed TSVs I’m dealing with.

But before I get to the new script, I wanted to mention one more interesting feature of these TSV files that I didn’t notice before.

Inherent ambiguity

While working on the multiprocessor script, I realized that I had missed an ambiguity lurking in these malformed TSV files. Given a file like the following, there’s no “correct” interpretation:

one,two,three
four
five,six,seven

This could be interpreted in one of two ways:

# Version 1
one,two,three\nfour
five,six,seven

# Version 2
one,two,three
four\nfive,six,seven

The problem is that when the final field contains a newline character, we can’t say whether it’s a record delimiter or an unquoted newline.

It’s computationally easier to use a non-greedy approach, which produces interpretation Version 2. You read lines and stop reading as soon as you have accumulated the correct number of field delimiters (tabs, commas). During processing of the above ambiguous snippet, the processor first reads the line one,two,three and finds it complete. Then, it starts building the next record with four, which it joins with the next line, after which join it finds that this record is now complete.

Foruntunately this rule is as easy to follow for a sequential processor as for a parallel processor.

Adapting the two-phase parallel parser

To make this work, I had to make the definition of a row a little more strict. The parallel processor can no longer tolerate lines that have too many tabs in them – it’s assumed that every record is composed of the same number of fields.

With that assumption, it becomes a given that the total number of tabs in the file is a multiple of the number of tabs in the header, i.e. n_fields - 1. So, I split the file evenly into chunks and assign each worker a chunk. The workers count the number of tabs in their chunks and report back to the master process. The master process then figures out how many tabs each worker needs to skip in order to land on a record boundary. The workers then treat the rest of their range as a normal TSV file, stopping once they’ve read their last full record that started before the end of their chunk. So workers often read past the end of their assigned byte range in the file, and the calculations performed by the master process ensure that the next worker down the line knows how far the previous worker had to over-read.

The implementation is a minefield of boundary issues and off-by-one errors, but in the end I got this working ok and reasonably efficiently. The full implementation is available on GitHub.

Performance

As I said in the last post, for best performance, you should leave the files in separate pieces, and that’s exactly what I did for the performance benchmarks. I did write an optional extension that recombines the files, and I used that to confirm that the “repaired” version of various files matched a known-good processor.

The performance is unstable, but generally very good. Performance seems to depend on how busy the system is with other, more “important” work. All of the workers are also accessing the same file, which creates the possibility for contention. They’re all reads, but as the system flips from one process to another, file accesses become more random.

Still, it’s sometimes exceptional. The best recorded time so far was 3.04 seconds, almost a 2x improvement on the previous best time. Here’s a smattering of results, with a normal sequential run thrown in the middle.

2026-03-17 15:35:40	repair_bytes_buffered_read_write_multiprocessing_two_pass	3.394725
2026-03-17 15:35:57	repair_bytes_buffered_read_write_multiprocessing_two_pass	3.986272
2026-03-17 15:36:08	repair_bytes_buffered_read_write	6.599493
2026-03-17 15:36:18	repair_bytes_buffered_read_write_multiprocessing_two_pass	9.798484
2026-03-17 15:36:35	repair_bytes_buffered_read_write_multiprocessing_two_pass	7.341621
2026-03-17 15:37:02	repair_bytes_buffered_read_write_multiprocessing_two_pass	3.043876

The trimmed mean is 4.9 seconds.

When I turn on the feature that re-combines the files at the end, performance dips to more like 12 seconds per run, more or less as expected.

Update: 2026-03-18

I ran more tests on this to see where the bottlenecks might be. I ran a very large file (10 GB) and the performance of the multiprocessing version of the code was about the same as the sequential version. A few things I noted:

I did optimize the alignment code, since this large file has 1200 columns in it. But the alignment code is not the bottleneck, usually taking somewhere in the range of 1-2 milliseconds.
With only one worker, I found that each chunk of the file was processed in only 2 seconds. If you kept that speed up, you would process the whole file in around 5-7 seconds.
With 4 workers, each chunk was processed in 6 seconds, and with 8 workers (=vcpu) each took 12 seconds.

So I/O contention is the most likely cause of the limited performance on large files. And of course multiprocessing is best when you can put multiple processors to work at once, doing calculations and whatnot. I found that when I increased the newline density to 0.2 again, the multiprocessing code was now significantly faster than the sequential code (14s compared to 26s). But even though this is I/O bound, multiprocessing seems to perform at least as well and often better than sequential processing.

Comments

This was a serious bump in complexity and peskiness, but I’m thrilled with the performance (most of the time). I think there might be a bit more performance to be squeezed out of it. The I/O could most likely be faster when I’m scanning forward to skip tabs, but I’m plenty happy with this implementation, and I’m satisfied that the cost of scanning forward is bounded by the number of fields, not the number of rows.

Profiling becomes tricky with multiprocessing, so I skipped it for these runs. If you’re a profiler junky, I’ll gladly accept any PRs.

Ultimately I wouldn’t recommend this for production, because it depends so heavily on there being a correct number of tabs. Any misalignment and the quality goes out the window, or you’d have to write a guard that makes the master process fall over if the number of tabs is wrong. But fun to get working anyway!

Optimizing TSV Repair in Python

Thu, 12 Mar 2026 00:00:00 +0000

This TSV file has a problem:

id	name	comment	score
0	charlie	normal	20
1	alice	this is a
multiline comment	10
2	bob	normal	8

There’s an unquoted multi-line field on the second line. I’m not aware of a TSV parser that can correctly parse this – most consider it an invalid file, some NULL-fill the remaining fields on any line that has incomplete data.

I regularly ingest data from a particular data source that has this bad feature in it, and the problem has gotten worse recently. I’ve decided to fix it. The question is, “What’s the best way to fix unquoted newline characters?”

I created this repository with my results and some tooling for benchmarking and profiling: https://github.com/charlie-gallagher/tsv-repair

Problem statement

Like the Billion Row Challenge, the goal is to process large files as quickly as possible using some base programming language.¹ In my case, I worked in base Python 3.13 on MacOS.

Using pure python (stdlib), repair a large (10GB), utf-8 encoded TSV file with a knowable number of fields by (a) identifying incomplete lines, (b) combining successive incomplete lines until they form a complete line, and (c) not combining lines if the result is a row with too many fields. A single row may have one or more newlines, i.e. a row might be spread across one or more lines of the file. The lines that form a row are always ordered correctly, successive and contiguous. Newlines are LF only, not CRLF. To find the number of fields, you can read the first line, which is always the header.

To make things simpler, even if a field is quoted, you can still join successive split lines.

In the end, the file should look like this:

id	name	comment	score
charlie	normal	20
alice	this is a multiline comment	10
bob	normal	8

i.e. replace the newline character with a space. If there are multiple newline characters in a row (hello\n\nworld), replace each one with a space. If a field starts or ends with a newline, you can still replace newlines with a space.

Basic solution

Here’s a straightforward solution I came up with that passes all the tests. There are one or two inelegancies, like the _need_to_write variable that keeps track of some state information, but it does the thing.

def repair(input_file: str, output_file: str) -> None:
    with open(input_file, "r") as fin, open(output_file, "w") as fout:
        # Start by copying header
        header = fin.readline()
        fout.write(header)

        expected_tabs = header.count("\t")

        # Then, iterate over the lines, repairing as you go
        while True:
            line = fin.readline()
            if not line:
                break
            line_tabs = line.count("\t")
            if line_tabs == expected_tabs:
                fout.write(line)
                continue

            # Line repair
            # Grab next line and see if it complements
            _need_to_write = True
            while line_tabs < expected_tabs:
                continuation_line = fin.readline()
                if not continuation_line:
                    break
                cline_tabs = continuation_line.count("\t")
                if line_tabs + cline_tabs <= expected_tabs:
                    line = line.rstrip("\n") + " " + continuation_line
                    line_tabs += cline_tabs
                else:
                    # Adding these lines would create a row with
                    # too many fields
                    fout.write(line)
                    fout.write(continuation_line)
                    _need_to_write = False
                    break
            if _need_to_write:
                fout.write(line)

And huzzah, the tests are passing.

❯ python3 -c "from test_repair import main; from repair_basic import repair; main(repair)"
Testing repair function against golden files in /Users/charlie/tsv-repair/test_files

  PASS  already_good.tsv
  PASS  basic.tsv
  PASS  basic_incomplete_first_line.tsv
  PASS  basic_incomplete_last_line.tsv
  PASS  multi_newline.tsv
  PASS  newline_as_first_char_in_field.tsv
  PASS  newline_as_last_char_in_field.tsv
  PASS  newline_as_only_char_in_field.tsv
  PASS  partial_solution.tsv
  PASS  partial_solution_2x.tsv
  PASS  quoted.tsv
  PASS  too_many_tabs.tsv
  PASS  two_bad_lines_in_a_row.tsv

13/13 tests passed.

Benchmark on large file:

2026-03-13 09:59:03	repair_basic	14.604409

The large file configuration for this run was: 1M rows, 120 columns, and 0.002 likelihood that a cell contains a newline character. The file was 3.0 GB. You can generate a similar file using:

python generate_large_file.py -r 1000000 -c 120 --newline-likelihood 0.002

There’s a cProfile script as well. Here’s the output:

❯ python3 profile_repair.py repair_basic.py 
Profiling repair_basic on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         4882830 function calls in 20.516 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.674    1.674   20.516   20.516 /Users/charlie/tsv-repair/repair_basic.py:2(repair)
  1000001    8.711    0.000    8.711    0.000 {method 'write' of '_io.TextIOWrapper' objects}
  1237862    5.031    0.000    6.102    0.000 {method 'readline' of '_io.TextIOWrapper' objects}
  1237861    3.800    0.000    3.800    0.000 {method 'count' of 'str' objects}
   389745    0.358    0.000    0.979    0.000 :322(decode)
   389745    0.621    0.000    0.621    0.000 {built-in method _codecs.utf_8_decode}
   237860    0.136    0.000    0.136    0.000 {method 'rstrip' of 'str' objects}
        2    0.093    0.047    0.093    0.047 {built-in method _io.open}
   389745    0.093    0.000    0.093    0.000 :334(getstate)
        2    0.000    0.000    0.000    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 :312(__init__)
        1    0.000    0.000    0.000    0.000 :189(__init__)
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)
        1    0.000    0.000    0.000    0.000 :263(__init__)

Most time was spent reading and writing, followed by counting tab characters. This is pretty much as you expect. This is an I/O bound program with some amount of calculation. Out of 20 seconds, 14s were spent on I/O. The other top hotspots were:

Decoding utf-8 (0.98s)
Removing newline characters with rstrip (0.14s)

Optimizations

A good optimization would be to not write this in Python at all, but let’s ignore that and assume we have to work in pure cPython. There are plenty of optimizations we can reach for – some make more sense for an I/O-bound program and others are more appropriate for a CPU-bound program. Scanning the files is I/O-bound and fixing tabs involves an amount of CPU work, so it’s worth checking both.

We can count tabs without decoding utf-8
Buffer writes
Buffer reads
Avoid copying and doing extra work
Multiprocessing
Memory map the file
PyPy

Avoiding utf-8 decoding

The file is encoded in utf-8, which has the nice property that any ACII byte uniquely identifies that ACII character. If you’re interested in finding tabs \t (0x09), utf-8 ensures that no matter how many multi-byte characters you have, none of them will contain this byte. In utf-8 multi-byte characters, the largest bit is always set. So no multi-byte character can contain 0x09, where the largest bit is unset.

Source: https://badrish.net/papers/dp-sigmod19.pdf

That means we can do away with decoding the bytes and just search for 0x09, or in Python b"\t".

Here’s the diff with repair_basic.py.

❯ diff -u repair_basic.py repair_bytes.py
--- repair_basic.py     2026-03-12 16:42:34
+++ repair_bytes.py     2026-03-13 10:27:08
@@ -1,18 +1,18 @@
 
 def repair(input_file: str, output_file: str) -> None:
-    with open(input_file, "r") as fin, open(output_file, "w") as fout:
+    with open(input_file, "rb") as fin, open(output_file, "wb") as fout:
         # Start by copying header
         header = fin.readline()
         fout.write(header)
 
-        expected_tabs = header.count("\t")
+        expected_tabs = header.count(b"\t")
 
         # Then, iterate over the lines, repairing as you go
         while True:
             line = fin.readline()
             if not line:
                 break
-            line_tabs = line.count("\t")
+            line_tabs = line.count(b"\t")
             if line_tabs == expected_tabs:
                 fout.write(line)
                 continue
@@ -24,9 +24,9 @@
                 continuation_line = fin.readline()
                 if not continuation_line:
                     break
-                cline_tabs = continuation_line.count("\t")
+                cline_tabs = continuation_line.count(b"\t")
                 if line_tabs + cline_tabs <= expected_tabs:
-                    line = line.rstrip("\n") + " " + continuation_line
+                    line = line.rstrip(b"\n") + b" " + continuation_line
                     line_tabs += cline_tabs
                 else:
                     # Adding these lines would create a row with

In practice, this was horrific for performance. The code is basically identical except now we’re searching for the tab byte instead of the tab character, and we’re writing bytes. But the benchmarks are kind of startling.

date_time	module	elapsed_seconds
2026-03-13 10:37:49	repair_basic	13.902526
2026-03-13 10:44:06	repair_basic	14.782423
2026-03-13 10:36:58	repair_bytes	41.753419
2026-03-13 10:38:07	repair_bytes	48.101314
2026-03-13 10:44:25	repair_bytes	41.985785

The basic version takes only 14s or so, while the bytes version takes closer to 45s. What gives? The profile points out the issue:

❯ python3 profile_repair.py repair_bytes.py    
Profiling repair_bytes on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         3713592 function calls in 47.199 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.112    2.112   47.199   47.199 /Users/charlie/tsv-repair/repair_bytes.py:2(repair)
  1000001   35.681    0.000   35.681    0.000 {method 'write' of '_io.BufferedWriter' objects}
  1237862    5.865    0.000    5.865    0.000 {method 'readline' of '_io.BufferedReader' objects}
  1237861    3.272    0.000    3.272    0.000 {method 'count' of 'bytes' objects}
   237860    0.138    0.000    0.138    0.000 {method 'rstrip' of 'bytes' objects}
        2    0.130    0.065    0.130    0.065 {built-in method _io.open}
        2    0.000    0.000    0.000    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)

Suddenly, we’re spending 35 seconds writing bytes. Most likely, the character string writer automatically does some amount of buffering to optimize file writes, and the bytes are not buffered at all. The other functions took around the same amount of time as before. We did successfully drop the utf-8 decoding logic, and that should save us a second or so all else equal. I’ll try buffering the write output so there are fewer calls to write bytes.

Buffered writes

~~We don’t need to decode utf-8 to count tabs~~
Buffer writes*
Buffer reads
Avoid copying and doing extra work
Multiprocessing
Memory map the file
PyPy

The io stdlib module has a BufferedWriter we can use to buffer our byte writes. Here’s the only change we need to make:

import io

def repair(input_file: str, output_file: str) -> None:
    with open(input_file, "rb") as fin, open(output_file, "wb") as fout_raw:
        with io.BufferedWriter(fout_raw, buffer_size=256 * 1024) as fout:
            ...

And the results are pretty outstanding – a 2x speedup on the basic script, and a 5x speedup on the unbuffered version of byte repair.

date_time	module	elapsed_seconds
2026-03-13 10:37:49	repair_basic	13.902526
2026-03-13 10:44:06	repair_basic	14.782423
2026-03-13 10:36:58	repair_bytes	41.753419
2026-03-13 10:38:07	repair_bytes	48.101314
2026-03-13 10:44:25	repair_bytes	41.985785
2026-03-13 11:15:02	repair_bytes_buffered_write	8.602299
2026-03-13 11:15:13	repair_bytes_buffered_write	7.982473

The profile shows that we now only spent 2s writing data.

❯ python3 profile_repair.py repair_bytes.py
Profiling repair_bytes on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         3713593 function calls in 11.621 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.332    1.332   11.621   11.621 /Users/charlie/tsv-repair/repair_bytes.py:5(repair)
  1237862    5.051    0.000    5.051    0.000 {method 'readline' of '_io.BufferedReader' objects}
  1237861    2.945    0.000    2.945    0.000 {method 'count' of 'bytes' objects}
  1000001    2.096    0.000    2.096    0.000 {method 'write' of '_io.BufferedWriter' objects}
   237860    0.101    0.000    0.101    0.000 {method 'rstrip' of 'bytes' objects}
        2    0.094    0.047    0.094    0.047 {built-in method _io.open}
        3    0.001    0.000    0.001    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)

The 256K buffer size here was found experimentally. I tried up to 5 MB but didn’t see a performance improvement.

Buffered reads

~~We don’t need to decode utf-8 to count tabs~~
~~Buffer writes~~
Buffer reads*
Avoid copying and doing extra work
Multiprocessing
Memory map the file
PyPy

Batching our writes worked well, and reads are now the majority of the processing time. Let’s see if I can just batch reads and get free performance.

There is an io.BufferedReader, and using it looks like this:

import io

def repair(input_file: str, output_file: str) -> None:
    with open(input_file, "rb") as fin_raw, open(output_file, "wb") as fout_raw:
        with io.BufferedReader(fin_raw, buffer_size=1 << 20) as fin, io.BufferedWriter(
            fout_raw, buffer_size=256 * 1024
        ) as fout:
            ...

The formatting gets dense, but the results are another significant speedup.

date_time	module	elapsed_seconds
2026-03-13 10:37:49	repair_basic	13.902526
2026-03-13 10:44:06	repair_basic	14.782423
2026-03-13 10:36:58	repair_bytes	41.753419
2026-03-13 10:38:07	repair_bytes	48.101314
2026-03-13 10:44:25	repair_bytes	41.985785
2026-03-13 11:15:02	repair_bytes_buffered_write	8.602299
2026-03-13 11:15:13	repair_bytes_buffered_write	7.982473
2026-03-13 11:48:41	repair_bytes_buffered_read_write	6.561013
2026-03-13 11:48:50	repair_bytes_buffered_read_write	5.967071
2026-03-13 11:49:25	repair_bytes_buffered_read_write	5.910506
2026-03-13 11:49:33	repair_bytes_buffered_read_write	5.782498

Avoiding extra work

~~We don’t need to decode utf-8 to count tabs~~
~~Buffer writes~~
~~Buffer reads~~
Avoid copying and doing extra work*
Multiprocessing
Memory map the file
PyPy

I/O is now optimized, and I wanted to take a look at the algorithm to see if I could improve it at all.

In the test file I generated, there are 1 million rows and 1,237,860 lines (excluding the header), so we will spend a decent amount of time fixing lines. The expected number of newline characters (configured with --newline-likelihood) will have an impact on the final algorithm you choose. I’ve set it so that every cell has a 0.2% chance of including a newline in it, which is pretty high compared to what I see in the actual data.

I have two ideas for optimizations.

Never count the same tab twice
Reduce allocations by using mutable data structures

Here’s the complete code at the moment:

# repair_bytes_buffered_read_write.py
import io
    
def repair(input_file: str, output_file: str) -> None:
   with open(input_file, "rb") as fin_raw, open(output_file, "wb") as fout_raw:
       with io.BufferedReader(fin_raw, buffer_size=1 << 20) as fin, io.BufferedWriter(
           fout_raw, buffer_size=256 * 1024
       ) as fout:
    
           # Start by copying header
           header = fin.readline()
           fout.write(header)
    
           expected_tabs = header.count(b"\t")
    
           # Then, iterate over the lines, repairing as you go
           while True:
               line = fin.readline()
               if not line:
                   break
               line_tabs = line.count(b"\t")
               if line_tabs == expected_tabs:
                   fout.write(line)
                   continue
    
               # Line repair
               # Grab next line and see if it complements
               _need_to_write = True
               while line_tabs < expected_tabs:
                   continuation_line = fin.readline()
                   if not continuation_line:
                       break
                   cline_tabs = continuation_line.count(b"\t")
                   if line_tabs + cline_tabs <= expected_tabs:
                       line = line.rstrip(b"\n") + b" " + continuation_line
                       line_tabs += cline_tabs
                   else:
                       # Adding these lines would create a row with
                       # too many fields
                       fout.write(line)
                       fout.write(continuation_line)
                       _need_to_write = False
                       break
               if _need_to_write:
                   fout.write(line)

This line does a few things at once, there might be a better way:

30	                        line = line.rstrip(b"\n") + b" " + continuation_line

We know that the line ends in a newline character, so it might be faster to use line[:-1]. And instead of creating a new byte string, I’ll extend an existing byte buffer, which is mutable.

The new version of this line is:

buffer.pop(-1) # remove the newline
buffer.extend(b" " + continuation_line)

Performance is about the same, though, at least at this density of newlines.

2026-03-13 12:26:21	repair_bytes_buffered_read_write_bytearray	6.067266
2026-03-13 12:26:31	repair_bytes_buffered_read_write_bytearray	5.932649
2026-03-13 12:26:40	repair_bytes_buffered_read_write_bytearray	5.986914

And here’s the profile:

❯ python profile_repair.py repair_bytes_buffered_read_write_bytearray.py
Profiling repair_bytes_buffered_read_write_bytearray on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         5951455 function calls in 9.120 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.716    1.716    9.120    9.120 /Users/charlie/tsv-repair/repair_bytes_buffered_read_write_bytearray.py:4(repair)
  1000000    2.518    0.000    2.518    0.000 {method 'count' of 'bytearray' objects}
  1000001    1.885    0.000    1.885    0.000 {method 'write' of '_io.BufferedWriter' objects}
  1237862    1.475    0.000    1.475    0.000 {method 'readline' of '_io.BufferedReader' objects}
  1237861    0.608    0.000    0.608    0.000 {method 'extend' of 'bytearray' objects}
  1000000    0.399    0.000    0.399    0.000 {method 'clear' of 'bytearray' objects}
   237861    0.343    0.000    0.343    0.000 {method 'count' of 'bytes' objects}
        2    0.121    0.061    0.121    0.061 {built-in method _io.open}
   237860    0.056    0.000    0.056    0.000 {method 'pop' of 'bytearray' objects}
        4    0.000    0.000    0.000    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)

I’m spending 0.4s clearing the byte array and 0.6s updating the byte array. I don’t have much insight into the comparable numbers for allocation times, but I’d guess this is about the same.

Avoiding extra work: generally no performance improvement, and the code was harder to read, so I’m not going to keep these optimizations.

Update: 2026-03-16

I bumped the number of newlines up to 0.2 (1 in 5 cells has a newline in it) and found that, yes, the bytearray version of the code is a decent bit faster at this newline density. The basic version takes 26 seconds, while the bytebuffer only takes 19 seconds.

2026-03-16 09:58:58	repair_bytes_buffered_read_write	26.129120
2026-03-16 09:59:46	repair_bytes_buffered_read_write_bytearray	19.367004
2026-03-16 10:01:38	repair_bytes_buffered_read_write	25.511255
2026-03-16 10:02:09	repair_bytes_buffered_read_write_bytearray	19.135691

This is a 1 million row file with 120 columns, which means that for a 0.2 newline density, each line will have about 120 / 5 = 24 newlines in it. Put another way, each row of data will be spread across approximately 24 lines of the file.

Multiprocessing

~~We don’t need to decode utf-8 to count tabs~~
~~Buffer writes~~
~~Buffer reads~~
~~Avoid copying and doing extra work~~
Multiprocessing*
Memory map the file
PyPy

Multiprocessing can be difficult to get right, and there’s no guarantee it’s a speedup in all cases.

The idea is to chunk the file, start up a few processes, and assign each chunk to a process. Assuming the processes successfully use all available cores on your computer, you should see a multiplicative speed up.

But this TSV repair isn’t a map/reduce problem, so let’s take a step back to consider whether multiprocessing will help at all. I could let each core write its part of the TSV file to that core’s own file, rather than everyone writing to the same file. This is essentially what tools like AWS Athena use to copy large amounts of data. The Hive table format is friendly to splintered files for the most part, and some teams use compaction after the fact to reduce the impact of having more files to read later.

Multiprocessing and leaving the code in pieces is the fastest way to write it. But that’s no good for the tests, so while it’s an attractive idea, I won’t be able to leave the files in pieces. To recombine the files, I would have to do another full read/write of the file, and most likely that would cost me too much time.

Update: 2026-03-16

I returned to this today to see if I could implement the multiprocessing version of the TSV repair. Interestingly enough, I found that this problem cannot be efficiently multiprocessed in the way I thought it could.

A row of data is only defined by the number of tabs. To split the file, we have to align each processor’s chunk of the input file on a row boundary so that each partial file has only complete rows in it. If you have a table with 25 rows and 5 columns, then the available boundaries are after 0 tabs, 5 tabs, 10, 15, and so on.

You cannot start in the middle of the file and find a row boundary in all cases. If you happen to find a newline character followed by n_column tabs, that works, you’ve found a valid boundary. But in a file with a high density of newline characters, where each row is basically guaranteed to have one or more improperly quoted newlines, it becomes almost impossible to figure out where a row begins and ends.

Consider the case where every cell has a newline character in it. If you track from the beginning of the file, you can correctly reassemble this dataset by counting tabs. But, if you start anywhere in the middle, it becomes impossible to decide where a row should begin and end. It’s a mass of alternating newlines and tabs.

We can’t align on a row boundary without knowing how many tab characters have come before the current seek position, and we can’t do that without indexing all of the tabs in the file. That’s what we’re already doing in the basic implementation of the TSV repair script, so I don’t see a way for multiprocessing to be faster than sequential processing.

That does get me thinking, though, how do other tools handle this for valid files? Imagine a TSV file that correctly quotes newline characters but has alternating newlines and tabs as before. In fact, since we’re quoting, it’s technically allowed to include tabs in the quoted fields as well. If this were a CSV you can substitute commas.

I’m going to use CSV format for clarity. Your worker might get assigned a seek position that looks like this:

,"
"Hello, world!","Nice to meet you"

Spicing it up a little with some newlines:

,"
"Hello,
world!","
Nice to meet you"

Can you align on a boundary? The first characters ,"\n are ambiguous. The comma could be a quoted comma or a field delimiter. But we get more information with the following "Hello. If the first double quote started a string, the second would have to end one and be followed by a delimiter. Since the second quote is not followed by a delimiter, it must be preceded by one – and it is (a newline, which delimits rows). So now that we’ve identified that the first quote is a closing quote, we can be sure that the newline that followed it was the end of a row of data. This gives us enough information to say that "Hello is the beginning of a row, and we can align from there.

Does this generally hold? Is it economical? I found a paper that discusses distributed CSV parsing, with examples that look a lot like my own examples above: “Speculative Distributed CSV Data Parsing for Big Data Analytics”. The authors bring up a good point, which is that a production parser has to recognize invalid CSV files as well as valid ones. Here’s the abstract:

There has been a recent flurry of interest in providing query capability on raw data in today’s big data systems. These raw data must be parsed before processing or use in analytics. Thus, a fundamental challenge in distributed big data systems is that of efficient parallel parsing of raw data. The difficulties come from the inherent ambiguity while independently parsing chunks of raw data without knowing the context of these chunks. Specifically, it can be difficult to find the beginnings and ends of fields and records in these chunks of raw data. To parallelize parsing, this paper proposes a speculation-based approach for the CSV format, arguably the most commonly used raw data format. Due to the syntactic and statistical properties of the format, speculative parsing rarely fails and therefore parsing is efficiently parallelized in a distributed setting. Our speculative approach is also robust, meaning that it can reliably detect syntax errors in CSV data. We experimentally evaluate the speculative, distributed parsing approach in Apache Spark using more than 11,000 real-world datasets, and show that our parser produces significant performance benefits over existing methods.

Memory mapping

~~We don’t need to decode utf-8 to count tabs~~
~~Buffer writes~~
~~Buffer reads~~
~~Avoid copying and doing extra work~~
~~Multiprocessing~~
Memory map the file
PyPy

Memory mapping can be helpful in some cases, but it’s not always a clear performance win. But before I get to that, what is memory mapping?

mmap is a syscall analogous to read, but with some architectural differences. Normally when you read a set of bytes from a file, the kernel copies those bytes into a kernel buffer, then copies them into your user buffer. When you request data, the OS schedules a request with the disk driver to get those bytes, and everything happens “buffer-to-buffer”.

mmap on the other hand involves mapping the file’s bytes to your virtual memory address space. When you read a set of bytes on an mmap‘d file, you get the exact same data, but the channels it goes through are very different. Instead of copying buffer-to-buffer, attempting to read the file causes a page fault. The whole page (or multiple pages depending on prefetch config) are copied into your user-space buffer with no kernel buffer in between.

To oversimplify a bit, reading cause a double copy, while mmaping uses a single copy. Great! you think.

The tradeoff is that while read involves more copying, mmap involves more syscalls and page faults. Here’s how one Stack answerer put it:

So basically you have the following comparison to determine which is faster for a single read of a large file: Is the extra per-page work implied by the mmap approach more costly than the per-byte work of copying file contents from kernel to user space implied by using read()?

https://stackoverflow.com/questions/45972/mmap-vs-reading-blocks

mmap can have benefits when you’re reading a file multiple times or doing non-sequential reads, but in my case the page fault overhead is probably too high on a cold run. Let’s try it anyway.

The basic implementation is:

import io
import mmap

def repair(input_file: str, output_file: str) -> None:
    with open(input_file, "rb") as fin_raw, open(output_file, "wb") as fout_raw:
        with mmap.mmap(fin_raw.fileno(), 0, access=mmap.ACCESS_READ) as mm_in:
            with io.BufferedWriter(fout_raw, buffer_size=256 * 1024) as fout:
                ...

I had to ditch the buffered reader, at least the one available from io, because a memory mapped file isn’t compatible with it. And in any case that wouldn’t affect the number of page faults, which is where mmap spends its time. The performance isn’t great.

2026-03-13 14:23:06	repair_bytes_buffered_read_write_mmap	27.898272

Looking at the profile is interesting:

❯ python3 profile_repair.py repair_bytes_buffered_read_write_mmap.py  
Profiling repair_bytes_buffered_read_write_mmap on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         3713595 function calls in 15.487 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.383    1.383   15.487   15.487 /Users/charlie/tsv-repair/repair_bytes_buffered_read_write_mmap.py:5(repair)
  1237862    8.700    0.000    8.700    0.000 {method 'readline' of 'mmap.mmap' objects}
  1237861    2.974    0.000    2.974    0.000 {method 'count' of 'bytes' objects}
  1000001    2.158    0.000    2.158    0.000 {method 'write' of '_io.BufferedWriter' objects}
        2    0.115    0.057    0.115    0.057 {built-in method _io.open}
   237860    0.104    0.000    0.104    0.000 {method 'rstrip' of 'bytes' objects}
        1    0.053    0.053    0.053    0.053 {method '__exit__' of 'mmap.mmap' objects}
        3    0.001    0.000    0.001    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'fileno' of '_io.BufferedReader' objects}
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)

Time spent in count matches previous runs, but we’re spending way more time reading as expected. You’ll also notice that the runtime was halved when I did the profile – I believe the reason is that the pages of the large file are still in the page cache. There’s a nifty trick you can use to clear the page cache, though. On MacOS:

sync && sudo purge

After doing this, the profile is again comparable to the first one.

❯ python3 profile_repair.py repair_bytes_buffered_read_write_mmap.py
Profiling repair_bytes_buffered_read_write_mmap on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         3713595 function calls in 28.173 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.542    1.542   28.173   28.173 /Users/charlie/projects/atlas/tsv-repair/repair_bytes_buffered_read_write_mmap.py:5(repair)
  1237862   20.947    0.000   20.947    0.000 {method 'readline' of 'mmap.mmap' objects}
  1237861    3.075    0.000    3.075    0.000 {method 'count' of 'bytes' objects}
  1000001    2.317    0.000    2.317    0.000 {method 'write' of '_io.BufferedWriter' objects}
        2    0.123    0.061    0.123    0.061 {built-in method _io.open}
   237860    0.108    0.000    0.108    0.000 {method 'rstrip' of 'bytes' objects}
        1    0.061    0.061    0.061    0.061 {method '__exit__' of 'mmap.mmap' objects}
        3    0.001    0.000    0.001    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'fileno' of '_io.BufferedReader' objects}
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)

Now it’s clear we’re really taking a performance hit with mmap.readline().

We’re not getting any performance benefit from mapping our file into memory – page faults are killing performance when we only read through the file once.

Pypy

~~We don’t need to decode utf-8 to count tabs~~
~~Buffer writes~~
~~Buffer reads~~
~~Avoid copying and doing extra work~~
~~Multiprocessing~~
~~Memory map the file~~
PyPy

This is a bit far afield, but I thought I’d try Pypy, a JIT Python interpreter that can give a good speedup in some cases. Through Homebrew I was able to get PyPy3.10, which is a little old, and in any case it’s best for calculation-intensive work, not IO-bound work. Indeed, the result was a 4x slowdown, maybe due to I/O improvements in more recent versions of python.

2026-03-13 15:58:39	repair_bytes_buffered_read_write	22.078622
2026-03-13 15:59:08	repair_bytes_buffered_read_write	20.995271

Conclusion

The winning script involved only simple tweaks on my original attempt.

import io


def repair(input_file: str, output_file: str) -> None:
    with open(input_file, "rb") as fin_raw, open(output_file, "wb") as fout_raw:
        with io.BufferedReader(fin_raw, buffer_size=1 << 20) as fin, io.BufferedWriter(
            fout_raw, buffer_size=256 * 1024
        ) as fout:

            # Start by copying header
            header = fin.readline()
            fout.write(header)

            expected_tabs = header.count(b"\t")

            # Then, iterate over the lines, repairing as you go
            while True:
                line = fin.readline()
                if not line:
                    break
                line_tabs = line.count(b"\t")
                if line_tabs == expected_tabs:
                    fout.write(line)
                    continue

                # Line repair
                # Grab next line and see if it complements
                _need_to_write = True
                while line_tabs < expected_tabs:
                    continuation_line = fin.readline()
                    if not continuation_line:
                        break
                    cline_tabs = continuation_line.count(b"\t")
                    if line_tabs + cline_tabs <= expected_tabs:
                        line = line.rstrip(b"\n") + b" " + continuation_line
                        line_tabs += cline_tabs
                    else:
                        # Adding these lines would create a row with
                        # too many fields
                        fout.write(line)
                        fout.write(continuation_line)
                        _need_to_write = False
                        break
                if _need_to_write:
                    fout.write(line)

Benchmark (after purging):

2026-03-13 15:28:01	repair_bytes_buffered_read_write	6.520668

Profile (also after purging):

❯ python3 profile_repair.py repair_bytes_buffered_read_write.py
Profiling repair_bytes_buffered_read_write on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         3713594 function calls in 7.851 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.228    1.228    7.850    7.850 /Users/charlie/projects/atlas/tsv-repair/repair_bytes_buffered_read_write.py:4(repair)
  1237861    2.875    0.000    2.875    0.000 {method 'count' of 'bytes' objects}
  1237862    1.813    0.000    1.813    0.000 {method 'readline' of '_io.BufferedReader' objects}
  1000001    1.770    0.000    1.770    0.000 {method 'write' of '_io.BufferedWriter' objects}
   237860    0.094    0.000    0.094    0.000 {method 'rstrip' of 'bytes' objects}
        2    0.070    0.035    0.070    0.035 {built-in method _io.open}
        4    0.000    0.000    0.000    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)

Best optimization techniques:

Buffered reads and writes accounted for the majority of the speedup
Working in bytes saved ~1 sec on utf-8 decoding. There might be more savings if your data has a higher density of multi-byte characters.

Failed optimizations (no effect or negative effect):

Memory mapping. The page fault overhead was too significant when we are doing a sequential read through the file.
Multiprocessing. The savings we gain from splitting up the file are eaten up by the time it takes to reassemble the individual outputs. This isn’t a map/reduce job, and we can’t leave the files disassembled, so no dice for this version of the problem.
Tweaking data structures. At this level of newline density, the code doesn’t spend enough time in the newline-repair loop to see any effect of optimizing data structures to reduce copying.

This problem space isn’t as rich as the 1BR challenge, so there won’t be as many exotic tricks, but it was still fun to work through it. For inspiration I looked back through Doug Mercer’s video on 1BR in Python. It’s a great watch and has a number of nifty tricks. ↩

"Don't automate complexity"

Mon, 02 Mar 2026 00:00:00 +0000

You ever hear someone say “We can just automate it,” and feel almost certain it’s a bad idea? I’m always fighting with this. And I can’t always explain why I feel like it’s a bad idea to automate certain processes.

And then on Saturday I was paging through Implementing Lean Software Development by Mary and Tom Poppendieck in a thrift store and found this short section:

Don’t automate complexity.

We are not helping our customers if we simply automate a complex or messy process; we would simply be encasing a process filled with waste in a straight jacket of software complexity. Any process that is a candidate for automation should first be clarified and simplified, possibly even removing existing automation. Only then can the process be clearly understood and the leverage points for effective automation identified.

More consistent hashing, or learning math for the advanced in age

Wed, 25 Feb 2026 00:00:00 +0000

I wrote my last post about consistent hashing, which has really stuck with me. The problem statement is so simple but the solution seems at a glance to be counter-intuitive. Sure, hashing both the file key and the node name is easy to say and implement. But that’s not enough for me to believe that it’s sufficient to solve the problem of data partitioning, nor that it’s required. Isn’t there any simpler version? An alternative?

Consistent hashing solves a generic problem – place keys on nodes, retrieve keys from the correct node. At least, that’s the apparent problem. But after reading the paper that suggested consistent hashing, I’ve started to appreciate the subtlety of the solution.

The paper describes the following desirable properties:

Balance. Distribute items evenly across buckets.
Monotonicity. If a new bucket is added, items may move into the new bucket, but no item moves between two buckets that were already there.
Spread. The degree to which clients disagree about where a data item belongs when they don’t agree about which buckets are available.
Load. Given some set of clients with different knowledge about what buckets are available, the max load at any particular bucket.

Three out of the four properties concern the behavior of the system when storage nodes are unstable. Monotonicity minimizes disruptions when nodes are added or removed from the cluster. Spread and load define how storage nodes are treated when clients have incomplete information about them. The other property, balance, is the one I originally thought was interesting about consistent hashing.

Lots of functions could have balance, any even partitioning of the key space will work. The other criteria are harder to achieve.

Some modeling will help explore the problem space. Data items could be described as a set of data items or a sequence of data items. If we model it as a sequence, we can assign items to buckets using, for example, round robin algorithms, but those are unavailable if we consider data items an unordered set. In any case, an algorithm that depends on a specific sequence of data items is more difficult to use in a distributed, imperfect-information system. A good solution will minimize state, and so if we can, we should use an algorithm that works on a set.

We could say that there is some set K of all possible data keys, and that our system must handle some reasonably-sized subset of K. So, given any subset of K, our solution must satisfy the balance, monotonicity, spread, and load properties. If you know the domain well, you could restrict this more and claim, for example, that keys are expected to be alphabetically evenly distributed. If you can’t make assumptions about the distribution of keys, you have to assume keys can be any subset of K. Prefixes stop working because you can always find a subset of K such that all elements have the same n-length prefix for any n.

Once you lay out your criteria, you try to satisfy them. A regular hash scheme where you hash keys modulo the number of storage nodes has excellent balance but poor monotonicity, spread, and load. It is a bad scheme under imperfect conditions. On the other hand, partitioning on prefix (with some adjustments) has excellent monotonicity, spread, and load, but the keys will most likely not be evenly distributed.

The traditional consistent hash scheme (hash the keys, hash the storage node names, assign keys to successors) actually doesn’t have great spread or load. It’s too likely that some node will land close to another one, causing one node to shoulder the burden. You have to use virtual nodes, which add some complexity to the implementation but have nicer properties.

The paper uses the abstraction of a view, V, to describe what clients know about storage nodes. The abstraction makes sense if clients themselves route requests based on the data key, but does it work when clients just send everything to a load balancer that forwards requests to a non-busy storage node for further routing? I think it works. A view could also represent what different storage nodes know about each other, and since we have some control over what the different storage nodes know (via gossip) this changes the game a bit. We can, like Chord, structure the views intentionally to guarantee certain routing properties. You could still call each node’s perspective a view and evaluate it on the criteria from the original paper. You could also reframe it as a routing table with certain properties, though. Is that more or less useful? It’s certainly more specific, and we need to be specific to gain a performance edge. But it’s less generic, and we might be missing out on a “greater truth.”

The paper leaves out a few practical criteria. Clients need to be able to deterministically place and search for data items on particular nodes. The function needs to be efficient. If routing is necessary¹, nodes in the system need to be able to route a request to the correct bucket efficiently.

I let this discussion get a bit mathematical to match the tone of the paper, but also because math has been on my mind lately. My math skills have atrophied since I left college. I haven’t even done a good proof in years.

But I’ve been reading comp-sci papers lately, and so I’ve started to turn that part of my brain back on. After reflecting on it the past few weeks, I’ve come to think that programmers and mathematicians are in a very similar business. The history of math, just like prorgamming, has been a history of discovering which abstractions, notions, and notations have useful properties for solving classes of problems.

I like theory ok, but perfect mathematical reasoning is a burden in programming. If you hold yourself to the standard of mathemetical proof, then you’ll spend forever modeling your system only to find that your proof is stretching for pages and pages, and it still doesn’t account for important things like differences in computer architecture and maintenance cost. In practical systems, you approximate and experiment. Software systems are filled with magic numbers found by experimentation.

Even still, I regret not being able to take more math classes in college, because the mathematical thought process is so similar to the programmer’s. You find the abstraction with the right level of power for solving the problem you’re interested in. You wouldn’t use ffmpeg to edit videos for the same reason you wouldn’t teach children basic algebra using group theory. The programmer eventually learns that ffmpeg underlies everything, and I guess one way of looking at work in mathematics is the process of discovering the lowest level components that explain how the higher level components work. It’s reverse engineering Nature.

I’ve thought about starting to learn math again, at least to struggle with problems every now and then. But how do you learn math at this point? What math do you learn? I spent some time thinking about what you get out of a college math degree.

There are a couple parts. There are the capabilities to reframe problems, abstract things, unify, generalize, and simplify. Then the ability to write good proofs and communicate with other mathematicians. And then there are the subjects, the tradition, and the important results. The universal processes of all math fields – abstracting, reframing, and so on – are my favorite parts. I use those skills every day in programming and systems design. Proofs aren’t too bad. But I’m lacking majorly in the tradition.

I have that in common with another group of math enthusiasts, child prodigies. How do they cope? In The Man Who Only Loved Numbers about Paul Erdos, Paul Hoffman says,

Perfect numbers and friendly numbers are among the areas of mathematics in which child prodigies tend to show their stuff. Like chess and music, such areas do not require much technical expertise. No child prodigies exist among historians or legal scholars, because years are needed to master those disciplines. A child can learn the rules of chess in a few minutes, and native ability takes over from there. So it is with areas of mathematics like these, which are aspects of elementary number theory (the study of the integers), graph theory, and combinatorics (problems involving the counting and classifying of objects). You can easily explain prime numbers, perfect numbers, and friendly numbers to a child, and he or she can start playing around with them and exploring their properties. Many areas of mathematics, however, require technical expertise which is acquired over years of assimilating definitions and previous results.

p. 48

And just as I was feeling better about the possibility that I could consider myself “kinda good at math” without knowing the integral of 1/x, I saw on HN yesterday a paper about Terence Tao, age 7². Sure enough, he was good at everything. “He has a prodigious long-term memory for mathematical definitions, proofs and ideas with which has become acquainted,” writes the author of the paper M. A. Ken Clements. He mentions in another part of the paper, “Not only did he have an astounding grasp of algebraic definitions, for someone who was still seven years old, but I was amazed at how he used sophisticated mathematical language freely.” He was the whole package at 7 years old. Widely read, good at math communication, and a natural problem solver. The paper is a great read.

As an aside, I also learned how to evaluate a continued fraction from Terence in this paper, and I’m wigging out over how programmerly the answer is. Here’s the whole section:

It’s a recursive problem, so you introduce x as a recursive definition of the continued fraction. Then it looks just like a quadratic as Terence’s mom suggested. Multiply both sides by x, rearrange, and solve for the positive root. Of course it wouldn’t work in a program since it recurses infinitely, but it has the same feel as any recursive function.

Child prodigies get the problems and the intuition for mathematical constructs. But I think I have a leg up in terms of appreciating the experience of mathematics. The experience of math is the mind-altering effort to grasp the importance of a pattern and what’s required to state the problem in “simplest” terms. The tradition of math on the other hand is the set of constructs, patterns, results, and abstractions that have been found to be substantial to the history of mathematics, both theoretical and applied. One isn’t too useful without the other. Sure I can think really hard about some problem, but without the tradition of math, I can’t build on any existing work. Like a programmer who doesn’t know what libraries are out there, and so has to build everything from scratch.

I wish the author of the paper on Terence Tao had asked him what he thought mathematicians did. He knew quite a few of them. Did he already appreciate that math isn’t just given in books, but thought, rethought, argued about, and revised? (He probably did.)

Like in any tradition, it’s hard to jump into mathematics. The most important results of the last few decades are built on subleties, arguments, tradeoffs, and alternatives that the casual student will never see. In my econometrics class, we used matrices a bit for linear regressions. After a short introduction about what matrices represent and how they work, we started on some definitions like how matrix multiplication is defined. It was elusive to me at the time why you would (a) have a non-commutative definition, and (b) do multiplication by lining up row and column values pairwise. Who decided this, and why? But sure enough, doing this led to the results we needed, so I put the questions in the back pocket for later.

We trade off power for understanding, just as you don’t need to really know what a socket is if you take the usage examples for granted. But I think it’s a mistake to leave it there, to never encourage a student to figure out something for themselves. Trying for a few days to derive some “obvious” idea is the only way to appreciate the difficulty of coming up with good math abstractions. I wasn’t gifted in math, but I also wasn’t encouraged to try to understand why math is being taught one way or another. And I think that was a scholastic miss on the curriculum’s part.

One of my favorite essays from Jon Bentley’s Programming Pearls (technically this one is from More Programming Pearls) is about writing your own profiler, and in the end he writes one for Awk. It seemed sheerly impossible to me to write a profiler, but I realized I had never even considered the problem space. Why not write a profiler? And why not try to find out for yourself what are the properties of a connected vs partitioned graph of storage nodes? Why not try to think up a new API for a threading library to rival pthreads? A bad idea can be discussed and expanded on, but you can’t do anything with no idea at all.

These days I’m pulled towards low-level programming, myself. I like the history and the forgotten alternatives, the influential ideas that faded and yet contributed to modern programming. But I also never forget that I my job is to build effective software even if can’t take the time to appreciate some tech that I find a bit mysterious and interesting. I just have to file it away until I have some spare time.

Routing was not necessary in the original paper, because the paper was concerned with cache nodes. Cache misses are a bit wasteful but not catastrophic. It’s enough to minimize them, not make them impossible. ↩
I’d skip the comment section. It’s mostly people flexing their own child prodigiousness and one thread about eugenics (“this proves to me that biological intelligence hasn’t nearly reached its peak. If we select for pure intelligence, biological brains can get much smarter”). ↩

The niftiness and diversity of consistent hashing

Sat, 14 Feb 2026 00:00:00 +0000

I’ve been interested in decentralized data in distributed systems lately. There’s too much data for any one storage node to handle, so you have lots of storage nodes and each gets some of the data. But how do you decide which node is responsible for a certain data item?

The problem space is diverse. In peer-to-peer systems, nodes just sort of have data that they register with the network. They have to not only find the node that has data, but also find out what data exists in the first place. Then there are distributed key/value systems like the one described in Amazon’s Dynamo paper. Data is placed in storage to balance the load and replicated across nodes to increase resiliency. And then there are CDNs and cache systems. In decentralized caches like the one described in the original consistent hashing paper, clients make an educated guess about the cache node that holds the data they’re looking for, and the design goal is to give each client a way to guess right most of the time with limited information.

I think the last problem is interesting: How can we make an educated guess about what storage node might have the data when we only have partial information about the network? It requires a bit of an intuitive leap.

Each data item is represented by a key, so there’s some concept of a key space, which represents all possible data keys. We can use a hash space as a pretty effective proxy for the key space, and then partition that hash space into ranges. Each storage node in our system becomes responsible for one of the ranges in the hash space.¹

A hash space is the set of possible outputs of a hash function – for example a SHA1 hash is 160 bits, so the hash space is all integers from 0 to 2¹⁶⁰. We know that if we hash different keys, we’re very likely to get different hash values, so this satisfies the “hash space as proxy for all possible keys” criterion. And hash functions are reasonably unbiased, so a hashed key has essentially equal probability of landing anywhere between 0 and 2^n_bits. The hash space is essentially a number line, and we can make it continuous by connecting the tail back to the head, so that the successor of the largest hash value is the smallest hash value.

You place a data item on the ring by hashing its key.

pos_x = sha1(x.key)

Our storage nodes are each responsible for a range, so we need a way to define these ranges. To that end, we place nodes on the ring as well, by hashing their IP addresses.

pos_server = sha1(s.ipaddr)

We decide that each node is responsible for the hash values between itself and its predecessor, which is to say the range of hash values between itself and the nearest counter-clockwise neighbor on the hash ring.

With a 64-bit hash space and 10 nodes, you might end up with a configuration like this.

Now if a client knows about any set of storage nodes, it can make its own partition of the hash ring and be mostly correct! If the client knows about 9 out of 10 nodes, when it makes its partition of the hash ring, it will correctly identify 8 out of the 10 actual ranges, and it won’t do too badly guessing about the 9th and 10th ranges, either.

Here’s how consistent hashing scores on a few more criteria:

Precise. About as accurate as you can be with limited information about storage nodes.
Deterministic. Yep.
Equal distribution of work. The hash ring is reasonably well distributed, but it’s not perfect. The size of each range is random.
Adaptable to changes in membership. This is one of the main motivations for consistent hashing. If you add a node, it divides some existing range, and the only redistribution that happens is with the new node’s successor. If you remove a node, the successor node becomes responsible for the range of the node that left, but no other nodes are affected.

In cache systems, getting close to the right node is good enough. Even if a node isn’t technically responsible for a key, it can still request the key from the server and cache the key itself. Any other clients that have the same incomplete information about the network will hit this (non-responsible) node and find the cached data there. Responsibility in this system is weak, but the data is stored with a lot of locality relative to the hash ring.

Sometimes you need to find exactly the node responsible for a key. The Chord protocol provides such stricter routing. When a server joins the network, it notifies its neighbors and then learns about them. The protocol ensures that the new node knows a lot about its immediate location on the ring, but less about the nodes that are farther away from it. When a client makes a data request to find a key, it reaches out to any node on the network. This node probably doesn’t have the data, but it knows someone who is close. The storage node forwards the request counter-clockwise (backwards) on the ring to the closest node to the point where the key is, but without passing it. This node could be the owner of the data item, but if it’s not, it knows enough about its local area on the ring to find a node that’s closer to to it. The routing tables are built in such a way that if a node doesn’t know of any closer node to the data item, it is the owner of the data item.²

The number of hops in a Chord data retrieval is bounded. Latency isn’t optimal, because there may be some number of hops to find the data, and in fact the nearness on the ring says nothing about how close two nodes are in the world. This is one of the hard problems of overlay networks, but there are optimizations for it.

So, different systems treat the routing requirement differently. A cache system is able to relax the routing requirement, because it tries to reduce cache misses but not prevent them entirely. Chord meanwhile guarantees that a key is found. It’s a scalable protocol that works in very unstable networks, at the cost of extra latency.

There are still a few issues. First, a node leaving is likely to increase the workload for another node, and a node entering the ring is likely to divide another node’s workload while leaving the remaining nodes with the same workload. Second, random positions on the ring are not very likely to be evenly spaced. (There is after all only one evenly spaced configuration.) Third, the scheme ignores heterogeneity – some servers can handle more work than others.

To illustrate the load balancing problem with hashing servers to the hash ring, I generated 10 different random configurations to show that they are not generally evenly spaced.

Next, I simulated 1,000 requests for data, and I ran the simulation 10 times, plotting the results. In this plot, the size of the node represents the number of requests serviced at the node.

If we want to smooth out the load, the solution is to assign each server a number of random virtual nodes on the ring. Virtual nodes decrease the chances that a node gets a particularly big range of responsibility on the ring – instead of getting unlucky once, it would have to get unlucky N times in a row for N virtual nodes per storage node.

Here’s another 1,000 requests made to a ring with 32 nodes instead of 10. I ran the simulation 10 times and plotted each outcome.

If you take any N nodes (say 4) and average their size, you get a reasonable number of requests. We can also solve the heterogeneity problem by allocating servers a number of virtual nodes according to its abilities. Finally, membership changes more evenly rebalance the nodes. When a node is removed, its virtual nodes are removed from the circle, and the extra work is split among all of the owners of the virtual nodes’ successors. It’s very likely that the successors will be owned by several nodes, not just one.

Routing is a little more complicated. The system needs to store some amount of extra metadata about how to map virtual nodes to actual nodes. For this reason, virtual nodes are most effective in smaller systems where each storage node might be responsible for a large chunk of the hash ring.

The design space becomes expansive at this point as particular systems balance metadata and latency with other guarantees they need to make. But this is the essence of the thing. We create a proxy for the key space that tends to evenly distribute nodes and then divvy up that domain across the storage nodes. Elegant, if you ask me.

Sources

Consistent Hashing and Random Trees: distributed caching protocols for relieving hot spots on the world wide web
Chord: a scalable peer-to-peer lookup service for internet applications
Dynamo: Amazon’s highly available key-value store
Distributed Systems, 4e, by Maarten Van Steen and Andrew S. Tenenbaum (link)
Computer networks: a systems approach, 3e, by Peterson and Davie

There are lots of ways to transform and then partition a key space, but most don’t work well. It’s vital that all keys are distributed evenly across storage nodes. So you might try to partition the key space by prefix and then place the key on the server whose name has the same prefix. This doesn’t work well because (a) there’s no guarantee that servers will have distinct prefixes, especially when they’re in the same subnet, (b) if they do have distinct prefixes, they’re probably not evenly spaced, and (c) you become significantly limited in the number of buckets. The longest prefix for an IPv4 address is 32 bits. ↩
See also Pastry, which optimizes the routing tables differently. ↩

Abuse-tolerant interfaces

Wed, 11 Feb 2026 00:00:00 +0000

A common approach in the industry for forming a performance oriented SLA is to describe it using average, median and expected variance. At Amazon we have found that these metrics are not good enough if the goal is to build a system where all customers have a good experience, rather than just the majority.

“Dynamo: Amazon’s highly available key-value store,” DeCandia, et al.

The Dynamo paper has me thinking about kinds of customer and service. Amazon is a company on the offense, by which I mean that there is no sort of traffic they want to turn away. They succeed when customers gleefully fill their carts and hammer as many orders as possible through the checkout during the holidays. Their only concern is keeping up.

The Dynamo paper goes on:

For example if extensive personalization techniques are used then customers with longer histories require more processing which impacts performance at the high-end of the distribution. An SLA stated in terms of mean or median reponse times will not address the performance of this important customer segment.

Those customers with enthusiastic shopping patterns are exactly the type of customer that Amazon wants to court, and that drives their metrics away from averages and towards extremes at the 99.9th percentile. At my own job at IXIS, we’ve created a BI and data socialization platform, and the Dynamo paper has me thinking that we are on the opposite side of some customer relationship spectrum. Our platform works best when people use it reasonably. If a power user queries only for multiple years of data, that strains our resources and has no incremental benefit to our bottom line.

Every company gets pricing stress, but some companies like IXIS, Snowflake, and OpenAI have to worry about whether their pricing is secure against unusually power-hungry power users. And that sucks, for everyone. I want people to power-use our product without feeling like we’re against them. At the least the fiddly money problems we have should be transparent to the user. Just imagine if YouTube made you pay a small fee if you watched too many videos today to cover their compute costs serving you those videos.

Here are a few pricing models in this space, with companies I think represent the model well:

Customer pays for compute. AI tokens, most AWS services, Snowflake.
Ads pay for customer. YouTube, Spotify.
Special services pay for freeloaders. The freemium model, usually mixed with ads.
Good behavior by default. BitTorrent’s bartering system.
Good behavior rewarded. Reddit, Stack Overflow.
Hard resource limits. Google drive, GMail.
Throttling. AT&T¹, Tinder.
Abuse-tolerant interface. Adobe Analytics Workspace, coupons.

As an alternative to “customer pays for compute,” I’ve been interested in abuse-tolerant interfaces, which you could describe as, “It’s not impossible for users to cost us a lot of money, but they’ll find it’s inconvenient to do so.” Coupons represent this par excellence. I mean physical, cut-em-out coupons. They’re abuse-tolerant because while it’s possible to cut big stacks of coupons, most people don’t. There was that whole Extreme Couponing TV show about lengths to which people went to clip the coupons. We’re talking days of labor between collecting books, snipping, and organizing. But the savings were huge.

Coupons work in spite of their inconvenience. When you take the trouble, it feels like a steal. They’re a great time.

A few digital companies have figured out how to work this model into their products. Adobe Analytics Workspace is one of those products in my opinion – and if you aren’t familiar, it’s a BI tool for analyzing Adobe Analytics data. The interface is composed drag-and-drop components like metrics, dimensions, and segments. You build visualizations ranging from simple and customizable (tables) to prefab (various flow charts and funnels), and they never limit you. You can theoretically ask for millions of rows of data and you won’t get rate limited. Instead:

The data is paged in on-demand.
The interface is much friendlier to simple visualizations than monstrously large tables.
Complex, nested breakdown tables must be created incrementally, which limits how much data you could sanely fit into a single table.

Now don’t confuse me for Adobe Analytics’ biggest fan or anything, but I’ve never felt limited by the interface, although we’ve certainly pushed it to some limits.

I think these sorts of abuse-tolerant interfaces are subtle and difficult to execute well. But when you get it right, it’s great for both the business and the customer. Food for thought for those product owners out there who are considering a “compute per request” pricing model.

Seller beware, AT&T was sued by the FTC in 2014 for throttling data speeds for “unlimited” plan users after they used a certain amount of data. ↩

Swatch time

Tue, 10 Feb 2026 00:00:00 +0000

I saw this on HN this morning. Nearly 30 years ago Swatch created Swatch Internet time. The units are called .beats, and they’re a decimal timekeeping system (1000 .beats per day) based apparently on the solar day in Biel, Switzerland. The second in this new timescale is defined as 1/1000 of a day, but since it seems to be defined in terms of UTC, it’s probably a translation of SI seconds..?

It seems like UTC with a different hat.

All the finery aside, Swatch time abolishes timezones and daylight savings time so that referring to time can be more natural. If it’s 584 .beats where I am, it’s the same time all over the world. This is supposed to be refreshing for people who spend a lot of time trying to coordinate people in different timezones. Someone on HN mentioned this post, So You Want to Abolish Timezones. If you read that and still think .beats are a good idea, I’ll be stunned.

I’ve talked before about time coordination and how the definition of “the time” is sociotechnical, with an emphasis on socio. If there’s any problem with timezones it’s that there aren’t enough of them. The original timezone scheme in the US had 144 timezones.

Sure, it’s hard to agree on when to have a meeting, but that’s only because globalization is hard. You need to have either social empathy for how people organize their lives in other places or a dictum that everyone change their idea of the day from something natural to something arbitrary, like the natural time in some place I’ve never been to.

So be happy that the world has such diversity! And refer back to your timezone tables. Because timekeeping is as complicated as the places and people that keep the time.

My kind of place

Sat, 07 Feb 2026 00:00:00 +0000

What is it about big social media that makes me feel like I’m part of the conversation? Conversations are being had, and there are people here, so this must be the place..?

Maybe we need a new name for it. There’s just not much social about TikTok, Instagram, or YouTube. It’s TV in your pocket with infinite channels. In 2005 if you wanted conversation, you’d watch a talk show or The View. Now the channels have a much finer grain, but it’s the same thing. And at times I’ve been glued to it.

If you remember old YouTube much, you might remember the video response section (phased out in 2013). It wasn’t used a lot, but the intention was good. For some video, you could make a response and it would appear underneath the video. You as a viewer could start a conversation with the creator. I thought about it recently because it feels alien now. Remixes and stitches are something like it but not quite right.

Conversation doesn’t scale well. After some tipping point the signal gets drowned out in the noise, the thoughts from interested strangers turn into regular negative comments. Funny scales well, drama scales well, but not so much people.

For a while I was a part of a small community of data visualization enthusiasts called #TidyTuesday. Every week our benevolent leader would publish a dataset, and we would make a visualization with it. You can see my visualizations on my GitHub, but as a small example:

It was the most fun I’ve had on the internet. The anticipation, the first visualizations, the ones that blew me away. For 41 weeks, I got the assignment, worked on something I thought would be cool, and then watched the other visualizations flow in. It was small, and mutually encouraging. We all gave credit when we stole from each other.

But I don’t think it works if there are 10,000 contributors instead of 100.

I’ve been thinking about that internet and the internet I’ve tended to be on lately. There are sites that focus on creators, and there are sites that focus on people. Instagram, YouTube, and TikTok are platforms for creators.¹ Reddit, Twitter, Meetup, personal websites, and (I mean this without irony, mostly) Facebook groups² are for people. LinkedIn is technically about people, but it’s driven by self-advertising, so I don’t know where it belongs.

I didn’t distinguish between these as much before. I knew I liked twitter better than instagram, but I thought it was my fault that I didn’t find a community on the ‘sta.

It’s market forces for the creators. You can’t monetize a subreddit³, but you can monetize a YouTube channel. RSS feeds can’t be monetized, and so the tech was nearly killed. But it’s still widely available. And as old as Reddit is, it’s still popular. It’s a people site.⁴

I’ve fallen into the tech blogosphere lately, thanks to this post from ClickHouse and especially thanks to Thorsten Ball’s weekly “Joy & Curiosity” on Register Spill. This part of the blogosphere is mostly passion projects on simple HTML websites (please excuse the appearance of my own site, I haven’t had time to make it plain), and the people are interested in their fields. I have an RSS reader and I add new sites when I find them. Some post every day (jwz), others every week (Thorsten), and some might never post again.

And you know the nice thing? When I open it up, it’s my feed, and there are no suggestions. I read it every morning with breakfast, and on Sundays I read the long-form articles I’ve saved up from the week.

Marcel the Shell said it best: “It’s still a group of people, but it’s an audience, it’s not a community.” And dammit if that’s not the whole thing. ↩
Say what you will about Facebook (and I’d agree), but Facebook groups are unique for being about geographically spread out people with niche interests who are encouraged to get together in real life occasionally. cf. Meetup and Reddit. ↩
You can earn money through Reddit by being what they consider to be a high quality contributor. The money’s not the same as channel-focused sites, though. ↩
I recently learned that three of my favorite things about the Internet (RSS, Reddit, at least in principle, and the creative commons license) were all developed in part by Aaron Swartz. He had the right idea for the Internet. ↩

Replicating lazy replication in python

Sat, 07 Feb 2026 00:00:00 +0000

I’ve been reading about distributed systems lately. I have a lot to catch up on. When I started making a reading list a couple months ago, I had heard about the CAP theorem, that it was about tradeoffs in consistency, availability, and P-something, so I started there.

The CAP acronym stands for “Consistency of reads¹, Availability of writes, and presence of a network Partition.” The theorem (poorly stated) says you can pick any two, but not all three.² If your service allows consistent reads (all replicated nodes return the same value) and is always available for write operations, then the system cannot function during a network partition. A network partition is when not all nodes can communicate with each other. And if you want to allow writes to proceed even when there’s a network partition, then you cannot guarantee that every node will return the same value on read.

The CAP theorem as a theorem is correct, but the impossibility result has been used as a teaching tool and a framework for thinking about tradeoffs in a distributed, replicated service. And as a framework it’s not very useful. It neglects the fact that users can tolerate some types of inconsistent reads, but not others; writes can proceed during a network partition under certain conditions, but not others; that there are many kinds of failure besides just a network partition (slow responses, Byzantine nodes, node failures, and an actual break in the network connection between parts of the network that otherwise function well).

In other words, the CAP framework is simplistic. I shopped around for better frameworks and found an excellent paper, “Rethinking Eventual Consistency” by Philip Bernstein and Sudipto Das from Microsoft. They give a more complete classification of the tradeoffs made in modern distributed systems and how to think about them.

First, they acknowledge that availability and network partition tolerance are essential for most services, so it’s read consistency that has to be adjusted. (For what it’s worth, the original CAP proof paper also acknowledges this.) Then, they disambiguate types of consistency and their uses. For example, in an email system, it’s usually enough to offer causal consistency where a single user always observes their own updates and the updates they’ve observed before. They don’t observe the whole system at once, so the replicas don’t have to be consistent.

The whole paper is worth reading. I myself focused on just one part that I found interesting: eventual consistency through partial ordering, implemented with vector clocks. The main reference for this in the eventual consistency paper is “Providing high availability using lazy replication, by Ladin et al.

In this paper, a service is replicated across a network of symmetric nodes that each serve both reads and writes to clients. The client uses a front end service, and this front end service has duties like routing to a particular preferred node and coordinating with nodes about which updates it expects to see. This makes the replication transparent to the user while storing some important program state on the client’s side.

A system is causally consistent if each user has a consistent view of the system: they see the things they’ve seen before, and the system reflects the updates they’ve made (maybe in response to things they’ve seen) in the right order. From the perspective of an email client, the exact order of unrelated emails being sent in the system is unimportant. But if I read an email and then refresh my messages, I should still see that email (my view should never go “back in time”), and the messages shouldn’t be reordered. If I reply to an email, and someone else replies to my reply, the thread should appear in the same order for everyone, regardless of which replica they talk to (replies are causally linked).

This results in chains of causality flying back and forth from client to replica and between replicas (as they share updates with each other). The model described in the paper uses actual chains of causality. The implementation of course is limited by memory and bandwidth, where full chains of causality are inconvenient.

Instead, the system is implemented using vector clocks, which are a nifty trick for tracking dependencies.

To understand vector clocks consider that the system is changed through updates, and it’s observed through reads. A read observes some data item, which is the product of all of the updates that have affected that data item, performed in some order.

The state at a particular replica is then defined by the log of updates it has processed. As long as the replica itself keeps track of its log of updates, everyone else can refer to a particular state of that replica by the length of its log, or in other words the number of updates that replica has processed.

This is the underlying idea of a vector clock. For N replicas, the state of the system is identified by a vector v of length N, where each element v_i is the number of updates processed at replica i.

This compact representation of the system is useful but not sufficient. The Ladin, et al., paper goes into the details of how their system makes use of vector clocks to enforce several types of operations (causal, forced, and immediate, in increasing order of strictness).

Simulating the system

I was having a hard time visualizing and playing with this system on paper, so I wrote a python simulation of the key parts. You can find it here: https://github.com/charlie-gallagher/simulation-lazy-replication-ladin-1992

It does somewhat detailed logging of what’s going on at each node (node=replica) and client, and in the end prints out a summary of what went on. Here’s an example summary:

❯ python3 node.py

======================
Summary
  Nodes: 4
  Front ends: 10
  Current time: 99
  Stats: {'updates': 316}
--------------------------
FrontEnd (0)
  Preferred node: 0
  Prev: [78, 96, 50, 87]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 177
  Seen vals: [12, 7, 9, 20, 20, 29, 40, 61, 86, 89, 105, 119, 110, 108, 119, 135, 145, 147, 153, 177]
  Stats: {'updates': 30, 'update_completes': 30, 'query_starts': 20, 'query_completes': 20, 'failed_polls': 0}
FrontEnd (1)
  Preferred node: 3
  Prev: [78, 98, 52, 88]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 170
  Seen vals: [6, 9, 21, 23, 33, 62, 115, 120, 108, 117, 112, 131, 133, 147, 153, 170]
  Stats: {'updates': 34, 'update_completes': 34, 'query_starts': 16, 'query_completes': 16, 'failed_polls': 0}
FrontEnd (2)
  Preferred node: 1
  Prev: [78, 98, 50, 87]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 173
  Seen vals: [6, 18, 11, 18, 23, 61, 115, 108, 127, 133, 141, 173]
  Stats: {'updates': 38, 'update_completes': 38, 'query_starts': 12, 'query_completes': 12, 'failed_polls': 0}
FrontEnd (3)
  Preferred node: 1
  Prev: [76, 97, 42, 84]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 155
  Seen vals: [5, 2, -2, 4, 12, 8, 9, 19, 27, 33, 37, 62, 86, 96, 100, 114, 110, 112, 115, 110, 115, 127, 127, 128, 148, 141, 155]
  Stats: {'updates': 23, 'update_completes': 23, 'query_starts': 27, 'query_completes': 27, 'failed_polls': 0}
FrontEnd (4)
  Preferred node: 1
  Prev: [78, 98, 50, 87]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 173
  Seen vals: [-2, 6, 9, 18, 21, 29, 72, 96, 117, 129, 134, 137, 141, 148, 143, 153, 173]
  Stats: {'updates': 33, 'update_completes': 33, 'query_starts': 17, 'query_completes': 17, 'failed_polls': 0}
FrontEnd (5)
  Preferred node: 2
  Prev: [64, 81, 51, 70]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 131
  Seen vals: [5, 9, 8, 12, 11, 20, 38, 62, 105, 119, 114, 111, 115, 118, 131]
  Stats: {'updates': 35, 'update_completes': 35, 'query_starts': 15, 'query_completes': 15, 'failed_polls': 0}
FrontEnd (6)
  Preferred node: 1
  Prev: [74, 98, 35, 79]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 148
  Seen vals: [4, 5, 11, 9, 32, 62, 66, 76, 93, 100, 120, 111, 108, 118, 123, 128, 133, 141, 148]
  Stats: {'updates': 31, 'update_completes': 31, 'query_starts': 19, 'query_completes': 19, 'failed_polls': 0}
FrontEnd (7)
  Preferred node: 0
  Prev: [78, 93, 46, 86]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 168
  Seen vals: [-5, 9, 5, 5, 5, 18, 21, 20, 32, 41, 62, 66, 120, 112, 115, 116, 117, 135, 131, 136, 144, 168]
  Stats: {'updates': 28, 'update_completes': 28, 'query_starts': 22, 'query_completes': 22, 'failed_polls': 0}
FrontEnd (8)
  Preferred node: 3
  Prev: [74, 85, 35, 88]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 147
  Seen vals: [4, 17, 11, 9, 18, 23, 30, 37, 62, 96, 89, 105, 115, 107, 118, 131, 135, 133, 147]
  Stats: {'updates': 31, 'update_completes': 31, 'query_starts': 19, 'query_completes': 19, 'failed_polls': 0}
FrontEnd (9)
  Preferred node: 2
  Prev: [65, 82, 52, 71]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 138
  Seen vals: [5, 5, 18, 15, 8, 20, 23, 22, 38, 40, 41, 62, 68, 76, 120, 110, 138]
  Stats: {'updates': 33, 'update_completes': 33, 'query_starts': 17, 'query_completes': 17, 'failed_polls': 0}
--------------------------
Node (0)
  Log records: 0
  First 5 log records:
  Replica TS: [78, 98, 52, 88]
  Value: 170
  Value TS: [78, 98, 52, 88]
  TS Table: [[78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88]]
  Gossip queue length: 0
  Update queue length: 0
  Query queue length: 0
  Query results length: 0
  Update results length: 0
  Stats: {'updates': 78, 'gossip_messages_processed': 243, 'gossip_updates_processed': 238, 'queries': 49}
Node (1)
  Log records: 0
  First 5 log records:
  Replica TS: [78, 98, 52, 88]
  Value: 170
  Value TS: [78, 98, 52, 88]
  TS Table: [[78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88]]
  Gossip queue length: 0
  Update queue length: 0
  Query queue length: 0
  Query results length: 0
  Update results length: 0
  Stats: {'updates': 98, 'gossip_messages_processed': 237, 'gossip_updates_processed': 218, 'queries': 47}
Node (2)
  Log records: 0
  First 5 log records:
  Replica TS: [78, 98, 52, 88]
  Value: 170
  Value TS: [78, 98, 52, 88]
  TS Table: [[78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88]]
  Gossip queue length: 0
  Update queue length: 0
  Query queue length: 0
  Query results length: 0
  Update results length: 0
  Stats: {'updates': 52, 'gossip_messages_processed': 238, 'gossip_updates_processed': 264, 'queries': 38}
Node (3)
  Log records: 0
  First 5 log records:
  Replica TS: [78, 98, 52, 88]
  Value: 170
  Value TS: [78, 98, 52, 88]
  TS Table: [[78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88]]
  Gossip queue length: 0
  Update queue length: 0
  Query queue length: 0
  Query results length: 0
  Update results length: 0
  Stats: {'updates': 88, 'gossip_messages_processed': 284, 'gossip_updates_processed': 228, 'queries': 50}
======================

It successfully replicates updates³ and produces the same value at each replica, in this case Value: 170.

The system isn’t perfect, and it still probably has some bugs, which have been difficult to track. But the value was in the exercise, and if you’re interested in the Ladin paper, this is a functioning example that’s simple enough to be considered basically pseudocode.

The “C” is sometimes phrased “Consistency of replicas” rather than consistency of reads, which is a fine distinction. Systems are only observed through read operations, so inconsistency can only be noticed through reads; however, if a node fails and its work hasn’t successfully been replicated anywhere, then a future consistent read becomes impossible. So consistent replicas and consistent reads are closely tied but have different implications for how you might design the system to handle failure. ↩
The original proof gives a precise definition of the CAP conjecture original posed by Brewer: https://users.ece.cmu.edu/~adrian/731-sp04/readings/GL-cap.pdf. Besides being more precise, it also describes weaker forms of consistency that are useful in real systems. ↩
A running total. This actually isn’t sensitive to the order in which operations are run, a so-called CRDT data type (see Rethinking Eventual Consistency paper). This was a good starting point, I thought, because I didn’t have to work out the partial ordering just yet. ↩

A week on NTPSec

Fri, 06 Feb 2026 00:00:00 +0000

In which I find out with some certainty what time it is.

The ntpd utility can synchronize time to a theoretical precision of about 232 picoseconds. In practice, this limit is unattainable due to quantum limits on the clock speed of ballistic-electron logic.

(https://docs.ntpsec.org/latest/ntpd.html)

I always assumed I was using ntpd to keep time on my linux computer. But I was only sort of right.

According to the Debian Wiki, since Debian 12, the default NTP client is systemd-timesyncd. It uses SNTP (Simple Network Time Protocol), which implements the client with no option to host a time server, and it sets the time roughly by communicating with a single time server. There’s no recourse if you get a bad server, or “falseticker” in NTP parlance.

There are a few implementations of NTP to choose from. The systemd-timesyncd daemon is a basic client suitable for keeping time. The original NTP reference implementation is ntpd, which is still around, but is deprecated on Debian in favor of the more security-focused NTPSec. And then Chrony is a newer implementation that is more practical than ntpd. It looks like a darn fine timekeeper by comparison.

There are interesting things to say about each NTP tool (and their apparent controversies), but if you’re interested in NTP you can pick pretty equally among ntpd, Chrony, and NTPSec. I’ve been playing with NTPSec for its debugging utilities like out-of-the-box data visualizations using ntpviz.

Most computers have a real time clock in hardware and a system clock in software. On powerup or reboot, the system clock is set using the RTC. You can use a command like date to set the date/time, but this only updates the system clock; strictly speaking to update the hardware clock immediately you’ll need

hwclock --systohc

The hardware clock is battery-driven, which is how its time reading persists across boots. But some parts of it are curiously system-dependent, for example whether the hardware clock is set in UTC or local time.

If your machine dual boots Windows and Linux, then you could have problems because Windows uses localtime for the hardware clock; while Linux and Debian use UTC for the hardware clock. In this case you have two choices. The first is to use localtime for the hardware clock, and set Debian to use localtime. The second is to use UTC for the hardware clock, and set Windows to use UTC.

https://wiki.debian.org/DateTime

NTP implementations like Chrony and NTPSec don’t directly interact with the RTC; instead, they modify the system clock. They tend to make use of a kernel feature called “11-minute mode”, where the system clock syncs to the hardware clock every 11 minutes, but documentation on this is a bit scant. Some comments in the Chrony docs.

Real time clocks are usually crystal oscillators with a frequency of 32.768 kHz, and since NTP doesn’t directly interact with them, I’m not going to talk much more about them.

Software clocks on the other hand are crucial to the system. Every system that NTP runs on must provide a time correction service. The adjtime syscall is intended to be portable. As far as I’ve seen, it’s POSIX standard. You might also see adjtimex, which is a Linux-specific variant.

To explain how the system call works, The Design and Implementation of the FreeBSD Operating System:

The settimeofday system call will result in time running backward on machines whose clocks were fast. Time running backward can confuse user programs (such as make) that expect time to invariably increase. To avoid this problem, the system provides the adjtime system call [Mills, 1992]. The adjtime system call takes a time delta (either positive or negative) and changes the rate at which time advances by 10 percent, faster or slower, until the time has been corrected. The operating system does the speedup by incrementing the global time by 1100 microseconds for each tick and does the slowdown by incrementing the global time by 900 microseconds for each tick. Regardless, time increases monotonically, and user processes depending on the ordering of file-modification times are not affected. However, time changes that take tens of seconds to adjust will affect programs that are measuring time intervals by using repeated calls to gettimeofday

The Mills reference is to RFC 1305.

Since I have TDAIOTFBSDOS open already, I can mention a few other things about a typical POSIX software clock works. The system software clock is created through an interrupt timer, and the system “increments its global time variable by an amount equal to the number of microseconds per tick. For the PC, running at 1000 ticks per second, each tick represents 1000 microseconds,” (p. 73). And if you think 1000 interrupts per second is a lot of interruption, you’re right. “To reduce the interrupt load, the kernel computes the number of ticks in the future at which an action may need to be taken. It then schedules the next clock interrupt to occur at that time. Thus, clock interrupts typically occur much less frequently than the 1000 ticks-per-second rate implies,” (pp. 65-66).

I’d guess (and a brief conversation with ChatGPT seems to confirm) that modern operating systems have heavily optimized this part of their timekeeping. After all, who cares what time it is if no process is trying to observe it?

There is not one way of measuring time more true than another; that which is generally adopted is only more convenient.

Henri Poincaré

What would it take for me to serve time to others? NTP servers listen on port 123 and usually work only over UDP, so I suppose the simple way to serve time is to start ntpd in server mode, start listening, and configure someone to ask you the time. But if you really want to be seen, you have to join the pool.

The pool.ntp.org project is a big virtual cluster of timeservers providing reliable, easy to use NTP service for millions of clients.

The pool is being used by hundreds of millions of systems around the world. It’s the default “time server” for most of the major Linux distributions and many networked appliances (see information for vendors).

https://www.ntppool.org/en/

There’s very clear documentation on how to join the pool, too. The basic steps are:

Get your own time from a known good source (not the pool).
Configure a stable IP address (trickier than you might think – even if you set up port forwarding to get around DHCP issues, your ISP tends to rotate your public IP address as it wants).
Be willing to make a long-term commitment to the project.

I’ll put “create a home time server” on my list of things to try, but joining the pool would probably create too big a wave.

My computer may not be the best to ask for the precise time. Where does authority in timekeeping come from? Who has the time? I’m only an hour’s drive away from the National Clock and Watch Museum, and after visit and a few dozen hours of followup research, I have something approximating an answer.

We get our sense of time from the periodic movements of the starry firmament – the sun, the moon, the stars. And our bodies, along with most organisms on Earth, have built-in timers that encourage us to do those activities that keep us alive. This is when you usually sleep, this is when you usually eat. As different cultures sought to understand the heavens and their perfection, timekeeping began to occupy its modern central role for coordination in our lives.

And so at the moment my interest in timekeeping isn’t how we developed precision clocks, but how we managed to coordinate ourselves using those clocks. Why we coordinate ourselves using those clocks. What even is a clock? If all of the atomic clocks in the world stopped ticking for 10 minutes, would we be able to recover “the time”?

We’ve had a sense for calendars for a long time. “By the 14th century BC the Shang Chinese had established the solar year as 365.25 days and the lunar month as 29.5 days,” (RFC 1305). By 432 BC, the Greek astronomer Meton had estimated the lunar month – the time it takes for the moon to circle the earth – to within about 2 minutes of the currently understood value.

Time-curious cultures became duly obsessed with the frequency and stability of our cosmic oscillators. The Earth’s rotation and its orbit around the sun, the moon’s orbit around the Earth. And each culture had a calendar that tried to match the motions of the cosmos with a predictable and convenient “civilian” calendar.

Not all cultures had a calendar, and the ones that did used different systems, so the process of dating events is knotty. Suffice to say that it involves some guesswork. The best case for understanding the orders of events in the old days is having what Mills calls (in RFC 1305) “an accurate count of the days relative to some globally alarming event, such as a comet passage or supernova explosion.”

And so calendars are social. The civil calendar had to be convenient and fit into the activities of daily life, and ordering of events depends on some collective consciousness around global events. I’ve been surprised by how often we make clocks and calendars fit into daily life and not the other way around. Several of the most precise modern timescales today are based on what feels right and looks right, made a bit more precise.

Calendars order our years; clocks order our days. Early religion temporalized daily life by requiring certain religious acts to be done multiple times a day. Some of the earliest interesting clock-like devices we have are from monasteries that rang bells at specific times. (And the word clock is derived from the French word for bell.) This went on for a few hundred years.

The next advance was periodic timekeepers. Time used to be more organic than it is today. Hours were not equally sized, and the day was not split equally into 24 parts. But somewhere, at some time, Europeans made an intuitive leap from continuous time devices like the clepsydra or the procession of different stars and planets to discrete time – time as ticks¹. In Revolution in Time, David Landes considers this one of the great methodological leaps in western civilization. It took other cultures another 500 years to begin using oscillating, periodic timekeepers.

Clocks have their social uses. Nearly as soon as clocks became convenient and domestic, punctuality became an important social cue. And as life became more connected with trade, trains, and radio, the pragmatic importance of clocks only increased.

I installed NTPSec on my Debian machine and left the configuration mostly as-is.

sudo apt-get update && sudo apt-get install ntpsec ntpsec-doc ntpsec-ntpviz

I made sure to enable statistics, because I’m really after visualizations. I want to see the thing do stuff. Visualizations are generated using ntpviz, which is scantily documented (this was helpful but ancient: ntpvis-intro), but I found enough to get me going. Unfortunately, I just set up my daemon, and there’s no data to visualize. I took the opportunity to do some background work on the metrics.

Clocks are never perfectly in sync, and the most important contributor to incorrect timekeeping is a difference in oscillator frequencies. This is called frequency skew. If the “correct” time is an oscillator at 1000 Hz, my local computer clock might be more like 1001 Hz or 999 Hz. So even if I set my clock to the right time, I would gain or lose some seconds every day.

Frequency skew is measured in parts per million, which is to say the number of periods fast or slow per million oscillations. In the 1000 Hz example, 1001 Hz would have a skew of 1 part in every thousand, or 1000 parts per million (ppm). 999 Hz has a skew of -1000 ppm.

Skew is also described in other ways. A human-friendly way to describe it is “seconds gained or lost per day”, or week or year. This gives you the number in practical terms. It’s a bit tricky to translate between them, though, considering the gap between oscillation frequency and length of a day.

Your skew might also vary over time, and this is called drift.

NTP corrects for skew as part of the protocol by nudging the time and doing its best to predict changes. Skew is affected by the quality of the hardware and the environment around the oscillator, especially temperature. For ideal timekeeping, you’ll want to keep your computer in a nice climate-controlled vault with excellent heat sinking.

The clock offset is the estimated difference between my clock and the reference clock, measured in milliseconds. I show roughly how this is calculated in NTP in 30 Seconds. In short, it’s calculated by estimating the latency between you and the server and using that to guess what time the server received your request. Then you compare your guess (based on local time + latency) to what the server reported was the “actual” time it received the request, and use the difference to work out how wrong your clock is.

Practically speaking, for general monitoring, you can use ntpmon. This is a top-like tool for watching your NTP daemon interact with peers. The output looks something like this:

     remote           refid      st t when poll reach   delay   offset   jitter
 0.debian.pool.n .POOL.          16 p    -   64    0   0.0000   0.0000   0.0001
 1.debian.pool.n .POOL.          16 p    -   64    0   0.0000   0.0000   0.0001
 2.debian.pool.n .POOL.          16 p    -   64    0   0.0000   0.0000   0.0001
 3.debian.pool.n .POOL.          16 p    -   64    0   0.0000   0.0000   0.0001
-ip74-208-14-149 192.58.120.8     2 u  598 1024  377  41.7645   0.9319   1.5714
-144.202.66.214. 162.159.200.1    4 u  834 1024  377  45.3224   1.1844   1.2529
*nyc2.us.ntp.li  17.253.2.37      2 u  564 1024  377  10.2534  -0.8732   0.8516
+ntp-62b.lbl.gov 128.3.133.141    2 u  748 1024  377  73.6187  -0.3344   1.0547
+time.cloudflare 10.102.8.4       3 u  212 1024  377   8.4884   0.0523   0.9331
 192-184-140-112 .PHC0.           1 u  66h 1024    0  85.8202   5.3690   0.0000
+ntp.nyc.icanbwe 69.180.17.124    2 u  639 1024  377  11.8765  -0.1314   1.1134

ntpd ntpsec-1.2.2                             Updated: 2026-02-04T08:17:40 (32)

 lstint avgint rstr r m v  count    score   drop rport remote address
      0   1284    0 . 6 2    321    1.217      0 51529 localhost
    212   1054   c0 . 4 4    127    0.050      0   123 time.cloudflare.com
    564   1079   c0 . 4 4    123    0.050      0   123 nyc2.us.ntp.li
    598   1058   c0 . 4 4    126    0.050      0   123 ip74-208-14-149.pbiaas.com
    639   1066   c0 . 4 4    125    0.050      0   123 ntp.nyc.icanbwell.com
    748   1055   c0 . 4 4    126    0.050      0   123 ntp-62b.lbl.gov
    834   1066   c0 . 4 4    125    0.050      0   123 144.202.66.214 (144.202.66.214.vultruser

I’ll describe peer metrics in a second. For now, the second table, starting with lstint, is the MRU list (MRU=most recently used). Here are the stats it reports.

lstint Interval (s) between receipt of most recent packet from this address and completion of the retrieval of the MRU list by ntpq.
avgint Average interval (s) between packets from this address.
rstr Restriction flags.
r Rate control indicator.
m Packet mode
v Packet version number.
count Packets received
score Packets per second (averaged with exponential decay)
drop Packets dropped
rport Source port of last packet received
remote address The remote host name

There are commands you can use to change the output, like d for detailed mode.

For a snapshot, you can use ntpq, a helpful tool for inspecting the daemon. It has an interactive mode and a one-shot mode. This queries peers in the one-shot mode.

$ ntpq --peers --units
     remote                                   refid      st t when poll reach   delay   offset   jitter
=======================================================================================================
 0.debian.pool.ntp.org                   .POOL.          16 p    -   64    0      0ns      0ns    119ns
 1.debian.pool.ntp.org                   .POOL.          16 p    -   64    0      0ns      0ns    119ns
 2.debian.pool.ntp.org                   .POOL.          16 p    -   64    0      0ns      0ns    119ns
 3.debian.pool.ntp.org                   .POOL.          16 p    -   64    0      0ns      0ns    119ns
-ip74-208-14-149.pbiaas.com              192.58.120.8     2 u  671 1024  377 41.765ms 931.92us 1.5714ms
-144.202.66.214.vultrusercontent.com     162.159.200.1    4 u  907 1024  377 45.322ms 1.1844ms 1.2529ms
*nyc2.us.ntp.li                          17.253.2.37      2 u  637 1024  377 10.253ms -873.2us 851.60us
+ntp-62b.lbl.gov                         128.3.133.141    2 u  821 1024  377 73.619ms -334.4us 1.0547ms
+time.cloudflare.com                     10.102.8.4       3 u  285 1024  377 8.4884ms 52.298us 933.07us
 192-184-140-112.fiber.dynamic.sonic.net .PHC0.           1 u  66h 1024    0 85.820ms 5.3690ms      0ns
+ntp.nyc.icanbwell.com                   69.180.17.124    2 u  712 1024  377 11.877ms -131.4us 1.1134ms

Here’s how this table is interpreted according to the ntpmon man page:

tally (symbol next to remote) One of space: not valid, x, ., -: discarded for various reasons, +: included by the combine algorithm, #: backup, *: system peer, 0: PPS peer. Basically, look for the * and any + signs to see who you’re listening to right now.
remote The host name of the time server.
refid The RefID identifies the specific upstream time source a server is using. In other words, it names the reference clock (stratum 0 or 1), even if this server is just repeating what that reference clock says.
st NTP stratum
t Type. u: unicast or manycase, l: local, s: symmetric (peer), server, B: broadcast server.
when sec/min/hr since last received packet.
poll Poll interval in log2 seconds
reach Octal triplet. Represents the last 8 attempts to reach the server. 377 is binary 11111111, which means all 8 attempts reached the server. A value like 326 is binary 11010110, meaning out of the last 8 attempts, the 3rd, 5th, and 8th attempts failed.
delay Roundtrip delay
offset Offset of server relative to this host.
jitter Jitter is random noise relative to the standard timescale.

For more complete definitions, see man ntpmon.

Many of these are technical and mostly of interest to those already experienced with NTP. I’m not, so I’ve focused on a few of the more interesting metrics: tally, reach, delay, offset, and jitter. These are the same metrics that ntpviz reports on.

There is a law of error that may be stated as follows: small errors do not matter until large errors are removed. So with the history of time measurement: each improvement in the performance of clocks and watches posed a new challenge by bringing to the fore problems that had previously been relatively small enough to be neglected.

Revolution in Time p. 114

For a long time, agreement between clocks didn’t matter. Citizens of the US in the 19th century had timekeepers, but they set them using “apparent solar time”, or time estimated by when the sun is highest in the sky. This varies across any distance east or west, so my clock’s noon in Pennsylvania was noticeably different from my cousin’s clock in Pittsburgh. Apparant solar time is set by sundial, and astronomers could keep time better still by looking at the movements of the planets and stars. (But who had an astronomer in those days?) Besides the sun, you had church bells and tower clocks. Ye old tower clock in most cases was set by sundial, and “none too accurately” in the words of the clock museum. Religious clocks were more of a suggestion of the time.

Coordination wasn’t a moral imperative in the US until the railroads. When you’re coordinating a few hundred trains in and out of stations, timekeeping becomes quite important. For most of the 19th century, each railroad company had its own timekeeping system and standards for accuracy. This created competing definitions of time, and confusion and accidents followed. In the middle of the century, there were 144 official time zones in North America alone².

The accidents and fatalities motivated the US to move to a new definition of standard time based on only four main timezones, the same basic ones we use today.

If you’re like me, you get nervous thinking about the logistics of suddenly changing the time, but while the topic of changing from “God’s time” to an official time was controversial, the actual change seems to have gone well. There was a day of two noons on November 18, 1883, and official clocks and watches were set to the correct time via telegraph. And that was it.

Source: https://www.nyshistoricnewspapers.org

After a day or two, I checked back in on my NTP stats to see what I’d collected. For my distribution, the data collects in /var/log/ntpsec/. Running ntpviz on this folder will generate an HTML report with all of the default data visualizations.

nptviz -d /var/log/ntpsec/
open ntpgraphs/index.html

The interesting graph for me is the first one, which plots clock offset (ms, left axis) and frequency skew (ppm, right axis). My clock is slow, pretty consistently, by about 7ppm. That is, over 1 million oscillations, my clock will read 7 periods less than the authority. As long as this is consistent, that’s ok.

At some point on Jan 31, I suddenly found myself 4ms ahead of the reference clock, and the ensuing correction was a bit too big. But the last day or two has been very stable.

The next graph shows “RMS time jitter” (RMS=root mean square), or in other words “how fast the local clock offset is changing.” The tip under the graph says that 0 is ideal, but it doesn’t give me a sense of whether my clock with a 90% range of 0.528 is any good. It seems spiky.

And a third graph shows RMS frequency jitter, similar metric but for my oscillator’s consistency.

Skipping down a bit, there’s a fun correlation graph between local temperature and the frequency offset. My computer apparently measures temperature in two different places (one consistently warmer than the other). You can see how sudden changes in temperature correlate closely with changes in the frequency offset. The spikes are caused by the space heater in my office.

All of this is still abstract to me. I’ll have to collect more data and try it on a few different machines until I get a better sense for what’s good and what’s not.

Ok, it’s time time got defined.

To define the time, you need a few things:

An oscillator
A count of oscillations (the “epoch”)
An origin

An oscillator with a counter is called a clock, and the origin is called the “frame of reference.” If you consider Earth’s rotations as an oscillator, then the “day” is the counter, where “day” is a complete rotation of the Earth. The origin can be anything convenient, maybe the oscillation when Halley’s comet last passed overhead, or a particular spring equinox. A particular clock is called a timescale.

Before 1958, the heavenly bodies defined the common timescale. The second was defined as 1/86,400 of a solar day, which is the average time between apparent noon at some standard location, like the Royal Observatory in Greenwich. There are all kinds of quirks with this. First, days are getting longer, because the Earth’s rotation is slowing down. It’s esimated that several hundred million years ago, there were only 20 hours in the day. This is caused by the friction of tides.

Second, there are variations in the rotation for other reasons. It’s not a stable oscillator, and the Earth’s tilt varies over time, which causes other inconsistencies. It turns out that this timescale is still useful in modern timekeeping (it’s a component of Greenwhich Mean Time), but it’s not an effective standard for the SI second.

“In 1958, the standard second was redefined as 1/31,556,925.9747 of the tropical year that began this century,” (RFC 1305). The tropical year is the time the Sun takes to return to the same position in the sky from some perspective on Earth. This only lasted until 1967, because it was still not precise enough for modern needs. The tropical year has an accuracy of only 50 ms and increases by 5ms per year.

In 1967, the second was redefined using ground state transitions of the cesium-133 atom, in particular 1 second = 9,192,631,770 periods. Since 1972, “time” has had a foundation of International Atomic Time (TAI), which is defined using the cesium state transition timescale alone. This is a very important time standard – it underlies UTC, for example.

TAI is a continuous average count of standard atomic seconds since 1958-01-01 00:00:00 TAI. You might say, “Hey, that’s defined in terms of TAI,” and yeah, I was wondering about that myself. To understand the origin of TAI, you have to understand the standard Modified Julian Date (MJD). There’s no space here for that, but in essence, it’s a more precise version of our intuitive understanding of a calendar of recent events. Historical dates are vague, but modern dates are well tracked. In other words, the origin is determined from well-known atronomical observations.

There are a lot of standards (UT, UT0, UT1, UT2, TAI, GMT, UTC) and I don’t have the space or knowledge to dismbiguate them all. But I want to answer one of the questions that started this history hunt. What’s the difference between UTC and GMT?

Coordinated Universal Time and Greenwhich Mean Time. The former is a variant of TAI that occasionally inserts leap seconds in order to stay in step with GMT. GMT is mean solar time (also known as local mean time) at the Royal Observatory in Greenwhich, London.

UTC stays in step with GMT through leap seconds, which are inserted/deleted when the difference between GMT and UTC approaches 0.7 seconds. These leap seconds make UTC a non-continuous timescale. TAI on the other hand is continuous – there are no leap seconds (UTC = TAI - leap seconds). TAI will continue to drift out of sync with our intuition for “the time” based on the orbital oscillations, but UTC, like so much of timekeeping, is social. Great pains have been taken to make it precise but intuitive.

And I can also answer the question I mentioned above, What would happen if the atomic clocks on Earth stopped for 10 minutes?³ When I posed the question, I imagined a server with a red segmented display keeping the time somewhere in a vault. But now I know that the cesium atom transitions define the second, not the time. The agreed-upon time is just that, agreed upon. If the frequency reference stops working, the time servers of the world would no longer receive a signal telling them if they’re on the standard or not. They might drift by picoseconds in 10 minutes, but not enough to cause catastrophe. And the frequency standard can be recovered by restarting the cesium atom state transitions – the clocks of the world would come once again to agree.

The point of all of this is that “the time” is not much more complicated than “whatever we say it is.” As Poincaré said in the quote above, the true time is the one that’s most convenient.

My NTP server has been keeping time for me for a week now while I researched this piece. All week I’ve been hunting for this idea of “the actual time” separate from how I intuitively understood it.⁴

But even though we took one step away from natural time by embracing averages over daily observations, we’ve taken a step back towards nature with UTC, which has been jumping through hoops (or at least seconds) to keep civil time convenient.

In the end, we all have the time. “The time” is a social construct, an event of the collective conscious, the one thing we can agree on because it was our agreement that defined it in the first place.

Sources

https://linux.die.net/sag/hw-sw-clocks.html
https://wiki.debian.org/DateTime
https://ntpsec.org
https://wiki.archlinux.org/title/Systemd-timesyncd
The National Watch and Clock Museum in Columbia, PA
Revolution in Time, David S. Landes, 1e
What is Time?, G. J. Whitrow
The Design and Implementation of the FreeBSD Operating System, 2e
RFC 1305, especially Appendix E: The NTP Timescale and its Chronometry.
https://www.nist.gov/pml/time-and-frequency-division/popular-links/walk-through-time/walk-through-time-atomic-age-time
https://www.nist.gov/pml/time-and-frequency-division/popular-links/walk-through-time/walk-through-time-world-time-scales
https://guides.loc.gov/this-month-in-business-history/november/day-of-two-noons
https://www.nyshistoricnewspapers.org

Other notes

There’s much more to say about this topic, much more research I couldn’t include here. A sampling of other interesting topics.

How time is kept in distributed systems and Lamport’s article on clocks
Special relativity and the meaning and relativity of the simultaneity of events
Why your garden sundial doesn’t work (and how to fix it)
Scams and scandals of US timekeeping authorities, who made a killing off of giving preferential treatment to some watchmakers and not others
Daylight savings time and the madness of crowds
This whole Tom Scott video and how computers deal with calendars

Maybe some other day.

Escapements convert potential energy like a falling weight suspended by a rope into periodic motion, such as the ticking of a hand. There’s no better visualization of the development of the clock than Bartosz Ciechanowski’s mechanical watch. And while these escapements were crude in the beginning, it took only a few breakthroughs until they were able to tell time within a few seconds per day. ↩
How do you get 144 time zones? Any move along a line of latitude (i.e. east or west) causes the sun’s apparent apex to move. When the sun is highest in the sky in Pennsylvania, it’s still rising in Colorado. Your sundial would in that case create infinite time zones for each variation in longitude. The railroads “solved” this by using a standard time for each major city they stopped in. You got a sort of “average solar time” for this stretch of railroad. ↩
In fact, an NIST atomic clock did recently stop. ↩
I haven’t struggled alone. Sundials used to come with an equation of time guide that translated the apparent solar time to the current mechanical (“mean”), so the purchaser could know with confidence what the “actual” time is. In the words of David Landes, “instead of setting by the sun, people corrected the sun.” Though I should mention that there were also equation clocks, which used complicated mechanisms to convert from mean time to apparent solar time. ↩

An automated checklist for computer setup

Thu, 29 Jan 2026 00:00:00 +0000

A little while ago my laptop died once, and then twice, and in between each failure I had to use a spare laptop. I ended up setting up my computer fully for work 3 times in the space of a couple weeks. I decided to automate the boring stuff and make a checklist program and a set of companion scripts for doing things like pulling all of my repos and installing my most commonly used libraries and programs through Homebrew.

I already use stow to manage my dotfiles so a lot of my configuration is taken care of with a few quick stows. But what I really needed was an automated checklist that told me how close my configuration was to the “target” laptop. It’s a cheap declarative setup, like a simpler terraform or nix.

Here’s an example checklist (with no color):

❯ ./laptop_checklist.sh

Setting up git ──────────────────────────────────────────────────────────
• Setup git config (git-config.sh)
✓ Pull this code from GitHub (git clone https://github.com/my-user/dotfiles.git)
✓ (file) /Users/user/.gitignore
✓ Copy global gitignore to home directory
× (directory) /Users/user/projects/project_1/
× (directory) /Users/user/projects/project_2/
× (directory) /Users/user/projects/project_3/
× Clone important repositories (clone-repos.sh)
✓ (environment variable) GITLAB_PAT
✓ (environment variable) GITHUB_PAT
✓ Export GitHub and GitLab tokens


Setting up terminal ─────────────────────────────────────────────────────
✓ (application) iTerm.app
✓ Install iTerm2 (https://iterm2.com/downloads.html)
• Import Iterm profile into Iterm
• Install frequently used applications (install-common-libraries.sh)
✓ Install OhMyZsh (install-omzsh.sh)
✓ Install Starship prompt (install-starship.sh)
✓ (file) /Users/user/.zshrc
✓ Copy .zshrc into home directory
✓ (file) /Users/user/.vimrc
✓ Install vim (install-vim.sh)


Install desktop applications ────────────────────────────────────────────
• Install VS Code (https://code.visualstudio.com/download)
✓ (file) /Users/user/Library/Application Support/Code/User/settings.json
✓ Run install-code.sh
✓ (application) Google Chrome.app
✓ (application) DBeaver.app
✓ (application) AWS VPN Client
✓ (application) Docker.app
✓ (application) logioptionsplus.app
✓ (application) Postman.app
✓ (application) Utilities/XQuartz.app
✓ (application) Visual Studio Code.app
• Install Google Chrome (https://www.google.com/chrome)
• Install docker and sign in to GitLab registry (https://www.docker.com/products/docker-desktop/)
• Install postman and sign in using lastpass (https://www.postman.com/downloads/)
• Install AWS VPN and set up with ixis-vpn-client-config.ovpn (https://ixisdigital.atlassian.net/l/cp/ctZeA341)
• Install DBeaver (https://dbeaver.io/download/)
• Install omnibug (https://chrome.google.com/webstore/detail/omnibug/bknpehncffejahipecakbfkomebjmokl)
• Install Logitech Options+ (https://www.logitech.com/en-us/software/logi-options-plus.html)
• Install MS Teams
• Install XQuartz (https://www.xquartz.org/)
• Install Talon Voice (https://talonvoice.com/)
• Install Visual Studio Code (https://code.visualstudio.com/download)
✓ Install common applications


Set up AWS credentials ──────────────────────────────────────────────────
✓ (directory) /Users/user/.aws
✓ Initialize .aws folder (install-aws.sh)
✓ (exec) ssocred
✓ (exec) aws
✓ Install AWS tools (install-aws.sh)
✓ (file) /Users/user/.aws_functions.zsh
✓ AWS authentication functions exist

The checklist tells me how close I am to the target, but it won’t take any action on its own. I have other scripts for that. But the thing about the other scripts is that they:

Have dependencies. There’s a correct order, and I can’t automate that as well because some steps require human action (like generating a new GitHub PAT).
Are finicky. Shell scripting is not always a smooth experience, especially when you might have to jump between shells (if zsh isn’t installed). Not to mention that entropy affects setup scripts the same as it affects roads and buildings. Links die, bits rot.
Don’t give you a high-level view of the current system state.

A real declarative system would compare the current state to the desired state and then take steps to bring the computer into the desired state. Declarative systems are hard to get right – you have to handle all possible current states and define how to get to the target state. When I’m running these scripts, I’m just trying to remember the steps for getting from 0 to back to work. This is just a bunch of setup scripts, and I’m fine with a little “meat in the loop.”

All this is to say, I have the following setup:

A bunch of separate setup scripts
An idea of what I want the final system to look like

The easy stuff

CLI programs and libraries are easy. First, you can usually get everything you need through Homebrew or your package manager of choice. Second, it’s easy to test whether they’re installed.

The shell environment is similarly easy to set up and check for. Environment variables, dotfiles, these are all well-defined environment features that you can check for.

Getting trickier

GitHub PATs are essentially environment variables, but you can’t programmatically generate them. It’s in the category of “easy to check, manual to fix”. The other main entrants in this category are applications like VS Code, Chrome, XQuartz, and so on. These can be checked for in the few places that MacOS stores applications.

Reminders

Some things can neither be tested for nor installed automatically (without significant effort). For these, I have this idea of a “reminder” in the checklist that basically says “do this or else”. But the script doesn’t know the status of it.

Examples are usually within applications, like signing into Chrome, importing settings into VS Code, DBeaver, and configuring my mx ergo mouse in Logitech Options.

Putting it together

Design

A single script
No config file, everything done in the script
Easy to add/drop
Easy to define sections
Non-blocking. I want to see the whole status at once.

Implementation

The program is composed of checkers and checklist items. The checkers are functions that take some standard input (like the name of an evironment variable) and check whether it exists, returning 0 (success) or 1 (fail).

Sections have a main status for the larger abstract concept (“Set up git”) and sub-statuses for each checklist item (“install git”, “GH PAT”, etc.). This is a section that checks for personal access tokens.

echo "Setting up git ──────────────────────────────────────────────────────────"
git_pat_status=PASS
if ! check_env GITLAB_PAT; then
	git_pat_status=FAIL
fi
if ! check_env GITHUB_PAT; then
	git_pat_status=FAIL
fi
status $git_pat_status "Export GitHub and GitLab tokens"

The section passes unless any of its children fail, then it’s a fail for the whole section. Here, I forgot to add a check for the git binary, so let’s add it.

echo "Setting up git ──────────────────────────────────────────────────────────"
git_pat_status=PASS
if ! check_exec git; then
    git_pat_status=FAIL
fi
if ! check_env GITLAB_PAT; then
	git_pat_status=FAIL
fi
if ! check_env GITHUB_PAT; then
	git_pat_status=FAIL
fi
status $git_pat_status "Export GitHub and GitLab tokens"

To break it down:

check_exec looks for a program using which.
check_env looks for an environment variable that is defined and not an empty string. All of the check_* functions print their result to the console in addition to calculating the success/failure.
status prints a summary message, indicating pass or fail.

It’s straightforward to write a new checker function. The checker should evaluate the status of the thing to be checked, write a message to the console, and then return a status to the user. Here’s the definition of check_env:

check_env() {
	prefix="${dim}(environment variable) ${normal}"
	env_var_value=${!1}
	if [[ ! -z "$env_var_value" ]]; then
		status PASS "${prefix}$1"
		return 0
	else
		status FAIL "${prefix}$1"
		return 1
	fi
}

The checkers also use the status function to print messages to the user.

Reminders are not done by check_* functions. Instead, they send straight to status with the value TODO:

status TODO "Import Iterm profile into Iterm"

And that’s pretty much it. I wish I saw more general purpose frameworks for this kind of thing, since I find it really helpful for remembering the pesky details of setting up a new computer. If you know of any, let me know!

The Gist

A plain text to-do list

Wed, 28 Jan 2026 00:00:00 +0000

I’ve tried lots of to-do lists, from ToDoist to fine-grained Jira tickets and all things in between. I still felt disorganized. But I found¹ Jeff Huang’s My productivity app is a never-ending .txt file and decided to give it a try, with some tweaks.

I spend a lot of time in the terminal, so now I keep an iTerm tab open with my daily to-do list. It’s a dumping ground of to-dos (both work and personal), meeting notes, random text I’m copying from one place to another. It’s faster than handwriting and easier to copy tasks from one day to another. It’s easy to stack rank things. It’s an unreasonably effective to-do list tool.

Here’s an example:

2025-12-16
----------
- [x] Announce next week's release delay
- [x] Review X's MR hotfix for dsmodels
    - https://gitlab.com/etc/123
- [x] Review hit segments table MR
    - https://gitlab.com/etc/697
- [x] Help X with evars 216, 217, and 221 (all null, "Merchandising Evars")
    - https://atlassian.net/wiki/x/EYAT4g123
- [x] Plan next sprint
    - [x] What PTO do folks have? How much actual capacity?
    - [x] What priorities for Client A? Client B? Internal eng efforts?
- [x] X's comment under TICK-14222
- [ ] Update Utils Logger Documentation with new lambda information
- [ ] Heads down on Snowflake models
- [ ] Send snippet on how to filter with `(year, month, day)` tuple.
- [ ] 
- [ ] 
- [ ] Lunch with Bre
- [ ] 
- [ ] Start new project?
- [ ] Write blog post about to-do lists
- [ ] What's going on with management?

Coworker
-------

- window, not supported
- evar64 as alternative to mcvisid for join, `external_leadid`

Demo tenant
-----------

- What are the core integrations that make Service Pro worthwhile
- Add CRM and DMS, completely spoofed
- What do we demo?
- We need these 4 data systems

I keep to-dos roughly separated into work, personal, and “for consideration”, which are tasks that I expect to carry over a few days in a row until I decide what to do with them. At the bottom, I throw in everything and anything related to the to-dos of the day. I especially like this section for dumping in notes. If I’m in the middle of something complicated, at the end of the day I’ll write down as much as I can think of in shorthand as context for what state things are in. What’s tested, what still needs to be connected to something else, why I stopped working on feature X temporarily.

To use Jeff Huang’s phrase, this is both a to-do list and a got-done list², and I like that I can see everything I’ve completed with a quick grep:

# Top-level completed tasks
cat *.md | sort | uniq | grep '^- \[x\]' | wc -l
161

# Including subtasks
cat *.md | sort | uniq | grep '- \[x\]' | wc -l
278

Occasionally I’ll plan days in the future – why not? Make a text file for a week from today and add a couple to-dos. Later, when I create tomorrow’s file, I’ll find it already exists and has some to-dos. Excellent.

A few things make this work for me:

Vim. Makes it very fast to rearrange, reorder, and re-indent.
Terminal tabs My to-do list is always in the first tab of iTerm, so it’s just a ⌘1 away any time I’m in the terminal.
Minimum viable discipline. 5 minutes in the morning and evening to organize the list.

It’s hard to overstate just what a good effect this has had on my work. It’s a 2¢ piece of tech that just works.

I found this through the really excellent Joy & Curiosity series in Register Spill by Thorsten Ball. I can’t recommend Thorsten’s work enough – I read every Sunday. His utter apprectiation for good writing is infectious (and one of the reasons I started this blog). ↩
There’s some value in these things as historical record, but I don’t plan on preserving them. I keep my files on my local computer and am waiting for the day my computer dies. But Jeff Huang used his .txt file as research for his Behind the Scenes: the struggle for each paper post, which I have to say is cool. It reminds me of Stephen Wolfram’s unbelievable Seeking the productive life where, among many other things, he visualizes the emails he’s sent and received over time. ↩

NTP in 30 seconds

Tue, 27 Jan 2026 00:00:00 +0000

What time is it?

There is a metrological fable that has been retold many times. It concerns an eccentric retired sea captain who lived in the hills overlooking Zanzibar City and fired a ceremonial cannon and raised the ensign at exactly noon each day. He knew it was noon from his chronometer which he took pains to accurately set whenever he passed the watchmaker’s window in town. The watchmaker knew his clocks were accurate because he checked them daily when that punctilious captain on the hill fired his cannon at noon exactly.

Account of the Zanzibar Effect in “Failures of the global measurement system. Part 2: institutions, instruments and strategy.” Gary Price.

Ballyhough railway station has two clocks which disagree by some six minutes. When one helpful Englishman pointed the fact out to a porter, his reply was “Faith, sir, if they was to tell the same time, why would we be having two of them?”

The Five Clocks, Martin Joos

Two servers have clocks with the same frequency, but they don’t agree on what time it is. Server A says it’s 10:00, Server B says it’s 10:05. Server B has a direct line to an atomic clock receiver, so it’s more accurate.

Once Server A knows that Server B has the right time, it needs to synchronize with it; Server A needs to discover that it’s actually 10:05, not 10:00. It could ask, but by the time the question made a round trip through the network, the answer would already be out of date. So, Server A needs to estimate the network latency between itself and Server B.

We can estimate latency without having synchronized clocks. Since latency is the amount of time spent in the network, we only need to estimate the amount of time not spent in the network. Server A knows the round-trip time of its “What time is it?” request, so it asks Server B to communicate how long it held the message. RTT - B_time = time_on_network, which means the link latency is (RTT - B_time) / 2.

Now we know the link latency, and Server A just needs to know the timestamp when Server B received its “What time is it?” request and it can work out the offset. And that’s it.

Example

Suppose Server A says it’s 10:00:00 and Server B says it’s 10:05:00, like above. As a shortcut, instead of telling Server A both the duration and the timestamp when it received the request, Server B includes the receipt timestamp and the exit timestamp in its message back to A.

              ┌────────────What time is it?───────────┐       
              │                                       │       
              │                                       │       
              │10:00:00                               │10:05:01
       ┌──────┼───────┐                       ┌───────▼──────┐
       │              │                       │              │
       │              │                       │              │
       │ Server A     │                       │ Server B     │
       │              │                       │              │
       └──────────────┘                       └───────┬──────┘
              ▲ 10:00:03                              │10:05:02
              │                                       │       
              │         ┌──────────────────┐          │       
              │         │                  │          │       
              │         │  Recv: 10:05:01  │──────────┘       
              └─────────┼  Ret:  10:05:02  │                  
                        │                  │                  
                        └──────────────────┘                  

Server B held the packet for 10:05:02 - 10:05:01 = 1s and the RTT was 10:00:03 - 10:00:00 = 3s, which gives a link latency of (3 - 1) / 2 = 1s. And Server A now knows that Server B received the message at 10:05:01, and since it sent the request at 10:00:00 and there’s 1 second of network delay, that means Server B’s 10:05:01 is the same as Server A’s 10:00:00 + 1s = 10:00:01, and Server A is exactly 10:05:01 - 10:00:01 = 5m too slow.

In practice, a little algebra takes this from several steps to just one.

offset
  = ((B1 - A1) + (B2 - A2)) / 2
  = ((10:05:01 - 10:00:00) + (10:05:02 - 10:00:03)) / 2
  = (5:01 + 4:59) / 2
  = 5 minutes

Notes

NTP involves a lot more than this algebra trick, but I think it’s a very neat trick. In addition to this formula, NTP solves for unreliable networks – latencies are estimated by repeatedly sampling the latency; a single round trip is not enough. Also, you can’t just apply an offset to a clock. Many applications and protocols assume that time only goes forwards, so if your clock is ahead of the reference, care is taken. Instead of hopping back in time, the system clock is slowed down slightly until the entire offset has been made up for. When you’re behind the reference, the standard protocol is to modestly speed up the clock until you’re caught up.

In extreme cases (like being off by 30 minutes) some hopping is used to avoid the algorithm taking too long.

None of this so far addresses how servers agree on who has the authority and who will sync to whose clock. That’s determined by consulting a hierarchy of authority based on how many hops you are from an atomic clock. Time sources are stratified. The atomic clocks belong to Stratum 1, and anyone who gets their time directly from an atmoic clock is in Stratum 2, and so on. Part of the NTP exchange is comparing stratum and selecting the authority base on whose number is lower. The “loser” sets their stratum to s_authority + 1.

And you might then ask, are all atomic clocks created equal? Where does that authority come from? How is UTC defined? And for that, I don’t have space on this blog. Let’s just be grateful that thanks to NTP, we don’t have to worry about it.

Postscript

If you run Linux and want to see exactly what your ntpd daemon is up to, try out ntpviz from the NTPsec project. Details here: https://blog.ntpsec.org/2016/12/19/ntpviz-intro.html