<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
     xmlns:content="http://purl.org/rss/1.0/modules/content/">
  <channel>
    <title>Charlie Gallagher on Programming</title>
    <link>https://charlie-gallagher.github.io</link>
    <description>Personal blog and website</description>
    <language>en-us</language>

    
    <item>
      <title>Multiprocessing TSV repair</title>
      <link>https://charlie-gallagher.github.io/2026/03/17/tsv-repair-multiprocessing.html</link>
      <guid isPermaLink="true">https://charlie-gallagher.github.io/2026/03/17/tsv-repair-multiprocessing.html</guid>
      <pubDate>Tue, 17 Mar 2026 00:00:00 +0000</pubDate>

      <description>
        <![CDATA[
          <p>A few days ago I wrote about
<a href="/2026/03/12/tsv-repair.html">optimizing a TSV repair script</a> that took
large TSV files with unquoted newline characters like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id	name	comment	score
0	charlie	normal	20
1	alice	this is a
multiline comment	10
2	bob	normal	8
</code></pre></div></div>

<p>And turned them into valid TSV files like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id	name	comment	score
0	charlie	normal	20
1	alice	this is a multiline comment	10
2	bob	normal	8
</code></pre></div></div>

<p>I initially ignored multiprocessing as being pesky, complicated, and unlikely to
yield good results. But after returning to it, I found that not only is it a
tractable problem, but it could potentially have real performance benefits.</p>

<p>I’ve written a compatible multiprocessing TSV repair script that has taken the
best end-to-end time from 5.78s down to 3.04s (though the average is somewhat
worse).</p>

<h1 id="what-i-missed">What I missed</h1>
<p>In the last post, I discounted the multiprocessing approach for the following
reason:</p>

<blockquote>
  <p>We can’t align on a row boundary without knowing how many tab characters have
come before the current <code class="language-plaintext highlighter-rouge">seek</code> position, and we can’t do that without indexing
all of the tabs in the file. That’s what we’re already doing in the basic
implementation of the TSV repair script, so I don’t see a way for
multiprocessing to be faster than sequential processing.</p>
</blockquote>

<p>I didn’t consider that you could <em>also parallelize the tab indexing</em>.</p>

<p>This is the basis of the <em>two-pass</em> approach to distributed parsing of
delimited files. The first pass uses workers to gather statistics about the
file. Then, the master uses those statistics to re-evaluate the naive ranges it
assigned to the workers. Finally, the master tells the workers what
modifications they need to make to their assigned ranges to work with only
complete records, and the workers go to work. For a complete description, see
<a href="https://badrish.net/papers/dp-sigmod19.pdf">https://badrish.net/papers/dp-sigmod19.pdf</a>.</p>

<p>For valid delimited files, where newline characters are allowed only if they’re
quoted, the real question for each worker is whether or not their initial
position is in a quoted field or not. In the paper linked above, the authors use
a speculative approach to decide whether or not a worker is beginning in a
quoted field or not. It’s a nifty technique, and if you’re interested I
recommend reading through the paper. The short version is that each worker
“sniffs” the first megabyte or so of data in their chunk and makes an educated
guess about whether or not they started in a quoted field. They no longer
communicate back with the master, they just proceed with their guess. If they
encounter an error, the whole things falls back to the two-pass approach.</p>

<p>In my case, there are no quotes, so I had to stay with the two-pass approach and
adapt it to the malformed TSVs I’m dealing with.</p>

<p>But before I get to the new script, I wanted to mention one more interesting
feature of these TSV files that I didn’t notice before.</p>

<h1 id="inherent-ambiguity">Inherent ambiguity</h1>
<p>While working on the multiprocessor script, I realized that I had missed an
ambiguity lurking in these malformed TSV files. Given a file like the following,
there’s no “correct” interpretation:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>one,two,three
four
five,six,seven
</code></pre></div></div>

<p>This could be interpreted in one of two ways:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Version 1
one,two,three\nfour
five,six,seven

# Version 2
one,two,three
four\nfive,six,seven
</code></pre></div></div>

<p>The problem is that when the final field contains a newline character, we can’t
say whether it’s a record delimiter or an unquoted newline.</p>

<p>It’s computationally easier to use a non-greedy approach, which produces
interpretation Version 2. You read lines and stop reading as soon as you have
accumulated the correct number of field delimiters (tabs, commas). During
processing of the above ambiguous snippet, the processor first reads the line
<code class="language-plaintext highlighter-rouge">one,two,three</code> and finds it complete. Then, it starts building the next record
with <code class="language-plaintext highlighter-rouge">four</code>, which it joins with the next line, after which join it finds that
this record is now complete.</p>

<p>Foruntunately this rule is as easy to follow for a sequential processor as for a
parallel processor.</p>

<h1 id="adapting-the-two-phase-parallel-parser">Adapting the two-phase parallel parser</h1>
<p>To make this work, I had to make the definition of a row a little more strict.
The parallel processor can no longer tolerate lines that have too many tabs in
them – it’s assumed that every record is composed of the same number of fields.</p>

<p>With that assumption, it becomes a given that the total number of tabs in the
file is a multiple of the number of tabs in the header, i.e. <code class="language-plaintext highlighter-rouge">n_fields - 1</code>. So,
I split the file evenly into chunks and assign each worker a chunk. The workers
count the number of tabs in their chunks and report back to the master process.
The master process then figures out how many tabs each worker needs to <em>skip</em> in
order to land on a record boundary. The workers then treat the rest of their
range as a normal TSV file, stopping once they’ve read their last full record
that started before the end of their chunk. So workers often read past the end
of their assigned byte range in the file, and the calculations performed by the
master process ensure that the next worker down the line knows how far the
previous worker had to over-read.</p>

<p>The implementation is a minefield of boundary issues and off-by-one errors, but
in the end I got this working ok and reasonably efficiently. The full
implementation is available <a href="https://github.com/charlie-gallagher/tsv-repair/blob/main/repair_bytes_buffered_read_write_multiprocessing_two_pass.py">on GitHub</a>.</p>

<h1 id="performance">Performance</h1>
<p>As I said in the last post, for best performance, you should leave the files in
separate pieces, and that’s exactly what I did for the performance benchmarks. I
did write an optional extension that recombines the files, and I used that to
confirm that the “repaired” version of various files matched a known-good
processor.</p>

<p>The performance is unstable, but generally very good. Performance seems to
depend on how busy the system is with other, more “important” work. All of the
workers are also accessing the same file, which creates the possibility for
contention. They’re all reads, but as the system flips from one process to
another, file accesses become more random.</p>

<p>Still, it’s sometimes exceptional. The best recorded time so far was 3.04
seconds, almost a 2x improvement on the previous best time. Here’s a smattering
of results, with a normal sequential run thrown in the middle.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2026-03-17 15:35:40	repair_bytes_buffered_read_write_multiprocessing_two_pass	3.394725
2026-03-17 15:35:57	repair_bytes_buffered_read_write_multiprocessing_two_pass	3.986272
2026-03-17 15:36:08	repair_bytes_buffered_read_write	6.599493
2026-03-17 15:36:18	repair_bytes_buffered_read_write_multiprocessing_two_pass	9.798484
2026-03-17 15:36:35	repair_bytes_buffered_read_write_multiprocessing_two_pass	7.341621
2026-03-17 15:37:02	repair_bytes_buffered_read_write_multiprocessing_two_pass	3.043876
</code></pre></div></div>

<p>The trimmed mean is 4.9 seconds.</p>

<p>When I turn on the feature that re-combines the files at the end, performance
dips to more like 12 seconds per run, more or less as expected.</p>

<h2 id="update-2026-03-18">Update: 2026-03-18</h2>
<p>I ran more tests on this to see where the bottlenecks might be. I ran a very
large file (10 GB) and the performance of the multiprocessing version of the
code was about the same as the sequential version. A few things I noted:</p>

<ul>
  <li>I did optimize the alignment code, since this large file has 1200 columns in
it. But the alignment code is not the bottleneck, usually taking somewhere in
the range of 1-2 milliseconds.</li>
  <li>With only one worker, I found that each chunk of the file was processed in
only 2 seconds. If you kept that speed up, you would process the whole file in
around 5-7 seconds.</li>
  <li>With 4 workers, each chunk was processed in 6 seconds, and with 8 workers
(=vcpu) each took 12 seconds.</li>
</ul>

<p>So I/O contention is the most likely cause of the limited performance on large
files. And of course multiprocessing is best when you can put multiple
processors to work at once, doing calculations and whatnot. I found that when I
increased the newline density to 0.2 again, the multiprocessing code was now
significantly faster than the sequential code (14s compared to 26s). But even
though this is I/O bound, multiprocessing seems to perform at least as well and
often better than sequential processing.</p>

<h1 id="comments">Comments</h1>
<p>This was a serious bump in complexity and peskiness, but I’m thrilled with the
performance (most of the time). I think there might be a bit more performance to
be squeezed out of it. The I/O could most likely be faster when I’m scanning
forward to skip tabs, but I’m plenty happy with this implementation, and I’m
satisfied that the cost of scanning forward is bounded by the number of fields,
not the number of rows.</p>

<p>Profiling becomes tricky with multiprocessing, so I skipped it for these runs.
If you’re a profiler junky, I’ll gladly accept any PRs.</p>

<p>Ultimately I wouldn’t recommend this for production, because it depends so
heavily on there being a correct number of tabs. Any misalignment and the
quality goes out the window, or you’d have to write a guard that makes the
master process fall over if the number of tabs is wrong. But fun to get working
anyway!</p>


        ]]>
      </description>

      <content:encoded>
        <![CDATA[
          <p>A few days ago I wrote about
<a href="/2026/03/12/tsv-repair.html">optimizing a TSV repair script</a> that took
large TSV files with unquoted newline characters like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id	name	comment	score
0	charlie	normal	20
1	alice	this is a
multiline comment	10
2	bob	normal	8
</code></pre></div></div>

<p>And turned them into valid TSV files like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id	name	comment	score
0	charlie	normal	20
1	alice	this is a multiline comment	10
2	bob	normal	8
</code></pre></div></div>

<p>I initially ignored multiprocessing as being pesky, complicated, and unlikely to
yield good results. But after returning to it, I found that not only is it a
tractable problem, but it could potentially have real performance benefits.</p>

<p>I’ve written a compatible multiprocessing TSV repair script that has taken the
best end-to-end time from 5.78s down to 3.04s (though the average is somewhat
worse).</p>

<h1 id="what-i-missed">What I missed</h1>
<p>In the last post, I discounted the multiprocessing approach for the following
reason:</p>

<blockquote>
  <p>We can’t align on a row boundary without knowing how many tab characters have
come before the current <code class="language-plaintext highlighter-rouge">seek</code> position, and we can’t do that without indexing
all of the tabs in the file. That’s what we’re already doing in the basic
implementation of the TSV repair script, so I don’t see a way for
multiprocessing to be faster than sequential processing.</p>
</blockquote>

<p>I didn’t consider that you could <em>also parallelize the tab indexing</em>.</p>

<p>This is the basis of the <em>two-pass</em> approach to distributed parsing of
delimited files. The first pass uses workers to gather statistics about the
file. Then, the master uses those statistics to re-evaluate the naive ranges it
assigned to the workers. Finally, the master tells the workers what
modifications they need to make to their assigned ranges to work with only
complete records, and the workers go to work. For a complete description, see
<a href="https://badrish.net/papers/dp-sigmod19.pdf">https://badrish.net/papers/dp-sigmod19.pdf</a>.</p>

<p>For valid delimited files, where newline characters are allowed only if they’re
quoted, the real question for each worker is whether or not their initial
position is in a quoted field or not. In the paper linked above, the authors use
a speculative approach to decide whether or not a worker is beginning in a
quoted field or not. It’s a nifty technique, and if you’re interested I
recommend reading through the paper. The short version is that each worker
“sniffs” the first megabyte or so of data in their chunk and makes an educated
guess about whether or not they started in a quoted field. They no longer
communicate back with the master, they just proceed with their guess. If they
encounter an error, the whole things falls back to the two-pass approach.</p>

<p>In my case, there are no quotes, so I had to stay with the two-pass approach and
adapt it to the malformed TSVs I’m dealing with.</p>

<p>But before I get to the new script, I wanted to mention one more interesting
feature of these TSV files that I didn’t notice before.</p>

<h1 id="inherent-ambiguity">Inherent ambiguity</h1>
<p>While working on the multiprocessor script, I realized that I had missed an
ambiguity lurking in these malformed TSV files. Given a file like the following,
there’s no “correct” interpretation:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>one,two,three
four
five,six,seven
</code></pre></div></div>

<p>This could be interpreted in one of two ways:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Version 1
one,two,three\nfour
five,six,seven

# Version 2
one,two,three
four\nfive,six,seven
</code></pre></div></div>

<p>The problem is that when the final field contains a newline character, we can’t
say whether it’s a record delimiter or an unquoted newline.</p>

<p>It’s computationally easier to use a non-greedy approach, which produces
interpretation Version 2. You read lines and stop reading as soon as you have
accumulated the correct number of field delimiters (tabs, commas). During
processing of the above ambiguous snippet, the processor first reads the line
<code class="language-plaintext highlighter-rouge">one,two,three</code> and finds it complete. Then, it starts building the next record
with <code class="language-plaintext highlighter-rouge">four</code>, which it joins with the next line, after which join it finds that
this record is now complete.</p>

<p>Foruntunately this rule is as easy to follow for a sequential processor as for a
parallel processor.</p>

<h1 id="adapting-the-two-phase-parallel-parser">Adapting the two-phase parallel parser</h1>
<p>To make this work, I had to make the definition of a row a little more strict.
The parallel processor can no longer tolerate lines that have too many tabs in
them – it’s assumed that every record is composed of the same number of fields.</p>

<p>With that assumption, it becomes a given that the total number of tabs in the
file is a multiple of the number of tabs in the header, i.e. <code class="language-plaintext highlighter-rouge">n_fields - 1</code>. So,
I split the file evenly into chunks and assign each worker a chunk. The workers
count the number of tabs in their chunks and report back to the master process.
The master process then figures out how many tabs each worker needs to <em>skip</em> in
order to land on a record boundary. The workers then treat the rest of their
range as a normal TSV file, stopping once they’ve read their last full record
that started before the end of their chunk. So workers often read past the end
of their assigned byte range in the file, and the calculations performed by the
master process ensure that the next worker down the line knows how far the
previous worker had to over-read.</p>

<p>The implementation is a minefield of boundary issues and off-by-one errors, but
in the end I got this working ok and reasonably efficiently. The full
implementation is available <a href="https://github.com/charlie-gallagher/tsv-repair/blob/main/repair_bytes_buffered_read_write_multiprocessing_two_pass.py">on GitHub</a>.</p>

<h1 id="performance">Performance</h1>
<p>As I said in the last post, for best performance, you should leave the files in
separate pieces, and that’s exactly what I did for the performance benchmarks. I
did write an optional extension that recombines the files, and I used that to
confirm that the “repaired” version of various files matched a known-good
processor.</p>

<p>The performance is unstable, but generally very good. Performance seems to
depend on how busy the system is with other, more “important” work. All of the
workers are also accessing the same file, which creates the possibility for
contention. They’re all reads, but as the system flips from one process to
another, file accesses become more random.</p>

<p>Still, it’s sometimes exceptional. The best recorded time so far was 3.04
seconds, almost a 2x improvement on the previous best time. Here’s a smattering
of results, with a normal sequential run thrown in the middle.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2026-03-17 15:35:40	repair_bytes_buffered_read_write_multiprocessing_two_pass	3.394725
2026-03-17 15:35:57	repair_bytes_buffered_read_write_multiprocessing_two_pass	3.986272
2026-03-17 15:36:08	repair_bytes_buffered_read_write	6.599493
2026-03-17 15:36:18	repair_bytes_buffered_read_write_multiprocessing_two_pass	9.798484
2026-03-17 15:36:35	repair_bytes_buffered_read_write_multiprocessing_two_pass	7.341621
2026-03-17 15:37:02	repair_bytes_buffered_read_write_multiprocessing_two_pass	3.043876
</code></pre></div></div>

<p>The trimmed mean is 4.9 seconds.</p>

<p>When I turn on the feature that re-combines the files at the end, performance
dips to more like 12 seconds per run, more or less as expected.</p>

<h2 id="update-2026-03-18">Update: 2026-03-18</h2>
<p>I ran more tests on this to see where the bottlenecks might be. I ran a very
large file (10 GB) and the performance of the multiprocessing version of the
code was about the same as the sequential version. A few things I noted:</p>

<ul>
  <li>I did optimize the alignment code, since this large file has 1200 columns in
it. But the alignment code is not the bottleneck, usually taking somewhere in
the range of 1-2 milliseconds.</li>
  <li>With only one worker, I found that each chunk of the file was processed in
only 2 seconds. If you kept that speed up, you would process the whole file in
around 5-7 seconds.</li>
  <li>With 4 workers, each chunk was processed in 6 seconds, and with 8 workers
(=vcpu) each took 12 seconds.</li>
</ul>

<p>So I/O contention is the most likely cause of the limited performance on large
files. And of course multiprocessing is best when you can put multiple
processors to work at once, doing calculations and whatnot. I found that when I
increased the newline density to 0.2 again, the multiprocessing code was now
significantly faster than the sequential code (14s compared to 26s). But even
though this is I/O bound, multiprocessing seems to perform at least as well and
often better than sequential processing.</p>

<h1 id="comments">Comments</h1>
<p>This was a serious bump in complexity and peskiness, but I’m thrilled with the
performance (most of the time). I think there might be a bit more performance to
be squeezed out of it. The I/O could most likely be faster when I’m scanning
forward to skip tabs, but I’m plenty happy with this implementation, and I’m
satisfied that the cost of scanning forward is bounded by the number of fields,
not the number of rows.</p>

<p>Profiling becomes tricky with multiprocessing, so I skipped it for these runs.
If you’re a profiler junky, I’ll gladly accept any PRs.</p>

<p>Ultimately I wouldn’t recommend this for production, because it depends so
heavily on there being a correct number of tabs. Any misalignment and the
quality goes out the window, or you’d have to write a guard that makes the
master process fall over if the number of tabs is wrong. But fun to get working
anyway!</p>


        ]]>
      </content:encoded>
    </item>
    
    <item>
      <title>Optimizing TSV Repair in Python</title>
      <link>https://charlie-gallagher.github.io/2026/03/12/tsv-repair.html</link>
      <guid isPermaLink="true">https://charlie-gallagher.github.io/2026/03/12/tsv-repair.html</guid>
      <pubDate>Thu, 12 Mar 2026 00:00:00 +0000</pubDate>

      <description>
        <![CDATA[
          <p>This TSV file has a problem:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id	name	comment	score
0	charlie	normal	20
1	alice	this is a
multiline comment	10
2	bob	normal	8
</code></pre></div></div>

<p>There’s an unquoted multi-line field on the second line. I’m not aware of a TSV
parser that can correctly parse this – most consider it an invalid file, some
NULL-fill the remaining fields on any line that has incomplete data.</p>

<p>I regularly ingest data from a particular data source that has this bad feature
in it, and the problem has gotten worse recently. I’ve decided to fix it. The
question is, “What’s the best way to fix unquoted newline characters?”</p>

<p>I created this repository with my results and some tooling for benchmarking and profiling: <a href="https://github.com/charlie-gallagher/tsv-repair">https://github.com/charlie-gallagher/tsv-repair</a></p>

<h1 id="problem-statement">Problem statement</h1>
<p>Like the <a href="https://www.morling.dev/blog/one-billion-row-challenge/">Billion Row Challenge</a>, the goal is
to process large files as quickly as possible using some base programming
language.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> In my case, I worked in base Python 3.13 on MacOS.</p>

<p>Using pure python (stdlib), repair a large (10GB), utf-8 encoded TSV file with a
knowable number of fields by (a) identifying incomplete lines, (b) combining
successive incomplete lines until they form a complete line, and (c) not
combining lines if the result is a row with too many fields. A single row may
have one or more newlines, i.e.  a row might be spread across one or more lines
of the file. The lines that form a row are always ordered correctly, successive
and contiguous. Newlines are LF only, not CRLF. To find the number of fields,
you can read the first line, which is always the header.</p>

<p>To make things simpler, even if a field is quoted, you can still join successive
split lines.</p>

<p>In the end, the file should look like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id	name	comment	score
0	charlie	normal	20
1	alice	this is a multiline comment	10
2	bob	normal	8
</code></pre></div></div>

<p>i.e. replace the newline character with a space. If there are multiple newline
characters in a row (<code class="language-plaintext highlighter-rouge">hello\n\nworld</code>), replace each one with a space. If a
field starts or ends with a newline, you can still replace newlines with a
space.</p>

<h1 id="basic-solution">Basic solution</h1>
<p>Here’s a straightforward solution I came up with that passes all the tests.
There are one or two inelegancies, like the <code class="language-plaintext highlighter-rouge">_need_to_write</code> variable that keeps
track of some state information, but it does the thing.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">repair</span><span class="p">(</span><span class="n">input_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">output_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">input_file</span><span class="p">,</span> <span class="s">"r"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fin</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">output_file</span><span class="p">,</span> <span class="s">"w"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fout</span><span class="p">:</span>
        <span class="c1"># Start by copying header
</span>        <span class="n">header</span> <span class="o">=</span> <span class="n">fin</span><span class="p">.</span><span class="n">readline</span><span class="p">()</span>
        <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">header</span><span class="p">)</span>

        <span class="n">expected_tabs</span> <span class="o">=</span> <span class="n">header</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">)</span>

        <span class="c1"># Then, iterate over the lines, repairing as you go
</span>        <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
            <span class="n">line</span> <span class="o">=</span> <span class="n">fin</span><span class="p">.</span><span class="n">readline</span><span class="p">()</span>
            <span class="k">if</span> <span class="ow">not</span> <span class="n">line</span><span class="p">:</span>
                <span class="k">break</span>
            <span class="n">line_tabs</span> <span class="o">=</span> <span class="n">line</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">line_tabs</span> <span class="o">==</span> <span class="n">expected_tabs</span><span class="p">:</span>
                <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
                <span class="k">continue</span>

            <span class="c1"># Line repair
</span>            <span class="c1"># Grab next line and see if it complements
</span>            <span class="n">_need_to_write</span> <span class="o">=</span> <span class="bp">True</span>
            <span class="k">while</span> <span class="n">line_tabs</span> <span class="o">&lt;</span> <span class="n">expected_tabs</span><span class="p">:</span>
                <span class="n">continuation_line</span> <span class="o">=</span> <span class="n">fin</span><span class="p">.</span><span class="n">readline</span><span class="p">()</span>
                <span class="k">if</span> <span class="ow">not</span> <span class="n">continuation_line</span><span class="p">:</span>
                    <span class="k">break</span>
                <span class="n">cline_tabs</span> <span class="o">=</span> <span class="n">continuation_line</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">)</span>
                <span class="k">if</span> <span class="n">line_tabs</span> <span class="o">+</span> <span class="n">cline_tabs</span> <span class="o">&lt;=</span> <span class="n">expected_tabs</span><span class="p">:</span>
                    <span class="n">line</span> <span class="o">=</span> <span class="n">line</span><span class="p">.</span><span class="n">rstrip</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span> <span class="o">+</span> <span class="s">" "</span> <span class="o">+</span> <span class="n">continuation_line</span>
                    <span class="n">line_tabs</span> <span class="o">+=</span> <span class="n">cline_tabs</span>
                <span class="k">else</span><span class="p">:</span>
                    <span class="c1"># Adding these lines would create a row with
</span>                    <span class="c1"># too many fields
</span>                    <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
                    <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">continuation_line</span><span class="p">)</span>
                    <span class="n">_need_to_write</span> <span class="o">=</span> <span class="bp">False</span>
                    <span class="k">break</span>
            <span class="k">if</span> <span class="n">_need_to_write</span><span class="p">:</span>
                <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
</code></pre></div></div>

<p>And huzzah, the tests are passing.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ python3 -c "from test_repair import main; from repair_basic import repair; main(repair)"
Testing repair function against golden files in /Users/charlie/tsv-repair/test_files

  PASS  already_good.tsv
  PASS  basic.tsv
  PASS  basic_incomplete_first_line.tsv
  PASS  basic_incomplete_last_line.tsv
  PASS  multi_newline.tsv
  PASS  newline_as_first_char_in_field.tsv
  PASS  newline_as_last_char_in_field.tsv
  PASS  newline_as_only_char_in_field.tsv
  PASS  partial_solution.tsv
  PASS  partial_solution_2x.tsv
  PASS  quoted.tsv
  PASS  too_many_tabs.tsv
  PASS  two_bad_lines_in_a_row.tsv

13/13 tests passed.
</code></pre></div></div>

<p>Benchmark on large file:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2026-03-13 09:59:03	repair_basic	14.604409
</code></pre></div></div>

<p>The large file configuration for this run was: 1M rows, 120 columns, and 0.002
likelihood that a cell contains a newline character. The file was 3.0 GB. You
can generate a similar file using:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python generate_large_file.py -r 1000000 -c 120 --newline-likelihood 0.002 
</code></pre></div></div>

<p>There’s a cProfile script as well. Here’s the output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ python3 profile_repair.py repair_basic.py 
Profiling repair_basic on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         4882830 function calls in 20.516 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.674    1.674   20.516   20.516 /Users/charlie/tsv-repair/repair_basic.py:2(repair)
  1000001    8.711    0.000    8.711    0.000 {method 'write' of '_io.TextIOWrapper' objects}
  1237862    5.031    0.000    6.102    0.000 {method 'readline' of '_io.TextIOWrapper' objects}
  1237861    3.800    0.000    3.800    0.000 {method 'count' of 'str' objects}
   389745    0.358    0.000    0.979    0.000 &lt;frozen codecs&gt;:322(decode)
   389745    0.621    0.000    0.621    0.000 {built-in method _codecs.utf_8_decode}
   237860    0.136    0.000    0.136    0.000 {method 'rstrip' of 'str' objects}
        2    0.093    0.047    0.093    0.047 {built-in method _io.open}
   389745    0.093    0.000    0.093    0.000 &lt;frozen codecs&gt;:334(getstate)
        2    0.000    0.000    0.000    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 &lt;frozen codecs&gt;:312(__init__)
        1    0.000    0.000    0.000    0.000 &lt;frozen codecs&gt;:189(__init__)
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)
        1    0.000    0.000    0.000    0.000 &lt;frozen codecs&gt;:263(__init__)
</code></pre></div></div>

<p>Most time was spent reading and writing, followed by counting tab characters.
This is pretty much as you expect. This is an I/O bound program with some amount
of calculation. Out of 20 seconds, 14s were spent on I/O. The other top
hotspots were:</p>

<ul>
  <li>Decoding utf-8 (0.98s)</li>
  <li>Removing newline characters with rstrip (0.14s)</li>
</ul>

<h1 id="optimizations">Optimizations</h1>
<p>A good optimization would be to not write this in Python at all, but let’s
ignore that and assume we have to work in pure cPython. There are plenty of
optimizations we can reach for – some make more sense for an I/O-bound program
and others are more appropriate for a CPU-bound program. Scanning the files is
I/O-bound and fixing tabs involves an amount of CPU work, so it’s worth checking
both.</p>

<ol>
  <li>We can count tabs without decoding utf-8</li>
  <li>Buffer writes</li>
  <li>Buffer reads</li>
  <li>Avoid copying and doing extra work</li>
  <li>Multiprocessing</li>
  <li>Memory map the file</li>
  <li>PyPy</li>
</ol>

<h2 id="avoiding-utf-8-decoding">Avoiding utf-8 decoding</h2>
<p>The file is encoded in utf-8, which has the nice property that any ACII byte
uniquely identifies that ACII character. If you’re interested in finding tabs
<code class="language-plaintext highlighter-rouge">\t</code> (<code class="language-plaintext highlighter-rouge">0x09</code>), utf-8 ensures that no matter how many multi-byte characters you
have, none of them will contain this byte. In utf-8 multi-byte characters, the
largest bit is always set. So no multi-byte character can contain <code class="language-plaintext highlighter-rouge">0x09</code>, where
the largest bit is unset.</p>

<p><img src="/assets/images/2026-tsv-repair/utf8.png" alt="utf-8 encoding diagram" /></p>

<p>Source: <a href="https://badrish.net/papers/dp-sigmod19.pdf">https://badrish.net/papers/dp-sigmod19.pdf</a></p>

<p>That means we can do away with decoding the bytes and just search for <code class="language-plaintext highlighter-rouge">0x09</code>, or
in Python <code class="language-plaintext highlighter-rouge">b"\t"</code>.</p>

<p>Here’s the diff with <code class="language-plaintext highlighter-rouge">repair_basic.py</code>.</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">❯</span> diff -u repair_basic.py repair_bytes.py
<span class="gd">--- repair_basic.py     2026-03-12 16:42:34
</span><span class="gi">+++ repair_bytes.py     2026-03-13 10:27:08
</span><span class="p">@@ -1,18 +1,18 @@</span>
 
 def repair(input_file: str, output_file: str) -&gt; None:
<span class="gd">-    with open(input_file, "r") as fin, open(output_file, "w") as fout:
</span><span class="gi">+    with open(input_file, "rb") as fin, open(output_file, "wb") as fout:
</span>         # Start by copying header
         header = fin.readline()
         fout.write(header)
 
<span class="gd">-        expected_tabs = header.count("\t")
</span><span class="gi">+        expected_tabs = header.count(b"\t")
</span> 
         # Then, iterate over the lines, repairing as you go
         while True:
             line = fin.readline()
             if not line:
                 break
<span class="gd">-            line_tabs = line.count("\t")
</span><span class="gi">+            line_tabs = line.count(b"\t")
</span>             if line_tabs == expected_tabs:
                 fout.write(line)
                 continue
<span class="p">@@ -24,9 +24,9 @@</span>
                 continuation_line = fin.readline()
                 if not continuation_line:
                     break
<span class="gd">-                cline_tabs = continuation_line.count("\t")
</span><span class="gi">+                cline_tabs = continuation_line.count(b"\t")
</span>                 if line_tabs + cline_tabs &lt;= expected_tabs:
<span class="gd">-                    line = line.rstrip("\n") + " " + continuation_line
</span><span class="gi">+                    line = line.rstrip(b"\n") + b" " + continuation_line
</span>                     line_tabs += cline_tabs
                 else:
                     # Adding these lines would create a row with
</code></pre></div></div>

<p>In practice, this was horrific for performance. The code is basically identical
except now we’re searching for the tab byte instead of the tab character, and
we’re writing bytes. But the benchmarks are kind of startling.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>date_time	module	elapsed_seconds
2026-03-13 10:37:49	repair_basic	13.902526
2026-03-13 10:44:06	repair_basic	14.782423
2026-03-13 10:36:58	repair_bytes	41.753419
2026-03-13 10:38:07	repair_bytes	48.101314
2026-03-13 10:44:25	repair_bytes	41.985785
</code></pre></div></div>

<p>The basic version takes only 14s or so, while the bytes version takes closer to
45s. What gives? The profile points out the issue:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ python3 profile_repair.py repair_bytes.py    
Profiling repair_bytes on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         3713592 function calls in 47.199 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.112    2.112   47.199   47.199 /Users/charlie/tsv-repair/repair_bytes.py:2(repair)
  1000001   35.681    0.000   35.681    0.000 {method 'write' of '_io.BufferedWriter' objects}
  1237862    5.865    0.000    5.865    0.000 {method 'readline' of '_io.BufferedReader' objects}
  1237861    3.272    0.000    3.272    0.000 {method 'count' of 'bytes' objects}
   237860    0.138    0.000    0.138    0.000 {method 'rstrip' of 'bytes' objects}
        2    0.130    0.065    0.130    0.065 {built-in method _io.open}
        2    0.000    0.000    0.000    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)
</code></pre></div></div>

<p>Suddenly, we’re spending 35 seconds writing bytes. Most likely, the character
string writer automatically does some amount of buffering to optimize file
writes, and the bytes are not buffered at all. The other functions took around
the same amount of time as before. We did successfully drop the utf-8 decoding
logic, and that should save us a second or so all else equal. I’ll try buffering
the write output so there are fewer calls to write bytes.</p>

<h2 id="buffered-writes">Buffered writes</h2>
<ol>
  <li><del>We don’t need to decode utf-8 to count tabs</del></li>
  <li>Buffer writes*</li>
  <li>Buffer reads</li>
  <li>Avoid copying and doing extra work</li>
  <li>Multiprocessing</li>
  <li>Memory map the file</li>
  <li>PyPy</li>
</ol>

<p>The <code class="language-plaintext highlighter-rouge">io</code> stdlib module has a <code class="language-plaintext highlighter-rouge">BufferedWriter</code> we can use to buffer our byte
writes. Here’s the only change we need to make:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">io</span>

<span class="k">def</span> <span class="nf">repair</span><span class="p">(</span><span class="n">input_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">output_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">input_file</span><span class="p">,</span> <span class="s">"rb"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fin</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">output_file</span><span class="p">,</span> <span class="s">"wb"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fout_raw</span><span class="p">:</span>
        <span class="k">with</span> <span class="n">io</span><span class="p">.</span><span class="n">BufferedWriter</span><span class="p">(</span><span class="n">fout_raw</span><span class="p">,</span> <span class="n">buffer_size</span><span class="o">=</span><span class="mi">256</span> <span class="o">*</span> <span class="mi">1024</span><span class="p">)</span> <span class="k">as</span> <span class="n">fout</span><span class="p">:</span>
            <span class="p">...</span>
</code></pre></div></div>

<p>And the results are pretty outstanding – a 2x speedup on the basic script, and
a 5x speedup on the unbuffered version of byte repair.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>date_time	module	elapsed_seconds
2026-03-13 10:37:49	repair_basic	13.902526
2026-03-13 10:44:06	repair_basic	14.782423
2026-03-13 10:36:58	repair_bytes	41.753419
2026-03-13 10:38:07	repair_bytes	48.101314
2026-03-13 10:44:25	repair_bytes	41.985785
2026-03-13 11:15:02	repair_bytes_buffered_write	8.602299
2026-03-13 11:15:13	repair_bytes_buffered_write	7.982473
</code></pre></div></div>

<p>The profile shows that we now only spent 2s writing data.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ python3 profile_repair.py repair_bytes.py
Profiling repair_bytes on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         3713593 function calls in 11.621 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.332    1.332   11.621   11.621 /Users/charlie/tsv-repair/repair_bytes.py:5(repair)
  1237862    5.051    0.000    5.051    0.000 {method 'readline' of '_io.BufferedReader' objects}
  1237861    2.945    0.000    2.945    0.000 {method 'count' of 'bytes' objects}
  1000001    2.096    0.000    2.096    0.000 {method 'write' of '_io.BufferedWriter' objects}
   237860    0.101    0.000    0.101    0.000 {method 'rstrip' of 'bytes' objects}
        2    0.094    0.047    0.094    0.047 {built-in method _io.open}
        3    0.001    0.000    0.001    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)
</code></pre></div></div>

<p>The 256K buffer size here was found experimentally. I tried up to 5 MB but
didn’t see a performance improvement.</p>

<h2 id="buffered-reads">Buffered reads</h2>
<ol>
  <li><del>We don’t need to decode utf-8 to count tabs</del></li>
  <li><del>Buffer writes</del></li>
  <li>Buffer reads*</li>
  <li>Avoid copying and doing extra work</li>
  <li>Multiprocessing</li>
  <li>Memory map the file</li>
  <li>PyPy</li>
</ol>

<p>Batching our writes worked well, and reads are now the majority of the
processing time. Let’s see if I can just batch reads and get free performance.</p>

<p>There is an <code class="language-plaintext highlighter-rouge">io.BufferedReader</code>, and using it looks like this:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">io</span>

<span class="k">def</span> <span class="nf">repair</span><span class="p">(</span><span class="n">input_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">output_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">input_file</span><span class="p">,</span> <span class="s">"rb"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fin_raw</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">output_file</span><span class="p">,</span> <span class="s">"wb"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fout_raw</span><span class="p">:</span>
        <span class="k">with</span> <span class="n">io</span><span class="p">.</span><span class="n">BufferedReader</span><span class="p">(</span><span class="n">fin_raw</span><span class="p">,</span> <span class="n">buffer_size</span><span class="o">=</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="mi">20</span><span class="p">)</span> <span class="k">as</span> <span class="n">fin</span><span class="p">,</span> <span class="n">io</span><span class="p">.</span><span class="n">BufferedWriter</span><span class="p">(</span>
            <span class="n">fout_raw</span><span class="p">,</span> <span class="n">buffer_size</span><span class="o">=</span><span class="mi">256</span> <span class="o">*</span> <span class="mi">1024</span>
        <span class="p">)</span> <span class="k">as</span> <span class="n">fout</span><span class="p">:</span>
            <span class="p">...</span>
</code></pre></div></div>

<p>The formatting gets dense, but the results are another significant speedup.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>date_time	module	elapsed_seconds
2026-03-13 10:37:49	repair_basic	13.902526
2026-03-13 10:44:06	repair_basic	14.782423
2026-03-13 10:36:58	repair_bytes	41.753419
2026-03-13 10:38:07	repair_bytes	48.101314
2026-03-13 10:44:25	repair_bytes	41.985785
2026-03-13 11:15:02	repair_bytes_buffered_write	8.602299
2026-03-13 11:15:13	repair_bytes_buffered_write	7.982473
2026-03-13 11:48:41	repair_bytes_buffered_read_write	6.561013
2026-03-13 11:48:50	repair_bytes_buffered_read_write	5.967071
2026-03-13 11:49:25	repair_bytes_buffered_read_write	5.910506
2026-03-13 11:49:33	repair_bytes_buffered_read_write	5.782498
</code></pre></div></div>

<h2 id="avoiding-extra-work">Avoiding extra work</h2>
<ol>
  <li><del>We don’t need to decode utf-8 to count tabs</del></li>
  <li><del>Buffer writes</del></li>
  <li><del>Buffer reads</del></li>
  <li>Avoid copying and doing extra work*</li>
  <li>Multiprocessing</li>
  <li>Memory map the file</li>
  <li>PyPy</li>
</ol>

<p>I/O is now optimized, and I wanted to take a look at the algorithm to see if I
could improve it at all.</p>

<p>In the test file I generated, there are 1 million rows and 1,237,860 lines
(excluding the header), so we will spend a decent amount of time fixing lines.
The expected number of newline characters (configured with
<code class="language-plaintext highlighter-rouge">--newline-likelihood</code>) will have an impact on the final algorithm you choose.
I’ve set it so that every cell has a 0.2% chance of including a newline in it,
which is pretty high compared to what I see in the actual data.</p>

<p>I have two ideas for optimizations.</p>

<ul>
  <li>Never count the same tab twice</li>
  <li>Reduce allocations by using mutable data structures</li>
</ul>

<p>Here’s the complete code at the moment:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 1	# repair_bytes_buffered_read_write.py
 2	import io
    
 3	def repair(input_file: str, output_file: str) -&gt; None:
 4	    with open(input_file, "rb") as fin_raw, open(output_file, "wb") as fout_raw:
 5	        with io.BufferedReader(fin_raw, buffer_size=1 &lt;&lt; 20) as fin, io.BufferedWriter(
 6	            fout_raw, buffer_size=256 * 1024
 7	        ) as fout:
    
 8	            # Start by copying header
 9	            header = fin.readline()
10	            fout.write(header)
    
11	            expected_tabs = header.count(b"\t")
    
12	            # Then, iterate over the lines, repairing as you go
13	            while True:
14	                line = fin.readline()
15	                if not line:
16	                    break
17	                line_tabs = line.count(b"\t")
18	                if line_tabs == expected_tabs:
19	                    fout.write(line)
20	                    continue
    
21	                # Line repair
22	                # Grab next line and see if it complements
23	                _need_to_write = True
24	                while line_tabs &lt; expected_tabs:
25	                    continuation_line = fin.readline()
26	                    if not continuation_line:
27	                        break
28	                    cline_tabs = continuation_line.count(b"\t")
29	                    if line_tabs + cline_tabs &lt;= expected_tabs:
30	                        line = line.rstrip(b"\n") + b" " + continuation_line
31	                        line_tabs += cline_tabs
32	                    else:
33	                        # Adding these lines would create a row with
34	                        # too many fields
35	                        fout.write(line)
36	                        fout.write(continuation_line)
37	                        _need_to_write = False
38	                        break
39	                if _need_to_write:
40	                    fout.write(line)
</code></pre></div></div>

<p>This line does a few things at once, there might be a better way:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>30	                        line = line.rstrip(b"\n") + b" " + continuation_line
</code></pre></div></div>

<p>We know that the line ends in a newline character, so it might be faster to use
<code class="language-plaintext highlighter-rouge">line[:-1]</code>. And instead of creating a new byte string, I’ll extend an existing
byte buffer, which is mutable.</p>

<p>The new version of this line is:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">buffer</span><span class="p">.</span><span class="n">pop</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="c1"># remove the newline
</span><span class="nb">buffer</span><span class="p">.</span><span class="n">extend</span><span class="p">(</span><span class="sa">b</span><span class="s">" "</span> <span class="o">+</span> <span class="n">continuation_line</span><span class="p">)</span>
</code></pre></div></div>

<p>Performance is about the same, though, at least at this density of newlines.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2026-03-13 12:26:21	repair_bytes_buffered_read_write_bytearray	6.067266
2026-03-13 12:26:31	repair_bytes_buffered_read_write_bytearray	5.932649
2026-03-13 12:26:40	repair_bytes_buffered_read_write_bytearray	5.986914
</code></pre></div></div>

<p>And here’s the profile:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ python profile_repair.py repair_bytes_buffered_read_write_bytearray.py
Profiling repair_bytes_buffered_read_write_bytearray on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         5951455 function calls in 9.120 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.716    1.716    9.120    9.120 /Users/charlie/tsv-repair/repair_bytes_buffered_read_write_bytearray.py:4(repair)
  1000000    2.518    0.000    2.518    0.000 {method 'count' of 'bytearray' objects}
  1000001    1.885    0.000    1.885    0.000 {method 'write' of '_io.BufferedWriter' objects}
  1237862    1.475    0.000    1.475    0.000 {method 'readline' of '_io.BufferedReader' objects}
  1237861    0.608    0.000    0.608    0.000 {method 'extend' of 'bytearray' objects}
  1000000    0.399    0.000    0.399    0.000 {method 'clear' of 'bytearray' objects}
   237861    0.343    0.000    0.343    0.000 {method 'count' of 'bytes' objects}
        2    0.121    0.061    0.121    0.061 {built-in method _io.open}
   237860    0.056    0.000    0.056    0.000 {method 'pop' of 'bytearray' objects}
        4    0.000    0.000    0.000    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)
</code></pre></div></div>

<p>I’m spending 0.4s clearing the byte array and 0.6s updating the byte array. I
don’t have much insight into the comparable numbers for allocation times, but
I’d guess this is about the same.</p>

<p>Avoiding extra work: generally no performance improvement, and the code was
harder to read, so I’m not going to keep these optimizations.</p>

<h3 id="update-2026-03-16">Update: 2026-03-16</h3>
<p>I bumped the number of newlines up to 0.2 (1 in 5 cells has a newline in it) and
found that, yes, the bytearray version of the code is a decent bit faster at
this newline density. The basic version takes 26 seconds, while the bytebuffer
only takes 19 seconds.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2026-03-16 09:58:58	repair_bytes_buffered_read_write	26.129120
2026-03-16 09:59:46	repair_bytes_buffered_read_write_bytearray	19.367004
2026-03-16 10:01:38	repair_bytes_buffered_read_write	25.511255
2026-03-16 10:02:09	repair_bytes_buffered_read_write_bytearray	19.135691
</code></pre></div></div>

<p>This is a 1 million row file with 120 columns, which means that for a 0.2
newline density, each line will have about <code class="language-plaintext highlighter-rouge">120 / 5 = 24</code> newlines in it. Put
another way, each row of data will be spread across approximately 24 lines of
the file.</p>

<h2 id="multiprocessing">Multiprocessing</h2>
<ol>
  <li><del>We don’t need to decode utf-8 to count tabs</del></li>
  <li><del>Buffer writes</del></li>
  <li><del>Buffer reads</del></li>
  <li><del>Avoid copying and doing extra work</del></li>
  <li>Multiprocessing*</li>
  <li>Memory map the file</li>
  <li>PyPy</li>
</ol>

<p>Multiprocessing can be difficult to get right, and there’s no guarantee it’s a
speedup in all cases.</p>

<p>The idea is to chunk the file, start up a few processes, and assign each chunk
to a process. Assuming the processes successfully use all available cores on
your computer, you should see a multiplicative speed up.</p>

<p>But this TSV repair isn’t a map/reduce problem, so let’s take a step back to
consider whether multiprocessing will help at all. I could let each core write
its part of the TSV file to that core’s own file, rather than everyone writing
to the same file. This is essentially what tools like AWS Athena use to copy
large amounts of data. The Hive table format is friendly to splintered files for
the most part, and some teams use compaction after the fact to reduce the impact
of having more files to read later.</p>

<p>Multiprocessing and leaving the code in pieces is the fastest way to write it.
But that’s no good for the tests, so while it’s an attractive idea, I won’t be
able to leave the files in pieces. To recombine the files, I would have to do
another full read/write of the file, and most likely that would cost me too much
time.</p>

<h3 id="update-2026-03-16-1">Update: 2026-03-16</h3>
<p>I returned to this today to see if I could implement the multiprocessing version
of the TSV repair. Interestingly enough, I found that this problem cannot be
efficiently multiprocessed in the way I thought it could.</p>

<p>A row of data is only defined by the number of tabs. To split the file, we have
to align each processor’s chunk of the input file on a row boundary so that each
partial file has only complete rows in it. If you have a table with 25 rows and
5 columns, then the available boundaries are after 0 tabs, 5 tabs, 10, 15, and
so on.</p>

<p>You cannot start in the middle of the file and find a row boundary in all cases.
If you happen to find a newline character followed by <code class="language-plaintext highlighter-rouge">n_column</code> tabs, that
works, you’ve found a valid boundary. But in a file with a high density of
newline characters, where each row is basically guaranteed to have one or more
improperly quoted newlines, it becomes almost impossible to figure out where a
row begins and ends.</p>

<p>Consider the case where every cell has a newline character in it. If you track
from the beginning of the file, you can correctly reassemble this dataset by
counting tabs. But, if you start anywhere in the middle, it becomes impossible
to decide where a row should begin and end. It’s a mass of alternating newlines
and tabs.</p>

<p>We can’t align on a row boundary without knowing how many tab characters have
come before the current <code class="language-plaintext highlighter-rouge">seek</code> position, and we can’t do that without indexing
all of the tabs in the file. That’s what we’re already doing in the basic
implementation of the TSV repair script, so I don’t see a way for
multiprocessing to be faster than sequential processing.</p>

<p>That does get me thinking, though, how do other tools handle this for valid
files? Imagine a TSV file that correctly quotes newline characters but has
alternating newlines and tabs as before. In fact, since we’re quoting, it’s
technically allowed to include <em>tabs</em> in the quoted fields as well. If this were
a CSV you can substitute commas.</p>

<p>I’m going to use CSV format for clarity. Your worker might get assigned a <code class="language-plaintext highlighter-rouge">seek</code>
position that looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>,"
"Hello, world!","Nice to meet you"
</code></pre></div></div>

<p>Spicing it up a little with some newlines:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>,"
"Hello,
world!","
Nice to meet you"
</code></pre></div></div>

<p>Can you align on a boundary? The first characters <code class="language-plaintext highlighter-rouge">,"\n</code> are ambiguous. The
comma could be a quoted comma or a field delimiter. But we get more information
with the following <code class="language-plaintext highlighter-rouge">"Hello</code>. If the first double quote started a string, the
second would have to end one and be followed by a delimiter. Since the second
quote is not followed by a delimiter, it must be preceded by one – and it is (a
newline, which delimits rows). So now that we’ve identified that the first quote
is a <em>closing</em> quote, we can be sure that the newline that followed it was the
end of a row of data. This gives us enough information to say that <code class="language-plaintext highlighter-rouge">"Hello</code> is
the beginning of a row, and we can align from there.</p>

<p>Does this generally hold? Is it economical? I found a paper that discusses
distributed CSV parsing, with examples that look a lot like my own examples
above: <a href="https://badrish.net/papers/dp-sigmod19.pdf">“Speculative Distributed CSV Data Parsing for Big Data
Analytics”</a>. The authors bring up a
good point, which is that a production parser has to recognize invalid CSV files
as well as valid ones. Here’s the abstract:</p>

<blockquote>
  <p>There has been a recent flurry of interest in providing query capability on
raw data in today’s big data systems. These raw data must be parsed before
processing or use in analytics. Thus, a fundamental challenge in distributed
big data systems is that of efficient parallel parsing of raw data. The
difficulties come from the inherent ambiguity while independently parsing
chunks of raw data without knowing the context of these chunks. Specifically,
it can be difficult to find the beginnings and ends of fields and records in
these chunks of raw data. To parallelize parsing, this paper proposes a
speculation-based approach for the CSV format, arguably the most commonly used
raw data format. Due to the syntactic and statistical properties of the
format, speculative parsing rarely fails and therefore parsing is efficiently
parallelized in a distributed setting. Our speculative approach is also
robust, meaning that it can reliably detect syntax errors in CSV data. We
experimentally evaluate the speculative, distributed parsing approach in
Apache Spark using more than 11,000 real-world datasets, and show that our
parser produces significant performance benefits over existing methods.</p>
</blockquote>

<h2 id="memory-mapping">Memory mapping</h2>
<ol>
  <li><del>We don’t need to decode utf-8 to count tabs</del></li>
  <li><del>Buffer writes</del></li>
  <li><del>Buffer reads</del></li>
  <li><del>Avoid copying and doing extra work</del></li>
  <li><del>Multiprocessing</del></li>
  <li>Memory map the file</li>
  <li>PyPy</li>
</ol>

<p>Memory mapping can be helpful in some cases, but it’s not always a clear
performance win. But before I get to that, what is memory mapping?</p>

<p><code class="language-plaintext highlighter-rouge">mmap</code> is a syscall analogous to <code class="language-plaintext highlighter-rouge">read</code>, but with some architectural
differences. Normally when you <code class="language-plaintext highlighter-rouge">read</code> a set of bytes from a file, the kernel
copies those bytes into a kernel buffer, then copies them into your user buffer.
When you request data, the OS schedules a request with the disk driver to get
those bytes, and everything happens “buffer-to-buffer”.</p>

<p><code class="language-plaintext highlighter-rouge">mmap</code> on the other hand involves mapping the file’s bytes to your virtual
memory address space. When you read a set of bytes on an <code class="language-plaintext highlighter-rouge">mmap</code>‘d file, you get
the exact same data, but the channels it goes through are very different.
Instead of copying buffer-to-buffer, attempting to read the file causes a page
fault. The whole page (or multiple pages depending on prefetch config) are
copied into your user-space buffer with no kernel buffer in between.</p>

<p>To oversimplify a bit, <code class="language-plaintext highlighter-rouge">read</code>ing cause a double copy, while <code class="language-plaintext highlighter-rouge">mmap</code>ing uses a
single copy. Great! you think.</p>

<p>The tradeoff is that while <code class="language-plaintext highlighter-rouge">read</code> involves more copying, <code class="language-plaintext highlighter-rouge">mmap</code> involves more
syscalls and page faults. Here’s how one Stack answerer put it:</p>

<blockquote>
  <p>So basically you have the following comparison to determine which is faster
for a single read of a large file: Is the extra per-page work implied by the
mmap approach more costly than the per-byte work of copying file contents from
kernel to user space implied by using read()?</p>

  <p><a href="https://stackoverflow.com/questions/45972/mmap-vs-reading-blocks">https://stackoverflow.com/questions/45972/mmap-vs-reading-blocks</a></p>
</blockquote>

<p><code class="language-plaintext highlighter-rouge">mmap</code> can have benefits when you’re reading a file multiple times or doing
non-sequential reads, but in my case the page fault overhead is probably too
high on a cold run. Let’s try it anyway.</p>

<p>The basic implementation is:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">io</span>
<span class="kn">import</span> <span class="nn">mmap</span>

<span class="k">def</span> <span class="nf">repair</span><span class="p">(</span><span class="n">input_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">output_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">input_file</span><span class="p">,</span> <span class="s">"rb"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fin_raw</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">output_file</span><span class="p">,</span> <span class="s">"wb"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fout_raw</span><span class="p">:</span>
        <span class="k">with</span> <span class="n">mmap</span><span class="p">.</span><span class="n">mmap</span><span class="p">(</span><span class="n">fin_raw</span><span class="p">.</span><span class="n">fileno</span><span class="p">(),</span> <span class="mi">0</span><span class="p">,</span> <span class="n">access</span><span class="o">=</span><span class="n">mmap</span><span class="p">.</span><span class="n">ACCESS_READ</span><span class="p">)</span> <span class="k">as</span> <span class="n">mm_in</span><span class="p">:</span>
            <span class="k">with</span> <span class="n">io</span><span class="p">.</span><span class="n">BufferedWriter</span><span class="p">(</span><span class="n">fout_raw</span><span class="p">,</span> <span class="n">buffer_size</span><span class="o">=</span><span class="mi">256</span> <span class="o">*</span> <span class="mi">1024</span><span class="p">)</span> <span class="k">as</span> <span class="n">fout</span><span class="p">:</span>
                <span class="p">...</span>
</code></pre></div></div>

<p>I had to ditch the buffered reader, at least the one available from <code class="language-plaintext highlighter-rouge">io</code>,
because a memory mapped file isn’t compatible with it. And in any case that
wouldn’t affect the number of page faults, which is where mmap spends its time.
The performance isn’t great.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2026-03-13 14:23:06	repair_bytes_buffered_read_write_mmap	27.898272
</code></pre></div></div>

<p>Looking at the profile is interesting:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ python3 profile_repair.py repair_bytes_buffered_read_write_mmap.py  
Profiling repair_bytes_buffered_read_write_mmap on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         3713595 function calls in 15.487 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.383    1.383   15.487   15.487 /Users/charlie/tsv-repair/repair_bytes_buffered_read_write_mmap.py:5(repair)
  1237862    8.700    0.000    8.700    0.000 {method 'readline' of 'mmap.mmap' objects}
  1237861    2.974    0.000    2.974    0.000 {method 'count' of 'bytes' objects}
  1000001    2.158    0.000    2.158    0.000 {method 'write' of '_io.BufferedWriter' objects}
        2    0.115    0.057    0.115    0.057 {built-in method _io.open}
   237860    0.104    0.000    0.104    0.000 {method 'rstrip' of 'bytes' objects}
        1    0.053    0.053    0.053    0.053 {method '__exit__' of 'mmap.mmap' objects}
        3    0.001    0.000    0.001    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'fileno' of '_io.BufferedReader' objects}
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)
</code></pre></div></div>

<p>Time spent in <code class="language-plaintext highlighter-rouge">count</code> matches previous runs, but we’re spending way more time
reading as expected. You’ll also notice that the runtime was halved when I did
the profile – I believe the reason is that the pages of the large file are
still in the page cache. There’s a nifty trick you can use to clear the page
cache, though. On MacOS:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sync &amp;&amp; sudo purge
</code></pre></div></div>

<p>After doing this, the profile is again comparable to the first one.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ python3 profile_repair.py repair_bytes_buffered_read_write_mmap.py
Profiling repair_bytes_buffered_read_write_mmap on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         3713595 function calls in 28.173 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.542    1.542   28.173   28.173 /Users/charlie/projects/atlas/tsv-repair/repair_bytes_buffered_read_write_mmap.py:5(repair)
  1237862   20.947    0.000   20.947    0.000 {method 'readline' of 'mmap.mmap' objects}
  1237861    3.075    0.000    3.075    0.000 {method 'count' of 'bytes' objects}
  1000001    2.317    0.000    2.317    0.000 {method 'write' of '_io.BufferedWriter' objects}
        2    0.123    0.061    0.123    0.061 {built-in method _io.open}
   237860    0.108    0.000    0.108    0.000 {method 'rstrip' of 'bytes' objects}
        1    0.061    0.061    0.061    0.061 {method '__exit__' of 'mmap.mmap' objects}
        3    0.001    0.000    0.001    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'fileno' of '_io.BufferedReader' objects}
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)
</code></pre></div></div>

<p>Now it’s clear we’re really taking a performance hit with <code class="language-plaintext highlighter-rouge">mmap.readline()</code>.</p>

<p>We’re not getting any performance benefit from mapping our file into memory –
page faults are killing performance when we only read through the file once.</p>

<h2 id="pypy">Pypy</h2>
<ol>
  <li><del>We don’t need to decode utf-8 to count tabs</del></li>
  <li><del>Buffer writes</del></li>
  <li><del>Buffer reads</del></li>
  <li><del>Avoid copying and doing extra work</del></li>
  <li><del>Multiprocessing</del></li>
  <li><del>Memory map the file</del></li>
  <li>PyPy</li>
</ol>

<p>This is a bit far afield, but I thought I’d try Pypy, a JIT Python interpreter
that can give a good speedup in some cases. Through Homebrew I was able to get
PyPy3.10, which is a little old, and in any case it’s best for
calculation-intensive work, not IO-bound work. Indeed, the result was a 4x
slowdown, maybe due to I/O improvements in more recent versions of python.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2026-03-13 15:58:39	repair_bytes_buffered_read_write	22.078622
2026-03-13 15:59:08	repair_bytes_buffered_read_write	20.995271
</code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>
<p>The winning script involved only simple tweaks on my original attempt.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">io</span>


<span class="k">def</span> <span class="nf">repair</span><span class="p">(</span><span class="n">input_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">output_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">input_file</span><span class="p">,</span> <span class="s">"rb"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fin_raw</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">output_file</span><span class="p">,</span> <span class="s">"wb"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fout_raw</span><span class="p">:</span>
        <span class="k">with</span> <span class="n">io</span><span class="p">.</span><span class="n">BufferedReader</span><span class="p">(</span><span class="n">fin_raw</span><span class="p">,</span> <span class="n">buffer_size</span><span class="o">=</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="mi">20</span><span class="p">)</span> <span class="k">as</span> <span class="n">fin</span><span class="p">,</span> <span class="n">io</span><span class="p">.</span><span class="n">BufferedWriter</span><span class="p">(</span>
            <span class="n">fout_raw</span><span class="p">,</span> <span class="n">buffer_size</span><span class="o">=</span><span class="mi">256</span> <span class="o">*</span> <span class="mi">1024</span>
        <span class="p">)</span> <span class="k">as</span> <span class="n">fout</span><span class="p">:</span>

            <span class="c1"># Start by copying header
</span>            <span class="n">header</span> <span class="o">=</span> <span class="n">fin</span><span class="p">.</span><span class="n">readline</span><span class="p">()</span>
            <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">header</span><span class="p">)</span>

            <span class="n">expected_tabs</span> <span class="o">=</span> <span class="n">header</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="sa">b</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">)</span>

            <span class="c1"># Then, iterate over the lines, repairing as you go
</span>            <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
                <span class="n">line</span> <span class="o">=</span> <span class="n">fin</span><span class="p">.</span><span class="n">readline</span><span class="p">()</span>
                <span class="k">if</span> <span class="ow">not</span> <span class="n">line</span><span class="p">:</span>
                    <span class="k">break</span>
                <span class="n">line_tabs</span> <span class="o">=</span> <span class="n">line</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="sa">b</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">)</span>
                <span class="k">if</span> <span class="n">line_tabs</span> <span class="o">==</span> <span class="n">expected_tabs</span><span class="p">:</span>
                    <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
                    <span class="k">continue</span>

                <span class="c1"># Line repair
</span>                <span class="c1"># Grab next line and see if it complements
</span>                <span class="n">_need_to_write</span> <span class="o">=</span> <span class="bp">True</span>
                <span class="k">while</span> <span class="n">line_tabs</span> <span class="o">&lt;</span> <span class="n">expected_tabs</span><span class="p">:</span>
                    <span class="n">continuation_line</span> <span class="o">=</span> <span class="n">fin</span><span class="p">.</span><span class="n">readline</span><span class="p">()</span>
                    <span class="k">if</span> <span class="ow">not</span> <span class="n">continuation_line</span><span class="p">:</span>
                        <span class="k">break</span>
                    <span class="n">cline_tabs</span> <span class="o">=</span> <span class="n">continuation_line</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="sa">b</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">)</span>
                    <span class="k">if</span> <span class="n">line_tabs</span> <span class="o">+</span> <span class="n">cline_tabs</span> <span class="o">&lt;=</span> <span class="n">expected_tabs</span><span class="p">:</span>
                        <span class="n">line</span> <span class="o">=</span> <span class="n">line</span><span class="p">.</span><span class="n">rstrip</span><span class="p">(</span><span class="sa">b</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span> <span class="o">+</span> <span class="sa">b</span><span class="s">" "</span> <span class="o">+</span> <span class="n">continuation_line</span>
                        <span class="n">line_tabs</span> <span class="o">+=</span> <span class="n">cline_tabs</span>
                    <span class="k">else</span><span class="p">:</span>
                        <span class="c1"># Adding these lines would create a row with
</span>                        <span class="c1"># too many fields
</span>                        <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
                        <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">continuation_line</span><span class="p">)</span>
                        <span class="n">_need_to_write</span> <span class="o">=</span> <span class="bp">False</span>
                        <span class="k">break</span>
                <span class="k">if</span> <span class="n">_need_to_write</span><span class="p">:</span>
                    <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
</code></pre></div></div>

<p>Benchmark (after purging):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2026-03-13 15:28:01	repair_bytes_buffered_read_write	6.520668
</code></pre></div></div>

<p>Profile (also after purging):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ python3 profile_repair.py repair_bytes_buffered_read_write.py
Profiling repair_bytes_buffered_read_write on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         3713594 function calls in 7.851 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.228    1.228    7.850    7.850 /Users/charlie/projects/atlas/tsv-repair/repair_bytes_buffered_read_write.py:4(repair)
  1237861    2.875    0.000    2.875    0.000 {method 'count' of 'bytes' objects}
  1237862    1.813    0.000    1.813    0.000 {method 'readline' of '_io.BufferedReader' objects}
  1000001    1.770    0.000    1.770    0.000 {method 'write' of '_io.BufferedWriter' objects}
   237860    0.094    0.000    0.094    0.000 {method 'rstrip' of 'bytes' objects}
        2    0.070    0.035    0.070    0.035 {built-in method _io.open}
        4    0.000    0.000    0.000    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)
</code></pre></div></div>

<p>Best optimization techniques:</p>

<ul>
  <li><strong>Buffered reads and writes</strong> accounted for the majority of the speedup</li>
  <li><strong>Working in bytes</strong> saved ~1 sec on utf-8 decoding. There might be more savings
if your data has a higher density of multi-byte characters.</li>
</ul>

<p>Failed optimizations (no effect or negative effect):</p>

<ul>
  <li><strong>Memory mapping.</strong> The page fault overhead was too significant when we are
doing a sequential read through the file.</li>
  <li><strong>Multiprocessing.</strong> The savings we gain from splitting up the file are eaten
up by the time it takes to reassemble the individual outputs. This isn’t a
map/reduce job, and we can’t leave the files disassembled, so no dice for this
version of the problem.</li>
  <li><strong>Tweaking data structures.</strong> At this level of newline density, the code
doesn’t spend enough time in the newline-repair loop to see any effect of
optimizing data structures to reduce copying.</li>
</ul>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>This problem space isn’t as rich as the 1BR challenge, so there won’t be as many exotic tricks, but it was still fun to work through it. For inspiration I looked back through <a href="https://www.youtube.com/watch?v=utTaPW32gKY">Doug Mercer’s video</a> on 1BR in Python. It’s a great watch and has a number of nifty tricks. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>

        ]]>
      </description>

      <content:encoded>
        <![CDATA[
          <p>This TSV file has a problem:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id	name	comment	score
0	charlie	normal	20
1	alice	this is a
multiline comment	10
2	bob	normal	8
</code></pre></div></div>

<p>There’s an unquoted multi-line field on the second line. I’m not aware of a TSV
parser that can correctly parse this – most consider it an invalid file, some
NULL-fill the remaining fields on any line that has incomplete data.</p>

<p>I regularly ingest data from a particular data source that has this bad feature
in it, and the problem has gotten worse recently. I’ve decided to fix it. The
question is, “What’s the best way to fix unquoted newline characters?”</p>

<p>I created this repository with my results and some tooling for benchmarking and profiling: <a href="https://github.com/charlie-gallagher/tsv-repair">https://github.com/charlie-gallagher/tsv-repair</a></p>

<h1 id="problem-statement">Problem statement</h1>
<p>Like the <a href="https://www.morling.dev/blog/one-billion-row-challenge/">Billion Row Challenge</a>, the goal is
to process large files as quickly as possible using some base programming
language.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> In my case, I worked in base Python 3.13 on MacOS.</p>

<p>Using pure python (stdlib), repair a large (10GB), utf-8 encoded TSV file with a
knowable number of fields by (a) identifying incomplete lines, (b) combining
successive incomplete lines until they form a complete line, and (c) not
combining lines if the result is a row with too many fields. A single row may
have one or more newlines, i.e.  a row might be spread across one or more lines
of the file. The lines that form a row are always ordered correctly, successive
and contiguous. Newlines are LF only, not CRLF. To find the number of fields,
you can read the first line, which is always the header.</p>

<p>To make things simpler, even if a field is quoted, you can still join successive
split lines.</p>

<p>In the end, the file should look like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>id	name	comment	score
0	charlie	normal	20
1	alice	this is a multiline comment	10
2	bob	normal	8
</code></pre></div></div>

<p>i.e. replace the newline character with a space. If there are multiple newline
characters in a row (<code class="language-plaintext highlighter-rouge">hello\n\nworld</code>), replace each one with a space. If a
field starts or ends with a newline, you can still replace newlines with a
space.</p>

<h1 id="basic-solution">Basic solution</h1>
<p>Here’s a straightforward solution I came up with that passes all the tests.
There are one or two inelegancies, like the <code class="language-plaintext highlighter-rouge">_need_to_write</code> variable that keeps
track of some state information, but it does the thing.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="k">def</span> <span class="nf">repair</span><span class="p">(</span><span class="n">input_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">output_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">input_file</span><span class="p">,</span> <span class="s">"r"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fin</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">output_file</span><span class="p">,</span> <span class="s">"w"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fout</span><span class="p">:</span>
        <span class="c1"># Start by copying header
</span>        <span class="n">header</span> <span class="o">=</span> <span class="n">fin</span><span class="p">.</span><span class="n">readline</span><span class="p">()</span>
        <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">header</span><span class="p">)</span>

        <span class="n">expected_tabs</span> <span class="o">=</span> <span class="n">header</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">)</span>

        <span class="c1"># Then, iterate over the lines, repairing as you go
</span>        <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
            <span class="n">line</span> <span class="o">=</span> <span class="n">fin</span><span class="p">.</span><span class="n">readline</span><span class="p">()</span>
            <span class="k">if</span> <span class="ow">not</span> <span class="n">line</span><span class="p">:</span>
                <span class="k">break</span>
            <span class="n">line_tabs</span> <span class="o">=</span> <span class="n">line</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">)</span>
            <span class="k">if</span> <span class="n">line_tabs</span> <span class="o">==</span> <span class="n">expected_tabs</span><span class="p">:</span>
                <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
                <span class="k">continue</span>

            <span class="c1"># Line repair
</span>            <span class="c1"># Grab next line and see if it complements
</span>            <span class="n">_need_to_write</span> <span class="o">=</span> <span class="bp">True</span>
            <span class="k">while</span> <span class="n">line_tabs</span> <span class="o">&lt;</span> <span class="n">expected_tabs</span><span class="p">:</span>
                <span class="n">continuation_line</span> <span class="o">=</span> <span class="n">fin</span><span class="p">.</span><span class="n">readline</span><span class="p">()</span>
                <span class="k">if</span> <span class="ow">not</span> <span class="n">continuation_line</span><span class="p">:</span>
                    <span class="k">break</span>
                <span class="n">cline_tabs</span> <span class="o">=</span> <span class="n">continuation_line</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">)</span>
                <span class="k">if</span> <span class="n">line_tabs</span> <span class="o">+</span> <span class="n">cline_tabs</span> <span class="o">&lt;=</span> <span class="n">expected_tabs</span><span class="p">:</span>
                    <span class="n">line</span> <span class="o">=</span> <span class="n">line</span><span class="p">.</span><span class="n">rstrip</span><span class="p">(</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span> <span class="o">+</span> <span class="s">" "</span> <span class="o">+</span> <span class="n">continuation_line</span>
                    <span class="n">line_tabs</span> <span class="o">+=</span> <span class="n">cline_tabs</span>
                <span class="k">else</span><span class="p">:</span>
                    <span class="c1"># Adding these lines would create a row with
</span>                    <span class="c1"># too many fields
</span>                    <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
                    <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">continuation_line</span><span class="p">)</span>
                    <span class="n">_need_to_write</span> <span class="o">=</span> <span class="bp">False</span>
                    <span class="k">break</span>
            <span class="k">if</span> <span class="n">_need_to_write</span><span class="p">:</span>
                <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
</code></pre></div></div>

<p>And huzzah, the tests are passing.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ python3 -c "from test_repair import main; from repair_basic import repair; main(repair)"
Testing repair function against golden files in /Users/charlie/tsv-repair/test_files

  PASS  already_good.tsv
  PASS  basic.tsv
  PASS  basic_incomplete_first_line.tsv
  PASS  basic_incomplete_last_line.tsv
  PASS  multi_newline.tsv
  PASS  newline_as_first_char_in_field.tsv
  PASS  newline_as_last_char_in_field.tsv
  PASS  newline_as_only_char_in_field.tsv
  PASS  partial_solution.tsv
  PASS  partial_solution_2x.tsv
  PASS  quoted.tsv
  PASS  too_many_tabs.tsv
  PASS  two_bad_lines_in_a_row.tsv

13/13 tests passed.
</code></pre></div></div>

<p>Benchmark on large file:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2026-03-13 09:59:03	repair_basic	14.604409
</code></pre></div></div>

<p>The large file configuration for this run was: 1M rows, 120 columns, and 0.002
likelihood that a cell contains a newline character. The file was 3.0 GB. You
can generate a similar file using:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>python generate_large_file.py -r 1000000 -c 120 --newline-likelihood 0.002 
</code></pre></div></div>

<p>There’s a cProfile script as well. Here’s the output:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ python3 profile_repair.py repair_basic.py 
Profiling repair_basic on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         4882830 function calls in 20.516 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.674    1.674   20.516   20.516 /Users/charlie/tsv-repair/repair_basic.py:2(repair)
  1000001    8.711    0.000    8.711    0.000 {method 'write' of '_io.TextIOWrapper' objects}
  1237862    5.031    0.000    6.102    0.000 {method 'readline' of '_io.TextIOWrapper' objects}
  1237861    3.800    0.000    3.800    0.000 {method 'count' of 'str' objects}
   389745    0.358    0.000    0.979    0.000 &lt;frozen codecs&gt;:322(decode)
   389745    0.621    0.000    0.621    0.000 {built-in method _codecs.utf_8_decode}
   237860    0.136    0.000    0.136    0.000 {method 'rstrip' of 'str' objects}
        2    0.093    0.047    0.093    0.047 {built-in method _io.open}
   389745    0.093    0.000    0.093    0.000 &lt;frozen codecs&gt;:334(getstate)
        2    0.000    0.000    0.000    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 &lt;frozen codecs&gt;:312(__init__)
        1    0.000    0.000    0.000    0.000 &lt;frozen codecs&gt;:189(__init__)
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)
        1    0.000    0.000    0.000    0.000 &lt;frozen codecs&gt;:263(__init__)
</code></pre></div></div>

<p>Most time was spent reading and writing, followed by counting tab characters.
This is pretty much as you expect. This is an I/O bound program with some amount
of calculation. Out of 20 seconds, 14s were spent on I/O. The other top
hotspots were:</p>

<ul>
  <li>Decoding utf-8 (0.98s)</li>
  <li>Removing newline characters with rstrip (0.14s)</li>
</ul>

<h1 id="optimizations">Optimizations</h1>
<p>A good optimization would be to not write this in Python at all, but let’s
ignore that and assume we have to work in pure cPython. There are plenty of
optimizations we can reach for – some make more sense for an I/O-bound program
and others are more appropriate for a CPU-bound program. Scanning the files is
I/O-bound and fixing tabs involves an amount of CPU work, so it’s worth checking
both.</p>

<ol>
  <li>We can count tabs without decoding utf-8</li>
  <li>Buffer writes</li>
  <li>Buffer reads</li>
  <li>Avoid copying and doing extra work</li>
  <li>Multiprocessing</li>
  <li>Memory map the file</li>
  <li>PyPy</li>
</ol>

<h2 id="avoiding-utf-8-decoding">Avoiding utf-8 decoding</h2>
<p>The file is encoded in utf-8, which has the nice property that any ACII byte
uniquely identifies that ACII character. If you’re interested in finding tabs
<code class="language-plaintext highlighter-rouge">\t</code> (<code class="language-plaintext highlighter-rouge">0x09</code>), utf-8 ensures that no matter how many multi-byte characters you
have, none of them will contain this byte. In utf-8 multi-byte characters, the
largest bit is always set. So no multi-byte character can contain <code class="language-plaintext highlighter-rouge">0x09</code>, where
the largest bit is unset.</p>

<p><img src="/assets/images/2026-tsv-repair/utf8.png" alt="utf-8 encoding diagram" /></p>

<p>Source: <a href="https://badrish.net/papers/dp-sigmod19.pdf">https://badrish.net/papers/dp-sigmod19.pdf</a></p>

<p>That means we can do away with decoding the bytes and just search for <code class="language-plaintext highlighter-rouge">0x09</code>, or
in Python <code class="language-plaintext highlighter-rouge">b"\t"</code>.</p>

<p>Here’s the diff with <code class="language-plaintext highlighter-rouge">repair_basic.py</code>.</p>

<div class="language-diff highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="err">❯</span> diff -u repair_basic.py repair_bytes.py
<span class="gd">--- repair_basic.py     2026-03-12 16:42:34
</span><span class="gi">+++ repair_bytes.py     2026-03-13 10:27:08
</span><span class="p">@@ -1,18 +1,18 @@</span>
 
 def repair(input_file: str, output_file: str) -&gt; None:
<span class="gd">-    with open(input_file, "r") as fin, open(output_file, "w") as fout:
</span><span class="gi">+    with open(input_file, "rb") as fin, open(output_file, "wb") as fout:
</span>         # Start by copying header
         header = fin.readline()
         fout.write(header)
 
<span class="gd">-        expected_tabs = header.count("\t")
</span><span class="gi">+        expected_tabs = header.count(b"\t")
</span> 
         # Then, iterate over the lines, repairing as you go
         while True:
             line = fin.readline()
             if not line:
                 break
<span class="gd">-            line_tabs = line.count("\t")
</span><span class="gi">+            line_tabs = line.count(b"\t")
</span>             if line_tabs == expected_tabs:
                 fout.write(line)
                 continue
<span class="p">@@ -24,9 +24,9 @@</span>
                 continuation_line = fin.readline()
                 if not continuation_line:
                     break
<span class="gd">-                cline_tabs = continuation_line.count("\t")
</span><span class="gi">+                cline_tabs = continuation_line.count(b"\t")
</span>                 if line_tabs + cline_tabs &lt;= expected_tabs:
<span class="gd">-                    line = line.rstrip("\n") + " " + continuation_line
</span><span class="gi">+                    line = line.rstrip(b"\n") + b" " + continuation_line
</span>                     line_tabs += cline_tabs
                 else:
                     # Adding these lines would create a row with
</code></pre></div></div>

<p>In practice, this was horrific for performance. The code is basically identical
except now we’re searching for the tab byte instead of the tab character, and
we’re writing bytes. But the benchmarks are kind of startling.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>date_time	module	elapsed_seconds
2026-03-13 10:37:49	repair_basic	13.902526
2026-03-13 10:44:06	repair_basic	14.782423
2026-03-13 10:36:58	repair_bytes	41.753419
2026-03-13 10:38:07	repair_bytes	48.101314
2026-03-13 10:44:25	repair_bytes	41.985785
</code></pre></div></div>

<p>The basic version takes only 14s or so, while the bytes version takes closer to
45s. What gives? The profile points out the issue:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ python3 profile_repair.py repair_bytes.py    
Profiling repair_bytes on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         3713592 function calls in 47.199 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    2.112    2.112   47.199   47.199 /Users/charlie/tsv-repair/repair_bytes.py:2(repair)
  1000001   35.681    0.000   35.681    0.000 {method 'write' of '_io.BufferedWriter' objects}
  1237862    5.865    0.000    5.865    0.000 {method 'readline' of '_io.BufferedReader' objects}
  1237861    3.272    0.000    3.272    0.000 {method 'count' of 'bytes' objects}
   237860    0.138    0.000    0.138    0.000 {method 'rstrip' of 'bytes' objects}
        2    0.130    0.065    0.130    0.065 {built-in method _io.open}
        2    0.000    0.000    0.000    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)
</code></pre></div></div>

<p>Suddenly, we’re spending 35 seconds writing bytes. Most likely, the character
string writer automatically does some amount of buffering to optimize file
writes, and the bytes are not buffered at all. The other functions took around
the same amount of time as before. We did successfully drop the utf-8 decoding
logic, and that should save us a second or so all else equal. I’ll try buffering
the write output so there are fewer calls to write bytes.</p>

<h2 id="buffered-writes">Buffered writes</h2>
<ol>
  <li><del>We don’t need to decode utf-8 to count tabs</del></li>
  <li>Buffer writes*</li>
  <li>Buffer reads</li>
  <li>Avoid copying and doing extra work</li>
  <li>Multiprocessing</li>
  <li>Memory map the file</li>
  <li>PyPy</li>
</ol>

<p>The <code class="language-plaintext highlighter-rouge">io</code> stdlib module has a <code class="language-plaintext highlighter-rouge">BufferedWriter</code> we can use to buffer our byte
writes. Here’s the only change we need to make:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">io</span>

<span class="k">def</span> <span class="nf">repair</span><span class="p">(</span><span class="n">input_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">output_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">input_file</span><span class="p">,</span> <span class="s">"rb"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fin</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">output_file</span><span class="p">,</span> <span class="s">"wb"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fout_raw</span><span class="p">:</span>
        <span class="k">with</span> <span class="n">io</span><span class="p">.</span><span class="n">BufferedWriter</span><span class="p">(</span><span class="n">fout_raw</span><span class="p">,</span> <span class="n">buffer_size</span><span class="o">=</span><span class="mi">256</span> <span class="o">*</span> <span class="mi">1024</span><span class="p">)</span> <span class="k">as</span> <span class="n">fout</span><span class="p">:</span>
            <span class="p">...</span>
</code></pre></div></div>

<p>And the results are pretty outstanding – a 2x speedup on the basic script, and
a 5x speedup on the unbuffered version of byte repair.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>date_time	module	elapsed_seconds
2026-03-13 10:37:49	repair_basic	13.902526
2026-03-13 10:44:06	repair_basic	14.782423
2026-03-13 10:36:58	repair_bytes	41.753419
2026-03-13 10:38:07	repair_bytes	48.101314
2026-03-13 10:44:25	repair_bytes	41.985785
2026-03-13 11:15:02	repair_bytes_buffered_write	8.602299
2026-03-13 11:15:13	repair_bytes_buffered_write	7.982473
</code></pre></div></div>

<p>The profile shows that we now only spent 2s writing data.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ python3 profile_repair.py repair_bytes.py
Profiling repair_bytes on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         3713593 function calls in 11.621 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.332    1.332   11.621   11.621 /Users/charlie/tsv-repair/repair_bytes.py:5(repair)
  1237862    5.051    0.000    5.051    0.000 {method 'readline' of '_io.BufferedReader' objects}
  1237861    2.945    0.000    2.945    0.000 {method 'count' of 'bytes' objects}
  1000001    2.096    0.000    2.096    0.000 {method 'write' of '_io.BufferedWriter' objects}
   237860    0.101    0.000    0.101    0.000 {method 'rstrip' of 'bytes' objects}
        2    0.094    0.047    0.094    0.047 {built-in method _io.open}
        3    0.001    0.000    0.001    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)
</code></pre></div></div>

<p>The 256K buffer size here was found experimentally. I tried up to 5 MB but
didn’t see a performance improvement.</p>

<h2 id="buffered-reads">Buffered reads</h2>
<ol>
  <li><del>We don’t need to decode utf-8 to count tabs</del></li>
  <li><del>Buffer writes</del></li>
  <li>Buffer reads*</li>
  <li>Avoid copying and doing extra work</li>
  <li>Multiprocessing</li>
  <li>Memory map the file</li>
  <li>PyPy</li>
</ol>

<p>Batching our writes worked well, and reads are now the majority of the
processing time. Let’s see if I can just batch reads and get free performance.</p>

<p>There is an <code class="language-plaintext highlighter-rouge">io.BufferedReader</code>, and using it looks like this:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">io</span>

<span class="k">def</span> <span class="nf">repair</span><span class="p">(</span><span class="n">input_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">output_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">input_file</span><span class="p">,</span> <span class="s">"rb"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fin_raw</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">output_file</span><span class="p">,</span> <span class="s">"wb"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fout_raw</span><span class="p">:</span>
        <span class="k">with</span> <span class="n">io</span><span class="p">.</span><span class="n">BufferedReader</span><span class="p">(</span><span class="n">fin_raw</span><span class="p">,</span> <span class="n">buffer_size</span><span class="o">=</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="mi">20</span><span class="p">)</span> <span class="k">as</span> <span class="n">fin</span><span class="p">,</span> <span class="n">io</span><span class="p">.</span><span class="n">BufferedWriter</span><span class="p">(</span>
            <span class="n">fout_raw</span><span class="p">,</span> <span class="n">buffer_size</span><span class="o">=</span><span class="mi">256</span> <span class="o">*</span> <span class="mi">1024</span>
        <span class="p">)</span> <span class="k">as</span> <span class="n">fout</span><span class="p">:</span>
            <span class="p">...</span>
</code></pre></div></div>

<p>The formatting gets dense, but the results are another significant speedup.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>date_time	module	elapsed_seconds
2026-03-13 10:37:49	repair_basic	13.902526
2026-03-13 10:44:06	repair_basic	14.782423
2026-03-13 10:36:58	repair_bytes	41.753419
2026-03-13 10:38:07	repair_bytes	48.101314
2026-03-13 10:44:25	repair_bytes	41.985785
2026-03-13 11:15:02	repair_bytes_buffered_write	8.602299
2026-03-13 11:15:13	repair_bytes_buffered_write	7.982473
2026-03-13 11:48:41	repair_bytes_buffered_read_write	6.561013
2026-03-13 11:48:50	repair_bytes_buffered_read_write	5.967071
2026-03-13 11:49:25	repair_bytes_buffered_read_write	5.910506
2026-03-13 11:49:33	repair_bytes_buffered_read_write	5.782498
</code></pre></div></div>

<h2 id="avoiding-extra-work">Avoiding extra work</h2>
<ol>
  <li><del>We don’t need to decode utf-8 to count tabs</del></li>
  <li><del>Buffer writes</del></li>
  <li><del>Buffer reads</del></li>
  <li>Avoid copying and doing extra work*</li>
  <li>Multiprocessing</li>
  <li>Memory map the file</li>
  <li>PyPy</li>
</ol>

<p>I/O is now optimized, and I wanted to take a look at the algorithm to see if I
could improve it at all.</p>

<p>In the test file I generated, there are 1 million rows and 1,237,860 lines
(excluding the header), so we will spend a decent amount of time fixing lines.
The expected number of newline characters (configured with
<code class="language-plaintext highlighter-rouge">--newline-likelihood</code>) will have an impact on the final algorithm you choose.
I’ve set it so that every cell has a 0.2% chance of including a newline in it,
which is pretty high compared to what I see in the actual data.</p>

<p>I have two ideas for optimizations.</p>

<ul>
  <li>Never count the same tab twice</li>
  <li>Reduce allocations by using mutable data structures</li>
</ul>

<p>Here’s the complete code at the moment:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 1	# repair_bytes_buffered_read_write.py
 2	import io
    
 3	def repair(input_file: str, output_file: str) -&gt; None:
 4	    with open(input_file, "rb") as fin_raw, open(output_file, "wb") as fout_raw:
 5	        with io.BufferedReader(fin_raw, buffer_size=1 &lt;&lt; 20) as fin, io.BufferedWriter(
 6	            fout_raw, buffer_size=256 * 1024
 7	        ) as fout:
    
 8	            # Start by copying header
 9	            header = fin.readline()
10	            fout.write(header)
    
11	            expected_tabs = header.count(b"\t")
    
12	            # Then, iterate over the lines, repairing as you go
13	            while True:
14	                line = fin.readline()
15	                if not line:
16	                    break
17	                line_tabs = line.count(b"\t")
18	                if line_tabs == expected_tabs:
19	                    fout.write(line)
20	                    continue
    
21	                # Line repair
22	                # Grab next line and see if it complements
23	                _need_to_write = True
24	                while line_tabs &lt; expected_tabs:
25	                    continuation_line = fin.readline()
26	                    if not continuation_line:
27	                        break
28	                    cline_tabs = continuation_line.count(b"\t")
29	                    if line_tabs + cline_tabs &lt;= expected_tabs:
30	                        line = line.rstrip(b"\n") + b" " + continuation_line
31	                        line_tabs += cline_tabs
32	                    else:
33	                        # Adding these lines would create a row with
34	                        # too many fields
35	                        fout.write(line)
36	                        fout.write(continuation_line)
37	                        _need_to_write = False
38	                        break
39	                if _need_to_write:
40	                    fout.write(line)
</code></pre></div></div>

<p>This line does a few things at once, there might be a better way:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>30	                        line = line.rstrip(b"\n") + b" " + continuation_line
</code></pre></div></div>

<p>We know that the line ends in a newline character, so it might be faster to use
<code class="language-plaintext highlighter-rouge">line[:-1]</code>. And instead of creating a new byte string, I’ll extend an existing
byte buffer, which is mutable.</p>

<p>The new version of this line is:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">buffer</span><span class="p">.</span><span class="n">pop</span><span class="p">(</span><span class="o">-</span><span class="mi">1</span><span class="p">)</span> <span class="c1"># remove the newline
</span><span class="nb">buffer</span><span class="p">.</span><span class="n">extend</span><span class="p">(</span><span class="sa">b</span><span class="s">" "</span> <span class="o">+</span> <span class="n">continuation_line</span><span class="p">)</span>
</code></pre></div></div>

<p>Performance is about the same, though, at least at this density of newlines.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2026-03-13 12:26:21	repair_bytes_buffered_read_write_bytearray	6.067266
2026-03-13 12:26:31	repair_bytes_buffered_read_write_bytearray	5.932649
2026-03-13 12:26:40	repair_bytes_buffered_read_write_bytearray	5.986914
</code></pre></div></div>

<p>And here’s the profile:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ python profile_repair.py repair_bytes_buffered_read_write_bytearray.py
Profiling repair_bytes_buffered_read_write_bytearray on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         5951455 function calls in 9.120 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.716    1.716    9.120    9.120 /Users/charlie/tsv-repair/repair_bytes_buffered_read_write_bytearray.py:4(repair)
  1000000    2.518    0.000    2.518    0.000 {method 'count' of 'bytearray' objects}
  1000001    1.885    0.000    1.885    0.000 {method 'write' of '_io.BufferedWriter' objects}
  1237862    1.475    0.000    1.475    0.000 {method 'readline' of '_io.BufferedReader' objects}
  1237861    0.608    0.000    0.608    0.000 {method 'extend' of 'bytearray' objects}
  1000000    0.399    0.000    0.399    0.000 {method 'clear' of 'bytearray' objects}
   237861    0.343    0.000    0.343    0.000 {method 'count' of 'bytes' objects}
        2    0.121    0.061    0.121    0.061 {built-in method _io.open}
   237860    0.056    0.000    0.056    0.000 {method 'pop' of 'bytearray' objects}
        4    0.000    0.000    0.000    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)
</code></pre></div></div>

<p>I’m spending 0.4s clearing the byte array and 0.6s updating the byte array. I
don’t have much insight into the comparable numbers for allocation times, but
I’d guess this is about the same.</p>

<p>Avoiding extra work: generally no performance improvement, and the code was
harder to read, so I’m not going to keep these optimizations.</p>

<h3 id="update-2026-03-16">Update: 2026-03-16</h3>
<p>I bumped the number of newlines up to 0.2 (1 in 5 cells has a newline in it) and
found that, yes, the bytearray version of the code is a decent bit faster at
this newline density. The basic version takes 26 seconds, while the bytebuffer
only takes 19 seconds.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2026-03-16 09:58:58	repair_bytes_buffered_read_write	26.129120
2026-03-16 09:59:46	repair_bytes_buffered_read_write_bytearray	19.367004
2026-03-16 10:01:38	repair_bytes_buffered_read_write	25.511255
2026-03-16 10:02:09	repair_bytes_buffered_read_write_bytearray	19.135691
</code></pre></div></div>

<p>This is a 1 million row file with 120 columns, which means that for a 0.2
newline density, each line will have about <code class="language-plaintext highlighter-rouge">120 / 5 = 24</code> newlines in it. Put
another way, each row of data will be spread across approximately 24 lines of
the file.</p>

<h2 id="multiprocessing">Multiprocessing</h2>
<ol>
  <li><del>We don’t need to decode utf-8 to count tabs</del></li>
  <li><del>Buffer writes</del></li>
  <li><del>Buffer reads</del></li>
  <li><del>Avoid copying and doing extra work</del></li>
  <li>Multiprocessing*</li>
  <li>Memory map the file</li>
  <li>PyPy</li>
</ol>

<p>Multiprocessing can be difficult to get right, and there’s no guarantee it’s a
speedup in all cases.</p>

<p>The idea is to chunk the file, start up a few processes, and assign each chunk
to a process. Assuming the processes successfully use all available cores on
your computer, you should see a multiplicative speed up.</p>

<p>But this TSV repair isn’t a map/reduce problem, so let’s take a step back to
consider whether multiprocessing will help at all. I could let each core write
its part of the TSV file to that core’s own file, rather than everyone writing
to the same file. This is essentially what tools like AWS Athena use to copy
large amounts of data. The Hive table format is friendly to splintered files for
the most part, and some teams use compaction after the fact to reduce the impact
of having more files to read later.</p>

<p>Multiprocessing and leaving the code in pieces is the fastest way to write it.
But that’s no good for the tests, so while it’s an attractive idea, I won’t be
able to leave the files in pieces. To recombine the files, I would have to do
another full read/write of the file, and most likely that would cost me too much
time.</p>

<h3 id="update-2026-03-16-1">Update: 2026-03-16</h3>
<p>I returned to this today to see if I could implement the multiprocessing version
of the TSV repair. Interestingly enough, I found that this problem cannot be
efficiently multiprocessed in the way I thought it could.</p>

<p>A row of data is only defined by the number of tabs. To split the file, we have
to align each processor’s chunk of the input file on a row boundary so that each
partial file has only complete rows in it. If you have a table with 25 rows and
5 columns, then the available boundaries are after 0 tabs, 5 tabs, 10, 15, and
so on.</p>

<p>You cannot start in the middle of the file and find a row boundary in all cases.
If you happen to find a newline character followed by <code class="language-plaintext highlighter-rouge">n_column</code> tabs, that
works, you’ve found a valid boundary. But in a file with a high density of
newline characters, where each row is basically guaranteed to have one or more
improperly quoted newlines, it becomes almost impossible to figure out where a
row begins and ends.</p>

<p>Consider the case where every cell has a newline character in it. If you track
from the beginning of the file, you can correctly reassemble this dataset by
counting tabs. But, if you start anywhere in the middle, it becomes impossible
to decide where a row should begin and end. It’s a mass of alternating newlines
and tabs.</p>

<p>We can’t align on a row boundary without knowing how many tab characters have
come before the current <code class="language-plaintext highlighter-rouge">seek</code> position, and we can’t do that without indexing
all of the tabs in the file. That’s what we’re already doing in the basic
implementation of the TSV repair script, so I don’t see a way for
multiprocessing to be faster than sequential processing.</p>

<p>That does get me thinking, though, how do other tools handle this for valid
files? Imagine a TSV file that correctly quotes newline characters but has
alternating newlines and tabs as before. In fact, since we’re quoting, it’s
technically allowed to include <em>tabs</em> in the quoted fields as well. If this were
a CSV you can substitute commas.</p>

<p>I’m going to use CSV format for clarity. Your worker might get assigned a <code class="language-plaintext highlighter-rouge">seek</code>
position that looks like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>,"
"Hello, world!","Nice to meet you"
</code></pre></div></div>

<p>Spicing it up a little with some newlines:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>,"
"Hello,
world!","
Nice to meet you"
</code></pre></div></div>

<p>Can you align on a boundary? The first characters <code class="language-plaintext highlighter-rouge">,"\n</code> are ambiguous. The
comma could be a quoted comma or a field delimiter. But we get more information
with the following <code class="language-plaintext highlighter-rouge">"Hello</code>. If the first double quote started a string, the
second would have to end one and be followed by a delimiter. Since the second
quote is not followed by a delimiter, it must be preceded by one – and it is (a
newline, which delimits rows). So now that we’ve identified that the first quote
is a <em>closing</em> quote, we can be sure that the newline that followed it was the
end of a row of data. This gives us enough information to say that <code class="language-plaintext highlighter-rouge">"Hello</code> is
the beginning of a row, and we can align from there.</p>

<p>Does this generally hold? Is it economical? I found a paper that discusses
distributed CSV parsing, with examples that look a lot like my own examples
above: <a href="https://badrish.net/papers/dp-sigmod19.pdf">“Speculative Distributed CSV Data Parsing for Big Data
Analytics”</a>. The authors bring up a
good point, which is that a production parser has to recognize invalid CSV files
as well as valid ones. Here’s the abstract:</p>

<blockquote>
  <p>There has been a recent flurry of interest in providing query capability on
raw data in today’s big data systems. These raw data must be parsed before
processing or use in analytics. Thus, a fundamental challenge in distributed
big data systems is that of efficient parallel parsing of raw data. The
difficulties come from the inherent ambiguity while independently parsing
chunks of raw data without knowing the context of these chunks. Specifically,
it can be difficult to find the beginnings and ends of fields and records in
these chunks of raw data. To parallelize parsing, this paper proposes a
speculation-based approach for the CSV format, arguably the most commonly used
raw data format. Due to the syntactic and statistical properties of the
format, speculative parsing rarely fails and therefore parsing is efficiently
parallelized in a distributed setting. Our speculative approach is also
robust, meaning that it can reliably detect syntax errors in CSV data. We
experimentally evaluate the speculative, distributed parsing approach in
Apache Spark using more than 11,000 real-world datasets, and show that our
parser produces significant performance benefits over existing methods.</p>
</blockquote>

<h2 id="memory-mapping">Memory mapping</h2>
<ol>
  <li><del>We don’t need to decode utf-8 to count tabs</del></li>
  <li><del>Buffer writes</del></li>
  <li><del>Buffer reads</del></li>
  <li><del>Avoid copying and doing extra work</del></li>
  <li><del>Multiprocessing</del></li>
  <li>Memory map the file</li>
  <li>PyPy</li>
</ol>

<p>Memory mapping can be helpful in some cases, but it’s not always a clear
performance win. But before I get to that, what is memory mapping?</p>

<p><code class="language-plaintext highlighter-rouge">mmap</code> is a syscall analogous to <code class="language-plaintext highlighter-rouge">read</code>, but with some architectural
differences. Normally when you <code class="language-plaintext highlighter-rouge">read</code> a set of bytes from a file, the kernel
copies those bytes into a kernel buffer, then copies them into your user buffer.
When you request data, the OS schedules a request with the disk driver to get
those bytes, and everything happens “buffer-to-buffer”.</p>

<p><code class="language-plaintext highlighter-rouge">mmap</code> on the other hand involves mapping the file’s bytes to your virtual
memory address space. When you read a set of bytes on an <code class="language-plaintext highlighter-rouge">mmap</code>‘d file, you get
the exact same data, but the channels it goes through are very different.
Instead of copying buffer-to-buffer, attempting to read the file causes a page
fault. The whole page (or multiple pages depending on prefetch config) are
copied into your user-space buffer with no kernel buffer in between.</p>

<p>To oversimplify a bit, <code class="language-plaintext highlighter-rouge">read</code>ing cause a double copy, while <code class="language-plaintext highlighter-rouge">mmap</code>ing uses a
single copy. Great! you think.</p>

<p>The tradeoff is that while <code class="language-plaintext highlighter-rouge">read</code> involves more copying, <code class="language-plaintext highlighter-rouge">mmap</code> involves more
syscalls and page faults. Here’s how one Stack answerer put it:</p>

<blockquote>
  <p>So basically you have the following comparison to determine which is faster
for a single read of a large file: Is the extra per-page work implied by the
mmap approach more costly than the per-byte work of copying file contents from
kernel to user space implied by using read()?</p>

  <p><a href="https://stackoverflow.com/questions/45972/mmap-vs-reading-blocks">https://stackoverflow.com/questions/45972/mmap-vs-reading-blocks</a></p>
</blockquote>

<p><code class="language-plaintext highlighter-rouge">mmap</code> can have benefits when you’re reading a file multiple times or doing
non-sequential reads, but in my case the page fault overhead is probably too
high on a cold run. Let’s try it anyway.</p>

<p>The basic implementation is:</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">io</span>
<span class="kn">import</span> <span class="nn">mmap</span>

<span class="k">def</span> <span class="nf">repair</span><span class="p">(</span><span class="n">input_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">output_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">input_file</span><span class="p">,</span> <span class="s">"rb"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fin_raw</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">output_file</span><span class="p">,</span> <span class="s">"wb"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fout_raw</span><span class="p">:</span>
        <span class="k">with</span> <span class="n">mmap</span><span class="p">.</span><span class="n">mmap</span><span class="p">(</span><span class="n">fin_raw</span><span class="p">.</span><span class="n">fileno</span><span class="p">(),</span> <span class="mi">0</span><span class="p">,</span> <span class="n">access</span><span class="o">=</span><span class="n">mmap</span><span class="p">.</span><span class="n">ACCESS_READ</span><span class="p">)</span> <span class="k">as</span> <span class="n">mm_in</span><span class="p">:</span>
            <span class="k">with</span> <span class="n">io</span><span class="p">.</span><span class="n">BufferedWriter</span><span class="p">(</span><span class="n">fout_raw</span><span class="p">,</span> <span class="n">buffer_size</span><span class="o">=</span><span class="mi">256</span> <span class="o">*</span> <span class="mi">1024</span><span class="p">)</span> <span class="k">as</span> <span class="n">fout</span><span class="p">:</span>
                <span class="p">...</span>
</code></pre></div></div>

<p>I had to ditch the buffered reader, at least the one available from <code class="language-plaintext highlighter-rouge">io</code>,
because a memory mapped file isn’t compatible with it. And in any case that
wouldn’t affect the number of page faults, which is where mmap spends its time.
The performance isn’t great.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2026-03-13 14:23:06	repair_bytes_buffered_read_write_mmap	27.898272
</code></pre></div></div>

<p>Looking at the profile is interesting:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ python3 profile_repair.py repair_bytes_buffered_read_write_mmap.py  
Profiling repair_bytes_buffered_read_write_mmap on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         3713595 function calls in 15.487 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.383    1.383   15.487   15.487 /Users/charlie/tsv-repair/repair_bytes_buffered_read_write_mmap.py:5(repair)
  1237862    8.700    0.000    8.700    0.000 {method 'readline' of 'mmap.mmap' objects}
  1237861    2.974    0.000    2.974    0.000 {method 'count' of 'bytes' objects}
  1000001    2.158    0.000    2.158    0.000 {method 'write' of '_io.BufferedWriter' objects}
        2    0.115    0.057    0.115    0.057 {built-in method _io.open}
   237860    0.104    0.000    0.104    0.000 {method 'rstrip' of 'bytes' objects}
        1    0.053    0.053    0.053    0.053 {method '__exit__' of 'mmap.mmap' objects}
        3    0.001    0.000    0.001    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'fileno' of '_io.BufferedReader' objects}
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)
</code></pre></div></div>

<p>Time spent in <code class="language-plaintext highlighter-rouge">count</code> matches previous runs, but we’re spending way more time
reading as expected. You’ll also notice that the runtime was halved when I did
the profile – I believe the reason is that the pages of the large file are
still in the page cache. There’s a nifty trick you can use to clear the page
cache, though. On MacOS:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sync &amp;&amp; sudo purge
</code></pre></div></div>

<p>After doing this, the profile is again comparable to the first one.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ python3 profile_repair.py repair_bytes_buffered_read_write_mmap.py
Profiling repair_bytes_buffered_read_write_mmap on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         3713595 function calls in 28.173 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.542    1.542   28.173   28.173 /Users/charlie/projects/atlas/tsv-repair/repair_bytes_buffered_read_write_mmap.py:5(repair)
  1237862   20.947    0.000   20.947    0.000 {method 'readline' of 'mmap.mmap' objects}
  1237861    3.075    0.000    3.075    0.000 {method 'count' of 'bytes' objects}
  1000001    2.317    0.000    2.317    0.000 {method 'write' of '_io.BufferedWriter' objects}
        2    0.123    0.061    0.123    0.061 {built-in method _io.open}
   237860    0.108    0.000    0.108    0.000 {method 'rstrip' of 'bytes' objects}
        1    0.061    0.061    0.061    0.061 {method '__exit__' of 'mmap.mmap' objects}
        3    0.001    0.000    0.001    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        1    0.000    0.000    0.000    0.000 {method 'fileno' of '_io.BufferedReader' objects}
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)
</code></pre></div></div>

<p>Now it’s clear we’re really taking a performance hit with <code class="language-plaintext highlighter-rouge">mmap.readline()</code>.</p>

<p>We’re not getting any performance benefit from mapping our file into memory –
page faults are killing performance when we only read through the file once.</p>

<h2 id="pypy">Pypy</h2>
<ol>
  <li><del>We don’t need to decode utf-8 to count tabs</del></li>
  <li><del>Buffer writes</del></li>
  <li><del>Buffer reads</del></li>
  <li><del>Avoid copying and doing extra work</del></li>
  <li><del>Multiprocessing</del></li>
  <li><del>Memory map the file</del></li>
  <li>PyPy</li>
</ol>

<p>This is a bit far afield, but I thought I’d try Pypy, a JIT Python interpreter
that can give a good speedup in some cases. Through Homebrew I was able to get
PyPy3.10, which is a little old, and in any case it’s best for
calculation-intensive work, not IO-bound work. Indeed, the result was a 4x
slowdown, maybe due to I/O improvements in more recent versions of python.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2026-03-13 15:58:39	repair_bytes_buffered_read_write	22.078622
2026-03-13 15:59:08	repair_bytes_buffered_read_write	20.995271
</code></pre></div></div>

<h2 id="conclusion">Conclusion</h2>
<p>The winning script involved only simple tweaks on my original attempt.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kn">import</span> <span class="nn">io</span>


<span class="k">def</span> <span class="nf">repair</span><span class="p">(</span><span class="n">input_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">,</span> <span class="n">output_file</span><span class="p">:</span> <span class="nb">str</span><span class="p">)</span> <span class="o">-&gt;</span> <span class="bp">None</span><span class="p">:</span>
    <span class="k">with</span> <span class="nb">open</span><span class="p">(</span><span class="n">input_file</span><span class="p">,</span> <span class="s">"rb"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fin_raw</span><span class="p">,</span> <span class="nb">open</span><span class="p">(</span><span class="n">output_file</span><span class="p">,</span> <span class="s">"wb"</span><span class="p">)</span> <span class="k">as</span> <span class="n">fout_raw</span><span class="p">:</span>
        <span class="k">with</span> <span class="n">io</span><span class="p">.</span><span class="n">BufferedReader</span><span class="p">(</span><span class="n">fin_raw</span><span class="p">,</span> <span class="n">buffer_size</span><span class="o">=</span><span class="mi">1</span> <span class="o">&lt;&lt;</span> <span class="mi">20</span><span class="p">)</span> <span class="k">as</span> <span class="n">fin</span><span class="p">,</span> <span class="n">io</span><span class="p">.</span><span class="n">BufferedWriter</span><span class="p">(</span>
            <span class="n">fout_raw</span><span class="p">,</span> <span class="n">buffer_size</span><span class="o">=</span><span class="mi">256</span> <span class="o">*</span> <span class="mi">1024</span>
        <span class="p">)</span> <span class="k">as</span> <span class="n">fout</span><span class="p">:</span>

            <span class="c1"># Start by copying header
</span>            <span class="n">header</span> <span class="o">=</span> <span class="n">fin</span><span class="p">.</span><span class="n">readline</span><span class="p">()</span>
            <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">header</span><span class="p">)</span>

            <span class="n">expected_tabs</span> <span class="o">=</span> <span class="n">header</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="sa">b</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">)</span>

            <span class="c1"># Then, iterate over the lines, repairing as you go
</span>            <span class="k">while</span> <span class="bp">True</span><span class="p">:</span>
                <span class="n">line</span> <span class="o">=</span> <span class="n">fin</span><span class="p">.</span><span class="n">readline</span><span class="p">()</span>
                <span class="k">if</span> <span class="ow">not</span> <span class="n">line</span><span class="p">:</span>
                    <span class="k">break</span>
                <span class="n">line_tabs</span> <span class="o">=</span> <span class="n">line</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="sa">b</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">)</span>
                <span class="k">if</span> <span class="n">line_tabs</span> <span class="o">==</span> <span class="n">expected_tabs</span><span class="p">:</span>
                    <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
                    <span class="k">continue</span>

                <span class="c1"># Line repair
</span>                <span class="c1"># Grab next line and see if it complements
</span>                <span class="n">_need_to_write</span> <span class="o">=</span> <span class="bp">True</span>
                <span class="k">while</span> <span class="n">line_tabs</span> <span class="o">&lt;</span> <span class="n">expected_tabs</span><span class="p">:</span>
                    <span class="n">continuation_line</span> <span class="o">=</span> <span class="n">fin</span><span class="p">.</span><span class="n">readline</span><span class="p">()</span>
                    <span class="k">if</span> <span class="ow">not</span> <span class="n">continuation_line</span><span class="p">:</span>
                        <span class="k">break</span>
                    <span class="n">cline_tabs</span> <span class="o">=</span> <span class="n">continuation_line</span><span class="p">.</span><span class="n">count</span><span class="p">(</span><span class="sa">b</span><span class="s">"</span><span class="se">\t</span><span class="s">"</span><span class="p">)</span>
                    <span class="k">if</span> <span class="n">line_tabs</span> <span class="o">+</span> <span class="n">cline_tabs</span> <span class="o">&lt;=</span> <span class="n">expected_tabs</span><span class="p">:</span>
                        <span class="n">line</span> <span class="o">=</span> <span class="n">line</span><span class="p">.</span><span class="n">rstrip</span><span class="p">(</span><span class="sa">b</span><span class="s">"</span><span class="se">\n</span><span class="s">"</span><span class="p">)</span> <span class="o">+</span> <span class="sa">b</span><span class="s">" "</span> <span class="o">+</span> <span class="n">continuation_line</span>
                        <span class="n">line_tabs</span> <span class="o">+=</span> <span class="n">cline_tabs</span>
                    <span class="k">else</span><span class="p">:</span>
                        <span class="c1"># Adding these lines would create a row with
</span>                        <span class="c1"># too many fields
</span>                        <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
                        <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">continuation_line</span><span class="p">)</span>
                        <span class="n">_need_to_write</span> <span class="o">=</span> <span class="bp">False</span>
                        <span class="k">break</span>
                <span class="k">if</span> <span class="n">_need_to_write</span><span class="p">:</span>
                    <span class="n">fout</span><span class="p">.</span><span class="n">write</span><span class="p">(</span><span class="n">line</span><span class="p">)</span>
</code></pre></div></div>

<p>Benchmark (after purging):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2026-03-13 15:28:01	repair_bytes_buffered_read_write	6.520668
</code></pre></div></div>

<p>Profile (also after purging):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ python3 profile_repair.py repair_bytes_buffered_read_write.py
Profiling repair_bytes_buffered_read_write on large_file.tsv...

--- Top 30 hotspots (sort: cumulative) ---

         3713594 function calls in 7.851 seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    1.228    1.228    7.850    7.850 /Users/charlie/projects/atlas/tsv-repair/repair_bytes_buffered_read_write.py:4(repair)
  1237861    2.875    0.000    2.875    0.000 {method 'count' of 'bytes' objects}
  1237862    1.813    0.000    1.813    0.000 {method 'readline' of '_io.BufferedReader' objects}
  1000001    1.770    0.000    1.770    0.000 {method 'write' of '_io.BufferedWriter' objects}
   237860    0.094    0.000    0.094    0.000 {method 'rstrip' of 'bytes' objects}
        2    0.070    0.035    0.070    0.035 {built-in method _io.open}
        4    0.000    0.000    0.000    0.000 {method '__exit__' of '_io._IOBase' objects}
        1    0.000    0.000    0.000    0.000 {method 'disable' of '_lsprof.Profiler' objects}
        2    0.000    0.000    0.000    0.000 /Users/charlie/.pyenv/versions/3.13.8/lib/python3.13/pathlib/_local.py:227(__str__)
</code></pre></div></div>

<p>Best optimization techniques:</p>

<ul>
  <li><strong>Buffered reads and writes</strong> accounted for the majority of the speedup</li>
  <li><strong>Working in bytes</strong> saved ~1 sec on utf-8 decoding. There might be more savings
if your data has a higher density of multi-byte characters.</li>
</ul>

<p>Failed optimizations (no effect or negative effect):</p>

<ul>
  <li><strong>Memory mapping.</strong> The page fault overhead was too significant when we are
doing a sequential read through the file.</li>
  <li><strong>Multiprocessing.</strong> The savings we gain from splitting up the file are eaten
up by the time it takes to reassemble the individual outputs. This isn’t a
map/reduce job, and we can’t leave the files disassembled, so no dice for this
version of the problem.</li>
  <li><strong>Tweaking data structures.</strong> At this level of newline density, the code
doesn’t spend enough time in the newline-repair loop to see any effect of
optimizing data structures to reduce copying.</li>
</ul>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>This problem space isn’t as rich as the 1BR challenge, so there won’t be as many exotic tricks, but it was still fun to work through it. For inspiration I looked back through <a href="https://www.youtube.com/watch?v=utTaPW32gKY">Doug Mercer’s video</a> on 1BR in Python. It’s a great watch and has a number of nifty tricks. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>

        ]]>
      </content:encoded>
    </item>
    
    <item>
      <title>&quot;Don&apos;t automate complexity&quot;</title>
      <link>https://charlie-gallagher.github.io/2026/03/02/dont-automate-complexity.html</link>
      <guid isPermaLink="true">https://charlie-gallagher.github.io/2026/03/02/dont-automate-complexity.html</guid>
      <pubDate>Mon, 02 Mar 2026 00:00:00 +0000</pubDate>

      <description>
        <![CDATA[
          <p>You ever hear someone say “We can just automate it,” and feel almost certain
it’s a bad idea? I’m always fighting with this. And I can’t always explain why I
feel like it’s a bad idea to automate certain processes.</p>

<p>And then on Saturday I was paging through <em>Implementing Lean Software
Development</em> by Mary and Tom Poppendieck in a thrift store and found this short
section:</p>

<blockquote>
  <p>Don’t automate complexity.</p>

  <p>We are not helping our customers if we simply automate a complex or messy
process; we would simply be encasing a process filled with waste in a straight
jacket of software complexity. Any process that is a candidate for automation
should first be clarified and simplified, possibly even removing existing
automation. Only then can the process be clearly understood and the leverage
points for effective automation identified.</p>
</blockquote>


        ]]>
      </description>

      <content:encoded>
        <![CDATA[
          <p>You ever hear someone say “We can just automate it,” and feel almost certain
it’s a bad idea? I’m always fighting with this. And I can’t always explain why I
feel like it’s a bad idea to automate certain processes.</p>

<p>And then on Saturday I was paging through <em>Implementing Lean Software
Development</em> by Mary and Tom Poppendieck in a thrift store and found this short
section:</p>

<blockquote>
  <p>Don’t automate complexity.</p>

  <p>We are not helping our customers if we simply automate a complex or messy
process; we would simply be encasing a process filled with waste in a straight
jacket of software complexity. Any process that is a candidate for automation
should first be clarified and simplified, possibly even removing existing
automation. Only then can the process be clearly understood and the leverage
points for effective automation identified.</p>
</blockquote>


        ]]>
      </content:encoded>
    </item>
    
    <item>
      <title>More consistent hashing, or learning math for the advanced in age</title>
      <link>https://charlie-gallagher.github.io/2026/02/25/math-and-comp-sci.html</link>
      <guid isPermaLink="true">https://charlie-gallagher.github.io/2026/02/25/math-and-comp-sci.html</guid>
      <pubDate>Wed, 25 Feb 2026 00:00:00 +0000</pubDate>

      <description>
        <![CDATA[
          <p>I wrote my last post about consistent hashing, which has really stuck with me.
The problem statement is so simple but the solution seems at a glance to be
counter-intuitive. Sure, hashing both the file key and the node name is easy to
say and implement. But that’s not enough for me to believe that it’s sufficient
to solve the problem of data partitioning, nor that it’s required. Isn’t there
any simpler version? An alternative?</p>

<p>Consistent hashing solves a generic problem – place keys on nodes, retrieve
keys from the correct node. At least, that’s the apparent problem. But after
reading the paper that suggested consistent hashing, I’ve started to appreciate
the subtlety of the solution.</p>

<p>The paper describes the following desirable properties:</p>

<ul>
  <li>Balance. Distribute items evenly across buckets.</li>
  <li>Monotonicity. If a new bucket is added, items may move into the new bucket,
but no item moves between two buckets that were already there.</li>
  <li>Spread. The degree to which clients disagree about where a data item belongs
when they don’t agree about which buckets are available.</li>
  <li>Load. Given some set of clients with different knowledge about what buckets
are available, the max load at any particular bucket.</li>
</ul>

<p>Three out of the four properties concern the behavior of the system when storage
nodes are unstable. Monotonicity minimizes disruptions when nodes are added or
removed from the cluster. Spread and load define how storage nodes are treated
when clients have incomplete information about them. The other property,
balance, is the one I originally thought was interesting about consistent
hashing.</p>

<p>Lots of functions could have balance, any even partitioning of the key space
will work. The other criteria are harder to achieve.</p>

<p>Some modeling will help explore the problem space. Data items could be described
as a set of data items or a sequence of data items. If we model it as a
sequence, we can assign items to buckets using, for example, round robin
algorithms, but those are unavailable if we consider data items an unordered
set. In any case, an algorithm that depends on a specific sequence of data items
is more difficult to use in a distributed, imperfect-information system. A good
solution will minimize state, and so if we can, we should use an algorithm that
works on a set.</p>

<p>We could say that there is some set <em>K</em> of all possible data keys, and that our
system must handle some reasonably-sized subset of <em>K</em>. So, given any subset of
<em>K</em>, our solution must satisfy the balance, monotonicity, spread, and load
properties. If you know the domain well, you could restrict this more and claim,
for example, that keys are expected to be alphabetically evenly distributed. If
you can’t make assumptions about the distribution of keys, you have to assume
keys can be any subset of <em>K</em>. Prefixes stop working because you can always find
a subset of <em>K</em> such that all elements have the same n-length prefix for any n.</p>

<p>Once you lay out your criteria, you try to satisfy them. A regular hash scheme
where you hash keys modulo the number of storage nodes has excellent balance but
poor monotonicity, spread, and load. It is a bad scheme under imperfect
conditions. On the other hand, partitioning on prefix (with some adjustments)
has excellent monotonicity, spread, and load, but the keys will most likely not
be evenly distributed.</p>

<p>The traditional consistent hash scheme (hash the keys, hash the storage node
names, assign keys to successors) actually doesn’t have great spread or load.
It’s too likely that some node will land close to another one, causing one node
to shoulder the burden. You have to use <em>virtual nodes</em>, which add some
complexity to the implementation but have nicer properties.</p>

<p>The paper uses the abstraction of a view, <em>V</em>, to describe what clients know
about storage nodes. The abstraction makes sense if clients themselves route
requests based on the data key, but does it work when clients just send
everything to a load balancer that forwards requests to a non-busy storage node
for further routing? I think it works. A view could also represent what
different storage nodes know about each other, and since we have some control
over what the different storage nodes know (via gossip) this changes the game a
bit. We can, like Chord, structure the views intentionally to guarantee certain
routing properties. You could still call each node’s perspective a <em>view</em> and
evaluate it on the criteria from the original paper. You could also reframe it
as a routing table with certain properties, though. Is that more or less useful?
It’s certainly more specific, and we need to be specific to gain a performance
edge. But it’s less generic, and we might be missing out on a “greater truth.”</p>

<p>The paper leaves out a few practical criteria. Clients need to be able to
deterministically place and search for data items on particular nodes. The
function needs to be efficient. If routing is necessary<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>, nodes in the system
need to be able to route a request to the correct bucket efficiently.</p>

<hr />

<p>I let this discussion get a bit mathematical to match the tone of the paper,
but also because math has been on my mind lately. My math skills have atrophied
since I left college. I haven’t even done a good proof in years.</p>

<p>But I’ve been reading comp-sci papers lately, and so I’ve started to turn that
part of my brain back on. After reflecting on it the past few weeks, I’ve come
to think that programmers and mathematicians are in a very similar business. The
history of math, just like prorgamming, has been a history of discovering which
abstractions, notions, and notations have useful properties for solving classes
of problems.</p>

<p>I like theory ok, but perfect mathematical reasoning is a burden in programming.
If you hold yourself to the standard of mathemetical proof, then you’ll spend
forever modeling your system only to find that your proof is stretching for
pages and pages, and it still doesn’t account for important things like
differences in computer architecture and maintenance cost. In practical systems,
you approximate and experiment. Software systems are <a href="https://oxide-and-friends.transistor.fm/episodes/grown-up-zfs-data-corruption-bug">filled with magic
numbers</a>
found by experimentation.</p>

<p>Even still, I regret not being able to take more math classes in college,
because the mathematical thought process is so similar to the programmer’s. You
find the abstraction with the right level of power for solving the problem
you’re interested in. You wouldn’t use ffmpeg to edit videos for the same reason
you wouldn’t teach children basic algebra using group theory. The programmer
eventually learns that ffmpeg underlies everything, and I guess one way of
looking at work in mathematics is the process of discovering the lowest level
components that explain how the higher level components work. It’s reverse
engineering Nature.</p>

<p>I’ve thought about starting to learn math again, at least to struggle with
problems every now and then. But how do you learn math at this point? What math
do you learn? I spent some time thinking about what you get out of a college
math degree.</p>

<p>There are a couple parts. There are the capabilities to reframe problems,
abstract things, unify, generalize, and simplify. Then the ability to write good
proofs and communicate with other mathematicians. And then there are the
subjects, the tradition, and the important results. The universal processes of
all math fields – abstracting, reframing, and so on – are my favorite parts. I
use those skills every day in programming and systems design. Proofs aren’t too
bad. But I’m lacking majorly in the tradition.</p>

<p>I have that in common with another group of math enthusiasts, child prodigies.
How do they cope? In <em>The Man Who Only Loved Numbers</em> about Paul Erdos, Paul
Hoffman says,</p>

<blockquote>
  <p>Perfect numbers and friendly numbers are among the areas of mathematics in
which child prodigies tend to show their stuff. Like chess and music, such
areas do not require much technical expertise. No child prodigies exist among
historians or legal scholars, because years are needed to master those
disciplines. A child can learn the rules of chess in a few minutes, and native
ability takes over from there. So it is with areas of mathematics like these,
which are aspects of elementary number theory (the study of the integers),
graph theory, and combinatorics (problems involving the counting and
classifying of objects). You can easily explain prime numbers, perfect
numbers, and friendly numbers to a child, and he or she can start playing
around with them and exploring their properties. Many areas of mathematics,
however, require technical expertise which is acquired over years of
assimilating definitions and previous results.</p>

  <p>p. 48</p>
</blockquote>

<p>And just as I was feeling better about the possibility that I could consider
myself “kinda good at math” without knowing the integral of 1/x, I saw on HN
yesterday a paper about <a href="https://news.ycombinator.com/item?id=47123689">Terence Tao, age 7</a><sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>.
Sure enough, he was good at everything. “He has a prodigious long-term memory
for mathematical definitions, proofs and ideas with which has become
acquainted,” writes the author of the paper M. A. Ken Clements. He mentions in
another part of the paper, “Not only did he have an astounding grasp
of algebraic definitions, for someone who was still seven years old, but I was
amazed at how he used sophisticated mathematical language freely.” He was the
whole package at 7 years old. Widely read, good at math communication, and a
natural problem solver. The paper is a great read.</p>

<p>As an aside, I also learned how to evaluate a continued fraction from Terence in
this paper, and I’m wigging out over how programmerly the answer is. Here’s the
whole section:</p>

<p><img src="/assets/images/2026-math-and-comp-sci/ttao-continued-fraction.png" alt="Terence Tao solves a continued fraction at age 7" /></p>

<p>It’s a recursive problem, so you introduce x as a recursive definition of the
continued fraction. Then it looks just like a quadratic as Terence’s mom
suggested. Multiply both sides by x, rearrange, and solve for the positive
root. Of course it wouldn’t work in a program since it recurses infinitely, but
it has the same feel as any recursive function.</p>

<p>Child prodigies get the problems and the intuition for mathematical constructs.
But I think I have a leg up in terms of appreciating the <a href="https://en.wikipedia.org/wiki/The_Mathematical_Experience">experience of
mathematics</a>. The
experience of math is the mind-altering effort to grasp the importance of a
pattern and what’s required to state the problem in “simplest” terms. The
tradition of math on the other hand is the set of constructs, patterns, results,
and abstractions that have been found to be substantial to the history of
mathematics, both theoretical and applied. One isn’t too useful without the
other. Sure I can think really hard about some problem, but without the
tradition of math, I can’t build on any existing work. Like a programmer who
doesn’t know what libraries are out there, and so has to build everything from
scratch.</p>

<p>I wish the author of the paper on Terence Tao had asked him what he thought
mathematicians did. He knew quite a few of them. Did he already appreciate that
math isn’t just given in books, but thought, rethought, argued about, and
revised? (He probably did.)</p>

<hr />

<p>Like in any tradition, it’s hard to jump into mathematics. The most important
results of the last few decades are built on subleties, arguments, tradeoffs,
and alternatives that the casual student will never see. In my econometrics
class, we used matrices a bit for linear regressions. After a short introduction
about what matrices represent and how they work, we started on some definitions
like how matrix multiplication is defined. It was elusive to me at the time why
you would (a) have a non-commutative definition, and (b) do multiplication by
lining up row and column values pairwise. Who decided this, and why? But sure
enough, doing this led to the results we needed, so I put the questions in the
back pocket for later.</p>

<p>We trade off power for understanding, just as you don’t need to really know what
a socket is if you take the usage examples for granted. But I think it’s a
mistake to leave it there, to never encourage a student to figure out something
for themselves. Trying for a few days to derive some “obvious” idea is the only
way to appreciate the difficulty of coming up with good math abstractions. I
wasn’t gifted in math, but I also wasn’t encouraged to try to understand <em>why</em>
math is being taught one way or another. And I think that was a scholastic miss
on the curriculum’s part.</p>

<p>One of my favorite essays from Jon Bentley’s <em>Programming Pearls</em> (technically
this one is from <em>More Programming Pearls</em>) is about writing your own profiler,
and in the end he writes one for Awk. It seemed sheerly impossible to me to
write a profiler, but I realized I had never even considered the problem space.
<em>Why not write a profiler?</em> And why not try to find out for yourself what are
the properties of a connected vs partitioned graph of storage nodes? Why not try
to think up a new API for a threading library to rival pthreads? A bad idea can
be discussed and expanded on, but you can’t do anything with no idea at all.</p>

<p>These days I’m pulled towards low-level programming, myself. I like the history
and the forgotten alternatives, the influential ideas that faded and yet
contributed to modern programming. But I also never forget that I my job is to
build effective software even if can’t take the time to appreciate some tech
that I find a bit mysterious and interesting. I just have to file it away until
I have some spare time.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Routing was not necessary in the original paper, because the paper was concerned with cache nodes. Cache misses are a bit wasteful but not catastrophic. It’s enough to minimize them, not make them impossible. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>I’d skip the comment section. It’s mostly people flexing their own child prodigiousness and one thread about eugenics (“this proves to me that biological intelligence hasn’t nearly reached its peak. If we select for pure intelligence, biological brains can get much smarter”). <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>

        ]]>
      </description>

      <content:encoded>
        <![CDATA[
          <p>I wrote my last post about consistent hashing, which has really stuck with me.
The problem statement is so simple but the solution seems at a glance to be
counter-intuitive. Sure, hashing both the file key and the node name is easy to
say and implement. But that’s not enough for me to believe that it’s sufficient
to solve the problem of data partitioning, nor that it’s required. Isn’t there
any simpler version? An alternative?</p>

<p>Consistent hashing solves a generic problem – place keys on nodes, retrieve
keys from the correct node. At least, that’s the apparent problem. But after
reading the paper that suggested consistent hashing, I’ve started to appreciate
the subtlety of the solution.</p>

<p>The paper describes the following desirable properties:</p>

<ul>
  <li>Balance. Distribute items evenly across buckets.</li>
  <li>Monotonicity. If a new bucket is added, items may move into the new bucket,
but no item moves between two buckets that were already there.</li>
  <li>Spread. The degree to which clients disagree about where a data item belongs
when they don’t agree about which buckets are available.</li>
  <li>Load. Given some set of clients with different knowledge about what buckets
are available, the max load at any particular bucket.</li>
</ul>

<p>Three out of the four properties concern the behavior of the system when storage
nodes are unstable. Monotonicity minimizes disruptions when nodes are added or
removed from the cluster. Spread and load define how storage nodes are treated
when clients have incomplete information about them. The other property,
balance, is the one I originally thought was interesting about consistent
hashing.</p>

<p>Lots of functions could have balance, any even partitioning of the key space
will work. The other criteria are harder to achieve.</p>

<p>Some modeling will help explore the problem space. Data items could be described
as a set of data items or a sequence of data items. If we model it as a
sequence, we can assign items to buckets using, for example, round robin
algorithms, but those are unavailable if we consider data items an unordered
set. In any case, an algorithm that depends on a specific sequence of data items
is more difficult to use in a distributed, imperfect-information system. A good
solution will minimize state, and so if we can, we should use an algorithm that
works on a set.</p>

<p>We could say that there is some set <em>K</em> of all possible data keys, and that our
system must handle some reasonably-sized subset of <em>K</em>. So, given any subset of
<em>K</em>, our solution must satisfy the balance, monotonicity, spread, and load
properties. If you know the domain well, you could restrict this more and claim,
for example, that keys are expected to be alphabetically evenly distributed. If
you can’t make assumptions about the distribution of keys, you have to assume
keys can be any subset of <em>K</em>. Prefixes stop working because you can always find
a subset of <em>K</em> such that all elements have the same n-length prefix for any n.</p>

<p>Once you lay out your criteria, you try to satisfy them. A regular hash scheme
where you hash keys modulo the number of storage nodes has excellent balance but
poor monotonicity, spread, and load. It is a bad scheme under imperfect
conditions. On the other hand, partitioning on prefix (with some adjustments)
has excellent monotonicity, spread, and load, but the keys will most likely not
be evenly distributed.</p>

<p>The traditional consistent hash scheme (hash the keys, hash the storage node
names, assign keys to successors) actually doesn’t have great spread or load.
It’s too likely that some node will land close to another one, causing one node
to shoulder the burden. You have to use <em>virtual nodes</em>, which add some
complexity to the implementation but have nicer properties.</p>

<p>The paper uses the abstraction of a view, <em>V</em>, to describe what clients know
about storage nodes. The abstraction makes sense if clients themselves route
requests based on the data key, but does it work when clients just send
everything to a load balancer that forwards requests to a non-busy storage node
for further routing? I think it works. A view could also represent what
different storage nodes know about each other, and since we have some control
over what the different storage nodes know (via gossip) this changes the game a
bit. We can, like Chord, structure the views intentionally to guarantee certain
routing properties. You could still call each node’s perspective a <em>view</em> and
evaluate it on the criteria from the original paper. You could also reframe it
as a routing table with certain properties, though. Is that more or less useful?
It’s certainly more specific, and we need to be specific to gain a performance
edge. But it’s less generic, and we might be missing out on a “greater truth.”</p>

<p>The paper leaves out a few practical criteria. Clients need to be able to
deterministically place and search for data items on particular nodes. The
function needs to be efficient. If routing is necessary<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>, nodes in the system
need to be able to route a request to the correct bucket efficiently.</p>

<hr />

<p>I let this discussion get a bit mathematical to match the tone of the paper,
but also because math has been on my mind lately. My math skills have atrophied
since I left college. I haven’t even done a good proof in years.</p>

<p>But I’ve been reading comp-sci papers lately, and so I’ve started to turn that
part of my brain back on. After reflecting on it the past few weeks, I’ve come
to think that programmers and mathematicians are in a very similar business. The
history of math, just like prorgamming, has been a history of discovering which
abstractions, notions, and notations have useful properties for solving classes
of problems.</p>

<p>I like theory ok, but perfect mathematical reasoning is a burden in programming.
If you hold yourself to the standard of mathemetical proof, then you’ll spend
forever modeling your system only to find that your proof is stretching for
pages and pages, and it still doesn’t account for important things like
differences in computer architecture and maintenance cost. In practical systems,
you approximate and experiment. Software systems are <a href="https://oxide-and-friends.transistor.fm/episodes/grown-up-zfs-data-corruption-bug">filled with magic
numbers</a>
found by experimentation.</p>

<p>Even still, I regret not being able to take more math classes in college,
because the mathematical thought process is so similar to the programmer’s. You
find the abstraction with the right level of power for solving the problem
you’re interested in. You wouldn’t use ffmpeg to edit videos for the same reason
you wouldn’t teach children basic algebra using group theory. The programmer
eventually learns that ffmpeg underlies everything, and I guess one way of
looking at work in mathematics is the process of discovering the lowest level
components that explain how the higher level components work. It’s reverse
engineering Nature.</p>

<p>I’ve thought about starting to learn math again, at least to struggle with
problems every now and then. But how do you learn math at this point? What math
do you learn? I spent some time thinking about what you get out of a college
math degree.</p>

<p>There are a couple parts. There are the capabilities to reframe problems,
abstract things, unify, generalize, and simplify. Then the ability to write good
proofs and communicate with other mathematicians. And then there are the
subjects, the tradition, and the important results. The universal processes of
all math fields – abstracting, reframing, and so on – are my favorite parts. I
use those skills every day in programming and systems design. Proofs aren’t too
bad. But I’m lacking majorly in the tradition.</p>

<p>I have that in common with another group of math enthusiasts, child prodigies.
How do they cope? In <em>The Man Who Only Loved Numbers</em> about Paul Erdos, Paul
Hoffman says,</p>

<blockquote>
  <p>Perfect numbers and friendly numbers are among the areas of mathematics in
which child prodigies tend to show their stuff. Like chess and music, such
areas do not require much technical expertise. No child prodigies exist among
historians or legal scholars, because years are needed to master those
disciplines. A child can learn the rules of chess in a few minutes, and native
ability takes over from there. So it is with areas of mathematics like these,
which are aspects of elementary number theory (the study of the integers),
graph theory, and combinatorics (problems involving the counting and
classifying of objects). You can easily explain prime numbers, perfect
numbers, and friendly numbers to a child, and he or she can start playing
around with them and exploring their properties. Many areas of mathematics,
however, require technical expertise which is acquired over years of
assimilating definitions and previous results.</p>

  <p>p. 48</p>
</blockquote>

<p>And just as I was feeling better about the possibility that I could consider
myself “kinda good at math” without knowing the integral of 1/x, I saw on HN
yesterday a paper about <a href="https://news.ycombinator.com/item?id=47123689">Terence Tao, age 7</a><sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>.
Sure enough, he was good at everything. “He has a prodigious long-term memory
for mathematical definitions, proofs and ideas with which has become
acquainted,” writes the author of the paper M. A. Ken Clements. He mentions in
another part of the paper, “Not only did he have an astounding grasp
of algebraic definitions, for someone who was still seven years old, but I was
amazed at how he used sophisticated mathematical language freely.” He was the
whole package at 7 years old. Widely read, good at math communication, and a
natural problem solver. The paper is a great read.</p>

<p>As an aside, I also learned how to evaluate a continued fraction from Terence in
this paper, and I’m wigging out over how programmerly the answer is. Here’s the
whole section:</p>

<p><img src="/assets/images/2026-math-and-comp-sci/ttao-continued-fraction.png" alt="Terence Tao solves a continued fraction at age 7" /></p>

<p>It’s a recursive problem, so you introduce x as a recursive definition of the
continued fraction. Then it looks just like a quadratic as Terence’s mom
suggested. Multiply both sides by x, rearrange, and solve for the positive
root. Of course it wouldn’t work in a program since it recurses infinitely, but
it has the same feel as any recursive function.</p>

<p>Child prodigies get the problems and the intuition for mathematical constructs.
But I think I have a leg up in terms of appreciating the <a href="https://en.wikipedia.org/wiki/The_Mathematical_Experience">experience of
mathematics</a>. The
experience of math is the mind-altering effort to grasp the importance of a
pattern and what’s required to state the problem in “simplest” terms. The
tradition of math on the other hand is the set of constructs, patterns, results,
and abstractions that have been found to be substantial to the history of
mathematics, both theoretical and applied. One isn’t too useful without the
other. Sure I can think really hard about some problem, but without the
tradition of math, I can’t build on any existing work. Like a programmer who
doesn’t know what libraries are out there, and so has to build everything from
scratch.</p>

<p>I wish the author of the paper on Terence Tao had asked him what he thought
mathematicians did. He knew quite a few of them. Did he already appreciate that
math isn’t just given in books, but thought, rethought, argued about, and
revised? (He probably did.)</p>

<hr />

<p>Like in any tradition, it’s hard to jump into mathematics. The most important
results of the last few decades are built on subleties, arguments, tradeoffs,
and alternatives that the casual student will never see. In my econometrics
class, we used matrices a bit for linear regressions. After a short introduction
about what matrices represent and how they work, we started on some definitions
like how matrix multiplication is defined. It was elusive to me at the time why
you would (a) have a non-commutative definition, and (b) do multiplication by
lining up row and column values pairwise. Who decided this, and why? But sure
enough, doing this led to the results we needed, so I put the questions in the
back pocket for later.</p>

<p>We trade off power for understanding, just as you don’t need to really know what
a socket is if you take the usage examples for granted. But I think it’s a
mistake to leave it there, to never encourage a student to figure out something
for themselves. Trying for a few days to derive some “obvious” idea is the only
way to appreciate the difficulty of coming up with good math abstractions. I
wasn’t gifted in math, but I also wasn’t encouraged to try to understand <em>why</em>
math is being taught one way or another. And I think that was a scholastic miss
on the curriculum’s part.</p>

<p>One of my favorite essays from Jon Bentley’s <em>Programming Pearls</em> (technically
this one is from <em>More Programming Pearls</em>) is about writing your own profiler,
and in the end he writes one for Awk. It seemed sheerly impossible to me to
write a profiler, but I realized I had never even considered the problem space.
<em>Why not write a profiler?</em> And why not try to find out for yourself what are
the properties of a connected vs partitioned graph of storage nodes? Why not try
to think up a new API for a threading library to rival pthreads? A bad idea can
be discussed and expanded on, but you can’t do anything with no idea at all.</p>

<p>These days I’m pulled towards low-level programming, myself. I like the history
and the forgotten alternatives, the influential ideas that faded and yet
contributed to modern programming. But I also never forget that I my job is to
build effective software even if can’t take the time to appreciate some tech
that I find a bit mysterious and interesting. I just have to file it away until
I have some spare time.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Routing was not necessary in the original paper, because the paper was concerned with cache nodes. Cache misses are a bit wasteful but not catastrophic. It’s enough to minimize them, not make them impossible. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>I’d skip the comment section. It’s mostly people flexing their own child prodigiousness and one thread about eugenics (“this proves to me that biological intelligence hasn’t nearly reached its peak. If we select for pure intelligence, biological brains can get much smarter”). <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>

        ]]>
      </content:encoded>
    </item>
    
    <item>
      <title>The niftiness and diversity of consistent hashing</title>
      <link>https://charlie-gallagher.github.io/2026/02/14/consistent-hashing.html</link>
      <guid isPermaLink="true">https://charlie-gallagher.github.io/2026/02/14/consistent-hashing.html</guid>
      <pubDate>Sat, 14 Feb 2026 00:00:00 +0000</pubDate>

      <description>
        <![CDATA[
          <p>I’ve been interested in decentralized data in distributed systems lately.
There’s too much data for any one storage node to handle, so you have lots of
storage nodes and each gets some of the data. But how do you decide which node
is responsible for a certain data item?</p>

<p>The problem space is diverse. In peer-to-peer systems, nodes just sort of <em>have</em>
data that they register with the network. They have to not only find the node
that has data, but also find out what data exists in the first place. Then there
are distributed key/value systems like the one described in Amazon’s Dynamo
paper. Data is placed in storage to balance the load and replicated across nodes
to increase resiliency. And then there are CDNs and cache systems. In
decentralized caches like the one described in the original consistent hashing
paper, clients make an educated guess about the cache node that holds the data
they’re looking for, and the design goal is to give each client a way to guess
right most of the time with limited information.</p>

<p>I think the last problem is interesting: How can we make an educated guess about
what storage node might have the data when we only have partial information
about the network? It requires a bit of an intuitive leap.</p>

<p>Each data item is represented by a key, so there’s some concept of a <em>key
space</em>, which represents all possible data keys. We can use a <em>hash space</em> as a
pretty effective proxy for the key space, and then partition that hash space
into ranges. Each storage node in our system becomes responsible for one of the
ranges in the hash space.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>

<p>A hash space is the set of possible outputs of a hash function – for example a
SHA1 hash is 160 bits, so the hash space is all integers from 0 to
2<sup>160</sup>. We know that if we hash different keys, we’re very likely to
get different hash values, so this satisfies the “hash space as proxy for all
possible keys” criterion. And hash functions are reasonably unbiased, so a
hashed key has essentially equal probability of landing anywhere between 0 and
2<sup><em>n_bits</em></sup>. The hash space is essentially a number line, and we can
make it continuous by connecting the tail back to the head, so that the
successor of the largest hash value is the smallest hash value.</p>

<p>You place a data item on the ring by hashing its key.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pos_x</span> <span class="o">=</span> <span class="n">sha1</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">key</span><span class="p">)</span>
</code></pre></div></div>

<p>Our storage nodes are each responsible for a range, so we need a way to define
these ranges. To that end, we place <em>nodes</em> on the ring as well, by hashing
their IP addresses.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pos_server</span> <span class="o">=</span> <span class="n">sha1</span><span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">ipaddr</span><span class="p">)</span>
</code></pre></div></div>

<p>We decide that each node is responsible for the hash values between itself and
its predecessor, which is to say the range of hash values between itself and the
nearest counter-clockwise neighbor on the hash ring.</p>

<p>With a 64-bit hash space and 10 nodes, you might end up with a configuration
like this.</p>

<p><img src="/assets/images/2026-consistent-hashing/hash_ring_s10_n64_01.png" alt="My alt text" /></p>

<p>Now if a client knows about any set of storage nodes, it can make its own
partition of the hash ring and be mostly correct! If the client knows about 9
out of 10 nodes, when it makes its partition of the hash ring, it will correctly
identify 8 out of the 10 actual ranges, and it won’t do too badly guessing about
the 9th and 10th ranges, either.</p>

<p>Here’s how consistent hashing scores on a few more criteria:</p>

<ul>
  <li><strong>Precise.</strong> About as accurate as you can be with limited information about
storage nodes.</li>
  <li><strong>Deterministic.</strong> Yep.</li>
  <li><strong>Equal distribution of work.</strong> The hash ring is reasonably well distributed,
but it’s not perfect. The size of each range is random.</li>
  <li><strong>Adaptable to changes in membership.</strong> This is one of the main motivations
for consistent hashing. If you add a node, it divides some existing range, and
the only redistribution that happens is with the new node’s successor. If you
remove a node, the successor node becomes responsible for the range of the node
that left, but no other nodes are affected.</li>
</ul>

<p>In cache systems, getting close to the right node is good enough. Even if a node
isn’t technically responsible for a key, it can still request the key from the
server and cache the key itself. Any other clients that have the same incomplete
information about the network will hit this (non-responsible) node and find the
cached data there. Responsibility in this system is weak, but the data is stored
with a lot of <em>locality</em> relative to the hash ring.</p>

<p>Sometimes you need to find exactly the node responsible for a key. The Chord
protocol provides such stricter routing. When a server joins the network, it
notifies its neighbors and then learns about them. The protocol ensures that
the new node knows a lot about its immediate location on the ring, but less
about the nodes that are farther away from it. When a client makes a data
request to find a key, it reaches out to <em>any</em> node on the network. This node
probably doesn’t have the data, but it knows someone who is close. The storage
node forwards the request counter-clockwise (backwards) on the ring to the
closest node to the point where the key is, but without passing it. This node
could be the owner of the data item, but if it’s not, it knows enough about its
local area on the ring to find a node that’s closer to to it. The routing tables
are built in such a way that if a node doesn’t know of any closer node to the
data item, it is the owner of the data item.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>

<p>The number of hops in a Chord data retrieval is bounded. Latency isn’t optimal,
because there may be some number of hops to find the data, and in fact the
nearness on the ring says nothing about how close two nodes are in the world.
This is one of the hard problems of overlay networks, but there are
optimizations for it.</p>

<p>So, different systems treat the routing requirement differently. A cache system
is able to relax the routing requirement, because it tries to reduce cache
misses but not prevent them entirely. Chord meanwhile guarantees that a key is
found. It’s a scalable protocol that works in very unstable networks, at the
cost of extra latency.</p>

<p>There are still a few issues. First, a node leaving is likely to increase the
workload for another node, and a node entering the ring is likely to divide
another node’s workload while leaving the remaining nodes with the same
workload. Second, random positions on the ring are not very likely to be evenly
spaced. (There is after all only one evenly spaced configuration.) Third, the
scheme ignores heterogeneity – some servers can handle more work than others.</p>

<p>To illustrate the load balancing problem with hashing servers to the hash ring,
I generated 10 different random configurations to show that they are not
generally evenly spaced.</p>

<p><img src="/assets/images/2026-consistent-hashing/s10_n64.gif" alt="My alt text" /></p>

<p>Next, I simulated 1,000 requests for data, and I ran the simulation 10 times,
plotting the results. In this plot, the size of the node represents the number
of requests serviced at the node.</p>

<p><img src="/assets/images/2026-consistent-hashing/s10_n64_hits_random.gif" alt="My alt text" /></p>

<p>If we want to smooth out the load, the solution is to assign each server a
number of random <em>virtual nodes</em> on the ring. Virtual nodes decrease the chances
that a node gets a particularly big range of responsibility on the ring –
instead of getting unlucky once, it would have to get unlucky N times in a row
for N virtual nodes per storage node.</p>

<p>Here’s another 1,000 requests made to a ring with 32 nodes instead of 10. I ran
the simulation 10 times and plotted each outcome.</p>

<p><img src="/assets/images/2026-consistent-hashing/s32_n64_hits_random.gif" alt="My alt text" /></p>

<p>If you take any N nodes (say 4) and average their size, you get a reasonable
number of requests. We can also solve the heterogeneity problem by allocating
servers a number of virtual nodes according to its abilities.  Finally,
membership changes more evenly rebalance the nodes. When a node is removed, its
virtual nodes are removed from the circle, and the extra work is split among all
of the owners of the virtual nodes’ successors. It’s very likely that the
successors will be owned by several nodes, not just one.</p>

<p>Routing is a little more complicated. The system needs to store some amount of
extra metadata about how to map virtual nodes to actual nodes. For this reason,
virtual nodes are most effective in smaller systems where each storage node
might be responsible for a large chunk of the hash ring.</p>

<p>The design space becomes expansive at this point as particular systems balance
metadata and latency with other guarantees they need to make. But this is the
essence of the thing. We create a proxy for the key space that tends to evenly
distribute nodes and then divvy up that domain across the storage nodes.
Elegant, if you ask me.</p>

<h1 id="sources">Sources</h1>

<ul>
  <li><a href="https://www.akamai.com/site/en/documents/research-paper/consistent-hashing-and-random-trees-distributed-caching-protocols-for-relieving-hot-spots-on-the-world-wide-web-technical-publication.pdf">Consistent Hashing and Random Trees: distributed caching protocols for relieving hot spots on the world wide web</a></li>
  <li><a href="https://pdos.csail.mit.edu/papers/chord:sigcomm01/chord_sigcomm.pdf">Chord: a scalable peer-to-peer lookup service for internet applications</a></li>
  <li><a href="https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf">Dynamo: Amazon’s highly available key-value store</a></li>
  <li><em>Distributed Systems</em>, 4e, by Maarten Van Steen and Andrew S. Tenenbaum
(<a href="https://www.distributed-systems.net/index.php/books/ds4/">link</a>)</li>
  <li><em>Computer networks: a systems approach</em>, 3e, by Peterson and Davie</li>
</ul>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>There are lots of ways to transform and then partition a key space, but most don’t work well. It’s vital that all keys are distributed <em>evenly</em> across storage nodes. So you might try to partition the key space by prefix and then place the key on the server whose name has the same prefix. This doesn’t work well because (a) there’s no guarantee that servers will have distinct prefixes, especially when they’re in the same subnet, (b) if they do have distinct prefixes, they’re probably not evenly spaced, and (c) you become significantly limited in the number of buckets. The longest prefix for an IPv4 address is 32 bits. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>See also Pastry, which optimizes the routing tables differently. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>

        ]]>
      </description>

      <content:encoded>
        <![CDATA[
          <p>I’ve been interested in decentralized data in distributed systems lately.
There’s too much data for any one storage node to handle, so you have lots of
storage nodes and each gets some of the data. But how do you decide which node
is responsible for a certain data item?</p>

<p>The problem space is diverse. In peer-to-peer systems, nodes just sort of <em>have</em>
data that they register with the network. They have to not only find the node
that has data, but also find out what data exists in the first place. Then there
are distributed key/value systems like the one described in Amazon’s Dynamo
paper. Data is placed in storage to balance the load and replicated across nodes
to increase resiliency. And then there are CDNs and cache systems. In
decentralized caches like the one described in the original consistent hashing
paper, clients make an educated guess about the cache node that holds the data
they’re looking for, and the design goal is to give each client a way to guess
right most of the time with limited information.</p>

<p>I think the last problem is interesting: How can we make an educated guess about
what storage node might have the data when we only have partial information
about the network? It requires a bit of an intuitive leap.</p>

<p>Each data item is represented by a key, so there’s some concept of a <em>key
space</em>, which represents all possible data keys. We can use a <em>hash space</em> as a
pretty effective proxy for the key space, and then partition that hash space
into ranges. Each storage node in our system becomes responsible for one of the
ranges in the hash space.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup></p>

<p>A hash space is the set of possible outputs of a hash function – for example a
SHA1 hash is 160 bits, so the hash space is all integers from 0 to
2<sup>160</sup>. We know that if we hash different keys, we’re very likely to
get different hash values, so this satisfies the “hash space as proxy for all
possible keys” criterion. And hash functions are reasonably unbiased, so a
hashed key has essentially equal probability of landing anywhere between 0 and
2<sup><em>n_bits</em></sup>. The hash space is essentially a number line, and we can
make it continuous by connecting the tail back to the head, so that the
successor of the largest hash value is the smallest hash value.</p>

<p>You place a data item on the ring by hashing its key.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pos_x</span> <span class="o">=</span> <span class="n">sha1</span><span class="p">(</span><span class="n">x</span><span class="p">.</span><span class="n">key</span><span class="p">)</span>
</code></pre></div></div>

<p>Our storage nodes are each responsible for a range, so we need a way to define
these ranges. To that end, we place <em>nodes</em> on the ring as well, by hashing
their IP addresses.</p>

<div class="language-py highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">pos_server</span> <span class="o">=</span> <span class="n">sha1</span><span class="p">(</span><span class="n">s</span><span class="p">.</span><span class="n">ipaddr</span><span class="p">)</span>
</code></pre></div></div>

<p>We decide that each node is responsible for the hash values between itself and
its predecessor, which is to say the range of hash values between itself and the
nearest counter-clockwise neighbor on the hash ring.</p>

<p>With a 64-bit hash space and 10 nodes, you might end up with a configuration
like this.</p>

<p><img src="/assets/images/2026-consistent-hashing/hash_ring_s10_n64_01.png" alt="My alt text" /></p>

<p>Now if a client knows about any set of storage nodes, it can make its own
partition of the hash ring and be mostly correct! If the client knows about 9
out of 10 nodes, when it makes its partition of the hash ring, it will correctly
identify 8 out of the 10 actual ranges, and it won’t do too badly guessing about
the 9th and 10th ranges, either.</p>

<p>Here’s how consistent hashing scores on a few more criteria:</p>

<ul>
  <li><strong>Precise.</strong> About as accurate as you can be with limited information about
storage nodes.</li>
  <li><strong>Deterministic.</strong> Yep.</li>
  <li><strong>Equal distribution of work.</strong> The hash ring is reasonably well distributed,
but it’s not perfect. The size of each range is random.</li>
  <li><strong>Adaptable to changes in membership.</strong> This is one of the main motivations
for consistent hashing. If you add a node, it divides some existing range, and
the only redistribution that happens is with the new node’s successor. If you
remove a node, the successor node becomes responsible for the range of the node
that left, but no other nodes are affected.</li>
</ul>

<p>In cache systems, getting close to the right node is good enough. Even if a node
isn’t technically responsible for a key, it can still request the key from the
server and cache the key itself. Any other clients that have the same incomplete
information about the network will hit this (non-responsible) node and find the
cached data there. Responsibility in this system is weak, but the data is stored
with a lot of <em>locality</em> relative to the hash ring.</p>

<p>Sometimes you need to find exactly the node responsible for a key. The Chord
protocol provides such stricter routing. When a server joins the network, it
notifies its neighbors and then learns about them. The protocol ensures that
the new node knows a lot about its immediate location on the ring, but less
about the nodes that are farther away from it. When a client makes a data
request to find a key, it reaches out to <em>any</em> node on the network. This node
probably doesn’t have the data, but it knows someone who is close. The storage
node forwards the request counter-clockwise (backwards) on the ring to the
closest node to the point where the key is, but without passing it. This node
could be the owner of the data item, but if it’s not, it knows enough about its
local area on the ring to find a node that’s closer to to it. The routing tables
are built in such a way that if a node doesn’t know of any closer node to the
data item, it is the owner of the data item.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup></p>

<p>The number of hops in a Chord data retrieval is bounded. Latency isn’t optimal,
because there may be some number of hops to find the data, and in fact the
nearness on the ring says nothing about how close two nodes are in the world.
This is one of the hard problems of overlay networks, but there are
optimizations for it.</p>

<p>So, different systems treat the routing requirement differently. A cache system
is able to relax the routing requirement, because it tries to reduce cache
misses but not prevent them entirely. Chord meanwhile guarantees that a key is
found. It’s a scalable protocol that works in very unstable networks, at the
cost of extra latency.</p>

<p>There are still a few issues. First, a node leaving is likely to increase the
workload for another node, and a node entering the ring is likely to divide
another node’s workload while leaving the remaining nodes with the same
workload. Second, random positions on the ring are not very likely to be evenly
spaced. (There is after all only one evenly spaced configuration.) Third, the
scheme ignores heterogeneity – some servers can handle more work than others.</p>

<p>To illustrate the load balancing problem with hashing servers to the hash ring,
I generated 10 different random configurations to show that they are not
generally evenly spaced.</p>

<p><img src="/assets/images/2026-consistent-hashing/s10_n64.gif" alt="My alt text" /></p>

<p>Next, I simulated 1,000 requests for data, and I ran the simulation 10 times,
plotting the results. In this plot, the size of the node represents the number
of requests serviced at the node.</p>

<p><img src="/assets/images/2026-consistent-hashing/s10_n64_hits_random.gif" alt="My alt text" /></p>

<p>If we want to smooth out the load, the solution is to assign each server a
number of random <em>virtual nodes</em> on the ring. Virtual nodes decrease the chances
that a node gets a particularly big range of responsibility on the ring –
instead of getting unlucky once, it would have to get unlucky N times in a row
for N virtual nodes per storage node.</p>

<p>Here’s another 1,000 requests made to a ring with 32 nodes instead of 10. I ran
the simulation 10 times and plotted each outcome.</p>

<p><img src="/assets/images/2026-consistent-hashing/s32_n64_hits_random.gif" alt="My alt text" /></p>

<p>If you take any N nodes (say 4) and average their size, you get a reasonable
number of requests. We can also solve the heterogeneity problem by allocating
servers a number of virtual nodes according to its abilities.  Finally,
membership changes more evenly rebalance the nodes. When a node is removed, its
virtual nodes are removed from the circle, and the extra work is split among all
of the owners of the virtual nodes’ successors. It’s very likely that the
successors will be owned by several nodes, not just one.</p>

<p>Routing is a little more complicated. The system needs to store some amount of
extra metadata about how to map virtual nodes to actual nodes. For this reason,
virtual nodes are most effective in smaller systems where each storage node
might be responsible for a large chunk of the hash ring.</p>

<p>The design space becomes expansive at this point as particular systems balance
metadata and latency with other guarantees they need to make. But this is the
essence of the thing. We create a proxy for the key space that tends to evenly
distribute nodes and then divvy up that domain across the storage nodes.
Elegant, if you ask me.</p>

<h1 id="sources">Sources</h1>

<ul>
  <li><a href="https://www.akamai.com/site/en/documents/research-paper/consistent-hashing-and-random-trees-distributed-caching-protocols-for-relieving-hot-spots-on-the-world-wide-web-technical-publication.pdf">Consistent Hashing and Random Trees: distributed caching protocols for relieving hot spots on the world wide web</a></li>
  <li><a href="https://pdos.csail.mit.edu/papers/chord:sigcomm01/chord_sigcomm.pdf">Chord: a scalable peer-to-peer lookup service for internet applications</a></li>
  <li><a href="https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf">Dynamo: Amazon’s highly available key-value store</a></li>
  <li><em>Distributed Systems</em>, 4e, by Maarten Van Steen and Andrew S. Tenenbaum
(<a href="https://www.distributed-systems.net/index.php/books/ds4/">link</a>)</li>
  <li><em>Computer networks: a systems approach</em>, 3e, by Peterson and Davie</li>
</ul>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>There are lots of ways to transform and then partition a key space, but most don’t work well. It’s vital that all keys are distributed <em>evenly</em> across storage nodes. So you might try to partition the key space by prefix and then place the key on the server whose name has the same prefix. This doesn’t work well because (a) there’s no guarantee that servers will have distinct prefixes, especially when they’re in the same subnet, (b) if they do have distinct prefixes, they’re probably not evenly spaced, and (c) you become significantly limited in the number of buckets. The longest prefix for an IPv4 address is 32 bits. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>See also Pastry, which optimizes the routing tables differently. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>

        ]]>
      </content:encoded>
    </item>
    
    <item>
      <title>Abuse-tolerant interfaces</title>
      <link>https://charlie-gallagher.github.io/2026/02/11/abuse-tolerant-interfaces.html</link>
      <guid isPermaLink="true">https://charlie-gallagher.github.io/2026/02/11/abuse-tolerant-interfaces.html</guid>
      <pubDate>Wed, 11 Feb 2026 00:00:00 +0000</pubDate>

      <description>
        <![CDATA[
          <blockquote>
  <p>A common approach in the industry for forming a performance oriented SLA is to
describe it using average, median and expected variance. At Amazon we have
found that these metrics are not good enough if the goal is to build a system
where <strong>all</strong> customers have a good experience, rather than just the majority.</p>

  <p>“Dynamo: Amazon’s highly available key-value store,” DeCandia, et al.</p>
</blockquote>

<p>The Dynamo paper has me thinking about kinds of customer and service. Amazon is
a company on the offense, by which I mean that there is no sort of traffic they
want to turn away. They succeed when customers gleefully fill their carts and
hammer as many orders as possible through the checkout during the holidays.
Their only concern is keeping up.</p>

<p>The Dynamo paper goes on:</p>

<blockquote>
  <p>For example if extensive personalization techniques are used then customers
with longer histories require more processing which impacts performance at the
high-end of the distribution. An SLA stated in terms of mean or median reponse
times will not address the performance of this important customer segment.</p>
</blockquote>

<p>Those customers with enthusiastic shopping patterns are exactly the type of
customer that Amazon wants to court, and that drives their metrics away from
averages and towards extremes at the 99.9th percentile. At my own job at IXIS,
we’ve created a BI and data socialization platform, and the Dynamo paper has
me thinking that we are on the opposite side of some customer relationship
spectrum. Our platform works best when people use it <em>reasonably</em>.  If a power
user queries only for multiple years of data, that strains our resources and has
no incremental benefit to our bottom line.</p>

<p>Every company gets pricing stress, but some companies like IXIS, Snowflake, and
OpenAI have to worry about whether their pricing is secure against unusually
power-hungry power users. And that sucks, for everyone. I want people to
power-use our product without feeling like we’re against them. At the least the
fiddly money problems we have should be transparent to the user. Just imagine if
YouTube made you pay a small fee if you watched too many videos today to cover
their compute costs serving you those videos.</p>

<p>Here are a few pricing models in this space, with companies I think represent
the model well:</p>

<ul>
  <li><strong>Customer pays for compute.</strong> AI tokens, most AWS services, Snowflake.</li>
  <li><strong>Ads pay for customer.</strong> YouTube, Spotify.</li>
  <li><strong>Special services pay for freeloaders.</strong> The freemium model, usually mixed
with ads.</li>
  <li><strong>Good behavior by default.</strong> BitTorrent’s bartering system.</li>
  <li><strong>Good behavior rewarded.</strong> Reddit, Stack Overflow.</li>
  <li><strong>Hard resource limits.</strong> Google drive, GMail.</li>
  <li><strong>Throttling.</strong> AT&amp;T<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>, Tinder.</li>
  <li><strong>Abuse-tolerant interface.</strong> Adobe Analytics Workspace, coupons.</li>
</ul>

<p>As an alternative to “customer pays for compute,” I’ve been interested in
abuse-tolerant interfaces, which you could describe as, “It’s not impossible for
users to cost us a lot of money, but they’ll find it’s inconvenient to do so.”
Coupons represent this par excellence. I mean physical, cut-em-out coupons.
They’re abuse-tolerant because while it’s possible to cut big stacks of coupons,
most people don’t. There was that whole Extreme Couponing TV show about lengths
to which people went to clip the coupons. We’re talking <em>days</em> of labor between
collecting books, snipping, and organizing. But the savings were huge.</p>

<p>Coupons work in spite of their inconvenience. When you take the trouble, it
feels like a <em>steal</em>. They’re a great time.</p>

<p>A few digital companies have figured out how to work this model into their
products. Adobe Analytics Workspace is one of those products in my opinion –
and if you aren’t familiar, it’s a BI tool for analyzing Adobe Analytics data.
The interface is composed drag-and-drop components like metrics, dimensions, and
segments. You build visualizations ranging from simple and customizable (tables)
to prefab (various flow charts and funnels), and they never limit you. You can
theoretically ask for millions of rows of data and you won’t get rate limited.
Instead:</p>

<ul>
  <li>The data is paged in on-demand.</li>
  <li>The interface is much friendlier to simple visualizations than monstrously
large tables.</li>
  <li>Complex, nested breakdown tables must be created incrementally, which limits
how much data you could sanely fit into a single table.</li>
</ul>

<p>Now don’t confuse me for Adobe Analytics’ biggest fan or anything, but I’ve
never felt limited by the interface, although we’ve certainly pushed it to some
limits.</p>

<p>I think these sorts of abuse-tolerant interfaces are subtle and difficult to
execute well. But when you get it right, it’s great for both the business and
the customer. Food for thought for those product owners out there who are
considering a “compute per request” pricing model.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Seller beware, AT&amp;T was sued by the FTC in 2014 for throttling data speeds for “unlimited” plan users after they used a certain amount of data. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>

        ]]>
      </description>

      <content:encoded>
        <![CDATA[
          <blockquote>
  <p>A common approach in the industry for forming a performance oriented SLA is to
describe it using average, median and expected variance. At Amazon we have
found that these metrics are not good enough if the goal is to build a system
where <strong>all</strong> customers have a good experience, rather than just the majority.</p>

  <p>“Dynamo: Amazon’s highly available key-value store,” DeCandia, et al.</p>
</blockquote>

<p>The Dynamo paper has me thinking about kinds of customer and service. Amazon is
a company on the offense, by which I mean that there is no sort of traffic they
want to turn away. They succeed when customers gleefully fill their carts and
hammer as many orders as possible through the checkout during the holidays.
Their only concern is keeping up.</p>

<p>The Dynamo paper goes on:</p>

<blockquote>
  <p>For example if extensive personalization techniques are used then customers
with longer histories require more processing which impacts performance at the
high-end of the distribution. An SLA stated in terms of mean or median reponse
times will not address the performance of this important customer segment.</p>
</blockquote>

<p>Those customers with enthusiastic shopping patterns are exactly the type of
customer that Amazon wants to court, and that drives their metrics away from
averages and towards extremes at the 99.9th percentile. At my own job at IXIS,
we’ve created a BI and data socialization platform, and the Dynamo paper has
me thinking that we are on the opposite side of some customer relationship
spectrum. Our platform works best when people use it <em>reasonably</em>.  If a power
user queries only for multiple years of data, that strains our resources and has
no incremental benefit to our bottom line.</p>

<p>Every company gets pricing stress, but some companies like IXIS, Snowflake, and
OpenAI have to worry about whether their pricing is secure against unusually
power-hungry power users. And that sucks, for everyone. I want people to
power-use our product without feeling like we’re against them. At the least the
fiddly money problems we have should be transparent to the user. Just imagine if
YouTube made you pay a small fee if you watched too many videos today to cover
their compute costs serving you those videos.</p>

<p>Here are a few pricing models in this space, with companies I think represent
the model well:</p>

<ul>
  <li><strong>Customer pays for compute.</strong> AI tokens, most AWS services, Snowflake.</li>
  <li><strong>Ads pay for customer.</strong> YouTube, Spotify.</li>
  <li><strong>Special services pay for freeloaders.</strong> The freemium model, usually mixed
with ads.</li>
  <li><strong>Good behavior by default.</strong> BitTorrent’s bartering system.</li>
  <li><strong>Good behavior rewarded.</strong> Reddit, Stack Overflow.</li>
  <li><strong>Hard resource limits.</strong> Google drive, GMail.</li>
  <li><strong>Throttling.</strong> AT&amp;T<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>, Tinder.</li>
  <li><strong>Abuse-tolerant interface.</strong> Adobe Analytics Workspace, coupons.</li>
</ul>

<p>As an alternative to “customer pays for compute,” I’ve been interested in
abuse-tolerant interfaces, which you could describe as, “It’s not impossible for
users to cost us a lot of money, but they’ll find it’s inconvenient to do so.”
Coupons represent this par excellence. I mean physical, cut-em-out coupons.
They’re abuse-tolerant because while it’s possible to cut big stacks of coupons,
most people don’t. There was that whole Extreme Couponing TV show about lengths
to which people went to clip the coupons. We’re talking <em>days</em> of labor between
collecting books, snipping, and organizing. But the savings were huge.</p>

<p>Coupons work in spite of their inconvenience. When you take the trouble, it
feels like a <em>steal</em>. They’re a great time.</p>

<p>A few digital companies have figured out how to work this model into their
products. Adobe Analytics Workspace is one of those products in my opinion –
and if you aren’t familiar, it’s a BI tool for analyzing Adobe Analytics data.
The interface is composed drag-and-drop components like metrics, dimensions, and
segments. You build visualizations ranging from simple and customizable (tables)
to prefab (various flow charts and funnels), and they never limit you. You can
theoretically ask for millions of rows of data and you won’t get rate limited.
Instead:</p>

<ul>
  <li>The data is paged in on-demand.</li>
  <li>The interface is much friendlier to simple visualizations than monstrously
large tables.</li>
  <li>Complex, nested breakdown tables must be created incrementally, which limits
how much data you could sanely fit into a single table.</li>
</ul>

<p>Now don’t confuse me for Adobe Analytics’ biggest fan or anything, but I’ve
never felt limited by the interface, although we’ve certainly pushed it to some
limits.</p>

<p>I think these sorts of abuse-tolerant interfaces are subtle and difficult to
execute well. But when you get it right, it’s great for both the business and
the customer. Food for thought for those product owners out there who are
considering a “compute per request” pricing model.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Seller beware, AT&amp;T was sued by the FTC in 2014 for throttling data speeds for “unlimited” plan users after they used a certain amount of data. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>

        ]]>
      </content:encoded>
    </item>
    
    <item>
      <title>Swatch time</title>
      <link>https://charlie-gallagher.github.io/2026/02/10/swatch-time.html</link>
      <guid isPermaLink="true">https://charlie-gallagher.github.io/2026/02/10/swatch-time.html</guid>
      <pubDate>Tue, 10 Feb 2026 00:00:00 +0000</pubDate>

      <description>
        <![CDATA[
          <p>I saw this on HN this morning. Nearly 30 years ago Swatch created <a href="https://beats.wiki">Swatch
Internet time</a>. The units are called .beats, and they’re a
decimal timekeeping system (1000 .beats per day) based apparently on the solar
day in Biel, Switzerland. The second in this new timescale is defined as 1/1000
of a day, but since it seems to be defined in terms of UTC, it’s probably a
translation of SI seconds..?</p>

<p>It seems like UTC with a different hat.</p>

<p>All the finery aside, Swatch time abolishes timezones and daylight savings time
so that referring to time can be more natural. If it’s 584 .beats where I am,
it’s the same time all over the world. This is supposed to be refreshing for
people who spend a lot of time trying to coordinate people in different
timezones. Someone on HN mentioned this post, <a href="https://qntm.org/abolish">So You Want to Abolish
Timezones</a>. If you read that and still think .beats
are a good idea, I’ll be stunned.</p>

<p>I’ve talked <a href="/2026/02/06/ntpsec.html">before</a> about time coordination
and how the definition of “the time” is sociotechnical, with an emphasis on
<em>socio</em>. If there’s any problem with timezones it’s that there aren’t enough of
them. The original timezone scheme in the US had <a href="https://www.bts.gov/explore-topics-and-geography/geography/geospatial-portal/history-time-zones-and-daylight-saving">144
timezones</a>.</p>

<p>Sure, it’s hard to agree on when to have a meeting, but that’s only because
globalization is hard. You need to have either social empathy for how people
organize their lives in other places or a dictum that everyone change their idea
of the day from something natural to something arbitrary, like the natural time
in some place I’ve never been to.</p>

<p>So be happy that the world has such diversity! And refer back to your timezone
tables. Because timekeeping is as complicated as the places and people that keep
the time.</p>


        ]]>
      </description>

      <content:encoded>
        <![CDATA[
          <p>I saw this on HN this morning. Nearly 30 years ago Swatch created <a href="https://beats.wiki">Swatch
Internet time</a>. The units are called .beats, and they’re a
decimal timekeeping system (1000 .beats per day) based apparently on the solar
day in Biel, Switzerland. The second in this new timescale is defined as 1/1000
of a day, but since it seems to be defined in terms of UTC, it’s probably a
translation of SI seconds..?</p>

<p>It seems like UTC with a different hat.</p>

<p>All the finery aside, Swatch time abolishes timezones and daylight savings time
so that referring to time can be more natural. If it’s 584 .beats where I am,
it’s the same time all over the world. This is supposed to be refreshing for
people who spend a lot of time trying to coordinate people in different
timezones. Someone on HN mentioned this post, <a href="https://qntm.org/abolish">So You Want to Abolish
Timezones</a>. If you read that and still think .beats
are a good idea, I’ll be stunned.</p>

<p>I’ve talked <a href="/2026/02/06/ntpsec.html">before</a> about time coordination
and how the definition of “the time” is sociotechnical, with an emphasis on
<em>socio</em>. If there’s any problem with timezones it’s that there aren’t enough of
them. The original timezone scheme in the US had <a href="https://www.bts.gov/explore-topics-and-geography/geography/geospatial-portal/history-time-zones-and-daylight-saving">144
timezones</a>.</p>

<p>Sure, it’s hard to agree on when to have a meeting, but that’s only because
globalization is hard. You need to have either social empathy for how people
organize their lives in other places or a dictum that everyone change their idea
of the day from something natural to something arbitrary, like the natural time
in some place I’ve never been to.</p>

<p>So be happy that the world has such diversity! And refer back to your timezone
tables. Because timekeeping is as complicated as the places and people that keep
the time.</p>


        ]]>
      </content:encoded>
    </item>
    
    <item>
      <title>My kind of place</title>
      <link>https://charlie-gallagher.github.io/2026/02/07/my-kind-of-place.html</link>
      <guid isPermaLink="true">https://charlie-gallagher.github.io/2026/02/07/my-kind-of-place.html</guid>
      <pubDate>Sat, 07 Feb 2026 00:00:00 +0000</pubDate>

      <description>
        <![CDATA[
          <p>What is it about big social media that makes me feel like I’m part of the
conversation? Conversations are being had, and there are people here, so this
must be the place..?</p>

<p>Maybe we need a new name for it. There’s just not much social about TikTok,
Instagram, or YouTube. It’s TV in your pocket with infinite channels. In 2005 if
you wanted conversation, you’d watch a talk show or The View. Now the channels
have a much finer grain, but it’s the same thing. And at times I’ve been glued
to it.</p>

<p>If you remember old YouTube much, you might remember the video response section
(phased out in
<a href="https://m.hexus.net/business/news/internet/59389-youtube-video-response-facility-will-dropped-next-month/">2013</a>).
It wasn’t used a lot, but the intention was good. For some video, you could make
a response and it would appear underneath the video. You as a viewer could start
a conversation with the creator. I thought about it recently because it feels
alien now. Remixes and stitches are something like it but not quite right.</p>

<p>Conversation doesn’t scale well. After some tipping point the signal gets
drowned out in the noise, the thoughts from interested strangers turn into
regular negative comments. Funny scales well, drama scales well, but not so much
people.</p>

<p>For a while I was a part of a small community of data visualization enthusiasts
called <a href="https://github.com/rfordatascience/tidytuesday">#TidyTuesday</a>. Every
week our benevolent leader would publish a dataset, and we would make a
visualization with it. You can see my visualizations on my
<a href="https://github.com/charlie-gallagher/tidy-tuesday">GitHub</a>, but as a small
example:</p>

<p><img src="/assets/images/2026-my-kind-of-place/ikea.png" alt="Ikea furniture names, with words distributed based on the ratio of vowels to consonants" /></p>

<p>It was the most fun I’ve had on the internet. The anticipation, the first
visualizations, the ones that blew me away. For 41 weeks, I got the assignment,
worked on something I thought would be cool, and then watched the other
visualizations flow in. It was small, and mutually encouraging. We all gave
credit when we stole from each other.</p>

<p>But I don’t think it works if there are 10,000 contributors instead of 100.</p>

<hr />

<p>I’ve been thinking about that internet and the internet I’ve tended to be on
lately. There are sites that focus on creators, and there are sites that focus
on people. Instagram, YouTube, and TikTok are platforms for creators.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>
Reddit, Twitter, Meetup, personal websites, and (I mean this without irony,
mostly) Facebook groups<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> are for people. LinkedIn is technically about
people, but it’s driven by self-advertising, so I don’t know where it belongs.</p>

<p>I didn’t distinguish between these as much before. I knew I liked twitter better
than instagram, but I thought it was my fault that I didn’t find a community on
the ‘sta.</p>

<p>It’s market forces for the creators. You can’t monetize a subreddit<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>, but you
can monetize a YouTube channel. RSS feeds can’t be monetized, and so the tech
was nearly killed. But it’s still widely available. And as old as Reddit is,
it’s still popular. It’s a people site.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup></p>

<p>I’ve fallen into the tech blogosphere lately, thanks to <a href="https://clickhouse.com/blog/tech-blogs">this post from
ClickHouse</a> and especially thanks to
Thorsten Ball’s weekly “Joy &amp; Curiosity” on <a href="https://registerspill.thorstenball.com">Register
Spill</a>. This part of the blogosphere is
mostly passion projects on simple HTML websites (please excuse the appearance of
my own site, I haven’t had time to make it plain), and the people are interested
in their fields. I have an RSS reader and I add new sites when I find them. Some
post every day (jwz), others every week (Thorsten), and some might never post
again.</p>

<p>And you know the nice thing? When I open it up, it’s my feed, and there are no
suggestions. I read it every morning with breakfast, and on Sundays I read the
long-form articles I’ve saved up from the week.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Marcel the Shell said it best: “It’s still a group of people, but it’s an audience, it’s not a community.” And dammit if that’s not the whole thing. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>Say what you will about Facebook (and I’d agree), but Facebook groups are unique for being about geographically spread out people with niche interests who are encouraged to get together in real life occasionally. cf. Meetup and Reddit. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>You can earn money through Reddit by being what they consider to be a high quality contributor. The money’s not the same as channel-focused sites, though. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>I recently learned that three of my favorite things about the Internet (RSS, Reddit, at least in principle, and the creative commons license) were all developed in part by Aaron Swartz. He had the right idea for the Internet. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>

        ]]>
      </description>

      <content:encoded>
        <![CDATA[
          <p>What is it about big social media that makes me feel like I’m part of the
conversation? Conversations are being had, and there are people here, so this
must be the place..?</p>

<p>Maybe we need a new name for it. There’s just not much social about TikTok,
Instagram, or YouTube. It’s TV in your pocket with infinite channels. In 2005 if
you wanted conversation, you’d watch a talk show or The View. Now the channels
have a much finer grain, but it’s the same thing. And at times I’ve been glued
to it.</p>

<p>If you remember old YouTube much, you might remember the video response section
(phased out in
<a href="https://m.hexus.net/business/news/internet/59389-youtube-video-response-facility-will-dropped-next-month/">2013</a>).
It wasn’t used a lot, but the intention was good. For some video, you could make
a response and it would appear underneath the video. You as a viewer could start
a conversation with the creator. I thought about it recently because it feels
alien now. Remixes and stitches are something like it but not quite right.</p>

<p>Conversation doesn’t scale well. After some tipping point the signal gets
drowned out in the noise, the thoughts from interested strangers turn into
regular negative comments. Funny scales well, drama scales well, but not so much
people.</p>

<p>For a while I was a part of a small community of data visualization enthusiasts
called <a href="https://github.com/rfordatascience/tidytuesday">#TidyTuesday</a>. Every
week our benevolent leader would publish a dataset, and we would make a
visualization with it. You can see my visualizations on my
<a href="https://github.com/charlie-gallagher/tidy-tuesday">GitHub</a>, but as a small
example:</p>

<p><img src="/assets/images/2026-my-kind-of-place/ikea.png" alt="Ikea furniture names, with words distributed based on the ratio of vowels to consonants" /></p>

<p>It was the most fun I’ve had on the internet. The anticipation, the first
visualizations, the ones that blew me away. For 41 weeks, I got the assignment,
worked on something I thought would be cool, and then watched the other
visualizations flow in. It was small, and mutually encouraging. We all gave
credit when we stole from each other.</p>

<p>But I don’t think it works if there are 10,000 contributors instead of 100.</p>

<hr />

<p>I’ve been thinking about that internet and the internet I’ve tended to be on
lately. There are sites that focus on creators, and there are sites that focus
on people. Instagram, YouTube, and TikTok are platforms for creators.<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>
Reddit, Twitter, Meetup, personal websites, and (I mean this without irony,
mostly) Facebook groups<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> are for people. LinkedIn is technically about
people, but it’s driven by self-advertising, so I don’t know where it belongs.</p>

<p>I didn’t distinguish between these as much before. I knew I liked twitter better
than instagram, but I thought it was my fault that I didn’t find a community on
the ‘sta.</p>

<p>It’s market forces for the creators. You can’t monetize a subreddit<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup>, but you
can monetize a YouTube channel. RSS feeds can’t be monetized, and so the tech
was nearly killed. But it’s still widely available. And as old as Reddit is,
it’s still popular. It’s a people site.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup></p>

<p>I’ve fallen into the tech blogosphere lately, thanks to <a href="https://clickhouse.com/blog/tech-blogs">this post from
ClickHouse</a> and especially thanks to
Thorsten Ball’s weekly “Joy &amp; Curiosity” on <a href="https://registerspill.thorstenball.com">Register
Spill</a>. This part of the blogosphere is
mostly passion projects on simple HTML websites (please excuse the appearance of
my own site, I haven’t had time to make it plain), and the people are interested
in their fields. I have an RSS reader and I add new sites when I find them. Some
post every day (jwz), others every week (Thorsten), and some might never post
again.</p>

<p>And you know the nice thing? When I open it up, it’s my feed, and there are no
suggestions. I read it every morning with breakfast, and on Sundays I read the
long-form articles I’ve saved up from the week.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Marcel the Shell said it best: “It’s still a group of people, but it’s an audience, it’s not a community.” And dammit if that’s not the whole thing. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>Say what you will about Facebook (and I’d agree), but Facebook groups are unique for being about geographically spread out people with niche interests who are encouraged to get together in real life occasionally. cf. Meetup and Reddit. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>You can earn money through Reddit by being what they consider to be a high quality contributor. The money’s not the same as channel-focused sites, though. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>I recently learned that three of my favorite things about the Internet (RSS, Reddit, at least in principle, and the creative commons license) were all developed in part by Aaron Swartz. He had the right idea for the Internet. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>

        ]]>
      </content:encoded>
    </item>
    
    <item>
      <title>Replicating lazy replication in python</title>
      <link>https://charlie-gallagher.github.io/2026/02/07/lazy-replication.html</link>
      <guid isPermaLink="true">https://charlie-gallagher.github.io/2026/02/07/lazy-replication.html</guid>
      <pubDate>Sat, 07 Feb 2026 00:00:00 +0000</pubDate>

      <description>
        <![CDATA[
          <p>I’ve been reading about distributed systems lately. I have a lot to catch up on.
When I started making a reading list a couple months ago, I had heard about the
CAP theorem, that it was about tradeoffs in consistency, availability, and
P-something, so I started there.</p>

<p>The CAP acronym stands for “Consistency of reads<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>, Availability of writes,
and presence of a network Partition.” The theorem (poorly stated) says you can
pick any two, but not all three.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> If your service allows consistent reads (all
replicated nodes return the same value) and is always available for write
operations, then the system cannot function during a network partition. A
network partition is when not all nodes can communicate with each other. And if
you want to allow writes to proceed even when there’s a network partition, then
you cannot guarantee that every node will return the same value on read.</p>

<p>The CAP theorem as a <em>theorem</em> is correct, but the impossibility result has been
used as a teaching tool and a framework for thinking about tradeoffs in a
distributed, replicated service. And as a framework it’s not very useful. It
neglects the fact that users can tolerate some types of inconsistent reads, but
not others; writes can proceed during a network partition under certain
conditions, but not others; that there are many kinds of failure besides just a
network partition (slow responses, Byzantine nodes, node failures, and an actual
break in the network connection between parts of the network that otherwise
function well).</p>

<p>In other words, the CAP framework is simplistic. I shopped around for better
frameworks and found an excellent paper, <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/sigtt611-bernstein.pdf">“Rethinking Eventual
Consistency”</a>
by Philip Bernstein and Sudipto Das from Microsoft. They give a more complete
classification of the tradeoffs made in modern distributed systems and how to
think about them.</p>

<p>First, they acknowledge that availability and network partition tolerance are
essential for most services, so it’s <em>read consistency</em> that has to be
adjusted. (For what it’s worth, the original CAP proof paper also acknowledges
this.) Then, they disambiguate types of consistency and their uses. For example,
in an email system, it’s usually enough to offer <em>causal</em> consistency where a
single user always observes their own updates and the updates they’ve observed
before. They don’t observe the whole system at once, so the replicas don’t have
to be consistent.</p>

<p>The whole paper is worth reading. I myself focused on just one part that I found
interesting: eventual consistency through partial ordering, implemented with
vector clocks. The main reference for this in the eventual consistency paper is
<a href="https://www.cs.princeton.edu/courses/archive/spr24/cos418/papers/lazy.pdf">“Providing high availability using lazy
replication</a>, by Ladin et al.</p>

<p>In this paper, a service is replicated across a network of symmetric nodes that
each serve both reads and writes to clients. The client uses a front end
service, and this front end service has duties like routing to a particular
preferred node and coordinating with nodes about which updates it expects to
see. This makes the replication transparent to the user while storing some
important program state on the client’s side.</p>

<p>A system is causally consistent if each user has a consistent view of the
system: they see the things they’ve seen before, and the system reflects the
updates they’ve made (maybe in response to things they’ve seen) in the right
order. From the perspective of an email client, the exact order of unrelated
emails being sent in the system is unimportant. But if I read an email and then
refresh my messages, I should still see that email (my view should never go
“back in time”), and the messages shouldn’t be reordered. If I reply to an
email, and someone else replies to my reply, the thread should appear in the
same order for everyone, regardless of which replica they talk to (replies are
causally linked).</p>

<p>This results in chains of causality flying back and forth from client to replica
and between replicas (as they share updates with each other). The <em>model</em>
described in the paper uses actual chains of causality. The <em>implementation</em> of
course is limited by memory and bandwidth, where full chains of causality are
inconvenient.</p>

<p>Instead, the system is implemented using <em>vector clocks</em>, which are a nifty
trick for tracking dependencies.</p>

<p>To understand vector clocks consider that the system is changed through updates,
and it’s observed through reads. A read observes some data item, which is the
product of all of the updates that have affected that data item, performed in
some order.</p>

<p>The state at a particular replica is then defined by the log of updates it has
processed. As long as the replica itself keeps track of its log of updates,
everyone else can refer to a particular state of that replica by the <em>length of
its log</em>, or in other words the number of updates that replica has processed.</p>

<p>This is the underlying idea of a vector clock. For N replicas, the state of the
system is identified by a vector <em>v</em> of length N, where each element
<em>v<sub>i</sub></em> is the number of updates processed at replica <em>i</em>.</p>

<p>This compact representation of the system is useful but not sufficient. The
Ladin, et al., paper goes into the details of how their system makes use of
vector clocks to enforce several types of operations (causal, forced, and
immediate, in increasing order of strictness).</p>

<h1 id="simulating-the-system">Simulating the system</h1>
<p>I was having a hard time visualizing and playing with this system on paper, so I
wrote a python simulation of the key parts. You can find it here:
<a href="https://github.com/charlie-gallagher/simulation-lazy-replication-ladin-1992">https://github.com/charlie-gallagher/simulation-lazy-replication-ladin-1992</a></p>

<p>It does somewhat detailed logging of what’s going on at each node (node=replica)
and client, and in the end prints out a summary of what went on. Here’s an
example summary:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ python3 node.py

======================
Summary
  Nodes: 4
  Front ends: 10
  Current time: 99
  Stats: {'updates': 316}
--------------------------
FrontEnd (0)
  Preferred node: 0
  Prev: [78, 96, 50, 87]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 177
  Seen vals: [12, 7, 9, 20, 20, 29, 40, 61, 86, 89, 105, 119, 110, 108, 119, 135, 145, 147, 153, 177]
  Stats: {'updates': 30, 'update_completes': 30, 'query_starts': 20, 'query_completes': 20, 'failed_polls': 0}
FrontEnd (1)
  Preferred node: 3
  Prev: [78, 98, 52, 88]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 170
  Seen vals: [6, 9, 21, 23, 33, 62, 115, 120, 108, 117, 112, 131, 133, 147, 153, 170]
  Stats: {'updates': 34, 'update_completes': 34, 'query_starts': 16, 'query_completes': 16, 'failed_polls': 0}
FrontEnd (2)
  Preferred node: 1
  Prev: [78, 98, 50, 87]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 173
  Seen vals: [6, 18, 11, 18, 23, 61, 115, 108, 127, 133, 141, 173]
  Stats: {'updates': 38, 'update_completes': 38, 'query_starts': 12, 'query_completes': 12, 'failed_polls': 0}
FrontEnd (3)
  Preferred node: 1
  Prev: [76, 97, 42, 84]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 155
  Seen vals: [5, 2, -2, 4, 12, 8, 9, 19, 27, 33, 37, 62, 86, 96, 100, 114, 110, 112, 115, 110, 115, 127, 127, 128, 148, 141, 155]
  Stats: {'updates': 23, 'update_completes': 23, 'query_starts': 27, 'query_completes': 27, 'failed_polls': 0}
FrontEnd (4)
  Preferred node: 1
  Prev: [78, 98, 50, 87]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 173
  Seen vals: [-2, 6, 9, 18, 21, 29, 72, 96, 117, 129, 134, 137, 141, 148, 143, 153, 173]
  Stats: {'updates': 33, 'update_completes': 33, 'query_starts': 17, 'query_completes': 17, 'failed_polls': 0}
FrontEnd (5)
  Preferred node: 2
  Prev: [64, 81, 51, 70]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 131
  Seen vals: [5, 9, 8, 12, 11, 20, 38, 62, 105, 119, 114, 111, 115, 118, 131]
  Stats: {'updates': 35, 'update_completes': 35, 'query_starts': 15, 'query_completes': 15, 'failed_polls': 0}
FrontEnd (6)
  Preferred node: 1
  Prev: [74, 98, 35, 79]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 148
  Seen vals: [4, 5, 11, 9, 32, 62, 66, 76, 93, 100, 120, 111, 108, 118, 123, 128, 133, 141, 148]
  Stats: {'updates': 31, 'update_completes': 31, 'query_starts': 19, 'query_completes': 19, 'failed_polls': 0}
FrontEnd (7)
  Preferred node: 0
  Prev: [78, 93, 46, 86]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 168
  Seen vals: [-5, 9, 5, 5, 5, 18, 21, 20, 32, 41, 62, 66, 120, 112, 115, 116, 117, 135, 131, 136, 144, 168]
  Stats: {'updates': 28, 'update_completes': 28, 'query_starts': 22, 'query_completes': 22, 'failed_polls': 0}
FrontEnd (8)
  Preferred node: 3
  Prev: [74, 85, 35, 88]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 147
  Seen vals: [4, 17, 11, 9, 18, 23, 30, 37, 62, 96, 89, 105, 115, 107, 118, 131, 135, 133, 147]
  Stats: {'updates': 31, 'update_completes': 31, 'query_starts': 19, 'query_completes': 19, 'failed_polls': 0}
FrontEnd (9)
  Preferred node: 2
  Prev: [65, 82, 52, 71]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 138
  Seen vals: [5, 5, 18, 15, 8, 20, 23, 22, 38, 40, 41, 62, 68, 76, 120, 110, 138]
  Stats: {'updates': 33, 'update_completes': 33, 'query_starts': 17, 'query_completes': 17, 'failed_polls': 0}
--------------------------
Node (0)
  Log records: 0
  First 5 log records:
  Replica TS: [78, 98, 52, 88]
  Value: 170
  Value TS: [78, 98, 52, 88]
  TS Table: [[78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88]]
  Gossip queue length: 0
  Update queue length: 0
  Query queue length: 0
  Query results length: 0
  Update results length: 0
  Stats: {'updates': 78, 'gossip_messages_processed': 243, 'gossip_updates_processed': 238, 'queries': 49}
Node (1)
  Log records: 0
  First 5 log records:
  Replica TS: [78, 98, 52, 88]
  Value: 170
  Value TS: [78, 98, 52, 88]
  TS Table: [[78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88]]
  Gossip queue length: 0
  Update queue length: 0
  Query queue length: 0
  Query results length: 0
  Update results length: 0
  Stats: {'updates': 98, 'gossip_messages_processed': 237, 'gossip_updates_processed': 218, 'queries': 47}
Node (2)
  Log records: 0
  First 5 log records:
  Replica TS: [78, 98, 52, 88]
  Value: 170
  Value TS: [78, 98, 52, 88]
  TS Table: [[78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88]]
  Gossip queue length: 0
  Update queue length: 0
  Query queue length: 0
  Query results length: 0
  Update results length: 0
  Stats: {'updates': 52, 'gossip_messages_processed': 238, 'gossip_updates_processed': 264, 'queries': 38}
Node (3)
  Log records: 0
  First 5 log records:
  Replica TS: [78, 98, 52, 88]
  Value: 170
  Value TS: [78, 98, 52, 88]
  TS Table: [[78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88]]
  Gossip queue length: 0
  Update queue length: 0
  Query queue length: 0
  Query results length: 0
  Update results length: 0
  Stats: {'updates': 88, 'gossip_messages_processed': 284, 'gossip_updates_processed': 228, 'queries': 50}
======================
</code></pre></div></div>

<p>It successfully replicates updates<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> and produces the same value at each
replica, in this case <code class="language-plaintext highlighter-rouge">Value: 170</code>.</p>

<p>The system isn’t perfect, and it still probably has some bugs, which have been
difficult to track. But the value was in the exercise, and if you’re interested
in the Ladin paper, this is a functioning example that’s simple enough to be
considered basically pseudocode.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>The “C” is sometimes phrased “Consistency of replicas” rather than consistency of reads, which is a fine distinction. Systems are only observed through read operations, so inconsistency can only be noticed through reads; however, if a node fails and its work hasn’t successfully been replicated anywhere, then a future consistent read becomes impossible. So consistent replicas and consistent reads are closely tied but have different implications for how you might design the system to handle failure. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>The original proof gives a precise definition of the CAP conjecture original posed by Brewer: <a href="https://users.ece.cmu.edu/~adrian/731-sp04/readings/GL-cap.pdf">https://users.ece.cmu.edu/~adrian/731-sp04/readings/GL-cap.pdf</a>. Besides being more precise, it also describes weaker forms of consistency that are useful in real systems. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>A running total. This actually isn’t sensitive to the order in which operations are run, a so-called CRDT data type (see <em>Rethinking Eventual Consistency</em> paper). This was a good starting point, I thought, because I didn’t have to work out the partial ordering just yet. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>

        ]]>
      </description>

      <content:encoded>
        <![CDATA[
          <p>I’ve been reading about distributed systems lately. I have a lot to catch up on.
When I started making a reading list a couple months ago, I had heard about the
CAP theorem, that it was about tradeoffs in consistency, availability, and
P-something, so I started there.</p>

<p>The CAP acronym stands for “Consistency of reads<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>, Availability of writes,
and presence of a network Partition.” The theorem (poorly stated) says you can
pick any two, but not all three.<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup> If your service allows consistent reads (all
replicated nodes return the same value) and is always available for write
operations, then the system cannot function during a network partition. A
network partition is when not all nodes can communicate with each other. And if
you want to allow writes to proceed even when there’s a network partition, then
you cannot guarantee that every node will return the same value on read.</p>

<p>The CAP theorem as a <em>theorem</em> is correct, but the impossibility result has been
used as a teaching tool and a framework for thinking about tradeoffs in a
distributed, replicated service. And as a framework it’s not very useful. It
neglects the fact that users can tolerate some types of inconsistent reads, but
not others; writes can proceed during a network partition under certain
conditions, but not others; that there are many kinds of failure besides just a
network partition (slow responses, Byzantine nodes, node failures, and an actual
break in the network connection between parts of the network that otherwise
function well).</p>

<p>In other words, the CAP framework is simplistic. I shopped around for better
frameworks and found an excellent paper, <a href="https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/sigtt611-bernstein.pdf">“Rethinking Eventual
Consistency”</a>
by Philip Bernstein and Sudipto Das from Microsoft. They give a more complete
classification of the tradeoffs made in modern distributed systems and how to
think about them.</p>

<p>First, they acknowledge that availability and network partition tolerance are
essential for most services, so it’s <em>read consistency</em> that has to be
adjusted. (For what it’s worth, the original CAP proof paper also acknowledges
this.) Then, they disambiguate types of consistency and their uses. For example,
in an email system, it’s usually enough to offer <em>causal</em> consistency where a
single user always observes their own updates and the updates they’ve observed
before. They don’t observe the whole system at once, so the replicas don’t have
to be consistent.</p>

<p>The whole paper is worth reading. I myself focused on just one part that I found
interesting: eventual consistency through partial ordering, implemented with
vector clocks. The main reference for this in the eventual consistency paper is
<a href="https://www.cs.princeton.edu/courses/archive/spr24/cos418/papers/lazy.pdf">“Providing high availability using lazy
replication</a>, by Ladin et al.</p>

<p>In this paper, a service is replicated across a network of symmetric nodes that
each serve both reads and writes to clients. The client uses a front end
service, and this front end service has duties like routing to a particular
preferred node and coordinating with nodes about which updates it expects to
see. This makes the replication transparent to the user while storing some
important program state on the client’s side.</p>

<p>A system is causally consistent if each user has a consistent view of the
system: they see the things they’ve seen before, and the system reflects the
updates they’ve made (maybe in response to things they’ve seen) in the right
order. From the perspective of an email client, the exact order of unrelated
emails being sent in the system is unimportant. But if I read an email and then
refresh my messages, I should still see that email (my view should never go
“back in time”), and the messages shouldn’t be reordered. If I reply to an
email, and someone else replies to my reply, the thread should appear in the
same order for everyone, regardless of which replica they talk to (replies are
causally linked).</p>

<p>This results in chains of causality flying back and forth from client to replica
and between replicas (as they share updates with each other). The <em>model</em>
described in the paper uses actual chains of causality. The <em>implementation</em> of
course is limited by memory and bandwidth, where full chains of causality are
inconvenient.</p>

<p>Instead, the system is implemented using <em>vector clocks</em>, which are a nifty
trick for tracking dependencies.</p>

<p>To understand vector clocks consider that the system is changed through updates,
and it’s observed through reads. A read observes some data item, which is the
product of all of the updates that have affected that data item, performed in
some order.</p>

<p>The state at a particular replica is then defined by the log of updates it has
processed. As long as the replica itself keeps track of its log of updates,
everyone else can refer to a particular state of that replica by the <em>length of
its log</em>, or in other words the number of updates that replica has processed.</p>

<p>This is the underlying idea of a vector clock. For N replicas, the state of the
system is identified by a vector <em>v</em> of length N, where each element
<em>v<sub>i</sub></em> is the number of updates processed at replica <em>i</em>.</p>

<p>This compact representation of the system is useful but not sufficient. The
Ladin, et al., paper goes into the details of how their system makes use of
vector clocks to enforce several types of operations (causal, forced, and
immediate, in increasing order of strictness).</p>

<h1 id="simulating-the-system">Simulating the system</h1>
<p>I was having a hard time visualizing and playing with this system on paper, so I
wrote a python simulation of the key parts. You can find it here:
<a href="https://github.com/charlie-gallagher/simulation-lazy-replication-ladin-1992">https://github.com/charlie-gallagher/simulation-lazy-replication-ladin-1992</a></p>

<p>It does somewhat detailed logging of what’s going on at each node (node=replica)
and client, and in the end prints out a summary of what went on. Here’s an
example summary:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ python3 node.py

======================
Summary
  Nodes: 4
  Front ends: 10
  Current time: 99
  Stats: {'updates': 316}
--------------------------
FrontEnd (0)
  Preferred node: 0
  Prev: [78, 96, 50, 87]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 177
  Seen vals: [12, 7, 9, 20, 20, 29, 40, 61, 86, 89, 105, 119, 110, 108, 119, 135, 145, 147, 153, 177]
  Stats: {'updates': 30, 'update_completes': 30, 'query_starts': 20, 'query_completes': 20, 'failed_polls': 0}
FrontEnd (1)
  Preferred node: 3
  Prev: [78, 98, 52, 88]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 170
  Seen vals: [6, 9, 21, 23, 33, 62, 115, 120, 108, 117, 112, 131, 133, 147, 153, 170]
  Stats: {'updates': 34, 'update_completes': 34, 'query_starts': 16, 'query_completes': 16, 'failed_polls': 0}
FrontEnd (2)
  Preferred node: 1
  Prev: [78, 98, 50, 87]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 173
  Seen vals: [6, 18, 11, 18, 23, 61, 115, 108, 127, 133, 141, 173]
  Stats: {'updates': 38, 'update_completes': 38, 'query_starts': 12, 'query_completes': 12, 'failed_polls': 0}
FrontEnd (3)
  Preferred node: 1
  Prev: [76, 97, 42, 84]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 155
  Seen vals: [5, 2, -2, 4, 12, 8, 9, 19, 27, 33, 37, 62, 86, 96, 100, 114, 110, 112, 115, 110, 115, 127, 127, 128, 148, 141, 155]
  Stats: {'updates': 23, 'update_completes': 23, 'query_starts': 27, 'query_completes': 27, 'failed_polls': 0}
FrontEnd (4)
  Preferred node: 1
  Prev: [78, 98, 50, 87]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 173
  Seen vals: [-2, 6, 9, 18, 21, 29, 72, 96, 117, 129, 134, 137, 141, 148, 143, 153, 173]
  Stats: {'updates': 33, 'update_completes': 33, 'query_starts': 17, 'query_completes': 17, 'failed_polls': 0}
FrontEnd (5)
  Preferred node: 2
  Prev: [64, 81, 51, 70]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 131
  Seen vals: [5, 9, 8, 12, 11, 20, 38, 62, 105, 119, 114, 111, 115, 118, 131]
  Stats: {'updates': 35, 'update_completes': 35, 'query_starts': 15, 'query_completes': 15, 'failed_polls': 0}
FrontEnd (6)
  Preferred node: 1
  Prev: [74, 98, 35, 79]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 148
  Seen vals: [4, 5, 11, 9, 32, 62, 66, 76, 93, 100, 120, 111, 108, 118, 123, 128, 133, 141, 148]
  Stats: {'updates': 31, 'update_completes': 31, 'query_starts': 19, 'query_completes': 19, 'failed_polls': 0}
FrontEnd (7)
  Preferred node: 0
  Prev: [78, 93, 46, 86]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 168
  Seen vals: [-5, 9, 5, 5, 5, 18, 21, 20, 32, 41, 62, 66, 120, 112, 115, 116, 117, 135, 131, 136, 144, 168]
  Stats: {'updates': 28, 'update_completes': 28, 'query_starts': 22, 'query_completes': 22, 'failed_polls': 0}
FrontEnd (8)
  Preferred node: 3
  Prev: [74, 85, 35, 88]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 147
  Seen vals: [4, 17, 11, 9, 18, 23, 30, 37, 62, 96, 89, 105, 115, 107, 118, 131, 135, 133, 147]
  Stats: {'updates': 31, 'update_completes': 31, 'query_starts': 19, 'query_completes': 19, 'failed_polls': 0}
FrontEnd (9)
  Preferred node: 2
  Prev: [65, 82, 52, 71]
  Is blocked on query: False
  Is blocked on update: False
  Last received value: 138
  Seen vals: [5, 5, 18, 15, 8, 20, 23, 22, 38, 40, 41, 62, 68, 76, 120, 110, 138]
  Stats: {'updates': 33, 'update_completes': 33, 'query_starts': 17, 'query_completes': 17, 'failed_polls': 0}
--------------------------
Node (0)
  Log records: 0
  First 5 log records:
  Replica TS: [78, 98, 52, 88]
  Value: 170
  Value TS: [78, 98, 52, 88]
  TS Table: [[78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88]]
  Gossip queue length: 0
  Update queue length: 0
  Query queue length: 0
  Query results length: 0
  Update results length: 0
  Stats: {'updates': 78, 'gossip_messages_processed': 243, 'gossip_updates_processed': 238, 'queries': 49}
Node (1)
  Log records: 0
  First 5 log records:
  Replica TS: [78, 98, 52, 88]
  Value: 170
  Value TS: [78, 98, 52, 88]
  TS Table: [[78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88]]
  Gossip queue length: 0
  Update queue length: 0
  Query queue length: 0
  Query results length: 0
  Update results length: 0
  Stats: {'updates': 98, 'gossip_messages_processed': 237, 'gossip_updates_processed': 218, 'queries': 47}
Node (2)
  Log records: 0
  First 5 log records:
  Replica TS: [78, 98, 52, 88]
  Value: 170
  Value TS: [78, 98, 52, 88]
  TS Table: [[78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88]]
  Gossip queue length: 0
  Update queue length: 0
  Query queue length: 0
  Query results length: 0
  Update results length: 0
  Stats: {'updates': 52, 'gossip_messages_processed': 238, 'gossip_updates_processed': 264, 'queries': 38}
Node (3)
  Log records: 0
  First 5 log records:
  Replica TS: [78, 98, 52, 88]
  Value: 170
  Value TS: [78, 98, 52, 88]
  TS Table: [[78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88], [78, 98, 52, 88]]
  Gossip queue length: 0
  Update queue length: 0
  Query queue length: 0
  Query results length: 0
  Update results length: 0
  Stats: {'updates': 88, 'gossip_messages_processed': 284, 'gossip_updates_processed': 228, 'queries': 50}
======================
</code></pre></div></div>

<p>It successfully replicates updates<sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> and produces the same value at each
replica, in this case <code class="language-plaintext highlighter-rouge">Value: 170</code>.</p>

<p>The system isn’t perfect, and it still probably has some bugs, which have been
difficult to track. But the value was in the exercise, and if you’re interested
in the Ladin paper, this is a functioning example that’s simple enough to be
considered basically pseudocode.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>The “C” is sometimes phrased “Consistency of replicas” rather than consistency of reads, which is a fine distinction. Systems are only observed through read operations, so inconsistency can only be noticed through reads; however, if a node fails and its work hasn’t successfully been replicated anywhere, then a future consistent read becomes impossible. So consistent replicas and consistent reads are closely tied but have different implications for how you might design the system to handle failure. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>The original proof gives a precise definition of the CAP conjecture original posed by Brewer: <a href="https://users.ece.cmu.edu/~adrian/731-sp04/readings/GL-cap.pdf">https://users.ece.cmu.edu/~adrian/731-sp04/readings/GL-cap.pdf</a>. Besides being more precise, it also describes weaker forms of consistency that are useful in real systems. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>A running total. This actually isn’t sensitive to the order in which operations are run, a so-called CRDT data type (see <em>Rethinking Eventual Consistency</em> paper). This was a good starting point, I thought, because I didn’t have to work out the partial ordering just yet. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>

        ]]>
      </content:encoded>
    </item>
    
    <item>
      <title>A week on NTPSec</title>
      <link>https://charlie-gallagher.github.io/2026/02/06/ntpsec.html</link>
      <guid isPermaLink="true">https://charlie-gallagher.github.io/2026/02/06/ntpsec.html</guid>
      <pubDate>Fri, 06 Feb 2026 00:00:00 +0000</pubDate>

      <description>
        <![CDATA[
          <p>In which I find out with some certainty what time it is.</p>

<blockquote>
  <p>The ntpd utility can synchronize time to a theoretical precision of about 232
picoseconds. In practice, this limit is unattainable due to quantum limits on
the clock speed of ballistic-electron logic.</p>

  <p>(https://docs.ntpsec.org/latest/ntpd.html)</p>
</blockquote>

<p>I always assumed I was using ntpd to keep time on my linux computer. But I was
only sort of right.</p>

<p>According to the Debian Wiki, since Debian 12, the default NTP client is
<code class="language-plaintext highlighter-rouge">systemd-timesyncd</code>. It uses SNTP (Simple Network Time Protocol), which
implements the client with no option to host a time server, and it sets the time
roughly by communicating with a single time server. There’s no recourse if you
get a bad server, or “falseticker” in NTP parlance.</p>

<p>There are a few implementations of NTP to choose from. The systemd-timesyncd
daemon is a basic client suitable for keeping time. The original NTP reference
implementation is <code class="language-plaintext highlighter-rouge">ntpd</code>, which is still around, but is deprecated on Debian in
favor of the more security-focused <a href="https://ntpsec.org">NTPSec</a>. And then
<a href="https://chrony-project.org">Chrony</a> is a newer implementation that is more
practical than <code class="language-plaintext highlighter-rouge">ntpd</code>. It looks like a darn fine timekeeper <a href="https://chrony-project.org/faq.html#_how_does_chrony_compare_to_ntpd">by
comparison</a>.</p>

<p>There are interesting things to say about each NTP tool (and their apparent
<a href="https://www.linux-magazine.com/Online/Blogs/Off-the-Beat-Bruce-Byfield-s-Blog/NTPsec-The-Wrong-Fork-for-the-Wrong-Reasons">controversies</a>),
but if you’re interested in NTP you can pick pretty equally among <code class="language-plaintext highlighter-rouge">ntpd</code>,
Chrony, and NTPSec. I’ve been playing with NTPSec for its debugging utilities
like out-of-the-box data visualizations using <code class="language-plaintext highlighter-rouge">ntpviz</code>.</p>

<hr />

<p>Most computers have a real time clock in hardware and a system clock in
software. On powerup or reboot, the system clock is set using the RTC. You can
use a command like <code class="language-plaintext highlighter-rouge">date</code> to set the date/time, but this only updates the system
clock; strictly speaking to update the hardware clock immediately you’ll need</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>hwclock --systohc
</code></pre></div></div>

<p>The hardware clock is battery-driven, which is how its time reading persists
across boots. But some parts of it are curiously system-dependent, for example
whether the hardware clock is set in UTC or local time.</p>

<blockquote>
  <p>If your machine dual boots Windows and Linux, then you could have problems
because Windows uses localtime for the hardware clock; while Linux and Debian
use UTC for the hardware clock. In this case you have two choices. The first
is to use localtime for the hardware clock, and set Debian to use localtime.
The second is to use UTC for the hardware clock, and set Windows to use UTC.</p>

  <p>https://wiki.debian.org/DateTime</p>
</blockquote>

<p>NTP implementations like Chrony and NTPSec don’t directly interact with the RTC;
instead, they modify the system clock. They <em>tend</em> to make use of a kernel
feature called “11-minute mode”, where the system clock syncs to the hardware
clock every 11 minutes, but documentation on this is a bit scant. Some comments
in the <a href="https://chrony-project.org/faq.html#_real_time_clock_issues">Chrony
docs</a>.</p>

<p>Real time clocks are usually crystal oscillators with a frequency of 32.768 kHz,
and since NTP doesn’t directly interact with them, I’m not going to talk much
more about them.</p>

<p>Software clocks on the other hand are crucial to the system. Every system that
NTP runs on must provide a time correction service. The <code class="language-plaintext highlighter-rouge">adjtime</code> syscall is
intended to be portable. As far as I’ve seen, it’s POSIX standard. You might
also see <code class="language-plaintext highlighter-rouge">adjtimex</code>, which is a Linux-specific variant.</p>

<p>To explain how the system call works, <em>The Design and Implementation of the
FreeBSD Operating System</em>:</p>

<blockquote>
  <p>The <code class="language-plaintext highlighter-rouge">settimeofday</code> system call will result in time running backward on
machines whose clocks were fast. Time running backward can confuse user
programs (such as <code class="language-plaintext highlighter-rouge">make</code>) that expect time to invariably increase. To avoid
this problem, the system provides the <code class="language-plaintext highlighter-rouge">adjtime</code> system call [Mills, 1992].
The <code class="language-plaintext highlighter-rouge">adjtime</code> system call takes a time delta (either positive or negative) and
changes the rate at which time advances by 10 percent, faster or slower, until
the time has been corrected. The operating system does the speedup by
incrementing the global time by 1100 microseconds for each tick and does the
slowdown by incrementing the global time by 900 microseconds for each tick.
Regardless, time increases monotonically, and user processes depending on the
ordering of file-modification times are not affected. However, time changes
that take tens of seconds to adjust will affect programs that are measuring
time intervals by using repeated calls to gettimeofday</p>
</blockquote>

<p>The Mills reference is to <a href="https://datatracker.ietf.org/doc/html/rfc1305">RFC 1305</a>.</p>

<p>Since I have TDAIOTFBSDOS open already, I can mention a few other things about
a typical POSIX software clock works. The system software clock is created
through an interrupt timer, and the system “increments its global time variable
by an amount equal to the number of microseconds per tick. For the PC, running
at 1000 ticks per second, each tick represents 1000 microseconds,” (p. 73). And
if you think 1000 interrupts per second is a lot of interruption, you’re right.
“To reduce the interrupt load, the kernel computes the number of ticks in the
future at which an action may need to be taken. It then schedules the next clock
interrupt to occur at that time. Thus, clock interrupts typically occur much
less frequently than the 1000 ticks-per-second rate implies,” (pp. 65-66).</p>

<p>I’d guess (and a brief conversation with ChatGPT seems to confirm) that modern
operating systems have heavily optimized this part of their timekeeping. After
all, who cares what time it is if no process is trying to observe it?</p>

<hr />

<blockquote>
  <p>There is not one way of measuring time more true than another; that which is
generally adopted is only more <em>convenient</em>.</p>

  <p>Henri Poincaré</p>
</blockquote>

<p>What would it take for me to serve time to others? NTP servers listen on port
123 and usually work only over UDP, so I suppose the simple way to serve time is
to start ntpd in server mode, start listening, and configure someone to ask you
the time. But if you really want to be seen, you have to join the
<a href="https://www.ntppool.org/en/">pool</a>.</p>

<blockquote>
  <p>The pool.ntp.org project is a big virtual cluster of timeservers providing
reliable, easy to use NTP service for millions of clients.</p>

  <p>The pool is being used by hundreds of millions of systems around the world.
It’s the default “time server” for most of the major Linux distributions and
many networked appliances (see information for vendors).</p>

  <p>https://www.ntppool.org/en/</p>
</blockquote>

<p>There’s very clear documentation on <a href="https://www.ntppool.org/en/join.html">how to join the
pool</a>, too. The basic steps are:</p>

<ul>
  <li>Get your own time from a known good source (<em>not</em> the pool).</li>
  <li>Configure a stable IP address (trickier than you might think – even if you
set up port forwarding to get around DHCP issues, your ISP tends to rotate
your public IP address as it wants).</li>
  <li>Be willing to make a long-term commitment to the project.</li>
</ul>

<p>I’ll put “create a home time server” on my list of things to try, but joining
the pool would probably create too big a wave.</p>

<hr />

<p>My computer may not be the best to ask for the precise time. Where does
authority in timekeeping come from? Who has the time? I’m only an hour’s drive
away from the <a href="https://museum.nawcc.org">National Clock and Watch Museum</a>, and
after visit and a few dozen hours of followup research, I have something
approximating an answer.</p>

<p>We get our sense of time from the periodic movements of the starry firmament –
the sun, the moon, the stars. And our bodies, along with most organisms on
Earth, have built-in timers that encourage us to do those activities that keep
us alive. This is when you usually sleep, this is when you usually eat. As
different cultures sought to understand the heavens and their perfection,
timekeeping began to occupy its modern central role for coordination in our
lives.</p>

<p>And so at the moment my interest in timekeeping isn’t how we developed precision
clocks, but how we managed to coordinate ourselves using those clocks. <em>Why</em> we
coordinate ourselves using those clocks. What even is a clock? If all of the
atomic clocks in the world stopped ticking for 10 minutes, would we be able to
recover “the time”?</p>

<p>We’ve had a sense for calendars for a long time. “By the 14th century BC the
Shang Chinese had established the solar year as 365.25 days and the lunar month
as 29.5 days,” (RFC 1305). By 432 BC, the Greek astronomer Meton had estimated
the lunar month – the time it takes for the moon to circle the earth – to
within about 2 minutes of the currently understood value.</p>

<p>Time-curious cultures became duly obsessed with the frequency and stability of
our cosmic oscillators. The Earth’s rotation and its orbit around the sun, the
moon’s orbit around the Earth. And each culture had a calendar that tried to
match the motions of the cosmos with a predictable and convenient “civilian”
calendar.</p>

<p>Not all cultures had a calendar, and the ones that did used different systems,
so the process of dating events is knotty. Suffice to say that it involves some
guesswork. The best case for understanding the orders of events in the old days
is having what Mills calls (in RFC 1305) “an accurate count of the days relative
to some globally alarming event, such as a comet passage or supernova
explosion.”</p>

<p>And so calendars are social. The civil calendar had to be convenient and fit
into the activities of daily life, and ordering of events depends on some
collective consciousness around global events. I’ve been surprised by how often
we make clocks and calendars fit into daily life and not the other way around.
Several of the most precise modern timescales today are based on what feels
right and looks right, made a bit more precise.</p>

<p>Calendars order our years; clocks order our days. Early religion temporalized
daily life by requiring certain religious acts to be done multiple times a day.
Some of the earliest interesting clock-like devices we have are from monasteries
that rang bells at specific times. (And the word <em>clock</em> is derived from the
French word for bell.) This went on for a few hundred years.</p>

<p>The next advance was periodic timekeepers. Time used to be more organic than it
is today. Hours were not equally sized, and the day was not split equally into
24 parts. But somewhere, at some time, Europeans made an intuitive leap from
continuous time devices like the clepsydra or the procession of different stars
and planets to discrete time – time as ticks<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. In <em>Revolution in Time</em>,
David Landes considers this one of the great methodological leaps in western
civilization. It took other cultures another 500 years to begin using
oscillating, periodic timekeepers.</p>

<p>Clocks have their social uses. Nearly as soon as clocks became convenient and
domestic, <em>punctuality</em> became an important social cue. And as life became more
connected with trade, trains, and radio, the pragmatic importance of clocks only
increased.</p>

<hr />

<p>I installed NTPSec on my Debian machine and left the configuration mostly as-is.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo apt-get update &amp;&amp; sudo apt-get install ntpsec ntpsec-doc ntpsec-ntpviz
</code></pre></div></div>

<p>I made sure to enable statistics, because I’m really after visualizations. I
want to see the thing do stuff. Visualizations are generated using <code class="language-plaintext highlighter-rouge">ntpviz</code>,
which is scantily documented (this was helpful but ancient:
<a href="https://blog.ntpsec.org/2016/12/19/ntpviz-intro.html">ntpvis-intro</a>), but I
found enough to get me going. Unfortunately, I just set up my daemon, and
there’s no data to visualize. I took the opportunity to do some background work
on the metrics.</p>

<p>Clocks are never perfectly in sync, and the most important contributor to
incorrect timekeeping is a difference in oscillator frequencies. This is called
frequency <em>skew</em>. If the “correct” time is an oscillator at 1000 Hz, my local
computer clock might be more like 1001 Hz or 999 Hz. So even if I set my clock
to the right time, I would gain or lose some seconds every day.</p>

<p>Frequency skew is measured in parts per million, which is to say the number of
periods fast or slow per million oscillations. In the 1000 Hz example, 1001 Hz
would have a skew of 1 part in every thousand, or 1000 parts per million (ppm).
999 Hz has a skew of -1000 ppm.</p>

<p>Skew is also described in other ways. A human-friendly way to describe it is
“seconds gained or lost per day”, or week or year. This gives you the number in
practical terms. It’s a bit tricky to translate between them, though,
considering the gap between oscillation frequency and length of a day.</p>

<p>Your skew might also vary over time, and this is called <em>drift</em>.</p>

<p>NTP corrects for skew as part of the protocol by nudging the time and doing its
best to predict changes. Skew is affected by the quality of the hardware and the
environment around the oscillator, especially temperature. For ideal
timekeeping, you’ll want to keep your computer in a nice climate-controlled
vault with excellent heat sinking.</p>

<p>The clock offset is the estimated difference between my clock and the reference
clock, measured in milliseconds. I show roughly how this is calculated in <a href="/2026/01/27/ntp-in-30-seconds.html">NTP
in 30 Seconds</a>. In short, it’s
calculated by estimating the latency between you and the server and using that
to guess what time the server received your request. Then you compare your guess
(based on local time + latency) to what the server reported was the “actual”
time it received the request, and use the difference to work out how wrong your
clock is.</p>

<p>Practically speaking, for general monitoring, you can use <code class="language-plaintext highlighter-rouge">ntpmon</code>. This is a
<code class="language-plaintext highlighter-rouge">top</code>-like tool for watching your NTP daemon interact with peers. The output
looks something like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>     remote           refid      st t when poll reach   delay   offset   jitter
 0.debian.pool.n .POOL.          16 p    -   64    0   0.0000   0.0000   0.0001
 1.debian.pool.n .POOL.          16 p    -   64    0   0.0000   0.0000   0.0001
 2.debian.pool.n .POOL.          16 p    -   64    0   0.0000   0.0000   0.0001
 3.debian.pool.n .POOL.          16 p    -   64    0   0.0000   0.0000   0.0001
-ip74-208-14-149 192.58.120.8     2 u  598 1024  377  41.7645   0.9319   1.5714
-144.202.66.214. 162.159.200.1    4 u  834 1024  377  45.3224   1.1844   1.2529
*nyc2.us.ntp.li  17.253.2.37      2 u  564 1024  377  10.2534  -0.8732   0.8516
+ntp-62b.lbl.gov 128.3.133.141    2 u  748 1024  377  73.6187  -0.3344   1.0547
+time.cloudflare 10.102.8.4       3 u  212 1024  377   8.4884   0.0523   0.9331
 192-184-140-112 .PHC0.           1 u  66h 1024    0  85.8202   5.3690   0.0000
+ntp.nyc.icanbwe 69.180.17.124    2 u  639 1024  377  11.8765  -0.1314   1.1134

ntpd ntpsec-1.2.2                             Updated: 2026-02-04T08:17:40 (32)

 lstint avgint rstr r m v  count    score   drop rport remote address
      0   1284    0 . 6 2    321    1.217      0 51529 localhost
    212   1054   c0 . 4 4    127    0.050      0   123 time.cloudflare.com
    564   1079   c0 . 4 4    123    0.050      0   123 nyc2.us.ntp.li
    598   1058   c0 . 4 4    126    0.050      0   123 ip74-208-14-149.pbiaas.com
    639   1066   c0 . 4 4    125    0.050      0   123 ntp.nyc.icanbwell.com
    748   1055   c0 . 4 4    126    0.050      0   123 ntp-62b.lbl.gov
    834   1066   c0 . 4 4    125    0.050      0   123 144.202.66.214 (144.202.66.214.vultruser
</code></pre></div></div>

<p>I’ll describe peer metrics in a second. For now, the second table, starting with
<code class="language-plaintext highlighter-rouge">lstint</code>, is the MRU list (MRU=most recently used). Here are the stats it
reports.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">lstint</code> Interval (s) between receipt of most recent packet from this address
and completion of the retrieval of the MRU list by ntpq.</li>
  <li><code class="language-plaintext highlighter-rouge">avgint</code> Average interval (s) between packets from this address.</li>
  <li><code class="language-plaintext highlighter-rouge">rstr</code> Restriction flags.</li>
  <li><code class="language-plaintext highlighter-rouge">r</code> Rate control indicator.</li>
  <li><code class="language-plaintext highlighter-rouge">m</code> Packet mode</li>
  <li><code class="language-plaintext highlighter-rouge">v</code> Packet version number.</li>
  <li><code class="language-plaintext highlighter-rouge">count</code> Packets received</li>
  <li><code class="language-plaintext highlighter-rouge">score</code> Packets per second (averaged with exponential decay)</li>
  <li><code class="language-plaintext highlighter-rouge">drop</code> Packets dropped</li>
  <li><code class="language-plaintext highlighter-rouge">rport</code> Source port of last packet received</li>
  <li><code class="language-plaintext highlighter-rouge">remote address</code> The remote host name</li>
</ul>

<p>There are commands you can use to change the output, like <code class="language-plaintext highlighter-rouge">d</code> for detailed mode.</p>

<p>For a snapshot, you can use <code class="language-plaintext highlighter-rouge">ntpq</code>, a helpful tool for inspecting the daemon. It
has an interactive mode and a one-shot mode. This queries peers in the one-shot
mode.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ntpq --peers --units
     remote                                   refid      st t when poll reach   delay   offset   jitter
=======================================================================================================
 0.debian.pool.ntp.org                   .POOL.          16 p    -   64    0      0ns      0ns    119ns
 1.debian.pool.ntp.org                   .POOL.          16 p    -   64    0      0ns      0ns    119ns
 2.debian.pool.ntp.org                   .POOL.          16 p    -   64    0      0ns      0ns    119ns
 3.debian.pool.ntp.org                   .POOL.          16 p    -   64    0      0ns      0ns    119ns
-ip74-208-14-149.pbiaas.com              192.58.120.8     2 u  671 1024  377 41.765ms 931.92us 1.5714ms
-144.202.66.214.vultrusercontent.com     162.159.200.1    4 u  907 1024  377 45.322ms 1.1844ms 1.2529ms
*nyc2.us.ntp.li                          17.253.2.37      2 u  637 1024  377 10.253ms -873.2us 851.60us
+ntp-62b.lbl.gov                         128.3.133.141    2 u  821 1024  377 73.619ms -334.4us 1.0547ms
+time.cloudflare.com                     10.102.8.4       3 u  285 1024  377 8.4884ms 52.298us 933.07us
 192-184-140-112.fiber.dynamic.sonic.net .PHC0.           1 u  66h 1024    0 85.820ms 5.3690ms      0ns
+ntp.nyc.icanbwell.com                   69.180.17.124    2 u  712 1024  377 11.877ms -131.4us 1.1134ms
</code></pre></div></div>

<p>Here’s how this table is interpreted according to the <code class="language-plaintext highlighter-rouge">ntpmon</code> man page:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">tally</code> (symbol next to remote) One of <code class="language-plaintext highlighter-rouge">space</code>: not valid, x, ., -: discarded
for various reasons, +: included by the combine algorithm, #: backup, *:
system peer, 0: PPS peer. Basically, look for the <code class="language-plaintext highlighter-rouge">*</code> and any <code class="language-plaintext highlighter-rouge">+</code> signs to see
who you’re listening to right now.</li>
  <li><code class="language-plaintext highlighter-rouge">remote</code> The host name of the time server.</li>
  <li><code class="language-plaintext highlighter-rouge">refid</code> The RefID identifies the specific upstream time source a server is
using. In other words, it names the reference clock (stratum 0 or 1), even if
this server is just repeating what that reference clock says.</li>
  <li><code class="language-plaintext highlighter-rouge">st</code> NTP stratum</li>
  <li><code class="language-plaintext highlighter-rouge">t</code> Type. u: unicast or manycase, l: local, s: symmetric (peer), server, B:
broadcast server.</li>
  <li><code class="language-plaintext highlighter-rouge">when</code> sec/min/hr since last received packet.</li>
  <li><code class="language-plaintext highlighter-rouge">poll</code> Poll interval in log2 seconds</li>
  <li><code class="language-plaintext highlighter-rouge">reach</code> Octal triplet. Represents the last 8 attempts to reach the server.
<code class="language-plaintext highlighter-rouge">377</code> is binary <code class="language-plaintext highlighter-rouge">11111111</code>, which means all 8 attempts reached the server. A
value like <code class="language-plaintext highlighter-rouge">326</code> is binary <code class="language-plaintext highlighter-rouge">11010110</code>, meaning out of the last 8 attempts, the
3rd, 5th, and 8th attempts failed.</li>
  <li><code class="language-plaintext highlighter-rouge">delay</code> Roundtrip delay</li>
  <li><code class="language-plaintext highlighter-rouge">offset</code> Offset of server relative to this host.</li>
  <li><code class="language-plaintext highlighter-rouge">jitter</code> Jitter is random noise relative to the standard timescale.</li>
</ul>

<p>For more complete definitions, see <code class="language-plaintext highlighter-rouge">man ntpmon</code>.</p>

<p>Many of these are technical and mostly of interest to those already experienced
with NTP. I’m not, so I’ve focused on a few of the more interesting metrics:
tally, reach, delay, offset, and jitter. These are the same metrics that
<code class="language-plaintext highlighter-rouge">ntpviz</code> reports on.</p>

<hr />

<blockquote>
  <p>There is a law of error that may be stated as follows: small errors do not
matter until large errors are removed. So with the history of time
measurement: each improvement in the performance of clocks and watches posed a
new challenge by bringing to the fore problems that had previously been
relatively small enough to be neglected.</p>

  <p><em>Revolution in Time</em> p. 114</p>
</blockquote>

<p>For a long time, agreement between clocks didn’t matter. Citizens of the US in
the 19th century had timekeepers, but they set them using “apparent solar time”,
or time estimated by when the sun is highest in the sky. This varies across any
distance east or west, so my clock’s noon in Pennsylvania was noticeably
different from my cousin’s clock in Pittsburgh. Apparant solar time is set by
sundial, and astronomers could keep time better still by looking at the
movements of the planets and stars. (But who had an astronomer in those days?)
Besides the sun, you had church bells and tower clocks. Ye old tower clock in
most cases was set by sundial, and “none too accurately” in the words of the
clock museum. Religious clocks were more of a suggestion of the time.</p>

<p>Coordination wasn’t a moral imperative in the US until the railroads. When
you’re coordinating a few hundred trains in and out of stations, timekeeping
becomes quite important. For most of the 19th century, each railroad company had
its own timekeeping system and standards for accuracy. This created competing
definitions of time, and confusion and accidents followed. In the middle of the
century, there were 144 official time zones in North America alone<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>.</p>

<p>The accidents and fatalities motivated the US to move to a new definition of
<em>standard time</em> based on only four main timezones, the same basic ones we use
today.</p>

<p>If you’re like me, you get nervous thinking about the logistics of suddenly
changing the time, but while the topic of changing from “God’s time” to an
official time was controversial, the actual change seems to have gone well.
There was a <a href="https://guides.loc.gov/this-month-in-business-history/november/day-of-two-noons">day of two
noons</a>
on November 18, 1883, and official clocks and watches were set to the correct
time via telegraph. And that was it.</p>

<p><img src="/assets/images/2026-timekeeping/sunday_morning_herald.png" alt="Sunday Herald article from November 18, 1883" /></p>

<p>Source: <a href="https://www.nyshistoricnewspapers.org">https://www.nyshistoricnewspapers.org</a></p>

<hr />

<p>After a day or two, I checked back in on my NTP stats to see what I’d collected.
For my distribution, the data collects in <code class="language-plaintext highlighter-rouge">/var/log/ntpsec/</code>. Running <code class="language-plaintext highlighter-rouge">ntpviz</code>
on this folder will generate an HTML report with all of the default data
visualizations.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nptviz -d /var/log/ntpsec/
open ntpgraphs/index.html
</code></pre></div></div>

<p>The interesting graph for me is the first one, which plots clock offset (ms,
left axis) and frequency skew (ppm, right axis). My clock is slow, pretty
consistently, by about 7ppm. That is, over 1 million oscillations, my clock will
read 7 periods less than the authority. As long as this is consistent, that’s
ok.</p>

<p><img src="/assets/images/2026-timekeeping/local-offset.png" alt="Clock offset and frequency skew" /></p>

<p>At some point on Jan 31, I suddenly found myself 4ms ahead of the reference
clock, and the ensuing correction was a bit too big. But the last day or two has
been very stable.</p>

<p>The next graph shows “RMS time jitter” (RMS=root mean square), or in other words
“how fast the local clock offset is changing.” The tip under the graph says that
0 is ideal, but it doesn’t give me a sense of whether my clock with a 90% range
of 0.528 is any good. It seems spiky.</p>

<p><img src="/assets/images/2026-timekeeping/local-jitter.png" alt="RMS time jitter" /></p>

<p>And a third graph shows RMS <em>frequency</em> jitter, similar metric but for my
oscillator’s consistency.</p>

<p><img src="/assets/images/2026-timekeeping/local-stability.png" alt="RMS frequency jitter" /></p>

<p>Skipping down a bit, there’s a fun correlation graph between local temperature
and the frequency offset. My computer apparently measures temperature in two
different places (one consistently warmer than the other). You can see how
sudden changes in temperature correlate closely with changes in the frequency
offset. The spikes are caused by the space heater in my office.</p>

<p><img src="/assets/images/2026-timekeeping/local-freq-temps.png" alt="Correlation between local temperature and frequency offset" /></p>

<p>All of this is still abstract to me. I’ll have to collect more data and try it
on a few different machines until I get a better sense for what’s good and
what’s not.</p>

<hr />

<p>Ok, it’s time time got defined.</p>

<p>To define the time, you need a few things:</p>

<ul>
  <li>An oscillator</li>
  <li>A count of oscillations (the “epoch”)</li>
  <li>An origin</li>
</ul>

<p>An oscillator with a counter is called a <em>clock</em>, and the origin is called the
“frame of reference.” If you consider Earth’s rotations as an oscillator, then
the “day” is the counter, where “day” is a complete rotation of the Earth. The
origin can be anything convenient, maybe the oscillation when Halley’s comet
last passed overhead, or a particular spring equinox. A particular clock is
called a <em>timescale</em>.</p>

<p>Before 1958, the heavenly bodies defined the common timescale. The second was
defined as 1/86,400 of a solar day, which is the average time between apparent
noon at some standard location, like the Royal Observatory in Greenwich. There
are all kinds of quirks with this. First, days are getting longer, because the
Earth’s rotation is slowing down. It’s esimated that several hundred million
years ago, there were only 20 hours in the day. This is caused by the friction
of tides.</p>

<p>Second, there are variations in the rotation for other reasons. It’s not a
stable oscillator, and the Earth’s tilt varies over time, which causes other
inconsistencies. It turns out that this timescale is still useful in modern
timekeeping (it’s a component of Greenwhich Mean Time), but it’s not an
effective standard for the SI second.</p>

<p>“In 1958, the standard second was redefined as 1/31,556,925.9747 of the tropical
year that began this century,” (RFC 1305). The tropical year is the time the Sun
takes to return to the same position in the sky from some perspective on Earth.
This only lasted until 1967, because it was still not precise enough for modern
needs. The tropical year has an accuracy of only 50 ms and increases by 5ms per
year.</p>

<p>In 1967, the second was redefined using ground state transitions of the
cesium-133 atom, in particular 1 second = 9,192,631,770 periods. Since 1972,
“time” has had a foundation of International Atomic Time (TAI), which is defined
using the cesium state transition timescale alone. This is a very important time
standard – it underlies UTC, for example.</p>

<p>TAI is a continuous average count of standard atomic seconds since 1958-01-01
00:00:00 TAI. You might say, “Hey, that’s defined in terms of TAI,” and yeah, I
was wondering about that myself. To understand the origin of TAI, you have to
understand the standard Modified Julian Date (MJD). There’s no space here for
that, but in essence, it’s a more precise version of our intuitive understanding
of a calendar of recent events. Historical dates are vague, but modern dates are
well tracked. In other words, the origin is determined from well-known
atronomical observations.</p>

<p>There are a lot of standards (UT, UT0, UT1, UT2, TAI, GMT, UTC) and I don’t have
the space or knowledge to dismbiguate them all. But I want to answer one of the
questions that started this history hunt. What’s the difference between UTC and
GMT?</p>

<p>Coordinated Universal Time and Greenwhich Mean Time. The former is a variant of
TAI that occasionally inserts leap seconds in order to stay in step with GMT.
GMT is mean solar time (also known as local mean time) at the Royal Observatory
in Greenwhich, London.</p>

<p>UTC stays in step with GMT through leap seconds, which are inserted/deleted when
the difference between GMT and UTC approaches 0.7 seconds. These leap seconds
make UTC a non-continuous timescale. TAI on the other hand is continuous –
there are no leap seconds (UTC = TAI - leap seconds). TAI will continue to drift
out of sync with our intuition for “the time” based on the orbital oscillations,
but UTC, like so much of timekeeping, is social. Great pains have been taken to
make it precise but intuitive.</p>

<p>And I can also answer the question I mentioned above, <em>What would happen if the
atomic clocks on Earth stopped for 10 minutes?</em><sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> When I posed the question, I
imagined a server with a red segmented display keeping <em>the time</em> somewhere in a
vault. But now I know that the cesium atom transitions define the second, not
the time. The agreed-upon time is just that, <em>agreed upon</em>. If the frequency
reference stops working, the time servers of the world would no longer receive
a signal telling them if they’re on the standard or not. They might drift by
picoseconds in 10 minutes, but not enough to cause catastrophe. And the
frequency standard can be recovered by restarting the cesium atom state
transitions – the clocks of the world would come once again to agree.</p>

<p>The point of all of this is that “the time” is not much more complicated than
“whatever we say it is.” As Poincaré said in the quote above, the true time is
the one that’s most convenient.</p>

<hr />

<p>My NTP server has been keeping time for me for a week now while I researched
this piece. All week I’ve been hunting for this idea of “the actual time”
separate from how I intuitively understood it.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup></p>

<p>But even though we took one step away from natural time by embracing averages
over daily observations, we’ve taken a step back towards nature with UTC, which
has been jumping through hoops (or at least seconds) to keep civil time
convenient.</p>

<p>In the end, we all have the time. “The time” is a social construct, an event of
the collective conscious, the one thing we can agree on because it was our
agreement that defined it in the first place.</p>

<h1 id="sources">Sources</h1>

<ul>
  <li>https://linux.die.net/sag/hw-sw-clocks.html</li>
  <li>https://wiki.debian.org/DateTime</li>
  <li>https://ntpsec.org</li>
  <li>https://wiki.archlinux.org/title/Systemd-timesyncd</li>
  <li>The National Watch and Clock Museum in Columbia, PA</li>
  <li><em>Revolution in Time</em>, David S. Landes, 1e</li>
  <li><em>What is Time?</em>, G. J. Whitrow</li>
  <li><em>The Design and Implementation of the FreeBSD Operating System</em>, 2e</li>
  <li>RFC 1305, especially Appendix E: The NTP Timescale and its Chronometry.</li>
  <li>https://www.nist.gov/pml/time-and-frequency-division/popular-links/walk-through-time/walk-through-time-atomic-age-time</li>
  <li>https://www.nist.gov/pml/time-and-frequency-division/popular-links/walk-through-time/walk-through-time-world-time-scales</li>
  <li>https://guides.loc.gov/this-month-in-business-history/november/day-of-two-noons</li>
  <li>https://www.nyshistoricnewspapers.org</li>
</ul>

<h1 id="other-notes">Other notes</h1>
<p>There’s much more to say about this topic, much more research I couldn’t include
here. A sampling of other interesting topics.</p>

<ul>
  <li>How time is kept in distributed systems and Lamport’s article on clocks</li>
  <li>Special relativity and the meaning and relativity of the simultaneity of
events</li>
  <li>Why your garden sundial doesn’t work (and how to fix it)</li>
  <li>Scams and scandals of US timekeeping authorities, who made a killing off of
giving preferential treatment to some watchmakers and not others</li>
  <li>Daylight savings time and the madness of crowds</li>
  <li>This whole <a href="https://www.youtube.com/watch?v=-5wpm-gesOY">Tom Scott video</a> and
how computers deal with calendars</li>
</ul>

<p>Maybe some other day.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Escapements convert potential energy like a falling weight suspended by a rope into periodic motion, such as the ticking of a hand. There’s no better visualization of the development of the clock than Bartosz Ciechanowski’s <a href="https://ciechanow.ski/mechanical-watch/">mechanical watch</a>. And while these escapements were crude in the beginning, it took only a few breakthroughs until they were able to tell time within a few seconds per day. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>How do you get 144 time zones? Any move along a line of latitude (i.e. east or west) causes the sun’s apparent apex to move. When the sun is highest in the sky in Pennsylvania, it’s still rising in Colorado. Your sundial would in that case create infinite time zones for each variation in longitude. The railroads “solved” this by using a standard time for each major city they stopped in. You got a sort of “average solar time” for this stretch of railroad. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>In fact, an <a href="https://www.npr.org/2025/12/21/nx-s1-5651317/colorado-us-official-time-microseconds-nist-clocks">NIST atomic clock did recently stop</a>. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>I haven’t struggled alone. Sundials used to come with an <em>equation of time</em> guide that translated the apparent solar time to the current mechanical (“mean”), so the purchaser could know with confidence what the “actual” time is. In the words of David Landes, “instead of setting by the sun, people corrected the sun.” Though I should mention that there were also <em>equation clocks</em>, which used complicated mechanisms to convert from mean time to apparent solar time. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>

        ]]>
      </description>

      <content:encoded>
        <![CDATA[
          <p>In which I find out with some certainty what time it is.</p>

<blockquote>
  <p>The ntpd utility can synchronize time to a theoretical precision of about 232
picoseconds. In practice, this limit is unattainable due to quantum limits on
the clock speed of ballistic-electron logic.</p>

  <p>(https://docs.ntpsec.org/latest/ntpd.html)</p>
</blockquote>

<p>I always assumed I was using ntpd to keep time on my linux computer. But I was
only sort of right.</p>

<p>According to the Debian Wiki, since Debian 12, the default NTP client is
<code class="language-plaintext highlighter-rouge">systemd-timesyncd</code>. It uses SNTP (Simple Network Time Protocol), which
implements the client with no option to host a time server, and it sets the time
roughly by communicating with a single time server. There’s no recourse if you
get a bad server, or “falseticker” in NTP parlance.</p>

<p>There are a few implementations of NTP to choose from. The systemd-timesyncd
daemon is a basic client suitable for keeping time. The original NTP reference
implementation is <code class="language-plaintext highlighter-rouge">ntpd</code>, which is still around, but is deprecated on Debian in
favor of the more security-focused <a href="https://ntpsec.org">NTPSec</a>. And then
<a href="https://chrony-project.org">Chrony</a> is a newer implementation that is more
practical than <code class="language-plaintext highlighter-rouge">ntpd</code>. It looks like a darn fine timekeeper <a href="https://chrony-project.org/faq.html#_how_does_chrony_compare_to_ntpd">by
comparison</a>.</p>

<p>There are interesting things to say about each NTP tool (and their apparent
<a href="https://www.linux-magazine.com/Online/Blogs/Off-the-Beat-Bruce-Byfield-s-Blog/NTPsec-The-Wrong-Fork-for-the-Wrong-Reasons">controversies</a>),
but if you’re interested in NTP you can pick pretty equally among <code class="language-plaintext highlighter-rouge">ntpd</code>,
Chrony, and NTPSec. I’ve been playing with NTPSec for its debugging utilities
like out-of-the-box data visualizations using <code class="language-plaintext highlighter-rouge">ntpviz</code>.</p>

<hr />

<p>Most computers have a real time clock in hardware and a system clock in
software. On powerup or reboot, the system clock is set using the RTC. You can
use a command like <code class="language-plaintext highlighter-rouge">date</code> to set the date/time, but this only updates the system
clock; strictly speaking to update the hardware clock immediately you’ll need</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>hwclock --systohc
</code></pre></div></div>

<p>The hardware clock is battery-driven, which is how its time reading persists
across boots. But some parts of it are curiously system-dependent, for example
whether the hardware clock is set in UTC or local time.</p>

<blockquote>
  <p>If your machine dual boots Windows and Linux, then you could have problems
because Windows uses localtime for the hardware clock; while Linux and Debian
use UTC for the hardware clock. In this case you have two choices. The first
is to use localtime for the hardware clock, and set Debian to use localtime.
The second is to use UTC for the hardware clock, and set Windows to use UTC.</p>

  <p>https://wiki.debian.org/DateTime</p>
</blockquote>

<p>NTP implementations like Chrony and NTPSec don’t directly interact with the RTC;
instead, they modify the system clock. They <em>tend</em> to make use of a kernel
feature called “11-minute mode”, where the system clock syncs to the hardware
clock every 11 minutes, but documentation on this is a bit scant. Some comments
in the <a href="https://chrony-project.org/faq.html#_real_time_clock_issues">Chrony
docs</a>.</p>

<p>Real time clocks are usually crystal oscillators with a frequency of 32.768 kHz,
and since NTP doesn’t directly interact with them, I’m not going to talk much
more about them.</p>

<p>Software clocks on the other hand are crucial to the system. Every system that
NTP runs on must provide a time correction service. The <code class="language-plaintext highlighter-rouge">adjtime</code> syscall is
intended to be portable. As far as I’ve seen, it’s POSIX standard. You might
also see <code class="language-plaintext highlighter-rouge">adjtimex</code>, which is a Linux-specific variant.</p>

<p>To explain how the system call works, <em>The Design and Implementation of the
FreeBSD Operating System</em>:</p>

<blockquote>
  <p>The <code class="language-plaintext highlighter-rouge">settimeofday</code> system call will result in time running backward on
machines whose clocks were fast. Time running backward can confuse user
programs (such as <code class="language-plaintext highlighter-rouge">make</code>) that expect time to invariably increase. To avoid
this problem, the system provides the <code class="language-plaintext highlighter-rouge">adjtime</code> system call [Mills, 1992].
The <code class="language-plaintext highlighter-rouge">adjtime</code> system call takes a time delta (either positive or negative) and
changes the rate at which time advances by 10 percent, faster or slower, until
the time has been corrected. The operating system does the speedup by
incrementing the global time by 1100 microseconds for each tick and does the
slowdown by incrementing the global time by 900 microseconds for each tick.
Regardless, time increases monotonically, and user processes depending on the
ordering of file-modification times are not affected. However, time changes
that take tens of seconds to adjust will affect programs that are measuring
time intervals by using repeated calls to gettimeofday</p>
</blockquote>

<p>The Mills reference is to <a href="https://datatracker.ietf.org/doc/html/rfc1305">RFC 1305</a>.</p>

<p>Since I have TDAIOTFBSDOS open already, I can mention a few other things about
a typical POSIX software clock works. The system software clock is created
through an interrupt timer, and the system “increments its global time variable
by an amount equal to the number of microseconds per tick. For the PC, running
at 1000 ticks per second, each tick represents 1000 microseconds,” (p. 73). And
if you think 1000 interrupts per second is a lot of interruption, you’re right.
“To reduce the interrupt load, the kernel computes the number of ticks in the
future at which an action may need to be taken. It then schedules the next clock
interrupt to occur at that time. Thus, clock interrupts typically occur much
less frequently than the 1000 ticks-per-second rate implies,” (pp. 65-66).</p>

<p>I’d guess (and a brief conversation with ChatGPT seems to confirm) that modern
operating systems have heavily optimized this part of their timekeeping. After
all, who cares what time it is if no process is trying to observe it?</p>

<hr />

<blockquote>
  <p>There is not one way of measuring time more true than another; that which is
generally adopted is only more <em>convenient</em>.</p>

  <p>Henri Poincaré</p>
</blockquote>

<p>What would it take for me to serve time to others? NTP servers listen on port
123 and usually work only over UDP, so I suppose the simple way to serve time is
to start ntpd in server mode, start listening, and configure someone to ask you
the time. But if you really want to be seen, you have to join the
<a href="https://www.ntppool.org/en/">pool</a>.</p>

<blockquote>
  <p>The pool.ntp.org project is a big virtual cluster of timeservers providing
reliable, easy to use NTP service for millions of clients.</p>

  <p>The pool is being used by hundreds of millions of systems around the world.
It’s the default “time server” for most of the major Linux distributions and
many networked appliances (see information for vendors).</p>

  <p>https://www.ntppool.org/en/</p>
</blockquote>

<p>There’s very clear documentation on <a href="https://www.ntppool.org/en/join.html">how to join the
pool</a>, too. The basic steps are:</p>

<ul>
  <li>Get your own time from a known good source (<em>not</em> the pool).</li>
  <li>Configure a stable IP address (trickier than you might think – even if you
set up port forwarding to get around DHCP issues, your ISP tends to rotate
your public IP address as it wants).</li>
  <li>Be willing to make a long-term commitment to the project.</li>
</ul>

<p>I’ll put “create a home time server” on my list of things to try, but joining
the pool would probably create too big a wave.</p>

<hr />

<p>My computer may not be the best to ask for the precise time. Where does
authority in timekeeping come from? Who has the time? I’m only an hour’s drive
away from the <a href="https://museum.nawcc.org">National Clock and Watch Museum</a>, and
after visit and a few dozen hours of followup research, I have something
approximating an answer.</p>

<p>We get our sense of time from the periodic movements of the starry firmament –
the sun, the moon, the stars. And our bodies, along with most organisms on
Earth, have built-in timers that encourage us to do those activities that keep
us alive. This is when you usually sleep, this is when you usually eat. As
different cultures sought to understand the heavens and their perfection,
timekeeping began to occupy its modern central role for coordination in our
lives.</p>

<p>And so at the moment my interest in timekeeping isn’t how we developed precision
clocks, but how we managed to coordinate ourselves using those clocks. <em>Why</em> we
coordinate ourselves using those clocks. What even is a clock? If all of the
atomic clocks in the world stopped ticking for 10 minutes, would we be able to
recover “the time”?</p>

<p>We’ve had a sense for calendars for a long time. “By the 14th century BC the
Shang Chinese had established the solar year as 365.25 days and the lunar month
as 29.5 days,” (RFC 1305). By 432 BC, the Greek astronomer Meton had estimated
the lunar month – the time it takes for the moon to circle the earth – to
within about 2 minutes of the currently understood value.</p>

<p>Time-curious cultures became duly obsessed with the frequency and stability of
our cosmic oscillators. The Earth’s rotation and its orbit around the sun, the
moon’s orbit around the Earth. And each culture had a calendar that tried to
match the motions of the cosmos with a predictable and convenient “civilian”
calendar.</p>

<p>Not all cultures had a calendar, and the ones that did used different systems,
so the process of dating events is knotty. Suffice to say that it involves some
guesswork. The best case for understanding the orders of events in the old days
is having what Mills calls (in RFC 1305) “an accurate count of the days relative
to some globally alarming event, such as a comet passage or supernova
explosion.”</p>

<p>And so calendars are social. The civil calendar had to be convenient and fit
into the activities of daily life, and ordering of events depends on some
collective consciousness around global events. I’ve been surprised by how often
we make clocks and calendars fit into daily life and not the other way around.
Several of the most precise modern timescales today are based on what feels
right and looks right, made a bit more precise.</p>

<p>Calendars order our years; clocks order our days. Early religion temporalized
daily life by requiring certain religious acts to be done multiple times a day.
Some of the earliest interesting clock-like devices we have are from monasteries
that rang bells at specific times. (And the word <em>clock</em> is derived from the
French word for bell.) This went on for a few hundred years.</p>

<p>The next advance was periodic timekeepers. Time used to be more organic than it
is today. Hours were not equally sized, and the day was not split equally into
24 parts. But somewhere, at some time, Europeans made an intuitive leap from
continuous time devices like the clepsydra or the procession of different stars
and planets to discrete time – time as ticks<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup>. In <em>Revolution in Time</em>,
David Landes considers this one of the great methodological leaps in western
civilization. It took other cultures another 500 years to begin using
oscillating, periodic timekeepers.</p>

<p>Clocks have their social uses. Nearly as soon as clocks became convenient and
domestic, <em>punctuality</em> became an important social cue. And as life became more
connected with trade, trains, and radio, the pragmatic importance of clocks only
increased.</p>

<hr />

<p>I installed NTPSec on my Debian machine and left the configuration mostly as-is.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>sudo apt-get update &amp;&amp; sudo apt-get install ntpsec ntpsec-doc ntpsec-ntpviz
</code></pre></div></div>

<p>I made sure to enable statistics, because I’m really after visualizations. I
want to see the thing do stuff. Visualizations are generated using <code class="language-plaintext highlighter-rouge">ntpviz</code>,
which is scantily documented (this was helpful but ancient:
<a href="https://blog.ntpsec.org/2016/12/19/ntpviz-intro.html">ntpvis-intro</a>), but I
found enough to get me going. Unfortunately, I just set up my daemon, and
there’s no data to visualize. I took the opportunity to do some background work
on the metrics.</p>

<p>Clocks are never perfectly in sync, and the most important contributor to
incorrect timekeeping is a difference in oscillator frequencies. This is called
frequency <em>skew</em>. If the “correct” time is an oscillator at 1000 Hz, my local
computer clock might be more like 1001 Hz or 999 Hz. So even if I set my clock
to the right time, I would gain or lose some seconds every day.</p>

<p>Frequency skew is measured in parts per million, which is to say the number of
periods fast or slow per million oscillations. In the 1000 Hz example, 1001 Hz
would have a skew of 1 part in every thousand, or 1000 parts per million (ppm).
999 Hz has a skew of -1000 ppm.</p>

<p>Skew is also described in other ways. A human-friendly way to describe it is
“seconds gained or lost per day”, or week or year. This gives you the number in
practical terms. It’s a bit tricky to translate between them, though,
considering the gap between oscillation frequency and length of a day.</p>

<p>Your skew might also vary over time, and this is called <em>drift</em>.</p>

<p>NTP corrects for skew as part of the protocol by nudging the time and doing its
best to predict changes. Skew is affected by the quality of the hardware and the
environment around the oscillator, especially temperature. For ideal
timekeeping, you’ll want to keep your computer in a nice climate-controlled
vault with excellent heat sinking.</p>

<p>The clock offset is the estimated difference between my clock and the reference
clock, measured in milliseconds. I show roughly how this is calculated in <a href="/2026/01/27/ntp-in-30-seconds.html">NTP
in 30 Seconds</a>. In short, it’s
calculated by estimating the latency between you and the server and using that
to guess what time the server received your request. Then you compare your guess
(based on local time + latency) to what the server reported was the “actual”
time it received the request, and use the difference to work out how wrong your
clock is.</p>

<p>Practically speaking, for general monitoring, you can use <code class="language-plaintext highlighter-rouge">ntpmon</code>. This is a
<code class="language-plaintext highlighter-rouge">top</code>-like tool for watching your NTP daemon interact with peers. The output
looks something like this:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>     remote           refid      st t when poll reach   delay   offset   jitter
 0.debian.pool.n .POOL.          16 p    -   64    0   0.0000   0.0000   0.0001
 1.debian.pool.n .POOL.          16 p    -   64    0   0.0000   0.0000   0.0001
 2.debian.pool.n .POOL.          16 p    -   64    0   0.0000   0.0000   0.0001
 3.debian.pool.n .POOL.          16 p    -   64    0   0.0000   0.0000   0.0001
-ip74-208-14-149 192.58.120.8     2 u  598 1024  377  41.7645   0.9319   1.5714
-144.202.66.214. 162.159.200.1    4 u  834 1024  377  45.3224   1.1844   1.2529
*nyc2.us.ntp.li  17.253.2.37      2 u  564 1024  377  10.2534  -0.8732   0.8516
+ntp-62b.lbl.gov 128.3.133.141    2 u  748 1024  377  73.6187  -0.3344   1.0547
+time.cloudflare 10.102.8.4       3 u  212 1024  377   8.4884   0.0523   0.9331
 192-184-140-112 .PHC0.           1 u  66h 1024    0  85.8202   5.3690   0.0000
+ntp.nyc.icanbwe 69.180.17.124    2 u  639 1024  377  11.8765  -0.1314   1.1134

ntpd ntpsec-1.2.2                             Updated: 2026-02-04T08:17:40 (32)

 lstint avgint rstr r m v  count    score   drop rport remote address
      0   1284    0 . 6 2    321    1.217      0 51529 localhost
    212   1054   c0 . 4 4    127    0.050      0   123 time.cloudflare.com
    564   1079   c0 . 4 4    123    0.050      0   123 nyc2.us.ntp.li
    598   1058   c0 . 4 4    126    0.050      0   123 ip74-208-14-149.pbiaas.com
    639   1066   c0 . 4 4    125    0.050      0   123 ntp.nyc.icanbwell.com
    748   1055   c0 . 4 4    126    0.050      0   123 ntp-62b.lbl.gov
    834   1066   c0 . 4 4    125    0.050      0   123 144.202.66.214 (144.202.66.214.vultruser
</code></pre></div></div>

<p>I’ll describe peer metrics in a second. For now, the second table, starting with
<code class="language-plaintext highlighter-rouge">lstint</code>, is the MRU list (MRU=most recently used). Here are the stats it
reports.</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">lstint</code> Interval (s) between receipt of most recent packet from this address
and completion of the retrieval of the MRU list by ntpq.</li>
  <li><code class="language-plaintext highlighter-rouge">avgint</code> Average interval (s) between packets from this address.</li>
  <li><code class="language-plaintext highlighter-rouge">rstr</code> Restriction flags.</li>
  <li><code class="language-plaintext highlighter-rouge">r</code> Rate control indicator.</li>
  <li><code class="language-plaintext highlighter-rouge">m</code> Packet mode</li>
  <li><code class="language-plaintext highlighter-rouge">v</code> Packet version number.</li>
  <li><code class="language-plaintext highlighter-rouge">count</code> Packets received</li>
  <li><code class="language-plaintext highlighter-rouge">score</code> Packets per second (averaged with exponential decay)</li>
  <li><code class="language-plaintext highlighter-rouge">drop</code> Packets dropped</li>
  <li><code class="language-plaintext highlighter-rouge">rport</code> Source port of last packet received</li>
  <li><code class="language-plaintext highlighter-rouge">remote address</code> The remote host name</li>
</ul>

<p>There are commands you can use to change the output, like <code class="language-plaintext highlighter-rouge">d</code> for detailed mode.</p>

<p>For a snapshot, you can use <code class="language-plaintext highlighter-rouge">ntpq</code>, a helpful tool for inspecting the daemon. It
has an interactive mode and a one-shot mode. This queries peers in the one-shot
mode.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ntpq --peers --units
     remote                                   refid      st t when poll reach   delay   offset   jitter
=======================================================================================================
 0.debian.pool.ntp.org                   .POOL.          16 p    -   64    0      0ns      0ns    119ns
 1.debian.pool.ntp.org                   .POOL.          16 p    -   64    0      0ns      0ns    119ns
 2.debian.pool.ntp.org                   .POOL.          16 p    -   64    0      0ns      0ns    119ns
 3.debian.pool.ntp.org                   .POOL.          16 p    -   64    0      0ns      0ns    119ns
-ip74-208-14-149.pbiaas.com              192.58.120.8     2 u  671 1024  377 41.765ms 931.92us 1.5714ms
-144.202.66.214.vultrusercontent.com     162.159.200.1    4 u  907 1024  377 45.322ms 1.1844ms 1.2529ms
*nyc2.us.ntp.li                          17.253.2.37      2 u  637 1024  377 10.253ms -873.2us 851.60us
+ntp-62b.lbl.gov                         128.3.133.141    2 u  821 1024  377 73.619ms -334.4us 1.0547ms
+time.cloudflare.com                     10.102.8.4       3 u  285 1024  377 8.4884ms 52.298us 933.07us
 192-184-140-112.fiber.dynamic.sonic.net .PHC0.           1 u  66h 1024    0 85.820ms 5.3690ms      0ns
+ntp.nyc.icanbwell.com                   69.180.17.124    2 u  712 1024  377 11.877ms -131.4us 1.1134ms
</code></pre></div></div>

<p>Here’s how this table is interpreted according to the <code class="language-plaintext highlighter-rouge">ntpmon</code> man page:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">tally</code> (symbol next to remote) One of <code class="language-plaintext highlighter-rouge">space</code>: not valid, x, ., -: discarded
for various reasons, +: included by the combine algorithm, #: backup, *:
system peer, 0: PPS peer. Basically, look for the <code class="language-plaintext highlighter-rouge">*</code> and any <code class="language-plaintext highlighter-rouge">+</code> signs to see
who you’re listening to right now.</li>
  <li><code class="language-plaintext highlighter-rouge">remote</code> The host name of the time server.</li>
  <li><code class="language-plaintext highlighter-rouge">refid</code> The RefID identifies the specific upstream time source a server is
using. In other words, it names the reference clock (stratum 0 or 1), even if
this server is just repeating what that reference clock says.</li>
  <li><code class="language-plaintext highlighter-rouge">st</code> NTP stratum</li>
  <li><code class="language-plaintext highlighter-rouge">t</code> Type. u: unicast or manycase, l: local, s: symmetric (peer), server, B:
broadcast server.</li>
  <li><code class="language-plaintext highlighter-rouge">when</code> sec/min/hr since last received packet.</li>
  <li><code class="language-plaintext highlighter-rouge">poll</code> Poll interval in log2 seconds</li>
  <li><code class="language-plaintext highlighter-rouge">reach</code> Octal triplet. Represents the last 8 attempts to reach the server.
<code class="language-plaintext highlighter-rouge">377</code> is binary <code class="language-plaintext highlighter-rouge">11111111</code>, which means all 8 attempts reached the server. A
value like <code class="language-plaintext highlighter-rouge">326</code> is binary <code class="language-plaintext highlighter-rouge">11010110</code>, meaning out of the last 8 attempts, the
3rd, 5th, and 8th attempts failed.</li>
  <li><code class="language-plaintext highlighter-rouge">delay</code> Roundtrip delay</li>
  <li><code class="language-plaintext highlighter-rouge">offset</code> Offset of server relative to this host.</li>
  <li><code class="language-plaintext highlighter-rouge">jitter</code> Jitter is random noise relative to the standard timescale.</li>
</ul>

<p>For more complete definitions, see <code class="language-plaintext highlighter-rouge">man ntpmon</code>.</p>

<p>Many of these are technical and mostly of interest to those already experienced
with NTP. I’m not, so I’ve focused on a few of the more interesting metrics:
tally, reach, delay, offset, and jitter. These are the same metrics that
<code class="language-plaintext highlighter-rouge">ntpviz</code> reports on.</p>

<hr />

<blockquote>
  <p>There is a law of error that may be stated as follows: small errors do not
matter until large errors are removed. So with the history of time
measurement: each improvement in the performance of clocks and watches posed a
new challenge by bringing to the fore problems that had previously been
relatively small enough to be neglected.</p>

  <p><em>Revolution in Time</em> p. 114</p>
</blockquote>

<p>For a long time, agreement between clocks didn’t matter. Citizens of the US in
the 19th century had timekeepers, but they set them using “apparent solar time”,
or time estimated by when the sun is highest in the sky. This varies across any
distance east or west, so my clock’s noon in Pennsylvania was noticeably
different from my cousin’s clock in Pittsburgh. Apparant solar time is set by
sundial, and astronomers could keep time better still by looking at the
movements of the planets and stars. (But who had an astronomer in those days?)
Besides the sun, you had church bells and tower clocks. Ye old tower clock in
most cases was set by sundial, and “none too accurately” in the words of the
clock museum. Religious clocks were more of a suggestion of the time.</p>

<p>Coordination wasn’t a moral imperative in the US until the railroads. When
you’re coordinating a few hundred trains in and out of stations, timekeeping
becomes quite important. For most of the 19th century, each railroad company had
its own timekeeping system and standards for accuracy. This created competing
definitions of time, and confusion and accidents followed. In the middle of the
century, there were 144 official time zones in North America alone<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>.</p>

<p>The accidents and fatalities motivated the US to move to a new definition of
<em>standard time</em> based on only four main timezones, the same basic ones we use
today.</p>

<p>If you’re like me, you get nervous thinking about the logistics of suddenly
changing the time, but while the topic of changing from “God’s time” to an
official time was controversial, the actual change seems to have gone well.
There was a <a href="https://guides.loc.gov/this-month-in-business-history/november/day-of-two-noons">day of two
noons</a>
on November 18, 1883, and official clocks and watches were set to the correct
time via telegraph. And that was it.</p>

<p><img src="/assets/images/2026-timekeeping/sunday_morning_herald.png" alt="Sunday Herald article from November 18, 1883" /></p>

<p>Source: <a href="https://www.nyshistoricnewspapers.org">https://www.nyshistoricnewspapers.org</a></p>

<hr />

<p>After a day or two, I checked back in on my NTP stats to see what I’d collected.
For my distribution, the data collects in <code class="language-plaintext highlighter-rouge">/var/log/ntpsec/</code>. Running <code class="language-plaintext highlighter-rouge">ntpviz</code>
on this folder will generate an HTML report with all of the default data
visualizations.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nptviz -d /var/log/ntpsec/
open ntpgraphs/index.html
</code></pre></div></div>

<p>The interesting graph for me is the first one, which plots clock offset (ms,
left axis) and frequency skew (ppm, right axis). My clock is slow, pretty
consistently, by about 7ppm. That is, over 1 million oscillations, my clock will
read 7 periods less than the authority. As long as this is consistent, that’s
ok.</p>

<p><img src="/assets/images/2026-timekeeping/local-offset.png" alt="Clock offset and frequency skew" /></p>

<p>At some point on Jan 31, I suddenly found myself 4ms ahead of the reference
clock, and the ensuing correction was a bit too big. But the last day or two has
been very stable.</p>

<p>The next graph shows “RMS time jitter” (RMS=root mean square), or in other words
“how fast the local clock offset is changing.” The tip under the graph says that
0 is ideal, but it doesn’t give me a sense of whether my clock with a 90% range
of 0.528 is any good. It seems spiky.</p>

<p><img src="/assets/images/2026-timekeeping/local-jitter.png" alt="RMS time jitter" /></p>

<p>And a third graph shows RMS <em>frequency</em> jitter, similar metric but for my
oscillator’s consistency.</p>

<p><img src="/assets/images/2026-timekeeping/local-stability.png" alt="RMS frequency jitter" /></p>

<p>Skipping down a bit, there’s a fun correlation graph between local temperature
and the frequency offset. My computer apparently measures temperature in two
different places (one consistently warmer than the other). You can see how
sudden changes in temperature correlate closely with changes in the frequency
offset. The spikes are caused by the space heater in my office.</p>

<p><img src="/assets/images/2026-timekeeping/local-freq-temps.png" alt="Correlation between local temperature and frequency offset" /></p>

<p>All of this is still abstract to me. I’ll have to collect more data and try it
on a few different machines until I get a better sense for what’s good and
what’s not.</p>

<hr />

<p>Ok, it’s time time got defined.</p>

<p>To define the time, you need a few things:</p>

<ul>
  <li>An oscillator</li>
  <li>A count of oscillations (the “epoch”)</li>
  <li>An origin</li>
</ul>

<p>An oscillator with a counter is called a <em>clock</em>, and the origin is called the
“frame of reference.” If you consider Earth’s rotations as an oscillator, then
the “day” is the counter, where “day” is a complete rotation of the Earth. The
origin can be anything convenient, maybe the oscillation when Halley’s comet
last passed overhead, or a particular spring equinox. A particular clock is
called a <em>timescale</em>.</p>

<p>Before 1958, the heavenly bodies defined the common timescale. The second was
defined as 1/86,400 of a solar day, which is the average time between apparent
noon at some standard location, like the Royal Observatory in Greenwich. There
are all kinds of quirks with this. First, days are getting longer, because the
Earth’s rotation is slowing down. It’s esimated that several hundred million
years ago, there were only 20 hours in the day. This is caused by the friction
of tides.</p>

<p>Second, there are variations in the rotation for other reasons. It’s not a
stable oscillator, and the Earth’s tilt varies over time, which causes other
inconsistencies. It turns out that this timescale is still useful in modern
timekeeping (it’s a component of Greenwhich Mean Time), but it’s not an
effective standard for the SI second.</p>

<p>“In 1958, the standard second was redefined as 1/31,556,925.9747 of the tropical
year that began this century,” (RFC 1305). The tropical year is the time the Sun
takes to return to the same position in the sky from some perspective on Earth.
This only lasted until 1967, because it was still not precise enough for modern
needs. The tropical year has an accuracy of only 50 ms and increases by 5ms per
year.</p>

<p>In 1967, the second was redefined using ground state transitions of the
cesium-133 atom, in particular 1 second = 9,192,631,770 periods. Since 1972,
“time” has had a foundation of International Atomic Time (TAI), which is defined
using the cesium state transition timescale alone. This is a very important time
standard – it underlies UTC, for example.</p>

<p>TAI is a continuous average count of standard atomic seconds since 1958-01-01
00:00:00 TAI. You might say, “Hey, that’s defined in terms of TAI,” and yeah, I
was wondering about that myself. To understand the origin of TAI, you have to
understand the standard Modified Julian Date (MJD). There’s no space here for
that, but in essence, it’s a more precise version of our intuitive understanding
of a calendar of recent events. Historical dates are vague, but modern dates are
well tracked. In other words, the origin is determined from well-known
atronomical observations.</p>

<p>There are a lot of standards (UT, UT0, UT1, UT2, TAI, GMT, UTC) and I don’t have
the space or knowledge to dismbiguate them all. But I want to answer one of the
questions that started this history hunt. What’s the difference between UTC and
GMT?</p>

<p>Coordinated Universal Time and Greenwhich Mean Time. The former is a variant of
TAI that occasionally inserts leap seconds in order to stay in step with GMT.
GMT is mean solar time (also known as local mean time) at the Royal Observatory
in Greenwhich, London.</p>

<p>UTC stays in step with GMT through leap seconds, which are inserted/deleted when
the difference between GMT and UTC approaches 0.7 seconds. These leap seconds
make UTC a non-continuous timescale. TAI on the other hand is continuous –
there are no leap seconds (UTC = TAI - leap seconds). TAI will continue to drift
out of sync with our intuition for “the time” based on the orbital oscillations,
but UTC, like so much of timekeeping, is social. Great pains have been taken to
make it precise but intuitive.</p>

<p>And I can also answer the question I mentioned above, <em>What would happen if the
atomic clocks on Earth stopped for 10 minutes?</em><sup id="fnref:3" role="doc-noteref"><a href="#fn:3" class="footnote" rel="footnote">3</a></sup> When I posed the question, I
imagined a server with a red segmented display keeping <em>the time</em> somewhere in a
vault. But now I know that the cesium atom transitions define the second, not
the time. The agreed-upon time is just that, <em>agreed upon</em>. If the frequency
reference stops working, the time servers of the world would no longer receive
a signal telling them if they’re on the standard or not. They might drift by
picoseconds in 10 minutes, but not enough to cause catastrophe. And the
frequency standard can be recovered by restarting the cesium atom state
transitions – the clocks of the world would come once again to agree.</p>

<p>The point of all of this is that “the time” is not much more complicated than
“whatever we say it is.” As Poincaré said in the quote above, the true time is
the one that’s most convenient.</p>

<hr />

<p>My NTP server has been keeping time for me for a week now while I researched
this piece. All week I’ve been hunting for this idea of “the actual time”
separate from how I intuitively understood it.<sup id="fnref:4" role="doc-noteref"><a href="#fn:4" class="footnote" rel="footnote">4</a></sup></p>

<p>But even though we took one step away from natural time by embracing averages
over daily observations, we’ve taken a step back towards nature with UTC, which
has been jumping through hoops (or at least seconds) to keep civil time
convenient.</p>

<p>In the end, we all have the time. “The time” is a social construct, an event of
the collective conscious, the one thing we can agree on because it was our
agreement that defined it in the first place.</p>

<h1 id="sources">Sources</h1>

<ul>
  <li>https://linux.die.net/sag/hw-sw-clocks.html</li>
  <li>https://wiki.debian.org/DateTime</li>
  <li>https://ntpsec.org</li>
  <li>https://wiki.archlinux.org/title/Systemd-timesyncd</li>
  <li>The National Watch and Clock Museum in Columbia, PA</li>
  <li><em>Revolution in Time</em>, David S. Landes, 1e</li>
  <li><em>What is Time?</em>, G. J. Whitrow</li>
  <li><em>The Design and Implementation of the FreeBSD Operating System</em>, 2e</li>
  <li>RFC 1305, especially Appendix E: The NTP Timescale and its Chronometry.</li>
  <li>https://www.nist.gov/pml/time-and-frequency-division/popular-links/walk-through-time/walk-through-time-atomic-age-time</li>
  <li>https://www.nist.gov/pml/time-and-frequency-division/popular-links/walk-through-time/walk-through-time-world-time-scales</li>
  <li>https://guides.loc.gov/this-month-in-business-history/november/day-of-two-noons</li>
  <li>https://www.nyshistoricnewspapers.org</li>
</ul>

<h1 id="other-notes">Other notes</h1>
<p>There’s much more to say about this topic, much more research I couldn’t include
here. A sampling of other interesting topics.</p>

<ul>
  <li>How time is kept in distributed systems and Lamport’s article on clocks</li>
  <li>Special relativity and the meaning and relativity of the simultaneity of
events</li>
  <li>Why your garden sundial doesn’t work (and how to fix it)</li>
  <li>Scams and scandals of US timekeeping authorities, who made a killing off of
giving preferential treatment to some watchmakers and not others</li>
  <li>Daylight savings time and the madness of crowds</li>
  <li>This whole <a href="https://www.youtube.com/watch?v=-5wpm-gesOY">Tom Scott video</a> and
how computers deal with calendars</li>
</ul>

<p>Maybe some other day.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>Escapements convert potential energy like a falling weight suspended by a rope into periodic motion, such as the ticking of a hand. There’s no better visualization of the development of the clock than Bartosz Ciechanowski’s <a href="https://ciechanow.ski/mechanical-watch/">mechanical watch</a>. And while these escapements were crude in the beginning, it took only a few breakthroughs until they were able to tell time within a few seconds per day. <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>How do you get 144 time zones? Any move along a line of latitude (i.e. east or west) causes the sun’s apparent apex to move. When the sun is highest in the sky in Pennsylvania, it’s still rising in Colorado. Your sundial would in that case create infinite time zones for each variation in longitude. The railroads “solved” this by using a standard time for each major city they stopped in. You got a sort of “average solar time” for this stretch of railroad. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:3" role="doc-endnote">
      <p>In fact, an <a href="https://www.npr.org/2025/12/21/nx-s1-5651317/colorado-us-official-time-microseconds-nist-clocks">NIST atomic clock did recently stop</a>. <a href="#fnref:3" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:4" role="doc-endnote">
      <p>I haven’t struggled alone. Sundials used to come with an <em>equation of time</em> guide that translated the apparent solar time to the current mechanical (“mean”), so the purchaser could know with confidence what the “actual” time is. In the words of David Landes, “instead of setting by the sun, people corrected the sun.” Though I should mention that there were also <em>equation clocks</em>, which used complicated mechanisms to convert from mean time to apparent solar time. <a href="#fnref:4" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>

        ]]>
      </content:encoded>
    </item>
    
    <item>
      <title>An automated checklist for computer setup</title>
      <link>https://charlie-gallagher.github.io/2026/01/29/computer-setup-scripts.html</link>
      <guid isPermaLink="true">https://charlie-gallagher.github.io/2026/01/29/computer-setup-scripts.html</guid>
      <pubDate>Thu, 29 Jan 2026 00:00:00 +0000</pubDate>

      <description>
        <![CDATA[
          <p>A little while ago my laptop died once, and then twice, and in between each
failure I had to use a spare laptop. I ended up setting up my computer fully for
work 3 times in the space of a couple weeks. I decided to automate the boring
stuff and make a checklist program and a set of companion scripts for doing
things like pulling all of my repos and installing my most commonly used
libraries and programs through Homebrew.</p>

<p>I already use <code class="language-plaintext highlighter-rouge">stow</code> to <a href="https://brandon.invergo.net/news/2012-05-26-using-gnu-stow-to-manage-your-dotfiles.html?round=two">manage my dotfiles</a>
so a lot of my configuration is taken care of with a few quick stows. But what I
really needed was an automated checklist that told me how close my configuration
was to the “target” laptop. It’s a cheap declarative setup, like a simpler
terraform or nix.</p>

<p>Here’s an example checklist (with no color):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ ./laptop_checklist.sh

Setting up git ──────────────────────────────────────────────────────────
• Setup git config (git-config.sh)
✓ Pull this code from GitHub (git clone https://github.com/my-user/dotfiles.git)
✓ (file) /Users/user/.gitignore
✓ Copy global gitignore to home directory
× (directory) /Users/user/projects/project_1/
× (directory) /Users/user/projects/project_2/
× (directory) /Users/user/projects/project_3/
× Clone important repositories (clone-repos.sh)
✓ (environment variable) GITLAB_PAT
✓ (environment variable) GITHUB_PAT
✓ Export GitHub and GitLab tokens


Setting up terminal ─────────────────────────────────────────────────────
✓ (application) iTerm.app
✓ Install iTerm2 (https://iterm2.com/downloads.html)
• Import Iterm profile into Iterm
• Install frequently used applications (install-common-libraries.sh)
✓ Install OhMyZsh (install-omzsh.sh)
✓ Install Starship prompt (install-starship.sh)
✓ (file) /Users/user/.zshrc
✓ Copy .zshrc into home directory
✓ (file) /Users/user/.vimrc
✓ Install vim (install-vim.sh)


Install desktop applications ────────────────────────────────────────────
• Install VS Code (https://code.visualstudio.com/download)
✓ (file) /Users/user/Library/Application Support/Code/User/settings.json
✓ Run install-code.sh
✓ (application) Google Chrome.app
✓ (application) DBeaver.app
✓ (application) AWS VPN Client
✓ (application) Docker.app
✓ (application) logioptionsplus.app
✓ (application) Postman.app
✓ (application) Utilities/XQuartz.app
✓ (application) Visual Studio Code.app
• Install Google Chrome (https://www.google.com/chrome)
• Install docker and sign in to GitLab registry (https://www.docker.com/products/docker-desktop/)
• Install postman and sign in using lastpass (https://www.postman.com/downloads/)
• Install AWS VPN and set up with ixis-vpn-client-config.ovpn (https://ixisdigital.atlassian.net/l/cp/ctZeA341)
• Install DBeaver (https://dbeaver.io/download/)
• Install omnibug (https://chrome.google.com/webstore/detail/omnibug/bknpehncffejahipecakbfkomebjmokl)
• Install Logitech Options+ (https://www.logitech.com/en-us/software/logi-options-plus.html)
• Install MS Teams
• Install XQuartz (https://www.xquartz.org/)
• Install Talon Voice (https://talonvoice.com/)
• Install Visual Studio Code (https://code.visualstudio.com/download)
✓ Install common applications


Set up AWS credentials ──────────────────────────────────────────────────
✓ (directory) /Users/user/.aws
✓ Initialize .aws folder (install-aws.sh)
✓ (exec) ssocred
✓ (exec) aws
✓ Install AWS tools (install-aws.sh)
✓ (file) /Users/user/.aws_functions.zsh
✓ AWS authentication functions exist
</code></pre></div></div>

<p>The checklist tells me how close I am to the target, but it won’t take any
action on its own. I have other scripts for that. But the thing about the other
scripts is that they:</p>

<ul>
  <li>Have dependencies. There’s a correct order, and I can’t automate that as well
because some steps require human action (like generating a new GitHub PAT).</li>
  <li>Are finicky. Shell scripting is not always a smooth experience, especially
when you might have to jump between shells (if zsh isn’t installed). Not to
mention that entropy affects setup scripts the same as it affects roads and
buildings. Links die, bits rot.</li>
  <li>Don’t give you a high-level view of the current system state.</li>
</ul>

<p>A real declarative system would compare the current state to the desired state
and then take steps to bring the computer into the desired state. Declarative
systems are hard to get right – you have to handle all possible current states
and define how to get to the target state. When I’m running these scripts, I’m
just trying to remember the steps for getting from 0 to back to work. This is
just a bunch of setup scripts, and I’m fine with a little “meat in the loop.”</p>

<p>All this is to say, I have the following setup:</p>

<ul>
  <li>A bunch of separate setup scripts</li>
  <li>An idea of what I want the final system to look like</li>
</ul>

<h2 id="the-easy-stuff">The easy stuff</h2>
<p>CLI programs and libraries are easy. First, you can usually get everything you
need through Homebrew or your package manager of choice. Second, it’s easy to
test whether they’re installed.</p>

<p>The shell environment is similarly easy to set up and check for. Environment
variables, dotfiles, these are all well-defined environment features that you
can check for.</p>

<h2 id="getting-trickier">Getting trickier</h2>
<p>GitHub PATs are essentially environment variables, but you can’t
programmatically generate them. It’s in the category of “easy to check, manual
to fix”. The other main entrants in this category are applications like VS Code,
Chrome, XQuartz, and so on. These can be checked for in the few places that
MacOS stores applications.</p>

<h2 id="reminders">Reminders</h2>
<p>Some things can neither be tested for nor installed automatically (without
significant effort). For these, I have this idea of a “reminder” in the
checklist that basically says “do this or else”. But the script doesn’t know the
status of it.</p>

<p>Examples are usually within applications, like signing into Chrome, importing
settings into VS Code, DBeaver, and configuring my mx ergo mouse in Logitech
Options.</p>

<h2 id="putting-it-together">Putting it together</h2>
<h3 id="design">Design</h3>

<ul>
  <li>A single script</li>
  <li>No config file, everything done in the script</li>
  <li>Easy to add/drop</li>
  <li>Easy to define sections</li>
  <li>Non-blocking. I want to see the whole status at once.</li>
</ul>

<h3 id="implementation">Implementation</h3>
<p>The program is composed of checkers and checklist items. The checkers are
functions that take some standard input (like the name of an evironment
variable) and check whether it exists, returning 0 (success) or 1 (fail).</p>

<p>Sections have a main status for the larger abstract concept (“Set up git”) and
sub-statuses for each checklist item (“install git”, “GH PAT”, etc.). This is a
section that checks for personal access tokens.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo</span> <span class="s2">"Setting up git ──────────────────────────────────────────────────────────"</span>
<span class="nv">git_pat_status</span><span class="o">=</span>PASS
<span class="k">if</span> <span class="o">!</span> check_env GITLAB_PAT<span class="p">;</span> <span class="k">then
	</span><span class="nv">git_pat_status</span><span class="o">=</span>FAIL
<span class="k">fi
if</span> <span class="o">!</span> check_env GITHUB_PAT<span class="p">;</span> <span class="k">then
	</span><span class="nv">git_pat_status</span><span class="o">=</span>FAIL
<span class="k">fi
</span>status <span class="nv">$git_pat_status</span> <span class="s2">"Export GitHub and GitLab tokens"</span>
</code></pre></div></div>

<p>The section passes unless any of its children fail, then it’s a fail for the
whole section. Here, I forgot to add a check for the <code class="language-plaintext highlighter-rouge">git</code> binary, so let’s add
it.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo</span> <span class="s2">"Setting up git ──────────────────────────────────────────────────────────"</span>
<span class="nv">git_pat_status</span><span class="o">=</span>PASS
<span class="k">if</span> <span class="o">!</span> check_exec git<span class="p">;</span> <span class="k">then
    </span><span class="nv">git_pat_status</span><span class="o">=</span>FAIL
<span class="k">fi
if</span> <span class="o">!</span> check_env GITLAB_PAT<span class="p">;</span> <span class="k">then
	</span><span class="nv">git_pat_status</span><span class="o">=</span>FAIL
<span class="k">fi
if</span> <span class="o">!</span> check_env GITHUB_PAT<span class="p">;</span> <span class="k">then
	</span><span class="nv">git_pat_status</span><span class="o">=</span>FAIL
<span class="k">fi
</span>status <span class="nv">$git_pat_status</span> <span class="s2">"Export GitHub and GitLab tokens"</span>
</code></pre></div></div>

<p>To break it down:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">check_exec</code> looks for a program using <code class="language-plaintext highlighter-rouge">which</code>.</li>
  <li><code class="language-plaintext highlighter-rouge">check_env</code> looks for an environment variable that is defined and not an empty
string. All of the <code class="language-plaintext highlighter-rouge">check_*</code> functions print their result to the console in
addition to calculating the success/failure.</li>
  <li><code class="language-plaintext highlighter-rouge">status</code> prints a summary message, indicating pass or fail.</li>
</ul>

<p>It’s straightforward to write a new checker function. The checker should
evaluate the status of the thing to be checked, write a message to the console,
and then return a status to the user. Here’s the definition of <code class="language-plaintext highlighter-rouge">check_env</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>check_env<span class="o">()</span> <span class="o">{</span>
	<span class="nv">prefix</span><span class="o">=</span><span class="s2">"</span><span class="k">${</span><span class="nv">dim</span><span class="k">}</span><span class="s2">(environment variable) </span><span class="k">${</span><span class="nv">normal</span><span class="k">}</span><span class="s2">"</span>
	<span class="nv">env_var_value</span><span class="o">=</span><span class="k">${</span><span class="p">!1</span><span class="k">}</span>
	<span class="k">if</span> <span class="o">[[</span> <span class="o">!</span> <span class="nt">-z</span> <span class="s2">"</span><span class="nv">$env_var_value</span><span class="s2">"</span> <span class="o">]]</span><span class="p">;</span> <span class="k">then
		</span>status PASS <span class="s2">"</span><span class="k">${</span><span class="nv">prefix</span><span class="k">}</span><span class="nv">$1</span><span class="s2">"</span>
		<span class="k">return </span>0
	<span class="k">else
		</span>status FAIL <span class="s2">"</span><span class="k">${</span><span class="nv">prefix</span><span class="k">}</span><span class="nv">$1</span><span class="s2">"</span>
		<span class="k">return </span>1
	<span class="k">fi</span>
<span class="o">}</span>
</code></pre></div></div>

<p>The checkers also use the <code class="language-plaintext highlighter-rouge">status</code> function to print messages to the user.</p>

<p><strong>Reminders</strong> are not done by <code class="language-plaintext highlighter-rouge">check_*</code> functions. Instead, they send straight
to <code class="language-plaintext highlighter-rouge">status</code> with the value <code class="language-plaintext highlighter-rouge">TODO</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>status TODO <span class="s2">"Import Iterm profile into Iterm"</span>
</code></pre></div></div>

<p>And that’s pretty much it. I wish I saw more general purpose frameworks for this
kind of thing, since I find it really helpful for remembering the pesky details
of setting up a new computer. If you know of any, let me know!</p>

<h2 id="the-gist">The Gist</h2>

<script src="https://gist.github.com/charlie-gallagher/d1544d336bc11fb1d64d5b9a227fbf34.js"></script>


        ]]>
      </description>

      <content:encoded>
        <![CDATA[
          <p>A little while ago my laptop died once, and then twice, and in between each
failure I had to use a spare laptop. I ended up setting up my computer fully for
work 3 times in the space of a couple weeks. I decided to automate the boring
stuff and make a checklist program and a set of companion scripts for doing
things like pulling all of my repos and installing my most commonly used
libraries and programs through Homebrew.</p>

<p>I already use <code class="language-plaintext highlighter-rouge">stow</code> to <a href="https://brandon.invergo.net/news/2012-05-26-using-gnu-stow-to-manage-your-dotfiles.html?round=two">manage my dotfiles</a>
so a lot of my configuration is taken care of with a few quick stows. But what I
really needed was an automated checklist that told me how close my configuration
was to the “target” laptop. It’s a cheap declarative setup, like a simpler
terraform or nix.</p>

<p>Here’s an example checklist (with no color):</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>❯ ./laptop_checklist.sh

Setting up git ──────────────────────────────────────────────────────────
• Setup git config (git-config.sh)
✓ Pull this code from GitHub (git clone https://github.com/my-user/dotfiles.git)
✓ (file) /Users/user/.gitignore
✓ Copy global gitignore to home directory
× (directory) /Users/user/projects/project_1/
× (directory) /Users/user/projects/project_2/
× (directory) /Users/user/projects/project_3/
× Clone important repositories (clone-repos.sh)
✓ (environment variable) GITLAB_PAT
✓ (environment variable) GITHUB_PAT
✓ Export GitHub and GitLab tokens


Setting up terminal ─────────────────────────────────────────────────────
✓ (application) iTerm.app
✓ Install iTerm2 (https://iterm2.com/downloads.html)
• Import Iterm profile into Iterm
• Install frequently used applications (install-common-libraries.sh)
✓ Install OhMyZsh (install-omzsh.sh)
✓ Install Starship prompt (install-starship.sh)
✓ (file) /Users/user/.zshrc
✓ Copy .zshrc into home directory
✓ (file) /Users/user/.vimrc
✓ Install vim (install-vim.sh)


Install desktop applications ────────────────────────────────────────────
• Install VS Code (https://code.visualstudio.com/download)
✓ (file) /Users/user/Library/Application Support/Code/User/settings.json
✓ Run install-code.sh
✓ (application) Google Chrome.app
✓ (application) DBeaver.app
✓ (application) AWS VPN Client
✓ (application) Docker.app
✓ (application) logioptionsplus.app
✓ (application) Postman.app
✓ (application) Utilities/XQuartz.app
✓ (application) Visual Studio Code.app
• Install Google Chrome (https://www.google.com/chrome)
• Install docker and sign in to GitLab registry (https://www.docker.com/products/docker-desktop/)
• Install postman and sign in using lastpass (https://www.postman.com/downloads/)
• Install AWS VPN and set up with ixis-vpn-client-config.ovpn (https://ixisdigital.atlassian.net/l/cp/ctZeA341)
• Install DBeaver (https://dbeaver.io/download/)
• Install omnibug (https://chrome.google.com/webstore/detail/omnibug/bknpehncffejahipecakbfkomebjmokl)
• Install Logitech Options+ (https://www.logitech.com/en-us/software/logi-options-plus.html)
• Install MS Teams
• Install XQuartz (https://www.xquartz.org/)
• Install Talon Voice (https://talonvoice.com/)
• Install Visual Studio Code (https://code.visualstudio.com/download)
✓ Install common applications


Set up AWS credentials ──────────────────────────────────────────────────
✓ (directory) /Users/user/.aws
✓ Initialize .aws folder (install-aws.sh)
✓ (exec) ssocred
✓ (exec) aws
✓ Install AWS tools (install-aws.sh)
✓ (file) /Users/user/.aws_functions.zsh
✓ AWS authentication functions exist
</code></pre></div></div>

<p>The checklist tells me how close I am to the target, but it won’t take any
action on its own. I have other scripts for that. But the thing about the other
scripts is that they:</p>

<ul>
  <li>Have dependencies. There’s a correct order, and I can’t automate that as well
because some steps require human action (like generating a new GitHub PAT).</li>
  <li>Are finicky. Shell scripting is not always a smooth experience, especially
when you might have to jump between shells (if zsh isn’t installed). Not to
mention that entropy affects setup scripts the same as it affects roads and
buildings. Links die, bits rot.</li>
  <li>Don’t give you a high-level view of the current system state.</li>
</ul>

<p>A real declarative system would compare the current state to the desired state
and then take steps to bring the computer into the desired state. Declarative
systems are hard to get right – you have to handle all possible current states
and define how to get to the target state. When I’m running these scripts, I’m
just trying to remember the steps for getting from 0 to back to work. This is
just a bunch of setup scripts, and I’m fine with a little “meat in the loop.”</p>

<p>All this is to say, I have the following setup:</p>

<ul>
  <li>A bunch of separate setup scripts</li>
  <li>An idea of what I want the final system to look like</li>
</ul>

<h2 id="the-easy-stuff">The easy stuff</h2>
<p>CLI programs and libraries are easy. First, you can usually get everything you
need through Homebrew or your package manager of choice. Second, it’s easy to
test whether they’re installed.</p>

<p>The shell environment is similarly easy to set up and check for. Environment
variables, dotfiles, these are all well-defined environment features that you
can check for.</p>

<h2 id="getting-trickier">Getting trickier</h2>
<p>GitHub PATs are essentially environment variables, but you can’t
programmatically generate them. It’s in the category of “easy to check, manual
to fix”. The other main entrants in this category are applications like VS Code,
Chrome, XQuartz, and so on. These can be checked for in the few places that
MacOS stores applications.</p>

<h2 id="reminders">Reminders</h2>
<p>Some things can neither be tested for nor installed automatically (without
significant effort). For these, I have this idea of a “reminder” in the
checklist that basically says “do this or else”. But the script doesn’t know the
status of it.</p>

<p>Examples are usually within applications, like signing into Chrome, importing
settings into VS Code, DBeaver, and configuring my mx ergo mouse in Logitech
Options.</p>

<h2 id="putting-it-together">Putting it together</h2>
<h3 id="design">Design</h3>

<ul>
  <li>A single script</li>
  <li>No config file, everything done in the script</li>
  <li>Easy to add/drop</li>
  <li>Easy to define sections</li>
  <li>Non-blocking. I want to see the whole status at once.</li>
</ul>

<h3 id="implementation">Implementation</h3>
<p>The program is composed of checkers and checklist items. The checkers are
functions that take some standard input (like the name of an evironment
variable) and check whether it exists, returning 0 (success) or 1 (fail).</p>

<p>Sections have a main status for the larger abstract concept (“Set up git”) and
sub-statuses for each checklist item (“install git”, “GH PAT”, etc.). This is a
section that checks for personal access tokens.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo</span> <span class="s2">"Setting up git ──────────────────────────────────────────────────────────"</span>
<span class="nv">git_pat_status</span><span class="o">=</span>PASS
<span class="k">if</span> <span class="o">!</span> check_env GITLAB_PAT<span class="p">;</span> <span class="k">then
	</span><span class="nv">git_pat_status</span><span class="o">=</span>FAIL
<span class="k">fi
if</span> <span class="o">!</span> check_env GITHUB_PAT<span class="p">;</span> <span class="k">then
	</span><span class="nv">git_pat_status</span><span class="o">=</span>FAIL
<span class="k">fi
</span>status <span class="nv">$git_pat_status</span> <span class="s2">"Export GitHub and GitLab tokens"</span>
</code></pre></div></div>

<p>The section passes unless any of its children fail, then it’s a fail for the
whole section. Here, I forgot to add a check for the <code class="language-plaintext highlighter-rouge">git</code> binary, so let’s add
it.</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="nb">echo</span> <span class="s2">"Setting up git ──────────────────────────────────────────────────────────"</span>
<span class="nv">git_pat_status</span><span class="o">=</span>PASS
<span class="k">if</span> <span class="o">!</span> check_exec git<span class="p">;</span> <span class="k">then
    </span><span class="nv">git_pat_status</span><span class="o">=</span>FAIL
<span class="k">fi
if</span> <span class="o">!</span> check_env GITLAB_PAT<span class="p">;</span> <span class="k">then
	</span><span class="nv">git_pat_status</span><span class="o">=</span>FAIL
<span class="k">fi
if</span> <span class="o">!</span> check_env GITHUB_PAT<span class="p">;</span> <span class="k">then
	</span><span class="nv">git_pat_status</span><span class="o">=</span>FAIL
<span class="k">fi
</span>status <span class="nv">$git_pat_status</span> <span class="s2">"Export GitHub and GitLab tokens"</span>
</code></pre></div></div>

<p>To break it down:</p>

<ul>
  <li><code class="language-plaintext highlighter-rouge">check_exec</code> looks for a program using <code class="language-plaintext highlighter-rouge">which</code>.</li>
  <li><code class="language-plaintext highlighter-rouge">check_env</code> looks for an environment variable that is defined and not an empty
string. All of the <code class="language-plaintext highlighter-rouge">check_*</code> functions print their result to the console in
addition to calculating the success/failure.</li>
  <li><code class="language-plaintext highlighter-rouge">status</code> prints a summary message, indicating pass or fail.</li>
</ul>

<p>It’s straightforward to write a new checker function. The checker should
evaluate the status of the thing to be checked, write a message to the console,
and then return a status to the user. Here’s the definition of <code class="language-plaintext highlighter-rouge">check_env</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>check_env<span class="o">()</span> <span class="o">{</span>
	<span class="nv">prefix</span><span class="o">=</span><span class="s2">"</span><span class="k">${</span><span class="nv">dim</span><span class="k">}</span><span class="s2">(environment variable) </span><span class="k">${</span><span class="nv">normal</span><span class="k">}</span><span class="s2">"</span>
	<span class="nv">env_var_value</span><span class="o">=</span><span class="k">${</span><span class="p">!1</span><span class="k">}</span>
	<span class="k">if</span> <span class="o">[[</span> <span class="o">!</span> <span class="nt">-z</span> <span class="s2">"</span><span class="nv">$env_var_value</span><span class="s2">"</span> <span class="o">]]</span><span class="p">;</span> <span class="k">then
		</span>status PASS <span class="s2">"</span><span class="k">${</span><span class="nv">prefix</span><span class="k">}</span><span class="nv">$1</span><span class="s2">"</span>
		<span class="k">return </span>0
	<span class="k">else
		</span>status FAIL <span class="s2">"</span><span class="k">${</span><span class="nv">prefix</span><span class="k">}</span><span class="nv">$1</span><span class="s2">"</span>
		<span class="k">return </span>1
	<span class="k">fi</span>
<span class="o">}</span>
</code></pre></div></div>

<p>The checkers also use the <code class="language-plaintext highlighter-rouge">status</code> function to print messages to the user.</p>

<p><strong>Reminders</strong> are not done by <code class="language-plaintext highlighter-rouge">check_*</code> functions. Instead, they send straight
to <code class="language-plaintext highlighter-rouge">status</code> with the value <code class="language-plaintext highlighter-rouge">TODO</code>:</p>

<div class="language-sh highlighter-rouge"><div class="highlight"><pre class="highlight"><code>status TODO <span class="s2">"Import Iterm profile into Iterm"</span>
</code></pre></div></div>

<p>And that’s pretty much it. I wish I saw more general purpose frameworks for this
kind of thing, since I find it really helpful for remembering the pesky details
of setting up a new computer. If you know of any, let me know!</p>

<h2 id="the-gist">The Gist</h2>

<script src="https://gist.github.com/charlie-gallagher/d1544d336bc11fb1d64d5b9a227fbf34.js"></script>


        ]]>
      </content:encoded>
    </item>
    
    <item>
      <title>A plain text to-do list</title>
      <link>https://charlie-gallagher.github.io/2026/01/28/to-do-list.html</link>
      <guid isPermaLink="true">https://charlie-gallagher.github.io/2026/01/28/to-do-list.html</guid>
      <pubDate>Wed, 28 Jan 2026 00:00:00 +0000</pubDate>

      <description>
        <![CDATA[
          <p>I’ve tried lots of to-do lists, from ToDoist to fine-grained Jira tickets and
all things in between. I still felt disorganized. But I found<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> Jeff Huang’s
<a href="https://jeffhuang.com/productivity_text_file/">My productivity app is a never-ending .txt file</a>
and decided to give it a try, with some tweaks.</p>

<p>I spend a lot of time in the terminal, so now I keep an iTerm tab open with my
daily to-do list. It’s a dumping ground of to-dos (both work and personal),
meeting notes, random text I’m copying from one place to another. It’s faster
than handwriting and easier to copy tasks from one day to another. It’s easy to
stack rank things. It’s an unreasonably effective to-do list tool.</p>

<p>Here’s an example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2025-12-16
----------
- [x] Announce next week's release delay
- [x] Review X's MR hotfix for dsmodels
    - https://gitlab.com/etc/123
- [x] Review hit segments table MR
    - https://gitlab.com/etc/697
- [x] Help X with evars 216, 217, and 221 (all null, "Merchandising Evars")
    - https://atlassian.net/wiki/x/EYAT4g123
- [x] Plan next sprint
    - [x] What PTO do folks have? How much actual capacity?
    - [x] What priorities for Client A? Client B? Internal eng efforts?
- [x] X's comment under TICK-14222
- [ ] Update Utils Logger Documentation with new lambda information
- [ ] Heads down on Snowflake models
- [ ] Send snippet on how to filter with `(year, month, day)` tuple.
- [ ] 
- [ ] 
- [ ] Lunch with Bre
- [ ] 
- [ ] Start new project?
- [ ] Write blog post about to-do lists
- [ ] What's going on with management?

Coworker
-------

- window, not supported
- evar64 as alternative to mcvisid for join, `external_leadid`

Demo tenant
-----------

- What are the core integrations that make Service Pro worthwhile
- Add CRM and DMS, completely spoofed
- What do we demo?
- We need these 4 data systems
</code></pre></div></div>

<p>I keep to-dos roughly separated into work, personal, and “for consideration”,
which are tasks that I expect to carry over a few days in a row until I decide
what to do with them. At the bottom, I throw in everything and anything related
to the to-dos of the day. I especially like this section for dumping in notes.
If I’m in the middle of something complicated, at the end of the day I’ll write
down as much as I can think of in shorthand as context for what state things are
in.  What’s tested, what still needs to be connected to something else, why I
stopped working on feature X temporarily.</p>

<p>To use Jeff Huang’s phrase, this is both a to-do list and a got-done list<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>,
and I like that I can see everything I’ve completed with a quick grep:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Top-level completed tasks
cat *.md | sort | uniq | grep '^- \[x\]' | wc -l
161

# Including subtasks
cat *.md | sort | uniq | grep '- \[x\]' | wc -l
278
</code></pre></div></div>

<p>Occasionally I’ll plan days in the future – why not? Make a text file for a
week from today and add a couple to-dos. Later, when I create tomorrow’s file,
I’ll find it already exists and has some to-dos. Excellent.</p>

<p>A few things make this work for me:</p>

<ul>
  <li><strong>Vim.</strong> Makes it very fast to rearrange, reorder, and re-indent.</li>
  <li><strong>Terminal tabs</strong> My to-do list is always in the first tab of iTerm, so it’s
just a <code class="language-plaintext highlighter-rouge">⌘1</code> away any time I’m in the terminal.</li>
  <li><strong>Minimum viable discipline.</strong> 5 minutes in the morning and evening to
organize the list.</li>
</ul>

<p>It’s hard to overstate just what a good effect this has had on my work. It’s a
2¢ piece of tech that just works.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>I found this through the really excellent <a href="https://registerspill.thorstenball.com">Joy &amp; Curiosity</a> series in Register Spill by Thorsten Ball. I can’t recommend Thorsten’s work enough – I read every Sunday. His utter apprectiation for good writing is infectious (and one of the reasons I started this blog). <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>There’s some value in these things as historical record, but I don’t plan on preserving them. I keep my files on my local computer and am waiting for the day my computer dies. But Jeff Huang used his .txt file as research for his <a href="https://jeffhuang.com/struggle_for_each_paper/">Behind the Scenes: the struggle for each paper</a> post, which I have to say is cool. It reminds me of Stephen Wolfram’s unbelievable <a href="https://writings.stephenwolfram.com/2019/02/seeking-the-productive-life-some-details-of-my-personal-infrastructure/">Seeking the productive life</a> where, among many other things, he visualizes the emails he’s sent and received over time. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>

        ]]>
      </description>

      <content:encoded>
        <![CDATA[
          <p>I’ve tried lots of to-do lists, from ToDoist to fine-grained Jira tickets and
all things in between. I still felt disorganized. But I found<sup id="fnref:1" role="doc-noteref"><a href="#fn:1" class="footnote" rel="footnote">1</a></sup> Jeff Huang’s
<a href="https://jeffhuang.com/productivity_text_file/">My productivity app is a never-ending .txt file</a>
and decided to give it a try, with some tweaks.</p>

<p>I spend a lot of time in the terminal, so now I keep an iTerm tab open with my
daily to-do list. It’s a dumping ground of to-dos (both work and personal),
meeting notes, random text I’m copying from one place to another. It’s faster
than handwriting and easier to copy tasks from one day to another. It’s easy to
stack rank things. It’s an unreasonably effective to-do list tool.</p>

<p>Here’s an example:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>2025-12-16
----------
- [x] Announce next week's release delay
- [x] Review X's MR hotfix for dsmodels
    - https://gitlab.com/etc/123
- [x] Review hit segments table MR
    - https://gitlab.com/etc/697
- [x] Help X with evars 216, 217, and 221 (all null, "Merchandising Evars")
    - https://atlassian.net/wiki/x/EYAT4g123
- [x] Plan next sprint
    - [x] What PTO do folks have? How much actual capacity?
    - [x] What priorities for Client A? Client B? Internal eng efforts?
- [x] X's comment under TICK-14222
- [ ] Update Utils Logger Documentation with new lambda information
- [ ] Heads down on Snowflake models
- [ ] Send snippet on how to filter with `(year, month, day)` tuple.
- [ ] 
- [ ] 
- [ ] Lunch with Bre
- [ ] 
- [ ] Start new project?
- [ ] Write blog post about to-do lists
- [ ] What's going on with management?

Coworker
-------

- window, not supported
- evar64 as alternative to mcvisid for join, `external_leadid`

Demo tenant
-----------

- What are the core integrations that make Service Pro worthwhile
- Add CRM and DMS, completely spoofed
- What do we demo?
- We need these 4 data systems
</code></pre></div></div>

<p>I keep to-dos roughly separated into work, personal, and “for consideration”,
which are tasks that I expect to carry over a few days in a row until I decide
what to do with them. At the bottom, I throw in everything and anything related
to the to-dos of the day. I especially like this section for dumping in notes.
If I’m in the middle of something complicated, at the end of the day I’ll write
down as much as I can think of in shorthand as context for what state things are
in.  What’s tested, what still needs to be connected to something else, why I
stopped working on feature X temporarily.</p>

<p>To use Jeff Huang’s phrase, this is both a to-do list and a got-done list<sup id="fnref:2" role="doc-noteref"><a href="#fn:2" class="footnote" rel="footnote">2</a></sup>,
and I like that I can see everything I’ve completed with a quick grep:</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code># Top-level completed tasks
cat *.md | sort | uniq | grep '^- \[x\]' | wc -l
161

# Including subtasks
cat *.md | sort | uniq | grep '- \[x\]' | wc -l
278
</code></pre></div></div>

<p>Occasionally I’ll plan days in the future – why not? Make a text file for a
week from today and add a couple to-dos. Later, when I create tomorrow’s file,
I’ll find it already exists and has some to-dos. Excellent.</p>

<p>A few things make this work for me:</p>

<ul>
  <li><strong>Vim.</strong> Makes it very fast to rearrange, reorder, and re-indent.</li>
  <li><strong>Terminal tabs</strong> My to-do list is always in the first tab of iTerm, so it’s
just a <code class="language-plaintext highlighter-rouge">⌘1</code> away any time I’m in the terminal.</li>
  <li><strong>Minimum viable discipline.</strong> 5 minutes in the morning and evening to
organize the list.</li>
</ul>

<p>It’s hard to overstate just what a good effect this has had on my work. It’s a
2¢ piece of tech that just works.</p>

<div class="footnotes" role="doc-endnotes">
  <ol>
    <li id="fn:1" role="doc-endnote">
      <p>I found this through the really excellent <a href="https://registerspill.thorstenball.com">Joy &amp; Curiosity</a> series in Register Spill by Thorsten Ball. I can’t recommend Thorsten’s work enough – I read every Sunday. His utter apprectiation for good writing is infectious (and one of the reasons I started this blog). <a href="#fnref:1" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
    <li id="fn:2" role="doc-endnote">
      <p>There’s some value in these things as historical record, but I don’t plan on preserving them. I keep my files on my local computer and am waiting for the day my computer dies. But Jeff Huang used his .txt file as research for his <a href="https://jeffhuang.com/struggle_for_each_paper/">Behind the Scenes: the struggle for each paper</a> post, which I have to say is cool. It reminds me of Stephen Wolfram’s unbelievable <a href="https://writings.stephenwolfram.com/2019/02/seeking-the-productive-life-some-details-of-my-personal-infrastructure/">Seeking the productive life</a> where, among many other things, he visualizes the emails he’s sent and received over time. <a href="#fnref:2" class="reversefootnote" role="doc-backlink">&#8617;</a></p>
    </li>
  </ol>
</div>

        ]]>
      </content:encoded>
    </item>
    
    <item>
      <title>NTP in 30 seconds</title>
      <link>https://charlie-gallagher.github.io/2026/01/27/ntp-in-30-seconds.html</link>
      <guid isPermaLink="true">https://charlie-gallagher.github.io/2026/01/27/ntp-in-30-seconds.html</guid>
      <pubDate>Tue, 27 Jan 2026 00:00:00 +0000</pubDate>

      <description>
        <![CDATA[
          <h1 id="what-time-is-it">What time is it?</h1>

<blockquote>
  <p>There is a metrological fable that has been retold many times. It concerns
 an eccentric retired sea captain who lived in the hills overlooking Zanzibar
 City and fired a ceremonial cannon and raised the ensign at exactly noon each
 day. He knew it was noon from his chronometer which he took pains to
 accurately set whenever he passed the watchmaker’s window in town. The
 watchmaker knew his clocks were accurate because he checked them daily when
 that punctilious captain on the hill fired his cannon at noon exactly.</p>

  <p>Account of the <em>Zanzibar Effect</em> in “Failures of the global measurement
system. Part 2: institutions, instruments and strategy.” Gary Price.</p>
</blockquote>

<blockquote>
  <p>Ballyhough railway station has two clocks which disagree by some six minutes.
When one helpful Englishman pointed the fact out to a porter, his reply was
“Faith, sir, if they was to tell the same time, why would we be having two of
them?”</p>

  <p>The Five Clocks, Martin Joos</p>
</blockquote>

<p>Two servers have clocks with the same frequency, but they don’t agree on what
time it is. Server A says it’s 10:00, Server B says it’s 10:05. Server B has a
direct line to an atomic clock receiver, so it’s more accurate.</p>

<p>Once Server A knows that Server B has the right time, it needs to synchronize
with it; Server A needs to discover that it’s actually 10:05, not 10:00. It
could ask, but by the time the question made a round trip through the network,
the answer would already be out of date. So, Server A needs to estimate the
network latency between itself and Server B.</p>

<p>We can estimate latency without having synchronized clocks. Since latency is the
amount of time spent in the network, we only need to estimate the amount of time
<em>not</em> spent in the network. Server A knows the round-trip time of its “What time
is it?” request, so it asks Server B to communicate how long it held the
message. <code class="language-plaintext highlighter-rouge">RTT - B_time = time_on_network</code>, which means the link latency is
<code class="language-plaintext highlighter-rouge">(RTT - B_time) / 2</code>.</p>

<p>Now we know the link latency, and Server A just needs to know the timestamp when
Server B received its “What time is it?” request and it can work out the offset.
And that’s it.</p>

<h2 id="example">Example</h2>
<p>Suppose Server A says it’s 10:00:00 and Server B says it’s 10:05:00, like above.
As a shortcut, instead of telling Server A both the duration and the timestamp
when it received the request, Server B includes the receipt timestamp and the
exit timestamp in its message back to A.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>              ┌────────────What time is it?───────────┐       
              │                                       │       
              │                                       │       
              │10:00:00                               │10:05:01
       ┌──────┼───────┐                       ┌───────▼──────┐
       │              │                       │              │
       │              │                       │              │
       │ Server A     │                       │ Server B     │
       │              │                       │              │
       └──────────────┘                       └───────┬──────┘
              ▲ 10:00:03                              │10:05:02
              │                                       │       
              │         ┌──────────────────┐          │       
              │         │                  │          │       
              │         │  Recv: 10:05:01  │──────────┘       
              └─────────┼  Ret:  10:05:02  │                  
                        │                  │                  
                        └──────────────────┘                  
</code></pre></div></div>

<p>Server B held the packet for <code class="language-plaintext highlighter-rouge">10:05:02 - 10:05:01 = 1s</code> and the RTT was
<code class="language-plaintext highlighter-rouge">10:00:03 - 10:00:00 = 3s</code>, which gives a link latency of <code class="language-plaintext highlighter-rouge">(3 - 1) / 2 = 1s</code>.
And Server A now knows that Server B received the message at 10:05:01, and since
it sent the request at 10:00:00 and there’s 1 second of network delay, that
means Server B’s 10:05:01 is the same as Server A’s <code class="language-plaintext highlighter-rouge">10:00:00 + 1s = 10:00:01</code>,
and Server A is exactly <code class="language-plaintext highlighter-rouge">10:05:01 - 10:00:01 = 5m</code> too slow.</p>

<p>In practice, a little algebra takes this from several steps to just one.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>offset
  = ((B1 - A1) + (B2 - A2)) / 2
  = ((10:05:01 - 10:00:00) + (10:05:02 - 10:00:03)) / 2
  = (5:01 + 4:59) / 2
  = 5 minutes
</code></pre></div></div>

<h2 id="notes">Notes</h2>
<p>NTP involves a lot more than this algebra trick, but I think it’s a very neat
trick. In addition to this formula, NTP solves for unreliable networks –
latencies are estimated by repeatedly sampling the latency; a single round trip
is not enough. Also, you can’t just apply an offset to a clock. Many
applications and protocols assume that time only goes forwards, so if your clock
is ahead of the reference, care is taken. Instead of hopping back in time,
the system clock is slowed down slightly until the entire offset has been made
up for. When you’re behind the reference, the standard protocol is to modestly
speed up the clock until you’re caught up.</p>

<p>In extreme cases (like being off by 30 minutes) some hopping is used to avoid
the algorithm taking too long.</p>

<p>None of this so far addresses how servers agree on who has the authority and who
will sync to whose clock. That’s determined by consulting a hierarchy of
authority based on how many hops you are from an atomic clock. Time sources are
stratified. The atomic clocks belong to Stratum 1, and anyone who gets their
time directly from an atmoic clock is in Stratum 2, and so on. Part of the NTP
exchange is comparing stratum and selecting the authority base on whose number
is lower. The “loser” sets their stratum to <code class="language-plaintext highlighter-rouge">s_authority + 1</code>.</p>

<p>And you might then ask, are all atomic clocks created equal? Where does that
authority come from? How is UTC defined? And for that, I don’t have space on
this blog. Let’s just be grateful that thanks to NTP, we don’t have to worry
about it.</p>

<h2 id="postscript">Postscript</h2>
<p>If you run Linux and want to see exactly what your <code class="language-plaintext highlighter-rouge">ntpd</code> daemon is up to, try
out <code class="language-plaintext highlighter-rouge">ntpviz</code> from the NTPsec project. Details here: https://blog.ntpsec.org/2016/12/19/ntpviz-intro.html</p>


        ]]>
      </description>

      <content:encoded>
        <![CDATA[
          <h1 id="what-time-is-it">What time is it?</h1>

<blockquote>
  <p>There is a metrological fable that has been retold many times. It concerns
 an eccentric retired sea captain who lived in the hills overlooking Zanzibar
 City and fired a ceremonial cannon and raised the ensign at exactly noon each
 day. He knew it was noon from his chronometer which he took pains to
 accurately set whenever he passed the watchmaker’s window in town. The
 watchmaker knew his clocks were accurate because he checked them daily when
 that punctilious captain on the hill fired his cannon at noon exactly.</p>

  <p>Account of the <em>Zanzibar Effect</em> in “Failures of the global measurement
system. Part 2: institutions, instruments and strategy.” Gary Price.</p>
</blockquote>

<blockquote>
  <p>Ballyhough railway station has two clocks which disagree by some six minutes.
When one helpful Englishman pointed the fact out to a porter, his reply was
“Faith, sir, if they was to tell the same time, why would we be having two of
them?”</p>

  <p>The Five Clocks, Martin Joos</p>
</blockquote>

<p>Two servers have clocks with the same frequency, but they don’t agree on what
time it is. Server A says it’s 10:00, Server B says it’s 10:05. Server B has a
direct line to an atomic clock receiver, so it’s more accurate.</p>

<p>Once Server A knows that Server B has the right time, it needs to synchronize
with it; Server A needs to discover that it’s actually 10:05, not 10:00. It
could ask, but by the time the question made a round trip through the network,
the answer would already be out of date. So, Server A needs to estimate the
network latency between itself and Server B.</p>

<p>We can estimate latency without having synchronized clocks. Since latency is the
amount of time spent in the network, we only need to estimate the amount of time
<em>not</em> spent in the network. Server A knows the round-trip time of its “What time
is it?” request, so it asks Server B to communicate how long it held the
message. <code class="language-plaintext highlighter-rouge">RTT - B_time = time_on_network</code>, which means the link latency is
<code class="language-plaintext highlighter-rouge">(RTT - B_time) / 2</code>.</p>

<p>Now we know the link latency, and Server A just needs to know the timestamp when
Server B received its “What time is it?” request and it can work out the offset.
And that’s it.</p>

<h2 id="example">Example</h2>
<p>Suppose Server A says it’s 10:00:00 and Server B says it’s 10:05:00, like above.
As a shortcut, instead of telling Server A both the duration and the timestamp
when it received the request, Server B includes the receipt timestamp and the
exit timestamp in its message back to A.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>              ┌────────────What time is it?───────────┐       
              │                                       │       
              │                                       │       
              │10:00:00                               │10:05:01
       ┌──────┼───────┐                       ┌───────▼──────┐
       │              │                       │              │
       │              │                       │              │
       │ Server A     │                       │ Server B     │
       │              │                       │              │
       └──────────────┘                       └───────┬──────┘
              ▲ 10:00:03                              │10:05:02
              │                                       │       
              │         ┌──────────────────┐          │       
              │         │                  │          │       
              │         │  Recv: 10:05:01  │──────────┘       
              └─────────┼  Ret:  10:05:02  │                  
                        │                  │                  
                        └──────────────────┘                  
</code></pre></div></div>

<p>Server B held the packet for <code class="language-plaintext highlighter-rouge">10:05:02 - 10:05:01 = 1s</code> and the RTT was
<code class="language-plaintext highlighter-rouge">10:00:03 - 10:00:00 = 3s</code>, which gives a link latency of <code class="language-plaintext highlighter-rouge">(3 - 1) / 2 = 1s</code>.
And Server A now knows that Server B received the message at 10:05:01, and since
it sent the request at 10:00:00 and there’s 1 second of network delay, that
means Server B’s 10:05:01 is the same as Server A’s <code class="language-plaintext highlighter-rouge">10:00:00 + 1s = 10:00:01</code>,
and Server A is exactly <code class="language-plaintext highlighter-rouge">10:05:01 - 10:00:01 = 5m</code> too slow.</p>

<p>In practice, a little algebra takes this from several steps to just one.</p>

<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>offset
  = ((B1 - A1) + (B2 - A2)) / 2
  = ((10:05:01 - 10:00:00) + (10:05:02 - 10:00:03)) / 2
  = (5:01 + 4:59) / 2
  = 5 minutes
</code></pre></div></div>

<h2 id="notes">Notes</h2>
<p>NTP involves a lot more than this algebra trick, but I think it’s a very neat
trick. In addition to this formula, NTP solves for unreliable networks –
latencies are estimated by repeatedly sampling the latency; a single round trip
is not enough. Also, you can’t just apply an offset to a clock. Many
applications and protocols assume that time only goes forwards, so if your clock
is ahead of the reference, care is taken. Instead of hopping back in time,
the system clock is slowed down slightly until the entire offset has been made
up for. When you’re behind the reference, the standard protocol is to modestly
speed up the clock until you’re caught up.</p>

<p>In extreme cases (like being off by 30 minutes) some hopping is used to avoid
the algorithm taking too long.</p>

<p>None of this so far addresses how servers agree on who has the authority and who
will sync to whose clock. That’s determined by consulting a hierarchy of
authority based on how many hops you are from an atomic clock. Time sources are
stratified. The atomic clocks belong to Stratum 1, and anyone who gets their
time directly from an atmoic clock is in Stratum 2, and so on. Part of the NTP
exchange is comparing stratum and selecting the authority base on whose number
is lower. The “loser” sets their stratum to <code class="language-plaintext highlighter-rouge">s_authority + 1</code>.</p>

<p>And you might then ask, are all atomic clocks created equal? Where does that
authority come from? How is UTC defined? And for that, I don’t have space on
this blog. Let’s just be grateful that thanks to NTP, we don’t have to worry
about it.</p>

<h2 id="postscript">Postscript</h2>
<p>If you run Linux and want to see exactly what your <code class="language-plaintext highlighter-rouge">ntpd</code> daemon is up to, try
out <code class="language-plaintext highlighter-rouge">ntpviz</code> from the NTPsec project. Details here: https://blog.ntpsec.org/2016/12/19/ntpviz-intro.html</p>


        ]]>
      </content:encoded>
    </item>
    
  </channel>
</rss>
