Notes on Sorted Data

(amit.prasad.me)

41 points | by surprisetalk 6 days ago

2 comments

amitprasad 15 minutes ago
Unexpected seeing this posted here.
I wrote this post mostly out of interest for a personal project and thus it's not actually a very holistic exploration of the topic. May revisit and update it in the future :)
Rakshath_1 3 hours ago
This is a really solid deep-dive. I like how you move from this seems obviouscases (ints, strings) into the subtle edge cases where ordering quietly breaks and then show practical encodings that actually work in byte-lex order. The examples make the pitfalls very concrete, especially the varint and tuple sections. Nice balance between theory and systems-level pragmatism
[-]
- Sesse__ 1 hour ago
  In contrast, I found it rather lacking. No discussion of the most common way to sort floats as bytes (shift the sign bit down and XOR the other 31 bits with the resulting masks), nor NaNs and +/-0 for that matter. Varint sorting introduces its own homegrown serialization but doesn't discuss the issue of overlong encodings. Nothing about string collation or Unicode issues in general. Composite data suggests adding NULs, but what if there are NULs in the actual data? (It is briefly mentioned, but only as in “you can't”.)
  [-]
  - amitprasad 9 minutes ago
    author here -- agreed on all fronts. Mentioned this in the other comment but I approached the topic from a relatively narrow perspective (I was working on a specific project at the time)
    I think it's worth including these things in a future update to the post, but I didn't have the time / need to explore it back then.
    In the meantime, I'd point to the following post on Unicode that remains very nice to read >20 years later: https://www.joelonsoftware.com/2003/10/08/the-absolute-minim...
    [-]
    - Sesse__ 6 minutes ago
      Sure, not all posts want to be the end-all of the topic. :-)
      And yes, the Joel post is a good introduction, if a bit old by now. Notably, of course, it doesn't say anything about _processing_ Unicode text. (E.g., don't sort by code point, don't break in the middle of a grapheme cluster, etc. etc.) But I believe that this is outside the scope of his intention.