Show HN: 22 GB of Hacker News in SQLite

(hackerbook.dosaygo.com)

106 points | by keepamovin 3 hours ago

8 comments

simonw 1 hour ago
Don't miss how this works. It's not a server-side application - this code runs entirely in your browser using SQLite compiled to WASM, but rather than fetching a full 22GB database it instead uses a clever hack that retrieves just "shards" of the SQLite database needed for the page you are viewing.
I watched it in the browser network panel and saw it fetch:
```
  https://hackerbook.dosaygo.com/static-shards/shard_1636.sqlite.gz
  https://hackerbook.dosaygo.com/static-shards/shard_1635.sqlite.gz
  https://hackerbook.dosaygo.com/static-shards/shard_1634.sqlite.gz
```
As I paginated to previous days.
It's reminiscent of that brilliant SQLite.js VFS trick from a few years ago: https://github.com/phiresky/sql.js-httpvfs - only that one used HTTP range headers, this one uses sharded files instead.
The interactive SQL query interface at https://hackerbook.dosaygo.com/?view=query asks you to select which shards to run the query against, there are 1636 total.
[-]
- tehlike 47 minutes ago
  Vfs support is amazing.
keepamovin 3 hours ago
Community, All the HN belong to you. This is an archive of hacker news that fits in your browser. When I made HN Made of Primes I realized I could probably do this offline sqlite/wasm thing with the whole GBs of archive. The whole dataset. So I tried it, and this is it. Have Hacker News on your device.
Go to this repo (https://github.com/DOSAYGO-STUDIO/HackerBook): you can download it. Big Query -> ETL -> npx serve docs - that's it. 20 years of HN arguments and beauty, can be yours forever. So they'll never die. Ever. It's the unkillable static archive of HN and it's your hands. That's my Year End gift to you all. Thank you for a wonderful year, have happy and wonderful 2026. make something of it.
[-]
- carbocation 1 hour ago
  That repo is throwing up a 404 for me.
  Question - did you consider tradeoffs between duckdb (or other columnar stores) and SQLite?
  [-]
  - keepamovin 1 hour ago
    No, I just went straight to sqlite. What is duckdb?
    [-]
    - simonw 1 hour ago
      One interesting feature of DuckDB is that it can run queries against HTTP ranges of a static file hosted via HTTPS, and there's an official WebAssembly build of it that can do that same trick.
      So you can dump e.g. all of Hacker News in a single multi-GB Parquet file somewhere and build a client-side JavaScript application that can run queries against that without having to fetch the whole thing.
      You can run searches on https://lil.law.harvard.edu/data-gov-archive/ and watch the network panel to see DuckDB in action.
    - fsiefken 1 hour ago
      DuckDB is an open-source column-oriented Relational Database Management System (RDBMS). It's designed to provide high performance on complex queries against large databases in embedded configuration.
      It has transparent compression built-in and has support for natural language queries. https://buckenhofer.com/2025/11/agentic-ai-with-duckdb-and-s...
      "DICT FSST (Dictionary FSST) represents a hybrid compression technique that combines the benefits of Dictionary Encoding with the string-level compression capabilities of FSST. This approach was implemented and integrated into DuckDB as part of ongoing efforts to optimize string storage and processing performance." https://homepages.cwi.nl/~boncz/msc/2025-YanLannaAlexandre.p...
    - cess11 1 hour ago
      It is very similar to SQLite in that it can run in-process and store its data as a file.
      It's different in that it is tailored to analytics, among other things storage is columnar, and it can run off some common data analytics file formats.
  - 3eb7988a1663 1 hour ago
    While I suspect DuckDB would compress better, given the ubiquity of SQLite, it seems a fine standard choice.
  - linhns 1 hour ago
    Not the author here. I’m not sure about DuckDB, but SQLite allows you to simply use a file as a database and for archiving, it’s really helpful. One file, that’s it.
    [-]
    - cobolcomesback 1 hour ago
      DuckDB does as well. A super simplified explanation of duckdb is that it’s sqlite but columnar, and so is better for analytics of large datasets.
      [-]
      - formerly_proven 1 hour ago
        The schema is this: items(id INTEGER PRIMARY KEY, type TEXT, time INTEGER, by TEXT, title TEXT, text TEXT, url TEXT
        Doesn't scream columnar database to me.
        [-]
        embedding-shape 1 hour ago
        At a glance, that is missing (at least) a `parent` or `parent_id` attribute which items in HN can have (and you kind of need if you want to render comments), see http://hn.algolia.com/api/v1/items/46436741
        [-]
        agolliver 1 hour ago
        Edges are a separate table
- abixb 1 hour ago
  Wonder if you could turn this into a .zim file for offline browsing with an offline browser like Kiwix, etc. [0]
  I've been taking frequent "offline-only-day" breaks to consolidate whatever I've been learning, and Kiwix has been a great tool for reference (offline Wikipedia, StackOverflow and whatnot).
  [0] https://kiwix.org/en/the-new-kiwix-library-is-available/
- tevon 1 hour ago
  The link seems to be down, was it taken down?
  [-]
  - scsh 44 minutes ago
    Probably just forgot to make it public.
- yupyupyups 1 hour ago
  1 hour passed and it's already nuked?
  Thank you btw
- wslh 1 hour ago
  Is this updated regularly? 404 on GitHub as the other comment.
  With all due respect it would be great if there is an official HN public dump available (and not requiring stuff such as BigQuery which is expensive).
- fao_ 1 hour ago
  > Community, All the HN belong to you. This is an archive of hacker news that fits in your browser.
  > 20 years of HN arguments and beauty, can be yours forever. So they'll never die. Ever. It's the unkillable static archive of HN and it's your hands
  I'm really sorry to have to ask this, but this really feels like you had an LLM write it?
  [-]
  - jesprenj 14 minutes ago
    I doubt it. "hacker news" spelled lowercase? comma after "beauty"? missing "in" after "it's"? i doubt an LLM would make such syntax mistakes. it's just good writing, that's also possible these days.
  - walthamstow 55 minutes ago
    There's a thing in soccer at the moment where a tackle looks fine in realtime but when the video referee shows it to the onpitch referee, they show the impact in slo-mo over and over again and it always looks worse.
    I wonder if there's something like this going on here. I never thought it was LLM on first read, and I still don't, but when you take snippets and point at them it makes me think maybe they are
  - rantingdemon 1 hour ago
    Why do you say that?
    [-]
    - sundarurfriend 1 hour ago
      Because anything that even slightly differs from the standard American phrasing of something must be "LLM generated" these days.
      [-]
      - JavGull 55 minutes ago
        With the em dashes I see you. But at this point idrc so long as it reads well. Everyone uses spell check…
        [-]
        naikrovek 4 minutes ago
        I add em dashes to everything I write now, solely to throw people who look for them off. Lots of editors add them automatically when you have two sequential dashes between words — a common occurrence, like that one. And this is is Chrome on iOS doing it automatically.
        Ooh, I used “sequential”, ooh, I used an em dash. ZOMG AI IS COMING FOR US ALL
      - deadbabe 43 minutes ago
        Sometimes I want to write more creatively, but then worry I’ll be accused of being an LLM. So I dumb it down. Remove the colorful language. Conform.
  - naikrovek 19 minutes ago
    > I'm really sorry to have to ask this, but this really feels like you had an LLM write it?
    Ending a sentence with a question mark doesn’t automatically make your sentence a question. You didn’t ask anything. You stated an opinion and followed it with a question mark.
    If you intended to ask if the text was written by AI, no, you don’t have to ask that.
    I am so damn tired of the “that didn’t happen” and the “AI did that” people when there is zero evidence of either being true.
    These people are the most exhausting people I have ever encountered in my entire life.
zkmon 53 minutes ago
Similar to Single-page applications (SPA), single-table application (STA) might become a thing. Just a shard a table on multiple keys and serve the shards as static files, provided that the data is Ok to share, similar to sharing static html content.
[-]
- jesprenj 12 minutes ago
  do you mean single database? it'd be quite hard if not impossible to make applications using a single table (no relations). reddit did it though, they have a huge table of "things" iirc.
Paul-E 1 hour ago
That's pretty neat!
I did something similar. I build a tool[1] to import the Project Arctic Shift dumps[2] of reddit into sqlite. It was mostly an exercise to experiment with Rust and SQLite (HN's two favorite topics). If you don't build a FTS5 index and import without WAL (--unsafe-mode), import of every reddit comment and submission takes a bit over 24 hours and produces a ~10TB DB.
SQLite offers a lot of cool json features that would let you store the raw json and operate on that, but I eschewed them in favor of parsing only once at load time. THat also lets me normalize the data a bit.
I find that building the DB is pretty "fast", but queries run much faster if I immediately vacuum the DB after building it. The vacuum operation is actually slower than the original import, taking a few days to finish.
[1] https://github.com/Paul-E/Pushshift-Importer
[2] https://github.com/ArthurHeitmann/arctic_shift/blob/master/d...
[-]
- s_ting765 38 minutes ago
  You could check out SQLite's auto_vacuum which reclaims space without rebuilding the entire db https://sqlite.org/pragma.html#pragma_auto_vacuum
sieep 1 hour ago
What a reminder on how text is so much more efficient than video, its crazy! Could you imagine the same amount of knowledge (or dribble) but in video form? I wonder how large that would be.
[-]
- fsiefken 1 minute ago
  one could use a video llm to generate the video, diagrams or the stills automatically based on the text. except when it's boardgames playthroughs or programming i just transcribe to text, summarise and read youtube video's.
- ivanjermakov 37 minutes ago
  Average high quality 1080p60 video has bitrate of 5Mbps, which is equivalent to 120k English words per second. With average English speech being 150wpm, we end up with text being 50 thousand times more space efficient.
  Converting 22GB of uncompressed text into video essay lands us at ~1PB or 1000TB.
zX41ZdbW 1 hour ago
The query tab looks quite complex with all these content shards: https://hackerbook.dosaygo.com/?view=query
I have a much simpler database: https://play.clickhouse.com/play?user=play#U0VMRUNUIHRpbWUsI...
[-]
- embedding-shape 1 hour ago
  Does your database also runs offline/locally in the browser? Seems to be the reason for the large number of shards.
sirjaz 1 hour ago
This would be awesome as a cross platform app.
asdefghyk 2 hours ago
How much space is needed? ...for the data .... Im wondering if it would work on a tablet? ....
[-]
- keepamovin 2 hours ago
  ~9GB gzipped.