Digitizing Old Notebooks

July 7, 2025 · 761 words · 4 min read

Building a pipeline to rescue old notebooks from slow decomposition.

Note There are no images in this post. The scans are personal, this is just about the process.

Every time I visit my parents I end up in the attic, like a squirrel after a long winter. It's a habit at this point, opening dusty boxes and hoping I get something interesting, and this time I did! Nine old notebooks from around 2001 to 2003, buried under a pile of toys I do not remember ever playing with. Back then I spent most of my free time writing down game concepts, sketching fantasy maps, and just generally trying out whatever idea happened to be in my head. After so long, I didn't remember most of what was in them, but I recognised it all instantly as mine.

There's a veritable treasure, I found really stupid RPG systems with more rules than any actual game could survive, fantasy maps covered in symbols I used to invent on the spot, little diagrams for machines that ignored logic. It's the kind of stuff a kid makes when they have more imagination than skill. It's a little embarrassing.

The handwriting is also part of the story. I'm very left-handed, but the Eastern European school system I grew up in insisted I write with my right. That pretty much never took and my handwriting is cramped and uneven, the muscle memory never formed, and every letter looks slightly off. I stopped writing by hand as soon as I could and never went back, so flipping through page after page of that handwriting felt like stumbling into a part of my childhood I usually skip.

The notebooks themselves are in rough shape, with yellowed pages, cracked edges, weird stains, the general wear of over twenty years in an attic. Seven year old me cared deeply about drawing worlds, not so much about preserving them. If I left them alone, they wouldn't survive much longer, so I decided to digitise everything and build a small pipeline to power it, something reliable, simple, and repeatable.

I started with my phone, but that lasted three pages, there were shadows everywhere, warped angles and terrible lighting, so doing nine notebooks like that would have driven me mad. I ended up borrowing a document camera from my library, mounted it above my desk, set up two cheap LED panels at 45 degrees, and put a dark desk mat underneath for contrast. Much better!

For each page I saved two versions. One archival JPEG at full resolution, one PNG copy for processing. The capture script handles the boring parts.

bash

#!/bin/bash
set -e
 
NOTEBOOK_ID="$1"
PAGE="$2"
RAW_DIR="./raw/${NOTEBOOK_ID}"
mkdir -p "$RAW_DIR"
 
# Capture from the document camera (outputs JPEG)
gphoto2 --quiet --capture-image-and-download \
  --filename="${RAW_DIR}/page_${PAGE}_archival.jpg"
 
# Convert a copy to PNG for processing
convert "${RAW_DIR}/page_${PAGE}_archival.jpg" \
  "${RAW_DIR}/page_${PAGE}_process.png"
 
# Clean up the processing version
convert "${RAW_DIR}/page_${PAGE}_process.png" \
  -fuzz 12% -trim +repage \
  -deskew 85% \
  "${RAW_DIR}/page_${PAGE}_clean.png"

Once captured, every page goes through a tagging step. I threw together a small Electron viewer because it was faster than searching for a better tool. It lets me assign categories, estimated dates, keywords, and a quick quality note. The metadata ends up looking like this.

json

{
  "notebook_id": "nb_03",
  "page": 47,
  "year_written": "2003",
  "categories": ["game_design", "rpg", "combat"],
  "keywords": ["turns", "mana", "classes"],
  "condition": "water_damage",
  "notes": "Illegible handwriting with several unclear symbols"
}

For the transcription, OCR failed immediately. Tesseract handled the printed labels just fine, but the handwriting might as well have been ancient runes. So I split the task. Google Vision does the printed text (it's excellent at that) and Claude handles the handwriting, which it's weirdly good at. The transcription script feeds the page and metadata and asks for both text and diagram descriptions.

python

import anthropic
import base64
from pathlib import Path
 
def encode_image(path):
    data = Path(path).read_bytes()
    return base64.b64encode(data).decode("utf-8")
 
def transcribe(image_path, meta):
    client = anthropic.Anthropic()
    encoded = encode_image(image_path)
 
    prompt = (
        "This is a scanned notebook page from around "
        f"{meta.get('year_written', 'unknown')}. "
        "The handwriting is extremely difficult to read. "
        "Transcribe everything visible and mark unclear text with [?]. "
        "Also describe drawings and diagrams in detail."
    )
 
    msg = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "image",
                     "source": {"type": "base64", "media_type": "image/png", "data": encoded}},
                    {"type": "text", "text": prompt}
                ]
            }
        ]
    )
 
    return msg.content[0].text

The results are hilarious. Claude recognises my messy scribbles way better than I can!

For storage and search, everything ends up in SQLite with full text search.

sql

CREATE TABLE pages (
    id INTEGER PRIMARY KEY,
    notebook_id TEXT,
    page INTEGER,
    date_written TEXT,
    transcription TEXT,
    diagram_description TEXT,
    image_path TEXT,
    metadata JSON
);
 
CREATE VIRTUAL TABLE pages_fts USING fts5(
    transcription,
    diagram_description,
    content='pages',
    content_rowid='id'
);

Now I can search for things like all pages about elemental magic or all maps with mountain ranges. It's instant and strangely satisfying to use.

These notebooks aren't valuable to anyone else. They're messy and half formed. But they're a clear record of who I was before I knew what I was doing. A kid trying to build things he didn't have the skills for yet, someone who thought putting a dungeon inside a giant crab made perfect sense. Digitising them gives that version of me a bit of dignity, and even if the handwriting is still hard to look at, at least now the pages live somewhere safe instead of falling apart in a box.

Six notebooks done, three left at the time of writing. The pipeline works well for something I threw together over a few evenings, and I'm curious to see what else shows up in the remaining pages. The total cost so far is only a few dollars in API calls and some late nights. Cheaper than a fancy coffee. The memories were worth it!

Update July 15: All nine notebooks finished. About ten EUR and maybe twenty hours of work total. Absolutely worth it!