Digitizing Old Notebooks

July 7, 2025 · 740 words · 4 min read

Building a pipeline to rescue old notebooks from slow decomposition.

Note There are no images in this post. The scans are personal, this is just about the process.

Every time I visit my parents I end up in the attic, like a squirrel after a long winter. It's a habit at this point, opening dusty boxes and hoping something worthwhile shows up, and this time it did! Nine old notebooks from around 2001 to 2003, buried under a pile of things I do not remember keeping. Back then I spent most of my free time writing down game concepts, sketching fantasy maps, and trying out whatever idea happened to be in my head. Honestly, I didn't remember most of what was in them, but I recognised it all instantly as mine.

There's a veritable treasure, I found really stupid RPG systems with more rules than any actual game could survive, fantasy maps covered in symbols I would invent on the spot, little diagrams for machines that ignored logic. It's the kind of stuff a kid makes when they have more imagination than skill and no idea that's a problem.

The handwriting is also part of the story. I'm very left-handed, but the Eastern European school system I grew up in insisted I write with my right. That pretty much never took and my handwriting is cramped and uneven, the muscle memory never formed, and every letter looks slightly off. I stopped writing by hand as soon as I could and never went back, so flipping through page after page of that handwriting felt like stumbling into a part of my childhood I usually skip.

The notebooks themselves are in rough shape. Yellowed pages, cracked edges, weird stains, the general wear of twenty years in an attic. Seven year old me cared about drawing worlds, not preserving them. If I left them alone, they wouldn't survive much longer. So I decided to digitise everything and build a small pipeline to power it, something reliable, simple, and repeatable.

I started with my phone, but that lasted three pages, there were shadows everywhere, warped angles, terrible lighting. Doing nine notebooks like that would have driven me mad, so I borrowed a document camera from work, mounted it above my desk, set up two cheap LED panels at 45 degrees, and put a dark desk mat underneath for contrast.

For each page I saved two versions. One archival JPEG at full resolution, one PNG copy for processing. The capture script handles the boring parts.

bash
#!/bin/bash
set -e
 
NOTEBOOK_ID="$1"
PAGE="$2"
RAW_DIR="./raw/${NOTEBOOK_ID}"
mkdir -p "$RAW_DIR"
 
# Capture from the document camera (outputs JPEG)
gphoto2 --quiet --capture-image-and-download \
  --filename="${RAW_DIR}/page_${PAGE}_archival.jpg"
 
# Convert a copy to PNG for processing
convert "${RAW_DIR}/page_${PAGE}_archival.jpg" \
  "${RAW_DIR}/page_${PAGE}_process.png"
 
# Clean up the processing version
convert "${RAW_DIR}/page_${PAGE}_process.png" \
  -fuzz 12% -trim +repage \
  -deskew 85% \
  "${RAW_DIR}/page_${PAGE}_clean.png"

Once captured, every page goes through a tagging step. I threw together a small Electron viewer because it was faster than searching for a better tool. It lets me assign categories, estimated dates, keywords, and a quick quality note. The metadata ends up looking like this.

json
{
  "notebook_id": "nb_03",
  "page": 47,
  "year_written": "2003",
  "categories": ["game_design", "rpg", "combat"],
  "keywords": ["turns", "mana", "classes"],
  "condition": "water_damage",
  "notes": "Illegible handwriting with several unclear symbols"
}

For the transcription, OCR failed immediately. Tesseract handled the printed labels just fine, but the handwriting might as well have been ancient runes. So I split the task. Google Vision does the printed text (it's excellent at that) and Claude handles the handwriting, which it's weirdly good at. The transcription script feeds the page and metadata and asks for both text and diagram descriptions.

python
import anthropic
import base64
from pathlib import Path
 
def encode_image(path):
    data = Path(path).read_bytes()
    return base64.b64encode(data).decode("utf-8")
 
def transcribe(image_path, meta):
    client = anthropic.Anthropic()
    encoded = encode_image(image_path)
 
    prompt = (
        "This is a scanned notebook page from around "
        f"{meta.get('year_written', 'unknown')}. "
        "The handwriting is extremely difficult to read. "
        "Transcribe everything visible and mark unclear text with [?]. "
        "Also describe drawings and diagrams in detail."
    )
 
    msg = client.messages.create(
        model="claude-sonnet-4-20250514",
        max_tokens=4096,
        messages=[
            {
                "role": "user",
                "content": [
                    {"type": "image",
                     "source": {"type": "base64", "media_type": "image/png", "data": encoded}},
                    {"type": "text", "text": prompt}
                ]
            }
        ]
    )
 
    return msg.content[0].text

The results are hilarious. Claude recognises my messy scribbles way better than I can!

For storage and search, everything ends up in SQLite with full text search.

sql
CREATE TABLE pages (
    id INTEGER PRIMARY KEY,
    notebook_id TEXT,
    page INTEGER,
    date_written TEXT,
    transcription TEXT,
    diagram_description TEXT,
    image_path TEXT,
    metadata JSON
);
 
CREATE VIRTUAL TABLE pages_fts USING fts5(
    transcription,
    diagram_description,
    content='pages',
    content_rowid='id'
);

Now I can search for things like all pages about elemental magic or all maps with mountain ranges. It's instant and strangely satisfying to use.

These notebooks aren't valuable to anyone else. They're messy and half formed. But they're a clear record of who I was before I knew what I was doing. A kid trying to build things he didn't have the skills for yet, someone who thought putting a dungeon inside a giant crab made perfect sense. Digitising them gives that version of me a bit of dignity. The handwriting is still hard to look at, but now the pages live somewhere safe instead of falling apart in a box.

Six notebooks done, three left at the time of writing. The pipeline works well for something I threw together over a few evenings, and I'm curious to see what else shows up in the remaining pages. The total cost so far is only a few dollars in API calls and some late nights. Cheaper than a fancy coffee. The memories were worth it!

Update July 15: All nine notebooks finished. About ten EUR and maybe twenty hours of work total. Absolutely worth it!