Arabic OCR with an API: Make Scanned Arabic PDFs Searchable (Python)

If you've ever tried to extract text from a scanned Arabic document, you already know the pain. Most OCR tooling is built English-first. Arabic adds three problems on top:

  1. Right-to-left (RTL) text that breaks naive layout assumptions.
  2. Connected letters (ligatures) — the same letter changes shape depending on its position in the word.
  3. **Diacritics and a different numeral…
Read more →
I
Bookmarks - data, design, vis, book

These are some things I’ve wandered across on the web this week.

🔖

When Bits Rot - with C McKean, L Talboom, A Page-Mitchell

English Edition: floppy disks, hard drives, CDs, DVDs, SSD drives - no matter what you choose to store your data on - ultimately they all decay. With my guests Callum McKean, Leontien Talboom and Adrian Page-Mitchell, we’re going to talk about what kinds of data we…

Read more →
Pluralistic: Georgia's voting technology blunder (18 Apr 2026)

Today's links Georgia's voting technology blunder: It's possible for Dominion machines to suck, but not in the way that Tucker Carlson says they do. Hey look at this: Delights to delectate. Object permanence: GWB's illegal iPod; McDonald's breakfast sandwich fanfic; Technofeudal debt; "The Everything Box"; $100m deli. Upcoming appearances: Los Angeles, San Francisco, London, Berlin, NYC,…

Read more →
Page 1