I am simultaneously excited for the implementation techniques that #python free threading is going to open up for framework authors and deeply concerned about the way I see discussions going around it, as it seems application authors will be speedrunning the “you can fit so many untestable data races in here” to “you can fit so many intractable mutex deadlocks in here” pipeline that java previewed for us all in the early aughts
Improving PixelMelt's Kindle Web Deobfuscator
https://shkspr.mobi/blog/2025/10/improving-pixelmelts-kindle-web-deobfuscator/
A few days ago, someone called PixelMelt published a way for Amazon's customers to download their purchased books without DRM. Well… sort of.
In their post "How I Reversed Amazon's Kindle Web Obfuscation Because Their App Sucked" they describe the process of spoofing a web browser, downloading a bunch of JSON files, reconstructing the obfuscated SVGs used to draw individual letters, and running OCR on them to extract text.
There were a few problems with this approach.
Firstly, the downloader was hard-coded to only work with the .com site. That fix was simple - do a search and replace on amazon.com
with amazon.co.uk
. Easy!
But the harder problem was with the OCR. The code was designed to visually centre each extracted glyph. That gives a nice amount of whitespace around the character which makes it easier for OCR to run. The only problem is that some characters are ambiguous when centred:
When I ran the code, lots of full-stops became midpoints, commas became apostrophes, and various other characters went a bit wonky.
That made the output rather hard to read. This was compounded by the way line-breaks were treated. Modern eBooks are designed to be reflowable - no matter the size of your screen, lines should only break on a new paragraph. This had forced linebreaks at the end of every displayed line - rather than at the end of a paragraph.
So I decided to fix it.
A New Approach
I decided that OCRing an entire page would yield better results than single characters. I was (mostly) right. Here's what a typical page looks like after de-obfuscation and reconstruction:
As you can see - the typesetting is good for the body text, but skew-whiff for the title. Bold and italics are preserved. There are no links or images.
Here's how I did it.
Extract the characters
As in the original code, I took the SVG path of the character and rendered it as a monochrome PNG. Rather than centring the glyph, I used the height and width provided in the glyphs.json
file. That gave me a directory full of individual letters, numbers, punctuation marks, and ligatures. These were named by fontKey (bold, italic, normal, etc).
Create a blank page
The page_data_0_4.json
has a width and height of the page. I created a white PNG with the same dimensions. The individual characters could then be placed on that.
Resize the characters
In the page_data_0_4.json
each run of text has a fontKey - which allows the correct glyph to be selected. There's also a fontSize
parameter. Most text seems to be (the ludicrously precise) 19.800001
. If a font had a different size, I temporarily scaled the glyph in proportion to 19.8.
Each glyph has an associated xPosition
, along with a transform
which gives X and Y offsets. That allows for indenting and other text layouts.
The characters were then pasted on to the blank page.
Once every character from that page had been extracted, resized, and placed - the page was saved as a monochrome PNG.
OCR the page
Tesseract 5 is a fast, modern, and reasonably accurate OCR engine for Linux.
Running tesseract page_0022.png output -l eng
produced a .txt file with all the text extracted.
For a more useful HTML style layout, the hOCR output can be used: tesseract page_0022.png output -l eng hocr
Or, a PDF with embedded text: tesseract page_0022.png output -l eng pdf
Mistakes
OCR isn't infallible. Even with a high resolution image and a clear font, there were some errors.
- Superscript numerals for footnotes were often missing from the OCR.
- Words can run together even if they are well spaced.
- Tesseract can recognise bold and italic characters - but it outputs everything as plain text.
What's missing?
Images aren't downloaded. I took a brief look and, while there are links to them in the metadata, they're downloaded as encrypted blobs. I'm not clever enough to do anything with them.
The OCR can't pick out semantic meaning. Chapter headings and footnotes are rendered the same way as text.
Layout is flat. The image of the page might have an indent, but the outputted text won't.
What's next?
This is very far from perfect. It can give you a visually similar layout to a book you have purchased from Amazon. But it won't be reflowable.
The text will be reasonably accurate. But there will be plenty of mistakes.
You can get an HTML layout with hOCR. But it will be missing formatting and links.
Processing all the JSON files and OCRing all the images is relatively quick. But tweaking and assembling is still fairly manual.
There's nothing particularly clever about what I've done. The original code didn't come with an open source software licence, so I am unable to share my changes - but any moderately competent programmer could recreate this.
Personally, I've just stopped buying books from Amazon. I find that Kobo is often cheaper and their DRM is easy to bypass. But if you have many books trapped in Amazon - or a book is only published there - this is a barely adequate way to liberate it for your personal use.

Now that #Python 3.14 is out and Python 3.9 is finally EOL, I'm really looking forward to using pattern matching, string enums, and keyword-only dataclasses in more codebases.

Reminder to #Python:
If you're still using PyInstaller, py2exe, py2app, etc
Please try Beeware's Briefcase
Reminder to #Python:
If you're still using PyInstaller, py2exe, py2app, etc
Please try Beeware's Briefcase
Now that #Python 3.14 is out and Python 3.9 is finally EOL, I'm really looking forward to using pattern matching, string enums, and keyword-only dataclasses in more codebases.

The Python docstring alignment chart.
The Python docstring alignment chart.

🎉 I hadn't kept up with what Django's Steering Council had been up to outside of random board updates, and I'm happy to share how impressed I am with how transparent they are.
They are meeting several times a month and sharing their minutes as they go. https://github.com/django/steering-council
They are even sharing on the Django Forum to bring more visibility to what they are doing https://forum.djangoproject.com/t/django-steering-council-meetings-2025/38306
These are very welcoming changes. 👏👏
I am doing a new #Python project, where I will try out #HTMX, and I intend to reach out to #Durus and #Quixote again – it's impressive how well old tools continue to be useful and powerful, and how their power increases with more usage.
I also keep learning new things. For example yesterday I discovered that the objects returned by dict.keys(), dict.values() and dict.items() are dynamic view objects!
🎉 I hadn't kept up with what Django's Steering Council had been up to outside of random board updates, and I'm happy to share how impressed I am with how transparent they are.
They are meeting several times a month and sharing their minutes as they go. https://github.com/django/steering-council
They are even sharing on the Django Forum to bring more visibility to what they are doing https://forum.djangoproject.com/t/django-steering-council-meetings-2025/38306
These are very welcoming changes. 👏👏
Fwiw: https://github.com/ology/Music/blob/master/hours.py => https://www.youtube.com/watch?v=VUgy90f2ysI <- 11.5 hrs ambient sleep music :)
Fwiw: https://github.com/ology/Music/blob/master/hours.py => https://www.youtube.com/watch?v=VUgy90f2ysI <- 11.5 hrs ambient sleep music :)

Just released! 🚀🚀🚀🚀🚀
Pillow 12.0.0
https://fosstodon.org/@pillow/115379893139846791
norwegianblue 0.23.0
https://github.com/hugovk/norwegianblue/releases/tag/0.23.0
pypinfo 23.0.0
https://github.com/ofek/pypinfo/blob/master/CHANGELOG.rst#2300
Humanize 4.14.0
https://github.com/python-humanize/humanize/releases/tag/4.14.0
Tablib 3.9.0
https://github.com/jazzband/tablib/releases/tag/v3.9.0
#Python #release#Pillow #pypinfo #Humanize #norwegianblue#Tablib
Just released! 🚀🚀🚀🚀🚀
Pillow 12.0.0
https://fosstodon.org/@pillow/115379893139846791
norwegianblue 0.23.0
https://github.com/hugovk/norwegianblue/releases/tag/0.23.0
pypinfo 23.0.0
https://github.com/ofek/pypinfo/blob/master/CHANGELOG.rst#2300
Humanize 4.14.0
https://github.com/python-humanize/humanize/releases/tag/4.14.0
Tablib 3.9.0
https://github.com/jazzband/tablib/releases/tag/v3.9.0
#Python #release#Pillow #pypinfo #Humanize #norwegianblue#Tablib
And here I thought I was going to get work done today lol. Fuck #python and the cultural "fuck you" it gives to any semblance of stability.
Let me just figure out which funky set of symlinks and different versions of different modules I have to bodge together to get this one thing working, and will it break other stuff I need?
You're goddamn fucking right it will. Fuck Python.

Forget* about Python 3.14, all the cool kids are trying out Python 3.15.0 alpha 1 (but not on production)! 🚀
🔬 PEP 799: A dedicated profiling package for Python profiling tools
💬 PEP 686: Python now uses UTF-8 as the default encoding
🌊 PEP 782: A new PyBytesWriter C API to create a Python bytes object
⚠️ Better error messages
https://discuss.python.org/t/python-3-15-alpha-1/104358?u=hugovk
* Please don't forget about 3.14...