Discussion
Loading...

#Tag

  • About
  • Code of conduct
  • Privacy
  • About Bonfire
Terence Eden’s Blog
@blog@shkspr.mobi  ·  activity timestamp 3 days ago

Improving PixelMelt's Kindle Web Deobfuscator

https://shkspr.mobi/blog/2025/10/improving-pixelmelts-kindle-web-deobfuscator/

A few days ago, someone called PixelMelt published a way for Amazon's customers to download their purchased books without DRM. Well… sort of.

In their post "How I Reversed Amazon's Kindle Web Obfuscation Because Their App Sucked" they describe the process of spoofing a web browser, downloading a bunch of JSON files, reconstructing the obfuscated SVGs used to draw individual letters, and running OCR on them to extract text.

There were a few problems with this approach.

Firstly, the downloader was hard-coded to only work with the .com site. That fix was simple - do a search and replace on amazon.com with amazon.co.uk. Easy!

But the harder problem was with the OCR. The code was designed to visually centre each extracted glyph. That gives a nice amount of whitespace around the character which makes it easier for OCR to run. The only problem is that some characters are ambiguous when centred:

Several letters drawn with vertical centering.

When I ran the code, lots of full-stops became midpoints, commas became apostrophes, and various other characters went a bit wonky.

That made the output rather hard to read. This was compounded by the way line-breaks were treated. Modern eBooks are designed to be reflowable - no matter the size of your screen, lines should only break on a new paragraph. This had forced linebreaks at the end of every displayed line - rather than at the end of a paragraph.

So I decided to fix it.

A New Approach

I decided that OCRing an entire page would yield better results than single characters. I was (mostly) right. Here's what a typical page looks like after de-obfuscation and reconstruction:

A page of text.

As you can see - the typesetting is good for the body text, but skew-whiff for the title. Bold and italics are preserved. There are no links or images.

Here's how I did it.

Extract the characters

As in the original code, I took the SVG path of the character and rendered it as a monochrome PNG. Rather than centring the glyph, I used the height and width provided in the glyphs.json file. That gave me a directory full of individual letters, numbers, punctuation marks, and ligatures. These were named by fontKey (bold, italic, normal, etc).

Create a blank page

The page_data_0_4.json has a width and height of the page. I created a white PNG with the same dimensions. The individual characters could then be placed on that.

Resize the characters

In the page_data_0_4.json each run of text has a fontKey - which allows the correct glyph to be selected. There's also a fontSize parameter. Most text seems to be (the ludicrously precise) 19.800001. If a font had a different size, I temporarily scaled the glyph in proportion to 19.8.

Each glyph has an associated xPosition, along with a transform which gives X and Y offsets. That allows for indenting and other text layouts.

The characters were then pasted on to the blank page.

Once every character from that page had been extracted, resized, and placed - the page was saved as a monochrome PNG.

OCR the page

Tesseract 5 is a fast, modern, and reasonably accurate OCR engine for Linux.

Running tesseract page_0022.png output -l eng produced a .txt file with all the text extracted.

For a more useful HTML style layout, the hOCR output can be used: tesseract page_0022.png output -l eng hocr

Or, a PDF with embedded text: tesseract page_0022.png output -l eng pdf

Mistakes

OCR isn't infallible. Even with a high resolution image and a clear font, there were some errors.

  • Superscript numerals for footnotes were often missing from the OCR.
  • Words can run together even if they are well spaced.
  • Tesseract can recognise bold and italic characters - but it outputs everything as plain text.

What's missing?

Images aren't downloaded. I took a brief look and, while there are links to them in the metadata, they're downloaded as encrypted blobs. I'm not clever enough to do anything with them.

The OCR can't pick out semantic meaning. Chapter headings and footnotes are rendered the same way as text.

Layout is flat. The image of the page might have an indent, but the outputted text won't.

What's next?

This is very far from perfect. It can give you a visually similar layout to a book you have purchased from Amazon. But it won't be reflowable.

The text will be reasonably accurate. But there will be plenty of mistakes.

You can get an HTML layout with hOCR. But it will be missing formatting and links.

Processing all the JSON files and OCRing all the images is relatively quick. But tweaking and assembling is still fairly manual.

There's nothing particularly clever about what I've done. The original code didn't come with an open source software licence, so I am unable to share my changes - but any moderately competent programmer could recreate this.

Personally, I've just stopped buying books from Amazon. I find that Kobo is often cheaper and their DRM is easy to bypass. But if you have many books trapped in Amazon - or a book is only published there - this is a barely adequate way to liberate it for your personal use.

#Amazon #drm #ebooks #kindle #python

hOCR - Wikipedia

tessdoc

Tesseract User Manual

Tesseract documentation
Cats with power tools

How I Reversed Amazon's Kindle Web Obfuscation Because Their App Sucked

As it turns out they don't actually want you to do this (and have some interesting ways to stop you)
An eReader with a pen.
An eReader with a pen.
An eReader with a pen.
  • Copy link
  • Flag this post
  • Block
Third spruce tree on the left
Third spruce tree on the left boosted
TinDrum
@oscarjiminy@aus.social  ·  activity timestamp last week

"a new digital platform enabling your favorite independently owned bookstores to sell digital books, granting them a foothold in a marketplace long dominated by Amazon...100% of the profits generated by sales through those brick-and-mortar stores funnel directly back to them. Customers who don’t specify a store to support when purchasing e-books on Bookshop.org support indie shops anyway, since a third of the profit from those sales goes into a profit-sharing pool"

#reading #ebooks #books #literature #technology#bookStore#fuckAmazon

https://www.salon.com/2025/01/28/bookshoporg-enters-the-e-book-arena-giving-indie-stores-a-new-way-to-compete-with/

  • Copy link
  • Flag this post
  • Block
TinDrum
@oscarjiminy@aus.social  ·  activity timestamp last week

"a new digital platform enabling your favorite independently owned bookstores to sell digital books, granting them a foothold in a marketplace long dominated by Amazon...100% of the profits generated by sales through those brick-and-mortar stores funnel directly back to them. Customers who don’t specify a store to support when purchasing e-books on Bookshop.org support indie shops anyway, since a third of the profit from those sales goes into a profit-sharing pool"

#reading #ebooks #books #literature #technology#bookStore#fuckAmazon

https://www.salon.com/2025/01/28/bookshoporg-enters-the-e-book-arena-giving-indie-stores-a-new-way-to-compete-with/

  • Copy link
  • Flag this post
  • Block
Log in

Bonfire Dinteg Labs

This is a bonfire demo instance for testing purposes. This is not a production site. There are no backups for now. Data, including profiles may be wiped without notice. No service or other guarantees expressed or implied.

Bonfire Dinteg Labs: About · Code of conduct · Privacy ·
Bonfire social · 1.0.0-rc.3.15 no JS en
Automatic federation enabled
  • Explore
  • About
  • Code of Conduct
Home
Login