Archive for June 2010
After deciding some months ago that none of the currently available e-readers could meet all my needs, I didn’t think about them for a while. The recent “price war” between Amazon and Barnes and Noble has me reconsidering: at less than $200 for a Nook or a Kindle, I could probably live without some of the hardware features I want, especially if there is software that can help bridge the gap.
I’m still undecided about buying a device, but I wanted to catalog some of the programs and hacks that I’ve (re-)discovered as I looked into the issue a second time.
- Calibre looks like a great piece of desktop software for managing e-books. It knows how to talk to multiple e-readers, and can inter-convert between popular formats, allowing (e.g.) Kindle readers to read ePub books by first converting them to Mobipocket format.
- Savory extracts some of the code used by Calibre to allow Kindle users to download and convert ePub books directly on their reader, without having to go through desktop software.
The main issue I have, though, is in dealing with PDFs. It doesn’t look like I’m going to be able to justify the expense of a large-format e-ink reader in the near future (the Kindle DX is the cheapest, I think, at $489!). I’m still looking for a comprehensive set of tools for manipulating PDFs so that I could read them easily on a smaller screen. Specifically, I need tools for:
- extracting text from text-based PDFs, or at least being able to reflow them and trim their margins
- converting scanned images of book pages in PDFs into text via OCR software
I haven’t found a complete solution for either task, but I have come across various programs that do some of these things:
- PDFMunge is a Python program that can help with the task of trimming margins and reflowing text in text-based PDFs.
- pdftk is a comprehensive Java library and command line tool for manipulating PDF files.
- Google Docs now has the option to use OCR to convert PDFs to text. It doesn’t work perfectly, especially for more technical material, but it’s easy to use. I believe it is based on Ocropus and/or tesseract-ocr, both of which are Free software and can be built and run locally (if you can figure out how to do so…the dependencies are pretty significant).
- Briss looks like a nice way to crop scanned PDFs using a GUI interface.