OCR software for PureOS

Phil974 · March 22, 2022, 12:49pm

I am looking for OCR software for PureOS and am a little lost in my research… I would at least need to retrieve the text and images separately and if it keeps the layout more or less, it would be great!
Among other things, I came across this:
“Best OCR Apps for Linux” the first on the list, Tesseract OCR, seems fine, but the “ggle” sponsor does not advocate in its favour…
and “Debian Accessibility Optical character recognition (ocr) packages” a list of application packages (including Tesseract but on the command line, I’m a little loose to make something of it…)
Would anyone have experience with OCR or at least an opinion on one of OCR’s software (free of course )? And if so, can it work from image files (jpg, png) since we don’t have a scanner to scan live, we would take pictures of each page? (I would like to “save” old books, some a little yellowed…)
Thank you !

irvinewade · March 23, 2022, 12:27am

I’ve used tesseract with good success but that is only to extract unformatted text, no images and no layout.

Depending on how many pages we are talking about, you can probably use GIMP to cut the images out yourself.

Yes. That’s how I’ve used it (only from JPEG). man page says:

The name of the input file. This can [...] be an image file [...].

Most image file formats (anything readable by Leptonica) are supported.

I had to do some proof-reading and correction on the result i.e. it’s not perfect.

amosbatto · March 23, 2022, 2:38am

I’ve used the Lios frontend with the tesseract-ocr backend. I installed cuneiform to use as a backend for Lios, but Lios kept crashing when I tried to open the Preferences menus, so I couldn’t configure Lios to use cuneiform. The Lios interface is very basic and it has very limited ability to adjust images for better OCR. If my text wasn’t perfectly straight, I had to rotate parts of the image in GIMP before opening it in Lios.

The output is plain text, which is a pain if you have tables or any formatting. Lios+tessaract works OK in a pinch, but when I was digitalizing whole dictionaries using OCR, I gave up and ran pirated FineReader in a Windows virtual machine.

Phil974 · March 25, 2022, 6:49pm

For those interested in OCR, by continuing my research, I discovered OCRopy which has the property of being able to be “trained” in order to increase the recognition rate… one more track !

Privacy2 · March 25, 2022, 7:13pm

… but the “ggle” sponsor does not advocate in its favour…

It’s Open Source.

I’ve got to say that I’m disappointed by the amount of what I see as fear and paranoia surrounding Google. Did you seriously write “ggle” ???

Privacy2 · March 25, 2022, 7:21pm

tesseract can output what they call an “hocr” file ( https://en.wikipedia.org/wiki/HOCR ). For one project, I created a python wrapper that used the hocr file to add back in estimated tabs and whitespace (which was easy).

Aside: The hocr file also has “OCR word confidence” information. I’ve combined that with “probability of OCR error” information (e.g. There are more mistakes for an “o” vs “e” than there is between a “z” and an “a”) as well as a dictionary to have an improved output.