Quick and free OCR of PDFs using CLI

Posted on Nov 26, 2021

This is a short guide to explain how you can quickly OCR a scanned PDF on the command line.

Preparation

I assume you know how to use a terminal. And since I assume that, I assume you know how to use the package manager for your system. For macOS, this would be homebrew.

We will combine tesseract and imagemagick to do the magick for us: tesseract is the software that does the OCR for us. Since tesseract currently can’t open PDFs directly, we use imagemagick to convert our PDF to a TIFF file.

On macOS, install both tools like this:

brew install tesseract tesseract-lang imagemagick

Note: tesseract-lang is required if you want to analyze any language other than eng. Also, this might be different for other operating systems.

One-file OCR

To analyze just one file, use the following commands. I assume that your PDF file is called original.pdf. Further I assume that you’ve scanned your PDF with a resolution of 300 dpi. If this is not the case, change the value 300 below to the actual resolution (e.g. 150 dpi is another common value).

convert -density 300 original.pdf original.tiff

Now that the PDF is converted to TIFF, tesseract can read it. In the command below, again, you may need to adjust the DPI. Also if you want to analyze a file in a different language, you need to change the eng value to the value for your language. (You can even combine those, e.g. eng+deu would indicate a document which is mainly written in English but also contains German text.)

tesseract original.tiff analyzed --dpi 300 -l eng pdf

When tesseract is done you should be able to copy text out of your PDF. If that worked well, I suggest to delete the TIFF file, as it is usually quite large: rm original.tiff

Mass OCR

If you want to analyze many PDFs, you’ll find a script below to help you with that. It assumes that the PDFs to analyze are in a folder called ./input. It will write the PDFs to a folder called ./output. These folders should be relative to the script file.

In the script change the following values:

  • deu to the language you need
  • 300 to the DPI you need
  • 4 to the number of parallel tesseract processes you like to run