Quick and free OCR of PDFs using CLI
This is a short guide to explain how you can quickly OCR a scanned PDF on the command line.
Preparation
I assume you know how to use a terminal. And since I assume that, I assume you know how to use the package manager for your system. For macOS, this would be homebrew.
We will combine tesseract and imagemagick to do the magick for us: tesseract is the software that does the OCR for us. Since tesseract currently can’t open PDFs directly, we use imagemagick to convert our PDF to a TIFF file.
On macOS, install both tools like this:
brew install tesseract tesseract-lang imagemagick
Note: tesseract-lang
is required if you want to analyze any language other than eng
.
Also, this might be different for other operating systems.
One-file OCR
To analyze just one file, use the following commands.
I assume that your PDF file is called original.pdf
.
Further I assume that you’ve scanned your PDF with a resolution of 300 dpi.
If this is not the case, change the value 300
below to the actual resolution (e.g. 150
dpi is another common value).
convert -density 300 original.pdf original.tiff
Now that the PDF is converted to TIFF, tesseract can read it.
In the command below, again, you may need to adjust the DPI.
Also if you want to analyze a file in a different language,
you need to change the eng
value to the value for your language.
(You can even combine those, e.g. eng+deu
would indicate a document which is mainly written in English but also contains German text.)
tesseract original.tiff analyzed --dpi 300 -l eng pdf
When tesseract is done you should be able to copy text out of your PDF.
If that worked well, I suggest to delete the TIFF file, as it is usually quite large:
rm original.tiff
Mass OCR
If you want to analyze many PDFs, you’ll find a script below to help you with that.
It assumes that the PDFs to analyze are in a folder called ./input
.
It will write the PDFs to a folder called ./output
.
These folders should be relative to the script file.
In the script change the following values:
deu
to the language you need300
to the DPI you need4
to the number of parallel tesseract processes you like to run