This is a short guide to explain how you can quickly OCR a scanned PDF on the command line.
I assume you know how to use a terminal. And since I assume that, I assume you know how to use the package manager for your system. For macOS, this would be homebrew.
We will combine tesseract and imagemagick to do the magick for us: tesseract is the software that does the OCR for us. Since tesseract currently can’t open PDFs directly, we use imagemagick to convert our PDF to a TIFF file.
On macOS, install both tools like this:
brew install tesseract tesseract-lang imagemagick
tesseract-lang is required if you want to analyze any language other than
Also, this might be different for other operating systems.
To analyze just one file, use the following commands.
I assume that your PDF file is called
Further I assume that you’ve scanned your PDF with a resolution of 300 dpi.
If this is not the case, change the value
300 below to the actual resolution (e.g.
150 dpi is another common value).
convert -density 300 original.pdf original.tiff
Now that the PDF is converted to TIFF, tesseract can read it.
In the command below, again, you may need to adjust the DPI.
Also if you want to analyze a file in a different language,
you need to change the
eng value to the value for your language.
(You can even combine those, e.g.
eng+deu would indicate a document which is mainly written in English but also contains German text.)
tesseract original.tiff analyzed --dpi 300 -l eng pdf
When tesseract is done you should be able to copy text out of your PDF.
If that worked well, I suggest to delete the TIFF file, as it is usually quite large:
If you want to analyze many PDFs, you’ll find a script below to help you with that.
It assumes that the PDFs to analyze are in a folder called
It will write the PDFs to a folder called
These folders should be relative to the script file.
In the script change the following values:
deuto the language you need
300to the DPI you need
4to the number of parallel tesseract processes you like to run