OCR a pdf file with tesseract and ImageMagick

convert -density 300 INPUTFILENAME.pdf tmp.tif && tesseract -psm 1 -l "eng" tmp.tif OUTPUTFILENAME pdf && rm tmp.tif

0
By: ilya
2016-07-17 21:25:54

These Might Interest You

  • Runs the identify command (from ImageMagick) on each jpg file in the current directory and returns image details according to the format parameter. The example here returns: Filename FileSize Compression Width Height More information about the available format options can be found here: http://www.imagemagick.org/script/escape.php I usually redirect the output to a text file using "> listofdetails.txt" at the end. Spreadsheet magic can then be applied. Show Sample Output


    0
    for file in *.jpg; do identify -format '%f %b %Q %w %h' $file; done
    phattmatt · 2012-11-16 10:06:35 1
  • convert -resize 750?500 -quality 80% *.jpg These are command-line invocations of ImageMagick functions. The first sizes an image file to 40% of original and saves it to a different name, while the second makes all jpg files in a directory sized to 750x500 pixels. Such a pleasure not to need to point and click to make a bunch of thumbnails -- for example.


    0
    convert panorama_rainbow_2005.jpg -resize 40% panorama_rainbow_compress.jpg
    pcardout · 2009-02-15 08:24:50 0
  • Of course it requires import command, from imagemagick tools, but it's simpler to type, and imagemagick is usefull anyway.


    25
    DISPLAY=:0.0 import -window root /tmp/shot.png
    depesz · 2010-10-28 12:00:00 0
  • In general, this is actually not better than the "scrot -d4" command I'm listing it as an alternative to, so please don't vote it down for that. I'm adding this command because xwd (X window dumper) comes with X11, so it is already installed on your machine, whereas scrot probably is not. I've found xwd handy on boxen that I don't want to (or am not allowed to) install packages on. NOTE: The dd junk for renaming the file is completely optional. I just did that for fun and because it's interesting that xwd embeds the window title in its metadata. I probably should have just parsed the output from file(1) instead of cutting it out with dd(1), but this was more fun and less error prone. NOTE2: Many programs don't know what to do with an xwd format image file. You can convert it to something normal using NetPBM's xwdtopnm(1) or ImageMagick's convert(1). For example, this would work: "xwd | convert fd:0 foo.jpg". Of course, if you have ImageMagick already installed, you'd probably use import(1) instead of xwd. NOTE3: Xwd files can be viewed using the X Window UnDumper: "xwud <foo.xwd". ImageMagick and The GIMP can also read .xwd files. Strangely, eog(1) cannot. NOTE4: The sleep is not strictly necessary, I put it in there so that one has time to raise the window above any others before clicking on it. Show Sample Output


    3
    sleep 4; xwd >foo.xwd; mv foo.xwd "$(dd skip=100 if=foo.xwd bs=1 count=256 2>/dev/null | egrep -ao '^[[:print:]]+' | tr / :).xwd"
    hackerb9 · 2010-09-19 08:03:02 0

What do you think?

Any thoughts on this command? Does it work on your machine? Can you do the same thing with only 14 characters?

You must be signed in to comment.

What's this?

commandlinefu.com is the place to record those command-line gems that you return to again and again. That way others can gain from your CLI wisdom and you from theirs too. All commands can be commented on, discussed and voted up or down.

Share Your Commands



Stay in the loop…

Follow the Tweets.

Every new command is wrapped in a tweet and posted to Twitter. Following the stream is a great way of staying abreast of the latest commands. For the more discerning, there are Twitter accounts for commands that get a minimum of 3 and 10 votes - that way only the great commands get tweeted.

» http://twitter.com/commandlinefu
» http://twitter.com/commandlinefu3
» http://twitter.com/commandlinefu10

Subscribe to the feeds.

Use your favourite RSS aggregator to stay in touch with the latest commands. There are feeds mirroring the 3 Twitter streams as well as for virtually every other subset (users, tags, functions,…):

Subscribe to the feed for: