PDF files are simultaneously wonderful and heinous. They are wonderful in being ubiquitous and mostly being cross platform. They are heinous in being very difficult to work with from the command line, search, grep, use only the text inside the PDF, or use outside of proprietary products. xpdf is a wonderful set of PDF tools. It is on many linux distros and can be installed on OS X. While primarily an open PDF viewer for X, xpdf has the tool "pdftotext" that can extract formated or unformatted text from inside a PDF that has text. This text stream can then be further processed by grep or other tool. The '-' after the file name directs output to stdout rather than to a text file the same name as the PDF. Make sure you use version 3.02 of pdftotext or later; earlier versions clipped lines. The lines extracted from a PDF without the "-layout" option are very long. More paragraphs. Use just to test that a pattern exists in the file. With "-layout" the output resembles the lines, but it is not perfect. xpdf is available open source at http://www.foolabs.com/xpdf/
grep pdf files easily
Any thoughts on this command? Does it work on your machine? Can you do the same thing with only 14 characters?
You must be signed in to comment.
commandlinefu.com is the place to record those command-line gems that you return to again and again. That way others can gain from your CLI wisdom and you from theirs too. All commands can be commented on, discussed and voted up or down.
Every new command is wrapped in a tweet and posted to Twitter. Following the stream is a great way of staying abreast of the latest commands. For the more discerning, there are Twitter accounts for commands that get a minimum of 3 and 10 votes - that way only the great commands get tweeted.
» http://twitter.com/commandlinefu
» http://twitter.com/commandlinefu3
» http://twitter.com/commandlinefu10
Use your favourite RSS aggregator to stay in touch with the latest commands. There are feeds mirroring the 3 Twitter streams as well as for virtually every other subset (users, tags, functions,…):
Subscribe to the feed for: