Extract raw URLs from a file

egrep -ie "<*HREF=(.*?)>" index.html | awk -F\" '{print $2}' | grep ://

By: wwest4
2009-02-05 17:51:14

1 Alternatives + Submit Alt

What Others Think

This fails if the HREF attribute of the anchor tag used a single-quote instead of a double quote. It also fails if the HREF attribute isn't the first attribute of the tag, or if there are two spaces (or a newline) between the tag and the attribute, and so on. It also can be done much more efficiently with a single command e.g. perl -ne 'print "$1\n" if m,+]href=["'\''](\w+://[^"'\'']+),i' index.html
Rhomboid · 599 weeks and 5 days ago
sigh, and of course this crappy site mangled the command because it contained angle brackets.
Rhomboid · 599 weeks and 5 days ago
ok, and i guess it would be easy enough to fix the regex to match the RFC, if necessary... aside from saying that efficiency doesn't really matter here, i can't really argue that perl is better suited to the task... you should post your soln if you can figure out how.
wwest4 · 599 weeks and 5 days ago
er... can't really argue AGAINST perl being better suited.
wwest4 · 599 weeks and 5 days ago

What do you think?

Any thoughts on this command? Does it work on your machine? Can you do the same thing with only 14 characters?

You must be signed in to comment.

What's this?

commandlinefu.com is the place to record those command-line gems that you return to again and again. That way others can gain from your CLI wisdom and you from theirs too. All commands can be commented on, discussed and voted up or down.

Share Your Commands

Stay in the loop…

Follow the Tweets.

Every new command is wrapped in a tweet and posted to Twitter. Following the stream is a great way of staying abreast of the latest commands. For the more discerning, there are Twitter accounts for commands that get a minimum of 3 and 10 votes - that way only the great commands get tweeted.

» http://twitter.com/commandlinefu
» http://twitter.com/commandlinefu3
» http://twitter.com/commandlinefu10

Subscribe to the feeds.

Use your favourite RSS aggregator to stay in touch with the latest commands. There are feeds mirroring the 3 Twitter streams as well as for virtually every other subset (users, tags, functions,…):

Subscribe to the feed for: