extract links from a google results page saved as a file

gsed -e :a -e 's/\(<\/[^>]*>\)/\1\n/g;s/\(<br>\)/\1\n/g' page2.txt | sed -n '/<cite>/p;s/<cite>\(.*\)<\/cite>/\1/g' >> output
From a saved page of google search results, split out all of the links for the results. Useful for creating apache rewrite rules from.

-1
2009-02-06 12:45:20
sed

These Might Interest You

  • I have a remote php file that I want to run once an hour. I set up cron to run this wget. I don't really care about what's in the file though, I don't want to save the results, so I run the -O and send it to /dev/null


    3
    wget -O /dev/null http://www.google.com
    topher1kenobe · 2009-09-10 14:43:57 2
  • Access a random news web page on the internet. The Links browser can of course be replaced by Firefox or any modern graphical web browser.


    2
    links $( a=( $( lynx -dump -listonly "http://news.google.com" | grep -Eo "(http|https)://[a-zA-Z0-9./?=_-]*" | grep -v "google.com" | sort | uniq ) ) ; amax=${#a[@]} ; n=$(( `date '+%s'` % $amax )) ; echo ${a[n]} )
    pascalv · 2016-07-26 11:52:12 3
  • Get the first 10 google results form a querry, but showing only the urls from the results. Use + to search diferent terms, ex: commandlinefu+google . Show Sample Output


    1
    gg(){ lynx -dump http://www.google.com/search?q=$@ | sed '/[0-9]*\..http:\/\/www.google.com\/search?q=related:/!d;s/...[0-9]*\..http:\/\/www.google.com\/search?q=related://;s/&hl=//';}
    chon8a · 2012-04-21 03:31:26 3
  • This will download all files of the type specified after "-A" from a website. Here is a breakdown of the options: -r turns on recursion and downloads all links on page -l1 goes only one level of links into the page(this is really important when using -r) -H spans domains meaning it will download links to sites that don't have the same domain -nd means put all the downloads in the current directory instead of making all the directories in the path -A mp3 filters to only download links that are mp3s(this can be a comma separated list of different file formats to search for multiple types) -e robots=off just means to ignore the robots.txt file which stops programs like wget from crashing the site... sorry http://example/url lol..


    3
    wget -r -l1 -H -nd -A mp3 -e robots=off http://example/url
    trizko · 2013-07-13 02:00:23 0

What do you think?

Any thoughts on this command? Does it work on your machine? Can you do the same thing with only 14 characters?

You must be signed in to comment.

What's this?

commandlinefu.com is the place to record those command-line gems that you return to again and again. That way others can gain from your CLI wisdom and you from theirs too. All commands can be commented on, discussed and voted up or down.

Share Your Commands



Stay in the loop…

Follow the Tweets.

Every new command is wrapped in a tweet and posted to Twitter. Following the stream is a great way of staying abreast of the latest commands. For the more discerning, there are Twitter accounts for commands that get a minimum of 3 and 10 votes - that way only the great commands get tweeted.

» http://twitter.com/commandlinefu
» http://twitter.com/commandlinefu3
» http://twitter.com/commandlinefu10

Subscribe to the feeds.

Use your favourite RSS aggregator to stay in touch with the latest commands. There are feeds mirroring the 3 Twitter streams as well as for virtually every other subset (users, tags, functions,…):

Subscribe to the feed for: