lynx --dump "http://www.google.com.br" | egrep -o "http:.*"

Get all URLs from webpage via Regular Expression

Get all URLs from website via Regular Expression... You must have lynx installed in your computer to execute the command. --> lynx --dump "" | egrep -o "" - Must substitute it for the website path that you want to extract the URLs - Regular Expression that you wanna filter the website
Sample Output
http://www.google.com.br/imghp?hl=pt-BR&tab=wi
http://video.google.com.br/?hl=pt-BR&tab=wv
http://maps.google.com.br/maps?hl=pt-BR&tab=wl
http://news.google.com.br/nwshp?hl=pt-BR&tab=wn
http://www.orkut.com/Main?hl=pt-BR&tab=w0#Home
http://www.google.com.br/intl/pt-BR/options/
http://www.google.com.br/url?sa=p&pref=ig&pval=3&q=http://www.google.com.br/ig%3Fhl%3Dpt-BR%26source%3Diglk&usg=AFQjCNEufhwNAC9POZqcS5r7r07CUPbvAA
http://www.google.com.br/preferences?hl=pt-BR
http://www.google.com.br/
http://www.google.com.br/advanced_search?hl=pt-BR
http://www.google.com.br/language_tools?hl=pt-BR
http://www.google.com.br/intl/pt-BR/ads/
http://www.google.com.br/services/
http://www.google.com.br/intl/pt-BR/about.html
http://www.google.com/ncr
http://www.google.com.br/intl/pt-BR/privacy.html

0
2011-09-05 01:12:15

These Might Interest You

  • This is a slight variation of an existing submission, but uses regular expression to look for files instead. This makes it vastly more versatile, and one can easily verify the files to be kept by running ls | egrep "[REGULAR EXPRESSION]"


    -1
    ls | egrep -v "[REGULAR EXPRESSION]" | xargs rm -v
    Saxphile · 2010-04-01 02:40:40 1
  • This is usefull when we don't know the exact name of the process, but the application name A limitation is that the regular expression only tries to match the last part of the full command (i.e. the bin file name itself). But this is way shorter than the following one: ps axww | grep SomeCommand | awk '{ print $1 }' | xargs kill Show Sample Output


    0
    killall -r 'a regular expression'
    dexterhu · 2011-03-07 07:29:42 2
  • Place the regular expression you want to validate between the forward slashes in the eval block. Show Sample Output


    4
    perl -we 'my $regex = eval {qr/.*/}; die "$@" if $@;'
    tlacuache · 2009-10-13 21:50:47 1
  • in "a.html", find all images referred as relative URI in an HTML file by "src" attribute of "img" element, replace them with "data:" URI. This useful to create single HTML file holding all images in it, as a replacement of the IE-created .mht file format. The generated HTML works fine on every other browser except IE, as well as many HTML editors like kompozer, while the .mht format only works for IE, but not for every other browser. Compare to the KDE's own single-file-web-page format "war" format, which only opens correctly on KDE, the HTML file with "data:" URI is more universally supported. The above command have many bugs. My commandline-fu is too limited to fix them: 1. it assume all URLs are relative URIs, thus works in this case: <img src="images/logo.png"/> but does not work in this case: <img src="http://www.my_web_site.com/images/logo.png" /> This may not be a bug, as full URIs perhaps should be ignored in many use cases. 2. it only work for images whoes file name suffix is one of .jpg, .gif, .png, albeit images with .jpeg suffix and those without extension names at all are legal to HTML. 3. image file name is not allowed to contain "(" even though frequently used, as in "(copy of) my car.jpg". Besides, neither single nor double quotes are allowed. 4. There is infact a big flaw in this, file names are actually used as regular expression to be replaced with base64 encoded content. This cause the script to fail in many other cases. Example: 'D:\images\logo.png', where backward slash have different meaning in regular expression. I don't know how to fix this. I don't know any command that can do full text (no regular expression) replacement the way basic editors like gedit does. 5. The original a.html are not preserved, so a user should make a copy first in case things go wrong.


    4
    grep -ioE "(url\(|src=)['\"]?[^)'\"]*" a.html | grep -ioE "[^\"'(]*.(jpg|png|gif)" | while read l ; do sed -i "s>$l>data:image/${l/[^.]*./};base64,`openssl enc -base64 -in $l| tr -d '\n'`>" a.html ; done;
    zhangweiwu · 2010-05-05 14:07:51 2

  • 0
    vim $(grep [REGULAR_EXPRESSION] -R * | cut -d":" -f1 | uniq)
    eduardostalinho · 2012-11-07 19:30:24 0
  • The -p parameter tell the netstat to display the PID and name of the program to which each socket belongs or in digestible terms list the program using the net.Hope you know what pipe symbol means! Presently we wish to only moniter tcp connections so we ask grep to scan for string tcp, now from the op of grep tcp we further scan for regular expression /[a-z]*. Wonder what that means ? If we look at the op of netstat -p we can see that the name of the application is preceded by a / ( try netstat -p ) so,now i assume application name contains only characters a to z (usually this is the case) hope now it makes some sense.Regular expression /[a-z]* means to scan a string that start with a / and contains zero or more characters from the range a-z !!. Foof .. is t Show Sample Output


    -4
    while true; do netstat -p |grep "tcp"|grep --color=always "/[a-z]*";sleep 1;done
    buffer · 2009-07-16 04:52:49 4

What do you think?

Any thoughts on this command? Does it work on your machine? Can you do the same thing with only 14 characters?

You must be signed in to comment.

What's this?

commandlinefu.com is the place to record those command-line gems that you return to again and again. That way others can gain from your CLI wisdom and you from theirs too. All commands can be commented on, discussed and voted up or down.

Share Your Commands



Stay in the loop…

Follow the Tweets.

Every new command is wrapped in a tweet and posted to Twitter. Following the stream is a great way of staying abreast of the latest commands. For the more discerning, there are Twitter accounts for commands that get a minimum of 3 and 10 votes - that way only the great commands get tweeted.

» http://twitter.com/commandlinefu
» http://twitter.com/commandlinefu3
» http://twitter.com/commandlinefu10

Subscribe to the feeds.

Use your favourite RSS aggregator to stay in touch with the latest commands. There are feeds mirroring the 3 Twitter streams as well as for virtually every other subset (users, tags, functions,…):

Subscribe to the feed for: