Re: how to extract URLs from files?

From: laura fairhead (run_signature_script_for_my_email_at_INVALID.com)
Date: 09/14/03


Date: Sun, 14 Sep 2003 16:33:43 GMT

On Sat, 6 Sep 2003 19:38:34 +0200, "ViperDK \(Daniel K.\)" <ViperDK@gmx.net> wrote:

>is there a easy way to extract urls from files with some simple shell
>commands? i want to get all urls from a html file listet. best would be if
>it not only prints out the first url found on the line but print any further
>url on a new line.
>
>

Hi,

Here is an 'awk' script that will extract ftp and http URLs
from a file text. I'm not quite sure that text embedded URLs
are the same as those that are embedded in HTML tags (the syntax
may differ significantly because of single-quoted strings in HTML)
This is basically a straight conversion from the BNF rules
in RFC1738. It could be easily extended to deal with most
of the other URL formats (file, gopher, mailto, news, nntp,
telnet, wais & prospero)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
BEGIN{
alpha="[a-zA-Z]"
digit="[0-9]"
alphadigit="[a-zA-Z0-9]"
hex="[0-9A-Fa-f]"
safe="[-$_.+]"
extra="[*'(),!]"
digits="[0-9][0-9]*"
escape="%"hex hex
uchar=escape "|" alphadigit "|" safe "|" extra

user="(" uchar "|[;?&=])*"
password="(" uchar "|[;?&=])*"
port=digits
toplabel="(" alpha "|" alpha "(" alphadigit "|" "-" ")*" ")"
domainlabel="(" alphadigit "|" alphadigit "(" alphadigit "|-)*" alphadigit ")"
hostnumber=digits "\\." digits "\\." digits "\\." digits

hostname="(" "(" domainlabel "\\." ")*" toplabel ")"
host="(" hostname "|" hostnumber ")"
hostport=host "(:" port")?"
login="(" user "(:" password ")?" "@" ")?" hostport

fsegment="("uchar"|[?:@&=])*"
fpath=fsegment "(/" fsegment ")*"
ftptype="[AIDaid]"
ftpurl="ftp://" login "(/" fpath "(;type=" ftptype ")?" ")?"

search="(" uchar "|[;:@&=])*"
hsegment="(" uchar "|[;:@&=])*"
hpath=hsegment "(/" hsegment ")*"
httpurl="http://" hostport "(" "/" hpath "([?]" search ")?" ")?"

}

{
for(;match($0,httpurl)>0||match($0,ftpurl)>0;$0=substr($0,RSTART+RLENGTH))
  {
  print substr($0,RSTART,RLENGTH)
  }
}

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

byefornow
l

-- 
echo alru_aafriehdab@ittnreen.tocm |sed 's/\(.\)\(.\)/\2\1/g'


Relevant Pages

  • Re: Paypal without HTML email
    ... >> it's a scrambled mess of syntactically incorrect HTML that my mail ... >> program can make no sense of - the embedded URLs look plausible but ... >> Is there a way of persuading Paypal never to send me any HTML ...
    (uk.people.consumers.ebay)
  • Re: OT - Help Me Become a Web Master!
    ... Things are moving quickly - two weeks ago I get my first URL name and ... from my hard drive to my domain space. ... only a few days ago I new nothing of HTML. ...
    (alt.sys.pc-clone.dell)