Re: how to extract URLs from files?
From: laura fairhead (run_signature_script_for_my_email_at_INVALID.com)
Date: 09/14/03
- Next message: dan: "bash working with filenames that need escaping."
- Previous message: Alan Connor: "Re: Bash script to see if PPP link is up..."
- In reply to: ViperDK \(Daniel K.\): "how to extract URLs from files?"
- Next in thread: Stephane CHAZELAS: "Re: how to extract URLs from files?"
- Reply: Stephane CHAZELAS: "Re: how to extract URLs from files?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Date: Sun, 14 Sep 2003 16:33:43 GMT
On Sat, 6 Sep 2003 19:38:34 +0200, "ViperDK \(Daniel K.\)" <ViperDK@gmx.net> wrote:
>is there a easy way to extract urls from files with some simple shell
>commands? i want to get all urls from a html file listet. best would be if
>it not only prints out the first url found on the line but print any further
>url on a new line.
>
>
Hi,
Here is an 'awk' script that will extract ftp and http URLs
from a file text. I'm not quite sure that text embedded URLs
are the same as those that are embedded in HTML tags (the syntax
may differ significantly because of single-quoted strings in HTML)
This is basically a straight conversion from the BNF rules
in RFC1738. It could be easily extended to deal with most
of the other URL formats (file, gopher, mailto, news, nntp,
telnet, wais & prospero)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
BEGIN{
alpha="[a-zA-Z]"
digit="[0-9]"
alphadigit="[a-zA-Z0-9]"
hex="[0-9A-Fa-f]"
safe="[-$_.+]"
extra="[*'(),!]"
digits="[0-9][0-9]*"
escape="%"hex hex
uchar=escape "|" alphadigit "|" safe "|" extra
user="(" uchar "|[;?&=])*"
password="(" uchar "|[;?&=])*"
port=digits
toplabel="(" alpha "|" alpha "(" alphadigit "|" "-" ")*" ")"
domainlabel="(" alphadigit "|" alphadigit "(" alphadigit "|-)*" alphadigit ")"
hostnumber=digits "\\." digits "\\." digits "\\." digits
hostname="(" "(" domainlabel "\\." ")*" toplabel ")"
host="(" hostname "|" hostnumber ")"
hostport=host "(:" port")?"
login="(" user "(:" password ")?" "@" ")?" hostport
fsegment="("uchar"|[?:@&=])*"
fpath=fsegment "(/" fsegment ")*"
ftptype="[AIDaid]"
ftpurl="ftp://" login "(/" fpath "(;type=" ftptype ")?" ")?"
search="(" uchar "|[;:@&=])*"
hsegment="(" uchar "|[;:@&=])*"
hpath=hsegment "(/" hsegment ")*"
httpurl="http://" hostport "(" "/" hpath "([?]" search ")?" ")?"
}
{
for(;match($0,httpurl)>0||match($0,ftpurl)>0;$0=substr($0,RSTART+RLENGTH))
{
print substr($0,RSTART,RLENGTH)
}
}
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
byefornow
l
-- echo alru_aafriehdab@ittnreen.tocm |sed 's/\(.\)\(.\)/\2\1/g'
- Next message: dan: "bash working with filenames that need escaping."
- Previous message: Alan Connor: "Re: Bash script to see if PPP link is up..."
- In reply to: ViperDK \(Daniel K.\): "how to extract URLs from files?"
- Next in thread: Stephane CHAZELAS: "Re: how to extract URLs from files?"
- Reply: Stephane CHAZELAS: "Re: how to extract URLs from files?"
- Messages sorted by: [ date ] [ thread ] [ subject ] [ author ]
Relevant Pages
|