Re: Extracting html links from text



On Tuesday 24 February 2009 21:55, admin@xxxxxxxxxx wrote:

<span class="Apple-style-span"  style="font-size:small;">: </span></
span></span></span><span style="font-style: itali
c;"><span style="font-weight: bold;"><a href="http://www.4shared.com/
file/66558677/7c74f732
/N.html" target="_self"><span class="Apple-style-span"  style="font-
family:verdana;"><span class="Apple-style-span"  style="font-
size:small;">DOWNLOAD</span></span></a><span class="Apple-style-span"
style="font-family:verdana;"><span class="Apple-style-span"
style="font-size:
small;"><br /><br />4)A </span></span></span></span><span style="font-
style: italic;"><span style="font-weight: bold;"><span class="Apple-
style-span"  style="font-family:verdana;"><span class="Apple-style-
span"  style="font-size:small;">English<br />Movie</span></span></
span></span><span class="Apple-style-span"  style="font-
family:verdana;"><span class="Apple-style-span"  style="font-size:
small;">: </span></span><span style="font-style: italic;"><span
style="font-weight: bold;"><span class="Apple-style-span"  style="font-
family:verdana;"><span class="Apple-style-span"  style="font-
size:small;">B<br />English: </span></span><a href="http://www.
4shared.com/file/28821701
/7c575b14/B.html" target="_self"><span class="Apple-style-span"
style="font-family:verdana;"><span class="Apple-style-span"
style="font-size:small;">DOWNLOAD</span></span></a><span class="Apple-
style-span"  style="font-family:verdana;"><span class="Apple-style-
span"  style="font-size:small;"><br /><br />5)E<br />Movie : P<br
/>English : </span></span><a href="http://www.4shared.com/

file/67065459/e90e54d3/E.html" target="_self"><span class="Apple-style-
span"  style="font-family:verdana;"><span class="Apple-style-span"
style="font-size:small;">DOWNLOAD</span></span>

I need to extract all html links that start with http and end with
html ... lines could span multiple rows .. the expected output is:

http://www.4shared.com/file/66558677/7c74f732/N.htmlhttp://www.4share...

Thanx

-Ad

This GUN awk one-liner will work on your sample input:

gawk -F\" -v RS="<a href=\"" 'NR>1{gsub(/[[:space:]]/,""); print $1}'
file

but if the input is anything more complicated, take a look at XMLawk.

Ed.- Hide quoted text -

- Show quoted text -

We don't have gawk installed ... anything that works with the regular
awk, perl, sed etc would be fine ....

You can exploit Perl's non-greedy quantifiers with its ability to "pull out"
matching subexpressions:

perl -n0e '@m = m|<a href="(http://.*?\.html)"|gs;print $_."\n" for(@m)'

.



Relevant Pages

  • Re: Extracting html links from text
    ... lines could span multiple rows .. ... the expected output is: ... This GUN awk one-liner will work on your sample input: ...
    (comp.unix.shell)
  • Awk Question
    ... tag to UPPERCASE By Using awk Script ... Sample Input ... <HTML> ...
    (comp.lang.awk)