Re: Sed: removing XML headers



On 28 Mar, 19:08, Janis Papanagnou <Janis_Papanag...@xxxxxxxxxxx>
wrote:
bruce_phi...@xxxxxxxxxxx wrote:
I am trying to concatenate several XML files (test01.xml, test02.xml,
test03.xml) into a single XML file.

cat test*.xml > out.xml

concatenates the files into one big file.

But the resulting XML file is invalid due to having several XML header
and DOCTYPE tags within the document:
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Frameset//EN" "http://
www.w3.org/TR/xhtml1/DTD/xhtml1-frameset.dtd">

If it is always the first two lines that you have to skip you may use

awk 'FNR==NR||FNR>2' test*.xml

If you you want to match against the specific patterns (assuming that
the xml header patterns don't span across many lines)

awk 'FNR==NR||!(/<!DOCTYPE/||/<\?xml version/)' test*.xml

The FNR==NR part assures that one (the first) header remains included.

Janis



How can I use sed to remove the XML headers within the output file?

Why sed?



Thanks for all the replies.
The problem seems to be that the XML is not line-based. It all wraps
into one continuous stream.
I think sed is line-based.
So maybe I should consider other alternatives...
Bruce


.