Hi Peter,
<snip>
Thanks all,
I think I agree.
Concatenating 2 regexes should suffice.
But it would be nice to tuck away a more arcane single expression for
future reference.
</snip>
With the greatest of respect - I think you are missing the point made by Scott.
I am sure you know, but I'll say it anyway (for the sake of the archive) that XML has some reserved characters because they form part of the language. These are: <, >, ", and ' That is, less than, greater than, quote, and apostrophe. To ensure a parser knows that one of these characters is part of the data and not part of the structure of the xml you have to escape these entities.
The escaped entities are written as follows:
less than: <
greater than: >
quote: "
apostrophe: '
These are pre-defined entity references and are part of the language, but all escaped references are wrapped in a leading & and trailing ;
This leads inevitably to another issue - the ampersand is now reserved and part of the language because it denotes the start of an escaped entity reference. This means we have to escape the ampersand!
ampersand: &
So there are 5 pre-defined entity references in XML.
The point Scott was clearly making is - well formed xml can easily have all of these escaped entity references. In fact, well formed xml should NEVER have any of the five standard entity references which is not escaped in the usual way. Further, xml doesn't restrict itself to the 5 pre-defined entity references. You can create your own custom unicode numeric character reference of the form &#nnnn; (nnnn code point in decimal form) or &#xhhhh (hhhh code point in hexadecimal form). This allows you to place any valid unicode character in your text-based xml document and not worry about it breaking the parser.
This is all standard, and as an example I believe this is all catered for when using %XML-SAX. The pre-defined entity references cause a *XML_PREDEF_REF event to fire and any other character reference will cause a *XML_UNKNOWN_REF event to fire.
We haven't even got into DTD Entity declaratrions (internal and external). Construction of Entity Replacement Text alone is extremely complicated because what you do can be governed by a DTD and its hierarchical handling of declarations! Believe me - you really don't want to go down the road of custom parsing. The XML 1.0 spec is pretty broad and you find yourself adding one fix after another. Get the XML fixed and use a standard w3c compliant parser. :-)
Please see:
http://www.w3.org/TR/REC-xml/ section 4.1 to see the details of the XML 1.0 specification and how it relates to character references.
Cheers
Larry Ducie
_________________________________________________________________
What goes online, stays online Check the daily blob for the latest on what's happening around the web
http://windowslive.ninemsn.com.au/blog.aspx
As an Amazon Associate we earn from qualifying purchases.