× The internal search function is temporarily non-functional. The current search engine is no longer viable and we are researching alternatives.
As a stop gap measure, we are using Google's custom search engine service.
If you know of an easy to use, open source, search engine ... please contact support@midrange.com.




Hi Peter,

<snip>
Thanks all,
I think I agree.
Concatenating 2 regexes should suffice.
But it would be nice to tuck away a more arcane single expression for
future reference.
</snip>

With the greatest of respect - I think you are missing the point made by Scott.

I am sure you know, but I'll say it anyway (for the sake of the archive) that XML has some reserved characters because they form part of the language. These are: <, >, ", and ' That is, less than, greater than, quote, and apostrophe. To ensure a parser knows that one of these characters is part of the data and not part of the structure of the xml you have to escape these entities.

The escaped entities are written as follows:

less than: &lt;
greater than: &gt;
quote: &quot;
apostrophe: &apos;

These are pre-defined entity references and are part of the language, but all escaped references are wrapped in a leading & and trailing ;

This leads inevitably to another issue - the ampersand is now reserved and part of the language because it denotes the start of an escaped entity reference. This means we have to escape the ampersand!

ampersand: &amp;

So there are 5 pre-defined entity references in XML.

The point Scott was clearly making is - well formed xml can easily have all of these escaped entity references. In fact, well formed xml should NEVER have any of the five standard entity references which is not escaped in the usual way. Further, xml doesn't restrict itself to the 5 pre-defined entity references. You can create your own custom unicode numeric character reference of the form &#nnnn; (nnnn code point in decimal form) or &#xhhhh (hhhh code point in hexadecimal form). This allows you to place any valid unicode character in your text-based xml document and not worry about it breaking the parser.

This is all standard, and as an example I believe this is all catered for when using %XML-SAX. The pre-defined entity references cause a *XML_PREDEF_REF event to fire and any other character reference will cause a *XML_UNKNOWN_REF event to fire.

We haven't even got into DTD Entity declaratrions (internal and external). Construction of Entity Replacement Text alone is extremely complicated because what you do can be governed by a DTD and its hierarchical handling of declarations! Believe me - you really don't want to go down the road of custom parsing. The XML 1.0 spec is pretty broad and you find yourself adding one fix after another. Get the XML fixed and use a standard w3c compliant parser. :-)

Please see: http://www.w3.org/TR/REC-xml/ section 4.1 to see the details of the XML 1.0 specification and how it relates to character references.

Cheers
Larry Ducie



_________________________________________________________________
What goes online, stays online Check the daily blob for the latest on what's happening around the web
http://windowslive.ninemsn.com.au/blog.aspx

As an Amazon Associate we earn from qualifying purchases.

This thread ...

Follow-Ups:

Follow On AppleNews
Return to Archive home page | Return to MIDRANGE.COM home page

This mailing list archive is Copyright 1997-2025 by midrange.com and David Gibbs as a compilation work. Use of the archive is restricted to research of a business or technical nature. Any other uses are prohibited. Full details are available on our policy page. If you have questions about this, please contact [javascript protected email address].

Operating expenses for this site are earned using the Amazon Associate program and Google Adsense.