Monday, January 22, 2007

A Typical Anti-pattern of using SAX

Recently I just encountered a small problem with parsing XML documents with SAX. As is well known, the handler interface in SAX has three major call-backs for elements, namely startElement, endElement, and characters . The first two methods are called starting/ending a tag, while the third method is aimed at handling character data between tags. The problem lies in the contract of the method characters(char[] ch, int start, int length) . In the specification of SAX 2.0, it is stated that implementations of parsers are not required to ensure encapsulating one string between two corresponding tags in one call of this method.

Thus in order to get the characters between tags, be careful to buffer all the chunks through possibly several calls of characters() . What's more, do build guard boolean variables to collect the characters belonging to one element. Otherwise "\n" between tags might be also collected.