Using Xerces, Java, and JDK internal classes for catalog validation

If you’re validating your XML, and if you work a lot with XML you probably are or should be, catalog files can be indispensable. Among other things, they allow you to locally redirect URI and public/system URN resolution to somewhere other than a path that’s specified in the document. When you need to locally cache resources, substitute your own schema, or just work behind a firewall, this is a very useful feature.

I’ll skip the details of what catalogs can do and the best way to use them here; there are lots of other resources for that on and off the net. However, if you’re trying to implement catalog validation in your own application, and you decide to use either Xerces or the Java default Xerces copy, there’s some things that you might find it useful to know:

1) Xerces does not seem to like dealing with catalog files that have a doctype declaration pointing to the Oasis DTD. Although I haven’t looked deeply enough into the Catalog class to be sure of this, I think it’s because although the SAXParser that’s used internally there is set to non-validating, it’s still trying to resolve the doctype declaration by going out to the OASIS site to fetch the DTD anyway rather than ignoring it entirely.

The nasty thing about this problem that I saw when implementing catalog validation in my app was that the catalog parsing was failing silently. No error message was provided, and I couldn’t see anything on the internal debug object; the catalog just wasn’t showing up and URIs weren’t resolving. It was only through some step-through tracing, trying to resolve URIs, and experimenting with directly using the Catalog class instead of going through the XMLCatalogResolver class that I was able to figure this out.

Below is the doctype declaration that is recommended for catalog files, but which you *should not* add to a catalog file that you’re using with Xerces.

<!DOCTYPE catalog
PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN"

2) You can still do catalog validation in a library-restricted environment (such as Ant, when you’re prevented from setting the Ant classpath or calling out to a different Java process where you can set the classpath independently) using the copy of Xerces that Sun puts in the Java internal class set, but you need to use some Java internal classes to do so.  This presents a risk tradeoff, but given the general stability of the Java internal copy of Xerces, this is probably acceptable in some environments and situations.

For the purposes of this post, let’s assume a very simple parser and catalog example:

        XMLReader reader = XMLReaderFactory.createXMLReader("");
        reader.setFeature("", true);

        String [] catalogs =  {catalogFile.toURI().toString()};

        // Create catalog resolver and set a catalog list.
	XMLCatalogResolver resolver = new XMLCatalogResolver();

        reader.setProperty("", resolver);
        System.out.println("Resolved URI: " + resolver.resolveURI(""));

        reader.parse(new InputSource(new FileInputStream(xmlFile)));

There are two specific things happening here that are different from the canonical Xerces example:

– Make sure that you explicitly call for the Java internal class when you’re creating your XML parser.  Unless you do this, you can’t be sure that the parser you’re getting will be the one you need, and other parsers either won’t accept the entity-resolver property or won’t take the Java internal version of XMLCatalogResolver.  For a SAX parser (which is the appropriate parser for validation most of the time), you’ll want to specify this:
XMLReader parser = XMLReaderFactory.createXMLReader("");

– Instead of using the Xerces class org.apache.xerces.util.XMLCatalogResolver, you need to use the Java internal class

My understanding is that it should be possible to use parser.setEntityResolver(resolver) instead of setting the property on the parser, but when I tried to do this via unit testing, I wasn’t able to make it work, and since setting it as a property did I didn’t try and drill down to figure out why (at least not on the client’s time!).

Comments are closed.