XSLT gotchas: generate-id() collisions

Mixing ID generation schemes is always something you have to be careful about, but XSLT has a particular gotcha that you might run into if you’re got both externally-accessible cross-references that you need to propagate, and internally-generated cross-references that you use to make things like indexes and TOCs work.

More likely than not, you’re using the generate-id() function to get unique IDs for those index and TOC links. You’re also probably using author-maintained (or at least authoring-environment-maintained) ID attributes for the internal cross-references and external link destinations. However, an ID is an ID, and even if you’re using separate strategies to deal with them, if they’re in the same XML document you should check to see if your code is checking for collisions between those author-managed IDs and the IDs that the generate-id() function is producing.

The XSLT specification does actually call this out: “There is no guarantee that a generated unique identifier will be distinct from any unique IDs specified in the source document.”  In my experience it’s rare that you have to deal with this, but it does happen occasionally.  In my particular case, it came up when working with a source document that had been generated by the same XSLT processor being used to do the transform I was working on, and the XSLT processor in question (Saxon 6.5.5) exclusively uses the structure of the document being transformed to generate the IDs – it doesn’t appear to salt them in any way. The result was collisions between source document IDs, and IDs for different elements being generated with the generate-id() method.

The solution for me was to make sure and use a pseudorandom salt for any IDs generated with generate-id(); this was pretty easy, since I was already in the process of replacing all of the generate-id() calls in the system with a custom function anyway (which was necessitated by requirements that aren’t relevant here).

A lighter note…

…but not a lighter solution:

FizzBuzz Enterprise Edition

The package structure alone is enough to give anybody who’s worked with over-architected “enterprise grade solutions” a giggle, or at least a grim chuckle.

Anticipation is the enemy of appropriate solutions. I think that it’s often more harmful to do premature abstraction and optimization than it is to do them after they’re needed. It is – as with so many software design decisions – a judgement call based on everything that you know about the project. A brilliant developer I used to work with (John Heintz) expressed it as a rule of thumb:

  1. Make it work.
  2. Make it right.
  3. Make it fast.

Agile development means cycling through each of these phases, doing each of them as they’re needed and not one iteration before if you can help it.

Using Xerces, Java, and JDK internal classes for catalog validation

If you’re validating your XML, and if you work a lot with XML you probably are or should be, catalog files can be indispensable. Among other things, they allow you to locally redirect URI and public/system URN resolution to somewhere other than a path that’s specified in the document. When you need to locally cache resources, substitute your own schema, or just work behind a firewall, this is a very useful feature.

I’ll skip the details of what catalogs can do and the best way to use them here; there are lots of other resources for that on and off the net. However, if you’re trying to implement catalog validation in your own application, and you decide to use either Xerces or the Java default Xerces copy, there’s some things that you might find it useful to know:

1) Xerces does not seem to like dealing with catalog files that have a doctype declaration pointing to the Oasis DTD. Although I haven’t looked deeply enough into the Catalog class to be sure of this, I think it’s because although the SAXParser that’s used internally there is set to non-validating, it’s still trying to resolve the doctype declaration by going out to the OASIS site to fetch the DTD anyway rather than ignoring it entirely.

The nasty thing about this problem that I saw when implementing catalog validation in my app was that the catalog parsing was failing silently. No error message was provided, and I couldn’t see anything on the internal debug object; the catalog just wasn’t showing up and URIs weren’t resolving. It was only through some step-through tracing, trying to resolve URIs, and experimenting with directly using the Catalog class instead of going through the XMLCatalogResolver class that I was able to figure this out.

Below is the doctype declaration that is recommended for catalog files, but which you *should not* add to a catalog file that you’re using with Xerces.

<!DOCTYPE catalog
PUBLIC "-//OASIS//DTD Entity Resolution XML Catalog V1.0//EN"

2) You can still do catalog validation in a library-restricted environment (such as Ant, when you’re prevented from setting the Ant classpath or calling out to a different Java process where you can set the classpath independently) using the copy of Xerces that Sun puts in the Java internal class set, but you need to use some Java internal classes to do so.  This presents a risk tradeoff, but given the general stability of the Java internal copy of Xerces, this is probably acceptable in some environments and situations.

For the purposes of this post, let’s assume a very simple parser and catalog example:

import com.sun.org.apache.xerces.internal.util.XMLCatalogResolver;
        XMLReader reader = XMLReaderFactory.createXMLReader("com.sun.org.apache.xerces.internal.parsers.SAXParser");
        reader.setFeature("http://apache.org/xml/features/validation/schema", true);

        String [] catalogs =  {catalogFile.toURI().toString()};

        // Create catalog resolver and set a catalog list.
	XMLCatalogResolver resolver = new XMLCatalogResolver();

        reader.setProperty("http://apache.org/xml/properties/internal/entity-resolver", resolver);
        System.out.println("Resolved URI: " + resolver.resolveURI("http://www.example.org/xsd/test"));

        reader.parse(new InputSource(new FileInputStream(xmlFile)));

There are two specific things happening here that are different from the canonical Xerces example:

– Make sure that you explicitly call for the Java internal class when you’re creating your XML parser.  Unless you do this, you can’t be sure that the parser you’re getting will be the one you need, and other parsers either won’t accept the entity-resolver property or won’t take the Java internal version of XMLCatalogResolver.  For a SAX parser (which is the appropriate parser for validation most of the time), you’ll want to specify this:
XMLReader parser = XMLReaderFactory.createXMLReader("com.sun.org.apache.xerces.internal.parsers.SAXParser");

– Instead of using the Xerces class org.apache.xerces.util.XMLCatalogResolver, you need to use the Java internal class com.sun.org.apache.xerces.internal.util.XMLCatalogResolver.

My understanding is that it should be possible to use parser.setEntityResolver(resolver) instead of setting the property on the parser, but when I tried to do this via unit testing, I wasn’t able to make it work, and since setting it as a property did I didn’t try and drill down to figure out why (at least not on the client’s time!).

WSDL and the little things

While helping somebody work out some XML/XSD/WSDL problems, I ran across this:


Among other problems, there’s 30+ occurrences of an attribute named ninOccurs in that WSDL file, and those date back over a year across multiple bug fixes.  And that’s why you should validate your XML and use a tool that helps you find these kinds of problems before checking them in. My friend eventually got the WSDL working with his stub generator after hacking it up enough, but this is not the kind of thing that inspires confidence in a tool or organization.

Tips for dealing with Lync

One of my clients requires me to use Lync for communicating within the company, but won’t provide me with a copy of it unless I install their entire operating image. This is a problem for various reasons which I don’t want to get into, but the result is that everything works fine except for not being able to get a copy of Lync to use with their system.  As a result, it’s necessary to use the evaluation version of the Lync client (http://technet.microsoft.com/en-us/lync/gg236589.aspx). This works great until the evaluation expires in 180 days.  How to work around that?  Well, if I was a little less pressed for time I could check the registry entries added during installation and see which is responsible for setting the evaluation period.  However, at the moment I’ve got better things to do, and a little bit of searching the lazyweb turned up this thread:


The last comment says to uninstall Lync 2010, re-install Office Communicator 2007 (also the trial version), uninstall that, then reinstall Lync 2010. Voila, your trial period is reset!

Eternal questions: spaces vs. tabs

Here’s some thoughts on an aspect of coding that can be frustrating and annoying all out of proportion to its importance: tabs vs. spaces.

It’s one of those issues that coders often don’t think about much any more until you’re trying to work with code from a project that either didn’t share your philosophy or has already had a lot of code in conflicting styles added.  IDEs can often aggravate this by being so darn powerful and convenient in how they auto-format code, it’s often not thought about until late in the project when mixed styles have already started creeping in.

The solution, of course, is project-wide code style conventions from the very beginning.  But which convention to use?  In an ideal world, the solution would be tabs for indentation, spaces for alignment.  Tabs for indentation allows for every developer to set their own indent level – no more arguments about how two spaces per level looks cramped, or four spaces per level wastes space.  Spaces for alignment allows code to be attractively formatted independently of tab size.  It’s the best solution by far, and yet it rarely works out in the long term.  There are some reasons for this:

  1. Too few developers think about code formatting at all, much less the details of code beautification.
  2. The tabs for indentation and spaces for alignment convention isn’t perfectly supported by all automatic formatting tools.  Because not nearly enough attention is paid to formatting in the first place, an automatic formatting convention is the best hope for many projects to keep code uniformly readable.
  3. On long term projects, developer churn should be assumed.  Sooner or later (hopefully later, but you never know), there will be coders who don’t meet your exacting standards for software development skills.  If a project is really a survivor, the odds are good that there will be people with commit access that have very little development access.  When the XML config file or that one piece of scripting is being edited in Notepad, you can pretty much forget consistency.

So, back to the root dichotomy: tabs or spaces?  To which I say: spaces only and forever.  The reasons why are all the reasons listed above; spaces require no thought or planning on the part of the developer, “convert everything to spaces” is an easy rule to implement in all IDE formatters (and Ctl-Shift-O before check-in is as easy a rule as it gets), and spaces are hard for even a monkey with Notepad and commit access to seriously screw up.  If it were me, writing code that would never be seen by anybody else, I’d choose tabs for indentation and spaces for alignment, but in a business environment that’s not something that should ever happen – you write the code, and format it, for the client and the developers who are to come after you, not for yourself.

XML parsing vulnerability

This is a little bit late to the game, but remotely exploitable vulnerabilities aren’t the kind of security I’ve had to worry about much in the past six months. That said, back in August an advisory for a lot of common XML parsers was published by CERT:

CERT-FI Advisory on XML libraries
CVE database entry

Freenet has claimed that this is a potential remote execution vulnerability in Java with sploits in the wild, but the CERT advisory and the various bugs submitted to individual projects seem to think it’s only a DoS bug, and that seems a lot more likely in my judgement.

Line Breaking

[originally posted 4/14/2009]

Some useful links from yesterday about line breaking:

XSL-FO is good for a lot of things, but there’s a lot that’s outside the scope of pure FO. Line breaks, for example. Deliberate line breaks are easy enough, using the element, but what about line breaks that occur in the natural course of flowing a page? FO doesn’t handle that; it’s left as an exercise for whoever implements the page flow portion of the FO renderer. Unicode provides us with some help here:

The Unicode line breaking specification: http://www.unicode.org/reports/tr14/tr14-19.html

A list of line break characters: http://www.unicode.org/Public/5.0.0/ucd/LineBreak.txt

Even applications that ‘follow the spec’ have a significant amount of latitude in implementation though, which makes line breaking a little bit different with each page flow engine you use. If you’ve got specific concerns (say, about breaking after a solidus in URLs vs. in dates), it’s wise to test them before making implementation choices and see if they can be configured to your needs. Some formatting tools will allow you to configure the line break rules, but some won’t.

To configure line breaking in XSL Formatter, use the axf:line-break attribute, along with the various include/append dependent attributes (information here: http://www.antennahouse.com/xslfo/axf5-extension.htm#axf.line-break).

More notes on XSL Formatter

[originally posted 4/11/2009]

More notes on XSL Formatter:

A common way of producing PostScript output via XSL Formatter is to output via the “Adobe PDF” printer (if you have Acrobat installed, of course), which in most contexts also means the Windows PostScript printer driver. So if you know a little bit but not a lot about how that works, you might wonder how they get bookmarks to work when using that output method, since printers shouldn’t understand anything about purely screen features like bookmarks. Unknown to me (until Friday, that is) was that it’s possible to pass PostScript directly through to the output using Pscript_Win_PassThrough. XSL Formatter uses this to output pdfmarks in the right places, which then results in printed PostScript that distills to a PDF with bookmarks in it. I haven’t dug into the APIs yet because I haven’t needed to, so Pscript_Win_PassThrough might not be the correct function name, that’s just how it shows up in the PS output.

Despite having and using this mechanism, neither the “Adobe PDF” printer nor the internal PostScript creator (designated by using “@PS” as the name of the printer to use) will output tagged figures. Only the internal PDF creator (“@PDF” as the name of the printer) is able to do that. I’m not sure if this is a design issue in the product itself, or just a feature that hasn’t been implemented yet. Time will tell.

XSL Formatter - more on graphics

[originally posted 4/8/2009]

Nope, EPS cannot be inlined – or at least, it cannot be inlined with any degree of consistency.

SVG, on the other hand, is handled quite well, as long as fonts are converted to outlines rather than trying to embed them. That’s fine for our purposes; now the problem is finding an EPS to SVG converter that can be used for batch conversion, and that will handle any or all the EPS format variations that Distiller can…