Extracting a catalogue of element names from a collection of XML documents using XSLT 2.0
We are trying to build a single stylesheet to work with the documents of two independent journals. In order to get a sense of the work involved, we wanted to create a catalogue of all elements used in the published articles. This means loading as input document directories’ worth of files and then going through extracting and sorting the elements across all the input documents.
Here’s the stylesheet that did it for us. It is probably not maximally optimised, but it currently does what we need. Any suggestions for improvements would be gratefully received.
Some notes:
- Our goal was to pre-build some templates for use in a stylesheet, so we formatted the elements names into xsl templates.
- Although you need to use this sheet with an input document, the input document is not actually transformed (the files we are extracting the element names from are loaded using the
collection()
function). So it doesn’t matter what the input document is as long as it is valid XML (we used the stylesheet itself)
<?xml version="1.0"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0">
<!-- this output is because we are going to construct ready-made templates for each element --> <xsl:output method="text"/>
<!-- for pretty printing --> <xsl:variable name="newline"> <xsl:text> </xsl:text> </xsl:variable>
<!-- Load the files in the relevant directories --> <xsl:variable name="allFiles" select="collection(iri-to-uri('file:///some/path/to/the/directories?select=*.xml; recurse=yes'))"/>
<!-- Dump their content into a single big pile --> <xsl:variable name="everything"> <xsl:copy-of select="$allFiles"/> </xsl:variable>
<!-- Build a key for all elements using their name --> <xsl:key name="elements" match="*" use="name()"/>
<!-- Match the root node of the input document (since the files we are actually working on have been loaded using the using the collection() function, nothing is actually going to happen to this element) --> <xsl:template match="/">
<!-- this is information required to turn the output into an XSL stylesheet itself --> <xsl:text><xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0"></xsl:text> <xsl:value-of select="$newline"/> <xsl:text><!--Summary of Elements --></xsl:text> <xsl:value-of select="$newline"/> <xsl:value-of select="$newline"/>
<!-- this invokes the collection of all elements in all the files in the directory for further processing --> <xsl:for-each select="$everything">
<!-- This makes sure we are dealing with the first named key --> <xsl:for-each select="//*[generate-id(.)=generate-id(key('elements',name())[1])]">
<!-- sort them --> <xsl:sort select="name()"/>
<xsl:for-each select="key('elements', name())">
<!-- this makes sure that only the first instance of each element name is outputted --> <xsl:if test="position()=1"> <xsl:text><xsl:template match="</xsl:text> <xsl:value-of select="name()"/> <xsl:text>"> </xsl:text> <xsl:value-of select="$newline"/> <xsl:text><!--</xsl:text> <!-- this counts the remaining occurences --> <xsl:value-of select="count(//*[name()=name(current())])"/> <xsl:text> occurences</xsl:text> <xsl:text>--></xsl:text> <xsl:value-of select="$newline"/> <xsl:text></xsl:template></xsl:text> <xsl:value-of select="$newline"/> <xsl:value-of select="$newline"/> </xsl:if> </xsl:for-each> </xsl:for-each> </xsl:for-each> <xsl:value-of select="$newline"/> <xsl:text></xsl:stylesheet></xsl:text> </xsl:template> </xsl:stylesheet>
Commenting is closed for this article.