Chapter 8. Using XPath

General Comments about XPath in PDFUnit

Using XPath to evaluate parts of a PDF document opens a wider range of testing capabilities than an API alone can provide.

Several chapters in this manual describe XPath tests. The current chapter gives you an overview with references to the special chapters.

<!-- Overview over XPath related test facilities: -->

<hasBookmarks><matchingXPath />...     3.4: “Bookmarks and Named Destinations” 
<hasFields><matchingXPath />...        3.10: “Form Fields” 
<hasFonts><matchingXPath />...         3.9: “Fonts” 
<hasSignatures><matchingXPath />...    3.23: “Signatures and Certificates” 
<hasXFAData><matchingXPath />...       3.30: “XFA Data” 
<hasXMPData><matchingXPath />...       3.31: “XMP Data” 

<!-- Comparing two documents using XPath: --> 
<haveSameXFAData><matchingXPath />...  4.18: “Comparing XFA Data” 
<haveSameXMPData><matchingXPath />...  4.19: “Comparing XMP Data” 

Using XMLUnit for Comparison

PDFUnit uses XMLUnit internally to compare XML structures (http://xmlunit.sourceforge.net). This means that the rules of XML syntax are respected, for example:

  • The order of attributes doesn't matter.

  • Whitespaces between element nodes are ignored.

More rules for Canonical XML are well described in Wikipedia (http://de.wikipedia.org/wiki/Canonical_XML).

The general configuration of XMLUnit is documented on the project site http://xmlunit.sourceforge.net/userguide/html/index.html#Configuring%20XMLUnit. PDFUnit uses the following:

XMLUnit.setXSLTVersion("2.0");
XMLUnit.setNormalizeWhitespace(true);
XMLUnit.setIgnoreWhitespace(true);
XMLUnit.setIgnoreAttributeOrder(true);
XMLUnit.setIgnoreComments(true);

Extract Data as XML

PDFUnit provides utility programs for all parts of a PDF document which can be tested using XML/XPath. They extract the information into XML files:

// Utilities to extract XML from PDF:

com.pdfunit.tools.ExtractBookmarks
com.pdfunit.tools.ExtractFieldsInfo
com.pdfunit.tools.ExtractFontsInfo
com.pdfunit.tools.ExtractSignaturesInfo
com.pdfunit.tools.ExtractXFAData
com.pdfunit.tools.ExtractXMPData

The utilities are described in the chapter 9.1: “Common Remarks for all Utilities”:

Namespaces with Prefix

A namespace with an existing prefix will be detected automatically by PDFUnit. This applies to both XML files and PDF-internal XML data.

Default Namespace

The default namespace is not detected automatically because the XML standard allows the definition of namespaces multiple times in an XML document. A default namespace has to be declared and you have to use a prefix:

<!--
  The default namespace has to be declared, 
  but any alias can be used for it.
-->
<testcase name="hasXFAData_UsingDefaultNamespace">
  <assertThat testDocument="xfa/xfa-enabled.pdf">
    <hasXFAData>
      <withNode tag="foo:log/foo:to" 
                value="memory" 
                defaultNamespace="http://www.xfa.org/schema/xci/2.6/" 
      />
    </hasXFAData>
  </assertThat>
</testcase>

Note that the prefixes in this example are named foo for the first and bar for the second usage. In real projects please use only one prefix - and not foo or bar.

XPath Result Types

The evaluation of an XPath expression generally results in distinct node types. The expected result type has to be declared when comparing XFA or XMP data from two PDF documents. The available result types are defined as constants for the attribute withResultType.

<!-- Result types for XPath-processing:  -->

withResultType="BOOLEAN"
withResultType="NUMBER"
withResultType="NODE"
withResultType="NODESET"
withResultType="STRING"

Tests with the expected node type BOOLEAN are a problem because XPath can not distinguish between not found and false. Try to use another XPath expression with a different result type.

XPath Compatibility

XPath expressions can use all of XPath’s syntax elements and functions. However, the number of available features of the XPath engine is version dependent. PDFUnit uses the XPath engine of your JDK. So your JDK version determines the compatibility to the XPath standard.