Chapter 7. Unicode

PDF Documents Containing Unicode

Would the tests described so far also run with content that is not ISO-8859-1, for example with Russian, Greek or Chinese text?

A difficult question. A lot of internal tests are done with Greek, Russian and Chinese documents, but tests are missing for Hebrew and Japanese documents. All in all it is not 100% clear that every available test will work with every language, but it should.

When you need to process Unicode data, it is good practice to configure all your tools to UTF-8.

The following hints may solve problems not only when working with UTF-8 files under PDFUnit. They may also be helpful in other situations.

Single Unicode Characters

Metadata and keywords can contain Unicode characters. If your operating system does not support fonts for foreign languages, you can use Unicode escape sequences in the format \uXXXX within strings. For example the copyright character © has the Unicode sequence \u00A9:

<testcase name="hasProducer_CopyrightAsUnicode">
  <assertThat testDocument="unicode/unicode_producer.pdf">
    <hasProducer>
                              <!-- 'copyright' -->
      <matchingComplete>txt2pdf v7.3 \u00A9 SANFACE Software 2004</matchingComplete>
    </hasProducer>
  </assertThat>
</testcase>

Longer Unicode Text

It would be too difficult to figure out the hex code for all characters of a longer text. Therefore PDFUnit provides the small utility ConvertUnicodeToHex. Pass the foreign text as a string to the tool, run the program and place the generated hex code into your test. Detailed information can be found in chapter 9.2: “Convert Unicode Text into Hex Code”. A test with a longer sequence may look like this:

<testcase name="hasSubject_Greek">
  <assertThat testDocument="unicode/unicode_subject.pdf">
    <hasSubject>
      <matchingComplete>
        Εργαστήριο Μηχανικής ΙΙ ΤΕΙ ΠΕΙΡΑΙΑ / Μηχανολόγοι       1
      </matchingComplete>
    </hasSubject>
  </assertThat>
</testcase>

1

If you don't see Greek text here, then your presentation system (PDF, eBook or HTML) does not support the required Unicode font.

Unicode Content Compared with XML Files

XML and XPath based tests use XML files, which might contain Unicode data, e.g. the bookmarks extracted from the following document:

<!-- 
  This test needs the following setting before starting ANT:
  set JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF-8                 1
-->

<testcase name="hasBookmarks_MatchingXML">
  <assertThat testDocument="unicode/unicode_bookmarks.pdf">
    <hasBookmarks>
      <matchingXML file="unicode/unicode_bookmarks.xml" />   2
    </hasBookmarks>
  </assertThat>
</testcase>

1

The codepage can be set using the environment variable file.encoding.

2

The bookmarks were exported to XML by the utility ExtractBookmarks.

Using Unicode within XPath Expressions

The chapter 8: “Using XPath” describes how to use XPath in PDFUnit tests. You can also use Unicode sequences in XPath expressions:

<testcase name="hasBookmarks_MatchingXPath">
  <assertThat testDocument="unicode/unicode_bookmarks.pdf">
    <hasBookmarks>
      <!-- The line is wrapped for printing: -->
      <matchingXPath expr="//Title[@Action][.='\u00D1\u00EE\u00E4
                                               \u00E5p\u00E6\u00E0
                                               \u00ED\u00E8\u00E5']" />
    </hasBookmarks>
  </assertThat>
</testcase>

File Encoding UTF-8 for Shell Scripts

Just like any Java program that processes files, PDFUnit depends on the environment variable file.encoding which can be set in the following ways:

set _JAVA_OPTIONS=-Dfile.encoding=UTF8
set _JAVA_OPTIONS=-Dfile.encoding=UTF-8

java -Dfile.encoding=UTF8
java -Dfile.encoding=UTF-8

File Encoding UTF-8 for ANT

During the development of PDFUnit there were two tests which ran successfully under Eclipse, but failed with ANT due to the current encoding.

The following command did not solve the encoding problem:

// does not work for ANT:

ant -Dfile.encoding=UTF-8

Instead, the property had to be set using JAVA_TOOLS_OPTIONS:

// Used when developing PDFUnit:

set JAVA_TOOL_OPTIONS=-Dfile.encoding=UTF-8

Configure Eclipse to UTF-8

When working with XML files in Eclipse, you do not need to configure Eclipse for UTF-8, because that is the default for XML files. But the default encoding for other file types is the encoding of the file system. So it is recommended to set the encoding for the entire workspace to UTF-8:

This default can be changed for each file.

Unicode in Error Messages

If tests of Unicode content fail, the error message may be presented incorrectly in Eclipse or in a browser. Again the file encoding is responsible for this behaviour. Configuring ANT to UTF-8 should solve all your problems. Only characters from the encoding UTF-16 may corrupt the presentation of the error message.

The PDF document in the next example includes a layer name containing UTF-16BE characters. To show the impact of Unicode characters in error messages, the expected layer name in the test is intentionally incorrect to produce an error message:

<!--
  The name of the layers consists of UTF-16BE and contains the 
  byte order mark (BOM). The error message is not complete. 
  It was corrupted by the internal Null-bytes.

  Adobe Reader® shows: "Ebene 1(4)"
  The used String is_: "Ebene _XXX"
-->

<testcase name="hasLayer_NameContainingUnicode_UTF16_ErrorIntended"
          errorExpected="YES"
>
  <assertThat testDocument="unicode/unicode_layerName.pdf">
    <hasLayer>
      <withName>
        <matchingComplete>
          \u00fe\u00ff\u0000E\u0000b\u0000e\u0000n\u0000e\u0000 \u0000_XXX
        </matchingComplete>
      </withName>
    </hasLayer>
  </assertThat>
</testcase>

When the tests were executed with ANT, a browser shows the complete error message including the trailing string þÿEbene _XXX:

Unicode for invisible Characters - &nbsp;

A problem can occur due to a non-breaking space. Because at first it looks like a normal space, the comparison with a space fails. But when using the Unicode sequence of the non-breaking space (\u00A0) the test runs successfully. Here's the test:

<!--
  The content of the node value terminates with the 
  Unicode value 'non-breaking space'.
-->

<testcase name="nodeValueWithUnicodeValue">
  <assertThat testDocument="xfa/xfaBasicToggle.pdf">
    <hasXFAData>
      <!-- The line is wrapped for printing: -->
      <withNode tag="default:p[7]" 
                value="The code for creating the toggle behavior involves  
                       switching the border between raised and lowered, 
                       and maintaining the button's\u00A0"
                defaultNamespace="http://www.w3.org/1999/xhtml"
      />
    </hasXFAData>
  </assertThat>
</testcase>