Chapter 6. Unicode

PDF Documents Containing Unicode

Would the tests described so far also run with content that is not ISO-8859-1, for example with Russian, Greek or Chinese text?

A difficult question. A lot of internal tests are done with Greek, Russian and Chinese documents, but tests are missing for Hebrew and Japanese documents. All in all it is not 100% clear that every available test will work with every language, but it should.

When you need to process Unicode data, it is good practice to configure all your tools to UTF-8.

The following hints may solve problems not only when working with UTF-8 files under PDFUnit. They may also be helpful in other situations.

Single Unicode Characters

Metadata and keywords can contain Unicode characters. If your operating system does not support fonts for foreign languages, you can use Unicode escape sequences in the format \x{nnnn} within double quoted strings. For example the copyright character © has the Unicode sequence \x{00A9}:

#
# The document info 'producer' contains the copyright as a Unicode charactere.
#
lives_ok {
  my $pdfUnderTest = "$resources_dir/unicode/unicode_producer.pdf";
  AssertThat->document($pdfUnderTest)
            ->hasProducer()
            ->equalsTo("txt2pdf v7.3 \x{00A9} SANFACE Software 2004")  # 'copyright'
  ;
} "value of producer contains Unicode";

Longer Unicode Text

Of course you can use Unicode inside Perl code. Don't forget the statement use utf8;. A test with a longer sequence may look like this:

#  
#  The document info 'subject' contains Unicode.
#  
lives_ok {
  my $pdfUnderTest = "$resources_dir/unicode/unicode_subject.pdf";
  my $expectedSubject = "Εργαστήριο Μηχανικής ΙΙ ΤΕΙ ΠΕΙΡΑΙΑ / Μηχανολόγοι";
  AssertThat->document($pdfUnderTest)
            ->hasSubject()
            ->equalsTo($expectedSubject)
  ;
} "test subject with Greek characters";

Configure Eclipse to UTF-8

When you are working with XML files in Eclipse, you do not need to configure Eclipse for UTF-8, because UTF-8 is the default for XML files. But the default encoding for other file types is the encoding of the file system. So it is recommended to set the encoding for the entire workspace to UTF-8:

This default can be changed for each file.

Unicode for invisible Characters -  

A problem can occur due to a non-breaking space. Because at first it looks like a normal space, the comparison with a space fails. But when using the Unicode sequence of the non-breaking space (\u00A0) the test runs successfully. Here's the test:

#
# String ends with NBSP.
#
lives_ok {
  my $pdfUnderTest = "$resources_dir/unicode/xfaBasicToggle.pdf";
  my $defaultNS = DefaultNamespace->new("http://www.w3.org/1999/xhtml");
  my $nodeValue = "The code for creating the toggle behavior involves switching "
                . "the border between raised and lowered, and maintaining the button's";
  my $nodeValueWithNBSP = $nodeValue . "\x{00A0}"; # The content terminates with a NBSP.
  my $nodeP7 = XMLNode->new("default:p[7]", $nodeValueWithNBSP, $defaultNS);
  AssertThat->document($pdfUnderTest)
            ->hasXFAData()
            ->withNode($nodeP7) 
  ;
} "check for invisible blank (nbsp)";