13.5.  Whitespace Processing

Almost all tests compare strings. Many comparisons would fail if whitespaces remained as they are. So, you can control the way whitespaces are handled using one of the following constants. NORMALIZE_WHITESPACES is the default if nothing is declared:

// Constants for whitespace processing:

com.pdfunit.Constants.IGNORE_WHITESPACES    1
com.pdfunit.Constants.IGNORE                2

com.pdfunit.Constants.KEEP_WHITESPACES      3
com.pdfunit.Constants.KEEP                  4

com.pdfunit.Constants.NORMALIZE_WHITESPACES 5
com.pdfunit.Constants.NORMALIZE             6

1 2

All whitespaces are deleted before comparing two strings.

3 4

Existing whitespaces are not changed.

5 6

Whitespaces at the beginning and at the end of a string are deleted. Any sequences of whitespaces within a text are reduced to one space.

Each set of two constants has the same meaning. The redundancy is provided to support different linguistic preferences.

An example:

public void hasText_WithLineBreaks_UsingIGNORE() throws Exception {
  String filename = "documentUnderTest.pdf";
  String expected = "PDFUnit - Automated PDF Tests http://pdfunit.com/" +
                    "This is a document that is used for unit tests of PDFUnit itself." +
                    "Content on first page." +
                    "odd pagenumber" +
                    "Page # 1 of 4";
            .equalsTo(expected, IGNORE_WHITESPACES)  

The expected string in this example is written completely without linebreaks, although the PDF page contains many of them. However when using IGNORE_WHITESPACES the test runs successfully.

NORMALIZE_WHITESPACES is the default when nothing else is set explicitly. Test methods in which a flexible treatment of whitespaces does not make sense do not have a second parameter.

As an exception to this rule, no method involving regular expressions changes whitespaces automatically. It is up to you to integrate the whitespace processing into the regular expression, for example like this:


The term (?ms) means that the search extends over multiple lines. Line breaks are interpreted as characters.