3.31.  Text - in Images (OCR)

Overview

PDFUnit can extract text from images and can validate this text in the same way as normal text. The syntax of these OCR tests follows natural language as much as possible. Each OCR tests starts with the methods hasImage().withText() or hasImage().withTextInRegion(). The following methods can be used to validate text in images:

// Tests for text in images:
.hasImage().withText().containing(..)
.hasImage().withText().endingWith(..)
.hasImage().withText().equalsTo(..)
.hasImage().withText().matchesRegex(..)
.hasImage().withText().startingWith(..)

// Tests for text in parts of an image:
.hasImage().withTextInRegion(imageRegion).containing(..)
.hasImage().withTextInRegion(imageRegion).endingWith(..)
.hasImage().withTextInRegion(imageRegion).equalsTo(..)
.hasImage().withTextInRegion(imageRegion).matchesRegex(..)
.hasImage().withTextInRegion(imageRegion).startingWith(..)

The text recognition function uses the OCR-processor Tesseract.

Example - Validate Text from Images

The following example uses a PDF file which contains an image showing the text of the novel 'Little Lord Fauntleroy'. The image has a slightly coloured background.

@Test
public void hasImageWithText() throws Exception {
  String filename = "ocr_little-lord-fauntleroy.pdf";
  int leftX  =  10; // millimeter
  int upperY =  35;
  int width  = 160;
  int height = 135;
  PageRegion pageRegion = new PageRegion(leftX, upperY, width, height);
  
  String expectedText = "Cedric himself knew nothing whatever about it.";
  AssertThat.document(filename)
            .restrictedTo(FIRST_PAGE)
            .restrictedTo(pageRegion)
            .hasImage()
            .withText()
            .containing(expectedText)
  ;
}

Normalization of OCR Text

If you look at the image, you can see the line break after the word 'nothing'. Despite this line break, the test is successful, because all whitespaces are eliminated by PDFUnit before comparing the OCR text with the expected text.

Steps in the normalization of OCR Text:

  • Characters are converted to lower-case

  • All whitespaces are deleted

  • 12 different hyphen/dash characters are deleted

  • 10 different underscore characters are deleted

  • Punctuation characters are deleted

The result of text recognition can be improved by "training" the OCR-processor. Language specific training data can be downloaded from https://github.com/tesseract-ocr/tessdata.

Example - Text in Image Regions

Sometimes an expected text should be located in a certain region of an image. You can define an image region to handle such a requirement:

@Test
public void hasImageWithTextInRegion() throws Exception {
  String filename = "ocr_little-lord-fauntleroy.pdf";
  
  int leftX  =  10; // millimeter
  int upperY =  35;
  int width  = 160;
  int height = 135;
  PageRegion pageRegion = new PageRegion(leftX, upperY, width, height);
  
  int imgLeftX  = 250; // pixel
  int imgUpperY =  90;
  int imgWidth  = 130;
  int imgHeight =  30;
  ImageRegion imageRegion = new ImageRegion(imgLeftX, imgUpperY, imgWidth, imgHeight);
  
  String expectedText = "Englishman";
  AssertThat.document(filename)
            .restrictedTo(FIRST_PAGE)
            .restrictedTo(pageRegion)
            .hasImage()
            .withTextInRegion(imageRegion)
            .containing(expectedText)
  ;
}

The unit for image size values is always pixel, since images in PDFs may be scaled. This means that using the unit millimeter might lead to incorrect measurements. To find the right values for an image region, extract all images from the PDF and use a simple image processing tool to get the values for the desired region. PDFUnit provides the tool ExtractImages to extract images. Chapter 9.7: “Extract Images from PDF” explains how to use it.

Example - Rotated and Flipped Text in Images

Water marks and some other text in images may be intentionally rotated or flipped. Such text can be validated using the following methods:

// Method to rotate and flip images before OCR processing:

.hasImage().flipped(FlipDirection).withText()...
.hasImage().rotatedBy(Rotation).withText()...

The horribly mangled text in the next image can be validated.

The text in this image is rotated 270 degrees and flipped vertically. If you know these data, the text can be checked:

@Test
public void testFlippedAndRotated() throws Exception {
  String filename = "image-with-rotated-and-flipped-text.pdf";
  int leftX  = 80; // in millimeter
  int upperY = 65;
  int width  = 50;
  int height = 75;
  PageRegion pageRegion = new PageRegion(leftX, upperY, width, height);

  String expectedText = "text rotated 270 and flipped vertically";
  
  AssertThat.document(filename)
            .restrictedTo(FIRST_PAGE)
            .restrictedTo(pageRegion)
            .hasImage()
            .rotatedBy(Rotation.DEGREES_270) 
            .flipped(FlipDirection.VERTICAL)
            .withText()
            .equalsTo(expectedText)
  ;
}

Allowed values for rotation or flipping are:

Rotation.DEGREES_0
Rotation.DEGREES_90
Rotation.DEGREES_180
Rotation.DEGREES_270

FlipDirection.NONE
FlipDirection.HORIZONTAL
FlipDirection.VERTICAL