Layout Analysis for Scanned PDF and Transformation to the Structured PDF Suitable for Vocalization and Navigation

,


Introduction
For vision-impaired users, access to electronic documents, including scanned PDF files, has been extremely limited.Their main mode of access is by using a screen reader.Even after a scanned document has been processed with OCR software, what is left is plain text without tags or mark-up specifying primitive components.As this resulting document lacks any tags, it is not navigable by vision-impaired users.
In order for readers to navigate scanned PDF, documents need to be tagged by components identification using PDF layout analysis.
Extracting Hidden Structures from Electronic Documents (XED) is a reverse engineering tool for PDF documents (Rigamonti et al., 2005).XED discovers and extracts the original document layout structure, and generates the XCDF hierarchical standard form, which is independent of the document type.
Firstly, XED cleans the primitives in the original document, taking into account all types of embedded resources such as raw images and fonts.Then it recovers the physical structures and represents them in XCDF format.XCDF is able to represent the reorganized document in a structured and unique manner that enables the document content to be accessed easily for further work; however, it is a closed application that works only under the Windows Operating System.
The final target of this research project is to design portable, stand alone, affordable, modifiable and open-source Complete Reading System (CRS) for vision-impaired people (Nazemi & Murray, 2012).CRS provides access to several electronic documents such as Digital Accessible Information System (DAISY), DOC, DOCX, ODT, PDF and all PDF non-textual components such as mathematical expressions and charts.This paper demonstrates part of the development of this open source application that extracts primitive components from an image document, generates tags, and reconstructs a new document that is accessible and navigable by assistive technologies.The open-source package used for this purpose includes OCRopus for OCR (Shafait, 2009) and ImageMagick for image processing (Still, 2006).The final output is in html Optical Character Recognition (hOCR) format.

Layout Analysis
The physical layout analysis represents a document page comprised of unique areas such as columns, paragraphs and text lines (O'Brein, 2012).This layout analysis is responsible for identifying page components such as, text columns, text blocks, text lines and reading order (Breuel, 2008).
Layout is a collection of segments: L = {S1,..., Sn}, where L and S represent layout and segment respectively (1) A segment is a pixel collection encapsulated within a bounding box defined by its lower left and upper right corner pixels: ), where S and P represent segments and pixels accordingly (2) Each pixel is defined by a coordinate pair: The layout information is divided into two categories: the geometric layout and the logical layout (Haralick, 1994).The geometric layout is determined by the positioning information about segments.The geometric layout information allows the segments to be categorized into different logical layouts.Each segment is represented with a tag collection.Based on this structure, primary layout analysis is obtained through the following steps:  computing the bounding box for the connected components of the scanned input page image;  identifying the whitespace;  finding the constrained text line.

Html Optical Character Recognition (hOCR)
hOCR is a logical format for representing the output of OCR systems.
It is an open standard used to embed layout, recognition confidence, style and other information into a recognized text.To successfully embed this data into the text, standard HTML is used.The logical mark-up available in hOCR is designed for the document logical hierarchy, independent of where or how it is rendered on the page.This kind of mark-up is usable for individual documents such as memos and articles, and for compound documents such as newspapers, magazines and collections (Breuel et al, 2007).The hOCR tags that can be used include:

OCRopus Segmentation Methods
Scanned PDF layout analysis strongly depends on the page segmentation method.Recognition by Adaptive Subdivision of Transformation Space (RAST) and Voronoi (named after Georgy Voronoy who created the Voronoi diagram) are two methods used for page segmentation in OCRopus.
RAST extracts connected components and then determines the largest possible whitespace rectangles based on the divider's priority.The RAST algorithm is capable of processing multiple-column documents.In the RAST image result, the column dividers are yellow and different colors are assigned to different segments (Winder, 2010).
The Voronoi method identifies the connected components then extracts sample points along the boundaries to construct a Voronoi-point diagram (Kise et al., 1998).A large number of edges are created, most of which are not required.The unnecessary edges are deleted in an ascending length-wise order, regardless of their connection to other lines.As a result, the Voronoi-point diagram is converted to an area Voronoi diagram, the areas of which represent the page regions.
Moreover, OCRopus provides several tools including: The ocr-text-image-seg completely separates the image from the text by removing the masked and rectangular regions from an input image.
The ocropus-gpageseg identifies the tops and bottoms of text lines by computing gradients and performing some adaptive thresholding.These components are then used as seeds for the text lines.Ocropus-gpageseg attempts to find column separators as either extended vertical black lines or extended vertical whitespace (OCRopus, 2013).

Methodology
This research comprises the following steps for the retrieval of a scanned PDF layout and making it navigable:  Pre-processing, which includes conversion of image to binary and resizing for segmentation;  Non-textual extraction of components such as figures and images;  Block segmentation to divide the page into logical blocks and preserve the reading order.Tables, where applicable, are also detected and extracted at this stage;  Line segmentation for each text block and computing their lines bounding box in order to recreate the physical layout;  Geometric data analysis and obtaining the logical layout to generate a tagged document;

 Mathematical expression, detection and extraction;
 Sending the detected tables, figures and/or math expressions to specific applications to extract the valuable and hidden implicit information and represent it as an audio format using Text To Speech(TTS);  Merging all output components by considering the reading order;  Generating the hOCR fully marked-up formatted document.

Implementation Results
This section describes the implementation results of the present research.

Non-Text Components Extraction Such as Figures and Images
The ocr-text-image-seg performs document zone classification using run-lengths and connected components based on features and a logistic regression classifier.Since CRS is intended to extract implicit information from non-textual PDF such as figures, another approach is used for figure detection and extraction from image documents which comprises these steps:  Use RAST segmentation;  Find all yellow pixels in RAST result;  Check that the pairs of pixels X , Y X , Y are located in the yellow area; this means that the yellow area is a rectangle;  Crop main image from X , Y to X , Y ).
The application of RAST for image text separation enables the extraction of the original figure by accessing its bounding box and sending it to GRAPHREADER (Nazemi & Murrey, 2013), which is an application used to extract possible implicit text information from graphical components.After figure extraction from the page has been completed, the figure block is tagged as <ocr-image> </ocr-image> and its bounding box is appended to hOCR.html.
Figure 1 shows a multi-column scanned PDF containing an image, RAST result and extracted figure block.
Based on the extracted features from the bounding box of each single line, the following can be concluded:  A position is reserved as a heading level if: hOCR file indicates the position as a new paragraph <P> ; (lm);  Caption tag is assigned to a line segment if: w= w is the width of line segment and the previous segment is an image or the next segment is a table  Running header tag is attached to a line segment if it is the first line segment of the first block and vertical space between this line and the next line >mean value (vertical spaces in page); h  Running footer tag is attached to a line segment if it is the last line segment of the last block and vertical space between this line and the previous line >mean value (vertical spaces in page); h  Side bar tag is assigned to the line segment if it is located in the last block in a page and the block-aspect-ratio<1  Foot-bar tag is assigned to the line segment if it is located in the last block in a page and the block-aspect-ratio>1

Table Recognition
When applying the RAST segmentation method to a document image that contains a table, the horizontal and vertical separator lines for table cells are specified in yellow.The coordinate values of line intersections, which are in fact the bounding box of each cell, are obtained by computing the number of these lines and finding their geometric properties.Therefore, by using these bounding box values, all cells can be separated from the table image as individual segments and marked with tags < </ , where i and j represent row and column respectively.Figure 7 illustrates a sample table and its RAST output.

F
Figure 1.Top l left: Multi-colu