KYOTO Logo
Knowledge Yielding Ontologies
for Transition-based Organization
  • Increase font size
  • Default font size
  • Decrease font size

Pdf2Text

Capturing text from PDFs

The Pdf2Text module is a shell around two other open-source packages: pdftk and pdftotext. The modules first splits the PDF into page PDF files. Next, it converts each PDF page to text. Finally, the pages cleaned up and glued together into a single HTML document for the whole document. The module introduces additional elements in the HTML to mark the page boundaries, e.g.<page number="1">, <page number="2">

The module also tries to fix a number of errors that may be introduced by the conversion software:

  • it tries to reconstruct enumerations by introducing list structures (<ul>)

  • it repairs words represented as space separated characters: E N V I R O N M E N T will become ENVIRONMENT.

  • It de-hyphenates words.

  • It introduces paragraph boundaries to mark coherent text area

In principle, the output is converted to UTF-8 but in many cases the PDFs contain elements which cannot be converted to UTF-8, e.g. many types of qoutes. It depends on the sensitivity of the linguistic processor whereas these need to be removed or converted separately.

Download and installation

The module can be downloaded from: https://kyoto.let.vu.nl/~kyoto/files/pdf2text/Pdf2Text-1.0-jar-with-dependencies.jar

Installation instructions

The module is developed in Java 1.6 and compiled (and tested) on Linux (Ubuntu - Hardy Heron - 8.04), but should run on any platform that meets the installation requirements.

Installation requirements

To be able to run the Pdf2Text module the following software must be present on the system (and available in the path):
  • pdftk: software to split a pdf document into pages (www.accesspdf.com)
  • pdftotext: software to convert pdf (pages) to text (www.foolabs.com)

Integration in the KYOTO pipeline

The capture module can be integrated in the KYOTO PipeT architecture. Whenever PDF files are uploaded with the option –mime-type='application/pdf', the job dispatcher will make a call the module to convert the PDF file to HTML.
 

ICT-211423 - 2008 © Kyoto Consortium