KYOTO Logo
Knowledge Yielding Ontologies
for Transition-based Organization
  • Increase font size
  • Default font size
  • Decrease font size

Document Base & Term Base

The database system is the core of the KYOTO integrated system. It consists of:

  • a document database manager;
  • a term  database manager;
  • a job dispatcher; and
  • a web based configuration tool.

The document base manager is used to build up a corpus. It keeps track of all documents in the corpus, and of all representations of each document (e.g., HTML, KAF, etc.). The document base provides in interface for other application components. Supported operations include: adding a document, deleting a document, adding another representation of a document, fetching a specific representation of a document, etc.).

The term database contains a collection of terms and term relations, including pointers to locations in the source documents. The term database manager provides an interface to view, add, delete and modify terms and term relations.

The job dispatcher reads from the configuration file which PipeT modules (or pipelines) apply to which documents. It continuously monitor the document base, and as soon as there is a document to which a module applies, it launches the module with the document as input. The result (which is a processed version of the document) is stored in the document base and may then be processed by another module. An administration is maintained to avoid duplicate processing.

 The web based configuration tool is used to configure the job dispatcher and the databases.

The database back-end of the document base manager and the term base manager is a MySQL server. The document base uses the filesystem to store documents and a relational database to store meta data such as the file type, source URL, etc.

Prerequisites

 The document base only runs on linux. Make sure you have installed Java 6, Maven 2 and MySQL:

sudo apt-get install sun-java6-jdk maven2 mysql-server

Installation

1. Get the source files.

svn co https://kyoto.let.vu.nl/svn/kyoto/trunk/tools/document-base2/tags/VERSION document-base2

Replace VERSION by the name of the latest version. A development snapshot can be found in the trunk/ subdirectory, but it is almost always preferable to use a tagged version.

3. Compile the java sources:

cd document-base2
mvn package

4. Create a MySQL database and grant the privileges:

echo "CREATE DATABASE kyoto;" \
 "GRANT ALL PRIVILEGES ON kyoto.* TO kyoto@localhost;" \
"FLUSH PRIVILEGES" \
 | mysql -u root -p

5. Initialize the database and install the configuration files:

./script/shell -c 'install database'

Usage

 A document base command can be issued by:

./scripts/shell -c COMMAND

where COMMAND is replaced by the actual command. If the command contains a
space character, quotes must be used.

The interactive shell can be started by:

./scripts/shell

A command issued in the interactive shell has the same effect as a command
issued with the -c switch.

A list of available commands is requested by the command:

help

in the document base shell environment. So, as you may have derived from the
above, a list of commands is requested by typing 'help' in the interactive
document base shell, or by typing './scripts/shell -c help' on the command
line. More detailed help can be requested by:

help COMMAND

where COMMAND is the command of interest. For example, use:

help put

to get help using the 'put' command.

Example of use

 This example shows how a corpus can be created and processed. Assumed is the
availability of a set of KAF files, and the goal is to create a term
database.

First, the KAF files are added to the database:

find /path/to/kaf -name '*.kaf' -exec bash -c \
'./scripts/shell -c put {} --mime-type=kaf/wsd --uri={}' \;

When done, the database contains a 'resource' for each document, and a KAF
file associated with each resource. See also the list of resources:

list resources

and the list of files associated with one of them:

resource 1 list files

The file contents can be requested once you know the file identifier:

cat 1

Note that the argument is a file identifier, not a resource identifier. A
file can have the same identifier as a resource.

For processing, we need a preconfigured pipeline. Let's see which pipelines
are available:

list pipelines

The pipeline 'pipeline:en:termex' transforms English KAF into an XML file
with terms and relations. We can assign this one to our database by creating
a processor which uses this pipeline:

create processor pipeline:en:termex
list processors
processor 1 info

The second command should show a list of processors, in our case only one
processor. When activated, this processor scans the database for files of
type 'kaf/wsd', and produces files of type 'application/x-termbase-xml'.
This information can be read from the output of the third command. Note that
the input type of this processor (kaf/wsd) matches the type of the files we
added to the database. This means that the processor will start working on
each of those files, as soon as it is activated:

processor 1 run

The processor processes the files one by one. When done, it will wait for
more files to come. If no more files are coming, it will wait forever or
until closed. Use Ctrl-C to cancel processing at any time. If processing is
aborted before finished, restarting the processor will complete the job.

While working the processor added a term file for each resource. The
command:

resource 1 list files

now shows two files instead of one.

We need another processor to add the terms (which are now extracted for each
KAF file) to the term database:

create processor pipeline:termloading
processor 2 run

The second command starts the processor, which reads the terms from the XML
files and populates the database. The resulting term database is obtained
by:

list terms
list term relations

Both commands return tab-separated tables with the information in the term
database. A confidence value is assigned to each term, but this value is
always zero. The reason for this is that the pipeline which calculates the
confidence values is not yet active. Assign the confidence values by:

create processor pipeline:termconf
processor 3 run

and issue the term database queries once more. You will notice that the
confidence values have changed.

 

ICT-211423 - 2008 © Kyoto Consortium