KYOTO Logo
Knowledge Yielding Ontologies
for Transition-based Organization
  • Increase font size
  • Default font size
  • Decrease font size

Tybot

While we acknowledge that some words have more relevance to a domain than others, we consider any syntactic unit as a potential term. Rather than focusing on extracting the most relevant terms, we try to establish a view on the terminology of the domain which is as complete as possible. Since an essential part of the meaning of a term is defined by its relations to other terms, discovering relations is as important to our goal as ranking terms by relevance. Once we have extensive knowledge of how the terms relate to each other, we are also more capable of judging the domain-relevance of a term.

After a domain-relevance score is assigned, the list of terms can be reduced as desired by setting a threshold to filter out the least relevant terms.

Preceding term extraction, we perform tokenization, part-of-speech tagging, lemmatization, dependency parsing and word-sense disambiguation. This produces all the morpho-syntactic information required, which is stored in KAF. As a result, the input to the term extraction process is a set of KAF files which contains the following levels of annotation:

  • Tokenization. Tokens are grouped by page, paragraph and sentence.
  • Lemmatization. A lemma and part-of-speech is assigned to a single-word or multi-word. References to tokens are inserted as well. Wordnet senses are assigned to lemmas where possible.
  • Constituents. Phrases such as noun phrases and prepositional phrases are identified, with pointers to the lemmas which constitute them. Also, the head of the phrase is marked.
  • Dependencies. Lemmas have dependency relations to other lemmas. The relation type (subject, object, etc.) is also identified.

The language-neutral nature of KAF allows us to keep any processing from this point on language-neutral.

Because all words in the source documents are linked to the wordnet of the corresponding language, also the extracted candidate terms are linked to wordnet (either directly or through hypernym relations). Since the wordnets are mapped to the English wordnet, the majority of extracted candidate terms also have a hypernym which is linked to its equivalent in other languages. For instance, the term invasive species is linked to species (based on its morpho-syntactic structure). The term species is in wordnet and linked to foreign equivalents of the term (e.g. soort in Dutch).

 

Download and installation

The Tybot is packaged with the document base.

 

Example of usage

See the README file shipped with the document base.

 

ICT-211423 - 2008 © Kyoto Consortium