KYOTO Logo
Knowledge Yielding Ontologies
for Transition-based Organization
  • Increase font size
  • Default font size
  • Decrease font size

Terminology extraction: TYBOT

A wealth of knowledge is hidden in language. This is reflected by the terminology used by people to convey information. If we can understand the terminology, we can begin understanding the language.

Efforts are made to get a complete and reliable picture of the terminology of different languages as well as specific domains within languages. Examples of manual efforts are ``generic'' thesauruses which aim at capturing a language (e.g., wordnets, Fellbaum [1998], Vossen [2004]) and thesauruses and taxonomies which focus on a specific domain [such as the General Multi-lingual Environmental Thesaurus, GEMET Felluga and Plini, 2000]. An obvious problem of generic thesauruses is their lack of coverage: no thesaurus can cover all conceivable domains adequately.

On the other hand, building a thesaurus for each individual domain is quite laborious. There are several directions of research to alleviate this problem by capturing the terminology of a domain automatically, each of which trying to tackle different challenges.
 
Domain term identification: Methods such as pointwise mutual information are developed to distinguish terms specific to the domain from ``generic'' terms.
Relation identification: Perhaps more important than identifying the terms in a domain is the job of finding relations between those terms, such as hyponymy, hypernymy, equivalence, meronymy, etc. Methods for identifying relations include Hearst [1992] and van der Plas [2008].

We argue that automatic and manual efforts need not be alternatives. Generic thesauruses provide the basic terminology in a language, while automatic methods can provide a domain extension. Since the generic terminology of a language is presupposed (and hence, the generic terms are priorly known), distinguishing domain terms from generic terms is not important. Distinguishing (domain-relevant) terms from non-terms is.
 
Note that there is a subtle (but important) difference between domain specific terms and domain relevant terms: while each domain specific term is also domain relevant, a term which is relevant to the domain may not be specific to the domain. For instance, the word traffic is relevant to the environmental domain, but it is also used in general language. Note also that the line between domain specific terms and non-domain specific terms is hard to draw. sually, automatic methods consider a term domain specific if it is used substantially more frequently in a domain corpus than in a reference corpus (which should be domain-neutral).

Connections with other KYOTO modules

KAF (Kyoto Annotation Format) annotated documents are used as input for terminology extraction. KAF is a multi-layered annotation format. Terminology extraction uses part-of-speech tags, chunks, and optionally wordnet references. The part-of-speech tags and chunks are added by the language-specific processors. Wordnet rederences are added by the language-neutral word sense disambiguation module.

The extracted terminology is used by the term editor for further processing.

Online demos

The terminology extraction software comes with a terminology explorer which allows you to browse and explore terms automatically extracted from a set of documents.

An online demo is available to browse the automatically extracted terminology of a selected set of documents (English only).

View the video tutorial below (click next):

 

 

References

C. Fellbaum, editor. WordNet. An Electronic Lexical Database. The MIT Press, 1998.

B. Felluga and P. Plini. Presentation of gemet thesaurus. In Open Forum on Metadata Registries, 2000.

M. Hearst. Automatic acquisition of hyponyms from large text corpora. In Proceedings of the Fourteenth International Conference on Computational Linguistics, 1992.

L. van der Plas. Automatic lexico-semantic acquisition for question answering. PhD thesis, Rijksuniversiteit Groningen, 2008.

P. Vossen. Eurowordnet: a multilingual database of autonomous and language-specific wordnets connected via an inter-lingual-index. International Journal of Linguistics, 17(2), 2004.

 

ICT-211423 - 2008 © Kyoto Consortium