KYOTO Logo
Knowledge Yielding Ontologies
for Transition-based Organization
  • Increase font size
  • Default font size
  • Decrease font size

Multiword Recognition

Introduction

MultiWordTagger (MWT) is a module that reads KAF files and tries to detect multiword terms from a generic wordnet and a domain wordnet in WN-LMF format. Terms are represented as separate elements in KAF and the word-sense-disambiguation (WSD) module operates on single terms. When a sequence of terms such as human and right is represented as separate terms, the WSD module will try to disambiguate the separate words individually. The MWT will group the two words as a single term, i.e. human right, and the WSD module will assign a single specific meaning, resulting in a more informative meaning. The multiwords are taken from the generic wordnet in a language and (optionally) a domain specific wordnet. The WSD module will use the same wordnets for assigning synsets.

Note that the current version of the program can only detect multiwords of adjacent elements. Disjoint multiwords are not detected.

What will the MultiWordTagger do to KAF?

Below is an example of a sequence of terms in KAF:

	<term tid="t74" type="open" lemma="land" pos="N"> 
<span><target id="w83"></target></span>
</term>
<term tid="t75" type="open" lemma="development" pos="N">
<span> <target id="w84"></target></span>
</term>
The wordnet data file contains a multiword land development. This will then be changed to:
	<term tid="t75mw" type="open" lemma="land_development" pos="N"> 
<span><target id="w83"/><target id="w84"/></span>
<component id="t74" lemma="land" pos="N"/>
<component id="t75" lemma="development" pos="N"/>
</term>
We create a new term identifier for the multiword and let it point to the word forms of the elements in the span. Furthermore, we include the elements as term components. This approach generates an output that is compatible with the way compounds are represented in KAF. The next example shows a Dutch compound term natuurbeschermingsovereenkomst that is split into 3 components by the LP using general rules and a general lexicon:
	<term head="t6.35.2" lemma="natuurbeschermingsovereenkomst" pos="N.noun" tid="t6.35" type="open"> 
<span><target id="w6.35"/></span>
<component id="t6.35.0" lemma="natuur"/>
<component id="t6.35.1" lemma="bescherming"/>
<component id="t6.35.2" lemma="overeenkomst"/>
</term>
Likewise, the WSD module can treat compounds and multiwords in the same manner.

In addition to detecting the phrase, the MWT needs to know the head of the multiword phrase, given the part of speech (POS) of the multiword as stored in wordnet: noun, verb or adjective. To detect the head of a phase, the MWT uses patterns that are specific for each language. Each pattern starts with the POS of the multiword unit, followed by a colon, followed by another POS or the string first or last. Here are two examples of these patterns for English and for Spanish:

pattern for English
lang=en
N:P
N:last
V:last
G:last
pattern for Spanish
lang=es
N:first
V:first
G:first
For English (en), the first line of the patterns states that for multiword terms that include a preposition (POS=P), the head is the last term with the POS N preceding the preposition. As a second rule for multiwords with the POS=N states that the last N is the head. The second line only applies if the first line does not. For verbs (V) and adjectives (G), the last term is taken as the head. For Spanish, in all cases the first element marks the head.

The POS tags need to match with the stadard POS tags as defined in the KAF XSD as defined in https://kyoto.let.vu.nl/svn/kyoto/trunk/doc/user/KAF/kaf.pdf

Whenever the head of the multi word is detected, the program creates a new term for the multiword with a new unique identifier and represents the elements as components. When a new term structure is created, the MWT needs to fix all the references that are made to the other layers in KAF to the original terms. It removes the original terms as elements in KAF and fixes the references to the original term identifiers in KAF, i.e.:

  1. All the references to the original term that is now the head of the multiword terms are replaced by the references to the new multiword identifier. This involves all chunks and dependencies in KAF.
  2. All references to the other elements in the multiword term are removed:
    • Chunks: adapt the span so that it refers only to the multiword and not to the elements and change the head reference by a reference to the multiword
    • Dependencies: remove dependencies in which the non-head elements are involved
Note that only the chunks and dependencies are adapted by MWT. Any other layers added to KAF by other modules are not adapted. It is therefore wise to apply MWT directly after the creation of KAF by the morpho-syntactic module, before any other layer of KAF is added.

Download and installation

MWT version 01 (02-April-2010, version 01) can be downloaded from: mwtagger.v.01.zip

 

Installation instructions

Installation requirements

Multiwordtagger is developed in Java 1.6 and compiled on MAC OS X. It can run on any platform that has Java installed (1.6). It does not require any specific installation actions besides copying the structure as is. You may need to edit the configuration file to use the proper WN-LMF file and the correct language patterns.

 

Installation structure

- conf
-mwtagger.basque.cfg
- mwtagger.spanish.cg
- mwtagger.english.cfg
- mwtagger.dutch.cfg
- doc
- kyoto_wn.dtd
- lib:
- mwtagger.jar
-resources

Integration in KYOTO pipeline

The eu.kyotoproject.multiwordtagger.MultiwordTaggerModule class should be used to run MWT as a module within the KYOTO PipeT architecture. Within the standard KYOTO pipeline, the MultiWordtagger operates on the KAF that is generated by the Linguistic Processors (LPs), before word-sense-disambigution takes place. As a pipeline module in KYOTO, MWT will take kaf/lp as an inputstream and generates kaf/mw as an outputstream for any KAF document in the document base to which the MWT is added a processor. The MWT module takes the path to a configuration file as a configuration value in the constructor. This path is specified through the pipeline configuration option (see the documentation on PipeT. The configuration file contains the patterns for a specified language (see above) and the path to the wordnet lexicons in WN-LMF format containing the multiwords, for example:
# last or first # any pos tag that marks post head position, e.g. for English a preposition P terminates the search for the head so that the last N before P becomes the head # patterns are checked in the listed order # first matching pattern applies lang=en generic_wn_lmf=/Projects/Kyoto/Data/mwtagger/resources/wnen3.xml.lmf domain_wn_lmf=/Projects/Kyoto/Data/mwtagger/resources/wneng_domain_LMF_v3.xml N:P N:last V:last G:last 
It is possible to specify up to two wordnet files in WN-LMF containing the mutiwords. If no multiwords lexicons are found, the program aborts and does not generate output. If no patterns are specified, the MWT will take the last word with the same POS as the head. Through the configuration file, MWT can be set to run on different languages and with different WN-LMF files. Specify the correct absolute (!) path to the WN-LMF files for runing MWT. You may also need to validate the patterns and the POS codes in KAF.

 

To run MWT as a standalone program on KAF files on disk

The eu.kyotoproject.multiwordtagger.MultiTaggerTest class can be used to run the tagger as a standalone application on any set of KAF files on disk. MultiTaggerTest takes two arguments:
  1. the full path to the configuration file
  2. the full path to a folder that contains the KAF files
Below is an example of how to call the MWT test class on a folder containing KAF documents:
java -Xmx512m -cp ./lib/kaf.jar:./lib/mwtagger.jar eu.kyotoproject.multiwordtagger.MultiTaggerTest "/Projects/Kyoto/mwtagger.v01/conf/mwtagger.english.cfg" "/Projects/Kyoto/Data/Estuaries/English" 
This call will use the configuration file for Enlgish and process all files with the extension *.kaf in the English folder. It will create a new folder English_lp+mw to store all the KAF files with multiword annotation.

License

MWT is copy right of Piek Vossen. It is released as an open source module available under `GNU GPL.

Contact

Send any questions and bugs to Piek Vossen, p.vossen at let.vu.nl VU University Amsterdam The Netherlands
 

ICT-211423 - 2008 © Kyoto Consortium