KYOTO Logo
Knowledge Yielding Ontologies
for Transition-based Organization
  • Increase font size
  • Default font size
  • Decrease font size

The KYOTO Annotation Format

KAF (also known as Knowledge Annotation Format) is a language neutral annotation format representing both morpho-syntactic and semantic annotation of documents through a stand-off multilayered structure (Bosma, 2009).

KAF has been defined and developed as an open XML reference format for the representation of knowledge and facts, encapsulating the following annotations:

           Tokenizaton and word form segmentation;

           POS tagging;

           Lemmatization and Term Extraction;

           MultiWord Extraction;

           Constituency and Dependency Tagging;

           Named Entity Recognition (NER);

           Word Sense Disambiguation (WSD);

           Ontology mapping;

           Semantic Role Labeling (SRL)-

 

KAF is a multilayered representation of the text as it occurs as a sequence of words. The following layers are distinguished:

1.                 Sequences of words, sentences and pages;

2.                 Sequence of terms and multiwords

3.                 Sequence of constituent chunks

4.                 Sequence of syntactic roles

5.                 Sequence of semantic roles

Each of these layers is interconnected through identifiers so that each level of analysis can be related to the next level, both at syntactic levels (tokens, words, lemmas, terms,) and at semantic levels (synsets, roles, quantifier detection, temporal relations, etc).

KAF adopts a stand off strategy for annotating the source text:

·                    elements are used for grouping linguistic elements.

·                    Linguistic annotations of a particular level always spans elements of previous levels.

·                    Linguistic annotations of different levels are not mixed.

Thus KAF is compatible with LAF, even if it imposes a more specific standardization of the annotation format itself.

 

We can distinguish in KAF three macro-layer of annotation (see Fig. 1):

          

  • the morpho-syntactic layer: it groups all the language-specific textual annotations. Tokens, sentences and paragraphs are identified in a specific document. Terms made of words or multi-words are pointed out along with their POS. In this layer also functional dependences are represented as well as chunks that are constituents and phrases;
  • the level-1 semantic layer: it includes linear annotation of expressions of time, events, quantities and locations;
  • the level-2 semantic layer: it is mainly devoted to represent facts, in a non linear annotation context, thus possibly aggregating evidences from the lower layers of multiple textual sources.


 

We will describe the different KAF annotation levels using the sentence

“Tropical terrestrial species populations declined by 55 per cent on average from 1970 to 2003.”

as a running example.

 Word forms

After tokenization step, all word forms are annotated within the   < text> element, and each form is enclosed by a   < wf> element.

   < wf >  elements have the following attributes:

 

·                    wid: the unique id for the word form.

·                    sent: sentence id of the token (optional)

·                    para: paragraph id (optional)

·                    offset: the offset of the word form (optional)

·                    lenght: the lenght of the original word form (optional)

·                    page: page id (optional)

 

Terms

Terms refer to previous word forms (and groups multi word forms) and attach lemma, part of speech, synset and name entity information. 

     < term >  elements have the following attributes:

·                    tid: unique identifier

·                    type: type of the term. Currently, 3 values are possible:

o       open: open category term

o       close: close category term

o       entity: term is a named entity

·                    lemma: lemma of the term

·                    pos: part of speech

The first letter of the pos attribute must be one of the following:

N common noun

R proper noun

G adjective

V verb

P preposition

A adverb

C conjunction

D determiner

O other

more complex pos attributes may be formed. For more detailed information please refer to [2].

·                    netype: if the term is a named entity, the type of the entity (only if type=”entity”). Valid attribute value must be one of "date", "number", "person", "location", "company" or "time".

·                    elements are used to associate terms to external resources, such as elements of a Knowledge base, an ontology, etc. It consists of several elements, one per association.

                    elements have the following attributes:

                   · resource (required): indicates the identifier of the resource referred to.

                   · reference (required): code of the referred element. If the element is a synset of some version of WordNet, it consists of a string composed by four fields separated by a dash. The four fields are the following:

                                    - Language code (three characters, lowercase).

                                    - WordNet version (two digits).

                                    - Synset identifier composed by digits.

                                    - POS character (n noun, v verb, a adjective, r adverb)

                   · reftype (optional): indicates the kind of relation the externalRef is expressing.

                   Within Kyoto, reftype attribute has values like 'sc DomainOf', 'sc SubclassOf', etc. An empty reftype would indicate a direct relationship.

                   · status (optional): indicates the status of the relationship.

                   · source (optional): the name of the process which created the external reference.

                   · confidence (optional): the confidence weight of the association.

·                   onent> elements may be used to annotate compound terms. They have the following attributes

                                    - id: unique identi_er

                                    - lemma: lemma of the term

                                    - pos: part of speech

                                    - case: declension case

Dependencies

Dependencies represent dependency relations among terms. Each dependency is represented by an empty    < dep > element and span previous terms.

< dep > element have the following attributes:

·                    from: term id of the source element

·                    to: term id of the target element

·                    rfunc: relational function. One of:

o       mod: indicates the word introducing the dependent in a head- modifier relation.

o       subj: indicates the subject in the grammatical relation Subject-Predicate.

o       csubj, xsubj, ncsubj: The Grammatical Relations (RL)  csubj and xsubj may be used for clausal subjects, controlled from within, or without,  respectively. ncsubj is a non-clausal subject.

o       dobj: Indicates the object in the grammatical relation between a predicate and its direct object.

o       iobj: The relation between a predicate and a non-clausal complement introduced by a preposition; type indicates the preposition introducing the dependent.

o       obj2: The relation between a predicate and the second non-clausal complement in ditransitive constructions.

Chunks

Chunks are noun or prepositional phrases, spanning terms.

< chunk > elements have the following attributes:

·                    cid: unique identifier

·                    head: the chunk head's term id

·                    phrase: typo of the phrase

Valid values for the phrase elements are one of the following:

NP      noun phrase

VP      verbale phrase

PP      prepositional phrase

S         sentence

O         other   

 

·                    case (optional): declension case

 

 

References

Bosma, 2009
Wauter Bosma, Piek Vossen, Aitor Soroa, German Rigau, Maurizio Tesconi, Andrea Marchetti, Monica Monachini and Carlo Aliprandi: “KAF: a generic semantic annotation format”, in Proceedings of the 5th International Conference on Generative Approaches to the Lexicon GL 2009, Pisa, Italy, September 17-19, 2009 

 

Semaf, 2008
Language Resource Management-Semantic Annotation Framework: time and events (SemAF-Time) rev-12 . http://lirics.loria.fr/doc_pub/SemAFCD24617-1Rev12.pdf, in: Proceedings of LREC 2008, Marrakech, Morocco, May 28-30, 2008

 

Kyoto, 2009
Database Models and Data Formats, Kyoto consortium 2009 
 

ICT-211423 - 2008 © Kyoto Consortium