This file contains a set of heuristic rules to perform dependency parsing.
The file consists of four sections:
sections: <GRPAR>
, <GRLAB>
, <SEMDB>
, and <VCLASS>
,
respectively closed by tags </GRPAR>
, </GRLAB>
, </SEMDB>
, and </VCLASS>
.
<GRPAR>
contains rules to complete the
partial parsing provided by the chart parser. The tree is
completed by combining chunk pairs as stated by the rules. Rules
are applied from highest priority (lower values) to lowest
priority (higher values), and left-to rigth.
That is, the pair of adjacent chunks matching the most prioritary
rule is found, and the rule is applied, joining both chunks in
one. The process is repeated until only one chunk is left.
Each line contains a rule, with the format:
ancestor-label descendant-label label operation priority-specwhere:
ancestor-label
and descendant-label
are the
syntactic labels (either assigned by the chunk parser, or a
new label
created by some other completion rule) of two
consecutive nodes in the tree.
label
has two meanings, depending on the
operation
field value.
For top_left
and top_right
operations, it states the
label with with the root node of the resulting tree must be
relabelled (``-'' means no relabelling).
For last_left
and last_right
operations, it states the
label that the node to be considered ``last'' must have to get the
subtree as a new child. If no node with this label is found, the
subtree is attached as a new child to the root node.
operation
is the way in which ancestor-label
and descendant-label
nodes are to be combined.
priority-spec
is a specification of possible priority
values for this rule, as detailed below.
For instance, the rule:
np pp - top_left 20
states that if two subtrees labelled np
and pp
are
found contiguous in the partial tree, the later is added as a new
child of the former.
The supported tree-building operations are the following:
top_left
: The right subtree is added as a daughter of the
left subtree. The root of the new tree is the root of the left
subtree. If a label
value other than ``-'' is specified,
the root is relabelled with that string.
last_left
: The right subtree is added as a daughter of
the last node inside the left subtree matching label
value
(or to the root if none is found). The root of the new tree is the
root of the left subtree.
top_right
: The left subtree is added as a new daughter
of the right subtree. The root of the new tree is the root of the
right subtree. If a label
value other than ``-'' is
specified, the root is relabelled with that string.
last_right
: The left subtree is added as a daughter of the
last node inside the right subtree matching label
value
(or to the root if none is found). The root of the new
tree is the root of the right subtree.
The priority-spec
part of a rule defines the priority that
will rank the applicable rules. Rules with low priority values will
be applied earlier. The priority-spec
consists of a list of
zero or more pairs context-condition value
, separated by
semicolons. The last item in the list is a single integer value,
and is required (i.e. the simplest possible priority-spec
is
a single integer value). Each context condition in the list is
checked in order, and the priority value for the first matching
condition is used for the rule. If no condition in the list
matches, the last single value is used.
The context conditions are a sequence of labels separated with
underscores, each label must match the label of one chunk in the
partial tree. The condition must include a label $$
which
will match the pair of chunks that activated the rule. An *
label
matches any chunk.
For instance, the rule:
np pp - top_left vp_$$_adjp 20; $$_*_vp 10; 5will be activated when an adjacent pair
np pp
is found, and
will be ranked with priority 20 provided there is a vp
chunk
to the left of the focus pair, and a adjp
chunk to its
right. If not, it will get a priority of 10 if there is a vp
chunk at the second right position, with any chunk in the first. If
none of those patterns are matched, the rule will be assigned a
priority of 5.
<GRLAB>
contains rules to label the
dependences extracted from the full parse tree build with the
rules in previous section:
Each line contains a rule, with the format:
ancestor-label dependence-label condition1 condition2 ...
where:
ancestor-label
is the label of the node which is
head of the dependence.
dependence-label
is the label to be assigned to the dependence
condition
is a list of conditions that the dependence
has to match to satisfy the rule.
Each condition
has one of the forms:
node.attribute = value node.attribute != value
Where node may be p for parent or d for descendant), and attribute is one of the following:
label
: chunk label (or PoS tag) of the node.
side
: (left or right) position of the specified node with respect to the other.
lemma
: lemma of dathe node head word.
class
: word class (see below) of lemma of the node head word.
tonto
: EWN Top Ontology properties of the node head word.
semfile
: WN semantic file of the node head word.
synon
: Synonym lemmas of the node head word (according to WN).
asynon
: Synonym lemmas of the node head word ancestors (according to WN).
Note that since no disambiguation is requiered, the attributes dealing with semantic properties will be satisfied if any of the word senses matches the condition.
For instance, the rule:
verb-phr subj d.label=np* d.side=leftstates that if a
verb-phr
node has a daughter to its left, with a label
starting by np
, this dependence is to be labeled as subj
.
Similarly, the rule:
verb-phr obj d.label=np* d.tonto=Edible p.lemma=eatstates that if a
verb-phr
node has eat as lemma, and a
descendant with a label starting by np
and with a Edible property in EWN Top ontology, this dependence is to be
labeled as obj
.
<SEMDB>
is only necessary if the dependency labeling rules in section <GRLAB>
use conditions on semantic values (that is, any of tonto
, semfile
, synon
, or asynon
).
The section must contain two lines specifying two semantic information files, a SenseFile and a WNFile. The filenames may be absolute or relative to the location of the dependency rules file.
<SEMDB> SenseFile ../senses16.db WNFile ../../common/wn16.db </SEMDB>
The SenseFile must be a BerkeleyDB indexed file as described in the 4.5 section. The WNFile must be a BerkeleyDB indexed file, obtained with the same procedure from a source plain text file. This file must contain a sense per line, with the following format:
synset:PoS hypern:hypern:...:hypern semfile TopOnto:TopOnto:...:TopOnto
That is: the first field is the synset code plus its PoS, separated by a colon. The second field is a colon-separated list of its hypernym synsets. The third field is the WN semantic file the synset belongs to, and the last field is a colon-separated list of EuroWN TopOntology codes valid for the synset.
<CLASS>
contains class definitions which may
be used as attributes in the dependency labelling rules.
Each line contains a class assignation for a lemma, with two possible formats:
class-name lemma comments class-name "filename" comments
For instance, the following lines assign to the class mov
the four listed verbs, and to the class animal
all lemmas
found in animals.dat
file. In the later case, if the file
name is not an absolute path, it is interpreted as a relative path
based at the location of the heuristic rules file.
Anything to the right of the second field is considered a comment and ignored.
mov go prep= to,towards mov come prep= from mov walk prep= through mov run prep= to,towards D.O. animal "animals.dat"
2008-01-24