This file controls the behaviour of the simple NE recognizer. It consists of the following sections:
<FunctionWords>
lists the function words that can be
embeeded inside a proper noun (e.g. preposisions and articles such
as those in ``Banco de Espaņa'' or ``Foundation for the Eradication
of Poverty''). For instance:
<FunctionWords> el la los las de del para </FunctionWords>
<SpecialPunct>
lists the PoS tags (according to
punctuation tags definition file, section 2.13) after
which a capitalized word may be indicating just a sentence or clause
beggining and not necessarily a named entity. Typical cases are
colon, open parenthesis, dot, hyphen..
<SpecialPunct> Fpa Fp Fd Fg </SpecialPunct>
<NE_Tag>
contains only one line with the PoS tag that
will be assigned to the recognized entities. If the NE classifier is
going to be used later, it will have to be informed of this tag at
creation time.
<NE_Tag> NP00000 </NE_Tag>
<Ignore>
contains a list of lemmas that are no considered to be a named entity even when they appear capitalized in the middle of a sentence. For instance, the word Spanish in the sentence He started studying Spanish two years ago is not a named entity. If the words in the list appear with other capitalized words, they are considered to form a named entity (e.g. An announcement of the Spanish Bank of Commerce was issued yesterday). The same distinction applies to the word I in the sentences whatever you say, I don't believe, and That was the death of Henry I.
<Ignore> i english dutch spanish </Ignore>
<RE_NounAdj>
<RE_Closed>
and <RE_DateNumPunct>
allow to modify the default regular expressions for PAROLE Part-of-Speech tags. For instance, if Penn-Treebank-like tags are used for English, we should define:
<RE_NounAdj> ^(NN$|NNS|JJ) </RE_NounAdj> <RE_Closed> ^(D|IN|C) </RE_Closed>
<TitleLimit>
contains only one line with an integer
value stating the length beyond which a sentence written entirely in uppercase will be considered a title and not a proper
noun. Example:
<TitleLimit> 3 </TitleLimit>
If TitleLimit=0
(the default) title detection is
deactivated (i.e, all-uppercase sentences are always marked as
named entities).
The idea of this heuristic is that newspaper titles are usually written in uppercase, and tend to have at least two or three words, while named entities written in this way tend to be acronyms (e.g. IBM, DARPA, ...) and usually have at most one or two words.
For instance, if TitleLimit=3
the sentence
FREELING ENTERS NASDAC UNDER CLOSE INTEREST OF MARKET ANALISTS
will not be recognized as a named entity, and will have its words analyzed
independently. On the other hand, the sentence IBM INC., having less than
3 words, will be considered a proper noun.
Obviously this heuristic is not 100% accurate, but in some cases (e.g. if you are analyzing newspapers) it may be preferrable to the default behaviour (which is not 100% accurate, either).
2008-01-24