The file contains four sections: <General>
, <Markers>
,
<SentenceEnd>
, and <SentenceStart>
.
The <General>
section contains general options for the
splitter: Namely, AllowBetweenMarkers and MaxLines
options. The former may take values 1 or 0 (on/off). The
later may be any integer. An example of the <General>
section is:
<General> AllowBetweenMarkers 0 MaxLines 0 </General>
If AllowBetweenMarkers is off, a sentence split will never be introduced inside a pair of parenthesis-like markers, which is useful to prevent splitting in sentences such as ``I hate'' (Mary said. Angryly.) ``apple pie''. If this option is on, a sentence end is allowed to be introduced inside such a pair.
MaxLines states how many text lines are read before forcing a sentence split inside parenthesis-like markers (this option is intended to avoid infinite loops in case the markers are not properly closed in the text). A value of zero means ``Never split, I'll risk to infinite loops''. Obviously, this option is only effective if AllowBetweenMarkers is on.
The <Markers>
section lists the pairs of characters (or
character groups) that have to be considered open-close markers. For instance:
<Markers> " " ( ) { } /* */ </Markers>
The <SentenceEnd>
section lists which characters are considered
as possible sentence endings. Each character is followed by a binary
value stating whether the character is an unambiguous sentence endig
or not. For instance, in the following example, ``?'' is an unabiguous
sentence marker, so a sentence split will be introduced
unconditionally after each ``?''. The other two characters are not
unambiguous, so a sentence split will only be introduced if they are
followed by a capitalized word or a sentence start character.
<SentenceEnd> . 0 ? 1 ! 0 </SentenceEnd>
The <SentenceStart>
section lists characters known to appear
only at sentence beggining. For instance, open question/exclamation
marks in Spanish:
<SentenceStart>
¿
¡
</SentenceStart>
2008-01-24