Splitter options file

The file contains four sections: <General>, <Markers>, <SentenceEnd>, and <SentenceStart>.

The <General> section contains general options for the splitter: Namely, AllowBetweenMarkers and MaxLines options. The former may take values 1 or 0 (on/off). The later may be any integer. An example of the <General> section is:

<General>
AllowBetweenMarkers 0
MaxLines 0
</General>

If AllowBetweenMarkers is off, a sentence split will never be introduced inside a pair of parenthesis-like markers, which is useful to prevent splitting in sentences such as ``I hate'' (Mary said. Angryly.) ``apple pie''. If this option is on, a sentence end is allowed to be introduced inside such a pair.

MaxLines states how many text lines are read before forcing a sentence split inside parenthesis-like markers (this option is intended to avoid infinite loops in case the markers are not properly closed in the text). A value of zero means ``Never split, I'll risk to infinite loops''. Obviously, this option is only effective if AllowBetweenMarkers is on.

The <Markers> section lists the pairs of characters (or character groups) that have to be considered open-close markers. For instance:

<Markers>
" "
( )
{ }
/* */
</Markers>

The <SentenceEnd> section lists which characters are considered as possible sentence endings. Each character is followed by a binary value stating whether the character is an unambiguous sentence endig or not. For instance, in the following example, ``?'' is an unabiguous sentence marker, so a sentence split will be introduced unconditionally after each ``?''. The other two characters are not unambiguous, so a sentence split will only be introduced if they are followed by a capitalized word or a sentence start character.

<SentenceEnd>
. 0
? 1
! 0
</SentenceEnd>

The <SentenceStart> section lists characters known to appear only at sentence beggining. For instance, open question/exclamation marks in Spanish:
<SentenceStart>
¿
¡
</SentenceStart>

2008-01-24