Go to Table of contents, next or previous page.


Extending Universal Converter

Thessalonica’s conversion tables are described in standard OpenOffice.org xml registry files (with the .xcu extension). Information from these files is merged into the OpenOffice.org registry each time you install the package using the unopkg utility (or the OpenOffice.org graphical extension manager), so that it is easy to retrieve this information using standard registry access methods defined in OpenOffice.org API. So if you want to extend Thessalonica with additional conversion tables, you have to prepare your own .xcu file similar to those already available (they are stored in the ‘ConvTables’ subdirectory), package it together with Thessalonica and reinstall the extension. If you are not familiar with the .xcu format, refer to the OpenOffice.org developer’s guide.

Each conversion table should follow the rules described in the ConvTables.xcs schema definition file. Some of these rules are discussed below.

Thessalonica’s conversion tables format

Since Thessalonica’s conversion tables (as well as input method descriptions) are standard .xcu files, they should have the oor:component-data root node with the following attributes:

xmlns:oor="http://openoffice.org/2001/registry" 
xmlns:xs="http://www.w3.org/2001/XMLSchema" 
oor:name="ConvTables" 
oor:package="org.openoffice.comp.thessalonica"

The oor:name and oor:package parameters mean that information from this file should be merged into OpenOffice.org registry file called ConvTables.xcu and will be accessible under the path org.openoffice.comp.thessalonica.ConvTables/.

The oor:component-data node has only one child, called Root, which may have several children. Each of those children describes a particular conversion table. It is recommended to put each conversion table into its own file, and, since each conversion table will normally contain some ANSI strings it maps to specific Unicode characters, normally this file should be written in the ISO-8859-1 encoding, although OpenOffice will merge all such files together and reencode them to UTF-8.

Each node describing a conversion table should have one property, called Title (which is a displayable name for that conversion table, used in the GUI), and one child node, called Rules. It will contain a set of conversion rules.

Each node representing a conversion rule maps a sequence of Unicode characters to one or more sequences of ANSI characters. Since each node should have a unique name, these rules are named according to the same convension, based on AGL, as rules applied to keyboard input in input method descriptions. However, node names really don’t matter in this case, since they are not really used by the converter.

The following set of properties is used to describe a conversion rule:

string-list ANSI

Specifies one or more 8-bit strings in the encoding the given conversion table describes. Note that OpenOffice.org uses the space character as a default separator for all registry keys, containing a list of values. If this is not that you want (e. g. some of your 8-bit strings already contain spaces) you should explicitly specify the oor:separator property for earch list of 8-bit character sequences (see the examples below).

int-list Unicode

Specifies one or more Unicode codepoints corresponding to the 8-bit string(s) this rule is applicable to. Using hexadecimal notation is strongly recommended, although OpenOffice.org will automatically convert all numbers to decimal form while merging the file into its registry. Note that this list has different meaning than one corresponding to the ANSI property: it doesn’t contain several alternate strings, but rather one single string represented as a sequence of Unicode codepoints.

boolean ANSIToUni

Specifies if this rule is applicable for conversion from ANSI to Unicode.

boolean UniToANSI

Specifies if this rule is applicable for conversion from Unicode to ANSI.

string-list Comment

A comment, which normally should contain a canonical name (or a sequence of canonical names) of the Unicode character(s) this rule is applicable to.

The properties listed above require a few explanations. In earlier versions of Thessalonica you could map just one 8-bit string to a single Unicode codepoint, represented by an integer value. The main reason for the fact Thessalonica mapped a string to a number rather than an 8-bit string to a Unicode string was that ISO-8859-1 was considered the preferred encoding for 8-bit strings, and using it would be impossible in Unicode-encoded XML files. However, this approach implied one serious limitation, as it assumed that it should be possible to represent any valid accented character, corresponding to one or more 8-bit characters in a specific encoding, with just one Unicode codepoint. This scheme worked almost fine for polytonic Greek, but it would be obviously wrong for many other scripts and languages, as the current Unicode policy is to encode only combining marks and base characters, rather than precomposed accented combinations. Even for Greek this scheme caused some problems: for example the converter could not properly handle precomposed characters for epsilon and omicron with the circumflex accent (greek peripomeni), available in the WinGreek encoding, as the only valid Unicode representation for these characters would be a combination of the base letter and the combining Greek perispomeni. Changing the value type for the Unicode property from int to int-list in Thessalonica 3.0 resolves this problem, as it allows to specify a sequence of Unicode codepoints in the same rule. The following example from the conversion table for the WinGreek encoding demonstrates this:

<node oor:name="epsilon_uni0342" oor:op="replace">
   <prop oor:name="ANSI" oor:type="oor:string-list">
      <value>ü</value>
   </prop>
   <prop oor:name="Unicode" oor:type="oor:int-list">
      <value oor:separator=";">0x03B5;0x0342</value>
   </prop>
   <prop oor:name="ANSIToUni" oor:type="xs:boolean">
      <value>true</value>
   </prop>
   <prop oor:name="UniToANSI" oor:type="xs:boolean">
      <value>true</value>
   </prop>
   <prop oor:name="Comment" oor:type="oor:string-list">
      <value>GREEK SMALL LETTER EPSILON;COMBINING GREEK PERISPOMENI</value>
   </prop>
</node>

Another improvement introduced in Thessalonica 3.0 is that now it is possible to map several 8-bit strings to the same Unicode character (or a sequence of characters) inside one single rule. All such strings are searched during the conversion from a 8-bit encoding to Unicode, but only forst of them is used when the opposed conversion is performed. This possibility was introduced mainly to simplify handling of those cases, where legacy 8-bit encodings allowed different ordering of overstriking diacritics, thus making several representations of the same accented character possible.

Let’s take for example a few rules from the conversion table for Linguist’s Software Greek encoding. In Linguist’s software Greek fonts semicolon corresponds to Greek varia (grave accent) and slash to Greek ypogegrammeni (iota subscript). Since combining diacritics may be typed in any sequence, both ‘a;/’ and ‘a/;’ will correspond to the same Unicode character with the code 0x1FB2, i. e. “GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI”. Before Thessalonica 3.0 the only way to handle this situation was using two separate conversion rules:

<node oor:name="uni1FB2" oor:op="replace">
   <prop oor:name="ANSI" oor:type="xs:string>"
      <value>a/;</value>
   </prop>
   <prop oor:name="Unicode" oor:type="xs:int">
      <value>0x1FB2</value>
   </prop>
   <prop oor:name="ANSIToUni" oor:type="xs:boolean">
      <value>true</value>
   </prop>
   <prop oor:name="UniToANSI" oor:type="xs:boolean">
      <value>true</value>
   </prop>
   <prop oor:name="Comment" oor:type="xs:string">
      <value>GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI</value>
   </prop>
</node>
<node oor:name="uni1FB2.alt" oor:op="replace">
   <prop oor:name="ANSI" oor:type="xs:string">
      <value>a;/</value>
   </prop>
   <prop oor:name="Unicode" oor:type="xs:int">
      <value>0x1FB2</value>
   </prop>
   <prop oor:name="ANSIToUni" oor:type="xs:boolean">
      <value>true</value>
   </prop>
   <prop oor:name="UniToANSI" oor:type="xs:boolean">
      <value>false</value>
   </prop>
   <prop oor:name="Comment" oor:type="xs:string">
      <value>GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI</value>
   </prop>
</node>

Note that both representations of the accented character may be used in a manually typed text, but it is enough to have only one of them for automatical conversion from Unicode to Linguist’s software encoding. So for first rule both ANSIToUni and UniToANSI are true, while for second rule UniToANSI is false. However, now it is possible to reduce a number of conversion rules, thus making the whole conversion table more legible. The following syntax can be used to describe the same situation with just one rule (note the oor:separator property, used to inform OpenOffice.org that 8-bit string are separated with commas):

<node oor:name="uni1FB2" oor:op="replace">
   <prop oor:name="ANSI" oor:type="oor:string-list>"
      <value oor:separator=",">a/;,a;/</value>
   </prop>
   <prop oor:name="Unicode" oor:type="oor:int-list">
      <value>0x1FB2</value>
   </prop>
   <prop oor:name="ANSIToUni" oor:type="xs:boolean">
      <value>true</value>
   </prop>
   <prop oor:name="UniToANSI" oor:type="xs:boolean">
      <value>true</value>
   </prop>
   <prop oor:name="Comment" oor:type="oor:string-list">
      <value>GREEK SMALL LETTER ALPHA WITH VARIA AND YPOGEGRAMMENI</value>
   </prop>
</node>

One may ask, why the UniToANSI flag has been preserved in Thessalonica 3.0, although setting it to false is no longer needed with the new syntax. Indeed, there is no need to use this option to specify which one of the equivalent 8-bit strings should be used for conversion from Unicode to a 8-bit encoding; but it may be still useful when converting a particular Unicode character to a 8-bit encoding is just undesired, although it can be correctly represented with that encoding. This is probably the case e. g. for common punctuation characters and digits.


Go to Table of contents, next or previous page.