Oracle® Text Reference 10g Release 2 (10.2) Part Number B14218-01 |
|
|
View PDF |
This chapter describes various ways that Oracle Text handles alternative spelling of words. It also documents the alternative spelling conventions that Oracle Text uses in the German, Danish, and Swedish languages.
The following topics are covered:
Some languages have alternative spelling forms for certain words. For example, the German word Schoen can also be spelled as Schön.
The form of a word is either original or normalized. The original form of the word is how it appears in the source document. The normalized form is how it is transformed, if it is transformed at all. Depending on the word being indexed and which system preferences are in effect (these are discussed in this chapter), the normalized form of a word may be the same as the original form. Also, the normalized form may comprise more than one spelling. For example, the normalized form of Schoen is both Schoen and Schön.
Oracle Text handles indexing of alternative word forms in the following ways:
Alternate Spelling—indexing of alternative forms is enabled
Base-Letter Conversion—accented letters are transformed into non-accented representations
New German Spelling—reformed German spelling is accepted
You enable these features by specifying the appropriate attribute to the BASIC_LEXER
. For instance, you enable Alternate Spelling by specifying either GERMAN
, DANISH
, or SWEDISH
for the ALTERNATE_SPELLING
attribute. As an example, here is how to enable Alternate Spelling in German:
begin ctx_ddl.create_preference('GERMAN_LEX', 'BASIC_LEXER'); ctx_ddl.set_attribute('GERMAN_LEX', 'ALTERNATE_SPELLING', 'GERMAN'); end;
To disable alternate spelling, use the CTX_DDL.UNSET_ATTRIBUTE
procedure as follows:
begin ctx_ddl.unset_attribute('GERMAN_LEX', 'ALTERNATE_SPELLING'); end;
Oracle Text converts query terms to their normalized forms before lookup. As a result, users can query words with either spelling. If Schoen has been indexed as both Schoen and Schön, a query with Schön returns documents containing either form.
When Swedish, German, or Danish has more than one way of spelling a word, Oracle Text normally indexes the word in its original form; that is, as it appears in the source document.
When Alternate Spelling is enabled, Oracle Text indexes words in their normalized form. So, for example, Schoen is indexed both as Schoen and as Schön, and a query on Schoen will return documents containing either spelling. (The same is true of a query on Schön.)
To enable Alternate Spelling, set the BASIC_LEXER
attribute ALTERNATE_SPELLING
to GERMAN
, DANISH
, or SWEDISH
. See BASIC_LEXER for more information.
Besides alternative spelling, Oracle Text also handles base-letter conversions. With base-letter conversions enabled, letters with umlauts, acute accents, cedillas, and the like are converted to their basic forms for indexing, so fiancé is indexed both as fiancé and as fiance, and a query of fiancé returns documents containing either form.
To enable base-letter conversions, set the BASIC_LEXER
attribute BASE_LETTER
to YES
. See BASIC_LEXER for more information.
When Alternate Spelling is also enabled, Base-Letter Conversion may need to be overridden to prevent unexpected results. See Overriding Base-Letter Transformations with Alternate Spelling for more information.
The BASE_LETTER_TYPE
attribute affects the way base-letter conversions take place. It has two possible values: GENERIC
or SPECIFIC
.
The GENERIC
value is the default and specifies that base letter transformation uses one transformation table that applies to all languages.
The SPECIFIC
value means that a base-letter transformation that has been specifically defined for your language will be used. This enables you to use accent-sensitive searches for words in your own language, while ignoring accents that are from other languages.
For example, both the GENERIC
and the Spanish SPECIFIC
tables will transform é into e. However, they treat the letter ñ distinctly. The GENERIC
table treats ñ as an n with an accent (actually, a tilde), and so transforms ñ to n. The Spanish SPECIFIC
table treats ñ as a separate letter of the alphabet, and thus does not transform it.
In 1996, new spelling rules for German were approved by representatives from all German-speaking countries. For example, under the spelling reforms, Potential becomes Potenzial, Schiffahrt becomes Schifffahrt, and schneuzen becomes schnäuzen.
When the BASIC_LEXER
attribute NEW_GERMAN_SPELLING
is set to YES, then a CONTAINS
query on a German word that has both new and traditional forms will return documents matching both forms. For example, a query on Potential returns documents containing both Potential and Potenzial. The default setting is NO.
Note: Under reformed German spelling, many words traditionally spelled as one word, such as soviel, are now spelled as two (so viel). Currently, Oracle Text does not make these conversions, nor conversions from two words to one (for example, weh tun to wehtun). |
The case of the transformed word is determined from the first two characters of the word in the source document; that is, schiffahrt becomes schifffahrt, Schiffahrt becomes Schifffahrt, and SCHIFFAHRT becomes SCHIFFFAHRT.
As many new German spellings include hyphens, it is recommended that users choosing NEW_GERMAN_SPELLING
define hyphens as printjoin
s.
See BASIC_LEXER for more information on setting this attribute.
Even when alternative spelling features have been specified by lexer preference, it is possible to override them. Overriding takes the following form:
Overriding of base-letter conversion when Alternate Spelling is used, to prevent characters with alternate spelling forms, such as ü, ö, and ä, from also being transformed to the base letter forms.
Transformations caused by turning on alternate_spelling
are performed before those of base_letter
, which can sometimes cause unexpected results when both are enabled.
When Alternate Spelling is enabled, Oracle Text converts two-letter forms to single-letter forms (for example, ue to ü), so that words can be searched in both their base and alternate forms. Therefore, with Alternate Spelling enabled, a search for Schoen will return documents with both Schoen and Schön.
However, when Base-letter Transformation is also enabled, the ö in Schön is transformed into an o, producing the non-existent word (in German, anyway) Schon, and the word is indexed in all three forms.
To prevent this secondary conversion, set the OVERRIDE_BASE_LETTER
attribute to TRUE.
OVERRIDE_BASE_LETTER
only affects letters with umlauts; accented letters, for example, are still transformed into their base forms.
For more on BASE_LETTER
, see Base-Letter Conversion.
The following sections show the alternative spelling substitutions used by Oracle Text.
The German alphabet is the English alphabet plus the additional characters: ä ö ü ß. Table 15-1 lists the alternate spelling conventions Oracle Text uses for these characters.
The Danish alphabet is the Latin alphabet without the w, plus the special characters: ø æ å. Table 15-2 lists the alternate spelling conventions Oracle Text uses for these characters.
The Swedish alphabet is the English alphabet without the w, plus the additional characters: å ä ö. Table 15-3 lists the alternate spelling conventions Oracle Text uses for these characters.