Skip Headers
Oracle® Text Reference
11g Release 1 (11.1)

Part Number B28304-01
Go to Documentation Home
Home
Go to Book List
Book List
Go to Table of Contents
Contents
Go to Index
Index
Go to Master Index
Master Index
Go to Feedback page
Contact Us

Go to previous page
Previous
Go to next page
Next
View PDF

B Oracle Text Supported Document Formats

This appendix contains a list of the document formats supported by the automatic (AUTO_FILTER) filtering technology. The following topics are covered in this appendix:

B.1 About Document Filtering Technology

The automatic filtering technology in Oracle Text, which is licensed from Autonomy, Inc., enables you to index most document formats. This technology also enables you to convert documents to HTML for document presentation with the CTX_DOC package.

To use automatic filtering for indexing and DML processing, you must specify the AUTO_FILTER object in your filter preference.

To use automatic filtering technology for converting documents to HTML with the CTX_DOC package, you need not use the AUTO_FILTER indexing preference, but you must still set up your environment to use this filtering technology, as described in this appendix.

B.1.1 Latest Updates for Patch Releases

The supported platforms and formats listed in this appendix apply for this release. These supported formats are updated for patch releases. To view the latest formats, refer to the Oracle Technology Network:

http://www.oracle.com/technology/products/text

B.1.2 Restrictions on Format Support

Password-protected documents and documents with password-protected content are not supported by the AUTO_FILTER filter.

For other limitations, refer to sections in this chapter concerning specific document types.

B.1.3 Supported Platforms for AUTO_FILTER Document Filtering Technology

Several platforms can take advantage of AUTO_FILTER filter technology.

B.1.3.1 Supported Platforms

AUTO_FILTER filter technology is supported on the following platforms:

  • Microsoft Windows

    • Server 2003 (x86 and IA-64)

    • XP (Service Packs 1 and 2)

    • 2000 x86 (Service Pack 2)

  • Sun Solaris 8.0, 9.0, and 10 SPARC (built on Solaris 8.0)

  • Sun Solaris on x86

  • HP-UX 11.0 and 11i, PA-RISC

  • HP-UX 11i v2, IA-64

  • IBM AIX L5.1, L5.2, and L5.3 (Power)

  • Red Hat Enterprise Linux AS 3.0 and 4.0 x86 and IA-64 (built on AS 3.0)

  • SuSE Linux Enterprise Server 8 and 9 x86 (built on Red Hat). For version 8.0 support, applications must use a GCC 3.2.3 runtime library.

  • Linux on Power

B.1.4 Environment Variables

No environment variables need to be set by the user.

B.1.5 General Limitations

AUTO_FILTER filter technology has the following limitations:

  • Any ASCII characters less then 0x20 (decimal 32) are converted to hexadecimal numbers.

  • Files larger than 2GB are not handled.

B.2 Supported Document Formats

The tables in this section list the document formats that Oracle Text supports for filtering. Oracle Text licenses its filtering technology from Autonomy, Inc.

Document filtering is used for indexing, DML, and for converting documents to HTML with the CTX_DOC package.

Note:

These lists do not represent the complete list of formats that Oracle Text is able to process. The USER_FILTER and PROCEDURE_FILTER enable Oracle Text to process any document format, provided an external filter exists that can filter to some textual format like plain-text, HTML, XML, and so forth.

B.2.1 Text and Markup

Plain-text, HTML, XHTML, XML, and SGML formats pass through the filter without any conversion.

Format Version Single-byte Asian (and Most Multi-byte) Bi-directional?
ANSI (TXT) All versions Y Y Y
ASCII (TXT) All versions Y Y Y
HTML 2.0, 3.2, 4.0 Y Y Y
IBM DCA/RFT (Revisable Form Text) (DC) SC23-0758-1 Character sets 500 and 1026 only N N
Rich Text Format (RTF) 1 through 1.7 Y Y Y
Unicode Text 3, 4 Y Y Y
XHTML 1.0 Y Y Y
Generic XML 1.0 Y Y Y

B.2.2 Word Processing Formats

Format Version Single-byte Asian (and Most Multi-byte) Bi-directional?
Adobe Maker Interchange Format (MIF) 5, 5.5, 6, 7 Character set 1252 only N N
Applix Words (AW) 3.11, 4.0, 4.1, 4.2, 4.3, 4.4 Character set 1252 only N N
DisplayWrite (IP) 4 Character sets 500 and 1026 only N N
Folio Flat File (FFF) 3.1 Character set 1252 only N N
Fujitsu Oasys (OA2) 7 Y Y N
JustSystems Ichitaro (JTD) 8 through 2005 Y Y N
Lotus AMI Pro (SAM) 2, 3 Y Simplified Chinese, Traditional Chinese, Japanese, and Thai only Y
Lotus Word Pro (LWP) 96, 97, Millennium Edition R9, 9.8 (supported on Windows 32-bit platform only) Y Y Y
Lotus Master (MWP) 96, 97 (supported on Windows 32-bit platform only) Y Y N
Lotus AMI Professional Write Plus (AMI) 2.1 Y Simplified Chinese, Traditional Chinese, Japanese, and Thai only N
Microsoft Word for PC (DOC) 4, 5, 5.5, 6 character set 1252 only N N
Microsoft Word for Windows (DOC) 1 through 2003 Y N: versions 1-2

Y: versions 6,7,8,95,97,2000,XP,2002,2003

N: versions 1-2

Hebrew only: versions 6,7,8,95

Y: versions 97,2000,XP,2002,2003

Microsoft Word for Windows XML format 2003 (No formatting extracted) Y Y Y
Microsoft Word for Macintosh (DOC) 4, 5, 6, 98 Y N Y
Microsoft Works (WPS) 1 through 2000 Y Japanese only N
Microsoft Windows Write (WRI) 1, 2, 3 Y Japanese only N
OpenOffice (SXW) 1, 1.1 (No formatting extracted) Y Y Y
StarOffice (SXW) 6, 7 (No formatting extracted) Y Y Y
WordPad (RTF) Through 2003 Y Y Y
WordPerfect for Windows (WO) 5, 5.1 Y N Y
WordPerfect for Windows (WPD) 6, 7, 8, 9, 10, 2000, 2002, 11 Y N N
WordPerfect for Macintosh (WPS) 1.02, 2, 2.1, 2.2, 3, 3.1 Y N N
WordPerfect for Linux 6.0, 8.1 Y N N
XYWrite (XY4) 4.12 Character set 1252 only N N

B.2.2.1 Word Processing Filtering Limitations

The following limitations apply to filtering of word processing documents:

  • If a graphic or table appears in a word processing text box, the filter cannot position it correctly in the HTML output.

  • Nested tables (a table inside another table) in word processing documents are not supported.

  • Line numbers in Microsoft Word documents are not supported.

  • Columns in word processing documents are not supported. Text and graphics in multiple columns in the source document appear in a single flow in the HTML output.

  • WordArt is converted to text. Display enhancements such as curves, angles, 3-D effects, and shadows are not shown.

  • Because the concept of a ÒpageÓ does not exist in HTML, page borders are not supported.

  • Because the concept of a ÒpageÓ does not exist in HTML, all page headers appear at the beginning of the HTML output file, and all page footers appear at the end of the HTML output file.

  • Because the concept of a ÒpageÓ does not exist in HTML, page orientation (landscape and portrait) is not supported.

  • For XML-based formats, only the following character sets are supported:

    Table B-1 Character Sets Supported in XML-based Formats

    Character Set Description

    AL32UTF8

    Unicode 4.0 UTF-8 Universal character set

    AL16UTF16

    Unicode 4.0 UTF-16 Universal 32-bit characters character set

    JA16EUC

    EUC 24-bit Japanese

    KO16MSWIN949

    MS Windows Code Page 949 Korean

    JA16EUC

    EUC 24-bit Japanese

    KO16MSWIN949

    MS Windows Code Page 949 Korean

    WE8ISO8859P1

    ISO 8859-1 West European

    US7ASCII

    ASCII 7-bit American

    ZHS16GBK

    GBK 16-bit Simplified Chinese

    ZHT16HKSCS

    MS Windows Code Page 950 with Hong Kong Supplementary Character Set HKSCS-2001 (character set conversion to and from Unicode is based on Unicode 3.0)

    JA16SJISTILDE

    Shift-JIS 16-bit Japanese, except the wave dash and tilde are mapped differently to and from Unicode.


Note:

A Unicode text document is not recognized if it lacks a byte-order mark. An exception is when the first 1024 bytes are Basic Latin Unicode. In this case a byte order mark is not required.

B.2.3 Spreadsheet Formats

Format Version Single-byte Asian (and Most Multi-byte) Bi-directional?
Applix Spreadsheets (AS) 4.2, 4.3, 4.4 Character set 1252 only N N
Corel Quattro Pro (QPW, WB3) 5, 6, 7, 8 (Later versions not supported) Y N N
Lotus 1-2-3 (123) 96, 97, Millennium Edition R9, 9.8 Y Y Y
Lotus 1-2-3 (WK4) 2, 3, 4, 5 Y Y N
Lotus 1-2-3 Charts (123) 2, 3, 4, 5 Y Y N
Microsoft Excel for Windows (XLS) 2.2 through 2003 Y Y Y
Microsoft Excel for Windows XML format 2003 (No formatting extracted) Y Y Y
Microsoft Excel for Macintosh (XLS) 98 Y N N
Microsoft Excel Charts (XLS) 2, 3, 4, 5, 6, 7 Y Y N
Microsoft Works Spreadsheet (S30,S40) 1, 2, 3, 4 Y N N
OpenOffice (SXC) 1, 1.1 (No formatting extracted) Y Y Y
StarOffice (SXC) 6, 7 (No formatting extracted) Y Y Y
Comma Separated Values (CSV)
Character set 1252 only N N

B.2.3.1 Spreadsheet Format Limitations

The following limitation applies to the filtering of spreadsheets:

  • Comments in Microsoft Excel spreadsheets are not filtered.

B.2.4 Presentation Formats

Format Version Single-byte Asian (and Most Multi-byte) Bi-directional?
Applix Presents (AG) 4.0, 4.2, 4.3, 4.4 character set 1252 only N N
Corel Presentations (SHW) 6, 7, 8, 9, 10, 11, 2000, 2002 character set 1252 only N N
Lotus Freelance Graphics (PRE) 2, 96, 97, 98, Millennium Edition R9, 9.8 character set 850 only N N
Lotus Freelance Graphics 2 (PRZ) 2 Y Japanese, Simplified Chinese, Traditional Chinese, and Thai only N
Microsoft PowerPoint for Windows (PPT) 95 through 2003 Y Japanese, Simplified Chinese, Traditional Chinese, and Korean only N
Microsoft PowerPoint for PC (PPT) 4 character set 1252 only Traditional Chinese only N
Microsoft PowerPoint for Macintosh (PPT) 98 Y N Y
Microsoft Project (MPP) 98, 2000, 2002 (XP) character set 1252 only N N
Microsoft Visio (VSD) 5, 6, 2000, 2002, 2003 Y Y Y
Microsoft Visio XML format 2003 (No formatting extracted) Y Y Y
OpenOffice (SXI, SXP) 1, 1.1 (No formatting extracted) Y Y Y
StarOffice (SXI, SXP) 6, 7 (No formatting extracted) Y Y Y

B.2.4.1 Presentation Format Limitations

The following limitations apply to the formatting of spreadsheets:

  • Hyperlinks are not supported. Hyperlinks within a document are not preserved.

  • Right-aligned and center-aligned tabs are displayed as left-aligned.

  • WordArt is converted to text. Display enhancements such as curves, angles, 3-D effects, and shadows are not shown.

B.2.5 Display Formats

Format Version Single-byte Asian (and Most Multi-byte) Bi-directional?
Adobe Portable Document Format (PDF) 1.1 (Acrobat 2.0) to 1.6 (Acrobat 7.0) Y Y Y

B.2.5.1 Filtering of PDF Format Documents

Multi-byte PDFs are supported, provided the PDF document is created using Character ID-keyed (CID) fonts, predefined CJK CMap files, or ToUnicode font encodings, and the document does not contain embedded fonts. See the Adobe website and the Adobe Acrobat documentation for more information.

To determine the type of font encodings that are used in a PDF, open the PDF document in Adobe Acrobat, and select File->Document Info->Fonts. If the Encodings column lists Custom or Embedded encodings, then you may encounter problems filtering the PDF document.

B.2.5.2 PDF Filtering Limitations

Limitations apply to PDF documents as described in this section.

  • Embedded fonts in a PDF document are not filtered correctly. They are usually displayed using the question mark (?) replacement character.

  • The following color spaces are supported:

    • DeviceRGB

    • DeviceGray

    • DeviceCMYK

    • CalGray

    • CalRGB

    Index color spaces are supported as long as they are used with a supported basic color space.

  • Hyperlinks in a PDF are not active when displayed in a browser or a viewing window.

  • All pre-defined CMaps in PDF 1.3 specification are supported. CMaps added in PDF 1.4 and PDF 1.5 specifications are not supported. A CMap specifies a mapping from a character code to the Adobe Character Identifier Number (CID). Characters with unsupported CMaps are not translated correctly. They are usually displayed using the question mark (?) replacement character.

  • Annotations, such as notes, sound, or movie, are not supported.

  • The following features of PDF 1.5 for Acrobat 6.0 are not supported:

    • Tagged PDFs. When processing a ÒtaggedÓ PDF, the structure defined by the PDF tags is ignored and is not used to determine the paragraph flow of the output.

    • Images compressed in JPEG2000

    • Hidden content in a PDF document, such as, Optional Content and OCG-State Actions

    • Interactive forms

    • Embedded multimedia presentations

    • Digital signatures and signature fields

    • Interactive presentations, that is, navigation between pages and transition actions.

  • Vector images are not supported. Because background colors are defined in PDF as vector images, background colors are also not supported. Raster images are supported.

  • Typeface styles that are rendered in a PDF by printing the character multiple times in the same space (such as shadow fonts) are not consolidated into a single character. For example, the shadow character ÒBÓ in a PDF is extracted as "BBBB."

B.2.6 Graphic Formats

Table B-2 lists the graphic formats that the AUTO_FILTER filter recognizes. This means that indexing a text column that contains any of these formats produces no error. As such, it is safe for the column to contain any of these formats.

Formats are categorized as either embedded graphics or standalone graphics. Embedded graphics are inserted or referenced within a document.

Note:

The AUTO_FILTER filter cannot extract textual information from graphics.

Table B-2 Supported Graphics Formats for AUTO_FILTER Filter

Graphics Format Version

AutoCAD Drawing format (DWG)

R13, R14, 2000, and 2004 (standalone only)

AutoCAD Drawing format (DXF)

R13, R14, 2000, and 2004 (standalone only)

Encapsulated PostScript (EPS) (raster only)

TIFF header only

Enhanced Metafile (EMF)

no specific version

Graphics Interchange Format (GIF)

87, 89

JPEG File Interchange Format

no specific version

Lotus AMIDraw Graphics (SDW)

no specific version

Lotus Pic (PIC)

no specific version

Macintosh Raster (PICT/PCT)

2

MacPaint (PNTG)

no specific version

Microsoft Windows Bitmap (BMP)

no specific version

PC Paintbrush (PCX)

3

Portable Network Graphics (PNG)

no specific version

SGI RGB Image (RGB)

no specific version

Sun Raster Image (RS)

no specific version

Tagged Image File (TIFF)

through 6.0Foot 1 

Truevision TARGA (TGA)

2

Windows Animated Cursor (ANI)

no specific version

Windows Metafile (WMF)

3

WordPerfect Graphics 1 (WPG)

1

WordPerfect Graphics 2 (WPG)

2, 7

Computer Graphics Metafile (CGM)

no specific version

Corel DRAW (CDR)

through to 9.0

DCX Fax System (DCX)

no specific version

Microsoft Office Drawing (MSO)

no specific version

Windows Icon Cursor (ICO)

no specific version


Footnote 1 For Tagged Image File (TIFF), the following compression types are supported: no compression, CCITT Group 3 1-Dimensional Modified Huffman, CCITT Group 3 T4 1-Dimensional, CCITT Group 4 T6, LZW, JPEG (only Gray, RGB and CMYK color space are supported), and PackBits.

B.2.6.1 Graphics Formats Limitations

AutoCAD drawing files are not supported on IBM AIX.