7 Understanding the Ultra Search Administration Tool

The Ultra Search administration tool lets you manage Ultra Search instances. This chapter helps guide you through the screens on the Ultra Search administration tool. It contains the following topics:

Ultra Search Administration Tool
Logging On to Ultra Search
Logging On and Managing Instances as SSO Users
Instances Page
Crawler Page
Web Access Page
Attributes Page
Sources Page
Schedules Page
Queries Page
Users Page
Globalization Page

Ultra Search Administration Tool

The Ultra Search administration tool is a J2EE-compliant Web application. You can use it to manage Ultra Search instances. To use the administration tool, log on as either a database user, an Enterprise Manager super-user, a Portal user, or an SSO user through any browser.

Note:

The Ultra Search administration tool and the Ultra Search query applications are part of the Ultra Search middle tier. However, the Ultra Search administration tool is independent from the Ultra Search query application. Therefore, they can be hosted on different computers to enhance security or scalability.

With the administration tool, you can do the following:

Log on to Ultra Search
Create Ultra Search instances
Manage administrative users
Define data sources and assign them to data groups
Configure and schedule the Ultra Search crawler
Set query options
Translate search attributes and LOV and data group display names to different languages

Setting Crawler Parameters

To configure the Ultra Search crawler, you must do the following:

Set crawler parameters, such as the crawler log file directory. To do so, use the Crawler Page.
Set Web access parameters, such as authentication and the proxy server. To do so, use the Web Access Page.
Define data sources. Data sources can be Web pages, database tables, files, email mailing lists, Oracle Sources (for example, Oracle Application Server Portals or federated sources), or user-defined data sources. You can assign one or more data sources to a crawler schedule. To define data sources, use the Sources Page. You can also set parameters for the source, such as domain inclusions or exclusions for Web sources or the display URL template or column for table sources.
Define synchronization schedules. The crawler uses the synchronization schedule to reconcile the Ultra Search index with current data source content. To define crawling schedules, use the Schedules Page.

Setting Query Options

Use query options to let users limit their searches. Searches can be limited to document attributes and data groups.

Attributes

Search attributes can be mapped to HTML metatags, table columns, document attributes, and email headers. Some attributes, such as author and description, are predefined and need no configuration. However, you can customize your own attributes. To set custom search attributes to expose to the query user, use the Attributes Page.

Data Groups

Data source groups are logical entities exposed to the search engine user. When entering a query, the search engine user is asked to select one or more data groups to search from. A data group consists of one or more data sources. To define data groups, use the Queries Page.

Online Help in Different Languages

Ultra Search provides context-sensitive online help, which can be viewed in different languages. You can change the language preferences in the Users Page.

Logging On to Ultra Search

The following users can log on to the Ultra Search administration tool:

Single Sign-on (SSO) users: These users are managed by the Oracle Int ernet Directory and are authenticated by the SSO server. The Ultra Search administration tool identifies all Ultra Search instances to which the SSO user has access. This is available only if you have the Oracle Identity Management infrastructure installed.
Database users (non-SSO): These users exist in the database on which Ultra Search runs.
Enterprise Manager users
Portal SSO users

To log on to the administration tool, point your Web browser to one of the following URLs:

For non-SSO mode: http://hostname:port/ultrasearch/admin/index.jsp
For SSO mode: http://hostname:port/ultrasearch/admin_sso/index.jsp

Immediately after installation, the only users able to create and manage instances are the following:

The WKSYS database user
The Enterprise Manager user
The PORTAL SSO user belonging to the default company [not supported in the Oracle database release]
The ORCLADMIN SSO user belonging to the default company [this is available only if it the Oracle Identity Management infrastructure is installed]

After you are logged on as one of these special users, you can grant permission to other users, enabling them to create and manage Ultra Search instances. Using the Ultra Search administration tool, you can only grant and revoke Ultra Search related permissions to and from exiting users. To add or delete users, use the Oracle Internet Directory for single-sign-on users or Oracle Enterprise Manager for local database users.

Note:

The Ultra Search product database dictionary is installed in the WKSYS schema.

See Also:

Chapter 3, "Installing and Configuring Ultra Search"
"Changing Ultra Search Schema Passwords" for information about changing the WKSYS password
"Instances Page" for more information about creating Ultra Search instances
"Users Page" for more information about granting permission to other users
"Logging On and Managing Instances as SSO Users" for more information about how Ultra Search handles SSO users

Logging On and Managing Instances as SSO Users

Note:

Single Sign-On (SSO) is available only if the Oracle Identity Management infrastructure is installed

Logging On to Ultra Search

When a single sign-on (SSO) user logs on to the SSO-protected Ultra Search administration tool, the user is first prompted with the SSO login screen. Enter the SSO user name and password. After the SSO server authenticates the user, the user sees a list of Ultra Search instances that they have the privilege to manage.

There are different URLs for different users. For example:

SSO users: http://host:http_port/ultrasearch_admin_sso/index.jsp
Portal users: http://host:http_port/pls/portal
Enterprise Manager users: http://host:em_port/

Granting Privileges to SSO Users

You might need to grant super-user privileges, or privileges for managing an Ultra Search instance, to an SSO user. This process is slightly different, depending on whether Oracle Application Server Portal is running in hosted mode or non-hosted mode, as described in the following list:

Note:

An SSO user is uniquely identified by Ultra Search with an SSO-nickname/subscriber-nickname combination.

In non-hosted mode, the subscriber-nickname is not required when granting privileges to an SSO user. This is because there is exactly one subscriber in Oracle Application Server Portal in non-hosted mode.

In hosted mode, the subscriber-nickname is required when granting privileges to an SSO user. This is because there can be more than one subscriber in Oracle Application Server Portal, and two or more users with the same SSO-nickname (for example, PORTAL) could be distinct SSO users distinguished by their subscriber-nickname. When running in hosted mode, also note the following:

When granting permissions for the default subscriber user, always specify "DEFAULT COMPANY" for the subscriber-nickname, even though the actual nickname could be different; for example, "ORACLE". The actual nickname is not recognized by Ultra Search.

When logging in to SSO as the default subscriber user, leave the subscriber nickname blank. Alternatively, enter "DEFAULT COMPANY" instead of the actual subscriber nickname; for example, "ORACLE" so that it is recognized by Ultra Search.

Note:

At any point after installation, you can run an Oracle Application Server Portal script to alter the running mode from non-hosted to hosted. Whenever this is done, the Oracle Application Server Portal script invokes an Ultra Search script to inform Ultra Search of the change from non-hosted to hosted modes.

See Also:

Hosting Developer's Guide at http://otn.oracle.com/

Instances Page

After successfully logging on to the Ultra Search administration tool, you find yourself on the Instances Page. This page manages all Ultra Search instances in the local database. In the top left corner of the page, there are tabs for creating, selecting, editing, and deleting instances.

Before you can use the administration tool to configure crawling and indexing, you must create an Ultra Search instance. An Ultra Search instance is identified with a name and has its own crawling schedules and index. Only users granted super-user privileges can create Ultra Search instances.

Creating an Instance

To c reate an instance, click Create. You can create a regular instance or a read-only snapshot instance. Only users with super-user privileges can create new instances.

Note:

If you define the same data source within different instances Ultra Search, then there could be crawling conflicts for table data sources with logging enabled, email data sources, and some user-defined data sources.

Creating a Regular Instance

To create an instance, do the following:

Prepare the database user.

Every Ultra Search instance is based on a database user/schema with the WKUSER role.

The database user you create to house the Ultra Search instance should be assigned a dedicated self-contained tablespace. This is important if you plan to ever create snapshot instances of this instance. To do this, create a new tablespace. Then, create a new database user whose default tablespace is the one you just created.
See Also:

"Configuring the Oracle Server for Ultra Search" for information and instructions on configuring database users for Ultra Search

"Creating a Snapshot Instance"
Follow instance creation in the Ultra Search administration tool.

From the main instance creation page, click Create Instance, and provide the following information:
- Instance name
- Database schema: this is the user name from step 1.
- Schema password
You can also enter the following optional index preferences:
- Lexer
  
  Specify the name of the lexer you want to use for indexing. The lexer breaks text into tokens according to your language. These tokens are usually words. The default lexer is wksys.wk_lexer, as defined in the wk0pref.sql file. After the instance is created, the lexer can no longer be changed.
- Stoplist
  
  Specify the name of a stoplist you want to use during indexing. The default stoplist is wksys.wk_stoplist, as defined in the wk0pref.sql file. Try to avoid modifying the stoplist after the instance has been created.
- Storage
  
  Specify the name of the storage preference for the index of your instance. The default storage preference is wksys.wk_storage, as defined in the wk0pref.sql file. After the instance is created, the storage preference cannot be changed.
  See Also:
  
  Oracle Text Reference for more information on these creating and modifying lexers, stoplists, and storage
  
  "Managing Stoplists"

Creating a Snapshot Instance

A snapshot instance is a copy of another instance. Unlike a regular instance, a snapshot instance is read only; it does not synchronize its index to the search domain. After the master instance re-synchronizes to the search domain, the snapshot instance becomes out of date. At that point, you should delete the snapshot and create a new one.

Note:

The snapshot and its master instance cannot reside on the same database.

A snapshot instance is useful for the following purposes:

Query Processing

Two Ultra Search instances can answer queries about the same search domain. Therefore, in a set amount of time, two instances can answer more queries about that domain than one instance. Because snapshot instances do not involve crawling and indexing, snapshot instance creation is fast and inexpensive. Thus, snapshot instances can improve scalability.
Backups

If the master instance becomes corrupted, its snapshot can be transformed into a regular instance by editing the instance mode to updatable. Because the snapshot and its master instance cannot reside on the same database, a snapshot instance should be made updatable only to replace a corrupted master instance.

A snapshot instance does not inherit authentication from the master instance. Therefore, if you make a snapshot instance updatable, you must re-enter any authentication information needed to crawl the search domain.

To create a snapshot instance, do the following:

Prepare the database user.

As with regular instances, snapshot instances require a database user. This user must have been granted the WKUSER role.
Copy the data from the master instance.

This is done with the transportable tablespace mechanism, which does not allow renaming of tablespaces. Therefore, snapshot instances cannot be created on the same database as its master.

Identify the tablespace or the set of tablespaces that contain all the master instance data. Then, copy it, and plug it into the database user from step 1.
Follow snapshot instance creation in the Ultra Search administration tool.

From the main instance creation page, click Create Read-Only Snapshot Instance, and provide the following information:
- Snapshot instance name
- Snapshot schema name: this is the database user from step 1.
- Snapshot schema password
- Database link: this is the name of the database link to the database where the master instance lives.
- Master instance name
Enable the snapshot for secure searches.

If the master instance for the snapshot of is secure-search enabled and if the destination database that you are making a snapshot in supports secure-search enabled instances, then you must also run a PL/SQL procedure in the destination database where you are creating the snapshot.

Running this procedure translates the IDs of the access control lists (ACLs) in the destination database, rendering them usable. Log on to the database as the WKSYS user. Invoke the procedure as follows:
```
exec WK_ADM.USE_INSTANCE('instance_name'); 
exec WK_ADM.TRANSLATE_ACL_IDS();
```

where instance_name is the name of the snapshot instance

Make sure that this statement completes successfully without error.

See Also:

Chapter 4, "Post-Installation Information" for information on changing the WKSYS password and for instructions on configuring database users for Ultra Search
Oracle Database Administrator's Guide for details on using transportable tablespaces

Selecting an Instance

You can have multiple Ultra Search instances. For example, an organization could have separate Ultra Search instances for its marketing, human resources, and development portals. The administration tool requires you to specify an instance before it lets you make any instance-specific changes.

To select an instance, do the following:

Click Select on the Instances Page.
Select an instance from the pull-down menu.
Click Apply.

Note:
Instances do not share data. Data sources, schedules, and indexes are specific to each instance.

Deleting an Instance

To delete an instance, do the following:

Click Delete on the Instances Page.
Select an instance from the pull-down menu.
Click Apply.

Note:
To delete an Ultra Search instance, the user must be granted the super-user privileges.

Editing an Instance

To edit an instance, click Edit on the Instances Page.

You can change the instance mode (make the instance updatable) or change the instance password.

Instance Mode

You can change the instance mode to updatable or read only. Updatable instances synchronize themselves to the search domain on a set schedule, whereas read-only instances (snapshot instances) do not do any synchronization. To set the instance mode, select the box corresponding the to mode you want, and click Apply.

Schema Password

An Ultra Search instance must know the password of the database user in which it resides. The instance cannot get this information directly from the database. During instance creation, Oracle provides the database user password, and the instance caches this information.

If this database user password changes, then the password that the instance has cached must be updated. To do this, enter the new password and click Apply. After the new password is verified against the database, it replaces the cached password.

Crawler Page

The Ultra Search crawler is a Java application that spawns threads to crawl defined data sources, such as Web sites, database tables, or email archives. Crawling occurs at regularly scheduled intervals, as defined in the Schedules Page.

With this page, you can do the following:

Configure the Settings

Crawler Threads Specify the number of crawler threads to be spawned at run time. Number of Processors Specify the number of central processing units (CPUs) that exist on the server where the Ultra Search crawler will run. This setting determines the optimal number of document conversion threads used by the system. A document conversion thread converts multiformat documents into HTML documents for proper indexing. Automatic Language Detection Not all documents retrieved by the Ultra Search crawler specify the language. For documents with no language specification, the Ultra Search crawler attempts to automatically detect language. Click Yes to turn on this feature.

The language recognizer is trained statistically using trigram data from documents in various languages (Danish, Dutch, English, French, German, Italian, Portuguese, and Spanish). It starts with the hypothesis that the given document does not belong to any language and ultimately refutes this hypothesis for a particular language where possible. It operates on Latin-1 alphabet and any language with a deterministic Unicode range of characters (Chinese, Japanese, Korean, and so on).

The crawler determines the language code by checking the HTTP header content-language or the LANGUAGE column, if it is a table data source. If it cannot determine the language, then it takes the following steps:

If the language recognizer is not available or if it is unable to determine a language code, then the default language code is used
If the language recognizer is available, then the output from the recognizer is used.

This language code is populated in 'LANG' column of the wk$url and wk$doc tables. Multilexer is the only lexer used for Ultra Search. All document URLs are stored in wk$doc for indexing and wk$url for crawling.

Default Language If automatic language detection is disabled, or if a Web document does not have a specified language, then the crawler assumes that the Web page is written in this default language. This setting is important, because language directly determines how a document is indexed.

Note:

This default language is used only if the crawler cannot determine the document language during crawling. Set language preference in the Users Page.

You can select a default language for the crawler or for data sources. Default language support for indexing and querying is available for the following languages:

Polish
Chinese
Hungarian
Norwegian
Romanian
Finnish
Japanese
Spanish
Slovak
English
Turkish
Danish
Swedish
Russian
German
Korean
Dutch
Italian
Greek
Portuguese
Czech
Hebrew
French
Arabic

Crawling Depth A Web document could contain links to other Web documents, which could contain more links. This setting lets you specify the maximum number of nested links the crawler will follow.

See Also:

"Tuning the Web Crawling Process" for more information on the importance of the crawling depth

Crawler Timeout Threshold Specify in seconds a crawler timeout. The crawler timeout threshold is used to force a timeout when the crawler cannot access a Web page. Default Character Set Specify the default character set. The crawler uses this setting when an HTML document does not have its character set specified. Cache Directory Specify the absolute path of the cache directory. During crawling, documents are stored in the cache directory. Every time the preset size is reached, crawling stops and indexing starts.

If you are crawling sensitive information, then make sure that you set the appropriate file system read permission to the cache directory.

You can choose whether or not to have the cache cleared after indexing.

Crawler Logging Specify the following:

Level of detail: everything or only a summary
Crawler logfile directory
Crawler logfile language

The log file directory stores the crawler log files. The log file records all crawler activity, warnings, and error messages for a particular schedule. It includes messages logged at startup, runtime, and shutdown. Logging everything can create very large log files when crawling a large number of documents. However, in certain situations, it can be beneficial to configure the crawler to print detailed activity to each schedule log file. The crawler logfile language is the language the crawler uses to generate the log file.

The crawler maintains multiple versions of its log file. The format of the log file name is:

iinstance_iddsdata_source_id.MMDDhhmm.log

where MM is the month, DD is the date, hh is the launching hour in 24-hour format, and mm is the minutes. For example, if a schedule for data source 23 of instance 3 is launched at 10 pm, July 8th, then the log file name is i3ds23.07082200.log. Each successive schedule launching will have a unique log file name. If the total number of log files for a data source reaches the system-specified limit, then the oldest log file will be deleted. The number of log files is a scheduler property and applies to all of the data sources assigned to the scheduler.

Database Connect String The database connect string is a standard JDBC connect string used by the remote crawler when it connects to the database. The connect string can be provided in the form of [hostname]:[port]:[sid] or in the form of a TNS keyword-value syntax; for example:

"(DESCRIPTION=(ADDRESS=(PROTOCOL=tcp)(HOST=...)(PORT=1521)...))"

You can update the JDBC connect string to a different format; for example, an LDAP format. However, you cannot change the JDBC connect string to point to a different database. The JDBC connect string must be set to the database where the middle tier points; that is, the middle tier and the JDBC should point to the same database.

In a Real Application Clusters environment, the TNS keyword-value syntax should be used, because it allows connection to any node of the system. For example,

"(DESCRIPTION=(LOAD_BALANCE=yes)(ADDRESS=(PROTOCOL=TCP)(HOST=cls02a)(PORT=3001))
(ADDRESS=(PROTOCOL=TCP)(HOST=cls02b)(PORT=3001)))(CONNECT_DATA=(SERVICE_NAME=sales.us.acme.com)))"

Remote Crawler Profiles

Use this page to view and edit remote crawler profiles.

A remote crawler profile consists of all parameters needed to run the Ultra Search crawler on a remote computer other than the Ultra Search database. To register a remote crawler, you need to use the PL/SQL API wk_crw.register_remote_crawler. You can choose either RMI-based or JDBC-based remote crawling.

To configure the remote crawler, click Edit. Here is a list of configuration parameters that you can change for the remote crawler:

Cache file access mode. You have two options for the remote crawler to handle cache files:
- Through a JDBC connection.
  
  In this case, the remote crawler will send cache files over the crawler's JDBC connection to the server's cache directory.
- Through a mounted file system.
  
  If you choose this option, the cache file will be saved in the remote crawler cache directory. The remote crawler cache directory must be mounted to the server side crawler cache directory (specified under "Crawler" "Settings" tab); otherwise, the documents cannot be indexed.
See Also:
For more on crawling with JDBC connections, see "Using the Remote Crawler"
Cache directory location (absolute path)
Crawler log file directory
Mail archive path
Number of crawler threads
Number of processors
Initial Java heap size (in megabytes)
Maximum Java heap size (in megabytes)
Java classpath

Crawler Statistics

Use this page to view the following crawler statistics:

Summary of Crawler Activity

This provides a general summary of crawler activity:

Aggregate crawler statistics
Total number of documents indexed
Crawler statistics by data source type

Detailed Crawler Statistics

This includes the following:

List of hosts crawled and indexed
Document distribution by depth
Document distribution by document type
Document distribution by data source type

Crawler Progress

This displays crawler progress for the past week. It shows the total number of documents that have been indexed for exactly one week prior to the current time. The Time column rounds the current time to the nearest hour.

Problematic URLs

This lists errors encountered during the crawling process. It also lists the number of URLs that cause each error.

Web Access Page

Use this page to set up authentication and proxies.

Proxies

Specify a proxy server if the search space includes Web pages that reside outside your organization's firewall. Specifying a proxy server is optional. Currently, only the HTTP protocol is supported.

Note:

The crawler cannot use a proxy server that requires proxy authentication.

You can also set domain exceptions.

Authentication

Use this page to enter authentication information global to all data sources.

Note:

The data source specific authentication take precedence over this global authentication.

HTTP Authentication

Specify the user name and password for the host and realm for which HTTP authentication is required. Ultra Search supports both basic (realm-based) and digest authentication. The realm is a name associated with the protected area of a Web site. It is a string that you provide to log on to such a protected page.

HTML Forms

Register HTML forms that you want the Ultra Search crawler to automatically fill out during Web crawling. HTML form support requires that HTTP cookie functionality is enabled.You can register HTML forms manually or with the form registration wizard. If the HTML form contains JavaScript, then the wizard might fail and you will need to use manual registration

Note:

The Ultra Search crawler will choose the form to use based on the form's URL and the form name. URL parameters are not included during matching; thus, they are truncated during form registration.

Attributes Page

When your indexed documents contain metadata, such as author and date information, you can let users refine their searches based on this information. For example, users can search for all documents where the author attribute has a certain value.

The list of values (LOV) for a document attribute can help specify a search query. An attribute value can have a display name for it. For example, the attribute country might use country code as the attribute value, but show the name of the country to the user. There could be multiple translations of the attribute display name.

To define a search attribute, use the Search Attributes subtab. Ultra Search provides the following default search attributes: Title, Author, Description, Subject, Mimetype, Language, Host, and LastModifedDate. They can be incorporated in search applications for a more detailed search and richer presentation.You can also define your own.

After defining search attributes, you must map between document attributes and global search attributes for data sources. To do so, use the Mappings subtab.

Note:

Ultra Search provides a command-line tool to load metadata, such as search attribute LOVs and display names into an Ultra Search database. If you have a large amount of data, this is probably faster than using the HTML-based administration tool. For more information, see Appendix A, "Loading Metadata into Ultra Search".

Search Attributes

Search attributes are attributes exposed to the query user. Ultra Search provides system-defined attributes, such as author and description. Ultra Search maintains a global list of search attributes. You can add, edit, or delete search attributes. You can also click Manage LOV to change the list of values (LOV) for the search attribute. There are two categories of attribute LOVs: one is global across all data sources, the other is data source-specific.

To define your own attribute, enter the name of the attribute in the text box; select string, date, or number; and click Add.

You can add or delete LOV entry and display name for search attributes. Display name is optional. If display name is absent, then LOV entry is used in the query screen.

Note:

LOV is only represented as string type. If LOV is in date format, then you must use "DD-MM-YYYY" to enter the LOV.

To update the policy value, click Manage LOV for any attribute.

A data source-specific LOV can be updated in three ways:

Update the LOV manually.
The crawler agent can automatically update the LOV during the crawling process.
New LOV entries can be automatically added by inspecting attribute values of incoming documents.

Caution:
If the update policy is agent-controlled, then the LOV and all translated values are erased in the next crawling.

Mappings

This section displays mapping information for all data sources. For user-defined data sources, mapping is done at the agent level, and document attributes are automatically mapped to search attributes with the same name initially. Document attributes and search attributes are mapped one-to-one. For each user-defined data source, you can edit the global search attribute to which the document attribute is mapped.

For Web, file, or table data sources, mappings are created manually when you create the data source. For user-defined data sources, mappings are automatically created on subsequent crawls.

Click Edit Mappings to change this mapping.

Editing the existing mapping is costly, because the crawler must recrawl all documents for this data source. Avoid this step, unless necessary.

Note:

There are no user-managed mappings for email sources. There are two predefined mappings for emails. The "From" field of an email is intrinsically mapped to the Ultra Search author attribute. Likewise, the "Subject" field of an email is mapped to the Ultra Search subject attribute. The abstract of the email message is mapped to the description attribute.

Sources Page

A collection of documents is called a source. The data source is characterized by the properties of its location, such as a Web site or an email inbox. The Ultra Search crawler retrieves data from one or more data sources.

The different types of sources are:

Web Sources
Table Sources
Email Sources
File Sources
Oracle Sources
User-Defined Sources (requires a crawler agent)
See Also:

"Schedules Page" to assign one or more data sources to a synchronization schedule

"Queries Page" to assign data sources to data groups to enable restrictive querying

You can create as many data sources as you want. The following section explains how to create and edit data sources.

Web Sources

A Web source represents the content on a specific Web site. Web sources facilitate maintenance crawling of specific Web sites.

Creating Web Sources

To create a new Web source, do the following:

Specify a name for the Web source and a starting address. This is the URL for the crawler to begin crawling. The starting address can be HTTP or HTTPS.
Set URL boundary rules to refine the crawling space. You can include or exclude hosts or domains beginning with, ending with, or equal to a specific name.

For example, an inclusion domain ending with oracle.com limits the Ultra Search crawler to hosts belonging to Oracle worldwide. Anything ending with oracle.com is crawled; but, http://www.oracle.com.tw is not crawled. If you change the inclusion domain to yahoo.com with a new seed "http://www.yahoo.com", then all oracle.com URLs are dropped by the crawler.

An exclusion domain uk.oracle.com prevents the crawler from crawling Oracle hosts in the United Kingdom. You can also include or exclude Web sites with a specific port. (By default, all ports are crawled.) You can have port inclusion or port exclusion rules for a specific host, but not both. Exclusion rules always override inclusion rules.
Specify the types of documents the Ultra Search crawler should process for this source. HTML and plain text are default document types that the crawler always processes.
Specify the authentication settings. This step is optional. Under HTTP Authentication, specify the user name and password for host and realm for which authentication is required. The realm is a name associated with the protected area of a Web site. Under HTML Forms, you can register HTML forms that you want the Ultra Search crawler to automatically fill out during Web crawling. HTML form support requires that HTTP cookie functionality is enabled. Cookies remember context between HTTP requests. For example, the server can send a cookie such that it knows if a user has already logged on and does not need to log on again. Cookie support is enabled by default. Click Register HTML Form to register authentication forms protecting the data source. Note: For the form URL to be crawled, you must verify that the URL is not excluded in the robots.txt file. If so, then you must disable robot exclusion for this data source. (By default, Ultra Search enables robot exclusion.)
Choose either No ACL or Ultra Search ACL for the data source. When a user performs a search, the ACL (access control list) controls which documents the user can access. The default is no ACL, with all documents considered searchable and visible. You can add more than one group and user to the ACL for the data source. The option to choose is only available if the instance is security-enabled.
Define, edit, or delete metatag mappings for your Web source. Metatags are descriptive tags in the HTML document header. One metatag can map to only one search attribute.
Override the default crawler settings for each Web source. This step is optional. The parameters you can override are the crawling depth, the number of crawler threads, the language, the crawler timeout threshold, the character set, the maximum cookie size, the maximum number of cookies, and the maximum number of cookies for each host. You can also enable or disable robots exclusion, language detection, the UrlRewriter, indexing dynamic pages, HTTP cookies, and whether content of the cookie log file is shown. (You can also edit those in Edit Web Sources.)

Robots exclusion lets you control which parts of your sites can be visited by robots. If robots exclusion is enabled (default), then the Web crawler traverses the pages based on the access policy specified in the Web server robots.txt file. For example, when a robot visits http://www.foobar.com/, it checks for http://www.foobar.com/robots.txt. If it finds it, the crawler analyzes its contents to see if it is allowed to retrieve the document. If you own the Web sites, then you can disable robots exclusions. However, when crawling other Web sites, you should always comply with robots.txt by enabling robots exclusion.

The URL Rewriter is a user-supplied Java module for implementing the Ultra Search UrlRewriter interface. It is used by the crawler to filter or rewrite extracted URL links before they are put into the URL queue. URL filtering removes unwanted links, and ULR rewriting transforms the URL link. This transformation is necessary when access URLs are used.

The UrlRewriter provides the following possible outcomes for links:
- There is no change to the link. The crawler inserts it as it is.
- Discard the link. There is no insertion.
- A new display URL is returned, replacing the URL link for insertion.
- A display URL and an access URL are returned. The display URL may or may not be identical to the URL link.
The generated new URL link is subject to all existing host, path, and mimetype inclusion and exclusion rules.

You must put the implemented rewriter class in a jar file and provide the class name and jar file name here.

If Index Dynamic Page is set to Yes, then dynamic URLs are crawled and indexed. For data sources already crawled with this option, setting Index Dynamic Page to No and recrawling the data source removes all dynamic URLs from the index.

Some dynamic pages appear as multiple search hits for the same page, and you may not want them all indexed. Other dynamic pages are each different and need to be indexed. You must distinguish between these two kinds of dynamic pages. In general, dynamic pages that only change in menu expansion without affecting its contents should not be indexed. Consider the following three URLs:
```
http://itweb.oraclecorp.com/aboutit/network/npe/standards/naming_convention.html

http://itweb.oraclecorp.com/aboutit/network/npe/standards/naming_convention.html?nsdnv=14z1

http://itweb.oraclecorp.com/aboutit/network/npe/standards/naming_convention.html?nsdnv=14
```
The question mark ('?') in the URL indicates that the rest of the strings are input parameters. The duplicate hits are essentially the same page with different side menu expansion. Ideally, the same query should yield only one hit:
```
http://itweb.oraclecorp.com/aboutit/network/npe/standards/naming_convention.html
```
Dynamic page index control applies to the whole data source. So, if a Web site has both kinds of dynamic pages, you need to define them separately as two data sources in order to control the indexing of those dynamic pages.

See Also:

"Ultra Search URL Rewriter API"
"Using Crawler Agents"
"Crawler Page" for information on default languages

Table Sources

A table source represents content in a database table or view. The database table or view can reside in the Ultra Search database instance or in a remote database. Ultra Search accesses remote databases using database links.

See Also:

"Limitations With Database Links"

Creating Table Sources

To create a table source, click Create Table Source, and follow these steps:

Specify a table source name, and the name of the database link, schema, and table. (Database links are configured manually using SQL CREATE DATABASE LINK against the Ultra Search instance in question. After you create the database link, it shows up in the drop down list.) Click Locate Table.
Specify settings for your table source, such as the default language and the primary key column. You can also specify the column where final content should be delivered, and the type of data stored in that column; for example, HTML, plain text, or binary. For information on default languages, see "Crawler Page".
Verify the information about your table source.
Decide whether or not to use the Ultra Search logging mechanism to optimize the crawling of table data sources. When crawling is enabled, only newly updated documents are revisited during the crawling process. You can enable logging for Oracle tables, enable logging for non-Oracle tables, or disable the logging mechanism. If you enable logging, then you are prompted to create a log table and log triggers. Oracle SQL statements are provided for Oracle tables. If you are using non-Oracle tables, then you must manually create a log table and log triggers. Follow the examples provided to create the log table and log triggers. After you have created the table, enter the table name in Log Table Name.
Map table columns to search attributes. Each table column can be mapped to exactly one search attribute. This lets the search engine seamlessly search data from the table source.
Specify the display URL template or column for the table source. This step is optional. Ultra Search uses a default text viewer for table data sources. If you specify display URL, then Ultra Search uses the Web URL defined to display the table data retrieved. If display URL column is available, then Ultra Search uses the column to get the URL to display the table data source content. You can also specify display URL templates in the following format: http://hostname:port/path?parameter_name=$(key1) where key1 is the corresponding table's primary key column. For example, assume that you can use the following URL to query the bug number 1234567, and the bug number is the primary key of the table: http://bug:7777/pls/bug?rptno=1234567. You can set the table source display URL template to http://bug:7777/pls/bug?rptno=$(key1).

The Table Column to Key Mappings section provides mapping information. Ultra Search supports table keys in STRING, NUMBER, or DATE type. If key1 is of NUMBER or DATE type, then you must specify the format model used by the Web site so that Oracle knows how to interpret the string. For example, the date format model for the string '11-Nov-1999' is 'DD-Mon-YYYY'. You can also map other table columns to Ultra Search attributes. Do not map the text column.
Specify the ACL (access control list) policy for the data source. When a user performs a search, the ACL controls which documents the user can access. The default is no ACL, with all documents considered public and visible. Alternatively, you can specify to use Ultra Search ACL. You can add more than one group and user to the ACL for the data source. The option to choose is only available if the instance is security-enabled.

See Also:
Oracle Database SQL Reference for more on format models

Editing Table Sources

On the main Table Sources page, click Edit to change the name of the table source. You can change, add, or delete table column and search attribute mappings; change the display URL template or column; and view values of the table source settings.

Table Sources Comprised of More Than One Table

If a table source has more than one table, then a view joining the relevant tables must be created. Ultra Search then uses this view as the table source. For example, two tables with a master-detail relationship can be joined through a SELECT statement on the master table and a user-implemented PL/SQL function that concatenate the detail table rows.

Limitations With Database Links

The following restrictions apply to base tables or views on a remote database that are accessed over a database link by the crawler.

If the text column of the base table or view is of type BLOB or CLOB, then the table must have a ROWID column. A table or view might not have a ROWID column for various reasons, including the following:
- A view is comprised of a join of one or more tables.
- A view is based on a single table using a GROUP BY clause.
The best way to know if a remote table or view can be safely crawled by Ultra Search is to check for the existence of the ROWID column. To do so, run the following SQL statement against that table or view using SQL*Plus:
```
SELECT MIN(ROWID) FROM table_name/view_name;
```

The base table or view cannot have text columns of type BFILE, RAW.

Email Sources

An email source derives its content from emails sent to a specific email address. When the Ultra Search crawler searches an email source, it collects all emails that have the specific email address in any of the "To:" or "Cc:" email header fields.

The most popular application of an email source is where an email source represents all emails sent to a mailing list. In such a scenario, multiple email sources are defined where each email source represents an email list.

To crawl email sources, you need an IMAP account. At present, the Ultra Search crawler can only crawl one IMAP account. Therefore, all emails to be crawled must be found in the inbox of that IMAP account. For example, in the case of mailing lists, the IMAP account should be subscribed to all desired mailing lists. All new postings to the mailing lists are sent to the IMAP email account and subsequently crawled. The Ultra Search crawler is IMAP4 compliant.

When the Ultra Search crawler retrieves an email message, it deletes the email message from the IMAP server. Then, it converts the email message content to HTML and temporarily stores that HTML in the cache directory for indexing. Next, the Ultra Search crawler stores all retrieved messages in a directory known as the archive directory. The email files stored in this directory are displayed to the search end-user when referenced by a query hit.

To crawl email sources, you must specify the user name and password of the email account on the IMAP server. Also specify the IMAP server host name and the archive directory.

Creating Email Sources

To create email sources, you must enter an email address and a description. Optionally, you can specify email aliases and ACL policy. The description can be viewed by all search end-users, so you should specify a short but meaningful name. When you create (register) an email source, the name you use is the email of the mailing list. If the emails are not sent to one of the registered mailing lists, then those emails are not crawled.

You can specify email address aliases for an email source. Specifying an alias for an email source causes all emails sent to the main email address, as well as the alias address, to be gathered by the crawler. An alias is useful when two or more email addresses are logically the same. For example, an email source representing the distribution list list@company.com can have the alternate address list@my.company.com. If list@my.company.com is added to the alias list, then email sent to that address are treated as if they were sent to list@company.com.

Specify the ACL (access control list) policy for the data source. When a user performs a search, the ACL controls which documents the user can access. The default is no ACL, with all documents considered searchable and visible. You can add more than one group and user to the ACL for the data source.

File Sources

A file source is the set of documents that can be accessed through the file protocol on the local machine.

To edit the name of a file source, click Edit.

Creating File Sources

To create a new file source, do the following:

Specify a name for the file source and the default language.
Designate files or directories to be crawled. If a URL represents a single file, then the Ultra Search crawler searches only that file. If a URL represents a directory, then the crawler recursively crawls all files and subdirectories in that directory.
Specify inclusion and exclusion paths to modify the crawling space associated with this file source. This step is optional. An inclusion path limits the crawling space. An exclusion path lets you further define the crawling space. If neither path is specified, then crawling is limited to the underlying file system access privileges. Path rules are host-specific, but you can specify more than one path rule for each host. For example, on the same host, you can include path files://host/doc and exclude path files://host/doc/unwanted.
Specify the types of documents the Ultra Search crawler should process for this file source. HTML and plain text are default document types that the crawler always processes.
Ultra Search displays file data sources in text format. However, if you specify display URL for the file data source, then Ultra Search uses the URL to display the file data source.

With display URL for file data sources, the URL uses network protocols, such as HTTP or HTTPS, to access the file data source. To generate display URL for the file data source, specify the prefix of the original file URL and the prefix of the display URL. Ultra Search replaces the prefix of the file URL with the prefix of the display URL.

For example, if your file URL is file:///home/operation/doc/file.doc and the display URL is https://webhost/client/doc/file.doc, then you can specify the file URL prefix to file:///home/operation and the display URL prefix to https://webhost/client.
Specify the ACL (access control list) policy for the data source. When a user performs a search, the ACL controls which documents the user can access. The default is no ACL, with all documents considered searchable and visible. Alternatively, you can specify using the Ultra Search ACL. You can add more than one group and user to the ACL for the data source. The option to choose is only available if the instance is security-enabled.

Oracle Sources

You can create, edit, or delete Oracle sources. You can choose federated or Oracle Application Server Portal (crawlable) data sources. A federated source is a repository that maintains its own index. Ultra Search can issue a query, and the repository can return query results. Ultra Search also supports the crawling and indexing of Oracle Application Server Portal installations. This enables searching across multiple portal installations.

Note:

When Ultra Search crawls content from Oracle Portal, it gathers all metadata (that is, attribute values) with the actual indexable content. This is then text-indexed. When you search for string "xxx", if that string occurs in either the attributes or in the content, then the document is returned. This is different from how Oracle Portal behaves. With Oracle Portal, when you search for string "xxx", only the content of a document (page/item in Portal terminology) is searched. Attributes are treated separately.

Oracle Portal Sources

Ultra Search can only crawl public Oracle AS Portal sources. See the Oracle Application Server Portal Configuration Guide for how to set up public pages.

To create Portal sources, you must first register your portal with Ultra Search. To register your portal:

Provide a name and portal URL base. The portal name is used to identify this portal entry in the Oracle Portal List page. The URL base is the beginning portion of the portal homepage. This include host name, port number, and DAD. After it is created, the portal URL base is not updatable. Click Register Portal. Ultra Search attempts to contact the Oracle Application Server Portal instance and retrieve information about it.
Choose one or more page groups for indexing. A portal data source is created for each page group. Click Delete to remove existing portal data sources.

You can edit the types of documents the Ultra Search crawler should process for a portal source. HTML and plain text are default document types that the crawler always processes. To edit document types, click Edit for the portal source after it has been created.

See Also:

The Oracle Application Server Portal documentation.

Federated Sources

To create federated sources, specify the name and JNDI for the new data source. By default, no resource adapter is available.

To create a federated source, you must manually deploy the Ultra Search resource adapter, or searchlet. A searchlet is a Java module deployed in the middle tier (inside OC4J) that searches the data in an enterprise information system on behalf of a user. A searchlet is a Java module deployed in the middle tier (inside OC4J) that searches the data in an enterprise information system on behalf of a user. When a user's query is delegated to the searchlet, the searchlet runs the query on behalf of the user. Every searchlet is a JCA 1.0 compliant resource adapter.

Deploying and BInding the Ultra Search Searchlet

The Ultra Search searchlet enables queries against one Ultra Search instance. The Ultra Search searchlet is packaged as ultrasearch.rar and is shipped under the $ORACLE_HOME/ultrasearch/adapter/ directory.

To deploy the Ultra Search searchlet in OC4J standalone, use admin.jar as follows:

java -jar admin.jar ormi://<hostname> <admin> <welcome> -deployconnector -file
ultrasearch.rar -name UltraSearchSearchlet

At this point, ultrasearch.rar has been deployed in OC4J. However, it has not been instantiated to connect to any Ultra Search instance. The Ultra Search searchlet can be instantiated multiple times, to connect to several Ultra Search instances, by repeating the following steps. To instantiate the searchlet, configuration parameters values must be specified, and a JNDI location must be specified where the searchlet instance should be bound to. To do this, you must manually edit oc4j-ra.xml. This file is typically located under the $J2EE_HOME/application-deployments/default/UltraSearchSearchlet/ directory. The Ultra Search searchlet requires four configuration properties: connectionURL, userName, password, and instanceName. For example, to bind a searchlet under "eis/UltraSearch" to connect to the default instance 'wk_test' on machine 'dbhost', the following entry can be used:

<connector-factory location="eis/UltraSearch" connector-name="Ultra Search Adapter">
 <config-property name="connectionURL" value="jdbc:oracle:thin:@dbhost:1521:sid"/>
 <config-property name=:userName" value="wk_test"/>
 <config-property name="passwors" value="wk_test"/>
 <config-property name="instanceName" value="wk_test"/>
</connector-factory>

After editing oc4j-ra.xml, restart the OC4J instance. If you do not see errors upon restart, then the searchlet has been successfully instantiated and bound to JNDI.

Deploying and Binding the Federator Searchlet

The Federator searchlet interacts with other searchlets to provide a single point of search against multiple repositories. For example, the Federator searchlet can invoke multiple Ultra Search searchlets to simultaneously query against multiple Ultra Search instances. In the same manner, the Federator searchlet can invoke searchlets for Oracle Files, Email, and so on.The Federator searchlet is configured and managed with the Ultra Search administration tool, under the Federated Sources tab. The Federator searchlet is packaged as federator.rar and is shipped under the $ORACLE_HOME/ultrasearch/adapter/ directory. The deployment procedure for federator.rar is similar to the deployment of the Ultra Search searchlet. To deploy the Federator searchlet in OC4J standalone, use admin.jar as follows:

java -jar admin.jar ormi://<hostname> <admin> <welcome> -deployment -file
federator.rar -name FederatorSearchlet

To instantiate the searchlet, the Federator searchlet requires four configuration properties: connectionURL, userName, password, and instanceName in the oc4j-ra.xml file. This file is typically located under the $J2EE_HOME/application-deployments/default/FederatorSearchlet/ directory. For example:

<connector-factory location="eis/Federator" connector-name="Federator Adapter">
 <config-property name="connectionURL" value="jdbc:oracle:thin:@dbhost:1521:sid"/>
 <config-property name="userName" value="wk_test"/>
 <config-property name="password" value="wk_test"/>
 <config-property name=InstanceName" value="wk_test"/>
</connector-factory>

After editing oc4j-ra.xml, restart the OC4J instance. If you do not see errors upon restart, then the searchlet has been successfully instantiated and bound to JNDI.

User-Defined Sources

Ultra Search lets you define, edit, or delete your own data sources and types in addition to the ones provided. You might implement your own crawler agent to crawl and index a proprietary document repository or management system, such as Lotus Notes or Documentum, which contain their own databases and interfaces.

For each new data source type, you must implement a crawler agent as a Java class. The agent collects document URLs and associated metadata from the proprietary document source and returns the information to the Ultra Search crawler, which enqueues it for later crawling.

See Also:

"Ultra Search Crawler Agent API"

To define a new data source, you first define a data source type to represent it.

Creating User-Defined Data Source Types

To create, edit, or delete data source types, click Manage Source Types. To create a new type, click Create New Type.

Specify data source type name, description, and crawler agent Java class file or jar file name. The crawler agent Java classpath is predefined at installation time. The agent collects the list of document URLs and associated metadata from the proprietary document source and returns it to the Ultra Search crawler, which enqueues the information for later crawling. The agent class file or jar file must be located under $ORACLE_HOME/ultrasearch/lib/agent/.
Specify parameters for this data source type. If you add parameters, you must enter the parameter name and a description. Also, you must decide whether to encrypt the parameter value.

Edit data source type information by changing the data source type name, description, crawler agent Java class/jar file name, or parameters.

Creating User-Defined Sources

To create a user-defined data source, select the type and click Go

Specify a name, default language, and parameter values for the data source. For information on default languages, see the Crawler Page.
Specify the authentication settings. This step is optional. Under HTTP Authentication, specify the user name and password for host and realm for which authentication is required. The realm is a name associated with the protected area of a Web site. Under HTML Forms, you can register HTML forms that you want the Ultra Search crawler to automatically fill out during Web crawling. HTML form support requires that HTTP cookie functionality is enabled. Cookies remember context between HTTP requests. For example, the server can send a cookie such that it knows if a user has already logged on and does not need to log on again. Cookie support is enabled by default. Click Register HTML Form to register authentication forms protecting the data source.
Specify the ACL (access control list) policy for the data source: no ACL, repository-generated ACL, or Ultra Search ACL. When a user performs a search, the ACL controls which documents the user can access. The default is no ACL, with all documents considered searchable and visible. For the Ultra Search ACL, you can add more than one group and user to the ACL for the data source.
Specify mappings. This step is optional. Document attributes are automatically mapped directly to the search attribute with the same name during crawling. If you want document attributes to map to another search attribute, then you can specify it here. The crawler picks up attributes that have been returned by the crawler agent or specified here.
Edit crawling parameters.
Specify the document types that the crawler should process for this data source. By default, HTML and plain text are always processed.

You can edit user-defined data sources by changing the name, type, default language, or starting address.

Schedules Page

Use this page to schedule data synchronization and index optimization. Data synchronization means keeping the Ultra Search index up to date with all data sources. Index optimization means keeping the updated index optimized for best query performance.

See Also:

"Synchronizing Data Sources"

Data Synchronization

The tables on this page display information about synchronization schedules. A synchronization schedule has one or more data sources assigned to it. The synchronization schedule frequency specifies when the assigned data sources should be synchronized. Schedules are sorted first by name. Within a synchronization schedule, individual data sources are listed and can be sorted by source name or source type.

Creating Synchronization Schedules

To create a new schedule, click Create New Schedule and follow these steps:

Name the schedule.
Pick a schedule frequency and determine whether the schedule should automatically accept all URLs for indexing or examine URLs before indexing. For initial planning purposes, you might want the crawler to collect URLs without indexing. After crawling is done, you can examine document URLs and status, remove unwanted documents, and start indexing. You can also associate the schedule with a remote crawler profile.

You can set the frequency to Manual Launch. In this case, the interval remains in SCHEDULED status until you explicitly invoke data synchronization with the Execute Immediately button of the admin tool (see "Launching Synchronization Schedules").
Assign data sources to the schedule. After a data source has been assigned to a group, it cannot be assigned to other groups.

Updating Schedules

Update the indexing option in the Update Schedule page.

Editing Synchronization Schedules

After a synchronization schedule has been defined, you can do the following in the Synchronization Schedules List:

To assign the schedule to either a crawler that runs on the database host or a remote crawler that runs on a separate host, click Hostname.
To change its frequency, click the schedule interval text.
To alter its status, click Status.
To delete it, click Delete.
To edit its name, data source assignments, recrawl policy, or crawling mode, click Edit. When the crawler retrieves a document, it checks to see if it has changed. By default, if the document has not changed, the crawler does not process it. In certain situations, you might want to force the crawler to reprocess all documents. Click Edit to edit schedules in the following ways:
- Update schedule name. This step is optional. To change the schedule name, specify a name for the schedule, and click Update Schedule Name.
- Assign data sources to schedule. To assign a data source, select one or more available sources and click >>. After a data source has been assigned to a group, it cannot be assigned to any other group. To undo assignments of a data source, select one or more scheduled sources and click <<.
- Update crawler recrawl policy. You can update the recrawl policy to the following:
  - Process Documents That Have Changed: This is maintenance crawling. Only documents that have changed are recrawled and indexed. For Web data sources, if there are new links in the updated document, then they are followed. For file data sources, new files are collected if its parent directory has changed.
  - Process All Documents: The crawler recrawls the data source. For example, suppose you want to crawl only text and HTML on a Web site. Later, you also want to crawl Microsoft Word and Adobe PDF documents. You must modify the document types for the source, edit the schedule to select Process All Documents, then rerun the schedule so that the crawler picks up PDF and doc document types for this data source. The crawler treats every document as if it has been changed, which means each document is fetched and processed again.
  Upon relaunching the schedule, the following rules determine which URLs will be recrawled:
  - If the previous crawl did not finish (for example, you stopped the crawling or the database tablespace was full), then the crawler only crawls URLs left in the URL queue. URLs already crawled are not touched on recrawl.
  - If the URL queue is empty but there is a new seed added since the last crawl, then the crawler only crawls the new seed.
  - If the URL queue is empty and there is no new seed URL, then the crawler recrawls all crawled URLs.
  Therefore, if you stop the crawler and set Index Dynamic Pages to No, this only affects the URLs in the queue yet to be crawled. The already crawled dynamic pages are removed from the index on the third recrawl when the queue is empty.
  
  Note:
  All crawled URLs are subject to crawler setting enforcement, not just newly crawled URLs.
- Update crawling mode. You can update the crawling mode to the following:
  - Automatically accept all URLs for indexing: This mode crawls and indexes.
  - Examine URLs before indexing: This mode crawls only. For initial planning purposes, you might want the crawler to collect URLs without indexing. After crawling is done, you can examine document URLs and status, remove unwanted documents, and start indexing.
  - Index only: This mode indexes only.
  The crawler behaves differently for the documents collected.
Crawling mode and recrawl policy can be combined for six different combinations. For example, Process All Documents and Index Only forces reindexing existing documents in this data source, while Process Documents That Have Changed and Index Only re-indexes only changed documents.

Launching Synchronization Schedules

A schedule's synchronization frequency can be identical to another schedule's synchronization frequency. This gives you maximum flexibility in managing data source synchronization.

You can launch a synchronization schedule in the following ways:

Set a schedule frequency and wait for the predetermined launch time.
Run it immediately. To do so, click Status, then Execute Immediately.

Manually start the schedule.

Note:

Launching a synchronization schedule can take a very long time. If a schedule has been launched before, then the next time a schedule is launched, all URLs that belong to the data source to be crawled by the schedule are updated to put into a queue. Depending on the number of URLs associated with that data source, the enqueue operation may take a long time. The administration tool displays the schedule state as 'Launching' the entire time.

The launch of a schedule does not perform any enqueue if the URL queue is not empty or if there is a new seed added since the last crawl. For example, if the user stopped the crawler earlier or if the crawler terminated because of insufficient Oracle table space, then the URL queue is not empty. So, on the next launch the crawler does not try to enqueue; instead it works on the existing URL queue until it is empty. In other words, enqueue is only performed when the queue is empty at launch time.

Synchronization Status and Crawler Progress

Click the link in the status column to see the synchronization schedule status. To see the crawling progress for any data source associated with this schedule, click Statistics.

If you decide to examine URLs before indexing for the schedule, then after you run the schedule, the schedule status is shown as "Indexing Pending".

In data harvesting mode, you should begin crawling first. After crawling is done, click Examine URL to examine document URLs and status, remove unwanted documents, and start indexing. After you click Begin Index, you see schedule status change from launching, executing, scheduled, and so on.

The crawling progress contains the following information:

Data source type
Data source name
Start time
Finish time
Elapsed time
Total indexing time
Total size of document data collected
Average document size
Average fetch throughput

It also contains the following statistics:

Documents to fetch
Documents fetched: This is the sum of Document non-indexable, Document conversion failure, and Documents indexed.
Document fetch failures: This could be an Oracle HTTP Server timeout or another HTTP server error.
Documents rejected: The document is not within the URL boundary rule.
Documents discovered: This is the sum of Documents to fetch, Documents fetched, Document fetch failures, and Documents rejected.
Documents indexed
Documents non-indexable: This could be a file directory, a portal page that is a discovery node, or a robot metatag that specifies no index.
Document conversion failures: The binary file filter failed.

Index Optimization

Index Optimization To ensure fast query results, the Ultra Search crawler maintains an active index of all documents crawled over all data sources. This lets you schedule when you would like the index to be optimized. The index should be optimized during hours of low usage.

Note:

Increasing the crawler cache directory size can reduce index fragmentation.

Index Optimization Schedule You can specify the index optimization schedule frequency. Be sure to specify all required data for the option that you select. You can optimize the index immediately, or you can enable the schedule. Optimization Process Duration Specify a maximum duration for the index optimization process. The actual time taken for optimization does not exceed this limit, but it could be shorter. Specifying a longer optimization time results in a more optimized index. Alternatively, you can specify that the optimization continue until it is finished.

If your Ultra Search instance is secure-search enabled, then the index optimization process also triggers garbage collection of unused access control lists (ACLs).

Queries Page

This section lets you specify query-related settings, such as data source groups, URL submission, relevancy boosting, and query statistics.

Data Groups

Data groups are logical entities exposed to the search engine user. When entering a query, the user is asked to select one or more data groups from which to search.

A data group consists of one or more data sources. A data source can be assigned to multiple data groups. Data groups are sorted first by name. Within each data group, individual data sources are listed and can be sorted by source name or source type.

To create a new data source group, do the following:

Specify a name for the group.
Assign data sources to the group. To assign a Web or table data source to this data group, select one or more available Web sources or table sources and click >>. After a data source has been assigned to a group, it cannot be assigned to any other group. To unassign a Web or table data source, select one or more scheduled sources and click <<.
Click Finish.

URL Submission

URL Submission Methods URL submission lets query users submit URLs. These URLs are added to the seed URL list and included in the Ultra Search crawler search space. You can allow or disallow query users to submit URLs. URL Boundary Rules Checking URLs are submitted to a specific Web data source. URL boundary rules checking ensures that submitted URLs comply with the URL boundary rules of the Web data source. You can allow or disallow URL boundary rules checking.

Relevancy Boosting

Relevancy boosting lets administrators override the search results and influence the order that documents are ranked in the query result list. This can be used to promote important documents to higher scores. It also makes them easier to find.

See Also:

"Document Relevancy Boosting"

There are two methods for locating URLs for relevancy boosting: locate by search or manual URL entry.

Locate by Search To boost a URL, first locate a URL by performing a search. You can specify a host name to narrow the search. After you have located the URL, click Information to edit the query string and score for the document. Manual URL Entry If a document has not been crawled or indexed, then it cannot be found in a search. However, you can provide a URL and enter the relevancy boosting information with it. To do so, click Create, and enter the following:

Specify the document URL. You must assign the URL to a data source. This document is indexed the next time it is crawled.
Enter scores in the range of 1 to 100 for one or more query strings. When a user performs a search using the exact query string, the score applies for this URL.

The document is searchable after the document is loaded for the term. The document is also indexed the next time the schedule is run.

With manual URL entry, you can only assign URLs for Web data sources. Users get an error message on this page if no Web data source is defined.

Note:

Ultra Search provides a command-line tool to load metadata, such as document relevance boosting, into an Ultra Search database. If you have a large amount of data, this is probably faster than using the HTML-based administration tool. For more information, see Appendix A, "Loading Metadata into Ultra Search".

Query Statistics

Enabling Query Statistics This section lets you enable or disable the collection of query statistics. The logging of query statistics reduces query performance. Therefore, Oracle recommends that you disable the collection of query statistics during regular operation.

Note:

After you enable query statistics, the table that stores statistics data is truncated every Sunday at 1:00 A.M.

Viewing Statistics If query statistics is enabled, you can click one of the following categories:

Daily Summary of Query Statistics
Top 50 Queries
Top 50 Ineffective Queries
Top 50 Failed Queries

Daily Summary of Query Statistics This summarizes all query activity on a daily basis. The statistics gathered are:

Average query time: the average time taken over all queries
Number of queries: the total number of queries made in the day
Number of hits: the average number of results returned by each query

Top 50 Queries This summarizes the 50 most frequent queries in the past 24 hours.

Query string: the query string
Average query time: the average time to return a result
Number of queries: the total number of queries in the past 24 hours
Number of hits: the average number of results returned by each query
Frequency: the number of queries divided by total number of queries over all query strings
Percentage of ineffective queries: the number of ineffective queries divided by total number of queries over all query strings

Top 50 Ineffective Queries This summarizes the 50 most frequent queries in the past 24 hours. Each row in the table describes statistics for a particular query string.

Query string: the query string
Number of queries: the total number of queries made in the past 24 hours
Percentage of ineffective queries: the number of ineffective queries divided by total number of queries for that string

Top 50 Failed Queries This summarizes the top 50 queries that failed over the past 24 hours. A failed query is one where the search engine end-user did not locate any query results.

The columns are:

Query string: the query string
Number of queries: the total number of queries made in the past 24 hours
Frequency: the percentage occurrence of a failed query
Cumulative frequency: the cumulative percentage occurrence of all failed queries

See Also:
"Tuning Query Performance"

Configuration

You can configure the query application and the federation engine with several parameters, including the maximum number of hits and enabling relevancy boosting.

Note:

The Table Display URL, the File Display URL, and the Email Display URL are relative URLs. For Oracle Portal to work, you must replace these URLs with full URLs here, including hostname, port, and path.

Users Page

Use this page to manage Ultra Search administrative users. You can assign a u ser to manage an Ultra Search instance. You can also select a language preference.

Preferences

This section lets you set preference options for the Ultra Search administrator.

You can specify the date and time format. The pull-down menu lists the following languages:

English
Brazilian Portuguese
French
German
Italian
Japanese
Korean
Simplified Chinese
Spanish
Traditional Chinese

You can also select the number of rows to display on each page.

Super-Users

A user with super-user privileges can perform all administrative functions on all instances, including creating instances, dropping instances, and granting privileges. Only super-users can access this page.

Single sign-on (SSO) users can use a delegated administrative service (DAS) list of values to add another SSO user as a super-user. These users are authenticated by the SSO server before allowing access. Database users can add another database user as a super-user.

To grant super-user administrative privileges to another user, enter the user name of the user. Specify also whether the user should be allowed to grant super-user privileges to other users. Then click Add.

Privileges

Only instance owners, users that have been granted general administrative privileges on this instance, or super-users are allowed to access this page. Instance owners must have been granted the WKUSER role.

Single sign-on (SSO) users can use a delegated administrative service (DAS) list of values to add privileges to another SSO user. These users are authenticated by the SSO server before allowing access. Database users can add privileges to another database user.

Note:

Database users cannot grant privileges to SSO users, and SSO users cannot grant privileges to database users. The DAS list of values only shows SSO users.

Granting general administrative privileges to a user allows that user to modify general settings for this instance. To do this, enter the user name and specify whether the user should be allowed to grant administrative privileges to other users. Then click Add.

To remove one ore more users from the list of administrators for this instance, select one or more user names from the list of current administrators and click Remove.

Note:

General administrative privileges do not include the ability to create or delete an instance. These privileges belong to super-users.

Globalization Page

Ultra Search lets you translate names to different languages. This page lets you enter multiple values for search attributes, list of values (LOV) display names, and data groups.

Search Attribute Name

This section lets you translate attribute display names to different languages. The pull-down menu lists the following languages:

English
Arabic
Brazilian Portuguese
Canadian French
Czech
Danish
Dutch
Finnish
French
German
Greek
Hebrew
Hungarian
Italian
Japanese
Korean
Latin American Spanish
Norwegian
Polish
Portuguese
Romanian
Russian
Simplified Chinese
Slovak
Spanish
Swedish
Thai
Traditional Chinese
Turkish

LOV Display Name

This section lets you translate data group names to different languages. Select a search attribute from the pull-down menu: author, description, mimetype, subject, or title. Select the LOV type, and then select the language from the pull-down menu.

Data Group Name

This section lets you translate data group display names to different languages. The pull-down menu lists the language options.