6 Using Partitioning in a Data Warehouse Environment

Data warehouses often contain large tables and require techniques both for managing these large tables and for providing good query performance across these large tables. This chapter describes the partitioning features that significantly enhance data access and improve overall application performance. This is especially true for applications that access tables and indexes with millions of rows and many gigabytes of data.

This chapter contains the following topics:

What Is a Data Warehouse?
Scalability
Performance
Manageability

What Is a Data Warehouse?

A data warehouse is a relational database that is designed for query and analysis rather than for transaction processing. It usually contains historical data derived from transaction data, but can include data from other sources. Data warehouses separate analysis workload from transaction workload and enable an organization to consolidate data from several sources.

In addition to a relational database, a data warehouse environment can include an extraction, transportation, transformation, and loading (ETL) solution, analytical processing and data mining capabilities, client analysis tools, and other applications that manage the process of gathering data and delivering it to business users.

See Also:

Oracle Database Data Warehousing Guide

Scalability

Partitioning helps scaling a data warehouse by dividing database objects into smaller pieces, enabling access to smaller, more manageable objects. Having direct access to smaller objects addresses the scalability requirements of data warehouses:

Bigger Databases
Bigger Individual tables: More Rows in Tables
More Users Querying the System
More Complex Queries

Bigger Databases

The ability to split a large database object into smaller pieces transparently provides benefits to manage a larger total database size. You can identify and manipulate individual partitions and subpartitions in order to cope with large database objects. Consider the following advantages of partitioned objects:

Backup and recovery can be performed on a low level of granularity to cope with the size of the database.
Part of a database object can be stored compressed while other parts can remain uncompressed.
Partitioning can be used to store data transparently on different storage tiers to lower the cost of storing vast amounts of data. Refer to Chapter 5, "Using Partitioning for Information Lifecycle Management".

Bigger Individual tables: More Rows in Tables

It takes longer to scan a big table than it takes to scan a small table. Queries against partitioned tables may access one or more partitions that are small compared with the total size of the table. Similarly, queries may take advantage of partition elimination on indexes. It takes less time to read a smaller portion of an index from disk than to read the entire index. Index structures that share the partitioning strategy with the table, local partitioned indexes, can be accessed and maintained on a partition-by-partition basis.

The database can take advantage of the distinct data sets in separate partitions if you use parallel execution to speed up queries, DML, and DDL statements. Individual parallel execution servers can work on their own data set, identified by the partition boundaries.

More Users Querying the System

With partitioning, users are more likely to hit isolated and smaller data sets. As a result, the database will be able to return results faster than if all users hit the same and much larger data sets. Data contention is less likely.

More Complex Queries

Smaller data sets help perform complex queries faster. If smaller data sets are being accessed, then complex calculations are more likely to be processed in memory which is beneficial from a performance perspective and which reduces the application's I/O requirements. A larger set may have to be written to the temporary tablespace in order to complete, in which case additional I/O against the database storage occurs.

Performance

Good performance is a key to success for a data warehouse. Analyses run against the database should return within a reasonable amount of time, even if the queries access large amounts of data in tables that are terabytes in size. Partitioning provides fundamental functionality to enable successful data warehouses that are not prohibitively expensive in terms of hardware cost.

Partition Pruning

Partition pruning is an essential performance feature for data warehouses. In partition pruning, the optimizer analyzes FROM and WHERE clauses in SQL statements to eliminate unneeded partitions when building the partition access list. This enables Oracle Database to perform operations only on those partitions that are relevant to the SQL statement.

Partition pruning dramatically reduces the amount of data retrieved from disk and shortens processing time, thus improving query performance and optimizing resource utilization.

See Also:

Chapter 4, "Partitioning for Availability, Manageability, and Performance" for more information about partition pruning and the difference between static and dynamic partition pruning

Basic Partition Pruning Techniques

The optimizer uses a wide variety of predicates for pruning. The three predicate types, equality, range, and IN-list, are the most commonly used cases of partition pruning. As an example, consider the following query:

SELECT SUM(amount_sold) day_sales

FROM sales

WHERE time_id = TO_DATE('02-JAN-1998', 'DD-MON-YYYY');

Because there is an equality predicate on the partitioning column of sales, this query will prune down to a single predicate and this will be reflected in the explain plan, as shown:

-----------------------------------------------------------------------------------------------

|  Id | Operation                | Name  | Rows| Bytes | Cost (%CPU)| Time     |Pstart| Pstop |

-----------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT         |       |     |       | 21 (100)   |          |      |       |

|   1 |  SORT AGGREGATE          |       | 1   | 13    |            |          |      |       |

|   2 |   PARTITION RANGE SINGLE |       | 485 | 6305  | 21 (10)    | 00:00:01 | 5    | 5     |

| * 3 |    TABLE ACCESS FULL     | SALES | 485 | 6305  | 21 (10)    | 00:00:01 | 5    | 5     |

-----------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

--------------------------------------------------- 

  3 - filter("TIME_ID"=TO_DATE('1998-01-02 00:00:00', 'yyyy-mm-dd hh24:mi:ss'))

Similarly, a range or an IN-list predicate on the time_id column and the optimizer would be used to prune to a set of partitions. The partitioning type plays a role in which predicates can be used. Range predicates cannot be used for pruning on hash partitioned tables while they can be used for all other partitioning strategies. However, on list-partitioned tables, range predicates may not map to a contiguous set of partitions. Equality and IN-list predicates can be used to prune with all the partitioning methods.

Advanced Partition Pruning Techniques

Oracle also prunes in the presence of more complex predicates or SQL statements involving partitioned tables. A common situation is when a partitioned table is joined to the subset of another table, limited by a WHERE condition. For example, consider the following query:

SELECT t.day_number_in_month, SUM(s.amount_sold)

FROM sales s, times t

WHERE s.time_id = t.time_id

  AND t.calendar_month_desc='2000-12'

GROUP BY t.day_number_in_month;

If the database performed a nested loop join with times on the right hand side, then the query would only access the partition corresponding to this row from the times table, so pruning would implicitly take place. But, if the database performed a hash or sort merge join, this would not be possible. If the table with the WHERE predicate is relatively small compared to the partitioned table, and the expected reduction of records or partitions for the partitioned table is significant, then the database will perform dynamic partition pruning using a recursive subquery. The decision whether or not to invoke subquery pruning is an internal cost-based decision of the optimizer.

A sample plan using a hash join operation would look like the following:

--------------------------------------------------------------------------------------------------

| Id| Operation                    |  Name |  Rows | Bytes| Cost (%CPU)|  Time  | Pstart | Pstop |

--------------------------------------------------------------------------------------------------

|  0| SELECT STATEMENT             |       |       |      | 761 (100)  |        |        |       |

|  1|  HASH GROUP BY               |       |    20 | 640  | 761 (41)   |00:00:10|        |       |

|* 2|   HASH JOIN                  |       | 19153 | 598K | 749 (40)   |00:00:09|        |       |

|* 3|    TABLE ACCESS FULL         | TIMES |    30 |  570 |  17 (6)    |00:00:01|        |       |

|  4|     PARTITION RANGE SUBQUERY |       |  918K | 11M  |   655 (33) |00:00:08| KEY(SQ)|KEY(SQ)|

|  5|      TABLE ACCESS FULL       | SALES |   918 | 11M  |   655 (33) |00:00:08| KEY(SQ)|KEY(SQ)|

--------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

---------------------------------------------------

PLAN_TABLE_OUTPUT

--------------------------------------------------------------------------------------------------

  2 - access("S"."TIME_ID"="T"."TIME_ID") 

  3 - filter("T"."CALENDAR_MONTH_DESC"='2000-12')

This plan shows that dynamic partition pruning occurred on the sales table using a subquery, as can be seen from the KEY(SQ) value in the PSTART and PSTOP columns.

Another example using advanced pruning is the following, which uses an OR predicate:

SELECT p.promo_name promo_name, (s.profit - p.promo_cost) profit

FROM

   promotions p,

   (SELECT

      promo_id,

      SUM(sales.QUANTITY_SOLD * (costs.UNIT_PRICE - costs.UNIT_COST)) profit

   FROM

      sales, costs

   WHERE

      ((sales.time_id BETWEEN TO_DATE('01-JAN-1998','DD-MON-YYYY',

                  'NLS_DATE_LANGUAGE = American') AND

      TO_DATE('01-JAN-1999','DD-MON-YYYY', 'NLS_DATE_LANGUAGE = American')

OR

      (sales.time_id BETWEEN TO_DATE('01-JAN-2001','DD-MON-YYYY',

                  'NLS_DATE_LANGUAGE = American') AND

      TO_DATE('01-JAN-2002','DD-MON-YYYY', 'NLS_DATE_LANGUAGE = American')))

      AND sales.time_id = costs.time_id

      AND sales.prod_id = costs.prod_id

   GROUP BY

      promo_id) s

WHERE s.promo_id = p.promo_id

ORDER BY profit

DESC;

This query joins the sales and costs tables in the sh sample schema. The sales table is partitioned by range on the column time_id. One of the conditions in the query are two predicates on time_id, which are combined with an OR operator. This OR predicate is used to prune the partitions in sales table and a single join between the sales and costs table is performed. The plan is as follows:

--------------------------------------------------------------------------------------------------

| Id| Operation               |  Name    |Rows |Bytes |TmpSp|Cost(%CPU)| Time    | Pstart| Pstop |

--------------------------------------------------------------------------------------------------

|  0| SELECT STATEMENT        |          | 4   |  200 |     | 3556 (14)| 00:00:43|       |       |

|  1|  SORT ORDER BY          |          | 4   |  200 |     | 3556 (14)| 00:00:43|       |       |

|* 2|   HASH JOIN             |          | 4   |  200 |     | 3555 (14)| 00:00:43|       |       |

|  3|    TABLE ACCESS FULL    |PROMOTIONS| 503 | 16599|     |    16 (0)| 00:00:01|       |       |

|  4|    VIEW                 |          |   4 |   68 |     | 3538 (14)| 00:00:43|       |       |

|  5|     HASH GROUP BY       |          |   4 |  164 |     | 3538 (14)| 00:00:43|       |       |

|  6|      PARTITION RANGE OR |          | 314K|   12M|     |  3321 (9)| 00:00:40|KEY(OR)|KEY(OR)|

|* 7|       HASH JOIN         |          | 314K|   12M| 440K|  3321 (9)| 00:00:40|       |       |

|* 8|        TABLE ACCESS FULL| SALES    | 402K| 7467K|     |  400 (39)| 00:00:05|KEY(OR)|KEY(OR)|

|  9| TABLE ACCESS FULL       | COSTS    |82112| 1764K|     |   77 (24)| 00:00:01|KEY(OR)|KEY(OR)|

--------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):

--------------------------------------------------- 

  2 - access("S"."PROMO_ID"="P"."PROMO_ID") 

  7 - access("SALES"."TIME_ID"="COSTS"."TIME_ID" AND "SALES"."PROD_ID"="COSTS"."PROD_ID") 

  8 - filter("SALES"."TIME_ID"<=TO_DATE('1999-01-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss') AND 

      "SALES"."TIME_ID">=TO_DATE('1998-01-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss') OR 

      "SALES"."TIME_ID">=TO_DATE('2001-01-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss') AND 

      "SALES"."TIME_ID"<=TO_DATE('2002-01-01 00:00:00', 'yyyy-mm-dd hh24:mi:ss'))

The database also does additional pruning when a column is range partitioned on multiple columns. As long as the database can guarantee that a particular predicate cannot be satisfied in a particular partition, the partition will be skipped. This allows the database to optimize cases where there are range predicates on more than one column or in the case where there are no predicates on a prefix of the partitioning columns.

See Also:

"Partition Pruning Tips" for tips on partition pruning

Partition-Wise Joins

Partition-wise joins reduce query response time by minimizing the amount of data exchanged among parallel execution servers when joins execute in parallel. This significantly reduces response time and improves the use of both CPU and memory resources.

Partition-wise joins can be full or partial. Oracle decides which type of join to use.

Full Partition-Wise Joins

Full partition-wise joins can occur if two tables that are co-partitioned on the same key are joined in a query. The tables can be co-partitioned at the partition level, or at the subpartition level, or at a combination of partition and subpartition levels. Reference partitioning is an easy way to guarantee co-partitioning. Full partition-wise joins can be executed in serial and in parallel.

See Also:

Chapter 4, "Partitioning for Availability, Manageability, and Performance" for more information on partition-wise joins

The following example shows a full partition-wise join on orders and order_items, in which the order_items table is reference partitioned.

CREATE TABLE orders

( order_id     NUMBER(12) NOT NULL

, order_date   DATE NOT NULL

, order_mode   VARCHAR2(8)

, order_status VARCHAR2(1)

, CONSTRAINT orders_pk PRIMARY KEY (order_id)

)

PARTITION BY RANGE (order_date)

( PARTITION p_before_jan_2006 VALUES LESS THAN (TO_DATE('01-JAN-2006','dd-MON-yyyy'))

, PARTITION p_2006_jan VALUES LESS THAN (TO_DATE('01-FEB-2006','dd-MON-yyyy'))

, PARTITION p_2006_feb VALUES LESS THAN (TO_DATE('01-MAR-2006','dd-MON-yyyy'))

, PARTITION p_2006_mar VALUES LESS THAN (TO_DATE('01-APR-2006','dd-MON-yyyy'))

, PARTITION p_2006_apr VALUES LESS THAN (TO_DATE('01-MAY-2006','dd-MON-yyyy'))

, PARTITION p_2006_may VALUES LESS THAN (TO_DATE('01-JUN-2006','dd-MON-yyyy'))

, PARTITION p_2006_jun VALUES LESS THAN (TO_DATE('01-JUL-2006','dd-MON-yyyy'))

, PARTITION p_2006_jul VALUES LESS THAN (TO_DATE('01-AUG-2006','dd-MON-yyyy'))

, PARTITION p_2006_aug VALUES LESS THAN (TO_DATE('01-SEP-2006','dd-MON-yyyy'))

, PARTITION p_2006_sep VALUES LESS THAN (TO_DATE('01-OCT-2006','dd-MON-yyyy'))

, PARTITION p_2006_oct VALUES LESS THAN (TO_DATE('01-NOV-2006','dd-MON-yyyy'))

, PARTITION p_2006_nov VALUES LESS THAN (TO_DATE('01-DEC-2006','dd-MON-yyyy'))

, PARTITION p_2006_dec VALUES LESS THAN (TO_DATE('01-JAN-2007','dd-MON-yyyy'))

)

PARALLEL;


CREATE TABLE order_items

( order_id NUMBER(12) NOT NULL

, product_id NUMBER NOT NULL

, quantity NUMBER NOT NULL

, sales_amount NUMBER NOT NULL

, CONSTRAINT order_items_orders_fk FOREIGN KEY (order_id) REFERENCES 

orders(order_id)

)

PARTITION BY REFERENCE (order_items_orders_fk)

PARALLEL;

A typical data warehouse query would scan a large amount of data. Note that in the underlying plan, the columns Rows, Bytes, Cost (%CPU), Time, and TQ have been removed.

EXPLAIN PLAN FOR

SELECT o.order_date

, sum(oi.sales_amount) sum_sales

FROM orders o

, order_items oi

WHERE o.order_id = oi.order_id

AND o.order_date BETWEEN TO_DATE('01-FEB-2006','DD-MON-YYYY')

                     AND TO_DATE('31-MAY-2006','DD-MON-YYYY')

GROUP BY o.order_id

, o.order_date

ORDER BY o.order_date;


---------------------------------------------------------------------------------------------

| Id  | Operation                         | Name        | Pstart| Pstop |IN-OUT| PQ Distrib |

---------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT                  |             |       |       |      |            |

|   1 |  PX COORDINATOR                   |             |       |       |      |            |

|   2 |   PX SEND QC (ORDER)              | :TQ10001    |       |       | P->S | QC (ORDER) |

|   3 |    SORT GROUP BY                  |             |       |       | PCWP |            |

|   4 |     PX RECEIVE                    |             |       |       | PCWP |            |

|   5 |      PX SEND RANGE                | :TQ10000    |       |       | P->P | RANGE      |

|   6 |       SORT GROUP BY               |             |       |       | PCWP |            |

|   7 |        PX PARTITION RANGE ITERATOR|             |     3 |     6 | PCWC |            |

|*  8 |         HASH JOIN                 |             |       |       | PCWP |            |

|*  9 |          TABLE ACCESS FULL        | ORDERS      |     3 |     6 | PCWP |            |

|  10 |          TABLE ACCESS FULL        | ORDER_ITEMS |     3 |     6 | PCWP |            |

---------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):

---------------------------------------------------


   8 - access("O"."ORDER_ID"="OI"."ORDER_ID")

   9 - filter("O"."ORDER_DATE"<=TO_DATE(' 2006-05-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Partial Partition-Wise Joins

Oracle Database can perform partial partition-wise joins only in parallel. Unlike full partition-wise joins, partial partition-wise joins require you to partition only one table on the join key, not both tables. The partitioned table is referred to as the reference table. The other table may or may not be partitioned. Partial partition-wise joins are more common than full partition-wise joins.

To execute a partial partition-wise join, the database dynamically repartitions the other table based on the partitioning of the reference table. Once the other table is repartitioned, the execution is similar to a full partition-wise join.

The following example shows a call detail records table, cdrs, in a typical data warehouse scenario. The table is interval-hash partitioned.

CREATE TABLE cdrs

( id                 NUMBER

, cust_id            NUMBER

, from_number        VARCHAR2(20)

, to_number          VARCHAR2(20)

, date_of_call       DATE

, distance           VARCHAR2(1)

, call_duration_in_s NUMBER(4)

) PARTITION BY RANGE(date_of_call)

INTERVAL (NUMTODSINTERVAL(1,'DAY'))

SUBPARTITION BY HASH(cust_id)

SUBPARTITIONS 16

(PARTITION p0 VALUES LESS THAN (TO_DATE('01-JAN-2005','dd-MON-yyyy')))

PARALLEL;

The cdrs table is joined with the non-partitioned callers table on cust_id to rank the customers who spent most time calling.

EXPLAIN PLAN FOR

SELECT c.cust_id

,      c.cust_last_name

,      c.cust_first_name

,      AVG(call_duration_in_s)

,      COUNT(1)

,      DENSE_RANK() OVER

       (ORDER BY (AVG(call_duration_in_s) * COUNT(1)) DESC) ranking

FROM   callers c

,      cdrs    cdr

WHERE cdr.cust_id = c.cust_id

AND cdr.date_of_call BETWEEN TO_DATE('01-JAN-2006','dd-MON-yyyy')

                         AND TO_DATE('31-DEC-2006','dd-MON-yyyy')  

GROUP BY c.cust_id

, c.cust_last_name

, c.cust_first_name

ORDER BY ranking;

The execution shows a partial partition-wise join. Note that the columns Rows, Bytes, Cost (%CPU), Time, and TQ have been removed.

--------------------------------------------------------------------------------------------

| Id  | Operation                           | Name     | Pstart| Pstop |IN-OUT| PQ Distrib |

--------------------------------------------------------------------------------------------

|   0 | SELECT STATEMENT                    |          |       |       |      |            |

|   1 |  WINDOW NOSORT                      |          |       |       |      |            |

|   2 |   PX COORDINATOR                    |          |       |       |      |            |

|   3 |    PX SEND QC (ORDER)               | :TQ10002 |       |       | P->S | QC (ORDER) |

|   4 |     SORT ORDER BY                   |          |       |       | PCWP |            |

|   5 |      PX RECEIVE                     |          |       |       | PCWP |            |

|   6 |       PX SEND RANGE                 | :TQ10001 |       |       | P->P | RANGE      |

|   7 |        HASH GROUP BY                |          |       |       | PCWP |            |

|*  8 |         HASH JOIN                   |          |       |       | PCWP |            |

|   9 |          PART JOIN FILTER CREATE    | :BF0000  |       |       | PCWP |            |

|  10 |           BUFFER SORT               |          |       |       | PCWC |            |

|  11 |            PX RECEIVE               |          |       |       | PCWP |            |

|  12 |             PX SEND PARTITION (KEY) | :TQ10000 |       |       | S->P | PART (KEY) |

|  13 |              TABLE ACCESS FULL      | CALLERS  |       |       |      |            |

|  14 |          PX PARTITION RANGE ITERATOR|          |   367 |   731 | PCWC |            |

|  15 |           PX PARTITION HASH ALL     |          |     1 |    16 | PCWC |            |

|* 16 |            TABLE ACCESS FULL        | CDRS     |  5857 | 11696 | PCWP |            |

--------------------------------------------------------------------------------------------


Predicate Information (identified by operation id):

---------------------------------------------------


   8 - access("CDR"."CUST_ID"="C"."CUST_ID")

  16 - filter("CDR"."DATE_OF_CALL">=TO_DATE(' 2006-01-01 00:00:00', 'syyyy-mm-dd 

hh24:mi:ss') AND "CDR"."DATE_OF_CALL"<=TO_DATE('

              2006-12-31 00:00:00', 'syyyy-mm-dd hh24:mi:ss'))

Benefits of Partition-Wise Joins

Partition-wise joins offer benefits described in the following sections:

Reduction of Communications Overhead
Reduction of Memory Requirements

Reduction of Communications Overhead

When executed in parallel, partition-wise joins reduce communications overhead. This is because, in the default case, parallel execution of a join operation by a set of parallel execution servers requires the redistribution of each table on the join column into disjoint subsets of rows. These disjoint subsets of rows are then joined pair-wise by a single parallel execution server.

The database can avoid redistributing the partitions because the two tables are already partitioned on the join column. This enables each parallel execution server to join a pair of matching partitions. This improved performance from using parallel execution is even more noticeable in Oracle Real Application Clusters configurations with internode parallel execution.

Partition-wise joins dramatically reduce interconnect traffic. Using this feature is key for large DSS configurations that use Oracle Real Application Clusters. Currently, most Oracle Real Application Clusters platforms, such as MPP and SMP clusters, provide limited interconnect bandwidths compared with their processing powers. Ideally, interconnect bandwidth should be comparable to disk bandwidth, but this is seldom the case. As a result, most join operations in Oracle Real Application Clusters experience high interconnect latencies without parallel execution of partition-wise joins.

Reduction of Memory Requirements

Partition-wise joins require less memory than the equivalent join operation of the complete data set of the tables being joined. In the case of serial joins, the join is performed at the same time on a pair of matching partitions. If data is evenly distributed across partitions, then the memory requirement is divided by the number of partitions. There is no skew.

In the parallel case, memory requirements depend on the number of partition pairs that are joined in parallel. For example, if the degree of parallelism is 20 and the number of partitions is 100, then 5 times less memory is required because only 20 joins of two partitions are performed at the same time. The fact that partition-wise joins require less memory has a direct effect on performance. For example, the join probably does not need to write blocks to disk during the build phase of a hash join.

Performance Considerations for Parallel Partition-Wise Joins

The optimizer weighs the advantages and disadvantages when deciding whether or not to use partition-wise joins.

In range partitioning where partition sizes differ, data skew increases response time; some parallel execution servers take longer than others to finish their joins. Oracle recommends the use of hash partitioning and subpartitioning to enable partition-wise joins because hash partitioning, if the number of partitions is a power of two, limits the risk of skew. Ideally the hash partitioning key is unique or almost unique to minimize the risk of skew.
The number of partitions used for partition-wise joins should, if possible, be a multiple of the number of query servers. With a degree of parallelism of 16, for example, you can have 16, 32, or even 64 partitions. If there is an odd number of partitions, then some parallel execution servers are used less than others. For example, if there are 17 evenly distributed partition pairs, only one pair will work on the last join, while the other pairs will have to wait. This is because, in the beginning of the execution, each parallel execution server works on a different partition pair. At the end of this first phase, only one pair is left. Thus, a single parallel execution server joins this remaining pair while all other parallel execution servers are idle.

Sometimes, parallel joins can cause remote I/Os. For example, on Oracle Real Application Clusters environments running on MPP configurations, if a pair of matching partitions is not collocated on the same node, a partition-wise join requires extra internode communication due to remote I/O. This is because Oracle must transfer at least one partition to the node where the join is performed. In this case, it is better to explicitly redistribute the data than to use a partition-wise join.

Indexes and Partitioned Indexes

Indexes are optional structures associated with tables that allow SQL statements to execute more quickly against a table. Even though table scans are very common in many data warehouses, indexes can often speed up queries. The most commonly used indexes in a data warehouse are B-tree and bitmap indexes.

Both B-tree and bitmap indexes can be created as LOCAL indexes on a partitioned table, in which case they inherit the table's partitioning strategy. B-tree indexes can be created as global partitioned indexes on partitioned and on non-partitioned tables.

See Also:

Chapter 4, "Partitioning for Availability, Manageability, and Performance" for more information about partitioned indexes

Local Partitioned Indexes

In a local index, all keys in a particular index partition refer only to rows stored in a single underlying table partition. A local index is equipartitioned with the underlying table. Oracle partitions the index on the same columns as the underlying table, creates the same number of partitions or subpartitions, and gives them the same partition bounds as corresponding partitions of the underlying table.

Oracle also maintains the index partitioning automatically when partitions in the underlying table are added, dropped, merged, or split, or when hash partitions or subpartitions are added or coalesced. This ensures that the index remains equipartitioned with the table.

For data warehouse applications, local nonprefixed indexes can improve performance because many index partitions can be scanned in parallel by range queries on the index key. The following example creates a local B-tree index on a partitioned customers table.

ALTER SESSION enable parallel ddl;


CREATE INDEX cust_last_name_ix

ON customers(last_name) LOCAL

PARALLEL NOLOGGING ;

Bitmap indexes use a very efficient storage mechanism for low cardinality columns. Bitmap indexes are commonly used in data warehouses, especially in data warehouses that implement so-called star schemas. A single star schema consists of a central large fact table and multiple smaller dimension tables that describe the data in the fact table.

For example, the sales table in the sample sh schema in the Oracle Database is a fact table, that is described by dimension tables customers, products, promotions, times, and channels. Bitmap indexes enable the so-called star transformation, an optimization for fast query retrieval against star or star look-a-like schemas.

Fact table foreign key columns are ideal candidates for bitmap indexes, because generally there are relatively few distinct values relative to the total number of rows. Fact tables are often range or range-* partitioned, in which case you have to create local bitmap indexes. Global bitmap indexes on partitioned tables are not supported.

The following example creates a local partitioned bitmap index on the sales table.

ALTER SESSION enable parallel ddl;


CREATE BITMAP INDEX prod_id_ix

ON sales(prod_id) LOCAL

PARALLEL NOLOGGING;

See Also:

Oracle Database Data Warehousing Guide for more information about the star transformation

Non-Partitioned Indexes

You can create non-partitioned indexes on non-partitioned tables and on partitioned tables. Non-partitioned indexes are primarily used on non-partitioned tables in data warehouse environments. You can use a non-partitioned global index on a partitioned table to enforce a primary or unique key. A non-partitioned (global) index can be useful for queries that commonly retrieve very few rows based on equality predicates or in-list on a column or set of columns that is not included in the partitioning key. In those cases, it can be faster to scan a single index than to scan many index partitions to find all matching rows.

Unique indexes on columns other than the partitioning columns must be global because unique local nonprefixed indexes whose key does not contain the partitioning key are not supported. Unique keys are not always enforced in data warehouses due to the controlled data load processes and the performance cost of enforcing the unique constraint. Global indexes can grow very large on tables with billions of rows.

The following example creates a global unique index on the sales table. Note that very few queries will benefit from this index. In systems with a very limited data load window, you should consider not to create and maintain it.

ALTER SESSION enable parallel ddl;


CREATE UNIQUE INDEX sales_unique_ix

ON sales(cust_id, prod_id, promo_id, channel_id, time_id)

PARALLEL NOLOGGING;

Note:

Most partition maintenance operations invalidate non-partitioned indexes, forcing an index rebuild.

Global Partitioned Indexes

You can create global partitioned indexes on non-partitioned tables and on partitioned tables. In a global partitioned index, the keys in a particular index partition may refer to rows stored in more than one underlying table partition or subpartition. A global index can be range or hash partitioned, though it can be defined on any type of partitioned table.

A global index is created by specifying the GLOBAL attribute. The database administrator is responsible for defining the initial partitioning of a global index at creation and for maintaining the partitioning over time. Index partitions can be merged or split as necessary.

Global indexes can be useful if there is a class of queries that uses an access path to the table to retrieve a few rows via an index, and by partitioning the index you can eliminate large portions of the index for the majority of queries that use the index. On a partitioned table you would consider a global partitioned index if the column or columns you should include to achieve partition pruning do not include the table partitioning key.

The following example creates a global hash-partitioned index on the sales table.

CREATE INDEX cust_id_prod_id_global_ix

ON sales(cust_id,prod_id)

GLOBAL PARTITION BY HASH (cust_id)

( PARTITION p1 TABLESPACE tbs1

, PARTITION p2 TABLESPACE tbs2

, PARTITION p3 TABLESPACE tbs3

, PARTITION p4 TABLESPACE tbs4

)

PARALLEL NOLOGGING;

Note:

Most partition maintenance operations invalidate global partitioned indexes, forcing an index rebuild.

Partitioning and Data Compression

Data in a partitioned table can be compressed on a partition-by-partition basis. Using compressed data is most efficient for data that does not change frequently. Although Oracle Database 11g supports compression for all DML operations, it is still more efficient to modify data in a non-compressed table.

Common data warehouse scenarios often see few data changes as data ages and other scenarios only insert data. Using the partition management features, you can compress data on a partition-by-partition basis. Note that altering a partition to enable compression only applies to future data to be inserted into the partition. If you want to compress the existing data in the partition, then you have to move the partition. Enabling compression and moving a partition can be done in a single operation.

If you want to use table compression on partitioned tables with bitmap indexes, then you need to do the following before you introduce the compression attribute for the first time:

Mark bitmap indexes unusable.
Set the compression attribute.
Rebuild the indexes.

The first time you make a compressed partition part of an already existing, fully uncompressed partitioned table, you must either drop all existing bitmap indexes or mark them UNUSABLE prior to adding a compressed partition. This must be done regardless of whether any partition contains data. It is also independent of the operation that causes one or more compressed partitions to become part of the table. This does not apply to a partitioned table having B-tree indexes only.

The following example shows how to compress the SALES_1995 partition in the sales table.

ALTER SESSION enable parallel ddl;


ALTER TABLE sales

MOVE PARTITION sales_1995

COMPRESS FOR ALL OPERATIONS

PARALLEL NOLOGGING;

If a table or a partition takes less space on disk, then the performance of large table scans in an I/O-constraint environment may improve.

Materialized Views and Partitioning

One technique employed in data warehouses to improve performance is the creation of summaries. Summaries are special types of aggregate views that improve query execution times by precalculating expensive joins and aggregation operations prior to execution and storing the results in a table in the database. For example, you can create a summary table to contain the sums of sales by region and by product.

The summaries or aggregates that are referred to in this book and in literature on data warehousing are created in Oracle Database using a schema object called a materialized view. Materialized views in a data warehouse are meant to speed up query performance.

The database supports transparent rewrites against materialized views, so that you do not need to modify the original queries to take advantage of precalculated results in materialized views. Instead of executing the query, the database will retrieve precalculated results from one or more materialized views, perform any necessary additional operations on the data, and return the query results. The database guarantees correct results in line with your setting of the QUERY_REWRITE_INTEGRITY initialization parameter.

See Also:

Oracle Database Data Warehousing Guide

Partitioned Materialized Views

The underlying storage for a materialized view is a table structure. You can partition materialized views like you can partition tables. When the database rewrites a query to run against materialized views, the query can take advantage of the same performance features that queries running against tables directly benefit from. The rewritten query may eliminate materialized view partitions. If joins back to tables or with other materialized views are necessary to retrieve the query result, then the rewritten query can take advantage of partition-wise joins.

The following example shows how to create a compressed partitioned materialized view that aggregates sales results to country level. This materialized view benefits from queries that summarize sales numbers by country level or higher to subregion or region level.

ALTER SESSION ENABLE PARALLEL DDL;


CREATE MATERIALIZED VIEW country_sales

PARTITION BY HASH (country_id)

PARTITIONS 16

COMPRESS FOR ALL OPERATIONS

PARALLEL NOLOGGING

ENABLE QUERY REWRITE

AS SELECT co.country_id

, co.country_name

, co.country_subregion

, co.country_region

, sum(sa.quantity_sold) country_quantity_sold

, sum(sa.amount_sold) country_amount_sold

FROM sales sa

, customers cu

, countries co

WHERE sa.cust_id = cu.cust_id

AND cu.country_id = co.country_id

GROUP BY co.country_id

, co.country_name

, co.country_subregion

, co.country_region;

See Also:

Oracle Database Data Warehousing Guide

Manageability

Data Warehouses store historical data. An important part of a data warehouse is the data load and purge. Partitioning is powerful technology that can help data management for data warehousing.

Partition Exchange Load

Partitions can be added using Partition Exchange Load (PEL). When you use PEL you create a separate table that looks exactly like a single partition, including the same indexes and constraints, if any. If you use a composite partitioned table, then your separate table must use a partitioning strategy that matches the subpartitioning strategy of your composite partitioned table. You can then swap out an existing table partition with this separate table. In a data load scenario, data can be loaded into the separate table. Build indexes and implement constraints on the separate table, without impacting the table users query. Then perform the PEL, which is a very low-impact transaction compared to the data load. Daily loads, in conjunction with a range partition strategy by day, are common in data warehouse environments.

The following example shows a partition exchange load for the sales table.

ALTER TABLE sales ADD PARTITION p_sales_jun_2007

VALUES LESS THAN (TO_DATE('01-FEB-2007','dd-MON-yyyy'));


CREATE TABLE sales_jun_2007 COMPRESS FOR ALL OPERATIONS

AS SELECT * FROM sales WHERE 1=0;

Next, populate table sales_jun_2007 with sales numbers for June 2007, and create the equivalent bitmap indexes and constraints that have been implemented on the sales table.

CREATE BITMAP INDEX time_id_jun_2007_bix ON sales_jun_2007(time_id) 

NOLOGGING;

CREATE BITMAP INDEX cust_id_jun_2007_bix ON sales_jun_2007(cust_id) 

NOLOGGING;

CREATE BITMAP INDEX prod_id_jun_2007_bix ON sales_jun_2007(prod_id) 

NOLOGGING;

CREATE BITMAP INDEX promo_id_jun_2007_bix ON sales_jun_2007(promo_id) 

NOLOGGING;

CREATE BITMAP INDEX channel_id_jun_2007_bix ON sales_jun_2007(channel_id) 

NOLOGGING;


ALTER TABLE sales_jun_2007 ADD CONSTRAINT prod_id_fk FOREIGN KEY (prod_id) 

REFERENCES products(prod_id);

ALTER TABLE sales_jun_2007 ADD CONSTRAINT cust_id_fk FOREIGN KEY (cust_id) 

REFERENCES customers(cust_id);

ALTER TABLE sales_jun_2007 ADD CONSTRAINT promo_id_fk FOREIGN KEY (promo_id) 

REFERENCES promotions(promo_id);

ALTER TABLE sales_jun_2007 ADD CONSTRAINT time_id_fk FOREIGN KEY (time_id) 

REFERENCES times(time_id);

ALTER TABLE sales_jun_2007 ADD CONSTRAINT channel_id_fk FOREIGN KEY 

(channel_id) REFERENCES channels(channel_id);

Next, exchange the partition.

ALTER TABLE sales

EXCHANGE PARTITION p_sales_jun_2007

WITH TABLE sales_jun_2007

INCLUDING INDEXES;

See Also:

Chapter 3, "Partition Administration" for more information about partition exchange load

Partitioning and Indexes

Local indexes are easiest when performing partition maintenance operations. Local indexes do not invalidate a global index when partition management takes place. Use INCLUDING INDEXES in the PEL statement in order to exchange the local indexes with the equivalent indexes on the separate table so that no index partitions get invalidated. In the case of PEL, you can update global indexes as part of the load. Use the UPDATE GLOBAL INDEXES extension to the PEL command. If an index requires updating, then the PEL takes much longer.

Partitioning and Materialized View Refresh Strategies

There are different ways to keep materialized views updated:

Full refresh
Fast (incremental) refresh based on materialized view logs against the base tables
Manually using DML, followed by ALTER MATERIALIZED VIEW CONSIDER FRESH

In order to enable query rewrites, set the QUERY_REWRITE_INTEGRITY initialization parameter. If you manually keep materialized views up to date, then you must set QUERY_REWRITE_INTEGRITY to either TRUSTED or STALE_TOLERATED.

See Also:

Oracle Database Data Warehousing Guide

If your materialized views and base tables use comparable partitioning strategies, then PEL can be an extremely powerful way to keep materialized views up-to-date manually. For example, if both your base table and your materialized view use range partitioning, then you can consider PEL to keep your base table and materialized view up-to-date. The total data refresh scenario would work as follows:

Create tables to enable PEL against the tables and materialized views
Load data into the tables, build the indexes, and implement any constraints
Update the base tables using PEL
Update the materialized views using PEL
Execute ALTER MATERIALIZED VIEW CONSIDER FRESH for every materialized view you updated using this strategy

Note that this strategy implies a short period of time, in between PEL against the base table and PEL against the materialized view, in which the materialized view does not reflect the current data in the underlying tables. Take into account the QUERY_REWRITE_INTEGRITY setting and the activity on your system to identify whether you can cope with this situation.

See Also:

Oracle Database 2 Day + Data Warehousing Guide for an example of this refresh scenario

Removing Data from Tables

Data Warehouses commonly keep a time window of data. For example, 3 years worth of historical data is stored.

Partitioning makes it very easy to purge data from a table. You can use the DROP PARTITION or TRUNCATE PARTITION statements in order to purge data. Common strategies also include using a partition exchange load to unload the data from the table and replacing the partition with an empty table before dropping the partition. Archive the separate table you exchanged before emptying or dropping it.

Note that a drop or truncate operation would invalidate a global index or a global partitioned index. Local indexes remain valid. The local index partition is dropped when you drop the table partition.

The following example shows how to drop partition sales_1995 from the sales table.

ALTER TABLE sales

DROP PARTITION sales_1995

UPDATE GLOBAL INDEXES PARALLEL;

Partitioning and Data Compression

Data in a partitioned table can be compressed on a partition-by-partition basis. Using compressed data is most efficient for data that does not change frequently. Common data warehouse scenarios often see few data changes as data ages and other scenarios only insert data. Using the partition management features, you can compress data on a partition-by-partition basis.

If a table takes less space on disk, then performance of large table scans in an I/O-constraint environment may improve.

Gathering Statistics on Large Partitioned Tables

In order to get good SQL execution plans, it is important to have reliable table statistics. Oracle automatically gathers statistics using the statistics job that is activated upon database installation, or you can manually gather statistics using the DBMS_STATS package. Managing statistics on large tables is more challenging than managing statistics on smaller tables.

If a query accesses only a single table partition, then it is best to have partition-level statistics. If queries perform some partition elimination, but not down to a single partition, then you should gather both partition-level statistics and global statistics. Oracle Database 11g can maintain global statistics for a partitioned table incrementally. Only partitions that have changed are scanned and not the entire table.

A typical scenario for statistics management on a partitioned table is the use of Partition Exchange Load (PEL). If you add data using PEL and you do not plan to update the global-level statistics as part of the data load, then you should gather statistics on the table the data was initially loaded into, before you exchange it with the partition. Your global-level statistics will become stale after the partition exchange. When you re-gather the global-level statistics, or when the automatic statistics gather job regathers the global-level statistics, only the new partition, and not the entire table, will be scanned.