The most important point is made on cardinality on page 10 when explaining "Data Skew".
The term "Data Skew" is used to refer to the imbalance of values with respect to a normal distribution (http://en.wikipedia.org/wiki/Skewness).
The imbalance is shown as the Mode (http://en.wikipedia.org/wiki/Mode_(statistics)) of the values in a column of a table.
SELECT * FROM A_TABLE;
COL1 | COL2
1 | A <== Mode = most frequent.
2 | A
3 | A <== Median = middle one
4 | B
5 | B
Histogram for COL2:
Data Skew means that more of (usually it takes a lot more, like +60%) the values in COL2 have a value of "A" than of "B".
As the Oracle whitepaper points out "One of the simplest formulas is used when there is a single equality predicate in a single table query (with no histogram). In this case the Optimizer assumes a uniform distribution and calculates the cardinality for the query by dividing the total number of rows in the table by the number of distinct values in the column used in the where clause predicate.".
In example "
SELECT * FROM A_TABLE WHERE COL2='A'" this would equate to: Cardinality = 5 / 2 A cardinality value of 2.5 would be rounded to 3.
The problem with this is that the larger the number of rows the more skewed the cardinality will become.
Imagine the table with values like so:
Now in example "
SELECT * FROM A_TABLE WHERE COL2='A'" this would equate to: Cardinality = 17 / 2
A cardinality value of 8.5 would be rounded to 9.
Definitely not 15.
To solve this problem, Oracle has the capability to collect statistics for individual columns.
You can use the DBMS_STATS.GATHER_TABLE_STATS command to collect stats on the column (http://www.dba-oracle.com/t_histograms.htm).
The below statement will collect stats on the COL2 column of A_TABLE in the OE schema with a bucket size of 20 buckets. These column stats are histograms of the distribution of values.
DBMS_STATS.GATHER_table_STATS (OWNNAME => 'OE', TABNAME => 'A_TABLE',
METHOD_OPT => 'FOR COLUMNS SIZE 20 COL2');
Two types of HISTOGRAM are available in Oracle, FREQUENCY and HEIGHT.
Oracle uses HEIGHT based histograms *only* when the number of distinct values of a column is greater than the number of buckets specified.
"In a height-balanced histogram, the column values are divided into bands so that each band contains approximately the same number of rows. The useful information that the histogram provides is where in the range of values the endpoints fall.".
Oracle will use FREQUENCY based histograms by default when the number of distinct values of a column is less than or equal to the number of buckets specified.
"In a frequency histogram, each value of the column corresponds to a single bucket of the histogram.".
It is possible to view the collected histogram for a column:
SELECT column_name, num_distinct, num_buckets, histogram
WHERE table_name = 'A_TABLE' AND column_name = 'COL2';