Você está na página 1de 16

This video covers the collection and use of database statistics in the PureData® System

for Analytics appliance, referred to as PDA.

DatabaseStats.ppt Page 1 of 16
This presentation provides an overview on the benefits and usage of database statistics
within the PDA system. You learn the types of statistics collected and when they are
collected. You also learn how to review and update these statistics.

DatabaseStats.ppt Page 2 of 16
The IBM PDA system relies on statistics to determine the most efficient way to execute
a query. The optimizer analyzes the statistics of each table used in a query and
determines the best order of operations for performance.
If statistics have not been collected on a table or are very stale, then performance might
suffer. A common problem in this case is the order in which tables are applied to join
operations. Ideally, tables are joined in an order that limits row counts during each join
operation. Without current statistics, however, the optimizer might first join two tables
that results in an explosion of rows before finally joining a third table that then reduces
the row count. The performance of this query might suffer significantly as a result of the
row explosion and might also impact other queries. Outdated statistics might also
impact workload management actions in cases where optimizer cost estimates are
used in scheduler rules.
In order to prevent these problems, statistics should be maintained regularly. This is the
most important ongoing administration task.

DatabaseStats.ppt Page 3 of 16
The statistics for each table includes the row count and various column-level statistics.
The column-level statistics includes the minimum and maximum values for the column,
the count of null values, and the count of distinct values, which are maintained as a
dispersion value.
All of this information is used by the optimizer to estimate the cost of SQL operations
that are needed to satisfy the query. The estimated costs of the various orders of
operation are compared. A final execution plan is chosen based on the order of
operations resulting in the lowest overall cost.

DatabaseStats.ppt Page 4 of 16
Table statistics are gathered during various SQL command operations. However, each
of these can result in a collection of different levels of statistics. These are briefly
covered here and later will go into details for each command.
The most important way to generate statistics is by running the GENERATE
STATISTICS command. This command generates and stores all of the statistics for the
table. It should be run regularly to maintain up-to-date statistics.
The CREATE TABLE AS or CTAS command automatically generates all statistics on all
but very small tables. There is no need to run a GENERATE STATISTICS command
immediately after producing a table with the CTAS command.
The INSERT and UPDATE commands automatically update the row count and the
minimum and maximum values for some of the columns. Other statistics are not
updated so this does not replace the need to run the GENERATE STATISTICS
command.
The PDA system may also collect Just in Time statistics, commonly referred to as “JIT
stats”. These statistics are collected on the fly during query execution that is based on
specific query content and are not stored for use by subsequent queries.
Finally, the TRUNCATE command removes statistics for a table since it also removes
all table content. The dispersion values do remain but will be updated after you run the
GENERATE STATISTICS command, assuming new rows have been inserted.

DatabaseStats.ppt Page 5 of 16
Covered first is the impact of INSERT and UPDATE commands on statistics. As
previously mentioned, these commands automatically update some of the statistics,
including the row count of the table and the minimum and maximum values for some of
the columns.
The columns that are updated are only those stored in a non-character format. Look at
the statistics that are displayed here from the Performance Portal for table STAT_TAB.
The columns have been given names that are associated with their data types. As you
can see, only the columns with non-character data types had their minimum and
maximum values updated after the INSERT command. These data types include any of
the exact integer and numeric types, approximate numeric types, logical types, and
temporal types. The reason for updating these statistics, and not others is because it is
inexpensive for the system to do so. These updated statistics can be beneficial but do
not replace the need for you to run the GENERATE STATISTICS command.
Note that the DELETE command does not alter any of the statistics. You should run
both the GROOM and GENERATE STATISTICS commands after deleting many rows
from a table.

DatabaseStats.ppt Page 6 of 16
After running the GENERATE STATISTICS command, you see that all statistics have
now been collected. This includes the columns with character data types. The count of
null values and dispersion values are also now collected for all columns.
Look at the dispersion values. These are the inverse of the cardinality. For example, a
dispersion value of 0.5 or one-half means the number of distinct values for that column
is 2.
Note that the dispersion status shows that the dispersion values are current but
estimated. The system uses sampling to determine the dispersion on all but very small
tables. This allows the GENERATE STATISTICS command to run relatively quickly but
still provide good statistics.

DatabaseStats.ppt Page 7 of 16
As previously noted, there is no need to run the GENERATE STATISTICS command
after running the CTAS command. This is because it has already been done for you.
In this example, the table STAT_TAB2 was created from the table STAT_TAB with a
CTAS command. Immediately afterward, all statistics are up-to-date. The system
automatically injects a GENERATE STATISTICS command into the CTAS transaction.
This is done in all cases where the target table consumes more than 10,000 rows by
default. You can change this default threshold within an individual SQL session.

DatabaseStats.ppt Page 8 of 16
Just In Time statistics, commonly known as JIT stats, are triggered by a specific query
and use sampling to obtain more precise statistics that are based on the specific
content of the query. These statistics are collected at the beginning of the query and
used to develop the main execution plan. For example, if the query contains the
restriction, WHERE ORDER_DATE = ‘2013-12-23’, the row count might be significantly
different from the average cardinality across all dates. The collection of JIT stats also
helps the system to determine the row estimates resulting from a Join operation and
also whether the data being requested from the system is significantly skewed across
the data slices.
JIT stats are only used for the specific query from which they are triggered. They are
used to produce an optimal execution plan for the query and are then discarded. They
might be triggered depending on the size of table that is referenced in the query,
whether the table is joined to another table, and whether the table is restricted. They
might also be triggered if the table has an associated materialized view.
The current status of permanent statistics that are collected by the GENERATE
STATISTICS command is not a factor whether JIT stats are collected. However, JIT
stats do not cover all scenarios and do not replace the permanent table statistics that
are collected by the GENERATE STATISTICS command.
JIT stats are designed to be collected very quickly, with little impact to the overall query.
If, in the unlikely event, you suspect that JIT stats are causing undesired overhead, you
should contact IBM’s PureData System for Analytics Support for assistance. There are
options to change trigger thresholds as well as disablement of part or all of the JIT stats
capabilities.
For more details about Just in Time statistics, refer to the System Administrator’s Guide.

DatabaseStats.ppt Page 9 of 16
As you have seen on previous slides, the statistics for a table can be reviewed from the
Performance Portal. This is done by right-clicking the table name and selecting the View
Statistics option. The resulting display is shown here. Statistics can be reviewed in the
same manner from the NzAdmin client.
If you are an administrator and prefer to use command line, you can review statistics
with the nz_get command. This command is part of the Software Support Tools kit that
is installed on the PDA host. Use the “-help” option with the command for more details.

DatabaseStats.ppt Page 10 of 16
The GENERATE STATISTICS command is a SQL command and can be run from any
SQL-based application. In both the Performance Portal and NzAdmin client, it is run by
clicking the “Generate Statistics” button in the “View Statistics” window.
On production systems, you typically run the GENERATE STATISTICS command from
a scheduler or at the end of an ETL stream.
The privileges that are required to run the GENERATE STATISTICS command are
discussed in the Database User’s Guide.

DatabaseStats.ppt Page 11 of 16
The use of best practices for generating statistics is very simple: Run the GENERATE
STATISTICS command regularly. If your table changes less than 10% per week, as a
rule of thumb, then generate statistics weekly. If your table changes significantly daily
then run GENERATE STATISTICS daily.
The PDA system does not track when statistics were last generated and also does not
track the degree to which statistics are outdated. You can track that yourself from query
history if you want but it is much simpler to generate the statistics regularly. The time
that is required to generate the statistics is not long and generating statistics regularly is
a safe and easy process to manage. If your tables change significantly during each ETL
or ELT cycle, then you may also choose to generate statistics as part of that process.
Although the process to generate statistics is not overly expensive, it should be run as a
non-ADMIN user. Using a non-ADMIN user allows you to easily manage the process as
part of workload management. You can then dynamically change workload priorities as
needed. If you choose or need to generate statistics as part of an ETL or ELT process
then ensure the database user for that process has the appropriate privileges to
execute the command.
Consider using the nz_genstats command if you have exceptions or want additional
control on the type of statistics you collect. This command allows you to override default
settings and it can also bypass the statistics collection if the statistics are already up-to-
date. For example, you might choose to read every row when collecting dispersion
values instead of using sampling. This command will allow you to do that. It is part of
the Software Support Tools kit.

DatabaseStats.ppt Page 12 of 16
This concludes the presentation part of the video. Pause the video now and test your
knowledge of database statistics. Resume the video when you are ready to see the
answers.

DatabaseStats.ppt Page 13 of 16
Answer to Question 1: Both A and B are correct. The Minimum and Maximum values
are maintained during an INSERT statement for columns with an INTEGER data type.
The Null Count is only updated by the GENERATE STATISTICS command.
Answer to Question 2: FALSE. No statistics are updated during a DELETE command.
Answer to Question 3: Both B and C are correct. The presence of a materialized view
and the size of a table are both factors that can trigger JIT stats. JIT stats are only
collected for a specific query and are not stored for use by subsequent queries.
Answer to Question 4: TRUE. By default a CTAS operation will automatically generate
statistics on the new table when it receives more than 10,000 rows.
Answer to Question 5: Both A and C are correct. The minimum value is automatically
maintained for columns with non-character data types during an UPDATE command.
This includes columns with INTEGER and NUMERIC data types.
Answer to Question 6: FALSE. The status of permanent statistics produced by the
GENERATE STATISTICS command has no bearing on whether JIT stats are collected
during a query. The purpose of JIT stats is to help generate a better execution plan
based on the content of a specific query.

DatabaseStats.ppt Page 14 of 16
For more information, visit the links displayed on this slide or contact IBM’s PureData
System for Analytics support.

DatabaseStats.ppt Page 15 of 16
DatabaseStats.ppt Page 16 of 16

Você também pode gostar