Escolar Documentos
Profissional Documentos
Cultura Documentos
Information Management
Table of Contents :
Connecting to the Host and Database
Database Administration
Data Distribution
NzAdmin
Loading and Unloading Data
Backup & Restore
Query Optimization
Optimization Objects
Groom
Stored Procedures
IBM Software
Information Management
IBM Software
Information Management
Table of Contents
Introduction .......................................................................................3
1.1
VMware Basics....................................................................................3
1.2
1.3
Tips and Tricks on Using the PureData System Virtual Machines ........3
2.2
3.2
3.3
3.4
Page 2 of 10
IBM Software
Information Management
1 Introduction
1.1
VMware Basics
VMware Player and VMware Workstation are the synonym for test beds and developer environments across the IT industry.
While having many other functions for this specific purpose it allows the easy distribution of an up and running PureData
System system right to anybodys computer be it a notebook, desktop, or server. The VMware image can be deployed for
simple demos and educational purposes or it can be the base of your own development and experiments on top of the given
environment.
1.2
For the hands-on lab portion of the bootcamp, we will be using 2 virtual machines (VM) to demonstrate the usability of PureData
System systems. Because of the nature of virtualized environment and the host hardware, we will be limited in terms of
performance. Please use these exercises as a guide to familiarize with the PureData System systems only.
The virtual images are adaptations of an appliance for their portability and convenience. We will be running a virtual image to act
as the host machine and the other image as a SPU that typically resides in a PureData System appliance. The Host image will
be the main gateway where the Netezza Performance Server (NPS) code resides and will be accessed. The second image is
the SPU where it contains 5 virtual hard drives of 20 GB each as well as a virtual FPGA. The hard disks here are not partitioned
into primary, mirror and temp partitions as you would observe on a PureData System appliance. Instead, 4 of the disks only
contain primary data partitions and the fifth disk is used for temporary data.
SPU
PuTTY
NPS code
temp
FPGA
1.3
The PureData System appliance is designed and fine tuned for a specific set of hardware. In order to demonstrate the system in
a virtualized environment, some adaptations were made on the virtual machines. To ensure the labs run smoothly, we have
listed some pointers for using the VMs:
Page 3 of 10
IBM Software
Information Management
Always boot up the Host image first before the SPU image
When booting up the VMs, start the Host image first. Once it is fully booted, the SPU image can be started at which time the
Host image would be listening for connections from the SPU machine. The connection then should be made automatically.
After pausing the virtual machines, nz services need to be restarted
In the case that the VMs were paused (the host operating system went into sleep or hibernation modes, the images were
paused in VMware Workstation,,, etc). To continue using the VMs, run the following commands in the prompt of the Host image.
2.1
Page 4 of 10
IBM Software
Information Management
2.2
2.2.1
To start using the virtual machines, first boot up the Host machine. Click on the HOST tab, then press the Power On
button in the upper left corner (marked in a red circle above). You should see the RedHat operating system boot up screen,
allow it to boot for a couple minutes until it runs to the PureData System login prompt.
At the login prompt, login with the following credentials:
Username: nz
Password: nz
Once logged in, we can check the state of the machine by issuing the following command:
Page 5 of 10
IBM Software
Information Management
Choose the I moved it radio button, and click OK. This will ensure that the previously configured MAC address in the SPU
image will remain the same. This is crucial for the communication between the Host and SPU virtual machines.
After the SPU is fully booted, you should see the screen similar to the following. Note the bottom right corner where it displays
that there are 5 virtual hard disks in healthy state.
We can now go back to the Host image to check the status of the connection. Click on the HOST tab, and enter the following
command in the prompt:
3.1
Using PuTTY
Since we will not be using any graphical interface tools from the Host virtual machine, there is an alternative to using the
PureData System prompts directly in VMware. We can connect to the Host via SSH using tools such as PuTTY. We will be
using the PuTTY console for the rest of the labs since this better simulates the real life scenario of connecting to a remote
PureData System system.
First, locate the PuTTY executable in the folder where the VMs were extracted. Under the folder Tools you should be able to
find the file putty.exe. Execute it by a double click. In the PuTTY interface, enter the IP of the Host image as 192.168.239.2 and
select SSH as the connection type. Finally, click Open to start the session.
Page 6 of 10
IBM Software
Information Management
Once the prompt window is open, log in with the following credentials:
Username: nz
Password: nz
We are now ready for connection to the system database and execute commands in the PuTTY command prompt.
3.2
Since we have not created any user and databases yet, we will connect to the default database as the default user, with the
following credentials:
Database: system
Username: admin
Password: password
When issuing the nzsql command, the user supplies the user account, password and the database to connect to using the
syntax, below is an example of how this would be done. Do not try to execute that command it is just demonstrating the syntax:
Since the current values correspond to our desired values, no modification is required.
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
Page 7 of 10
IBM Software
Information Management
Next, lets take a look at what options are available to start nzsql. Type in the following command
3.3
There are commonly used commands that start with \ which we will demonstrate in this section. First, we will run the 2 help
commands to familiarize ourselves with these handy commands. The \h command will list the available SQL commands, while
the \? command is used to list the internal slash commands. Examine the output for both commands:
SYSTEM(ADMIN)=> \h
SYSTEM(ADMIN)=> \?
From the output of the \? command, we found the \l internal command we can use to find out all the databases:
Lets find out all the databases by entering
SYSTEM(ADMIN)=> \l
List of databases
DATABASE |
OWNER
-----------+----------MASTER_DB | ADMIN
SYSTEM
| ADMIN
(2 rows)
Secondly, we will use \dSt to find out the system tables within the system database.
SYSTEM(ADMIN)=> \dSt
List of relations
Name
|
Type
| Owner
------------------------------+--------------+------_T_ACCESS_TIME
| SYSTEM TABLE | ADMIN
_T_ACL
| SYSTEM TABLE | ADMIN
_T_ACTIONFRAG
| SYSTEM TABLE | ADMIN
_T_AGGREGATE
| SYSTEM TABLE | ADMIN
_T_ALTBASE
| SYSTEM TABLE | ADMIN
_T_AM
| SYSTEM TABLE | ADMIN
_T_AMOP
| SYSTEM TABLE | ADMIN
_T_AMPROC
| SYSTEM TABLE | ADMIN
_T_ATTRDEF
| SYSTEM TABLE | ADMIN
.
.
.
Page 8 of 10
IBM Software
Information Management
Note: press the space bar to scroll down the result set when you see --More-- on the screen.
From the previous command, we can see that there is a user table called _T_USER. To find out what is stored in that table, we
will use the describe command \d:
SYSTEM(ADMIN)=>\d _T_USER
This will return all the columns of the _T_USER system table. Next, we want to know the existing users stored in the table. In
case too many rows are returned at once, we will first calculate the number of rows it contains by enter the following query:
3.4
Exit nzsql
To exit nzsql, use the command \q to return to the PureData System system.
Page 9 of 10
IBM Software
Information Management
Page 10 of
IBM Software
Information Management
Database Administration
Hands-On Lab
IBM PureData System for Analytics Powered by Netezza Technology
IBM Software
Information Management
Table of Contents
Introduction .....................................................................3
1.1
Objectives........................................................................3
2.2
3.2
3.3
3.4
Page 2 of 18
IBM Software
Information Management
1 Introduction
A factory-configured and installed IBM PureData System will include some of the following components:
An IBM PureData System warehouse appliance with pre-installed IBM PureData System software
A preconfigured Linux operating system (with PureData System modifications)
Several preconfigured Linux users and groups:
o The nz user is the default PureData System system Administration account
o The group is the default group
An IBM PureData System database user named ADMIN. The ADMIN user is the database super-user, and has full
access to all system functions and objects
A preconfigured database group named PUBLIC. All database users are automatically placed in the group PUBLIC and
therefore inherit all of its privileges
The IBM PureData System warehouse appliance includes a highly optimized SQL dialect called PureData System Structured
Query Language. You can use SQL commands to create and manage your PureData System databases, user access, and
permissions for the databases, as well as to query and modify the contents of the databases
On a new IBM PureData System system, there is typically one main database, SYSTEM, and a database template, MASTER_DB.
IBM PureData System uses the MASTER_DB as a template for all other user databases that are created on the system.
Initially, only the ADMIN user can create new databases, but the ADMIN user can grant other users permission to create
databases as well. The ADMIN user can also make another user the owner of a database, which gives that user ADMIN-like
control over that database and its contents. The database creator becomes the default owner of the database. The owner can
remove the database and all its objects, even if other users own objects within the database. Within a database, permitted users
can create tables and populate them with data and query its contents.
1.1
Objectives
This lab will guide you through the typical steps to create and manage new IBM PureData System users and groups after an
IBM PureData System has been delivered and configured. This will include creating a new database and assigning the
appropriate privileges. The users and the database that you create in this lab will be used as a basis for the remaining labs in
this bootcamp. After this lab you will have a basic understanding on how to plan and create an IBM PureData System database
environment.
The first part of this lab will examine creating IBM PureData System users and groups
The second part of this lab will explore creating and using a database and tables. The table schema to be used within
this bootcamp will be explained in the Data Distribution lab.
The security access model for this bootcamp environment will use three PureData System database users:
o
LABADMIN
LABUSER
Page 3 of 18
IBM Software
Information Management
DBUSER
LAGRP
LUGRP
1.
Connect to the Netezza image using PuTTy. Login to 192.168.239.2 as user nz with password nz. 192.168.239.2 is
the default IP address for a local VM which is used for most bootcamp environments. In some cases where the images
are hosted remotely, the instructors will provide the host IPs which will vary between machines
2.
Connect to the system database as the PureData System database super-user, ADMIN, using the nzsql interface:
\h
\?
\g
\q
SYSTEM(ADMIN)=>
2.1
The three new PureData System database users will be initially created using the ADMIN user. The LABADMIN user will be the
full owner of the bootcamp database. The LABUSER user will be allowed to perform data manipulation language (DML)
operations (INSERT, UPDATE, DELETE) against all of the tables in the database, but they will not be allowed to create new
objects like tables in the database. And lastly, the DBUSER user will only be allowed to read tables in the database, that is, they
will only have LIST and SELECT privilege against tables in the database.
The basic syntax to create a user is:
Page 4 of 18
IBM Software
Information Management
1.
As the PureData System database super-user, ADMIN, you can now start to create the first user, LABADMIN, which will
be the administrator of the database: (Note user and group names are not case sensitive)
2.
Now you will create two additional PureData System database users that will have restricted access to the database.
The first user, LABUSER, will have full DML access to the data in the tables, but will not be able to create or alter tables.
For now you will just create the user. We will set the privileges after the database is created :
3.
Finally we create the user DBUSER. This user will have even more limited access to the database since it will only be
allowed to select data from the tables within the database. Again, you will set the privileges after the database is
created :
4.
To list the existing PureData System database users in the environment use the \du internal slash option:
SYSTEM(ADMIN)=> \du
This will return a list of all database users:
List of Users
USERNAME | VALIDUNTIL | ROWLIMIT | SESSIONTIMEOUT | QUERYTIMEOUT | DEF_PRIORITY | MAX_PRIORITY | USERESOURCEGRPID | USERESOURCEGRPNAME | CROSS_JOINS_ALLOWED
-----------+------------+----------+----------------+--------------+--------------+--------------+------------------+--------------------+--------------------ADMIN
|
|
|
0 |
0 | NONE
| NONE
|
| _ADMIN_
| NULL
DBUSER
|
|
0 |
0 |
0 | NONE
| NONE
|
| PUBLIC
| NULL
LABADMIN |
|
0 |
0 |
0 | NONE
| NONE
|
| PUBLIC
| NULL
LABUSER
|
|
0 |
0 |
0 | NONE
| NONE
|
| PUBLIC
| NULL
(4 rows)
The additional information like USERRESOURCEGROUP is intended for resource management, which is covered later in the
WLM presentation.
2.2
PureData System database user groups are useful for organizing and managing PureData System database users. By default
PureData System contains one group with the name PUBLIC. All users are members in the PUBLIC group when they are
created. Users can be members of other groups as well though. In this section we will create two new PureData System
database user groups. They will be initially created by the ADMIN user.
We will create an administrative group LAGRP which is short for Lab Admin Group. This group will contain the LABADMIN user.
The second group we create will be the LUGRP or Lab User Group. This group will contain the users LABUSER and DBUSER.
Page 5 of 18
IBM Software
Information Management
Two different methods will be used to add the existing users to the newly created groups. Alternatively, the groups could be
created first and then the users. The basic command to create a group is:
1.
As the PureData System database super-user, ADMIN, you will now create the first group, LAGRP, which will be the
administrative group for the LABADMIN user :
2.
After the LAGRP group is created you will now add the LABADMIN user to this group. This is accomplished by using the
ALTER statement. You can either ALTER the user or the group, for this task you will ALTER the group to add the
LABADMIN user to the LAGRP group:
3.
Now you will create the second group, LUGRP, which will be the user group for the both the LABUSER and DBUSER
users. You can specify the users to be included in the group when creating the group:
If you had created the group before creating the user, you could add the user to the group when creating the user. To create
the LABUSER user and add it to an existing group LUGRP, you would use the following command:
create user LABUSER with in group LUGRP;
4.
To list the existing PureData System groups in the environment use the \dg internal slash option:
SYSTEM(ADMIN)=> \dg
This will return a list of all groups in the system. In our test system this is the default group PUBLIC and the two groups you
have just created:
List of Groups
GROUPNAME | ROWLIMIT | SESSIONTIMEOUT | QUERYTIMEOUT | DEF_PRIORITY | MAX_PRIORITY | GRORSGPERCENT | RSGMAXPERCENT | JOBMAX | SOSS_JOINS_ALLOWED
-----------+----------+----------------+--------------+--------------+--------------+---------------+---------------+--------+------------------LAGRP
|
0 |
0 |
0 | NONE
| NONE
|
0 |
100 |
0 | NULL
LUGRP
|
0 |
0 |
0 | NONE
| NONE
|
0 |
100 |
0 | NULL
PUBLIC
|
0 |
0 |
0 | NONE
| NONE
|
20 |
100 |
0 | NULL
(3 rows)
Page 6 of 18
IBM Software
Information Management
5.
To list the users in a group you can use one of two internal slash options, \dG, or \dU. The internal slash option \dG
will list the groups with the associated users:
SYSTEM(ADMIN)=> \dG
This returns a list of all groups and the users they contain:
SYSTEM(ADMIN)=> \dU
In this case the output is ordered by the users:
3.1
The lab database that will be created will be named LABDB. It will be initially created by the ADMIN user and then ownership of
the database will be transferred to the LABADMIN user. The LABADMIN user will have full administrative privileges against the
LABDB database. The basic syntax to create a database is:
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
Page 7 of 18
IBM Software
Information Management
1.
As the PureData System database super-user, ADMIN, you will create the first database, LABDB, using the CREATE
DATABASE command :
2.
SYSTEM(ADMIN)=> \l
This will return the following list:
List of databases
DATABASE |
OWNER
-----------+----------LABDB
| ADMIN
MASTER_DB | ADMIN
SYSTEM
| ADMIN
(3 rows)
The owner of the newly created LABDB database is the ADMIN user. The other databases are the default database SYSTEM
and the template database MASTER_DB.
3.
At this point you could continue by creating new tables as the ADMIN user. However, the ADMIN user should only be
used to create users, groups, and databases, and to assign authorities and privileges. Therefore we will transfer
ownership of the LABDB database from the ADMIN user to the LABADMIN user we created previously. The ALTER
DATABASE command is used to transfer ownership of an existing database :
4.
Check that the owner of the LABDB database is now the LABADMIN user :
SYSTEM(ADMIN)=> \l
List of databases
DATABASE |
OWNER
-----------+----------LABDB
| LABADMIN
MASTER_DB | ADMIN
SYSTEM
| ADMIN
(3 rows)
The owner of the LABDB database is now the LABADMIN user.
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
Page 8 of 18
IBM Software
Information Management
The LABDB database is now created and the LABADMIN user has full privileges on the LABDB database. The user can create
and alter objects within the database. You could now continue and start creating tables as the LABADMIN user. However, we will
first finish assigning privileges to the two remaining database users that were created in the previous section.
3.2
One last task for the ADMIN user is to assign the privileges to the two users we created earlier, LABUSER and DBUSER. LABUSER
user will have full DML rights against all tables in the LABDB database. It will not be allowed to create or alter tables within the
database. User DBUSER will have more restricted access in the database and will only be allowed to read data from the tables in
the database. The privileges will be controlled by a combination of setting the privileges at the group and user level.
The LUGRP user group will be granted LIST and SELECT privileges against the database and tables within the database. So any
member of the LUGRP will have these privileges. The full data manipulation privileges will be granted individually to the LABUSER
user.
The GRANT command that is used to assign object privileges has the following syntax:
GRANT <objective_privilege> ON <object> TO { PUBLIC | GROUP <group> | <username> }
1.
As the PureData System database super-user, ADMIN, connect to the LABDB database using the internal slash option
\c:
2.
First you will grant LIST privilege on the LABDB database to the group LUGRP. This will allow members of the LUGRP to
view and connect to the LABDB database :
3.
To list the object permissions for a group use the following internal slash option, \dpg :
The X in the L column of the list denotes that the LUGRP group has LIST object privileges on the LABDB global object.
Page 9 of 18
IBM Software
Information Management
4.
With the current privileges set for the LABUSER and DBUSER, they can now view and connect to the LABDB database
as members of the LUGRP group. But these two users have no privileges to access any of the objects within the
database. So you will grant LIST and SELECT privilege to the tables within the LABDB database to the members of the
LUGRP :
5.
The X in the L and S column denotes that the LUGRP group has both LIST and SELECT privileges on all of the tables in the
LABDB database. (The LIST privilege is used to allow users to view the tables using the internal slash opton \d.)
6.
The current privileges satisfy the DBUSER user requirements, which is to allow access to the LABDB database and
SELECT access to all the tables in the database. But these privileges do not satisfy the requirements for the LABUSER
user. The LABUSER user is to have full DML access to all the tables in the database. So you will grant SELECT, INSERT,
UPDATE, DELETE, LIST, and TRUNCATE privileges on tables in the LABDB database to the LABUSER user:
LABDB(ADMIN)=> grant select, insert, update, delete, list, truncate on table to labuser;
7.
To list the object permissions for a user use the \dpu <user name> internal slash option,:
The X under the L, S, I, U, D, T columns indicates that the LABUSER user has LIST, SELECT, INSERT, UPDATE, DELETE,
and TRUNCATE privileges on all of the tables in the LABDB database.
Page 10 of
IBM Software
Information Management
Now that all of the privileges have been set by the ADMIN user the LABDB database can be handed over to the end-users. The
end-users can use the LABADMIN user to create objects, which include tables, in the database and also maintain the database.
3.3
The LABADMIN user will be used to create tables in the LABDB database instead of the ADMIN user. Two tables will be created
in this lab. The remaining tables for the LABDB database schema will be created in the Data Distribution lab. Data Distribution is
an important aspect that should be considered when creating tables. This concept is not covered in this lab since it is discussed
separately in the Data Distribution presentation. The two tables that will be created are the REGION and NATION tables. These
two tables will be populated with data in the next section using LABUSER user. Two methods will be utilized to create these
tables. The basic syntax to create a table is:
CREATE TABLE table_name
(
column_name type [ [ constraint_name ] column_constraint
[ constraint_characteristics ] ] [, ... ]
[ [ constraint_name ] table_constraint [ constraint_characteristics ] ] [, ... ]
)
[ DISTRIBUTE ON ( column [, ...] ) ]
1.
Connect to the LABDB database as the LABADMIN user using the internal slash option \c:
To use the nzsql interface to connect to the LABDB database as the LABADMIN user you could use the following options:
nzsql d labdb u labadmin pw password
or with the short form, omitting the options:
nzsql labdb labadmin password
or you could set the environment variables to the following values and issue nzsql without options.
Page 11 of
IBM Software
Information Management
NZ_DATABASE=LABDB
NZ_USER=LABADMIN
NZ_PASSWORD=password
In further labs we will often leave out the password parameter since it has been set to the same value password for all
users.
2.
Now you can create the first table in the LABDB database. The first table you will create is the REGION table with the
following columns and datatypes :
Column Name
Data Type
R_REGIONKEY
INTEGER
R_NAME
CHAR(25)
R_COMMENT
VARCHAR(152)
3.
To list the tables in the LABDB database use the \dt internal slash option:
LABDB(LABADMIN)=> \dt
This will show the table you just created
List of relations
Name | Type | Owner
--------+-------+---------REGION | TABLE | LABADMIN
(1 row)
4.
To describe a table you can use the internal slash option \d <table name>:
LABDB(LABADMIN)=> \d region
This shows a description of the created table:
Table "REGION"
Attribute |
Type
| Modifier | Default Value
-------------+------------------------+----------+--------------R_REGIONKEY | INTEGER
|
|
R_NAME
| CHARACTER(25)
|
|
R_COMMENT
| CHARACTER VARYING(152) |
|
Distributed on hash: "R_REGIONKEY"
Page 12 of
IBM Software
Information Management
The distributed on hash clause is the distribution method used by the table. If you do not explicitly specify a distribution
method, a default distribution is used. In our system this is a hash distribution on the first column R_REGIONKEY. This
concept is discussed in the Data Distribution presentation and lab.
5.
Instead of typing out the entire create table statement at the nzsql command line you can read and execute
commands from a file. Youll use this method to create the NATION table in the LABDB database with the following
columns and data types:
Column Name
Data Type
Constraint
N_NATIONKEY
INTEGER
NOT NULL
N_NAME
CHAR(25)
NOT NULL
N_REGIONKEY
INTEGER
NOT NULL
N_COMMENT
VARCHAR(152)
---
The statement can be found in the nation.ddl file under the /labs/databaseAdministration directory. To read and
execute commands from a file use the \i <file> internal slash option:
LABDB(LABADMIN)=> \i /labs/databaseAdministration/nation.ddl
7.
LABDB(LABADMIN)=> \dt
We will now see a list containing the two tables you created:
List of relations
Name | Type | Owner
--------+-------+---------NATION | TABLE | LABADMIN
REGION | TABLE | LABADMIN
(2 rows)
8.
Page 13 of
IBM Software
Information Management
LABDB(LABADMIN)=> \d nation
This will show the following results:
Table "NATION"
Attribute |
Type
| Modifier | Default Value
-------------+------------------------+----------+--------------N_NATIONKEY | INTEGER
| NOT NULL |
N_NAME
| CHARACTER(25)
| NOT NULL |
N_REGIONKEY | INTEGER
| NOT NULL |
N_COMMENT
| CHARACTER VARYING(152) |
|
Distributed on random: (round-robin)
The distributed on random is the distribution method used, in this case the rows in the NATION table are distributed in
round-robin fashion. This concept will be discussed separately in the Data Distribution presentation and lab.
It is possible to continue to use LABADMIN user to perform DML queries since it is the owner of the database and holds all
privileges on all of the objects in the databases. However, the LABUSER and DBUSER users will be used to perform DML queries
against the tables in the database.
3.4
We will now use the LABUSER user to populate data into both the REGION and NATION tables. This user has full data
manipulation language (DML) privileges in the database, but no data definition language (DDL) privileges. Only the LABADMIN
has full DDL privileges in the database. Later in this course more efficient methods to populate tables with data are discussed.
The DBUSER will also be used to read data from the tables, but it can not insert data in to the tables since is has limited DML
privileges in the database.
1.
Connect to the LABDB database as the LABUSER user using the internal slash option, \c:
2.
First check which tables exist in the LABDB database using the \dt internal slash option:
LABDB(LABUSER)=> \dt
You should see the following list:
Page 14 of
IBM Software
Information Management
List of relations
Name | Type | Owner
--------+-------+---------NATION | TABLE | LABADMIN
REGION | TABLE | LABADMIN
(2 rows)
Remember that the LABUSER user is a member of the LUGRP group which was granted LIST privileges on the tables in the
LABDB database. This is the reason why it can list and view the tables in the LABDB database. If it did not have this privilege
it would not be able to see any of the tables in the LABDB database.
3.
The LABUSER user was created to perform DML operations against the tables in the LABDB database. However, it was
restricted on performing DDL operations against the database. Lets see what happens when you try create a new table,
t1, with one column, C1, using the INTEGER data type:
4.
Lets continue by performing DML operations that the LABUSER user is allowed to perform against the tables in the
LABDB database. Insert a new row into the REGION table:
5.
Issue the SELECT statement against the REGION table to check the new row you just added to the table:
Page 15 of
IBM Software
Information Management
6.
Instead of typing DML statements at the nzsql command line, you can read and execute statements from a file. You
will use this method to add the following three rows to the REGION table:
R_REGIONKEY
R_NAME
R_COMMENT
SA
South America
EMEA
AP
Asia Pacific
LABDB(LABUSER)=> \i /labs/databaseAdministration/region.dml
You will see the following result. You can see from the output that the SQL script contained three INSERT statements.
LABDB(LABUSER)=> \i /labs/databaseAdministration/region.dml
INSERT 0 1
INSERT 0 1
INSERT 0 1
7.
You will load data into the NATION table using an external table with the following command:
8.
Now you will switch over to the DBUSER user, who only has SELECT privilege on the tables in the LABDB database. This
privilege is granted to this user since he is a member of the LUGRP group. Use the internal slash option, \c <database
name> <user> <password> to connect to the LABDB database as the DBUSER user:
Page 16 of
IBM Software
Information Management
You will notice that the user name in the command prompt changes from LABUSER to DBUSER.
9.
Before trying to view rows from tables in the LABDB database, try to add a new row to the REGION table:
11. Finally let's run a slightly more complex query. We want to return all nation names in Asia Pacific, together with their
region name. To do this you need to execute a simple join using the NATION and REGION tables. The join key will be
the region key, and to restrict results on the AP region you need to add a WHERE condition:
Congratulations you have completed the lab. You have successfully created the lab database, 2 tables, and database users
and user groups with various privileges. You also ran a couple of simple queries. In further labs you will continue to use this
database by creating the full schema.
Page 17 of
IBM Software
Information Management
Page 18 of
Data Distribution
Hands-On Lab
IBM PureData System for Analytics Powered by Netezza Technology
IBM Software
Information Management
Table of Contents
Introduction .....................................................................3
1.1
Objectives........................................................................3
Skew .................................................................................4
2.1
2.2
Co-Location ...................................................................10
3.1
Investigation ..................................................................10
3.2
Co-Located Joins...........................................................12
Schema Creation...........................................................15
4.1
Investigation ..................................................................15
4.2
Solution .........................................................................16
Page 2 of 19
IBM Software
Information Management
1 Introduction
IBM PureData System is a family of data-warehousing appliances that combine high performance with low administrative effort.
Due to the unique data warehousing centric architecture of PureData System, most performance tuning tasks are either not
necessary or automated. Unlike normal data warehousing solutions, no tablespaces need to be created or tuned, there are also
no indexes, buffer pools or partitions.
Since PureData System is built on a massively parallel architecture that distributes data and workloads over a large number of
processing and data nodes, the single most important tuning factor is picking the right distribution key. The distribution key
governs which data rows of a table are distributed to which data slice and it is very important to pick an optimal distribution key
to avoid data skew, processing skew and to make joins co-located whenever possible.
1.1
Objectives
In this lab we will cover a typical scenario in a POC or customer engagement which involves an existing data warehouse for
customer transactions.
Page 3 of 19
IBM Software
Information Management
2 Skew
Tables in PureData System are distributed across data slices based on the distribution method and key. If a bad data distribution
method has been picked, it will result in skewed tables or processing skew. Data skew occurs when the distribution method puts
significantly more records of a table on one data slice than on other data slices. Apart from bad performance this also results in
a situation where the PureData System can hold significantly less data than expected.
Processing skew occurs if processing of queries is mainly taking place on some data slices for example because queries only
apply to data on those data slices. Both types of skew result in suboptimal performance since in a parallel system the slowest
node defines the total execution time.
2.1
Data Skew
The first table we will create is LINEITEM, the main fact table of the schema. It contains roughly 6 million rows.
1.
Connect to the Netezza image using PuTTy. Login to 192.168.239.2 as user nz with password nz. 192.168.239.2 is
the default IP address for a local VM which is used for most bootcamp environments. In some cases where the images
are hosted remotely, the instructors will provide the host IPs which will vary between machines
2.
If you are continuing from the previous lab and are already connected to NZSQL quit the NZSQL console with the \q
command.
3.
To create the LINEITEM table, switch to the lab directory /labs/dataDistribution. To do this use the following command:
(Notice that you can use bash auto complete by using the Tab key to complete folder and files names)
4.
Create the LINEITEM table by using the following script. Since the fact table is quite large this can take a couple
minutes.
You should see a similar result to the following. The error message at the beginning is expected since the script tries to
clean up existing LINEITEM tables:
5.
Now lets have a look at the created table, open the nzsql console by entering the command: nzsql
\h
\?
\g
\q
SYSTEM(ADMIN)=>
Page 4 of 19
IBM Software
Information Management
6.
Connect to the database LABDB as user LABADMIN by typing the following command:
7.
Lets have a look at the table we just created. First we want to see a description of its columns and distribution key. Use
the NZSQL describe command \d LINEITEM to get a description of the table. This should have the following result:
TPCH(TPCHADMIN)=> \d LINEITEM
Table "LINEITEM"
Attribute
|
Type
| Modifier | Default Value
-----------------+-----------------------+----------+--------------L_ORDERKEY
| INTEGER
| NOT NULL |
L_PARTKEY
| INTEGER
| NOT NULL |
L_SUPPKEY
| INTEGER
| NOT NULL |
L_LINENUMBER
| INTEGER
| NOT NULL |
L_QUANTITY
| NUMERIC(15,2)
| NOT NULL |
L_EXTENDEDPRICE | NUMERIC(15,2)
| NOT NULL |
L_DISCOUNT
| NUMERIC(15,2)
| NOT NULL |
L_TAX
| NUMERIC(15,2)
| NOT NULL |
L_RETURNFLAG
| CHARACTER(1)
| NOT NULL |
L_LINESTATUS
| CHARACTER(1)
| NOT NULL |
L_SHIPDATE
| DATE
| NOT NULL |
L_COMMITDATE
| DATE
| NOT NULL |
L_RECEIPTDATE
| DATE
| NOT NULL |
L_SHIPINSTRUCT | CHARACTER(25)
| NOT NULL |
L_SHIPMODE
| CHARACTER(10)
| NOT NULL |
L_COMMENT
| CHARACTER VARYING(44) | NOT NULL |
Distributed on hash: "L_LINESTATUS"
We can see that the LINEITEM table has 16 columns with different data types. Some of the columns have a key suffix and
substrings containing the names of other tables and are most likely foreign keys of dimension tables. The distribution key is
L_LINESTATUS, which is of a CHAR(1) data type.
8.
Now lets have a look at the data in the table. To return a limited number of rows you can use the limit keyword in your
select queries. Execute the following select command to return 10 rows of the LINEITEM table. For readability we only
select a couple of columns including the order key, the ship date and the linestatus distribution key:
Page 5 of 19
IBM Software
Information Management
We will now verify the number of distinct values in the L_LINESTATUS column with a SELECT DISTINCT call. To
return a list of all values that are in the L_LINESTATUS column execute the following SQL command:
Page 6 of 19
IBM Software
Information Management
2.2
Processing Skew
Even in tables that are distributed evenly across dataslices, data processing for queries can be concentrated or skewed to a
limited number of dataslices. This can happen because PureData System is able to ignore data extents (sets of data pages) that
do not fit to a given WHERE condition. We will cover the mechanism behind that in the zone map chapter.
1.
First we will pick a new distribution key. As we have seen it should have a big number of distinct values. One of the
columns that did fit this description was the L_SHIPDATE column. Check the number of distinct values in the shipdate
column with the COUNT(DISTINCT ) statement:
Now lets reload the LINEITEM table with the new distribution key. For this we need to change the SQL of the loading
script we executed at the beginning of the lab. Exit the nzsql console by entering: \q
3.
You should now be in the lab directory /labs/dataDistribution. The table creation statement is situated in the lineitem.sql
file. We will need to make changes to the file with a text editor. Open the file with the default linux text editor vi. To do
this enter the following command:
vi lineitem.sql
4.
The vi editor has two modes, a command mode used to save files, quit the editor etc. and an insert mode. Initially you
will be in the command mode. To change the file you need to switch into the insert mode by pressing i. The editor will
show an INSERT at the bottom of the screen.
5.
You can now use the cursor keys to navigate to the DISTRIBUTE ON clause at the bottom of the create command.
Change the distribution key to l_shipdate. The editor should now look like the following:
Page 7 of 19
IBM Software
Information Management
6.
We will now save our changes. Press Esc to switch back into command mode. You should see that the INSERT
string at the bottom of the screen vanishes. Enter :wq! and press enter to write the file, and quit the editor without
any questions. If you made a mistake editing and would like to undo it press Esc then enter :q! and go back to step
3.
7.
8.
Recreate and load the LINEITEM table with your new distribution key by executing
the ./create_lineitem_1.sh command
Use the nzsql command to enter the command console
Switch to the LABDB database by using the \c LABDB LABADMIN command.
Now we verify that the new distribution key results in a good data distribution. For this we will repeat the query, which
returns the number of rows for each datasliceid of the LINEITEM table. Execute the following command:
Page 8 of 19
IBM Software
Information Management
We can see that the data distribution is much better now. All four data slices have a roughly equal amount of rows.
9.
Now that we have a database table with a good data distribution lets look at a couple of queries we have received from
the customer. The following query is executed regularly by the customer. It returns the average quantity shipped on a
given day grouped by the shipping mode. Execute the following query:
This query will take all rows from the 29th March of 1996 and compute the average value of the L_QUANTITY column for
each L_SHIPMODE value. It is a typical warehousing query insofar as a date column is used to restrict the row set that is
taken as input for computation.
In this example most rows of the LINEITEM table will be filtered away, only rows that have the specified date will be used as
input for computation of the AVG aggregation.
10. Execute the following SQL statement to see on which data slice we can find the rows from the 29th March of 1996:
Page 9 of 19
IBM Software
Information Management
3 Co-Location
The most basic warehouse schema consists of a fact table containing a list of all business transactions and a set of dimension
tables that contain the different actors, objects, locations and time points that have taken part in these transactions. This means
that most queries will not only access one database table but will require joins between tables.
In PureData System database, tables are distributed over a potentially large numbers of data slices on different SPUs. This
means that during a join of two tables there are two possibilities.
Rows of the two tables that belong together are situated on the same dataslice, which means that they are co-located
and can be joined locally
Rows that belong together are situated on different dataslices which means that tables need to be redistributed.
3.1
Investigation
Obviously co-location has big performance advantages. In the following section we will demonstrate that by introducing a
second table ORDERS.
1.
Switch to the Linux command line, if you are in the NZSQL console. Do this with the \q command.
2.
Switch to the data distribution lab directory with the command cd /labs/dataDistribution
3.
Create and load the ORDERS table by executing the following command: ./create_orders_1.sh
4.
Enter the NZSQL console with the nzsql labdb labadmin command
5.
Lets take a look at the ORDERS table with the \d orders command. You should see the following results.
Page 10 of
IBM Software
Information Management
LABDB(LABADMIN)=> \d orders
Table "ORDERS"
Attribute
|
Type
| Modifier | Default Value
-----------------+-----------------------+----------+--------------O_ORDERKEY
| INTEGER
| NOT NULL |
O_CUSTKEY
| INTEGER
| NOT NULL |
O_ORDERSTATUS
| CHARACTER(1)
| NOT NULL |
O_TOTALPRICE
| NUMERIC(15,2)
| NOT NULL |
O_ORDERDATE
| DATE
| NOT NULL |
O_ORDERPRIORITY | CHARACTER(15)
| NOT NULL |
O_CLERK
| CHARACTER(15)
| NOT NULL |
O_SHIPPRIORITY | INTEGER
| NOT NULL |
O_COMMENT
| CHARACTER VARYING(79) | NOT NULL |
Distributed on random: (round-robin)
The orders table has a key column O_ORDERKEY that is most likely the primary key of the table. It contains information on
the order value, priority and date and has been distributed on random. This means that PureData System doesnt use a
hash based algorithm to distribute the data. Instead, rows are distributed randomly on the available data slices.
You can check the data distribution of the table, using the methods we have used before for the LINEITEM table. The data
distribution will be perfect. There will also not be any processing skew for queries on the single table, since in a random
distribution there can be no correlation between any WHERE condition and the distribution key.
6.
We have received another typical query from our customer. It returns the average total price and item quantity of all
orders grouped by the shipping priority. This query has to join together the LINEITEM and ORDERS tables to get the
total order cost from the orders table and the quantity for each shipped item from the LINEITEM table. The tables are
joined with an inner join on the L_ORDERKEY column. Execute the following query and note the approximate execution
time:
Notice that the query takes about a minute to complete on our machine. The actual execution times on your machine will be
different.
7.
Remember that the ORDERS table was distributed randomly and the LINEITEM table is still distributed by the
L_SHIPDATE column. The join on the other hand is taking place on the L_ORDERKEY and O_ORDERKEY columns.
We will now have a quick look at what is happening inside PureData System in this scenario. To do this we use the
PureData System EXPLAIN function. This will be more thoroughly covered in the Optimization lab.
Page 11 of
IBM Software
Information Management
3.2
Co-Located Joins
In the last section we have seen that a query using joins can result in costly data redistribution during join execution when the
joined tables are not distributed on the join key. In this section we will reload the tables based on the mutual join key to enhance
performance during joins.
1.
2.
3.
Open the file with the vi editor by executing the command: vi lineitem.sql
Switch to INSERT mode by pressing i
Navigate with the cursor keys to the DISTRIBUTE ON clause and change it to DISTRIBUTE ON
(L_ORDERKEY)
Exit the INSERT mode by pressing ESC
Page 12 of
IBM Software
Information Management
e.
Enter :wq! In the command line of the VI editor and Press Enter. Before pressing enter your screen should
look like the following:
Recreate and load the LINEITEM table with the distribution key L_ORDERKEY by executing the
command: ./create_lineitem_1.sh
Page 13 of
IBM Software
Information Management
6.
Recreate and load the ORDERS table with the distribution key O_ORDERKEY by executing the
command: ./create_orders_1.sh
7.
Enter the NZSQL console by executing the following command: nzsql labdb labadmin
8.
Repeat executing the explain of our join query from the previous section by executing the following command:
The query should return the same results as in the previous section but run much faster even in the VMWare environment.
In a real PureData System appliance with 6, 12 or more SPUs the difference would be much more significant.
You now have loaded the LINEITEM and ORDERS table into your PureData System appliance using the optimal distribution
key for these tables for most situations.
a. Both tables are distributed evenly across dataslices, so there is no data skew.
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
19
Page 14 of
IBM Software
Information Management
b.
c.
The distribution key is highly unlikely to result in processing skew, since most where conditions will restrict a
key column evenly
Since ORDERS is a parent table of LINEITEM, with a foreign key relationship between them, most queries
joining them together will utilize the join key. These queries will be co-located.
Now finally we will pick the distribution keys of the full schema.
4 Schema Creation
Now that we have created the ORDERS and LINEITEM tables we need to pick the distribution keys for the remaining tables as
well.
4.1
Investigation
Table
Number of Rows
Primary Key
REGION
R_REGIONKEY
NATION
25
N_NATIONKEY
Page 15 of
IBM Software
Information Management
CUSTOMER
150000
C_CUSTKEY
ORDERS
1500000
O_ORDERKEY
SUPPLIER
10000
S_SUPPKEY
PART
200000
P_PARTKEY
PARTSUPP
800000
---
LINEITEM
6000000
--
Parent Table
Child Table
REGION
NATION
R_REGIONKEY
N_REGIONKEY
NATION
CUSTOMER
N_NATIONKEY
C_NATIONKEY
NATION
SUPPLIER
N_NATIONKEY
S_NATIONKEY
CUSTOMER
ORDERS
C_CUSTKEY
O_CUSTKEY
ORDERS
LINEITEM
O_ORDERKEY
L_ORDERKEY
SUPPLIER
LINEITEM
S_SUPPKEY
L_SUPPKEY
SUPPLIER
PARTSUPP
S_SUPPKEY
PS_SUPPKEY
PART
LINEITEM
P_PARTKEY
L_PARTKEY
PART
PARTSUPP
P_PARTKEY
PS_PARTKEY
Given all that you heard in the presentation and lab, try to fill in the distribution keys in the chart below. Lets assume that we will
not change the distribution keys for LINEITEM and ORDERS anymore.
Table
REGION
NATION
CUSTOMER
SUPPLIER
PART
PARTSUPP
ORDERS
O_ORDERKEY
LINEITEM
L_ORDERKEY
4.2
Solution
It is important to note that there is no optimal way to pick distribution keys. It always depends on the queries that run against the
database. Without these queries it is only possible to follow some general rules:
Co-Location between big tables (esp. if a fact table is involved) is more important than between small tables
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
19
Page 16 of
IBM Software
Information Management
Very small tables can be broadcast by the system with little performance penalty. If one table of a join is broadcast the
other will not need to be redistributed
If you suspect that there will be lots of queries joining two big tables but you cannot distribute both of them on the
expected join key, distributing one table on the join key is better than nothing, since it will lead to a single redistribute
instead of a double redistribute.
If we break down the problem, we can see that PART and PARTSUPP are the biggest two of the remaining tables and we have
already based on available customer queries distributed the LINEITEM table on the order key, so it seems to make sense to
distribute PART and PARTSUPP on their join keys.
CUSTOMER is big as well and has two relationships. The first relationship is with the very small NATION table that is easily
broadcasted by the system. The second relationship is with the ORDERS table which is big as well but already distributed by the
order key. But as mentioned above a single redistribute is better than a double redistribute. Therefore it makes sense to
distribute the CUSTOMER table on the customer key, which is also the join key of this relationship.
The situation is very similar for the SUPPLIER table. It has two very large child tables PARTSUPP and LINEITEM which are
both related to it through the supplier key, so it should be distributed on this key.
NATION and REGION are both very small and will most likely be broadcasted by the Optimizer. You could distribute those
tables randomly, on their primary keys, on their join keys. In this case we have decided to distribute both on their primary keys
but there is no definite right or wrong approaches. One possible solution for the distribution keys could be the following.
Table
REGION
R_REGIONKEY
NATION
N_NATIONKEY
CUSTOMER
C_CUSTKEY
SUPPLIER
S_SUPPKEY
PART
P_PARTKEY
PARTSUPP
PS_PARTKEY
ORDERS
O_ORDERKEY
LINEITEM
L_ORDERKEY
You should still be connected to the LABDB database. We now need to recreate the NATION and REGION tables with
a new distribution key. To drop the old versions execute the following command:
3.
4.
Verify the SQL script creating the remaining 6 tables with the command: more remaining_tables.sql
You will see the SQL script used for creating the remaining tables with the distribution keys mentioned above. Press the
Enter key to scroll lower until you reach the end of the file.
5.
Actually create the remaining tables and load the data into it with the following command: ./create_remaining.sh
Page 17 of
IBM Software
Information Management
You should see the following results. The error message at the top is expected since the script tries to clean up any old
tables of the same name in case a reload is necessary.
Congratulations! You just have defined data distribution keys for a customer data schema in PureData System. You can
have a look at the created tables and their definitions with the commands you used in the previous chapters. We will
continue to use the tables we created in the following labs.
Page 18 of
IBM Software
Information Management
Page 19 of
IBM Software
Information Management
IBM Software
Information Management
Table of Contents
1
2
3
4
5
Introduction .....................................................................3
Installing NzAdmin ..........................................................3
The System Tab...............................................................4
The Database Tab............................................................6
Tools...............................................................................14
5.1
Workload Management..................................................14
5.2
5.3
Table Skew....................................................................15
Page 2 of 16
IBM Software
Information Management
1 Introduction
In this lab we will explore the features of the IBM PureData System Administrator GUI tool, NzAdmin. NzAdmin is a Windowsbased application that allows users to manage the system, obtain hardware information and status, and manage various aspects
of user databases, tables, and objects. NzAdmin consists of two distinct environments: the System tab and the Database tab.
We will look at both. When you click either tab, the system displays the tree view on the left side and the data view on the right
side.
The VMWare image we are using in the labs differs significantly from a normal PureData System appliance. There is
only one virtualized SPA and SPU, only 4 dataslices and no dataslice mirroring. In addition to that some NzAdmin
functions do not work with the VM. For example the SPU and SPA sections are blank and the data distribution of a
table cannot be displayed. Nevertheless most functionality works and should provide a good overview.
2 Installing NzAdmin
NzAdmin is part of the PureData System client tools for Windows. It can be installed with a standard Windows installer and
doesnt require the JDBC or ODBC drivers to be installed, since it contains its own connection libraries.
1.
2.
Install the NzAdmin client by double-clicking on the Installer and accepting all standard settings.
3.
You can start NzAdmin from the Windows Start Menu. Programs->IBM PureData System -> IBM PureData System
Administrator
4.
Connect to your PureData System host with the IP address taken from the VM where PureData System Host is running
(you can use ifconfig eth0 in the Linux terminal window. In our lab the IP address is 192.168.239.2, username admin,
and password password.
Page 3 of 16
IBM Software
Information Management
5.
The Admin client has a navigation menu on the left with two views System and Database.
The System view contains information about the general status of the PureData System hardware and the PureData
System Performance Server software. It displays system information and provides information about possible system
problems like a hard disc failure. It also contains statistics like the user space usage.
The database view contains information about the user databases in the system. It displays information about all database
objects like tables, views, sequences, synonyms, user defined functions, procedures etc. It also provides the user with the
tools necessary to manage groups and access privileges. You can also view the current active database sessions and their
queries and a recent history of all queries that have been run on the system. Finally you can see the backup history for
each database.
The menu bar contains common actions like refresh or connect. In addition to that it provides access to some
administration tools like Workload Management, a tool for the identification of table skew etc.
Page 4 of 16
IBM Software
Information Management
1.
2.
3.
The default view is the main hardware view, which shows a dashboard representing the general health of the system.
Unfortunately the hardware information cannot be gathered for our VM. But we see the disc usage at the bottom. Note
that the most important measure is actually the Maximum storage utilization. If one disc runs full, no new data can be
added to the system.
Unfortunately the SPA and SPU sections are empty for our VM system, normally we could see health information of all
Snippet processing arrays, snippet processing units and their data slices and hard discs. The next available section is
data slices. When you select it, you can see that our VM has 4 dataslices 1-4 on four hard discs 1002-1005. Normally
we would also see which disc contains the mirrors of these discs, but our VM system doesnt mirror its data slices.
Under the data slice section you can see the currently active event rules. Event rules monitor the system for note worthy
occurrences and act accordingly i.e. sending an email or raising an alert. For example by sending a mail to an
administrator in case of a hardware failure. Unlike a real PureData System appliance only a very small set of event rules
is enabled. You could use the New Event Rule wizard to add new events or generate test events.
Page 5 of 16
IBM Software
Information Management
Switch tabs to the Database tab. This is the area where database tables, users, groups and sessions can be viewed
and managed.
You may not have some of the database objects displayed in the image, this shouldnt change the lab in any way.
2.
In the Database tab, expand Databases and click on LABDB. NzAdmin can view all the objects of the following types:
tables, views, sequences, synonyms, functions, aggregates, procedures, and libraries. You can also create objects of
the following types: tables, views, sequences, and synonyms. Furthermore, many of these object types can be
managed in some way using NzAdmin - for example we have control over tables in NzAdmin. Finally we can see the
currently running Groom processes in the Groom Status section.
Page 6 of 16
IBM Software
Information Management
3.
Click on Tables in the tree or data view. For each table in the LABDB database you can view information such as the
owner, creation date, number of columns, size of table on disk, data skew, row count, and percentage organized if
enabled.
4.
If you right click on a table you can selected ways in which to manage the table including changing the name, owner,
columns, organizing keys, default value, generating or viewing statistics, viewing record distribution, truncating and/or
dropping the table.
Page 7 of 16
IBM Software
Information Management
Unfortunately one of the most important menu entries Record Distribution which gives you a graphical distribution of
the data distribution of the table doesnt work in our VMware environment.
5.
To look at information about the columns, distribution key, organizing keys, DLL, and users/groups with privileges for a
table double click on the table entry to bring up the details:
This view shows the columns of the table and their constraints. It also shows if the columns are Zone Map enabled or
not - Zone Maps are an important performance feature in PureData System and will be discussed later in this course.
You can set access privileges to the Table with the Privileges button. The DDL button returns the command to create
the table and is a convenient feature for administrators.
Page 8 of 16
IBM Software
Information Management
6.
Close the Table Properties view again and click on the Users field of the left navigation tree. Here you can create and
manage users.
Users can either be created from a context menu on the Users folder or from the Options Menu at the top of the screen.
To manage users use the context menu that is displayed when you right click on the user you want to manage.
NzAdmin allows you to rename or drop users, change their privileges and workload management settings etc.
7.
You can do the same management for groups in the Groups section of the Database tab.
8.
Click on Sessions in the Database tab. Here you can see who is currently logged into the PureData System and the
commands they have issued. You can also abort sessions or change their priority in a workload management
environment (this has to be setup before you can change the priority).
Page 9 of 16
IBM Software
Information Management
9.
To see and manage active queries you can expand Sessions in the Database tab and click on Active Queries, however,
there are no queries running at this time.
10. Comprehensive query history information can be seen by clicking on the Query History section in the Database tab.
PureData System keeps a query history of the last 2000 queries. For a full audit log you would need to use the Query
History database. Select the View Query Entry menu item from the context menu, to get a more structured view for the
values of a specific query:
11. Another window is opened showing the fields of the query history table in a more structured way:
Page 10 of
IBM Software
Information Management
PureData System saves for each recent query a significant amount of information, which can help you to identify
queries that behave badly. Important values are estimated and actually elapsed seconds, result rows and of course the
actual SQL statement.
12. It is also possible to get information about the actual execution plan of the query. We will discuss this in more detail in
future modules. To see a graphical representation of an Explain output right-click on a query and select Explain->Graph:
13. You should see a similar window, the actual graph may differ depending on the query you pick:
Page 11 of
IBM Software
Information Management
This graph shows how the PureData System appliance plans to execute the query in question. It is an important part of
troubleshooting the occasional misbehaving queries. We will discuss this in more detail in the Query Optimization
module. You can also get a textual view by selecting Explain->Verbose.
14. Close the graph and display the plan file in the context menu with the View Plan File entry. You should see a similar
window to the following. Please scroll down to the bottom:
Page 12 of
IBM Software
Information Management
Plan files look similar to Verbose Explain information but there is a significant difference. Explain information tells you
how the system plans to execute a query. The Plan files add information on how the query was actually executed
including actual execution times and are an invaluable help for debugging queries that failed or took longer than
expected.
15. Finally select the Backup->History field in the navigator.
PureData System logs all backups and restore sessions in the system. You will see an empty list but if you return to this
view after the Backup and Restore lab you will see the backup and restore processes you started. The backup history
allows PureData System to provide easy incremental or cumulative backups and to synchronize backups with the
groom process - we will discuss more about that in a later section.
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
16
Page 13 of
IBM Software
Information Management
5 Tools
In this section we will learn how to set workload management system settings with NzAdmin, as well as search for data skew,
and view disk usage by database or user.
5.1
Workload Management
1.
From the menu bar at the top of NzAdmin click on Tools Workload Management System Settings. Using this tool
we can limit the maximum number of rows allowed in a table, enable query timeouts, session idle timeouts, and default
session priority
2.
From the Workload Management menu option, click into Performance Summary
3.
From the Summary pane, we can look at activities that happened in the last hour in an aggregate view. Keep this in
mind for the Workload Management module.
Page 14 of
IBM Software
Information Management
5.2
Table Storage
1.
5.3
From the menu bar at the top of NzAdmin click on Tools Table Storage. This is a tool, which will tell us the total size
in MB for each database or the total size of all the databases a user owns.
Table Skew
1.
From the menu bar at the top of NzAdmin click on Tools Table Skew. This tool displays tables that meet or exceed a
specified data skew threshold between data slices.
Once an administrator has seen in the main overview that the maximal storage differs significantly from the average
story he can use this tool to find the skewed tables. He can then fix them for example by redistributing them with a
CTAS table. Skewed tables not only limit the available storage but also significantly lower the performance.
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
16
Page 15 of
IBM Software
Information Management
Page 16 of
IBM Software
Information Management
Table of Contents
Introduction .....................................................................3
1.1
Objectives........................................................................3
External Tables................................................................3
2.1
2.2
2.3
3.2
3.3
Page 2 of 32
IBM Software
Information Management
1 Introduction
In every data warehouse environment there is a need to load new data into the database. The task to load data into the
database is not just a one time operation but rather a continuous operation that can occur hourly, daily, weekly, or even monthly.
The loading of the data into a database is vital operation that needs to be supported by the data warehouse system. IBM
PureData System provides a framework to support not only the loading of data into the PureData System database environment
but also the unloading of data from the database environment. This framework contains more than one component, some of
these components are:
External Tables These are tables stored as flat files on the host or client systems and registered like tables in the
PureData System catalog. They can be used to load data into the PureData System appliance or unload data to the file
system.
nzload This is a wrapper command line tool around external tables that provides an easy method loading data into
the PureData System appliance.
Format Options These are options for formatting the data load to and from external tables.
1.1
Objectives
This lab will help you explore the IBM PureData System framework components for loading and unloading data from the
database. You will use the various commands to create external tables to unload and load data. Then you will get a basic
understanding of the nzload utility. In this lab the REGION and NATION tables in the LABDB database are used to illustrate the
use of external tables and the nzload utility. After this lab you will have a good understanding on how to load and unload data
from a PureData System database environment
The first part of this lab will explore using External Tables to unload and load data.
The second part of this lab will discuss using the nzload utility to load records into tables.
2 External Tables
An external table allows PureData System to treat an external file as a database table. An external table has a definition (a table
schema) in the PureData System system catalog but the actual data exists outside of the PureData System appliance database.
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
Page 3 of 32
IBM Software
Information Management
This is referred to as a datasource file. External tables can be used to access files which are stored on the file system. After you
have created the external table definition, you can use INSERT INTO statements to load data from the external file into a
database table, or SELECT FROM statements to query the external table. Different methods are described to create and use
external tables using the nzsql interface. Along with this the external datasource files for the external tables are examined, so a
second session will be used to help view these files.
I. Connect to your PureData System image using PuTTY. Login to 192.168.239.2 as user nz with password nz.
(192.168.239.2 is the default IP address for a local VM, the IP may be different for your Bootcamp)
II. Change to the lab working directory /labs/movingData with the following command
cd /labs/movingData
III. Connect to the LABDB database as the database owner, LABADMIN, using the nzsql interface :
\h
\?
\g
\q
LABDB(LABADMIN)=>
IV. Now in this lab we will need to alternatively execute SQL commands and operating system commands. To make it easier
for you, we will open a second putty session for executing operating system commands like nzload, view generated
external files etc. It will be referred to as session 2 throughout the lab.
The picture above shows the two PuTTY windows that you will need. Session 1 will be used for SQL commands and
session 2 for operating system prompt commands.
V. Open another session using PuTTY. Login to 192.168.239.2 as user nz with password nz. (192.168.239.2 is the
default IP address for a local VM, the IP may be different for your Bootcamp)
Also make sure that you change to the correct directory, /labs/movingData:
[nz@netezza ~] cd /labs/movingData
Page 4 of 32
IBM Software
Information Management
2.1
External tables will be used to unload rows from the LABDB database as records into an external datasource file. Various
methods to create and use external tables will be explored unloading rows from either REGION or NATION tables. Five basic
different user cases are presented for you to follow so that you can gain a better understanding of how to use external tables to
unload data from a database.
2.1.1
Unloading data with an External Table created with the SAMEAS clause
The first external table will be used to unload data from the REGION table into an ASCII delimited text file. This external table will
be named ET1_REGION using the same column definition as the REGION table. After the ET1_REGION external table is created
you will then use it to unload all the rows from the REGION table. The records for the ET1_REGION external table will be in the
external datasource file, et1_region_flat_file. The basic syntax to create this type of external table is:
The SAMEAS clause allows the external table to be created with the same column definition of the referred. This is referred to as
implicit schema definition.
1.
As the LABDB database owner, LABADMIN, you will create the first basic external table using the same column
definitions as the REGION table:
2.
To list the external tables in the LABDB database you use the internal slash option, \dx:
LABDB(LABADMIN)=> \dx
Which will list the external table you just created:
List of relations
Name
|
Type
| Owner
------------+----------------+---------ET1_REGION | EXTERNAL TABLE | LABADMIN
(1 rows)
3.
You can also list the properties of the external table using the following internal slash option to describe the table, \d
<external table name> :
LABDB(LABADMIN)=> \d et1_region
Which will list the properties of the ET1_REGION external table:
Page 5 of 32
IBM Software
Information Management
f
1_0
FALSE
f
f
YMD
|
INTERNAL
f
TEXT
f
/tmp
1
0
NULL
NO
f
0
8388608
:
f
24HOUR
f
0
f
This output includes the columns and associated data types in the external table. You will notice that this is similar to the
REGION table since the external table was created using the SAMEAS clause in the CREATE EXTERNAL TABLE command.
The output also includes the properties of the external table. The most notable property is the DataObject property that
shows the location and the name of the external datasource file used for the external table. We will examine some of the
other properties in this lab.
4.
Now that the external table is created you can use it to unload data from the REGION table using an INSERT statement :
Page 6 of 32
IBM Software
Information Management
5.
You can use this external table like a regular table by issuing SQL statements. Try issuing a simple SELECT FROM
statement against ET1_REGION external table:
R_REGIONKEY |
R_NAME
|
R_COMMENT
------------+---------------------------+----------------------------2 | sa
| south america
1 | na
| north america
4 | ap
| asia pacific
3 | emea
| europe, middle east, Africa
(4 rows)
You will notice that this is the same data that is in the REGION table. But the data retrieved for this SELECT statement was
from the datasource of this external table and not from the data within the database.
6.
The main reason for creating an external table is to unload data from a table to a file. Using the second putty session
review the file that was created, et1_region_flat_file, in the /labs/movingData directory:
2|sa|south america
1|na|north america
4|ap|asia pacific
3|emea|europe, middle east, africa
This is an ASCII delimited flat file containing the data from the REGION table. The column delimiter used in this file was the
default character |.
2.1.2
The second external table will also be used to unload data from the REGION table into an ASCII delimited text file using a
different method. The external table will be created and the data will be unloaded in the same create statement. So a separate
step is not required to unload the data. The external table will be named ET2_REGION and the external datasource file will be
named et2_region_flat_file. The basic syntax to create this type of external table is:
The AS clause allows the external table to be created with the same columns returned in the SELECT FROM statement, which is
referred to as implicit table schema definition. This also unloads the rows at the same time the external table is created.
Page 7 of 32
IBM Software
Information Management
1.
The first method used to create an external table required the data to be unloaded in a second step using an INSERT
statement. Now you will create an external table and unload the data in a single step:
2.
LABDB(LABADMIN)=> \dx
Which will list all the external tables in the LABDB database:
List of relations
Name
|
Type
| Owner
------------+----------------+---------ET1_REGION | EXTERNAL TABLE | LABADMIN
ET2_REGION | EXTERNAL TABLE | LABADMIN
(2 rows)
You will notice that there are now two external tables. You can also list the properties of the external table, but the output
will be similar to the output in the last section, except for the filename.
3.
Using the second session review the file that was created, et2_region_flat_file, in the /labs/movingData directory:
2|sa|south america
1|na|north america
4|ap|asia pacific
3|emea|europe, middle east, africa
This file is exactly the same as the file you reviewed in the last chapter. The difference this time is that we didnt need to
unload it explicitly.
Page 8 of 32
IBM Software
Information Management
1.
You will create a new external table to only include the R_NAME and R_COMMENT columns, and exclude the
R_REGIONKEY column from the REGION table. Along with this you will change the delimiter string from the default | to
=:
2.
LABDB(LABADMIN)=> \d et3_region
Which will list the properties of the ET3_REGION external table:
Page 9 of 32
IBM Software
Information Management
f
1_0
FALSE
f
f
YMD
=
INTERNAL
f
TEXT
f
/tmp
1
0
NULL
NO
f
0
8388608
:
f
24HOUR
f
0
f
You will notice that there are only two columns for this external table since you only specified two columns when creating
the external table. The rest of the output is very similar to the properties of the other two external tables that you created,
with two main exceptions. The first difference is obviously the Dataobjects field, since the filename is different. The other
difference is the string used for the delimiter, since it is now = instead of the default, |.
3.
Now you will unload the data from the REGION table but only the data from columns R_NAME and R_COMMENT:
Page 10 of 32
IBM Software
Information Management
4.
Using the second session review the file that was created, et3_region_flat_file, in the /labs/movingData directory:
sa=south america
na=north america
ap=asia pacific
emea=europe, middle east, africa
You will notice that only two columns are present in the flat file using the = string as a delimiter.
2.1.4 (Optional) Unloading data with an External Table from two tables
The first three external tables unloaded data from one table. The next external table you will create will be based on using a
table join between the REGION and NATION table. The two tables will be joined on the REGIONKEY and only the N_NAME and
R_NAME columns will be defined for the external table. This exercise will illustrate how data can be unloaded using SQL
statements other than a simple SELECT FROM statement. The external table will be named ET_NATION_REGION using another
ASCII delimited text file named et_nation_file_flat_file.
1.
For the next external table you will unload data from both the REGION and NATION table joined on the REGIONKEY
column to list all of the countries and their associated regions. Instead of specifying the columns in the create external
table statement you will use the AS SELECT option:
2.
LABDB(LABADMIN)=> \d et_nation_region
Which will show the properties of the ET_NATION_REGION table:
Page 11 of 32
IBM Software
Information Management
f
1_0
FALSE
f
f
YMD
|
INTERNAL
f
TEXT
f
/tmp
1
0
NULL
NO
f
0
8388608
:
f
24HOUR
f
0
f
You will notice that the external table was created using the two columns specified in the SELECT clause: N_NAME and
R_NAME.
3.
Page 12 of 32
IBM Software
Information Management
N_NAME
|
R_NAME
---------------------------+--------------------------brazil
| sa
guyana
| sa
venezuela
| sa
portugal
| emea
australia
| ap
united kingdom
| emea
united arab emirates
| emea
south africa
| emea
hong kong
| ap
new zealand
| ap
japan
| ap
macau
| ap
canada
| na
united states
| na
(14 rows)
This is the result of the joining the NATION and REGION table on the REGIONKEY column to return just the N_NAME and
R_NAME columns.
4.
And now using the second session review the file that was created, et_nation_region_flat_file, in the /labs/movingData
directory:
brazil|sa
guyana|sa
venezuela|sa
portugal|emea
australia|ap
united kingdom|emea
united arab emirates|emea
south africa|emea
hong kong|ap
new zealand|ap
japan|ap
macau|ap
canada|na
united states|na
You can see that we created a flat delimited flat file from a complex SQL statement. External tables are a very flexible and
powerful way to load, unload and transfer data.
2.1.5 (Optional) Unloading data with an External Table using the compress format
The previous external tables that you created used the default ASCII delimited text format. This last external table will be similar
to the second external table that you created. But instead of the using an ASCII delimited text format you will use the
compressed binary format. The name of the external table will be ET4_REGION and the datasource file name will be
et4_region_compress. The basic syntax to create this type of external table is:
Page 13 of 32
IBM Software
Information Management
You will now create one last external table using a similar method that you used to create the second external table, in
section 2.1.2. But instead of using an ASCII delimited-text format the datasource will be compressed. This is achieved
by using the COMPRESS and FORMAT external table options:
2.
LABDB(LABADMIN)=> \d et4_region
Which will list the properties of the ET4_REGION table:
Page 14 of 32
IBM Software
Information Management
- TRUE
- INTERNAL
- 8388608
-
You will notice that the options for COMPRESS has changed from FALSE to TRUE indicating that the datasource file is
compressed. And the FORMAT has changed from TEXT to INTERNAL, which is required for compressed files.
2.2
Dropping external tables is similar to dropping a regular PureData System table. The column definition for the external table is
removed from the PureData System catalog. Keep in mind that dropping the table doesnt delete the external datasource file so
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
Page 15 of 32
IBM Software
Information Management
this also has to be maintained. So the external datasource file can still be used for loading data into a different table. In this
chapter you will drop the ET1_REGION table, but you will not delete the associated external datasource file, et1_region_flat_file.
This datasource file will be used later in this lab to load data into the REGION table.
1.
Drop the first external table that you created, ET1_REGION, using the DROP TABLE command
2.
Verify that the external table has been dropped using the internal slash option, \dx:
LABDB(LABADMIN)=> \dx
Which will list all the external tables in the LABDB database:
List of relations
Name
|
Type
| Owner
------------------+----------------+---------ET2_REGION
| EXTERNAL TABLE | LABADMIN
ET3_REGION
| EXTERNAL TABLE | LABADMIN
ET4_REGION
| EXTERNAL TABLE | LABADMIN
ET_NATION_REGION | EXTERNAL TABLE | LABADMIN
(4 rows)
In this list the four remaining external tables that you created still exist.
3.
Even though the external table definition no longer exists within the LABDB database, the flat file named
et1_region_flat_file still exits in the /labs/movingData directory. Verify this by using the second putty session:
[nz@netezza movingData]$ ls
Which will list all of the files in the /labs/movingData directory:
et1_region_flat_file
et2_region_flat_file
et4_region_compress
et3_region_flat_file
et_nation_region
You can see that the file et1_REGION_flat_file still exists. This file can still be used to load data into another similar table.
2.3
External tables can also be used to load data into tables in the database. In this chapter data will be loaded into the REGION
table, so you will first have to remove the existing rows from the REGION table. The method to load data from external tables into
a table is similar to using the DML INSERT INTO and SELECT FROM statements. You will use two different methods to load
data into the REGION table, one using an external table and the other using the external datasource file directly. Loading data
into a table from any external table will have an associated log file with a default name of <table_name>.<database_name>.log
1.
Before loading the data into the REGION table, delete the rows from the data using the TRUNCATE TABLE command:
Page 16 of 32
IBM Software
Information Management
2.
3.
You will load data into the REGION table from the ET2_REGION external table using an INSERT statement:
4.
Check to ensure that the table contains the four rows using the SELECT * statement.
5.
6.
Check to ensure that the table is empty using the SELECT * statement.
7.
You will load data into the REGION table using the ASCII delimited file that was created for external table ET1_REGION.
Remember that the definition of the external table was removed from that database, but the external data source file,
et1_region_flat_file, still exists:
8.
Check to ensure that the table contains the four rows using the SELECT * statement.
9.
Since this is a load operation there is always an associated log file, <table>.<database>.nzlog created for each load
performed. By default this log file is created in the /tmp directory. In the second session review this file:
Page 17 of 32
IBM Software
Information Management
LABDB
REGION
/labs/movingData/et1_region_flat_file
netezza
Load Options
Field delimiter:
File Buffer Size (MB):
Encoding:
Skip records:
FillRecord:
Escape Char:
Allow CR in string:
Quoted data:
'|'
8
INTERNAL
0
No
None
No
NO
NULL value:
NULL
Load Replay REGION (MB): 0
Max errors:
1
Max rows:
0
Truncate String:
No
Accept Control Chars: No
Ignore Zero:
No
Require Quotes:
No
BoolStyle:
1_0
Decimal Delimiter:
'.'
Date Style:
Time Style:
Time extra zeros:
YMD
24HOUR
No
Date Delim:
Time Delim:
'-'
':'
Statistics
number of records read:
4
number of bad records:
0
------------------------------------------------number of records loaded:
4
Elapsed Time (sec): 3.0
----------------------------------------------------------------------------Load completed at: 01-Jan-11 12:34:59 EDT
=============================================================================
You will notice that the log file contains the Load Options and the Statistics of the load, along with environment information
to identify the table.
Command line
Page 18 of 32
IBM Software
Information Management
Control file
NZ Environment Variables
Without a control file, you can only do one load at a time. Using a control file allows multiple loads. The nzload command
connects to a database with a user name and password, just like any other PureData System appliance client application. The
user name specifies an account with a particular set of privileges, and the system uses this account to verify access.
For this section of the lab you will continue to use the LABADMIN user to load data into the LABDB database. The nzload utility
will be used to load records from an external datasource file into the REGION table. Along with this the nzload log files will be
reviewed to examine the nzload options. Since you will be loading data into a populated REGION table, you will use the
TRUNCATE TABLE command to remove the rows from the table.
We will continue to use the two putty sessions from the external table lab.
Session One, which is connected to the NZSQL console to execute SQL commands, for example to review tables after
load operations
Session Two, which will be used for operating system commands, to execute nzload commands, view data files,
3.1
The first method for using the nzload utility to load data in the REGION table will specify options at the command line. You will
only need to specify the datasource file. We will use default options for the rest. The datasource file will be the
et1_region_flat_file that you created in the External Tables section. The basic syntax for this type of command is:
nzload db <database> -u <username> pw <password>
1.
As the LABDB database owner, LABADMIN first remove the rows in the REGION table:
2.
Check to ensure that the rows have been removed from the table using the SELECT * statement:
3.
Using the second session at the OS command line you will use the nzload utility to load data from the et1_region_flat
file into the REGION table using the following command line options, -db <database name>, -u <user>, -pw
<password>, -t <table name>, -df <data file>, and delimiter <string>:
Page 19 of 32
IBM Software
Information Management
4.
Check to ensure that the rows have been load into the table using the SELECT * statement:
R_REGIONKEY |
R_NAME
|
R_COMMENT
-------------+---------------------------+----------------------------1 | na
| north america
4 | ap
| asia pacific
2 | sa
| south america
3 | emea
| europe, middle east, africa
(4 rows)
These rows were loaded from the records in the etl_region_flat_file file.
5.
For every load task performed there is always an associated log file, <table>.<db>.nzlog created. By default this log file
is created in the current working directory, which is the /labs/movingData directory. In the second session review this file:
Page 20 of 32
IBM Software
Information Management
LABDB
REGION
/labs/movingData/et1_region_flat_file
netezza
Load Options
Field delimiter:
File Buffer Size (MB):
Encoding:
Skip records:
FillRecord:
Escape Char:
Allow CR in string:
Quoted data:
'|'
8
INTERNAL
0
No
None
No
NO
NULL value:
NULL
Load Replay REGION (MB): 0
Max errors:
1
Max rows:
0
Truncate String:
No
Accept Control Chars: No
Ignore Zero:
No
Require Quotes:
No
BoolStyle:
1_0
Decimal Delimiter:
'.'
Date Style:
Time Style:
Time extra zeros:
YMD
24HOUR
No
Date Delim:
Time Delim:
'-'
':'
Statistics
number of records read:
4
number of bad records:
0
------------------------------------------------number of records loaded:
4
Elapsed Time (sec): 3.0
----------------------------------------------------------------------------Load completed at: 01-Jan-11 12:34:59 EDT
=============================================================================
You will notice that the log file contains the Load Options and the Statistics of the load, along with environment information
to identify the database and table.
The db, -u, and pw, options specify the database name, the user, and the password, respectively. Alternatively, you could
omit these options if the NZ environment variables are set to the appropriate database, username and password values. Since
the NZ environment variables, NZ_DATABASE, NZ_USER, and NZ_PASSWORD are set to system, admin, and password, you
need to use these options so the load will be against the LABDB database using the LABADMIN user.
The other options:
-t specifies the target table name in the database
-df specifies the datasource file to be loaded
-delimiter specifies the string to use as the delimiter in an ASCII delimited text file.
There are other options that you can use with the nzload utility. These options were not specified here since the default values
were sufficient for this load task.
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
Page 21 of 32
IBM Software
Information Management
The following command is equivalent to the nzload command we used above. Its intended to demonstrate some of the options
that can be used with the nzload command but can be omitted when default values are used. Its only for demonstrating
purposes:
3.2
As demonstrated in section 3.1 you can run the nzload command by specifying the command line options or you can use
another method by specifying the options in a file, which is referred to as a control file. This is useful since the file can be
updated and modified over time since loading data into a database for a data warehouse environment is a continuous operation.
An nzload control file has the following basic structure:
DATAFILE <filename>
{
[<option name> <option value>]
}
And the cf option is used at the nzload command line to use a control file:
1.
As the LABDB database owner, LABADMIN first remove the rows in the REGION table::
2.
Using the second session at the OS command line you will create the control file to be used with the nzload utility to
load data into the REGION table using the region.del data file. The control file will include the following options:
Parameter
Value
Database
Database name
Page 22 of 32
IBM Software
Information Management
Tablename
Table name
Delimiter
Delimiter string
LogDir
Log directory
LogFile
BadFile
And the data file will be the region.del file instead of the et1_region_flat_file that you used in section 3.1.
We already created the control file in the lab directory. Review it in the second putty session with the following command:
3.
Still in the second session you will load the data using the nzload utility with the control file you created, using the
following command line options: -u <user>, -pw <user>, -cf <control file>
Check the nzload log which was renamed from the default to region.log which is located in the /labs/movingData
directory.
Check to ensure that the rows are in the REGION table in the first putty session with the nzsql console:
Page 23 of 32
IBM Software
Information Management
3.3
The first two methods illustrated how to use the nzload utility to load data into an empty table using command line options or a
control file. In a data warehousing environment you will most of the time incrementally add data to a table already containing
some rows.
There will be instances where records from a datasource might not match the datatypes in the table. When this occurs the load
will abort when the first bad record is encountered. This is the default behaviour and is controlled by the maxErrors option,
which is set to a default value of 1.
For this exercise you will add additional rows to the NATION table. Since you will be adding rows to the NATION table there will
be no need to truncate the table. The datasource file you will be using is the nation.del file, which unfortunately has a bad record.
1.
First check the NATION table by listing all of the rows in the table using the SELECT * statement in the first putty
session:
2.
Using the second session at the OS command line you will use the nzload utility to load data from the nation.del file
into the NATION table using the following command line options, -db <database name>, -u <user>, -pw
<password>, -t <table name>, -df <data file>, and delimiter <string>
Error: External Table : count of bad input rows reached maxerrors limit
See /labs/movingData/NATION.LABDB.nzlog file
Error: Load Failed, records not inserted.
This is an indication that the load has failed due to a bad record in the datasource file
Page 24 of 32
IBM Software
Information Management
3.
Since the load has failed no rows were loaded into the NATION table, which you can confirm by using the SELECT *
statement (in the first session):
4.
In the second session you can check the log file, NATION.LABDB.nzlog, to determine the problem:
Page 25 of 32
IBM Software
Information Management
LABDB
NATION
/home/nz/movingData/nation.del
netezza
Load Options
Field delimiter:
File Buffer Size (MB):
Encoding:
Skip records:
FillRecord:
Escape Char:
Allow CR in string:
Quoted data:
'|'
8
INTERNAL
0
No
None
No
NO
NULL value:
NULL
Load Replay REGION (MB): 0
Max errors:
1
Max rows:
0
Truncate String:
No
Accept Control Chars: No
Ignore Zero:
No
Require Quotes:
No
BoolStyle:
1_0
Decimal Delimiter:
'.'
Date Style:
Time Style:
Time extra zeros:
YMD
24HOUR
No
Date Delim:
Time Delim:
'-'
':'
=============================================================================
The Statistics section indicates that 10 records were read before the bad record was encountered during the load process.
As expected no rows were inserted into the table since the default is to abort the load when one bad record is encountered.
The log file also provides information about the bad record:
10(1)
[1, INT4]
10(1) indicates the input record number (10) within the file and the offset (1) within the row where a problem was
encountered. [1, INT(4)] indicates the column number (1) within the row and the data type (INT(4)) for the column.
2[t] indicates the char that caused the problem ([2]). So putting this all together the problem is that the 2t was in a field
for an INT(4) column, which is the N_NATIONKEY in the NATION table. 2t is not a valid integer so this is why the load
marked this as a bad record.
5.
You can confirm that this observation is correct by examining the nation.del datasource file that was used for the load.
In the second session execute the following command:
Page 26 of 32
IBM Software
Information Management
15|andorra|2|andorra
16|ascension islan|3|ascension
17|austria|3|osterreich
18|bahamas|2|bahamas
19|barbados|2|barbados
20|belgium|3|belqique
21|chile|2|chile
22|cuba|2|cuba
23|cook islands|4|cook islands
2t|denmark|3|denmark
25|ecuador|2|ecuador
26|falkland islands|3|islas malinas
27|fiji|4|fiji
28|finland|3|suomen tasavalta
29|greenland|1|kalaallit nunaat
30|great britain|3|great britian
31|gibraltar|3|gibraltar
32|hungary|3|magyarorszag
33|iceland|3|lyoveldio island
34|ireland|3|eire
35|isle of man|3|isle of man
36|jamaica|2|jamaica
37|korea|4|han-guk
38|luxembourg|3|luxembourg
39|monaco|3|Monaco
You will notice on the 10th line the following record:
2t|denmark|3|denmark
So we made the correct assumption that the 2t is causing the problem. From this list you can assume that the correct
value should be 24.
6.
Alternatively you could instead examine the nzload bad log file NATION.LABDB.nzbad, which will contain all bad
records that are processed during a load. In the second session execute the following command:
2t|denmark|3|denmark
This is the same row identified in the nation.del file using the log file to locate the record within the file. Since the default is to
stop the load after the first bad record is processed there is only one row. If you were to change the default behaviour to
allow more bad records to be processed this file could potentially contain more records. It provides a comfortable overview
of all the records that created exceptions during load.
7.
We have the option of changing the NATION.del file to change 2t to 24, and then rerun the same nzload command
as in step 7. Instead you will rerun a similar load but you will allow 10 bad records to be encountered during the load
process. To change the default behaviour you need to use the command option -maxErrors. You will also change the
name of the nzbad file using the bf command option and the log filename using the lf command option:
Page 27 of 32
IBM Software
Information Management
8.
Check to ensure that the new loaded rows are in the NATION table:
So now all of the new records were loaded except for the one bad row with nation key 24.
Page 28 of 32
IBM Software
Information Management
9.
Even though the nzload command received a successful message it is good practice to review the nzload log file for
any problems, for example bad rows that are under the maxErrors threshold. In the second putty session execute the
following command:
LABDB
NATION
/home/nz/movingData/nation.del
netezza
Load Options
Field delimiter:
File Buffer Size (MB):
Encoding:
Skip records:
FillRecord:
Escape Char:
Allow CR in string:
Quoted data:
'|'
8
INTERNAL
0
No
None
No
NO
NULL value:
NULL
Load Replay REGION (MB): 0
Max errors:
1
Max rows:
0
Truncate String:
No
Accept Control Chars: No
Ignore Zero:
No
Require Quotes:
No
BoolStyle:
1_0
Decimal Delimiter:
'.'
Date Style:
Time Style:
Time extra zeros:
YMD
24HOUR
No
Date Delim:
Time Delim:
'-'
':'
The main difference to before is that all of the data records in the data source file were processed (25.) 24 records were
loaded because there was one bad record in the data source file.
10. Now you will correct the bad row and load it into the NATION table. There are couple of options you could use. One
option is to extract the bad row from the original data source file and create a new data source file with the correct
record. However, this task could be tedious when dealing with large data source files and potentially many bad records.
The other option, which is more appropriate, is to use the bad log file. All bad records that can not be loaded into the
table are placed in the bad log file. So in the second session use vi to open and edit the nation.bad file and change the
2t to 24 in the first field.
Page 29 of 32
IBM Software
Information Management
11. The vi editor has two modes, a command mode used to save files, quit the editor etc. and an insert mode. Initially you
will be in the command mode. To change the file you need to switch into the insert mode by pressing i. The editor will
show an INSERT at the bottom of the screen.
12. You can now use the cursor keys to navigate. Change the first two chars of the bad row from 2t to 24. Your screen
should look like the following:
24|denmark|3|denmark
~
~
~
~
~
~
~
~
~
~
-- INSERT -13. We will now save our changes. Press Esc to switch back into command mode. You should see that the INSERT
string at the bottom of the screen vanishes. Enter :wq! and press enter to write the file, and quit the editor without any
questions.
14. After the nation.bad file has modified to correct the record issue a nzload to load the modified nation.bad file:
15. And now check the new row has been loaded into the table in session one:
Page 30 of 32
IBM Software
Information Management
N_NATIONKEY |
N_NAME
| N_REGIONKEY |
N_COMMENT
-------------+---------------------------+-------------+---------------------------------1 | canada
|
1 | canada
8 | united arab emirates
|
3 | al imarat al arabiyah multahidah
16 | ascension islan
|
3 | ascension
17 | austria
|
3 | osterreich
21 | chile
|
2 | chile
22 | cuba
|
2 | cuba
23 | cook islands
|
4 | cook islands
35 | isle of man
|
3 | isle of man
24 | denmark
|
3 | denmark
2 | united states
|
1 | united states of america
11 | japan
|
4 | nippon
18 | bahamas
|
2 | bahamas
19 | barbados
|
2 | barbados
20 | belgium
|
3 | belqique
25 | ecuador
|
2 | ecuador
33 | iceland
|
3 | lyoveldio island
34 | ireland
|
3 | eire
39 | monaco
|
3 | monaco
3 | brazil
|
2 | brasil
4 | guyana
|
2 | guyana
5 | venezuela
|
2 | venezuela
9 | south africa
|
3 | south africa
13 | hong kong
|
4 | xianggang
15 | andorra
|
2 | andorra
27 | fiji
|
4 | fiji
28 | finland
|
3 | suomen tasavalta
30 | great britain
|
3 | great britian
36 | jamaica
|
2 | jamaica
37 | korea
|
4 | han-guk
38 | luxembourg
|
3 | luxembourg
6 | united kingdom
|
3 | united kingdom
7 | portugal
|
3 | portugal
10 | australia
|
4 | australia
12 | macau
|
4 | aomen
14 | new zealand
|
4 | new zealand
26 | falkland islands
|
3 | islas malinas
29 | greenland
|
1 | kalaallit nunaat
31 | gibraltar
|
3 | gibraltar
32 | hungary
|
3 | magyarorszag
(39 rows)
The row in bold denotes the new row that was added to the table, which was the bad record you corrected.
Page 31 of 32
IBM Software
Information Management
Page 32 of 32
Backup Restore
Hands-On Lab
IBM PureData System for Analytics Powered by Netezza Technology
IBM Software
Information Management
Table of Contents
1
Introduction .....................................................................3
1.1
2
3
4
Objectives........................................................................3
4.2
4.3
4.4
5.2
Page 2 of 23
IBM Software
Information Management
1 Introduction
PureData System appliances are 99.99% reliable and all internal components are redundant. Nevertheless regular
backups should be part of any data warehouse. The first reason for this is disaster recovery, for example in case of a fire
in the data warehouse. The second reason is to undo changes like accidental deletes.
For disaster recovery, backups should be stored in a different location than the data center that hosts the data warehouse.
For most of the big companies this will be a backup server which will have a version of Veritas Netbackup, Tivoli Storage
Manager, or similar software, furthermore, backing up to a file server is also possible.
1.1
Objectives
In the last labs we have created our LABDB database, and loaded the data into it. In this lab we will first set up a QA
database that contains a subset of the tables and data of the full database. To create the tables we will use cross
database access from our QA database to the LABDB production database.
We will then use the schema-only function of nzbackup to create a test database that contains the same tables and data
objects as the QA database but no data. Test data will later be added specifically for testing needs. After that we will do a
multistep backup of our QA database and test the restore functionality. Testing backups by restoring them is generally a
good idea and should be done during the development phase and also at regular intervals. After all - you are not fully sure
what a backup contains until you restore it.
Finally we will backup the system user data and the host data. While a database backup saves all users and groups that
are involved in that database, a full user backup may be needed to get the full picture - for example to archive users and
groups that are not used in any database. Also host data should be backed up regularly. In case of a host failure, which
leaves the user data on the S-Blades intact, having a recent host backup will make the recovery of the appliance much
faster and more straightforward.
Page 3 of 23
IBM Software
Information Management
2 Creating a QA Database
In this chapter we will create a QA database called LABDBQA, which contains a subset of the tables. It will contain the full
NATION and REGION tables and the CUSTOMER table with a subset of the data. We will first create our QA database then
we will connect to it and use CTAS tables to create the table copies. We will use cross-database access to create our
CTAS tables from the foreign LABDB database. This is possible since PureData System allows read-only cross database
access if fully qualified names are used.
In this lab we will switch regularly between the operating system prompt and the nzsql console. The operating system
prompt will be used to execute the backup and restore commands and review the created files. The nzsql console will be
used to create the tables and further review the changes made to the user data using the restore commands.
To make this easier you should open two putty sessions, the first one will be used to execute the operating system
commands and it will be referred to as session 1 or the OS session, in the second session we will start the nzsql console.
It will be referred to as session 2 or the nzsql session. You can also see which session to use from the command prompt
in the screenshots.
Page 4 of 23
IBM Software
Information Management
Figure 2 The two putty sessions for this lab, OS session 1 on the left, NZSQL session 2 on the right
1. Open the first putty session. Login to 192.168.239.2 as user nz with password nz. (192.168.239.2 is the default
IP address for a local VM, the IP may be different for your Bootcamp)
2. Access the lab directory for this lab with the following command,
[nz@netezza ~]$ cd /labs/backupRestore/
3. Open the second putty session. Login to 192.168.239.2 as user nz with password nz. (192.168.239.2 is the
default IP address for a local VM, the IP may be different for your Bootcamp)
4. Access the lab directory for this lab with the same command as before
[nz@netezza ~]$ cd /labs/backupRestore/
5. Start the NZSQL console with the following command: nzsql
This will connect you to the SYSTEM database with the ADMIN user. These are the default settings stored in
the environment variables of the NZ user.
6. Create our empty QA database with the following command:
SYSTEM(ADMIN)=> create database LABDBQA;
7.
8. Create a full copy of the REGION table from the LABDB database:
LABDBQA(ADMIN)=> create table region as select * from labdb..region;
With this statement we create a local REGION table in the currently connected QA database that has the same
definition and content as the REGION table from the LABDB database. The CREATE TABLE AS statement is one of
the most flexible administrative tools for a PureData System administrator.
Page 5 of 23
IBM Software
Information Management
We can easily access tables of databases we are currently not connected to, but only for read operations. We couldnt
insert data into a database we are not connected to.
9. Lets verify that the content has been copied over correctly. First lets look at the original data in the LABDB
database:
LABDBQA(ADMIN)=> select * from labdb..region;
You should see four rows in the result set:
LABDBQA(ADMIN)=> select * from labdb..region;
R_REGIONKEY |
R_NAME
|
R_COMMENT
-------------+---------------------------+----------------------------2 | sa
| south america
3 | emea
| europe, middle east, africa
1 | na
| north america
4 | ap
| asia pacific
(4 rows)
To access a table from a foreign database we need to have the fully qualified name. Notice that we leave out the
schema name between the two dots. Schemas are not fully supported in PureData System and since each table
name needs to be unique in a given database it can be omitted.
10. Now lets compare that to our local REGION table:
LABDBQA(ADMIN)=> select * from region;
You should see the same rows as before although they can be in a different order:
LABDBQA(ADMIN)=> select * from region;
R_REGIONKEY |
R_NAME
|
R_COMMENT
-------------+---------------------------+----------------------------1 | na
| north america
3 | emea
| europe, middle east, africa
4 | ap
| asia pacific
2 | sa
| south america
(4 rows)
11. Now we copy over the NATION table as well:
LABDBQA(ADMIN)=> create table nation as select * from labdb..nation;
12. And finally we will copy over a subset of our CUSTOMER table, we will only use the rows from the automobile
market segment for the QA database:
LABDBQA(ADMIN)=> create table customer as select * from labdb..customer where
c_mktsegment = 'AUTOMOBILE';
You will see that this inserts almost 30000 rows into the QA customer table, this is roughly a fifth of the original table:
Page 6 of 23
IBM Software
Information Management
Page 7 of 23
IBM Software
Information Management
LABDBQA(ADMIN)=> \d
List of relations
Name
| Type | Owner
------------------+-------+------CUSTOMER
| TABLE | ADMIN
NATION
| TABLE | ADMIN
NATIONSBYREGIONS | VIEW | ADMIN
REGION
| TABLE | ADMIN
(4 rows)
16. Finally we will create a QA user and make him owner of the database. Create the user with:
LABDBQA(ADMIN)=> create user qauser;
17. Make him the owner of the QA database:
LABDBQA(ADMIN)=> alter database labdbqa owner to qauser;
We have successfully created our QA database using cross database CTAS statements. Our QA database contains
three tables, a view and we have a user that is the owner of this database. In the next chapter we will use backup and
restore to create an empty copy of the QA database for the test database.
Page 8 of 23
IBM Software
Information Management
3. In the nzsql session we will verify that we successfully created an empty copy of our database. See all available
databases with the following command: \l
LABDBTEST(ADMIN)=> \l
List of databases
DATABASE | OWNER
-----------+---------INZA
| ADMIN
LABDB
| LABADMIN
LABDBQA
| QAUSER
LABDBTEST | QAUSER
MASTER_DB | ADMIN
NZA
| ADMIN
NZM
| ADMIN
NZR
| ADMIN
SYSTEM
| ADMIN
(9 rows)
You can see that the LABDBTEST database was successfully created and that its privilege information have been
copied as well, the owner is QAUSER as in the LABDBQA database.
4. First we do not want the QA user being the owner of the test database, change the owner to ADMIN for now:
LABDBTEST(ADMIN)=> alter database labdbtest owner to admin;
5. Now lets check the contents of our test database, connect to it with: \c labdbtest
6. Check if our test database contains all the objects of the QA database: \d
You will see the three tables and the view we created:
LABDBTEST(ADMIN)=> \d
List of relations
Name
| Type | Owner
------------------+-------+------CUSTOMER
| TABLE | ADMIN
NATION
| TABLE | ADMIN
NATIONSBYREGIONS | VIEW | ADMIN
REGION
| TABLE | ADMIN
(4 rows)
PureData System Backup does save all database objects including views, stored procedures etc. Also all users,
groups and privileges that refer to the backed up database are saved as well.
7. Since we used the schema-only option we have not copied any data verify this for the NATION table with the
following command:
LABDBTEST(ADMIN)=> select * from nation;
You will see an empty result set as expected. The schema-only backup option is a convenient way to save your
database schema and to create empty copies of your database. Apart from the missing user data it will create a full
1:1 copy of the original database. You could also restore the database to a different PureData System Appliance. This
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
Page 9 of 23
IBM Software
Information Management
would only require that the backup server location is accessible from both PureData System Appliances. It could even
be a differently sized appliance and the target appliance can have a higher version number of the NPS software than
the source. It cannot be lower though.
4.1
PureData Systems backup is organized in so called backup sets. Every new full backup creates a new backup set. Differential
and cumulative backups are per default added to the last backup set. But they can be added to a different backup set as well. In
this section we will switch between the two putty sessions.
1. In the OS session execute the following command to create a full backup of the QA database:
[nz@netezza backupRestore]$ nzbackup -db labdbqa -dir /tmp/bk1 /tmp/bk2
You should get the following result:
Page 10 of 23
IBM Software
Information Management
This command will create a full user data backup of the LABDBQA database.
Each backup set has a unique id that can be later used to access it. Per default the last active backup set is used for
restore and differential backups.
In this lab we split up the backup between two file system locations. You can specify up to 16 file system
locations after the dir parameter. Alternatively you could use a directory list file as well with the dirfile
option. Splitting up the backup between different file servers will result in higher backup performance.
2. In the NZSQL session we will now add a new row to the REGION table. First connect to the QA database:
LABDBTEST(ADMIN)=> \c labdbqa
3. Now add a new entry for the north pole to the REGION table:
LABDBQA(ADMIN)=> insert into region values (5, 'np', 'north pole');
4. In the OS session create an differential backup:
[nz@netezza backupRestore]$ nzbackup -db labdbqa -dir /tmp/bk1 /tmp/bk2
-differential
We now create a differential backup with the differential option. This will create a new entry to the backup set
we created previously only containing the differences since the full backup. You can see that the backup set id hasnt
changed.
5. In the NZSQL session add the south pole to the REGION table:
LABDBQA(ADMIN)=> insert into region values (6, 'sp', 'south pole');
You have now one full backup with the original 4 rows in the REGION table, then a differential backup that has
additionally the north pole entry and a current state that has in addition to that the south pole region.
4.2
In this subchapter we will have a closer look at the files and logs that are created during the PureData System Backup
process.
1. In the OS session display the backup history of your Appliance:
[nz@netezza backupRestore]$ nzbackup -history
You should get the following result:
Page 11 of 23
IBM Software
Information Management
Date
Log File
------------------- -------------------2011-12-13 17:50:29
2011-12-14 12:35:51
2011-12-14 12:44:53
PureData System keeps track of all backups and saves them in the system catalog. This is used for differential
backups and it is also integrated with the Groom process. Since PureData System doesnt use transaction logs it
needs logically deleted rows for differential backups. Per default Groom doesnt remove a logically deleted row that
has not been backed up yet. Therefore the Groom process is integrated with the backup history. We will explain this in
more detail in the Transaction and Groom modules.
In our machine we have done three backups, one backup set containing the schema only backup and two backups for
the second backup set, one full and one differential. Lets have a closer look at the log that has been generated for the
last differential backup.
2. In the OS session, switch to the log directory of the backupsrv process, which is the process responsible for
backing up data:
[nz@netezza backupRestore]$ cd /nz/kit/log/backupsvr/
The /nz/kit/log directory contains the log directories for all PureData System processes.
3. Display the end of the log for the last differential backup process. You will need to replace the XXX values with
the actual values of your log. You can cut and paste the log name from the history output above. We are
interested in the last differential backup process:
[nz@netezza backupsvr]$ tail backupsvr.xxxxx.xxxx-xx-xx.log
Page 12 of 23
IBM Software
Information Management
You can see that the process backed up the three tables REGION, NATION and CUSTOMER and wrote the result to
two different locations. You also see the amount of data written to these locations. Since we only added a single row
the amount of data is tiny. If you look at the log of the full backup you will see a lot more data being written.
4. Now lets have a look at the files that are created during the backup process, enter the first backup location:
[nz@netezza backupsvr]$ cd /tmp/bk1
5. And display the contents with ls
You will see the following result:
[nz@netezza bk1]$ ls
Netezza
The PureData System folder contains all backup sets for all PureData System appliances that use this backup
location. If you need to move the backup you always have to move the complete folder.
6. Enter the Netezza folder with cd Netezza and display the contents with ls :
You will see the following result:
[nz@netezza Netezza]$ ls
netezza
Under the main Netezza folder you will find sub folders for each Netezza host that is backed up to this location. In our
case we only have one Netezza host called netezza. But if your company had multiple Netezza hosts you would
find them here.
7. Enter the Netezza folder with cd Netezza and display the contents with ls :
[nz@netezza netezza]$ ls
LABDBQA
Below the host you will find all the databases of the host that have been backed up to this location, in our case the QA
database.
8. Enter the LABDBQA folder with cd LABDBQA and display the contents with ls :
[nz@netezza LABDBQA]$ ls
20111214173551
In this folder you can see all the backup sets that have been saved for this database. Each backup set corresponds to
one full backup plus an optional set of differential and cumulative backups. Note that we backed up the schema to a
different location so we only have one backup set in here.
9. Enter the backup set folder with cd <your backupset id> and display the contents with ls :
Page 13 of 23
IBM Software
Information Management
[nz@netezza 20111214173551]$ ls
1 2
Under the backup set are folders for each backup that has been added to that backup set. 1 is always the full
backup followed by additional differential or cumulative backups. We will later use these numbers to restore our
database to a specific backup of the backup set.
10.
Enter the full backup with cd 1 and display the contents with ls :
[nz@netezza 1]$ ls
FULL
Enter the FULL folder with cd FULL and display the contents with ls :
[nz@netezza FULL]$ ls
data md
The data folder contains the user data, the md folder contains metadata including the schema definition of the
database.
12.
Enter the data folder with cd data and display detailed information with ll :
[nz@netezza data]$ ll
total 1120
-rw------- 1 nz nz
338 Dec 14 12:36 206208.full.2.1
-rw------- 1 nz nz
451 Dec 14 12:36 206222.full.2.1
-rw------- 1 nz nz 1132410 Dec 14 12:36 206238.full.1.1
You can see that there are three data files two small files for the REGION and NATION table and a big file for the
CUSTOMER table.
13.
Now switch to the md folder with cd ../md and display the contents with ls :
[nz@netezza md]$ ls
contents.txt loc1 schema.xml
stream.0.1
stream.1.1.1.1
This folder contains information about the files that contribute to the backup and the schema definition of the database
in the schema.xml
14. Lets have a quick look inside the schema.xml file:
[nz@netezza md]$ more schema.xml
You should see the following result:
Page 14 of 23
IBM Software
Information Management
more schema.xml
<ARCHIVE archive_major="4" archive_minor="0" product_ver="Release 6.1, Dev 2 [Bu
ild 16340]" catalog_ver="3.976" hostname="netezza" dataslices="4" createtime="20
11-12-14 17:35:57" lowercase="f" hpfrel="4.10" model="WMware" family="vmware" pl
atform="xs">
<OPERATION backupset="20111214173551" increment="1" predecessor="0" optype="0" d
bname="LABDBQA"/>
<DATABASE name="LABDBQA" owner="QAUSER" oid="206144" delimited="f" odelim="f" ch
arset="LATIN9" collation="BINARY" collecthist="t">
<STATISTICS column_count="15"/>
<TABLE ver="2" name="REGION" owner="ADMIN" oid="206208" delimited="f" odelim="f"
rowsecurity="f" origoid="206208">
<COLUMN name="R_REGIONKEY" owner="" oid="206209" delimited="f" odelim="t" seq="1
" type="INTEGER" typeno="23" typemod="-1" notnull="t"/>
...
As you see this file contains a full XML description of your database, including table definition, views, users etc.
15. Switch back to the lab folder with :
[nz@netezza md]$ cd /labs/backupRestore/
You should now have a pretty good understanding of the PureData System Backup process, in the next subchapter
we will demonstrate the restore functionality.
4.3
In this subchapter we will restore our database first to the first increment and then we will upgrade our database to the
next increment.
PureData System allows you to return a database to a specific increment in your backup set. If you want to do an
incremental restore the database must be locked. Tables can be queried but not changed until the database is in the
desired state and unlocked.
1. In the NZSQL session we will now drop the QA database and the QA user, first connect to the SYSTEM database:
LABDBQA(ADMIN)=> \c SYSTEM
2. Now drop the QA database:
SYSTEM(ADMIN)=> DROP DATABASE LABDBQA;
3. Now drop the QA User:
SYSTEM(ADMIN)=> DROP USER QAUSER;
4. Lets verify that the QA database really has been deleted with \l
You will see that the LABDBQA database has been removed:
Page 15 of 23
IBM Software
Information Management
SYSTEM(ADMIN)=> \l
List of databases
DATABASE | OWNER
-----------+---------INZA
| ADMIN
LABDB
| LABADMIN
LABDBTEST | ADMIN
MASTER_DB | ADMIN
NZA
| ADMIN
NZM
| ADMIN
NZR
| ADMIN
SYSTEM
| ADMIN
(8 rows)
5. In the OS session we will now restore the database to the first increment:
[nz@netezza md]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -increment 1 -lockdb
true
Notice that we have specified the increment with the increment option. In our case this is the first full backup in our
backup set.
We didnt need to specify a backup set, per default the most recent one is used. Since we are not sure to which
increment we want to restore the database we have to lock the database with the lockdb option. This allows only
read-only access until the desired increment has been restored.
6. In the NZSQL session verify that the database has been recreated with \l
You will see the LABDBQA database and you can also see that the owner QAUSER has been recreated and is again
the database owner:
SYSTEM(ADMIN)=> \l
List of databases
DATABASE | OWNER
-----------+---------INZA
| ADMIN
LABDB
| LABADMIN
LABDBQA
| QAUSER
LABDBTEST | ADMIN
MASTER_DB | ADMIN
NZA
| ADMIN
NZM
| ADMIN
NZR
| ADMIN
SYSTEM
| ADMIN
(9 rows)
7. Connect to the LABDBQA database with
SYSTEM(ADMIN)=> \c labdbqa
You will see that LABDBQA database is currently in read-only mode.
Page 16 of 23
IBM Software
Information Management
SYSTEM(ADMIN)=> \c labdbqa
NOTICE: Database 'LABDBQA' is available for read-only
You are now connected to database labdbqa.
8. Verify the contents of the REGION table from the LABDBQA database:
SYSTEM(ADMIN)=> select * from region;
You can see that we have returned the database to the point in time before the first full backup. There is no north or
south pole in the comments column:
LABDBQA(ADMIN)=> select * from region;
R_REGIONKEY |
R_NAME
|
R_COMMENT
-------------+---------------------------+----------------------------2 | sa
| south america
1 | na
| north america
3 | emea
| europe, middle east, africa
4 | ap
| asia pacific
(4 rows)
10. In the OS session we will now apply the next increment to the database
[nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -increment
next -lockdb true
You will see that we now apply the second increment to the database:
[nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -increment
next -lockdb true
Restore of increment 2 from backupset 20111214173551 to database 'labdbqa'
committed.
11. Since we do not need to load any more increments, we can now unlock the database:
[nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -unlockdb
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
Page 17 of 23
IBM Software
Information Management
After the database unlock we cannot apply any further increments to this database. To jump to a different increment
we would need to start from the beginning.
12. In the NZSQL session we have a look at the REGION table again:
LABDBQA(ADMIN)=> select * from region;
You can see that we have added the north pole region which was created before the first differential backup:
LABDBQA(ADMIN)=> select * from region;
R_REGIONKEY |
R_NAME
|
R_COMMENT
-------------+---------------------------+----------------------------2 | sa
| south america
3 | emea
| europe, middle east, africa
4 | ap
| asia pacific
1 | na
| north america
5 | np
| north pole
(5 rows)
13. Verify that the database is unlocked and ready for use again by adding a new set of customers to the
CUSTOMER table. In addition to the Automobile users we want to add the machinery users from the main
database:
LABDBQA(ADMIN)=> insert into customer select * from labdb..customer where
c_mktsegment = 'MACHINERY';
You will see that we now can use the database in a normal fashion again.
14. We had around 30000 customers before, verify that the new user set has been added successfully:
LABDBQA(ADMIN)=> select count(*) from customer;
You will see that we now have around 60000 rows in the CUSTOMER table.
You have now done a full restore cycle for the database and applied a full and incremental backup to your database.
In the next chapter we will demonstrate single table restore and the ability to restore from any backup set.
4.4
In this chapter we will demonstrate the targeted restore of a subset of tables from a backup set. We will also demonstrate
how to restore from a specific older backup set.
1. First we will create a second backup set with the new customer data. In the OS session execute the following
command:
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
Page 18 of 23
IBM Software
Information Management
Date
Log File
------------------- --------------2011-12-13 17:50:29
2011-12-14 12:35:51
2011-12-14 12:44:53
2011-12-14 15:55:36
3. To return only the CUSTOMER table to its condition of the second backup set we can do a table level restore with
the following command:
[nz@netezza backupRestore]$ nzrestore -db labdbqa -dir /tmp/bk1 /tmp/bk2 -backupset
<your_backup_set_id> -tables CUSTOMER
This command will only restore the tables of the tables option. If you want to restore multiple tables you can simply
write them in a list after the option.
We use the backupset option to specify a specific backup set. Remember to replace the id with the value you
retrieved with the history command.
Notice that the table name needs to be case sensitive. This is in contrast to the database name.
Page 19 of 23
IBM Software
Information Management
PureData System cannot restore a table that exists in the target database. You can either drop the table before
restoring it, or use the droptables option.
In this chapter you have executed a single table restore and you did a restore from a specific backup set.
Page 20 of 23
IBM Software
Information Management
In this chapter you will do a backup of these components, so you would be able to revert your appliance to the exact
condition it was in before the backup.
5.1
Users, groups, and privileges that are not used in databases will not be backed up by the user data backup. To be able to
revert a PureData System Appliance completely to its original condition you need to have a backup of the global user
information as well, to capture for example administrative users that are not part of any database.
This is done with the users option of the nzbackup command:
1. In the OS session execute the following command:
[nz@netezza backupRestore]$ nzbackup -dir /tmp/bk1 /tmp/bk2 -users
You will see the following results:
[nz@netezza backupRestore]$ nzbackup -dir /tmp/bk1 /tmp/bk2 -users
Backup of users, groups, and global permissions completed successfully.
.
This will create a backup of all Users, Groups and Privileges. Restoring it will not delete any users, instead it will only add
missing Users, Groups and Privileges, so it doesnt need to be fully synchronized with the user data backup. You can
even restore an older user backup without fear of destroying information.
5.2
Until now we have always backed up database content. This is essentially catalog and user data that can be applied to a
new PureData System appliance. PureData System also provides the functionality to backup and restore host data. This
is essentially the data in the /nz/data and /export/nz directories of the host server.
There are two reasons for regularly backing up host data. The first is a host crash. If the S-Blades of your appliance are
intact but the host file system has been destroyed you could recreate all databases from the user backup. But in very
large systems this might take a long time. It is much easier to only restore the host information and reconnect to the
undamaged user tables on the S-Blades.
The second reason is that the host data contains configuration information, log and plan files etc. that are not saved by
the user backup. If you for example changed the system configuration that information would be lost.
Therefore it is advisable to regularly backup host data.
1. To backup the host data execute the following command in the OS session:
[nz@netezza backupRestore]$ nzhostbackup /tmp/hostbackup
This will pause your system and copy the host files into the specified file name:
Page 21 of 23
IBM Software
Information Management
14
14
13
14
12
20
12
12
12
12
12
12:35
12:35
17:50
16:37
14:55
2011
15:04
15:04
15:05
14:46
12:55
bk1
bk2
bkschema
hostbackup
inza1.1.2
lost+found
nzaeus__nzmpirun___
nzaeus__nzmpirun_____Process
nzcm.lock
nzcm-temp_18uEeq
nzcm-temp_rvAZXR
You can see that a backup file has been created. Its a compressed file containing the system catalog and PureData
System host information. If possible host backups should be done regularly. If for example an old host backup is
restored there might exist so called orphaned tables. This means tables that have been created after the host backup
and exist on the S-Blades but are now not registered in the system catalog anymore. During host restore PureData
System will create a script to clean up these orphaned tables, so they do not take up any disc space.
Congratulations you have finished the Backup&Restore lab and you have had a chance to see all components of a
successful PureData System backup strategy. The one missing component is that we did only use file system backup. In
a real environment you would more likely use a Veritas or TSM backup server. For further information regarding the setup
steps please refer to the excellent system administration guide.
Page 22 of 23
IBM Software
Information Management
Page 23 of 23
IBM
Software
h
Information Management
Query Optimization
Hands-On Lab
IBM PureData System for Analytics Powered by Netezza Technology
IBM Software
Information Management
Table of Contents
Introduction .....................................................................3
1.1
2
3
4
Objectives........................................................................3
Generate Statistics..........................................................3
Identifying Join Problems ..............................................6
HTML Explain ................................................................10
Page 2 of 14
IBM Software
Information Management
1 Introduction
PureData System uses a cost-based optimizer to determine the best method for scan and join operations, join order, and data
movement between SPUs (redistribute or broadcast operations if necessary). For example the planner tries to avoid
redistributing large tables because of the performance impact. The optimizer can also dynamically rewrite queries to improve
query performance.
The optimizer takes a SQL query as input and creates a detailed execution or query plan for the database system. For the
optimizer to create the best execution plan that results in the best performance, it must have the most up-to-date statistics.
You can use EXPLAIN, HTML (also known as bubble), and text plans to analyze how the PureData System system
executes a query.
Explain is a very useful tool to spot and identify performance problems, bad distribution keys, badly written SQL queries
and out-of-date statistics.
1.1
Objectives
During our POC we have identified a couple of very long running customer queries that have significantly worse performance
than the number of rows involved would suggest. In this lab we will use Explain functionality to identify the concrete bottlenecks
and if possible fix them to improve query performance.
2 Generate Statistics
Our first long running customer query returns the average order price by customer segment for a given year and order priority. It
joins the customer table for the market segment and the orders table for the total price of the order. Due to restrictive join
conditions it shouldnt require too much processing time. But on our test systems it runs a very long time. In this chapter we will
use PureData System Explain functionality to find out why this is the case.
The customer query in question:
1.
Connect to your PureData System image using putty. Login to 192.168.239.2 as user nz with password nz.
(192.168.239.2 is the default IP address for a local VM, the IP may be different for your Bootcamp)
2.
First we will make sure that the system doesnt run a different workload that could influence our tests. Use the following
nzsession command to verify that the system is free:
Page 3 of 14
IBM Software
Information Management
This result shows that there is currently only one session connected to the database, which is the nzsession command itself.
Per default the database user in your vmware image is ADMIN. Executing this command before doing any performance
measurements ensures that other workloads are not influencing the performance of the system. You can use the nzsession
command as well to abort bad or locked sessions.
3.
After we verified that the system is free we can start analyzing the query. Connect to the lab database with the following
command:
4.
Lets first have a look at the two tables and the WHERE conditions to get an idea of the row numbers involved. Our query
joins the CUSTOMER table without any where condition applied to it and the ORDERS table that has two where
conditions restricting it on the date and order priority. From the data distribution lab we know that the CUSTOMER table
has 150000 rows. To get the rows that are involved from the ORDERS table Execute the following COUNT(*) command:
5.
The PureData System optimizer uses statistics about the data in the system to estimate the number of rows that result
from WHERE conditions, joins, etc. Doing wrong approximations can lead to bad execution plans. For example a huge
result set could be broadcast for a join instead of doing a double redistribution. To see its estimated rows for the
WHERE conditions in our query run the following EXPLAIN command:
Page 4 of 14
IBM Software
Information Management
explain verbose select count(*) from orders as o where EXTRACT(YEAR FROM o.o_orderdate) =
1996 and o.o_orderpriority = '1-URGENT';
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan table "ORDERS" as "O" {(O.O_ORDERKEY)}]
-- Estimated Rows = 150, Width = 0, Cost = 0.0 .. 578.6, Conf = 64.0
Restrictions:
((O.O_ORDERPRIORITY = '1-URGENT'::BPCHAR) AND (DATE_PART('YEAR'::"VARCHAR",
O.O_ORDERDATE) = 1996))
Projections:
Node 2.
[SPU Aggregate]
...
The execution plan of this query consists of two nodes or snippets. First the table is scanned and the WHERE conditions
are applied, which can be seen in the Restrictions sub node. Since we use a COUNT(*) the Projections node is empty.
Then an Aggregation node is applied to count the rows that are returned by node 1.
When we look at the estimated number of rows we can see that it is way off the mark. The PureData System Optimizer
estimates from its available statistics that only 150 rows are returned by the WHERE conditions. We have seen before that
in reality its 46014 or roughly 300 times as many.
6.
One way to help the optimizer in its estimates is the collection of detailed statistics about the involved tables. Execute
the following command to generate detailed statistics about the ORDERS table:
7.
We will now check if generating statistics has improved the estimates. Execute the EXPLAIN command again:
As we can see the estimated rows of the SELECT query have improved drastically. The optimizer now assumes this WHERE
condition will apply to 3000 rows of the order table. Still significantly off the true number of 46000 but by a factor of 20 better
than the original estimate of 150.
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
Page 5 of 14
IBM Software
Information Management
Estimations are very difficult to make. Obviously the optimizer cannot do the actual computation during planning. It relies on
current statistics about the involved columns. Statistics include min/max values, distinct values, numbers of null values etc.
Some of these statistics are collected on the fly but the most detailed statistics can be generated manually with the Generate
Statistics command. Generating full statistics after loading a table or changing its content significantly is one of the most
important administration tasks in PureData System. The PureData System appliance will automatically generate express
statistics after many tasks like load operations and just-in-time statistics during planning. Nevertheless full statistics should be
generated on a regular basis.
Lets analyze the execution plan for this query using the EXPLAIN VERBOSE command:
Page 6 of 14
IBM Software
Information Management
2.
First try to answer the following questions through the execution plan yourself. Take your time. We will walk through the
answers after that.
Question
Answer
Projections:
1:C.C_MKTSEGMENT
[SPU Broadcast]
a.
The first node in the execution plan does a sequential scan of the CUSTOMER table on the SPUs. It estimates that 150000
rows are returned which we know is the number of rows in the CUSTOMER table.
The statement that tells us which columns are used in further computations is the Projections: clause. We can see that
only the C_MKTSEGMENT column is carried on from the CUSTOMER table. All other columns are thrown away. Since
C_MKTSEGMENT is a CHAR(10) column the returned resultset has a width of 10.
b.
During scan the table is broadcast to the other SPUs. This means that a complete CUSTOMER table is assembled on the
host and broadcast to each SPU for further computation of the query. This may seem surprising at first since we have a
substantial amount of rows. But since the width of the result set is only 10 we are talking about 150000 rows * 10 bytes =
1.5mb. This is almost nothing for a warehousing system.
Node 2.
[SPU Sequential Scan table "ORDERS" as "O" {(O.O_ORDERKEY)}]
-- Estimated Rows = 3000, Width = 8, Cost = 0.0 .. 578.6, Conf = 64.0
Restrictions:
((O.O_ORDERPRIORITY = '1-URGENT'::BPCHAR) AND (DATE_PART('YEAR'::"VARCHAR",
O.O_ORDERDATE) = 1996))
Projections:
1:O.O_TOTALPRICE
Page 7 of 14
IBM Software
Information Management
c.
The second node of the execution plan does a scan of the ORDERS table. One column O_TOTALPRICE is projected and
used in further computations. We cannot see any distribution or broadcast clauses so this table can be joined locally. This is
true because the CUSTOMER table is broadcast to all SPUs. If one table of a join is broadcast the other table doesnt need
any redistribution.
d.
In which node are the WHERE conditions applied and how many rows does PureData System expect to fulfill the where
condition?
We can see in the Restrictions clause that the WHERE conditions of our query are applied during the second node as well.
This should be clear since both of the WHERE conditions are applied to the ORDERS table and they can be executed
during the scan of the ORDERS table. As we can see in the Estimated Rows clause, the optimizer estimates a returned
set of 3000 rows which we know is not perfectly true since in reality 46014 rows are returned from this table.
Node 3.
[SPU Nested Loop Stream "Node 2" with Temp "Node 1" {(O.O_ORDERKEY)}]
-- Estimated Rows = 450000007, Width = 18, Cost = 1048040.0 .. 7676127.0, Conf = 64.0
Restrictions:
't'::BOOL
Projections:
1:C.C_MKTSEGMENT
e.
2:O.O_TOTALPRICE
The third node of our execution plan contains the join between the two tables. It is a Nested Loop Join which means that
every row of the first join set is compared to each row of the second join set. If the join condition holds true the joined row is
then added to the result set. This can be a very efficient join for small tables but for large tables its complexity is quadratic
and therefore in general less fast than for example a Hash Join. The Hash Join though cannot be used in cases of inequality
join conditions, floating point join keys etc.
f. What is the number of estimated rows for the join?
We can see in the Estimated Rows clause that the optimizer estimates this join node to return roughly 450m rows. Which is
the number of rows from the first node times the number of rows from the second node.
g. What is the most expensive node and why?
As we can see from the Cost clause the optimizer estimates, that the join has a cost in the range from 1048040 .. 7676127.0.
This is a roughly 2000 14000 times higher cost than what was expected for Node 1 and Node 2. Node 4 and 5 which
group and aggregate the result set do not add much cost as well. So our performance problems clearly originate in the join
node 3.
So what is happening here? If we take a look at the query we can assume that it is intended to compute the average order
cost per market segment. This means we should join all customers to their corresponding order rows. But for this to happen
we would need a join condition that joins the customer table and the orders table on the customer key. Instead the query
performs a Cartesian Join, joining each customer row to each orders row. This is a very work intensive query that results in
the behavior we have seen. The joined result set becomes huge. And it even returns results that cannot have been
expected for the query we see.
4.
So how do we fix this? By adding a join condition to the query that makes sure that customers are only joined to their
orders. This additional join condition is O.O_CUSTKEY=C.C_CUSTKEY. Execute the following EXPLAIN command for
the modified query.
Page 8 of 14
IBM Software
Information Management
As you can see there have been some changes to the exeuction plan. The ORDERS table is now scanned first and
distributed on the customer key. The CUSTOMER table is already distributed on the customer key so there doesnt need to
happen any redistribution here. Both tables are then joined in node 3 through a Hash Join on the customer key.
The estimated number of rows is now 150000, the same as the number of customers. Since we have a 1:n relationship
between customers and orders this is as we would expect. Also the estimated cost of node 3 has come down significantly to
578.6 ... 746.7.
5.
Lets make sure that the query performance has indeed improved. Switch on the display of elapsed query time with the
following command:
LABDB(LABADMIN)=> \time
If you want you can later switch off the elapsed time display by executing the same command again. It is a toggle.
6.
Page 9 of 14
IBM Software
Information Management
Before we made our changes the query took so long that we couldnt wait for it to finish. After our changes the execution
time has improved to slightly more than a second. In this relatively simple case we might have been able to pinpoint the
problem through analyzing the SQL on its own. But this can be almost impossible for complicated multi-join queries that are
often used in warehousing. Reporting and BI tools tend to create very complicated portable SQL as well. In these cases
EXPLAIN can be a valuable tool to pinpoint the problem.
4 HTML Explain
In this section we will look at the HTML plangraph for the customer query that we just fixed. Besides the text descriptions of the
exeution plan we used in the previous chapter, PureData System provides the ability to generate a graphical query tree as well.
This is done with the help of HTML. So plangraph files can be created and viewed in your internet browser. PureData System
can be configured to save a HTML plangraph or plantext file for every executed SQL query. But in this chapter we will use the
basic EXPLAIN PLANGRAPH command and use Cut&Paste to export the file to your host computer.
1.
Enter the query with the keyword explain plangraph to generate the HTML plangraph:
Page 10 of
IBM Software
Information Management
LABDB(LABADMIN)=> EXPLAIN PLANGRAPH SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM orders AS o, CUSTOMER as c WHERE
EXTRACT(YEAR FROM o.o_orderdate) = 1996 AND o.o_orderpriority = '1-URGENT' AND o.o_custkey = c.c_custkey GROUP BY
c.c_mktsegment;
NOTICE: QUERY PLAN:
<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns="http://www.w3.org/TR/REC-html40">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<meta http-equiv="Generator" content="Netezza Performance Server">
<meta http-equiv="Author" content="Babu Tammisetti <btammisetti@netezza.com>">
<style>
v\:* {behavior:url(#default#VML);}
</style>
</head>
<body lang="en-US">
<pre style="font:normal 68% verdana,arial,helvetica;background:#EEEEEE;margin-top:1em;margin-bottom:1em;marginleft:0px;padding:5pt;">
EXPLAIN PLANGRAPH SELECT c.c_mktsegment, AVG(o.o_totalprice) FROM orders AS o, CUSTOMER as c WHERE EXTRACT(YEAR FROM
o.o_orderdate) = 1996 AND o.o_orderpriority = '1-URGENT' AND o.o_custkey = c.c_custkey GROUP BY c.c_mktsegment;
</pre>
<v:textbox style="position:absolute;margin-left:230pt;margin-top:19pt;width:80pt;height:25pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">AGG<br/>r=100 w=26 s=2.5KB</p></v:textbox>
<v:oval style="position:absolute;margin-left:231pt;margin-top:15pt;width:78pt;height:25pt;z-index:9;"></v:oval>
<v:textbox style="position:absolute;margin-left:230pt;margin-top:0pt;width:80pt;height:25pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">snd,ret</p></v:textbox>
<v:textbox style="position:absolute;margin-left:230pt;margin-top:54pt;width:80pt;height:25pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">GROUP<br/>r=100 w=18 s=1.8KB</p></v:textbox>
<v:oval style="position:absolute;margin-left:231pt;margin-top:50pt;width:78pt;height:25pt;z-index:9;"></v:oval>
<v:line style="position:absolute;z-index:8;" from="270pt,27pt" to="270pt,62pt"/>
<v:textbox style="position:absolute;margin-left:233pt;margin-top:42pt;width:80pt;height:25pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">dst,m-grp</p></v:textbox>
<v:textbox style="position:absolute;margin-left:230pt;margin-top:89pt;width:80pt;height:31pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">HASHJOIN<br/>r=150.0K w=18 s=2.6MB<br/>(C_CUSTKEY =
O_CUSTKEY)</p></v:textbox>
<v:oval style="position:absolute;margin-left:231pt;margin-top:85pt;width:78pt;height:31pt;z-index:9;"></v:oval>
<v:line style="position:absolute;z-index:8;" from="270pt,62pt" to="270pt,100pt"/>
<v:textbox style="position:absolute;margin-left:190pt;margin-top:124pt;width:80pt;height:31pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">SEQSCAN<br/>r=150.0K w=14 s=2.0MB<br/>C</p></v:textbox>
<v:oval style="position:absolute;margin-left:191pt;margin-top:120pt;width:78pt;height:31pt;z-index:9;"></v:oval>
<v:line style="position:absolute;z-index:8;" from="270pt,97pt" to="230pt,135pt"/>
<v:textbox style="position:absolute;margin-left:270pt;margin-top:124pt;width:80pt;height:25pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">HASH<br/>r=3.0K w=12 s=35.2KB</p></v:textbox>
<v:oval style="position:absolute;margin-left:271pt;margin-top:120pt;width:78pt;height:25pt;z-index:9;"></v:oval>
<v:line style="position:absolute;z-index:8;" from="270pt,97pt" to="310pt,132pt"/>
<v:textbox style="position:absolute;margin-left:253pt;margin-top:112pt;width:80pt;height:25pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">dst{(O_CUSTKEY)}</p></v:textbox>
<v:textbox style="position:absolute;margin-left:270pt;margin-top:159pt;width:80pt;height:31pt;z-index:10;">
<p style="text-align:center;font-size:6pt;">SEQSCAN<br/>r=3.0K w=12 s=35.2KB<br/>O</p></v:textbox>
<v:oval style="position:absolute;margin-left:271pt;margin-top:155pt;width:78pt;height:31pt;z-index:9;"></v:oval>
<v:line style="position:absolute;z-index:8;" from="310pt,132pt" to="310pt,170pt"/>
</body>
</html>
EXPLAIN
Next open your host computers text editor. If you workstation is windows open notepad, if you use a linux desktop use the
default text editor like KEDIT, or GEDIT. Copy the output from the explain plangraph from your putty window into notepad.
Make sure that you only copy the HTML file from the <html start tag to the </html> end tag.
2.
Page 11 of
IBM Software
Information Management
3.
Now on your desktop double click on explain.html. In windows make sure to open it with Internet Explorer since this
will result in the best output
You can see a graphical representation of the query we analyzed before. The left leg of the tree is the scan node of the
Customer tables C, the right leg contains a scan of the Orders table O and a node hashing the result set from orders in
preparation for the HASHJOIN node, that is joining the resultsets of the two table scans on the customer key. After the join the
result is fed into a GROUP node and an Aggregation node that computes the Average total price, before being returned to the
caller.
IBM PureData System for Analytics
Copyright IBM Corp. 2012. All rights reserved
14
Page 12 of
IBM Software
Information Management
A graphical representation of the execution plan can be valuable for complicated multi-join queries to get an overview of the join.
Congratulations in this lab you have used PureData System Explain functionality to analyze a query.
Page 13 of
IBM Software
Information Management
Page 14 of
Optimization Objects
Hands-On Lab
IBM PureData System for Analytics Powered by Netezza Technology
IBM Software
Information Management
Table of Contents
Introduction .....................................................................3
1.1
Objectives........................................................................3
Materialized Views...........................................................3
2.1
Wide Tables.....................................................................4
2.2
3.2
Page 2 of 17
IBM Software
Information Management
1 Introduction
A PureData System appliance is designed to provide excellent performance in most cases without any specific tuning or index
creation. One of the key technologies used to achieve this are zone maps: Automatically computed and maintained records of
the data that is inside the extents of a database table.
In general data is loaded into data warehouses ordered by the time dimension; therefore zone maps have the biggest
performance impact on queries that restrict the time dimension as well.
This approach works well for most situations, but PureData System provides additional functionality to enhance specific
workloads, which we will use in this chapter.
We will first use materialized views to enhance performance of database queries against wide tables and for queries that only
lookup small subsets of columns.
Then we will use Cluster Based Tables to enhance query performance of queries which are using multiple lookup dimensions.
1.1
Objectives
In the last couple of labs we have recreated a customer database in our PureData System system. We have picked distribution
keys, loaded the data and made some first performance investigations. In this lab we will take a deeper look at some customer
queries and try to enhance their performance by tuning the system.
2 Materialized Views
A materialized view is a view of a database table that projects a subset of the base tables columns and can be sorted on a
specific set of the projected columns. When a materialized view is created, the sorted projection of the base tables data is
stored in a materialized table on disk.
Materialized views reduce the width of data being scanned in a base table. They are beneficial for wide tables that contain many
columns (i.e. 50-500 columns) where typical queries only reference a small subset of the columns.
IBM PureData System for Analytics
Copyright IBM Corp. 20112 All rights reserved
Page 3 of 17
IBM Software
Information Management
Materialized views also provide fast, single or few record lookup operations. The thin materialized view is automatically
substituted by the optimizer for the base table, allowing faster response, particularly for shorter tactical queries that examine only
a small segment of the overall database table.
2.1
Wide Tables
In our customer scenario we have a couple of queries that do some basic computations on the LINEITEM table but only touch a
small number of columns of the table.
1. Connect to your Netezza image using putty. Login to 192.168.239.2 as user nz with password nz. (192.168.239.2 is
the default IP address for a local VM, the IP may be different for your Bootcamp)
2.
3.
The first thing we need to do is to make sure table statistics have been generated so that more accurate estimated
query costs can be reported by explain commands which we will be looking at. Please generate statistics for the
ORDERS and LINEITEM tables using the following commands.
The following query computes the total quantity of items shipped and their average tax rate for a given month. In this
case the fourth month or April. Execute the following query:
Now lets have a look at the cost of this query. To get the projected cost from the Optimizer we use the following
EXPLAIN VERBOSE command:
Page 4 of 17
IBM Software
Information Management
QUERY SQL:
EXPLAIN VERBOSE SELECT SUM(L_QUANTITY), AVG(L_TAX) FROM LINEITEM WHERE EXTRACT(MONTH
FROM L_SHIPDATE) = 4;
Since this query is run very frequently we want to enhance the scanning performance. And since it only uses 3 of the 16
LINEITEM columns we have decided to create a materialized view covering these three columns. This should
significantly increase scan speed since only a small subset of the data needs to be scanned. To create the materialized
view THINLINEITEM execute the following command:
Repeat the explain call from step 2. Execute the following command:
Page 5 of 17
IBM Software
Information Management
QUERY SQL:
EXPLAIN VERBOSE SELECT SUM(L_QUANTITY), AVG(L_TAX) FROM LINEITEM WHERE EXTRACT(MONTH
FROM L_SHIPDATE) = 4;
[MV:
Notice that the PureData System Optimizer has automatically replaced the LINEITEM table with the view THINLINEITEM.
We didnt need to make any changes to the query. Also notice that the expected cost has been reduced to 174 which is less
than 10% of the original.
As you have seen in cases where you have wide database tables, with queries only touching a subset of them, a
materialized view of the hot columns can significantly increase performance for these queries, without any changes to the
executed queries.
2.2
Materialized views not only reduce the width of tables, they can also be used in a similar way to indexes to increase the speed of
queries that only access a very limited set of rows.
1.
First we drop the view we used in the last chapter with the following command:
The following command returns the number of returned shipments vs. total shipments for a specific shipping day.
Execute the following command:
LABDB(LABADMIN)=> SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET,
COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
You should have a similar result to the following:
Page 6 of 17
IBM Software
Information Management
LABDB(LABADMIN)=> SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET,
COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
RET | TOTAL
-----+------176 | 2550
(1 row)
You can see that on the 15th June of 1995 there have been 176 returned shipments out of a total of 2550. Notice the use of
the CASE statement to change the L_RETURNFLAG column into a Boolean 0-1 value, which is easily countable.
3.
We will now take a look at the underlying data distribution of the LINEITEM table and its zone map values. To do this
exit the NZSQL console by executing the \q command.
4.
In our demo image we have installed the PureData System support tools. You can normally find them as an installation
package in /nz on your PureData System appliances or you can retrieve them from IBM support. One of these tools is
the nz_zonemap tool that returns detailed information about the zone map values associated with a given database
table. First lets have a look at the zone mappable columns of the LINEITEM table. Execute the following command:
LABDB
LINEITEM
TABLE
243252
Now we will have a look at the zone map values for the L_SHIPDATE column. Execute the following command:
Page 7 of 17
IBM Software
Information Management
LABDB
LINEITEM
TABLE
243252
1
L_SHIPDATE
(DATE)
Enter the NZSQL console again by entering the nzsql labdb labadmin command.
7.
We will now create a materialized view that is ordered on the L_SHIPDATE column. Execute the following command:
Page 8 of 17
IBM Software
Information Management
retains the information about the location of a parent row in the base table and can be used for lookups even if columns of
the parent table are accessed in the SELECT clause.
You can specify more than one order column. In that case they are ordered first by the first column; in case this column has
equal values the next column is used to order rows with the same value in column one etc. In general only the first order
column provides a significant impact on performance.
8.
Lets have a look at the zone map of the newly created view. Leave the NZSQL console again with the \q command.
9.
Display the zone map values of the materialized view SHIPLINEITEM with the following command:
LABDB
SHIPLINEITEM
MATERIALIZED VIEW
252077
1
L_SHIPDATE (DATE)
LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE
0 END) AS RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
You will see a long text output, scroll up till you find the command you just executed. Your result should look like the
following:
Page 9 of 17
IBM Software
Information Management
EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET,
COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan mview index "_MSHIPLINEITEM" {(LINEITEM.L_ORDERKEY)}]
-- Estimated Rows = 2193, Width = 1, Cost = 0.0 .. 61.7, Conf = 90.0 [MV:
MaxPages=24 TotalPages=24] [BT: MaxPages=549 TotalPages=2193] (JIT-Stats)
Restrictions:
(LINEITEM.L_SHIPDATE = '1995-06-15'::DATE)
Projections:
1:LINEITEM.L_RETURNFLAG
Node 2.
[SPU Aggregate]
-- Estimated Rows = 1, Width = 24, Cost = 62.2 .. 62.2, Conf = 0.0
Projections:
1:SUM(CASE WHEN (LINEITEM.L_RETURNFLAG <> 'N'::BPCHAR) THEN 1 ELSE 0 END)
2:COUNT(*)
[SPU Return]
[HOST Merge Aggs]
[Host Return]
...
Notice that the Optimizer has automatically changed the table scan to a scan of the view SHIPLINEITEM we just created.
This is possible even though the projection is taking place on column L_RETURNFLAG of the base table.
12. In some cases you might want to disable or suspend an associated materialized view. For troubleshooting or
administrative tasks on the base table. For these cases use the following command to suspend the view:
LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE
0 END) AS RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
Scroll up till you see your explain query. With the view suspended we can see that the optimizer again scans the original
table LINEITEM.
EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET,
COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
Page 10 of 17
IBM Software
Information Management
14. Note that we have only suspended our view not dropped it. We will now reactivate it with the following refresh command:
LABDB(LABADMIN)=> EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE
0 END) AS RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
Make sure that the Optimizer again uses the materialized view for its first scan operation. The output should again look like
before you suspended the view.
EXPLAIN VERBOSE SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET,
COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
QUERY VERBOSE PLAN:
Node 1.
[SPU Sequential Scan mview index "_MSHIPLINEITEM" {(LINEITEM.L_ORDERKEY)}]
-- Estimated Rows = 2193, Width = 1, Cost = 0.0 .. 61.7, Conf = 90.0 [MV:
MaxPages=24 TotalPages=24] [BT: MaxPages=549 TotalPages=2193] (JIT-Stats)
Restrictions:
(LINEITEM.L_SHIPDATE = '1995-06-15'::DATE)
Projections:
1:LINEITEM.L_RETURNFLAG
...
16. If you execute the query again you should get the same results as you got before creating the materialized view.
LABDB(LABADMIN)=> SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS RET,
COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
You should see the following output:
There is a defect in our VMWare image which in some cases only returns the rows from one dataslice instead of all
four, when a materialized view is used. This means that instead of seeing a TOTAL of 2550 you will see a total of
623 (or similar numbers depending on your data distribution and which dataslice is returned). You can solve this
problem by restarting your PureData System database. It will also not occur on a real PureData System appliance.
LABDB(LABADMIN)=> SELECT SUM(CASE WHEN L_RETURNFLAG <> 'N' THEN 1 ELSE 0 END) AS
RET, COUNT(*) AS TOTAL FROM LINEITEM WHERE L_SHIPDATE='1995-06-15';
RET | TOTAL
-----+------176 | 2550
(1 row)
IBM PureData System for Analytics
Copyright IBM Corp. 20112 All rights reserved
Page 11 of 17
IBM Software
Information Management
You have just created a materialized view to speed up queries that lookup small numbers of rows. A materialized view can
provide a significant performance improvement and is transparent to end users and applications accessing the database.
But it also creates additional overhead during INSERTs, UPDATEs and DELETEs, requires additional hard disc space and it
may require regular maintenance.
Therefore materialized views should be used sparingly. In the next chapter we will discuss an alternative approach to speed
up scan speeds on a database table.
3.1
Cluster based tables are created like normal PureData System database tables. They need to be flagged as a CBT during table
creation by specifying up to four organization columns. A PureData System table can be altered at any time to become a cluster
based table as well.
1.
We are going to change the create table command for ORDERS to create a cluster based table. We will create a new
cluster based table called ORDERS_CBT. Exit the NZSQL console by executing the \q command.
2.
Switch to the optimization lab directory by executing the following command: cd /labs/optimizationObjects
3.
We have supplied a the script for the creation of the ORDERS_CBT table but we need to add the ORGANIZE
ON(O_ORDERDATE, O_TOTALPRICE) clause to create the table as a cluster based table organized on the
O_ORDERDATE and O_TOTALPRICE columns. To change the CREATE statement open the orders_cbt.sql script in
the vi editor with the following command:
vi orders_cbt.sql
4.
Enter the insert mode by pressing i, the editor should now show an ---INSERT MODE--- statement in the bottom line.
Page 12 of 17
IBM Software
Information Management
5.
Navigate the cursor on the semicolon ending the statement. Press enter to move it into a new line. Enter the line
organize on (o_orderdate, o_totalprice) before it. Your screen should now look like the following.
6.
7.
Enter :wq! In the command line and press Enter to save and exit without questions.
8.
Create and load the orders_cbt table by executing the following script: ./create_orders_test.sh
9.
This may take a couple minutes because of our virtualized environment. You may see an error message that the table
orders_cbt does not exist. This is expected since the script first tries to clean up an existing orders_cbt table.
10. We will now have a look at how Netezza has organized the data in this table. For this we use the nz_zonemap utility
again. Execute the following command:
Page 13 of 17
IBM Software
Information Management
LABDB
ORDERS_CBT
TABLE
264428
LABDB
ORDERS_CBT
TABLE
264428
1
O_ORDERDATE
(DATE)
Page 14 of 17
IBM Software
Information Management
This command will be covered in detail in the following presentation and lab. But we will use it in the next chapter to
reorganize the table.
3.2
When a table is created as a cluster based table in Netezza the data isnt actually organized during load time. Also similar to
ordered materialized views a cluster based table can become partially unordered due to INSERTS, UPDATES and DELETES. A
threshold is defined for reorganization and the groom command can be used at any time to reorganize a cluster based table,
based on its organization keys.
1.
To organize the table you created in the last chapter you need to switch to the NZSQL console again. Execute the
following command: nzsql labdb labadmin
2.
Lets have a look at the data organization in the table. To do this quit the NZSQL console with the \q command.
4.
Review the zone maps of the two organization columns by executing the following command:
Your results should look like the following (we removed the ORDER columns from the results to make it better readable)
[nz@netezza optimizationObjects]$ nz_zonemap labdb orders_cbt o_orderdate o_totalprice
Database:
Object Name:
Object Type:
Object ID :
Data Slice:
Column 1:
Column 2:
LABDB
ORDERS_CBT
TABLE
264428
1
O_ORDERDATE (DATE)
O_TOTALPRICE (NUMERIC(15,2))
(7 rows)
Page 15 of 17
IBM Software
Information Management
You can see that both columns have some form of order now. Our query is restricting rows in two ranges
Condition 1: O_ORDERDATE = 1996
AND
Condition 2: 150000 < O_TOTALPRICE <= 180000
Below we enter the minimum and maximum values of the extents in a table and add a column to mark (with an X) if the
contained values of an extent overlap with the above conditions.
Min(Date)
Max(Date)
Min(Price)
Max(Price)
1992-01-01
1994-06-22
912.10
144450.63
Cond 1
1993-08-27
1996-12-08
875.52
144451.22
1996-02-13
1998-08-02
884.52
144446.76
1995-04-18
1998-08-02
78002.23
215555.39
1993-08-27
1998-08-02
196595.73
530604.44
1992-01-01
1995-04-18
144451.94
296228.30
1992-01-01
1993-08-27
196591.22
555285.16
Cond 2
Both Cond
As you can see there are now 4 extents that have rows from 1996 in them and 2 extents that contain rows in the price range
from 150000 to 18000. But we have only one extent that contains rows that satisfy both conditions and needs to be scanned
during query execution.
In this scenario we probably would have been able to get similar results with one organization column or a materialized view,
but with bigger tables and more extents cluster based tables gain a performance advantage.
Congratulations, you have finished the Optimization Objects lab. In this lab you have created materialized views to speedup
scans of wide tables and queries that only look up small numbers of rows. Finally you created a cluster based table and
used the groom command to organize it. Throughout the lab you have used the nz_zonemap tool to see zone maps and get
a better idea on how data is stored in the Netezza appliance.
Page 16 of 17
IBM Software
Information Management
Page 17 of 17
IBM Software
Information Management
Groom
Hands-On Lab
IBM PureData System for Analytics Powered by Netezza Technology
IBM Software
Information Management
Table of Contents
Introduction .....................................................................3
1.1
3
4
5
Objectives........................................................................3
Transactions....................................................................3
2.1
2.2
2.3
Aborting Transactions......................................................7
2.4
Cleaning up .....................................................................8
Page 2 of 18
IBM Software
Information Management
1 Introduction
As part of your routine database maintenance activities, you should plan to recover disk space occupied by outdated or deleted
rows. In normal PureData System operation, an UPDATE or DELETE of a table row does not remove the physical row on the
hard disc. Instead the old row is marked as deleted together with a transaction id of the deleting transaction and in case of
update a new row is created. This approach is called multiversioning. Rows that could potentially be visible to other transactions
with an older transaction id are still accessible. Over time however, the outdated or deleted rows are of no interest to any
transaction anymore and need to be removed to free up hard disc space and improve performance. After the rows have been
captured in a backup, you can reclaim the space they occupy using the SQL GROOM TABLE command. The GROOM TABLE
command does not lock a table while it is running; you can continue to SELECT, UPDATE, and INSERT into the table while the
table is being groomed.
1.1
Objectives
In this lab we will use the GROOM command to prepare our tables for the customer. During the course of the POC we have
deleted and update a number of rows. At the end of a POC it is sensible to clean up the system. Use Groom on the created
tables, Generate Statistics, and other cleanup tasks.
2 Transactions
In this section we will show how transactions can leave logically deleted rows in a table which later as an administrative task
need to be removed with the groom command. We will go through the different transaction types and show you what happens
under the covers in a PureData System Appliance.
2.1
Insert Transaction
In this chapter we will add a new row to the regions table and review the hidden fields that are saved in the database. As you
remember from the Transactions presentation, PureData System uses a concept called multi-versioning for transactions. Each
transaction has its own image of the table and doesnt influence other transactions. This is done by adding a number of hidden
fields to the PureData System table. The most important ones are the CREATEXID and the DELETEXID. Each PureData System
transaction has a unique transaction id that is increasing with each new transaction.
In this subsection we will add a new row to the REGION table.
1.
Connect to your Netezza image using putty. Login to 192.168.239.2 as user nz with password nz. (192.168.239.2 is
the default IP address for a local VM, the IP may be different for your Bootcamp)
2.
3.
Connect to the database LABDB as user LABADMIN by typing the following command:
Page 3 of 18
IBM Software
Information Management
Insert a new row into the REGIONS table for the region Australia with the following SQL command
Now we will again do a select on the REGION table. But this time we will also query the hidden fields CREATEXID,
DELETEXID and ROWID:
As you can see we now have five rows in the REGION table. The new row for Australia has the id of the last transaction as
CREATEXID and 0 as DELETEXID since it has not yet been deleted. Other transactions with a lower transaction id that
might still be running will not be able to see this new row. Note also that each row has a unique rowid. Rowids do not need
to be consecutive but they are unique across all dataslices for one table.
2.2
Delete transactions in PureData System do not physically remove rows but update the DELETEXID field of a row to mark it as
logically deleted. These logically deleted rows need to be removed regularly with the administrative Groom command.
Update transactions in PureData System consist of a logical delete of the old row and an insert of a new row with the updated
fields. To show this effectively we will need to change a system parameter in PureData System that allows us to switch off the
invisibility lists in PureData System. Note that the parameter we will be using is dangerous and shouldnt be used in a real
PureData System environment. There is also a safer environment variable but this has some restrictions.
1.
First we will change the system variable that allows us to see deleted rows in the system
To do this exit the console with \q
2.
Page 4 of 18
IBM Software
Information Management
3.
Open the system.cfg file that contains the PureData System system configuration with vi
5.
Enter the insert mode by pressing i, the editor should now show an ---INSERT MODE--- statement in the bottom line.
6.
Navigate the cursor to the end of the last line. Press enter to create a new line. Enter the line
host.fpgaAllowXIDOverride=yes before it. Your screen should now look like the following.
system.enableCompressedTables=false
system.realFpga=no
system.useFpgaPrep=yes
system.enableCompressedTables=yes
system.enclosurePollInterval=0
system.envPollInterval=0
system.esmPollInterval=0
system.hbaPollInterval=0
system.diskPollInterval=0
system.enableCTA2=1
system.enableCTAColumns=1
sysmgr.coreCountWarning=1
sysmgr.coreCountFailover=1
system.emulatorMode=64
system.emulatorThreads=4
host.fpgaAllowXIDOverride=yes
~
~
-- INSERT --
7.
8.
Enter :wq! In the command line and press Enter to save and exit without questions.
9.
Start the system again with the nzstart command. Note in a real PureData System system changing system
configuration parameters can be a very dangerous thing that is normally not advisable without PureData System service
support.
10. Enter the NZSQL console again with the following command:
11. Now we will update the row we inserted in the last chapter to the REGION table:
Page 5 of 18
IBM Software
Information Management
Normally you would now see 5 rows with the update value. But since we disabled the invisibility lists you now see 6 rows in
the REGION table. Our transaction that updated the row had the transaction id 369666. You can see that the original row
with the lowercase australia in the comment column is still there and now has a DELETXID field that contains the
transaction id of the transaction that deleted it. Transactions with a higher transaction id will not see a row with a deletexid
that indicates that it has been logically deleted before the transaction is run.
We also see a newly inserted row with the new comment value Australia. It has the same rowid as the deleted row and the
same CREATEXID as the transaction that did the insert.
13. Finally lets clean up the table again by deleting the Australia row:
Page 6 of 18
IBM Software
Information Management
We can now see that we have logically deleted our updated row as well. It has now a DELETEXID field with the value of the
new transaction. New transactions will see the original table from the start of this lab again. Normally the logically deleted
rows are filtered out automatically by the FPGA.
If you do a SELECT the FPGA will remove all rows that:
have a CREATEXID which is bigger than the current transaction id.
have a CREATEXID of an uncommitted transaction.
have a DELETENXID which is smaller than the current transaction, but only if the transaction of the DELETEXID
field is committed.
have a DELETEXID of 1 which means that the insert has been aborted.
2.3
Aborting Transactions
PureData System never deletes a row during transactions even if transactions are rolled back. In this section we will show what
happens if a transaction is rolled back. Since an update transaction consists of a delete and insert transaction, we will
demonstrate the behavior for all tree transaction types with this.
1.
To start a transaction that we can later rollback we need to use the BEGIN keyword.
LABDB(LABADMIN)=> BEGIN;
Per default all SQL statements entered into the NZSQL console are auto-committed. To start a multi command transaction
the BEGIN keyword needs to be used. All SQL statements that are executed after it will belong to a single transaction. To
end the transaction two keywords can be used COMMIT to commit the transaction or ROLLBACK to rollback the transaction
and all changes since the BEGIN statement was executed.
2.
3.
Page 7 of 18
IBM Software
Information Management
Note that we have the same results as in the last chapter, the original row for the AP region was logically deleted by
updating its DELETEXID field, and a new row with the updated comment and new rowid has been added. Note that its
CREATEXID is the same as the DELETEXID of the old row, since they were updated by the same transaction.
4.
LABDB(LABADMIN)=> ROLLBACK;
5.
We can see that the transaction has been rolled back. The DELETEXID of the old version of the row has been reset to 0 ,
which means that it is a valid row that can be seen by other transactions, and the DELETEXID of the new row has been set
to 1 which marks it as aborted.
2.4
Cleaning up
In this section we will use the Groom command to remove the logically deleted rows we have entered and we will remove the
system parameter from the configuration file. The Groom command will be used in more detail in the next chapter. It is the main
maintenance command in PureData System and we have already used it in the Cluster Based Table labs to reorder a CBT. It
also removes all logically deleted rows from a table and frees up the space on the machine again.
1.
Page 8 of 18
IBM Software
Information Management
2.
You can see that the groom command has removed all logically deleted rows from the table. Remember that we still have
the parameter switched on that allows us to see any logically deleted rows. Especially in tables that are heavily changed
with lots and updates and deletes running the groom command will free up hard drive space and increase performance.
3.
Finally we will remove the system parameter again. Quit the nzsql console with the \q command.
4.
5.
Open the system.cfg file that contains the PureData System system configuration with vi
7.
Navigate the cursor to the last line and delete it by pressing d twice. Your screen should look like the following:
Page 9 of 18
IBM Software
Information Management
system.enableCompressedTables=false
system.realFpga=no
system.useFpgaPrep=yes
system.enableCompressedTables=yes
system.enclosurePollInterval=0
system.envPollInterval=0
system.esmPollInterval=0
system.hbaPollInterval=0
system.diskPollInterval=0
system.enableCTA2=1
system.enableCTAColumns=1
sysmgr.coreCountWarning=1
sysmgr.coreCountFailover=1
system.emulatorMode=64
system.emulatorThreads=4
~~
"system.cfg" 16L, 421C
8.
Enter :wq! In the command line and press Enter to save and exit without questions.
9.
Start the system again with the nzstart command. We have now returned the system to its original status. Logically
deleted lines will again be hidden by the database.
First determine the physical size on disk of the table ORDERS using the following command :
Now we are going to delete some rows from ORDERS table. Delete all rows where the orderstatus is marked as F for
finished using the following command :
Page 10 of
IBM Software
Information Management
DELETE 729413
3.
Now check the physical table size for ORDERS and see if the size decreased using the same command as before. You
must first exit NZSQL to shell using \q.
LABDB(LABADMIN)=> \q
[nz@netezza ~]$ nz_db_size LABDB
The output should be the same as above showing that the ORDERS table did not change in size and is still 75MB. This is
because the deleted rows were logically deleted but are still left on disk. The rows will still accessible to transactions that started
before the DELETE statement which we just executed. (i.e. have a lower transaction id)
4.
Next lets physically delete what we just logically deleted using the GROOM TABLE command and specifying table
ORDERS. When you run the GROOM TABLE command, it removes outdated and deleted records from tables.
Check if the ORDERS table size on disk has shrunk using the nz_db_size command. You must first exit NZSQL to
shell using \q.
LABDB(LABADMIN)=> \q
[nz@netezza ~]$ nz_db_size LABDB
The output is shown below. Note the reduced size of the ORDERS table:
[nz@netezza ~]$ nz_db_size labdb
Object
|
Name
|
Bytes
|
KB
|
MB
|
GB
|
TB
-----------+-------------+--------------+------------+------------+-----------+------Appliance | netezza
| 430,833,664 |
420,736 |
411 |
.4 |
.0
Database | LABDB
| 422,969,344 |
413,056 |
403 |
.4 |
.0
Table
| CUSTOMER
|
13,631,488 |
13,312 |
13 |
.0 |
.0
Table
| LINEITEM
| 294,256,640 |
287,360 |
281 |
.3 |
.0
Table
| NATION
|
524,288 |
512 |
1 |
.0 |
.0
Table
| ORDERS
|
40,370,176 |
39,424 |
39 |
.0 |
.0
Table
| PART
|
5,242,880 |
5,120 |
5 |
.0 |
.0
Table
| PARTSUPP
|
67,502,080 |
65,920 |
64 |
.1 |
.0
Table
| REGION
|
393,216 |
384 |
0 |
.0 |
.0
Table
| SUPPLIER
|
1,048,576 |
1,024 |
1 |
.0 |
.0
Page 11 of
IBM Software
Information Management
We can see that GROOM did purge the deleted rows from disk. GROOM reported that the table size was reduced by 12 extents and
we can confirm this because we can see that the size of the table reduced by 36MB which is the correct size for 12 exents. (1
extents size is 3 MB).
Update the ORDERS table so that the price of everything is increased by $1. Do this using the following command:
UPDATE 770587
All rows will be affected by the update resulting in a doubled number of physical rows in the table. This is because the update
operation leaves a copy of the rows before the update occurred incase a transaction is still operating on the rows.. New rows
are created and the results of the UPDATE are put in these rows. The old rows that are left on disk are marked as logically
deleted.
2.
To measure the performance of our test query, we can configure the NZSQL console to show the elapsed execution
time using the following command:
LABDB(LABADMIN)=> \time
Output:
Please rerun the query once or twice more to see roughly what a consistent query time is on your machine.
Now run the GROOM TABLE command on the ORDER table again:
Page 12 of
IBM Software
Information Management
Now run our chosen test query again and you should see a difference in performance:
Page 13 of
IBM Software
Information Management
For our example we find out that we have a new Region we want to add to our Regions table which has a name that exceeds
the limits of the CHAR(25) field R_NAME. Australia, New Zealand, and Tasmania. And we decide to increase the R_NAME
field to a CHAR(40) field.
1.
Add a new column to the region table with name R_NAME_TEMP and data type CHAR(40)
Lets insert a row into the table using the new name column
You can see that the results are exactly as you would expect them to be, but how does the system actually achieve this.
Remember inside the PureData System appliances we have two versions of the table, one containing the old columns and
rows and one containing the new row column.
4.
Page 14 of
IBM Software
Information Management
Normally the query would result in a single table scan node. But now we see a more complicated query plan. The Optimizer
automatically translates the simple SELECT into a UNION of two tables. The two tables are internal and are called
_TV_315893_1, which is the old version of the table before the ALTER statement. And _TV_315893_2, which is the new
version of the table after the table statement containing the new column R_NAME_TEMP.
Notice that in the old table a 4th column of CHAR(40) with default value NULL is added. This is necessary for the UNION to
succeed. The merger of those tables is done in Node 5, which takes both result sets and appends them.
But lets proceed with our data type change operation.
5.
Now we will move all values of the R_NAME column to the R_NAME_TEMP column by updating them
Page 15 of
IBM Software
Information Management
8.
We have achieved to change the data type of the R_NAME column. The column order has changed but our R_NAME
column has the same values as before and now supports longer region names.
But we have one last step to do. Under the cover the system now has three different versions of the table which are merged
for each call against the REGION table. This not only uses up space it is also bad for the query performance. So we have to
materialize these table changes with the groom command.
11. Groom the REGION table with the VERSIONS keyword to merge table versions:
Page 16 of
IBM Software
Information Management
Now this is much nicer. As we would expect we only have a single table scan snippet in the query plan and a single version
of the REGION table.
13. Finally we will return the REGION table to the old column ordering to not interfere with future labs, to do this we will use
a CTAS statement
Page 17 of
IBM Software
Information Management
Page 18 of
Stored Procedures
Hands-On Lab
IBM PureData System for Analytics Powered by Netezza Technology
IBM Software
Information Management
Table of Contents
Introduction .....................................................................3
1.1
Objectives........................................................................3
2.2
2.3
Page 2 of 22
IBM Software
Information Management
1 Introduction
Stored Procedures are subroutines that are saved in PureData System. They are executed inside the database server and are
only available by accessing the NPS system. They combine the capabilities of SQL to query and manipulate database
information with capabilities of procedural programming languages, like branching and iterations. This makes them an ideal
solution for tasks like data validation, writing event logs or encrypting data. They are especially suited for repetitive tasks that
can be easily encapsulated in a sub-routine.
1.1
Objectives
In the last labs we have created our database, loaded the data and we have done some optimization and administration tasks.
In this lab we will enhance the database by a couple of stored procedures. As we mentioned in a previous chapter PureData
System doesnt check referential or unique constraints. This is normally not critical since data loading in a data warehousing
environment is a controlled task. In our PureData System implementation we get the requirement to allow some non
administrative database users to add new customers to the customer table. This happens rarely so there are no performance
requirements and we have decided to implement this with a stored procedure that is accessible for these users and checks the
input values and referential constraints.
In a second part we will implement a business logic function as a stored procedure returning a result set. TODO describe
function
Page 3 of 22
IBM Software
Information Management
In this chapter we will create the stored procedure to insert data into the customer table. The information that is added for a new
customer will be the customer key, name, phone number and nation, the rest of the information is updated through other
processes.
2.1
First we will review the customer table and define the interface of the insert stored procedure.
1.
Connect to your Netezza image using putty. Login to 192.168.239.2 as user nz with password nz. (192.168.239.2 is
the default IP address for a local VM, the IP may be different for your Bootcamp)
2.
Access the lab directory for this lab with the following command, this folder already contains empty files for the stored
procedure scripts we will later create. If you want review them with the ls command:
3.
4.
To create a stored procedure we will use the internal vi editor of the nzsql console. Open the already existing empty file
addUser.sql with the following command (note you can tab out the filename):
LABDB(ADMIN)=> \e addCustomer.sql
6.
You are now in the familiar VI interface and you can edit the file. Switch to INSERT mode by pressing i
7.
We will now create the interface of the stored procedure so we can test creating it. We need the 4 input field mentioned
above and will return an integer return code. Enter the text as seen in the following, then exit the insert mode by
pressing ESC and enter wq! and enter to save the file and quit vi.
Page 4 of 22
IBM Software
Information Management
The minimal stored procedure we create here doesnt yet do anything, since it has an empty body. We simply create the
signature with the input and output variables. We use the command CREATE OR REPLACE so we can later execute the
same command multiple times to update the stored procedure with more code.
The input variables cannot be given names so we only add the datatypes for our input parameters key, name, nation and
phone. We also return an integer return code.
Note that we have to specify the procedure language even though NZPLSQL is the only available option in PureData
System.
8.
Back in the nzsql command line execute the script we just created with \i addCustomer.sql
You should see, that the procedure has been created successfully
LABDB(ADMIN)=> \i addCustomer.sql
CREATE PROCEDURE
LABDB(ADMIN)=>
9.
Display all stored procedures in the LABDB database with the following command:
You can see the procedure ADDCUSTOMER with the arguments we specified.
10. Execute the stored procedure with the following dummy input parameters:
Page 5 of 22
IBM Software
Information Management
LABDB(LABADMIN)=> \e addCustomer.sql
12. Switch to insert mode by pressing i"
13. We will now create a simple stored procedure that inserts the new entry into the customer table. But first we will add
some variables that alias the input variables $1, $2 etc. After the BEGIN_PROC statement enter the following lines:
DECLARE
C_KEY ALIAS FOR $1;
C_NAME ALIAS FOR $2;
N_KEY ALIAS FOR $3;
PHONE ALIAS FOR $4;
Each BEGIN..END block in the stored procedure can have its own DECLARE section. Variables are valid in the block they
belong to. It is a good best practice to change the input parameters into readable variable names to make the stored
procedure code maintainable. We will later add some additional parameters to our procedures as well.
Be careful not to use variable names that are restricted by PureData System like for example NAME.
14. Next we will add the BEGIN..END block with the INSERT statement.
BEGIN
INSERT INTO CUSTOMER VALUES (C_KEY, C_NAME, '', N_KEY, PHONE, 0, '', '');
END;
This statement will add a new row to the customer table using the input variables. It will replace the remaining fields like
account balance with default values that can be later filled. It is also possible to execute dynamic SQL queries which we will
do in a later chapter.
Your complete stored procedure should now look like the following:
Page 6 of 22
IBM Software
Information Management
Page 7 of 22
IBM Software
Information Management
2.2
In this chapter we will add integrity checks to the stored procedure we just created. We will make sure that no duplicate
customer is entered into the CUSTOMER table by querying it before the insert. We will then check with an IF condition if the
value had already been inserted into the CUSTOMER table and abort the insert in that case. We will also check the foreign key
relationship to the nation table and make sure that no customer is inserted for a nation that doesnt exist. If any of these
conditions arent met the procedure will abort and display an error message.
1.
Switch back to the VI view of the procedure with the following command. In case of a message warning about duplicate
files press enter.
LABDB(LABADMIN)=> \e addCustomer.sql
2.
3.
Add a new variable customer_rec with the type RECORD in the DECLARE section:
REC RECORD;
A RECORD is a row set with dynamic fields. It can refer to any row that is selected in a SELECT INTO statement. You can
later refer to fields with for example CUSTOMER_REC.C_PHONE.
4.
This statement fills the CUSTOMER_REC variable with the results of the query. If there is already one or more customers
with the specified key it will contain the first. Otherwise the variable will be null.
5.
Now we add the IF condition to abort the stored procedure in case a record already exists. After the newly added
SELECT statement add the following lines:
IF FOUND REC THEN
RAISE EXCEPTION 'Customer with key % already exists', C_KEY;
END IF;
In this case we use an IF condition to check if an customer record with the key already exists and has been selected by the
previous SELECT condition. We could do an implicit check on the record or any of its fields and see if it compares to the null
value, but PureData System provides a number of special variables that make this more convenient.
FOUND specifies if the last SELECT INTO statement has returned any records
ROW_COUNT contains the number of found rows in the last SELECT INTO statement
LAST_OID is the object id of the last inserted row, this variable is not very useful unless used for catalog tables.
Finally we use a RAISE EXCEPTION statement to throw an error and abort the stored procedure. To add variable values to
the return string use the % symbol anywhere in the string. This is a similar approach as used for example by the C printf
statement.
6.
We will also check the foreign key relationship to NATION, add the following lines after the last once:
Page 8 of 22
IBM Software
Information Management
Save the stored procedure by pressing ESC, and then entering wq! and pressing Enter.
8.
In NZSQL create the stored procedure from the script by executing the following command (remember that you can
cycle through previous commands by pressing the UP key)
LABDB(LABADMIN)=> \i addCustomer.sql
9.
Now lets test the check for duplicate customer ids by repeating our last CALL statement, we already know that a
customer record with the id 999999 already exists:
Page 9 of 22
IBM Software
Information Management
10. Now lets check the foreign key integrity by executing the following command with a customer id that does not yet exist
and a nation key that doesnt exist in the NATION table as well. You can double check this using select statements if
you want:
2.3
In the last chapters we have created a stored procedure that inserts values to the CUSTOMER table and does check constraints.
We will now give rights to execute this procedure to a user and we will use the management functions to make changes to the
stored procedure and verify them.
1.
First we will create a user custadmin who will be responsible for adding customers, to do this we will need to switch to
the admin user since users are global objects:
Page 10 of 22
IBM Software
Information Management
2.
Now we will grant access to the labdb database, otherwise he couldnt log on
Finally we will grant him the right to select from the customer table, he will need to have this to verify any changes he
has made:
Now lets test this out first switch to the user custadmin:
Now try to select something from the NATION table to verify that the user only has access to the customer table:
Finally lets verify that the user doesnt have INSERT rights on the table:
Page 11 of 22
IBM Software
Information Management
We now need to switch back to the admin user to give custadmin the rights to execute the stored procedure:
10. To grant the right to execute a specific stored procedure we need to specify the full name including all input parameters.
The easiest way to get these in the correct syntax is to first list them with the SHOW PROCEDURE command:
Page 12 of 22
IBM Software
Information Management
You can see that the user has only the rights we have given him. He can select data from the customer table and execute
our stored procedure but he is not allowed to change the customer table directly or execute anything but the stored
procedure.
13. Lets test this switch to the custadmin user with the following command: \c labdb custadmin
14. Add another customer to the customer table:
You can see the input and output arguments, procedure name, owner, if it is executed as owner or caller and other details.
Verbose also shows you the source code of the stored procedure. We see that the description field is still empty so lets add
a comment to the stored procedure. This is important to do if you have a big number of stored procedures in your system.
Note: nzadmin is a very convenient way to manage your stored procedure it provides most of the managing functionality
used in this lab in a graphical UI.
17. Add a description to the stored procedure:
Page 13 of 22
IBM Software
Information Management
19. We will now alter the stored procedure to be executed as the caller instead of the owner. This means that whoever
executes the stored procedure needs to have access rights to all the objects that are touched in the stored procedure
otherwise it will fail. This should be the default for stored procedures that encapsulate business logic and do not do
extensive data checking:
As expected the stored procedure fails now. The user custadmin has read access to the CUSTOMER table but no read
access to the NATION table, therefore this check results in an exception. While EXECUTE AS CALLER is more secure in
some circumstances it doesnt fit our usecase where we specifically want to expose some data modification ability to a user
who shouldnt be able to modify a table otherwise. Therefore we will change the stored procedure back:
Page 14 of 22
IBM Software
Information Management
In this chapter you have setup the permissions for the addCustomer stored procedure and the user custadmin who is
supposed to use it. You also added comments to the stored procedure.
The procedure will return each row of the region table together with additional columns that describe if the above constraints are
broken. It will also return a notice with the number of faulty rows.
This chapter will teach you to use loops in a stored procedure and to return table results. You will also use dynamic query
execution to create queries on the fly.
You should be familiar with the use of VI for the development of stored procedures from the last chapter. Alternatives to using a
standard text editor for the creation of your stored procedure would be the use of a graphical development environment like
Aginity or the PureData System Eclipse plugins that can be downloaded from the PureData System Developer Network.
1.
Open the already existing empty file checkRegion.sql with the following command (note you can tab out the filename):
LABDB(ADMIN)=> \e checkRegion.sql
2.
You are now in the familiar VI interface and you can edit the file. Switch to INSERT mode by pressing i
3.
First we will define the stored procedure header similar to the last procedure. It will be very simple since we will not use
any input arguments. Enter the following code to the editor:
Page 15 of 22
IBM Software
Information Management
Lets have a detailed look at the RETURNS section. We want to return a result set but do not have to describe the column
names or datatypes of the table object that is returned. Instead we reference an existing table, which needs to exist at the
time the stored procedure is created. This means we will need to create the table TB1 before executing the CREATE
PROCEDURE command.
Once the stored procedure is executed the stored procedure will create under the cover an empty temporary table that has
the same definition as the referenced table. So the results will not actually be saved in the referenced table, which is only
used for the definition. This means that multiple stored procedures can be executed at the same time without influencing
each other. Since the created table is temporary it will be cleaned up once the connection to the database is aborted.
Note: If the referenced table contains rows they will neither be changed nor copied over to the temporary table, the table is
strictly used for reference.
4.
For our stored procedure we need four variables, add the following lines after the BEGIN_PROC statement:
DECLARE
rec RECORD;
errorRows INTEGER;
fieldEmpty BOOLEAN;
descUpper BOOLEAN;
The four variables are needed for our stored procedure:
5.
rec, is a RECORD structure while we loop through the rows of the table we will use it to save and access the
values of each row and check them with our constraints
errorRows will be used to contain the total number of rows that violate our constraints
fieldEmpty will be used to store if the row violates either the constraint that the name is empty or the record code is
smaller than 1, this is appropriate since values of -1 or 0 in the region code are used to denote that it is empty
descUpper will be true if a record violates the constraint that the description needs to be lowercase
We will now add the main BEGIN..END clause and initialize the errorRows variable. Add the following rows after the
DECLARE section:
BEGIN
RAISE NOTICE 'Start check of Region';
errorRows := 0;
END;
Each stored procedure must at least contain one BEGIN .. END clause, which encapsulates the executed commands. We
also initially set the number of error rows to 0 and display a short sentence.
6.
We will now add the main loop. It will iterate through all rows of the REGION table and store each row in the rec
variable. Add the following lines before the END statement
Page 16 of 22
IBM Software
Information Management
The FOR rec IN expression LOOP .. END LOOP command is used to iterate through a result set, in our case a
SELECT * on the region table. The loop body is executed once for every row in the expression and the current row is saved
in the rec field. The loop needs to be ended with the END LOOP keyword.
There are many other types of loops in NZPLSQL, for a complete set refer to the stored procedure guide.
For each iteration of the loop we initially set the value of the fieldEmpty and descUpper to false. Variables can be
assigned with the := operator. Finally we will display a notice that shows the number of rows that either had an empty field
or upper case expression. This number will be saved in the errorRows variable.
7.
Now its time to check the rows for our constraints and set our variables accordingly. Enter the following rows behind the
variable initialization and before the END LOOP keyword:
Finally add the following lines after the lines you just added and before the END LOOP statement:
Page 17 of 22
IBM Software
Information Management
Since the name of the table is dynamic we need to execute the INSERT operations as a dynamic statement. This means
that the EXECUTE IMMEDIATE statement is used with a string that contains the query that is to be executed.
To add variable values to the string the pipe symbol || is used. Note that the values for R_NAME and R_COMMENT are
inserted as strings, which means they need to be surrounded by quotes. To add quotes to a string they need to be escaped
with a second quote character. This is the reason that R_NAME and R_COMMENT are surrounded by triple quotes. Apart from
that we trim them, so the inserted VARCHAR values are not blown up with empty characters.
It can be tricky to construct a string like that and you will see the error only once it is executed. For debugging it can be
useful to construct the string and display it with a RAISE NOTICE statement.
9.
Your VI should now look like that, containing the complete stored procedure:
Page 18 of 22
IBM Software
Information Management
This command creates a table TB1 that has all the rows of the REGION table and two additional BOOLEAN fields
FIELDNULL and DESCUPPER. It will also be empty because we used the LIMIT 0 clause.
12. Describe the reference table with \d TB1
You should see the following result:
LABDB(ADMIN)=> \d TB1
Table "TB1"
Attribute |
Type
| Modifier | Default Value
-------------+------------------------+----------+--------------R_REGIONKEY | INTEGER
| NOT NULL |
R_NAME
| CHARACTER(25)
| NOT NULL |
R_COMMENT
| CHARACTER VARYING(152) |
|
FIELDEMPTY | BOOLEAN
|
|
DESCUPPER
| BOOLEAN
|
|
Distributed on hash: "R_REGIONKEY"
You can see the three columns of the REGION table and the two additional BOOLEAN fields that will contain for each
row if the row violates the specified constraints.
Note this table needs to exist before the procedure can be created.
13. Now create the stored procedure. Execute the script you just created with the following command:
LABDB(ADMIN)=> \i checkRegion.sql
You should successfully create your stored procedure.
14. Now lets have a look at our REGION table, select all rows:
LABDB(ADMIN)=> SELECT * FROM REGION;
You will get the following results:
LABDB(ADMIN)=> SELECT * FROM REGION;
R_REGIONKEY |
R_NAME
|
R_COMMENT
-------------+---------------------------+----------------------------2 | sa
| south america
1 | na
| north america
4 | ap
| asia pacific
3 | emea
| europe, middle east, africa
(4 rows)
We can see that none of the rows would violate the constraints we defined which would be pretty boring. So lets test
our stored procedure by adding two rows that violate our constraints.
15. Add the two violating rows with the following commands:
LABDB(ADMIN)=> INSERT INTO REGION VALUES (0, 'as', 'Australia');
Page 19 of 22
IBM Software
Information Management
This row violates the lower case constraints for the comment field and the empty field constraint for the region key
LABDB(ADMIN)=> INSERT INTO REGION VALUES (6, '', 'mongolia');
This row violates the empty field constraint for the region name.
16. Now finally lets try our checkRegions stored procedure:
LABDB(ADMIN)=> call checkRegions();
You should see the following output:
LABDB(ADMIN)=> call checkRegions();
NOTICE: Start check of Region
NOTICE:
2 rows had an error see result set
R_REGIONKEY |
R_NAME
|
R_COMMENT
| FIELDEMPTY | DESCUPPER
-------------+---------------------------+-----------------------------+------------+----------1 | na
| north america
| f
| f
3 | emea
| europe, middle east, africa | f
| f
0 | as
| Australia
| t
| t
4 | ap
| asia pacific
| f
| f
2 | sa
| south america
| f
| f
6 |
| mongolia
| t
| f
(6 rows)
You can see the expected results. Our stored procedure has found two rows that violated the constraints we check for.
In the FIELDNULL and DESCUPPER columns we can easily see that the row with the key 0 has both an empty field
and uppercase comment. We can also see that row 6 only violated the empty field constraint.
Note that the TB1 table we created doesnt contain any rows, it is only used as a template.
17. Finally lets cleanup our REGION table again:
LABDB(ADMIN)=> DELETE FROM REGION WHERE R_REGIONKEY = 0 OR R_REGIONKEY = 6;
18. And lets run our checkRegions procedure again:
LABDB(ADMIN)=> call checkRegions();
You will see the following results:
LABDB(ADMIN)=> call checkRegions();
NOTICE: Start check of Region
NOTICE:
0 rows had an error see result set
R_REGIONKEY |
R_NAME
|
R_COMMENT
| FIELDEMPTY | DESCUPPER
-------------+---------------------------+-----------------------------+------------+----------3 | emea
| europe, middle east, africa | f
| f
4 | ap
| asia pacific
| f
| f
1 | na
| north america
| f
| f
2 | sa
| south america
| f
| f
(4 rows)
You can see that the table now is error free and all constraint violation fields are false.
Page 20 of 22
IBM Software
Information Management
Congratulations you have finished the stored procedure lab and created two stored procedures that help you to
manage your database.
Page 21 of 22
IBM Software
Information Management
Page 22 of 22