Doug Burns: Tuning & Tracing Parallel Execution

Tuning & Tracing Parallel Execution
(An Introduction)
Doug Burns
(dougburns@yahoo.com)
Introduction
• Introduction
• Parallel Architecture
• Configuration
• Dictionary Views
• Tracing and Wait Events
• Conclusion
Introduction
• Parallel Query Option introduced in 7.1
– Now called Parallel Execution
• Parallel Execution splits a single large task into

multiple smaller tasks which are handled by
separate processes running concurrently.
– Full Table Scans
– Partition Scans
– Sorts
– Index Creation
– And others …
Introduction
• A little history
• So why did so few sites implement PQO?

- Lack of understanding
- Leads to horrible early experiences
- Community's resistance to change
- Not useful in all environments
- Needs time and effort applied to the initial design!
• Isn’t Oracle’s Instance architecture parallel

anyway?
Introduction
• Non-Parallel Architecture?
Parallel Architecture
• Introduction
• Configuration
• Conclusion
EM P
Non-Parallel
U s e r P ro c e s s
S e rv e r
select * from emp; P ro c e s s
S la v e
EM P
0
U s e r P ro c e s s
Parallel
R e a d in g 1 s t
H a lf
select /*+
Q C
parallel(emp,2) */
Deg 2 * from emp;
S la v e 1
R e a d in g 2 n d
H a lf
• The Degree of Parallelism (DOP) refers to the
number of discrete threads of work
• The default DOP for an Instance is calculated as

– cpu_count * parallel_threads_per_cpu
– Used if I don’t specify a DOP in a hint or table
definition
• The maximum number of PX slaves is :-

– DOP * 2
– Plus the Query Coordinator
– But this is per Data Flow Operation
– And the slaves will be re-used
S la v e S la v e EM P
0 2
S o rtin g A - P R e a d in g 1 s t
U s e r P ro c e s s H a lf Q C
select * (R A N G E R )
Q C
from emp
order by
name; S la v e
S la v e 3
1
R e a d in g 2 n d
S o r tin g Q - Z
H a lf
• Inter-process communication is through message buffers (also

known as table queues)
• These can be stored in the shared pool or the large pool
This slide intentionally left

blank
• Methods of invoking Parallel Execution
– Table / Index Level
ALTER TABLE emp PARALLEL(DEGREE 2);
– Optimizer Hints
SELECT /*+ PARALLEL(emp) */ *
FROM emp;
• Note Using Parallel Execution implies that you will
be using the Cost-based Optimiser
• As usual, appropriate statistics are vital
– Statement Level
ALTER INDEX emp_idx_1 REBUILD
PARALLEL 8;
Configuration
• Introduction
• Configuration
• Conclusion
Configuration
• parallel_automatic_tuning
– First introduced in Oracle 8i
– This is the first parameter you should set - to TRUE
• An alternative point of view – don’t use it!
• Deprecated in 10G and default is FALSE but much of
the same functionality is implemented
– Ensures that message queues are stored in the
Large Pool rather than the Shared Pool
– It modifies the values of other parameters
– As well as the 10g default values, the following
sections show the values when
parallel_automatic_tuning is set to TRUE on previous
versions
Configuration
• parallel_adaptive_multi_user
– First introduced in Oracle 8
– Default Value – FALSE (TRUE in 10g)
– Automatic Tuning Default – TRUE
– Designed when using PX for online usage
– As workload increases, new statements will have
their degree of parallelism down-graded.
–Effective Oracle by Design
– Tom Kyte
‘This provides the best of both worlds and what users
expect from a system. They know that when it is busy,
it will run slower.’
Configuration
• parallel_max_servers
– Default - cpu_count * parallel_threads_per_cpu * 2 (if
using automatic PGA management) * 5
• e.g. 1 CPU * 2 * 2 * 5 = 20 on my laptop
– The maximum number of parallel execution slaves
available for all sessions in this instance.
– Watch out for the processes trap!
• parallel_min_servers
– Default - 0
– May choose to increase this if PX usage is constant to
reduce overhead of starting and stopping slave
processes.
More on this subject in tomorrow’s presentation
Configuration
• parallel_execution_message_size
– Default Value – 2148 bytes
– Automatic Tuning Default – 4Kb
– Maximum size of a message buffer
– May be worth increasing to 8Kb, depending on wait
event analysis.
– However, small increases in message size could lead
to large increases in large pool memory
requirements
– Remember that DOP2 relationship and multiple
sessions
Configuration
• Metalink Note 201799.1 contains full details and guidance for setting all parameters
• Ensure that standard parameters are also set appropriately
– large_pool_size
• Modified by parallel_automatic_tuning
• Calculation in Data Warehousing Guide
• Can be monitored using v$sgastat
– processes
• Modified by parallel_automatic_tuning
– sort_area_size
• For best results use automatic PGA management
• Be aware of _smm_px_max_size
• Metalink Note 201799.1 contains full details and guidance for all relevant parameters
Dictionary Views
• Introduction
• Configuration
• Conclusion
Dictionary Views
• Parallel-specific Dictionary Views
SELECT table_name
FROM dict
WHERE table_name LIKE 'V%PQ%' OR table_name like 'V%PX%‘;
TABLE_NAME
------------------------------
V$PQ_SESSTAT
V$PQ_SYSSTAT
V$PQ_SLAVE
V$PQ_TQSTAT
V$PX_BUFFER_ADVICE
V$PX_SESSION
V$PX_SESSTAT
V$PX_PROCESS
V$PX_PROCESS_SYSSTAT
– Also GV$PQ_SESSTAT and GV$PQ_TQSTAT with INST_ID

Dictionary Views
• v$pq_sesstat
– Provides statistics relating to the current session
– Useful for verifying that a specific query is using parallel
execution as expected
SELECT * FROM v$pq_sesstat;
STATISTIC LAST_QUERY SESSION_TOTAL

------------------------------ ---------- -------------
Queries Parallelized 1 1
DML Parallelized 0 0
DDL Parallelized 0 0
DFO Trees 1 1
Server Threads 3 0
Allocation Height 3 0
Allocation Width 1 0
Local Msgs Sent 217 217
Distr Msgs Sent 0 0
Local Msgs Recv'd 217 217
Distr Msgs Recv'd 0 0
Dictionary Views
• v$pq_sysstat
– The instance-level overview
– Various values, including information to help set
parallel_min_servers and parallel_max_servers
– v$px_process_sysstat contains similar information
SELECT * FROM v$pq_sysstat WHERE statistic like ‘Servers%’;
STATISTIC VALUE
------------------------------ ----------
Servers Busy 0
Servers Idle 0
Servers Highwater 3
Server Sessions 3
Servers Started 3
Servers Shutdown 3
Servers Cleaned Up 0
Dictionary Views
• v$pq_slave
– Gives information on the activity of individual PX slaves
– v$px_process contains similar information
SELECT slave_name, status, sessions, msgs_sent_total, msgs_rcvd_total

FROM v$pq_slave;
SLAV STAT SESSIONS MSGS_SENT_TOTAL MSGS_RCVD_TOTAL

---- ---- ---------- --------------- ---------------
P000 BUSY 3 465 508
P001 BUSY 3 356 290
P002 BUSY 3 153 78
P003 BUSY 3 108 63
P004 IDLE 2 249 97
P005 IDLE 2 246 97
P006 IDLE 2 239 95
P007 IDLE 2 249 96
Dictionary Views
• v$pq_tqstat
– Shows communication relationship between slaves
– Must be executed from a session that’s been using parallel
operations – refers to this session
– Example 1 – Attendance Table (25,481 rows)
break on dfo_number on tq_id
SELECT /*+ PARALLEL (attendance, 4) */ *

FROM attendance;
SELECT dfo_number, tq_id, server_type, process, num_rows, bytes

FROM v$pq_tqstat
ORDER BY dfo_number DESC, tq_id, server_type DESC, process;
DFO_NUMBER TQ_ID SERVER_TYP PROCESS NUM_ROWS BYTES

---------- ---------- ---------- ---------- ---------- ----------
1 0 Producer P000 6605 114616
Producer P001 6102 105653
Producer P002 6251 110311
Producer P003 6523 113032
Consumer QC 25481 443612
Dictionary Views
• Example 2 - with a sort operation
SELECT /*+ PARALLEL (attendance, 4) */ *

FROM attendance
ORDER BY amount_paid;
DFO_NUMBER TQ_ID SERVER_TYP PROCESS NUM_ROWS BYTES

---------- ---------- ---------- ---------- ---------- ----------
1 0 Ranger QC 372 13322
Producer P004 5744 100069
Producer P005 6304 110167
Producer P006 6303 109696
Producer P007 7130 124060
Consumer P000 15351 261380
Consumer P001 10129 182281
Consumer P002 0 103
Consumer P003 1 120
1 Producer P000 15351 261317
Producer P001 10129 182238
Producer P002 0 20
Producer P003 1 37
Consumer QC 25481 443612
Dictionary Views
• So why the unbalanced slaves?
– Check the list of distinct values in amount_paid
SELECT amount_paid, COUNT(*)

FROM attendance
GROUP BY amount_paid
ORDER BY amount_paid
/

AMOUNT_PAID COUNT(*)
----------- ----------
200 1
850 1
900 1
1000 7
1150 1
1200 15340
1995 10129
4000 1
Dictionary Views
• v$px_session and v$px_sesstat
– Query to show slaves and physical reads
break on qcsid on server_set
SELECT stat.qcsid, stat.server_set, stat.server#, nam.name, stat.value

FROM v$px_sesstat stat, v$statname nam
WHERE stat.statistic# = nam.statistic#
AND nam.name = ‘physical reads’
ORDER BY 1,2,3
QCSID SERVER_SET SERVER# NAME VALUE

---------- ---------- ---------- -------------------- ----------
145 1 1 physical reads 0
2 physical reads 0
3 physical reads 0
2 1 physical reads 63
2 physical reads 56
3 physical reads 61
physical reads 4792
Dictionary Views
• v$px_process
– Shows parallel execution slave processes, status and
session information
SELECT * FROM v$px_process;
SERV STATUS PID SPID SID SERIAL#

---- --------- ---------- ------------ ---------- ----------
P001 IN USE 18 7680 144 17
P004 IN USE 20 7972 146 11
P005 IN USE 21 8040 148 25
P000 IN USE 16 7628 150 16
P006 IN USE 24 8100 151 66
P003 IN USE 19 7896 152 30
P007 AVAILABLE 25 5804
P002 AVAILABLE 12 6772
Dictionary Views
• Monitoring the SQL being executed by slaves
set pages 0
column sql_text format a60

select p.server_name,
sql.sql_text
from v$px_process p, v$sql sql, v$session s
WHERE p.sid = s.sid AND p.serial# = s.serial#
AND s.sql_address = sql.address AND s.sql_hash_value = sql.hash_value
/
– 9i Results
P001 SELECT A1.C0 C0,A1.C1 C1,A1.C2 C2,A1.C3 C3,A1.C4 C4,A1.C5 C5,
A1.C6 C6,A1.C7 C7 FROM :Q3000 A1 ORDER BY A1.C0
– 10g Results
P001 SELECT /*+ PARALLEL (attendance, 2) */ * FROM attendance
ORDER BY amount_paid
Dictionary Views
• Additional information in standard Dictionary
Views
– e.g. v$sysstat
SELECT name, value FROM v$sysstat WHERE name LIKE 'PX%';
NAME VALUE
---------------------------------------------- ----------
PX local messages sent 4895
PX local messages recv'd 4892
PX remote messages sent 0
PX remote messages recv'd 0
Dictionary Views
Monitoring
•• Monitoring the adaptive
the adaptive multi-user multi-user
algorithm algorithm
– We need to be able to check whether operations are being downgraded and by
We
–how need to be able to check whether operations are
much
being downgraded
– Downgraded and
to serial could be by how
a particular much
problem!
– Downgraded to serial could be a particular problem!
SELECT name, value FROM v$sysstat WHERE name LIKE 'Parallel%'
NAME VALUE
---------------------------------------------------------------- ----------
SELECT name, value
Parallel operations FROM v$sysstat WHERE name LIKE 546353
not downgraded 'Parallel%'
Parallel operations downgraded to serial 432
Parallel operations downgraded 75 to 99 pct 790
NAME
VALUE
------------------ ----------------------------------------------
Parallel operations downgraded 25 to 50 pct 7654 ----------
Parallel operations downgraded 1 to
Parallel operations not downgraded 25 pct 11873 546353
P*ssed-off users 432
Dictionary Views
• Statspack
– Example Report (Excerpt)
– During overnight batch operation
– Mainly Bitmap Index creation
– Slightly difficult to read
Parallel operations downgraded 1 0
Parallel operations downgraded to 1
Parallel operations not downgrade 22
– With one stream downgraded to serial, the rest of the schedule may depend on this one job.
Tracing and Wait Events
• Introduction
• Configuration
• Conclusion
• Tracing Parallel Execution operations is more complicated than standard
tracing
– One trace file per slave (as well as the query coordinator)
– Potentially 5 trace files even with a DOP of 2
– May be in background_dump_dest or user_dump_dest (usually background_dump_dest)
• Optimizing Oracle Performance

– Millsap and Holt
‘The remaining task is to identify and analyze all of the
relevant trace files. This task is usually simple …’

• Much simpler in 10g
– Use trcsess to generate a consolidated trace file for QC and all slaves
exec dbms_session.set_identifier(‘PX_TEST');
REM tracefile_identifier is optional, but might make things easier for you
alter session set tracefile_identifier=‘PX_TEST';
exec dbms_monitor.client_id_trace_enable(‘PX_TEST');
REM DO WORK
exec dbms_monitor.client_id_trace_disable(‘PX_TEST’);
GENERATE THE CONSOLIDATED TRACE FILE AND THEN RUN IT THROUGH TKPROF
trcsess output=/ora/admin/TEST1020/udump/PX_TEST.trc clientid=PX_TEST /ora/admin/TEST1020/udump/*px_test*.trc /ora/admin/TEST1020/bdump/*.trc
tkprof /ora/admin/TEST1020/udump/DOUG.trc /ora/admin/TEST1020/udump/DOUG.out

• This is what one of the slaves looks like
C:\oracle\product\10.2.0\admin\ORCL\udump>cd ../bdump
C:\oracle\product\10.2.0\admin\ORCL\bdump>more orcl_p000_2748.trc
<SNIPPED>
*** SERVICE NAME:(SYS$USERS) 2006-03-07 10:57:29.812

*** CLIENT ID:(PX_TEST) 2006-03-07 10:57:29.812
*** SESSION ID:(151.24) 2006-03-07 10:57:29.812
WAIT #0: nam='PX Deq: Msg Fragment' ela= 13547 sleeptime/senderid=268566527 passes=1 p3=0 obj#=-1 tim=3408202924
=====================
PARSING IN CURSOR #1 len=60 dep=1 uid=70 oct=3 lid=70 tim=3408244715 hv=1220056081 ad='6cc64000'
select /*+ parallel(test_tab3, 2) */ count(*)
from test_tab3
END OF STMT
• Many more wait events and more time spent
waiting
– The various processes need to communicate with
each other
– Metalink Note 191103.1 lists the wait events related
to Parallel Execution
– But be careful of what ‘Idle’ means
• Events indicating consumers or QC are waiting
for data from producers
– PX Deq: Execute Reply
– PX Deq: Table Q Normal
• Although considered idle events, if these waits are

excessive, it could indicate a problem in the
performance of the slaves
• Investigate the slave trace files

• Events indicating producers are quicker than
consumers (or QC)
– PX qref latch
• Try increasing parallel_execution_message_size as

this might reduce the communications overhead
• Although it could make things worse if the consumer is

just taking time to process the incoming data.

• Messaging Events
– PX Deq Credit: need buffer
– PX Deq Credit: send blkd
• Although there may be many waits, the time spent

should not be a problem.
• If it is, perhaps you have an extremely busy server that

is struggling to cope
– Reduce DOP?
– Increase parallel_execution_message_size?
– Don’t use PX?
• Query Coordinator waiting for the slaves to
parse their SQL statements
– PX Deq: Parse Reply
• If there are any significant waits for this event, this may
indicate you have shared pool resource issues.
• Or you’ve encountered a bug!

• Partial Message Event
– PX Deq: Msg Fragment
• May be eliminated or improved by increasing

parallel_execution_message_size
• Not an issue on recent tests

• Example
– Excerpt from an overnight Statspack Report

Event Waits Timeouts Time (s) (ms) /txn
direct Path read 2,249,666 0 115,813 51 25.5
PX Deq: Execute Reply 553,797 22,006 75,910 137 6.3
PX qref latch 77,461 39,676 42,257 546 0.9
library cache pin 27,877 10,404 31,422 1127 0.3
db file scattered read 1,048,135 0 25,144 24 11.9
– Direct Path Reads

• Sort I/O
• Read-ahead
• PX Slave I/O
• The average wait time – SAN!

Event Waits Timeouts Time (s) (ms) /txn
direct Path read 2,249,666 0 115,813 51 25.5
PX Deq: Execute Reply 553,797 22,006 75,910 137 6.3
PX qref latch 77,461 39,676 42,257 546 0.9
library cache pin 27,877 10,404 31,422 1127 0.3
db file scattered read 1,048,135 0 25,144 24 11.9
– PX Deq: Execute Reply

• Idle event – QC waiting for a response from slaves
• Some waiting is inevitable
– PX qref latch
• Largely down to the extreme use of Parallel Execution
• Practically unavoidable but perhaps we could increase
parallel_execution_message_size?
– Library cache pin?
• Need to look at the trace files
Conclusion
• Introduction
• Configuration
• Conclusion
Conclusion
• Plan / Test / Implement
– Asking for trouble if you don’t!
• Hardware
– It’s designed to suck the server dry
– Trying to squeeze a quart into a pint pot will make
things slow down due to contention
• Tune the SQL first
– All the old rules apply
– The biggest improvements come from doing less
unnecessary work in the first place
– Even if PX does make things go quickly enough, it’s
going to use a lot more resources doing so
Conclusion
• Don’t use it for small, fast tasks
– They won’t go much quicker
– They might go slower
– They will use more resources
• Don’t use it for online

– Not unless it’s a handful of users
– With a predictable maximum number of concurrent
activities
– Who understand the implications and won’t go crazy
when something takes four times as long as normal!
– It gives a false initial perception of high performance and
isn’t scalable
– Okay, Tom, set parallel_adaptive_multi_user to TRUE
Conclusion
• The slower your I/O sub-system, the more
benefit you are likely to see from PX
– But shouldn’t you fix the underlying problem?
– More on this in the next presentation
• Consider whether PX is the correct parallel

solution for overnight batch operations
– A single stream of parallel jobs?
– Parallel streams of single-threaded jobs?
– Unfortunately you’ll probably have to do some work
to prove your ideas!
Tuning & Tracing Parallel Execution
(An Introduction)
Doug Burns
(dougburns@yahoo.com)
(oracledoug.blogspot.com)
(doug.burns.tripod.com)

Doug Burns: Tuning & Tracing Parallel Execution

Enviado por

Dados do documento

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Doug Burns: Tuning & Tracing Parallel Execution

Enviado por

Direitos autorais:

Formatos disponíveis

Tuning & Tracing Parallel Execution

• Parallel Execution splits a single large task into

• So why did so few sites implement PQO?

• Isn’t Oracle’s Instance architecture parallel

Deg 2 * from emp;

• The default DOP for an Instance is calculated as

• The maximum number of PX slaves is :-

• Inter-process communication is through message buffers (also

This slide intentionally left

– Also GV$PQ_SESSTAT and GV$PQ_TQSTAT with INST_ID

STATISTIC LAST_QUERY SESSION_TOTAL

SELECT slave_name, status, sessions, msgs_sent_total, msgs_rcvd_total

SLAV STAT SESSIONS MSGS_SENT_TOTAL MSGS_RCVD_TOTAL

SELECT /*+ PARALLEL (attendance, 4) */ *

SELECT dfo_number, tq_id, server_type, process, num_rows, bytes

DFO_NUMBER TQ_ID SERVER_TYP PROCESS NUM_ROWS BYTES

SELECT /*+ PARALLEL (attendance, 4) */ *

DFO_NUMBER TQ_ID SERVER_TYP PROCESS NUM_ROWS BYTES

SELECT amount_paid, COUNT(*)

SELECT stat.qcsid, stat.server_set, stat.server#, nam.name, stat.value

QCSID SERVER_SET SERVER# NAME VALUE

SELECT * FROM v$px_process;

SERV STATUS PID SPID SID SERIAL#

SELECT name, value FROM v$sysstat WHERE name LIKE 'PX%';

• Optimizing Oracle Performance

trcsess output=/ora/admin/TEST1020/udump/PX_TEST.trc clientid=PX_TEST /ora/admin/TEST1020/udump/*px_test*.trc /ora/admin/TEST1020/bdump/*.trc

tkprof /ora/admin/TEST1020/udump/DOUG.trc /ora/admin/TEST1020/udump/DOUG.out

*** SERVICE NAME:(SYS$USERS) 2006-03-07 10:57:29.812

• Although considered idle events, if these waits are

• Investigate the slave trace files

• Try increasing parallel_execution_message_size as

• Although it could make things worse if the consumer is

• Although there may be many waits, the time spent

• If it is, perhaps you have an extremely busy server that

• Or you’ve encountered a bug!

• May be eliminated or improved by increasing

• Not an issue on recent tests

– Direct Path Reads

– PX Deq: Execute Reply

• Don’t use it for online

• Consider whether PX is the correct parallel

Você também pode gostar

SELECT /+ PARALLEL (attendance, 4) / *

SELECT /+ PARALLEL (attendance, 4) / *

trcsess output=/ora/admin/TEST1020/udump/PX_TEST.trc clientid=PX_TEST /ora/admin/TEST1020/udump/px_test.trc /ora/admin/TEST1020/bdump/*.trc