Troubleshooting Replication Propagation Issues

Troubleshooting Guide: Replication Propagation [ID 1035874.
6]
Modified 10-MAY-2010 Type TROUBLESHOOTING Status

PUBLISHED
In this Document
Purpose
Scope & Application
Last Review Date
Instructions for the Reader
Troubleshooting Details
1. Overview of how data moves between Replicated Site
2. Checking the replication propagation environment is configured correctly
2.1 Replication group status
2.2 Replicated object triggers, packages and status
2.3 Status of existing admin requests
3. Checking the automatic propagation mechanism is working
3.1 Checking for errors
3.2 Check job_queue_processes is set
3.3 Has a push / purge job been scheduled
3.4 Check the status of the propagation jobs
3.5 Check the propagator and their private database links
3.6 Check if the push job is currently running
3.7 Terminating a deferred queue push job that is currently running
4. Diagnosing the progress and status of replicated transactions
4.1 Verify transactions can be manually pushed
4.2 Verifying transactions are being propagated
4.3 Deferred transaction propagated but not applied to receiving site
4.3.1 Checking for deferred errors in the queue
4.3.2 Stop on Error
4.4 Verify the purge transaction operation
4.4.1 Verify transactions can be manually purged
4.4.2 Verifying transactions are being purged
4.5 Large deferred transactions and slow propagation
5. Diagnosing hanging propagation
5.1 Check for locks
5.2 Check the wait events
5.3 Check if a large error is being queued
5.4 Advanced Analysis
6. How to clear down large deferred queues
7. Example of how deferred transactions are propagated
7.1 Data is inserted into the table:
7.2 Interrogating the deferred queue:
7.3 Identify the next transaction to be pushed:
7.4 Manually push transactions to REP901 and interrogate the deferred queue
7.5 Identify unpurged transactions
Applies to:
Oracle Server - Enterprise Edition - Version: 7.2.2.0 to 10.2.0.1 - Release: 7.2.2 to 10.2
Oracle Server - Enterprise Edition - Version: 7.2.2.0 to 10.2.0.1 [Release: 7.2.2 to 10.2]
Information in this document applies to any platform.
Checked for relevance on 27-Nov-2007
Checked for relevance on 16-Mar-2010.
Purpose
The purpose of this article is to provide basic steps for troubleshooting advanced replication
propagation and the underlying mechanism it uses; the deferred queue. Additional notes are
referenced through out this article that address specific issues or provide additional
information on a particular component used by Advanced Replication.
Scope & Application

To be used by Oracle support analysts and replication DBA's to understand and employ basic
troubleshooting techniques for advanced replication propagation.
Last Review Date

March 16, 2010
Instructions for the Reader
A Troubleshooting Guide is provided to assist in debugging a specific issue. When possible,
diagnostic tools are included in the document to assist in troubleshooting.
Troubleshooting Details
1. Overview of how data moves between Replicated Site
Oracle uses internalised triggers to capture and store any data changes made to tables that
have been defined as replicated. It stores the data changes as remote procedure calls (RPC's)
in a deferred transaction queue table (defcall) for propagation at a predefined time and
interval. Propagation to remote master sites is automated using the database job queue
mechanism. The job either applies the transaction to the remote master or if a transaction fails
it is placed in the remote deferred transaction error queue for analysis at a later
date.
2. Checking the replication propagation environment is configured correctly
A large majority of the problems encountered by replication users are due to incorrect setup
and configuration of the replicated environment. Refer to section 1 of note:122039.1 and
note:117434.1 before checking the following :
2.1 Replication group status
For propagation to be successful the status of the replication groups must be normal:
select gname, status from dba_repgroup;
GNAME STATUS
------------------------------ ---------
GROUP1 NORMAL
2.2 Replicated object triggers, packages and status
For the propagation and queuing of data changes to be successful to remote sites, replicated
tables (objects) must display as valid at all replication sites. They must also have the
associated internalised triggers and packages defined. Run the following query to check the
replicated tables (objects), in releases prior to Oracle 8.1.x:
select oname, status, generation_status

from dba_repobject
where status != 'VALID';
Run the following query to check the replicated tables (objects), in Oracle 8.1.x and above:
select oname, status, generation_status,

replication_trigger_exists, internal_package_exists
from dba_repobject
where status != 'VALID'
or replication_trigger_exists != 'Y'
or internal_package_exists != 'Y';
If the above query returns replication objects that do not have the associated internalised
triggers or packages, it may be necessary to re-generate replication support. If after re-
generation the replication objects still show as invalid, run the following statement to ensure
that all dependant SYS and SYSTEM objects are valid:
column owner format a25

column object_name format a30
column object_type format a15
select owner, object_name, object_type

from dba_objects
where status != 'VALID'
and owner in ('SYS', 'SYSTEM');
If problems persist, contact Oracle Support Services.

2.3 Status of existing admin requests
Data cannot move between replicated sites if there are outstanding admin requests for the
associated master group and there should not be any such requests if the group displays with
status normal. Use the following query to check for admin requests:
column gname format a18
select gname, request, status, errnum

from dba_repcatlog
order by id, gname;
For additional analysis of the replication administrative requests queue

please refer to note:180014.1 and note:122039.1.
2.4 Check database links
The correct configuration of database link is essential for replication propagation to operate
correctly. The following articles cover the links required by replication:
note:50593.1 Initial steps required to create Multi Master and Snapshot Replication v8.0
note:117434.1 Initial steps required to a create Multi Master Replication environment v8.1 /
v9.x/ v10.x
Check the existence of the public and private links with the following query at each site
involved to propagation:

column db_link format a45
column username format a15
select owner, db_link, username from dba_db_links;
OWNER DB_LINK USERNAME

--------------- ----------------------------------------- ---------------
PUBLIC DB2
REPADMIN DB2 REPADMIN
Test each of the links and ensure that the global name matches the link name with the
following query:
connect as database link owner

select * from global_name@;
It is important that the correct links exist for the user who owns the job that performs the
replication push job.
3. Checking the automatic propagation mechanism is working
Oracle Replication uses the job queue to automate the propagation and purging of deferred
transactions. If the jobs are not configured properly, they will not run as expected and their
associated tasks will not be completed.
Administrators usually become aware that there may be a problem with the job queue
mechanism when they discover that the replication deferred transaction queue is building up.
To check if the queue is growing run :
select count(*) from deftrandest ;
If the queue appears to be growing to an unusual size, use the following 3.x sections to ensure
that the automated jobs are not the cause. Do not attempt to quiesce the replication system
with SUSPEND_MASTER_ACTIVITY, as that will just try to push the queue first.
3.1 Checking for errors
If a job fails while attempting to push or purge replication data changes, errors will be written
to the alertSID.log. Additional and more detailed information will go into the following files
referred to by the alert.log:
- Pre V9 : SID_snpx_nnnnn.TRC
- V9 and above : SID_cjq0_nnnnn.TRC and SID_jnnn_nnnn.TRC
The format of these files may vary between operating systems, their location can be
determined by running the following from SQLPLUS:
SQL> show parameter dump_dest

3.2 Check job_queue_processes is set
The init.ora parameters that control the Oracle job queues must be set, execute the following
statement to check that they are correctly set at each replication site:
SQL> show parameter job
NAME TYPE VALUE

-------------------------------- ------- --------
job_queue_interval integer 30
job_queue_processes integer 4 <= must be > 0
job_queue_processes : A job queue process executes a single job at a time and this parameter
determines the maximum concurrent number of these. In most replication environments
configure this to be:
= number of replicated destinations (or connection qualifier destinations)

+ 1 (for administrative requests)
+ enough to service other concurrent non replication jobs if they exist
job_queue_interval : This parameter only applies to versions prior to Oracle9i, it specifies in

seconds how frequently each SNPn background process of the instance wakes up and checks
if there is a job to run. Its default value is 60 and if jobs need to execute more frequently
reduce this value.
3.3 Has a push / purge job been scheduled
For transactions to be propagated to remote master sites automatically, there needs to be a job
in the job queue to push them. There should be one push job per remote master site, check the
job exists with:
column what format a64

select job, what from dba_jobs where upper(what) like '%DBMS_DEFER_SYS%';
JOB WHAT
-------- ----------------------------------------------------------------
43 declare rc binary_integer; begin rc := sys.dbms_defer_sys.purge(
delay_seconds=>0); end;
44 declare rc binary_integer; begin rc := sys.dbms_defer_sys.push(d

estination=>'DB2.WORLD', stop_on_error=>FALSE, delay_seconds=>0, parallelism=>2);
end;

estination=>'DB3.WORLD', stop_on_error=>FALSE, delay_seconds=>0, parallelism=>2);
end;
If the push job exists it is important to know when it last ran and when it is next scheduled to
run, check the jobs schedule with:
column dblink format a24

column last_date format a9
column next_date format a9
column interval format a22
select job push_job, dblink,

SUBSTR(TO_CHAR(LAST_DATE,'MM/DD/YY HH24:MI:SS'),1,20) LAST_DATE,
SUBSTR(TO_CHAR(NEXT_DATE,'MM/DD/YY HH24:MI:SS'),1,20) NEXT_DATE,
interval
from defschedule
where job in (select job from dba_jobs
where upper(what) like '%DBMS_DEFER_SYS.PUSH%');
PUSH_JOB DBLINK LAST_DATE NEXT_DATE INTERVAL

---------- --------------------- --------- --------- ---------------------
44 DB2.WORLD 05/15/02 05/15/02 sysdate + 10/(60*24)
14:27:22 14:37:22
45 DB3.WORLD 05/15/02 05/15/02 sysdate + 1/1440

14:30:33 14:31:33
There may be two reasons that the next_date has been passed. The first is the job has failed, to
check this see section 3.4. The second is that the job is still running, to check this see section
3.6. It is important to note that the next_date is set when the job completes, so if the interval is
10 minutes and the job took 9 minutes to run the next_date will be 19 minutes from the time
the job started. Please refer to DBMS_JOB.CHANGE in note:61730.1 if the next date of the
job needs to be altered.
Oracle uses a lazy algorithm to purge deferred transactions from the local queue. It is
important that the purge job runs regularly to clear down this queue because the same
underlying table is used for transactions waiting to go to remote sites as for those which have
been pushed but not purged. If there is a large number of transactions to be purged it can
affect the performance of propagation, check the scheduled purge job exists with:
column last_date format a9

column next_date format a9
column interval format a40
select job purge_job,

SUBSTR(TO_CHAR(LAST_DATE,'MM/DD/YY HH24:MI:SS'),1,20) LAST_DATE,
SUBSTR(TO_CHAR(NEXT_DATE,'MM/DD/YY HH24:MI:SS'),1,20) NEXT_DATE,
interval
from defschedule
where upper(what) like '%DBMS_DEFER_SYS.PURGE%');
PURGE_JOB LAST_DATE NEXT_DATE INTERVAL

---------- --------- --------- ----------------------------------------
43 05/15/02 05/15/02 /*1:Hr*/ sysdate + 1/24
14:03:16 15:03:16
3.4 Check the status of the propagation jobs
If a job returns an error during execution, Oracle will try it again. The first attempt is made
after one minute, the second attempt after two minutes, the third after four minutes, etc. This
happens up to the original interval, so if the interval is two minutes, its 1 minute, 2 minutes, 2
minutes, ... If the job fails 16 times it is marked as broken, use the following to check how
many failures there have been and if the job has been marked as broken:
column broken format a6

select job, broken, failures from dba_jobs
where upper(what) like '%DBMS_DEFER_SYS%';
JOB BROKEN FAILURES

---------- ------ ----------
26 N 0
25 N 3
If the job is showing failures follow the advice in section 3.1, once the underlying problem
has been resolved unbreak the job with DBMS_JOB.BROKEN (see note:1018453.102).
Please note that the job may show failures = 0 and broken = Y is the job was manually broken
with DBMS_JOB.BROKEN.
3.5 Check the propagator and their private database links
The owner of the job that performs the push must be the replication propagator and that user
must have a private database link to the site where the job is pushing data to. See section 2.4
for details of the required database links and how to check they are working correctly.
Use the following SQL to check the propagator, links and push jobs all match up:
select * from defpropagator;
USERNAME USERID STATUS CREATED

------------------------------ ---------- ------- ---------
REPADMIN 31 VALID 18-APR-02
select db_link from dba_db_links

where owner = (select username from defpropagator);
DB_LINK
----------------------------------
DB2.WORLD
DB3.WORLD
column pushed_site_by_propagator format a25

select job, dblink pushed_site_by_propagator
from defschedule
where log_user in (select username from defpropagator)
and upper(what) like '%DBMS_DEFER_SYS.PUSH%');
JOB PUSHED_SITE_BY_PROPAGATOR
---------- -------------------------
43 DB2.WORLD
44 DB3.WORLD
3.6 Check if the push job is currently running
Oracle replication only allows a single push operation to run to a master site at a time,
although multiple push operations can occur concurrently but they have to be to different
remote master sites. If connection qualifiers are being used then there can be multiple push
jobs to the same master site, but with different database links (see note:1024982.6).
It may also be difficult to know if data is moving between replicated sites by using the
deftrandest table because new transactions will be added all the time and for transactions with
many calls they may take some time to process. The following query identifies push jobs that
are currently running:
select /*+ ORDERED */ j.job, j.sid, d.dblink,
SUBSTR(TO_CHAR(J.THIS_DATE,'MM/DD/RRRR HH24:MI:SS'),1,20) START_DATE
from defschedule d, dba_jobs_running j
where j.job in (select job from dba_jobs
where upper(what) like '%DBMS_DEFER_SYS.PUSH%')
and j.job = d.job;
JOB SID DBLINK START_DATE

---------- ---------- ------------------------------ -------------------
44 9 DB2.WORLD 05/16/2002 12:14:47
When a job runs it create a Job Queue Lock to protect it from being run more than once (i.e.
run manually from a users session), there have been conditions observed where the lock is
still held by the Job Queue Process due to network failures. Use the following query to
identify Job Queue (JQ) lock:
select s.sid, s.serial#, s.username

from v$lock l, v$session s
where l.type = 'JQ'
and l.sid = s.sid;
SID SERIAL# USERNAME

---------- ---------- ------------------------------
9 164 REPADMIN
It is important to note that the above queries will not identify manual push operations that
have been initiated from a users session, see section 4.1 for more details of identifying these.
After evaluating the above and section 4 it may be necessary to terminate the push process,
please follow the steps defined in section 3.7 to do this.
3.7 Terminating a deferred queue push job that is currently running
There will be situations where the running push job needs to be terminated and prevented
from running again, until the current problem that is being encountered is resolved. Perform
the following steps:
- Break the job with :
execute dbms_job.broken(, true);
- Kill the Job Queue Process from the Operating System. To do this use the sid from section
3.6 to identify the process in v$session, v$process and v$bgprocess. The process will
generally be named SNPx or Jxxx.
- After killing the process, wait approximately 1 minute, to ensure the job is removed from
dba_jobs_running.
Killing the background process from the operating system will release the Job Queue lock and
the User Lock used to protect the push operation. Once the underlying problem has been
resolved by working through sections 4, 5 and 6, restart the jobs with:
execute dbms_job.broken(, false);
4. Diagnosing the progress and status of replicated transactions

The following section provides database administrators with a step by step guide to
diagnosing the progress of deferred transactions. Prior to following these steps ensure the job
queue is correctly configured, see section 3 above.
4.1 Verify transactions can be manually pushed

If it has been established that the job queue is configured correctly, but still there are problems
with propagation, the first step should be to analyse the problem by trying a manual push of
the deferred transaction queue.
Remember to make sure that the job that would normally perform the push is not currently
running and has been prevented from running when the manual push is being tested (see
sections 3.6 and 3.7). If this is not the case the manual push will normally return immediately
without pushing any rows.
Log on as the propagation user and run the following:
declare x integer;
begin
x := sys.dbms_defer_sys.push('',...);
end;
/
If you get errors during the manual push, generally rows will not be pushed, use My Oracle
Support to search for known problems that relate to these errors.
If the manual push completes without errors but the entries in deftrandest remain unchanged,
the following could be the cause:
- Another users session is performing the same push operation, when a manual push starts it
allocates a User Lock to ensure there is only one push at a time, use note:1059290.6 to
identify the blocking session.
- BUG:734902 (fixed in 8.0.6), which may manifest itself as a hang.
If neither of the above are the cause, check through the following 4.x sections and if the cause
cannot be identified, raise a call with Oracle Support Services.
If the manual push hangs check through the following 4.x sections and if they do not identify
the cause use section 5 to diagnose the problem.
4.2 Verifying transactions are being propagated

In Oracle release 7.x, the replication routines deleted all transactions from the deferred
transaction queue once they had been pushed. Where as from Oracle 8.x onwards transactions
(and calls) are no longer immediately deleted after application at the remote site, they are
instead purged from the local queue on a regular basis (interval).
There are two different views of the deferred transaction queue:

- DEFTRAN : contains all unpurged transactions
- DEFTRANDEST : contains all transaction that have not yet been pushed to a remote master
site. Transactions appear once per master site they have to be pushed to.
To view all unpurged transactions:
select * from deftran order by delivery_order;
DEFERRED_TRAN_ID DELIVERY_ORDER D START_TIM

------------------------------ -------------- - ---------
9.13.904 280185 R 16-MAY-02
1.5.559 280654 R 16-MAY-02
3.11.549 281112 R 16-MAY-02
** Note the deftran view also includes transactions from the deferred error queue, see sections
4.3.1 and 4.4 for additional information.
To view all transactions yet to be propagated:
column deferred_tran_id format a16

select * from deftrandest order by delivery_order;
DEFERRED_TRAN_ID DELIVERY_ORDER DBLINK

---------------- -------------- ------------------------------
1.5.559 280654 DB2.WORLD
3.11.549 281112 DB2.WORLD
The transaction with the lowest delivery order will be the next transaction to be pushed to the
remote replication site. Prior to Oracle9 the DEFTRANDEST view was the only way to
identify how propagation was progressing. From Oracle8 onwards use the following to
investigate the current push:
- Transactions that are currently being pushed appear in the target sites DEF$_ORIGIN table.
- A transaction has been pushed if system.def$_destination.last_delivered is greater than
system.def$_aqcall.cscn.
Oracle9 includes a mechanism for identifying how far through the current transaction the
current push is. In Oracle9 and above, use the following query to identify if a transaction is
being pushed and how many rows have been pushed:
column xid format a12

select /*+ ORDERED */ p.xid, p.dblink,

(MAX(c.callno) + 1) "Calls in Tran",
(p.sequence/(MAX(c.callno) + 1)) * 100 "% Processed Calls"
from v$replprop p, defcall c
where p.name like '%Slave%'
and c.deferred_tran_id = p.xid
group by p.xid, p.dblink, p.sequence;
XID DBLINK Calls in Tran % Processed Calls
------------ ------------------------------ ------------- -----------------
10.20.556 DB2.WORLD 9999 32.446489
6.6.581 DB2.WORLD 7000 15.153031
The current implementation of v$replprop only applies to transactions that are pushed using
parallel propagation, however Oracle recommends all customers use parallel propagation. On
systems that are CPU bound pushing the deferred queue, it may be better to run the following
query:

select p.xid, p.dblink,

p.sequence "Processed Calls"
from v$replprop p
where p.name like '%Slave%';
In Oracle9 and above running the following query should assist database administrators is
monitoring the overall activity in the deferred transaction queue:
select * from v$replqueue;
TXNS_ENQUEUED CALLS_ENQUEUED TXNS_PURGED LAST_ENQU LAST_PURG

------------- -------------- ----------- --------- ---------
6543 21299 300 17-MAY-02 17-MAY-02
See Section 7 for an example of a transaction being propagated to a remote site. If the current
transaction being pushed appears to be hung or running very slowly, check the following 4.x
sections of this document, if they do not help section 5 contains a more detailed analysis
method.
For further information on replication parallel propagation, refer to:note:76447.1 Replication

Parallel Propagation
4.3 Deferred transaction propagated but not applied to receiving site

If deferred transactions are leaving the local master site (being removed from the
DEFTRANDEST view) and the data changes do not appear to be applied at the remote master
site. The most likely cause is they are failing at the remote site; the most common reasons for
this are:
- Database space management or rollback problems.

- Data conflicts, where another user has updated a row that is being pushed to this replication
site.
When the replication propagation mechanism encounters an error in applying a transaction to

a remote master site, it rolls back the transaction and writes the whole transaction (all
rows/calls) to the deferred error queue at the site to which the transactions was being pushed.
The whole transaction is copied
even if only one row causes the problem.
4.3.1 Checking for deferred errors in the queue

To obtain information of a transaction in error query deferror with:
column deferred_tran_id heading 'Deferred|Transaction|ID' format a11

column callno heading 'Call|Number' format 99999999
column origin_tran_db heading 'Origin |Database' format a16
column destination heading 'Destination|Database' format a16
column time_of_error heading 'Date Of|Error' format a9
column error_number heading 'Oracle|Error|Number' format 999999
select deferred_tran_id, callno, origin_tran_db, destination,

TO_CHAR(start_time, 'MM/DD/YY HH24:MI:SS') TIME_OF_ERROR,
error_number
from deferror
order by start_time;
Deferred Oracle
Transaction Call Origin Destination Date Of Error
ID Number Database Database Error Number
----------- --------- ---------------- ---------------- --------- -------
5.11.152 0 DB3.WORLD DB4.WORLD 05/16/02 1403
20:03:07
8.10.153 2 DB3.WORLD DB4.WORLD 05/16/02 1403
20:08:01
In general to resolve the errors that the above query returns, search MetaLink for ORA-. If the
error is an ORA-1403 as in the above example data between the replication sites tables has
diverged, this data will have to be manually resynchronised and Oracle would recommend
customers implement a conflict resolution mechanism.
Use note:2065172.102 to discover details of the call that is in error.
4.3.2 Stop on Error

The majority of customers configure their push jobs with stop_on_error = false, the
consequence of this is:
- If one transaction fails and is written to the error queue, then the following transactions
succeed, there is a possibility that data at the remote master site could be logically
inconsistent, because the earlier
transaction has been overtaken.
If a customer has configured stop_on_error = true, then the first transaction to fail will be
written to the error queue and subsequent transactions will not be pushed. This makes it much
easier to resolve divergent data.
Use the following query to identify what stop_on_error has been set to:
column what format a64

select job, what from dba_jobs where upper(what) like '%STOP_ON_ERROR%';
JOB WHAT
---------- ----------------------------------------------------------------
estination=>'DB2.WORLD', stop_on_error=>FALSE, delay_seconds=>0,
parallelism=>2); end;
4.4 Verify the purge transaction operation

As stated in section 4.2 deferred transactions are not immediately purged from the local
replication site after they are propagated. It is very important that the purge happens at a
regular interval, such that the queue to purge does not grow too large. If the queue to purge
gets too large it will start to impact the performance of propagation.
4.4.1 Verify transactions can be manually purged

Use section 3.3 to check that the purge job is correctly scheduled and section 3.4 to identify if
the job is failing. If the purge job is failing or the number of transactions to be purged does
not appear to be decreasing follow the steps in this section.
The purge job should be owned (log_user) by a replication administrator, so log on as a

replication administrator user (normally repadmin) and run the following:
declare x integer;
begin
x := sys.dbms_defer_sys.purge(delay_seconds=>0);
end;
/
If the manual purge raises errors, check MetaLink for likely causes, address the errors and run
the purge again. If the manual purge returns without error follow the steps described in
section 4.4.2 to check that transactions are being correctly purged.
4.4.2 Verifying transactions are being purged

The easiest way to check if the queue of deferred transactions to be purged is decreasing is to
use the following query, which eliminates deferror entries in the local deferred queue:
select min(delivery_order) "Oldest Unpurged"

from deftran
where destination_list != 'D';
Oldest Unpurged
---------------
781633
In Oracle9 and above it may be better to run the following query particularly if the deferred
queue is very large, it shows the number of transactions that have been purged since the
database was last started:
select txns_purged from v$replqueue;

TXNS_PURGED
-----------
356
There are two types of purge that Oracle can perform:
- Lazy purge (the default).

- Precise purge.
The lazy purge will purge transactions with a system.def$_aqcall.cscn lower than the local
low water mark for propagated transactions (this is calculated based on the minimum
last_delivered in the local system.def$_destination). This low water mark can be lower than
some cscn numbers of some previously pushed transactions, so they will not be purged
immediately. This can happen if not all the push jobs have run and still have active
transactions. In this case, the transactions will remain in def$_aqcall until the low water mark
rises above the cscn for the transaction.
A precise purge will purge transactions with a cscn lower than the low water mark for
propagated transactions to it's specific destination. This means that the purge will query the
last_delivered for each dblink destination. All transactions that have been pushed from the
local site to that destination will usually fall below the low water mark and be purged. To
perform the precise purge execute:
declare x integer;
begin
x := sys.dbms_defer_sys.purge(purge_method=>0);
end;
/
4.5 Large deferred transactions and slow propagation

The two main reasons for slow propagation of the deferred queue are:
- Limited systems resources, usually caused by CPU or network bottlenecks.

- Large deferred transaction or error queues.
For the replication propagation mechanism to achieve maximum throughput the deferred
queue needs to be kept as small as possible and transactions need to be propagated at regular
intervals. There are two types of large queue:
- Queues with one or two transactions with tens or hundreds of thousands of calls, usually
cause by bulk update operations (DML) or SQL loads.
- Queues with tens or hundreds of thousands of transactions, usually caused by a failure in the
propagation job due to network outage or space management issues at the remote site.
In the majority of cases the queue will have already become large before the database
administrator becomes aware of the problem and running queries against the DEF.... views
will prove difficult because with enormous queues the views are slow.
The following queries should help the database administrator to make a decision about what
to do with the large deferred queue, please note on some systems it may not be practical to run
these queries.
Check how many rows are in the current or next transaction to be pushed:
select d.deferred_tran_id, count(*) calls

from defcall c, deftrandest d
where d.deferred_tran_id = c.deferred_tran_id
and d.delivery_order = (select min(delivery_order) from deftrandest)
group by d.deferred_tran_id;
DEFERRED_TRAN_ID CALLS
------------------------------ ----------
1.0.704 4999
Check how many rows exist in transactions to come (ordered by rows):
select deferred_tran_id, count(callno) calls

from defcall
group by deferred_tran_id
order by calls;
DEFERRED_TRAN_ID CALLS
------------------------------ ----------
1.0.704 4999
1.21.691 3430
1.7.725 112
10.0.669 102
Oracle 9.x and above, check how many rows have been propagated from the current
transaction so far:

select p.xid, p.dblink,

p.sequence "Processed Calls"
from v$replprop p
where p.name like '%Slave%';
XID DBLINK Processed Calls

------------ ------------------------------ ---------------
9.13.1016 DB2.WORLD 4999
4.10.628 DB2.WORLD 636
Oracle 9.x and above, check the overall number of transactions and rows that have been
queued since the instance was last started:
select * from v$replqueue;
TXNS_ENQUEUED CALLS_ENQUEUED TXNS_PURGED LAST_ENQU LAST_PURG

------------- -------------- ----------- --------- ---------
72438 2179964 2400 22-MAY-02
After making the above assessment it may be the case that the only real option is to clear
down the deferred queue and manually resynchronise the data. See Section 6 for details of
how to perform this operation.
To avoid large deferred queues building up in the future:
- Ensure conflict resolution handlers are defined for tables that receive large updates, by
handling the conflicts we avoid the overhead of rolling back the transaction and re-pushing it
into the remote error queue. This operation can take considerably longer than the original
transaction.
- Monitor the push and purge jobs with Enterprise Manager events, as soon as a failure occurs
the database administrator will then be alerted and can address the problem before the queues
build up.
- There is no easy way to avoid large transactions that are generated by mistake or adhoc user
access, but for planned batch operations consider using procedural replication.
5. Diagnosing hanging propagation
If after completing the analysis described in section 4, rows do not appear to be moving
between replication sites, perform the diagnostic steps described in the section.
5.1 Check for locks

If a row being pushed to a remote master is locked by a users session for an extended period
of time, propagation will appear to be hung. Use the following select statement to identify
sessions blocking propagation at the site where propagation is pushing data too:
column holder format 99999

column holder_name format a16
column rowno format 9999999999
column object_name format a25
select w.holding_session holder,

h.username holder_name,
s.row_wait_row# rowno,
o.object_name,
o.owner
from dba_waiters w, dba_objects o, v$session s, v$session h
where w.waiting_session = s.sid
and s.username = 'REPADMIN'
and w.holding_session = h.sid
and s.row_wait_obj# = o.object_id;
HOLDER HOLDER_NAME ROWNO OBJECT_NAME OWNER

------ ---------------- ----------- ------------------------- --------------
13 SCOTT 32 T2 REPDBA
For this query to be executed successfully; replace 'REPADMIN' with the user that the
pushing site's replication propagator pushes to (receiver user) and make sure
CATBLOCK.SQL has been run.
If the ROWNO / OBJECT_NAME column does not change the blocking users session will
have to be killed to allow propagation to continue. Under some circumstances more than one
row may be returned because each query slave used by parallel propagation opens a separate
session at the remote site.
note:76447.1 contains additional information about Parallel Propagation
note:62354.1 and UTLLOCKT.SQL contain additional information about

identifying locks.
5.2 Check the wait events

If propagation is not waiting for a lock then the next step in diagnosing the hang is to find out
what the pushing and receiving sessions are waiting for. The easiest way to do this is to use
the V$SESSION_WAIT view; customers should note that Job Queue processes
(DBMS_JOB) are not always recorded in
this view.
Run the following query at the pushing site if propagation is being executed from
DBMS_JOB:
column event format a44
select sw.sid, sw.event, sw.p1, sw.p2

from v$session_wait sw
where sw.sid in ( select qs.sid
from dba_jobs_running jr, dba_jobs j,
v$px_session qs
where jr.job = j.job
and jr.sid = qs.qcsid
and upper(j.what) like '%DBMS_DEFER_SYS.PUSH%')
and sw.wait_time = 0;
Run the following query at the pushing site if propagation is being executed by a users
session, replace REPADMIN with the propagation user:

from v$session_wait sw, v$session s
where sw.sid = s.sid
and sw.sid in (select qs.sid from v$px_session qs)
and sw.wait_time = 0
and s.username = 'REPADMIN';
Run the following query at the site where data is being pushed to and replace 'REPADMIN'
with the user that the propagation user pushes to (receiver user), note each parallel
propagation slave will appear as a separate session:

from v$session_wait sw, v$session s
where sw.sid = s.sid
and s.username = 'REPADMIN'
and sw.wait_time = 0;
If the sessions appear to be stuck on the same event for a long period of time, consult
note:61998.1 or raise a call with Oracle Support Services.
5.3 Check if a large error is being queued

Writing failed replicated transactions to the remote deferred transaction error queue is one of
the longest operations replication can perform, failed transactions often take four or more
times as long to write to the error queue as they did to create.
Section 4.3 can be used to identify when small transactions are being written to the error
queue, however it will not assist in identifying if there is a single transaction with thousands
of rows being written to the error queue. The following query can be used to identify if a large
error is being queued at the remote master site:
select count(*)
from v$sqlarea a, v$session s
where s.sql_address = a.address
and s.sql_hash_value = a.hash_value
and a.sql_text like '%DEF$_AQERROR%'
and s.username = 'REPADMIN';
Replace 'REPADMIN' with the user that the pushing sites replication propagator pushes to
(receiver user).
5.4 Advanced Analysis

If after following all other sections in this article the push still seems to be hung or stuck,
collect the information described in this section and supply it to Oracle Support Services. It
may be necessary to raise a call with Oracle Support Services to assist in collecting the
required information.
The first step is to collect an errorstack from the pushing session, this must include all query
slaves used by the push. Use the query defined in section 5.2 to identify these.
The second step is to collect an errorstack from all sessions that are applying changes at the
remote site. Use the query defined in section 5.2 to identify these.
Collect the errorstack by running the following from SQLPLUS for each session:
connect .... as sysdba

oradebug setospid oradebug unlimit oradebug dump errorstack 3
As hung or stuck push operations may in fact be spinning or looping operations, it may also
be necessary to collect in depth sql_trace of all sessions listed above. Collect the required
trace by running the following from SQLPLUS for each session:
connect .... as sysdba

oradebug setospid
oradebug unlimit
oradebug dump 10046 12
If after analysis the sessions need to be killed refer to section 3.7.

6. How to clear down large deferred queues
Database administrators frequently have to make the decision to terminate propagation and
manually resynchronise their replicated environments when the deferred queue has become
large and or slow. See section 4.5 for additional information. Once the decision has been
made to clear down the queue and resynchronise the data, follow : note:190885.1 How to
Clear Down the Deferred Queue and
DBMS_DEFER_SYS.DELETE_TRAN
Oracle normally recommends customers use dbms_defer_sys.delete_tran with all

arguments set to null, to remove transactions from the deferred queue. However
when the queue is very large this may not be the most efficient mechanism for
clearing the queue.
The basic steps to clear down the queue are listed below, if the queue is
large, slow or hanging refer to note:190885.1:
- Terminate the current push operation and prevent it from re-running,

if the push is running from dbms_job see section 3.7.
- Stop remote sites from replicating to the local site.
- DO NOT attempt to suspend or quiesce the replicated environment, that
will try to push the queue again and introduce admin requests that also
need to be cleaned out.
- If the system schema has optimiser statistics defined, make sure they are up to date, by
analyzing the schema with compute statistics. In general Oracle does not recommend adding
statistics to the system schema tables.
- From SQLPLUS :
execute dbms_defer_sys.delete_tran(null, null);

7. Example of how deferred transactions are propagated
The following is an example of a transaction being replicated from ORA9I.WORLD to
REP9I.WORLD.
7.1 Data is inserted into the table:

insert into scott.dept values (80,'MARKETING','ORLANDO');
commit;
insert into scott.dept values (90,'MARKETING','LONDON');
commit;
7.2 Interrogating the deferred queue:

After applying DML to replicated tables, modifications are logged in the system.def$_aqcall
table. These changes can be seen through the deftran and deftrandest views:
select * from deftran;

------------------------------ -------------- - ---------
4.49.495 479001 R 22-JAN-02
3.0.494 478972 R 22-JAN-02
column deferred_tran_id format a10

select * from deftrandest order by delivery_order;
--OR--
select enq_tid, cscn

from system.def$_aqcall
where cscn is not null
order by cscn;
DEFERRED_T DELIVERY_ORDER DBLINK

---------- -------------- ------------------------------
4.49.495 479001 REP9I.WORLD
7.3 Identify the next transaction to be pushed:

Find out what the current LAST_DELIVERED value is for
SYSTEM.DEF$_DESTINATION:
select dblink, last_delivered from system.def$_destination;
DBLINK LAST_DELIVERED
------------------------------ --------------
REP9I.WORLD 478992
ORA9I.WORLD 478878
At this point, DEFERRED_TRAN_ID OR ENQ_TID 4.49.495 has not yet been pushed
because its DELIVERY_ORDER OR CSCN of 479001 is greater than the current
LAST_DELIVERED value for all transactions going to REP9I.WORLD which is 478992.
DEF$_AQCALL.CSCN > DEF$_DESTINATION.LAST_DELIVERED means the

transaction has NOT been pushed.
7.4 Manually push transactions to REP901 and interrogate the deferred queue
Push the queue:
..... dbms_defer_sys.push(destination=>'REP9I.WORLD');
Check the views:
select * from deftrandest;
no rows selected
select * from deftran;

------------------------------ -------------- - ---------
4.49.495 479001 R 22-JAN-02
3.0.494 478972 R 22-JAN-02
select enq_tid, cscn from system.def$_aqcall;
ENQ_TID CSCN
------------------------------ ----------
4.49.495 479001
3.0.494 478972
As expected, the unpurged queue view deftran shows all transactions and the deftrandest view
is emply because all transactions have been pushed.
select dblink, last_delivered from system.def$_destination;
DBLINK LAST_DELIVERED
---------------------------- --------------
REP9I.WORLD 479017
ORA9I.WORLD 478878
The LAST_DELIVERED column has increased and all transactions with

DEF$_AQCALL.CSCN < DEF$_DESTINATION.LAST_DELIVERED have been pushed,
which is all of the transactions currently in deftran.
7.5 Identify unpurged transactions

The following query identifies the number of transactions that have been pushed but not yet
purged:
select count(*)
from system.def$_aqcall
where cscn < (select last_delivered from
system.def$_destination where dblink ='[dblink]');
COUNT(*)
----------
2
Related
Products
• Oracle Database Products > Oracle Database > Oracle Database > Oracle Server -
Enterprise Edition
• Oracle Database Products > Oracle Database > Oracle Database > Oracle Server -
Enterprise Edition
Keywords

Troubleshooting Replication Propagation Issues

Enviado por

Dados do documento

Descrição original:

Título original

Direitos autorais

Formatos disponíveis

Compartilhar este documento

Compartilhar ou incorporar documento

Opções de compartilhamento

Você considera este documento útil?

Este conteúdo é inapropriado?

Direitos autorais:

Formatos disponíveis

Troubleshooting Replication Propagation Issues

Enviado por

Direitos autorais:

Formatos disponíveis

Troubleshooting Guide: Replication Propagation [ID 1035874.

Modified 10-MAY-2010 Type TROUBLESHOOTING Status

7.5 Identify unpurged transactions

Scope & Application

Last Review Date

select gname, status from dba_repgroup;

select oname, status, generation_status

select oname, status, generation_status,

column owner format a25

select owner, object_name, object_type

If problems persist, contact Oracle Support Services.

column gname format a18

select gname, request, status, errnum

For additional analysis of the replication administrative requests queue

column owner format a15

OWNER DB_LINK USERNAME

connect as database link owner

select count(*) from deftrandest ;

SQL> show parameter dump_dest

SQL> show parameter job

NAME TYPE VALUE

= number of replicated destinations (or connection qualifier destinations)

job_queue_interval : This parameter only applies to versions prior to Oracle9i, it specifies in

column what format a64

44 declare rc binary_integer; begin rc := sys.dbms_defer_sys.push(d

45 declare rc binary_integer; begin rc := sys.dbms_defer_sys.push(d

column dblink format a24

select job push_job, dblink,

PUSH_JOB DBLINK LAST_DATE NEXT_DATE INTERVAL

45 DB3.WORLD 05/15/02 05/15/02 sysdate + 1/1440

column last_date format a9

select job purge_job,

PURGE_JOB LAST_DATE NEXT_DATE INTERVAL

column broken format a6

JOB BROKEN FAILURES

select * from defpropagator;

USERNAME USERID STATUS CREATED

select db_link from dba_db_links

column pushed_site_by_propagator format a25

JOB SID DBLINK START_DATE

select s.sid, s.serial#, s.username

SID SERIAL# USERNAME

- Break the job with :

execute dbms_job.broken(, true);

4. Diagnosing the progress and status of replicated transactions

4.1 Verify transactions can be manually pushed

Log on as the propagation user and run the following:

4.2 Verifying transactions are being propagated

There are two different views of the deferred transaction queue:

To view all unpurged transactions:

select * from deftran order by delivery_order;

DEFERRED_TRAN_ID DELIVERY_ORDER D START_TIM

To view all transactions yet to be propagated:

column deferred_tran_id format a16

DEFERRED_TRAN_ID DELIVERY_ORDER DBLINK

column xid format a12

select /*+ ORDERED */ p.xid, p.dblink,

column xid format a12

select p.xid, p.dblink,

select * from v$replqueue;

TXNS_ENQUEUED CALLS_ENQUEUED TXNS_PURGED LAST_ENQU LAST_PURG

For further information on replication parallel propagation, refer to:note:76447.1 Replication

select /+ ORDERED / p.xid, p.dblink,