Você está na página 1de 30

Understanding LGWR, Log File Sync Waits and Commit Performance

Tanel Pder http://blog.tanelpoder.com http://tech.e2sn.com

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

WWW.E2SN.COM

Intro: About me
Tanel Pder
http://tech.e2sn.com - My company and technical Oracle stuff http://blog.tanelpoder.com - Personal Blog (more tech stuff) tanel@tanelpoder.com - Questions & enquiries

Consulting, Training, Seminars tanel@tanelpoder.com http://tech.e2sn.com/oracle-training-seminars


Online seminars coming soon!

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

WWW.E2SN.COM

Topics
How does commit & log file sync work
Overview

Reasons for too long log file sync waits

How to measure wheres the problem?

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

WWW.E2SN.COM

Reasons for log file sync waits


Commits wait for log file sync by default
User commits
Theres an user commits statistic in v$sesstat

DDL
Resulting recursive transactions commit

Recursive data dictionary DML

Rollbacks wait too!


User rollbacks
User/application issued a rollback command

Transaction rollbacks
We had an internal rollback (because of some failure) Space allocation/ASSM problems, cancelled queries, killed sessions

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

WWW.E2SN.COM

Commit & log file sync flow idealistic overview


Time

1) User issues a COMMIT 2) Foreground proc posts LGWR 3) LGWR issues the physical write syscall 5) LGWR posts the FG proc.

6) Commit complete

FG proc.

LGWR IO

4) The physcical write syscall completes

log file parallel write log file sync


Enterprise 2.0 Service Networks 2009-2010 www.e2sn.com

WWW.E2SN.COM

Log file sync performance -> disk IO speed


Time

FG proc.

LGWR IO

The physical write IO (log file parallel write) takes most of the time

Most of the log file sync time was spent waiting on log file parallel write The other components, scheduling latency, IPC were small
WWW.E2SN.COM

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

Log file sync performance -> scheduling latency


Time

1) User issues a COMMIT


FG proc.

2) LGWR waits in CPU runqueue

6) Foreground proc gets posted, gets onto CPU runqueue

LGWR IO

3) LGWR submits the IO, goes to sleep

4) IO completes, OS puts LGWR to CPU runqueue


WWW.E2SN.COM

5) LGWR gets onto CPU, posts foreground proc.

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

Log file sync flow


1. Foreground process (FG) posts LGWR and goes to sleep
The log file sync wait starts Posting is done via a semaphore operation on Unix/Linux

2. LGWR wakes up, gets onto CPU


Issues the IO request(s) LGWR goes to sleep, waiting for log file parallel write wait

3. Hardware completes the IO and OS wakes up LGWR


LGWR gets onto CPU Marks log file parallel write event complete and posts the FG

4. Foreground process is woken up by LGWR post


Foreground process gets onto CPU and completes the log file sync wait
Enterprise 2.0 Service Networks 2009-2010 www.e2sn.com

WWW.E2SN.COM

Measuring LGWR "speed"


SQL> @snapper out 1 10 1096 -- Session Snapper v2.01 by Tanel Poder ( http://www.tanelpoder.com ) --------------------------------------------------------------------------------SID, USERNAME , TYPE, STATISTIC , DELTA, HDELTA/SEC, %TIME, GRAPH --------------------------------------------------------------------------------1096, (LGWR) , STAT, messages sent , 12 , 12, 1096, (LGWR) , STAT, messages received , 10 , 10, 1096, (LGWR) , STAT, background timeouts , 1 , 1, 1096, (LGWR) , STAT, physical write total IO requests , 40, 40, 1096, (LGWR) , STAT, physical write total multi block request, 38, 38, 1096, (LGWR) , STAT, physical write total bytes, 2884608 , 2.88M, 1096, (LGWR) , STAT, calls to kcmgcs , 20 , 20, 1096, (LGWR) , STAT, redo wastage , 4548 , 4.55k, 1096, (LGWR) , STAT, redo writes , 10 , 10, 1096, (LGWR) , STAT, redo blocks written , 2817 , 2.82k, 1096, (LGWR) , STAT, redo write time , 25 , 25, 1096, (LGWR) , WAIT, LGWR wait on LNS , 1040575 , 1.04s, 104.1%, |@@@@@@@@@@| 1096, (LGWR) , WAIT, log file parallel write , 273837 , 273.84ms, 27.4%,|@@@ | 1096, (LGWR) , WAIT, events in waitclass Other , 1035172 , 1.04s , 103.5%,|@@@@@@@@@@| -- End of snap 1, end=2010-03-23 12:46:04, seconds=1

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

WWW.E2SN.COM

LGWR and Asynch IO


oracle@linux01:~$ strace -cp `pgrep -f lgwr` Process 12457 attached - interrupt to quit ^CProcess 12457 detached % time seconds usecs/call calls errors This is what the log ------ ----------- ----------- --------- --------file263 parallel write wait 100.00 0.010000 38 3 0.00 0.000000 213 event0measures AIO 0.00 0.000000 0 8 reaping duration 0.00 0.000000 0 701 0.00 0.000000 0 41 0.00 0.000000 0 41 This system call is not 0.00 0.000000 0 2 0.00 0.000000 0 37 instrumented by wait ------ ----------- ----------- --------- --------Interface! 100.00 0.010000 1081 3

syscall -------------semtimedop times getrusage gettimeofday io_getevents io_submit semop semctl -------------total

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

WWW.E2SN.COM

10

Warning
The (background processes) IO instrumentation has quite a few bugs
Different IO modes, sync, asynd, direct, buffered etc

On some versions, the log file parallel write (and db file parallel write) arent properly instrumented
Version dependent, for example 11.2.0.1 on Linux When filesystemio_options = NONE, IO syscall waits are instrumented ok (but you dont wan to use this option) When filesystem_io_options = ASYNC, IO reaping waits are all very short
However theres unaccounted time in LGWRs wait profile

When filesystem_io_options = SETALL, IO reaping waits are instrumented properly


Enterprise 2.0 Service Networks 2009-2010 www.e2sn.com

WWW.E2SN.COM

11

Redo, commit related latches and tuning


Redo related latches
redo allocation latches
Protect allocating space in log buffer / RBA ranges in redolog stream

redo copy latches


Used only for keeping track of whether anyones copying data into redo log buffer so that LGWR would know to wait for these memory copies to complete before it tries to write buffers to disk LGWR will wait for LGWR wait for redo copy wait event in such cases Used to be tuned by _log_simultaneous_copies

Should we tune any of these?


No, we should fix only problems which exist In other words, if wait interface doesnt show anyone waiting for them, then dont bother tuning them!
WWW.E2SN.COM

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

12

Instrumentation
Wait Events:
log file sync log file parallel write log file single write

Performance Counters (V$SESSTAT, V$SYSSTAT)


redo size redo writing time user commits user rollbacks transaction rollbacks

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

WWW.E2SN.COM

13

Wait event: log buffer space


(Not a commit problem) LGWR is too slow flushing redo log buffer contents to disk
Either because too slow IO subsystem Or LGWR not getting enough (quality) CPU time Sometimes pops up due large (unplanned) transactions

Of course, it can also be because of a too small log buffer


Which is not the case anymore in modern days Log buffer is usually multiple MB due how it is allocated from SGA You shouldnt even set the log_buffer parameter in 10g+

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

WWW.E2SN.COM

14

Wait event: log file single write


Single block redo IO is used mostly for logfile header block reading/writing
Log switch is the main cause Archiving as well as it updates log header Who wait: LGWR & ARCH

Example of what LGWR does during a log switch:


WAIT #0: nam='log file sequential read' ela= 12607 log#=0 block#=1 WAIT #0: nam='log file sequential read' ela= 21225 log#=1 block#=1 WAIT #0: nam='control file sequential read' ela= 358 file#=0 WAIT #0: nam='log file single write' ela= 470 log#=0 block#=1 WAIT #0: nam='log file single write' ela= 227 log#=1 block#=1

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

WWW.E2SN.COM

15

LGWR trace warnings


Starting from 10.2.0.3 (or was it 10.2.0.4) LGWR is trying to be helpful and dump warnings when the actual log write IO takes too long:
New parameter: _side_channel_batch_timeout_ms timeout before shipping out the batched side channelmessages in milliseconds

LGWR trace file:


*** 2010-03-10 11:36:06.759 Warning: log write time 690ms, size 19KB *** 2010-03-10 11:37:23.778 Warning: log write time 52710ms, size 0KB *** 2010-03-10 11:37:27.302 Warning: log write time 3520ms, size 144KB
Enterprise 2.0 Service Networks 2009-2010 www.e2sn.com

WWW.E2SN.COM

16

Log file sync in Statspack / AWR


How much of the end-to-end response time goes to log file sync?
How big it is compared to the full response time of the end user? Log file sync may take 20% of your DB Time but the DB Time itself may take only 10% of the total end user response time!

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

WWW.E2SN.COM

17

Log file sync in Statspack / AWR


If log file sync waits take a significant part of the response time, look into the Avg wait (ms) column:
Top 5 Timed Events Avg %Total ~~~~~~~~~~~~~~~~~~ wait Call Event Waits Time (s) (ms) Time ------------------------------------- ------------ ----------- ------ -----PL/SQL lock timer 57,159 8,754 153 68.8 db file sequential read 61,258 1,562 25 12.3 log file sync 5,873 1,522 259 12.0 CPU time 463 3.6 direct path write 235,606 155 1 1.2 ............ %Tim Total Wait out Time (s) ---- ---------0 108 0 39 0 4 Avg %Total wait Waits Call (ms) /txn Time ------ -------- -----8 0.9 .8 11 0.2 .3 16 0.0 .0

Event Waits ------------------------ -----------log file parallel write 13,292 db file parallel write 3,470 db file sequential read 257

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

WWW.E2SN.COM

18

Better breakdown of wait times V$EVENT_HISTOGRAM


Instead of a single wait time average, breaks wait times into buckets
SQL> select event, wait_time_milli,wait_count 2 from v$event_histogram 3 where event = 'log file parallel write'; EVENT WAIT_TIME_MILLI WAIT_COUNT ------------------------- --------------- ---------log file parallel write 1 22677 log file parallel write 2 424 log file parallel write 4 141 log file parallel write 8 340 log file parallel write 16 1401 log file parallel write 32 812 log file parallel write 64 391 log file parallel write 128 21 log file parallel write 256 6
WWW.E2SN.COM

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

19

Identifying scheduling latency


Easy on Solaris prstat m
Works since Solaris 8

Timed_os_statistics
OS Wait-cpu (latency) time in v$sesstat

On other OSes, it cant be directly measured with standard tools Indirectly, you can look into system wide average CPU runqueue length and assume that LGWR was also queueing
Not a too systematic approach, huh?

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

WWW.E2SN.COM

20

Measure scheduling latency (Solaris) Reading thread-level microstate accounting data with prstat:
USR SYS TRP TFL/DFL LCK SLP LAT % % % % % % % Time spent on CPU in user mode Time spent on CPU in kernel mode Time spent processing system traps (CPU traps) Time spent processing text/data page faults Time spent waiting for user locks Time spent sleeping (other than user locks) CPU scheduling latency

# prstat -mLp 5124 PID USERNAME USR 5124 oracle 4.2 5124 oracle 0.0 5124 oracle 0.0 5124 oracle 0.0 5124 oracle 0.0 5124 oracle 0.0 5124 oracle 0.0 5124 oracle 0.0 5124 oracle 0.0

SYS 1.6 0.6 0.6 0.6 0.6 0.0 0.0 0.0 0.0

TRP 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

TFL 0.6 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

DFL 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

LCK 0.0 99 99 99 99 0.0 100 100 100

SLP 94 0.0 0.0 0.7 0.0 100 0.0 0.0 0.0

LAT VCX ICX SCL SIG PROCESS/LWPID 0.1 412 20 3K 0 oracle/1 0.0 44 3 90 0 oracle/11 0.0 44 0 90 0 oracle/7 0.0 46 0 90 0 oracle/9 0.0 44 1 90 0 oracle/5 0.0 24 0 7 0 oracle/2 0.0 1 0 5 0 oracle/3 0.0 1 0 5 0 oracle/10 0.0 1 0 5 0 oracle/8

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

WWW.E2SN.COM

21

Bugs, problems
No instrumentation
On some version/platform/IO configuration the wait interface doesnt record log file parallel write waits at all
The same goes for db file parallel writes (Ive noticed it on 9.2-10.2.0.x on Solaris for example)

For LGWR you can use V$SESSTAT redo write time statistic instead
Its in centiseconds

1-second log file sync bug


Most log file syncs took ~1 second to complete The posts sent back by LGWR were missed by foreground process Thus the FG always waited until the 1 second log file sync wait timeout happened

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

WWW.E2SN.COM

22

Tuning
No need for tuning!
Log buffer is quite large by default
All memory remaining in a granule after the allocation for fixed SGA is given to log buffer

Oracle used to have a single redo log buffer until v9.0


Redo allocation latch could become the ultimate contention point

Since 9.2, Oracle can have the log buffer split into multiple buffers
Each protected by a separate redo allocation latch

From 10g, Oracle can keep lots of small private redo strands in shared pool
Each protected by a separate redo allocation latch @rs.sql Show redo strands available
WWW.E2SN.COM

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

23

Evil tuning
If you dont care about the D in ACID (and want to occasionally lose data for fun), then:
10gR1:
commit_logging transaction commit log write behavior

10gR2:
commit_write commit_wait transaction commit log write behavior transaction commit log wait behavior

Old undocumented stuff


_wait_for_sync wait for sync on commit MUST BE ALWAYS TRUE

Old: Put redologs to /tmp (on Solaris) or in-memory disks (/dev/shm) for duration of a migration/upgrade
If your OS / server crashes, youll need to restore from a backup!
Enterprise 2.0 Service Networks 2009-2010 www.e2sn.com

WWW.E2SN.COM

24

Optimizations for working around bad applications


Commit optimization
In PL/SQL since Oracle 9i The log file sync is deferred until the end of the PL/SQL call!
SQL> exec while true loop update t set a=a+1 ; commit ; end loop;

No log file sync waits log buffer space / log file switch completion waits more likely!

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

WWW.E2SN.COM

25

LGWR configuration CPU


Prevent priority decay
Put LGWR into fixed priority scheduling class (FX60 on Solaris) LGWR should get onto CPU faster when waking up LGWR isnt thrown off CPU as likely

If LGWR is still experiencing significant scheduling latency


You can put LGWR into a higher priority class

You should not put LGWR into the highest real-time class
Real time is tricky your process can monopolize a CPU for itself You dont want to make LGWR pre-empt the OS kernel! Note that Oracle sets some processes into higher priority by default:
_high_priority_processes
Enterprise 2.0 Service Networks 2009-2010 www.e2sn.com

LMS*|VKTM 26

WWW.E2SN.COM

LGWR configuration - IO
Reduce the amount of work and waiting a log file parallel write has to do
Unbuffered concurrent IO
Verify with truss/strace whether proper flags are used (O_DIRECT, O_DIO, O_CIO etc)

Or use raw devices


ASM is essentially a raw device

Or ODM for some cases

And optimize the whole IO hardware stack, of course!


Note shat mid-large size storage arrays do have write cache built in So, moving redo log files to SSD may not give any advantage! Verify whats your current log file parallel write latency using v$event_histogram
Enterprise 2.0 Service Networks 2009-2010 www.e2sn.com

WWW.E2SN.COM

27

log file sync magic tuning super-secret!!!

COMMIT LESS!!!
Commit when your business transaction ends, not after every single update!
Enterprise 2.0 Service Networks 2009-2010 www.e2sn.com

WWW.E2SN.COM

28

Summary
Application: Commit less!
Ideally only when your logical business transaction ends

Troubleshooting: Measure log file sync at session level detail CPU: If waits for log file sync are significant - see whether LGWR gets:
Enough (quality) CPU time Onto the CPU fast enough

IO: See how much LGWR waits for log file parallel write event
Whats the log file parallel write completion time V$EVENT_HISTOGRAM for better detail
Enterprise 2.0 Service Networks 2009-2010 www.e2sn.com

WWW.E2SN.COM

29

Thanks!
Download slides from:
http://tech.e2sn.com/oracle-slides-and-whitepapers

Download Snapper from:


http://tech.e2sn.com/oracle-scripts-and-tools/session-snapper

Blog:
http://blog.tanelpoder.com

Email:
tanel@tanelpoder.com

Enterprise 2.0 Service Networks

2009-2010 www.e2sn.com

WWW.E2SN.COM

30

Você também pode gostar