Está en la página 1de 5

Diagnostics For Database Hang

Many times Oracle DBA’s are in a situation when the database is hung and does not
seem to be responding. In some scenarios, state is such that you cannot even
connect to the sqlplus session. Majority people restart the database (Sometimes I
wonder if this is due to the fact that most of us started working on Computer’s on
Microsoft Windows ) and then log a ticket with Oracle support . They inturn happily
inform us that “They do not have any diagnostic information to diagnose and resolve
the issue and we need to wait for next occurrence to collect some diagnostic
information)
Based on my experiences , I am writing this article to assist my fellow Oracle DBA’s
to diagnose the problem and collect the required information. So Let’s Start.
1) First of all we need to ensure that this is really a database hung situation and not a
slow database condition. This can be done by asking some questions to users.
a) Is a particular user complaining of database hang or its the condition for all the
users. If one or few user are reporting then are these users executing a batch job?
b)Are you able to make new connections to database?
c)Also check if any initialisation parameter has been changed recently?
d)Check if any resource manager plan is in effect.
One more way to establish if database is hung is t0 try to query v$session_wait view
to find events being waited on
select sid,event,seq#,p1,p2,p3 from V$session_wait where wait_time=0 and event
not like ‘%message%’;
This will give the events for all the waiting session. In case you see something
like ‘log file switch (archiving required)’ then this problem is caused by archiving
issue. See if there is free space in archiving destination.
Suppose this gives events like row cache enqueue or latches, then we need to
gather Hanganalyze andSystemstate for the support.
Else it could be that you are experiencing a slow database. In this case use AWR or
statspack to diagnose the issue. Look out for top timed events. In case you
see Library Latch or shared pool latch consuming lot of time, then look at the Hard
parses per sec section in Load profile.
2)Look at database alert log and see if any messages are present. In case you are
facing Latching or Enqueueissues, then you might see errors like below
PMON failed to acquire latch, see PMON dump
Errors in file /u01/BDUMP/test10_pmon_12864.trc:
In this case you are required to upload tracefile (reported in alert log) to Oracle
support.
Note : -Make sure that max_dump_file_size is set to unlimited so that the
tracefile contains the complete data.
Coming back to Hanganalyze and Systemstate. Find details for them below
A)Hanganalyze
HANGANALYZE is used to determine if a session is waiting for a resource, and reports
the relationships between blockers and waiters.
Use following syntax and take hanganalyze from two sessions at interval of 1 min
SQL>sqlplus “/ as sysdba”
SQL> oradebug setmypid
SQL> oradebug unlimit;
SQL> oradebug hanganalyze 3
SQL>oradebug tracefile_name
Last command will report the tracefile name which has to be uploaded to Oracle
support.
Alternatively you can use
SQL>ALTER SESSION SET EVENTS ‘IMMEDIATE TRACE NAME HANGANALYZE LEVEL 3′;
In case you wish to understand how to interpret hanganalyze file, then use Metalink
Note:215858.1: Interpreting HANGANALYZE trace files to diagnose hanging and
performance problems
B)Systemstate
Systemstate is used to dump the process information which is useful for Oracle
support to diagnose why the sessions are waiting.
For 9.2.0.6 and above gather systemstate as below
SQL> oradebug setmypid
SQL> oradebug unlimit;
SQL> oradebug dump systemstate 266
SQL>oradebug tracefile_name
Last command will report the tracefile name which has to be uploaded to Oracle
support. Perform this 2-3 times at interval of 1 min.
Again you can use
ALTER SESSION SET EVENTS ‘IMMEDIATE TRACE NAME SYSTEMSTATE LEVEL 266′;
For Oracle 9.2.0.5 and less use level 10 instead of 266
ALTER SESSION SET EVENTS ‘IMMEDIATE TRACE NAME SYSTEMSTATE LEVEL 266′;
Level 266 includes short stacks (Oracle function calls) which are useful for Oracle
Developers to determine which Oracle function’s are causing the problem. This is also
helpful in matching existing bugs.
In case you are unable to connect to database then capture systemstate using below
note
Note 121779.1-Taking a SYSTEMSTATE dump when you cannot CONNECT to Oracle.
Apart from this, following information can also be captured
a)Database alert log
b)AWR report /statspack report for 30-60 min during database hang
c)Output of OS tools to ensure that everything is fine at OS level.
E.g
$vmstat 2 20
This will capture 20 snapshots at 2 seconds interval. Look for CPU contention or
swapping issues
In Addition to above , you can use utility called LTOM which has got some predefined
rules based on which it determines that database is in Hung situation and takes
systemstate and hanganalyze dump automatically .
Please refer to following Metalink note for more details
Note:352363.1: LTOM – The On-Board Monitor User Guide
In case you are able to narrow down to a blocking session manually, then you can
very well take errorstack for the blocking process as below
connect / as sysdba
oradebug setospid 1234
oradebug unlimit
oradebug dump errorstack 3
wait 1 min
oradebug dump errorstack 3
wait 1 min
oradebug dump errorstack 3
oradebug tracefile_name
* In case the ospid for blocking session is 1234
Last command will report the tracefile name which has to be uploaded to Oracle
support.
In case you are able to capture the above information, you stand 99% chance of
getting solution . I have kept 1 % for the cases when Oracle Support will ask for
setting up some events and waiting for Next Hang Occurence for getting more
information.

RAC Hang Analysis and/or System State Dumps

First, login using the -prelim option of


SQL*Plus as "/ as sysdba".

Then run the following:


alter session set tracefile_identifier=’sysstatedump’;
oradebug setmypid
oradebug unlimit

prompt Issuing command:

oradebug -g all dump systemstate 10


oradebug -g all dump systemstate 10
prompt Sleeping for 1 minute
!sleep 60
oradebug -g all dump systemstate 10

prompt We may not be able to do dump of locks


prompt Issuing command: oradebug -g all dump locks 8
prompt If the sqlplus session hangs here for more than several minutes
prompt kill it by hitting control-c
oradebug -g all dump locks 8
prompt Sleeping for 1 minute
!sleep 60
oradebug -g all dump locks 8

BACKGROUNG PROCESSES
Getting a list of all of the background processes is also handy if you are a DBA and
as a RAC DBA, you will come across some additional processes, which are available
in your RAC environment.

SQL> select name, description from v$bgprocess where PADDR <> '00';
PMON process cleanup
DIAG diagnosibility process
LMON global enqueue service monitor
LMD0 global enqueue service daemon 0
LMS0 global cache service process 0
LMS1 global cache service process 1
MMAN Memory Manager
DBW0 db writer process 0
LGWR Redo etc.
LCK0 Lock Process 0
CKPT checkpoint
SMON System Monitor Process
RECO distributed recovery
.
.
.
.
.
The additional RAC centric processes are DIAG, LCK, LMON, LMDn, and LMSn
processes. We will give a brief description of each and discuss how they interact in a
RAC environment next.

DIAG: This is a diagnostic daemon. It constantly monitors the health of the instances
across the RAC and possible failures on the RAC. There is one per instance.

LCK: This lock process manages requests that are not cache-fusion requests.
Requests like row cache requests and library cache requests. Only a single LCK
process is allowed for each instance.

LMD: The Lock Manager Daemon. This is also sometimes referred to as the GES
(Global Enqueue Service) daemon since its job is to manage the global enqueue and
global resource access. It also detects deadlocks and monitors lock conversion
timeouts.

LMON: The Lock Monitor Process. It is the GES monitor. It reconfigures the lock
resources adding or removing nodes. LMON will generate a trace file every time a
node reconfiguration takes place. It also monitors the RAC cluster wide and detects a
node’s demise and trigger a quick reconfiguration.

LMS: This is the Lock Manager Server Process or the LMS process, sometimes also
called the GCS (Global Cache Services) process. Its primary job is to transport blocks
across the nodes for cache-fusion requests. If there is a consistent-read request, the
LMS process rolls back the block, makes a Consistent-Read image of the block and
then ship this block across the HSI (High Speed Interconnect) to the process
requesting from a remote node. LMS must also check constantly with the LMD
background process (or our GES process) to get the lock requests placed by the LMD
process. Up to 10 such processes can be generated dynamically.

También podría gustarte