HADR monitoring interfaces

DB2 V10.1 has enhanced HADR monitoring. See DB2 V10.1 Info Center
For HADR monitoring in V97, see DB2 V97 Info Center

SQL based monitoring such as table functions and administrative views work on standby only when Reads on Standby is enabled. db2pd and the "db2 get snapshot for database" command always work on the standby.

db2pd -hadr

"db2pd -hadr" option returns HADR info for a database. It is light weight, requiring no SQL connection. But it can only be issued on the database host machine (it attaches to DB2 server shared memory).

In DB2 V10.1 and later , the output is like

        Database Member 0 -- Database HADRDB -- Active -- Up 0 days 00:23:17 -- Date 06/08/2011 13:57:23

                                      HADR_ROLE = PRIMARY
                                    REPLAY_TYPE = PHYSICAL
                                  HADR_SYNCMODE = SYNC
                                     STANDBY_ID = 1
                                  LOG_STREAM_ID = 0
                                     HADR_STATE = PEER
                            PRIMARY_MEMBER_HOST = hostP.ibm.com
                               PRIMARY_INSTANCE = db2inst
                                 PRIMARY_MEMBER = 0
                            STANDBY_MEMBER_HOST = hostS1.ibm.com
                               STANDBY_INSTANCE = db2inst
                                 STANDBY_MEMBER = 0
                            HADR_CONNECT_STATUS = CONNECTED
                       HADR_CONNECT_STATUS_TIME = 06/08/2011 13:38:10.199479 (1307565490)
                    HEARTBEAT_INTERVAL(seconds) = 25
                          HADR_TIMEOUT(seconds) = 100
                  TIME_SINCE_LAST_RECV(seconds) = 3
                       PEER_WAIT_LIMIT(seconds) = 0
                     LOG_HADR_WAIT_CUR(seconds) = 0.000
              LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.006298
             LOG_HADR_WAIT_ACCUMULATED(seconds) = 0.516
                            LOG_HADR_WAIT_COUNT = 82
          SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 50772
          SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 87616
                      PRIMARY_LOG_FILE,PAGE,POS = S0000009.LOG, 1, 49262315
                      STANDBY_LOG_FILE,PAGE,POS = S0000009.LOG, 1, 49262315
                            HADR_LOG_GAP(bytes) = 0
               STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0000009.LOG, 1, 49262315
                 STANDBY_RECV_REPLAY_GAP(bytes) = 0
                               PRIMARY_LOG_TIME = 06/08/2011 13:49:19.000000 (1307566159)
                               STANDBY_LOG_TIME = 06/08/2011 13:49:19.000000 (1307566159)
                        STANDBY_REPLAY_LOG_TIME = 06/08/2011 13:49:19.000000 (1307566159)
                   STANDBY_RECV_BUF_SIZE(pages) = 16
                       STANDBY_RECV_BUF_PERCENT = 0
                     STANDBY_SPOOL_LIMIT(pages) = 0
                           PEER_WINDOW(seconds) = 0
                       READS_ON_STANDBY_ENABLED = Y
              STANDBY_REPLAY_ONLY_WINDOW_ACTIVE = N

The V10.1 format is similar to "db2 get database snapshot" output. V10.1 also added a number of fields for better monitoring. The new fields include LOG_HADR_WAIT_CUR, LOG_HADR_WAIT_ACCUMULATED, and LOG_HADR_WAIT_COUNT.

Fields not relevant to current status may be omitted in db2pd output. For example, standby replay only window fields are not shown when reads on standby is disabled; Takeover app remaining fields are included only when takeover is in progress.

Timestamps are printed in local time zone. The integer in parenthesis is the internal representation, typically the number of seconds since Jan.1.1970. This number is included for easy time arithmetic. Subsecond info is only shown in the formated timestamp.

In DB2 V97 and earlier releases, the output is like:

        Database Partition 0 -- Database HADRDB -- Active -- Up 0 days 00:19:25

        HADR Information:
        Role    State                SyncMode HeartBeatsMissed   LogGapRunAvg (bytes)
        Primary Peer                 Sync     0                  1412

        ConnectStatus ConnectTime                           Timeout
        Connected     Mon Sep 05 14:05:09 2006 (1158008709) 8

        LocalHost                                LocalService
        server1.ibm.com                          hdr_svc1

        RemoteHost                               RemoteService      RemoteInstance
        server2.ibm.com                          hdr_svc2           accounting

        PrimaryFile  PrimaryPg  PrimaryLSN
        S0000833.LOG 0          0x0000000001CA4000

        StandByFile  StandByPg  StandByLSN
        S0000832.LOG 3          0x0000000001CA3A77

The old format is harder to parse.

MON_GET_HADR table function

MON_GET_HADR table function is available starting DB2 V10.1. It provides the same information as "db2pd -hadr". It is heaviler weight because it requires SQL connection. But it can be accessed from any client and it is more friendly to database admin tools. Monitor queries can be issued to the standby database only when reads on standby is enabled.

Database Snapshot

HADR info can also be accessed through database snapshot interface. But monitoring HADR through this interface is deprecated as of DB2 V10.1. HADR info may be removed from this interface in a future release. This interface includes:

The GET SNAPSHOT FOR DATABASE command
The db2GetSnapshot API
The SNAPHADR administrative view
The SNAP_GET_HADR table function
Other snapshot administrative views and table functions

For multiple standby, the deprecated snapshot interface will only return information about the principle standby when issued to the primary, while db2pd and MON_GET_HADR table function returns info about all standbys.

Fields to watch out in HADR monitoring

For description of all fields, see MON_GET_HADR in DB2 Info Center.

For more on performance topics, such as congestion, log write time and HADR overhead, see HADR performance tuning and diagnostics

Role and State

HADR_ROLE : Is this database primary or standby? If the database is offline, its role can still be shown through the "HADR database role" field from "db2 get db cfg for <dbname>" output.

HADR_STATE : PEER is good. You are fully protected. If state does not enter peer for extended time, investigate the cause (most likely remote catch up speed is slower than primary log generation rate). Note that superAsync mode never enters peer, so you monitor log gap instead.

HADR_CONNECT_STATUS : CONNECTED is good. DISCONNECTED is bad. Investigate why primary and standby cannot connect to each other. This field can also show CONGESTED, meaning a database cannot send out data to its partner. CONGESTED is also bad. It is most likely to be seen on the primary side. It's less likely to be seen on the standby side because there isn't much data to send to the primary (usually just ack and heartbeat).

HADR_CONNECT_STATUS_TIME shows the start time of the current HADR_CONNECT_STATUS.

If HADR_CONNECT_STATUS is CONNECTED, HADR_CONNECT_STATUS_TIME shows connection establishment time.
If HADR_CONNECT_STATUS is DISCONNECTED, HADR_CONNECT_STATUS_TIME shows connection loss time.
If HADR_CONNECT_STATUS is CONGESTED, HADR_CONNECT_STATUS_TIME shows congestion start time.

Impact to Primary database logging

The LOG_HADR_WAIT_* fields are available starting DB2 V10.1. In older releases you will have to monitor from application side. For example, you can monitor application commit time. While the LOG_HADR_WAIT_* fields isolates the HADR cost in log writing, application side commit time is the overall cost of commit from database.

LOG_HADR_WAIT_CUR : How long the primary logger has been blocked waiting on log replication?
A small number (less than a few millisecond) is acceptable. Longer wait shows performance problem. You can sample this field over a period of time (such as take a reading every minute over one day) to find out the distribution of logger wait. If this number keep growing in real time (for example, it is 10 seconds longer on a reading taken 10 seconds later), then the logger is completely blocked (really bad).

LOG_HADR_WAIT_RECENT_AVG : Recent average (over last a few seconds) of logger HADR wait time. This field is not available in MON_GET_HADR table function because the table function is intended to be more an API than user interface. Tools could compute their own average using LOG_HADR_WAIT_ACCUMULATED and LOG_HADR_WAIT_COUNT for arbitrary time interval.

LOG_HADR_WAIT_ACCUMULATED and LOG_HADR_WAIT_COUNT : Accumulated logger HADR wait time and number of waits (each log flush (ie. log write) in peer state triggers a wait). Monitoring tools can compute average wait time for arbitrary time interval using these fields.

Log Gaps

HADR_LOG_GAP : This is the running average of (PRIMARY_LOG_POS - STANDBY_LOG_POS). It tells you how far the standby is behind the primary. This number is especially important in superAsync mode, which never enters peer state. HA/DR protection decreases as the gap grows. In local and remote catchup states, you should see decreasing gap and the pair will eventually enter peer state (assuming non superAsync mode). If the gap is increasing or stays high, it means that standby or network cannot keep up with the primary. In peer state, the gap should stay within a small range, because standby is basically one log flush behind primary.

STANDBY_RECV_REPLAY_GAP : This is the running average of (STANDBY_LOG_POS - STANDBY_REPLAY_POS). It shows the amount of received, but not yet replayed log data on the standby. With large STANDBY_SPOOL_LIMIT or STANDBY_RECV_BUF_SIZE, the gap can be large. During a takeover, standby needs to complete replay of logs in the gap. So larger gap will result in slower takeover (forced and non forced).

Note: In DB2 V97 and earlier releases, only a single STANDBY_LOG_POS is reported. It is derived from replay position. Because log spooling is not supported in these releases, log receive and replay positions are usually not too far from each other. In V10.1 and later, log receive position and replay position are reported separately. A large gap is possible when spooling is enabled.

Standby receive blocked, Standby receive buffer and spool usage

STANDBY_RECV_BUF_PERCENT (in V97 and earlier releases, this field is only available from "db2pd -hadr" command issued on the standby) reports standby log receive buffer usage. The buffer size can be configured by registry variable DB2_HADR_BUF_SIZE . If spooling is disabled (or not supported, as is the case on DB2 V97 and earlier), receive buffer full (100% used) will cause standby log receive to be blocked and primary log writing and transactions will eventually be blocked. As replay progresses, part of the buffer will be released and log receive will resume. If spooling is enabled, 100% buffer use does not indicate a bad condition. HADR will release the buffer for new incoming data if it needs to, even if log data in the buffer has not been replayed.

STANDBY_SPOOL_PERCENT (available on V10.5 and later) reports spool space used, as a percentage of configured limit hadr_spool_limit. NULL will be returned if the limit is -1 (unlimited). On earlier releases (or for the unlimited spooling case), you can compute spool space used using other monitor fields (result is in unit of pages):
(STANDBY_LOG_POS - STANDBY_REPLAY_LOG_POS) / 4096 - (STANDBY_RECV_BUF_SIZE * STANDBY_RECV_BUF_PERCENT)

STANDBY_RECV_BLOCKED flag (available on V10.5 and later) from the HADR_FLAGS field directly indicates that the standby log receiving is blocked. Primary log send and transactions will eventually be blocked if the condition persists. There are multiple scenarios of this condition:

When log spooling is disabled (or not supported), standby receive buffer is full (STANDBY_RECV_BUF_PERCENT is 100%).
When log spooling is enabled, spooling has reached configured spool limit (STANDBY_SPOOL_PERCENT is 100%).
The standby logging device is full (STANDBY_LOG_DEVICE_FULL flag from HADR_FLAGS field is set), regardless of whether spooling is enabled or not.

STANDBY_LOG_DEVICE_FULL flag (available on V10.5 and later) from HADR_FLAGS reports that standby log device is full. The STANDBY_RECV_BLOCKED flag will also be turned on when log device is full. The device full flag allows you to identify the cause of the blocked receiving.

In earlier releases, where the STANDBY_RECV_BLOCKED flag is not available, you will have to derive this condition using other fields:

If you are on V97 or earlier, or on V10.1 with spooling disabled (database config parameter hadr_spool_limit is 0), STANDBY_RECV_BUF_PERCENT reaching 100% indicates that standby log receive is blocked.
If you are on V10.1 with spooling enabled, first calculate spool space used (see STANDBY_SPOOL_PERCENT section above).
If spool space used is reaching hadr_spool_limit or standby logging device capacity (when spool limit is -1 (unlimited)), spooling is close to limit. Once it reaches the limit, standby log receive will be blocked.
On V10.1 and earlier, there is no monitoring fields to report standby logging device full. You will need to use OS commands to monitor the device. When device is full, log receive is blocked even if log receive buffer is not full, because DB2 need to write received log data to disks for crash recovery purpose.

Monitoring and identifying erroneous table spaces on the standby

Starting in v11.1.1.1 and v10.5 Fix Pack 9, when one or more tablespaces is in an invalid or error state on the Standby database, the HADR_FLAGS field will display the value 'STANDBY_TABLESPACE_ERROR'. For details, see: Monitoring and identifying tablespaces in invalid or error state, or tables in Inoperative state on the HADR Standby database User can recover bad tablespace on standby, following the instructions in: Recovering table space errors on an HADR standby database

If the failed tablespaces are important, reinitialize the standby to make them "good" again.

HADR monitoring tips

Monitoring the standby database

Reporting multiple standbys

Each standby only knows about the primary. It has no knowledge of other standbys. Monitoring command and query issued to the standby will report only the standby itself and the primary, as a usual HADR pair. No info about other standbys will be reported.

The primary is aware of all standbys. When issued on the primary ,"db2pd -hadr" and MON_GET_HADR table function will report all standbys. If a configured standby (listed in primary's hadr_target_list) is not connected, it is still displayed, as a reminder that multiple standbys are configured. The snapshot interface (deprecated) will only report the principal standby.

HADR pair view

Certain monitor fields are applicable to primary or standby only. For example, PEER_WAIT_LIMIT is applicable only to primary, STANDBY_RECV_BUF_SIZE, STANDBY_SPOOL_LIMIT, READS_ON_STANDBY_ENABLED are applicable only to standby. When this kind of information is reported, it is data from the database currently in this role (which may be the remote database), rather than the local database. For example, PEER_WAIT_LIMIT reported by a standby is the value configured on the primary, not the standby's local configuration. The standby's configuration will be used only when the standby turns into a primary.

Information about remote database

Primary and standby exchange monitoring information via heartbeat messages, which is sent periodically. Information about the remote database can be slightly out of date, up to a heartbeat interval. Heartbeat interval itself is reported via the HEARTBEAT_INTERVAL field. If a database has never connected to its partner database since activation, info about remote database will be returned as SQL NULL to indicate "unknown".

Unit of time duration

Per monitor table function convention, all MON_GET_HADR time duration fields use millisecond as unit. For those fields reflecting a config parameter (such as HADR_TIMEOUT, HADR_PEER_WINDOW) whose unit in config is second, the number returned by MON_GET_HADR table function will be different from the number used in "db2 get/update db cfg" command, the number returned by SYSIBMADM.DBCFG admin view or SYSPROC.DB_GET_CFG() table function. For example, for a 60 second HADR_TIMEOUT value, MON_GET_HADR will return 60000, while the config oriented interfaces will return 60. To convert the millisecond number to second, use column_name/1000 in your query.

"db2pd -hadr" returns seconds for parameters configured using seconds. Furthermore, the unit itself is printed in output to avoid confusion.

Monitoring workload

It's important be aware of the open transactions and avoid long running transactions in the workload to prevent long takeover time. The “db2pd -transaction” command displays the transactions in progress. The following is an example output of “db2pd -transaction” taken on a HADR standby database:

        Transactions:

        Address             AppHandl   [nod-index]    TranHdl    Locks    State    Tflag
        0x00007F409D103280  14         [000-00014]    4          0        READ     0x00000000
        0x00007F409D109380  0          [000-00000]    6          0        WRITE    0x00000000

        Tflag2        Firstlsn              Lastlsn               Firstlso    Lastlso
        0x00000000    0x0000000000000000    0x0000000000000000    0           0
        0x00000000    0x00000000000772A8    0x0000000000077437    61762921    61796038



        LogSpace    SpaceReserved    TID               AxRegCnt    GXID
        0           0                0x000000000100    1           0

        0           0                0x00000000022E    0           0

        The example shows that there is one write transaction that is going on. The transaction starts at LSN 0x0772A8 and is currently at 0x077437. This transaction is still relatively small and would not take too long to undo. However, if the difference between LastLsn and FirstLsn becomes too big, then it may take considerable time to undo the transaction when takeover happens. It’s recommended to optimize the workload to avoid long-running transactions, so that it would take reasonable time for HADR standby to complete takeover in case of fatal failure on primary.

Monitoring HADR takeover

"db2pd -hadr" is the recommended monitoring method during takeover. During a takeover, there may be a time window when clients cannot connect to either primary or standby, so SQL based methods such as table function won't work.

For more info, see "How to monitor takeover" section in HADR takeover

HADR Info in db2diag.log

Db2diag.log is an important DB2 diagnostic tool. It records history of events in DB2. See db2diag.log for details.