Monitoring HADR
Table of Contents
DB2 V10.1 has enhanced HADR monitoring. See
DB2 V10.1 Info Center
For HADR monitoring in V97, see
DB2 V97 Info Center
SQL based monitoring such as table functions and administrative views work on standby only when Reads on Standby is enabled. db2pd and the "db2 get snapshot for database" command always work on the standby.
"db2pd -hadr" option returns HADR info for a database. It is light weight, requiring no SQL connection. But it can only be issued on the database host machine (it attaches to DB2 server shared memory).
In
DB2 V10.1 and later
, the output is like
Database Member 0 -- Database HADRDB -- Active -- Up 0 days 00:23:17 -- Date 06/08/2011 13:57:23 HADR_ROLE = PRIMARY REPLAY_TYPE = PHYSICAL HADR_SYNCMODE = SYNC STANDBY_ID = 1 LOG_STREAM_ID = 0 HADR_STATE = PEER PRIMARY_MEMBER_HOST = hostP.ibm.com PRIMARY_INSTANCE = db2inst PRIMARY_MEMBER = 0 STANDBY_MEMBER_HOST = hostS1.ibm.com STANDBY_INSTANCE = db2inst STANDBY_MEMBER = 0 HADR_CONNECT_STATUS = CONNECTED HADR_CONNECT_STATUS_TIME = 06/08/2011 13:38:10.199479 (1307565490) HEARTBEAT_INTERVAL(seconds) = 25 HADR_TIMEOUT(seconds) = 100 TIME_SINCE_LAST_RECV(seconds) = 3 PEER_WAIT_LIMIT(seconds) = 0 LOG_HADR_WAIT_CUR(seconds) = 0.000 LOG_HADR_WAIT_RECENT_AVG(seconds) = 0.006298 LOG_HADR_WAIT_ACCUMULATED(seconds) = 0.516 LOG_HADR_WAIT_COUNT = 82 SOCK_SEND_BUF_REQUESTED,ACTUAL(bytes) = 0, 50772 SOCK_RECV_BUF_REQUESTED,ACTUAL(bytes) = 0, 87616 PRIMARY_LOG_FILE,PAGE,POS = S0000009.LOG, 1, 49262315 STANDBY_LOG_FILE,PAGE,POS = S0000009.LOG, 1, 49262315 HADR_LOG_GAP(bytes) = 0 STANDBY_REPLAY_LOG_FILE,PAGE,POS = S0000009.LOG, 1, 49262315 STANDBY_RECV_REPLAY_GAP(bytes) = 0 PRIMARY_LOG_TIME = 06/08/2011 13:49:19.000000 (1307566159) STANDBY_LOG_TIME = 06/08/2011 13:49:19.000000 (1307566159) STANDBY_REPLAY_LOG_TIME = 06/08/2011 13:49:19.000000 (1307566159) STANDBY_RECV_BUF_SIZE(pages) = 16 STANDBY_RECV_BUF_PERCENT = 0 STANDBY_SPOOL_LIMIT(pages) = 0 PEER_WINDOW(seconds) = 0 READS_ON_STANDBY_ENABLED = Y STANDBY_REPLAY_ONLY_WINDOW_ACTIVE = N
The V10.1 format is similar to "db2 get database snapshot" output. V10.1 also added a number of fields for better monitoring. The new fields include LOG_HADR_WAIT_CUR, LOG_HADR_WAIT_ACCUMULATED, and LOG_HADR_WAIT_COUNT.
Fields not relevant to current status may be omitted in db2pd output. For example, standby replay only window fields are not shown when reads on standby is disabled; Takeover app remaining fields are included only when takeover is in progress.
Timestamps are printed in local time zone. The integer in parenthesis is the internal representation, typically the number of seconds since Jan.1.1970. This number is included for easy time arithmetic. Subsecond info is only shown in the formated timestamp.
In DB2 V97 and earlier releases, the output is like:
Database Partition 0 -- Database HADRDB -- Active -- Up 0 days 00:19:25 HADR Information: Role State SyncMode HeartBeatsMissed LogGapRunAvg (bytes) Primary Peer Sync 0 1412 ConnectStatus ConnectTime Timeout Connected Mon Sep 05 14:05:09 2006 (1158008709) 8 LocalHost LocalService server1.ibm.com hdr_svc1 RemoteHost RemoteService RemoteInstance server2.ibm.com hdr_svc2 accounting PrimaryFile PrimaryPg PrimaryLSN S0000833.LOG 0 0x0000000001CA4000 StandByFile StandByPg StandByLSN S0000832.LOG 3 0x0000000001CA3A77The old format is harder to parse.
MON_GET_HADR table function is available starting DB2 V10.1. It provides the same information as "db2pd -hadr". It is heaviler weight because it requires SQL connection. But it can be accessed from any client and it is more friendly to database admin tools. Monitor queries can be issued to the standby database only when reads on standby is enabled.
HADR info can also be accessed through database snapshot interface. But monitoring HADR through this interface is deprecated as of DB2 V10.1. HADR info may be removed from this interface in a future release. This interface includes:
For multiple standby, the deprecated snapshot interface will only return information about the principle standby when issued to the primary, while db2pd and MON_GET_HADR table function returns info about all standbys.
For description of all fields, see MON_GET_HADR in DB2 Info Center.
For more on performance topics, such as congestion, log write time and HADR overhead, see HADR performance tuning and diagnostics
HADR_ROLE
: Is this database primary or standby? If the database is offline, its role can still be shown through the "HADR database role" field from "db2 get db cfg for <dbname>" output.
HADR_STATE
: PEER is good. You are fully protected. If state does not enter peer for extended time, investigate the cause (most likely remote catch up speed is slower than primary log generation rate). Note that superAsync mode never enters peer, so you monitor log gap instead.
HADR_CONNECT_STATUS
: CONNECTED is good. DISCONNECTED is bad. Investigate why primary and standby cannot connect to each other. This field can also show CONGESTED, meaning a database cannot send out data to its partner. CONGESTED is also bad. It is most likely to be seen on the primary side. It's less likely to be seen on the standby side because there isn't much data to send to the primary (usually just ack and heartbeat).
HADR_CONNECT_STATUS_TIME shows the start time of the current HADR_CONNECT_STATUS.
The LOG_HADR_WAIT_* fields are available starting DB2 V10.1. In older releases you will have to monitor from application side. For example, you can monitor application commit time. While the LOG_HADR_WAIT_* fields isolates the HADR cost in log writing, application side commit time is the overall cost of commit from database.
LOG_HADR_WAIT_CUR
: How long the primary logger has been blocked waiting on log replication?
A small number (less than a few millisecond) is acceptable. Longer wait shows performance problem. You can sample this field over a period of time (such as take a reading every minute over one day) to find out the distribution of logger wait. If this number keep growing in real time (for example, it is 10 seconds longer on a reading taken 10 seconds later), then the logger is completely blocked (really bad).
LOG_HADR_WAIT_RECENT_AVG
: Recent average (over last a few seconds) of logger HADR wait time. This field is not available in MON_GET_HADR table function because the table function is intended to be more an API than user interface. Tools could compute their own average using LOG_HADR_WAIT_ACCUMULATED and LOG_HADR_WAIT_COUNT for arbitrary time interval.
LOG_HADR_WAIT_ACCUMULATED
and
LOG_HADR_WAIT_COUNT
: Accumulated logger HADR wait time and number of waits (each log flush (ie. log write) in peer state triggers a wait). Monitoring tools can compute average wait time for arbitrary time interval using these fields.
HADR_LOG_GAP
: This is the running average of (PRIMARY_LOG_POS - STANDBY_LOG_POS). It tells you how far the standby is behind the primary. This number is especially important in superAsync mode, which never enters peer state. HA/DR protection decreases as the gap grows. In local and remote catchup states, you should see decreasing gap and the pair will eventually enter peer state (assuming non superAsync mode). If the gap is increasing or stays high, it means that standby or network cannot keep up with the primary. In peer state, the gap should stay within a small range, because standby is basically one log flush behind primary.
STANDBY_RECV_REPLAY_GAP
: This is the running average of (STANDBY_LOG_POS - STANDBY_REPLAY_POS). It shows the amount of received, but not yet replayed log data on the standby. With large STANDBY_SPOOL_LIMIT or STANDBY_RECV_BUF_SIZE, the gap can be large. During a takeover, standby needs to complete replay of logs in the gap. So larger gap will result in slower takeover (forced and non forced).
Note: In DB2 V97 and earlier releases, only a single STANDBY_LOG_POS is reported. It is derived from replay position. Because log spooling is not supported in these releases, log receive and replay positions are usually not too far from each other. In V10.1 and later, log receive position and replay position are reported separately. A large gap is possible when spooling is enabled.
STANDBY_RECV_BUF_PERCENT (in V97 and earlier releases, this field is only available from "db2pd -hadr" command issued on the standby) reports standby log receive buffer usage. The buffer size can be configured by registry variable DB2_HADR_BUF_SIZE . If spooling is disabled (or not supported, as is the case on DB2 V97 and earlier), receive buffer full (100% used) will cause standby log receive to be blocked and primary log writing and transactions will eventually be blocked. As replay progresses, part of the buffer will be released and log receive will resume. If spooling is enabled, 100% buffer use does not indicate a bad condition. HADR will release the buffer for new incoming data if it needs to, even if log data in the buffer has not been replayed.
STANDBY_SPOOL_PERCENT
(available on V10.5 and later) reports spool space used, as a percentage of configured limit hadr_spool_limit. NULL will be returned if the limit is -1 (unlimited). On earlier releases (or for the unlimited spooling case), you can compute spool space used using other monitor fields (result is in unit of pages):
(STANDBY_LOG_POS - STANDBY_REPLAY_LOG_POS) / 4096 - (STANDBY_RECV_BUF_SIZE * STANDBY_RECV_BUF_PERCENT)
STANDBY_RECV_BLOCKED flag (available on V10.5 and later) from the HADR_FLAGS field directly indicates that the standby log receiving is blocked. Primary log send and transactions will eventually be blocked if the condition persists. There are multiple scenarios of this condition:
STANDBY_LOG_DEVICE_FULL flag (available on V10.5 and later) from HADR_FLAGS reports that standby log device is full. The STANDBY_RECV_BLOCKED flag will also be turned on when log device is full. The device full flag allows you to identify the cause of the blocked receiving.
In earlier releases, where the STANDBY_RECV_BLOCKED flag is not available, you will have to derive this condition using other fields:
Starting in v11.1.1.1 and v10.5 Fix Pack 9, when one or more tablespaces is in an invalid or error state on the Standby database, the HADR_FLAGS field will display the value 'STANDBY_TABLESPACE_ERROR'. For details, see:
Monitoring and identifying tablespaces in invalid or error state, or tables in Inoperative state on the HADR Standby database
User can recover bad tablespace on standby, following the instructions in:
Recovering table space errors on an HADR standby database
If the failed tablespaces are important, reinitialize the standby to make them "good" again.
SQL based monitoring such as table functions and administrative views work on standby only when Reads on Standby is enabled. db2pd and the "db2 get snapshot for database" command always work on the standby.
Each standby only knows about the primary. It has no knowledge of other standbys. Monitoring command and query issued to the standby will report only the standby itself and the primary, as a usual HADR pair. No info about other standbys will be reported.
The primary is aware of all standbys. When issued on the primary ,"db2pd -hadr" and MON_GET_HADR table function will report all standbys. If a configured standby (listed in primary's hadr_target_list) is not connected, it is still displayed, as a reminder that multiple standbys are configured. The snapshot interface (deprecated) will only report the principal standby.
Certain monitor fields are applicable to primary or standby only. For example, PEER_WAIT_LIMIT is applicable only to primary, STANDBY_RECV_BUF_SIZE, STANDBY_SPOOL_LIMIT, READS_ON_STANDBY_ENABLED are applicable only to standby. When this kind of information is reported, it is data from the database currently in this role (which may be the remote database), rather than the local database. For example, PEER_WAIT_LIMIT reported by a standby is the value configured on the primary, not the standby's local configuration. The standby's configuration will be used only when the standby turns into a primary.
Primary and standby exchange monitoring information via heartbeat messages, which is sent periodically. Information about the remote database can be slightly out of date, up to a heartbeat interval. Heartbeat interval itself is reported via the HEARTBEAT_INTERVAL field. If a database has never connected to its partner database since activation, info about remote database will be returned as SQL NULL to indicate "unknown".
Per monitor table function convention, all MON_GET_HADR time duration fields use millisecond as unit. For those fields reflecting a config parameter (such as HADR_TIMEOUT, HADR_PEER_WINDOW) whose unit in config is second, the number returned by MON_GET_HADR table function will be different from the number used in "db2 get/update db cfg" command, the number returned by SYSIBMADM.DBCFG admin view or SYSPROC.DB_GET_CFG() table function. For example, for a 60 second HADR_TIMEOUT value, MON_GET_HADR will return 60000, while the config oriented interfaces will return 60. To convert the millisecond number to second, use column_name/1000 in your query.
"db2pd -hadr" returns seconds for parameters configured using seconds. Furthermore, the unit itself is printed in output to avoid confusion.
Transactions: Address AppHandl [nod-index] TranHdl Locks State Tflag 0x00007F409D103280 14 [000-00014] 4 0 READ 0x00000000 0x00007F409D109380 0 [000-00000] 6 0 WRITE 0x00000000 Tflag2 Firstlsn Lastlsn Firstlso Lastlso 0x00000000 0x0000000000000000 0x0000000000000000 0 0 0x00000000 0x00000000000772A8 0x0000000000077437 61762921 61796038 LogSpace SpaceReserved TID AxRegCnt GXID 0 0 0x000000000100 1 0 0 0 0x00000000022E 0 0 The example shows that there is one write transaction that is going on. The transaction starts at LSN 0x0772A8 and is currently at 0x077437. This transaction is still relatively small and would not take too long to undo. However, if the difference between LastLsn and FirstLsn becomes too big, then it may take considerable time to undo the transaction when takeover happens. It’s recommended to optimize the workload to avoid long-running transactions, so that it would take reasonable time for HADR standby to complete takeover in case of fatal failure on primary.
"db2pd -hadr" is the recommended monitoring method during takeover. During a takeover, there may be a time window when clients cannot connect to either primary or standby, so SQL based methods such as table function won't work.
For more info, see "How to monitor takeover" section in
HADR takeover
Db2diag.log is an important DB2 diagnostic tool. It records history of events in DB2. See db2diag.log for details.