HADR Performance Guide

Presentation

Slides for a companion presentation are available at HADR_perf.pdf . The presentation is intended as an overview or quick start for HADR performance, while the article below is best for in-depth studying or as handy reference.

Article

When the DB2 HADR feature is enabled, the primary database's transaction logs are replicated to one or more standby databases in real time. The replication provides HA and DR protection, at a cost. The replication adds overhead to the primary's log writing. Applications on the primary may exprience slow down. This article will focus on HADR performance specifically.

This article consists of three main sections:
- Configuring HADR for Optimal Performance: Preventive Care
- Monitoring HADR Performance: Get early warning of problems
- Diagnosing and fixing HADR performance problems: When a problem does occur, how to fix it.

Table of Contents

Configuring HADR for Optimal Performance
Monitoring HADR Performance
Diagnosing and Fixing HADR Performance Problems
Performance Workaround
Multiple Standby
HADR on pureScale

Configuring HADR for Optimal Performance

This section discusses how to tune HADR for optimal performance. The tuning steps should be done as part of HADR setup. If you already hit problems, well, better late than never.

TCP Tuning

HADR uses TCP to send logs. So TCP performance is critical to HADR. Registry variables DB2_HADR_SOSNDBUF and DB2_HADR_SORCVBUF (socket send and receive buffer size) are important to TCP performance. See TCP Tuning

HADR Synchronization Mode

HADR overhead on primary database logging varies, depending on the HADR sync mode chosen. Although stronger sync mode provides more HA and DR protection, hardware and network speed may limit your choice.

The rule of thumb is to use SYNC or NEARSYNC mode on LAN, and use ASYNC or SUPERASYNC on WAN. With the new HADR tool kit, you can do quantitative analysis of your workload, disk, and network to make an informed choice. The step by step TCP and sync mode configuration procedure is recommended during HADR planning, after HADR deployment (if it was not done during planning), or any time when performance problem is suspected (workload characteristic can change over time).

Standby receive buffer and spool size

The standby receive buffer (in memory) and spool (on disk) serve the same purpose: hold received, but not yet replayed log data on the standby. Log receive produces log data in buffer and spool. Log replay consumes the log data. The buffer and spool allows standby to continue to receive log data when replay position is behind receive position, until buffer or spool fills up. This buffering can absorb load peaks in the primary workload. However, if the sustained standby replay speed is slower than primary log generation speed, the buffer or spool will eventually be full and primary log write and send will be blocked.

Log spooling is added in DB2 V10.1. Only in-memory buffering is available in older releases. Where both features are available, spooling is preferred, because spooling device can be much larger and the performance difference between buffering in memory and spooling on disk is minimal.

There is a side effect of large buffer or spool size. During a takeover (forced and non forced), the old standby need to finish replay of all buffered/spooled log data. Thus a large buffer or spool can result in longer takeover time. You should consider takeover time when setting buffer or spool size.

See also buffer size , and spool limit in the configuration page.

Note: hadr_spool_limit defaults to 0 (spooling disabled) in V10.1. In V10.5, the default is "automatic". DB2 will use (logprimary + logsecond) * logFileSize as spool size for "automatic". The exact value used by "automatic" can be seen in the STANDBY_SPOOL_LIMIT monitor field.

You can experiment on buffer size (spooling is preferred on V10.1 and later). The default is "2 * primary_database_logbufsz". You can double the size with the DB2_HADR_BUF_SIZE registry variable until there is no more positive effect. On large systems, 2 to 4 GB size is not unusual. 4GB is a soft limit. See buffer size

On V10.1 and later, you can set spool size. Start from 512MB. Double the size until there is no more positive effect. The "automatic" value on V10.5 is a moderate size. It is chosen as a compromise between spooling and takeover time, and also because log device is supposed to have space for at least logprimary + logsecond log files. A larger size often gives better spooling effect, at the cost of takeover time.

HADR peer wait limit

DB2_HADR_PEER_WAIT_LIMIT is a registry variable that limits primary logging wait time in peer state. Normally it is not set (default 0 takes effect, meaning no limit). It limits HADR impact to primary logging, at the cost of HA and DR protection. When the limit is reached, the primary database disconnects from the standby. The standby immediately reconnects. The pair will then start in remote catchup state, where primary logging is not slowed down by log shipping, but HA and DR protection is weaker than peer state. It can take some time for the pair to reach peer state again. If primary logging rate is higher than log shipping or replay speed, the pair may not peer again until primary logging rate goes down.

See peer wait limit for more info

Log Path on Primary and Standby

Dedicated device for log path is strongly recommended. Do not share devices between tablespaces and log path. High performance device is recommended as log device because data changes from all transactions on a database member go to a single log stream. Log stream writing can easily become bottleneck, even when HADR is not enabled. This is particularly important for OLTP systems. DB2 log scanner can be used to analyze logger behavior. See db2 log scanner , DB2 Logging Performance , and the " Tuning and Monitoring Database System Performance " white paper.

Self Tuning Memory Manager (STMM) on the Standby

Self Tuning Memory Manager (STMM) only runs when a database is in the primary role. It does not run on the standby, even if configured as enabled. See also Self Tuning Memory Manager (STMM) on the Standby . You may manually configure the standby using values set by STMM on the primary, to make replay or reads on standby work better, or to make the standby perform well in primary role right after a takeover.

Monitoring HADR Performance

General Notes

The snapshot interface is being deprecated. The MON_* table funtions and db2pd are preferred monitor methods. See Database Snapshot
Monitor table functions work on standby only when Reads on Standby is enabled. db2pd and the "db2 get snapshot for database" command always work on the standby.
Information about a remote database (for example, STANDBY_LOG_POS field returned from a query against the primary database) can be slightly out of date. For up to date reading, query the the source database directly. See also HADR monitoring tips

Primary Database Logging Rate

The LOG_POS (PRIMARY_LOG_POS, STANDBY_LOG_POS, or STANDBY_REPLAY_LOG_POS) field in HADR monitoring is a byte offset in the log stream. So the difference between two log pos is the amount of log data (in bytes) between the two positions. To compute primary logging rate, just take two readings of PRIMARY_LOG_POS, say at t1 and t2, the do
(PRIMARY_LOG_POS2 - PRIMARY_LOG_POS1) / (t2 - t1)

If HADR is not enabled, you can get logging rate from the logger component as follows:

In V10.1 and later, column CURRENT_LSO from table function MON_GET_TRANSACTION_LOG is equivalent to PRIMARY_LOG_POS (they may differ by 1, because one is defined as the "next byte to write" and the other "the last byte written).
In V9.7 and earlier, use the LOG_WRITES (number of log pages written) field from table function SNAP_GET_DB, or the "Log pages written" field from "db2 get snapshot for database" command. Compute logging rate as:
(LOG_WRITES2 - LOG_WRITES1) * 4096 / (t2 - t1)

You can also scan log files to get historical logging rate and a wealth of logging metrics. See db2logscan

Obviously, logging rate cannot exceed speed of the slowest device in the processing chain. It is limited by the speed of primary log disk, standby log disk, and the primary-standby network. And it is sensitive to HADR sync mode. If logging rate is close to the speed of any device (or exceeds the speed of a device, when HADR is not enabled), then the device is or will be the bottleneck. See HADR sync mode on the interaction of the factors.

Log Shipping Rate

Compute HADR log shipping rate using STANDBY_LOG_POS. Take two readings at t1 and t2, then compute as:
(STANDBY_LOG_POS2 - STANDBY_LOG_POS1) / (t2 - t1)

In remote catchup state, primary log writing and HADR log shipping are independent. So primary logging rate and log shipping rate are independent.

In peer state, primary and standby log position cannot be too far away (gap is one flush in SYNC and NEARSYNC mode, and the primary-standby network pipe line size in ASYNC mode). Primary logging rate and HADR log shipping rate will be close (identical over time) in peer state.

Log write size and time

Due to the access time overhead for each write (on traditional hard drives), larger write size will give you higher throughput on the same disk. See DB2 Logging Performance . Loge write size an important logging metric.

Primary vs. Standby log write

Both primary and standby database write log files. Primary writes log data submitted by SQL agents. Standby writes received log pages to its local log path, for crash recovery purposes. See also: Log file management on standby . You can monitor primary and standby log write individually. Primary and standby keep track of their own log write monitoring fields. The fields reflect log write activity of the local database only.

Log write on primary is also known as "log flush", meaning "flush log pages from log buffer onto log disk". The term "log flush" is not used for log write on the standby.

Scanning log files always reports log metrics of the primary, even if a file was shipped to the standby, and re-written in the standby's log path, because the standby preserves primary's log metrics when it writes received log pages.

Log write size

To compute log write size, get the LOG_WRITES (number of log pages written since database activation) and NUM_LOG_WRITE_IO (number of write() calls to OS) fields from table function MON_GET_TRANSACTION_LOG (V10.1 and later), or table function SNAP_GET_DB (V9.7 and earlier). Take two readings and calculate average flush size between the two readings as:
(LOG_WRITES2 - LOG_WRITES1) / (NUM_LOG_WRITE_IO2 - NUM_LOG_WRITE_IO1)

Alternatively, you can use the "db2 get snapshot for database" command. Compute log write size using the
"Log pages written" and "Number write log IOs" fields.

You can also scan log files to get the primary's historical flush size and flush duration. See db2logscan . Pay attention to very large (more than a few hundred) or very long (longer than a few seconds) flushes. See Statistical Distribution of Logging Metrics .
Note: Scanning log files always reports log metrics of the primary , even if a file was shipped to the standby, and then re-written in the standby's log path.

Typical log write size on the primary is from a few pages, to a few hundred pages, depending on workload.

Log write time

To compute average log write time per write() call, get the LOG_WRITE_TIME (accumulated log write time since database activation, unit is millisecond ) and NUM_LOG_WRITE_IO (number of write() calls to OS) fields from table function MON_GET_TRANSACTION_LOG (V10.1 and later), Take two readings and calculate average log write time between the two readings as:
(LOG_WRITE_TIME2 - LOG_WRITE_TIME1) / (NUM_LOG_WRITE_IO2 - NUM_LOG_WRITE_IO1)

For table function SNAP_GET_DB (V9.7 and earlier), log write time is reported by two fields: LOG_WRITE_TIME_S (the whole "second" part) and LOG_WRITE_TIME_NS (the nanosecond part in addition to the whole second part). Use those fields with NUM_LOG_WRITE_IO field to compute average log write time.

Alternatively, you can use the "db2 get snapshot for database" command. Compute average log write time using the
"Log write time (sec.ns)" and "Number write log IOs" fields. Log write time is reported as a float point number in unit of second.

Typical log write time is a few millisecond, for a small write (a few pages).

On the primary, log write time from the monitoring fields does NOT include HADR overhead. It only includes net time of writing to primary log disk. This applies to MON_GET_TRANSACTION_LOG, SNAP_GET_DB, and "db2 get snapshot for database" . For example, in SYNC mode, disk write takes 4ms, then log write thread waits for 6ms for the data to be replicated to the standby (primary sends the data to standby. Ack message from standby arrives 6ms after disk write on primary finishes). Total flush time is 10ms. Net disk write time is 4m. HADR overhead is 6ms.

HADR overhead is reported via the LOG_HADR_WAIT_* fields (available starting DB2 V10.1). See "Impact to Primary database logging" in monitoring topic. Compare the relative size of net disk write time and HADR overhead to see if HADR overhead is excessive.

You can also scan log files to get the primary's historical flush size and flush duration. See db2logscan. Pay attention to very large (more than a few hundred) or very long (longer than a few seconds) flushes. See Statistical Distribution of Logging Metrics . "Flush duration" reported by scanner is the total time the primary spent on a flush, including local disk write and replication to the standby (HADR overhead).
Note: Scanning log files always reports log metrics of the primary , even if a file was shipped to the standby, and then re-written in the standby's log path.

Alert Conditions: Impact to primary database logging

You may set up scripts to generate alerts based on monitor fields. The script can have the following logic:
(LOG_HADR_WAIT_* fields are available starting DB2 V10.1):

Trigger alert if LOG_HADR_WAIT_CUR is more than 5 seconds. It's just bad for a log write to be stuck for 5 seconds or more. Primary database transaction would be blocked for 5 seconds or more.
Trigger alert if LOG_HADR_WAIT_RECENT_AVG (available from db2pd only) is more than 20 ms.

A typical log write takes 5 to 10 ms for disk write. If an additional 20 ms (HADR overhead) is added, then log writing will be much slower. An HADR system should have an overhead less than 10 ms. A well tuned system can have 5 ms or less overhead.

Compute average logger HADR wait time in custom interval using LOG_HADR_WAIT_ACCUMULATED and LOG_HADR_WAIT_COUNT (LOG_HADR_WAIT_RECENT_AVG is readily available from db2pd. But you have no control on the interval over which the average is computed). Use this logic to compute you own average:

At time t1, you get LOG_HADR_WAIT_ACCUMULATED1 and LOG_HADR_WAIT_COUNT1
At time t2, you get LOG_HADR_WAIT_ACCUMULATED2 and LOG_HADR_WAIT_COUNT2
The average HADR wait time per log write between t1 and t2 is
(LOG_HADR_WAIT_ACCUMULATED2 - LOG_HADR_WAIT_ACCUMULATED1) / (LOG_HADR_WAIT_COUNT2 - LOG_HADR_WAIT_COUNT1)
You can trigger alert if the computed average is more than 20 ms.

Alert on the STANDBY_RECV_BLOCKED flag (available on V10.5 and later) from HADR_FLAGS field.

In older releases where LOG_HADR_WAIT_* fields are not available, you will have to monitor logging overhead from application side. For example, you can watch application commit time (how long it takes to execute a "commit" request). Application side commit time is the overall cost of commit from the database. When commit time is long, HADR is suspect, but problem could be caused by other database components. In contrast, the LOG_HADR_WAIT_* fields is the net HADR cost in log writing, which is direct evidence of HADR overhead on logging.

In older releases where the STANDBY_RECV_BLOCKED flag is not available, you will have to derive the condition using other monitor fields, see Standby receive blocked in monitoring page.

Note: Primary and standby exchange monitoring information via heartbeat messages, which is sent periodically. Information about the remote database can be slightly out of date, up to a heartbeat interval. Heartbeat interval itself is reported via the HEARTBEAT_INTERVAL field. Thus it is best to retrieve the LOG_HADR_WAIT_* fields from the primary and the STANDBY_RECV_BLOCKED flag from the standby.

IBM Data Studio Web Console Alerts

IBM Data Studio Web Console version 3.1 has enhanced HADR setup and management. It also provides built-in alerts for HADR. Some built-in alerts are identical to the ones suggested in the "Alert Conditions" section above. See Data Studio web console

You may also create custom alerts. See Configuring Data Studio web console alerts

Sender Congestion

If HADR cannot send out data to its partner database because network is congested (think of traffic jam), the HADR_CONNECT_STATUS monitor field will report "CONGESTED", and the HADR_CONNECT_STATUS_TIME will show congestion start time. Congestion duration can be computed as
time_when_monitor_data_is_taken - HADR_CONNECT_STATUS_TIME

For db2pd, the monitor data taken time is reported at beginning of output. Example:
Database Member 0 -- Database HADRDB -- Active -- Up 0 days 00:00:36 -- Date 2013-09-13-11.19.42.213563

For MON_GET_HADR table function, you can include current_timestamp in your query. Example:

select HADR_CONNECT_STATUS, HADR_CONNECT_STATUS_TIME, current_timestamp as current_time, current_timestamp - HADR_CONNECT_STATUS_TIME as duration from table (mon_get_hadr(0))

HADR_CONNECT_STATUS HADR_CONNECT_STATUS_TIME CURRENT_TIME DURATION
------------------- -------------------------- -------------------------- ----------------------
CONNECTED 2013-09-13-11.19.09.974274 2013-09-13-11.30.55.811394 1145.837120

1 record(s) selected.

Note: Due to race condition, HADR_CONNECT_STATUS_TIME might be a little later than current_timestamp. Just ignore such cases (negative duration).

A congestion longer than 1 seconds should be of concern. Shorter congestions are normal. They are part of normal network flow control (think of traffic light). See also Non blocking IO . Of course, it is always a fuzzy line between a normal red light and a traffic jam. A long wait at the red light can just become a traffic jam. What really matters is the network throughput and impact to applications. If you are experiencing slower logging rate or application response when HADR is enabled, look for congestion. You may sample the HADR_CONNECT_STATUS and HADR_CONNECT_STATUS_TIME fields periodically (for example, every minute) and look for congestions longer than a few seconds.

CONGESTED is most likely to be seen on the primary side. It's less likely to be seen on the standby side because there isn't much data to send to the primary (usually just ack and heartbeat messages).

There are two common causes of congestion:

Network is slow. It cannot accept data from the sender fast enough.
Standby is not receiving. Network pipeline from primary to standby becomes full.

We will study congestion in more details in the diagnostic section below. A few notes here:

In ASYNC mode, primary logging slow down or stop is typically caused by congestion.
In SYNC and NEARSYNC mode, primary logging slow down or stop often does not involve congestion. In a typical scenario, standby stops receiving because its receive buffer is full. The primary can still send out one more flush. This flush will be buffered in the network pipeline between primary and standby. Standby can not fully receive it because its buffer is full. So standby can not send ack message back. Thus primary will be stuck waiting for the ack message.
Congestion, even extended congestion (more than a few seconds) is normal in remote catchup state, where the primary reads log pages from log files and send the pages to the standby. If disk read speed is higher than network send speed or standby replay speed, primary send can hit congestion and be blocked. This is a typical fast producer waiting for slow consumer case. As long as overall log shipping rate (compute using STANDBY_LOG_POS) is reasonable, there is no worry.

Monitoring Hardware Resources

You should monitor hardware resources on both the primary and the standby. A common tools is vmstat. Check CPU, memory, and IO utilization. Look for bottleneck (100% utilization). See the " Tuning and Monitoring Database System Performance " white paper for more into.

Diagnosing and Fixing HADR Performance Problems

In this section, we will first discuss how to identify bottlenecks in the system, then discuss each bottleneck scenario in details.

Identifying Bottleneck

HADR data flow is as follows (see HADR log shipping for a nice picture and details):

Primary produces log pages
Primary sends log pages to the standby
Standby receives log pages
Standby writes received log pages to disk
Standby replays written log pages

The operations on the critical path are (each operation depends on the previous one):
send - receive - write - replay

There are two common bottlenecks along the data flow:

Slow network: Log data is not shipped to standby host fast enough. Standby database often waits for more data.
Slow standby: Standby log processing (usually log replay) is slow or blocked. Standby log receiving is blocked from time to time.

The STANDBY_RECV_BLOCKED flag (available on V10.5 and later) from the HADR_FLAGS monitor field directly indicates that the standby log receiving is blocked. If the system repeatedly hits the STANDBY_RECV_BLOCKED condition, then its a slow standby case, otherwise, it's likely a slow network case.

Note: In older releases, you will have to derive the STANDBY_RECV_BLOCKED condition using other monitor fields, see "Standby receive blocked, Standby receive buffer and spool usage" from monitoring topic. For STANDBY_LOG_DEVICE_FULL flag, use OS command such as "df" on Unix to monitor standby device capacity.

The logic below lists causes of HADR performance problems. Use it to identify the bottleneck.

if (STANDBY_RECV_BLOCKED)
{
this is a slow standby case.

if (STANDBY_LOG_DEVICE_FULL)
standby log device too small. Enlarge it.
else
standby replay is too slow. Tune replay or upgrade hardware.
}
else
{
Most likely a slow network case.
Measure network speed to confirm.
Tune or upgrade network if confirmed.
Or use a less demanding HADR sync mode.

In a rare case, it's slow standby log write
Measure standby disk speed and log write size to confirm.
Tune or upgrade disk if confirmed.
}

There is no simple flag at DB2 level to differentiate the slow network and slow standby log write cases. Since slow network is the most common case, you should start from there with the following steps:

Find out the time when applications are experiencing slow down, using client side metrics, DB2 metrics such as primary logging rate, log shipping rate, congestion status, LOG_HADR_WAIT_* fields, or db2logscan (look at flush duration, transaction duration, commit interval, etc.)
Measure flush (log write) size on the primary for the slow period, using a method from "Log write size and time" section above.
Run HADR simulator using
- Primary's flush size from the slow period
- The actual socket buffer size used by primary and standby databases. If DB2_HADR_SOSNDBUF (or DB2_HADR_SORCVBUF) is set, use the same number for simulator -sockSndBuf (or -sockRcvBuf) option. Otherwise, do not specify the option (simulator will use system default, just like real HADR).
- The actual HADR sync mode of the database

to measure network throughput for the given sync mode. If the resulting throughput is close to the actual throughput of the HADR system, it means network is the bottleneck.

To check if it's a case of slow standby log write, use these steps:

Find out the time when applications are experiencing slow down.
Measure log write size on the standby for the slow period, using a method from "Log write size and time" section above.
Measure standby log device write throughput using HADR simulator , with the following command line:
simhadr -write <logPath>/testFile -flushSize <standbyWriteSize> -t 60

Make sure you have enough space on <logPath> for a 60 second run (estimate data amount by "estimated_throughput * 60 seconds"). Alternatively, you can use shorter run, as long as the result is stable (not sensitive to length of run), or you can use -n option to specify exact number of flushes (total data amount is "flushSize * number_of_flushes").
If the resulting throughput is close to the actual throughput of the HADR system, it means standby log write is the bottleneck.

In even more rare cases, both network and standby log write are bottlenecks.

Gathering Diagnostic Information

To gather information for diagnostics, monitor HADR at regular interval (like every minute). Shell script pseudo code below:

while :
do
issue "db2pd -hadr" command on primary
record output

issue "db2pd -hadr" command on standby
record output

sleep 60
done

db2pd is preferred over MON_GET_HADR because it is light weight, and can run on standby without reads on standby enabled. Note that even if reads on standby is enabled, if standby is replaying a DDL statement (replay only window), it cannot process queries (MON_GET_HADR cannot run).

You may need tools like rsh to run db2pd on primary and standby. HADR primary and standby do exchange monitor info, but only on every heartbeat. So for up to date info, run db2pd directly on primary and standby host, rather than relying on the built-in info exchange.

Save full db2pd output, even if you are focusing on a subset of fields like STANDBY_RECV_BLOCKED and CONGESTED. Other fields will be used to get the full picture of system status for a more reliable diagnose.

Slow Standby

Slow Replay on Standby

There are 3 scenarios of standby log receive blocked:

When log spooling is disabled, standby receive buffer is full (STANDBY_RECV_BUF_PERCENT is 100%).
When log spooling is enabled, spooling has reached configured spool limit (STANDBY_SPOOL_PERCENT is 100%).
The standby logging device is full (STANDBY_LOG_DEVICE_FULL flag from HADR_FLAGS field is set), regardless of buffer and spool percentage.

Of the 3 scenarios, standby receive buffer or spool full is the result of slow replay on standby. When replay is slower than receiving, more and more log data will build up in buffer and spool. Eventually, buffer or spool gets full and cannot receive more data. See also spooling

In addition to STANDBY_RECV_BLOCKED flag, slow replay has the following symptoms:

If spooling is disabled
- STANDBY_RECV_BUF_PERCENT is close to 100%
- (STANDBY_LOG_POS - STANDBY_REPLAY_LOG_POS) / 4096 is close to STANDBY_RECV_BUF_SIZE
If spooling is enabled
- STANDBY_SPOOL_PERCENT is close to 100%
- (STANDBY_LOG_POS - STANDBY_REPLAY_LOG_POS) / 4096 is close to STANDBY_SPOOL_LIMIT

We compute the instantaneous standby recv-replay log gap as
(STANDBY_LOG_POS - STANDBY_REPLAY_LOG_POS) / 4096
"/ 4096" is for unit conversion from byte to page.
The STANDBY_RECV_REPLAY_GAP monitor field is a running average. It should not be used for data correlation within a monitor snapshot.

The "Tuning Standby Replay" section below will discuss replay tuning in details.

Standby Log Device Full

Slow replay is the most common cause of slow standby. STANDBY_LOG_DEVICE_FULL is less common. STANDBY_LOG_DEVICE_FULL (available on V10.5 and later) can be simply confirmed by the flag itself, or from OS command (such as "df" on Unix) on earlier releases.

The obvious fix for this scenario is to increase the log device capacity. When spooling is enabled, log device capacity should at least match spool limit. Regardless of spooling enable/disable status, standby log device should at least contain the primary's transaction window (logprimary + logsecond log files). A larger size is recommended because replay transaction window can be larger than the primary's transaction window. A file on the standby is reclaimable only when all transactions in it have been replayed. See also Log file management on standby

Slow Log Write on Standby

A rare case of slow standby is slow log write on standby. On the standby, all received log pages must be written to disk before being replayed, for crash recovery purpose.

With slow log write on standby, you won't see receive buffer or spool full. Primary log writing and shipping is not totally blocked, just slow. In ASYNC mode peer state or remote catchup state of any sync mode, you are likely to see congestion on the primary end. In peer state of other modes, there is likely no congestion on the primary.

To get to the root cause of slow write on standby, test primary and standby log device using this procedure: Know Your Disks (this should be done in planning phase anyway). If standby disk speed is much slower than that of the primary, consider upgrade. Otherwise, it could be that write size is too small on the standby.

In ASYNC and NEARSYNC peer state, standby writes received log pages on the same flush boundary as the primary. Primary sends log pages to standby and writes the pages to local disk in parallel. Standby writes a flush to disk only when it knows that the received flush has been written on primary, using data from heartbeat or upon receiving the next flush (next flush implies that the primary is done with the previous flush). So standby has exactly the same write size as the primary. Assuming same hardware, standby log write should not slow down the primary.

In remote catchup state (any sync mode) and SYNC mode peer state, standby write log pages as they are received. The write size is determined by the amount of data returned from each TCP recv() call. Similar to log writing on the primary , auto tuning is in effect. If a log write is slow, more data will build up in OS TCP recv buffer and the next receive and write will be larger, assuming TCP recv buffer is large enough. TCP recv buffer size is controlled by socket recv buffer size and should be at least the size of a large primary log flush. See Buffering for HADR Log Shipping in socket buffer size tuning. Reconfigure TCP socket recv buffer size if needed.

The actual write size on the standby can be measured using a method from "Log write size and time" section above in "Monitoring HADR Performance". Compare primary and standby log write size and time to see if standby write throughput is caused by small writes.

You can measure standby write throughput at the actual write size using HADR simulator:
simhadr -write <logPath>/testFile -flushSize <standbyWriteSize>
See also "Identifying Bottleneck" section above

You can also compute theoretical standby write throughput using measured write size, perWriteOverhead, and transfer rate:

timePerWrite = perWriteOverhead + writeSize / transferRate
throughput = writeSize / timePerWrite

Compare the theoretical throughput, throughput from HADR simulator, and actual throughput from the database. They should be close if log write is the bottleneck.

If standby log write is the bottleneck, consider tuning/upgrading standby log device, or tuning TCP interface to increase recv() size.

Tuning Standby Replay

It is recommended that primary and standby have the same hardware. Normally, with same hardware, replay performance is not a problem because replay just updates data pages based on information from log records. Replay need not parse SQL statements. And read only workload on the primary can further reduce the amount of log records the primary generates (less write workload). But if standby is underpowered, misconfigured, or the workload is unusual (for example, has many serialization points, preventing parallel replay from its full potential), then replay can be slower than log generation.

Number of Replay Threads

Standby uses parallel replay. A replay master thread dispatches log records to multiple worker threads for parallel replay. Number of total threads (workers + master) defaults to min(number of physical CPU on host, 17).

To check the number of threads used, look for message like this in db2diag.log:
Using parallel recovery with 8 agents 3 QSets 21 queues and 0 chunks

The above line shows 8 threads (workers + master), on an 8 CPU machine.
Users can optimize the HADR standby replay performance by following the optimizing recovery performance topic on Knowledge Centre.

Physical vs. Virtual CPU

Some machines show 2 virtual processors for each physical processor. These systems include IBM SMT (simultaneous multithreading) and Intel Hyper Threading machines. Some early (before V97) DB2 versions uses virtual processor number for default number of replay threads. This is not optimal. A fix was put in to use physical CPUs. The fix is first available in V8.1 fp17 (aka. V8.2 fp10), V91 fp6, V95 fp3 and V97 GA. Related APARs are IZ03422 (v8), IZ03423 (v91), and IZ32886 (v95).

Bufferpool Performance

The replay threads apply data page changes to the buffer pool, which writes to tablespace disks periodically. Multiple updates to a page can be combined into one write to disk. On both primary and standby, the buffer pool plays an important role in database performance.

Buffer pool definitions are treated as DDLs. They are replayed on standby. So the standby has identical buffer pools as the primary. Normally, this is desirable because for each buffer pool, the amount of data change is the same on primary and standby. On primary, the changes are applied by SQL agents. On the standby, they are applied by the replay threads. Any large buffer pool created on the primary side for the benefit of reading data is not useful on the standby though, unless reads on standby is enabled.

Database configuration parameters are not replicated by HADR. Configuration on the standby needs to be manually maintained. It is recommended that buffer pool related parameters be set to the same on the primary and standby, assuming the same hardware on primary and standby . These include num_iocleaners, page_age_trgt_mcr (starting in V10.5), page_age_trgt_gcr (starting in V10.5, applicable to pureScale systems only) , and softmax (deprecated as of V10.5).

See the " Tuning and Monitoring Database System Performance " white paper for details of monitoring and tuning buffer pools.

Reads on Standby

When reads on standby is enabled, read queries will compete against replay thread for resource. If performance problem is observed, you may temporarily disable reads on standby to see if there is any improvement in performance. If reads on standby is identified as the cause of slow replay, consider reducing read workload, or disable reads on standby.

Hardware Utilization

When replay is slow, check hardware bottleneck on standby. A common tools is vmstat. Check CPU, memory, and IO utilization. Look for bottleneck (100% utilization). See the " Tuning and Monitoring Database System Performance " white paper for details of monitoring system resources.

If standby hardware is less powerful than the primary, consider upgrading to match the primary.

Slow Network

Network Tuning

If network is suspect, first tune socket buffer size using this procedure . Check with your network administrator to see if network throughput and round trip time is expected.

If possible, disable HADR temporarily to get a base workload. Then run the step by step performance tuning procedure to see if the current sync mode is feasible for the system. If not, consider hardware/network upgrade, or using a less demanding mode. See also HADR sync mode

If your business requires HADR to stay enabled, you can still use the step by step performance tuning procedure. The calculated HADR logging rate for the mode you are using should be close to your actual logging rate. The calculated rate is usually a little higher because it only considers network and disk in its model. CPU cost such as application interface to disk and network, and inter process/thread communication is ignored (such cost is small compared to disk and network cost). You can look at result of other sync modes to evaluate alternatives.

Beyond TCP socket buffer, there are many things that can improve network performance. For AIX, see TCP streaming workload tuning

You may need to consult a network expert to further tune your network.

Unstable Network

See Diagnosing Intermittent Network Problems

Network Bonding

You may bond multiple NIC's (Network Interface Controller) into a virtual one and use it for HADR (set hadr_local_host to the bonded address). For both multiple standby and pureScale HADR, the balanced-xor bonding mode is preferred. In this mode, all packets for a given TCP connection are routed through one NIC. This effectively gives each log shipping stream a dedicated NIC, assuming that number of NIC's is equal to or greater than number of log streams.

The balanced-rr (round robin) mode spreads packets for one TCP connection over multiple NIC's. Because TCP needs to sort packets on the receiving end, this mode may give worse throughput for each TCP connection. For example, when there is one NIC and two log streams, each stream gets 50% capacity of a NIC. When using two bonded NIC's, with balanced-xor, each stream gets 100% capacity of a NIC; with balanced-rr, each stream may just get 80% capacity of a NIC.

Besides performance, bonding can also be used for HA purpose. Multiple NIC can be bonded with some configured as active, and some configured as passive backup. If an active NIC fails, a backup can takeover.

Performance Workaround

Using a Large Spool on the Standby

See also Standby receive buffer and spool size above in section "Configuring HADR for Optimal Performance".

A large spool (or receive buffer) can absorb a load peak on the primary. Standby just receives the peak and store it in the spool or buffer, while replay catches up. Primary is not slowed down as long as standby log receive-replay gap does not reach the spool or buffer limit. For example, if a load peak generates 30GB of log data, a 64GB spool is likely able to absorb it.

The main drawback of large spool is that during takeover (forced and non-forced), standby has to complete replay of the spooled data before takeover can complete. So a large spool can slow down takeover.

HADR peer wait limit

See also " HADR peer wait limit " above in section "Configuring HADR for Optimal Performance".

As a workaround, you may set DB2_HADR_PEER_WAIT_LIMIT to limit the HADR impact to primary database logging. Because HADR clock resolution is "second", and OS scheduling may delay a process from running (for example, it can take a little time for a swapped out process to run again), it is recommended that peer wait limit be set to at least 5 seconds. Smaller number is likely to cause false alarm.

Peer wait limit works best when primary logging is completely blocked. If each log write is just slow (for example, taking 100ms, instead of 10ms), it won't be triggered.

Another drawback is that peer wait limit is not triggered until logging has been blocked for the specified seconds. So you already took a performance hit when it is triggered. But having a limit guarantees that application won't be stuck for ever. It can act as the last line of defense against performance problems.

Temporarily Changing HADR Sychronization Mode

As a workaround, you can temporarily change HADR sync mode to a less demanding mode during heavy load period on the primary. Because the SuperAsync mode doesn't have any impact to primary logging at all (it never enters peer state), it is often used as the temporary sync mode. For example, if reorg on a large table is known to cause performance problem, you can change sync mode to SuperAsynch before the reorg job and change it back to you normal sync mode when the job is done and standby catches up to the primary (both primary-standby log shipping gap (HADR_LOG_GAP in monitoring) and standby receive-replay gap (STANDBY_RECV_REPLAY_GAP in monitoring) are small). The small gaps indicates that logs from the load peak have been shipped to the standby and have been replayed. It can take some time after the load peak for the standby to catchup. If it takes a very long time to catchup or never catches up, then you have a general performance problem, this workaround won't help much.

Beginning in DB2 V10.1, you can do " semi dynamic " update of HADR config parameters. So changing sync mode requires no down time on the primary. Simply do "stop hadr", "update db cfg ...", "start hadr" on the primary database. If you are using hadr_target_list on primary and standby to specify the remote databases instead of the hadr_remote_host/hadr_remote_svc parameters, the hadr_syncmode parameter specifies the sync mode used by the primary-standby pair (or the primary-principal standby pair in the case of multiple standbys) when this database is a primary, and hadr_syncmode need not be the same on the primary and standby databases. Thus changing hadr_syncmode on the primary alone is enough to change the operational sync mode. The hadr_target_list parameter supports multiple standbys. It is recommended even when you are just defining one standby. If hadr_target_list is not used, you will need to update hadr_syncmode on the standby too and deactivate and reactivate the standby to pick up the new value.

Changing sync mode can be done before an expected load surge or when a performance problem has been encountered. The "stop hadr" command breaks the pair out of peer state, therefore unblocks log writing. When HADR is restarted, it will be running in the new sync mode.

In DB2 V9.7 and earlier releases, neither hadr_target_list nor semi dynamic parameters are supported. Changing sync mode would require deactivation and reactivation of the primary database. Thus it is not recommended. On these old releases, if primary log writing is slow in peer state, you can do "stop hadr" followed by "start hadr" on the primary to break the connection and take hadr out of peer state. After "start hadr", the pair will reconnect and enter remote catchup state, where primary log writing is not slowed down by the standby. Depending on the workload, it may take a short or a long time for the standby to reenter peer state. Once in peer state, primary can be slowed down again. In contrast, changing sync mode to SuperAsync can hold the pair in remote catchup state until you explicitly change the mode back.

Multiple Standby

For performance tuning, a multiple standby system can be treated as multiple individual primary-standby pairs, with some special consideration on the network and logging device on the primary.

For general information on multiple standby, see the HADR multiple standby white paper.

Network Requirement

In a multiple standby system, the primary database sends log data to multiple standbys concurrently. The network interface on the primary host can be stressed. It is recommended that the network be tested with multiple log shipping streams using HADR simulator before deploying multiple standbys. See simulating multiple standby with HADR simulator.

Log Device and Archive Requirement

In a multiple standby system, the primary database's logging and archiving device need to have enough bandwidth to support multiple read streams and one write stream concurrently. Each read stream is used for remote catchup of a standby. The write stream is the active logging workload of the primary. Primary creates a dedicated thread for each standby. In remote catchup, each thread reads from the logging device and sends data to its standby. DB2 does have an optimization to read from the primary's log write buffer when remote catchup position is close to the primary's current log write position (therefore the data needed for remote catchup is likely to be still in the log write buffer). But when the standbys' log positions are far away from the log write position, log shipping will need to read from logging device or archive.

When the standbys catch up to the currently position of primary, the principle standby enters peer state and the thread serving it reads from the log write buffer, if its syncmode is not superAsync. The auxiliary standbys are all superAsync mode, so they will stay in remote catchup state, even though their log positions are close to primary current log position. The threads serving these standbys will likely to be able to read from primary's log buffer for log shipping too. This greatly reduces load on the primary's logging device.

Remote Catchup Log Read Optimization

Although an HADR thread in remote catchup state can read from primary's log write buffer, its behavior is different from that of a peer state thread. In peer state, log writing will wait for log shipping, therefore log shipping can slow down log writing. In remote catchup state, log writing does not wait for log shipping. The log shipping thread just peeks at log buffer to opportunistically read from it. When the log pages it needs does not exist in the buffer, it still goes to logging device or archive to get it. This is an optimization to reduce load on the logging device and speed up remote catchup log reading. The remote catchup optimization is enabled on multiple and single standby systems .

HADR on pureScale

In pureScale HADR setup, each primary member has its own log stream. Each primary member connects to the standby replay member via TCP to ship logs. For performance tuning, each log stream can be modeled as an individual primary-standby pair.

For general information on HADR on pureScale, see tutorial of HADR on pureScale .

Network Requirement

Since all streams goes to the standby replay member, the network interface on this member can be stressed. It is recommended that the network be tested with multiple log shipping streams using HADR simulator. See simulating pureScale HADR with HADR simulator.

Replay Speed

The replay member on standby merges log records from all primary members into a single logical log stream, the replays it. Replay uses SMP parallelism on the replay member. Replay resource is still limited to one member. Replay can become bottleneck. If primary members have higher write to read workload ratio, then the problem can be more serious. In a good case where there are 4 primary members, each with a 25% write workload, total write load is 100% of a member. Since replaying a transaction generally costs less resource, the standby member should be able to handle the replay workload. But if it's 50% write load on primary, then standby replay member is more likely to become bottleneck. Typical "slow replay" symptom (standby receive buffer/spool full, standby stops receive, primary logging blocked) would appear on primary.

Storage System Requirement

The standby replay member also puts higher write load on its storage devices because it is writing all log streams and replaying workload from all members (during replay, the replay member directly writes to storage, bypassing global buffer pool on the CF). The interface between a standby member and the SAN (mounted as GPFS) needs higher bandwidth than that of a primary member. Because replay member can auto migrate among the standby members, for best result, the higher bandwidth should be supported on all standby members, not just the "preferred" replay member.