DB2 HADR Tutorial
Table of Contents
HADR (High Availability Disaster Recovery) is a DB2 LUW feature for high availability and disaster recovery. It is the preferred solution for HA and DR on DB2. It replicates data in real time from a primary database to one or more standby databases. In case the primary database fails, a standby database takes over as the new primary. For all benefits of HADR, see HADR benefits
HADR is based on log replay. You initialize the standby with a backup or split mirror image of the primary. You then configure and start HADR on the primary and standby databases. The primary ships its transaction log data to the standby via a TCP connection. The standby continuously replays the log records to keep itself in sync with the primary.
Initially, the standby's log position is at the time when the backup was taken. The standby starts in a catchup state that ships logs from this position to the current position of the primary. Eventually, the standby catches up to the current position and enters a "peer" state. HADR automatically handles state transition, including chasing the "moving target" of primary position. All you need to do is to issue the "start HADR" command on primary and standby.
If the primary fails, you issue a "takeover HADR by force" command on the standby to make the standby the new primary. A single command does it all. For planned maintenance, you issue a "takeover HADR" (without "by force" option) command to switch primary and standby's roles. HADR management is so simple. There are only 3 commands: Start HADR, stop HADR, and takeover HADR.
Logged operations are replicated. These include, but are not limited to:
Not logged operations are not replicated. These include, but are not limited to:
Note: You can use database configuration parameter BLOCKNONLOGGED to block not logged operations on the primary.
Standby is maintained as a mirror image of the primary. Many aspects of the primary and standby databases need to be the same.
For more info, see Info Center
First, how many standbys are needed? A standby can serve HA (High Availability), or DR (Disaster Recovery) purpose. If you want both HA and DR protection, you need 2 standbys, one for each purpose. HADR supports up to 3 standbys (starting V10.1). A third one can be used for a 2nd DR site, or for protection against user error (configured with
delayed replay
). Any standby can have reads on standby enabled to support reporting workload.
This tutorial will focus on single standby setup. For specifics of multiple standby setup, see
HADR multiple standby
white paper. It is recommended that you start from this single standby tutorial to get familiar with HADR basics, then go to multiple standby scenarios.
If you are using a DB2 version earlier than V10.1, HADR is limited to one standby. The common ways to support both HA and DR are:
Due to the complexity and reduced HA/DR protection when using single HADR standby with other HA/DR methods, HADR with multiple standby is recommended. This is a good reason to upgrade to V10.1 and later.
If you are using pureScale, there are some limitations of HADR on pureScale.
pureScale cluster itself provides excellent HA. HADR on pureScale is mainly intended for DR protection. Single standby and sync mode restrictions generally do not impact DR ability.
This tutorial will focus on non pureScale setup. For specifics of pureScale setup, see
Tutorial for HADR on pureScale
. It is recommended that you start from the non pureScale tutorial to get familiar with HADR basics, then go to the pureScale scenario.
The most important HADR parameter is the HADR synchronization mode. See
HADR Sync Modes
for description of the various modes and how to choose a mode.
An HA standby is usually close to the primary, often in the same building, connected to the primary by LAN. HADR synchronization mode is usually Sync or NearSync. HADR provides quick failover, with little or no data loss. The HA standby guards against common failures like hardware/software failure. You can also schedule planned downtime without service interruption using
rolling update
.
An DR standby is usually far from the primary, often in a different city. When a disaster like fire, storm, or earthquake destroys the primary site, you can failover to the standby site. The DR standby is usually connected to the primary via WAN. Network round trip time is often 100ms or more. HADR synchronization mode is usually Async or SuperAsync. While HADR failover is still fast, there can be some data loss during a failover.
Network is usually not an issue in HA systems. For example, a gigabit ethernet provides 120MB/sec throughput, with round trip time at 0.1ms. The throughput is more than enough for most database workload. The round trip time is also very small compared to disk write latency, which is often at least a few ms. So ack message travel in sync or nearSync mode does not add much overhead.
Network can become the bottleneck in DR systems. For example, if your peak logging rate is 20MB/sec, while the network throughput is 15MB/sec, your logging rate will be capped at 15MB/sec, unless superAsync mode is used. SuperAsync mode allows the primary to get ahead of the standby, exposing higher data loss risk. If this is not acceptable, you have to upgrade your network.
A dedicated private network (separate NIC, router, etc.) between primary and standby is recommended, for performance and security reasons. HADR does not encrypt the traffic between primary and standby. Log data is sent as is.
While maximizing TCP throughput on network, socket buffer size is an important factor, especially for WAN. A small buffer size can limit the throughput. See
TCP Tuning
.
To figure out if your network can support the planned sync mode, see
step by step TCP and sync mode configuration
HADR does not encrypt or compress log data during log shipping. If needed, you can use network layer encryption or compression between primary and standby.
You are likely to get lower compression ratio when many tables in your database are already compressed, with either classic row compression or adaptive compression, because row data in log records are also compressed (just another benefit of table compression). With non-compressed tables, the compression ratio is usually between 3 and 5. Of course, it costs time to compress/decompress data. Compression is more likely to be beneficial on WAN.
Same hardware (CPU, memory, disk, etc.) is recommended on the primary and standby, so that standby has enough power for replay. Sufficient planning and testing should be done when deploying a less powerful standby.
If you plan to use Reads on Standby, additional power on the standby machine may be needed.
You may choose to have shared or independent archive devices on primary and standby. See Log Archive Device for details.
It is recommended that you decide on the HADR configuration parameters as much as you can before initializing HADR, to minimize updating configuration after initialization. Changing config after setup may require db2 instance or database shutdown and restart. All HADR registry variables require db2 instance shutdown and restart to take effect. Prior to DB2 V10.1, HADR database config parameters require database shutdown and restart to take effect. See also HADR Database Configuration Parameters
See HADR config for description of relevant parameters and the optimal setting.
DB2 Info Center has detailed info on how to initialize HADR. The basic steps are: create the standby database, configure primary and standby databases, and start HADR on primary and standby.
Additonal info: HADR commands
HADR can also be set up and managed using IBM Data Studio or Optim™ Database Administrator. See Prevent database outages with High AvailabilityDisaster Recovery in DB2 for more info.
Once you have HADR set up, you need to monitor it. See HADR monitoring for details.
HADR supports moving primary role from one database to another. For planned role change, primary and standby switch roles. It is a simple one command operation and guaranteed to have no data loss. For unplanned outage, you will need a failover that changes the standby to a primary, but leaves the old primary as primary (it may not even be accessible). Both kinds role change are done by the "takeover HADR" command, with different options. See
HADR takeover
for details.
Additonal info:
HADR commands
In case you run into performance problems, like slow database response to SQL clients, you will need to find the cause and fix it. First, you need to check if HADR is the cause. See Impact to primary database logging . If it is an HADR problem, then proceed to diagnostics. For details, see HADR performance tuning and diagnostics .
HADR supports automated failover and client reroute to the new primary. See cluster managers and Client Reroute
HADR standby database supports read-only queries. This feature is disabled by default. It can be enabled by the DB2_HADR_ROS registry variable. Only the Uncommitted Read isolation level is supported. This feature is best for read-only reporting workload. It offloads the primary. In a multiple standby system, reads on standby can be enabled on any standby. See the followings for more info:
For DB2 fixpack update and OS/hardware upgrade, HADR rolling update allows continued availability during the update.
See the following Info Center topics for more info:
HADR rolling update does not support upgrade across DB2 major version. For example, from DB2 V10.1 to V10.5. Major version upgrade procedure is as follows:
The main reason that HADR rolling update cannot cross major version boundary is that DB2 transaction logs from the new release may not be compatible with the old release, yet HADR requires log compatibility on primary and standby.