DB2 HADR Tutorial

Table of Contents

Introduction

What is HADR?

HADR (High Availability Disaster Recovery) is a DB2 LUW feature for high availability and disaster recovery. It is the preferred solution for HA and DR on DB2. It replicates data in real time from a primary database to one or more standby databases. In case the primary database fails, a standby database takes over as the new primary. For all benefits of HADR, see HADR benefits

How does HADR work?

HADR is based on log replay. You initialize the standby with a backup or split mirror image of the primary. You then configure and start HADR on the primary and standby databases. The primary ships its transaction log data to the standby via a TCP connection. The standby continuously replays the log records to keep itself in sync with the primary.

Initially, the standby's log position is at the time when the backup was taken. The standby starts in a catchup state that ships logs from this position to the current position of the primary. Eventually, the standby catches up to the current position and enters a "peer" state. HADR automatically handles state transition, including chasing the "moving target" of primary position. All you need to do is to issue the "start HADR" command on primary and standby.

If the primary fails, you issue a "takeover HADR by force" command on the standby to make the standby the new primary. A single command does it all. For planned maintenance, you issue a "takeover HADR" (without "by force" option) command to switch primary and standby's roles. HADR management is so simple. There are only 3 commands: Start HADR, stop HADR, and takeover HADR.

What's replicated, what's not?

Logged operations are replicated. These include, but are not limited to:

  • DDL
  • DML
  • Create/alter table space
  • Create/alter storage group
  • Create/alter buffer pool
  • XML data.
  • Logged LOBs

Not logged operations are not replicated. These include, but are not limited to:

  • Database configuration parameters (this allows primary and standby databases to be configured differently).
  • "Not logged initially" tables
  • Not logged LOBs
  • UDF (User Defined Function) libraries. UDF DDL is replicated. But the libraries used by UDF (such as C or Java libraries)  are not replicated, because they are not stored in the database. Users must manually copy the libraries to the standby.

Note: You can use database configuration parameter BLOCKNONLOGGED to block not logged operations on the primary.

System Requirements and Recommendations

Standby is maintained as a mirror image of the primary. Many aspects of the primary and standby databases need to be the same.

  • Primary and standby must have the same DB2 major version . For example, both on V10.1.
    • Standby database fix pack level must be same or higher than that of the primary (otherwise, primary could generate log records the standby cannot replay).
    • Same fix pack level is recommended on primary and standby, to minimize compatibility risk. Different primary and standby fix pack levels usually only occur during rolling update
  • Primary and standby must have the same platform .
    • "Platform" here is defined as the combination of OS type (software) and machine architecture (hardware). For example, the followings are considered distinct platforms: Windows-x86, AIX-power, HP-IA, Solaris-Sparc, Solaris-x86, Linux-PPC, Linux-Z, Linux-390, Linux-x86,
    • Primary and standby must have the same endian (both big endian, or both small endian). This requirement is usually satisfied by the platform requirement already.
  • Same OS version (major and minor) is recommended on primary and standby. Different versions usually only occur during rolling update . DB2 does not enforce any check on OS version. But you should keep the different-version window as short as possible, to minimize compatibility risk.
  • The DB2 software on primary and standby must have the same bit size (both 64 bit, or both 32 bit).
  • Same bit size on the host platform is recommended, to minimize compatibility risk.
    • Host platform bit size could be different. For example, DB2 is 32 bit on both machines. Primary host is 64 bit, which can run both 64 bit and 32 bit applications. Standby host is 32 bit.
  • Primary and standby must have the same paths for tablespace containers, to support tablespace replication.
    • The standby devices should have same or larger capacity.
    • Redirected restore is not supported when creating the standby. However, database directory (for database metadata files) and transaction log directory changes are supported during the restore. Table space containers created by relative paths will be restored to paths relative to the new database directory.
  • Same hardware (CPU, memory, disk, etc.) is recommended on the primary and standby, so that standby has enough power for replay. Sufficient planning and testing should be done when deploying a less powerful standby.
  • Same amount of memory is recommended on the primary and standby, so that buffer pool replication is less likely to fail.

For more info, see Info Center

Planning HADR

Number of standbys

First, how many standbys are needed? A standby can serve HA (High Availability), or DR (Disaster Recovery) purpose. If you want both HA and DR protection, you need 2 standbys, one for each purpose. HADR supports up to 3 standbys (starting V10.1). A third one can be used for a 2nd DR site, or for protection against user error (configured with delayed replay ). Any standby can have reads on standby enabled to support reporting workload.

This tutorial will focus on single standby setup. For specifics of multiple standby setup, see HADR multiple standby white paper. It is recommended that you start from this single standby tutorial to get familiar with HADR basics, then go to multiple standby scenarios.

If you are using a DB2 version earlier than V10.1, HADR is limited to one standby. The common ways to support both HA and DR are:

  • Set up the HADR standby locally for HA purpose. Use Q-Rep for a remote DR site.
  • Set up disk mirroring and/or spare host machine locally for HA. Use HADR for remote DR site.


Due to the complexity and reduced HA/DR protection when using single HADR standby with other HA/DR methods, HADR with multiple standby is recommended. This is a good reason to upgrade to V10.1 and later.

pureScale Considerations

If you are using pureScale, there are some limitations of HADR on pureScale.

  • HADR support on pureScale is starts from DB2 V10.5. If you are on earlier releases, you may use Q-rep for DR solution, or consider upgrade.
  • Only one standby is supported on pureScale,
  • Only async and superAsync HADR modes are supported on pureScale.

pureScale cluster itself provides excellent HA. HADR on pureScale is mainly intended for DR protection. Single standby and sync mode restrictions generally do not impact DR ability.

This tutorial will focus on non pureScale setup. For specifics of pureScale setup, see Tutorial for HADR on pureScale . It is recommended that you start from the non pureScale tutorial to get familiar with HADR basics, then go to the pureScale scenario.

HADR Synchronization Mode and Network Speed

The most important HADR parameter is the HADR synchronization mode. See HADR Sync Modes for description of the various modes and how to choose a mode.

An HA standby is usually close to the primary, often in the same building, connected to the primary by LAN. HADR synchronization mode is usually Sync or NearSync. HADR provides quick failover, with little or no data loss. The HA standby guards against common failures like hardware/software failure. You can also schedule planned downtime without service interruption using rolling update .

An DR standby is usually far from the primary, often in a different city. When a disaster like fire, storm, or earthquake destroys the primary site, you can failover to the standby site. The DR standby is usually connected to the primary via WAN. Network round trip time is often 100ms or more. HADR synchronization mode is usually Async or SuperAsync. While HADR failover is still fast, there can be some data loss during a failover.

Network is usually not an issue in HA systems. For example, a gigabit ethernet provides 120MB/sec throughput, with round trip time at 0.1ms. The throughput is more than enough for most database workload. The round trip time is also very small compared to disk write latency, which is often at least a few ms. So ack message travel in sync or nearSync mode does not add much overhead.

Network can become the bottleneck in DR systems. For example, if your peak logging rate is 20MB/sec, while the network throughput is 15MB/sec, your logging rate will be capped at 15MB/sec, unless superAsync mode is used. SuperAsync mode allows the primary to get ahead of the standby, exposing higher data loss risk. If this is not acceptable, you have to upgrade your network.

A dedicated private network (separate NIC, router, etc.) between primary and standby is recommended, for performance and security reasons. HADR does not encrypt the traffic between primary and standby. Log data is sent as is.

While maximizing TCP throughput on network, socket buffer size is an important factor, especially for WAN. A small buffer size can limit the throughput. See TCP Tuning .

To figure out if your network can support the planned sync mode, see step by step TCP and sync mode configuration

Network Encryption and Compression

HADR does not encrypt or compress log data during log shipping. If needed, you can use network layer encryption or compression between primary and standby.

You are likely to get lower compression ratio when many tables in your database are already compressed, with either classic row compression or adaptive compression, because row data in log records are also compressed (just another benefit of table compression). With non-compressed tables, the compression ratio is usually between 3 and 5. Of course, it costs time to compress/decompress data. Compression is more likely to be beneficial on WAN.

Primary/Standby Hardware Recommendations

Same hardware (CPU, memory, disk, etc.) is recommended on the primary and standby, so that standby has enough power for replay. Sufficient planning and testing should be done when deploying a less powerful standby.

If you plan to use Reads on Standby, additional power on the standby machine may be needed.

Log Archive Considerations

You may choose to have shared or independent archive devices on primary and standby. See Log Archive Device for details.

Deploying HADR

Initializing HADR

It is recommended that you decide on the HADR configuration parameters as much as you can before initializing HADR, to minimize updating configuration after initialization. Changing config after setup may require db2 instance or database shutdown and restart. All HADR registry variables require db2 instance shutdown and restart to take effect. Prior to DB2 V10.1, HADR database config parameters require database shutdown and restart to take effect. See also HADR Database Configuration Parameters

See HADR config for description of relevant parameters and the optimal setting.

DB2 Info Center has detailed info on how to initialize HADR. The basic steps are: create the standby database, configure primary and standby databases, and start HADR on primary and standby.

Additonal info: HADR commands

HADR can also be set up and managed using IBM Data Studio or Optim™ Database Administrator. See Prevent database outages with High AvailabilityDisaster Recovery in DB2 for more info.

Monitoring HADR

Once you have HADR set up, you need to monitor it. See HADR monitoring for details.

Role switch and Failover

HADR supports moving primary role from one database to another. For planned role change, primary and standby switch roles. It is a simple one command operation and guaranteed to have no data loss. For unplanned outage, you will need a failover that changes the standby to a primary, but leaves the old primary as primary (it may not even be accessible). Both kinds role change are done by the "takeover HADR" command, with different options. See HADR takeover for details.

Additonal info: HADR commands

Performance Tuning

In case you run into performance problems, like slow database response to SQL clients, you will need to find the cause and fix it. First, you need to check if HADR is the cause. See Impact to primary database logging . If it is an HADR problem, then proceed to diagnostics. For details, see HADR performance tuning and diagnostics .

Automation

HADR supports automated failover and client reroute to the new primary. See cluster managers and Client Reroute

Reads on Standby

HADR standby database supports read-only queries. This feature is disabled by default. It can be enabled by the DB2_HADR_ROS registry variable. Only the Uncommitted Read isolation level is supported. This feature is best for read-only reporting workload. It offloads the primary. In a multiple standby system, reads on standby can be enabled on any standby. See the followings for more info:

Rolling Update

For DB2 fixpack update and OS/hardware upgrade, HADR rolling update allows continued availability during the update.

See the following Info Center topics for more info:

HADR rolling update does not support upgrade across DB2 major version. For example, from DB2 V10.1 to V10.5. Major version upgrade procedure is as follows:

  • Backup primary database (in case upgrade fails, you can go back to the old release). This step is recommended, but not required.
  • Run "stop HADR" command on the primary database
  • Upgrade the primary instance and database as usual
  • Take a full backup (or split mirror image) of the primary database on the new release.
  • Concurrently, drop standby database and upgrade the standby instance
  • Reinitialize the standby database using primary's backup or split mirror image from the new release

The main reason that HADR rolling update cannot cross major version boundary is that DB2 transaction logs from the new release may not be compatible with the old release, yet HADR requires log compatibility on primary and standby.