This page discusses takeover related topics: role switch, failover, and primary reintegration.
Table of Contents
The " takeover hadr " command does role switch, or failover, with just a single command. The command can only be issued on the standby (the standby takes over to become a primary). There are two flavors:
For multiple standby , takeover (both flavors) is allowed on any standby. After takeover, all new standbys will be auto reconfigured to connect to the new primary.
After a failover, the old primary can rejoin the new primary as a standby. This is known as primary reintegration. Primary reintegration will succeed only if there is no data loss during the failover (ie. log streams of the two databases have not diverged). If there is no data loss, then the new primary has all the log data of the old primary and it will continue logging on the common log stream. If there is data loss, then the old primary has some log data not replicated to the new primary. As the new primary starts its logging at the takeover point in the log stream, it writes different data than the old primary's, effectively forking off a new log stream. The following diagram shows diverging log streams:
common log data takeover_point old primary's not replicated data -------------------------+-------------------------------------------- | --------------------------------------------- new primary's log data
Run "start HADR ... as standby" command on the old primary to convert it to a new standby.
The command will change the database role to standby, start the database in standby role, then return. When the command returns success, it does NOT mean that primary reintegration has succeeded. The new standby has yet to connect to the new primary to check if its log stream is compatible to the new primary's. The "start HADR ... as standby" command does not wait for the check because it can take a long time. For example, the new standby may not even be able to connect to the primary for an extended time. User should continue to monitor the new standby until it connects to the primary and enters remote catchup or peer state. If reintegration fails (two databases have diverging log streams), the new standby will shut itself down. Check db2diag.log file for details.
The following steps are performed during a non forced takeover:
To minimize takeover time, forcing applications on primary, forcing applications on standby, log shipping, and log replay are all performed in parallel.
The following steps are performed during a forced takeover:
To minimize takeover time, forcing apps on standby, log shipping, and log replay are all performed in parallel.
To avoid split brain, it is best practice to take the old primary offline before failover and keep it offline after failover. Standby stopping log writing on old primary is only best effort and it does not prevent read only access.
The main factors affecting takeover time are:
In superAsync mode, it is recommended to issue non forced takeover only when primary-standby log gap is relatively small. Throttle or stop primary workload before takeover if needed.
It is best practice to check the followings before issuing a non forced takeover:
It is best practice to check the followings before issuing a forced takeover:
Use the "db2pd -hadr -db <dbname>" to monitor overral progress of takeover. During a takeover, there may be a time window when clients cannot connect to either primary or standby, so SQL based methods such as table function won't work.
Use the "db2pd -applications -db <dbname>" command to view details of remaining applications. Note that this command lists both user applications and DB2 internal connections. For example, the replay agent on standby is also listed. It can be identified as having "Appid" like "*LOCAL.DB2.121029205843" (the 2nd field is the user name for user applications. "DB2" is used for internal apps. The ending number is a time stamp for "2012-10-29 20:58:43", which is the start time of the application in GMT) and "SystemAuthID" as "n/a".
Use the "db2pd -transactions -db <dbname>." command to get additional information on the applications. Pay attention to the State, Firstlsn, Lastlsn, SpaceReserved, and LogSpace fields. During roll back, State will be "ABORT". Unless a transaction hangs, Lastlsn should be increasing. LogSpace shows how big the transaction is (it is the sum of space already used and SpaceReserved). SpaceReserved shows estimation of log data yet to be written for this transaction.
Use the "db2pd -recovery -db <dbname>" command to get details of the current status of recovery, which is part of the takeover process. The output shows the LSN and LSO values that are being replayed/compensated currently. There is also the “progress” section at the bottom showing the progress of the recovery, including the description, starting time, and total work done for each phase. If these values change over time, then the recovery process must be progressing, and takeover has not hanged.
For more info on db2pd, see
DB2 V9.7 Info Center
, and
DB2 V10.1 Info Center
.
For non forced takeover, first examine the role of both databases after takeover failure. You may end up with two standby databases, but you will never end up with two primary databases. If a database is offline, use "HADR database role" field from "db2 get db cfg for <dbname>" command to determine the role. Do not attempt to restart a database before examining its role and working out a recovery plan.
For forced takeover, examine the role of the old standby after the failure (role of old primary is never changed). Assuming you still want to use the old standby as the new primary (chances are that this is a failover and old primary is not available):
If a takeover appears hang, first monitor it to find out what is hanging (see "How to monitor takeover" above). If possible, resolve the issue and let the takeover end by itself. As a last resort, you can forcefully end the takeover. This may be necessary because while takeover is hanging, you lose service availability (neither database can serve clients). Use the following procedure to stop takeover:
When a takeover is initiated, the other side must receive a connection from the takeover for it to proceed. This takeover connection can potentially end up competing with connection requests from client applications. A large number of connection requests coinciding with takeover can cause a delay in takeover being able to proceed. It's recommended to control and limit as much as possible incoming connections prior to initiating takeover.
There are a few methods to help alleviate the competition of the takeover with other connections:
Beginning in 11.1.2.2, the database can be configured to allow connectivity during the backward phase of crash recovery or takeover. The backward phase is divided into two parts. The first part is synchronous, for undoing transactions with DDL, catalog changes, column-organized tables, etc. This part still does not allow connections. The second part is asynchronous, for undoing transactions that only touch regular objects. In this part the database can be connectable.
By setting the registry variable DB2_ONLINERECOVERY=YES before takeover, the takeover command will return at the start of asynchronous backward phase, when the database become connectable. At this moment, the tables, indexes, or objects NOT associated with uncommitted transactions will be fully accessible.
Another registry variable DB2_ONLINERECOVERY_WITH_UR_ACCESS will control if the tables, indexes, or objects associated with these uncommitted transactions are accessible using UR isolation level. As these uncommitted transactions are compensated, the locks on the associated objects will be released and they will become fully accessible.
For details on this new feature, please see this Knowledge Center article .