Diagnosing HADR Primary-Standby Connection

Table of Contents

 

Symptom of Primary-Standby Connection Problem

When primary and standby can not connect to each other, the symptom is that primary database can not start (unless "start hadr as primary by force" is used). Primary startup will hang for hadr_timeout seconds, then return

SQL1768N Unable to start HADR. Reason code = 7
7 The primary database failed to establish a connection to its standby database within the HADR timeout interval.


Note that because standby startup does not require a connection, standby database can be started normally, but it will remain disconnected.

How does primary and standby connect to each other?

Primary acts as TCP server. Primary HADR thread listens on its hadr_local_host: hadr_local_svc address. Standby acts as TCP client. If the standby does not already have a connection to the primary, it will try connecting to its hadr_remote_host: hadr_remote_svc address every a few seconds. Once the primary receives the connection request, it will accept the connection. The connection is then established. HADR only uses one long lived TCP connection between a primary-standby pair. Both log data and control messages go through this connection.

The primary keeps listening on its hadr_local_host: hadr_local_svc address after a primary-standby connection has been established. When a new connection request comes in, the existing connection is closed. This behavior allows stale connections to be cleared quickly. Starting in DB2 V10.1, the standby also listens on its hadr_local_host: hadr_local_svc address, so that the primary can initiate contact to the standby. The primary initiated connections are transient. They are used only for short admin messages. The main log shipping channel is still initiated by the standby. In earlier releases, the standby does not listen on any address.

Diagnosing Connection Problem

When primary/standby connection does not work, check the following :

  • Check HADR related configuration parameters, especially TCP address parameters: hadr_local_host: hadr_local_svc, hadr_remote_host: hadr_remote_svc, and hadr_target_list (new in DB2 V10.1, for multiple standby support). Make sure that the parameters on primary and standby match up. The hadr_remote_host: hadr_remote_svc address on one database should match the other's hadr_local_host: hadr_local_svc (unless NAT (Network Address Translation) is used, where registry variable DB2_HADR_NO_IP_CHECK is set). For multiple standbys, make sure that hadr_target_list on one database includes the other's hadr_local_host: hadr_local_svc. For address parameters, check the following:
    • Check if full host name (such as "foo.ibm.com") is used instead of short name (such as "foo"). Full name is recommended. Short name can expand to different full names if the same name exists in different domains.
    • Check that a host name does resolve to the same IP address on primary and standby machines. Misconfigured network or machines can have the same name resolved to different IP address on different machines. On DB2 V9 or later versions, you can find messages like "hadr_local_host foo mapped to 1.2.3.40" and "hadr_remote_host foo2 mapped to 1.2.3.50" in db2diag.log. Check that the mapping is what you expect.
    • Check that the same service name resolves to the same port number on primary and standby machines. Misconfigured machines can have a service name resolved to different port numbers on different machines.
    • Do not use "localhost" or its IP equivalent (127.0.0.1 for IP v4) for either hadr_local_host or hadr_remote_host. Such config is error prone. "Localhost" is treated as a virtual NIC, separate from the physical NIC. For example, if primary listens on "localhost", and standby tries to connect to primary's actual hostname, they cannot make the connection.
       
  • Try using numeric host address (such as 1.2.3.4) and/or numeric service port parameters.
     
  • Changes to HADR database config parameters are not picked up until the database is deactivated and reactivated. On exception is that on V10.1 and later, HADR primary database will pick up new values after "stop hadr" and "start hadr". See Refreshing HADR configuration
     
  • Check db2diag.log on both primary and standby for hints. A connection will be rejected if the two databases are not compatible. Examples of incompatibilities include different OS type, bit size (32 vs. 64 bit), DB2 version, or HADR config (hadr_syncmode, hadr_timeout), etc.
     
  • If HADR primary and standby are in NAT (Network Address Translation) environment, you need to set registry variable DB2_HADR_NO_IP_CHECK to "ON" to bypass remote/local host IP address cross check. The registry variable should be set on BOTH primary and standby even if only one is under NAT control.
     
  • Make sure that your firewall allows primary and standby to make the TCP connection. Firewall should be configured to allow connection with either database as the primary.
  • Make sure that primary and standby database processes have sufficient permission to access the ports. Generally, ports below 1024 require root privilege to listen on. If DB2 is not installed with root set uid, you will have to use a higher port number.
  • Start primary, before the command returns (while it's still waiting in hadr_timeout), run "netstat -all" to check if it is really listening on the port ("-all" option is needed to list listening ports). Note, when the command returns, database start has failed and all ports have been released. If you would like more time to check the port, start the primary with "start hadr as primary by force". As long as the primary database is online, it should be listening on the port. Expect a netstat entry like the following:

Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 host.ibm.com:hadr_svc *:* LISTEN

Testing HADR configuration using HADR simulator

You can try HADR simulator on the HADR addresses (note that simulator only accepts port number. It does not accept service name). Simulator gives detailed information on host name resolution. It can also reproduce and isolate problems like firewall, port privilege.

Testing TCP connection via telnet

We can use telnet as a generic TCP client to test connection to the primary database. First, start primary with "start hadr as primary by force" and verify that it is listening on the port ("netstat -all"). Then run "telnet <primaryHost> <primaryLocalSvcPort>" on the standby machine using the standby instance user id. Telnet will try connecting to the primary address as a TCP client. If telnet fails to connect to the port, it will give an error like

telnet: connect to address <primaryHostIP>: Connection refused

"Connection refused" usually indicates that no process is listening on the port.

If telnet fails to connect to the primary from the standby machine, you can try telnet from the primary machine. This will verify if a local TCP client can connect. If it can not connect locally, problem is likely to be on the primary machine. If it can, problem is likely to be primary-standby network (such as a firewall between primary and standby).

If telnet can connect to the port, you can then type "whatever" and expect the following:

Trying <primaryHostIP>...
Connected to <primaryHost>.
Escape character is '^]'.
whatever
Connection closed by foreign host.


On the primary side, you will get the following in db2diag.log:

2006-11-12-12.21.49.035929-480 I19209E328 LEVEL: Error
PID : 6178 TID : 183039466048PROC : db2hadrp (HADRDB)
INSTANCE: myInst NODE : 000
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrHandleHsAck, probe: 30330
MESSAGE : remote connection is not following correct msg format


This is because "whatever" is not a well formed HADR message. Primary will then close the connection. Telnet will see "Connection closed by foreign host.".

If the problem still can not be resolved, contact IBM tech support.

"Failed to connect to primary" error message in db2diag.log

The following error messages will be printed to db2diag.log when standby can not connect to primary:

2006-11-13-14.12.25.760666-480 I35290E408 LEVEL: Severe
PID : 10148 TID : 183039466048PROC : db2hadrs (HADRDB)
INSTANCE: myInst2 NODE : 000
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe: 20280
MESSAGE : Failed to connect to primary. rc:
DATA #1 : Hexdump, 4 bytes
0x0000007FBFFFADD4 : 1900 0F81 ....

2006-11-13-14.12.25.765467-480 I35699E344 LEVEL: Severe
PID : 10148 TID : 183039466048PROC : db2hadrs (HADRDB)
INSTANCE: myInst2 NODE : 000
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe: 20280
RETCODE : ZRC 0x810F0019 -2129723367 SQLO_CONN_REFUSED "Connection refused"


Note that the actual ZRC (return code) depends on the specific scenario. SQLO_CONN_REFUSED is shown here only as an example.

To avoid filling up db2diag.log file system, but still keep reminding the user of the abnormal condition, the standby database prints these messages every 10 minutes while the condition persists. The frequency of the messages has no relation to the interval at which the standby tries to connect to the primary. Standby always tries to connect to the primary every a few seconds as long as it has no connection to the primary.