Table of Contents
When primary and standby can not connect to each other, the symptom is that primary database can not start (unless "start hadr as primary by force" is used). Primary startup will hang for hadr_timeout seconds, then return
SQL1768N Unable to start HADR. Reason code = 7
7 The primary database failed to establish a connection to its standby database within the HADR timeout interval.
Note that because standby startup does not require a connection, standby database can be started normally, but it will remain disconnected.
Primary acts as TCP server. Primary HADR thread listens on its hadr_local_host: hadr_local_svc address. Standby acts as TCP client. If the standby does not already have a connection to the primary, it will try connecting to its hadr_remote_host: hadr_remote_svc address every a few seconds. Once the primary receives the connection request, it will accept the connection. The connection is then established. HADR only uses one long lived TCP connection between a primary-standby pair. Both log data and control messages go through this connection.
The primary keeps listening on its hadr_local_host: hadr_local_svc address after a primary-standby connection has been established. When a new connection request comes in, the existing connection is closed. This behavior allows stale connections to be cleared quickly. Starting in DB2 V10.1, the standby also listens on its hadr_local_host: hadr_local_svc address, so that the primary can initiate contact to the standby. The primary initiated connections are transient. They are used only for short admin messages. The main log shipping channel is still initiated by the standby. In earlier releases, the standby does not listen on any address.
When primary/standby connection does not work, check the following :
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 0 host.ibm.com:hadr_svc *:* LISTEN
You can try HADR simulator on the HADR addresses (note that simulator only accepts port number. It does not accept service name). Simulator gives detailed information on host name resolution. It can also reproduce and isolate problems like firewall, port privilege.
We can use telnet as a generic TCP client to test connection to the primary database. First, start primary with "start hadr as primary by force" and verify that it is listening on the port ("netstat -all"). Then run "telnet <primaryHost> <primaryLocalSvcPort>" on the standby machine using the standby instance user id. Telnet will try connecting to the primary address as a TCP client. If telnet fails to connect to the port, it will give an error like
telnet: connect to address <primaryHostIP>: Connection refused
"Connection refused" usually indicates that no process is listening on the port.
If telnet fails to connect to the primary from the standby machine, you can try telnet from the primary machine. This will verify if a local TCP client can connect. If it can not connect locally, problem is likely to be on the primary machine. If it can, problem is likely to be primary-standby network (such as a firewall between primary and standby).
If telnet can connect to the port, you can then type "whatever" and expect the following:
Trying <primaryHostIP>...
Connected to <primaryHost>.
Escape character is '^]'.
whatever
Connection closed by foreign host.
On the primary side, you will get the following in db2diag.log:
2006-11-12-12.21.49.035929-480 I19209E328 LEVEL: Error
PID : 6178 TID : 183039466048PROC : db2hadrp (HADRDB)
INSTANCE: myInst NODE : 000
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrHandleHsAck, probe: 30330
MESSAGE : remote connection is not following correct msg format
This is because "whatever" is not a well formed HADR message. Primary will then close the connection. Telnet will see "Connection closed by foreign host.".
If the problem still can not be resolved, contact IBM tech support.
The following error messages will be printed to db2diag.log when standby can not connect to primary:
2006-11-13-14.12.25.760666-480 I35290E408 LEVEL: Severe
PID : 10148 TID : 183039466048PROC : db2hadrs (HADRDB)
INSTANCE: myInst2 NODE : 000
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe: 20280
MESSAGE : Failed to connect to primary. rc:
DATA #1 : Hexdump, 4 bytes
0x0000007FBFFFADD4 : 1900 0F81 ....
2006-11-13-14.12.25.765467-480 I35699E344 LEVEL: Severe
PID : 10148 TID : 183039466048PROC : db2hadrs (HADRDB)
INSTANCE: myInst2 NODE : 000
FUNCTION: DB2 UDB, High Availability Disaster Recovery, hdrEduAcceptEvent, probe: 20280
RETCODE : ZRC 0x810F0019 -2129723367 SQLO_CONN_REFUSED "Connection refused"
Note that the actual ZRC (return code) depends on the specific scenario. SQLO_CONN_REFUSED is shown here only as an example.
To avoid filling up db2diag.log file system, but still keep reminding the user of the abnormal condition, the standby database prints these messages every 10 minutes while the condition persists. The frequency of the messages has no relation to the interval at which the standby tries to connect to the primary. Standby always tries to connect to the primary every a few seconds as long as it has no connection to the primary.