HADR Socket Buffer Size Tuning
Table of Contents
DB2 registry variables DB2_HADR_SOSNDBUF and DB2_HADR_SORCVBUF (first available in V8fp17, V91fp5 and V95fp2) specify socket send and receive buffer sizes for the HADR TCP connection. They allow tuning TCP window size (controlled via socket buffer size on most systems) for HADR connection without impact on other TCP connections on the system. These parameters are applicable on both primary and standby.
The host machine may round up the requested size to certain sizes like power of two or multiple of network packet size, or cap the requested size at system limit without returning an error (observed on Linux). DB2 does not fail HADR startup if actual size is smaller than requested size. It is important that you verify the actual size against the requested size. See monitoring socket buffer size
There are two major aspects when setting socket buffer size on HADR systems:
- Maximizing TCP throughput
- Buffering for HADR log shipping
This article will discuss the two aspects.
Because HADR is network dependent, properly tuned network is very important. TCP window size is an important parameter for optimal TCP performance. The general rule is
TCP window size = send_rate * round_trip_time
The reason is that TCP is a reliable protocol. It guarantees delivery of data submitted by applications (via the send() call). The submitted data is copied to OS TCP buffer (this is the socket buffer on most systems). The sender's TCP layer can not release a portion of the TCP send buffer until if has received an ack message from the receiver that data in this portion of the buffer has been received. If it does not receive the ack message (within a timeout), it will resend the data. Consider the following sequence:
- Sender sends data onto the network.
- First part of the data to arrives on the receiver. (one way network travel time)
- Receiver sends back an ack message.
- Ack message arrives on sender. (one way network travel time)
- Sender releases buffer holding the first part.
This sequence takes network round trip time (excluding sender and receiver processing time, which is usually very short compared to network delay). While sender is waiting for the ack message (wait time is round trip time), it needs to keep sending more data, at the maximal send rate, to fully utilize the network bandwidth. The amount of data it can send (and must buffer) during the wait is send_rate * round_trip_time. This is the reason behind the TCP window size formula.
The formula is a theoretical one. In reality, the window usually needs to be bigger, because the receiver will not ack until it receives the "first part", whose size is system dependent. It takes time for the "first part" to be fully shipped to the receiver. To maximize throughput, "send rate" in the formula should be the maximal rate (this matters more on systems capable of burst send) and round trip time should be the worst case (on WAN, the time is likely to vary more than on LAN, due to all the relay systems involved on the network path).
Note that we are talking about the OS socket buffer and TCP ack message, which are different from the HADR buffer DB2 maintains and the HADR ack message DB2 sends. HADR buffer and messages are on a higher layer in the network stack.
On LAN, the system default is usually large enough because round trip time is short. Example: 1000Mbit/second * .1 ms = 12500 bytes. Most systems have a default larger than 12500 bytes.
For WAN, the system default is often not large enough because of longer round trip time. Example: 10Mbit/second * 100ms = 125 KB. Many systems have a default smaller than 125 KB. Such systems would require setting TCP window size. Setting a large size at the system level would consume a large amount of memory if there are many connections on the system, as is the case of many client/server connections. Thus setting windows size for HADR only is desirable.
When window size is too small, the network will not be able to fully utilize its bandwidth. Applications like HADR will experience throughput lower than the nominal bandwidth. A larger than necessary size usually causes no harm other than consuming more memory.
It is recommended that send and receive size be set to the same value and applied to both primary and standby databases. It has been observed that setting sender side buffer size alone is not enough to increase throughput on Windows systems. The recommendation is to use a single size for all 4 parameters:
- Primary side send buf size
- Primary side receive buf size
- Standby side send buf size
- Standby side receive buf size
Larger than 64k bytes TCP windows require TCP scaling. Some systems automatically enable window scaling when TCP buffer size is larger than 64k bytes. Some require explicit enablement. Window scaling is also known as RFC1323. When window size is greater than 64k bytes, check if you need to explicitly enable window scaling.
In the TCP window size formula
TCP window size = send_rate * round_trip_time
Send rate is the nominal bandwidth. For example, if you have a giga bit ethernet, the nominal rate is 1 GigaBit/second. If you lease a WAN line from a telecom vendor, the vendor should be able to tell you the speed of the line. Round trip time can be easily measured via the ping command. Just run ping the standby host from the primary host and vice versa. The result should be close. If not, investigate.
In reality, some DBA's do not actually know the nominal rate of the primary-standby network. It is common in large corporations that another department manages the network and the DBA's are not familiar with details of the network. This is not a problem with the simulator. We can easily find out the rate with incremental search. You can start from 64KB buffer and double the buffer size until the throughput no longer improves. The highest throughput you get is the bandwidth.
When send rate and round trip time are known, the incremental search procedure is still recommended. The window size formula is only a general guide. Actual test via the simulator should be used to determine the window size to be deployed.
See HADR simulator page for full documentation of the simulator.
Run the test when the network has no other workload, or at least no other heavy load. The test will stress the network. Recommended time for each run is 60 seconds. Use default flush size (this test is not sensitive to flush size). You must use ASYNC mode for this test. Sample command (for 64KB buffer):
On primary host portland.ibm.com (Note the "-syncmode ASYNC" option):
simhadr_linux -lhost portland.ibm.com -lport 4000 -rhost sanfrancisco.ibm.com -rport 4000 -sockSndBuf 65536 -sockRcvBuf 65536 -role primary -syncmode ASYNC -t 60
On standby host sanfrancisco.ibm.com:
simhadr_linux -lhost sanfrancisco.ibm.com -lport 4000 -rhost portland.ibm.com -rport 4000 -sockSndBuf 65536 -sockRcvBuf 65536 -role standby
In the output, look for the "MBytes/sec" field as shown below. For your convenience, this line is produced on both primary and standby side. It starts with the sync mode (this test requires ASYNC mode). It is the HADR log shipping rate (primary sends logs to the standby). This line does not say "send" or "receive" because it is the send rate on primary and receive rate on the standby. HADR traffic is heavily asymmetric. Standby to primary traffic is much lighter. For this test, we only care about the primary to standby log shipping traffic.
ASYNC: Total 2097152 bytes in 12.460158 seconds, 0.168309 MBytes/sec
Note: When buffer is large, you may notice that the throughput reported by the primary is higher than that from the standby. This is because primary considers send done when send() call returns (data is copied to TCP send buffer, but may not have reached standby). While standby counts data as received only when it actually gets the data. So the standby side number is more accurate. "Large" here is relative to total amount of data shipped. In the below 1MB buffer example, 289 MB of data was sent in 60 seconds (at 4.8 MB/sec), so 1MB is relatively very small and should not cause much discrepancy in primary and standby numbers. When buffer is large, you can use longer test runs to increase total amount of data shipped to make the buffer relatively small.
Below is actual result between IBM labs in Portland, Oregon, and San Francisco, California. The physical distance is about 600 miles. This is a WAN example. The impact of socket buffer size on network throughput is significant.
Buffer Throughput
64k 1.554048 MBytes/sec
128k 3.347476 MBytes/sec
256k 4.734673 MBytes/sec
512k 4.834047 MBytes/sec
1M 4.821998 MBytes/sec
As seen above, throughput peaks at 4.8 MB/sec. This indicates that send rate is 4.8 MB/sec. Round trip time (measured by "ping") is .032 seconds (normal for this distance). The computed TCP window size is
4.8 MB/sec * .032 second = 153 KB
In reality, we see throughput leveling off from 256 KB. To be safe (a few MB of memory is really nothing nowadays), 512KB is recommended as minimal socket buffer size. DB2 log flush size may require larger buffer and we will use the larger size between TCP and flush size requirements.
The search procedure starts from 64KB. Smaller sizes are not recommended in any case, regardless of network requirement, due to the other aspect of buffer size: to buffer DB2 log shipping.
The other function of socket buffer in HADR systems is to buffer HADR log shipping.
On the sender side, a TCP send buffer size greater than send() size will minimize number of send calls. On Unix systems, when TCP send buffer size is smaller than send() size, TCP layer cannot copy all data into buffer. It will only send (copy) part of the requested data and ask the application to call send() again for remaining data. On Windows, data is copied into an intermediate buffer (application has no control on the intermediate buffer's size) first, then to the socket buffer. But it's still good to config socket buffer size to be greater than send size on Windows.
In HADR peer state, log writer flush size is HADR send() size. The same piece of log buffer is passed to write() for disk and send() for network. In HADR remote catchup state, HADR reads from disk and sends to network in 64KB unit. Thus we want a minimal send buffer size of 64KB.
DB2 log flush size is dynamic. To measure the max flush size, first scan log files using DB2 log scanner, then run the scanner output through HADR calculator. See HADR performance tuning procedure for details.
On the receiver side, the HADR thread in DB2 receives log data from the TCP layer, then write the data to disk, in the same thread. While the thread is writing, OS can continue to copy more incoming data into the TCP buffer (socket buffer). Typical write size is one flush. Thus a socket receive buffer at least the size of a flush will maximize the effect of parallel log write and receive.
For ASYNC mode, when there is intermittent network hiccup such that the primary is unable to send out log data, larger (larger than one flush) TCP send buffer may help. You may set the size to buffer primary work load for a certain amount of time. For example, 30MB to buffer 10 seconds of work load at 3MB/sec. A hiccup shorter than this duration will not block primary logging. This setup does not help SYNC and NEARSYNC mode because even if the log data can be buffered in TCP send buffer, until the standby receives it and acks back, primary logging is blocked.
However, larger send buffer in ASYNC mode increases the potential data loss in a failure. Should the primary host crash, any data in its send buffer not yet sent out will be lost.
On DB2 V10.1 and later, the following fields in db2pd -hadr or MON_GET_HADR table function can be used to monitor socket buffer size.
The host machine may round up the requested size to certain sizes like power of two or multiple of network packet size, or cap the requested size at system limit without returning an error (observed on Linux). Therefore actual size does not always match requested size. DB2 does not fail HADR startup if actual size is smaller than requested size. It is important that you verify the actual size against the requested size.
Some OS may assign a buffer size on socket creation, then adjust the size upon connection, to optimize the size for the particular connection when remote end information becomes available. The SOCK_SEND_BUF_ACTUAL and SOCK_RECV_BUF_ACTUAL fields returned before connection reflect the initial size on socket creation. The values returned after connection will reflect the adjusted size.
In older releases, look for these messages from db2diag.log:
If DB2_HADR_SOSNDBUF or DB2_HADR_SORCVBUF is not set (system default is used):
HADR Socket send buffer size, SO_SNDBUF: X bytes
HADR Socket receive buffer size, SO_RCVBUF: X bytes
If DB2_HADR_SOSNDBUF or DB2_HADR_SORCVBUF is set:
HADR Socket send buffer size, SO_SNDBUF: Requested X bytes, Actual Y bytes
HADR Socket receive buffer size, SO_RCVBUF: Requested X bytes, Actual Y bytes
The above messages show buffer size at socket creation. Some OS may adjust buffer size upon connection, the final size used will be printed as (this message is printed whether the registry variables are set or not):
HADR Socket send buffer size adjusted to, SO_SNDBUF: X bytes
HADR Socket receive buffer size adjusted to, SO_RCVBUF: X bytes