Lab: High CPU
This lab covers how to investigate high CPU for a sample Liberty application in OpenShift.
Theory
There are many ways to review high CPU for Liberty in an OpenShift environment:
- Review verbose garbage collection during the issue to check for a high proportion of time in garbage collection
- Manually gather thread dumps during the issue by opening a terminal in the container and executing
kill -3 $PID
- If the application is installed using the WebSphere Liberty Operator, the WebSphereLibertyDump custom resource may be used to gather a Liberty server dump with a thread dump. This only works if the
WebSphereLibertyApplication
custom resource has the properopenliberty.io/day2operations
annotation, and if the container has a/serviceability
directory. - If you have
cluster-admin
permissions, use the MustGather: Performance, hang, or high CPU issues with WebSphere Application Server on Linux on Containers during the issue and search for repeating patterns in thread dumps - If running IBM Java 8, gather a headless mode HealthCenter Java sampling profiler collection
- If you have
cluster-admin
permissions, gather a Linuxperf
native sampling profiler collection - Run
top -H
to review per-thread CPU utilization, if the container image includes thetop
utility - Use third-party monitoring products such as Instana
This lab will only cover some of the above methods.
Labs
Choose one or more labs:
- Lab: Manually gather thread dumps during the issue
- Lab: Use the performance/hang/high CPU MustGather on Linux on Containers
- You could also gather a
server dump
as done in the High response times lab
Lab: Manually gather thread dumps during the issue
This lab will gather thread dumps showing accumulated CPU by thread and associated stacks. Any statistical patterns in thread stacks associated with high CPU threads may be used to infer potential causes of high CPU.
This method is generally used when there is persistently high CPU. If CPU spikes are infrequent, then this technique is more difficult unless you capture a lot of thread dumps.
This lab will take approximately 10 minutes.
Step 1: Install example application
If you haven't already, install the sample application. If you installed it in a previous lab, you may continue using the previous installation.
Step 2: Exercise high CPU
You will simulate 5 concurrent users sending HTTP requests that perform CPU intensive operations such as compiling regular expressions and performing mathematical operations.
Using the command line
-
Request the following web page from your terminal to exercise high CPU:
-
macOS, Linux, or Windows with Cygwin:
curl -k -s "https://$(oc get route libertydiag "--output=jsonpath={.spec.host}")/servlet/LoadRunner?url=http%3A%2F%2Flocalhost%3A9080%2Fservlet%2FDoComplicatedStuff%3Fmoreiterations%3Dtrue&method=get&entity=&concurrentusers=5&totalrequests=0&user=&password=&infinite=on"
-
Windows with Command Prompt:
- Ensure you have
curl
for Windows installed -
List the application's URL:
oc get route libertydiag "--output=jsonpath={.spec.host}{'\n'}"
-
Execute the following command, replacing
$HOST
with the output of the previous command:curl -k -s "https://$HOST/servlet/LoadRunner?url=http%3A%2F%2Flocalhost%3A9080%2Fservlet%2FDoComplicatedStuff%3Fmoreiterations%3Dtrue&method=get&entity=&concurrentusers=5&totalrequests=0&user=&password=&infinite=on"
- Ensure you have
-
-
This load will run indefinitely. Continue to the next step.
Using the browser
- Click on the
Load Runner
link from the libertydiag application homepage:
-
In the
Target URL
, copy and paste the following:http://localhost:9080/servlet/DoComplicatedStuff?moreiterations=true
-
Scroll down and check
Infinite (until manually stopped)
- Scroll to the bottom and click
Start
- This load will run indefinitely. Continue to the next step.
Step 3: Manually gather thread dumps
Now you will gather a number of thread dumps.
Using the command line
-
List the pods for the example application deployment; for example:
oc get pods
Example output:
NAME READY STATUS RESTARTS AGE libertydiag-b98748954-mgj64 1/1 Running 0 97s
-
Open a shell into the pod by replacing
$POD
with a pod name from the previous command:oc rsh -t $POD
For example:
oc rsh -t libertydiag-b98748954-mgj64
-
Note that the remote shell might timeout after a short period of inactivity, so be aware that you might have been logged out and you'll need to
oc rsh
back in to continue where you left off. -
First, we need to find the process ID (PID) of Liberty. Most Liberty images do not have tools like
ps
ortop
pre-installed. However, most Liberty images only have a single process in the container which is the Java process running Liberty, and this has the PID of 1. Double check that this is the Liberty process by doing a full listing on PID 1:ls -l /proc/1/
-
If you see a Liberty current working directory (
cwd
) such as/opt/ol/wlp
or/opt/ibm/wlp
then you can assume that is the Liberty process. Otherwise, runls -l /proc/[0-9]*
and then explore each PID to find the Liberty process.ls -l /proc/1
Example output:
[...] lrwxrwxrwx. 1 1000830000 root 0 Dec 6 17:45 cwd -> /opt/ibm/wlp/output/defaultServer -r--------. 1 1000830000 root 0 Dec 6 17:45 environ lrwxrwxrwx. 1 1000830000 root 0 Dec 6 14:57 exe -> /opt/ibm/java/jre/bin/java [...]
-
Create a thread dump by sending the
SIGQUIT
signal to the process using thekill -QUIT $PID
command. Replace$PID
with the process ID of Liberty that you found above. Note that, by default, Java handles theSIGQUIT
signal, creates the thread dump, and the process continues, so the process is not killed, despite the names of the command and signal.kill -QUIT 1
-
It's common to wait about 30 seconds between thread dumps. Take another 1 or more thread dumps, waiting some time in between.
kill -QUIT 1
-
Normally, thread dumps for IBM Java and Semeru will be produced as
javacore*txt
files in thecwd
directory that you found above:ls -l /opt/ibm/wlp/output/defaultServer/javacore*txt
However, in the case of this sample application, this directory is overridden with an
-Xdump
configuration. You can check JVM configurations by printing the processcmdline
andenviron
files and find the relevant configuration. For example:cat /proc/1/cmdline /proc/1/environ
Example output:
[...] -Xdump:directory=logs/diagnostics/
Therefore, for this application, javacores will go into
logs/diagnostics/
relative tocwd
:ls /opt/ibm/wlp/output/defaultServer/logs/diagnostics/javacore*txt
Example output:
/opt/ibm/wlp/output/defaultServer/logs/diagnostics/javacore.20221206.175535.1.0001.txt /opt/ibm/wlp/output/defaultServer/logs/diagnostics/javacore.20221206.175626.1.0002.txt
Note that overridding the
-Xdump
directory is common in container deployments so that a directory may be used that's mounted on a permanent disk so that diagnostics are still available if a pod is killed.
Using the browser
- In the
Topology
view of theDeveloper
perspective, click on thelibertydiag
circle, then click theResources
tab in the drawer on the right, and then click on the one pod that's running:
- Click on the
Terminal
tab to open a remote shell into the running container in the pod:
- Follow the
Using the command line
steps above starting at step 4.
Step 4: Download thread dumps
Using the command line
- Download the thread dumps by replacing
$POD
with a pod name from above and$DIR
with the directory of the javacores. Note thatoc cp
does not support wildcards so the whole directory (or a single file) must be downloaded.oc cp $POD:$DIR .
For example:
oc cp libertydiag-ddf5f95b6-wj6dm:/opt/ibm/wlp/output/defaultServer/logs/diagnostics .
Using the browser
Files other than native logs (equivalent to Liberty's console.log
) cannot be downloaded through the browser. You must use the command line steps above. Alternatively, you may use the Terminal
tab of the pod and cat
the file in the browser.
Step 5: Analyze the thread dumps
If you are familiar with analyzing thread dumps for high CPU, you may skip this step.
Review thread dumps
-
Open each thread dump and search for
Thread Details
:1XMTHDINFO Thread Details
-
Review each thread (starting with
3XMTHREADINFO
) and, in particular, the3XMCPUTIME
line:3XMTHREADINFO "Default Executor-thread-189" J9VMThread:0x00000000002BA700, omrthread_t:0x00007F7FD0014240, java/lang/Thread:0x00000000E43D62C0, state:R, prio=5 [...] 3XMCPUTIME CPU usage total: 6.839021278 secs, current category="Application"
-
The
CPU usage total
is accumulated CPU usage since the beginning of that thread. Therefore, finding high users of CPU is most commonly done by comparing these values across thread dumps for particular threads. Note that if a thread is short-lived, then it may not be captured in the thread dumps, or may only exist in a subset of thread dumps. - With a sufficient number of thread dumps and a persistent pattern of stack tops for high CPU threads, this may be used to hypothesize likely causes of high CPU. In this case, there is a peristent pattern of application stacks in the
DoComplicatedStuff.doWork
method driving code such as mathematical operations, regular expression compilation, etc. For example:3XMTHREADINFO3 Java callstack: 4XESTACKTRACE at java/math/BigInteger.cutOffLeadingZeroes(BigInteger.java:1505(Compiled Code)) 4XESTACKTRACE at java/math/Division.divideAndRemainderByInteger(Division.java:344(Compiled Code)) 4XESTACKTRACE at java/math/BigInteger.divideAndRemainder(BigInteger.java:1202(Compiled Code)) 4XESTACKTRACE at java/math/BigDecimal.slScaledDivide(BigDecimal.java:3489(Compiled Code)) 4XESTACKTRACE at java/math/BigDecimal.divide(BigDecimal.java:2446(Compiled Code)) 4XESTACKTRACE at java/math/BigDecimal.divide(BigDecimal.java:2424(Compiled Code)) 4XESTACKTRACE at java/math/BigDecimal.divide(BigDecimal.java:2389(Compiled Code)) 4XESTACKTRACE at com/example/servlet/DoComplicatedStuff.arctan(DoComplicatedStuff.java:141(Compiled Code)) 4XESTACKTRACE at com/example/servlet/DoComplicatedStuff.computePi(DoComplicatedStuff.java:119(Compiled Code)) 4XESTACKTRACE at com/example/servlet/DoComplicatedStuff.doWork(DoComplicatedStuff.java:60(Compiled Code))
Step 6: Clean-up
The pod will be using a lot of CPU, so you will kill the pod. Kubernetes will automatically create a new replacement pod in preparation for the next lab.
Using the command line
-
List the pods; for example:
oc get pods
Example output:
NAME READY STATUS RESTARTS AGE libertydiag-b98748954-mgj64 1/1 Running 0 97s
-
Kill the running pod by replacing
$POD
with a pod name from the previous command:oc delete pod $POD
For example:
oc delete pod libertydiag-b98748954-mgj64
-
Wait for the replacement pod to come up:
oc wait deployment libertydiag --for condition=available --timeout=5m
Using the browser
- In the
Topology
view of theDeveloper
perspective, click on thelibertydiag
circle, then click theResources
tab in the drawer on the right, and then click on the one pod that's running:
- In the top right, click Actions } Delete Pod
- While the new pod is initializing, there will be a light blue circle around the deployment. Wait until the circle turns into a dark blue, signifying the application is ready. This may take up to 2 minutes or more depending on available cluster resources and namespace limits.
Summary
In summary, this lab demonstrated how to manually gather thread dumps for Liberty running in a container and review CPU utilization by thread over time. With a sufficient number of thread dumps and a persistent pattern of stack tops for high CPU threads, this may be used to hypothesize likely causes of high CPU.
Lab: Use the performance/hang/high CPU MustGather on Linux on Containers
This lab will use IBM Support's MustGather: Performance, hang, or high CPU issues with WebSphere Application Server on Linux on Containers to gather thread dumps showing any requests being processed. Any statistical patterns in thread stacks may be used to infer potential causes of high CPU. This MustGather is publicly available and nearly the same as the standalone Linux performance/hang/high-CPU MustGather in that it gathers CPU, memory, disk, network information, thread dumps, etc., and customers should be encouraged to use it if they can accept that it requires cluster-admin
permissions to execute.
This lab will demonstrate how to execute the MustGather, download the diagnostics, and review them for potential causes of high CPU.
Note: This lab requires that the user has cluster-admin
permissions. A future version of the MustGather will not require administrator permissions.
This lab will take approximately 10 minutes.
Step 0: Check if you have cluster-admin permissions
These steps will show if you have cluster-admin
permissions. If you do not, you must skip this lab.
Using the command line
-
Check if you have authority for all verbs on all resources:
oc auth can-i '*' '*'
Example output:
yes
-
If the answer is
no
, then you do not havecluster-admin
permissions.
Using the browser
- Access your OpenShift web console at
https://console-openshift-console.$CLUSTER/
. Replace$CLUSTER
with your OpenShift cluster domain. - Ensure the perspective is set to
Administrator
in the top left:
- Expand
User Management
. If you don't see aUsers
option, then you do not havecluster-admin
permissions. If you do see it, click on it, and then click on your user name:
- Click on
RoleBindings
and check if any binding has aRole ref
ofcluster-admin
. If there are none, then you do not havecluster-admin
permissions.
Step 1: Install example application
If you haven't already, install the sample application. If you installed it in a previous lab, you may continue using the previous installation.
Step 2: Exercise high CPU
You will simulate 5 concurrent users sending HTTP requests that perform CPU intensive operations such as compiling regular expressions and performing mathematical operations.
Using the command line
-
Request the following web page from your terminal to exercise high CPU:
-
macOS, Linux, or Windows with Cygwin:
curl -k -s "https://$(oc get route libertydiag "--output=jsonpath={.spec.host}")/servlet/LoadRunner?url=http%3A%2F%2Flocalhost%3A9080%2Fservlet%2FDoComplicatedStuff%3Fmoreiterations%3Dtrue&method=get&entity=&concurrentusers=5&totalrequests=0&user=&password=&infinite=on"
-
Windows with Command Prompt:
- Ensure you have
curl
for Windows installed -
List the application's URL:
oc get route libertydiag "--output=jsonpath={.spec.host}{'\n'}"
-
Execute the following command, replacing
$HOST
with the output of the previous command:curl -k -s "https://$HOST/servlet/LoadRunner?url=http%3A%2F%2Flocalhost%3A9080%2Fservlet%2FDoComplicatedStuff%3Fmoreiterations%3Dtrue&method=get&entity=&concurrentusers=5&totalrequests=0&user=&password=&infinite=on"
- Ensure you have
-
-
This load will run indefinitely. Continue to the next step.
Using the browser
- Click on the
Load Runner
link from the libertydiag application homepage:
-
In the
Target URL
, copy and paste the following:http://localhost:9080/servlet/DoComplicatedStuff?moreiterations=true
-
Scroll down and check
Infinite (until manually stopped)
- Scroll to the bottom and click
Start
- This load will run indefinitely. Continue to the next step.
Step 3: Execute the MustGather
Now you will execute the MustGather. This takes approximately 6 minutes to run.
Using the command line
- Download a helper script:
- macOS or Linux: containerdiag.sh
- Windows: containerdiag.bat
- Open a
Terminal
orCommand Prompt
and change directory to where you downloaded the script -
On macOS or Linux, make the script executable:
chmod +x containerdiag.sh
-
On macOS, remove the download quarantine:
xattr -d com.apple.quarantine containerdiag.sh
-
List the current deployments:
oc get deployments
Example output:
NAME READY UP-TO-DATE AVAILABLE AGE libertydiag 1/1 1 1 13m
-
Execute the MustGather. Normally, the
-c
option specifying the directory of the javacores is not needed; however, this sample application overrides the default javacore directory using-Xdump
. This is common in container deployments so that a directory may be used that's mounted on a permanent disk so that diagnostics are still available if a pod is killed.-
macOS or Linux:
./containerdiag.sh -d libertydiag libertyperf.sh -c "/opt/ibm/wlp/output/defaultServer/logs/diagnostics/javacore*"
-
Windows:
containerdiag.bat -d libertydiag libertyperf.sh -c "/opt/ibm/wlp/output/defaultServer/logs/diagnostics/javacore*"
-
-
When the MustGather is complete, you will see a repeating message of the form:
run.sh: Files are ready for download. Download with the following command in another window: oc cp [...]
-
Open another
Terminal
orCommand Prompt
and copy & paste theoc cp
line that you saw in the previous step. For example (your command will be different):$ oc cp worker4-debug:/tmp/containerdiag.SN9RbwVmfC.tar.gz containerdiag.SN9RbwVmfC.tar.gz --namespace=openshift-debug-node-g8dqbdfx5d tar: Removing leading `/' from member names
-
Go back to the previous
Terminal
orCommand Prompt
, typeok
, and pressEnter
to complete the MustGather:After the download is complete, type OK and press ENTER: ok [2022-11-08 19:01:03.670923238 UTC] run.sh: Processing finished. Deleting /tmp/containerdiag.SN9RbwVmfC.tar.gz [2022-11-08 19:01:03.674236286 UTC] run.sh: finished.
-
Expand the
containerdiag.*.tar.gz
file that you downloaded.
Using the browser
The MustGather cannot be executed from the browser. You must use the command line steps above.
Step 4: Analyze the thread dumps
If you are familiar with analyzing thread dumps for high CPU, you may skip this step.
Review the MustGather data
- First, review the
top -H
output to see CPU utilization by thread. Expandlinperf_RESULTS.tar.gz
in the root directory and then reviewtopdashH*.out
. -
Open each thread dump under
pods/libertydiag*/containers/libertydiag/
and search forThread Details
:1XMTHDINFO Thread Details
-
Review each thread (starting with
3XMTHREADINFO
) and, in particular, the3XMCPUTIME
line:3XMTHREADINFO "Default Executor-thread-189" J9VMThread:0x00000000002BA700, omrthread_t:0x00007F7FD0014240, java/lang/Thread:0x00000000E43D62C0, state:R, prio=5 [...] 3XMCPUTIME CPU usage total: 6.839021278 secs, current category="Application"
-
The
CPU usage total
is accumulated CPU usage since the beginning of that thread. Therefore, finding high users of CPU is most commonly done by comparing these values across thread dumps for particular threads. Note that if a thread is short-lived, then it may not be captured in the thread dumps, or may only exist in a subset of thread dumps. - With a sufficient number of thread dumps and a persistent pattern of stack tops for high CPU threads, this may be used to hypothesize likely causes of high CPU. In this case, there is a peristent pattern of application stacks in the
DoComplicatedStuff.doWork
method driving code such as mathematical operations, regular expression compilation, etc. For example:3XMTHREADINFO3 Java callstack: 4XESTACKTRACE at java/math/BigInteger.cutOffLeadingZeroes(BigInteger.java:1505(Compiled Code)) 4XESTACKTRACE at java/math/Division.divideAndRemainderByInteger(Division.java:344(Compiled Code)) 4XESTACKTRACE at java/math/BigInteger.divideAndRemainder(BigInteger.java:1202(Compiled Code)) 4XESTACKTRACE at java/math/BigDecimal.slScaledDivide(BigDecimal.java:3489(Compiled Code)) 4XESTACKTRACE at java/math/BigDecimal.divide(BigDecimal.java:2446(Compiled Code)) 4XESTACKTRACE at java/math/BigDecimal.divide(BigDecimal.java:2424(Compiled Code)) 4XESTACKTRACE at java/math/BigDecimal.divide(BigDecimal.java:2389(Compiled Code)) 4XESTACKTRACE at com/example/servlet/DoComplicatedStuff.arctan(DoComplicatedStuff.java:141(Compiled Code)) 4XESTACKTRACE at com/example/servlet/DoComplicatedStuff.computePi(DoComplicatedStuff.java:119(Compiled Code)) 4XESTACKTRACE at com/example/servlet/DoComplicatedStuff.doWork(DoComplicatedStuff.java:60(Compiled Code))
Step 5: Clean-up
The pod will be using a lot of CPU, so you will kill the pod. Kubernetes will automatically create a new replacement pod in preparation for the next lab.
Using the command line
-
List the pods; for example:
oc get pods
Example output:
NAME READY STATUS RESTARTS AGE libertydiag-b98748954-mgj64 1/1 Running 0 97s
-
Kill the running pod by replacing
$POD
with a pod name from the previous command:oc delete pod $POD
For example:
oc delete pod libertydiag-b98748954-mgj64
-
Wait for the replacement pod to come up:
oc wait deployment libertydiag --for condition=available --timeout=5m
Using the browser
- In the
Topology
view of theDeveloper
perspective, click on thelibertydiag
circle, then click theResources
tab in the drawer on the right, and then click on the one pod that's running:
- In the top right, click Actions } Delete Pod
- While the new pod is initializing, there will be a light blue circle around the deployment. Wait until the circle turns into a dark blue, signifying the application is ready. This may take up to 2 minutes or more depending on available cluster resources and namespace limits.
Summary
In summary, this lab demonstrated how to execute the performance/hang/high CPU MustGather on Linux on Containers, download the diagnostics, and review them for high CPU usage. With a sufficient number of thread dumps and a persistent pattern of stack tops for high CPU threads, this may be used to hypothesize likely causes of high CPU.