Symptoms
When installed on a non-OpenShift Kubernetes environment, the Flink (FlinkDeployment
) pod goes into CrashLoopBackoff
state when trying to access the JobResultStore
.
The logs in the Flink instance (FlinkDeployment
) pod display the following error:
ERROR org.apache.flink.runtime.entrypoint.ClusterEntrypoint [] - Fatal error occurred in the cluster entrypoint.
java.util.concurrent.CompletionException: java.lang.IllegalStateException: The base directory of the JobResultStore isn't accessible. No dirty JobResults can be restored.
at java.util.concurrent.CompletableFuture.encodeThrowable(Unknown Source) ~[?:?]
at java.util.concurrent.CompletableFuture.completeThrowable(Unknown Source) [?:?]
at java.util.concurrent.CompletableFuture$AsyncSupply.run(Unknown Source) [?:?]
at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
at java.lang.Thread.run(Unknown Source) [?:?]
Caused by: java.lang.IllegalStateException: The base directory of the JobResultStore isn't accessible. No dirty JobResults can be restored.
at org.apache.flink.util.Preconditions.checkState(Preconditions.java:193) ~[flink-dist-1.17.1.jar:1.17.1]
at org.apache.flink.runtime.highavailability.FileSystemJobResultStore.getDirtyResultsInternal(FileSystemJobResultStore.java:199) ~[flink-dist-1.17.1.jar:1.17.1]
at org.apache.flink.runtime.highavailability.AbstractThreadsafeJobResultStore.withReadLock(AbstractThreadsafeJobResultStore.java:118) ~[flink-dist-1.17.1.jar:1.17.1]
at org.apache.flink.runtime.highavailability.AbstractThreadsafeJobResultStore.getDirtyResults(AbstractThreadsafeJobResultStore.java:100) ~[flink-dist-1.17.1.jar:1.17.1]
at org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResults(SessionDispatcherLeaderProcess.java:194) ~[flink-dist-1.17.1.jar:1.17.1]
at org.apache.flink.runtime.dispatcher.runner.AbstractDispatcherLeaderProcess.supplyUnsynchronizedIfRunning(AbstractDispatcherLeaderProcess.java:198) ~[flink-dist-1.17.1.jar:1.17.1]
at org.apache.flink.runtime.dispatcher.runner.SessionDispatcherLeaderProcess.getDirtyJobResultsIfRunning(SessionDispatcherLeaderProcess.java:188) ~[flink-dist-1.17.1.jar:1.17.1]
... 4 more
Causes
The Flink instance (FlinkDeployment
) is trying to access the FileSystem within the container, but the Flink instance (FlinkDeployment
) does not have the permission to access the FileSystem.
Resolving the problem
To resolve the problem, configure the podTemplate
section in the FlinkDeployment
custom resource with: spec.securityContext.fsGroup: 1001
.
For example:
# excerpt from a FlinkDeployment CRD
spec:
podTemplate:
spec:
securityContext:
fsGroup: 1001