Planning your installation

Consider the following when planning your installation of Event Processing and Flink.

Decide the purpose of your deployment, for example, whether you want a starter deployment for testing purposes, or a production deployment.

Use the Event Processing sample deployments and the Flink sample deployments as a starting point to base your deployment on.
For production deployments, and whenever you want your data to be saved in the event of a restart, set up persistent storage.
Consider the options for securing your deployment.
Customize the configuration to suit your needs.

Event Processing sample deployments

A number of sample configuration files are available in GitHub, where you can select the GitHub tag for your Event Processing version, and then go to /cr-examples/eventprocessing/openshift or /cr-examples/eventprocessing/kubernetes to access the samples. These range from smaller deployments for non-production development or general experimentation to deployments that can handle a production workload.

The following Event Processing sample configurations are available to deploy:

Quick Start: A development instance with reduced resources, using ephemeral storage and Local authentication.
Production: A production instance with placeholders for persistence and OpenID Connect (OIDC) authentication.

By default, both samples require the following resources:

CPU request (cores)	CPU limit (cores)	Memory request (GiB)	Memory limit (GiB)
1.0	2.0	1.0	2.0

If you are installing on the OpenShift Container Platform, you can view and apply the sample configurations in the web console.

If you are installing on other Kubernetes platforms, the following samples are available in the Helm chart package:

Quick start
Production

The sample configurations for both the OpenShift Container Platform and other Kubernetes platforms are also available in GitHub where you can select the GitHub tag for your Event Processing version, and then go to /cr-examples/eventprocessing/openshift or /cr-examples/eventprocessing/kubernetes to access the samples.

Important: For a production setup, the sample configuration values are for guidance only, and you might need to change them.

Flink sample deployments

A number of sample configuration files are available in GitHub, where you can select the GitHub tag for your IBM Operator for Apache Flink version, and then go to /cr-examples/flinkdeployment/openshift or /cr-examples/flinkdeployment/kubernetes to access the samples. These range from smaller deployments for non-production development or general experimentation to deployments that can handle a production workload.

The following table provides an overview of the Flink sample configurations and their resource requirements:

Sample	CPU limit per Task Manager	Memory per Task Manager	Slots per Task Manager	Parallelism	Max. number of flows per Task Manager	Job Manager High Availability	Chargeable cores (see licensing)
Quick Start	0.5	2GB	4	1	2	No	1
Minimal Production	1	2GB	10	1	5	Yes (but with 1 replica)	2
Production	2	4GB	10	1	5	Yes	3
Production - Flink Application Cluster	2	4GB	2	Typically > 1 but set to 1 in the sample	1	Yes	3

Important:

Quick Start, Minimal Production, and Production are session cluster samples. They are suitable when deploying Flink for use with the Event Processing flow authoring UI, and for deploying your flows in a Flink cluster for development environments.
The Production - Flink Application Cluster sample is suitable for deploying your flows in an application mode Flink cluster.

It is not suitable when deploying Flink for use with the Event Processing UI. For the deployment procedure, see deploying jobs customized for production or test environments.
To secure your communication with Flink deployments, all samples except Quick Start require specifying a secret containing a JKS keystore and truststore and the password for that keystore and truststore.

The sample configurations for both the OpenShift Container Platform and other Kubernetes platforms are available in GitHub where you can select the GitHub tag for your IBM Operator for Apache Flink version, and then go to /cr-examples/flinkdeployment/openshift or /cr-examples/flinkdeployment/kubernetes to access the samples.

Points to consider for resource requirements:

Both the Minimal Production and Production session cluster samples are preconfigured with 10 Task Manager slots, allowing for a maximum of 5 Event Processing-authored flows to run at the same time on one Task Manager (one Java Virtual Machine).

Note: Each flow generates 2 Flink jobs, and each job is allocated its own Task Manager slot. If you have 6 or more flows, a new Task Manager will be automatically provisioned. With all session cluster configurations, horizontal scaling can happen, constrained only by resource availability. A Task Manager is terminated when all its flows are stopped in the Event Processing flow authoring UI, preventing unnecessary allocation of resources.
All jobs in the same Task Manager compete for the CPU capacity of the Task Manager as defined in the sample. The amount of memory defined for a Task Manager is equally pre-allocated to each Task Manager slot.

For example, if only 4 flows are running, 1/5 (20%) of the memory is not available to the running jobs. This is because the job parallelism is set to 1, and cannot be modified in the flow authoring UI.

Note: spec.job.parallelism is a Flink configuration parameter that enables a single job to be allocated up to as many Task Manager slots as the value of the parallelism parameter. For resource-intensive large jobs, the parallelism typically needs to be greater than 1. For more information about running large jobs, see the Flink production application cluster sample.
To determine the right sample to use that meets your requirements, establish if your typical Event Processing flows can be handled by the Minimal Production or the Production sample. For this purpose, find out if the CPU requirements of a typical job, along with the CPU requirements of all the other competing jobs that may be running on the same Task Manager, do not exceed the allocated amount of CPU. In addition, the memory required by a job should not exceed 80% of the limit defined in the sample.
What happens if the resource limits are exceeded?
- If the CPU limit is exceeded, Kubernetes CPU throttling is activated. When sustained, this can have a significant negative impact on performance. If the memory limit is exceeded, the Kubernetes out-of-memory condition can terminate the Task Manager container.
- A good practice is to avoid resource usage consistently exceeding 70% of CPU or 80% of memory capacity. The spare capacity is often enough to take care of short-lived CPU peaks, Flink checkpoint processing, and minor fluctuations in memory usage. If either of the thresholds is being exceeded after choosing the Minimal Production sample despite performance optimizations on a flow, consider the larger Production sample.

High-level criteria for choosing between Minimal Production and Production samples:

Minimal Production sample (per Task Manager: 1 CPU core, 2GB memory)
- Jobs have smaller CPU and memory requirements.
- Message sizes are typically less than 1KB.
- Flows consist predominantly of filters and transformations.
- Workloads are stable and predictable.
- If the Flink Job Manager fails, some downtime is expected.
Production sample (per Task Manager: 2 CPU cores, 4GB memory)
- Jobs may have higher CPU and memory requirements.
- Message sizes are typically greater than 1KB.
- Flows consist predominantly of interval joins and aggregations.
- Workloads may contain spikes, but not sustained very high event rates.
- You have stricter high availability requirements.

To create a configuration optimized for jobs that have high throughput, low latency requirements, or both, estimate your resource requirements, including network capacity.

Flink Quick Start sample

The Quick Start sample is a Flink session cluster suitable only for very small workloads that have no persistence or reliability requirements. It is capable of running in a single Flink Task Manager a maximum of 2 parallel flows submitted from the Event Processing flow authoring tool. If you need to run more than 2 flows, a new Task Manager will be automatically created.

This sample does not configure Flink with High Availability for the Flink Job Manager, thus Flink jobs are not automatically restarted if the Flink cluster restarts.

Flink Minimal Production sample

The Minimal Production sample is a Flink session cluster suitable for small production workloads.

This sample configures Flink with minimal High Availability for the Flink Job Manager. This means that Flink jobs are restarted automatically if the Flink cluster restarts. However, some downtime is expected as there is only a single Job Manager replica.

Flink Production sample

The Production sample is a Flink session cluster suitable for large production workloads. This sample configures Flink with High Availability for the Flink Job Manager, thus Flink jobs are automatically restarted if the Flink cluster restarts.

Flink Production Application Cluster sample

The Production – Flink Application Cluster sample is suitable for running a single large Flink job in application cluster mode. You can configure the job parallelism (spec.job.parallelism) and the number of slots (spec.flinkConfiguration["taskmanager.numberOfTaskSlots"]) to suit the characteristics of your flow and your workload. As this sample configures 2 slots, assuming your job is consuming events from a Kafka topic with 10 partitions, to run a large job you can select a parallelism of 10, which requires 5 Task Managers.

This sample configures Flink with High Availability for the Flink Job Manager. Being a Flink application cluster, the Flink jobs are automatically restarted if the Flink cluster restarts.

This sample supports deploying jobs customized for production or test environments.

Deploying the Flink PVC

All Flink samples except the Quick Start sample configure Flink to use persistent storage. Before installing a Flink instance (FlinkDeployment custom resource), the following PersistentVolumeClaim must be deployed as follows.

Log in to your Kubernetes cluster as a cluster administrator by setting your kubectl context.
If the namespace where the Flink instance will be deployed does not exist yet, create it:
```
kubectl create namespace <your-namespace>
```
Set the following environment variable to hold the namespace of the Flink instance:
```
export FLINK_NAMESPACE=<your-namespace>
```
Set the following environment variables to hold the name of the storage class and the storage capacity:
```
export STORAGE_CLASS=<your-storage-class>
export STORAGE_CAPACITY=<your-storage-capacity>
```
For example:
```
export STORAGE_CLASS=rook-cephfs
export STORAGE_CAPACITY=20Gi
```
Important: The storage class must comply with the storage requirements for Flink.

Important: The storage capacity needs to be large enough for Flink to store data, which includes the Flink checkpoints and savepoints, and the stateful event processing data.

Run the following command to deploy the PersistentVolumeClame:

kubectl apply -n ${FLINK_NAMESPACE} -f - << EOF
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
  name: ibm-flink-pvc
spec:
  accessModes:
    - ReadWriteMany
  resources:
    requests:
      storage: ${STORAGE_CAPACITY}
  storageClassName: ${STORAGE_CLASS}
EOF

Note: The name of the PVC, ibm-flink-pvc, must match the name of the PVC configured in the FlinkDeployment custom resource. The samples which use persistent storage configure this PVC name.

Planning for persistent storage

If you plan to have persistent volumes, consider the disk space required for storage.

You either need to create a persistent volume, persistent volume claim, or specify a storage class that supports dynamic provisioning. Each component can use a different storage class to control how physical volumes are allocated.

For information about creating persistent volumes and creating a storage class that supports dynamic provisioning:

For OpenShift Container Platform, see the OpenShift Container Platform documentation.
For other Kubernetes platforms, see the Kubernetes documentation.

You must have the Cluster Administrator role for creating persistent volumes or a storage class.

If these persistent volumes are to be created manually, this must be done by the system administrator before installing Event Processing. These will then be claimed from a central pool when the Event Processing instance is deployed. The installation will then claim a volume from this pool.
If these persistent volumes are to be created automatically, ensure a dynamic provisioner is configured for the storage class you want to use. See data storage requirements for information about storage systems supported by Event Processing.

Important:

For Flink, when creating persistent volumes, ensure the Access mode is set to ReadWriteMany. For more information, see storage requirements for Flink.
For Event Processing, when creating persistent volumes, ensure the Access mode is set to ReadWriteOnce. To use persistent storage, configure the storage properties in your EventProcessing custom resource.

Planning for security

There are two areas of security to consider when installing Event Processing:

The type of authentication the Event Processing UI uses. Event Processing UI supports locally defined authentication for testing purposes and OpenID Connect (OIDC) authentication for production purposes.
- If you are configuring with locally defined authentication, the Event Processing UI uses a secret that has a list of username and passwords.
- If you are configuring with OIDC authentication, you must provide the required information to connect to your OIDC provider.
Certificates for the encryption of data in flight, that you must provide to the Event Processing deployment when you create an instance.
- Provide a secret containing a CA certificate, we will use this to generate other certificates.
- Provide a secret that contains a CA certificate, server certificate, and key that has the required DNS names for accessing the deployment.
- The operator creates a CA certificate, which is used to generate all the other certificates.

To configure authentication, see managing access.

To configure the certificates, see configuring TLS.

Securing communication with Flink deployments

When you install IBM Operator for Apache Flink in a production environment, enable TLS in your FlinkDeployment instance, so that all communication between Flink pods, such as Flink Job Manager (JM) and Task Manager (TM) pods, use mutual TLS, and the REST endpoint is encrypted. To secure the communication between Event Processing and Flink pods:

Create a secret that contains a JKS keystore, and a truststore that contains the correct CA certificate.
Create a secret that contains the password for those keystore and truststore.
Provide access for IBM Operator for Apache Flink to a truststore that contains the CA certificate, so that the operator can communicate with the FlinkDeployment instance.

For more information, see configuring TLS for Flink.

Licensing

Licensing is typically based on Virtual Processing Cores (VPC).

For more information about available licenses, chargeable components, and tracking license usage, see the licensing reference.