Kubernetes has become the go-to platform for running not just long-lived services, but also batch workloads like data processing, ETL pipelines, machine learning training, CI/CD pipelines, and scientific simulations. These workloads typically rely on the Job API, which ensures that a specified number of Pods run to completion.
Until now, Kubernetes has had limited flexibility when a Job’s Pod failed or was evicted. Pod replacement behavior was often unpredictable: would the replacement Pod get scheduled on the same node? a nearby node? or anywhere in the cluster?
With Kubernetes v1.34, a new feature lands: Pod Replacement Policy for Jobs, driven by KEP-3015. This allows users to explicitly control how replacement Pods are scheduled, improving reliability, performance, and efficiency of batch workloads.
When a Pod belonging to a Job fails (e.g., due to node drain, eviction, OOM, or hardware issue), Kubernetes creates a replacement Pod. However:
For workloads that depend on node affinity or cached state, this can be a real problem.
Current behavior:
By default, Kubernetes’ controller replaces pods as soon as they start terminating, which can lead to multiple pods running for the same task at the same time, especially in indexed Jobs. This can result in issues with workloads that require exactly one Pod per task, such as certain machine learning frameworks.
Starting replacement pods before old pods are terminated fully can cause other problems like extra cluster resources being used for running replacement pods.
Feature: Pod Replacement Policy feature
This feature, Kubernetes jobs will have two pod replacement policies to choose from:
TerminatingOrFailed (default): will create a replacement Pod as soon as the old one starts terminating.
Failed: waits until the old Pod is fully terminated and reaches the Failed state before creating a new one pod
Using policy: Failed ensures that only one Pod runs for a task at a time
:::info Quick Demo: We will try to demo Pod Replacement Policy for Jobs feature for both scenarios
:::
SCENARIO 1: default behavior TerminatingOrFailed: demo steps.
\
brew install minikube # start local cluster minikube start --kubernetes-version=v1.34.0

# verify cluster is running kubectl get nodes # verify kubernetes version: v1.34.0

\
# worker-job.yaml apiVersion: batch/v1 kind: Job metadata: name: worker-job spec: completions: 2 parallelism: 1 podReplacementPolicy: TerminatingOrFailed template: spec: restartPolicy: Never containers: - name: worker image: busybox command: ["sh", "-c", "echo Running; sleep 30"]
\
kubectl apply -f worker-job.yaml
# monitor pods are running kubectl get pods -l job-name=worker-job
\
\ \
# delete pods associated with job:worker-job kubectl delete pod -l job-name=worker-job
\
scenario 2 : Delayed Replacement with Failed Policy: demo steps
# worker-job-failed.yaml apiVersion: batch/v1 kind: Job metadata: name: worker-job-failed spec: completions: 2 parallelism: 1 podReplacementPolicy: Failed template: spec: restartPolicy: Never containers: - name: worker image: busybox command: ["sh", "-c", "echo Running; sleep 1000"]
\
# monitor pods are running kubectl get pods -l job-name=worker-job-failed
\
\
# delete pods associated with job:worker-job-failed kubectl delete pod -l job-name=worker-job-failed
behavior: replacement pod:worker-job-failed-q98qx is created only after the old pod:worker-job-failed-sg42q fully terminates, there is no overlap between old and new pod.
Better User Experience: For developers, running jobs becomes less error-prone. Teams can focus on business logic instead of constantly monitoring for pod failures.
Pod replacement policy gives control over Pod creation timing to avoid overlaps, optimizes cluster resources by preventing temporary extra pods,and offers flexibility to choose the right policy for your job workloads based on your requirements and resource constraints
\ \ \ \


