Kubernetes in Production: What Engineers Actually Do to Keep Clusters Healthy
Kubernetes has become the default operating system of the cloud. As of 2026, over 96% of organizations that have adopted containers are running Kubernetes to orchestrate them, according to the CNCF Annual Survey. It powers everything from weekend side projects to the infrastructure behind some of the largest platforms on the internet. But there is a gap that nobody talks about enough: the gap between getting Kubernetes working and running it well in production.
Getting a cluster up with a managed service like Amazon EKS, Google GKE, or Azure AKS takes less than 15 minutes. Deploying your first application takes another 10. But the real work of Kubernetes, the work that separates engineers who have been paged at 2 AM from those who sleep through the night, is everything that comes after. This post is about that work.
Why Most Kubernetes Problems Are Self-Inflicted
The majority of production incidents in Kubernetes environments are not caused by bugs in Kubernetes itself. They are caused by misconfigurations that seemed reasonable at the time. A pod without resource limits gets scheduled onto a node, consumes all available memory, and triggers an OOM kill cascade that takes down half the cluster. A deployment rolls out without a readiness probe, so Kubernetes sends traffic to pods that are not actually ready to serve requests. A namespace has no network policies, so a compromised pod can reach any service in the cluster.
These are not edge cases. They are the most common failure modes in real production environments, and they are entirely preventable. The engineers who avoid them do not have superhuman instincts. They have built habits and guardrails that make misconfigurations hard to commit and easy to catch.
Resource Requests and Limits Are Not Optional
Every container in a production Kubernetes cluster should have CPU and memory requests and limits defined. No exceptions. This is not a preference or a style choice. It is how Kubernetes makes scheduling decisions and how it protects workloads from each other on shared nodes.
Resource requests tell the scheduler how much CPU and memory a pod needs to be placed on a node. Limits tell the kubelet the maximum a container is allowed to consume. Without requests, the scheduler has no information to work with and makes poor placement decisions. Without limits, a single misbehaving container can starve every other workload on the same node.
A practical starting point is to set CPU requests conservatively based on actual observed usage, and set CPU limits slightly higher to allow for bursts. For memory, set requests and limits to the same value. This is because memory is not compressible the way CPU is. If a container exceeds its memory limit, the kernel kills it immediately. If you leave headroom between request and limit for memory, you are essentially gambling that the container will not use that headroom when the node is under pressure. That is a bet you will lose eventually.
LimitRange and ResourceQuota objects at the namespace level are the enforcement mechanism. LimitRange sets defaults and constraints for individual containers. ResourceQuota sets aggregate caps for an entire namespace. With both in place, any pod that does not explicitly define its own resource settings still gets sensible defaults, and no namespace can consume unlimited cluster resources.
Probes: Liveness, Readiness, and Startup
Kubernetes has three types of health probes and all three serve a different purpose. Understanding the distinction is critical to avoiding a whole class of production incidents.
A readiness probe tells Kubernetes whether a pod is ready to receive traffic. If the readiness probe fails, the pod is removed from the load balancer endpoints. It is not restarted. This is exactly what you want during application startup, during database migrations, or when a downstream dependency is temporarily unavailable. Without a readiness probe, Kubernetes routes traffic to pods the moment they start, which often means routing to pods that are still initializing and will return errors.
A liveness probe tells Kubernetes whether a pod is alive. If the liveness probe fails repeatedly, Kubernetes restarts the container. This is the right tool for detecting deadlocks or hung processes that are running but not making progress. However, liveness probes are frequently misconfigured with thresholds that are too aggressive. Setting a liveness probe with a 5-second interval and a failure threshold of 1 will restart your pod the moment it takes longer than expected to respond under load, turning a slow service into a restarting one.
A startup probe was added specifically for applications that need a long time to initialize. It disables liveness and readiness probes until the startup probe succeeds, preventing premature restarts of legitimate slow-starting containers. Any application with a startup time longer than 30 seconds should use a startup probe.
RBAC: The Principle of Least Privilege Is Not Negotiable
Role-Based Access Control in Kubernetes is the mechanism that governs who and what can do what inside your cluster. By default, a ServiceAccount in Kubernetes has minimal permissions. The problem is that many teams assign the cluster-admin ClusterRole to application service accounts because it is the path of least resistance when debugging permission issues. Once that happens in a non-production environment, it often gets copied to production.
The correct approach is to create a dedicated ServiceAccount for each application, write a Role that grants only the exact permissions the application needs, and bind the Role to the ServiceAccount with a RoleBinding. If your application only reads ConfigMaps in one namespace, its ServiceAccount should be able to do exactly that and nothing else. If the application is compromised, the blast radius is contained to the permissions it was granted.
Cluster-level roles like ClusterRole and ClusterRoleBinding should be used only for cluster-wide infrastructure components such as monitoring agents, ingress controllers, and node-level daemons. Human access to clusters should be managed through short-lived credentials where possible, and access to production namespaces should require explicit justification. Tools like kubectl-access-matrix make it easy to audit what permissions exist in a cluster and identify over-privileged accounts.
Autoscaling: Horizontal, Vertical, and Cluster
Kubernetes has three autoscaling mechanisms and they operate at different levels. Understanding how they interact is essential to building a system that handles variable load without wasting money during quiet periods.
The Horizontal Pod Autoscaler scales the number of pod replicas based on observed metrics. CPU utilization is the most commonly used signal, but in 2026 most mature teams have moved beyond CPU-only scaling. Custom metrics through the Kubernetes Metrics API allow HPA to scale based on queue depth, request latency, active connections, or any application-level metric exposed via Prometheus. Scaling based on queue depth is particularly powerful for async workloads where CPU utilization is a lagging indicator of actual load.
The Vertical Pod Autoscaler adjusts the resource requests and limits of individual containers based on observed usage. It solves the problem of setting initial resource values by continuously analyzing actual consumption and recommending or automatically applying changes. VPA runs in three modes: Off (recommendations only), Initial (applies on pod creation), and Auto (applies recommendations and restarts pods). Most teams run VPA in Off or Initial mode and review recommendations manually before promoting them, especially in production.
Cluster Autoscaler adds and removes nodes from the underlying node group based on whether pods are pending due to insufficient resources or nodes have been underutilized for a sustained period. It is the mechanism that makes Kubernetes genuinely elastic at the infrastructure level. For it to work correctly, node groups need to be configured with appropriate minimum and maximum sizes, and PodDisruptionBudgets need to be in place to ensure that when a node is drained, the applications running on it maintain their availability guarantees.
Zero-Downtime Deployments and Rollout Strategy
The default deployment strategy in Kubernetes is RollingUpdate, which replaces pods gradually to maintain availability. But the defaults are not always right for production workloads. The default maxSurge is 25% and maxUnavailable is 25%, which means Kubernetes can simultaneously terminate 25% of existing pods and add 25% extra. For a deployment with 4 replicas, that means potentially going down to 3 running pods during a rollout. For traffic-sensitive services, that may not be acceptable.
Setting maxUnavailable to 0 and maxSurge to 1 ensures that the total number of running pods never drops below the desired count during a deployment. New pods are added one at a time, and only removed after they pass their readiness probe. This is slower but guarantees no capacity reduction. Combined with a properly configured readiness probe and a preStop lifecycle hook that introduces a short sleep before the pod terminates, this pattern achieves genuine zero-downtime deployments for the vast majority of HTTP workloads.
For workloads that require more control, blue-green deployments using separate Deployments and a Service selector switch, or canary deployments using tools like Argo Rollouts or Flagger, give engineering teams the ability to shift traffic gradually, observe metrics, and automatically roll back if error rates or latency exceed defined thresholds.
Observability: Logs, Metrics, and Traces Are All Required
A Kubernetes cluster without observability is a black box. You cannot debug what you cannot see, and in a distributed system where a single request may touch dozens of pods across multiple namespaces, visibility is not optional.
The standard observability stack in 2026 for Kubernetes environments is Prometheus for metrics, Grafana for dashboards, and an OpenTelemetry-compatible collector for traces and logs. The kube-state-metrics exporter exposes cluster-level state (replica counts, pod phases, deployment rollout status) as Prometheus metrics. Node Exporter exposes node-level metrics (CPU, memory, disk, network). Your applications should expose their own custom metrics through the Prometheus client library for their language.
Distributed tracing with OpenTelemetry has become the standard for understanding request flows across microservices. Every request that enters your system should carry a trace context header. Each service should create and propagate spans. When an incident occurs, traces allow you to identify exactly which service in a chain introduced latency or returned an error, reducing the mean time to diagnosis from hours to minutes.
Alerting should be built on symptoms, not causes. Alert when error rates are high, when latency exceeds SLO thresholds, when a deployment is stalled, or when node memory is running out. Do not alert on individual pod restarts unless they become persistent. The goal is to be paged when a user is experiencing a problem, not when a normal operational event is occurring.
Security Hardening That Engineers Actually Implement
Security in Kubernetes is a layered discipline. No single configuration makes a cluster secure. The following are the controls that experienced engineers treat as baseline requirements in 2026.
Pod Security Standards replace the deprecated PodSecurityPolicy and are enforced through the built-in Pod Security Admission controller. The Restricted policy prevents containers from running as root, disallows privilege escalation, requires non-root user IDs, and enforces read-only root filesystems. Applying the Restricted standard to production namespaces eliminates an entire class of container escape vulnerabilities.
Network policies restrict traffic between pods. By default, all pods in a Kubernetes cluster can communicate with all other pods. Adding a default-deny NetworkPolicy to every namespace and then explicitly allowing only required communication paths implements a zero-trust network model inside the cluster. Cilium and Calico are the most widely used CNI plugins that support NetworkPolicy with low operational overhead.
Image security is frequently underestimated. Every container image in production should be built from a minimal base image, scanned for known vulnerabilities before deployment, and signed with a tool like Cosign from the Sigstore project. Admission controllers like Kyverno or OPA Gatekeeper can enforce policies that reject images from unverified registries, images without vulnerability scans, or images with critical CVEs that have available patches.
Secrets management deserves its own section. Kubernetes Secrets are base64-encoded, not encrypted, and are stored in etcd in plaintext by default unless etcd encryption is explicitly configured. Most production teams integrate an external secrets manager, such as HashiCorp Vault, AWS Secrets Manager, or Google Secret Manager, using the External Secrets Operator to sync secrets into Kubernetes without storing sensitive values in Git or unencrypted in etcd.
GitOps: The Deployment Model That Scales
GitOps is the practice of using a Git repository as the single source of truth for cluster state. Every change to cluster configuration, every deployment update, every policy modification goes through a pull request. A GitOps operator running inside the cluster, typically ArgoCD or Flux, continuously reconciles the live cluster state with what is declared in Git and applies changes automatically.
The operational benefits of GitOps are significant. Every change is auditable because it goes through version control. Rollbacks are as simple as reverting a commit. Drift detection ensures that manual changes to the cluster are detected and either reverted or flagged. Disaster recovery becomes straightforward because the entire cluster state can be reconstructed from a Git repository. Teams that have adopted GitOps consistently report fewer production incidents and faster mean time to recovery compared to imperative deployment workflows.
The Discipline Behind Reliable Kubernetes
Kubernetes gives you the tools. The discipline of using them correctly is what separates clusters that are a source of confidence from clusters that are a source of anxiety. Resource limits, health probes, RBAC, autoscaling, observability, security hardening, and GitOps are not advanced topics reserved for platform engineering teams at large companies. They are the baseline of production-grade Kubernetes, and in 2026, there is no good reason to operate below that baseline.
Every practice described in this post is implementable today using open-source tooling, documented CNCF projects, and the built-in capabilities of managed Kubernetes services. The investment is measured in engineering hours, not budget. And the return is clusters that scale predictably, fail gracefully, and let your team ship with confidence instead of fear.