DevOps Lessons: Running Kubernetes in Production

We ran our first Kubernetes cluster in 2023. Within a week, someone accidentally deleted the production namespace. Within a month, the team understood why everyone says Kubernetes is powerful and painful in equal measure.

This is what we learned running K8s for HMD Developments' production workloads.

Why Kubernetes

Before Kubernetes, our deployment story was: SSH into a server, pull the latest code, restart the process, hope nothing breaks. It worked for one or two services. By the time we had six services across three projects, manual deployments were consuming entire afternoons.

Kubernetes solved three problems simultaneously:

Declarative deployments: describe the desired state, let the system figure out how to get there
Self-healing: if a container crashes, it restarts automatically
Scaling: handle traffic spikes without manual intervention

The Namespace Incident

Two weeks into production Kubernetes, I ran kubectl delete namespace production thinking I was in a staging context. I was not. Every pod, service, and deployment in the production namespace was gone instantly.

Recovery took four hours. We had manifests in Git, but the database persistent volumes were in the namespace. The data was technically still on the disk, but Kubernetes had released the volume claims.

Lesson 1: Use kubectl config use-context religiously. Better yet, set your shell prompt to display the current context. I now use a ZSH plugin that shows the active cluster and namespace in my terminal prompt. A visual reminder has prevented this mistake from ever happening again.

Lesson 2: Keep persistent volume reclaim policies set to Retain, not Delete. When a volume claim is released, the data persists on disk instead of being wiped.

Resource Limits: The Silent Killer

The first production outage that was not my fault was caused by a container without resource limits. A memory leak in one service consumed all available memory on the node, triggering the OOM killer, which took down other pods on the same node.

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 512Mi

Every container gets resource limits. No exceptions. requests tell the scheduler how much to reserve. limits tell the kernel when to throttle (CPU) or kill (memory). Without limits, one misbehaving container can take down an entire node.

Health Checks Save Lives

Kubernetes does not know if your application is healthy unless you tell it. Liveness and readiness probes are the mechanism:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
 
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

The liveness probe answers: "Is this container alive?" If it fails, Kubernetes restarts the container.

The readiness probe answers: "Is this container ready to receive traffic?" If it fails, Kubernetes removes the container from the service's load balancer but does not restart it.

The distinction matters. A container that is alive but initialising a database connection should not be killed. It should be kept out of the load balancer until it is ready.

Helm Charts: Use Them

I started without Helm, writing raw YAML manifests. For one service with one deployment, one service, and one ConfigMap, raw YAML is fine. For six services, each with their own deployments, services, ConfigMaps, secrets, and ingress rules, raw YAML becomes unmanageable.

Helm charts template your Kubernetes manifests:

# values.yaml
replicaCount: 3
image:
  repository: registry.example.com/api
  tag: v1.2.3
 
# templates/deployment.yaml
spec:
  replicas: {{ .Values.replicaCount }}
  containers:
    - image: {{ .Values.image.repository }}:{{ .Values.image.tag }}

One chart, multiple environments. Change the values file, get a different deployment. Staging gets 1 replica; production gets 3. The template is the same.

Secrets Management

Kubernetes secrets are base64-encoded, not encrypted. Anyone with cluster access can decode them. For real secret management:

Sealed Secrets: encrypt secrets client-side; only the cluster can decrypt them
External Secrets Operator: sync secrets from AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault
Never commit plain secrets to Git, even in private repos. Secret scanning tools exist, but prevention is better than detection.

We use External Secrets Operator with AWS Secrets Manager. Secrets are managed in AWS, synced to Kubernetes automatically, and rotated on a schedule.

Monitoring: Prometheus + Grafana

Kubernetes generates a wealth of metrics. The standard stack:

Prometheus for metric collection and alerting
Grafana for dashboards and visualisation
Alertmanager for routing alerts to Slack/PagerDuty

The dashboards I check daily:

Pod restart count (indicator of crashes or OOM kills)
CPU and memory utilisation per namespace
HTTP error rates (5xx) per service
Request latency p99

What I Would Tell Past Me

Start with a managed service. Do not run your own control plane unless you have a dedicated platform team. EKS, GKE, or AKS handles the hard parts.
Learn kubectl deeply. It is your primary debugging tool. kubectl logs, kubectl describe, kubectl exec: master these before anything else.
GitOps from day one. Every cluster change should be a Git commit. Manual kubectl apply in production is a recipe for configuration drift.
Kubernetes is not always the answer. For simple services, a container on a managed runtime (ECS, Cloud Run, Vercel) is often better. Kubernetes shines when you need the orchestration: multiple services, complex networking, custom scheduling.

Running Kubernetes in production taught me more about infrastructure in one year than the previous three years combined. It is powerful. It is complex. If you respect that complexity, it is remarkably reliable.

This is what we learned running K8s for HMD Developments' production workloads.

Why Kubernetes

Kubernetes solved three problems simultaneously:

Declarative deployments: describe the desired state, let the system figure out how to get there
Self-healing: if a container crashes, it restarts automatically
Scaling: handle traffic spikes without manual intervention

The Namespace Incident

Lesson 2: Keep persistent volume reclaim policies set to Retain, not Delete. When a volume claim is released, the data persists on disk instead of being wiped.

Resource Limits: The Silent Killer

resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 512Mi

Health Checks Save Lives

Kubernetes does not know if your application is healthy unless you tell it. Liveness and readiness probes are the mechanism:

livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10
 
readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5

The liveness probe answers: "Is this container alive?" If it fails, Kubernetes restarts the container.

The readiness probe answers: "Is this container ready to receive traffic?" If it fails, Kubernetes removes the container from the service's load balancer but does not restart it.

The distinction matters. A container that is alive but initialising a database connection should not be killed. It should be kept out of the load balancer until it is ready.

Helm Charts: Use Them

Helm charts template your Kubernetes manifests:

# values.yaml
replicaCount: 3
image:
  repository: registry.example.com/api
  tag: v1.2.3
 
# templates/deployment.yaml
spec:
  replicas: {{ .Values.replicaCount }}
  containers:
    - image: {{ .Values.image.repository }}:{{ .Values.image.tag }}

One chart, multiple environments. Change the values file, get a different deployment. Staging gets 1 replica; production gets 3. The template is the same.

Secrets Management

Kubernetes secrets are base64-encoded, not encrypted. Anyone with cluster access can decode them. For real secret management:

Sealed Secrets: encrypt secrets client-side; only the cluster can decrypt them
External Secrets Operator: sync secrets from AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault
Never commit plain secrets to Git, even in private repos. Secret scanning tools exist, but prevention is better than detection.

We use External Secrets Operator with AWS Secrets Manager. Secrets are managed in AWS, synced to Kubernetes automatically, and rotated on a schedule.

Monitoring: Prometheus + Grafana

Kubernetes generates a wealth of metrics. The standard stack:

Prometheus for metric collection and alerting
Grafana for dashboards and visualisation
Alertmanager for routing alerts to Slack/PagerDuty

The dashboards I check daily:

Pod restart count (indicator of crashes or OOM kills)
CPU and memory utilisation per namespace
HTTP error rates (5xx) per service
Request latency p99

What I Would Tell Past Me

Start with a managed service. Do not run your own control plane unless you have a dedicated platform team. EKS, GKE, or AKS handles the hard parts.
Learn kubectl deeply. It is your primary debugging tool. kubectl logs, kubectl describe, kubectl exec: master these before anything else.
GitOps from day one. Every cluster change should be a Git commit. Manual kubectl apply in production is a recipe for configuration drift.
Kubernetes is not always the answer. For simple services, a container on a managed runtime (ECS, Cloud Run, Vercel) is often better. Kubernetes shines when you need the orchestration: multiple services, complex networking, custom scheduling.

DevOps Lessons: Running Kubernetes in Production

Why Kubernetes

The Namespace Incident

Resource Limits: The Silent Killer

Health Checks Save Lives

Helm Charts: Use Them

Secrets Management

Monitoring: Prometheus + Grafana

What I Would Tell Past Me

Blog

Say Hello

DevOps Lessons: Running Kubernetes in Production

Why Kubernetes

The Namespace Incident

Resource Limits: The Silent Killer

Health Checks Save Lives

Helm Charts: Use Them

Secrets Management

Monitoring: Prometheus + Grafana

What I Would Tell Past Me

Why Kubernetes

The Namespace Incident

Resource Limits: The Silent Killer

Health Checks Save Lives

Helm Charts: Use Them

Secrets Management

Monitoring: Prometheus + Grafana

What I Would Tell Past Me

Stay in the loop

Related Posts

Why TypeScript Is Worth the Investment

Open Source Licensing: What I Learned the Hard Way

Choosing the Right Database: A Practical Decision Framework

Blog

Say Hello

Why Kubernetes

The Namespace Incident

Resource Limits: The Silent Killer

Health Checks Save Lives

Helm Charts: Use Them

Secrets Management

Monitoring: Prometheus + Grafana

What I Would Tell Past Me

Stay in the loop

Related Posts

Why TypeScript Is Worth the Investment

Open Source Licensing: What I Learned the Hard Way

Choosing the Right Database: A Practical Decision Framework