---
title: "DevOps Lessons: Running Kubernetes in Production"
description: "Hard-won lessons from running Kubernetes clusters for HMD Developments, from initial setup mistakes to production-ready configurations"
date: 2025-03-02
category: Engineering
readingTime: "4 min read"
---


We ran our first Kubernetes cluster in 2023. Within a week, someone accidentally deleted the production namespace. Within a month, the team understood why everyone says Kubernetes is powerful and painful in equal measure.

This is what we learned running K8s for HMD Developments' production workloads.

## Why Kubernetes

Before Kubernetes, our deployment story was: SSH into a server, pull the latest code, restart the process, hope nothing breaks. It worked for one or two services. By the time we had six services across three projects, manual deployments were consuming entire afternoons.

Kubernetes solved three problems simultaneously:

1. **Declarative deployments**: describe the desired state, let the system figure out how to get there
2. **Self-healing**: if a container crashes, it restarts automatically
3. **Scaling**: handle traffic spikes without manual intervention

## The Namespace Incident

Two weeks into production Kubernetes, I ran `kubectl delete namespace production` thinking I was in a staging context. I was not. Every pod, service, and deployment in the production namespace was gone instantly.

Recovery took four hours. We had manifests in Git, but the database persistent volumes were in the namespace. The data was technically still on the disk, but Kubernetes had released the volume claims.

**Lesson 1:** Use `kubectl config use-context` religiously. Better yet, set your shell prompt to display the current context. I now use a ZSH plugin that shows the active cluster and namespace in my terminal prompt. A visual reminder has prevented this mistake from ever happening again.

**Lesson 2:** Keep persistent volume reclaim policies set to `Retain`, not `Delete`. When a volume claim is released, the data persists on disk instead of being wiped.

## Resource Limits: The Silent Killer

The first production outage that was not my fault was caused by a container without resource limits. A memory leak in one service consumed all available memory on the node, triggering the OOM killer, which took down other pods on the same node.

```yaml
resources:
  requests:
    cpu: 100m
    memory: 128Mi
  limits:
    cpu: 500m
    memory: 512Mi
```

**Every container gets resource limits.** No exceptions. `requests` tell the scheduler how much to reserve. `limits` tell the kernel when to throttle (CPU) or kill (memory). Without limits, one misbehaving container can take down an entire node.

## Health Checks Save Lives

Kubernetes does not know if your application is healthy unless you tell it. Liveness and readiness probes are the mechanism:

```yaml
livenessProbe:
  httpGet:
    path: /healthz
    port: 8080
  initialDelaySeconds: 30
  periodSeconds: 10

readinessProbe:
  httpGet:
    path: /ready
    port: 8080
  initialDelaySeconds: 5
  periodSeconds: 5
```

The **liveness probe** answers: "Is this container alive?" If it fails, Kubernetes restarts the container.

The **readiness probe** answers: "Is this container ready to receive traffic?" If it fails, Kubernetes removes the container from the service's load balancer but does not restart it.

The distinction matters. A container that is alive but initialising a database connection should not be killed. It should be kept out of the load balancer until it is ready.

## Helm Charts: Use Them

I started without Helm, writing raw YAML manifests. For one service with one deployment, one service, and one ConfigMap, raw YAML is fine. For six services, each with their own deployments, services, ConfigMaps, secrets, and ingress rules, raw YAML becomes unmanageable.

Helm charts template your Kubernetes manifests:

```yaml
# values.yaml
replicaCount: 3
image:
  repository: registry.example.com/api
  tag: v1.2.3

# templates/deployment.yaml
spec:
  replicas: {{ .Values.replicaCount }}
  containers:
    - image: {{ .Values.image.repository }}:{{ .Values.image.tag }}
```

One chart, multiple environments. Change the values file, get a different deployment. Staging gets 1 replica; production gets 3. The template is the same.

## Secrets Management

Kubernetes secrets are base64-encoded, not encrypted. Anyone with cluster access can decode them. For real secret management:

- **Sealed Secrets**: encrypt secrets client-side; only the cluster can decrypt them
- **External Secrets Operator**: sync secrets from AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault
- **Never commit plain secrets to Git**, even in private repos. Secret scanning tools exist, but prevention is better than detection.

We use External Secrets Operator with AWS Secrets Manager. Secrets are managed in AWS, synced to Kubernetes automatically, and rotated on a schedule.

## Monitoring: Prometheus + Grafana

Kubernetes generates a wealth of metrics. The standard stack:

- **Prometheus** for metric collection and alerting
- **Grafana** for dashboards and visualisation
- **Alertmanager** for routing alerts to Slack/PagerDuty

The dashboards I check daily:
- Pod restart count (indicator of crashes or OOM kills)
- CPU and memory utilisation per namespace
- HTTP error rates (5xx) per service
- Request latency p99

## What I Would Tell Past Me

1. **Start with a managed service.** Do not run your own control plane unless you have a dedicated platform team. EKS, GKE, or AKS handles the hard parts.
2. **Learn `kubectl` deeply.** It is your primary debugging tool. `kubectl logs`, `kubectl describe`, `kubectl exec`: master these before anything else.
3. **GitOps from day one.** Every cluster change should be a Git commit. Manual `kubectl apply` in production is a recipe for configuration drift.
4. **Kubernetes is not always the answer.** For simple services, a container on a managed runtime (ECS, Cloud Run, Vercel) is often better. Kubernetes shines when you need the orchestration: multiple services, complex networking, custom scheduling.

Running Kubernetes in production taught me more about infrastructure in one year than the previous three years combined. It is powerful. It is complex. If you respect that complexity, it is remarkably reliable.