We ran our first Kubernetes cluster in 2023. Within a week, someone accidentally deleted the production namespace. Within a month, the team understood why everyone says Kubernetes is powerful and painful in equal measure.
Here's what we learned running K8s for HMD Developments' production workloads.
Why Kubernetes
Before Kubernetes, our deployment story was: SSH into a server, pull the latest code, restart the process, hope nothing breaks. It worked for one or two services. By the time we had six services across three projects, manual deployments were consuming entire afternoons.
Kubernetes solved three problems simultaneously:
- Declarative deployments - describe the desired state, let the system figure out how to get there
- Self-healing - if a container crashes, it restarts automatically
- Scaling - handle traffic spikes without manual intervention
The Namespace Incident
Two weeks into production Kubernetes, I ran kubectl delete namespace production thinking I was in a staging context. I wasn't. Every pod, service, and deployment in the production namespace was gone instantly.
Recovery took four hours. We had manifests in Git (thank god), but the database persistent volumes were in the namespace. The data was technically still on the disk, but Kubernetes had released the volume claims.
Lesson 1: Use kubectl config use-context religiously. Better yet, set your shell prompt to display the current context. I now use a ZSH plugin that shows the active cluster and namespace in my terminal prompt. A visual reminder has prevented this mistake from ever happening again.
Lesson 2: Keep persistent volume reclaim policies set to Retain, not Delete. When a volume claim is released, the data persists on disk instead of being wiped.
Resource Limits: The Silent Killer
The first production outage that wasn't my fault was caused by a container without resource limits. A memory leak in one service consumed all available memory on the node, triggering the OOM killer, which took down other pods on the same node.
resources:
requests:
cpu: 100m
memory: 128Mi
limits:
cpu: 500m
memory: 512MiEvery container gets resource limits. No exceptions. requests tell the scheduler how much to reserve. limits tell the kernel when to throttle (CPU) or kill (memory). Without limits, one misbehaving container can take down an entire node.
Health Checks Save Lives
Kubernetes doesn't know if your application is healthy unless you tell it. Liveness and readiness probes are the mechanism:
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5The liveness probe answers: "Is this container alive?" If it fails, Kubernetes restarts the container.
The readiness probe answers: "Is this container ready to receive traffic?" If it fails, Kubernetes removes the container from the service's load balancer but doesn't restart it.
The distinction matters. A container that's alive but initializing a database connection shouldn't be killed - it should just be kept out of the load balancer until it's ready.
Helm Charts: Use Them
I started without Helm, writing raw YAML manifests. For one service with one deployment, one service, and one ConfigMap, raw YAML is fine. For six services, each with their own deployments, services, ConfigMaps, secrets, and ingress rules, raw YAML becomes unmanageable.
Helm charts template your Kubernetes manifests:
# values.yaml
replicaCount: 3
image:
repository: registry.example.com/api
tag: v1.2.3
# templates/deployment.yaml
spec:
replicas: {{ .Values.replicaCount }}
containers:
- image: {{ .Values.image.repository }}:{{ .Values.image.tag }}One chart, multiple environments. Change the values file, get a different deployment. Staging gets 1 replica; production gets 3. The template is the same.
Secrets Management
Kubernetes secrets are base64-encoded, not encrypted. Anyone with cluster access can decode them. For real secret management:
- Sealed Secrets - encrypt secrets client-side; only the cluster can decrypt them
- External Secrets Operator - sync secrets from AWS Secrets Manager, GCP Secret Manager, or HashiCorp Vault
- Never commit plain secrets to Git - even in private repos. Secret scanning tools exist, but prevention is better than detection.
We use External Secrets Operator with AWS Secrets Manager. Secrets are managed in AWS, synced to Kubernetes automatically, and rotated on a schedule.
Monitoring: Prometheus + Grafana
Kubernetes generates a wealth of metrics. The standard stack:
- Prometheus for metric collection and alerting
- Grafana for dashboards and visualization
- Alertmanager for routing alerts to Slack/PagerDuty
The dashboards I check daily:
- Pod restart count (indicator of crashes or OOM kills)
- CPU and memory utilization per namespace
- HTTP error rates (5xx) per service
- Request latency p99
What I'd Tell Past Me
- Start with a managed service. Don't run your own control plane unless you have a dedicated platform team. EKS, GKE, or AKS handles the hard parts.
- Learn
kubectldeeply. It's your primary debugging tool.kubectl logs,kubectl describe,kubectl exec- master these before anything else. - GitOps from day one. Every cluster change should be a Git commit. Manual
kubectl applyin production is a recipe for configuration drift. - Kubernetes is not always the answer. For simple services, a container on a managed runtime (ECS, Cloud Run, Vercel) is often better. Kubernetes shines when you need the orchestration - multiple services, complex networking, custom scheduling.
Running Kubernetes in production taught me more about infrastructure in one year than the previous three years combined. It's powerful. It's complex. And if you respect that complexity, it's remarkably reliable.
Getting started with Kubernetes? Use a managed service, set resource limits on every container, and display your kubectl context in your terminal prompt. Those three things prevent 80% of common mistakes.