Scaling Node.js with Kubernetes & Docker: Production Guide - NextGenBeing Scaling Node.js with Kubernetes & Docker: Production Guide - NextGenBeing
Back to discoveries

Complete Solution: Scaling a Node.js Application with Kubernetes and Docker

Learn how we scaled our Node.js app from 50k to 5M daily requests using Kubernetes and Docker. Real production patterns, battle-tested configurations, and the gotchas that cost us 2 days of downtime.

Mobile Development 16 min read
NextGenBeing Founder

NextGenBeing Founder

Apr 25, 2026 6 views
Size:
Height:
📖 16 min read 📝 9,150 words 👁 Focus mode: ✨ Eye care:

Listen to Article

Loading...
0:00 / 0:00
0:00 0:00
Low High
0% 100%
⏸ Paused ▶️ Now playing... Ready to play ✓ Finished

Last November, our Node.js API hit a wall at 50,000 daily active users. Response times spiked to 3+ seconds during peak hours, our single EC2 instance was maxed out at 95% CPU, and our CTO Sarah was getting nervous about an upcoming product launch that could 10x our traffic overnight. We had maybe two weeks to figure out a scaling strategy that wouldn't require rewriting our entire application.

I'd been hearing about Kubernetes for years—mostly in the context of massive tech companies managing thousands of microservices. Our team of five engineers running a monolithic Express app didn't seem like the target audience. But when you're staring down potential viral growth and your current infrastructure is held together with PM2 and prayer, you start considering options you previously dismissed as overkill.

Here's what I learned after spending three months moving our Node.js application to a Kubernetes cluster running on Docker containers: it's not actually about microservices or massive scale. It's about having infrastructure that can adapt to demand automatically, recover from failures without manual intervention, and give you the confidence to sleep through the night when your app hits the front page of Reddit.

This isn't a theoretical guide. I'm going to walk you through exactly how we containerized our Node.js app, deployed it to Kubernetes, configured autoscaling that actually works, and handled the production issues that the documentation conveniently skips. I'll share the specific configuration files we use, the monitoring setup that saved us during our first major traffic spike, and the three critical mistakes that cost us two full days of debugging.

Why We Actually Needed Kubernetes (And Why You Might Not)

Let me be brutally honest: for the first six months of our product's life, Kubernetes would have been complete overkill. We were running a single Node.js process on a $40/month VPS, handling maybe 1,000 requests per day. Adding Kubernetes to that setup would have been like buying a semi-truck to deliver your groceries.

But here's what changed for us. Around month seven, we landed a B2B customer that brought 10,000 users overnight. Then another. Our traffic patterns became completely unpredictable—we'd see 500 concurrent users at 2 PM on a Tuesday, then 5,000 at 11 PM on a Sunday. Our Express app was handling it, but barely. We were manually scaling our EC2 instances up and down, which meant someone (usually me) was checking CloudWatch metrics multiple times per day.

The real breaking point came during a product launch. We'd scaled up to our largest instance size (16 cores, 64GB RAM), and we still hit capacity. Response times degraded, some requests timed out, and we lost about $15,000 in potential revenue during a three-hour period where our checkout flow was essentially broken. That's when Sarah walked into my office and said, "I don't care what it takes. Make this thing scale automatically."

Here's when you actually need Kubernetes:

You're experiencing unpredictable traffic patterns that make manual scaling impractical. If you can predict your load and scale accordingly with simple autoscaling groups, you probably don't need K8s yet.

You need zero-downtime deployments. We were doing blue-green deployments manually with load balancer configuration changes. It worked, but it was nerve-wracking every single time.

You're running multiple services that need to communicate efficiently. Even though we started with a monolith, we were planning to extract some services (real-time notifications, background job processing) and needed a clean way to manage inter-service communication.

You want infrastructure-as-code that's portable across cloud providers. We were locked into AWS-specific tooling, and our infrastructure setup was a mix of Terraform, bash scripts, and manual console changes. The idea of having everything defined in YAML files (yes, I know, lots of YAML) that could run anywhere was appealing.

Here's when you DON'T need Kubernetes:

Your traffic is predictable and steady. If you're handling 1,000 requests per hour consistently, a single well-configured server with PM2 clustering is probably fine.

You're a solo developer or very small team. K8s has operational overhead. You need to understand networking, storage, security policies, and have monitoring in place. If you're still in the "move fast and figure it out" phase, stick with simpler deployment options.

Your application is stateful and relies heavily on local disk storage. While you can run stateful applications on K8s (using StatefulSets), it's significantly more complex. If your app stores critical data on local disk, you're not ready for containerization without major architectural changes.

For us, the decision came down to this: we were already spending 10-15 hours per week managing infrastructure manually. Kubernetes would require an upfront investment of maybe 80-100 hours to learn and implement properly, but would then reduce our ongoing operational burden to near-zero for scaling concerns. The math worked out.

Containerizing Our Node.js Application the Right Way

My first attempt at containerizing our app was embarrassingly naive. I wrote a Dockerfile that basically did FROM node:latest, copied everything, and ran npm start. It worked on my machine. It even worked in production for about two hours before we discovered that our Docker image was 1.2GB, container startup time was 45 seconds, and we'd accidentally included our entire node_modules folder with development dependencies.

Here's the production-ready Dockerfile we ended up with after several iterations and one embarrassing production incident:

# Multi-stage build for smaller final image
FROM node:18-alpine AS builder

# Install build dependencies for native modules
RUN apk add --no-cache python3 make g++

WORKDIR /app

# Copy package files first for better layer caching
COPY package*.json ./

# Install ALL dependencies (including dev) for build stage
RUN npm ci --only=production=false

# Copy source code
COPY . .

# Run any build steps (TypeScript compilation, etc.)
RUN npm run build

# Production stage
FROM node:18-alpine

# Add non-root user for security
RUN addgroup -g 1001 -S nodejs && \
    adduser -S nodejs -u 1001

WORKDIR /app

# Copy package files
COPY package*.json ./

# Install ONLY production dependencies
RUN npm ci --only=production && \
    npm cache clean --force

# Copy built application from builder stage
COPY --from=builder --chown=nodejs:nodejs /app/dist ./dist
COPY --from=builder --chown=nodejs:nodejs /app/node_modules ./node_modules

# Switch to non-root user
USER nodejs

# Expose port
EXPOSE 3000

# Health check
HEALTHCHECK --interval=30s --timeout=3s --start-period=40s \
  CMD node -e "require('http').get('http://localhost:3000/health', (r) => {process.exit(r.statusCode === 200 ? 0 : 1)})"

# Start application
CMD ["node", "dist/server.js"]

Let me break down why every line of this Dockerfile matters, because I learned most of this the hard way.

Multi-stage builds are non-negotiable. Our first single-stage build produced a 1.2GB image. This multi-stage approach gets us down to 180MB. Why does this matter? Smaller images mean faster deployments, lower storage costs, and reduced attack surface. When Kubernetes needs to pull your image to a new node during scaling, those extra seconds of image pull time directly impact how quickly you can respond to traffic spikes.

Alpine Linux vs full Node image. We use node:18-alpine instead of the full node:18 image. Alpine adds about 5MB versus 200MB+ for the full Debian-based image. The tradeoff: some native modules have issues with Alpine's musl libc instead of glibc. We hit this with bcrypt initially—it would compile fine but crash at runtime with a cryptic error about missing symbols. The solution was adding build dependencies (python3 make g++) in the builder stage to compile native modules properly.

Layer caching strategy. Notice how we copy package*.json before copying the entire source code. This is critical for build performance. Docker caches layers, and npm dependencies change way less frequently than your application code. By copying package files first, we only reinstall dependencies when package.json actually changes. This reduced our average build time from 4 minutes to 45 seconds.

Security: running as non-root. This bit me during our first security audit. By default, containers run as root. If someone finds a vulnerability in your application, they have root access inside the container. Creating a non-root user (nodejs) and switching to it before running your app limits the blast radius. We use UID 1001 specifically because some Kubernetes security policies require UIDs above 1000.

Health checks are essential. That HEALTHCHECK instruction seems optional until your container starts but your app crashes 30 seconds later due to a missing environment variable. Without a health check, Kubernetes thinks your container is healthy (because the process is running) and routes traffic to it, resulting in 500 errors for your users. With a health check, Kubernetes knows the container isn't ready and won't send traffic until it passes.

Here's what our health check endpoint looks like in Express:

// health.js
const express = require('express');
const router = express.Router();

let isHealthy = false;

// Mark as healthy after successful startup
function setHealthy() {
  isHealthy = true;
}

router.get('/health', (req, res) => {
  if (!isHealthy) {
    return res.status(503).json({ 
      status: 'unhealthy',
      message: 'Application starting up'
    });
  }

  // Check critical dependencies
  const checks = {
    database: checkDatabase(),
    redis: checkRedis(),
    memory: checkMemory()
  };

  const allHealthy = Object.values(checks).every(check => check.healthy);

  if (allHealthy) {
    return res.status(200).json({
      status: 'healthy',
      checks,
      uptime: process.uptime(),
      timestamp: new Date().toISOString()
    });
  }

  return res.status(503).json({
    status: 'unhealthy',
    checks
  });
});

function checkDatabase() {
  try {
    // Quick query to verify DB connection
    // This should be fast (< 100ms)
    return { healthy: true, latency: 23 };
  } catch (error) {
    return { healthy: false, error: error.message };
  }
}

function checkRedis() {
  // Similar check for Redis
  return { healthy: true, latency: 5 };
}

function checkMemory() {
  const used = process.memoryUsage();
  const heapUsedMB = Math.round(used.heapUsed / 1024 / 1024);
  const heapTotalMB = Math.round(used.heapTotal / 1024 / 1024);
  
  // Alert if using more than 90% of heap
  const healthy = (heapUsedMB / heapTotalMB) < 0.9;
  
  return { 
    healthy, 
    heapUsedMB, 
    heapTotalMB,
    percentage: Math.round((heapUsedMB / heapTotalMB) * 100)
  };
}

module.exports = { router, setHealthy };

The .dockerignore file that actually matters. This is the file everyone forgets but costs you in build time and image size:

node_modules
npm-debug.log
.git
.gitignore
.env
.env.local
.DS_Store
*.md
.vscode
.idea
coverage
.nyc_output
dist
build
*.log

We learned about .dockerignore after our first production deploy when we realized we'd shipped our entire .git history (200MB) and local node_modules folder inside the Docker image. The image was 1.4GB. After adding this file: 180MB.

Building and pushing images efficiently. Here's our actual build script that runs in CI/CD:

#!/bin/bash
set -e

# Configuration
REGISTRY="your-registry.azurecr.io"
IMAGE_NAME="nodejs-api"
VERSION=$(git rev-parse --short HEAD)
TAG="${REGISTRY}/${IMAGE_NAME}:${VERSION}"
LATEST_TAG="${REGISTRY}/${IMAGE_NAME}:latest"

echo "Building Docker image: ${TAG}"

# Build with BuildKit for better caching
DOCKER_BUILDKIT=1 docker build \
  --build-arg NODE_ENV=production \
  --cache-from ${LATEST_TAG} \
  -t ${TAG} \
  -t ${LATEST_TAG} \
  .

echo "Image built successfully"
docker images | grep ${IMAGE_NAME}

# Run security scan
echo "Running security scan..."
docker scan ${TAG} || echo "Scan completed with warnings"

# Push to registry
echo "Pushing to registry..."
docker push ${TAG}
docker push ${LATEST_TAG}

echo "Deploy complete: ${TAG}"

A critical gotcha with environment variables. Don't bake secrets into your Docker image. I see this mistake constantly. Your Dockerfile should never contain actual API keys or database passwords. We use Kubernetes Secrets (which I'll cover next) to inject environment variables at runtime.

Wrong:

ENV DATABASE_URL="postgresql://user:password@host/db"

Right:

# No sensitive ENV vars in Dockerfile
# These get injected by Kubernetes at runtime

Our container startup time is now consistently under 5 seconds from image pull to serving traffic. The image size of 180MB means Kubernetes can pull it to a new node in 3-8 seconds depending on network conditions. These numbers matter when you're autoscaling and need pods to come online quickly during traffic spikes.

Kubernetes Configuration That Actually Works in Production

The Kubernetes documentation shows you simple examples that work great for demos. They don't work in production. Here's what we learned after our first production deployment resulted in a complete outage because we didn't configure resource limits correctly.

Our production setup uses three main Kubernetes objects: Deployment (manages your pods), Service (handles networking), and HorizontalPodAutoscaler (scales based on metrics). Let's start with the Deployment configuration:

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: nodejs-api
  namespace: production
  labels:
    app: nodejs-api
    version: v1
spec:
  replicas: 3  # Minimum replicas, HPA will adjust this
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1        # Allow 1 extra pod during updates
      maxUnavailable: 0   # Never have fewer than desired replicas
  selector:
    matchLabels:
      app: nodejs-api
  template:
    metadata:
      labels:
        app: nodejs-api
        version: v1
    spec:
      # Anti-affinity to spread pods across nodes
      affinity:
        podAntiAffinity:
          preferredDuringSchedulingIgnoredDuringExecution:
          - weight: 100
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: app
                  operator: In
                  values:
                  - nodejs-api
              topologyKey: kubernetes.io/hostname
      
      containers:
      - name: nodejs-api
        image: your-registry.azurecr.io/nodejs-api:latest
        imagePullPolicy: Always
        
        ports:
        - containerPort: 3000
          name: http
          protocol: TCP
        
        # CRITICAL: Resource requests and limits
        resources:
          requests:
            memory: "256Mi"
            cpu: "250m"      # 0.25 CPU cores
          limits:
            memory: "512Mi"
            cpu: "500m"      # 0.5 CPU cores
        
        # Environment variables from ConfigMap and Secrets
        env:
        - name: NODE_ENV
          value: "production"
        - name: PORT
          value: "3000"
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: nodejs-api-secrets
              key: database-url
        - name: REDIS_URL
          valueFrom:
            secretKeyRef:
              name: nodejs-api-secrets
              key: redis-url
        - name: LOG_LEVEL
          valueFrom:
            configMapKeyRef:
              name: nodejs-api-config
              key: log-level
        
        # Liveness probe - restart if unhealthy
        livenessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 30
          periodSeconds: 10
          timeoutSeconds: 5
          failureThreshold: 3
        
        # Readiness probe - don't send traffic if not ready
        readinessProbe:
          httpGet:
            path: /health
            port: 3000
          initialDelaySeconds: 10
          periodSeconds: 5
          timeoutSeconds: 3
          failureThreshold: 2
        
        # Graceful shutdown
        lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "sleep 10"]
      
      # Give pods time to shut down gracefully
      terminationGracePeriodSeconds: 30
      
      # Image pull secrets if using private registry
      imagePullSecrets:
      - name: registry-credentials

Let me explain the critical parts that took us weeks to get right.

Resource requests and limits: where we lost $3,000. This is the section that caused our first major outage. Initially, we didn't specify resource limits. Kubernetes scheduled multiple pods on the same node, and during a traffic spike, one pod consumed all available memory, causing the entire node to become unresponsive and taking down all pods on it.

Here's what each setting means:

  • requests.memory: 256Mi - Kubernetes guarantees your pod gets at least this much memory
  • limits.memory: 512Mi - Kubernetes kills your pod if it exceeds this
  • requests.cpu: 250m - Your pod is guaranteed 0.25 CPU cores
  • limits.cpu: 500m - Your pod can burst up to 0.5 cores but will be throttled if it tries to use more

How do you determine these values? You can't guess. We ran load tests and monitored actual memory usage:

# Watch pod resource usage in real-time
kubectl top pods -n production --watch

# Get detailed metrics for a specific pod
kubectl describe pod nodejs-api-7d8f9c4b6-x9k2m -n production | grep -A 5 "Limits\|Requests"

Under normal load (1,000 requests/min), our pods used 180-220Mi memory and 150-200m CPU. We set requests at 256Mi/250m to have headroom. Under peak load (5,000 requests/min), memory spiked to 380-450Mi and CPU to 400-480m, so we set limits at 512Mi/500m.

The difference between liveness and readiness probes. I confused these initially and it caused unnecessary pod restarts during deployments.

Liveness probe answers: "Is my application running?" If this fails, Kubernetes restarts the container. Set this conservatively—you don't want restarts during temporary issues.

Readiness probe answers: "Is my application ready to receive traffic?" If this fails, Kubernetes removes the pod from the Service load balancer but doesn't restart it. This should be more aggressive.

We initially set both to the same values and had a problem: during deployments, new pods would start, fail readiness checks while initializing database connections, get restarted by the liveness probe before they could finish starting up, and enter a restart loop. The fix was giving the liveness probe a longer initialDelaySeconds (30s vs 10s for readiness).

Graceful shutdown is critical. The preStop hook and terminationGracePeriodSeconds prevent request failures during deployments. Here's what happens without them:

  1. Kubernetes decides to terminate a pod during deployment
  2. It immediately removes the pod from the Service endpoints
  3. But in-flight requests to that pod fail with connection errors
  4. Users see 502 Bad Gateway errors

With graceful shutdown:

  1. Kubernetes sends SIGTERM to your container
  2. The preStop hook delays termination for 10 seconds
  3. During this time, the pod stops accepting new requests but finishes existing ones
  4. Kubernetes waits up to terminationGracePeriodSeconds (30s) for the pod to exit cleanly

You also need to handle SIGTERM in your Node.js application:

// server.js
const express = require('express');
const app = express();

const server = app.listen(3000, () => {
  console.log('Server started on port 3000');
});

// Track active connections
let connections = new Set();

server.on('connection', (conn) => {
  connections.add(conn);
  conn.on('close', () => connections.delete(conn));
});

// Graceful shutdown handler
function shutdown(signal) {
  console.log(`${signal} received, starting graceful shutdown...`);
  
  // Stop accepting new connections
  server.close(() => {
    console.log('HTTP server closed');
    
    // Close database connections
    database.close();
    
    // Close Redis connections
    redis.quit();
    
    console.log('All connections closed, exiting');
    process.exit(0);
  });
  
  // Force close after 25 seconds (before K8s kills us at 30s)
  setTimeout(() => {
    console.error('Forced shutdown after timeout');
    
    // Force close all connections
    connections.forEach(conn => conn.destroy());
    
    process.exit(1);
  }, 25000);
}

process.on('SIGTERM', () => shutdown('SIGTERM'));
process.on('SIGINT', () => shutdown('SIGINT'));

Pod anti-affinity prevents single points of failure. The affinity section tells Kubernetes to prefer spreading pods across different nodes. Without this, all your pods might end up on the same node, and when that node fails (and nodes do fail), your entire application goes down. We learned this during a routine node upgrade when AWS terminated the node our pods were on, and we had zero capacity for 45 seconds while pods rescheduled.

Now let's look at the Service configuration:

# service.yaml
apiVersion: v1
kind: Service
metadata:
  name: nodejs-api
  namespace: production
  labels:
    app: nodejs-api
spec:
  type: ClusterIP
  selector:
    app: nodejs-api
  ports:
  - port: 80
    targetPort: 3000
    protocol: TCP
    name: http
  sessionAffinity: None

This is straightforward, but there's one gotcha: sessionAffinity. We initially set this to ClientIP thinking it would help with WebSocket connections. It didn't—it just meant that users hitting the same pod would continue hitting it even if that pod was overloaded while others were idle. Set it to None and let Kubernetes load balance properly.

For external access, we use an Ingress with an ALB (Application Load Balancer):

# ingress.yaml
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: nodejs-api
  namespace: production
  annotations:
    kubernetes.io/ingress.class: alb
    alb.ingress.kubernetes.io/scheme: internet-facing
    alb.ingress.kubernetes.io/target-type: ip
    alb.ingress.kubernetes.io/healthcheck-path: /health
    alb.ingress.kubernetes.io/healthcheck-interval-seconds: '30'
    alb.ingress.kubernetes.

io/healthcheck-timeout-seconds: '5'
    alb.ingress.kubernetes.io/healthy-threshold-count: '2'
    alb.ingress.kubernetes.io/unhealthy-threshold-count: '3'
    alb.ingress.kubernetes.io/success-codes: '200'
    alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
    alb.ingress.kubernetes.io/ssl-redirect: '443'
    alb.ingress.kubernetes.io/certificate-arn: arn:aws:acm:us-east-1:xxxxx:certificate/xxxxx
spec:
  rules:
  - host: api.yourdomain.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: nodejs-api
            port:
              number: 80

The ALB Ingress Controller creates an AWS Application Load Balancer automatically. The target-type: ip annotation is critical—it routes traffic directly to pod IPs rather than node IPs, which is more efficient and works better with autoscaling.

Secrets and ConfigMaps management. Never commit secrets to Git. We use sealed-secrets for GitOps:

# configmap.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: nodejs-api-config
  namespace: production
data:
  log-level: "info"
  max-request-size: "10mb"
  rate-limit-window: "15"
  rate-limit-max: "100"

For secrets, we create them via kubectl and never commit the YAML:

kubectl create secret generic nodejs-api-secrets \
  --from-literal=database-url="postgresql://user:pass@host/db" \
  --from-literal=redis-url="redis://host:6379" \
  --from-literal=jwt-secret="your-secret-key" \
  -n production

In production, we actually use AWS Secrets Manager with the External Secrets Operator, which syncs secrets from AWS into Kubernetes automatically.

Horizontal Pod Autoscaling That Responds to Real Traffic

The default HPA examples in Kubernetes docs scale based on CPU usage. This doesn't work well for Node.js applications. Here's why: Node.js is single-threaded. A pod can be maxing out its single CPU core (100% CPU usage) but still have memory available and could theoretically handle more connections. Conversely, your application might be handling a ton of concurrent requests with low CPU usage but high memory consumption.

We scale based on multiple metrics: CPU, memory, and most importantly, custom metrics from our application. Here's our HPA configuration:

# hpa.yaml
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: nodejs-api
  namespace: production
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: nodejs-api
  minReplicas: 3
  maxReplicas: 20
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  - type: Resource
    resource:
      name: memory
      target:
        type: Utilization
        averageUtilization: 80
  - type: Pods
    pods:
      metric:
        name: http_requests_per_second
      target:
        type: AverageValue
        averageValue: "1000"
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Percent
        value: 50
        periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
      - type: Percent
        value: 100
        periodSeconds: 30
      - type: Pods
        value: 2
        periodSeconds: 30
      selectPolicy: Max

Why three minimum replicas? Redundancy. If one pod crashes, you still have two handling traffic. During deployments with rolling updates, you always have at least two pods running (remember maxUnavailable: 0).

The behavior section is where the magic happens. This took us three iterations to get right:

Scale-up is aggressive: We can double our pod count (100% increase) or add 2 pods, whichever is more, every 30 seconds. When traffic spikes, we want to respond fast. The stabilizationWindowSeconds: 0 means we don't wait—we scale immediately when metrics breach thresholds.

Scale-down is conservative: We wait 5 minutes (stabilizationWindowSeconds: 300) before scaling down, and we only reduce by 50% at a time. Why? Because traffic patterns are unpredictable. We had an issue where traffic would spike, we'd scale to 15 pods, traffic would drop slightly, we'd scale down to 3 pods, then traffic would spike again 2 minutes later. The constant scaling was causing more disruption than just maintaining a slightly higher pod count.

Custom metrics for request rate. This is the most valuable metric for a Node.js API. We expose metrics using the prom-client library:

// metrics.js
const promClient = require('prom-client');
const express = require('express');

// Create a Registry
const register = new promClient.Registry();

// Add default metrics (CPU, memory, etc.)
promClient.collectDefaultMetrics({ register });

// Custom metric: HTTP requests per second
const httpRequestsTotal = new promClient.Counter({
  name: 'http_requests_total',
  help: 'Total number of HTTP requests',
  labelNames: ['method', 'route', 'status'],
  registers: [register]
});

// Custom metric: Request duration
const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status'],
  buckets: [0.1, 0.5, 1, 2, 5],
  registers: [register]
});

// Custom metric: Active connections
const activeConnections = new promClient.Gauge({
  name: 'active_connections',
  help: 'Number of active connections',
  registers: [register]
});

// Middleware to track metrics
function metricsMiddleware(req, res, next) {
  const start = Date.now();
  
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    const route = req.route ? req.route.path : req.path;
    
    httpRequestsTotal.inc({
      method: req.method,
      route: route,
      status: res.statusCode
    });
    
    httpRequestDuration.observe({
      method: req.method,
      route: route,
      status: res.statusCode
    }, duration);
  });
  
  next();
}

// Metrics endpoint for Prometheus
const metricsRouter = express.Router();
metricsRouter.get('/metrics', async (req, res) => {
  res.set('Content-Type', register.contentType);
  res.end(await register.metrics());
});

module.exports = {
  metricsMiddleware,
  metricsRouter,
  activeConnections,
  httpRequestsTotal
};

Then in your main app:

// server.js
const express = require('express');
const { metricsMiddleware, metricsRouter } = require('./metrics');

const app = express();

// Apply metrics middleware to all routes
app.use(metricsMiddleware);

// Expose metrics endpoint
app.use(metricsRouter);

// Your other routes...
app.get('/api/users', (req, res) => {
  // Your logic
});

app.listen(3000);

To make Kubernetes aware of these custom metrics, we use the Prometheus Adapter:

# prometheus-adapter-config.yaml
apiVersion: v1
kind: ConfigMap
metadata:
  name: adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
    - seriesQuery: 'http_requests_total{namespace="production",pod!=""}'
      resources:
        overrides:
          namespace: {resource: "namespace"}
          pod: {resource: "pod"}
      name:
        matches: "^(.*)_total"
        as: "${1}_per_second"
      metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'

This configuration tells the adapter to expose http_requests_per_second as a metric that the HPA can use. The rate(http_requests_total[2m]) calculates requests per second over a 2-minute window.

Testing autoscaling before production. Don't wait for a real traffic spike to discover your HPA doesn't work. We use a simple load testing script:

#!/bin/bash
# load-test.sh

URL="https://api.yourdomain.com/api/users"
DURATION=300  # 5 minutes
CONCURRENT=100

echo "Starting load test: $CONCURRENT concurrent users for $DURATION seconds"

# Using hey (https://github.com/rakyll/hey)
hey -z ${DURATION}s -c ${CONCURRENT} -q 10 ${URL}

# Watch pods scale in another terminal:
# watch -n 2 'kubectl get pods -n production | grep nodejs-api'

Run this while watching your HPA:

kubectl get hpa -n production --watch

You should see replica count increase as metrics breach thresholds, then decrease after the load test completes and the stabilization window passes.

A critical lesson about scaling delays. Even with perfect autoscaling configuration, there's an inherent delay in Kubernetes' scaling response:

  1. Metrics collection: 30-60 seconds for Prometheus to scrape and aggregate metrics
  2. HPA evaluation: 15 seconds (default sync period)
  3. Pod creation: 3-10 seconds depending on image pull and startup time
  4. Readiness: 10-20 seconds for your app to pass readiness probes

Total: 60-105 seconds from traffic spike to new pods serving traffic.

This is why we maintain a minimum of 3 replicas and scale up aggressively. You can't scale from 1 to 10 pods fast enough to handle a sudden viral traffic spike. You need baseline capacity to absorb the initial surge while autoscaling kicks in.

Monitoring and Observability: What We Wish We'd Set Up First

Our first production deployment to Kubernetes went smoothly—for about 6 hours. Then response times started degrading. We couldn't figure out why. CPU was fine, memory was fine, pod count was stable. We spent 4 hours debugging before discovering that our database connection pool was exhausted because we hadn't configured connection limits per pod correctly.

The problem wasn't the bug itself—it was that we had no visibility into what was happening inside our pods. Here's the monitoring stack we eventually built:

Prometheus for metrics collection:

# servicemonitor.yaml
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: nodejs-api
  namespace: production
  labels:
    app: nodejs-api
spec:
  selector:
    matchLabels:
      app: nodejs-api
  endpoints:
  - port: http
    path: /metrics
    interval: 30s

Grafana dashboards that actually help. We built a custom dashboard with these panels:

  1. Request rate (requests per second) - shows traffic patterns
  2. Response time (p50, p95, p99) - shows performance degradation
  3. Error rate (5xx responses / total requests) - shows application errors
  4. Pod count - shows autoscaling behavior
  5. CPU and memory usage per pod - shows resource utilization
  6. Database connection pool stats - this saved us multiple times

Here's the JSON for our most valuable panel (p95 response time):

{
  "targets": [
    {
      "expr": "histogram_quantile(0.95, sum(rate(http_request_duration_seconds_bucket{namespace=\"production\"}[5m])) by (le))",
      "legendFormat": "p95 response time"
    }
  ],
  "title": "Response Time (p95)",
  "type": "graph"
}

Structured logging with correlation IDs. This was a game-changer for debugging production issues:

// logger.js
const winston = require('winston');
const { v4: uuidv4 } = require('uuid');

const logger = winston.createLogger({
  level: process.env.LOG_LEVEL || 'info',
  format: winston.format.combine(
    winston.format.timestamp(),
    winston.format.errors({ stack: true }),
    winston.format.json()
  ),
  transports: [
    new winston.transports.Console()
  ]
});

// Middleware to add correlation ID to every request
function correlationMiddleware(req, res, next) {
  req.correlationId = req.headers['x-correlation-id'] || uuidv4();
  res.setHeader('x-correlation-id', req.correlationId);
  
  // Add correlation ID to logger
  req.log = logger.child({ correlationId: req.correlationId });
  
  req.log.info('Request received', {
    method: req.method,
    path: req.path,
    ip: req.ip,
    userAgent: req.headers['user-agent']
  });
  
  next();
}

module.exports = { logger, correlationMiddleware };

Now every log line includes a correlation ID, making it trivial to trace a single request through your entire system:

kubectl logs -n production -l app=nodejs-api --tail=1000 | grep "correlation-id-here"

Alerting rules that don't cause alert fatigue. We started with too many alerts and learned to focus on what actually matters:

# prometheus-rules.yaml
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: nodejs-api-alerts
  namespace: production
spec:
  groups:
  - name: nodejs-api
    interval: 30s
    rules:
    # Alert if error rate exceeds 5% for 5 minutes
    - alert: HighErrorRate
      expr: |
        (sum(rate(http_requests_total{status=~"5..", namespace="production"}[5m]))
        /
        sum(rate(http_requests_total{namespace="production"}[5m]))) > 0.05
      for: 5m
      labels:
        severity: critical
      annotations:
        summary: "High error rate detected"
        description: "Error rate is {{ $value | humanizePercentage }}"
    
    # Alert if p95 response time exceeds 2 seconds
    - alert: HighResponseTime
      expr: |
        histogram_quantile(0.95,
          sum(rate(http_request_duration_seconds_bucket{namespace="production"}[5m])) by (le)
        ) > 2
      for: 5m
      labels:
        severity: warning
      annotations:
        summary: "High response time detected"
        description: "p95 response time is {{ $value }}s"
    
    # Alert if no pods are ready
    - alert: NoPodReady
      expr: |
        sum(kube_pod_status_ready{namespace="production", condition="true"}) == 0
      for: 2m
      labels:
        severity: critical
      annotations:
        summary: "No pods ready in production"
        description: "All pods are down or not ready"

Distributed tracing with OpenTelemetry. For complex requests that touch multiple services, we added distributed tracing:

// tracing.js
const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { registerInstrumentations } = require('@opentelemetry/instrumentation');
const { HttpInstrumentation } = require('@opentelemetry/instrumentation-http');
const { ExpressInstrumentation } = require('@opentelemetry/instrumentation-express');
const { Resource } = require('@opentelemetry/resources');
const { SemanticResourceAttributes } = require('@opentelemetry/semantic-conventions');

const provider = new NodeTracerProvider({
  resource: new Resource({
    [SemanticResourceAttributes.SERVICE_NAME]: 'nodejs-api',
    [SemanticResourceAttributes.SERVICE_VERSION]: process.env.VERSION || 'unknown',
  }),
});

registerInstrumentations({
  instrumentations: [
    new HttpInstrumentation(),
    new ExpressInstrumentation(),
  ],
});

provider.register();

module.exports = provider;

This automatically creates traces for every HTTP request, showing exactly where time is spent. When we had the database connection pool issue, traces showed that requests were spending 3+ seconds waiting for a connection before even executing the query.

The Three Critical Mistakes That Cost Us Two Days

Mistake #1: Not configuring pod disruption budgets. During our first Kubernetes version upgrade, AWS started draining nodes. Kubernetes evicted all pods from a node simultaneously, and for about 90 seconds, we had zero capacity. Users saw 503 errors.

The fix was adding a PodDisruptionBudget:

# pdb.yaml
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: nodejs-api
  namespace: production
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: nodejs-api

This ensures at least 2 pods are always available during voluntary disruptions (node drains, upgrades). Kubernetes won't evict pods if it would violate this budget.

Mistake #2: Not handling SIGTERM properly in our application. I mentioned graceful shutdown earlier, but we initially didn't implement it correctly. During deployments, we'd see 502 errors because:

  1. Kubernetes sent SIGTERM to the pod
  2. Our Node.js process ignored it (no handler)
  3. After 30 seconds, Kubernetes sent SIGKILL
  4. In-flight requests failed immediately

The fix was the SIGTERM handler I showed earlier, plus ensuring our load balancer respects connection draining:

# In ingress.yaml annotations
alb.ingress.kubernetes.io/target-group-attributes: deregistration_delay.timeout_seconds=30

This tells the ALB to wait 30 seconds before completely removing a pod from the target group, allowing in-flight requests to complete.

Mistake #3: Not setting appropriate resource requests. This caused the most insidious problem. We set resource limits but not requests:

# What we did (WRONG)
resources:
  limits:
    memory: "512Mi"
    cpu: "500m"
  # No requests specified

Kubernetes defaults requests to equal limits, which meant every pod was requesting 512Mi memory. Our nodes had 8GB memory, so Kubernetes would only schedule 15 pods per node (8GB / 512Mi ≈ 15). But our pods actually used only 200Mi on average, so we were wasting 60% of our node capacity.

The fix:

# Correct configuration
resources:
  requests:
    memory: "256Mi"  # What we typically use
    cpu: "250m"
  limits:
    memory: "512Mi"  # Maximum we can burst to
    cpu: "500m"

Now Kubernetes schedules pods based on typical usage (256Mi) but protects against memory leaks with the limit (512Mi). We went from 15 pods per node to 30 pods per node, cutting our infrastructure costs by 40%.

Real-World Performance Results

After three months of running on Kubernetes, here are the actual numbers:

Before Kubernetes (single EC2 instance):

  • Average response time: 180ms (p50), 850ms (p95)
  • Maximum concurrent users before degradation: ~2,000
  • Deployment downtime: 30-60 seconds
  • Manual scaling time: 15-20 minutes
  • Cost: $450/month (m5.2xlarge instance + RDS)

After Kubernetes (EKS cluster):

  • Average response time: 120ms (p50), 320ms (p95)
  • Maximum concurrent users handled: 15,000+ (and counting)
  • Deployment downtime: 0 seconds
  • Automatic scaling time: 60-90 seconds
  • Cost: $580/month (EKS + 3 t3.medium nodes + RDS)

The 30% cost increase bought us:

  • 7.5x capacity increase
  • Zero-downtime deployments
  • Automatic scaling and self-healing
  • Better resource utilization
  • Ability to sleep through traffic spikes

The traffic spike that validated everything. Two months after our Kubernetes migration, we got featured on TechCrunch. Traffic went from 1,200 concurrent users to 8,500 in under 5 minutes.

Here's what happened:

  • 9:15 AM: Article published, traffic starts climbing
  • 9:17 AM: HPA triggers, scales from 3 to 6 pods
  • 9:19 AM: Scales to 12 pods
  • 9:23 AM: Scales to 18 pods (peak)
  • 9:45 AM: Traffic stabilizes around 6,000 concurrent users
  • 10:30 AM: Traffic drops, HPA scales down to 10 pods
  • 11:00 AM: Scales down to 6 pods
  • 2:00 PM: Back to 3 pods

Total errors during the spike: 12 (out of 2.3 million requests = 0.0005% error rate)

The errors were from the initial 2-minute window before autoscaling fully kicked in. Our baseline capacity of 3 pods handled the initial surge well enough that users didn't notice degradation.

With our old setup, we would have crashed within 3 minutes and stayed down until someone manually scaled up the instance, which would have taken at least 15 minutes. We'd have lost an estimated $50,000 in potential signups during the outage.

Conclusion

Moving to Kubernetes wasn't easy. It took three months of learning, testing, and fixing production issues. But it fundamentally changed how we think about infrastructure. We went from reactive firefighting to proactive capacity planning. From manual deployments with crossed fingers to automated rollouts we barely think about.

Here are the key takeaways from our journey:

Start with containerization. Before you even think about Kubernetes, get comfortable with Docker. Build production-ready images with multi-stage builds, proper security practices, and health checks. This alone will improve your deployment process.

Kubernetes isn't magic—it's automation. It won't fix a poorly architected application or solve problems you don't understand. But if you're already doing manual scaling, manual deployments, and manual recovery from failures, Kubernetes can automate all of that.

Resource configuration is critical. Set appropriate requests and limits based on actual load testing. Don't guess. Too low and your pods will crash; too high and you'll waste money. Monitor actual usage and adjust iteratively.

Graceful shutdown is non-negotiable. Handle SIGTERM properly in your application. Configure pod disruption budgets. Set appropriate termination grace periods. These details make the difference between zero-downtime deployments and user-facing errors.

Monitoring must come first, not later. Set up Prometheus, Grafana, and structured logging before you deploy to production. You can't debug what you can't see. Correlation IDs and distributed tracing will save you hours of debugging time.

Autoscaling requires multiple metrics. CPU alone isn't enough for Node.js applications. Use memory, custom application metrics (like request rate), and configure scale-up to be aggressive while scale-down is conservative.

Start small and iterate. We didn't build everything at once. We started with a basic deployment, added autoscaling, then added monitoring, then optimized resource usage. Each iteration taught us something and improved our setup.

The operational overhead is real but manageable. Kubernetes adds complexity. You need to understand networking, storage, security policies, and YAML (so much YAML). But the operational burden of managing infrastructure manually was higher. We spend less time on infrastructure now than we did before Kubernetes.

Cost optimization matters. Right-sizing your resource requests can cut costs by 40%+. Use spot instances for non-critical workloads. Monitor your cluster utilization and adjust node sizes accordingly. The cloud bill will surprise you if you're not careful.

Document everything. Your future self (and your teammates) will thank you. Document why you made certain configuration decisions, what the resource limits represent, how to debug common issues. We maintain a runbook that's been invaluable during incidents.

Is Kubernetes right for every application? No. If you're running a simple CRUD app with predictable traffic, a PaaS like Heroku or a managed service like AWS Elastic Beanstalk might be better choices. But if you're experiencing rapid growth, unpredictable traffic patterns, or spending significant time on manual infrastructure management, Kubernetes can transform your operations.

The learning curve is steep, but the payoff is worth it. We now deploy 10-15 times per day without thinking about it. Our application automatically scales to handle traffic spikes. When pods crash, they restart automatically. When nodes fail, pods reschedule to healthy nodes. We sleep through the night even when our app is on the front page of Reddit.

That peace of mind—knowing your infrastructure can handle whatever gets thrown at it—is what makes the investment in Kubernetes worthwhile. Three months of pain for years of operational stability. For us, that was a trade worth making.

Never Miss an Article

Get our best content delivered to your inbox weekly. No spam, unsubscribe anytime.

Comments (0)

Please log in to leave a comment.

Log In

Related Articles