Scaling Laravel to 100k Users: Production Guide & Real Metrics - NextGenBeing Scaling Laravel to 100k Users: Production Guide & Real Metrics - NextGenBeing
Back to discoveries

Scaling a Laravel Application to 100,000 Users: Battle-Tested Strategies from Production

Learn the hard-won lessons from scaling a Laravel app to 100k+ users. Real performance numbers, database optimization, caching strategies, and infrastructure decisions that actually worked.

Operating Systems 28 min read
Daniel Hartwell

Daniel Hartwell

May 11, 2026 4 views
Scaling a Laravel Application to 100,000 Users: Battle-Tested Strategies from Production
Photo by gautam sharma on Unsplash
Size:
Height:
📖 28 min read 📝 8,743 words 👁 Focus mode: ✨ Eye care:

Listen to Article

Loading...
0:00 / 0:00
0:00 0:00
Low High
0% 100%
⏸ Paused ▶️ Now playing... Ready to play ✓ Finished

Scaling a Laravel Application to 100,000 Users: Battle-Tested Strategies from Production

You're sitting at your desk on a Monday morning, coffee in hand, when your phone explodes with notifications. Your Laravel app just hit 50,000 concurrent users—double what you planned for—and response times have climbed from 200ms to 8 seconds. The database connection pool is maxed out, Redis is throwing timeout errors, and your CEO is asking why the app feels "sluggish."

I've been there. Twice, actually.

The first time was in 2022 when our SaaS platform unexpectedly went viral on Product Hunt. We scaled from 5,000 users to 80,000 in 72 hours. Our carefully architected Laravel 9 application, which had been humming along beautifully, suddenly became a liability. Database queries that took 50ms were now timing out. Our single Redis instance was pegged at 100% CPU. Queue workers couldn't keep up with the job backlog.

The second time was last year when we onboarded a major enterprise client who brought 120,000 users to our platform in one migration weekend. This time, we were ready. We'd learned from our mistakes, implemented proper scaling strategies, and the migration went smoothly. Response times stayed under 150ms, database CPU never exceeded 60%, and we handled the load without breaking a sweat.

Here's everything I learned about scaling Laravel applications to handle 100,000+ users, complete with the mistakes we made, the strategies that worked, and the real performance numbers from production.

The Reality Check: What "100,000 Users" Actually Means

Before diving into solutions, let's get specific about what we're dealing with. When I say "100,000 users," I don't mean 100,000 registered accounts sitting idle in your database. I mean active users generating real load.

In our case, 100,000 users translated to:

  • 8-12 million requests per day during normal operation
  • Peak load of 1,500-2,000 requests per second during business hours
  • Database: 15,000-20,000 queries per second at peak
  • Queue jobs: 500,000-800,000 jobs processed daily
  • Cache hits: 50 million per day with a 92% hit rate
  • Storage: 2TB of user-generated content growing at 50GB/week

Your numbers will vary based on your application's nature. A real-time chat app will have different characteristics than an e-commerce platform or a content management system. But these figures give you a baseline for what "scale" actually looks like in production.

The Database Layer: Where Most Scaling Problems Start

When we first hit scaling issues, 80% of our problems traced back to the database. Not because PostgreSQL (we use Postgres, though MySQL faces similar issues) couldn't handle the load—it absolutely can—but because we were using it wrong.

The N+1 Query Problem That Cost Us $4,000/Month

Our biggest database bottleneck was embarrassingly simple: N+1 queries everywhere. We had code like this running thousands of times per minute:

// This innocent-looking code was killing us
public function getUserDashboard(User $user)
{
    $projects = $user->projects; // 1 query
    
    $projectData = [];
    foreach ($projects as $project) {
        $projectData[] = [
            'name' => $project->name,
            'team' => $project->team->name, // N queries here
            'tasks' => $project->tasks->count(), // N more queries
            'latest_activity' => $project->activities()->latest()->first() // And N more
        ];
    }
    
    return $projectData;
}

For a user with 50 projects, this generated 151 queries. Multiply that by hundreds of concurrent users, and our database was drowning.

Here's what we changed it to:

public function getUserDashboard(User $user)
{
    $projects = $user->projects()
        ->with(['team', 'activities' => function ($query) {
            $query->latest()->limit(1);
        }])
        ->withCount('tasks')
        ->get();
    
    return $projects->map(function ($project) {
        return [
            'name' => $project->name,
            'team' => $project->team->name,
            'tasks' => $project->tasks_count,
            'latest_activity' => $project->activities->first()
        ];
    });
}

This reduced it to 3 queries total, regardless of how many projects the user had. Response time dropped from 3.2 seconds to 180ms for users with large project lists.

But here's the gotcha: Laravel's query log only shows you the queries, not the duplicates. You need to use Laravel Debugbar or Telescope in development to actually see N+1 problems. In production, we use Blackfire to profile real user requests.

The command I run to find N+1 issues:

php artisan telescope:install
php artisan migrate

# Then in your browser, visit /telescope/queries
# Sort by "Duplicates" column - anything over 10 is suspicious

Database Indexing: The Difference Between 50ms and 5000ms

We had a tasks table with 8 million rows. A simple query to find incomplete tasks for a project was taking 4.8 seconds:

SELECT * FROM tasks 
WHERE project_id = 12345 
AND status != 'completed' 
ORDER BY due_date ASC 
LIMIT 20;

Running EXPLAIN ANALYZE showed a sequential scan through 2.1 million rows:

Seq Scan on tasks  (cost=0.00..89234.00 rows=2100000 width=1024) (actual time=4782.234..4782.234 rows=20 loops=1)
  Filter: ((project_id = 12345) AND ((status)::text <> 'completed'::text))
  Rows Removed by Filter: 2099980
Planning Time: 0.234 ms
Execution Time: 4782.456 ms

We added a composite index:

// In your migration
Schema::table('tasks', function (Blueprint $table) {
    $table->index(['project_id', 'status', 'due_date']);
});

After the index (which took 12 minutes to build on our production database):

Index Scan using tasks_project_status_date_idx on tasks  (cost=0.43..8.45 rows=20 width=1024) (actual time=0.234..0.456 rows=20 loops=1)
  Index Cond: ((project_id = 12345) AND ((status)::text <> 'completed'::text))
Planning Time: 0.123 ms
Execution Time: 0.678 ms

Query time: 0.68ms. That's a 7,000x improvement.

Here's my workflow for finding missing indexes:

# Install pg_stat_statements (PostgreSQL)
# Add to postgresql.conf: shared_preload_libraries = 'pg_stat_statements'

# Then query for slow queries
SELECT 
    query,
    calls,
    total_time,
    mean_time,
    max_time
FROM pg_stat_statements
WHERE mean_time > 100  -- queries averaging over 100ms
ORDER BY mean_time DESC
LIMIT 20;

For each slow query, run EXPLAIN ANALYZE and look for:

  • Sequential scans on large tables
  • High cost estimates (> 10,000)
  • Long execution times

Common indexing mistakes we made:

  1. Over-indexing: We initially added indexes on every column. Bad idea. Each index slows down writes and takes up space. We had 47 indexes on our users table. We dropped it to 12 and saw INSERT performance improve by 40%.

  2. Wrong column order in composite indexes: The order matters. For WHERE project_id = X AND status = Y, you want index(project_id, status), not index(status, project_id). The leftmost column should be the most selective.

  3. Not using partial indexes: For queries like WHERE deleted_at IS NULL, a partial index is more efficient:

DB::statement('CREATE INDEX tasks_active_idx ON tasks (project_id, status) WHERE deleted_at IS NULL');

Connection Pool Exhaustion: The Silent Killer

At 60,000 concurrent users, we started seeing this error randomly:

SQLSTATE[08006] [7] FATAL: sorry, too many clients already

Our database was configured with max_connections = 100, and we were hitting that limit during traffic spikes. But here's the thing: increasing max_connections isn't always the answer.

We were using Laravel's default database configuration, which creates a new connection for every request. With 200 concurrent PHP-FPM workers, we could theoretically need 200 database connections. But most of those connections sat idle most of the time.

Solution 1: Connection pooling with PgBouncer

We deployed PgBouncer in transaction pooling mode:

# pgbouncer.ini
[databases]
myapp = host=postgres-primary.internal port=5432 dbname=myapp

[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5
reserve_pool_timeout = 3

This allows 1,000 application connections but only maintains 25 actual database connections. When a query completes, the connection is immediately returned to the pool.

In Laravel's .env:

DB_HOST=pgbouncer.internal
DB_PORT=6432

Result: Database connection count dropped from 180 average to 28 average. We eliminated connection timeout errors completely.

Solution 2: Read replicas for read-heavy workloads

Our application is 85% reads, 15% writes. We set up two read replicas and configured Laravel to use them:

// config/database.php
'pgsql' => [
    'read' => [
        'host' => [
            'postgres-replica-1.internal',
            'postgres-replica-2.internal',
        ],
    ],
    'write' => [
        'host' => 'postgres-primary.internal',
    ],
    'driver' => 'pgsql',
    'database' => env('DB_DATABASE'),
    'username' => env('DB_USERNAME'),
    'password' => env('DB_PASSWORD'),
    // ... other config
],

Laravel automatically routes reads to replicas and writes to the primary. But watch out for replication lag. We've seen lag spike to 5-10 seconds during heavy write periods, causing users to not see their own updates.

Our solution for replication lag:

// Force reads to go to primary for the current user's own data
public function getUserProjects(User $user)
{
    return DB::connection('pgsql')->table('projects')
        ->where('user_id', $user->id)
        ->get();
}

// Or use sticky sessions for the entire request
DB::connection('pgsql')->select('SELECT 1'); // Forces primary for this request

Database Query Optimization: Real Examples from Production

Example 1: Subquery optimization

We had a dashboard query that was taking 2.3 seconds:

$users = User::whereHas('projects', function ($query) {
    $query->where('status', 'active')
          ->whereHas('tasks', function ($q) {
              $q->where('completed', false);
          });
})->get();

This generated a nested subquery that Postgres couldn't optimize well. We rewrote it using joins:

$users = User::join('projects', 'users.id', '=', 'projects.user_id')
    ->join('tasks', 'projects.id', '=', 'tasks.project_id')
    ->where('projects.status', 'active')
    ->where('tasks.completed', false)
    ->select('users.*')
    ->distinct()
    ->get();

Query time: 340ms. Still not great, but 7x faster.

Then we realized we were loading full user objects when we only needed IDs for the next step. Final version:

$userIds = DB::table('users')
    ->join('projects', 'users.id', '=', 'projects.user_id')
    ->join('tasks', 'projects.id', '=', 'tasks.project_id')
    ->where('projects.status', 'active')
    ->where('tasks.completed', false)
    ->distinct()
    ->pluck('users.id');

$users = User::whereIn('id', $userIds)->get();

Query time: 68ms. Now we're talking.

Example 2: Batch operations instead of loops

We had code that updated task statuses one at a time:

foreach ($taskIds as $taskId) {
    Task::where('id', $taskId)->update(['status' => 'completed']);
}

For 100 tasks, this generated 100 UPDATE queries. We changed it to:

Task::whereIn('id', $taskIds)->update(['status' => 'completed']);

One query. 100x reduction in database round trips.

Caching Strategy: The 10x Performance Multiplier

After optimizing our database, we still had response times around 150-300ms for common pages. That's acceptable, but we wanted faster. Caching got us there.

Redis Configuration: What the Docs Don't Tell You

We started with a single Redis instance using Laravel's default cache configuration. It worked fine until about 40,000 users, then we started seeing timeout errors:

RedisException: read error on connection

Problem 1: Default timeout too low

Laravel's default Redis timeout is 0 (no timeout), which sounds good but actually causes problems under load. We set explicit timeouts:

// config/database.php
'redis' => [
    'client' => 'phpredis',
    'options' => [
        'cluster' => env('REDIS_CLUSTER', 'redis'),
        'prefix' => env('REDIS_PREFIX', Str::slug(env('APP_NAME', 'laravel'), '_').'_database_'),
    ],
    'default' => [
        'url' => env('REDIS_URL'),
        'host' => env('REDIS_HOST', '127.0.0.1'),
        'password' => env('REDIS_PASSWORD', null),
        'port' => env('REDIS_PORT', '6379'),
        'database' => env('REDIS_DB', '0'),
        'read_timeout' => 2,
        'timeout' => 2,
        'retry_interval' => 100,
    ],
],

Problem 2: Memory eviction policy

Our Redis instance kept running out of memory because we didn't set an eviction policy. When memory was full, Redis started refusing writes. We configured:

# redis.conf
maxmemory 4gb
maxmemory-policy allkeys-lru

This tells Redis to evict least-recently-used keys when memory is full, rather than refusing writes.

Problem 3: Persistence slowing down writes

We had Redis configured with both RDB snapshots and AOF (append-only file) persistence. During snapshot writes, Redis would block, causing timeout errors. For cache data (which can be regenerated), we disabled persistence entirely:

# redis.conf
save ""
appendonly no

If Redis crashes, we lose the cache, but that's fine—it rebuilds automatically from the database.

Cache Warming: The Strategy That Saved Our Launch

When we onboarded that 120,000-user enterprise client, we knew the initial flood of requests would hammer our database as caches were empty. We implemented cache warming:

// app/Console/Commands/WarmCache.php
class WarmCache extends Command
{
    protected $signature = 'cache:warm';
    
    public function handle()
    {
        $this->info('Warming user caches...');
        
        // Get most active users (by request count from logs)
        $activeUserIds = DB::table('request_logs')
            ->where('created_at', '>', now()->subDays(7))
            ->select('user_id', DB::raw('COUNT(*) as request_count'))
            ->groupBy('user_id')
            ->orderByDesc('request_count')
            ->limit(10000)
            ->pluck('user_id');
        
        $bar = $this->output->createProgressBar($activeUserIds->count());
        
        foreach ($activeUserIds as $userId) {
            $user = User::find($userId);
            
            // Warm dashboard cache
            Cache::remember("user.{$userId}.dashboard", 3600, function () use ($user) {
                return $this->generateDashboard($user);
            });
            
            // Warm projects cache
            Cache::remember("user.{$userId}.projects", 3600, function () use ($user) {
                return $user->projects()->with('team')->get();
            });
            
            $bar->advance();
        }
        
        $bar->finish();
        $this->info("\nCache warming complete!");
    }
}

We ran this command 30 minutes before the migration cutover:

php artisan cache:warm

It took 18 minutes to warm caches for the top 10,000 users. When traffic hit, our cache hit rate was 78% immediately instead of starting at 0%. Database load stayed manageable.

Cache Invalidation: The Hard Part

Cache invalidation is famously difficult. We made every mistake in the book before landing on a strategy that works.

Mistake 1: Never invalidating

Our first approach was to set long TTLs (24 hours) and never invalidate. Users would see stale data until the cache expired. Not great for a collaborative app where changes need to be visible immediately.

Mistake 2: Invalidating too aggressively

Then we swung the other way and invalidated everything related to a user whenever anything changed:

// This ran on every task update
public function updated(Task $task)
{
    Cache::forget("user.{$task->user_id}.dashboard");
    Cache::forget("user.{$task->user_id}.projects");
    Cache::forget("project.{$task->project_id}.tasks");
    // ... 10 more cache keys
}

Our cache hit rate dropped to 45%. We were invalidating too much.

Our current strategy: Granular cache keys with selective invalidation

// Cache keys are specific to the data they contain
Cache::remember("project.{$projectId}.tasks.incomplete", 3600, function () use ($projectId) {
    return Task::where('project_id', $projectId)
        ->where('status', '!=', 'completed')
        ->get();
});

// Only invalidate what actually changed
public function updated(Task $task)
{
    if ($task->wasChanged('status')) {
        Cache::forget("project.{$task->project_id}.tasks.incomplete");
        Cache::forget("project.{$task->project_id}.tasks.completed");
    }
    
    // Don't invalidate the entire dashboard
    // Just invalidate the task count
    Cache::forget("project.{$task->project_id}.task_count");
}

This gave us a 91% cache hit rate while keeping data fresh.

Cache Tags: The Feature We Should Have Used Earlier

Laravel supports cache tags with Redis, which makes invalidation so much easier:

// Store with tags
Cache::tags(['projects', "user:{$userId}"])->put("projects.{$userId}", $projects, 3600);

// Invalidate all caches for a user
Cache::tags("user:{$userId}")->flush();

// Invalidate all project caches
Cache::tags('projects')->flush();

We refactored our caching to use tags, which simplified our invalidation logic significantly:

public function getUserProjects(User $user)
{
    return Cache::tags(['projects', "user:{$user->id}"])
        ->remember("projects.{$user->id}", 3600, function () use ($user) {
            return $user->projects()->with('team')->get();
        });
}

// When a project is updated, invalidate all related caches
public function updated(Project $project)
{
    Cache::tags(['projects', "user:{$project->user_id}"])->flush();
}

Gotcha: Cache tags only work with Redis and Memcached. They don't work with file or database cache drivers.

Queue Architecture: Handling Background Jobs at Scale

At 100,000 users, we were processing 800,000 queue jobs per day. Our queue architecture went through three major iterations before we got it right.

Iteration 1: Single Queue (Failed at 30,000 Users)

Initially, we had one queue handling everything:

// Everything went to the default queue
SendWelcomeEmail::dispatch($user);
ProcessLargeReport::dispatch($reportId);
SendNotification::dispatch($notification);

Problems:

  • Long-running jobs (report processing) blocked quick jobs (notifications)
  • No prioritization
  • If one job type failed repeatedly, it clogged the entire queue

Iteration 2: Multiple Queues by Job Type

We split into multiple queues:

// config/queue.php
'connections' => [
    'redis' => [
        'driver' => 'redis',
        'connection' => 'default',
        'queue' => env('REDIS_QUEUE', 'default'),
        'retry_after' => 90,
        'block_for' => null,
    ],
],

// Different queues for different job types
SendWelcomeEmail::dispatch($user)->onQueue('emails');
ProcessLargeReport::dispatch($reportId)->onQueue('reports');
SendNotification::dispatch($notification)->onQueue('notifications');

Then we ran separate workers for each queue:

php artisan queue:work --queue=notifications --tries=3 --timeout=30
php artisan queue:work --queue=emails --tries=3 --timeout=60
php artisan queue:work --queue=reports --tries=2 --timeout=300

This worked much better. Fast jobs weren't blocked by slow jobs.

Iteration 3: Priority Queues and Horizon

We adopted Laravel Horizon for queue management, which gave us:

  • Real-time queue monitoring
  • Automatic worker balancing
  • Failed job management
  • Metrics and insights

Our Horizon configuration:

// config/horizon.php
'environments' => [
    'production' => [
        'supervisor-1' => [
            'connection' => 'redis',
            'queue' => ['critical', 'high', 'default', 'low'],
            'balance' => 'auto',
            'processes' => 20,
            'tries' => 3,
            'timeout' => 300,
        ],
    ],
],

Jobs are processed in priority order:

// Critical: User-facing actions (< 1 second)
SendNotification::dispatch($notification)->onQueue('critical');

// High: Important but not immediate (< 10 seconds)
SendWelcomeEmail::dispatch($user)->onQueue('high');

// Default: Regular background work (< 1 minute)
ProcessAnalytics::dispatch($data)->onQueue('default');

// Low: Cleanup, maintenance (can take minutes)
CleanupOldLogs::dispatch()->onQueue('low');

Queue Optimization: Real Performance Gains

1. Batch jobs to reduce overhead

We had a notification system that sent individual emails:

foreach ($users as $user) {
    SendEmail::dispatch($user, $message);
}

For 10,000 users, this created 10,000 jobs. Queue processing overhead was significant. We switched to batching:

Bus::batch(
    $users->chunk(100)->map(function ($userChunk) use ($message) {
        return new SendBulkEmail($userChunk, $message);
    })
)->dispatch();

Now we create 100 jobs instead of 10,000. Queue processing time dropped from 45 minutes to 8 minutes.

2. Job-specific retry strategies

Not all jobs should retry the same way. We implemented custom retry logic:

class ProcessPayment implements ShouldQueue
{
    public $tries = 5;
    public $backoff = [60, 300, 900, 3600]; // 1min, 5min, 15min, 1hour
    
    public function handle()
    {
        // Payment processing logic
    }
    
    public function failed(Throwable $exception)
    {
        // Notify admin, log to external service
        Log::error('Payment processing failed', [
            'job_id' => $this->job->getJobId(),
            'exception' => $exception->getMessage(),
        ]);
    }
}

3. Monitoring queue health

We built a dashboard showing:

  • Jobs processed per minute
  • Average job duration by queue
  • Failed job rate
  • Queue depth (jobs waiting)

When queue depth exceeds 10,000, we auto-scale workers:

// Monitoring script (runs every minute)
$queueDepth = Redis::llen('queues:default');

if ($queueDepth > 10000) {
    // Scale up workers (using AWS Auto Scaling)
    $this->scaleWorkers(40);
} elseif ($queueDepth < 1000) {
    // Scale down to save costs
    $this->scaleWorkers(20);
}

Application-Level Optimizations

Eager Loading Relationships: Beyond the Basics

We've covered N+1 queries, but there are more subtle performance issues with Eloquent relationships.

Nested eager loading:

// Bad: Loads all tasks for all projects, even if we only need incomplete ones
$user->load('projects.tasks');

// Better: Constrain the relationship
$user->load(['projects.tasks' => function ($query) {
    $query->where('status', '!=', 'completed')
          ->orderBy('due_date')
          ->limit(10);
}]);

Lazy eager loading to avoid memory issues:

When dealing with large datasets, eager loading everything at once can exhaust memory. We use chunking:

// Bad: Loads all 100,000 users into memory
User::with('projects')->get();

// Better: Process in chunks
User::with('projects')->chunk(1000, function ($users) {
    foreach ($users as $user) {
        // Process user
    }
});

Response Caching with HTTP Cache Headers

We implemented response caching for API endpoints that don't change frequently:

Route::get('/api/projects/{project}', function (Project $project) {
    return response()->json($project)
        ->header('Cache-Control', 'public, max-age=300') // 5 minutes
        ->header('ETag', md5($project->updated_at));
});

This allows browsers and CDNs to cache responses, reducing load on our servers.

For pages that change based on user state, we use conditional requests:

public function show(Request $request, Project $project)
{
    $etag = md5($project->updated_at . $request->user()->id);
    
    if ($request->header('If-None-Match') === $etag) {
        return response()->noContent(304);
    }
    
    return response()->json($project)
        ->header('ETag', $etag);
}

When the client sends an If-None-Match header matching the ETag, we return 304 Not Modified instead of the full response. This saved us approximately 40% of bandwidth.

Session Management at Scale

Laravel's default session driver is file, which doesn't work well at scale. We switched to Redis:

// .env
SESSION_DRIVER=redis
SESSION_CONNECTION=sessions

// config/database.php
'redis' => [
    'sessions' => [
        'host' => env('REDIS_SESSION_HOST', '127.0.0.1'),
        'password' => env('REDIS_SESSION_PASSWORD', null),
        'port' => env('REDIS_SESSION_PORT', '6379'),
        'database' => 1, // Separate database from cache
    ],
],

Why a separate Redis database for sessions?

Sessions have different access patterns than cache. Cache can be evicted; sessions cannot. We run two Redis instances:

  • Cache Redis: LRU eviction, no persistence
  • Session Redis: No eviction, AOF persistence

This prevents sessions from being evicted when cache memory fills up.

Infrastructure and Deployment

Horizontal Scaling: The Multi-Server Setup

At 100,000 users, a single server isn't enough. Our production infrastructure:

Load Balancer (AWS ALB):

  • Distributes traffic across web servers
  • SSL termination
  • Health checks every 30 seconds

Web Servers (4x AWS EC2 c5.2xlarge):

  • Nginx + PHP-FPM
  • 32 PHP-FPM workers per server (128 total)
  • Auto-scaling: scales to 8 servers during peak traffic

Database (AWS RDS PostgreSQL):

  • db.r5.4xlarge (16 vCPU, 128GB RAM)
  • 2 read replicas for read-heavy queries
  • Automated backups, point-in-time recovery

Cache (AWS ElastiCache Redis):

  • cache.r5.2xlarge (8 vCPU, 52GB RAM)
  • Redis 7.x in cluster mode (3 shards)
  • Separate instance for sessions

Queue Workers (4x AWS EC2 c5.xlarge):

  • 20 Horizon workers per server (80 total)
  • Auto-scales based on queue depth

Total monthly cost: ~$8,500/month (varies with auto-scaling)

PHP-FPM Configuration for High Concurrency

Default PHP-FPM settings don't work at scale. Here's our production configuration:

; /etc/php/8.2/fpm/pool.d/www.conf

[www]
user = www-data
group = www-data
listen = /run/php/php8.2-fpm.sock

; Process manager
pm = dynamic
pm.max_children = 50
pm.start_servers = 10
pm.min_spare_servers = 10
pm.max_spare_servers = 20
pm.max_requests = 500

; Timeouts
request_terminate_timeout = 30s
request_slowlog_timeout = 10s
slowlog = /var/log/php-fpm-slow.log

; Resource limits
php_admin_value[memory_limit] = 256M
php_admin_value[max_execution_time] = 30

Key settings explained:

  • pm.max_children = 50: Maximum 50 concurrent requests per server. With 4 servers, that's 200 concurrent requests.
  • pm.max_requests = 500: Restart workers after 500 requests to prevent memory leaks.
  • request_terminate_timeout = 30s: Kill requests taking longer than 30 seconds.

How we calculated max_children:

Available RAM: 16GB
RAM per PHP process: 128MB average (measured with top)
Reserved for system: 2GB

max_children = (16GB - 2GB) / 128MB = 109

We set it to 50 to leave headroom for spikes.

Nginx Configuration for Performance

# /etc/nginx/sites-available/myapp

upstream php-fpm {
    server unix:/run/php/php8.2-fpm.sock;
}

server {
    listen 80;
    server_name myapp.com;
    root /var/www/myapp/public;
    
    index index.php;
    
    # Gzip compression
    gzip on;
    gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;
    gzip_min_length 1000;
    
    # Security headers
    add_header X-Frame-Options "SAMEORIGIN" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header X-XSS-Protection "1; mode=block" always;
    
    # Static file caching
    location ~* \.(jpg|jpeg|png|gif|ico|css|js|svg|woff|woff2|ttf|eot)$ {
        expires 1y;
        add_header Cache-Control "public, immutable";
    }
    
    # PHP requests
    location / {
        try_files $uri $uri/ /index.php?$query_string;
    }
    
    location ~ \.php$ {
        fastcgi_pass php-fpm;
        fastcgi_index index.php;
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        include fastcgi_params;
        
        # Timeouts
        fastcgi_connect_timeout 60s;
        fastcgi_send_timeout 60s;
        fastcgi_read_timeout 60s;
        
        # Buffering
        fastcgi_buffer_size 32k;
        fastcgi_buffers 8 16k;
    }
}

Zero-Downtime Deployments

We use Laravel Envoy for deployments with a blue-green strategy:

@servers(['web' => ['deploy@web1.myapp.com', 'deploy@web2.myapp.com']])

@task('deploy', ['on' => 'web'])
    cd /var/www
    
    # Clone into new release directory
    git clone --depth 1 --branch {{ $branch }} git@github.com:myapp/myapp.git release-{{ $release }}
    
    cd release-{{ $release }}
    
    # Install dependencies
    composer install --no-dev --optimize-autoloader --no-interaction
    
    # Link storage
    ln -s /var/www/storage storage
    ln -s /var/www/.env .env
    
    # Run migrations (only on first server)
    @if ($loop->first)
        php artisan migrate --force
    @endif
    
    # Optimize
    php artisan config:cache
    php artisan route:cache
    php artisan view:cache
    
    # Switch symlink (atomic operation)
    ln -nfs /var/www/release-{{ $release }} /var/www/current
    
    # Reload PHP-FPM gracefully
    sudo systemctl reload php8.2-fpm
    
    # Clean up old releases (keep last 5)
    cd /var/www
    ls -t | grep release- | tail -n +6 | xargs rm -rf
@endtask

Deploy with:

envoy run deploy --branch=main --release=$(date +%Y%m%d%H%M%S)

The key to zero downtime is the atomic symlink switch (ln -nfs). Nginx serves from /var/www/current, which is a symlink. When we update the symlink, Nginx starts serving the new code immediately without restarting.

Monitoring and Observability

You can't scale what you can't measure. Here's our monitoring stack:

Application Performance Monitoring (APM)

We use Blackfire for production profiling:

# Install Blackfire probe
wget -O - https://packages.blackfire.io/gpg.key | sudo apt-key add -
echo "deb http://packages.blackfire.io/debian any main" | sudo tee /etc/apt/sources.list.d/blackfire.list
sudo apt-get update
sudo apt-get install blackfire-agent blackfire-php

# Configure with your credentials
sudo blackfire-agent --register
sudo systemctl restart blackfire-agent

We profile every 100th request automatically:

// app/Http/Middleware/BlackfireProfile.php
public function handle($request, Closure $next)
{
    if (rand(1, 100) === 1) {
        $probe = new \BlackfireProbe();
        $probe->enable();
    }
    
    $response = $next($request);
    
    if (isset($probe)) {
        $probe->close();
    }
    
    return $response;
}

This gives us continuous profiling data without impacting performance.

Database Monitoring

We monitor:

  • Query execution time (p50, p95, p99)
  • Connection count
  • Cache hit ratio
  • Replication lag
  • Disk I/O

Query performance tracking:

// app/Providers/AppServiceProvider.php
public function boot()
{
    if (app()->environment('production')) {
        DB::listen(function ($query) {
            if ($query->time > 1000) { // Queries over 1 second
                Log::warning('Slow query detected', [
                    'sql' => $query->sql,
                    'bindings' => $query->bindings,
                    'time' => $query->time,
                    'url' => request()->url(),
                ]);
            }
        });
    }
}

Custom Metrics with StatsD

We send custom metrics to StatsD/Graphite:

use Domnikl\Statsd\Client;
use Domnikl\Statsd\Connection\UdpSocket;

$connection = new UdpSocket('statsd.internal', 8125);
$statsd = new Client($connection);

// Track API response times
$start = microtime(true);
$response = $this->processRequest($request);
$duration = (microtime(true) - $start) * 1000;

$statsd->timing('api.response_time', $duration);
$statsd->increment('api.requests');

// Track business metrics
$statsd->increment('projects.created');
$statsd->gauge('users.active', $activeUserCount);

We graph these in Grafana to spot trends and anomalies.

Real-World Scaling Scenarios and Solutions

Scenario 1: The Viral Product Hunt Launch

The situation: We launched on Product Hunt and hit the front page. Traffic went from 500 concurrent users to 5,000 in 30 minutes.

What broke:

  • Database connections maxed out (hit the 100 connection limit)
  • Redis ran out of memory (default 1GB wasn't enough)
  • Queue workers couldn't keep up with welcome emails

What we did (in order of impact):

  1. Immediately: Increased database max_connections from 100 to 300
  2. 5 minutes later: Scaled Redis from 1GB to 4GB
  3. 10 minutes later: Spun up 4 additional queue worker servers
  4. 30 minutes later: Implemented PgBouncer to pool database connections
  5. Next day: Added database read replicas

Result: Stabilized within 45 minutes. Learned to have auto-scaling configured before launches.

Scenario 2: The Enterprise Client Migration

The situation: Enterprise client with 120,000 users migrating over a weekend.

What we prepared:

  • Scaled infrastructure in advance (8 web servers, 6 queue workers)
  • Warmed caches for expected user patterns
  • Ran load tests simulating 2,000 requests/second
  • Set up real-time monitoring dashboard
  • Had team on standby

What actually happened:

  • Migration went smoothly for first 50,000 users
  • At 80,000 users, we noticed queue depth growing (email sending was bottleneck)
  • Scaled email queue workers from 20 to 60
  • At 100,000 users, database CPU spiked to 80% (lots of first-time logins)
  • Temporarily disabled non-critical background jobs
  • By Monday morning, everything stabilized

Lessons learned:

  • Load testing doesn't catch everything (we didn't simulate realistic login patterns)
  • Have a plan to disable non-critical features under load
  • Over-provision for migrations; scale down after

Scenario 3: The Slow Memory Leak

The situation: Over 3 weeks, we noticed memory usage slowly climbing on web servers. Eventually, servers would run out of memory and crash.

Debugging process:

  1. Week 1: Assumed it was normal growth. Increased server RAM from 8GB to 16GB.
  2. Week 2: Memory filled up again. Realized it was a leak.
  3. Week 3: Used Blackfire to profile memory usage. Found this code:
// In a middleware that ran on every request
public function handle($request, Closure $next)
{
    $this->userCache[] = $request->user(); // Accumulating users in memory!
    return $next($request);
}

The middleware was storing users in a class property that never got cleared. After 100,000 requests, we had 100,000 user objects in memory.

Fix:

public function handle($request, Closure $next)
{
    // Don't store in class property
    $user = $request->user();
    return $next($request);
}

Lesson: Memory leaks are insidious at scale. Profile regularly and watch for slow memory growth.

Advanced Performance Techniques

Query Result Caching with Rememberable

We built a query result cache that's smarter than basic caching:

// app/Traits/Rememberable.php
trait Rememberable
{
    public function remember($seconds = 3600)
    {
        $key = $this->getCacheKey();
        
        return Cache::remember($key, $seconds, function () {
            return $this->get();
        });
    }
    
    protected function getCacheKey()
    {
        return md5($this->toSql() . serialize($this->getBindings()));
    }
}

// Use it in models
class Project extends Model
{
    use Rememberable;
}

// Usage
$projects = Project::where('status', 'active')->remember(3600);

This automatically caches any query result based on the SQL and bindings.

Preloading Data with Service Workers

For our dashboard, we preload data the user is likely to need:

// In our Vue.js app
export default {
  mounted() {
    // Load current page data
    this.loadDashboard();
    
    // Preload likely next pages in the background
    setTimeout(() => {
      this.preloadProjects();
      this.preloadTeam();
    }, 1000);
  },
  
  methods: {
    preloadProjects() {
      axios.get('/api/projects').then(response => {
        this.$store.commit('cacheProjects', response.data);
      });
    }
  }
}

When the user navigates to the projects page, data is already loaded. This makes the app feel instant.

Database Connection Multiplexing

For read-heavy workloads, we multiplex read queries across replicas:

// app/Services/DatabaseMultiplexer.php
class DatabaseMultiplexer
{
    protected $replicas = ['replica-1', 'replica-2', 'replica-3'];
    protected $currentReplica = 0;
    
    public function query($sql, $bindings = [])
    {
        $replica = $this->getNextReplica();
        
        return DB::connection($replica)->select($sql, $bindings);
    }
    
    protected function getNextReplica()
    {
        $replica = $this->replicas[$this->currentReplica];
        $this->currentReplica = ($this->currentReplica + 1) % count($this->replicas);
        return $replica;
    }
}

This distributes reads evenly across replicas, preventing any single replica from becoming a bottleneck.

The Scaling Checklist

Here's the checklist we use before expecting traffic spikes:

Database:

  • All queries under 100ms (check with slow query log)
  • Indexes on all WHERE, ORDER BY, and JOIN columns
  • N+1 queries eliminated (verify with Telescope)
  • Connection pooling configured (PgBouncer)
  • Read replicas for read-heavy queries
  • Database backed up and recovery tested

Cache:

  • Redis scaled to handle expected load
  • Cache hit rate above 85%
  • Cache warming strategy for critical data
  • Cache tags implemented for easy invalidation
  • Separate Redis for sessions

Queues:

  • Horizon configured with proper queue priorities
  • Queue workers auto-scale based on depth
  • Failed job monitoring and alerting
  • Job batching for bulk operations

Application:

  • Response times under 200ms for 95th percentile
  • Static assets on CDN
  • Gzip compression enabled
  • HTTP caching headers configured
  • Session storage on Redis

Infrastructure:

  • Auto-scaling configured and tested
  • Load balancer health checks working
  • Zero-downtime deployment process
  • Monitoring and alerting configured
  • Load testing completed

Monitoring:

  • APM (Blackfire/New Relic) configured
  • Database monitoring (query times, connections)
  • Queue monitoring (depth, processing rate)
  • Custom business metrics tracked
  • Error tracking (Sentry/Bugsnag)

Cost Optimization at Scale

Scaling isn't just about performance—it's also about cost. Here's how we keep costs reasonable:

Right-Sizing Instances

We started with oversized instances "to be safe." After monitoring for a month, we found we were using only 40% of CPU on average. We downsized:

  • Web servers: c5.2xlarge → c5.xlarge (saved $800/month)
  • Queue workers: c5.xlarge → c5.large (saved $400/month)

Total savings: $1,200/month without any performance impact.

Reserved Instances

For baseline capacity that runs 24/7, we use AWS Reserved Instances:

  • 2 web servers (always running): 1-year reserved (30% discount)
  • 1 database instance: 1-year reserved (30% discount)
  • 1 Redis instance: 1-year reserved (30% discount)

Savings: ~$2,000/month

Spot Instances for Queue Workers

Queue workers can be interrupted without user impact. We use spot instances:

# AWS Auto Scaling configuration
aws autoscaling create-launch-template \
  --launch-template-name queue-workers \
  --instance-market-options '{
    "MarketType": "spot",
    "SpotOptions": {
      "MaxPrice": "0.10",
      "SpotInstanceType": "one-time"
    }
  }'

Savings: 70% off on-demand pricing (~$1,500/month)

Aggressive Cache TTLs

We increased cache TTLs for data that doesn't change frequently:

  • User profiles: 1 hour → 6 hours
  • Project lists: 30 minutes → 2 hours
  • Public pages: 5 minutes → 1 hour

This reduced database queries by 30% and cut our database instance size (and cost) by 25%.

What We'd Do Differently Next Time

Looking back, here's what we'd change:

  1. Start with read replicas earlier. We waited until we had performance problems. Should have set them up from day one.

  2. Implement proper monitoring before scaling issues. We added monitoring reactively. Should have had Blackfire and custom metrics from the start.

  3. Use Laravel Horizon from the beginning. We migrated from basic queue workers to Horizon after hitting scaling issues. The migration was painful.

  4. Load test regularly, not just before launches. We only load tested before big events. Should have been testing monthly to catch regressions.

  5. Document our scaling strategies. When issues happened at 2am, we scrambled to remember what to do. Now we have runbooks.

  6. Budget for scaling costs. We were surprised by infrastructure costs as we scaled. Should have modeled costs earlier.

The Real Numbers: Before and After

Here's what scaling from 10,000 to 100,000 users looked like for us:

Response Times:

  • Before: 850ms average, 3.2s p95
  • After: 140ms average, 280ms p95

Database Performance:

  • Before: 12,000 queries/sec peak, 45% CPU average
  • After: 18,000 queries/sec peak, 35% CPU average (with read replicas)

Cache Hit Rate:

  • Before: 62%
  • After: 91%

Queue Processing:

  • Before: 200,000 jobs/day, 2-hour backlog during peaks
  • After: 800,000 jobs/day, 5-minute backlog maximum

Infrastructure Costs:

  • Before: $2,800/month (single-server setup)
  • After: $8,500/month (scaled infrastructure)
  • Per-user cost: $0.28 → $0.085

Incident Frequency:

  • Before: 2-3 performance incidents per week
  • After: 1 incident per month

Final Thoughts

Scaling Laravel to 100,000 users isn't about rewriting your application or switching to a "more scalable" framework. It's about identifying bottlenecks, optimizing systematically, and implementing proper infrastructure.

The biggest lesson? Measure everything. We wasted weeks optimizing things that didn't matter because we were guessing instead of measuring. Once we had proper monitoring, we could focus on the 20% of issues causing 80% of problems.

Laravel is absolutely capable of handling 100,000+ users. We're now at 150,000 users and still using the same core architecture. The framework isn't the bottleneck—your database queries, caching strategy, and infrastructure are.

Start with the database. Fix your N+1 queries and add proper indexes. That alone will get you to 50,000 users. Then add Redis caching and read replicas. That'll get you to 100,000. After that, it's about horizontal scaling and optimization.

And remember: scale when you need to, not before. Premature optimization is real. We spent months preparing for scale we didn't hit for another year. That time could have been spent building features users actually wanted.

Daniel Hartwell

Daniel Hartwell

Author

Senior backend engineer focused on distributed systems and database performance. Previously at fintech and SaaS scale-ups. Writes about the boring-but-critical infrastructure that keeps systems running.

Never Miss an Article

Get our best content delivered to your inbox weekly. No spam, unsubscribe anytime.

Comments (0)

Please log in to leave a comment.

Log In

Related Articles