Daniel Hartwell
Listen to Article
Loading...Scaling a Laravel Application to 100,000 Users: Battle-Tested Strategies from Production
You're sitting at your desk on a Monday morning, coffee in hand, when your phone explodes with notifications. Your Laravel app just hit 50,000 concurrent users—double what you planned for—and response times have climbed from 200ms to 8 seconds. The database connection pool is maxed out, Redis is throwing timeout errors, and your CEO is asking why the app feels "sluggish."
I've been there. Twice, actually.
The first time was in 2022 when our SaaS platform unexpectedly went viral on Product Hunt. We scaled from 5,000 users to 80,000 in 72 hours. Our carefully architected Laravel 9 application, which had been humming along beautifully, suddenly became a liability. Database queries that took 50ms were now timing out. Our single Redis instance was pegged at 100% CPU. Queue workers couldn't keep up with the job backlog.
The second time was last year when we onboarded a major enterprise client who brought 120,000 users to our platform in one migration weekend. This time, we were ready. We'd learned from our mistakes, implemented proper scaling strategies, and the migration went smoothly. Response times stayed under 150ms, database CPU never exceeded 60%, and we handled the load without breaking a sweat.
Here's everything I learned about scaling Laravel applications to handle 100,000+ users, complete with the mistakes we made, the strategies that worked, and the real performance numbers from production.
The Reality Check: What "100,000 Users" Actually Means
Before diving into solutions, let's get specific about what we're dealing with. When I say "100,000 users," I don't mean 100,000 registered accounts sitting idle in your database. I mean active users generating real load.
In our case, 100,000 users translated to:
- 8-12 million requests per day during normal operation
- Peak load of 1,500-2,000 requests per second during business hours
- Database: 15,000-20,000 queries per second at peak
- Queue jobs: 500,000-800,000 jobs processed daily
- Cache hits: 50 million per day with a 92% hit rate
- Storage: 2TB of user-generated content growing at 50GB/week
Your numbers will vary based on your application's nature. A real-time chat app will have different characteristics than an e-commerce platform or a content management system. But these figures give you a baseline for what "scale" actually looks like in production.
The Database Layer: Where Most Scaling Problems Start
When we first hit scaling issues, 80% of our problems traced back to the database. Not because PostgreSQL (we use Postgres, though MySQL faces similar issues) couldn't handle the load—it absolutely can—but because we were using it wrong.
The N+1 Query Problem That Cost Us $4,000/Month
Our biggest database bottleneck was embarrassingly simple: N+1 queries everywhere. We had code like this running thousands of times per minute:
// This innocent-looking code was killing us
public function getUserDashboard(User $user)
{
$projects = $user->projects; // 1 query
$projectData = [];
foreach ($projects as $project) {
$projectData[] = [
'name' => $project->name,
'team' => $project->team->name, // N queries here
'tasks' => $project->tasks->count(), // N more queries
'latest_activity' => $project->activities()->latest()->first() // And N more
];
}
return $projectData;
}
For a user with 50 projects, this generated 151 queries. Multiply that by hundreds of concurrent users, and our database was drowning.
Here's what we changed it to:
public function getUserDashboard(User $user)
{
$projects = $user->projects()
->with(['team', 'activities' => function ($query) {
$query->latest()->limit(1);
}])
->withCount('tasks')
->get();
return $projects->map(function ($project) {
return [
'name' => $project->name,
'team' => $project->team->name,
'tasks' => $project->tasks_count,
'latest_activity' => $project->activities->first()
];
});
}
This reduced it to 3 queries total, regardless of how many projects the user had. Response time dropped from 3.2 seconds to 180ms for users with large project lists.
But here's the gotcha: Laravel's query log only shows you the queries, not the duplicates. You need to use Laravel Debugbar or Telescope in development to actually see N+1 problems. In production, we use Blackfire to profile real user requests.
The command I run to find N+1 issues:
php artisan telescope:install
php artisan migrate
# Then in your browser, visit /telescope/queries
# Sort by "Duplicates" column - anything over 10 is suspicious
Database Indexing: The Difference Between 50ms and 5000ms
We had a tasks table with 8 million rows. A simple query to find incomplete tasks for a project was taking 4.8 seconds:
SELECT * FROM tasks
WHERE project_id = 12345
AND status != 'completed'
ORDER BY due_date ASC
LIMIT 20;
Running EXPLAIN ANALYZE showed a sequential scan through 2.1 million rows:
Seq Scan on tasks (cost=0.00..89234.00 rows=2100000 width=1024) (actual time=4782.234..4782.234 rows=20 loops=1)
Filter: ((project_id = 12345) AND ((status)::text <> 'completed'::text))
Rows Removed by Filter: 2099980
Planning Time: 0.234 ms
Execution Time: 4782.456 ms
We added a composite index:
// In your migration
Schema::table('tasks', function (Blueprint $table) {
$table->index(['project_id', 'status', 'due_date']);
});
After the index (which took 12 minutes to build on our production database):
Index Scan using tasks_project_status_date_idx on tasks (cost=0.43..8.45 rows=20 width=1024) (actual time=0.234..0.456 rows=20 loops=1)
Index Cond: ((project_id = 12345) AND ((status)::text <> 'completed'::text))
Planning Time: 0.123 ms
Execution Time: 0.678 ms
Query time: 0.68ms. That's a 7,000x improvement.
Here's my workflow for finding missing indexes:
# Install pg_stat_statements (PostgreSQL)
# Add to postgresql.conf: shared_preload_libraries = 'pg_stat_statements'
# Then query for slow queries
SELECT
query,
calls,
total_time,
mean_time,
max_time
FROM pg_stat_statements
WHERE mean_time > 100 -- queries averaging over 100ms
ORDER BY mean_time DESC
LIMIT 20;
For each slow query, run EXPLAIN ANALYZE and look for:
- Sequential scans on large tables
- High cost estimates (> 10,000)
- Long execution times
Common indexing mistakes we made:
-
Over-indexing: We initially added indexes on every column. Bad idea. Each index slows down writes and takes up space. We had 47 indexes on our
userstable. We dropped it to 12 and saw INSERT performance improve by 40%. -
Wrong column order in composite indexes: The order matters. For
WHERE project_id = X AND status = Y, you wantindex(project_id, status), notindex(status, project_id). The leftmost column should be the most selective. -
Not using partial indexes: For queries like
WHERE deleted_at IS NULL, a partial index is more efficient:
DB::statement('CREATE INDEX tasks_active_idx ON tasks (project_id, status) WHERE deleted_at IS NULL');
Connection Pool Exhaustion: The Silent Killer
At 60,000 concurrent users, we started seeing this error randomly:
SQLSTATE[08006] [7] FATAL: sorry, too many clients already
Our database was configured with max_connections = 100, and we were hitting that limit during traffic spikes. But here's the thing: increasing max_connections isn't always the answer.
We were using Laravel's default database configuration, which creates a new connection for every request. With 200 concurrent PHP-FPM workers, we could theoretically need 200 database connections. But most of those connections sat idle most of the time.
Solution 1: Connection pooling with PgBouncer
We deployed PgBouncer in transaction pooling mode:
# pgbouncer.ini
[databases]
myapp = host=postgres-primary.internal port=5432 dbname=myapp
[pgbouncer]
pool_mode = transaction
max_client_conn = 1000
default_pool_size = 25
reserve_pool_size = 5
reserve_pool_timeout = 3
This allows 1,000 application connections but only maintains 25 actual database connections. When a query completes, the connection is immediately returned to the pool.
In Laravel's .env:
DB_HOST=pgbouncer.internal
DB_PORT=6432
Result: Database connection count dropped from 180 average to 28 average. We eliminated connection timeout errors completely.
Solution 2: Read replicas for read-heavy workloads
Our application is 85% reads, 15% writes. We set up two read replicas and configured Laravel to use them:
// config/database.php
'pgsql' => [
'read' => [
'host' => [
'postgres-replica-1.internal',
'postgres-replica-2.internal',
],
],
'write' => [
'host' => 'postgres-primary.internal',
],
'driver' => 'pgsql',
'database' => env('DB_DATABASE'),
'username' => env('DB_USERNAME'),
'password' => env('DB_PASSWORD'),
// ... other config
],
Laravel automatically routes reads to replicas and writes to the primary. But watch out for replication lag. We've seen lag spike to 5-10 seconds during heavy write periods, causing users to not see their own updates.
Our solution for replication lag:
// Force reads to go to primary for the current user's own data
public function getUserProjects(User $user)
{
return DB::connection('pgsql')->table('projects')
->where('user_id', $user->id)
->get();
}
// Or use sticky sessions for the entire request
DB::connection('pgsql')->select('SELECT 1'); // Forces primary for this request
Database Query Optimization: Real Examples from Production
Example 1: Subquery optimization
We had a dashboard query that was taking 2.3 seconds:
$users = User::whereHas('projects', function ($query) {
$query->where('status', 'active')
->whereHas('tasks', function ($q) {
$q->where('completed', false);
});
})->get();
This generated a nested subquery that Postgres couldn't optimize well. We rewrote it using joins:
$users = User::join('projects', 'users.id', '=', 'projects.user_id')
->join('tasks', 'projects.id', '=', 'tasks.project_id')
->where('projects.status', 'active')
->where('tasks.completed', false)
->select('users.*')
->distinct()
->get();
Query time: 340ms. Still not great, but 7x faster.
Then we realized we were loading full user objects when we only needed IDs for the next step. Final version:
$userIds = DB::table('users')
->join('projects', 'users.id', '=', 'projects.user_id')
->join('tasks', 'projects.id', '=', 'tasks.project_id')
->where('projects.status', 'active')
->where('tasks.completed', false)
->distinct()
->pluck('users.id');
$users = User::whereIn('id', $userIds)->get();
Query time: 68ms. Now we're talking.
Example 2: Batch operations instead of loops
We had code that updated task statuses one at a time:
foreach ($taskIds as $taskId) {
Task::where('id', $taskId)->update(['status' => 'completed']);
}
For 100 tasks, this generated 100 UPDATE queries. We changed it to:
Task::whereIn('id', $taskIds)->update(['status' => 'completed']);
One query. 100x reduction in database round trips.
Caching Strategy: The 10x Performance Multiplier
After optimizing our database, we still had response times around 150-300ms for common pages. That's acceptable, but we wanted faster. Caching got us there.
Redis Configuration: What the Docs Don't Tell You
We started with a single Redis instance using Laravel's default cache configuration. It worked fine until about 40,000 users, then we started seeing timeout errors:
RedisException: read error on connection
Problem 1: Default timeout too low
Laravel's default Redis timeout is 0 (no timeout), which sounds good but actually causes problems under load. We set explicit timeouts:
// config/database.php
'redis' => [
'client' => 'phpredis',
'options' => [
'cluster' => env('REDIS_CLUSTER', 'redis'),
'prefix' => env('REDIS_PREFIX', Str::slug(env('APP_NAME', 'laravel'), '_').'_database_'),
],
'default' => [
'url' => env('REDIS_URL'),
'host' => env('REDIS_HOST', '127.0.0.1'),
'password' => env('REDIS_PASSWORD', null),
'port' => env('REDIS_PORT', '6379'),
'database' => env('REDIS_DB', '0'),
'read_timeout' => 2,
'timeout' => 2,
'retry_interval' => 100,
],
],
Problem 2: Memory eviction policy
Our Redis instance kept running out of memory because we didn't set an eviction policy. When memory was full, Redis started refusing writes. We configured:
# redis.conf
maxmemory 4gb
maxmemory-policy allkeys-lru
This tells Redis to evict least-recently-used keys when memory is full, rather than refusing writes.
Problem 3: Persistence slowing down writes
We had Redis configured with both RDB snapshots and AOF (append-only file) persistence. During snapshot writes, Redis would block, causing timeout errors. For cache data (which can be regenerated), we disabled persistence entirely:
# redis.conf
save ""
appendonly no
If Redis crashes, we lose the cache, but that's fine—it rebuilds automatically from the database.
Cache Warming: The Strategy That Saved Our Launch
When we onboarded that 120,000-user enterprise client, we knew the initial flood of requests would hammer our database as caches were empty. We implemented cache warming:
// app/Console/Commands/WarmCache.php
class WarmCache extends Command
{
protected $signature = 'cache:warm';
public function handle()
{
$this->info('Warming user caches...');
// Get most active users (by request count from logs)
$activeUserIds = DB::table('request_logs')
->where('created_at', '>', now()->subDays(7))
->select('user_id', DB::raw('COUNT(*) as request_count'))
->groupBy('user_id')
->orderByDesc('request_count')
->limit(10000)
->pluck('user_id');
$bar = $this->output->createProgressBar($activeUserIds->count());
foreach ($activeUserIds as $userId) {
$user = User::find($userId);
// Warm dashboard cache
Cache::remember("user.{$userId}.dashboard", 3600, function () use ($user) {
return $this->generateDashboard($user);
});
// Warm projects cache
Cache::remember("user.{$userId}.projects", 3600, function () use ($user) {
return $user->projects()->with('team')->get();
});
$bar->advance();
}
$bar->finish();
$this->info("\nCache warming complete!");
}
}
We ran this command 30 minutes before the migration cutover:
php artisan cache:warm
It took 18 minutes to warm caches for the top 10,000 users. When traffic hit, our cache hit rate was 78% immediately instead of starting at 0%. Database load stayed manageable.
Cache Invalidation: The Hard Part
Cache invalidation is famously difficult. We made every mistake in the book before landing on a strategy that works.
Mistake 1: Never invalidating
Our first approach was to set long TTLs (24 hours) and never invalidate. Users would see stale data until the cache expired. Not great for a collaborative app where changes need to be visible immediately.
Mistake 2: Invalidating too aggressively
Then we swung the other way and invalidated everything related to a user whenever anything changed:
// This ran on every task update
public function updated(Task $task)
{
Cache::forget("user.{$task->user_id}.dashboard");
Cache::forget("user.{$task->user_id}.projects");
Cache::forget("project.{$task->project_id}.tasks");
// ... 10 more cache keys
}
Our cache hit rate dropped to 45%. We were invalidating too much.
Our current strategy: Granular cache keys with selective invalidation
// Cache keys are specific to the data they contain
Cache::remember("project.{$projectId}.tasks.incomplete", 3600, function () use ($projectId) {
return Task::where('project_id', $projectId)
->where('status', '!=', 'completed')
->get();
});
// Only invalidate what actually changed
public function updated(Task $task)
{
if ($task->wasChanged('status')) {
Cache::forget("project.{$task->project_id}.tasks.incomplete");
Cache::forget("project.{$task->project_id}.tasks.completed");
}
// Don't invalidate the entire dashboard
// Just invalidate the task count
Cache::forget("project.{$task->project_id}.task_count");
}
This gave us a 91% cache hit rate while keeping data fresh.
Cache Tags: The Feature We Should Have Used Earlier
Laravel supports cache tags with Redis, which makes invalidation so much easier:
// Store with tags
Cache::tags(['projects', "user:{$userId}"])->put("projects.{$userId}", $projects, 3600);
// Invalidate all caches for a user
Cache::tags("user:{$userId}")->flush();
// Invalidate all project caches
Cache::tags('projects')->flush();
We refactored our caching to use tags, which simplified our invalidation logic significantly:
public function getUserProjects(User $user)
{
return Cache::tags(['projects', "user:{$user->id}"])
->remember("projects.{$user->id}", 3600, function () use ($user) {
return $user->projects()->with('team')->get();
});
}
// When a project is updated, invalidate all related caches
public function updated(Project $project)
{
Cache::tags(['projects', "user:{$project->user_id}"])->flush();
}
Gotcha: Cache tags only work with Redis and Memcached. They don't work with file or database cache drivers.
Queue Architecture: Handling Background Jobs at Scale
At 100,000 users, we were processing 800,000 queue jobs per day. Our queue architecture went through three major iterations before we got it right.
Iteration 1: Single Queue (Failed at 30,000 Users)
Initially, we had one queue handling everything:
// Everything went to the default queue
SendWelcomeEmail::dispatch($user);
ProcessLargeReport::dispatch($reportId);
SendNotification::dispatch($notification);
Problems:
- Long-running jobs (report processing) blocked quick jobs (notifications)
- No prioritization
- If one job type failed repeatedly, it clogged the entire queue
Iteration 2: Multiple Queues by Job Type
We split into multiple queues:
// config/queue.php
'connections' => [
'redis' => [
'driver' => 'redis',
'connection' => 'default',
'queue' => env('REDIS_QUEUE', 'default'),
'retry_after' => 90,
'block_for' => null,
],
],
// Different queues for different job types
SendWelcomeEmail::dispatch($user)->onQueue('emails');
ProcessLargeReport::dispatch($reportId)->onQueue('reports');
SendNotification::dispatch($notification)->onQueue('notifications');
Then we ran separate workers for each queue:
php artisan queue:work --queue=notifications --tries=3 --timeout=30
php artisan queue:work --queue=emails --tries=3 --timeout=60
php artisan queue:work --queue=reports --tries=2 --timeout=300
This worked much better. Fast jobs weren't blocked by slow jobs.
Iteration 3: Priority Queues and Horizon
We adopted Laravel Horizon for queue management, which gave us:
- Real-time queue monitoring
- Automatic worker balancing
- Failed job management
- Metrics and insights
Our Horizon configuration:
// config/horizon.php
'environments' => [
'production' => [
'supervisor-1' => [
'connection' => 'redis',
'queue' => ['critical', 'high', 'default', 'low'],
'balance' => 'auto',
'processes' => 20,
'tries' => 3,
'timeout' => 300,
],
],
],
Jobs are processed in priority order:
// Critical: User-facing actions (< 1 second)
SendNotification::dispatch($notification)->onQueue('critical');
// High: Important but not immediate (< 10 seconds)
SendWelcomeEmail::dispatch($user)->onQueue('high');
// Default: Regular background work (< 1 minute)
ProcessAnalytics::dispatch($data)->onQueue('default');
// Low: Cleanup, maintenance (can take minutes)
CleanupOldLogs::dispatch()->onQueue('low');
Queue Optimization: Real Performance Gains
1. Batch jobs to reduce overhead
We had a notification system that sent individual emails:
foreach ($users as $user) {
SendEmail::dispatch($user, $message);
}
For 10,000 users, this created 10,000 jobs. Queue processing overhead was significant. We switched to batching:
Bus::batch(
$users->chunk(100)->map(function ($userChunk) use ($message) {
return new SendBulkEmail($userChunk, $message);
})
)->dispatch();
Now we create 100 jobs instead of 10,000. Queue processing time dropped from 45 minutes to 8 minutes.
2. Job-specific retry strategies
Not all jobs should retry the same way. We implemented custom retry logic:
class ProcessPayment implements ShouldQueue
{
public $tries = 5;
public $backoff = [60, 300, 900, 3600]; // 1min, 5min, 15min, 1hour
public function handle()
{
// Payment processing logic
}
public function failed(Throwable $exception)
{
// Notify admin, log to external service
Log::error('Payment processing failed', [
'job_id' => $this->job->getJobId(),
'exception' => $exception->getMessage(),
]);
}
}
3. Monitoring queue health
We built a dashboard showing:
- Jobs processed per minute
- Average job duration by queue
- Failed job rate
- Queue depth (jobs waiting)
When queue depth exceeds 10,000, we auto-scale workers:
// Monitoring script (runs every minute)
$queueDepth = Redis::llen('queues:default');
if ($queueDepth > 10000) {
// Scale up workers (using AWS Auto Scaling)
$this->scaleWorkers(40);
} elseif ($queueDepth < 1000) {
// Scale down to save costs
$this->scaleWorkers(20);
}
Application-Level Optimizations
Eager Loading Relationships: Beyond the Basics
We've covered N+1 queries, but there are more subtle performance issues with Eloquent relationships.
Nested eager loading:
// Bad: Loads all tasks for all projects, even if we only need incomplete ones
$user->load('projects.tasks');
// Better: Constrain the relationship
$user->load(['projects.tasks' => function ($query) {
$query->where('status', '!=', 'completed')
->orderBy('due_date')
->limit(10);
}]);
Lazy eager loading to avoid memory issues:
When dealing with large datasets, eager loading everything at once can exhaust memory. We use chunking:
// Bad: Loads all 100,000 users into memory
User::with('projects')->get();
// Better: Process in chunks
User::with('projects')->chunk(1000, function ($users) {
foreach ($users as $user) {
// Process user
}
});
Response Caching with HTTP Cache Headers
We implemented response caching for API endpoints that don't change frequently:
Route::get('/api/projects/{project}', function (Project $project) {
return response()->json($project)
->header('Cache-Control', 'public, max-age=300') // 5 minutes
->header('ETag', md5($project->updated_at));
});
This allows browsers and CDNs to cache responses, reducing load on our servers.
For pages that change based on user state, we use conditional requests:
public function show(Request $request, Project $project)
{
$etag = md5($project->updated_at . $request->user()->id);
if ($request->header('If-None-Match') === $etag) {
return response()->noContent(304);
}
return response()->json($project)
->header('ETag', $etag);
}
When the client sends an If-None-Match header matching the ETag, we return 304 Not Modified instead of the full response. This saved us approximately 40% of bandwidth.
Session Management at Scale
Laravel's default session driver is file, which doesn't work well at scale. We switched to Redis:
// .env
SESSION_DRIVER=redis
SESSION_CONNECTION=sessions
// config/database.php
'redis' => [
'sessions' => [
'host' => env('REDIS_SESSION_HOST', '127.0.0.1'),
'password' => env('REDIS_SESSION_PASSWORD', null),
'port' => env('REDIS_SESSION_PORT', '6379'),
'database' => 1, // Separate database from cache
],
],
Why a separate Redis database for sessions?
Sessions have different access patterns than cache. Cache can be evicted; sessions cannot. We run two Redis instances:
- Cache Redis: LRU eviction, no persistence
- Session Redis: No eviction, AOF persistence
This prevents sessions from being evicted when cache memory fills up.
Infrastructure and Deployment
Horizontal Scaling: The Multi-Server Setup
At 100,000 users, a single server isn't enough. Our production infrastructure:
Load Balancer (AWS ALB):
- Distributes traffic across web servers
- SSL termination
- Health checks every 30 seconds
Web Servers (4x AWS EC2 c5.2xlarge):
- Nginx + PHP-FPM
- 32 PHP-FPM workers per server (128 total)
- Auto-scaling: scales to 8 servers during peak traffic
Database (AWS RDS PostgreSQL):
- db.r5.4xlarge (16 vCPU, 128GB RAM)
- 2 read replicas for read-heavy queries
- Automated backups, point-in-time recovery
Cache (AWS ElastiCache Redis):
- cache.r5.2xlarge (8 vCPU, 52GB RAM)
- Redis 7.x in cluster mode (3 shards)
- Separate instance for sessions
Queue Workers (4x AWS EC2 c5.xlarge):
- 20 Horizon workers per server (80 total)
- Auto-scales based on queue depth
Total monthly cost: ~$8,500/month (varies with auto-scaling)
PHP-FPM Configuration for High Concurrency
Default PHP-FPM settings don't work at scale. Here's our production configuration:
; /etc/php/8.2/fpm/pool.d/www.conf
[www]
user = www-data
group = www-data
listen = /run/php/php8.2-fpm.sock
; Process manager
pm = dynamic
pm.max_children = 50
pm.start_servers = 10
pm.min_spare_servers = 10
pm.max_spare_servers = 20
pm.max_requests = 500
; Timeouts
request_terminate_timeout = 30s
request_slowlog_timeout = 10s
slowlog = /var/log/php-fpm-slow.log
; Resource limits
php_admin_value[memory_limit] = 256M
php_admin_value[max_execution_time] = 30
Key settings explained:
pm.max_children = 50: Maximum 50 concurrent requests per server. With 4 servers, that's 200 concurrent requests.pm.max_requests = 500: Restart workers after 500 requests to prevent memory leaks.request_terminate_timeout = 30s: Kill requests taking longer than 30 seconds.
How we calculated max_children:
Available RAM: 16GB
RAM per PHP process: 128MB average (measured with top)
Reserved for system: 2GB
max_children = (16GB - 2GB) / 128MB = 109
We set it to 50 to leave headroom for spikes.
Nginx Configuration for Performance
# /etc/nginx/sites-available/myapp
upstream php-fpm {
server unix:/run/php/php8.2-fpm.sock;
}
server {
listen 80;
server_name myapp.com;
root /var/www/myapp/public;
index index.php;
# Gzip compression
gzip on;
gzip_types text/plain text/css application/json application/javascript text/xml application/xml application/xml+rss text/javascript;
gzip_min_length 1000;
# Security headers
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header X-XSS-Protection "1; mode=block" always;
# Static file caching
location ~* \.(jpg|jpeg|png|gif|ico|css|js|svg|woff|woff2|ttf|eot)$ {
expires 1y;
add_header Cache-Control "public, immutable";
}
# PHP requests
location / {
try_files $uri $uri/ /index.php?$query_string;
}
location ~ \.php$ {
fastcgi_pass php-fpm;
fastcgi_index index.php;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
include fastcgi_params;
# Timeouts
fastcgi_connect_timeout 60s;
fastcgi_send_timeout 60s;
fastcgi_read_timeout 60s;
# Buffering
fastcgi_buffer_size 32k;
fastcgi_buffers 8 16k;
}
}
Zero-Downtime Deployments
We use Laravel Envoy for deployments with a blue-green strategy:
@servers(['web' => ['deploy@web1.myapp.com', 'deploy@web2.myapp.com']])
@task('deploy', ['on' => 'web'])
cd /var/www
# Clone into new release directory
git clone --depth 1 --branch {{ $branch }} git@github.com:myapp/myapp.git release-{{ $release }}
cd release-{{ $release }}
# Install dependencies
composer install --no-dev --optimize-autoloader --no-interaction
# Link storage
ln -s /var/www/storage storage
ln -s /var/www/.env .env
# Run migrations (only on first server)
@if ($loop->first)
php artisan migrate --force
@endif
# Optimize
php artisan config:cache
php artisan route:cache
php artisan view:cache
# Switch symlink (atomic operation)
ln -nfs /var/www/release-{{ $release }} /var/www/current
# Reload PHP-FPM gracefully
sudo systemctl reload php8.2-fpm
# Clean up old releases (keep last 5)
cd /var/www
ls -t | grep release- | tail -n +6 | xargs rm -rf
@endtask
Deploy with:
envoy run deploy --branch=main --release=$(date +%Y%m%d%H%M%S)
The key to zero downtime is the atomic symlink switch (ln -nfs). Nginx serves from /var/www/current, which is a symlink. When we update the symlink, Nginx starts serving the new code immediately without restarting.
Monitoring and Observability
You can't scale what you can't measure. Here's our monitoring stack:
Application Performance Monitoring (APM)
We use Blackfire for production profiling:
# Install Blackfire probe
wget -O - https://packages.blackfire.io/gpg.key | sudo apt-key add -
echo "deb http://packages.blackfire.io/debian any main" | sudo tee /etc/apt/sources.list.d/blackfire.list
sudo apt-get update
sudo apt-get install blackfire-agent blackfire-php
# Configure with your credentials
sudo blackfire-agent --register
sudo systemctl restart blackfire-agent
We profile every 100th request automatically:
// app/Http/Middleware/BlackfireProfile.php
public function handle($request, Closure $next)
{
if (rand(1, 100) === 1) {
$probe = new \BlackfireProbe();
$probe->enable();
}
$response = $next($request);
if (isset($probe)) {
$probe->close();
}
return $response;
}
This gives us continuous profiling data without impacting performance.
Database Monitoring
We monitor:
- Query execution time (p50, p95, p99)
- Connection count
- Cache hit ratio
- Replication lag
- Disk I/O
Query performance tracking:
// app/Providers/AppServiceProvider.php
public function boot()
{
if (app()->environment('production')) {
DB::listen(function ($query) {
if ($query->time > 1000) { // Queries over 1 second
Log::warning('Slow query detected', [
'sql' => $query->sql,
'bindings' => $query->bindings,
'time' => $query->time,
'url' => request()->url(),
]);
}
});
}
}
Custom Metrics with StatsD
We send custom metrics to StatsD/Graphite:
use Domnikl\Statsd\Client;
use Domnikl\Statsd\Connection\UdpSocket;
$connection = new UdpSocket('statsd.internal', 8125);
$statsd = new Client($connection);
// Track API response times
$start = microtime(true);
$response = $this->processRequest($request);
$duration = (microtime(true) - $start) * 1000;
$statsd->timing('api.response_time', $duration);
$statsd->increment('api.requests');
// Track business metrics
$statsd->increment('projects.created');
$statsd->gauge('users.active', $activeUserCount);
We graph these in Grafana to spot trends and anomalies.
Real-World Scaling Scenarios and Solutions
Scenario 1: The Viral Product Hunt Launch
The situation: We launched on Product Hunt and hit the front page. Traffic went from 500 concurrent users to 5,000 in 30 minutes.
What broke:
- Database connections maxed out (hit the 100 connection limit)
- Redis ran out of memory (default 1GB wasn't enough)
- Queue workers couldn't keep up with welcome emails
What we did (in order of impact):
- Immediately: Increased database max_connections from 100 to 300
- 5 minutes later: Scaled Redis from 1GB to 4GB
- 10 minutes later: Spun up 4 additional queue worker servers
- 30 minutes later: Implemented PgBouncer to pool database connections
- Next day: Added database read replicas
Result: Stabilized within 45 minutes. Learned to have auto-scaling configured before launches.
Scenario 2: The Enterprise Client Migration
The situation: Enterprise client with 120,000 users migrating over a weekend.
What we prepared:
- Scaled infrastructure in advance (8 web servers, 6 queue workers)
- Warmed caches for expected user patterns
- Ran load tests simulating 2,000 requests/second
- Set up real-time monitoring dashboard
- Had team on standby
What actually happened:
- Migration went smoothly for first 50,000 users
- At 80,000 users, we noticed queue depth growing (email sending was bottleneck)
- Scaled email queue workers from 20 to 60
- At 100,000 users, database CPU spiked to 80% (lots of first-time logins)
- Temporarily disabled non-critical background jobs
- By Monday morning, everything stabilized
Lessons learned:
- Load testing doesn't catch everything (we didn't simulate realistic login patterns)
- Have a plan to disable non-critical features under load
- Over-provision for migrations; scale down after
Scenario 3: The Slow Memory Leak
The situation: Over 3 weeks, we noticed memory usage slowly climbing on web servers. Eventually, servers would run out of memory and crash.
Debugging process:
- Week 1: Assumed it was normal growth. Increased server RAM from 8GB to 16GB.
- Week 2: Memory filled up again. Realized it was a leak.
- Week 3: Used Blackfire to profile memory usage. Found this code:
// In a middleware that ran on every request
public function handle($request, Closure $next)
{
$this->userCache[] = $request->user(); // Accumulating users in memory!
return $next($request);
}
The middleware was storing users in a class property that never got cleared. After 100,000 requests, we had 100,000 user objects in memory.
Fix:
public function handle($request, Closure $next)
{
// Don't store in class property
$user = $request->user();
return $next($request);
}
Lesson: Memory leaks are insidious at scale. Profile regularly and watch for slow memory growth.
Advanced Performance Techniques
Query Result Caching with Rememberable
We built a query result cache that's smarter than basic caching:
// app/Traits/Rememberable.php
trait Rememberable
{
public function remember($seconds = 3600)
{
$key = $this->getCacheKey();
return Cache::remember($key, $seconds, function () {
return $this->get();
});
}
protected function getCacheKey()
{
return md5($this->toSql() . serialize($this->getBindings()));
}
}
// Use it in models
class Project extends Model
{
use Rememberable;
}
// Usage
$projects = Project::where('status', 'active')->remember(3600);
This automatically caches any query result based on the SQL and bindings.
Preloading Data with Service Workers
For our dashboard, we preload data the user is likely to need:
// In our Vue.js app
export default {
mounted() {
// Load current page data
this.loadDashboard();
// Preload likely next pages in the background
setTimeout(() => {
this.preloadProjects();
this.preloadTeam();
}, 1000);
},
methods: {
preloadProjects() {
axios.get('/api/projects').then(response => {
this.$store.commit('cacheProjects', response.data);
});
}
}
}
When the user navigates to the projects page, data is already loaded. This makes the app feel instant.
Database Connection Multiplexing
For read-heavy workloads, we multiplex read queries across replicas:
// app/Services/DatabaseMultiplexer.php
class DatabaseMultiplexer
{
protected $replicas = ['replica-1', 'replica-2', 'replica-3'];
protected $currentReplica = 0;
public function query($sql, $bindings = [])
{
$replica = $this->getNextReplica();
return DB::connection($replica)->select($sql, $bindings);
}
protected function getNextReplica()
{
$replica = $this->replicas[$this->currentReplica];
$this->currentReplica = ($this->currentReplica + 1) % count($this->replicas);
return $replica;
}
}
This distributes reads evenly across replicas, preventing any single replica from becoming a bottleneck.
The Scaling Checklist
Here's the checklist we use before expecting traffic spikes:
Database:
- All queries under 100ms (check with slow query log)
- Indexes on all WHERE, ORDER BY, and JOIN columns
- N+1 queries eliminated (verify with Telescope)
- Connection pooling configured (PgBouncer)
- Read replicas for read-heavy queries
- Database backed up and recovery tested
Cache:
- Redis scaled to handle expected load
- Cache hit rate above 85%
- Cache warming strategy for critical data
- Cache tags implemented for easy invalidation
- Separate Redis for sessions
Queues:
- Horizon configured with proper queue priorities
- Queue workers auto-scale based on depth
- Failed job monitoring and alerting
- Job batching for bulk operations
Application:
- Response times under 200ms for 95th percentile
- Static assets on CDN
- Gzip compression enabled
- HTTP caching headers configured
- Session storage on Redis
Infrastructure:
- Auto-scaling configured and tested
- Load balancer health checks working
- Zero-downtime deployment process
- Monitoring and alerting configured
- Load testing completed
Monitoring:
- APM (Blackfire/New Relic) configured
- Database monitoring (query times, connections)
- Queue monitoring (depth, processing rate)
- Custom business metrics tracked
- Error tracking (Sentry/Bugsnag)
Cost Optimization at Scale
Scaling isn't just about performance—it's also about cost. Here's how we keep costs reasonable:
Right-Sizing Instances
We started with oversized instances "to be safe." After monitoring for a month, we found we were using only 40% of CPU on average. We downsized:
- Web servers: c5.2xlarge → c5.xlarge (saved $800/month)
- Queue workers: c5.xlarge → c5.large (saved $400/month)
Total savings: $1,200/month without any performance impact.
Reserved Instances
For baseline capacity that runs 24/7, we use AWS Reserved Instances:
- 2 web servers (always running): 1-year reserved (30% discount)
- 1 database instance: 1-year reserved (30% discount)
- 1 Redis instance: 1-year reserved (30% discount)
Savings: ~$2,000/month
Spot Instances for Queue Workers
Queue workers can be interrupted without user impact. We use spot instances:
# AWS Auto Scaling configuration
aws autoscaling create-launch-template \
--launch-template-name queue-workers \
--instance-market-options '{
"MarketType": "spot",
"SpotOptions": {
"MaxPrice": "0.10",
"SpotInstanceType": "one-time"
}
}'
Savings: 70% off on-demand pricing (~$1,500/month)
Aggressive Cache TTLs
We increased cache TTLs for data that doesn't change frequently:
- User profiles: 1 hour → 6 hours
- Project lists: 30 minutes → 2 hours
- Public pages: 5 minutes → 1 hour
This reduced database queries by 30% and cut our database instance size (and cost) by 25%.
What We'd Do Differently Next Time
Looking back, here's what we'd change:
-
Start with read replicas earlier. We waited until we had performance problems. Should have set them up from day one.
-
Implement proper monitoring before scaling issues. We added monitoring reactively. Should have had Blackfire and custom metrics from the start.
-
Use Laravel Horizon from the beginning. We migrated from basic queue workers to Horizon after hitting scaling issues. The migration was painful.
-
Load test regularly, not just before launches. We only load tested before big events. Should have been testing monthly to catch regressions.
-
Document our scaling strategies. When issues happened at 2am, we scrambled to remember what to do. Now we have runbooks.
-
Budget for scaling costs. We were surprised by infrastructure costs as we scaled. Should have modeled costs earlier.
The Real Numbers: Before and After
Here's what scaling from 10,000 to 100,000 users looked like for us:
Response Times:
- Before: 850ms average, 3.2s p95
- After: 140ms average, 280ms p95
Database Performance:
- Before: 12,000 queries/sec peak, 45% CPU average
- After: 18,000 queries/sec peak, 35% CPU average (with read replicas)
Cache Hit Rate:
- Before: 62%
- After: 91%
Queue Processing:
- Before: 200,000 jobs/day, 2-hour backlog during peaks
- After: 800,000 jobs/day, 5-minute backlog maximum
Infrastructure Costs:
- Before: $2,800/month (single-server setup)
- After: $8,500/month (scaled infrastructure)
- Per-user cost: $0.28 → $0.085
Incident Frequency:
- Before: 2-3 performance incidents per week
- After: 1 incident per month
Final Thoughts
Scaling Laravel to 100,000 users isn't about rewriting your application or switching to a "more scalable" framework. It's about identifying bottlenecks, optimizing systematically, and implementing proper infrastructure.
The biggest lesson? Measure everything. We wasted weeks optimizing things that didn't matter because we were guessing instead of measuring. Once we had proper monitoring, we could focus on the 20% of issues causing 80% of problems.
Laravel is absolutely capable of handling 100,000+ users. We're now at 150,000 users and still using the same core architecture. The framework isn't the bottleneck—your database queries, caching strategy, and infrastructure are.
Start with the database. Fix your N+1 queries and add proper indexes. That alone will get you to 50,000 users. Then add Redis caching and read replicas. That'll get you to 100,000. After that, it's about horizontal scaling and optimization.
And remember: scale when you need to, not before. Premature optimization is real. We spent months preparing for scale we didn't hit for another year. That time could have been spent building features users actually wanted.
Daniel Hartwell
AuthorSenior backend engineer focused on distributed systems and database performance. Previously at fintech and SaaS scale-ups. Writes about the boring-but-critical infrastructure that keeps systems running.
Never Miss an Article
Get our best content delivered to your inbox weekly. No spam, unsubscribe anytime.
Comments (0)
Please log in to leave a comment.
Log In