Monitoring Laravel in Production — CloudWatch, Prometheus & Grafana
Deploying is not the finish line. How many times have you received a message like "Hey, the site is really slow" or "Why can't I place an order?" without having any idea there was a problem until users complained?
Monitoring helps you detect issues before users encounter them. This article walks you through setting up a comprehensive observability system for Laravel, from simple to advanced.
The Three Pillars of Observability
┌─────────────────────────────────────────┐
│ OBSERVABILITY │
├──────────┬──────────┬───────────────────┤
│ LOGS │ METRICS │ TRACES │
│ │ │ │
│ "What │ "How │ "Where did the │
│ happened"│ much, │ request go, how │
│ │ how long"│ long at each hop"│
│ │ │ │
│ Monolog │Prometheus│ OpenTelemetry │
│CloudWatch│ Grafana │ Jaeger/Zipkin │
└──────────┴──────────┴───────────────────┘
- Logs: Record events. "404 request at 15:30", "Payment failed for user #123".
- Metrics: Aggregated numbers. "CPU 80%", "500 requests/second", "Average response time 200ms".
- Traces: Track a single request's lifecycle across services. "Request → Controller → Database (150ms) → Redis (5ms) → Response".
Part 1: Logging — The Foundation
Configuring Monolog for Production
// config/logging.php
'channels' => [
'stack' => [
'driver' => 'stack',
'channels' => ['daily', 'cloudwatch'],
'ignore_exceptions' => false,
],
'daily' => [
'driver' => 'daily',
'path' => storage_path('logs/laravel.log'),
'level' => env('LOG_LEVEL', 'info'),
'days' => 14,
'replace_placeholders' => true,
],
'cloudwatch' => [
'driver' => 'custom',
'via' => App\Logging\CloudWatchLoggerFactory::class,
'level' => env('LOG_LEVEL', 'info'),
'retention' => 30,
'group_name' => env('CLOUDWATCH_LOG_GROUP', '/laravel/production'),
'stream_name' => env('CLOUDWATCH_LOG_STREAM', 'application'),
],
],
Creating the CloudWatch Logger
composer require aws/aws-sdk-php maxbanton/cwh
// app/Logging/CloudWatchLoggerFactory.php
namespace App\Logging;
use Aws\CloudWatchLogs\CloudWatchLogsClient;
use Maxbanton\Cwh\Handler\CloudWatch;
use Monolog\Formatter\JsonFormatter;
use Monolog\Logger;
class CloudWatchLoggerFactory
{
public function __invoke(array $config): Logger
{
$client = new CloudWatchLogsClient([
'region' => config('services.aws.region', 'ap-southeast-1'),
'version' => 'latest',
]);
$handler = new CloudWatch(
client: $client,
group: $config['group_name'],
stream: $config['stream_name'],
retentionDays: $config['retention'],
batchSize: 25,
);
// JSON format for easy querying on CloudWatch Insights
$handler->setFormatter(new JsonFormatter());
$logger = new Logger('cloudwatch');
$logger->pushHandler($handler);
return $logger;
}
}
Why JSON format? CloudWatch Logs Insights lets you query logs with SQL-like syntax. With JSON format, you can:
-- Find all errors in the last hour
fields @timestamp, context.exception, message
| filter level = "ERROR"
| sort @timestamp desc
| limit 50
-- Count errors by type
fields context.exception
| filter level = "ERROR"
| stats count(*) as error_count by context.exception
| sort error_count desc
Structured Logging — Add Context
Don't just log empty messages. Add context for easier debugging:
// ❌ Hard to debug
Log::error('Payment failed');
// ✅ Easy to debug
Log::error('Payment failed', [
'user_id' => $user->id,
'order_id' => $order->id,
'amount' => $order->total,
'gateway' => 'stripe',
'error_code' => $e->getCode(),
'error_msg' => $e->getMessage(),
'trace_id' => request()->header('X-Request-ID'),
]);
Middleware to Attach Request ID
// app/Http/Middleware/RequestId.php
namespace App\Http\Middleware;
use Closure;
use Illuminate\Http\Request;
use Illuminate\Support\Str;
use Illuminate\Support\Facades\Log;
class RequestId
{
public function handle(Request $request, Closure $next)
{
$requestId = $request->header('X-Request-ID', (string) Str::uuid());
// Attach to all log entries
Log::shareContext([
'request_id' => $requestId,
'ip' => $request->ip(),
'url' => $request->fullUrl(),
'method' => $request->method(),
]);
$response = $next($request);
$response->headers->set('X-Request-ID', $requestId);
return $response;
}
}
Explanation of Log::shareContext(): Since Laravel 10+, this method automatically adds context to all log entries within the same request. You no longer need to pass $requestId everywhere you log.
Part 2: Metrics with Prometheus
Prometheus is an open-source monitoring system that works on a pull model: the Prometheus server periodically "scrapes" metrics from your application.
Installation
composer require promphp/prometheus_client_php
Creating the Metrics Service
// app/Services/MetricsService.php
namespace App\Services;
use Prometheus\CollectorRegistry;
use Prometheus\RenderTextFormat;
use Prometheus\Storage\InMemory;
use Prometheus\Storage\Redis;
class MetricsService
{
private CollectorRegistry $registry;
public function __construct()
{
// Use Redis to persist metrics between requests
// InMemory is only suitable for testing
$this->registry = new CollectorRegistry(
new Redis([
'host' => config('database.redis.default.host'),
'port' => config('database.redis.default.port'),
])
);
}
/**
* Count total HTTP requests by method, path, status
*/
public function recordHttpRequest(
string $method,
string $path,
int $statusCode,
float $duration
): void {
// Counter: only increments, never decreases
$counter = $this->registry->getOrRegisterCounter(
'laravel',
'http_requests_total',
'Total HTTP requests',
['method', 'path', 'status']
);
$counter->inc([$method, $path, $statusCode]);
// Histogram: response time distribution
$histogram = $this->registry->getOrRegisterHistogram(
'laravel',
'http_request_duration_seconds',
'Response time (seconds)',
['method', 'path'],
// Buckets: 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s
[0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
);
$histogram->observe($duration, [$method, $path]);
}
/**
* Measure database query count and duration
*/
public function recordDatabaseQuery(float $duration, string $connection): void
{
$histogram = $this->registry->getOrRegisterHistogram(
'laravel',
'database_query_duration_seconds',
'Database query duration',
['connection'],
[0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5]
);
$histogram->observe($duration, [$connection]);
}
/**
* Count queue jobs
*/
public function recordQueueJob(string $job, string $status): void
{
$counter = $this->registry->getOrRegisterCounter(
'laravel',
'queue_jobs_total',
'Total queue jobs',
['job', 'status']
);
$counter->inc([$job, $status]);
}
/**
* Gauge: current value (can go up/down)
*/
public function setQueueSize(string $queue, int $size): void
{
$gauge = $this->registry->getOrRegisterGauge(
'laravel',
'queue_size',
'Number of pending jobs in queue',
['queue']
);
$gauge->set($size, [$queue]);
}
/**
* Render metrics in Prometheus text format
*/
public function render(): string
{
$renderer = new RenderTextFormat();
return $renderer->render($this->registry->getMetricFamilySamples());
}
}
Explanation of metric types:
- Counter: Only increments. Example: total requests, total errors. Resets to 0 on restart.
- Histogram: Measures value distribution. Example: response time. Prometheus automatically calculates percentiles (p50, p95, p99).
- Gauge: Current value, can go up or down. Example: queue size, memory usage.
Middleware to Collect Metrics
// app/Http/Middleware/CollectMetrics.php
namespace App\Http\Middleware;
use Closure;
use Illuminate\Http\Request;
use App\Services\MetricsService;
class CollectMetrics
{
public function __construct(
private MetricsService $metrics,
) {}
public function handle(Request $request, Closure $next)
{
$start = microtime(true);
$response = $next($request);
$duration = microtime(true) - $start;
// Normalize path to avoid cardinality explosion
// /blog/my-post → /blog/{slug}
$path = $this->normalizePath($request->route());
$this->metrics->recordHttpRequest(
method: $request->method(),
path: $path,
statusCode: $response->getStatusCode(),
duration: $duration,
);
return $response;
}
private function normalizePath($route): string
{
if (!$route) {
return 'unknown';
}
// Use route URI pattern instead of actual path
// Example: /blog/{slug} instead of /blog/my-actual-post
return '/' . ltrim($route->uri(), '/');
}
}
Why normalizePath()? If you log actual paths (/blog/post-1, /blog/post-2, ...), Prometheus will create thousands of time series → high memory → slow queries. This is called cardinality explosion. Always group by pattern: /blog/{slug}.
Route Endpoint for Prometheus Scraping
// routes/web.php
use App\Services\MetricsService;
Route::get('/metrics', function (MetricsService $metrics) {
// Protect this endpoint!
return response($metrics->render(), 200, [
'Content-Type' => 'text/plain; version=0.0.4',
]);
})->middleware('auth.basic'); // Or restrict by IP
Listening to Database Queries
// app/Providers/AppServiceProvider.php
use Illuminate\Support\Facades\DB;
use App\Services\MetricsService;
public function boot(): void
{
// Only enable in production, slight overhead
if (app()->isProduction()) {
DB::listen(function ($query) {
app(MetricsService::class)->recordDatabaseQuery(
duration: $query->time / 1000, // ms → seconds
connection: $query->connectionName,
);
});
}
}
Part 3: Grafana Dashboards
Grafana connects to Prometheus to display metrics as beautiful charts.
Setup with Docker Compose
# docker-compose.monitoring.yml
services:
prometheus:
image: prom/prometheus:latest
volumes:
- ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports:
- "9090:9090"
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.retention.time=30d'
grafana:
image: grafana/grafana:latest
volumes:
- grafana_data:/var/lib/grafana
- ./monitoring/dashboards:/etc/grafana/provisioning/dashboards
- ./monitoring/datasources:/etc/grafana/provisioning/datasources
ports:
- "3000:3000"
environment:
- GF_SECURITY_ADMIN_PASSWORD=your-secure-password
- GF_INSTALL_PLUGINS=grafana-clock-panel
# Node Exporter: system metrics (CPU, RAM, Disk)
node-exporter:
image: prom/node-exporter:latest
ports:
- "9100:9100"
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
volumes:
prometheus_data:
grafana_data:
Prometheus Config
# monitoring/prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
# Metrics from Laravel app
- job_name: 'laravel'
metrics_path: /metrics
basic_auth:
username: prometheus
password: your-metrics-password
static_configs:
- targets: ['your-app-server:80']
labels:
app: laravel-blog
env: production
# System metrics
- job_name: 'node'
static_configs:
- targets: ['node-exporter:9100']
# Nginx metrics
- job_name: 'nginx'
static_configs:
- targets: ['nginx-exporter:9113']
# MySQL metrics
- job_name: 'mysql'
static_configs:
- targets: ['mysql-exporter:9104']
# Redis metrics
- job_name: 'redis'
static_configs:
- targets: ['redis-exporter:9121']
Grafana Datasource Auto-provisioning
# monitoring/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
Useful PromQL Queries for Dashboards
# 1. Request rate (requests/second)
rate(laravel_http_requests_total[5m])
# 2. Error rate (% of requests returning 5xx)
sum(rate(laravel_http_requests_total{status=~"5.."}[5m]))
/
sum(rate(laravel_http_requests_total[5m])) * 100
# 3. Response time percentile 95
histogram_quantile(0.95,
rate(laravel_http_request_duration_seconds_bucket[5m])
)
# 4. Response time percentile 99
histogram_quantile(0.99,
rate(laravel_http_request_duration_seconds_bucket[5m])
)
# 5. Slowest endpoints
topk(10,
histogram_quantile(0.95,
sum by (path, le) (
rate(laravel_http_request_duration_seconds_bucket[5m])
)
)
)
# 6. Database query rate
rate(laravel_database_query_duration_seconds_count[5m])
# 7. Slow queries (> 100ms)
rate(laravel_database_query_duration_seconds_bucket{le="0.1"}[5m])
# 8. Current queue size
laravel_queue_size
# 9. Failed jobs rate
rate(laravel_queue_jobs_total{status="failed"}[5m])
# 10. CPU usage (from node-exporter)
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
Part 4: Alerting — Notifications When Things Go Wrong
Prometheus Alert Rules
# monitoring/alerts.yml
groups:
- name: laravel-alerts
rules:
# Error rate > 5% for 5 minutes
- alert: HighErrorRate
expr: |
sum(rate(laravel_http_requests_total{status=~"5.."}[5m]))
/
sum(rate(laravel_http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Abnormally high error rate"
description: "Error rate is at {{ $value | humanizePercentage }}. Threshold: 5%"
# Response time p95 > 2 seconds
- alert: HighLatency
expr: |
histogram_quantile(0.95,
rate(laravel_http_request_duration_seconds_bucket[5m])
) > 2
for: 10m
labels:
severity: warning
annotations:
summary: "High response time"
description: "P95 latency: {{ $value | humanizeDuration }}"
# Queue size > 1000 jobs
- alert: QueueBacklog
expr: laravel_queue_size > 1000
for: 15m
labels:
severity: warning
annotations:
summary: "Queue backlog detected"
description: "{{ $value }} jobs waiting to be processed"
# Disk usage > 85%
- alert: DiskSpaceLow
expr: |
(node_filesystem_size_bytes - node_filesystem_free_bytes)
/ node_filesystem_size_bytes > 0.85
for: 10m
labels:
severity: warning
annotations:
summary: "Disk space running low"
description: "Disk usage: {{ $value | humanizePercentage }}"
# Server down
- alert: ServerDown
expr: up{job="laravel"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Laravel server is not responding"
Sending Alerts via Slack
# monitoring/alertmanager.yml
global:
resolve_timeout: 5m
route:
group_by: ['alertname', 'severity']
group_wait: 10s
group_interval: 10s
repeat_interval: 1h
receiver: 'slack-notifications'
routes:
- match:
severity: critical
receiver: 'slack-critical'
receivers:
- name: 'slack-notifications'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#monitoring'
title: '{{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
- name: 'slack-critical'
slack_configs:
- api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
channel: '#alerts-critical'
title: '🚨 {{ .CommonAnnotations.summary }}'
text: '{{ .CommonAnnotations.description }}'
Part 5: Health Check Endpoint
Create a simple endpoint to check application health:
// routes/web.php
Route::get('/health', function () {
$checks = [];
$healthy = true;
// Check database
try {
DB::connection()->getPdo();
$checks['database'] = 'ok';
} catch (\Exception $e) {
$checks['database'] = 'failed: ' . $e->getMessage();
$healthy = false;
}
// Check Redis
try {
Cache::store('redis')->put('health-check', true, 10);
$checks['redis'] = 'ok';
} catch (\Exception $e) {
$checks['redis'] = 'failed: ' . $e->getMessage();
$healthy = false;
}
// Check disk space
$freeSpace = disk_free_space('/');
$totalSpace = disk_total_space('/');
$usagePercent = round((1 - $freeSpace / $totalSpace) * 100, 1);
$checks['disk'] = $usagePercent . '% used';
if ($usagePercent > 90) {
$healthy = false;
}
// Check queue
try {
$queueSize = Queue::size('default');
$checks['queue_size'] = $queueSize;
if ($queueSize > 1000) {
$healthy = false;
}
} catch (\Exception $e) {
$checks['queue'] = 'failed';
$healthy = false;
}
return response()->json([
'status' => $healthy ? 'healthy' : 'unhealthy',
'checks' => $checks,
'timestamp' => now()->toISOString(),
], $healthy ? 200 : 503);
});
Part 6: Laravel Pulse Integration (Laravel 11+)
If you don't want to set up Prometheus + Grafana, Laravel Pulse is the simpler built-in solution:
composer require laravel/pulse
php artisan vendor:publish --provider="Laravel\Pulse\PulseServiceProvider"
php artisan migrate
Pulse automatically tracks:
- Slow requests
- Slow queries
- Slow jobs
- Exceptions
- Cache hits/misses
- Queue throughput
Access the dashboard at /pulse.
Comparison:
| Laravel Pulse | Prometheus + Grafana | |
|---|---|---|
| Setup | 5 minutes | 2-4 hours |
| Custom metrics | Limited | Unlimited |
| Long-term storage | Database | Prometheus TSDB |
| Alerting | ❌ | ✅ |
| Multi-server | ⚠️ | ✅ |
| Best for | Small/medium apps | Large apps, teams |
Production Monitoring Checklist
- Logging: JSON format, CloudWatch/ELK, structured context
- Request ID: Every request has a unique ID throughout its lifecycle
- Metrics: Response time, error rate, throughput
- Health check:
/healthendpoint for ALB/uptime monitor - Alerting: Slack/email when error rate is high or server is down
- Dashboard: Grafana or Pulse with key panels
- Log rotation: Don't let log files eat all your disk space
- Uptime monitoring: External service (UptimeRobot, Pingdom)
Conclusion
Monitoring isn't "nice to have" — it's mandatory for any production application. Start simple:
- Week 1: Structured logging + CloudWatch/stderr
- Week 2: Health check endpoint + UptimeRobot
- Week 3: Laravel Pulse or basic Prometheus metrics
- Week 4: Grafana dashboards + Slack alerts
You don't need to set up everything on day one. But at minimum, have logging with context and a health check endpoint before going live.
"If you can't measure it, you can't improve it." — Peter Drucker (and every SRE engineer)