Monitoring Laravel trên Production — CloudWatch, Prometheus & Grafana

Deploy xong không có nghĩa là xong. Biết bao nhiêu lần bạn nhận được tin nhắn "Em ơi, web chậm lắm" hoặc "Sao đặt hàng không được?" mà bạn không hề biết có vấn đề cho đến khi người dùng phàn nàn?

Monitoring (giám sát) giúp bạn phát hiện vấn đề trước khi người dùng gặp phải. Bài viết này hướng dẫn cách thiết lập hệ thống observability toàn diện cho Laravel, từ đơn giản đến nâng cao.

Ba trụ cột của Observability

┌─────────────────────────────────────────┐
│            OBSERVABILITY                │
├──────────┬──────────┬───────────────────┤
│  LOGS    │ METRICS  │    TRACES         │
│          │          │                   │
│  "Chuyện │ "Bao     │  "Request đi      │
│  gì đã   │ nhiêu,   │  qua đâu, mất    │
│  xảy ra" │ bao lâu" │  bao lâu ở đâu"  │
│          │          │                   │
│ Monolog  │Prometheus│  OpenTelemetry    │
│CloudWatch│ Grafana  │  Jaeger/Zipkin    │
└──────────┴──────────┴───────────────────┘

Logs: Ghi lại sự kiện. "Request 404 lúc 15:30", "Payment failed cho user #123".
Metrics: Số liệu tổng hợp. "CPU 80%", "500 requests/giây", "Response time trung bình 200ms".
Traces: Theo dõi vòng đời 1 request qua nhiều service. "Request → Controller → Database (150ms) → Redis (5ms) → Response".

Phần 1: Logging — Nền tảng

Cấu hình Monolog cho Production

// config/logging.php
'channels' => [
    'stack' => [
        'driver' => 'stack',
        'channels' => ['daily', 'cloudwatch'],
        'ignore_exceptions' => false,
    ],

    'daily' => [
        'driver' => 'daily',
        'path' => storage_path('logs/laravel.log'),
        'level' => env('LOG_LEVEL', 'info'),
        'days' => 14,
        'replace_placeholders' => true,
    ],

    'cloudwatch' => [
        'driver' => 'custom',
        'via' => App\Logging\CloudWatchLoggerFactory::class,
        'level' => env('LOG_LEVEL', 'info'),
        'retention' => 30,
        'group_name' => env('CLOUDWATCH_LOG_GROUP', '/laravel/production'),
        'stream_name' => env('CLOUDWATCH_LOG_STREAM', 'application'),
    ],
],

Tạo CloudWatch Logger

composer require aws/aws-sdk-php maxbanton/cwh

// app/Logging/CloudWatchLoggerFactory.php
namespace App\Logging;

use Aws\CloudWatchLogs\CloudWatchLogsClient;
use Maxbanton\Cwh\Handler\CloudWatch;
use Monolog\Formatter\JsonFormatter;
use Monolog\Logger;

class CloudWatchLoggerFactory
{
    public function __invoke(array $config): Logger
    {
        $client = new CloudWatchLogsClient([
            'region'  => config('services.aws.region', 'ap-southeast-1'),
            'version' => 'latest',
        ]);

        $handler = new CloudWatch(
            client: $client,
            group: $config['group_name'],
            stream: $config['stream_name'],
            retentionDays: $config['retention'],
            batchSize: 25,
        );

        // JSON format cho dễ query trên CloudWatch Insights
        $handler->setFormatter(new JsonFormatter());

        $logger = new Logger('cloudwatch');
        $logger->pushHandler($handler);

        return $logger;
    }
}

Tại sao JSON format? CloudWatch Logs Insights cho phép bạn query logs bằng SQL-like syntax. Với JSON format, bạn có thể:

-- Tìm tất cả errors trong 1 giờ qua
fields @timestamp, context.exception, message
| filter level = "ERROR"
| sort @timestamp desc
| limit 50

-- Đếm errors theo loại
fields context.exception
| filter level = "ERROR"
| stats count(*) as error_count by context.exception
| sort error_count desc

Structured Logging — Thêm Context

Đừng chỉ log message trống. Thêm context để debug dễ hơn:

// ❌ Khó debug
Log::error('Payment failed');

// ✅ Dễ debug
Log::error('Payment failed', [
    'user_id'    => $user->id,
    'order_id'   => $order->id,
    'amount'     => $order->total,
    'gateway'    => 'stripe',
    'error_code' => $e->getCode(),
    'error_msg'  => $e->getMessage(),
    'trace_id'   => request()->header('X-Request-ID'),
]);

Middleware gắn Request ID

// app/Http/Middleware/RequestId.php
namespace App\Http\Middleware;

use Closure;
use Illuminate\Http\Request;
use Illuminate\Support\Str;
use Illuminate\Support\Facades\Log;

class RequestId
{
    public function handle(Request $request, Closure $next)
    {
        $requestId = $request->header('X-Request-ID', (string) Str::uuid());

        // Gắn vào mọi log entry
        Log::shareContext([
            'request_id' => $requestId,
            'ip'         => $request->ip(),
            'url'        => $request->fullUrl(),
            'method'     => $request->method(),
        ]);

        $response = $next($request);

        $response->headers->set('X-Request-ID', $requestId);

        return $response;
    }
}

Giải thích Log::shareContext(): Từ Laravel 10+, method này tự động thêm context vào tất cả log entries trong cùng request. Bạn không cần truyền $requestId vào từng nơi log nữa.

Phần 2: Metrics với Prometheus

Prometheus là hệ thống monitoring mã nguồn mở, hoạt động theo mô hình pull: Prometheus server định kỳ "kéo" (scrape) metrics từ ứng dụng của bạn.

Cài đặt

composer require promphp/prometheus_client_php

Tạo Metrics Service

// app/Services/MetricsService.php
namespace App\Services;

use Prometheus\CollectorRegistry;
use Prometheus\RenderTextFormat;
use Prometheus\Storage\InMemory;
use Prometheus\Storage\Redis;

class MetricsService
{
    private CollectorRegistry $registry;

    public function __construct()
    {
        // Dùng Redis để persist metrics giữa các requests
        // InMemory chỉ phù hợp cho testing
        $this->registry = new CollectorRegistry(
            new Redis([
                'host' => config('database.redis.default.host'),
                'port' => config('database.redis.default.port'),
            ])
        );
    }

    /**
     * Đếm tổng số HTTP requests theo method, path, status
     */
    public function recordHttpRequest(
        string $method,
        string $path,
        int $statusCode,
        float $duration
    ): void {
        // Counter: chỉ tăng, không bao giờ giảm
        $counter = $this->registry->getOrRegisterCounter(
            'laravel',
            'http_requests_total',
            'Tổng số HTTP requests',
            ['method', 'path', 'status']
        );
        $counter->inc([$method, $path, $statusCode]);

        // Histogram: phân phối response time
        $histogram = $this->registry->getOrRegisterHistogram(
            'laravel',
            'http_request_duration_seconds',
            'Response time (seconds)',
            ['method', 'path'],
            // Buckets: 10ms, 25ms, 50ms, 100ms, 250ms, 500ms, 1s, 2.5s, 5s, 10s
            [0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
        );
        $histogram->observe($duration, [$method, $path]);
    }

    /**
     * Đo database query count và duration
     */
    public function recordDatabaseQuery(float $duration, string $connection): void
    {
        $histogram = $this->registry->getOrRegisterHistogram(
            'laravel',
            'database_query_duration_seconds',
            'Thời gian query database',
            ['connection'],
            [0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1, 5]
        );
        $histogram->observe($duration, [$connection]);
    }

    /**
     * Đếm queue jobs
     */
    public function recordQueueJob(string $job, string $status): void
    {
        $counter = $this->registry->getOrRegisterCounter(
            'laravel',
            'queue_jobs_total',
            'Tổng số queue jobs',
            ['job', 'status']
        );
        $counter->inc([$job, $status]);
    }

    /**
     * Gauge: giá trị hiện tại (lên/xuống)
     */
    public function setQueueSize(string $queue, int $size): void
    {
        $gauge = $this->registry->getOrRegisterGauge(
            'laravel',
            'queue_size',
            'Số jobs đang chờ trong queue',
            ['queue']
        );
        $gauge->set($size, [$queue]);
    }

    /**
     * Render metrics ở format Prometheus text
     */
    public function render(): string
    {
        $renderer = new RenderTextFormat();
        return $renderer->render($this->registry->getMetricFamilySamples());
    }
}

Giải thích các loại metric:

Counter: Chỉ tăng. Ví dụ: tổng requests, tổng errors. Reset = 0 khi restart.
Histogram: Đo phân phối giá trị. Ví dụ: response time. Prometheus tự tính percentile (p50, p95, p99).
Gauge: Giá trị hiện tại, có thể lên/xuống. Ví dụ: queue size, memory usage.

Middleware thu thập metrics

// app/Http/Middleware/CollectMetrics.php
namespace App\Http\Middleware;

use Closure;
use Illuminate\Http\Request;
use App\Services\MetricsService;

class CollectMetrics
{
    public function __construct(
        private MetricsService $metrics,
    ) {}

    public function handle(Request $request, Closure $next)
    {
        $start = microtime(true);

        $response = $next($request);

        $duration = microtime(true) - $start;

        // Normalize path để tránh cardinality explosion
        // /blog/my-post → /blog/{slug}
        $path = $this->normalizePath($request->route());

        $this->metrics->recordHttpRequest(
            method: $request->method(),
            path: $path,
            statusCode: $response->getStatusCode(),
            duration: $duration,
        );

        return $response;
    }

    private function normalizePath($route): string
    {
        if (!$route) {
            return 'unknown';
        }

        // Dùng route URI pattern thay vì actual path
        // Ví dụ: /blog/{slug} thay vì /blog/my-actual-post
        return '/' . ltrim($route->uri(), '/');
    }
}

Tại sao cần normalizePath()? Nếu bạn log đường dẫn thực (/blog/post-1, /blog/post-2, ...), Prometheus sẽ tạo hàng nghìn time series → tốn memory → chậm query. Đây gọi là cardinality explosion. Luôn nhóm vào pattern: /blog/{slug}.

Route endpoint cho Prometheus scrape

// routes/web.php
use App\Services\MetricsService;

Route::get('/metrics', function (MetricsService $metrics) {
    // Bảo vệ endpoint này!
    return response($metrics->render(), 200, [
        'Content-Type' => 'text/plain; version=0.0.4',
    ]);
})->middleware('auth.basic'); // Hoặc giới hạn IP

Lắng nghe Database Queries

// app/Providers/AppServiceProvider.php
use Illuminate\Support\Facades\DB;
use App\Services\MetricsService;

public function boot(): void
{
    // Chỉ bật ở production, có overhead nhẹ
    if (app()->isProduction()) {
        DB::listen(function ($query) {
            app(MetricsService::class)->recordDatabaseQuery(
                duration: $query->time / 1000, // ms → seconds
                connection: $query->connectionName,
            );
        });
    }
}

Phần 3: Grafana Dashboards

Grafana kết nối với Prometheus để hiển thị metrics dưới dạng biểu đồ đẹp mắt.

Cài đặt với Docker Compose

# docker-compose.monitoring.yml
services:
  prometheus:
    image: prom/prometheus:latest
    volumes:
      - ./monitoring/prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"
    command:
      - '--config.file=/etc/prometheus/prometheus.yml'
      - '--storage.tsdb.retention.time=30d'

  grafana:
    image: grafana/grafana:latest
    volumes:
      - grafana_data:/var/lib/grafana
      - ./monitoring/dashboards:/etc/grafana/provisioning/dashboards
      - ./monitoring/datasources:/etc/grafana/provisioning/datasources
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=your-secure-password
      - GF_INSTALL_PLUGINS=grafana-clock-panel

  # Node Exporter: metrics hệ thống (CPU, RAM, Disk)
  node-exporter:
    image: prom/node-exporter:latest
    ports:
      - "9100:9100"
    volumes:
      - /proc:/host/proc:ro
      - /sys:/host/sys:ro
    command:
      - '--path.procfs=/host/proc'
      - '--path.sysfs=/host/sys'

volumes:
  prometheus_data:
  grafana_data:

Prometheus Config

# monitoring/prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  # Metrics từ Laravel app
  - job_name: 'laravel'
    metrics_path: /metrics
    basic_auth:
      username: prometheus
      password: your-metrics-password
    static_configs:
      - targets: ['your-app-server:80']
        labels:
          app: laravel-blog
          env: production

  # Metrics hệ thống
  - job_name: 'node'
    static_configs:
      - targets: ['node-exporter:9100']

  # Metrics Nginx
  - job_name: 'nginx'
    static_configs:
      - targets: ['nginx-exporter:9113']

  # Metrics MySQL
  - job_name: 'mysql'
    static_configs:
      - targets: ['mysql-exporter:9104']

  # Metrics Redis
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']

Grafana Datasource Auto-provisioning

# monitoring/datasources/prometheus.yml
apiVersion: 1
datasources:
  - name: Prometheus
    type: prometheus
    access: proxy
    url: http://prometheus:9090
    isDefault: true

Các PromQL Queries hữu ích cho Dashboard

# 1. Request rate (requests/giây)
rate(laravel_http_requests_total[5m])

# 2. Error rate (% requests trả về 5xx)
sum(rate(laravel_http_requests_total{status=~"5.."}[5m]))
/
sum(rate(laravel_http_requests_total[5m])) * 100

# 3. Response time percentile 95
histogram_quantile(0.95, 
  rate(laravel_http_request_duration_seconds_bucket[5m])
)

# 4. Response time percentile 99
histogram_quantile(0.99, 
  rate(laravel_http_request_duration_seconds_bucket[5m])
)

# 5. Slowest endpoints
topk(10, 
  histogram_quantile(0.95, 
    sum by (path, le) (
      rate(laravel_http_request_duration_seconds_bucket[5m])
    )
  )
)

# 6. Database query rate
rate(laravel_database_query_duration_seconds_count[5m])

# 7. Slow queries (> 100ms)
rate(laravel_database_query_duration_seconds_bucket{le="0.1"}[5m])

# 8. Queue size hiện tại
laravel_queue_size

# 9. Failed jobs rate
rate(laravel_queue_jobs_total{status="failed"}[5m])

# 10. CPU usage (từ node-exporter)
100 - (avg(rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)

Phần 4: Alerting — Thông báo khi có vấn đề

Prometheus Alert Rules

# monitoring/alerts.yml
groups:
  - name: laravel-alerts
    rules:
      # Error rate > 5% trong 5 phút
      - alert: HighErrorRate
        expr: |
          sum(rate(laravel_http_requests_total{status=~"5.."}[5m]))
          /
          sum(rate(laravel_http_requests_total[5m])) > 0.05
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Error rate cao bất thường"
          description: "Error rate đang ở {{ $value | humanizePercentage }}. Threshold: 5%"

      # Response time p95 > 2 giây
      - alert: HighLatency
        expr: |
          histogram_quantile(0.95,
            rate(laravel_http_request_duration_seconds_bucket[5m])
          ) > 2
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Response time cao"
          description: "P95 latency: {{ $value | humanizeDuration }}"

      # Queue size > 1000 jobs
      - alert: QueueBacklog
        expr: laravel_queue_size > 1000
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "Queue đang bị backlog"
          description: "{{ $value }} jobs đang chờ xử lý"

      # Disk usage > 85%
      - alert: DiskSpaceLow
        expr: |
          (node_filesystem_size_bytes - node_filesystem_free_bytes)
          / node_filesystem_size_bytes > 0.85
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Disk sắp đầy"
          description: "Disk usage: {{ $value | humanizePercentage }}"

      # Server down
      - alert: ServerDown
        expr: up{job="laravel"} == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Laravel server không phản hồi"

Gửi Alert qua Slack

# monitoring/alertmanager.yml
global:
  resolve_timeout: 5m

route:
  group_by: ['alertname', 'severity']
  group_wait: 10s
  group_interval: 10s
  repeat_interval: 1h
  receiver: 'slack-notifications'
  routes:
    - match:
        severity: critical
      receiver: 'slack-critical'

receivers:
  - name: 'slack-notifications'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#monitoring'
        title: '{{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

  - name: 'slack-critical'
    slack_configs:
      - api_url: 'https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK'
        channel: '#alerts-critical'
        title: '🚨 {{ .CommonAnnotations.summary }}'
        text: '{{ .CommonAnnotations.description }}'

Phần 5: Health Check Endpoint

Tạo một endpoint đơn giản để kiểm tra sức khỏe ứng dụng:

// routes/web.php
Route::get('/health', function () {
    $checks = [];
    $healthy = true;

    // Check database
    try {
        DB::connection()->getPdo();
        $checks['database'] = 'ok';
    } catch (\Exception $e) {
        $checks['database'] = 'failed: ' . $e->getMessage();
        $healthy = false;
    }

    // Check Redis
    try {
        Cache::store('redis')->put('health-check', true, 10);
        $checks['redis'] = 'ok';
    } catch (\Exception $e) {
        $checks['redis'] = 'failed: ' . $e->getMessage();
        $healthy = false;
    }

    // Check disk space
    $freeSpace = disk_free_space('/');
    $totalSpace = disk_total_space('/');
    $usagePercent = round((1 - $freeSpace / $totalSpace) * 100, 1);
    $checks['disk'] = $usagePercent . '% used';
    if ($usagePercent > 90) {
        $healthy = false;
    }

    // Check queue
    try {
        $queueSize = Queue::size('default');
        $checks['queue_size'] = $queueSize;
        if ($queueSize > 1000) {
            $healthy = false;
        }
    } catch (\Exception $e) {
        $checks['queue'] = 'failed';
        $healthy = false;
    }

    return response()->json([
        'status' => $healthy ? 'healthy' : 'unhealthy',
        'checks' => $checks,
        'timestamp' => now()->toISOString(),
    ], $healthy ? 200 : 503);
});

Phần 6: Tích hợp Laravel Pulse (Laravel 11+)

Nếu bạn không muốn setup Prometheus + Grafana, Laravel Pulse là giải pháp built-in đơn giản hơn:

composer require laravel/pulse
php artisan vendor:publish --provider="Laravel\Pulse\PulseServiceProvider"
php artisan migrate

Pulse tự động thu thập:

Slow requests
Slow queries
Slow jobs
Exceptions
Cache hits/misses
Queue throughput

Truy cập dashboard tại /pulse.

So sánh:

	Laravel Pulse	Prometheus + Grafana
Setup	5 phút	2-4 giờ
Custom metrics	Hạn chế	Không giới hạn
Long-term storage	Database	Prometheus TSDB
Alerting	❌	✅
Multi-server	⚠️	✅
Phù hợp cho	App nhỏ/vừa	App lớn, team

Checklist Monitoring cho Production

Logging: JSON format, CloudWatch/ELK, structured context
Request ID: Mỗi request có ID duy nhất xuyên suốt
Metrics: Response time, error rate, throughput
Health check: /health endpoint cho ALB/uptime monitor
Alerting: Slack/email khi error rate cao hoặc server down
Dashboard: Grafana hoặc Pulse với các panel quan trọng
Log rotation: Không để log file ăn hết disk
Uptime monitoring: Dịch vụ bên ngoài (UptimeRobot, Pingdom)

Kết luận

Monitoring không phải "nice to have" — nó là bắt buộc cho mọi ứng dụng production. Bắt đầu đơn giản:

Tuần 1: Structured logging + CloudWatch/stderr
Tuần 2: Health check endpoint + UptimeRobot
Tuần 3: Laravel Pulse hoặc basic Prometheus metrics
Tuần 4: Grafana dashboards + Slack alerts

Bạn không cần setup tất cả trong ngày đầu. Nhưng ít nhất phải có logging có context và health check endpoint trước khi go live.

"Nếu bạn không thể đo được nó, bạn không thể cải thiện nó." — Peter Drucker (và mọi SRE engineer)