Fine-tuning AI Models cho PHP/Laravel: Xây Dựng Coding Assistant Riêng

· 16 min read

Giới Thiệu

Các LLM như GPT, Claude, hay open-source models đều được train trên đa dạng ngôn ngữ lập trình. Nhưng nếu bạn muốn một model chuyên sâu về PHP/Laravel với coding style của riêng team bạn? Đó là lúc fine-tuning phát huy tác dụng.

Tại Sao Cần Fine-tune?

Yếu tố Base Model Fine-tuned Model
Laravel best practices Biết cơ bản Chuyên sâu
Team conventions Không biết Tuân thủ
Domain knowledge Chung chung Cụ thể
Response format Đa dạng Nhất quán
Inference cost Cao (large models) Thấp hơn (smaller)

Khi Nào Nên Fine-tune?

Nên:

  • Cần model tuân thủ coding standards cụ thể
  • Có domain knowledge đặc biệt
  • Cần giảm latency/cost với smaller models
  • Muốn output format nhất quán

Không nên:

  • Chỉ cần knowledge update (dùng RAG)
  • Dataset quá nhỏ (<1000 examples)
  • Requirements thay đổi thường xuyên

Chuẩn Bị Dataset

Nguồn Dữ Liệu

// app/Services/DatasetPreparation/DatasetCollector.php
namespace App\Services\DatasetPreparation;

class DatasetCollector
{
    public function collectFromCodebase(string $path): array
    {
        $examples = [];
        
        // 1. Collect từ code comments/docblocks
        $examples = array_merge($examples, $this->extractDocblocks($path));
        
        // 2. Collect từ tests (input/output pairs)
        $examples = array_merge($examples, $this->extractFromTests($path));
        
        // 3. Collect từ git commits
        $examples = array_merge($examples, $this->extractFromCommits($path));
        
        return $examples;
    }
    
    protected function extractDocblocks(string $path): array
    {
        $finder = new \Symfony\Component\Finder\Finder();
        $finder->files()->in($path)->name('*.php');
        
        $examples = [];
        
        foreach ($finder as $file) {
            $content = $file->getContents();
            
            // Parse docblocks với PHP-Parser
            preg_match_all('/\/\*\*(.*?)\*\/\s*(public|protected|private)?\s*function\s+(\w+)/s', 
                $content, $matches, PREG_SET_ORDER);
            
            foreach ($matches as $match) {
                $docblock = $this->parseDocblock($match[1]);
                $functionName = $match[3];
                
                if ($docblock['description']) {
                    $examples[] = [
                        'instruction' => "Write a PHP function named {$functionName}: {$docblock['description']}",
                        'output' => $this->extractFunction($content, $functionName)
                    ];
                }
            }
        }
        
        return $examples;
    }
}

Format Dữ Liệu cho Fine-tuning

OpenAI format (JSONL):

{"messages": [{"role": "system", "content": "You are a Laravel expert..."}, {"role": "user", "content": "Create a migration for users table"}, {"role": "assistant", "content": "<?php\n\nuse Illuminate\\Database\\Migrations..."}]}
{"messages": [{"role": "system", "content": "You are a Laravel expert..."}, {"role": "user", "content": "Write a controller for CRUD posts"}, {"role": "assistant", "content": "<?php\n\nnamespace App\\Http\\Controllers..."}]}

Alpaca format (cho open-source models):

{
    "instruction": "Create a Laravel migration for a posts table with title, content, and user_id fields",
    "input": "",
    "output": "<?php\n\nuse Illuminate\\Database\\Migrations\\Migration;\nuse Illuminate\\Database\\Schema\\Blueprint;\nuse Illuminate\\Support\\Facades\\Schema;\n\nreturn new class extends Migration\n{\n    public function up(): void\n    {\n        Schema::create('posts', function (Blueprint $table) {\n            $table->id();\n            $table->foreignId('user_id')->constrained()->cascadeOnDelete();\n            $table->string('title');\n            $table->text('content');\n            $table->timestamps();\n        });\n    }\n\n    public function down(): void\n    {\n        Schema::dropIfExists('posts');\n    }\n};"
}

Tạo Dataset Tool

// app/Console/Commands/GenerateTrainingDataset.php
namespace App\Console\Commands;

use Illuminate\Console\Command;
use Illuminate\Support\Facades\File;

class GenerateTrainingDataset extends Command
{
    protected $signature = 'ai:generate-dataset {--source=} {--output=training_data.jsonl}';
    protected $description = 'Generate training dataset from Laravel codebase';

    public function handle(): int
    {
        $source = $this->option('source') ?? base_path();
        $output = $this->option('output');
        
        $this->info('Collecting examples from codebase...');
        
        $examples = [];
        
        // 1. Controllers → CRUD operations
        $examples = array_merge($examples, $this->collectControllerExamples($source));
        
        // 2. Models → Eloquent patterns
        $examples = array_merge($examples, $this->collectModelExamples($source));
        
        // 3. Tests → Expected behavior
        $examples = array_merge($examples, $this->collectTestExamples($source));
        
        // 4. Blade templates → View patterns
        $examples = array_merge($examples, $this->collectBladeExamples($source));
        
        // Convert to JSONL
        $this->info('Converting to JSONL format...');
        
        $jsonl = '';
        foreach ($examples as $example) {
            $jsonl .= json_encode($this->formatForOpenAI($example)) . "\n";
        }
        
        File::put(storage_path($output), $jsonl);
        
        $this->info("Generated " . count($examples) . " examples to {$output}");
        
        return self::SUCCESS;
    }
    
    protected function formatForOpenAI(array $example): array
    {
        return [
            'messages' => [
                [
                    'role' => 'system',
                    'content' => $this->getSystemPrompt()
                ],
                [
                    'role' => 'user', 
                    'content' => $example['instruction']
                ],
                [
                    'role' => 'assistant',
                    'content' => $example['output']
                ]
            ]
        ];
    }
    
    protected function getSystemPrompt(): string
    {
        return <<<PROMPT
You are an expert Laravel developer. Follow these conventions:
- Use PHP 8.3+ features (readonly, typed properties, enums)
- Follow PSR-12 coding standard
- Use strict typing (declare(strict_types=1))
- Prefer dependency injection over facades
- Use descriptive variable and method names
- Write comprehensive PHPDoc when needed
- Follow Laravel best practices and conventions
PROMPT;
    }
    
    protected function collectControllerExamples(string $path): array
    {
        $examples = [];
        $controllerPath = $path . '/app/Http/Controllers';
        
        if (!is_dir($controllerPath)) {
            return $examples;
        }
        
        foreach (File::allFiles($controllerPath) as $file) {
            $content = $file->getContents();
            $className = pathinfo($file->getFilename(), PATHINFO_FILENAME);
            
            // Extract methods with their docblocks
            preg_match_all(
                '/\/\*\*(.*?)\*\/\s*(public\s+function\s+\w+\([^)]*\)[^{]*\{[^}]+\})/s',
                $content,
                $matches,
                PREG_SET_ORDER
            );
            
            foreach ($matches as $match) {
                $docblock = trim($match[1]);
                $method = $match[2];
                
                // Parse purpose from docblock
                if (preg_match('/@purpose\s+(.+)/', $docblock, $purposeMatch)) {
                    $examples[] = [
                        'instruction' => "In Laravel, " . trim($purposeMatch[1]),
                        'output' => $method
                    ];
                }
            }
        }
        
        return $examples;
    }
}

Data Augmentation

Tăng cường dataset bằng cách tạo variations:

// app/Services/DatasetPreparation/DataAugmenter.php
namespace App\Services\DatasetPreparation;

class DataAugmenter
{
    public function augment(array $example): array
    {
        $augmented = [$example];
        
        // 1. Rephrase instruction
        $augmented[] = [
            'instruction' => $this->rephrase($example['instruction']),
            'output' => $example['output']
        ];
        
        // 2. Add context variations
        $augmented[] = [
            'instruction' => "As a Laravel developer, " . lcfirst($example['instruction']),
            'output' => $example['output']
        ];
        
        // 3. Add error scenario
        if (str_contains($example['output'], 'function')) {
            $augmented[] = [
                'instruction' => "Fix this Laravel code: " . $this->introduceError($example['output']),
                'output' => $example['output']
            ];
        }
        
        return $augmented;
    }
    
    protected function rephrase(string $instruction): string
    {
        $replacements = [
            'Create' => ['Write', 'Generate', 'Implement', 'Build'],
            'function' => ['method', 'function'],
            'controller' => ['controller class', 'HTTP controller'],
        ];
        
        foreach ($replacements as $original => $alternatives) {
            if (str_contains($instruction, $original)) {
                $instruction = str_replace(
                    $original, 
                    $alternatives[array_rand($alternatives)], 
                    $instruction
                );
                break;
            }
        }
        
        return $instruction;
    }
}

Fine-tuning với OpenAI

Upload Dataset

// app/Services/FineTuning/OpenAIFineTuner.php
namespace App\Services\FineTuning;

use OpenAI\Laravel\Facades\OpenAI;
use Illuminate\Support\Facades\Storage;

class OpenAIFineTuner
{
    public function uploadDataset(string $filePath): string
    {
        $response = OpenAI::files()->upload([
            'purpose' => 'fine-tune',
            'file' => fopen($filePath, 'r'),
        ]);
        
        return $response->id;
    }
    
    public function createFineTuneJob(string $fileId, array $options = []): string
    {
        $response = OpenAI::fineTuning()->createJob([
            'training_file' => $fileId,
            'model' => $options['base_model'] ?? 'gpt-4o-mini-2024-07-18',
            'hyperparameters' => [
                'n_epochs' => $options['epochs'] ?? 3,
                'batch_size' => $options['batch_size'] ?? 'auto',
                'learning_rate_multiplier' => $options['learning_rate'] ?? 'auto',
            ],
            'suffix' => $options['suffix'] ?? 'laravel-assistant',
        ]);
        
        return $response->id;
    }
    
    public function checkJobStatus(string $jobId): array
    {
        $job = OpenAI::fineTuning()->retrieveJob($jobId);
        
        return [
            'status' => $job->status,
            'model' => $job->fineTunedModel,
            'trained_tokens' => $job->trainedTokens,
            'error' => $job->error?->message,
        ];
    }
    
    public function listJobs(): array
    {
        $response = OpenAI::fineTuning()->listJobs(['limit' => 10]);
        
        return array_map(fn($job) => [
            'id' => $job->id,
            'status' => $job->status,
            'model' => $job->fineTunedModel,
            'created_at' => $job->createdAt,
        ], $response->data);
    }
}

Artisan Commands

// app/Console/Commands/FineTuneModel.php
namespace App\Console\Commands;

use App\Services\FineTuning\OpenAIFineTuner;
use Illuminate\Console\Command;

class FineTuneModel extends Command
{
    protected $signature = 'ai:fine-tune 
                            {--dataset= : Path to training dataset}
                            {--model=gpt-4o-mini-2024-07-18 : Base model}
                            {--epochs=3 : Number of training epochs}
                            {--suffix=laravel : Model suffix}';
    
    protected $description = 'Start a fine-tuning job for Laravel code assistant';

    public function handle(OpenAIFineTuner $fineTuner): int
    {
        $datasetPath = $this->option('dataset') ?? storage_path('training_data.jsonl');
        
        if (!file_exists($datasetPath)) {
            $this->error("Dataset not found: {$datasetPath}");
            return self::FAILURE;
        }
        
        $this->info('Uploading dataset...');
        $fileId = $fineTuner->uploadDataset($datasetPath);
        $this->info("File uploaded: {$fileId}");
        
        $this->info('Creating fine-tune job...');
        $jobId = $fineTuner->createFineTuneJob($fileId, [
            'base_model' => $this->option('model'),
            'epochs' => (int) $this->option('epochs'),
            'suffix' => $this->option('suffix'),
        ]);
        
        $this->info("Fine-tune job created: {$jobId}");
        $this->info("Monitor with: php artisan ai:fine-tune-status {$jobId}");
        
        return self::SUCCESS;
    }
}
// app/Console/Commands/FineTuneStatus.php
namespace App\Console\Commands;

use App\Services\FineTuning\OpenAIFineTuner;
use Illuminate\Console\Command;

class FineTuneStatus extends Command
{
    protected $signature = 'ai:fine-tune-status {job?}';
    protected $description = 'Check fine-tuning job status';

    public function handle(OpenAIFineTuner $fineTuner): int
    {
        $jobId = $this->argument('job');
        
        if ($jobId) {
            $status = $fineTuner->checkJobStatus($jobId);
            
            $this->table(
                ['Property', 'Value'],
                collect($status)->map(fn($v, $k) => [$k, $v ?? 'N/A'])->toArray()
            );
            
            if ($status['status'] === 'succeeded') {
                $this->info("Model ready: {$status['model']}");
                $this->info("Add to .env: OPENAI_FINE_TUNED_MODEL={$status['model']}");
            }
        } else {
            $jobs = $fineTuner->listJobs();
            
            $this->table(
                ['ID', 'Status', 'Model', 'Created'],
                $jobs
            );
        }
        
        return self::SUCCESS;
    }
}

Fine-tuning Open-Source Models

Sử dụng Unsloth (Llama, Mistral)

# scripts/finetune_llama.py
from unsloth import FastLanguageModel
from datasets import load_dataset
from trl import SFTTrainer
from transformers import TrainingArguments

# Load base model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/llama-3-8b-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)

# Setup LoRA
model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj", "k_proj", "v_proj", "o_proj",
                    "gate_proj", "up_proj", "down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

# Load dataset
dataset = load_dataset("json", data_files="training_data.json", split="train")

# Format prompt
def format_prompt(example):
    return f"""### Instruction:
{example['instruction']}

### Response:
{example['output']}"""

# Training
trainer = SFTTrainer(
    model=model,
    tokenizer=tokenizer,
    train_dataset=dataset,
    dataset_text_field="text",
    max_seq_length=2048,
    formatting_func=format_prompt,
    args=TrainingArguments(
        per_device_train_batch_size=2,
        gradient_accumulation_steps=4,
        warmup_steps=5,
        max_steps=100,
        learning_rate=2e-4,
        fp16=True,
        logging_steps=1,
        output_dir="outputs",
        optim="adamw_8bit",
    ),
)

trainer.train()

# Save model
model.save_pretrained_merged("laravel-llama-3-8b", tokenizer, save_method="merged_16bit")

Laravel Integration với Local Model

// app/Services/AI/LocalModelService.php
namespace App\Services\AI;

use Illuminate\Support\Facades\Http;

class LocalModelService
{
    public function __construct(
        private string $endpoint = 'http://localhost:11434/api/generate'
    ) {}
    
    public function generate(string $prompt, array $options = []): string
    {
        $response = Http::timeout(120)->post($this->endpoint, [
            'model' => config('ai.local_model', 'laravel-llama'),
            'prompt' => $this->formatPrompt($prompt),
            'stream' => false,
            'options' => [
                'temperature' => $options['temperature'] ?? 0.7,
                'top_p' => $options['top_p'] ?? 0.9,
                'num_predict' => $options['max_tokens'] ?? 2048,
            ],
        ]);
        
        if ($response->failed()) {
            throw new \RuntimeException('Local model request failed');
        }
        
        return $response->json('response');
    }
    
    public function stream(string $prompt, callable $callback): void
    {
        $response = Http::withOptions(['stream' => true])
            ->post($this->endpoint, [
                'model' => config('ai.local_model'),
                'prompt' => $this->formatPrompt($prompt),
                'stream' => true,
            ]);
        
        foreach (explode("\n", $response->body()) as $line) {
            if (empty($line)) continue;
            
            $data = json_decode($line, true);
            if (isset($data['response'])) {
                $callback($data['response'], $data['done'] ?? false);
            }
        }
    }
    
    protected function formatPrompt(string $prompt): string
    {
        return <<<PROMPT
### System:
You are an expert Laravel developer following PSR-12 and Laravel best practices.

### Instruction:
{$prompt}

### Response:
PROMPT;
    }
}

Deploy Fine-tuned Model

API Service

// app/Services/AI/CodeAssistantService.php
namespace App\Services\AI;

use OpenAI\Laravel\Facades\OpenAI;

class CodeAssistantService
{
    private string $model;
    
    public function __construct()
    {
        $this->model = config('ai.fine_tuned_model', 'gpt-4o-mini');
    }
    
    public function generateCode(string $instruction): string
    {
        $response = OpenAI::chat()->create([
            'model' => $this->model,
            'messages' => [
                [
                    'role' => 'system',
                    'content' => $this->getSystemPrompt()
                ],
                [
                    'role' => 'user',
                    'content' => $instruction
                ]
            ],
            'temperature' => 0.3,
            'max_tokens' => 2000,
        ]);
        
        return $response->choices[0]->message->content;
    }
    
    public function reviewCode(string $code): array
    {
        $response = OpenAI::chat()->create([
            'model' => $this->model,
            'messages' => [
                [
                    'role' => 'system',
                    'content' => 'Review PHP/Laravel code and suggest improvements. Return JSON with: issues, suggestions, improved_code.'
                ],
                [
                    'role' => 'user',
                    'content' => "Review this code:\n\n```php\n{$code}\n```"
                ]
            ],
            'response_format' => ['type' => 'json_object'],
        ]);
        
        return json_decode($response->choices[0]->message->content, true);
    }
    
    public function explainCode(string $code): string
    {
        return $this->generateCode("Explain this Laravel code:\n\n```php\n{$code}\n```");
    }
    
    protected function getSystemPrompt(): string
    {
        return <<<PROMPT
You are a Laravel code assistant fine-tuned on high-quality Laravel codebases.

Guidelines:
- Generate clean, readable PHP 8.3+ code
- Follow PSR-12 coding standard
- Use Laravel conventions and best practices
- Include proper type hints and return types
- Add PHPDoc for complex methods
- Handle errors appropriately
- Consider security implications
PROMPT;
    }
}

Artisan Command cho Code Generation

// app/Console/Commands/AIGenerate.php
namespace App\Console\Commands;

use App\Services\AI\CodeAssistantService;
use Illuminate\Console\Command;
use Illuminate\Support\Facades\File;

class AIGenerate extends Command
{
    protected $signature = 'ai:generate 
                            {type : Type of code (controller, model, migration, etc)}
                            {name : Name of the class/file}
                            {--description= : Additional description}';
    
    protected $description = 'Generate Laravel code using AI';

    public function handle(CodeAssistantService $assistant): int
    {
        $type = $this->argument('type');
        $name = $this->argument('name');
        $description = $this->option('description');
        
        $prompt = $this->buildPrompt($type, $name, $description);
        
        $this->info("Generating {$type}...");
        
        $code = $assistant->generateCode($prompt);
        
        // Extract code from markdown if present
        if (preg_match('/```php\n(.*?)\n```/s', $code, $matches)) {
            $code = $matches[1];
        }
        
        $path = $this->getPath($type, $name);
        
        if (File::exists($path)) {
            if (!$this->confirm("File exists. Overwrite?")) {
                return self::FAILURE;
            }
        }
        
        File::ensureDirectoryExists(dirname($path));
        File::put($path, "<?php\n\n" . ltrim($code, "<?php\n"));
        
        $this->info("Generated: {$path}");
        
        return self::SUCCESS;
    }
    
    protected function buildPrompt(string $type, string $name, ?string $description): string
    {
        $prompts = [
            'controller' => "Create a Laravel resource controller named {$name}Controller",
            'model' => "Create a Laravel Eloquent model named {$name}",
            'migration' => "Create a Laravel migration for {$name} table",
            'request' => "Create a Laravel Form Request named {$name}Request",
            'service' => "Create a Laravel service class named {$name}Service",
            'action' => "Create a Laravel Action class named {$name}",
            'job' => "Create a Laravel Job named {$name}",
            'event' => "Create a Laravel Event named {$name}",
            'listener' => "Create a Laravel Listener named {$name}",
        ];
        
        $prompt = $prompts[$type] ?? "Create a Laravel {$type} named {$name}";
        
        if ($description) {
            $prompt .= ". Description: {$description}";
        }
        
        return $prompt;
    }
    
    protected function getPath(string $type, string $name): string
    {
        $paths = [
            'controller' => app_path("Http/Controllers/{$name}Controller.php"),
            'model' => app_path("Models/{$name}.php"),
            'request' => app_path("Http/Requests/{$name}Request.php"),
            'service' => app_path("Services/{$name}Service.php"),
            'action' => app_path("Actions/{$name}.php"),
            'job' => app_path("Jobs/{$name}.php"),
            'event' => app_path("Events/{$name}.php"),
            'listener' => app_path("Listeners/{$name}.php"),
        ];
        
        return $paths[$type] ?? app_path("{$name}.php");
    }
}

Evaluation và Monitoring

Quality Metrics

// app/Services/AI/ModelEvaluator.php
namespace App\Services\AI;

use Illuminate\Support\Facades\Process;

class ModelEvaluator
{
    public function evaluateGeneration(string $generated, string $expected): array
    {
        return [
            'syntax_valid' => $this->checkSyntax($generated),
            'similarity' => $this->calculateSimilarity($generated, $expected),
            'follows_conventions' => $this->checkConventions($generated),
            'passes_phpstan' => $this->runPhpStan($generated),
        ];
    }
    
    protected function checkSyntax(string $code): bool
    {
        $tempFile = tempnam(sys_get_temp_dir(), 'php_');
        file_put_contents($tempFile, "<?php\n" . $code);
        
        $result = Process::run("php -l {$tempFile}");
        
        unlink($tempFile);
        
        return $result->successful();
    }
    
    protected function checkConventions(string $code): array
    {
        $checks = [
            'has_strict_types' => str_contains($code, 'declare(strict_types=1)'),
            'has_type_hints' => preg_match('/function \w+\([^)]*\w+ \$/', $code),
            'has_return_type' => preg_match('/\): \w+/', $code),
            'follows_psr12' => $this->runPint($code),
        ];
        
        return $checks;
    }
    
    protected function calculateSimilarity(string $a, string $b): float
    {
        // Normalize code
        $a = preg_replace('/\s+/', ' ', $a);
        $b = preg_replace('/\s+/', ' ', $b);
        
        similar_text($a, $b, $percent);
        
        return $percent / 100;
    }
}

Logging và Analytics

// app/Services/AI/UsageLogger.php
namespace App\Services\AI;

use Illuminate\Support\Facades\DB;

class UsageLogger
{
    public function log(string $model, string $prompt, string $response, array $metadata = []): void
    {
        DB::table('ai_usage_logs')->insert([
            'model' => $model,
            'prompt_tokens' => $metadata['prompt_tokens'] ?? 0,
            'completion_tokens' => $metadata['completion_tokens'] ?? 0,
            'total_tokens' => $metadata['total_tokens'] ?? 0,
            'latency_ms' => $metadata['latency_ms'] ?? 0,
            'prompt_hash' => md5($prompt),
            'response_quality' => $metadata['quality_score'] ?? null,
            'created_at' => now(),
        ]);
    }
    
    public function getStats(string $model, string $period = 'day'): array
    {
        return DB::table('ai_usage_logs')
            ->where('model', $model)
            ->where('created_at', '>=', now()->sub($period, 1))
            ->selectRaw('
                COUNT(*) as requests,
                SUM(total_tokens) as total_tokens,
                AVG(latency_ms) as avg_latency,
                AVG(response_quality) as avg_quality
            ')
            ->first();
    }
}

Kết Luận

Fine-tuning LLMs cho PHP/Laravel mang lại nhiều lợi ích:

  • Code quality cao hơn: Model học từ best practices của team
  • Consistency: Output nhất quán với coding standards
  • Domain knowledge: Hiểu sâu về business logic
  • Cost efficiency: Smaller models sau fine-tuning có thể competitive với larger ones
  1. Collect - Thu thập code chất lượng cao từ codebase
  2. Curate - Chọn lọc và format dữ liệu
  3. Augment - Tăng cường dataset
  4. Train - Fine-tune với hyperparameters phù hợp
  5. Evaluate - Đánh giá trên test set
  6. Deploy - Tích hợp vào workflow
  7. Monitor - Theo dõi quality và iterate

Tài Liệu Tham Khảo

Bình luận