geocrop-platform./plan/04_admin_retraining.md

676 lines
15 KiB
Markdown

# Plan 04: Admin Retraining CI/CD
**Status**: Pending Implementation
**Date**: 2026-02-27
---
## Objective
Build an admin-triggered ML model retraining pipeline that:
1. Enables admins to upload new training datasets
2. Triggers Kubernetes Jobs for model training
3. Stores trained models in MinIO
4. Maintains a model registry for versioning
5. Allows promotion of models to production
---
## 1. Architecture Overview
```mermaid
graph TD
A[Admin Panel] -->|Upload Dataset| B[API]
B -->|Store| C[MinIO: geocrop-datasets]
B -->|Trigger Job| D[Kubernetes API]
D -->|Run| E[Training Job Pod]
E -->|Read Dataset| C
E -->|Download Dependencies| F[PyPI/NPM]
E -->|Train| G[ML Models]
G -->|Upload| H[MinIO: geocrop-models]
H -->|Update| I[Model Registry]
I -->|Promote| J[Production]
```
---
## 2. Current Training Code
### 2.1 Existing Training Script
Location: [`training/train.py`](training/train.py)
Current features:
- Uses XGBoost, LightGBM, CatBoost, RandomForest
- Feature selection with Scout (LightGBM)
- StandardScaler for normalization
- Outputs model artifacts to local directory
### 2.2 Training Configuration
From [`apps/worker/config.py`](apps/worker/config.py:28):
```python
@dataclass
class TrainingConfig:
# Dataset
label_col: str = "label"
junk_cols: list = field(default_factory=lambda: [...])
# Split
test_size: float = 0.2
random_state: int = 42
# Model hyperparameters
rf_n_estimators: int = 200
xgb_n_estimators: int = 300
lgb_n_estimators: int = 800
# Artifact upload
upload_minio: bool = False
minio_bucket: str = "geocrop-models"
```
---
## 3. Kubernetes Job Strategy
### 3.1 Training Job Manifest
Create `k8s/jobs/training-job.yaml`:
```yaml
apiVersion: batch/v1
kind: Job
metadata:
name: geocrop-train-{version}
namespace: geocrop
labels:
app: geocrop-train
version: "{version}"
spec:
backoffLimit: 3
ttlSecondsAfterFinished: 3600
template:
metadata:
labels:
app: geocrop-train
spec:
restartPolicy: OnFailure
serviceAccountName: geocrop-admin
containers:
- name: trainer
image: frankchine/geocrop-worker:latest
command: ["python", "training/train.py"]
env:
- name: DATASET_PATH
value: "s3://geocrop-datasets/{dataset_version}/training_data.csv"
- name: OUTPUT_PATH
value: "s3://geocrop-models/{model_version}/"
- name: MINIO_ENDPOINT
value: "minio.geocrop.svc.cluster.local:9000"
- name: MODEL_VARIANT
value: "Scaled"
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: geocrop-secrets
key: minio-access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: geocrop-secrets
key: minio-secret-key
resources:
requests:
memory: "4Gi"
cpu: "2"
nvidia.com/gpu: "1"
limits:
memory: "8Gi"
cpu: "4"
nvidia.com/gpu: "1"
volumeMounts:
- name: cache
mountPath: /root/.cache/pip
volumes:
- name: cache
emptyDir: {}
```
### 3.2 Service Account
```yaml
apiVersion: v1
kind: ServiceAccount
metadata:
name: geocrop-admin
namespace: geocrop
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: geocrop-job-creator
namespace: geocrop
rules:
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: geocrop-admin-job-binding
namespace: geocrop
subjects:
- kind: ServiceAccount
name: geocrop-admin
roleRef:
kind: Role
name: geocrop-job-creator
apiGroup: rbac.authorization.k8s.io
```
---
## 4. API Endpoints for Admin
### 4.1 Dataset Management
```python
# apps/api/admin.py
from fastapi import APIRouter, UploadFile, File, Depends, HTTPException
from minio import Minio
import boto3
router = APIRouter(prefix="/admin", tags=["Admin"])
@router.post("/datasets/upload")
async def upload_dataset(
version: str,
file: UploadFile = File(...),
current_user: dict = Depends(get_current_admin_user)
):
"""Upload a new training dataset version."""
# Validate file type
if not file.filename.endswith('.csv'):
raise HTTPException(400, "Only CSV files supported")
# Upload to MinIO
client = get_minio_client()
client.put_object(
"geocrop-datasets",
f"{version}/{file.filename}",
file.file,
file.size
)
return {"status": "uploaded", "version": version, "filename": file.filename}
@router.get("/datasets")
async def list_datasets(current_user: dict = Depends(get_current_admin_user)):
"""List all available datasets."""
# List objects in geocrop-datasets bucket
pass
```
### 4.2 Training Triggers
```python
@router.post("/training/start")
async def start_training(
dataset_version: str,
model_version: str,
model_variant: str = "Scaled",
current_user: dict = Depends(get_current_admin_user)
):
"""Start a training job."""
# Create Kubernetes Job
job_manifest = create_training_job_manifest(
dataset_version=dataset_version,
model_version=model_version,
model_variant=model_variant
)
k8s_api.create_namespaced_job("geocrop", job_manifest)
return {
"status": "started",
"job_name": job_manifest["metadata"]["name"],
"dataset": dataset_version,
"model_version": model_version
}
@router.get("/training/jobs")
async def list_training_jobs(current_user: dict = Depends(get_current_admin_user)):
"""List all training jobs."""
jobs = k8s_api.list_namespaced_job("geocrop", label_selector="app=geocrop-train")
return {"jobs": [...]} # Parse job status
```
### 4.3 Model Registry
```python
@router.get("/models")
async def list_models():
"""List all trained models."""
# Query model registry (could be in MinIO metadata or separate DB)
pass
@router.post("/models/{model_version}/promote")
async def promote_model(
model_version: str,
current_user: dict = Depends(get_current_admin_user)
):
"""Promote a model to production."""
# Update model registry to set default model
# This changes which model is used by inference jobs
pass
```
---
## 5. Model Registry
### 5.1 Dataset Versioning
- `datasets/<dataset_name>/vYYYYMMDD/<files>`
### 5.2 Model Registry Storage
Store model metadata in MinIO:
```
geocrop-models/
├── registry.json # Model registry index
├── v1/
│ ├── metadata.json # Model details
│ ├── model.joblib # Trained model
│ ├── scaler.joblib # Feature scaler
│ ├── label_encoder.json # Class mapping
│ └── selected_features.json # Feature list
└── v2/
└── ...
```
### 5.2 Registry Schema
```json
// registry.json
{
"models": [
{
"version": "v1",
"created": "2026-02-01T10:00:00Z",
"dataset_version": "v1",
"features": ["ndvi_peak", "evi_peak", "savi_peak"],
"classes": ["cropland", "grass", "shrubland", "forest", "water", "builtup", "bare"],
"metrics": {
"accuracy": 0.89,
"f1_macro": 0.85
},
"is_default": true
}
],
"default_model": "v1"
}
```
### 5.3 Metadata Schema
```json
// v1/metadata.json
{
"version": "v1",
"training_date": "2026-02-01T10:00:00Z",
"dataset_version": "v1",
"training_samples": 1500,
"test_samples": 500,
"features": ["ndvi_peak", "evi_peak", "savi_peak"],
"classes": ["cropland", "grass", "shrubland", "forest", "water", "builtup", "bare"],
"models": {
"lightgbm": {
"accuracy": 0.91,
"f1_macro": 0.88
},
"xgboost": {
"accuracy": 0.89,
"f1_macro": 0.85
},
"catboost": {
"accuracy": 0.88,
"f1_macro": 0.84
}
},
"selected_model": "lightgbm",
"training_params": {
"n_estimators": 800,
"learning_rate": 0.03,
"num_leaves": 63
}
}
```
---
## 6. Frontend Admin Panel
### 6.1 Admin Page Structure
```tsx
// app/admin/page.tsx
export default function AdminPage() {
return (
<div className="p-6">
<h1 className="text-2xl font-bold mb-6">Admin Panel</h1>
<div className="grid grid-cols-2 gap-6">
{/* Dataset Upload */}
<DatasetUploadCard />
{/* Training Controls */}
<TrainingCard />
{/* Model Registry */}
<ModelRegistryCard />
</div>
</div>
);
}
```
### 6.2 Dataset Upload Component
```tsx
// components/admin/DatasetUpload.tsx
'use client';
import { useState } from 'react';
import { useMutation } from '@tanstack/react-query';
export function DatasetUpload() {
const [version, setVersion] = useState('');
const [file, setFile] = useState<File | null>(null);
const upload = useMutation({
mutationFn: async () => {
const formData = new FormData();
formData.append('version', version);
formData.append('file', file!);
return fetch('/api/admin/datasets/upload', {
method: 'POST',
body: formData,
headers: { Authorization: `Bearer ${token}` }
});
},
onSuccess: () => {
toast.success('Dataset uploaded successfully');
}
});
return (
<div className="card">
<h2>Upload Dataset</h2>
<input
type="text"
placeholder="Version (e.g., v2)"
value={version}
onChange={e => setVersion(e.target.value)}
/>
<input
type="file"
accept=".csv"
onChange={e => setFile(e.target.files?.[0] || null)}
/>
<button onClick={() => upload.mutate()}>
Upload
</button>
</div>
);
}
```
### 6.3 Training Trigger Component
```tsx
// components/admin/TrainingTrigger.tsx
export function TrainingTrigger() {
const [datasetVersion, setDatasetVersion] = useState('');
const [modelVersion, setModelVersion] = useState('');
const [variant, setVariant] = useState('Scaled');
const startTraining = useMutation({
mutationFn: async () => {
return fetch('/api/admin/training/start', {
method: 'POST',
body: JSON.stringify({
dataset_version: datasetVersion,
model_version: modelVersion,
model_variant: variant
})
});
}
});
return (
<div className="card">
<h2>Start Training</h2>
<select value={datasetVersion} onChange={e => setDatasetVersion(e.target.value)}>
{/* List available datasets */}
</select>
<input
type="text"
placeholder="Model version (e.g., v2)"
value={modelVersion}
/>
<button onClick={() => startTraining.mutate()}>
Start Training Job
</button>
</div>
);
}
```
---
## 7. Training Script Updates
### 7.1 Modified Training Entry Point
```python
# training/train.py
import argparse
import os
import json
from datetime import datetime
import boto3
from pathlib import Path
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--data', required=True, help='Path to training data CSV')
parser.add_argument('--out', required=True, help='Output directory (s3://...)')
parser.add_argument('--variant', default='Scaled', choices=['Scaled', 'Raw'])
args = parser.parse_args()
# Parse S3 path
output_bucket, output_prefix = parse_s3_path(args.out)
# Load and prepare data
df = pd.read_csv(args.data)
# Train models (existing logic)
results = train_models(df, args.variant)
# Upload artifacts to MinIO
s3 = boto3.client('s3')
# Upload model files
for filename in ['model.joblib', 'scaler.joblib', 'label_encoder.json', 'selected_features.json']:
if os.path.exists(filename):
s3.upload_file(filename, output_bucket, f"{output_prefix}/{filename}")
# Upload metadata
metadata = {
'version': output_prefix,
'training_date': datetime.utcnow().isoformat(),
'metrics': results,
'features': selected_features,
}
s3.put_object(
output_bucket,
f"{output_prefix}/metadata.json",
json.dumps(metadata)
)
print(f"Training complete. Artifacts saved to s3://{output_bucket}/{output_prefix}")
if __name__ == '__main__':
main()
```
---
## 8. CI/CD Pipeline
### 8.1 GitHub Actions (Optional)
```yaml
# .github/workflows/train.yml
name: Model Training
on:
workflow_dispatch:
inputs:
dataset_version:
description: 'Dataset version'
required: true
model_version:
description: 'Model version'
required: true
jobs:
train:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install -r training/requirements.txt
- name: Run training
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
python training/train.py \
--data s3://geocrop-datasets/${{ github.event.inputs.dataset_version }}/training_data.csv \
--out s3://geocrop-models/${{ github.event.inputs.model_version }}/ \
--variant Scaled
```
---
## 9. Security
### 9.1 Admin Authentication
- Require admin role in JWT
- Check `user.get('is_admin', False)` before any admin operation
### 9.2 Kubernetes RBAC
- Only admin service account can create training jobs
- Training jobs run with limited permissions
### 9.3 MinIO Policies
```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:GetObject"],
"Resource": [
"arn:aws:s3:::geocrop-datasets/*",
"arn:aws:s3:::geocrop-models/*"
]
}
]
}
```
---
## 10. Implementation Checklist
- [ ] Create Kubernetes ServiceAccount and RBAC for admin
- [ ] Create training job manifest template
- [ ] Update training script to upload to MinIO
- [ ] Create API endpoints for dataset upload
- [ ] Create API endpoints for training triggers
- [ ] Create API endpoints for model registry
- [ ] Implement model promotion logic
- [ ] Build admin frontend components
- [ ] Add dataset upload UI
- [ ] Add training trigger UI
- [ ] Add model registry UI
- [ ] Test end-to-end training pipeline
### 10.1 Promotion Workflow
- "train" produces candidate model version
- "promote" marks it as default for UI
---
## 11. Technical Notes
### 11.1 GPU Support
If GPU training needed:
- Add nvidia.com/gpu resource requests
- Use CUDA-enabled image
- Install GPU-enabled TensorFlow/PyTorch
### 11.2 Training Timeout
- Default Kubernetes job timeout: no limit
- Set `activeDeadlineSeconds` to prevent runaway jobs
### 11.3 Model Selection
- Store multiple model outputs (XGBoost, LightGBM, CatBoost)
- Select best based on validation metrics
- Allow admin to override selection
---
## 12. Next Steps
After implementation approval:
1. Create Kubernetes RBAC manifests
2. Create training job template
3. Update training script for MinIO upload
4. Implement admin API endpoints
5. Build admin frontend
6. Test training pipeline
7. Document admin procedures