676 lines
15 KiB
Markdown
676 lines
15 KiB
Markdown
# Plan 04: Admin Retraining CI/CD
|
|
|
|
**Status**: Pending Implementation
|
|
**Date**: 2026-02-27
|
|
|
|
---
|
|
|
|
## Objective
|
|
|
|
Build an admin-triggered ML model retraining pipeline that:
|
|
1. Enables admins to upload new training datasets
|
|
2. Triggers Kubernetes Jobs for model training
|
|
3. Stores trained models in MinIO
|
|
4. Maintains a model registry for versioning
|
|
5. Allows promotion of models to production
|
|
|
|
---
|
|
|
|
## 1. Architecture Overview
|
|
|
|
```mermaid
|
|
graph TD
|
|
A[Admin Panel] -->|Upload Dataset| B[API]
|
|
B -->|Store| C[MinIO: geocrop-datasets]
|
|
B -->|Trigger Job| D[Kubernetes API]
|
|
D -->|Run| E[Training Job Pod]
|
|
E -->|Read Dataset| C
|
|
E -->|Download Dependencies| F[PyPI/NPM]
|
|
E -->|Train| G[ML Models]
|
|
G -->|Upload| H[MinIO: geocrop-models]
|
|
H -->|Update| I[Model Registry]
|
|
I -->|Promote| J[Production]
|
|
```
|
|
|
|
---
|
|
|
|
## 2. Current Training Code
|
|
|
|
### 2.1 Existing Training Script
|
|
|
|
Location: [`training/train.py`](training/train.py)
|
|
|
|
Current features:
|
|
- Uses XGBoost, LightGBM, CatBoost, RandomForest
|
|
- Feature selection with Scout (LightGBM)
|
|
- StandardScaler for normalization
|
|
- Outputs model artifacts to local directory
|
|
|
|
### 2.2 Training Configuration
|
|
|
|
From [`apps/worker/config.py`](apps/worker/config.py:28):
|
|
|
|
```python
|
|
@dataclass
|
|
class TrainingConfig:
|
|
# Dataset
|
|
label_col: str = "label"
|
|
junk_cols: list = field(default_factory=lambda: [...])
|
|
|
|
# Split
|
|
test_size: float = 0.2
|
|
random_state: int = 42
|
|
|
|
# Model hyperparameters
|
|
rf_n_estimators: int = 200
|
|
xgb_n_estimators: int = 300
|
|
lgb_n_estimators: int = 800
|
|
|
|
# Artifact upload
|
|
upload_minio: bool = False
|
|
minio_bucket: str = "geocrop-models"
|
|
```
|
|
|
|
---
|
|
|
|
## 3. Kubernetes Job Strategy
|
|
|
|
### 3.1 Training Job Manifest
|
|
|
|
Create `k8s/jobs/training-job.yaml`:
|
|
|
|
```yaml
|
|
apiVersion: batch/v1
|
|
kind: Job
|
|
metadata:
|
|
name: geocrop-train-{version}
|
|
namespace: geocrop
|
|
labels:
|
|
app: geocrop-train
|
|
version: "{version}"
|
|
spec:
|
|
backoffLimit: 3
|
|
ttlSecondsAfterFinished: 3600
|
|
template:
|
|
metadata:
|
|
labels:
|
|
app: geocrop-train
|
|
spec:
|
|
restartPolicy: OnFailure
|
|
serviceAccountName: geocrop-admin
|
|
containers:
|
|
- name: trainer
|
|
image: frankchine/geocrop-worker:latest
|
|
command: ["python", "training/train.py"]
|
|
env:
|
|
- name: DATASET_PATH
|
|
value: "s3://geocrop-datasets/{dataset_version}/training_data.csv"
|
|
- name: OUTPUT_PATH
|
|
value: "s3://geocrop-models/{model_version}/"
|
|
- name: MINIO_ENDPOINT
|
|
value: "minio.geocrop.svc.cluster.local:9000"
|
|
- name: MODEL_VARIANT
|
|
value: "Scaled"
|
|
- name: AWS_ACCESS_KEY_ID
|
|
valueFrom:
|
|
secretKeyRef:
|
|
name: geocrop-secrets
|
|
key: minio-access-key
|
|
- name: AWS_SECRET_ACCESS_KEY
|
|
valueFrom:
|
|
secretKeyRef:
|
|
name: geocrop-secrets
|
|
key: minio-secret-key
|
|
resources:
|
|
requests:
|
|
memory: "4Gi"
|
|
cpu: "2"
|
|
nvidia.com/gpu: "1"
|
|
limits:
|
|
memory: "8Gi"
|
|
cpu: "4"
|
|
nvidia.com/gpu: "1"
|
|
volumeMounts:
|
|
- name: cache
|
|
mountPath: /root/.cache/pip
|
|
volumes:
|
|
- name: cache
|
|
emptyDir: {}
|
|
```
|
|
|
|
### 3.2 Service Account
|
|
|
|
```yaml
|
|
apiVersion: v1
|
|
kind: ServiceAccount
|
|
metadata:
|
|
name: geocrop-admin
|
|
namespace: geocrop
|
|
---
|
|
apiVersion: rbac.authorization.k8s.io/v1
|
|
kind: Role
|
|
metadata:
|
|
name: geocrop-job-creator
|
|
namespace: geocrop
|
|
rules:
|
|
- apiGroups: ["batch"]
|
|
resources: ["jobs"]
|
|
verbs: ["create", "list", "watch"]
|
|
---
|
|
apiVersion: rbac.authorization.k8s.io/v1
|
|
kind: RoleBinding
|
|
metadata:
|
|
name: geocrop-admin-job-binding
|
|
namespace: geocrop
|
|
subjects:
|
|
- kind: ServiceAccount
|
|
name: geocrop-admin
|
|
roleRef:
|
|
kind: Role
|
|
name: geocrop-job-creator
|
|
apiGroup: rbac.authorization.k8s.io
|
|
```
|
|
|
|
---
|
|
|
|
## 4. API Endpoints for Admin
|
|
|
|
### 4.1 Dataset Management
|
|
|
|
```python
|
|
# apps/api/admin.py
|
|
|
|
from fastapi import APIRouter, UploadFile, File, Depends, HTTPException
|
|
from minio import Minio
|
|
import boto3
|
|
|
|
router = APIRouter(prefix="/admin", tags=["Admin"])
|
|
|
|
@router.post("/datasets/upload")
|
|
async def upload_dataset(
|
|
version: str,
|
|
file: UploadFile = File(...),
|
|
current_user: dict = Depends(get_current_admin_user)
|
|
):
|
|
"""Upload a new training dataset version."""
|
|
|
|
# Validate file type
|
|
if not file.filename.endswith('.csv'):
|
|
raise HTTPException(400, "Only CSV files supported")
|
|
|
|
# Upload to MinIO
|
|
client = get_minio_client()
|
|
client.put_object(
|
|
"geocrop-datasets",
|
|
f"{version}/{file.filename}",
|
|
file.file,
|
|
file.size
|
|
)
|
|
|
|
return {"status": "uploaded", "version": version, "filename": file.filename}
|
|
|
|
|
|
@router.get("/datasets")
|
|
async def list_datasets(current_user: dict = Depends(get_current_admin_user)):
|
|
"""List all available datasets."""
|
|
# List objects in geocrop-datasets bucket
|
|
pass
|
|
```
|
|
|
|
### 4.2 Training Triggers
|
|
|
|
```python
|
|
@router.post("/training/start")
|
|
async def start_training(
|
|
dataset_version: str,
|
|
model_version: str,
|
|
model_variant: str = "Scaled",
|
|
current_user: dict = Depends(get_current_admin_user)
|
|
):
|
|
"""Start a training job."""
|
|
|
|
# Create Kubernetes Job
|
|
job_manifest = create_training_job_manifest(
|
|
dataset_version=dataset_version,
|
|
model_version=model_version,
|
|
model_variant=model_variant
|
|
)
|
|
|
|
k8s_api.create_namespaced_job("geocrop", job_manifest)
|
|
|
|
return {
|
|
"status": "started",
|
|
"job_name": job_manifest["metadata"]["name"],
|
|
"dataset": dataset_version,
|
|
"model_version": model_version
|
|
}
|
|
|
|
|
|
@router.get("/training/jobs")
|
|
async def list_training_jobs(current_user: dict = Depends(get_current_admin_user)):
|
|
"""List all training jobs."""
|
|
jobs = k8s_api.list_namespaced_job("geocrop", label_selector="app=geocrop-train")
|
|
return {"jobs": [...]} # Parse job status
|
|
```
|
|
|
|
### 4.3 Model Registry
|
|
|
|
```python
|
|
@router.get("/models")
|
|
async def list_models():
|
|
"""List all trained models."""
|
|
# Query model registry (could be in MinIO metadata or separate DB)
|
|
pass
|
|
|
|
|
|
@router.post("/models/{model_version}/promote")
|
|
async def promote_model(
|
|
model_version: str,
|
|
current_user: dict = Depends(get_current_admin_user)
|
|
):
|
|
"""Promote a model to production."""
|
|
|
|
# Update model registry to set default model
|
|
# This changes which model is used by inference jobs
|
|
pass
|
|
```
|
|
|
|
---
|
|
|
|
## 5. Model Registry
|
|
|
|
### 5.1 Dataset Versioning
|
|
|
|
- `datasets/<dataset_name>/vYYYYMMDD/<files>`
|
|
|
|
### 5.2 Model Registry Storage
|
|
|
|
Store model metadata in MinIO:
|
|
|
|
```
|
|
geocrop-models/
|
|
├── registry.json # Model registry index
|
|
├── v1/
|
|
│ ├── metadata.json # Model details
|
|
│ ├── model.joblib # Trained model
|
|
│ ├── scaler.joblib # Feature scaler
|
|
│ ├── label_encoder.json # Class mapping
|
|
│ └── selected_features.json # Feature list
|
|
└── v2/
|
|
└── ...
|
|
```
|
|
|
|
### 5.2 Registry Schema
|
|
|
|
```json
|
|
// registry.json
|
|
{
|
|
"models": [
|
|
{
|
|
"version": "v1",
|
|
"created": "2026-02-01T10:00:00Z",
|
|
"dataset_version": "v1",
|
|
"features": ["ndvi_peak", "evi_peak", "savi_peak"],
|
|
"classes": ["cropland", "grass", "shrubland", "forest", "water", "builtup", "bare"],
|
|
"metrics": {
|
|
"accuracy": 0.89,
|
|
"f1_macro": 0.85
|
|
},
|
|
"is_default": true
|
|
}
|
|
],
|
|
"default_model": "v1"
|
|
}
|
|
```
|
|
|
|
### 5.3 Metadata Schema
|
|
|
|
```json
|
|
// v1/metadata.json
|
|
{
|
|
"version": "v1",
|
|
"training_date": "2026-02-01T10:00:00Z",
|
|
"dataset_version": "v1",
|
|
"training_samples": 1500,
|
|
"test_samples": 500,
|
|
"features": ["ndvi_peak", "evi_peak", "savi_peak"],
|
|
"classes": ["cropland", "grass", "shrubland", "forest", "water", "builtup", "bare"],
|
|
"models": {
|
|
"lightgbm": {
|
|
"accuracy": 0.91,
|
|
"f1_macro": 0.88
|
|
},
|
|
"xgboost": {
|
|
"accuracy": 0.89,
|
|
"f1_macro": 0.85
|
|
},
|
|
"catboost": {
|
|
"accuracy": 0.88,
|
|
"f1_macro": 0.84
|
|
}
|
|
},
|
|
"selected_model": "lightgbm",
|
|
"training_params": {
|
|
"n_estimators": 800,
|
|
"learning_rate": 0.03,
|
|
"num_leaves": 63
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 6. Frontend Admin Panel
|
|
|
|
### 6.1 Admin Page Structure
|
|
|
|
```tsx
|
|
// app/admin/page.tsx
|
|
export default function AdminPage() {
|
|
return (
|
|
<div className="p-6">
|
|
<h1 className="text-2xl font-bold mb-6">Admin Panel</h1>
|
|
|
|
<div className="grid grid-cols-2 gap-6">
|
|
{/* Dataset Upload */}
|
|
<DatasetUploadCard />
|
|
|
|
{/* Training Controls */}
|
|
<TrainingCard />
|
|
|
|
{/* Model Registry */}
|
|
<ModelRegistryCard />
|
|
</div>
|
|
</div>
|
|
);
|
|
}
|
|
```
|
|
|
|
### 6.2 Dataset Upload Component
|
|
|
|
```tsx
|
|
// components/admin/DatasetUpload.tsx
|
|
'use client';
|
|
|
|
import { useState } from 'react';
|
|
import { useMutation } from '@tanstack/react-query';
|
|
|
|
export function DatasetUpload() {
|
|
const [version, setVersion] = useState('');
|
|
const [file, setFile] = useState<File | null>(null);
|
|
|
|
const upload = useMutation({
|
|
mutationFn: async () => {
|
|
const formData = new FormData();
|
|
formData.append('version', version);
|
|
formData.append('file', file!);
|
|
|
|
return fetch('/api/admin/datasets/upload', {
|
|
method: 'POST',
|
|
body: formData,
|
|
headers: { Authorization: `Bearer ${token}` }
|
|
});
|
|
},
|
|
onSuccess: () => {
|
|
toast.success('Dataset uploaded successfully');
|
|
}
|
|
});
|
|
|
|
return (
|
|
<div className="card">
|
|
<h2>Upload Dataset</h2>
|
|
<input
|
|
type="text"
|
|
placeholder="Version (e.g., v2)"
|
|
value={version}
|
|
onChange={e => setVersion(e.target.value)}
|
|
/>
|
|
<input
|
|
type="file"
|
|
accept=".csv"
|
|
onChange={e => setFile(e.target.files?.[0] || null)}
|
|
/>
|
|
<button onClick={() => upload.mutate()}>
|
|
Upload
|
|
</button>
|
|
</div>
|
|
);
|
|
}
|
|
```
|
|
|
|
### 6.3 Training Trigger Component
|
|
|
|
```tsx
|
|
// components/admin/TrainingTrigger.tsx
|
|
export function TrainingTrigger() {
|
|
const [datasetVersion, setDatasetVersion] = useState('');
|
|
const [modelVersion, setModelVersion] = useState('');
|
|
const [variant, setVariant] = useState('Scaled');
|
|
|
|
const startTraining = useMutation({
|
|
mutationFn: async () => {
|
|
return fetch('/api/admin/training/start', {
|
|
method: 'POST',
|
|
body: JSON.stringify({
|
|
dataset_version: datasetVersion,
|
|
model_version: modelVersion,
|
|
model_variant: variant
|
|
})
|
|
});
|
|
}
|
|
});
|
|
|
|
return (
|
|
<div className="card">
|
|
<h2>Start Training</h2>
|
|
<select value={datasetVersion} onChange={e => setDatasetVersion(e.target.value)}>
|
|
{/* List available datasets */}
|
|
</select>
|
|
<input
|
|
type="text"
|
|
placeholder="Model version (e.g., v2)"
|
|
value={modelVersion}
|
|
/>
|
|
<button onClick={() => startTraining.mutate()}>
|
|
Start Training Job
|
|
</button>
|
|
</div>
|
|
);
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Training Script Updates
|
|
|
|
### 7.1 Modified Training Entry Point
|
|
|
|
```python
|
|
# training/train.py
|
|
|
|
import argparse
|
|
import os
|
|
import json
|
|
from datetime import datetime
|
|
import boto3
|
|
from pathlib import Path
|
|
|
|
def main():
|
|
parser = argparse.ArgumentParser()
|
|
parser.add_argument('--data', required=True, help='Path to training data CSV')
|
|
parser.add_argument('--out', required=True, help='Output directory (s3://...)')
|
|
parser.add_argument('--variant', default='Scaled', choices=['Scaled', 'Raw'])
|
|
args = parser.parse_args()
|
|
|
|
# Parse S3 path
|
|
output_bucket, output_prefix = parse_s3_path(args.out)
|
|
|
|
# Load and prepare data
|
|
df = pd.read_csv(args.data)
|
|
|
|
# Train models (existing logic)
|
|
results = train_models(df, args.variant)
|
|
|
|
# Upload artifacts to MinIO
|
|
s3 = boto3.client('s3')
|
|
|
|
# Upload model files
|
|
for filename in ['model.joblib', 'scaler.joblib', 'label_encoder.json', 'selected_features.json']:
|
|
if os.path.exists(filename):
|
|
s3.upload_file(filename, output_bucket, f"{output_prefix}/{filename}")
|
|
|
|
# Upload metadata
|
|
metadata = {
|
|
'version': output_prefix,
|
|
'training_date': datetime.utcnow().isoformat(),
|
|
'metrics': results,
|
|
'features': selected_features,
|
|
}
|
|
s3.put_object(
|
|
output_bucket,
|
|
f"{output_prefix}/metadata.json",
|
|
json.dumps(metadata)
|
|
)
|
|
|
|
print(f"Training complete. Artifacts saved to s3://{output_bucket}/{output_prefix}")
|
|
|
|
if __name__ == '__main__':
|
|
main()
|
|
```
|
|
|
|
---
|
|
|
|
## 8. CI/CD Pipeline
|
|
|
|
### 8.1 GitHub Actions (Optional)
|
|
|
|
```yaml
|
|
# .github/workflows/train.yml
|
|
name: Model Training
|
|
|
|
on:
|
|
workflow_dispatch:
|
|
inputs:
|
|
dataset_version:
|
|
description: 'Dataset version'
|
|
required: true
|
|
model_version:
|
|
description: 'Model version'
|
|
required: true
|
|
|
|
jobs:
|
|
train:
|
|
runs-on: ubuntu-latest
|
|
steps:
|
|
- uses: actions/checkout@v3
|
|
|
|
- name: Set up Python
|
|
uses: actions/setup-python@v4
|
|
with:
|
|
python-version: '3.11'
|
|
|
|
- name: Install dependencies
|
|
run: |
|
|
pip install -r training/requirements.txt
|
|
|
|
- name: Run training
|
|
env:
|
|
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
|
|
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
|
|
run: |
|
|
python training/train.py \
|
|
--data s3://geocrop-datasets/${{ github.event.inputs.dataset_version }}/training_data.csv \
|
|
--out s3://geocrop-models/${{ github.event.inputs.model_version }}/ \
|
|
--variant Scaled
|
|
```
|
|
|
|
---
|
|
|
|
## 9. Security
|
|
|
|
### 9.1 Admin Authentication
|
|
|
|
- Require admin role in JWT
|
|
- Check `user.get('is_admin', False)` before any admin operation
|
|
|
|
### 9.2 Kubernetes RBAC
|
|
|
|
- Only admin service account can create training jobs
|
|
- Training jobs run with limited permissions
|
|
|
|
### 9.3 MinIO Policies
|
|
|
|
```json
|
|
{
|
|
"Version": "2012-10-17",
|
|
"Statement": [
|
|
{
|
|
"Effect": "Allow",
|
|
"Action": ["s3:PutObject", "s3:GetObject"],
|
|
"Resource": [
|
|
"arn:aws:s3:::geocrop-datasets/*",
|
|
"arn:aws:s3:::geocrop-models/*"
|
|
]
|
|
}
|
|
]
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 10. Implementation Checklist
|
|
|
|
- [ ] Create Kubernetes ServiceAccount and RBAC for admin
|
|
- [ ] Create training job manifest template
|
|
- [ ] Update training script to upload to MinIO
|
|
- [ ] Create API endpoints for dataset upload
|
|
- [ ] Create API endpoints for training triggers
|
|
- [ ] Create API endpoints for model registry
|
|
- [ ] Implement model promotion logic
|
|
- [ ] Build admin frontend components
|
|
- [ ] Add dataset upload UI
|
|
- [ ] Add training trigger UI
|
|
- [ ] Add model registry UI
|
|
- [ ] Test end-to-end training pipeline
|
|
|
|
### 10.1 Promotion Workflow
|
|
|
|
- "train" produces candidate model version
|
|
- "promote" marks it as default for UI
|
|
|
|
---
|
|
|
|
## 11. Technical Notes
|
|
|
|
### 11.1 GPU Support
|
|
|
|
If GPU training needed:
|
|
- Add nvidia.com/gpu resource requests
|
|
- Use CUDA-enabled image
|
|
- Install GPU-enabled TensorFlow/PyTorch
|
|
|
|
### 11.2 Training Timeout
|
|
|
|
- Default Kubernetes job timeout: no limit
|
|
- Set `activeDeadlineSeconds` to prevent runaway jobs
|
|
|
|
### 11.3 Model Selection
|
|
|
|
- Store multiple model outputs (XGBoost, LightGBM, CatBoost)
|
|
- Select best based on validation metrics
|
|
- Allow admin to override selection
|
|
|
|
---
|
|
|
|
## 12. Next Steps
|
|
|
|
After implementation approval:
|
|
|
|
1. Create Kubernetes RBAC manifests
|
|
2. Create training job template
|
|
3. Update training script for MinIO upload
|
|
4. Implement admin API endpoints
|
|
5. Build admin frontend
|
|
6. Test training pipeline
|
|
7. Document admin procedures
|