15 KiB
15 KiB
Plan 04: Admin Retraining CI/CD
Status: Pending Implementation
Date: 2026-02-27
Objective
Build an admin-triggered ML model retraining pipeline that:
- Enables admins to upload new training datasets
- Triggers Kubernetes Jobs for model training
- Stores trained models in MinIO
- Maintains a model registry for versioning
- Allows promotion of models to production
1. Architecture Overview
graph TD
A[Admin Panel] -->|Upload Dataset| B[API]
B -->|Store| C[MinIO: geocrop-datasets]
B -->|Trigger Job| D[Kubernetes API]
D -->|Run| E[Training Job Pod]
E -->|Read Dataset| C
E -->|Download Dependencies| F[PyPI/NPM]
E -->|Train| G[ML Models]
G -->|Upload| H[MinIO: geocrop-models]
H -->|Update| I[Model Registry]
I -->|Promote| J[Production]
2. Current Training Code
2.1 Existing Training Script
Location: training/train.py
Current features:
- Uses XGBoost, LightGBM, CatBoost, RandomForest
- Feature selection with Scout (LightGBM)
- StandardScaler for normalization
- Outputs model artifacts to local directory
2.2 Training Configuration
From apps/worker/config.py:
@dataclass
class TrainingConfig:
# Dataset
label_col: str = "label"
junk_cols: list = field(default_factory=lambda: [...])
# Split
test_size: float = 0.2
random_state: int = 42
# Model hyperparameters
rf_n_estimators: int = 200
xgb_n_estimators: int = 300
lgb_n_estimators: int = 800
# Artifact upload
upload_minio: bool = False
minio_bucket: str = "geocrop-models"
3. Kubernetes Job Strategy
3.1 Training Job Manifest
Create k8s/jobs/training-job.yaml:
apiVersion: batch/v1
kind: Job
metadata:
name: geocrop-train-{version}
namespace: geocrop
labels:
app: geocrop-train
version: "{version}"
spec:
backoffLimit: 3
ttlSecondsAfterFinished: 3600
template:
metadata:
labels:
app: geocrop-train
spec:
restartPolicy: OnFailure
serviceAccountName: geocrop-admin
containers:
- name: trainer
image: frankchine/geocrop-worker:latest
command: ["python", "training/train.py"]
env:
- name: DATASET_PATH
value: "s3://geocrop-datasets/{dataset_version}/training_data.csv"
- name: OUTPUT_PATH
value: "s3://geocrop-models/{model_version}/"
- name: MINIO_ENDPOINT
value: "minio.geocrop.svc.cluster.local:9000"
- name: MODEL_VARIANT
value: "Scaled"
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: geocrop-secrets
key: minio-access-key
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: geocrop-secrets
key: minio-secret-key
resources:
requests:
memory: "4Gi"
cpu: "2"
nvidia.com/gpu: "1"
limits:
memory: "8Gi"
cpu: "4"
nvidia.com/gpu: "1"
volumeMounts:
- name: cache
mountPath: /root/.cache/pip
volumes:
- name: cache
emptyDir: {}
3.2 Service Account
apiVersion: v1
kind: ServiceAccount
metadata:
name: geocrop-admin
namespace: geocrop
---
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
name: geocrop-job-creator
namespace: geocrop
rules:
- apiGroups: ["batch"]
resources: ["jobs"]
verbs: ["create", "list", "watch"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
name: geocrop-admin-job-binding
namespace: geocrop
subjects:
- kind: ServiceAccount
name: geocrop-admin
roleRef:
kind: Role
name: geocrop-job-creator
apiGroup: rbac.authorization.k8s.io
4. API Endpoints for Admin
4.1 Dataset Management
# apps/api/admin.py
from fastapi import APIRouter, UploadFile, File, Depends, HTTPException
from minio import Minio
import boto3
router = APIRouter(prefix="/admin", tags=["Admin"])
@router.post("/datasets/upload")
async def upload_dataset(
version: str,
file: UploadFile = File(...),
current_user: dict = Depends(get_current_admin_user)
):
"""Upload a new training dataset version."""
# Validate file type
if not file.filename.endswith('.csv'):
raise HTTPException(400, "Only CSV files supported")
# Upload to MinIO
client = get_minio_client()
client.put_object(
"geocrop-datasets",
f"{version}/{file.filename}",
file.file,
file.size
)
return {"status": "uploaded", "version": version, "filename": file.filename}
@router.get("/datasets")
async def list_datasets(current_user: dict = Depends(get_current_admin_user)):
"""List all available datasets."""
# List objects in geocrop-datasets bucket
pass
4.2 Training Triggers
@router.post("/training/start")
async def start_training(
dataset_version: str,
model_version: str,
model_variant: str = "Scaled",
current_user: dict = Depends(get_current_admin_user)
):
"""Start a training job."""
# Create Kubernetes Job
job_manifest = create_training_job_manifest(
dataset_version=dataset_version,
model_version=model_version,
model_variant=model_variant
)
k8s_api.create_namespaced_job("geocrop", job_manifest)
return {
"status": "started",
"job_name": job_manifest["metadata"]["name"],
"dataset": dataset_version,
"model_version": model_version
}
@router.get("/training/jobs")
async def list_training_jobs(current_user: dict = Depends(get_current_admin_user)):
"""List all training jobs."""
jobs = k8s_api.list_namespaced_job("geocrop", label_selector="app=geocrop-train")
return {"jobs": [...]} # Parse job status
4.3 Model Registry
@router.get("/models")
async def list_models():
"""List all trained models."""
# Query model registry (could be in MinIO metadata or separate DB)
pass
@router.post("/models/{model_version}/promote")
async def promote_model(
model_version: str,
current_user: dict = Depends(get_current_admin_user)
):
"""Promote a model to production."""
# Update model registry to set default model
# This changes which model is used by inference jobs
pass
5. Model Registry
5.1 Dataset Versioning
datasets/<dataset_name>/vYYYYMMDD/<files>
5.2 Model Registry Storage
Store model metadata in MinIO:
geocrop-models/
├── registry.json # Model registry index
├── v1/
│ ├── metadata.json # Model details
│ ├── model.joblib # Trained model
│ ├── scaler.joblib # Feature scaler
│ ├── label_encoder.json # Class mapping
│ └── selected_features.json # Feature list
└── v2/
└── ...
5.2 Registry Schema
// registry.json
{
"models": [
{
"version": "v1",
"created": "2026-02-01T10:00:00Z",
"dataset_version": "v1",
"features": ["ndvi_peak", "evi_peak", "savi_peak"],
"classes": ["cropland", "grass", "shrubland", "forest", "water", "builtup", "bare"],
"metrics": {
"accuracy": 0.89,
"f1_macro": 0.85
},
"is_default": true
}
],
"default_model": "v1"
}
5.3 Metadata Schema
// v1/metadata.json
{
"version": "v1",
"training_date": "2026-02-01T10:00:00Z",
"dataset_version": "v1",
"training_samples": 1500,
"test_samples": 500,
"features": ["ndvi_peak", "evi_peak", "savi_peak"],
"classes": ["cropland", "grass", "shrubland", "forest", "water", "builtup", "bare"],
"models": {
"lightgbm": {
"accuracy": 0.91,
"f1_macro": 0.88
},
"xgboost": {
"accuracy": 0.89,
"f1_macro": 0.85
},
"catboost": {
"accuracy": 0.88,
"f1_macro": 0.84
}
},
"selected_model": "lightgbm",
"training_params": {
"n_estimators": 800,
"learning_rate": 0.03,
"num_leaves": 63
}
}
6. Frontend Admin Panel
6.1 Admin Page Structure
// app/admin/page.tsx
export default function AdminPage() {
return (
<div className="p-6">
<h1 className="text-2xl font-bold mb-6">Admin Panel</h1>
<div className="grid grid-cols-2 gap-6">
{/* Dataset Upload */}
<DatasetUploadCard />
{/* Training Controls */}
<TrainingCard />
{/* Model Registry */}
<ModelRegistryCard />
</div>
</div>
);
}
6.2 Dataset Upload Component
// components/admin/DatasetUpload.tsx
'use client';
import { useState } from 'react';
import { useMutation } from '@tanstack/react-query';
export function DatasetUpload() {
const [version, setVersion] = useState('');
const [file, setFile] = useState<File | null>(null);
const upload = useMutation({
mutationFn: async () => {
const formData = new FormData();
formData.append('version', version);
formData.append('file', file!);
return fetch('/api/admin/datasets/upload', {
method: 'POST',
body: formData,
headers: { Authorization: `Bearer ${token}` }
});
},
onSuccess: () => {
toast.success('Dataset uploaded successfully');
}
});
return (
<div className="card">
<h2>Upload Dataset</h2>
<input
type="text"
placeholder="Version (e.g., v2)"
value={version}
onChange={e => setVersion(e.target.value)}
/>
<input
type="file"
accept=".csv"
onChange={e => setFile(e.target.files?.[0] || null)}
/>
<button onClick={() => upload.mutate()}>
Upload
</button>
</div>
);
}
6.3 Training Trigger Component
// components/admin/TrainingTrigger.tsx
export function TrainingTrigger() {
const [datasetVersion, setDatasetVersion] = useState('');
const [modelVersion, setModelVersion] = useState('');
const [variant, setVariant] = useState('Scaled');
const startTraining = useMutation({
mutationFn: async () => {
return fetch('/api/admin/training/start', {
method: 'POST',
body: JSON.stringify({
dataset_version: datasetVersion,
model_version: modelVersion,
model_variant: variant
})
});
}
});
return (
<div className="card">
<h2>Start Training</h2>
<select value={datasetVersion} onChange={e => setDatasetVersion(e.target.value)}>
{/* List available datasets */}
</select>
<input
type="text"
placeholder="Model version (e.g., v2)"
value={modelVersion}
/>
<button onClick={() => startTraining.mutate()}>
Start Training Job
</button>
</div>
);
}
7. Training Script Updates
7.1 Modified Training Entry Point
# training/train.py
import argparse
import os
import json
from datetime import datetime
import boto3
from pathlib import Path
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--data', required=True, help='Path to training data CSV')
parser.add_argument('--out', required=True, help='Output directory (s3://...)')
parser.add_argument('--variant', default='Scaled', choices=['Scaled', 'Raw'])
args = parser.parse_args()
# Parse S3 path
output_bucket, output_prefix = parse_s3_path(args.out)
# Load and prepare data
df = pd.read_csv(args.data)
# Train models (existing logic)
results = train_models(df, args.variant)
# Upload artifacts to MinIO
s3 = boto3.client('s3')
# Upload model files
for filename in ['model.joblib', 'scaler.joblib', 'label_encoder.json', 'selected_features.json']:
if os.path.exists(filename):
s3.upload_file(filename, output_bucket, f"{output_prefix}/{filename}")
# Upload metadata
metadata = {
'version': output_prefix,
'training_date': datetime.utcnow().isoformat(),
'metrics': results,
'features': selected_features,
}
s3.put_object(
output_bucket,
f"{output_prefix}/metadata.json",
json.dumps(metadata)
)
print(f"Training complete. Artifacts saved to s3://{output_bucket}/{output_prefix}")
if __name__ == '__main__':
main()
8. CI/CD Pipeline
8.1 GitHub Actions (Optional)
# .github/workflows/train.yml
name: Model Training
on:
workflow_dispatch:
inputs:
dataset_version:
description: 'Dataset version'
required: true
model_version:
description: 'Model version'
required: true
jobs:
train:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v4
with:
python-version: '3.11'
- name: Install dependencies
run: |
pip install -r training/requirements.txt
- name: Run training
env:
AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
run: |
python training/train.py \
--data s3://geocrop-datasets/${{ github.event.inputs.dataset_version }}/training_data.csv \
--out s3://geocrop-models/${{ github.event.inputs.model_version }}/ \
--variant Scaled
9. Security
9.1 Admin Authentication
- Require admin role in JWT
- Check
user.get('is_admin', False)before any admin operation
9.2 Kubernetes RBAC
- Only admin service account can create training jobs
- Training jobs run with limited permissions
9.3 MinIO Policies
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": ["s3:PutObject", "s3:GetObject"],
"Resource": [
"arn:aws:s3:::geocrop-datasets/*",
"arn:aws:s3:::geocrop-models/*"
]
}
]
}
10. Implementation Checklist
- Create Kubernetes ServiceAccount and RBAC for admin
- Create training job manifest template
- Update training script to upload to MinIO
- Create API endpoints for dataset upload
- Create API endpoints for training triggers
- Create API endpoints for model registry
- Implement model promotion logic
- Build admin frontend components
- Add dataset upload UI
- Add training trigger UI
- Add model registry UI
- Test end-to-end training pipeline
10.1 Promotion Workflow
- "train" produces candidate model version
- "promote" marks it as default for UI
11. Technical Notes
11.1 GPU Support
If GPU training needed:
- Add nvidia.com/gpu resource requests
- Use CUDA-enabled image
- Install GPU-enabled TensorFlow/PyTorch
11.2 Training Timeout
- Default Kubernetes job timeout: no limit
- Set
activeDeadlineSecondsto prevent runaway jobs
11.3 Model Selection
- Store multiple model outputs (XGBoost, LightGBM, CatBoost)
- Select best based on validation metrics
- Allow admin to override selection
12. Next Steps
After implementation approval:
- Create Kubernetes RBAC manifests
- Create training job template
- Update training script for MinIO upload
- Implement admin API endpoints
- Build admin frontend
- Test training pipeline
- Document admin procedures