# Plan 04: Admin Retraining CI/CD **Status**: Pending Implementation **Date**: 2026-02-27 --- ## Objective Build an admin-triggered ML model retraining pipeline that: 1. Enables admins to upload new training datasets 2. Triggers Kubernetes Jobs for model training 3. Stores trained models in MinIO 4. Maintains a model registry for versioning 5. Allows promotion of models to production --- ## 1. Architecture Overview ```mermaid graph TD A[Admin Panel] -->|Upload Dataset| B[API] B -->|Store| C[MinIO: geocrop-datasets] B -->|Trigger Job| D[Kubernetes API] D -->|Run| E[Training Job Pod] E -->|Read Dataset| C E -->|Download Dependencies| F[PyPI/NPM] E -->|Train| G[ML Models] G -->|Upload| H[MinIO: geocrop-models] H -->|Update| I[Model Registry] I -->|Promote| J[Production] ``` --- ## 2. Current Training Code ### 2.1 Existing Training Script Location: [`training/train.py`](training/train.py) Current features: - Uses XGBoost, LightGBM, CatBoost, RandomForest - Feature selection with Scout (LightGBM) - StandardScaler for normalization - Outputs model artifacts to local directory ### 2.2 Training Configuration From [`apps/worker/config.py`](apps/worker/config.py:28): ```python @dataclass class TrainingConfig: # Dataset label_col: str = "label" junk_cols: list = field(default_factory=lambda: [...]) # Split test_size: float = 0.2 random_state: int = 42 # Model hyperparameters rf_n_estimators: int = 200 xgb_n_estimators: int = 300 lgb_n_estimators: int = 800 # Artifact upload upload_minio: bool = False minio_bucket: str = "geocrop-models" ``` --- ## 3. Kubernetes Job Strategy ### 3.1 Training Job Manifest Create `k8s/jobs/training-job.yaml`: ```yaml apiVersion: batch/v1 kind: Job metadata: name: geocrop-train-{version} namespace: geocrop labels: app: geocrop-train version: "{version}" spec: backoffLimit: 3 ttlSecondsAfterFinished: 3600 template: metadata: labels: app: geocrop-train spec: restartPolicy: OnFailure serviceAccountName: geocrop-admin containers: - name: trainer image: frankchine/geocrop-worker:latest command: ["python", "training/train.py"] env: - name: DATASET_PATH value: "s3://geocrop-datasets/{dataset_version}/training_data.csv" - name: OUTPUT_PATH value: "s3://geocrop-models/{model_version}/" - name: MINIO_ENDPOINT value: "minio.geocrop.svc.cluster.local:9000" - name: MODEL_VARIANT value: "Scaled" - name: AWS_ACCESS_KEY_ID valueFrom: secretKeyRef: name: geocrop-secrets key: minio-access-key - name: AWS_SECRET_ACCESS_KEY valueFrom: secretKeyRef: name: geocrop-secrets key: minio-secret-key resources: requests: memory: "4Gi" cpu: "2" nvidia.com/gpu: "1" limits: memory: "8Gi" cpu: "4" nvidia.com/gpu: "1" volumeMounts: - name: cache mountPath: /root/.cache/pip volumes: - name: cache emptyDir: {} ``` ### 3.2 Service Account ```yaml apiVersion: v1 kind: ServiceAccount metadata: name: geocrop-admin namespace: geocrop --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: geocrop-job-creator namespace: geocrop rules: - apiGroups: ["batch"] resources: ["jobs"] verbs: ["create", "list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: geocrop-admin-job-binding namespace: geocrop subjects: - kind: ServiceAccount name: geocrop-admin roleRef: kind: Role name: geocrop-job-creator apiGroup: rbac.authorization.k8s.io ``` --- ## 4. API Endpoints for Admin ### 4.1 Dataset Management ```python # apps/api/admin.py from fastapi import APIRouter, UploadFile, File, Depends, HTTPException from minio import Minio import boto3 router = APIRouter(prefix="/admin", tags=["Admin"]) @router.post("/datasets/upload") async def upload_dataset( version: str, file: UploadFile = File(...), current_user: dict = Depends(get_current_admin_user) ): """Upload a new training dataset version.""" # Validate file type if not file.filename.endswith('.csv'): raise HTTPException(400, "Only CSV files supported") # Upload to MinIO client = get_minio_client() client.put_object( "geocrop-datasets", f"{version}/{file.filename}", file.file, file.size ) return {"status": "uploaded", "version": version, "filename": file.filename} @router.get("/datasets") async def list_datasets(current_user: dict = Depends(get_current_admin_user)): """List all available datasets.""" # List objects in geocrop-datasets bucket pass ``` ### 4.2 Training Triggers ```python @router.post("/training/start") async def start_training( dataset_version: str, model_version: str, model_variant: str = "Scaled", current_user: dict = Depends(get_current_admin_user) ): """Start a training job.""" # Create Kubernetes Job job_manifest = create_training_job_manifest( dataset_version=dataset_version, model_version=model_version, model_variant=model_variant ) k8s_api.create_namespaced_job("geocrop", job_manifest) return { "status": "started", "job_name": job_manifest["metadata"]["name"], "dataset": dataset_version, "model_version": model_version } @router.get("/training/jobs") async def list_training_jobs(current_user: dict = Depends(get_current_admin_user)): """List all training jobs.""" jobs = k8s_api.list_namespaced_job("geocrop", label_selector="app=geocrop-train") return {"jobs": [...]} # Parse job status ``` ### 4.3 Model Registry ```python @router.get("/models") async def list_models(): """List all trained models.""" # Query model registry (could be in MinIO metadata or separate DB) pass @router.post("/models/{model_version}/promote") async def promote_model( model_version: str, current_user: dict = Depends(get_current_admin_user) ): """Promote a model to production.""" # Update model registry to set default model # This changes which model is used by inference jobs pass ``` --- ## 5. Model Registry ### 5.1 Dataset Versioning - `datasets//vYYYYMMDD/` ### 5.2 Model Registry Storage Store model metadata in MinIO: ``` geocrop-models/ ├── registry.json # Model registry index ├── v1/ │ ├── metadata.json # Model details │ ├── model.joblib # Trained model │ ├── scaler.joblib # Feature scaler │ ├── label_encoder.json # Class mapping │ └── selected_features.json # Feature list └── v2/ └── ... ``` ### 5.2 Registry Schema ```json // registry.json { "models": [ { "version": "v1", "created": "2026-02-01T10:00:00Z", "dataset_version": "v1", "features": ["ndvi_peak", "evi_peak", "savi_peak"], "classes": ["cropland", "grass", "shrubland", "forest", "water", "builtup", "bare"], "metrics": { "accuracy": 0.89, "f1_macro": 0.85 }, "is_default": true } ], "default_model": "v1" } ``` ### 5.3 Metadata Schema ```json // v1/metadata.json { "version": "v1", "training_date": "2026-02-01T10:00:00Z", "dataset_version": "v1", "training_samples": 1500, "test_samples": 500, "features": ["ndvi_peak", "evi_peak", "savi_peak"], "classes": ["cropland", "grass", "shrubland", "forest", "water", "builtup", "bare"], "models": { "lightgbm": { "accuracy": 0.91, "f1_macro": 0.88 }, "xgboost": { "accuracy": 0.89, "f1_macro": 0.85 }, "catboost": { "accuracy": 0.88, "f1_macro": 0.84 } }, "selected_model": "lightgbm", "training_params": { "n_estimators": 800, "learning_rate": 0.03, "num_leaves": 63 } } ``` --- ## 6. Frontend Admin Panel ### 6.1 Admin Page Structure ```tsx // app/admin/page.tsx export default function AdminPage() { return (

Admin Panel

{/* Dataset Upload */} {/* Training Controls */} {/* Model Registry */}
); } ``` ### 6.2 Dataset Upload Component ```tsx // components/admin/DatasetUpload.tsx 'use client'; import { useState } from 'react'; import { useMutation } from '@tanstack/react-query'; export function DatasetUpload() { const [version, setVersion] = useState(''); const [file, setFile] = useState(null); const upload = useMutation({ mutationFn: async () => { const formData = new FormData(); formData.append('version', version); formData.append('file', file!); return fetch('/api/admin/datasets/upload', { method: 'POST', body: formData, headers: { Authorization: `Bearer ${token}` } }); }, onSuccess: () => { toast.success('Dataset uploaded successfully'); } }); return (

Upload Dataset

setVersion(e.target.value)} /> setFile(e.target.files?.[0] || null)} />
); } ``` ### 6.3 Training Trigger Component ```tsx // components/admin/TrainingTrigger.tsx export function TrainingTrigger() { const [datasetVersion, setDatasetVersion] = useState(''); const [modelVersion, setModelVersion] = useState(''); const [variant, setVariant] = useState('Scaled'); const startTraining = useMutation({ mutationFn: async () => { return fetch('/api/admin/training/start', { method: 'POST', body: JSON.stringify({ dataset_version: datasetVersion, model_version: modelVersion, model_variant: variant }) }); } }); return (

Start Training

); } ``` --- ## 7. Training Script Updates ### 7.1 Modified Training Entry Point ```python # training/train.py import argparse import os import json from datetime import datetime import boto3 from pathlib import Path def main(): parser = argparse.ArgumentParser() parser.add_argument('--data', required=True, help='Path to training data CSV') parser.add_argument('--out', required=True, help='Output directory (s3://...)') parser.add_argument('--variant', default='Scaled', choices=['Scaled', 'Raw']) args = parser.parse_args() # Parse S3 path output_bucket, output_prefix = parse_s3_path(args.out) # Load and prepare data df = pd.read_csv(args.data) # Train models (existing logic) results = train_models(df, args.variant) # Upload artifacts to MinIO s3 = boto3.client('s3') # Upload model files for filename in ['model.joblib', 'scaler.joblib', 'label_encoder.json', 'selected_features.json']: if os.path.exists(filename): s3.upload_file(filename, output_bucket, f"{output_prefix}/{filename}") # Upload metadata metadata = { 'version': output_prefix, 'training_date': datetime.utcnow().isoformat(), 'metrics': results, 'features': selected_features, } s3.put_object( output_bucket, f"{output_prefix}/metadata.json", json.dumps(metadata) ) print(f"Training complete. Artifacts saved to s3://{output_bucket}/{output_prefix}") if __name__ == '__main__': main() ``` --- ## 8. CI/CD Pipeline ### 8.1 GitHub Actions (Optional) ```yaml # .github/workflows/train.yml name: Model Training on: workflow_dispatch: inputs: dataset_version: description: 'Dataset version' required: true model_version: description: 'Model version' required: true jobs: train: runs-on: ubuntu-latest steps: - uses: actions/checkout@v3 - name: Set up Python uses: actions/setup-python@v4 with: python-version: '3.11' - name: Install dependencies run: | pip install -r training/requirements.txt - name: Run training env: AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} run: | python training/train.py \ --data s3://geocrop-datasets/${{ github.event.inputs.dataset_version }}/training_data.csv \ --out s3://geocrop-models/${{ github.event.inputs.model_version }}/ \ --variant Scaled ``` --- ## 9. Security ### 9.1 Admin Authentication - Require admin role in JWT - Check `user.get('is_admin', False)` before any admin operation ### 9.2 Kubernetes RBAC - Only admin service account can create training jobs - Training jobs run with limited permissions ### 9.3 MinIO Policies ```json { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": ["s3:PutObject", "s3:GetObject"], "Resource": [ "arn:aws:s3:::geocrop-datasets/*", "arn:aws:s3:::geocrop-models/*" ] } ] } ``` --- ## 10. Implementation Checklist - [ ] Create Kubernetes ServiceAccount and RBAC for admin - [ ] Create training job manifest template - [ ] Update training script to upload to MinIO - [ ] Create API endpoints for dataset upload - [ ] Create API endpoints for training triggers - [ ] Create API endpoints for model registry - [ ] Implement model promotion logic - [ ] Build admin frontend components - [ ] Add dataset upload UI - [ ] Add training trigger UI - [ ] Add model registry UI - [ ] Test end-to-end training pipeline ### 10.1 Promotion Workflow - "train" produces candidate model version - "promote" marks it as default for UI --- ## 11. Technical Notes ### 11.1 GPU Support If GPU training needed: - Add nvidia.com/gpu resource requests - Use CUDA-enabled image - Install GPU-enabled TensorFlow/PyTorch ### 11.2 Training Timeout - Default Kubernetes job timeout: no limit - Set `activeDeadlineSeconds` to prevent runaway jobs ### 11.3 Model Selection - Store multiple model outputs (XGBoost, LightGBM, CatBoost) - Select best based on validation metrics - Allow admin to override selection --- ## 12. Next Steps After implementation approval: 1. Create Kubernetes RBAC manifests 2. Create training job template 3. Update training script for MinIO upload 4. Implement admin API endpoints 5. Build admin frontend 6. Test training pipeline 7. Document admin procedures