7.2 KiB
7.2 KiB
Sovereign MLOps Platform: LULC Crop-Mapping Portfolio
Overview
This document outlines the execution plan for restructuring the GeoCrop platform into a GitOps-driven, self-hosted MLOps platform on K3s. It replaces the full Supabase stack with a lightweight Postgres+PostGIS standalone container to conserve RAM while meeting all spatial querying requirements.
Phased Execution Strategy
Phase 1: Infrastructure Setup (The Foundation)
- Terraform (Namespaces & Quotas): Apply Terraform to configure the K3s namespace (
geocrop) with explicit ResourceQuotas. We will apply 512MB limits to lightweight services (API, Web) but allocate 2GB to the ML Worker and Jupyter instances to prevent OOM errors. - Database (Postgres + PostGIS): Deploy a standalone StatefulSet for PostGIS on port 5433 (
db.techarvest.co.zw), fully isolated from other apps. - MLOps Tools (MLflow & Jupyter):
- Deploy MLflow (
ml.techarvest.co.zw) backed by the new PostGIS DB and the existing MinIO artifact store. - Deploy a Jupyter Data Science workspace (
lab.techarvest.co.zw) configured to pull datasets directly from the MinIOgeocrop-datasetsbucket, ensuring node-agnostic scheduling.
- Deploy MLflow (
- GitOps Tools (Gitea & ArgoCD): Initialize Gitea (
git.techarvest.co.zw) and ArgoCD (cd.techarvest.co.zw) to take over cluster management.
Phase 2: Frontend (React/Vite) Setup & Testing
- Zero-Downtime Requirement: The current live web page at
portfolio.techarvest.co.zwMUST remain active and untouched during this transition as it is actively receiving traffic from job applications. - Parallel Loading Strategy: Configure the new React frontend components to instantly fetch and render Dynamic World (DW) baselines (2015-2025) via the TiTiler service (
tiles.portfolio.techarvest.co.zw) while awaiting ML inference. - ArgoCD Deployment: Commit the new frontend manifests to the Gitea repository and sync via ArgoCD, carefully routing traffic to avoid disrupting the live welcome page.
- Verification: Test that the new frontend components successfully load and render TiTiler COGs instantly without backend dependency.
Phase 3: Backend (API + ML Worker) Setup & CI/CD
- Gitea Actions (CI/CD): Implement
.gitea/workflows/build-push.yamlto automatically buildapps/worker/Dockerfileandapps/api/Dockerfile, and push them to Docker Hub (frankchine/geocrop-worker:latest, etc.). - ArgoCD Deployment: Update backend Kubernetes manifests in the GitOps repo to pull from
frankchine/.... Sync ArgoCD. - Worker Tuning: Ensure the ML worker is correctly configured to use the standalone PostGIS database (if spatial logging is needed) and MinIO for models/results.
- Auth & Limits: Implement simple JWT-based auth in the API. Add logic to track job counts per user, enforcing a 5-job limit for Recruiter accounts while allowing unlimited for Admin.
Phase 4: End-to-End System Testing
- Trigger Job: Submit an AOI via the React frontend.
- Verify Instant UX: Ensure the DW baseline renders immediately.
- Verify Inference: Monitor the Redis queue and ML Worker logs to ensure it pulls STAC data, runs the XGBoost/Ensemble model, and writes the output COG to MinIO.
- Verify Result Overlay: Ensure the frontend polls the API and seamlessly overlays the high-resolution LULC prediction once complete.
- Verify MLflow: Check
ml.techarvest.co.zwto confirm the run metrics were logged successfully.
Phase 5: Portfolio & Recruiter Experience (New Pages)
Implement the following technical documentation pages within the React frontend to showcase system depth to recruiters:
-
GeoCrop System Architecture
- Visual:
graph TD subgraph "Data Sources" STAC[Digital Earth Africa STAC] end subgraph "Sovereign Cluster (K3s)" API[FastAPI Gateway] Redis[(Redis Job Queue)] Worker[ML Inference Worker] MinIO[(MinIO S3 Storage)] Tiler[TiTiler] DB[(PostGIS Standalone)] end subgraph "Frontend" Web[React/Vite Portfolio] end STAC --> Worker Web --> API API --> Redis Redis --> Worker Worker --> MinIO Worker --> DB MinIO --> Tiler Tiler --> Web - Tech Rationale:
- Why MinIO: local sovereignty + S3 compatibility.
- Why Argo: lightweight orchestration vs Airflow.
- Why Supabase/PostGIS: fast Postgres + PostGIS integration for spatial depth.
- Visual:
-
Infrastructure Design (K3s Sovereign Cluster)
- Title: Infrastructure Design (K3s Sovereign Cluster)
- Visual: Cluster design details (Single-node K3s on Contabo VPS).
- Resource Strategy: 512MB limits for API/Web; 2GB for Worker/Jupyter.
- Key Principle: “Designed for low-resource environments while maintaining full MLOps capability.”
- Terraform Layer: Namespace isolation (geocrop), Resource quotas, Future SSO integration.
- GitOps Layer: Argo CD as single source of truth (/k8s/base + /overlays). “Everything deployed is version-controlled and reproducible.”
-
End-to-End MLOps Workflow
- Title: End-to-End MLOps Workflow
- Pipeline Breakdown:
- Data Ingestion: Zimbabwe CSV batches stored in MinIO.
- Training: Triggered via Argo Workflows, executed from
/training/active. - Experiment Tracking: MLflow logs parameters, metrics, and artifacts.
- Deployment: Model packaged into worker container, deployed via Argo CD.
-
Engineering Decisions & Trade-offs (CRITICAL)
- Title: Engineering Decisions & Trade-offs
- Argo vs Kubeflow:
- Decision: Chose Argo Workflows + Argo CD.
- Why NOT Kubeflow: Too resource-heavy for 512MB constraints; complex deployment overhead.
- Why Argo: Lightweight, native K8s integration, easier GitOps alignment.
- Gitea vs GitLab:
- Decision: Chose Gitea.
- Why NOT GitLab: High RAM usage; overkill for single-node cluster.
- Why Gitea: Lightweight, self-hostable in constrained environments, good enough CI/CD via Actions.
- MLflow vs Alternatives: Simple experiment tracking, easy DB backend integration (Postgres), lightweight vs full ML platforms.
- MinIO vs Cloud Storage: Full data sovereignty, S3-compatible, works offline / low-connectivity environments.
- Supabase (Postgres + PostGIS): Spatial queries (critical for geospatial ML), simple API layer, lightweight vs full GIS stacks.
-
Observability & Monitoring
- Title: Observability & System Monitoring
- Stack: Prometheus (Metrics), Grafana (Visualization), Uptime Kuma (SLA monitoring).
- Live Endpoints: uptime.techarvest.co.zw, grafana.techarvest.co.zw, prometheus.techarvest.co.zw.
- Metrics: API latency, container health, resource usage, job execution success.
-
Live System Page
- Title: Live Infrastructure (Production System)
- Status Table:
Service Status Monitoring Live Metrics Live Storage Live MLflow Deploying