# Sovereign MLOps Platform: LULC Crop-Mapping Portfolio ## Overview This document outlines the execution plan for restructuring the GeoCrop platform into a GitOps-driven, self-hosted MLOps platform on K3s. It replaces the full Supabase stack with a lightweight Postgres+PostGIS standalone container to conserve RAM while meeting all spatial querying requirements. ## Phased Execution Strategy ### Phase 1: Infrastructure Setup (The Foundation) 1. **Terraform (Namespaces & Quotas):** Apply Terraform to configure the K3s namespace (`geocrop`) with explicit ResourceQuotas. We will apply 512MB limits to lightweight services (API, Web) but allocate 2GB to the ML Worker and Jupyter instances to prevent OOM errors. 2. **Database (Postgres + PostGIS):** Deploy a standalone StatefulSet for PostGIS on port 5433 (`db.techarvest.co.zw`), fully isolated from other apps. 3. **MLOps Tools (MLflow & Jupyter):** - Deploy MLflow (`ml.techarvest.co.zw`) backed by the new PostGIS DB and the existing MinIO artifact store. - Deploy a Jupyter Data Science workspace (`lab.techarvest.co.zw`) configured to pull datasets directly from the MinIO `geocrop-datasets` bucket, ensuring node-agnostic scheduling. 4. **GitOps Tools (Gitea & ArgoCD):** Initialize Gitea (`git.techarvest.co.zw`) and ArgoCD (`cd.techarvest.co.zw`) to take over cluster management. ### Phase 2: Frontend (React/Vite) Setup & Testing 1. **Zero-Downtime Requirement:** The current live web page at `portfolio.techarvest.co.zw` MUST remain active and untouched during this transition as it is actively receiving traffic from job applications. 2. **Parallel Loading Strategy:** Configure the new React frontend components to instantly fetch and render Dynamic World (DW) baselines (2015-2025) via the TiTiler service (`tiles.portfolio.techarvest.co.zw`) while awaiting ML inference. 3. **ArgoCD Deployment:** Commit the new frontend manifests to the Gitea repository and sync via ArgoCD, carefully routing traffic to avoid disrupting the live welcome page. 4. **Verification:** Test that the new frontend components successfully load and render TiTiler COGs instantly without backend dependency. ### Phase 3: Backend (API + ML Worker) Setup & CI/CD 1. **Gitea Actions (CI/CD):** Implement `.gitea/workflows/build-push.yaml` to automatically build `apps/worker/Dockerfile` and `apps/api/Dockerfile`, and push them to Docker Hub (`frankchine/geocrop-worker:latest`, etc.). 2. **ArgoCD Deployment:** Update backend Kubernetes manifests in the GitOps repo to pull from `frankchine/...`. Sync ArgoCD. 3. **Worker Tuning:** Ensure the ML worker is correctly configured to use the standalone PostGIS database (if spatial logging is needed) and MinIO for models/results. 4. **Auth & Limits:** Implement simple JWT-based auth in the API. Add logic to track job counts per user, enforcing a **5-job limit for Recruiter accounts** while allowing unlimited for Admin. ### Phase 4: End-to-End System Testing 1. **Trigger Job:** Submit an AOI via the React frontend. 2. **Verify Instant UX:** Ensure the DW baseline renders immediately. 3. **Verify Inference:** Monitor the Redis queue and ML Worker logs to ensure it pulls STAC data, runs the XGBoost/Ensemble model, and writes the output COG to MinIO. 4. **Verify Result Overlay:** Ensure the frontend polls the API and seamlessly overlays the high-resolution LULC prediction once complete. 5. **Verify MLflow:** Check `ml.techarvest.co.zw` to confirm the run metrics were logged successfully. ### Phase 5: Portfolio & Recruiter Experience (New Pages) Implement the following technical documentation pages within the React frontend to showcase system depth to recruiters: 1. **GeoCrop System Architecture** - **Visual:** ```mermaid graph TD subgraph "Data Sources" STAC[Digital Earth Africa STAC] end subgraph "Sovereign Cluster (K3s)" API[FastAPI Gateway] Redis[(Redis Job Queue)] Worker[ML Inference Worker] MinIO[(MinIO S3 Storage)] Tiler[TiTiler] DB[(PostGIS Standalone)] end subgraph "Frontend" Web[React/Vite Portfolio] end STAC --> Worker Web --> API API --> Redis Redis --> Worker Worker --> MinIO Worker --> DB MinIO --> Tiler Tiler --> Web ``` - **Tech Rationale:** - *Why MinIO*: local sovereignty + S3 compatibility. - *Why Argo*: lightweight orchestration vs Airflow. - *Why Supabase/PostGIS*: fast Postgres + PostGIS integration for spatial depth. 2. **Infrastructure Design (K3s Sovereign Cluster)** - **Title:** Infrastructure Design (K3s Sovereign Cluster) - **Visual:** Cluster design details (Single-node K3s on Contabo VPS). - **Resource Strategy:** 512MB limits for API/Web; 2GB for Worker/Jupyter. - **Key Principle:** “Designed for low-resource environments while maintaining full MLOps capability.” - **Terraform Layer:** Namespace isolation (geocrop), Resource quotas, Future SSO integration. - **GitOps Layer:** Argo CD as single source of truth (/k8s/base + /overlays). “Everything deployed is version-controlled and reproducible.” 3. **End-to-End MLOps Workflow** - **Title:** End-to-End MLOps Workflow - **Pipeline Breakdown:** 1. **Data Ingestion:** Zimbabwe CSV batches stored in MinIO. 2. **Training:** Triggered via Argo Workflows, executed from `/training/active`. 3. **Experiment Tracking:** MLflow logs parameters, metrics, and artifacts. 4. **Deployment:** Model packaged into worker container, deployed via Argo CD. 4. **Engineering Decisions & Trade-offs (CRITICAL)** - **Title:** Engineering Decisions & Trade-offs - **Argo vs Kubeflow**: - *Decision*: Chose Argo Workflows + Argo CD. - *Why NOT Kubeflow*: Too resource-heavy for 512MB constraints; complex deployment overhead. - *Why Argo*: Lightweight, native K8s integration, easier GitOps alignment. - **Gitea vs GitLab**: - *Decision*: Chose Gitea. - *Why NOT GitLab*: High RAM usage; overkill for single-node cluster. - *Why Gitea*: Lightweight, self-hostable in constrained environments, good enough CI/CD via Actions. - **MLflow vs Alternatives**: Simple experiment tracking, easy DB backend integration (Postgres), lightweight vs full ML platforms. - **MinIO vs Cloud Storage**: Full data sovereignty, S3-compatible, works offline / low-connectivity environments. - **Supabase (Postgres + PostGIS)**: Spatial queries (critical for geospatial ML), simple API layer, lightweight vs full GIS stacks. 5. **Observability & Monitoring** - **Title:** Observability & System Monitoring - **Stack:** Prometheus (Metrics), Grafana (Visualization), Uptime Kuma (SLA monitoring). - **Live Endpoints:** uptime.techarvest.co.zw, grafana.techarvest.co.zw, prometheus.techarvest.co.zw. - **Metrics:** API latency, container health, resource usage, job execution success. 6. **Live System Page** - **Title:** Live Infrastructure (Production System) - **Status Table:** | Service | Status | | :--- | :--- | | Monitoring | Live | | Metrics | Live | | Storage | Live | | MLflow | Deploying |