geocrop-platform./plan/restructuringPlan/00_restructuring_plan.md

7.2 KiB

Sovereign MLOps Platform: LULC Crop-Mapping Portfolio

Overview

This document outlines the execution plan for restructuring the GeoCrop platform into a GitOps-driven, self-hosted MLOps platform on K3s. It replaces the full Supabase stack with a lightweight Postgres+PostGIS standalone container to conserve RAM while meeting all spatial querying requirements.

Phased Execution Strategy

Phase 1: Infrastructure Setup (The Foundation)

  1. Terraform (Namespaces & Quotas): Apply Terraform to configure the K3s namespace (geocrop) with explicit ResourceQuotas. We will apply 512MB limits to lightweight services (API, Web) but allocate 2GB to the ML Worker and Jupyter instances to prevent OOM errors.
  2. Database (Postgres + PostGIS): Deploy a standalone StatefulSet for PostGIS on port 5433 (db.techarvest.co.zw), fully isolated from other apps.
  3. MLOps Tools (MLflow & Jupyter):
    • Deploy MLflow (ml.techarvest.co.zw) backed by the new PostGIS DB and the existing MinIO artifact store.
    • Deploy a Jupyter Data Science workspace (lab.techarvest.co.zw) configured to pull datasets directly from the MinIO geocrop-datasets bucket, ensuring node-agnostic scheduling.
  4. GitOps Tools (Gitea & ArgoCD): Initialize Gitea (git.techarvest.co.zw) and ArgoCD (cd.techarvest.co.zw) to take over cluster management.

Phase 2: Frontend (React/Vite) Setup & Testing

  1. Zero-Downtime Requirement: The current live web page at portfolio.techarvest.co.zw MUST remain active and untouched during this transition as it is actively receiving traffic from job applications.
  2. Parallel Loading Strategy: Configure the new React frontend components to instantly fetch and render Dynamic World (DW) baselines (2015-2025) via the TiTiler service (tiles.portfolio.techarvest.co.zw) while awaiting ML inference.
  3. ArgoCD Deployment: Commit the new frontend manifests to the Gitea repository and sync via ArgoCD, carefully routing traffic to avoid disrupting the live welcome page.
  4. Verification: Test that the new frontend components successfully load and render TiTiler COGs instantly without backend dependency.

Phase 3: Backend (API + ML Worker) Setup & CI/CD

  1. Gitea Actions (CI/CD): Implement .gitea/workflows/build-push.yaml to automatically build apps/worker/Dockerfile and apps/api/Dockerfile, and push them to Docker Hub (frankchine/geocrop-worker:latest, etc.).
  2. ArgoCD Deployment: Update backend Kubernetes manifests in the GitOps repo to pull from frankchine/.... Sync ArgoCD.
  3. Worker Tuning: Ensure the ML worker is correctly configured to use the standalone PostGIS database (if spatial logging is needed) and MinIO for models/results.
  4. Auth & Limits: Implement simple JWT-based auth in the API. Add logic to track job counts per user, enforcing a 5-job limit for Recruiter accounts while allowing unlimited for Admin.

Phase 4: End-to-End System Testing

  1. Trigger Job: Submit an AOI via the React frontend.
  2. Verify Instant UX: Ensure the DW baseline renders immediately.
  3. Verify Inference: Monitor the Redis queue and ML Worker logs to ensure it pulls STAC data, runs the XGBoost/Ensemble model, and writes the output COG to MinIO.
  4. Verify Result Overlay: Ensure the frontend polls the API and seamlessly overlays the high-resolution LULC prediction once complete.
  5. Verify MLflow: Check ml.techarvest.co.zw to confirm the run metrics were logged successfully.

Phase 5: Portfolio & Recruiter Experience (New Pages)

Implement the following technical documentation pages within the React frontend to showcase system depth to recruiters:

  1. GeoCrop System Architecture

    • Visual:
      graph TD
          subgraph "Data Sources"
              STAC[Digital Earth Africa STAC]
          end
      
          subgraph "Sovereign Cluster (K3s)"
              API[FastAPI Gateway]
              Redis[(Redis Job Queue)]
              Worker[ML Inference Worker]
              MinIO[(MinIO S3 Storage)]
              Tiler[TiTiler]
              DB[(PostGIS Standalone)]
          end
      
          subgraph "Frontend"
              Web[React/Vite Portfolio]
          end
      
          STAC --> Worker
          Web --> API
          API --> Redis
          Redis --> Worker
          Worker --> MinIO
          Worker --> DB
          MinIO --> Tiler
          Tiler --> Web
      
    • Tech Rationale:
      • Why MinIO: local sovereignty + S3 compatibility.
      • Why Argo: lightweight orchestration vs Airflow.
      • Why Supabase/PostGIS: fast Postgres + PostGIS integration for spatial depth.
  2. Infrastructure Design (K3s Sovereign Cluster)

    • Title: Infrastructure Design (K3s Sovereign Cluster)
    • Visual: Cluster design details (Single-node K3s on Contabo VPS).
    • Resource Strategy: 512MB limits for API/Web; 2GB for Worker/Jupyter.
    • Key Principle: “Designed for low-resource environments while maintaining full MLOps capability.”
    • Terraform Layer: Namespace isolation (geocrop), Resource quotas, Future SSO integration.
    • GitOps Layer: Argo CD as single source of truth (/k8s/base + /overlays). “Everything deployed is version-controlled and reproducible.”
  3. End-to-End MLOps Workflow

    • Title: End-to-End MLOps Workflow
    • Pipeline Breakdown:
      1. Data Ingestion: Zimbabwe CSV batches stored in MinIO.
      2. Training: Triggered via Argo Workflows, executed from /training/active.
      3. Experiment Tracking: MLflow logs parameters, metrics, and artifacts.
      4. Deployment: Model packaged into worker container, deployed via Argo CD.
  4. Engineering Decisions & Trade-offs (CRITICAL)

    • Title: Engineering Decisions & Trade-offs
    • Argo vs Kubeflow:
      • Decision: Chose Argo Workflows + Argo CD.
      • Why NOT Kubeflow: Too resource-heavy for 512MB constraints; complex deployment overhead.
      • Why Argo: Lightweight, native K8s integration, easier GitOps alignment.
    • Gitea vs GitLab:
      • Decision: Chose Gitea.
      • Why NOT GitLab: High RAM usage; overkill for single-node cluster.
      • Why Gitea: Lightweight, self-hostable in constrained environments, good enough CI/CD via Actions.
    • MLflow vs Alternatives: Simple experiment tracking, easy DB backend integration (Postgres), lightweight vs full ML platforms.
    • MinIO vs Cloud Storage: Full data sovereignty, S3-compatible, works offline / low-connectivity environments.
    • Supabase (Postgres + PostGIS): Spatial queries (critical for geospatial ML), simple API layer, lightweight vs full GIS stacks.
  5. Observability & Monitoring

    • Title: Observability & System Monitoring
    • Stack: Prometheus (Metrics), Grafana (Visualization), Uptime Kuma (SLA monitoring).
    • Live Endpoints: uptime.techarvest.co.zw, grafana.techarvest.co.zw, prometheus.techarvest.co.zw.
    • Metrics: API latency, container health, resource usage, job execution success.
  6. Live System Page

    • Title: Live Infrastructure (Production System)
    • Status Table:
      Service Status
      Monitoring Live
      Metrics Live
      Storage Live
      MLflow Deploying