geocrop-platform./plan/restructuringPlan/00_restructuring_plan.md

120 lines
7.2 KiB
Markdown

# Sovereign MLOps Platform: LULC Crop-Mapping Portfolio
## Overview
This document outlines the execution plan for restructuring the GeoCrop platform into a GitOps-driven, self-hosted MLOps platform on K3s. It replaces the full Supabase stack with a lightweight Postgres+PostGIS standalone container to conserve RAM while meeting all spatial querying requirements.
## Phased Execution Strategy
### Phase 1: Infrastructure Setup (The Foundation)
1. **Terraform (Namespaces & Quotas):** Apply Terraform to configure the K3s namespace (`geocrop`) with explicit ResourceQuotas. We will apply 512MB limits to lightweight services (API, Web) but allocate 2GB to the ML Worker and Jupyter instances to prevent OOM errors.
2. **Database (Postgres + PostGIS):** Deploy a standalone StatefulSet for PostGIS on port 5433 (`db.techarvest.co.zw`), fully isolated from other apps.
3. **MLOps Tools (MLflow & Jupyter):**
- Deploy MLflow (`ml.techarvest.co.zw`) backed by the new PostGIS DB and the existing MinIO artifact store.
- Deploy a Jupyter Data Science workspace (`lab.techarvest.co.zw`) configured to pull datasets directly from the MinIO `geocrop-datasets` bucket, ensuring node-agnostic scheduling.
4. **GitOps Tools (Gitea & ArgoCD):** Initialize Gitea (`git.techarvest.co.zw`) and ArgoCD (`cd.techarvest.co.zw`) to take over cluster management.
### Phase 2: Frontend (React/Vite) Setup & Testing
1. **Zero-Downtime Requirement:** The current live web page at `portfolio.techarvest.co.zw` MUST remain active and untouched during this transition as it is actively receiving traffic from job applications.
2. **Parallel Loading Strategy:** Configure the new React frontend components to instantly fetch and render Dynamic World (DW) baselines (2015-2025) via the TiTiler service (`tiles.portfolio.techarvest.co.zw`) while awaiting ML inference.
3. **ArgoCD Deployment:** Commit the new frontend manifests to the Gitea repository and sync via ArgoCD, carefully routing traffic to avoid disrupting the live welcome page.
4. **Verification:** Test that the new frontend components successfully load and render TiTiler COGs instantly without backend dependency.
### Phase 3: Backend (API + ML Worker) Setup & CI/CD
1. **Gitea Actions (CI/CD):** Implement `.gitea/workflows/build-push.yaml` to automatically build `apps/worker/Dockerfile` and `apps/api/Dockerfile`, and push them to Docker Hub (`frankchine/geocrop-worker:latest`, etc.).
2. **ArgoCD Deployment:** Update backend Kubernetes manifests in the GitOps repo to pull from `frankchine/...`. Sync ArgoCD.
3. **Worker Tuning:** Ensure the ML worker is correctly configured to use the standalone PostGIS database (if spatial logging is needed) and MinIO for models/results.
4. **Auth & Limits:** Implement simple JWT-based auth in the API. Add logic to track job counts per user, enforcing a **5-job limit for Recruiter accounts** while allowing unlimited for Admin.
### Phase 4: End-to-End System Testing
1. **Trigger Job:** Submit an AOI via the React frontend.
2. **Verify Instant UX:** Ensure the DW baseline renders immediately.
3. **Verify Inference:** Monitor the Redis queue and ML Worker logs to ensure it pulls STAC data, runs the XGBoost/Ensemble model, and writes the output COG to MinIO.
4. **Verify Result Overlay:** Ensure the frontend polls the API and seamlessly overlays the high-resolution LULC prediction once complete.
5. **Verify MLflow:** Check `ml.techarvest.co.zw` to confirm the run metrics were logged successfully.
### Phase 5: Portfolio & Recruiter Experience (New Pages)
Implement the following technical documentation pages within the React frontend to showcase system depth to recruiters:
1. **GeoCrop System Architecture**
- **Visual:**
```mermaid
graph TD
subgraph "Data Sources"
STAC[Digital Earth Africa STAC]
end
subgraph "Sovereign Cluster (K3s)"
API[FastAPI Gateway]
Redis[(Redis Job Queue)]
Worker[ML Inference Worker]
MinIO[(MinIO S3 Storage)]
Tiler[TiTiler]
DB[(PostGIS Standalone)]
end
subgraph "Frontend"
Web[React/Vite Portfolio]
end
STAC --> Worker
Web --> API
API --> Redis
Redis --> Worker
Worker --> MinIO
Worker --> DB
MinIO --> Tiler
Tiler --> Web
```
- **Tech Rationale:**
- *Why MinIO*: local sovereignty + S3 compatibility.
- *Why Argo*: lightweight orchestration vs Airflow.
- *Why Supabase/PostGIS*: fast Postgres + PostGIS integration for spatial depth.
2. **Infrastructure Design (K3s Sovereign Cluster)**
- **Title:** Infrastructure Design (K3s Sovereign Cluster)
- **Visual:** Cluster design details (Single-node K3s on Contabo VPS).
- **Resource Strategy:** 512MB limits for API/Web; 2GB for Worker/Jupyter.
- **Key Principle:** “Designed for low-resource environments while maintaining full MLOps capability.”
- **Terraform Layer:** Namespace isolation (geocrop), Resource quotas, Future SSO integration.
- **GitOps Layer:** Argo CD as single source of truth (/k8s/base + /overlays). “Everything deployed is version-controlled and reproducible.”
3. **End-to-End MLOps Workflow**
- **Title:** End-to-End MLOps Workflow
- **Pipeline Breakdown:**
1. **Data Ingestion:** Zimbabwe CSV batches stored in MinIO.
2. **Training:** Triggered via Argo Workflows, executed from `/training/active`.
3. **Experiment Tracking:** MLflow logs parameters, metrics, and artifacts.
4. **Deployment:** Model packaged into worker container, deployed via Argo CD.
4. **Engineering Decisions & Trade-offs (CRITICAL)**
- **Title:** Engineering Decisions & Trade-offs
- **Argo vs Kubeflow**:
- *Decision*: Chose Argo Workflows + Argo CD.
- *Why NOT Kubeflow*: Too resource-heavy for 512MB constraints; complex deployment overhead.
- *Why Argo*: Lightweight, native K8s integration, easier GitOps alignment.
- **Gitea vs GitLab**:
- *Decision*: Chose Gitea.
- *Why NOT GitLab*: High RAM usage; overkill for single-node cluster.
- *Why Gitea*: Lightweight, self-hostable in constrained environments, good enough CI/CD via Actions.
- **MLflow vs Alternatives**: Simple experiment tracking, easy DB backend integration (Postgres), lightweight vs full ML platforms.
- **MinIO vs Cloud Storage**: Full data sovereignty, S3-compatible, works offline / low-connectivity environments.
- **Supabase (Postgres + PostGIS)**: Spatial queries (critical for geospatial ML), simple API layer, lightweight vs full GIS stacks.
5. **Observability & Monitoring**
- **Title:** Observability & System Monitoring
- **Stack:** Prometheus (Metrics), Grafana (Visualization), Uptime Kuma (SLA monitoring).
- **Live Endpoints:** uptime.techarvest.co.zw, grafana.techarvest.co.zw, prometheus.techarvest.co.zw.
- **Metrics:** API latency, container health, resource usage, job execution success.
6. **Live System Page**
- **Title:** Live Infrastructure (Production System)
- **Status Table:**
| Service | Status |
| :--- | :--- |
| Monitoring | Live |
| Metrics | Live |
| Storage | Live |
| MLflow | Deploying |