120 lines
7.2 KiB
Markdown
120 lines
7.2 KiB
Markdown
# Sovereign MLOps Platform: LULC Crop-Mapping Portfolio
|
|
|
|
## Overview
|
|
This document outlines the execution plan for restructuring the GeoCrop platform into a GitOps-driven, self-hosted MLOps platform on K3s. It replaces the full Supabase stack with a lightweight Postgres+PostGIS standalone container to conserve RAM while meeting all spatial querying requirements.
|
|
|
|
## Phased Execution Strategy
|
|
|
|
### Phase 1: Infrastructure Setup (The Foundation)
|
|
1. **Terraform (Namespaces & Quotas):** Apply Terraform to configure the K3s namespace (`geocrop`) with explicit ResourceQuotas. We will apply 512MB limits to lightweight services (API, Web) but allocate 2GB to the ML Worker and Jupyter instances to prevent OOM errors.
|
|
2. **Database (Postgres + PostGIS):** Deploy a standalone StatefulSet for PostGIS on port 5433 (`db.techarvest.co.zw`), fully isolated from other apps.
|
|
3. **MLOps Tools (MLflow & Jupyter):**
|
|
- Deploy MLflow (`ml.techarvest.co.zw`) backed by the new PostGIS DB and the existing MinIO artifact store.
|
|
- Deploy a Jupyter Data Science workspace (`lab.techarvest.co.zw`) configured to pull datasets directly from the MinIO `geocrop-datasets` bucket, ensuring node-agnostic scheduling.
|
|
4. **GitOps Tools (Gitea & ArgoCD):** Initialize Gitea (`git.techarvest.co.zw`) and ArgoCD (`cd.techarvest.co.zw`) to take over cluster management.
|
|
|
|
### Phase 2: Frontend (React/Vite) Setup & Testing
|
|
1. **Zero-Downtime Requirement:** The current live web page at `portfolio.techarvest.co.zw` MUST remain active and untouched during this transition as it is actively receiving traffic from job applications.
|
|
2. **Parallel Loading Strategy:** Configure the new React frontend components to instantly fetch and render Dynamic World (DW) baselines (2015-2025) via the TiTiler service (`tiles.portfolio.techarvest.co.zw`) while awaiting ML inference.
|
|
3. **ArgoCD Deployment:** Commit the new frontend manifests to the Gitea repository and sync via ArgoCD, carefully routing traffic to avoid disrupting the live welcome page.
|
|
4. **Verification:** Test that the new frontend components successfully load and render TiTiler COGs instantly without backend dependency.
|
|
|
|
### Phase 3: Backend (API + ML Worker) Setup & CI/CD
|
|
1. **Gitea Actions (CI/CD):** Implement `.gitea/workflows/build-push.yaml` to automatically build `apps/worker/Dockerfile` and `apps/api/Dockerfile`, and push them to Docker Hub (`frankchine/geocrop-worker:latest`, etc.).
|
|
2. **ArgoCD Deployment:** Update backend Kubernetes manifests in the GitOps repo to pull from `frankchine/...`. Sync ArgoCD.
|
|
3. **Worker Tuning:** Ensure the ML worker is correctly configured to use the standalone PostGIS database (if spatial logging is needed) and MinIO for models/results.
|
|
4. **Auth & Limits:** Implement simple JWT-based auth in the API. Add logic to track job counts per user, enforcing a **5-job limit for Recruiter accounts** while allowing unlimited for Admin.
|
|
|
|
### Phase 4: End-to-End System Testing
|
|
1. **Trigger Job:** Submit an AOI via the React frontend.
|
|
2. **Verify Instant UX:** Ensure the DW baseline renders immediately.
|
|
3. **Verify Inference:** Monitor the Redis queue and ML Worker logs to ensure it pulls STAC data, runs the XGBoost/Ensemble model, and writes the output COG to MinIO.
|
|
4. **Verify Result Overlay:** Ensure the frontend polls the API and seamlessly overlays the high-resolution LULC prediction once complete.
|
|
5. **Verify MLflow:** Check `ml.techarvest.co.zw` to confirm the run metrics were logged successfully.
|
|
|
|
### Phase 5: Portfolio & Recruiter Experience (New Pages)
|
|
Implement the following technical documentation pages within the React frontend to showcase system depth to recruiters:
|
|
|
|
1. **GeoCrop System Architecture**
|
|
- **Visual:**
|
|
```mermaid
|
|
graph TD
|
|
subgraph "Data Sources"
|
|
STAC[Digital Earth Africa STAC]
|
|
end
|
|
|
|
subgraph "Sovereign Cluster (K3s)"
|
|
API[FastAPI Gateway]
|
|
Redis[(Redis Job Queue)]
|
|
Worker[ML Inference Worker]
|
|
MinIO[(MinIO S3 Storage)]
|
|
Tiler[TiTiler]
|
|
DB[(PostGIS Standalone)]
|
|
end
|
|
|
|
subgraph "Frontend"
|
|
Web[React/Vite Portfolio]
|
|
end
|
|
|
|
STAC --> Worker
|
|
Web --> API
|
|
API --> Redis
|
|
Redis --> Worker
|
|
Worker --> MinIO
|
|
Worker --> DB
|
|
MinIO --> Tiler
|
|
Tiler --> Web
|
|
```
|
|
- **Tech Rationale:**
|
|
- *Why MinIO*: local sovereignty + S3 compatibility.
|
|
- *Why Argo*: lightweight orchestration vs Airflow.
|
|
- *Why Supabase/PostGIS*: fast Postgres + PostGIS integration for spatial depth.
|
|
|
|
2. **Infrastructure Design (K3s Sovereign Cluster)**
|
|
- **Title:** Infrastructure Design (K3s Sovereign Cluster)
|
|
- **Visual:** Cluster design details (Single-node K3s on Contabo VPS).
|
|
- **Resource Strategy:** 512MB limits for API/Web; 2GB for Worker/Jupyter.
|
|
- **Key Principle:** “Designed for low-resource environments while maintaining full MLOps capability.”
|
|
- **Terraform Layer:** Namespace isolation (geocrop), Resource quotas, Future SSO integration.
|
|
- **GitOps Layer:** Argo CD as single source of truth (/k8s/base + /overlays). “Everything deployed is version-controlled and reproducible.”
|
|
|
|
3. **End-to-End MLOps Workflow**
|
|
- **Title:** End-to-End MLOps Workflow
|
|
- **Pipeline Breakdown:**
|
|
1. **Data Ingestion:** Zimbabwe CSV batches stored in MinIO.
|
|
2. **Training:** Triggered via Argo Workflows, executed from `/training/active`.
|
|
3. **Experiment Tracking:** MLflow logs parameters, metrics, and artifacts.
|
|
4. **Deployment:** Model packaged into worker container, deployed via Argo CD.
|
|
|
|
4. **Engineering Decisions & Trade-offs (CRITICAL)**
|
|
- **Title:** Engineering Decisions & Trade-offs
|
|
- **Argo vs Kubeflow**:
|
|
- *Decision*: Chose Argo Workflows + Argo CD.
|
|
- *Why NOT Kubeflow*: Too resource-heavy for 512MB constraints; complex deployment overhead.
|
|
- *Why Argo*: Lightweight, native K8s integration, easier GitOps alignment.
|
|
- **Gitea vs GitLab**:
|
|
- *Decision*: Chose Gitea.
|
|
- *Why NOT GitLab*: High RAM usage; overkill for single-node cluster.
|
|
- *Why Gitea*: Lightweight, self-hostable in constrained environments, good enough CI/CD via Actions.
|
|
- **MLflow vs Alternatives**: Simple experiment tracking, easy DB backend integration (Postgres), lightweight vs full ML platforms.
|
|
- **MinIO vs Cloud Storage**: Full data sovereignty, S3-compatible, works offline / low-connectivity environments.
|
|
- **Supabase (Postgres + PostGIS)**: Spatial queries (critical for geospatial ML), simple API layer, lightweight vs full GIS stacks.
|
|
|
|
5. **Observability & Monitoring**
|
|
- **Title:** Observability & System Monitoring
|
|
- **Stack:** Prometheus (Metrics), Grafana (Visualization), Uptime Kuma (SLA monitoring).
|
|
- **Live Endpoints:** uptime.techarvest.co.zw, grafana.techarvest.co.zw, prometheus.techarvest.co.zw.
|
|
- **Metrics:** API latency, container health, resource usage, job execution success.
|
|
|
|
6. **Live System Page**
|
|
- **Title:** Live Infrastructure (Production System)
|
|
- **Status Table:**
|
|
| Service | Status |
|
|
| :--- | :--- |
|
|
| Monitoring | Live |
|
|
| Metrics | Live |
|
|
| Storage | Live |
|
|
| MLflow | Deploying |
|
|
|
|
|