geocrop-platform./plan/plan.md

556 lines
13 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# GeoCrop Portfolio App — End-State Checklist, Architecture, and Next Steps
*Last updated: 27 Feb 2026 (Africa/Harare)*
This document captures:
* Whats **already built and verified** in your K3s cluster
* The **full end-state feature checklist** (public + admin)
* The **target architecture** and data flow
* The **next steps** (what to build next, in the order that wont get you stuck)
* Notes to make this **agent-friendly** (Roo / Minimax execution)
---
## 0) Current progress — what you have done so far (verified)
### 0.1 Cluster + networking
* **K3s cluster running** (1 control-plane + 2 workers)
* **NGINX Ingress Controller installed and running**
* Ingress controller exposed on worker `vmi3045103` public IP `167.86.68.48`
* **cert-manager installed**
* **Lets Encrypt prod ClusterIssuer created** (`letsencrypt-prod`) and is Ready=True
### 0.2 DNS
A records pointing to `167.86.68.48`:
* `portfolio.techarvest.co.zw`
* `api.portfolio.techarvest.co.zw`
* `minio.portfolio.techarvest.co.zw`
* `console.minio.portfolio.techarvest.co.zw`
### 0.3 Namespace + core services (geocrop)
Namespace:
* `geocrop`
Running components:
* **Redis** (queue/broker)
* **MinIO** (S3 storage) with PVC (30Gi, local-path)
* Placeholder web + API behind Ingress
* TLS certificates for all subdomains (Ready=True)
### 0.4 Connectivity tests (verified)
* `portfolio.techarvest.co.zw` reachable over HTTPS
* `api.portfolio.techarvest.co.zw` reachable over HTTPS
* `console.minio.portfolio.techarvest.co.zw` loads correctly
### 0.5 What you added recently (major progress)
* Uploaded ML model artifact to **MinIO** (geocrop-models bucket)
* Implemented working **FastAPI backend** with JWT authentication
* Implemented **Python RQ worker** consuming Redis queue
* Verified end-to-end async job submission + dummy inference response
### 0.6 Dynamic World Baseline Migration (Completed)
* Configured **rclone** with Google Drive remote (`gdrive`)
* Successfully copied ~7.9 GiB of Dynamic World seasonal GeoTIFFs (132 files) from Google Drive to server path:
* `~/geocrop/data/dw_baselines`
* Installed `rio-cogeo`, `rasterio`, `pyproj`, and dependencies
* Converted all baseline GeoTIFFs to **Cloud Optimized GeoTIFFs (COGs)**:
* Output directory: `~/geocrop/data/dw_cogs`
> This is a major milestone: your Dynamic World baselines are now local and being converted to COG format, which is required for efficient tiling and MinIO-based serving.
> Note: Your earlier `10-redis.yaml` and `20-minio.yaml` editing had some terminal echo corruption, but K8s objects did apply and are running. Well clean manifests into a proper repo layout next.
---
## 1) End-state: what the app should have (complete checklist)
### 1.1 Public user experience
**Auth & access**
* Login for public users (best for portfolio: **invite-only registration** or “request access”)
* JWT auth (already planned)
* Clear “demo limits” messaging
**AOI selection**
* Leaflet map:
* Place a marker OR draw a circle (center + radius)
* Radius slider up to **5 km**
* Optional polygon draw (but enforce max area / vertex count)
* Manual input:
* Latitude/Longitude center
* Radius (meters / km)
**Parameters**
* Year chooser: **2015 → present**
* Season chooser:
* Summer cropping only (Nov 1 → Apr 30) for now
* Model chooser:
* RandomForest / XGBoost / LightGBM / CatBoost / Ensemble
**Job lifecycle UI**
* Submit job
* Loading/progress screen with stages:
* Queued → Downloading imagery → Computing indices → Running model → Smoothing → Exporting GeoTIFF → Uploading → Done
* Results page:
* Map viewer with layer toggles
* Download links (GeoTIFF only)
**Map layers (toggles)**
* ✅ Refined crop/LULC map (final product) at **10m**
* ✅ Dynamic World baseline toggle
* Prefer **Highest Confidence** composite (as you stated)
* ✅ True colour composite
* ✅ Indices toggles:
* Peak NDVI
* Peak EVI
* Peak SAVI
* (Optional later: NDMI, NDRE)
**Outputs**
* Download refined result as **GeoTIFF only**
* Optional downloads:
* Baseline DW clipped AOI (GeoTIFF)
* True colour composite (GeoTIFF)
* Indices rasters (GeoTIFF)
**Legend / key**
* On-map legend showing your refined classes (color-coded)
* Class list includes:
* Your refined crop classes (from your image)
* Plus non-crop landcover classes so it remains full LULC
### 1.2 Processing pipeline requirements
**Validation**
* AOI inside Zimbabwe only
* Radius ≤ 5 km
* Reject overly complex geometries
**Data sources**
* DEA STAC endpoint:
* `https://explorer.digitalearth.africa/stac/search`
* Dynamic World baseline:
* Your pre-exported DW GeoTIFFs per year/season (now in Google Drive; migrate to MinIO)
**Core computations**
* Pull imagery from DEA STAC for selected year + summer season window
* Build feature stack:
* True colour
* Indices: NDVI, EVI, SAVI (+ optional NDRE/NDMI)
* “Peak” index logic (seasonal maximum)
* Load DW baseline for the same year/season, clip to AOI
**ML refinement**
* Take baseline DW + EO features and run selected ML model
* Refine crops into crop-specific classes
* Keep non-crop classes to output full LULC map
**Neighborhood smoothing**
* Majority filter rule:
* If pixel is surrounded by majority class, set it to majority class
* Configurable kernel sizes: 3×3 / 5×5
**Export and storage**
* Export refined output as GeoTIFF (prefer **Cloud Optimized GeoTIFF**)
* Save to MinIO
* Provide **signed URLs** for downloads
### 1.3 Admin capabilities
* Admin login (role-based)
* Dataset uploads:
* Upload training CSVs and/or labeled GeoTIFFs
* Version datasets (v1, v2…)
* Retraining:
* Trigger model retraining using Kubernetes Job
* Save trained models to MinIO (versioned)
* Promote a model to “production default”
* Job monitoring:
* See queue/running/failed jobs, timing, logs
* User management:
* Invite/create/disable users
* Per-user limits
### 1.4 Reliability + portfolio safety (high value)
**Compute control**
* Global concurrency cap (cluster-wide): e.g. **2 jobs running**
* Per-user daily limits: e.g. **35 jobs/day**
* Job timeouts: kill jobs > 25 minutes
**Caching**
* Deterministic caching:
* If (AOI + year + season + model) repeats → return cached output
**Resilience**
* Queue-based async processing (RQ)
* Retry logic for STAC fetch
* Clean error reporting to user
### 1.5 Security
* HTTPS everywhere (already done)
* JWT auth
* RBAC roles: admin vs user
* K8s Secrets for:
* JWT secret
* MinIO credentials
* DB credentials
* MinIO should not be publicly writable
* Downloads are signed URLs only
### 1.6 Nice-to-have portfolio boosters
* Swipe/slider compare: Refined vs DW baseline
* Confidence raster toggle (if model outputs probabilities)
* Stats panel:
* area per class (ha)
* Metadata JSON (small but very useful even if downloads are “GeoTIFF only”)
* job_id, timestamp, year/season, model version, AOI, CRS, pixel size
---
## 2) Recommendation: “best” login + limiting approach for a portfolio
Because this is a portfolio project on VPS resources:
**Best default**
* **Invite-only accounts** (you create accounts or send invites)
* Simple password login (JWT)
* Hard limits:
* Global: 12 jobs running
* Per user: 3 jobs/day
**Why invite-only is best for portfolio**
* It prevents random abuse from your CV link
* It keeps your compute predictable
* It still demonstrates full auth + quota features
**Optional later**
* Public “Request Access” form (email + reason)
* Or Google OAuth (more work, not necessary for portfolio)
---
## 3) Target architecture (final)
### 3.1 Components
* **Frontend**: React + Leaflet
* Select AOI + params
* Submit job
* Poll status
* Render map layers from tiles
* Download GeoTIFF
* **API**: FastAPI
* Auth (JWT)
* Validate AOI + quotas
* Create job records
* Push job to Redis queue
* Generate signed URLs
* **Worker**: Python RQ Worker
* Pull job
* Query DEA STAC
* Compute features/indices
* Load DW baseline
* Run model inference
* Neighborhood smoothing
* Write outputs as COG GeoTIFF
* Update job status
* **Redis**
* job queue
* **MinIO**
* Baselines (DW)
* Models
* Results (COGs)
* **Database (recommended)**
* Postgres (preferred) for:
* users, roles
* jobs, params
* quotas usage
* model registry metadata
* **Tile server**
* TiTiler or rio-tiler based service
* Serves tiles from MinIO-hosted COGs
### 3.2 Buckets (MinIO)
* `geocrop-baselines` (DW GeoTIFF/COG)
* `geocrop-models` (pkl/onnx + metadata)
* `geocrop-results` (output COGs)
* `geocrop-datasets` (training data uploads)
### 3.3 Subdomains
* `portfolio.techarvest.co.zw` → frontend
* `api.portfolio.techarvest.co.zw` → FastAPI
* `tiles.portfolio.techarvest.co.zw` → TiTiler (recommended add)
* `minio.portfolio.techarvest.co.zw` → MinIO API (private)
* `console.minio.portfolio.techarvest.co.zw` → MinIO Console (admin-only)
---
## 4) What to build next (exact order)
### Phase A — Clean repo + manifests (so you stop fighting YAML)
1. Create a Git repo layout:
* `geocrop/`
* `k8s/`
* `base/`
* `prod/`
* `api/`
* `worker/`
* `web/`
2. Move your current YAML into files with predictable names:
* `k8s/base/00-namespace.yaml`
* `k8s/base/10-redis.yaml`
* `k8s/base/20-minio.yaml`
* `k8s/base/30-api.yaml`
* `k8s/base/40-worker.yaml`
* `k8s/base/50-web.yaml`
* `k8s/base/60-ingress.yaml`
3. Add `kubectl apply -k` using Kustomize later (optional).
### Phase B — Make API real (replace hello-api)
4. Build FastAPI endpoints:
* `POST /auth/register` (admin-only or invite)
* `POST /auth/login`
* `POST /jobs` (create job)
* `GET /jobs/{job_id}` (status)
* `GET /jobs/{job_id}/download` (signed url)
* `GET /models` (list available models)
5. Add quotas + concurrency guard:
* Global running jobs ≤ 2
* Per-user jobs/day ≤ 35
6. Store job status:
* Start with Redis
* Upgrade to Postgres when stable
### Phase C — Worker: “real pipeline v1”
7. Implement DEA STAC search + download clip for AOI:
* Sentinel-2 (s2_l2a) is likely easiest first
* Compute indices (NDVI, EVI, SAVI)
* Compute peak indices (season max)
8. Load DW baseline GeoTIFF for the year:
* Step 1: upload DW GeoTIFFs from Google Drive to MinIO
* Step 2: clip to AOI
9. Run model inference:
* Load model from MinIO
* Apply to feature stack
* Output refined label raster
10. Neighborhood smoothing:
* Majority filter 3×3 / 5×5 (configurable)
11. Export result as GeoTIFF (prefer COG)
* Write to temp
* Upload to MinIO
### Phase D — Tiles + map UI
12. Deploy TiTiler service and expose:
* `tiles.portfolio...`
13. Frontend:
* Leaflet selection + coords input
* Submit job + poll
* Add layers from tile URLs
* Legend + downloads
### Phase E — Admin portal + retraining
14. Admin UI:
* Dataset upload
* Model list + promote
15. Retraining pipeline:
* Kubernetes Job that:
* pulls dataset from MinIO
* trains models
* saves artifact to MinIO
* registers new model version
---
## 5) Important “you might forget” items (add now)
### 5.1 Model registry metadata
For each model artifact store:
* model_name
* version
* training datasets used
* training timestamp
* feature list expected
* class mapping
### 5.2 Class mapping (must be consistent)
Create a single `classes.json` used by:
* training
* inference
* frontend legend
### 5.3 Zimbabwe boundary validation
Use a Zimbabwe boundary polygon in the API/worker to validate AOI.
* Best: store the boundary geometry as GeoJSON in repo.
### 5.4 Deterministic job cache key
Hash:
* year
* season
* model_version
* center lat/lon
* radius
If exists → return cached result (huge compute saver).
### 5.5 Signed downloads
Never expose MinIO objects publicly.
* API generates signed GET URLs that expire.
---
## 6) Open items to decide (tomorrow)
1. **Frontend framework**: React + Vite (recommended)
2. **Tile approach**: TiTiler vs pre-render PNGs (TiTiler looks much more professional)
3. **DB**: add Postgres now vs later (recommended soon for quotas + user mgmt)
4. **Which DEA collections** to use for the first version:
* Start with Sentinel-2 L2A (s2_l2a)
* Later add Landsat fallback
5. **Model input features**: exact feature vector and normalization rules
---
## 7) Roo/Minimax execution notes (so it doesnt get confused)
* Treat current cluster as **production-like**
* All services live in namespace: `geocrop`
* Ingress class: `nginx`
* ClusterIssuer: `letsencrypt-prod`
* Public IP of ingress node: `167.86.68.48`
* Subdomains already configured and reachable
* Next change should be swapping placeholder services for real deployments
---
## 8) Short summary
You already have the hard part done:
* K3s + ingress + TLS + DNS works
* MinIO + Redis work
* You proved async jobs can be queued and processed
Next is mostly **application engineering**:
* Replace placeholder web/api with real app
* Add job status + quotas
* Implement DEA STAC fetch + DW baseline clipping + ML inference
* Export COG + tile server + map UI