177 lines
8.6 KiB
Markdown
177 lines
8.6 KiB
Markdown
# CLAUDE.md
|
||
|
||
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
|
||
|
||
## What This Project Does
|
||
|
||
GeoCrop is a crop-type classification platform for Zimbabwe. It:
|
||
1. Accepts an AOI (lat/lon + radius) and year via REST API
|
||
2. Queues an inference job via Redis/RQ
|
||
3. Worker fetches Sentinel-2 imagery from DEA STAC, computes 51 spectral features, loads a Dynamic World baseline, runs an ML model (XGBoost/LightGBM/CatBoost/Ensemble), and uploads COG results to MinIO
|
||
4. Results are served via TiTiler (tile server reading COGs directly from MinIO over S3)
|
||
|
||
## Build & Run Commands
|
||
|
||
```bash
|
||
# API
|
||
cd apps/api && pip install -r requirements.txt
|
||
uvicorn main:app --host 0.0.0.0 --port 8000
|
||
|
||
# Worker
|
||
cd apps/worker && pip install -r requirements.txt
|
||
python worker.py --worker # start RQ worker
|
||
python worker.py --test # syntax/import self-test only
|
||
|
||
# Web frontend (React + Vite + TypeScript)
|
||
cd apps/web && npm install
|
||
npm run dev # dev server (hot reload)
|
||
npm run build # production build → dist/
|
||
npm run lint # ESLint check
|
||
npm run preview # preview production build locally
|
||
|
||
# Training
|
||
cd training && python train.py --data /path/to/data.csv --out ./artifacts --variant Raw
|
||
# With MinIO upload:
|
||
MINIO_ENDPOINT=... MINIO_ACCESS_KEY=... MINIO_SECRET_KEY=... \
|
||
python train.py --data /path/to/data.csv --out ./artifacts --variant Raw --upload-minio
|
||
|
||
# Docker
|
||
docker build -t frankchine/geocrop-api:v1 apps/api/
|
||
docker build -t frankchine/geocrop-worker:v1 apps/worker/
|
||
```
|
||
|
||
## Kubernetes Deployment
|
||
|
||
All k8s manifests are in `k8s/` — numbered for apply order:
|
||
|
||
```bash
|
||
kubectl apply -f k8s/00-namespace.yaml
|
||
kubectl apply -f k8s/ # apply all in order
|
||
kubectl -n geocrop rollout restart deployment/geocrop-api
|
||
kubectl -n geocrop rollout restart deployment/geocrop-worker
|
||
```
|
||
|
||
Namespace: `geocrop`. Ingress class: `nginx`. ClusterIssuer: `letsencrypt-prod`.
|
||
|
||
Exposed hosts:
|
||
- `portfolio.techarvest.co.zw` → geocrop-web (nginx static)
|
||
- `api.portfolio.techarvest.co.zw` → geocrop-api:8000
|
||
- `tiles.portfolio.techarvest.co.zw` → geocrop-tiler:8000 (TiTiler)
|
||
- `minio.portfolio.techarvest.co.zw` → MinIO API
|
||
- `console.minio.portfolio.techarvest.co.zw` → MinIO Console
|
||
|
||
## Architecture
|
||
|
||
```
|
||
Web (React/Vite/OL) → API (FastAPI) → Redis Queue (geocrop_tasks) → Worker (RQ)
|
||
↓
|
||
DEA STAC → feature_computation.py (51 features)
|
||
MinIO → dw_baseline.py (windowed read)
|
||
MinIO → inference.py (model load + predict)
|
||
→ postprocess.py (majority filter)
|
||
→ cog.py (write COG)
|
||
→ MinIO geocrop-results/
|
||
↓
|
||
TiTiler reads COGs from MinIO via S3 protocol
|
||
```
|
||
|
||
Job status is written to Redis at `job:{job_id}:status` with 24h expiry.
|
||
|
||
**Web frontend** (`apps/web/`): React 19 + TypeScript + Vite. Uses OpenLayers for the map (click-to-set-coordinates). Components: `Login`, `Welcome`, `JobForm`, `StatusMonitor`, `MapComponent`, `Admin`. State is in `App.tsx`; JWT token stored in `localStorage`.
|
||
|
||
**API user store**: Users are stored in an in-memory dict (`USERS` in `apps/api/main.py`) — lost on restart. Admin panel (`/admin/users`) manages users at runtime. Any user additions must be re-done after pod restarts unless the dict is seeded in code.
|
||
|
||
## Critical Non-Obvious Patterns
|
||
|
||
**Season window**: Sept 1 → May 31 of the following year. `year=2022` → 2022-09-01 to 2023-05-31. See `InferenceConfig.season_dates()` in `apps/worker/config.py`.
|
||
|
||
**AOI format**: `(lon, lat, radius_m)` — NOT `(lat, lon)`. Longitude first everywhere in `features.py`.
|
||
|
||
**Zimbabwe bounds**: Lon 25.2–33.1, Lat -22.5 to -15.6 (enforced in `worker.py` validation).
|
||
|
||
**Radius limit**: Max 5000m enforced in both API (`apps/api/main.py:90`) and worker validation.
|
||
|
||
**RQ queue name**: `geocrop_tasks`. Redis service: `redis.geocrop.svc.cluster.local`.
|
||
|
||
**API vs worker function name mismatch**: `apps/api/main.py` enqueues `'worker.run_inference'` but the worker only defines `run_job`. Any new worker entry point must be named `run_inference` (or the API call must be updated) for end-to-end jobs to work.
|
||
|
||
**Smoothing kernel**: Must be odd — 3, 5, or 7 only (`postprocess.py`).
|
||
|
||
**Feature order**: `FEATURE_ORDER_V1` in `feature_computation.py` — exactly 51 scalar features. Order matters for model inference. Changing this breaks all existing models.
|
||
|
||
## MinIO Buckets & Path Conventions
|
||
|
||
| Bucket | Purpose | Path pattern |
|
||
|--------|---------|-------------|
|
||
| `geocrop-models` | ML model `.pkl` files | ROOT — no subfolders |
|
||
| `geocrop-baselines` | Dynamic World COG tiles | `dw/zim/summer/<season>/<type>/DW_Zim_<Type>_<year>_<year+1>-<row>-<col>.tif` |
|
||
| `geocrop-results` | Output COGs | `results/<job_id>/<filename>` |
|
||
| `geocrop-datasets` | Training data CSVs | — |
|
||
|
||
**Model filenames** (ROOT of `geocrop-models`):
|
||
- `Zimbabwe_Ensemble_Raw_Model.pkl` — no scaler needed
|
||
- `Zimbabwe_XGBoost_Model.pkl`, `Zimbabwe_LightGBM_Model.pkl`, `Zimbabwe_RandomForest_Model.pkl` — require scaler
|
||
- `Zimbabwe_CatBoost_Raw_Model.pkl` — no scaler
|
||
|
||
**DW baseline tiles**: COGs are 65536×65536 pixel tiles. Worker MUST use windowed reads via presigned URL — never download the full tile. Always transform AOI bbox to tile CRS before computing window.
|
||
|
||
## Environment Variables
|
||
|
||
| Variable | Default | Notes |
|
||
|----------|---------|-------|
|
||
| `REDIS_HOST` | `redis.geocrop.svc.cluster.local` | Also supports `REDIS_URL` |
|
||
| `MINIO_ENDPOINT` | `minio.geocrop.svc.cluster.local:9000` | |
|
||
| `MINIO_ACCESS_KEY` | `minioadmin` | |
|
||
| `MINIO_SECRET_KEY` | `minioadmin123` | |
|
||
| `MINIO_SECURE` | `false` | |
|
||
| `GEOCROP_CACHE_DIR` | `/tmp/geocrop-cache` | |
|
||
| `SECRET_KEY` | (change in prod) | API JWT signing |
|
||
|
||
TiTiler uses `AWS_S3_ENDPOINT_URL=http://minio.geocrop.svc.cluster.local:9000`, `AWS_HTTPS=NO`, credentials from `geocrop-secrets` k8s secret.
|
||
|
||
## Feature Engineering (must match training exactly)
|
||
|
||
Pipeline in `feature_computation.py`:
|
||
1. Compute indices: ndvi, ndre, evi, savi, ci_re, ndwi
|
||
2. Fill zeros linearly, then Savitzky-Golay smooth (window=5, polyorder=2)
|
||
3. Phenology metrics for ndvi/ndre/evi: max, min, mean, std, amplitude, auc, peak_timestep, max_slope_up, max_slope_down (27 features)
|
||
4. Harmonics for ndvi only: harmonic1_sin/cos, harmonic2_sin/cos (4 features)
|
||
5. Interactions: ndvi_ndre_peak_diff, canopy_density_contrast (2 features)
|
||
6. Window summaries (early=Oct–Dec, peak=Jan–Mar, late=Apr–Jun) for ndvi/ndwi/ndre × mean/max (18 features)
|
||
|
||
**Total: 51 features** — see `FEATURE_ORDER_V1` for exact ordering.
|
||
|
||
Training junk columns dropped: `.geo`, `system:index`, `latitude`, `longitude`, `lat`, `lon`, `ID`, `parent_id`, `batch_id`, `is_syn`.
|
||
|
||
## DEA STAC
|
||
|
||
- Search endpoint: `https://explorer.digitalearth.africa/stac/search`
|
||
- Primary collection: `s2_l2a` (falls back to `s2_l2a_c1`, `sentinel-2-l2a`, `sentinel_2_l2a`)
|
||
- Required bands: red, green, blue, nir, nir08 (red-edge), swir16, swir22
|
||
- Cloud filter: `eo:cloud_cover < 30`
|
||
|
||
## Worker Pipeline Stages
|
||
|
||
`fetch_stac → build_features → load_dw → infer → smooth → export_cog → upload → done`
|
||
|
||
When real DEA STAC data is unavailable, worker falls back to synthetic features (seeded by year+coords) to allow end-to-end pipeline testing.
|
||
|
||
## Label Classes (V1 — temporary)
|
||
|
||
35 classes including Maize, Tobacco, Soyabean, etc. — defined as `CLASSES_V1` in `apps/worker/worker.py`. Extract dynamically from `model.classes_` when available; fall back to this list only if not present.
|
||
|
||
## Training Artifacts
|
||
|
||
`train.py --variant Raw` produces `artifacts/model_raw/`:
|
||
- `model.joblib` — VotingClassifier (soft) over RF + XGBoost + LightGBM + CatBoost
|
||
- `label_encoder.joblib` — sklearn LabelEncoder (maps string class → int)
|
||
- `selected_features.json` — feature subset chosen by scout RF (subset of FEATURE_ORDER_V1)
|
||
- `meta.json` — class names, n_features, config snapshot
|
||
- `metrics.json` — per-model accuracy/F1/classification report
|
||
|
||
`--variant Scaled` also emits `scaler.joblib`. Models uploaded to MinIO via `--upload-minio` go under `geocrop-models` at the ROOT (no subfolders).
|
||
|
||
## Plans & Docs
|
||
|
||
`plan/` contains detailed step-by-step implementation plans (01–05) and an SRS. Read these before making significant architectural changes. `ops/` contains MinIO upload scripts and storage setup docs.
|