# CLAUDE.md This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. ## What This Project Does GeoCrop is a crop-type classification platform for Zimbabwe. It: 1. Accepts an AOI (lat/lon + radius) and year via REST API 2. Queues an inference job via Redis/RQ 3. Worker fetches Sentinel-2 imagery from DEA STAC, computes 51 spectral features, loads a Dynamic World baseline, runs an ML model (XGBoost/LightGBM/CatBoost/Ensemble), and uploads COG results to MinIO 4. Results are served via TiTiler (tile server reading COGs directly from MinIO over S3) ## Build & Run Commands ```bash # API cd apps/api && pip install -r requirements.txt uvicorn main:app --host 0.0.0.0 --port 8000 # Worker cd apps/worker && pip install -r requirements.txt python worker.py --worker # start RQ worker python worker.py --test # syntax/import self-test only # Web frontend (React + Vite + TypeScript) cd apps/web && npm install npm run dev # dev server (hot reload) npm run build # production build → dist/ npm run lint # ESLint check npm run preview # preview production build locally # Training cd training && python train.py --data /path/to/data.csv --out ./artifacts --variant Raw # With MinIO upload: MINIO_ENDPOINT=... MINIO_ACCESS_KEY=... MINIO_SECRET_KEY=... \ python train.py --data /path/to/data.csv --out ./artifacts --variant Raw --upload-minio # Docker docker build -t frankchine/geocrop-api:v1 apps/api/ docker build -t frankchine/geocrop-worker:v1 apps/worker/ ``` ## Kubernetes Deployment All k8s manifests are in `k8s/` — numbered for apply order: ```bash kubectl apply -f k8s/00-namespace.yaml kubectl apply -f k8s/ # apply all in order kubectl -n geocrop rollout restart deployment/geocrop-api kubectl -n geocrop rollout restart deployment/geocrop-worker ``` Namespace: `geocrop`. Ingress class: `nginx`. ClusterIssuer: `letsencrypt-prod`. Exposed hosts: - `portfolio.techarvest.co.zw` → geocrop-web (nginx static) - `api.portfolio.techarvest.co.zw` → geocrop-api:8000 - `tiles.portfolio.techarvest.co.zw` → geocrop-tiler:8000 (TiTiler) - `minio.portfolio.techarvest.co.zw` → MinIO API - `console.minio.portfolio.techarvest.co.zw` → MinIO Console ## Architecture ``` Web (React/Vite/OL) → API (FastAPI) → Redis Queue (geocrop_tasks) → Worker (RQ) ↓ DEA STAC → feature_computation.py (51 features) MinIO → dw_baseline.py (windowed read) MinIO → inference.py (model load + predict) → postprocess.py (majority filter) → cog.py (write COG) → MinIO geocrop-results/ ↓ TiTiler reads COGs from MinIO via S3 protocol ``` Job status is written to Redis at `job:{job_id}:status` with 24h expiry. **Web frontend** (`apps/web/`): React 19 + TypeScript + Vite. Uses OpenLayers for the map (click-to-set-coordinates). Components: `Login`, `Welcome`, `JobForm`, `StatusMonitor`, `MapComponent`, `Admin`. State is in `App.tsx`; JWT token stored in `localStorage`. **API user store**: Users are stored in an in-memory dict (`USERS` in `apps/api/main.py`) — lost on restart. Admin panel (`/admin/users`) manages users at runtime. Any user additions must be re-done after pod restarts unless the dict is seeded in code. ## Critical Non-Obvious Patterns **Season window**: Sept 1 → May 31 of the following year. `year=2022` → 2022-09-01 to 2023-05-31. See `InferenceConfig.season_dates()` in `apps/worker/config.py`. **AOI format**: `(lon, lat, radius_m)` — NOT `(lat, lon)`. Longitude first everywhere in `features.py`. **Zimbabwe bounds**: Lon 25.2–33.1, Lat -22.5 to -15.6 (enforced in `worker.py` validation). **Radius limit**: Max 5000m enforced in both API (`apps/api/main.py:90`) and worker validation. **RQ queue name**: `geocrop_tasks`. Redis service: `redis.geocrop.svc.cluster.local`. **API vs worker function name mismatch**: `apps/api/main.py` enqueues `'worker.run_inference'` but the worker only defines `run_job`. Any new worker entry point must be named `run_inference` (or the API call must be updated) for end-to-end jobs to work. **Smoothing kernel**: Must be odd — 3, 5, or 7 only (`postprocess.py`). **Feature order**: `FEATURE_ORDER_V1` in `feature_computation.py` — exactly 51 scalar features. Order matters for model inference. Changing this breaks all existing models. ## MinIO Buckets & Path Conventions | Bucket | Purpose | Path pattern | |--------|---------|-------------| | `geocrop-models` | ML model `.pkl` files | ROOT — no subfolders | | `geocrop-baselines` | Dynamic World COG tiles | `dw/zim/summer///DW_Zim___--.tif` | | `geocrop-results` | Output COGs | `results//` | | `geocrop-datasets` | Training data CSVs | — | **Model filenames** (ROOT of `geocrop-models`): - `Zimbabwe_Ensemble_Raw_Model.pkl` — no scaler needed - `Zimbabwe_XGBoost_Model.pkl`, `Zimbabwe_LightGBM_Model.pkl`, `Zimbabwe_RandomForest_Model.pkl` — require scaler - `Zimbabwe_CatBoost_Raw_Model.pkl` — no scaler **DW baseline tiles**: COGs are 65536×65536 pixel tiles. Worker MUST use windowed reads via presigned URL — never download the full tile. Always transform AOI bbox to tile CRS before computing window. ## Environment Variables | Variable | Default | Notes | |----------|---------|-------| | `REDIS_HOST` | `redis.geocrop.svc.cluster.local` | Also supports `REDIS_URL` | | `MINIO_ENDPOINT` | `minio.geocrop.svc.cluster.local:9000` | | | `MINIO_ACCESS_KEY` | `minioadmin` | | | `MINIO_SECRET_KEY` | `minioadmin123` | | | `MINIO_SECURE` | `false` | | | `GEOCROP_CACHE_DIR` | `/tmp/geocrop-cache` | | | `SECRET_KEY` | (change in prod) | API JWT signing | TiTiler uses `AWS_S3_ENDPOINT_URL=http://minio.geocrop.svc.cluster.local:9000`, `AWS_HTTPS=NO`, credentials from `geocrop-secrets` k8s secret. ## Feature Engineering (must match training exactly) Pipeline in `feature_computation.py`: 1. Compute indices: ndvi, ndre, evi, savi, ci_re, ndwi 2. Fill zeros linearly, then Savitzky-Golay smooth (window=5, polyorder=2) 3. Phenology metrics for ndvi/ndre/evi: max, min, mean, std, amplitude, auc, peak_timestep, max_slope_up, max_slope_down (27 features) 4. Harmonics for ndvi only: harmonic1_sin/cos, harmonic2_sin/cos (4 features) 5. Interactions: ndvi_ndre_peak_diff, canopy_density_contrast (2 features) 6. Window summaries (early=Oct–Dec, peak=Jan–Mar, late=Apr–Jun) for ndvi/ndwi/ndre × mean/max (18 features) **Total: 51 features** — see `FEATURE_ORDER_V1` for exact ordering. Training junk columns dropped: `.geo`, `system:index`, `latitude`, `longitude`, `lat`, `lon`, `ID`, `parent_id`, `batch_id`, `is_syn`. ## DEA STAC - Search endpoint: `https://explorer.digitalearth.africa/stac/search` - Primary collection: `s2_l2a` (falls back to `s2_l2a_c1`, `sentinel-2-l2a`, `sentinel_2_l2a`) - Required bands: red, green, blue, nir, nir08 (red-edge), swir16, swir22 - Cloud filter: `eo:cloud_cover < 30` ## Worker Pipeline Stages `fetch_stac → build_features → load_dw → infer → smooth → export_cog → upload → done` When real DEA STAC data is unavailable, worker falls back to synthetic features (seeded by year+coords) to allow end-to-end pipeline testing. ## Label Classes (V1 — temporary) 35 classes including Maize, Tobacco, Soyabean, etc. — defined as `CLASSES_V1` in `apps/worker/worker.py`. Extract dynamically from `model.classes_` when available; fall back to this list only if not present. ## Training Artifacts `train.py --variant Raw` produces `artifacts/model_raw/`: - `model.joblib` — VotingClassifier (soft) over RF + XGBoost + LightGBM + CatBoost - `label_encoder.joblib` — sklearn LabelEncoder (maps string class → int) - `selected_features.json` — feature subset chosen by scout RF (subset of FEATURE_ORDER_V1) - `meta.json` — class names, n_features, config snapshot - `metrics.json` — per-model accuracy/F1/classification report `--variant Scaled` also emits `scaler.joblib`. Models uploaded to MinIO via `--upload-minio` go under `geocrop-models` at the ROOT (no subfolders). ## Plans & Docs `plan/` contains detailed step-by-step implementation plans (01–05) and an SRS. Read these before making significant architectural changes. `ops/` contains MinIO upload scripts and storage setup docs.