8.6 KiB
CLAUDE.md
This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository.
What This Project Does
GeoCrop is a crop-type classification platform for Zimbabwe. It:
- Accepts an AOI (lat/lon + radius) and year via REST API
- Queues an inference job via Redis/RQ
- Worker fetches Sentinel-2 imagery from DEA STAC, computes 51 spectral features, loads a Dynamic World baseline, runs an ML model (XGBoost/LightGBM/CatBoost/Ensemble), and uploads COG results to MinIO
- Results are served via TiTiler (tile server reading COGs directly from MinIO over S3)
Build & Run Commands
# API
cd apps/api && pip install -r requirements.txt
uvicorn main:app --host 0.0.0.0 --port 8000
# Worker
cd apps/worker && pip install -r requirements.txt
python worker.py --worker # start RQ worker
python worker.py --test # syntax/import self-test only
# Web frontend (React + Vite + TypeScript)
cd apps/web && npm install
npm run dev # dev server (hot reload)
npm run build # production build → dist/
npm run lint # ESLint check
npm run preview # preview production build locally
# Training
cd training && python train.py --data /path/to/data.csv --out ./artifacts --variant Raw
# With MinIO upload:
MINIO_ENDPOINT=... MINIO_ACCESS_KEY=... MINIO_SECRET_KEY=... \
python train.py --data /path/to/data.csv --out ./artifacts --variant Raw --upload-minio
# Docker
docker build -t frankchine/geocrop-api:v1 apps/api/
docker build -t frankchine/geocrop-worker:v1 apps/worker/
Kubernetes Deployment
All k8s manifests are in k8s/ — numbered for apply order:
kubectl apply -f k8s/00-namespace.yaml
kubectl apply -f k8s/ # apply all in order
kubectl -n geocrop rollout restart deployment/geocrop-api
kubectl -n geocrop rollout restart deployment/geocrop-worker
Namespace: geocrop. Ingress class: nginx. ClusterIssuer: letsencrypt-prod.
Exposed hosts:
portfolio.techarvest.co.zw→ geocrop-web (nginx static)api.portfolio.techarvest.co.zw→ geocrop-api:8000tiles.portfolio.techarvest.co.zw→ geocrop-tiler:8000 (TiTiler)minio.portfolio.techarvest.co.zw→ MinIO APIconsole.minio.portfolio.techarvest.co.zw→ MinIO Console
Architecture
Web (React/Vite/OL) → API (FastAPI) → Redis Queue (geocrop_tasks) → Worker (RQ)
↓
DEA STAC → feature_computation.py (51 features)
MinIO → dw_baseline.py (windowed read)
MinIO → inference.py (model load + predict)
→ postprocess.py (majority filter)
→ cog.py (write COG)
→ MinIO geocrop-results/
↓
TiTiler reads COGs from MinIO via S3 protocol
Job status is written to Redis at job:{job_id}:status with 24h expiry.
Web frontend (apps/web/): React 19 + TypeScript + Vite. Uses OpenLayers for the map (click-to-set-coordinates). Components: Login, Welcome, JobForm, StatusMonitor, MapComponent, Admin. State is in App.tsx; JWT token stored in localStorage.
API user store: Users are stored in an in-memory dict (USERS in apps/api/main.py) — lost on restart. Admin panel (/admin/users) manages users at runtime. Any user additions must be re-done after pod restarts unless the dict is seeded in code.
Critical Non-Obvious Patterns
Season window: Sept 1 → May 31 of the following year. year=2022 → 2022-09-01 to 2023-05-31. See InferenceConfig.season_dates() in apps/worker/config.py.
AOI format: (lon, lat, radius_m) — NOT (lat, lon). Longitude first everywhere in features.py.
Zimbabwe bounds: Lon 25.2–33.1, Lat -22.5 to -15.6 (enforced in worker.py validation).
Radius limit: Max 5000m enforced in both API (apps/api/main.py:90) and worker validation.
RQ queue name: geocrop_tasks. Redis service: redis.geocrop.svc.cluster.local.
API vs worker function name mismatch: apps/api/main.py enqueues 'worker.run_inference' but the worker only defines run_job. Any new worker entry point must be named run_inference (or the API call must be updated) for end-to-end jobs to work.
Smoothing kernel: Must be odd — 3, 5, or 7 only (postprocess.py).
Feature order: FEATURE_ORDER_V1 in feature_computation.py — exactly 51 scalar features. Order matters for model inference. Changing this breaks all existing models.
MinIO Buckets & Path Conventions
| Bucket | Purpose | Path pattern |
|---|---|---|
geocrop-models |
ML model .pkl files |
ROOT — no subfolders |
geocrop-baselines |
Dynamic World COG tiles | dw/zim/summer/<season>/<type>/DW_Zim_<Type>_<year>_<year+1>-<row>-<col>.tif |
geocrop-results |
Output COGs | results/<job_id>/<filename> |
geocrop-datasets |
Training data CSVs | — |
Model filenames (ROOT of geocrop-models):
Zimbabwe_Ensemble_Raw_Model.pkl— no scaler neededZimbabwe_XGBoost_Model.pkl,Zimbabwe_LightGBM_Model.pkl,Zimbabwe_RandomForest_Model.pkl— require scalerZimbabwe_CatBoost_Raw_Model.pkl— no scaler
DW baseline tiles: COGs are 65536×65536 pixel tiles. Worker MUST use windowed reads via presigned URL — never download the full tile. Always transform AOI bbox to tile CRS before computing window.
Environment Variables
| Variable | Default | Notes |
|---|---|---|
REDIS_HOST |
redis.geocrop.svc.cluster.local |
Also supports REDIS_URL |
MINIO_ENDPOINT |
minio.geocrop.svc.cluster.local:9000 |
|
MINIO_ACCESS_KEY |
minioadmin |
|
MINIO_SECRET_KEY |
minioadmin123 |
|
MINIO_SECURE |
false |
|
GEOCROP_CACHE_DIR |
/tmp/geocrop-cache |
|
SECRET_KEY |
(change in prod) | API JWT signing |
TiTiler uses AWS_S3_ENDPOINT_URL=http://minio.geocrop.svc.cluster.local:9000, AWS_HTTPS=NO, credentials from geocrop-secrets k8s secret.
Feature Engineering (must match training exactly)
Pipeline in feature_computation.py:
- Compute indices: ndvi, ndre, evi, savi, ci_re, ndwi
- Fill zeros linearly, then Savitzky-Golay smooth (window=5, polyorder=2)
- Phenology metrics for ndvi/ndre/evi: max, min, mean, std, amplitude, auc, peak_timestep, max_slope_up, max_slope_down (27 features)
- Harmonics for ndvi only: harmonic1_sin/cos, harmonic2_sin/cos (4 features)
- Interactions: ndvi_ndre_peak_diff, canopy_density_contrast (2 features)
- Window summaries (early=Oct–Dec, peak=Jan–Mar, late=Apr–Jun) for ndvi/ndwi/ndre × mean/max (18 features)
Total: 51 features — see FEATURE_ORDER_V1 for exact ordering.
Training junk columns dropped: .geo, system:index, latitude, longitude, lat, lon, ID, parent_id, batch_id, is_syn.
DEA STAC
- Search endpoint:
https://explorer.digitalearth.africa/stac/search - Primary collection:
s2_l2a(falls back tos2_l2a_c1,sentinel-2-l2a,sentinel_2_l2a) - Required bands: red, green, blue, nir, nir08 (red-edge), swir16, swir22
- Cloud filter:
eo:cloud_cover < 30
Worker Pipeline Stages
fetch_stac → build_features → load_dw → infer → smooth → export_cog → upload → done
When real DEA STAC data is unavailable, worker falls back to synthetic features (seeded by year+coords) to allow end-to-end pipeline testing.
Label Classes (V1 — temporary)
35 classes including Maize, Tobacco, Soyabean, etc. — defined as CLASSES_V1 in apps/worker/worker.py. Extract dynamically from model.classes_ when available; fall back to this list only if not present.
Training Artifacts
train.py --variant Raw produces artifacts/model_raw/:
model.joblib— VotingClassifier (soft) over RF + XGBoost + LightGBM + CatBoostlabel_encoder.joblib— sklearn LabelEncoder (maps string class → int)selected_features.json— feature subset chosen by scout RF (subset of FEATURE_ORDER_V1)meta.json— class names, n_features, config snapshotmetrics.json— per-model accuracy/F1/classification report
--variant Scaled also emits scaler.joblib. Models uploaded to MinIO via --upload-minio go under geocrop-models at the ROOT (no subfolders).
Plans & Docs
plan/ contains detailed step-by-step implementation plans (01–05) and an SRS. Read these before making significant architectural changes. ops/ contains MinIO upload scripts and storage setup docs.