# Plan: Updated Inference Worker - Training Parity **Status**: Draft **Date**: 2026-02-28 --- ## Objective Update the inference worker (`apps/worker/inference.py`, `apps/worker/features.py`, `apps/worker/config.py`) to perfectly match the training pipeline from `train.py`. This ensures that features computed during inference are identical to those used during model training. --- ## 1. Gap Analysis ### Current State vs Required | Component | Current (Worker) | Required (Train.py) | Gap | |-----------|-----------------|---------------------|-----| | Feature Engineering | Placeholder (zeros) | Full pipeline | **CRITICAL** | | Model Loading | Expected bundle format | Individual .pkl files | Medium | | Indices | ndvi, evi, savi only | + ndre, ci_re, ndwi | Medium | | Smoothing | Savitzky-Golay (window=5, polyorder=2) | Implemented | OK | | Phenology | Not implemented | amplitude, AUC, max_slope, peak_timestep | **CRITICAL** | | Harmonics | Not implemented | 1st/2nd order sin/cos | **CRITICAL** | | Seasonal Windows | Not implemented | Early/Peak/Late | **CRITICAL** | --- ## 2. Feature Engineering Pipeline (from train.py) ### 2.1 Smoothing ```python # From train.py apply_smoothing(): # 1. Replace 0 with NaN # 2. Linear interpolate across time (axis=1), fillna(0) # 3. Savitzky-Golay: window_length=5, polyorder=2 ``` ### 2.2 Phenology Metrics (per index) - `idx_max`, `idx_min`, `idx_mean`, `idx_std` - `idx_amplitude` = max - min - `idx_auc` = trapezoid(integral) with dx=10 - `idx_peak_timestep` = argmax index - `idx_max_slope_up` = max(diff) - `idx_max_slope_down` = min(diff) ### 2.3 Harmonic Features (per index, normalized) - `idx_harmonic1_sin` = dot(values, sin_t) / n_dates - `idx_harmonic1_cos` = dot(values, cos_t) / n_dates - `idx_harmonic2_sin` = dot(values, sin_2t) / n_dates - `idx_harmonic2_cos` = dot(values, cos_2t) / n_dates ### 2.4 Seasonal Windows (Zimbabwe: Oct-Jun) - **Early**: Oct-Dec (months 10,11,12) - **Peak**: Jan-Mar (months 1,2,3) - **Late**: Apr-Jun (months 4,5,6) For each window and each index: - `idx_early_mean`, `idx_early_max` - `idx_peak_mean`, `idx_peak_max` - `idx_late_mean`, `idx_late_max` ### 2.5 Interactions - `ndvi_ndre_peak_diff` = ndvi_max - ndre_max - `canopy_density_contrast` = evi_mean / (ndvi_mean + 0.001) --- ## 3. Model Loading Strategy ### Current MinIO Files ``` geocrop-models/ Zimbabwe_CatBoost_Model.pkl Zimbabwe_CatBoost_Raw_Model.pkl Zimbabwe_Ensemble_Raw_Model.pkl Zimbabwe_LightGBM_Model.pkl Zimbabwe_LightGBM_Raw_Model.pkl Zimbabwe_RandomForest_Model.pkl Zimbabwe_XGBoost_Model.pkl ``` ### Mapping to Inference | Model Name (Job) | MinIO File | Scaler Required | |------------------|------------|-----------------| | Ensemble | Zimbabwe_Ensemble_Raw_Model.pkl | No (Raw) | | Ensemble_Scaled | Zimbabwe_Ensemble_Model.pkl | Yes | | RandomForest | Zimbabwe_RandomForest_Model.pkl | Yes | | XGBoost | Zimbabwe_XGBoost_Model.pkl | Yes | | LightGBM | Zimbabwe_LightGBM_Model.pkl | Yes | | CatBoost | Zimbabwe_CatBoost_Model.pkl | Yes | **Note**: "_Raw" suffix means no scaling needed. Models without "_Raw" need StandardScaler. ### Label Handling Since label_encoder is not in MinIO, we need to either: 1. Store label_encoder alongside model in MinIO (future) 2. Hardcode class mapping based on training data (temporary) 3. Derive from model if it has classes_ attribute --- ## 4. Implementation Plan ### 4.1 Update `apps/worker/features.py` Add new functions: - `apply_smoothing(df, indices)` - Savitzky-Golay with 0-interpolation - `extract_phenology(df, dates, indices)` - Phenology metrics - `add_harmonics(df, dates, indices)` - Fourier features - `add_interactions_and_windows(df, dates)` - Seasonal windows + interactions Update: - `build_feature_stack_from_dea()` - Full DEA STAC loading + feature computation ### 4.2 Update `apps/worker/inference.py` Modify: - `load_model_artifacts()` - Map model name to MinIO filename - Add scaler detection based on model name (_Raw vs _Scaled) - Handle label encoder (create default or load from metadata) ### 4.3 Update `apps/worker/config.py` Add: - `MinIOStorage` class implementation - Model name to filename mapping - MinIO client configuration ### 4.4 Update `apps/worker/requirements.txt` Add dependencies: - `scipy` (for savgol_filter, trapezoid) - `pystac-client` - `stackstac` - `xarray` - `rioxarray` --- ## 5. Data Flow ```mermaid graph TD A[Job: aoi, year, model] --> B[Query DEA STAC] B --> C[Load Sentinel-2 scenes] C --> D[Compute indices: ndvi, ndre, evi, savi, ci_re, ndwi] D --> E[Apply Savitzky-Golay smoothing] E --> F[Extract phenology metrics] F --> G[Add harmonic features] G --> H[Add seasonal window stats] H --> I[Add interactions] I --> J[Align to target grid] J --> K[Load model from MinIO] K --> L[Apply scaler if needed] L --> M[Predict per-pixel] M --> N[Majority filter smoothing] N --> O[Upload COG to MinIO] ``` --- ## 6. Key Functions to Implement ### features.py ```python # Smoothing def apply_smoothing(df, indices=['ndvi', 'ndre', 'evi', 'savi', 'ci_re', 'ndwi']): """Apply Savitzky-Golay smoothing with 0-interpolation.""" # 1. Replace 0 with NaN # 2. Linear interpolate across time axis # 3. savgol_filter(window_length=5, polyorder=2) # Phenology def extract_phenology(df, dates, indices=['ndvi', 'ndre', 'evi']): """Extract amplitude, AUC, peak_timestep, max_slope.""" # Harmonics def add_harmonics(df, dates, indices=['ndvi']): """Add 1st and 2nd order harmonic features.""" # Seasonal Windows def add_interactions_and_windows(df, dates): """Add Early/Peak/Late window stats + interactions.""" ``` --- ## 7. Acceptance Criteria - [ ] Worker computes exact same features as training pipeline - [ ] All indices (ndvi, ndre, evi, savi, ci_re, ndwi) computed - [ ] Savitzky-Golay smoothing applied correctly - [ ] Phenology metrics (amplitude, AUC, peak, slope) computed - [ ] Harmonic features (sin/cos 1st and 2nd order) computed - [ ] Seasonal window stats (Early/Peak/Late) computed - [ ] Model loads from current MinIO format (Zimbabwe_*.pkl) - [ ] Scaler applied only for non-Raw models - [ ] Results uploaded to MinIO as COG --- ## 8. Files to Modify | File | Changes | |------|---------| | `apps/worker/features.py` | Add feature engineering functions, update build_feature_stack_from_dea | | `apps/worker/inference.py` | Update model loading, add scaler detection | | `apps/worker/config.py` | Add MinIOStorage implementation | | `apps/worker/requirements.txt` | Add scipy, pystac-client, stackstac |