geocrop-platform./plan/05_inference_worker_trainin...

6.5 KiB

Plan: Updated Inference Worker - Training Parity

Status: Draft
Date: 2026-02-28


Objective

Update the inference worker (apps/worker/inference.py, apps/worker/features.py, apps/worker/config.py) to perfectly match the training pipeline from train.py. This ensures that features computed during inference are identical to those used during model training.


1. Gap Analysis

Current State vs Required

Component Current (Worker) Required (Train.py) Gap
Feature Engineering Placeholder (zeros) Full pipeline CRITICAL
Model Loading Expected bundle format Individual .pkl files Medium
Indices ndvi, evi, savi only + ndre, ci_re, ndwi Medium
Smoothing Savitzky-Golay (window=5, polyorder=2) Implemented OK
Phenology Not implemented amplitude, AUC, max_slope, peak_timestep CRITICAL
Harmonics Not implemented 1st/2nd order sin/cos CRITICAL
Seasonal Windows Not implemented Early/Peak/Late CRITICAL

2. Feature Engineering Pipeline (from train.py)

2.1 Smoothing

# From train.py apply_smoothing():
# 1. Replace 0 with NaN
# 2. Linear interpolate across time (axis=1), fillna(0)
# 3. Savitzky-Golay: window_length=5, polyorder=2

2.2 Phenology Metrics (per index)

  • idx_max, idx_min, idx_mean, idx_std
  • idx_amplitude = max - min
  • idx_auc = trapezoid(integral) with dx=10
  • idx_peak_timestep = argmax index
  • idx_max_slope_up = max(diff)
  • idx_max_slope_down = min(diff)

2.3 Harmonic Features (per index, normalized)

  • idx_harmonic1_sin = dot(values, sin_t) / n_dates
  • idx_harmonic1_cos = dot(values, cos_t) / n_dates
  • idx_harmonic2_sin = dot(values, sin_2t) / n_dates
  • idx_harmonic2_cos = dot(values, cos_2t) / n_dates

2.4 Seasonal Windows (Zimbabwe: Oct-Jun)

  • Early: Oct-Dec (months 10,11,12)
  • Peak: Jan-Mar (months 1,2,3)
  • Late: Apr-Jun (months 4,5,6)

For each window and each index:

  • idx_early_mean, idx_early_max
  • idx_peak_mean, idx_peak_max
  • idx_late_mean, idx_late_max

2.5 Interactions

  • ndvi_ndre_peak_diff = ndvi_max - ndre_max
  • canopy_density_contrast = evi_mean / (ndvi_mean + 0.001)

3. Model Loading Strategy

Current MinIO Files

geocrop-models/
  Zimbabwe_CatBoost_Model.pkl
  Zimbabwe_CatBoost_Raw_Model.pkl
  Zimbabwe_Ensemble_Raw_Model.pkl
  Zimbabwe_LightGBM_Model.pkl
  Zimbabwe_LightGBM_Raw_Model.pkl
  Zimbabwe_RandomForest_Model.pkl
  Zimbabwe_XGBoost_Model.pkl

Mapping to Inference

Model Name (Job) MinIO File Scaler Required
Ensemble Zimbabwe_Ensemble_Raw_Model.pkl No (Raw)
Ensemble_Scaled Zimbabwe_Ensemble_Model.pkl Yes
RandomForest Zimbabwe_RandomForest_Model.pkl Yes
XGBoost Zimbabwe_XGBoost_Model.pkl Yes
LightGBM Zimbabwe_LightGBM_Model.pkl Yes
CatBoost Zimbabwe_CatBoost_Model.pkl Yes

Note: "_Raw" suffix means no scaling needed. Models without "_Raw" need StandardScaler.

Label Handling

Since label_encoder is not in MinIO, we need to either:

  1. Store label_encoder alongside model in MinIO (future)
  2. Hardcode class mapping based on training data (temporary)
  3. Derive from model if it has classes_ attribute

4. Implementation Plan

4.1 Update apps/worker/features.py

Add new functions:

  • apply_smoothing(df, indices) - Savitzky-Golay with 0-interpolation
  • extract_phenology(df, dates, indices) - Phenology metrics
  • add_harmonics(df, dates, indices) - Fourier features
  • add_interactions_and_windows(df, dates) - Seasonal windows + interactions

Update:

  • build_feature_stack_from_dea() - Full DEA STAC loading + feature computation

4.2 Update apps/worker/inference.py

Modify:

  • load_model_artifacts() - Map model name to MinIO filename
  • Add scaler detection based on model name (_Raw vs _Scaled)
  • Handle label encoder (create default or load from metadata)

4.3 Update apps/worker/config.py

Add:

  • MinIOStorage class implementation
  • Model name to filename mapping
  • MinIO client configuration

4.4 Update apps/worker/requirements.txt

Add dependencies:

  • scipy (for savgol_filter, trapezoid)
  • pystac-client
  • stackstac
  • xarray
  • rioxarray

5. Data Flow

graph TD
    A[Job: aoi, year, model] --> B[Query DEA STAC]
    B --> C[Load Sentinel-2 scenes]
    C --> D[Compute indices: ndvi, ndre, evi, savi, ci_re, ndwi]
    D --> E[Apply Savitzky-Golay smoothing]
    E --> F[Extract phenology metrics]
    F --> G[Add harmonic features]
    G --> H[Add seasonal window stats]
    H --> I[Add interactions]
    I --> J[Align to target grid]
    J --> K[Load model from MinIO]
    K --> L[Apply scaler if needed]
    L --> M[Predict per-pixel]
    M --> N[Majority filter smoothing]
    N --> O[Upload COG to MinIO]

6. Key Functions to Implement

features.py

# Smoothing
def apply_smoothing(df, indices=['ndvi', 'ndre', 'evi', 'savi', 'ci_re', 'ndwi']):
    """Apply Savitzky-Golay smoothing with 0-interpolation."""
    # 1. Replace 0 with NaN
    # 2. Linear interpolate across time axis
    # 3. savgol_filter(window_length=5, polyorder=2)

# Phenology
def extract_phenology(df, dates, indices=['ndvi', 'ndre', 'evi']):
    """Extract amplitude, AUC, peak_timestep, max_slope."""

# Harmonics
def add_harmonics(df, dates, indices=['ndvi']):
    """Add 1st and 2nd order harmonic features."""

# Seasonal Windows
def add_interactions_and_windows(df, dates):
    """Add Early/Peak/Late window stats + interactions."""

7. Acceptance Criteria

  • Worker computes exact same features as training pipeline
  • All indices (ndvi, ndre, evi, savi, ci_re, ndwi) computed
  • Savitzky-Golay smoothing applied correctly
  • Phenology metrics (amplitude, AUC, peak, slope) computed
  • Harmonic features (sin/cos 1st and 2nd order) computed
  • Seasonal window stats (Early/Peak/Late) computed
  • Model loads from current MinIO format (Zimbabwe_*.pkl)
  • Scaler applied only for non-Raw models
  • Results uploaded to MinIO as COG

8. Files to Modify

File Changes
apps/worker/features.py Add feature engineering functions, update build_feature_stack_from_dea
apps/worker/inference.py Update model loading, add scaler detection
apps/worker/config.py Add MinIOStorage implementation
apps/worker/requirements.txt Add scipy, pystac-client, stackstac