6.5 KiB
6.5 KiB
Plan: Updated Inference Worker - Training Parity
Status: Draft
Date: 2026-02-28
Objective
Update the inference worker (apps/worker/inference.py, apps/worker/features.py, apps/worker/config.py) to perfectly match the training pipeline from train.py. This ensures that features computed during inference are identical to those used during model training.
1. Gap Analysis
Current State vs Required
| Component | Current (Worker) | Required (Train.py) | Gap |
|---|---|---|---|
| Feature Engineering | Placeholder (zeros) | Full pipeline | CRITICAL |
| Model Loading | Expected bundle format | Individual .pkl files | Medium |
| Indices | ndvi, evi, savi only | + ndre, ci_re, ndwi | Medium |
| Smoothing | Savitzky-Golay (window=5, polyorder=2) | Implemented | OK |
| Phenology | Not implemented | amplitude, AUC, max_slope, peak_timestep | CRITICAL |
| Harmonics | Not implemented | 1st/2nd order sin/cos | CRITICAL |
| Seasonal Windows | Not implemented | Early/Peak/Late | CRITICAL |
2. Feature Engineering Pipeline (from train.py)
2.1 Smoothing
# From train.py apply_smoothing():
# 1. Replace 0 with NaN
# 2. Linear interpolate across time (axis=1), fillna(0)
# 3. Savitzky-Golay: window_length=5, polyorder=2
2.2 Phenology Metrics (per index)
idx_max,idx_min,idx_mean,idx_stdidx_amplitude= max - minidx_auc= trapezoid(integral) with dx=10idx_peak_timestep= argmax indexidx_max_slope_up= max(diff)idx_max_slope_down= min(diff)
2.3 Harmonic Features (per index, normalized)
idx_harmonic1_sin= dot(values, sin_t) / n_datesidx_harmonic1_cos= dot(values, cos_t) / n_datesidx_harmonic2_sin= dot(values, sin_2t) / n_datesidx_harmonic2_cos= dot(values, cos_2t) / n_dates
2.4 Seasonal Windows (Zimbabwe: Oct-Jun)
- Early: Oct-Dec (months 10,11,12)
- Peak: Jan-Mar (months 1,2,3)
- Late: Apr-Jun (months 4,5,6)
For each window and each index:
idx_early_mean,idx_early_maxidx_peak_mean,idx_peak_maxidx_late_mean,idx_late_max
2.5 Interactions
ndvi_ndre_peak_diff= ndvi_max - ndre_maxcanopy_density_contrast= evi_mean / (ndvi_mean + 0.001)
3. Model Loading Strategy
Current MinIO Files
geocrop-models/
Zimbabwe_CatBoost_Model.pkl
Zimbabwe_CatBoost_Raw_Model.pkl
Zimbabwe_Ensemble_Raw_Model.pkl
Zimbabwe_LightGBM_Model.pkl
Zimbabwe_LightGBM_Raw_Model.pkl
Zimbabwe_RandomForest_Model.pkl
Zimbabwe_XGBoost_Model.pkl
Mapping to Inference
| Model Name (Job) | MinIO File | Scaler Required |
|---|---|---|
| Ensemble | Zimbabwe_Ensemble_Raw_Model.pkl | No (Raw) |
| Ensemble_Scaled | Zimbabwe_Ensemble_Model.pkl | Yes |
| RandomForest | Zimbabwe_RandomForest_Model.pkl | Yes |
| XGBoost | Zimbabwe_XGBoost_Model.pkl | Yes |
| LightGBM | Zimbabwe_LightGBM_Model.pkl | Yes |
| CatBoost | Zimbabwe_CatBoost_Model.pkl | Yes |
Note: "_Raw" suffix means no scaling needed. Models without "_Raw" need StandardScaler.
Label Handling
Since label_encoder is not in MinIO, we need to either:
- Store label_encoder alongside model in MinIO (future)
- Hardcode class mapping based on training data (temporary)
- Derive from model if it has classes_ attribute
4. Implementation Plan
4.1 Update apps/worker/features.py
Add new functions:
apply_smoothing(df, indices)- Savitzky-Golay with 0-interpolationextract_phenology(df, dates, indices)- Phenology metricsadd_harmonics(df, dates, indices)- Fourier featuresadd_interactions_and_windows(df, dates)- Seasonal windows + interactions
Update:
build_feature_stack_from_dea()- Full DEA STAC loading + feature computation
4.2 Update apps/worker/inference.py
Modify:
load_model_artifacts()- Map model name to MinIO filename- Add scaler detection based on model name (_Raw vs _Scaled)
- Handle label encoder (create default or load from metadata)
4.3 Update apps/worker/config.py
Add:
MinIOStorageclass implementation- Model name to filename mapping
- MinIO client configuration
4.4 Update apps/worker/requirements.txt
Add dependencies:
scipy(for savgol_filter, trapezoid)pystac-clientstackstacxarrayrioxarray
5. Data Flow
graph TD
A[Job: aoi, year, model] --> B[Query DEA STAC]
B --> C[Load Sentinel-2 scenes]
C --> D[Compute indices: ndvi, ndre, evi, savi, ci_re, ndwi]
D --> E[Apply Savitzky-Golay smoothing]
E --> F[Extract phenology metrics]
F --> G[Add harmonic features]
G --> H[Add seasonal window stats]
H --> I[Add interactions]
I --> J[Align to target grid]
J --> K[Load model from MinIO]
K --> L[Apply scaler if needed]
L --> M[Predict per-pixel]
M --> N[Majority filter smoothing]
N --> O[Upload COG to MinIO]
6. Key Functions to Implement
features.py
# Smoothing
def apply_smoothing(df, indices=['ndvi', 'ndre', 'evi', 'savi', 'ci_re', 'ndwi']):
"""Apply Savitzky-Golay smoothing with 0-interpolation."""
# 1. Replace 0 with NaN
# 2. Linear interpolate across time axis
# 3. savgol_filter(window_length=5, polyorder=2)
# Phenology
def extract_phenology(df, dates, indices=['ndvi', 'ndre', 'evi']):
"""Extract amplitude, AUC, peak_timestep, max_slope."""
# Harmonics
def add_harmonics(df, dates, indices=['ndvi']):
"""Add 1st and 2nd order harmonic features."""
# Seasonal Windows
def add_interactions_and_windows(df, dates):
"""Add Early/Peak/Late window stats + interactions."""
7. Acceptance Criteria
- Worker computes exact same features as training pipeline
- All indices (ndvi, ndre, evi, savi, ci_re, ndwi) computed
- Savitzky-Golay smoothing applied correctly
- Phenology metrics (amplitude, AUC, peak, slope) computed
- Harmonic features (sin/cos 1st and 2nd order) computed
- Seasonal window stats (Early/Peak/Late) computed
- Model loads from current MinIO format (Zimbabwe_*.pkl)
- Scaler applied only for non-Raw models
- Results uploaded to MinIO as COG
8. Files to Modify
| File | Changes |
|---|---|
apps/worker/features.py |
Add feature engineering functions, update build_feature_stack_from_dea |
apps/worker/inference.py |
Update model loading, add scaler detection |
apps/worker/config.py |
Add MinIOStorage implementation |
apps/worker/requirements.txt |
Add scipy, pystac-client, stackstac |