# Storage Contract ## Overview This document defines the storage layout, naming conventions, and metadata requirements for the GeoCrop project MinIO buckets. ## Bucket Structure | Bucket | Purpose | Example Path | |--------|---------|--------------| | `geocrop-baselines` | Dynamic World baseline COGs | `dw/zim/summer/YYYY_YYYY/` | | `geocrop-datasets` | Training datasets | `datasets/{name}/{version}/` | | `geocrop-models` | Trained ML models | `models/{name}/{version}/` | | `geocrop-results` | Inference output COGs | `jobs/{job_id}/` | --- ## 1. geocrop-baselines ### Path Structure ``` geocrop-baselines/ └── dw/ └── zim/ └── summer/ ├── {season}/ │ ├── agreement/ │ │ └── DW_Zim_Agreement_{season}-{tileX}-{tileY}.tif │ ├── highest_conf/ │ │ └── DW_Zim_HighestConf_{season}-{tileX}-{tileY}.tif │ └── mode/ │ └── DW_Zim_Mode_{season}-{tileX}-{tileY}.tif └── manifests/ └── dw_baseline_keys.txt ``` ### Naming Convention - **Season format**: `YYYY_YYYY` (e.g., `2015_2016`, `2025_2026`) - **Tile format**: `{tileX}-{tileY}` (e.g., `0000000000-0000000000`) - **Composite types**: `Agreement`, `HighestConf`, `Mode` ### Example Object Keys ``` dw/zim/summer/2020_2021/highest_conf/DW_Zim_HighestConf_2020_2021-0000000000-0000000000.tif dw/zim/summer/2020_2021/highest_conf/DW_Zim_HighestConf_2020_2021-0000000000-0000065536.tif dw/zim/summer/2020_2021/highest_conf/DW_Zim_HighestConf_2020_2021-0000065536-0000000000.tif dw/zim/summer/2020_2021/highest_conf/DW_Zim_HighestConf_2020_2021-0000065536-0000065536.tif ``` --- ## 2. geocrop-datasets ### Path Structure ``` geocrop-datasets/ └── datasets/ └── {dataset_name}/ └── {version}/ ├── data/ │ └── *.csv └── metadata.json ``` ### Naming Convention - **Dataset name**: Lowercase, alphanumeric with hyphens (e.g., `zimbabwe-full`, `augmented-v2`) - **Version**: Semantic versioning (e.g., `v1`, `v2.0`, `v2.1.0`) ### Required Metadata File (`metadata.json`) ```json { "version": "v1", "created": "2026-02-27", "description": "Augmented training dataset for GeoCrop crop classification", "source": "Manual labeling from high-resolution imagery + augmentation", "classes": ["cropland", "grass", "shrubland", "forest", "water", "builtup", "bare"], "features": ["ndvi_peak", "evi_peak", "savi_peak"], "total_samples": 25000, "spatial_extent": "Zimbabwe", "batches": 23 } ``` --- ## 3. geocrop-models ### Path Structure ``` geocrop-models/ └── models/ └── {model_name}/ └── {version}/ ├── model.joblib ├── label_encoder.joblib ├── scaler.joblib (optional) ├── selected_features.json └── metadata.json ``` ### Naming Convention - **Model name**: Lowercase, alphanumeric with hyphens (e.g., `xgboost-crop`, `ensemble-v1`) - **Version**: Semantic versioning ### Required Metadata File ```json { "name": "xgboost-crop", "version": "v1", "created": "2026-02-27", "model_type": "XGBoost", "features": ["ndvi_peak", "evi_peak", "savi_peak"], "classes": ["cropland", "grass", "shrubland", "forest", "water", "builtup", "bare"], "training_samples": 20000, "accuracy": 0.92, "scaler": "StandardScaler" } ``` --- ## 4. geocrop-results ### Path Structure ``` geocrop-results/ └── jobs/ └── {job_id}/ ├── output.tif ├── metadata.json └── thumbnail.png (optional) ``` ### Naming Convention - **Job ID**: UUID format (e.g., `a1b2c3d4-e5f6-7890-abcd-ef1234567890`) ### Required Metadata File ```json { "job_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890", "created": "2026-02-27T10:30:00Z", "status": "completed", "aoi": { "lon": 29.0, "lat": -19.0, "radius_m": 5000 }, "season": "2024_2025", "model": { "name": "xgboost-crop", "version": "v1" }, "output": { "format": "COG", "bounds": [25.0, -22.0, 33.0, -15.0], "resolution": 10, "classes": ["cropland", "grass", "shrubland", "forest", "water", "builtup", "bare"] } } ``` --- ## Metadata Requirements Summary | Resource | Required Metadata Files | |----------|----------------------| | Baselines | `manifests/dw_baseline_keys.txt` (optional) | | Datasets | `metadata.json` | | Models | `metadata.json` + model files | | Results | `metadata.json` | --- ## Access Patterns ### Worker Access (Internal) - Read from: `geocrop-baselines/` - Read from: `geocrop-models/` - Write to: `geocrop-results/` ### API Access - Read from: `geocrop-results/` - Generate signed URLs for downloads ### Frontend Access - Request signed URLs from API for downloads - Never access MinIO directly --- **Date**: 2026-02-28 **Status**: ✅ Structure Implemented --- ## Implementation Status (2026-02-28) ### ✅ geocrop-baselines - **Structure**: `dw/zim/summer/{season}/` directories created for seasons 2015_2016 through 2025_2026 - **Status**: Partial - Agreement files exist but need reorganization to `{season}/agreement/` subdirectory - **Files**: 12 Agreement TIF files in `dw/zim/summer/` - **Needs**: Reorganization script at [`ops/reorganize_storage.sh`](ops/reorganize_storage.sh) ### ✅ geocrop-datasets - **Structure**: `datasets/zimbabwe-full/v1/data/` + `metadata.json` - **Status**: Partial - CSV files exist at root level - **Files**: 30 CSV batch files in root - **Metadata**: ✅ metadata.json uploaded ### ✅ geocrop-models - **Structure**: `models/xgboost-crop/v1/` with metadata - **Status**: Partial - .pkl files exist at root level - **Files**: 9 model files in root - **Metadata**: ✅ metadata.json + selected_features.json uploaded ### ✅ geocrop-results - **Structure**: `jobs/` directory created - **Status**: Empty (ready for inference outputs)