geocrop-platform./plan/00E_storage_contract.md

220 lines
5.9 KiB
Markdown

# Storage Contract
## Overview
This document defines the storage layout, naming conventions, and metadata requirements for the GeoCrop project MinIO buckets.
## Bucket Structure
| Bucket | Purpose | Example Path |
|--------|---------|--------------|
| `geocrop-baselines` | Dynamic World baseline COGs | `dw/zim/summer/YYYY_YYYY/` |
| `geocrop-datasets` | Training datasets | `datasets/{name}/{version}/` |
| `geocrop-models` | Trained ML models | `models/{name}/{version}/` |
| `geocrop-results` | Inference output COGs | `jobs/{job_id}/` |
---
## 1. geocrop-baselines
### Path Structure
```
geocrop-baselines/
└── dw/
└── zim/
└── summer/
├── {season}/
│ ├── agreement/
│ │ └── DW_Zim_Agreement_{season}-{tileX}-{tileY}.tif
│ ├── highest_conf/
│ │ └── DW_Zim_HighestConf_{season}-{tileX}-{tileY}.tif
│ └── mode/
│ └── DW_Zim_Mode_{season}-{tileX}-{tileY}.tif
└── manifests/
└── dw_baseline_keys.txt
```
### Naming Convention
- **Season format**: `YYYY_YYYY` (e.g., `2015_2016`, `2025_2026`)
- **Tile format**: `{tileX}-{tileY}` (e.g., `0000000000-0000000000`)
- **Composite types**: `Agreement`, `HighestConf`, `Mode`
### Example Object Keys
```
dw/zim/summer/2020_2021/highest_conf/DW_Zim_HighestConf_2020_2021-0000000000-0000000000.tif
dw/zim/summer/2020_2021/highest_conf/DW_Zim_HighestConf_2020_2021-0000000000-0000065536.tif
dw/zim/summer/2020_2021/highest_conf/DW_Zim_HighestConf_2020_2021-0000065536-0000000000.tif
dw/zim/summer/2020_2021/highest_conf/DW_Zim_HighestConf_2020_2021-0000065536-0000065536.tif
```
---
## 2. geocrop-datasets
### Path Structure
```
geocrop-datasets/
└── datasets/
└── {dataset_name}/
└── {version}/
├── data/
│ └── *.csv
└── metadata.json
```
### Naming Convention
- **Dataset name**: Lowercase, alphanumeric with hyphens (e.g., `zimbabwe-full`, `augmented-v2`)
- **Version**: Semantic versioning (e.g., `v1`, `v2.0`, `v2.1.0`)
### Required Metadata File (`metadata.json`)
```json
{
"version": "v1",
"created": "2026-02-27",
"description": "Augmented training dataset for GeoCrop crop classification",
"source": "Manual labeling from high-resolution imagery + augmentation",
"classes": ["cropland", "grass", "shrubland", "forest", "water", "builtup", "bare"],
"features": ["ndvi_peak", "evi_peak", "savi_peak"],
"total_samples": 25000,
"spatial_extent": "Zimbabwe",
"batches": 23
}
```
---
## 3. geocrop-models
### Path Structure
```
geocrop-models/
└── models/
└── {model_name}/
└── {version}/
├── model.joblib
├── label_encoder.joblib
├── scaler.joblib (optional)
├── selected_features.json
└── metadata.json
```
### Naming Convention
- **Model name**: Lowercase, alphanumeric with hyphens (e.g., `xgboost-crop`, `ensemble-v1`)
- **Version**: Semantic versioning
### Required Metadata File
```json
{
"name": "xgboost-crop",
"version": "v1",
"created": "2026-02-27",
"model_type": "XGBoost",
"features": ["ndvi_peak", "evi_peak", "savi_peak"],
"classes": ["cropland", "grass", "shrubland", "forest", "water", "builtup", "bare"],
"training_samples": 20000,
"accuracy": 0.92,
"scaler": "StandardScaler"
}
```
---
## 4. geocrop-results
### Path Structure
```
geocrop-results/
└── jobs/
└── {job_id}/
├── output.tif
├── metadata.json
└── thumbnail.png (optional)
```
### Naming Convention
- **Job ID**: UUID format (e.g., `a1b2c3d4-e5f6-7890-abcd-ef1234567890`)
### Required Metadata File
```json
{
"job_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"created": "2026-02-27T10:30:00Z",
"status": "completed",
"aoi": {
"lon": 29.0,
"lat": -19.0,
"radius_m": 5000
},
"season": "2024_2025",
"model": {
"name": "xgboost-crop",
"version": "v1"
},
"output": {
"format": "COG",
"bounds": [25.0, -22.0, 33.0, -15.0],
"resolution": 10,
"classes": ["cropland", "grass", "shrubland", "forest", "water", "builtup", "bare"]
}
}
```
---
## Metadata Requirements Summary
| Resource | Required Metadata Files |
|----------|----------------------|
| Baselines | `manifests/dw_baseline_keys.txt` (optional) |
| Datasets | `metadata.json` |
| Models | `metadata.json` + model files |
| Results | `metadata.json` |
---
## Access Patterns
### Worker Access (Internal)
- Read from: `geocrop-baselines/`
- Read from: `geocrop-models/`
- Write to: `geocrop-results/`
### API Access
- Read from: `geocrop-results/`
- Generate signed URLs for downloads
### Frontend Access
- Request signed URLs from API for downloads
- Never access MinIO directly
---
**Date**: 2026-02-28
**Status**: ✅ Structure Implemented
---
## Implementation Status (2026-02-28)
### ✅ geocrop-baselines
- **Structure**: `dw/zim/summer/{season}/` directories created for seasons 2015_2016 through 2025_2026
- **Status**: Partial - Agreement files exist but need reorganization to `{season}/agreement/` subdirectory
- **Files**: 12 Agreement TIF files in `dw/zim/summer/`
- **Needs**: Reorganization script at [`ops/reorganize_storage.sh`](ops/reorganize_storage.sh)
### ✅ geocrop-datasets
- **Structure**: `datasets/zimbabwe-full/v1/data/` + `metadata.json`
- **Status**: Partial - CSV files exist at root level
- **Files**: 30 CSV batch files in root
- **Metadata**: ✅ metadata.json uploaded
### ✅ geocrop-models
- **Structure**: `models/xgboost-crop/v1/` with metadata
- **Status**: Partial - .pkl files exist at root level
- **Files**: 9 model files in root
- **Metadata**: ✅ metadata.json + selected_features.json uploaded
### ✅ geocrop-results
- **Structure**: `jobs/` directory created
- **Status**: Empty (ready for inference outputs)