220 lines
5.9 KiB
Markdown
220 lines
5.9 KiB
Markdown
# Storage Contract
|
|
|
|
## Overview
|
|
|
|
This document defines the storage layout, naming conventions, and metadata requirements for the GeoCrop project MinIO buckets.
|
|
|
|
## Bucket Structure
|
|
|
|
| Bucket | Purpose | Example Path |
|
|
|--------|---------|--------------|
|
|
| `geocrop-baselines` | Dynamic World baseline COGs | `dw/zim/summer/YYYY_YYYY/` |
|
|
| `geocrop-datasets` | Training datasets | `datasets/{name}/{version}/` |
|
|
| `geocrop-models` | Trained ML models | `models/{name}/{version}/` |
|
|
| `geocrop-results` | Inference output COGs | `jobs/{job_id}/` |
|
|
|
|
---
|
|
|
|
## 1. geocrop-baselines
|
|
|
|
### Path Structure
|
|
```
|
|
geocrop-baselines/
|
|
└── dw/
|
|
└── zim/
|
|
└── summer/
|
|
├── {season}/
|
|
│ ├── agreement/
|
|
│ │ └── DW_Zim_Agreement_{season}-{tileX}-{tileY}.tif
|
|
│ ├── highest_conf/
|
|
│ │ └── DW_Zim_HighestConf_{season}-{tileX}-{tileY}.tif
|
|
│ └── mode/
|
|
│ └── DW_Zim_Mode_{season}-{tileX}-{tileY}.tif
|
|
└── manifests/
|
|
└── dw_baseline_keys.txt
|
|
```
|
|
|
|
### Naming Convention
|
|
- **Season format**: `YYYY_YYYY` (e.g., `2015_2016`, `2025_2026`)
|
|
- **Tile format**: `{tileX}-{tileY}` (e.g., `0000000000-0000000000`)
|
|
- **Composite types**: `Agreement`, `HighestConf`, `Mode`
|
|
|
|
### Example Object Keys
|
|
```
|
|
dw/zim/summer/2020_2021/highest_conf/DW_Zim_HighestConf_2020_2021-0000000000-0000000000.tif
|
|
dw/zim/summer/2020_2021/highest_conf/DW_Zim_HighestConf_2020_2021-0000000000-0000065536.tif
|
|
dw/zim/summer/2020_2021/highest_conf/DW_Zim_HighestConf_2020_2021-0000065536-0000000000.tif
|
|
dw/zim/summer/2020_2021/highest_conf/DW_Zim_HighestConf_2020_2021-0000065536-0000065536.tif
|
|
```
|
|
|
|
---
|
|
|
|
## 2. geocrop-datasets
|
|
|
|
### Path Structure
|
|
```
|
|
geocrop-datasets/
|
|
└── datasets/
|
|
└── {dataset_name}/
|
|
└── {version}/
|
|
├── data/
|
|
│ └── *.csv
|
|
└── metadata.json
|
|
```
|
|
|
|
### Naming Convention
|
|
- **Dataset name**: Lowercase, alphanumeric with hyphens (e.g., `zimbabwe-full`, `augmented-v2`)
|
|
- **Version**: Semantic versioning (e.g., `v1`, `v2.0`, `v2.1.0`)
|
|
|
|
### Required Metadata File (`metadata.json`)
|
|
```json
|
|
{
|
|
"version": "v1",
|
|
"created": "2026-02-27",
|
|
"description": "Augmented training dataset for GeoCrop crop classification",
|
|
"source": "Manual labeling from high-resolution imagery + augmentation",
|
|
"classes": ["cropland", "grass", "shrubland", "forest", "water", "builtup", "bare"],
|
|
"features": ["ndvi_peak", "evi_peak", "savi_peak"],
|
|
"total_samples": 25000,
|
|
"spatial_extent": "Zimbabwe",
|
|
"batches": 23
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 3. geocrop-models
|
|
|
|
### Path Structure
|
|
```
|
|
geocrop-models/
|
|
└── models/
|
|
└── {model_name}/
|
|
└── {version}/
|
|
├── model.joblib
|
|
├── label_encoder.joblib
|
|
├── scaler.joblib (optional)
|
|
├── selected_features.json
|
|
└── metadata.json
|
|
```
|
|
|
|
### Naming Convention
|
|
- **Model name**: Lowercase, alphanumeric with hyphens (e.g., `xgboost-crop`, `ensemble-v1`)
|
|
- **Version**: Semantic versioning
|
|
|
|
### Required Metadata File
|
|
```json
|
|
{
|
|
"name": "xgboost-crop",
|
|
"version": "v1",
|
|
"created": "2026-02-27",
|
|
"model_type": "XGBoost",
|
|
"features": ["ndvi_peak", "evi_peak", "savi_peak"],
|
|
"classes": ["cropland", "grass", "shrubland", "forest", "water", "builtup", "bare"],
|
|
"training_samples": 20000,
|
|
"accuracy": 0.92,
|
|
"scaler": "StandardScaler"
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## 4. geocrop-results
|
|
|
|
### Path Structure
|
|
```
|
|
geocrop-results/
|
|
└── jobs/
|
|
└── {job_id}/
|
|
├── output.tif
|
|
├── metadata.json
|
|
└── thumbnail.png (optional)
|
|
```
|
|
|
|
### Naming Convention
|
|
- **Job ID**: UUID format (e.g., `a1b2c3d4-e5f6-7890-abcd-ef1234567890`)
|
|
|
|
### Required Metadata File
|
|
```json
|
|
{
|
|
"job_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
|
|
"created": "2026-02-27T10:30:00Z",
|
|
"status": "completed",
|
|
"aoi": {
|
|
"lon": 29.0,
|
|
"lat": -19.0,
|
|
"radius_m": 5000
|
|
},
|
|
"season": "2024_2025",
|
|
"model": {
|
|
"name": "xgboost-crop",
|
|
"version": "v1"
|
|
},
|
|
"output": {
|
|
"format": "COG",
|
|
"bounds": [25.0, -22.0, 33.0, -15.0],
|
|
"resolution": 10,
|
|
"classes": ["cropland", "grass", "shrubland", "forest", "water", "builtup", "bare"]
|
|
}
|
|
}
|
|
```
|
|
|
|
---
|
|
|
|
## Metadata Requirements Summary
|
|
|
|
| Resource | Required Metadata Files |
|
|
|----------|----------------------|
|
|
| Baselines | `manifests/dw_baseline_keys.txt` (optional) |
|
|
| Datasets | `metadata.json` |
|
|
| Models | `metadata.json` + model files |
|
|
| Results | `metadata.json` |
|
|
|
|
---
|
|
|
|
## Access Patterns
|
|
|
|
### Worker Access (Internal)
|
|
- Read from: `geocrop-baselines/`
|
|
- Read from: `geocrop-models/`
|
|
- Write to: `geocrop-results/`
|
|
|
|
### API Access
|
|
- Read from: `geocrop-results/`
|
|
- Generate signed URLs for downloads
|
|
|
|
### Frontend Access
|
|
- Request signed URLs from API for downloads
|
|
- Never access MinIO directly
|
|
|
|
---
|
|
|
|
**Date**: 2026-02-28
|
|
**Status**: ✅ Structure Implemented
|
|
|
|
---
|
|
|
|
## Implementation Status (2026-02-28)
|
|
|
|
### ✅ geocrop-baselines
|
|
- **Structure**: `dw/zim/summer/{season}/` directories created for seasons 2015_2016 through 2025_2026
|
|
- **Status**: Partial - Agreement files exist but need reorganization to `{season}/agreement/` subdirectory
|
|
- **Files**: 12 Agreement TIF files in `dw/zim/summer/`
|
|
- **Needs**: Reorganization script at [`ops/reorganize_storage.sh`](ops/reorganize_storage.sh)
|
|
|
|
### ✅ geocrop-datasets
|
|
- **Structure**: `datasets/zimbabwe-full/v1/data/` + `metadata.json`
|
|
- **Status**: Partial - CSV files exist at root level
|
|
- **Files**: 30 CSV batch files in root
|
|
- **Metadata**: ✅ metadata.json uploaded
|
|
|
|
### ✅ geocrop-models
|
|
- **Structure**: `models/xgboost-crop/v1/` with metadata
|
|
- **Status**: Partial - .pkl files exist at root level
|
|
- **Files**: 9 model files in root
|
|
- **Metadata**: ✅ metadata.json + selected_features.json uploaded
|
|
|
|
### ✅ geocrop-results
|
|
- **Structure**: `jobs/` directory created
|
|
- **Status**: Empty (ready for inference outputs)
|