geocrop-platform./plan/00E_storage_contract.md

5.9 KiB

Storage Contract

Overview

This document defines the storage layout, naming conventions, and metadata requirements for the GeoCrop project MinIO buckets.

Bucket Structure

Bucket Purpose Example Path
geocrop-baselines Dynamic World baseline COGs dw/zim/summer/YYYY_YYYY/
geocrop-datasets Training datasets datasets/{name}/{version}/
geocrop-models Trained ML models models/{name}/{version}/
geocrop-results Inference output COGs jobs/{job_id}/

1. geocrop-baselines

Path Structure

geocrop-baselines/
└── dw/
    └── zim/
        └── summer/
            ├── {season}/
            │   ├── agreement/
            │   │   └── DW_Zim_Agreement_{season}-{tileX}-{tileY}.tif
            │   ├── highest_conf/
            │   │   └── DW_Zim_HighestConf_{season}-{tileX}-{tileY}.tif
            │   └── mode/
            │       └── DW_Zim_Mode_{season}-{tileX}-{tileY}.tif
            └── manifests/
                └── dw_baseline_keys.txt

Naming Convention

  • Season format: YYYY_YYYY (e.g., 2015_2016, 2025_2026)
  • Tile format: {tileX}-{tileY} (e.g., 0000000000-0000000000)
  • Composite types: Agreement, HighestConf, Mode

Example Object Keys

dw/zim/summer/2020_2021/highest_conf/DW_Zim_HighestConf_2020_2021-0000000000-0000000000.tif
dw/zim/summer/2020_2021/highest_conf/DW_Zim_HighestConf_2020_2021-0000000000-0000065536.tif
dw/zim/summer/2020_2021/highest_conf/DW_Zim_HighestConf_2020_2021-0000065536-0000000000.tif
dw/zim/summer/2020_2021/highest_conf/DW_Zim_HighestConf_2020_2021-0000065536-0000065536.tif

2. geocrop-datasets

Path Structure

geocrop-datasets/
└── datasets/
    └── {dataset_name}/
        └── {version}/
            ├── data/
            │   └── *.csv
            └── metadata.json

Naming Convention

  • Dataset name: Lowercase, alphanumeric with hyphens (e.g., zimbabwe-full, augmented-v2)
  • Version: Semantic versioning (e.g., v1, v2.0, v2.1.0)

Required Metadata File (metadata.json)

{
  "version": "v1",
  "created": "2026-02-27",
  "description": "Augmented training dataset for GeoCrop crop classification",
  "source": "Manual labeling from high-resolution imagery + augmentation",
  "classes": ["cropland", "grass", "shrubland", "forest", "water", "builtup", "bare"],
  "features": ["ndvi_peak", "evi_peak", "savi_peak"],
  "total_samples": 25000,
  "spatial_extent": "Zimbabwe",
  "batches": 23
}

3. geocrop-models

Path Structure

geocrop-models/
└── models/
    └── {model_name}/
        └── {version}/
            ├── model.joblib
            ├── label_encoder.joblib
            ├── scaler.joblib (optional)
            ├── selected_features.json
            └── metadata.json

Naming Convention

  • Model name: Lowercase, alphanumeric with hyphens (e.g., xgboost-crop, ensemble-v1)
  • Version: Semantic versioning

Required Metadata File

{
  "name": "xgboost-crop",
  "version": "v1",
  "created": "2026-02-27",
  "model_type": "XGBoost",
  "features": ["ndvi_peak", "evi_peak", "savi_peak"],
  "classes": ["cropland", "grass", "shrubland", "forest", "water", "builtup", "bare"],
  "training_samples": 20000,
  "accuracy": 0.92,
  "scaler": "StandardScaler"
}

4. geocrop-results

Path Structure

geocrop-results/
└── jobs/
    └── {job_id}/
        ├── output.tif
        ├── metadata.json
        └── thumbnail.png (optional)

Naming Convention

  • Job ID: UUID format (e.g., a1b2c3d4-e5f6-7890-abcd-ef1234567890)

Required Metadata File

{
  "job_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
  "created": "2026-02-27T10:30:00Z",
  "status": "completed",
  "aoi": {
    "lon": 29.0,
    "lat": -19.0,
    "radius_m": 5000
  },
  "season": "2024_2025",
  "model": {
    "name": "xgboost-crop",
    "version": "v1"
  },
  "output": {
    "format": "COG",
    "bounds": [25.0, -22.0, 33.0, -15.0],
    "resolution": 10,
    "classes": ["cropland", "grass", "shrubland", "forest", "water", "builtup", "bare"]
  }
}

Metadata Requirements Summary

Resource Required Metadata Files
Baselines manifests/dw_baseline_keys.txt (optional)
Datasets metadata.json
Models metadata.json + model files
Results metadata.json

Access Patterns

Worker Access (Internal)

  • Read from: geocrop-baselines/
  • Read from: geocrop-models/
  • Write to: geocrop-results/

API Access

  • Read from: geocrop-results/
  • Generate signed URLs for downloads

Frontend Access

  • Request signed URLs from API for downloads
  • Never access MinIO directly

Date: 2026-02-28 Status: Structure Implemented


Implementation Status (2026-02-28)

geocrop-baselines

  • Structure: dw/zim/summer/{season}/ directories created for seasons 2015_2016 through 2025_2026
  • Status: Partial - Agreement files exist but need reorganization to {season}/agreement/ subdirectory
  • Files: 12 Agreement TIF files in dw/zim/summer/
  • Needs: Reorganization script at ops/reorganize_storage.sh

geocrop-datasets

  • Structure: datasets/zimbabwe-full/v1/data/ + metadata.json
  • Status: Partial - CSV files exist at root level
  • Files: 30 CSV batch files in root
  • Metadata: metadata.json uploaded

geocrop-models

  • Structure: models/xgboost-crop/v1/ with metadata
  • Status: Partial - .pkl files exist at root level
  • Files: 9 model files in root
  • Metadata: metadata.json + selected_features.json uploaded

geocrop-results

  • Structure: jobs/ directory created
  • Status: Empty (ready for inference outputs)