11 KiB
Plan 00: Data Migration & Storage Setup
Status: CRITICAL PRIORITY
Date: 2026-02-27
Objective
Configure MinIO buckets and migrate existing Dynamic World Cloud Optimized GeoTIFFs (COGs) from local storage to MinIO for use by the inference pipeline.
1. Current State Assessment
1.1 Existing Data in Local Storage
| Directory | File Count | Description |
|---|---|---|
data/dw_cogs/ |
132 TIF files | DW COGs (Agreement, HighestConf, Mode) for years 2015-2026 |
data/dw_baselines/ |
~50 TIF files | Partial baseline set |
1.2 DW COG File Naming Convention
DW_Zim_{Type}_{StartYear}_{EndYear}-{TileX}-{TileY}.tif
Types:
Agreement- Agreement compositeHighestConf- Highest confidence compositeMode- Mode composite
Years: 2015_2016 through 2025_2026 (11 seasons)
Tiles: 2x2 grid (0000000000, 0000000000-0000065536, 0000065536-0000000000, 0000065536-0000065536)
1.3 Training Dataset Available
The project already has training data in the training/ directory:
| Directory | File Count | Description |
|---|---|---|
training/ |
23 CSV files | Zimbabwe_Full_Augmented_Batch_*.csv |
Dataset File Sizes:
- Zimbabwe_Full_Augmented_Batch_1.csv - 11 MB
- Zimbabwe_Full_Augmented_Batch_2.csv - 10 MB
- Zimbabwe_Full_Augmented_Batch_10.csv - 11 MB
- ... (total ~250 MB of training data)
These files should be uploaded to geocrop-datasets/ for use in model retraining.
1.4 MinIO Status
| Bucket | Status | Purpose |
|---|---|---|
geocrop-models |
✅ Created + populated | Trained ML models |
geocrop-baselines |
❌ Needs creation | DW baseline COGs |
geocrop-results |
❌ Needs creation | Output COGs from inference |
geocrop-datasets |
❌ Needs creation + dataset | Training datasets |
2. MinIO Access Method
2.1 Option A: MinIO Client (Recommended)
Use the MinIO client (mc) from the control-plane node for bulk uploads.
Step 1 — Get MinIO root credentials
On the control-plane node: F
- Check how MinIO is configured:
kubectl -n geocrop get deploy minio -o yaml | sed -n '1,200p'
Look for env vars (e.g., MINIO_ROOT_USER, MINIO_ROOT_PASSWORD) or a Secret reference.
or use
user: minioadmin
pass: minioadmin123 2. If credentials are stored in a Secret:
kubectl -n geocrop get secret | grep -i minio
kubectl -n geocrop get secret <secret-name> -o jsonpath='{.data.MINIO_ROOT_USER}' | base64 -d; echo
kubectl -n geocrop get secret <secret-name> -o jsonpath='{.data.MINIO_ROOT_PASSWORD}' | base64 -d; echo
Step 2 — Install mc (if missing)
curl -fsSL https://dl.min.io/client/mc/release/linux-amd64/mc -o /usr/local/bin/mc
chmod +x /usr/local/bin/mc
mc --version
Step 3 — Add MinIO alias Use in-cluster DNS so you don't rely on public ingress:
mc alias set geocrop-minio http://minio.geocrop.svc.cluster.local:9000 minioadmin minioadmin12
Note: Default credentials are
minioadmin/minioadmin12
2.2 Create Missing Buckets
# Verify existing buckets
mc ls geocrop-minio
# Create any missing buckets
mc mb geocrop-minio/geocrop-baselines || true
mc mb geocrop-minio/geocrop-datasets || true
mc mb geocrop-minio/geocrop-results || true
mc mb geocrop-minio/geocrop-models || true
# Verify
mc ls geocrop-minio/geocrop-baselines
mc ls geocrop-minio/geocrop-datasets
2.3 Set Bucket Policies (Portfolio-Safe Defaults)
Principle: No public access to baselines/results/models. Downloads happen via signed URLs generated by API.
# Set buckets to private
mc anonymous set none geocrop-minio/geocrop-baselines
mc anonymous set none geocrop-minio/geocrop-results
mc anonymous set none geocrop-minio/geocrop-models
mc anonymous set none geocrop-minio/geocrop-datasets
# Verify
mc anonymous get geocrop-minio/geocrop-baselines
3. Object Path Layout
3.1 geocrop-baselines
Store DW baseline COGs under:
dw/zim/summer/<season>/highest_conf/<filename>.tif
Where:
<season>=YYYY_YYYY(e.g.,2015_2016)<filename>= original (e.g.,DW_Zim_HighestConf_2015_2016.tif)
Example object key:
dw/zim/summer/2015_2016/highest_conf/DW_Zim_HighestConf_2015_2016-0000000000-0000000000.tif
3.2 geocrop-datasets
datasets/<dataset_name>/<version>/...
For example:
datasets/zimbabwe_full/v1/Zimbabwe_Full_Augmented_Batch_1.csv
datasets/zimbabwe_full/v1/Zimbabwe_Full_Augmented_Batch_2.csv
...
datasets/zimbabwe_full/v1/metadata.json
3.3 geocrop-models
models/<model_name>/<version>/...
3.4 geocrop-results
results/<job_id>/...
4. Upload DW COGs into geocrop-baselines
4.1 Verify Local Source Folder
On control-plane node:
ls -lh ~/geocrop/data/dw_cogs | head
file ~/geocrop/data/dw_cogs/*.tif | head
Optional sanity checks:
- Ensure each COG has overviews:
gdalinfo -json <file> | jq '.metadata' # if gdalinfo installed
4.2 Dry-Run: Compute Count and Size
find ~/geocrop/data/dw_cogs -maxdepth 1 -type f -name '*.tif' | wc -l
du -sh ~/geocrop/data/dw_cogs
4.3 Upload with Mirroring
This keeps bucket in sync with folder:
mc mirror --overwrite --remove --json \
~/geocrop/data/dw_cogs \
geocrop-minio/geocrop-baselines/dw/zim/summer/ \
> ~/geocrop/logs/mc_mirror_dw_baselines.jsonl
Notes:
--removeremoves objects in bucket that aren't in local folder (safe if you only use this prefix for DW baselines).- If you want safer first run, omit
--remove.
4.4 Verify Upload
mc ls geocrop-minio/geocrop-baselines/dw/zim/summer/ | head
Spot-check hashes:
mc stat geocrop-minio/geocrop-baselines/dw/zim/summer/<somefile>.tif
4.5 Record Baseline Index
Create a manifest for the worker to quickly map year -> key.
Generate on control-plane:
mc find geocrop-minio/geocrop-baselines/dw/zim/summer --name '*.tif' --json \
| jq -r '.key' \
| sort \
> ~/geocrop/data/dw_baseline_keys.txt
Commit a copy into repo later (or store in MinIO as manifests/dw_baseline_keys.txt).
3.3 Script Implementation Requirements
# scripts/migrate_dw_to_minio.py
import os
import sys
import glob
import hashlib
import argparse
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
from minio import Minio
from minio.error import S3Error
def calculate_md5(filepath):
"""Calculate MD5 checksum of a file."""
hash_md5 = hashlib.md5()
with open(filepath, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
def upload_file(client, bucket, source_path, dest_object):
"""Upload a single file to MinIO."""
try:
client.fput_object(bucket, dest_object, source_path)
print(f"✅ Uploaded: {dest_object}")
return True
except S3Error as e:
print(f"❌ Failed: {source_path} - {e}")
return False
def main():
parser = argparse.ArgumentParser(description="Migrate DW COGs to MinIO")
parser.add_argument("--source", default="data/dw_cogs/", help="Source directory")
parser.add_argument("--bucket", default="geocrop-baselines", help="MinIO bucket")
parser.add_argument("--workers", type=int, default=4, help="Parallel workers")
args = parser.parse_args()
# Initialize MinIO client
client = Minio(
"minio.geocrop.svc.cluster.local:9000",
access_key=os.getenv("MINIO_ACCESS_KEY"),
secret_key=os.getenv("MINIO_SECRET_KEY"),
)
# Find all TIF files
tif_files = glob.glob(os.path.join(args.source, "*.tif"))
print(f"Found {len(tif_files)} TIF files to migrate")
# Upload with parallel workers
with ThreadPoolExecutor(max_workers=args.workers) as executor:
futures = []
for tif_path in tif_files:
filename = os.path.basename(tif_path)
# Parse filename to create directory structure
# e.g., DW_Zim_Agreement_2015_2016-0000000000-0000000000.tif
parts = filename.replace(".tif", "").split("-")
type_year = "-".join(parts[0:2]) # DW_Zim_Agreement_2015_2016
dest_object = f"{type_year}/{filename}"
futures.append(executor.submit(upload_file, client, args.bucket, tif_path, dest_object))
# Wait for completion
results = [f.result() for f in futures]
success = sum(results)
print(f"\nMigration complete: {success}/{len(tif_files)} files uploaded")
if __name__ == "__main__":
main()
5. Upload Training Dataset to geocrop-datasets
5.1 Training Data Already Available
The project already has training data in the training/ directory (23 CSV files, ~250 MB total):
| File | Size |
|---|---|
| Zimbabwe_Full_Augmented_Batch_1.csv | 11 MB |
| Zimbabwe_Full_Augmented_Batch_2.csv | 10 MB |
| Zimbabwe_Full_Augmented_Batch_3.csv | 11 MB |
| ... | ... |
5.2 Upload Training Data
# Create dataset directory structure
mc mb geocrop-minio/geocrop-datasets/zimbabwe_full/v1 || true
# Upload all training batches
mc cp training/Zimbabwe_Full_Augmented_Batch_*.csv \
geocrop-minio/geocrop-datasets/zimbabwe_full/v1/
# Upload metadata
cat > /tmp/metadata.json << 'EOF'
{
"version": "v1",
"created": "2026-02-27",
"description": "Augmented training dataset for GeoCrop crop classification",
"source": "Manual labeling from high-resolution imagery + augmentation",
"classes": [
"cropland",
"grass",
"shrubland",
"forest",
"water",
"builtup",
"bare"
],
"features": [
"ndvi_peak",
"evi_peak",
"savi_peak"
],
"total_samples": 25000,
"spatial_extent": "Zimbabwe",
"batches": 23
}
EOF
mc cp /tmp/metadata.json geocrop-minio/geocrop-datasets/zimbabwe_full/v1/metadata.json
5.3 Verify Dataset Upload
mc ls geocrop-minio/geocrop-datasets/zimbabwe_full/v1/
6. Acceptance Criteria (Must Be True Before Phase 1)
- Buckets exist:
geocrop-baselines,geocrop-datasets(andgeocrop-models,geocrop-results) - Buckets are private (anonymous access disabled)
- DW baseline COGs available under
geocrop-baselines/dw/zim/summer/... - Training dataset uploaded to
geocrop-datasets/zimbabwe_full/v1/ - A baseline manifest exists (text file listing object keys)
7. Common Pitfalls
- Uploading to the wrong bucket or root prefix → fix by mirroring into a single authoritative prefix
- Leaving MinIO public → fix with
mc anonymous set none - Mixing season windows (Nov–Apr vs Sep–May) → store DW as "summer season" per filename, but keep model season config separate
6. Next Steps
After this plan is approved:
- Execute bucket creation commands
- Run migration script for DW COGs
- Upload sample dataset
- Verify worker can read from MinIO
- Proceed to Plan 01: STAC Inference Worker
7. Technical Notes
7.1 MinIO Access from Worker
The worker uses internal Kubernetes DNS:
MINIO_ENDPOINT = "minio.geocrop.svc.cluster.local:9000"
7.2 Bucket Naming Convention
Per AGENTS.md:
geocrop-models- trained ML modelsgeocrop-results- output COGsgeocrop-baselines- DW baseline COGsgeocrop-datasets- training datasets
7.3 File Size Estimates
| Dataset | File Count | Avg Size | Total |
|---|---|---|---|
| DW COGs | 132 | ~60MB | ~7.9 GB |
| Training Data | 1 | ~10MB | ~10MB |