# Plan 00: Data Migration & Storage Setup **Status**: CRITICAL PRIORITY **Date**: 2026-02-27 --- ## Objective Configure MinIO buckets and migrate existing Dynamic World Cloud Optimized GeoTIFFs (COGs) from local storage to MinIO for use by the inference pipeline. --- ## 1. Current State Assessment ### 1.1 Existing Data in Local Storage | Directory | File Count | Description | |-----------|------------|-------------| | `data/dw_cogs/` | 132 TIF files | DW COGs (Agreement, HighestConf, Mode) for years 2015-2026 | | `data/dw_baselines/` | ~50 TIF files | Partial baseline set | ### 1.2 DW COG File Naming Convention ``` DW_Zim_{Type}_{StartYear}_{EndYear}-{TileX}-{TileY}.tif ``` **Types**: - `Agreement` - Agreement composite - `HighestConf` - Highest confidence composite - `Mode` - Mode composite **Years**: 2015_2016 through 2025_2026 (11 seasons) **Tiles**: 2x2 grid (0000000000, 0000000000-0000065536, 0000065536-0000000000, 0000065536-0000065536) ### 1.3 Training Dataset Available The project already has training data in the `training/` directory: | Directory | File Count | Description | |-----------|------------|-------------| | `training/` | 23 CSV files | Zimbabwe_Full_Augmented_Batch_*.csv | **Dataset File Sizes**: - Zimbabwe_Full_Augmented_Batch_1.csv - 11 MB - Zimbabwe_Full_Augmented_Batch_2.csv - 10 MB - Zimbabwe_Full_Augmented_Batch_10.csv - 11 MB - ... (total ~250 MB of training data) These files should be uploaded to `geocrop-datasets/` for use in model retraining. ### 1.4 MinIO Status | Bucket | Status | Purpose | |--------|--------|---------| | `geocrop-models` | ✅ Created + populated | Trained ML models | | `geocrop-baselines` | ❌ Needs creation | DW baseline COGs | | `geocrop-results` | ❌ Needs creation | Output COGs from inference | | `geocrop-datasets` | ❌ Needs creation + dataset | Training datasets | --- ## 2. MinIO Access Method ### 2.1 Option A: MinIO Client (Recommended) Use the MinIO client (`mc`) from the control-plane node for bulk uploads. **Step 1 — Get MinIO root credentials** On the control-plane node: F 1. Check how MinIO is configured: ```bash kubectl -n geocrop get deploy minio -o yaml | sed -n '1,200p' ``` Look for env vars (e.g., `MINIO_ROOT_USER`, `MINIO_ROOT_PASSWORD`) or a Secret reference. or use user: minioadmin pass: minioadmin123 2. If credentials are stored in a Secret: ```bash kubectl -n geocrop get secret | grep -i minio kubectl -n geocrop get secret -o jsonpath='{.data.MINIO_ROOT_USER}' | base64 -d; echo kubectl -n geocrop get secret -o jsonpath='{.data.MINIO_ROOT_PASSWORD}' | base64 -d; echo ``` **Step 2 — Install mc (if missing)** ```bash curl -fsSL https://dl.min.io/client/mc/release/linux-amd64/mc -o /usr/local/bin/mc chmod +x /usr/local/bin/mc mc --version ``` **Step 3 — Add MinIO alias** Use in-cluster DNS so you don't rely on public ingress: ```bash mc alias set geocrop-minio http://minio.geocrop.svc.cluster.local:9000 minioadmin minioadmin12 ``` > Note: Default credentials are `minioadmin` / `minioadmin12` ### 2.2 Create Missing Buckets ```bash # Verify existing buckets mc ls geocrop-minio # Create any missing buckets mc mb geocrop-minio/geocrop-baselines || true mc mb geocrop-minio/geocrop-datasets || true mc mb geocrop-minio/geocrop-results || true mc mb geocrop-minio/geocrop-models || true # Verify mc ls geocrop-minio/geocrop-baselines mc ls geocrop-minio/geocrop-datasets ``` ### 2.3 Set Bucket Policies (Portfolio-Safe Defaults) **Principle**: No public access to baselines/results/models. Downloads happen via signed URLs generated by API. ```bash # Set buckets to private mc anonymous set none geocrop-minio/geocrop-baselines mc anonymous set none geocrop-minio/geocrop-results mc anonymous set none geocrop-minio/geocrop-models mc anonymous set none geocrop-minio/geocrop-datasets # Verify mc anonymous get geocrop-minio/geocrop-baselines ``` ## 3. Object Path Layout ### 3.1 geocrop-baselines Store DW baseline COGs under: ``` dw/zim/summer//highest_conf/.tif ``` Where: - `` = `YYYY_YYYY` (e.g., `2015_2016`) - `` = original (e.g., `DW_Zim_HighestConf_2015_2016.tif`) **Example object key**: ``` dw/zim/summer/2015_2016/highest_conf/DW_Zim_HighestConf_2015_2016-0000000000-0000000000.tif ``` ### 3.2 geocrop-datasets ``` datasets///... ``` For example: ``` datasets/zimbabwe_full/v1/Zimbabwe_Full_Augmented_Batch_1.csv datasets/zimbabwe_full/v1/Zimbabwe_Full_Augmented_Batch_2.csv ... datasets/zimbabwe_full/v1/metadata.json ``` ### 3.3 geocrop-models ``` models///... ``` ### 3.4 geocrop-results ``` results//... ``` --- ## 4. Upload DW COGs into geocrop-baselines ### 4.1 Verify Local Source Folder On control-plane node: ```bash ls -lh ~/geocrop/data/dw_cogs | head file ~/geocrop/data/dw_cogs/*.tif | head ``` Optional sanity checks: - Ensure each COG has overviews: ```bash gdalinfo -json | jq '.metadata' # if gdalinfo installed ``` ### 4.2 Dry-Run: Compute Count and Size ```bash find ~/geocrop/data/dw_cogs -maxdepth 1 -type f -name '*.tif' | wc -l du -sh ~/geocrop/data/dw_cogs ``` ### 4.3 Upload with Mirroring This keeps bucket in sync with folder: ```bash mc mirror --overwrite --remove --json \ ~/geocrop/data/dw_cogs \ geocrop-minio/geocrop-baselines/dw/zim/summer/ \ > ~/geocrop/logs/mc_mirror_dw_baselines.jsonl ``` > Notes: > - `--remove` removes objects in bucket that aren't in local folder (safe if you only use this prefix for DW baselines). > - If you want safer first run, omit `--remove`. ### 4.4 Verify Upload ```bash mc ls geocrop-minio/geocrop-baselines/dw/zim/summer/ | head ``` Spot-check hashes: ```bash mc stat geocrop-minio/geocrop-baselines/dw/zim/summer/.tif ``` ### 4.5 Record Baseline Index Create a manifest for the worker to quickly map `year -> key`. Generate on control-plane: ```bash mc find geocrop-minio/geocrop-baselines/dw/zim/summer --name '*.tif' --json \ | jq -r '.key' \ | sort \ > ~/geocrop/data/dw_baseline_keys.txt ``` Commit a copy into repo later (or store in MinIO as `manifests/dw_baseline_keys.txt`). ### 3.3 Script Implementation Requirements ```python # scripts/migrate_dw_to_minio.py import os import sys import glob import hashlib import argparse from concurrent.futures import ThreadPoolExecutor from pathlib import Path from minio import Minio from minio.error import S3Error def calculate_md5(filepath): """Calculate MD5 checksum of a file.""" hash_md5 = hashlib.md5() with open(filepath, "rb") as f: for chunk in iter(lambda: f.read(4096), b""): hash_md5.update(chunk) return hash_md5.hexdigest() def upload_file(client, bucket, source_path, dest_object): """Upload a single file to MinIO.""" try: client.fput_object(bucket, dest_object, source_path) print(f"✅ Uploaded: {dest_object}") return True except S3Error as e: print(f"❌ Failed: {source_path} - {e}") return False def main(): parser = argparse.ArgumentParser(description="Migrate DW COGs to MinIO") parser.add_argument("--source", default="data/dw_cogs/", help="Source directory") parser.add_argument("--bucket", default="geocrop-baselines", help="MinIO bucket") parser.add_argument("--workers", type=int, default=4, help="Parallel workers") args = parser.parse_args() # Initialize MinIO client client = Minio( "minio.geocrop.svc.cluster.local:9000", access_key=os.getenv("MINIO_ACCESS_KEY"), secret_key=os.getenv("MINIO_SECRET_KEY"), ) # Find all TIF files tif_files = glob.glob(os.path.join(args.source, "*.tif")) print(f"Found {len(tif_files)} TIF files to migrate") # Upload with parallel workers with ThreadPoolExecutor(max_workers=args.workers) as executor: futures = [] for tif_path in tif_files: filename = os.path.basename(tif_path) # Parse filename to create directory structure # e.g., DW_Zim_Agreement_2015_2016-0000000000-0000000000.tif parts = filename.replace(".tif", "").split("-") type_year = "-".join(parts[0:2]) # DW_Zim_Agreement_2015_2016 dest_object = f"{type_year}/{filename}" futures.append(executor.submit(upload_file, client, args.bucket, tif_path, dest_object)) # Wait for completion results = [f.result() for f in futures] success = sum(results) print(f"\nMigration complete: {success}/{len(tif_files)} files uploaded") if __name__ == "__main__": main() ``` --- ## 5. Upload Training Dataset to geocrop-datasets ### 5.1 Training Data Already Available The project already has training data in the `training/` directory (23 CSV files, ~250 MB total): | File | Size | |------|------| | Zimbabwe_Full_Augmented_Batch_1.csv | 11 MB | | Zimbabwe_Full_Augmented_Batch_2.csv | 10 MB | | Zimbabwe_Full_Augmented_Batch_3.csv | 11 MB | | ... | ... | ### 5.2 Upload Training Data ```bash # Create dataset directory structure mc mb geocrop-minio/geocrop-datasets/zimbabwe_full/v1 || true # Upload all training batches mc cp training/Zimbabwe_Full_Augmented_Batch_*.csv \ geocrop-minio/geocrop-datasets/zimbabwe_full/v1/ # Upload metadata cat > /tmp/metadata.json << 'EOF' { "version": "v1", "created": "2026-02-27", "description": "Augmented training dataset for GeoCrop crop classification", "source": "Manual labeling from high-resolution imagery + augmentation", "classes": [ "cropland", "grass", "shrubland", "forest", "water", "builtup", "bare" ], "features": [ "ndvi_peak", "evi_peak", "savi_peak" ], "total_samples": 25000, "spatial_extent": "Zimbabwe", "batches": 23 } EOF mc cp /tmp/metadata.json geocrop-minio/geocrop-datasets/zimbabwe_full/v1/metadata.json ``` ### 5.3 Verify Dataset Upload ```bash mc ls geocrop-minio/geocrop-datasets/zimbabwe_full/v1/ ``` --- ## 6. Acceptance Criteria (Must Be True Before Phase 1) - [ ] Buckets exist: `geocrop-baselines`, `geocrop-datasets` (and `geocrop-models`, `geocrop-results`) - [ ] Buckets are private (anonymous access disabled) - [ ] DW baseline COGs available under `geocrop-baselines/dw/zim/summer/...` - [ ] Training dataset uploaded to `geocrop-datasets/zimbabwe_full/v1/` - [ ] A baseline manifest exists (text file listing object keys) ## 7. Common Pitfalls - Uploading to the wrong bucket or root prefix → fix by mirroring into a single authoritative prefix - Leaving MinIO public → fix with `mc anonymous set none` - Mixing season windows (Nov–Apr vs Sep–May) → store DW as "summer season" per filename, but keep **model season** config separate --- ## 6. Next Steps After this plan is approved: 1. Execute bucket creation commands 2. Run migration script for DW COGs 3. Upload sample dataset 4. Verify worker can read from MinIO 5. Proceed to Plan 01: STAC Inference Worker --- ## 7. Technical Notes ### 7.1 MinIO Access from Worker The worker uses internal Kubernetes DNS: ```python MINIO_ENDPOINT = "minio.geocrop.svc.cluster.local:9000" ``` ### 7.2 Bucket Naming Convention Per AGENTS.md: - `geocrop-models` - trained ML models - `geocrop-results` - output COGs - `geocrop-baselines` - DW baseline COGs - `geocrop-datasets` - training datasets ### 7.3 File Size Estimates | Dataset | File Count | Avg Size | Total | |---------|------------|----------|-------| | DW COGs | 132 | ~60MB | ~7.9 GB | | Training Data | 1 | ~10MB | ~10MB |