geocrop-platform./plan/00_data_migration.md

11 KiB
Raw Permalink Blame History

Plan 00: Data Migration & Storage Setup

Status: CRITICAL PRIORITY
Date: 2026-02-27


Objective

Configure MinIO buckets and migrate existing Dynamic World Cloud Optimized GeoTIFFs (COGs) from local storage to MinIO for use by the inference pipeline.


1. Current State Assessment

1.1 Existing Data in Local Storage

Directory File Count Description
data/dw_cogs/ 132 TIF files DW COGs (Agreement, HighestConf, Mode) for years 2015-2026
data/dw_baselines/ ~50 TIF files Partial baseline set

1.2 DW COG File Naming Convention

DW_Zim_{Type}_{StartYear}_{EndYear}-{TileX}-{TileY}.tif

Types:

  • Agreement - Agreement composite
  • HighestConf - Highest confidence composite
  • Mode - Mode composite

Years: 2015_2016 through 2025_2026 (11 seasons)

Tiles: 2x2 grid (0000000000, 0000000000-0000065536, 0000065536-0000000000, 0000065536-0000065536)

1.3 Training Dataset Available

The project already has training data in the training/ directory:

Directory File Count Description
training/ 23 CSV files Zimbabwe_Full_Augmented_Batch_*.csv

Dataset File Sizes:

  • Zimbabwe_Full_Augmented_Batch_1.csv - 11 MB
  • Zimbabwe_Full_Augmented_Batch_2.csv - 10 MB
  • Zimbabwe_Full_Augmented_Batch_10.csv - 11 MB
  • ... (total ~250 MB of training data)

These files should be uploaded to geocrop-datasets/ for use in model retraining.

1.4 MinIO Status

Bucket Status Purpose
geocrop-models Created + populated Trained ML models
geocrop-baselines Needs creation DW baseline COGs
geocrop-results Needs creation Output COGs from inference
geocrop-datasets Needs creation + dataset Training datasets

2. MinIO Access Method

Use the MinIO client (mc) from the control-plane node for bulk uploads.

Step 1 — Get MinIO root credentials

On the control-plane node: F

  1. Check how MinIO is configured:
kubectl -n geocrop get deploy minio -o yaml | sed -n '1,200p'

Look for env vars (e.g., MINIO_ROOT_USER, MINIO_ROOT_PASSWORD) or a Secret reference. or use user: minioadmin

pass: minioadmin123 2. If credentials are stored in a Secret:

kubectl -n geocrop get secret | grep -i minio
kubectl -n geocrop get secret <secret-name> -o jsonpath='{.data.MINIO_ROOT_USER}' | base64 -d; echo
kubectl -n geocrop get secret <secret-name> -o jsonpath='{.data.MINIO_ROOT_PASSWORD}' | base64 -d; echo

Step 2 — Install mc (if missing)

curl -fsSL https://dl.min.io/client/mc/release/linux-amd64/mc -o /usr/local/bin/mc
chmod +x /usr/local/bin/mc
mc --version

Step 3 — Add MinIO alias Use in-cluster DNS so you don't rely on public ingress:

mc alias set geocrop-minio http://minio.geocrop.svc.cluster.local:9000 minioadmin minioadmin12

Note: Default credentials are minioadmin / minioadmin12

2.2 Create Missing Buckets

# Verify existing buckets
mc ls geocrop-minio

# Create any missing buckets
mc mb geocrop-minio/geocrop-baselines || true
mc mb geocrop-minio/geocrop-datasets || true
mc mb geocrop-minio/geocrop-results || true
mc mb geocrop-minio/geocrop-models || true

# Verify
mc ls geocrop-minio/geocrop-baselines
mc ls geocrop-minio/geocrop-datasets

2.3 Set Bucket Policies (Portfolio-Safe Defaults)

Principle: No public access to baselines/results/models. Downloads happen via signed URLs generated by API.

# Set buckets to private
mc anonymous set none geocrop-minio/geocrop-baselines
mc anonymous set none geocrop-minio/geocrop-results
mc anonymous set none geocrop-minio/geocrop-models
mc anonymous set none geocrop-minio/geocrop-datasets

# Verify
mc anonymous get geocrop-minio/geocrop-baselines

3. Object Path Layout

3.1 geocrop-baselines

Store DW baseline COGs under:

dw/zim/summer/<season>/highest_conf/<filename>.tif

Where:

  • <season> = YYYY_YYYY (e.g., 2015_2016)
  • <filename> = original (e.g., DW_Zim_HighestConf_2015_2016.tif)

Example object key:

dw/zim/summer/2015_2016/highest_conf/DW_Zim_HighestConf_2015_2016-0000000000-0000000000.tif

3.2 geocrop-datasets

datasets/<dataset_name>/<version>/...

For example:

datasets/zimbabwe_full/v1/Zimbabwe_Full_Augmented_Batch_1.csv
datasets/zimbabwe_full/v1/Zimbabwe_Full_Augmented_Batch_2.csv
...
datasets/zimbabwe_full/v1/metadata.json

3.3 geocrop-models

models/<model_name>/<version>/...

3.4 geocrop-results

results/<job_id>/...

4. Upload DW COGs into geocrop-baselines

4.1 Verify Local Source Folder

On control-plane node:

ls -lh ~/geocrop/data/dw_cogs | head
file ~/geocrop/data/dw_cogs/*.tif | head

Optional sanity checks:

  • Ensure each COG has overviews:
gdalinfo -json <file> | jq '.metadata'  # if gdalinfo installed

4.2 Dry-Run: Compute Count and Size

find ~/geocrop/data/dw_cogs -maxdepth 1 -type f -name '*.tif' | wc -l
du -sh ~/geocrop/data/dw_cogs

4.3 Upload with Mirroring

This keeps bucket in sync with folder:

mc mirror --overwrite --remove --json \
  ~/geocrop/data/dw_cogs \
  geocrop-minio/geocrop-baselines/dw/zim/summer/ \
  > ~/geocrop/logs/mc_mirror_dw_baselines.jsonl

Notes:

  • --remove removes objects in bucket that aren't in local folder (safe if you only use this prefix for DW baselines).
  • If you want safer first run, omit --remove.

4.4 Verify Upload

mc ls geocrop-minio/geocrop-baselines/dw/zim/summer/ | head

Spot-check hashes:

mc stat geocrop-minio/geocrop-baselines/dw/zim/summer/<somefile>.tif

4.5 Record Baseline Index

Create a manifest for the worker to quickly map year -> key.

Generate on control-plane:

mc find geocrop-minio/geocrop-baselines/dw/zim/summer --name '*.tif' --json \
  | jq -r '.key' \
  | sort \
  > ~/geocrop/data/dw_baseline_keys.txt

Commit a copy into repo later (or store in MinIO as manifests/dw_baseline_keys.txt).

3.3 Script Implementation Requirements

# scripts/migrate_dw_to_minio.py

import os
import sys
import glob
import hashlib
import argparse
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
from minio import Minio
from minio.error import S3Error

def calculate_md5(filepath):
    """Calculate MD5 checksum of a file."""
    hash_md5 = hashlib.md5()
    with open(filepath, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

def upload_file(client, bucket, source_path, dest_object):
    """Upload a single file to MinIO."""
    try:
        client.fput_object(bucket, dest_object, source_path)
        print(f"✅ Uploaded: {dest_object}")
        return True
    except S3Error as e:
        print(f"❌ Failed: {source_path} - {e}")
        return False

def main():
    parser = argparse.ArgumentParser(description="Migrate DW COGs to MinIO")
    parser.add_argument("--source", default="data/dw_cogs/", help="Source directory")
    parser.add_argument("--bucket", default="geocrop-baselines", help="MinIO bucket")
    parser.add_argument("--workers", type=int, default=4, help="Parallel workers")
    args = parser.parse_args()
    
    # Initialize MinIO client
    client = Minio(
        "minio.geocrop.svc.cluster.local:9000",
        access_key=os.getenv("MINIO_ACCESS_KEY"),
        secret_key=os.getenv("MINIO_SECRET_KEY"),
    )
    
    # Find all TIF files
    tif_files = glob.glob(os.path.join(args.source, "*.tif"))
    print(f"Found {len(tif_files)} TIF files to migrate")
    
    # Upload with parallel workers
    with ThreadPoolExecutor(max_workers=args.workers) as executor:
        futures = []
        for tif_path in tif_files:
            filename = os.path.basename(tif_path)
            # Parse filename to create directory structure
            # e.g., DW_Zim_Agreement_2015_2016-0000000000-0000000000.tif
            parts = filename.replace(".tif", "").split("-")
            type_year = "-".join(parts[0:2])  # DW_Zim_Agreement_2015_2016
            dest_object = f"{type_year}/{filename}"
            futures.append(executor.submit(upload_file, client, args.bucket, tif_path, dest_object))
        
        # Wait for completion
        results = [f.result() for f in futures]
        success = sum(results)
        print(f"\nMigration complete: {success}/{len(tif_files)} files uploaded")

if __name__ == "__main__":
    main()

5. Upload Training Dataset to geocrop-datasets

5.1 Training Data Already Available

The project already has training data in the training/ directory (23 CSV files, ~250 MB total):

File Size
Zimbabwe_Full_Augmented_Batch_1.csv 11 MB
Zimbabwe_Full_Augmented_Batch_2.csv 10 MB
Zimbabwe_Full_Augmented_Batch_3.csv 11 MB
... ...

5.2 Upload Training Data

# Create dataset directory structure
mc mb geocrop-minio/geocrop-datasets/zimbabwe_full/v1 || true

# Upload all training batches
mc cp training/Zimbabwe_Full_Augmented_Batch_*.csv \
  geocrop-minio/geocrop-datasets/zimbabwe_full/v1/

# Upload metadata
cat > /tmp/metadata.json << 'EOF'
{
  "version": "v1",
  "created": "2026-02-27",
  "description": "Augmented training dataset for GeoCrop crop classification",
  "source": "Manual labeling from high-resolution imagery + augmentation",
  "classes": [
    "cropland",
    "grass",
    "shrubland", 
    "forest",
    "water",
    "builtup",
    "bare"
  ],
  "features": [
    "ndvi_peak",
    "evi_peak", 
    "savi_peak"
  ],
  "total_samples": 25000,
  "spatial_extent": "Zimbabwe",
  "batches": 23
}
EOF

mc cp /tmp/metadata.json geocrop-minio/geocrop-datasets/zimbabwe_full/v1/metadata.json

5.3 Verify Dataset Upload

mc ls geocrop-minio/geocrop-datasets/zimbabwe_full/v1/

6. Acceptance Criteria (Must Be True Before Phase 1)

  • Buckets exist: geocrop-baselines, geocrop-datasets (and geocrop-models, geocrop-results)
  • Buckets are private (anonymous access disabled)
  • DW baseline COGs available under geocrop-baselines/dw/zim/summer/...
  • Training dataset uploaded to geocrop-datasets/zimbabwe_full/v1/
  • A baseline manifest exists (text file listing object keys)

7. Common Pitfalls

  • Uploading to the wrong bucket or root prefix → fix by mirroring into a single authoritative prefix
  • Leaving MinIO public → fix with mc anonymous set none
  • Mixing season windows (NovApr vs SepMay) → store DW as "summer season" per filename, but keep model season config separate

6. Next Steps

After this plan is approved:

  1. Execute bucket creation commands
  2. Run migration script for DW COGs
  3. Upload sample dataset
  4. Verify worker can read from MinIO
  5. Proceed to Plan 01: STAC Inference Worker

7. Technical Notes

7.1 MinIO Access from Worker

The worker uses internal Kubernetes DNS:

MINIO_ENDPOINT = "minio.geocrop.svc.cluster.local:9000"

7.2 Bucket Naming Convention

Per AGENTS.md:

  • geocrop-models - trained ML models
  • geocrop-results - output COGs
  • geocrop-baselines - DW baseline COGs
  • geocrop-datasets - training datasets

7.3 File Size Estimates

Dataset File Count Avg Size Total
DW COGs 132 ~60MB ~7.9 GB
Training Data 1 ~10MB ~10MB