geocrop-platform./plan/00_data_migration.md

# Plan 00: Data Migration & Storage Setup

**Status**: CRITICAL PRIORITY
**Date**: 2026-02-27

---

## Objective

Configure MinIO buckets and migrate existing Dynamic World Cloud Optimized GeoTIFFs (COGs) from local storage to MinIO for use by the inference pipeline.

---

## 1. Current State Assessment

### 1.1 Existing Data in Local Storage

| Directory | File Count | Description |
|-----------|------------|-------------|
| `data/dw_cogs/` | 132 TIF files | DW COGs (Agreement, HighestConf, Mode) for years 2015-2026 |
| `data/dw_baselines/` | ~50 TIF files | Partial baseline set |

### 1.2 DW COG File Naming Convention

```
DW_Zim_{Type}_{StartYear}_{EndYear}-{TileX}-{TileY}.tif
```

**Types**:
- `Agreement` - Agreement composite
- `HighestConf` - Highest confidence composite
- `Mode` - Mode composite

**Years**: 2015_2016 through 2025_2026 (11 seasons)

**Tiles**: 2x2 grid (0000000000, 0000000000-0000065536, 0000065536-0000000000, 0000065536-0000065536)

### 1.3 Training Dataset Available

The project already has training data in the `training/` directory:

| Directory | File Count | Description |
|-----------|------------|-------------|
| `training/` | 23 CSV files | Zimbabwe_Full_Augmented_Batch_*.csv |

**Dataset File Sizes**:
- Zimbabwe_Full_Augmented_Batch_1.csv - 11 MB
- Zimbabwe_Full_Augmented_Batch_2.csv - 10 MB
- Zimbabwe_Full_Augmented_Batch_10.csv - 11 MB
- ... (total ~250 MB of training data)

These files should be uploaded to `geocrop-datasets/` for use in model retraining.

### 1.4 MinIO Status

| Bucket | Status | Purpose |
|--------|--------|---------|
| `geocrop-models` | ✅ Created + populated | Trained ML models |
| `geocrop-baselines` | ❌ Needs creation | DW baseline COGs |
| `geocrop-results` | ❌ Needs creation | Output COGs from inference |
| `geocrop-datasets` | ❌ Needs creation + dataset | Training datasets |

---

## 2. MinIO Access Method

### 2.1 Option A: MinIO Client (Recommended)

Use the MinIO client (`mc`) from the control-plane node for bulk uploads.

**Step 1 — Get MinIO root credentials**

On the control-plane node:
F
1. Check how MinIO is configured:
```bash
kubectl -n geocrop get deploy minio -o yaml | sed -n '1,200p'
```
Look for env vars (e.g., `MINIO_ROOT_USER`, `MINIO_ROOT_PASSWORD`) or a Secret reference.
or use
user: minioadmin

pass: minioadmin123
2. If credentials are stored in a Secret:
```bash
kubectl -n geocrop get secret | grep -i minio
kubectl -n geocrop get secret <secret-name> -o jsonpath='{.data.MINIO_ROOT_USER}' | base64 -d; echo
kubectl -n geocrop get secret <secret-name> -o jsonpath='{.data.MINIO_ROOT_PASSWORD}' | base64 -d; echo
```

**Step 2 — Install mc (if missing)**
```bash
curl -fsSL https://dl.min.io/client/mc/release/linux-amd64/mc -o /usr/local/bin/mc
chmod +x /usr/local/bin/mc
mc --version
```

**Step 3 — Add MinIO alias**
Use in-cluster DNS so you don't rely on public ingress:
```bash
mc alias set geocrop-minio http://minio.geocrop.svc.cluster.local:9000 minioadmin minioadmin12
```

> Note: Default credentials are `minioadmin` / `minioadmin12`

### 2.2 Create Missing Buckets

```bash
# Verify existing buckets
mc ls geocrop-minio

# Create any missing buckets
mc mb geocrop-minio/geocrop-baselines || true
mc mb geocrop-minio/geocrop-datasets || true
mc mb geocrop-minio/geocrop-results || true
mc mb geocrop-minio/geocrop-models || true

# Verify
mc ls geocrop-minio/geocrop-baselines
mc ls geocrop-minio/geocrop-datasets
```

### 2.3 Set Bucket Policies (Portfolio-Safe Defaults)

**Principle**: No public access to baselines/results/models. Downloads happen via signed URLs generated by API.

```bash
# Set buckets to private
mc anonymous set none geocrop-minio/geocrop-baselines
mc anonymous set none geocrop-minio/geocrop-results
mc anonymous set none geocrop-minio/geocrop-models
mc anonymous set none geocrop-minio/geocrop-datasets

# Verify
mc anonymous get geocrop-minio/geocrop-baselines
```

## 3. Object Path Layout

### 3.1 geocrop-baselines

Store DW baseline COGs under:
```
dw/zim/summer/<season>/highest_conf/<filename>.tif
```

Where:
- `<season>` = `YYYY_YYYY` (e.g., `2015_2016`)
- `<filename>` = original (e.g., `DW_Zim_HighestConf_2015_2016.tif`)

**Example object key**:
```
dw/zim/summer/2015_2016/highest_conf/DW_Zim_HighestConf_2015_2016-0000000000-0000000000.tif
```

### 3.2 geocrop-datasets

```
datasets/<dataset_name>/<version>/...
```

For example:
```
datasets/zimbabwe_full/v1/Zimbabwe_Full_Augmented_Batch_1.csv
datasets/zimbabwe_full/v1/Zimbabwe_Full_Augmented_Batch_2.csv
...
datasets/zimbabwe_full/v1/metadata.json
```

### 3.3 geocrop-models

```
models/<model_name>/<version>/...
```

### 3.4 geocrop-results

```
results/<job_id>/...
```

---

## 4. Upload DW COGs into geocrop-baselines

### 4.1 Verify Local Source Folder

On control-plane node:

```bash
ls -lh ~/geocrop/data/dw_cogs | head
file ~/geocrop/data/dw_cogs/*.tif | head
```

Optional sanity checks:
- Ensure each COG has overviews:
```bash
gdalinfo -json <file> | jq '.metadata'  # if gdalinfo installed
```

### 4.2 Dry-Run: Compute Count and Size

```bash
find ~/geocrop/data/dw_cogs -maxdepth 1 -type f -name '*.tif' | wc -l
du -sh ~/geocrop/data/dw_cogs
```

### 4.3 Upload with Mirroring

This keeps bucket in sync with folder:

```bash
mc mirror --overwrite --remove --json \
  ~/geocrop/data/dw_cogs \
  geocrop-minio/geocrop-baselines/dw/zim/summer/ \
  > ~/geocrop/logs/mc_mirror_dw_baselines.jsonl
```

> Notes:
> - `--remove` removes objects in bucket that aren't in local folder (safe if you only use this prefix for DW baselines).
> - If you want safer first run, omit `--remove`.

### 4.4 Verify Upload

```bash
mc ls geocrop-minio/geocrop-baselines/dw/zim/summer/ | head
```

Spot-check hashes:
```bash
mc stat geocrop-minio/geocrop-baselines/dw/zim/summer/<somefile>.tif
```

### 4.5 Record Baseline Index

Create a manifest for the worker to quickly map `year -> key`.

Generate on control-plane:

```bash
mc find geocrop-minio/geocrop-baselines/dw/zim/summer --name '*.tif' --json \
  | jq -r '.key' \
  | sort \
  > ~/geocrop/data/dw_baseline_keys.txt
```

Commit a copy into repo later (or store in MinIO as `manifests/dw_baseline_keys.txt`).

### 3.3 Script Implementation Requirements

```python
# scripts/migrate_dw_to_minio.py

import os
import sys
import glob
import hashlib
import argparse
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
from minio import Minio
from minio.error import S3Error

def calculate_md5(filepath):
    """Calculate MD5 checksum of a file."""
    hash_md5 = hashlib.md5()
    with open(filepath, "rb") as f:
        for chunk in iter(lambda: f.read(4096), b""):
            hash_md5.update(chunk)
    return hash_md5.hexdigest()

def upload_file(client, bucket, source_path, dest_object):
    """Upload a single file to MinIO."""
    try:
        client.fput_object(bucket, dest_object, source_path)
        print(f"✅ Uploaded: {dest_object}")
        return True
    except S3Error as e:
        print(f"❌ Failed: {source_path} - {e}")
        return False

def main():
    parser = argparse.ArgumentParser(description="Migrate DW COGs to MinIO")
    parser.add_argument("--source", default="data/dw_cogs/", help="Source directory")
    parser.add_argument("--bucket", default="geocrop-baselines", help="MinIO bucket")
    parser.add_argument("--workers", type=int, default=4, help="Parallel workers")
    args = parser.parse_args()

    # Initialize MinIO client
    client = Minio(
        "minio.geocrop.svc.cluster.local:9000",
        access_key=os.getenv("MINIO_ACCESS_KEY"),
        secret_key=os.getenv("MINIO_SECRET_KEY"),
    )

    # Find all TIF files
    tif_files = glob.glob(os.path.join(args.source, "*.tif"))
    print(f"Found {len(tif_files)} TIF files to migrate")

    # Upload with parallel workers
    with ThreadPoolExecutor(max_workers=args.workers) as executor:
        futures = []
        for tif_path in tif_files:
            filename = os.path.basename(tif_path)
            # Parse filename to create directory structure
            # e.g., DW_Zim_Agreement_2015_2016-0000000000-0000000000.tif
            parts = filename.replace(".tif", "").split("-")
            type_year = "-".join(parts[0:2])  # DW_Zim_Agreement_2015_2016
            dest_object = f"{type_year}/{filename}"
            futures.append(executor.submit(upload_file, client, args.bucket, tif_path, dest_object))

        # Wait for completion
        results = [f.result() for f in futures]
        success = sum(results)
        print(f"\nMigration complete: {success}/{len(tif_files)} files uploaded")

if __name__ == "__main__":
    main()
```

---

## 5. Upload Training Dataset to geocrop-datasets

### 5.1 Training Data Already Available

The project already has training data in the `training/` directory (23 CSV files, ~250 MB total):

| File | Size |
|------|------|
| Zimbabwe_Full_Augmented_Batch_1.csv | 11 MB |
| Zimbabwe_Full_Augmented_Batch_2.csv | 10 MB |
| Zimbabwe_Full_Augmented_Batch_3.csv | 11 MB |
| ... | ... |

### 5.2 Upload Training Data

```bash
# Create dataset directory structure
mc mb geocrop-minio/geocrop-datasets/zimbabwe_full/v1 || true

# Upload all training batches
mc cp training/Zimbabwe_Full_Augmented_Batch_*.csv \
  geocrop-minio/geocrop-datasets/zimbabwe_full/v1/

# Upload metadata
cat > /tmp/metadata.json << 'EOF'
{
  "version": "v1",
  "created": "2026-02-27",
  "description": "Augmented training dataset for GeoCrop crop classification",
  "source": "Manual labeling from high-resolution imagery + augmentation",
  "classes": [
    "cropland",
    "grass",
    "shrubland",
    "forest",
    "water",
    "builtup",
    "bare"
  ],
  "features": [
    "ndvi_peak",
    "evi_peak",
    "savi_peak"
  ],
  "total_samples": 25000,
  "spatial_extent": "Zimbabwe",
  "batches": 23
}
EOF

mc cp /tmp/metadata.json geocrop-minio/geocrop-datasets/zimbabwe_full/v1/metadata.json
```

### 5.3 Verify Dataset Upload

```bash
mc ls geocrop-minio/geocrop-datasets/zimbabwe_full/v1/
```

---

## 6. Acceptance Criteria (Must Be True Before Phase 1)

- [ ] Buckets exist: `geocrop-baselines`, `geocrop-datasets` (and `geocrop-models`, `geocrop-results`)
- [ ] Buckets are private (anonymous access disabled)
- [ ] DW baseline COGs available under `geocrop-baselines/dw/zim/summer/...`
- [ ] Training dataset uploaded to `geocrop-datasets/zimbabwe_full/v1/`
- [ ] A baseline manifest exists (text file listing object keys)

## 7. Common Pitfalls

- Uploading to the wrong bucket or root prefix → fix by mirroring into a single authoritative prefix
- Leaving MinIO public → fix with `mc anonymous set none`
- Mixing season windows (Nov–Apr vs Sep–May) → store DW as "summer season" per filename, but keep **model season** config separate

---

## 6. Next Steps

After this plan is approved:

1. Execute bucket creation commands
2. Run migration script for DW COGs
3. Upload sample dataset
4. Verify worker can read from MinIO
5. Proceed to Plan 01: STAC Inference Worker

---

## 7. Technical Notes

### 7.1 MinIO Access from Worker

The worker uses internal Kubernetes DNS:
```python
MINIO_ENDPOINT = "minio.geocrop.svc.cluster.local:9000"
```

### 7.2 Bucket Naming Convention

Per AGENTS.md:
- `geocrop-models` - trained ML models
- `geocrop-results` - output COGs
- `geocrop-baselines` - DW baseline COGs
- `geocrop-datasets` - training datasets

### 7.3 File Size Estimates

| Dataset | File Count | Avg Size | Total |
|---------|------------|----------|-------|
| DW COGs | 132 | ~60MB | ~7.9 GB |
| Training Data | 1 | ~10MB | ~10MB |