geocrop-platform./plan/00_data_migration.md

435 lines
11 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

# Plan 00: Data Migration & Storage Setup
**Status**: CRITICAL PRIORITY
**Date**: 2026-02-27
---
## Objective
Configure MinIO buckets and migrate existing Dynamic World Cloud Optimized GeoTIFFs (COGs) from local storage to MinIO for use by the inference pipeline.
---
## 1. Current State Assessment
### 1.1 Existing Data in Local Storage
| Directory | File Count | Description |
|-----------|------------|-------------|
| `data/dw_cogs/` | 132 TIF files | DW COGs (Agreement, HighestConf, Mode) for years 2015-2026 |
| `data/dw_baselines/` | ~50 TIF files | Partial baseline set |
### 1.2 DW COG File Naming Convention
```
DW_Zim_{Type}_{StartYear}_{EndYear}-{TileX}-{TileY}.tif
```
**Types**:
- `Agreement` - Agreement composite
- `HighestConf` - Highest confidence composite
- `Mode` - Mode composite
**Years**: 2015_2016 through 2025_2026 (11 seasons)
**Tiles**: 2x2 grid (0000000000, 0000000000-0000065536, 0000065536-0000000000, 0000065536-0000065536)
### 1.3 Training Dataset Available
The project already has training data in the `training/` directory:
| Directory | File Count | Description |
|-----------|------------|-------------|
| `training/` | 23 CSV files | Zimbabwe_Full_Augmented_Batch_*.csv |
**Dataset File Sizes**:
- Zimbabwe_Full_Augmented_Batch_1.csv - 11 MB
- Zimbabwe_Full_Augmented_Batch_2.csv - 10 MB
- Zimbabwe_Full_Augmented_Batch_10.csv - 11 MB
- ... (total ~250 MB of training data)
These files should be uploaded to `geocrop-datasets/` for use in model retraining.
### 1.4 MinIO Status
| Bucket | Status | Purpose |
|--------|--------|---------|
| `geocrop-models` | ✅ Created + populated | Trained ML models |
| `geocrop-baselines` | ❌ Needs creation | DW baseline COGs |
| `geocrop-results` | ❌ Needs creation | Output COGs from inference |
| `geocrop-datasets` | ❌ Needs creation + dataset | Training datasets |
---
## 2. MinIO Access Method
### 2.1 Option A: MinIO Client (Recommended)
Use the MinIO client (`mc`) from the control-plane node for bulk uploads.
**Step 1 — Get MinIO root credentials**
On the control-plane node:
F
1. Check how MinIO is configured:
```bash
kubectl -n geocrop get deploy minio -o yaml | sed -n '1,200p'
```
Look for env vars (e.g., `MINIO_ROOT_USER`, `MINIO_ROOT_PASSWORD`) or a Secret reference.
or use
user: minioadmin
pass: minioadmin123
2. If credentials are stored in a Secret:
```bash
kubectl -n geocrop get secret | grep -i minio
kubectl -n geocrop get secret <secret-name> -o jsonpath='{.data.MINIO_ROOT_USER}' | base64 -d; echo
kubectl -n geocrop get secret <secret-name> -o jsonpath='{.data.MINIO_ROOT_PASSWORD}' | base64 -d; echo
```
**Step 2 — Install mc (if missing)**
```bash
curl -fsSL https://dl.min.io/client/mc/release/linux-amd64/mc -o /usr/local/bin/mc
chmod +x /usr/local/bin/mc
mc --version
```
**Step 3 — Add MinIO alias**
Use in-cluster DNS so you don't rely on public ingress:
```bash
mc alias set geocrop-minio http://minio.geocrop.svc.cluster.local:9000 minioadmin minioadmin12
```
> Note: Default credentials are `minioadmin` / `minioadmin12`
### 2.2 Create Missing Buckets
```bash
# Verify existing buckets
mc ls geocrop-minio
# Create any missing buckets
mc mb geocrop-minio/geocrop-baselines || true
mc mb geocrop-minio/geocrop-datasets || true
mc mb geocrop-minio/geocrop-results || true
mc mb geocrop-minio/geocrop-models || true
# Verify
mc ls geocrop-minio/geocrop-baselines
mc ls geocrop-minio/geocrop-datasets
```
### 2.3 Set Bucket Policies (Portfolio-Safe Defaults)
**Principle**: No public access to baselines/results/models. Downloads happen via signed URLs generated by API.
```bash
# Set buckets to private
mc anonymous set none geocrop-minio/geocrop-baselines
mc anonymous set none geocrop-minio/geocrop-results
mc anonymous set none geocrop-minio/geocrop-models
mc anonymous set none geocrop-minio/geocrop-datasets
# Verify
mc anonymous get geocrop-minio/geocrop-baselines
```
## 3. Object Path Layout
### 3.1 geocrop-baselines
Store DW baseline COGs under:
```
dw/zim/summer/<season>/highest_conf/<filename>.tif
```
Where:
- `<season>` = `YYYY_YYYY` (e.g., `2015_2016`)
- `<filename>` = original (e.g., `DW_Zim_HighestConf_2015_2016.tif`)
**Example object key**:
```
dw/zim/summer/2015_2016/highest_conf/DW_Zim_HighestConf_2015_2016-0000000000-0000000000.tif
```
### 3.2 geocrop-datasets
```
datasets/<dataset_name>/<version>/...
```
For example:
```
datasets/zimbabwe_full/v1/Zimbabwe_Full_Augmented_Batch_1.csv
datasets/zimbabwe_full/v1/Zimbabwe_Full_Augmented_Batch_2.csv
...
datasets/zimbabwe_full/v1/metadata.json
```
### 3.3 geocrop-models
```
models/<model_name>/<version>/...
```
### 3.4 geocrop-results
```
results/<job_id>/...
```
---
## 4. Upload DW COGs into geocrop-baselines
### 4.1 Verify Local Source Folder
On control-plane node:
```bash
ls -lh ~/geocrop/data/dw_cogs | head
file ~/geocrop/data/dw_cogs/*.tif | head
```
Optional sanity checks:
- Ensure each COG has overviews:
```bash
gdalinfo -json <file> | jq '.metadata' # if gdalinfo installed
```
### 4.2 Dry-Run: Compute Count and Size
```bash
find ~/geocrop/data/dw_cogs -maxdepth 1 -type f -name '*.tif' | wc -l
du -sh ~/geocrop/data/dw_cogs
```
### 4.3 Upload with Mirroring
This keeps bucket in sync with folder:
```bash
mc mirror --overwrite --remove --json \
~/geocrop/data/dw_cogs \
geocrop-minio/geocrop-baselines/dw/zim/summer/ \
> ~/geocrop/logs/mc_mirror_dw_baselines.jsonl
```
> Notes:
> - `--remove` removes objects in bucket that aren't in local folder (safe if you only use this prefix for DW baselines).
> - If you want safer first run, omit `--remove`.
### 4.4 Verify Upload
```bash
mc ls geocrop-minio/geocrop-baselines/dw/zim/summer/ | head
```
Spot-check hashes:
```bash
mc stat geocrop-minio/geocrop-baselines/dw/zim/summer/<somefile>.tif
```
### 4.5 Record Baseline Index
Create a manifest for the worker to quickly map `year -> key`.
Generate on control-plane:
```bash
mc find geocrop-minio/geocrop-baselines/dw/zim/summer --name '*.tif' --json \
| jq -r '.key' \
| sort \
> ~/geocrop/data/dw_baseline_keys.txt
```
Commit a copy into repo later (or store in MinIO as `manifests/dw_baseline_keys.txt`).
### 3.3 Script Implementation Requirements
```python
# scripts/migrate_dw_to_minio.py
import os
import sys
import glob
import hashlib
import argparse
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
from minio import Minio
from minio.error import S3Error
def calculate_md5(filepath):
"""Calculate MD5 checksum of a file."""
hash_md5 = hashlib.md5()
with open(filepath, "rb") as f:
for chunk in iter(lambda: f.read(4096), b""):
hash_md5.update(chunk)
return hash_md5.hexdigest()
def upload_file(client, bucket, source_path, dest_object):
"""Upload a single file to MinIO."""
try:
client.fput_object(bucket, dest_object, source_path)
print(f"✅ Uploaded: {dest_object}")
return True
except S3Error as e:
print(f"❌ Failed: {source_path} - {e}")
return False
def main():
parser = argparse.ArgumentParser(description="Migrate DW COGs to MinIO")
parser.add_argument("--source", default="data/dw_cogs/", help="Source directory")
parser.add_argument("--bucket", default="geocrop-baselines", help="MinIO bucket")
parser.add_argument("--workers", type=int, default=4, help="Parallel workers")
args = parser.parse_args()
# Initialize MinIO client
client = Minio(
"minio.geocrop.svc.cluster.local:9000",
access_key=os.getenv("MINIO_ACCESS_KEY"),
secret_key=os.getenv("MINIO_SECRET_KEY"),
)
# Find all TIF files
tif_files = glob.glob(os.path.join(args.source, "*.tif"))
print(f"Found {len(tif_files)} TIF files to migrate")
# Upload with parallel workers
with ThreadPoolExecutor(max_workers=args.workers) as executor:
futures = []
for tif_path in tif_files:
filename = os.path.basename(tif_path)
# Parse filename to create directory structure
# e.g., DW_Zim_Agreement_2015_2016-0000000000-0000000000.tif
parts = filename.replace(".tif", "").split("-")
type_year = "-".join(parts[0:2]) # DW_Zim_Agreement_2015_2016
dest_object = f"{type_year}/{filename}"
futures.append(executor.submit(upload_file, client, args.bucket, tif_path, dest_object))
# Wait for completion
results = [f.result() for f in futures]
success = sum(results)
print(f"\nMigration complete: {success}/{len(tif_files)} files uploaded")
if __name__ == "__main__":
main()
```
---
## 5. Upload Training Dataset to geocrop-datasets
### 5.1 Training Data Already Available
The project already has training data in the `training/` directory (23 CSV files, ~250 MB total):
| File | Size |
|------|------|
| Zimbabwe_Full_Augmented_Batch_1.csv | 11 MB |
| Zimbabwe_Full_Augmented_Batch_2.csv | 10 MB |
| Zimbabwe_Full_Augmented_Batch_3.csv | 11 MB |
| ... | ... |
### 5.2 Upload Training Data
```bash
# Create dataset directory structure
mc mb geocrop-minio/geocrop-datasets/zimbabwe_full/v1 || true
# Upload all training batches
mc cp training/Zimbabwe_Full_Augmented_Batch_*.csv \
geocrop-minio/geocrop-datasets/zimbabwe_full/v1/
# Upload metadata
cat > /tmp/metadata.json << 'EOF'
{
"version": "v1",
"created": "2026-02-27",
"description": "Augmented training dataset for GeoCrop crop classification",
"source": "Manual labeling from high-resolution imagery + augmentation",
"classes": [
"cropland",
"grass",
"shrubland",
"forest",
"water",
"builtup",
"bare"
],
"features": [
"ndvi_peak",
"evi_peak",
"savi_peak"
],
"total_samples": 25000,
"spatial_extent": "Zimbabwe",
"batches": 23
}
EOF
mc cp /tmp/metadata.json geocrop-minio/geocrop-datasets/zimbabwe_full/v1/metadata.json
```
### 5.3 Verify Dataset Upload
```bash
mc ls geocrop-minio/geocrop-datasets/zimbabwe_full/v1/
```
---
## 6. Acceptance Criteria (Must Be True Before Phase 1)
- [ ] Buckets exist: `geocrop-baselines`, `geocrop-datasets` (and `geocrop-models`, `geocrop-results`)
- [ ] Buckets are private (anonymous access disabled)
- [ ] DW baseline COGs available under `geocrop-baselines/dw/zim/summer/...`
- [ ] Training dataset uploaded to `geocrop-datasets/zimbabwe_full/v1/`
- [ ] A baseline manifest exists (text file listing object keys)
## 7. Common Pitfalls
- Uploading to the wrong bucket or root prefix → fix by mirroring into a single authoritative prefix
- Leaving MinIO public → fix with `mc anonymous set none`
- Mixing season windows (NovApr vs SepMay) → store DW as "summer season" per filename, but keep **model season** config separate
---
## 6. Next Steps
After this plan is approved:
1. Execute bucket creation commands
2. Run migration script for DW COGs
3. Upload sample dataset
4. Verify worker can read from MinIO
5. Proceed to Plan 01: STAC Inference Worker
---
## 7. Technical Notes
### 7.1 MinIO Access from Worker
The worker uses internal Kubernetes DNS:
```python
MINIO_ENDPOINT = "minio.geocrop.svc.cluster.local:9000"
```
### 7.2 Bucket Naming Convention
Per AGENTS.md:
- `geocrop-models` - trained ML models
- `geocrop-results` - output COGs
- `geocrop-baselines` - DW baseline COGs
- `geocrop-datasets` - training datasets
### 7.3 File Size Estimates
| Dataset | File Count | Avg Size | Total |
|---------|------------|----------|-------|
| DW COGs | 132 | ~60MB | ~7.9 GB |
| Training Data | 1 | ~10MB | ~10MB |