435 lines
11 KiB
Markdown
435 lines
11 KiB
Markdown
# Plan 00: Data Migration & Storage Setup
|
||
|
||
**Status**: CRITICAL PRIORITY
|
||
**Date**: 2026-02-27
|
||
|
||
---
|
||
|
||
## Objective
|
||
|
||
Configure MinIO buckets and migrate existing Dynamic World Cloud Optimized GeoTIFFs (COGs) from local storage to MinIO for use by the inference pipeline.
|
||
|
||
---
|
||
|
||
## 1. Current State Assessment
|
||
|
||
### 1.1 Existing Data in Local Storage
|
||
|
||
| Directory | File Count | Description |
|
||
|-----------|------------|-------------|
|
||
| `data/dw_cogs/` | 132 TIF files | DW COGs (Agreement, HighestConf, Mode) for years 2015-2026 |
|
||
| `data/dw_baselines/` | ~50 TIF files | Partial baseline set |
|
||
|
||
### 1.2 DW COG File Naming Convention
|
||
|
||
```
|
||
DW_Zim_{Type}_{StartYear}_{EndYear}-{TileX}-{TileY}.tif
|
||
```
|
||
|
||
**Types**:
|
||
- `Agreement` - Agreement composite
|
||
- `HighestConf` - Highest confidence composite
|
||
- `Mode` - Mode composite
|
||
|
||
**Years**: 2015_2016 through 2025_2026 (11 seasons)
|
||
|
||
**Tiles**: 2x2 grid (0000000000, 0000000000-0000065536, 0000065536-0000000000, 0000065536-0000065536)
|
||
|
||
### 1.3 Training Dataset Available
|
||
|
||
The project already has training data in the `training/` directory:
|
||
|
||
| Directory | File Count | Description |
|
||
|-----------|------------|-------------|
|
||
| `training/` | 23 CSV files | Zimbabwe_Full_Augmented_Batch_*.csv |
|
||
|
||
**Dataset File Sizes**:
|
||
- Zimbabwe_Full_Augmented_Batch_1.csv - 11 MB
|
||
- Zimbabwe_Full_Augmented_Batch_2.csv - 10 MB
|
||
- Zimbabwe_Full_Augmented_Batch_10.csv - 11 MB
|
||
- ... (total ~250 MB of training data)
|
||
|
||
These files should be uploaded to `geocrop-datasets/` for use in model retraining.
|
||
|
||
### 1.4 MinIO Status
|
||
|
||
| Bucket | Status | Purpose |
|
||
|--------|--------|---------|
|
||
| `geocrop-models` | ✅ Created + populated | Trained ML models |
|
||
| `geocrop-baselines` | ❌ Needs creation | DW baseline COGs |
|
||
| `geocrop-results` | ❌ Needs creation | Output COGs from inference |
|
||
| `geocrop-datasets` | ❌ Needs creation + dataset | Training datasets |
|
||
|
||
---
|
||
|
||
## 2. MinIO Access Method
|
||
|
||
### 2.1 Option A: MinIO Client (Recommended)
|
||
|
||
Use the MinIO client (`mc`) from the control-plane node for bulk uploads.
|
||
|
||
**Step 1 — Get MinIO root credentials**
|
||
|
||
On the control-plane node:
|
||
F
|
||
1. Check how MinIO is configured:
|
||
```bash
|
||
kubectl -n geocrop get deploy minio -o yaml | sed -n '1,200p'
|
||
```
|
||
Look for env vars (e.g., `MINIO_ROOT_USER`, `MINIO_ROOT_PASSWORD`) or a Secret reference.
|
||
or use
|
||
user: minioadmin
|
||
|
||
pass: minioadmin123
|
||
2. If credentials are stored in a Secret:
|
||
```bash
|
||
kubectl -n geocrop get secret | grep -i minio
|
||
kubectl -n geocrop get secret <secret-name> -o jsonpath='{.data.MINIO_ROOT_USER}' | base64 -d; echo
|
||
kubectl -n geocrop get secret <secret-name> -o jsonpath='{.data.MINIO_ROOT_PASSWORD}' | base64 -d; echo
|
||
```
|
||
|
||
**Step 2 — Install mc (if missing)**
|
||
```bash
|
||
curl -fsSL https://dl.min.io/client/mc/release/linux-amd64/mc -o /usr/local/bin/mc
|
||
chmod +x /usr/local/bin/mc
|
||
mc --version
|
||
```
|
||
|
||
**Step 3 — Add MinIO alias**
|
||
Use in-cluster DNS so you don't rely on public ingress:
|
||
```bash
|
||
mc alias set geocrop-minio http://minio.geocrop.svc.cluster.local:9000 minioadmin minioadmin12
|
||
```
|
||
|
||
> Note: Default credentials are `minioadmin` / `minioadmin12`
|
||
|
||
### 2.2 Create Missing Buckets
|
||
|
||
```bash
|
||
# Verify existing buckets
|
||
mc ls geocrop-minio
|
||
|
||
# Create any missing buckets
|
||
mc mb geocrop-minio/geocrop-baselines || true
|
||
mc mb geocrop-minio/geocrop-datasets || true
|
||
mc mb geocrop-minio/geocrop-results || true
|
||
mc mb geocrop-minio/geocrop-models || true
|
||
|
||
# Verify
|
||
mc ls geocrop-minio/geocrop-baselines
|
||
mc ls geocrop-minio/geocrop-datasets
|
||
```
|
||
|
||
### 2.3 Set Bucket Policies (Portfolio-Safe Defaults)
|
||
|
||
**Principle**: No public access to baselines/results/models. Downloads happen via signed URLs generated by API.
|
||
|
||
```bash
|
||
# Set buckets to private
|
||
mc anonymous set none geocrop-minio/geocrop-baselines
|
||
mc anonymous set none geocrop-minio/geocrop-results
|
||
mc anonymous set none geocrop-minio/geocrop-models
|
||
mc anonymous set none geocrop-minio/geocrop-datasets
|
||
|
||
# Verify
|
||
mc anonymous get geocrop-minio/geocrop-baselines
|
||
```
|
||
|
||
## 3. Object Path Layout
|
||
|
||
### 3.1 geocrop-baselines
|
||
|
||
Store DW baseline COGs under:
|
||
```
|
||
dw/zim/summer/<season>/highest_conf/<filename>.tif
|
||
```
|
||
|
||
Where:
|
||
- `<season>` = `YYYY_YYYY` (e.g., `2015_2016`)
|
||
- `<filename>` = original (e.g., `DW_Zim_HighestConf_2015_2016.tif`)
|
||
|
||
**Example object key**:
|
||
```
|
||
dw/zim/summer/2015_2016/highest_conf/DW_Zim_HighestConf_2015_2016-0000000000-0000000000.tif
|
||
```
|
||
|
||
### 3.2 geocrop-datasets
|
||
|
||
```
|
||
datasets/<dataset_name>/<version>/...
|
||
```
|
||
|
||
For example:
|
||
```
|
||
datasets/zimbabwe_full/v1/Zimbabwe_Full_Augmented_Batch_1.csv
|
||
datasets/zimbabwe_full/v1/Zimbabwe_Full_Augmented_Batch_2.csv
|
||
...
|
||
datasets/zimbabwe_full/v1/metadata.json
|
||
```
|
||
|
||
### 3.3 geocrop-models
|
||
|
||
```
|
||
models/<model_name>/<version>/...
|
||
```
|
||
|
||
### 3.4 geocrop-results
|
||
|
||
```
|
||
results/<job_id>/...
|
||
```
|
||
|
||
---
|
||
|
||
## 4. Upload DW COGs into geocrop-baselines
|
||
|
||
### 4.1 Verify Local Source Folder
|
||
|
||
On control-plane node:
|
||
|
||
```bash
|
||
ls -lh ~/geocrop/data/dw_cogs | head
|
||
file ~/geocrop/data/dw_cogs/*.tif | head
|
||
```
|
||
|
||
Optional sanity checks:
|
||
- Ensure each COG has overviews:
|
||
```bash
|
||
gdalinfo -json <file> | jq '.metadata' # if gdalinfo installed
|
||
```
|
||
|
||
### 4.2 Dry-Run: Compute Count and Size
|
||
|
||
```bash
|
||
find ~/geocrop/data/dw_cogs -maxdepth 1 -type f -name '*.tif' | wc -l
|
||
du -sh ~/geocrop/data/dw_cogs
|
||
```
|
||
|
||
### 4.3 Upload with Mirroring
|
||
|
||
This keeps bucket in sync with folder:
|
||
|
||
```bash
|
||
mc mirror --overwrite --remove --json \
|
||
~/geocrop/data/dw_cogs \
|
||
geocrop-minio/geocrop-baselines/dw/zim/summer/ \
|
||
> ~/geocrop/logs/mc_mirror_dw_baselines.jsonl
|
||
```
|
||
|
||
> Notes:
|
||
> - `--remove` removes objects in bucket that aren't in local folder (safe if you only use this prefix for DW baselines).
|
||
> - If you want safer first run, omit `--remove`.
|
||
|
||
### 4.4 Verify Upload
|
||
|
||
```bash
|
||
mc ls geocrop-minio/geocrop-baselines/dw/zim/summer/ | head
|
||
```
|
||
|
||
Spot-check hashes:
|
||
```bash
|
||
mc stat geocrop-minio/geocrop-baselines/dw/zim/summer/<somefile>.tif
|
||
```
|
||
|
||
### 4.5 Record Baseline Index
|
||
|
||
Create a manifest for the worker to quickly map `year -> key`.
|
||
|
||
Generate on control-plane:
|
||
|
||
```bash
|
||
mc find geocrop-minio/geocrop-baselines/dw/zim/summer --name '*.tif' --json \
|
||
| jq -r '.key' \
|
||
| sort \
|
||
> ~/geocrop/data/dw_baseline_keys.txt
|
||
```
|
||
|
||
Commit a copy into repo later (or store in MinIO as `manifests/dw_baseline_keys.txt`).
|
||
|
||
### 3.3 Script Implementation Requirements
|
||
|
||
```python
|
||
# scripts/migrate_dw_to_minio.py
|
||
|
||
import os
|
||
import sys
|
||
import glob
|
||
import hashlib
|
||
import argparse
|
||
from concurrent.futures import ThreadPoolExecutor
|
||
from pathlib import Path
|
||
from minio import Minio
|
||
from minio.error import S3Error
|
||
|
||
def calculate_md5(filepath):
|
||
"""Calculate MD5 checksum of a file."""
|
||
hash_md5 = hashlib.md5()
|
||
with open(filepath, "rb") as f:
|
||
for chunk in iter(lambda: f.read(4096), b""):
|
||
hash_md5.update(chunk)
|
||
return hash_md5.hexdigest()
|
||
|
||
def upload_file(client, bucket, source_path, dest_object):
|
||
"""Upload a single file to MinIO."""
|
||
try:
|
||
client.fput_object(bucket, dest_object, source_path)
|
||
print(f"✅ Uploaded: {dest_object}")
|
||
return True
|
||
except S3Error as e:
|
||
print(f"❌ Failed: {source_path} - {e}")
|
||
return False
|
||
|
||
def main():
|
||
parser = argparse.ArgumentParser(description="Migrate DW COGs to MinIO")
|
||
parser.add_argument("--source", default="data/dw_cogs/", help="Source directory")
|
||
parser.add_argument("--bucket", default="geocrop-baselines", help="MinIO bucket")
|
||
parser.add_argument("--workers", type=int, default=4, help="Parallel workers")
|
||
args = parser.parse_args()
|
||
|
||
# Initialize MinIO client
|
||
client = Minio(
|
||
"minio.geocrop.svc.cluster.local:9000",
|
||
access_key=os.getenv("MINIO_ACCESS_KEY"),
|
||
secret_key=os.getenv("MINIO_SECRET_KEY"),
|
||
)
|
||
|
||
# Find all TIF files
|
||
tif_files = glob.glob(os.path.join(args.source, "*.tif"))
|
||
print(f"Found {len(tif_files)} TIF files to migrate")
|
||
|
||
# Upload with parallel workers
|
||
with ThreadPoolExecutor(max_workers=args.workers) as executor:
|
||
futures = []
|
||
for tif_path in tif_files:
|
||
filename = os.path.basename(tif_path)
|
||
# Parse filename to create directory structure
|
||
# e.g., DW_Zim_Agreement_2015_2016-0000000000-0000000000.tif
|
||
parts = filename.replace(".tif", "").split("-")
|
||
type_year = "-".join(parts[0:2]) # DW_Zim_Agreement_2015_2016
|
||
dest_object = f"{type_year}/{filename}"
|
||
futures.append(executor.submit(upload_file, client, args.bucket, tif_path, dest_object))
|
||
|
||
# Wait for completion
|
||
results = [f.result() for f in futures]
|
||
success = sum(results)
|
||
print(f"\nMigration complete: {success}/{len(tif_files)} files uploaded")
|
||
|
||
if __name__ == "__main__":
|
||
main()
|
||
```
|
||
|
||
---
|
||
|
||
## 5. Upload Training Dataset to geocrop-datasets
|
||
|
||
### 5.1 Training Data Already Available
|
||
|
||
The project already has training data in the `training/` directory (23 CSV files, ~250 MB total):
|
||
|
||
| File | Size |
|
||
|------|------|
|
||
| Zimbabwe_Full_Augmented_Batch_1.csv | 11 MB |
|
||
| Zimbabwe_Full_Augmented_Batch_2.csv | 10 MB |
|
||
| Zimbabwe_Full_Augmented_Batch_3.csv | 11 MB |
|
||
| ... | ... |
|
||
|
||
### 5.2 Upload Training Data
|
||
|
||
```bash
|
||
# Create dataset directory structure
|
||
mc mb geocrop-minio/geocrop-datasets/zimbabwe_full/v1 || true
|
||
|
||
# Upload all training batches
|
||
mc cp training/Zimbabwe_Full_Augmented_Batch_*.csv \
|
||
geocrop-minio/geocrop-datasets/zimbabwe_full/v1/
|
||
|
||
# Upload metadata
|
||
cat > /tmp/metadata.json << 'EOF'
|
||
{
|
||
"version": "v1",
|
||
"created": "2026-02-27",
|
||
"description": "Augmented training dataset for GeoCrop crop classification",
|
||
"source": "Manual labeling from high-resolution imagery + augmentation",
|
||
"classes": [
|
||
"cropland",
|
||
"grass",
|
||
"shrubland",
|
||
"forest",
|
||
"water",
|
||
"builtup",
|
||
"bare"
|
||
],
|
||
"features": [
|
||
"ndvi_peak",
|
||
"evi_peak",
|
||
"savi_peak"
|
||
],
|
||
"total_samples": 25000,
|
||
"spatial_extent": "Zimbabwe",
|
||
"batches": 23
|
||
}
|
||
EOF
|
||
|
||
mc cp /tmp/metadata.json geocrop-minio/geocrop-datasets/zimbabwe_full/v1/metadata.json
|
||
```
|
||
|
||
### 5.3 Verify Dataset Upload
|
||
|
||
```bash
|
||
mc ls geocrop-minio/geocrop-datasets/zimbabwe_full/v1/
|
||
```
|
||
|
||
---
|
||
|
||
## 6. Acceptance Criteria (Must Be True Before Phase 1)
|
||
|
||
- [ ] Buckets exist: `geocrop-baselines`, `geocrop-datasets` (and `geocrop-models`, `geocrop-results`)
|
||
- [ ] Buckets are private (anonymous access disabled)
|
||
- [ ] DW baseline COGs available under `geocrop-baselines/dw/zim/summer/...`
|
||
- [ ] Training dataset uploaded to `geocrop-datasets/zimbabwe_full/v1/`
|
||
- [ ] A baseline manifest exists (text file listing object keys)
|
||
|
||
## 7. Common Pitfalls
|
||
|
||
- Uploading to the wrong bucket or root prefix → fix by mirroring into a single authoritative prefix
|
||
- Leaving MinIO public → fix with `mc anonymous set none`
|
||
- Mixing season windows (Nov–Apr vs Sep–May) → store DW as "summer season" per filename, but keep **model season** config separate
|
||
|
||
---
|
||
|
||
## 6. Next Steps
|
||
|
||
After this plan is approved:
|
||
|
||
1. Execute bucket creation commands
|
||
2. Run migration script for DW COGs
|
||
3. Upload sample dataset
|
||
4. Verify worker can read from MinIO
|
||
5. Proceed to Plan 01: STAC Inference Worker
|
||
|
||
---
|
||
|
||
## 7. Technical Notes
|
||
|
||
### 7.1 MinIO Access from Worker
|
||
|
||
The worker uses internal Kubernetes DNS:
|
||
```python
|
||
MINIO_ENDPOINT = "minio.geocrop.svc.cluster.local:9000"
|
||
```
|
||
|
||
### 7.2 Bucket Naming Convention
|
||
|
||
Per AGENTS.md:
|
||
- `geocrop-models` - trained ML models
|
||
- `geocrop-results` - output COGs
|
||
- `geocrop-baselines` - DW baseline COGs
|
||
- `geocrop-datasets` - training datasets
|
||
|
||
### 7.3 File Size Estimates
|
||
|
||
| Dataset | File Count | Avg Size | Total |
|
||
|---------|------------|----------|-------|
|
||
| DW COGs | 132 | ~60MB | ~7.9 GB |
|
||
| Training Data | 1 | ~10MB | ~10MB |
|