Data Preparation
Proper data preparation is crucial for successful MarrmotFlow workflows. This guide covers how to prepare catchment data, forcing data, and other inputs.
Catchment Data
Format Requirements
MarrmotFlow accepts catchment data in various spatial formats:
Shapefiles (.shp)
GeoJSON (.geojson)
GeoPackage (.gpkg)
Any format supported by GeoPandas/GDAL
Loading Catchment Data
import geopandas as gpd
# From shapefile
catchments = gpd.read_file("catchments.shp")
# From GeoJSON
catchments = gpd.read_file("catchments.geojson")
# From GeoPackage
catchments = gpd.read_file("catchments.gpkg")
Required Attributes
Your catchment data should include:
Unique identifiers: Each catchment should have a unique ID
Geometry: Polygon geometries defining catchment boundaries
Optional metadata: Name, area, elevation, etc.
# Example catchment structure
print(catchments.columns)
# Index(['id', 'name', 'area_km2', 'mean_elev', 'geometry'], dtype='object')
Coordinate Reference Systems
Ensure your catchment data has a proper CRS:
# Check CRS
print(catchments.crs)
# Set CRS if missing
catchments = catchments.set_crs('EPSG:4326')
# Reproject if needed
catchments = catchments.to_crs('EPSG:3857')
Data Validation
Validate your catchment data before using it:
# Check for valid geometries
invalid_geoms = catchments[~catchments.is_valid]
if not invalid_geoms.empty:
print(f"Warning: {len(invalid_geoms)} invalid geometries found")
# Check for missing data
missing_data = catchments.isnull().sum()
print("Missing data per column:")
print(missing_data[missing_data > 0])
# Basic statistics
print(f"Total catchments: {len(catchments)}")
print(f"Total area: {catchments.area.sum():.2f} square units")
Forcing Data
Supported Formats
MarrmotFlow supports various forcing data formats:
NetCDF (.nc, .nc4)
HDF5 (.h5, .hdf5)
Zarr stores
Any format supported by xarray
Data Structure
Your forcing data should be structured as multidimensional arrays with:
Time dimension: Temporal coordinate
Spatial dimensions: Latitude/longitude or other spatial coordinates
Variables: Precipitation, temperature, and other meteorological variables
import xarray as xr
# Load forcing data
forcing = xr.open_dataset("climate_data.nc")
# Check structure
print(forcing)
print(forcing.coords)
print(forcing.data_vars)
Required Variables
At minimum, you need:
Precipitation: Any units that can be converted to mm/day
Temperature: Any units that can be converted to Celsius
# Example forcing data structure
forcing_vars = {
"precip": "precipitation", # Variable name in your data
"temp": "temperature" # Variable name in your data
}
Time Handling
Ensure proper time coordinates:
# Check time coordinate
print(forcing.time)
# Convert time if needed
forcing['time'] = pd.to_datetime(forcing.time)
# Set time zone if needed
forcing = forcing.assign_coords(
time=forcing.time.dt.tz_localize('UTC')
)
Spatial Alignment
Your forcing data should cover your catchment areas:
# Check spatial bounds
lon_min, lat_min, lon_max, lat_max = catchments.total_bounds
forcing_lon_range = [forcing.lon.min().item(), forcing.lon.max().item()]
forcing_lat_range = [forcing.lat.min().item(), forcing.lat.max().item()]
print(f"Catchment bounds: {lon_min:.2f}, {lat_min:.2f}, {lon_max:.2f}, {lat_max:.2f}")
print(f"Forcing lon range: {forcing_lon_range}")
print(f"Forcing lat range: {forcing_lat_range}")
Unit Conversion
Precipitation Units
Common precipitation unit conversions:
forcing_units = {
"precip": "mm/day", # Direct use
"precip": "mm/hour", # Will be converted
"precip": "m/day", # Will be converted
"precip": "kg m-2 s-1" # CMIP6 standard, will be converted
}
Temperature Units
Common temperature unit conversions:
forcing_units = {
"temp": "celsius", # Direct use
"temp": "kelvin", # Will be converted
"temp": "fahrenheit" # Will be converted
}
Data Quality Checks
Missing Data
Check for and handle missing data:
# Check for NaN values
precip_missing = forcing.precipitation.isnull().sum()
temp_missing = forcing.temperature.isnull().sum()
print(f"Missing precipitation values: {precip_missing.item()}")
print(f"Missing temperature values: {temp_missing.item()}")
Outliers
Identify potential outliers:
# Precipitation outliers (negative values or extremely high)
negative_precip = (forcing.precipitation < 0).sum()
extreme_precip = (forcing.precipitation > 1000).sum() # > 1000 mm/day
print(f"Negative precipitation values: {negative_precip.item()}")
print(f"Extreme precipitation values: {extreme_precip.item()}")
# Temperature outliers
extreme_cold = (forcing.temperature < -50).sum() # < -50°C
extreme_hot = (forcing.temperature > 60).sum() # > 60°C
print(f"Extremely cold values: {extreme_cold.item()}")
print(f"Extremely hot values: {extreme_hot.item()}")
Data Preprocessing Workflow
Complete preprocessing example:
import geopandas as gpd
import xarray as xr
import pandas as pd
def prepare_data(catchment_file, forcing_file):
"""Complete data preparation workflow."""
# Load catchment data
catchments = gpd.read_file(catchment_file)
# Validate catchments
if catchments.crs is None:
catchments = catchments.set_crs('EPSG:4326')
# Load forcing data
forcing = xr.open_dataset(forcing_file)
# Standardize time
forcing['time'] = pd.to_datetime(forcing.time)
# Check spatial coverage
lon_min, lat_min, lon_max, lat_max = catchments.total_bounds
forcing_subset = forcing.sel(
lon=slice(lon_min, lon_max),
lat=slice(lat_min, lat_max)
)
# Quality checks
print("Data quality summary:")
print(f"Catchments: {len(catchments)} features")
print(f"Forcing time range: {forcing.time.min().item()} to {forcing.time.max().item()}")
print(f"Forcing spatial extent: {forcing_subset.dims}")
return catchments, forcing_subset
# Use the function
catchments, forcing = prepare_data("catchments.shp", "climate_data.nc")
Best Practices
Always validate your data before creating workflows
Use consistent coordinate systems across all spatial data
Document your data sources and preprocessing steps
Check temporal alignment between different datasets
Handle missing data appropriately for your use case
Use meaningful variable names in your forcing data mapping
Test with small datasets before processing large volumes