Machine learning has revolutionized many fields of science, and climate science is no exception. In this post, I’ll share insights from my recent research and provide a practical introduction to applying ML techniques in atmospheric sciences.
Why Machine Learning in Climate Science?
Climate data is characterized by:
- High dimensionality: Satellite observations, reanalysis data, and model outputs contain millions of data points
- Complex patterns: Non-linear relationships between variables that traditional statistical methods struggle to capture
- Temporal dependencies: Time series with multiple scales of variability
- Spatial correlations: Geographic patterns that span different scales
Machine learning excels at finding patterns in such complex, high-dimensional datasets.
Getting Started: Essential Tools
Python Libraries
import numpy as np
import pandas as pd
import xarray as xr
import matplotlib.pyplot as plt
import cartopy.crs as ccrs
import cartopy.feature as cfeature
# Machine learning libraries
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, r2_score
import tensorflow as tf
from tensorflow import keras
Data Sources
Popular climate datasets for ML applications:
- ERA5 Reanalysis: High-resolution atmospheric reanalysis
- CHIRPS: Precipitation data for tropical regions
- MODIS: Satellite observations of clouds, vegetation, etc.
- CMIP6: Climate model ensemble data
Practical Example: Precipitation Prediction
Let’s build a simple model to predict precipitation using atmospheric variables.
1. Data Preparation
import xarray as xr
# Load ERA5 reanalysis data
ds = xr.open_dataset('era5_data.nc')
# Select relevant variables
variables = ['temperature_2m', 'relative_humidity', 'sea_level_pressure',
'wind_speed_10m', 'total_precipitation']
# Extract data for a specific region (e.g., tropical Pacific)
region_data = ds.sel(
latitude=slice(-20, 20),
longitude=slice(120, 280),
time=slice('2010-01-01', '2020-12-31')
)[variables]
print(f"Data shape: {region_data.dims}")
2. Feature Engineering
def create_features(data):
"""Create additional features from basic meteorological variables."""
# Calculate gradients
data['temp_gradient_lat'] = data['temperature_2m'].differentiate('latitude')
data['temp_gradient_lon'] = data['temperature_2m'].differentiate('longitude')
# Calculate relative vorticity
data['vorticity'] = calculate_vorticity(data['wind_u'], data['wind_v'])
# Time-based features
data['day_of_year'] = data['time.dayofyear']
data['hour'] = data['time.hour']
# Lagged features (previous time steps)
data['temp_lag1'] = data['temperature_2m'].shift(time=1)
data['precip_lag1'] = data['total_precipitation'].shift(time=1)
return data
# Apply feature engineering
enhanced_data = create_features(region_data)
3. Model Training
# Prepare training data
def prepare_ml_data(data, target_var='total_precipitation'):
"""Convert xarray dataset to sklearn-compatible format."""
# Stack spatial dimensions
stacked = data.stack(sample=('time', 'latitude', 'longitude'))
# Convert to DataFrame
df = stacked.to_dataframe().dropna()
# Separate features and target
feature_cols = [col for col in df.columns if col != target_var]
X = df[feature_cols]
y = df[target_var]
return X, y
X, y = prepare_ml_data(enhanced_data)
# Split data
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# Train Random Forest model
rf_model = RandomForestRegressor(
n_estimators=100,
max_depth=10,
random_state=42,
n_jobs=-1
)
rf_model.fit(X_train, y_train)
# Make predictions
y_pred = rf_model.predict(X_test)
# Evaluate
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
r2 = r2_score(y_test, y_pred)
print(f"RMSE: {rmse:.4f}")
print(f"R²: {r2:.4f}")
4. Feature Importance Analysis
# Analyze feature importance
feature_importance = pd.DataFrame({
'feature': X.columns,
'importance': rf_model.feature_importances_
}).sort_values('importance', ascending=False)
plt.figure(figsize=(10, 6))
plt.barh(feature_importance['feature'][:10], feature_importance['importance'][:10])
plt.xlabel('Feature Importance')
plt.title('Top 10 Most Important Features for Precipitation Prediction')
plt.tight_layout()
plt.show()
Deep Learning for Climate Data
For more complex patterns, deep learning can be powerful:
# Build a simple neural network
model = keras.Sequential([
keras.layers.Dense(128, activation='relu', input_shape=(X_train.shape[1],)),
keras.layers.Dropout(0.3),
keras.layers.Dense(64, activation='relu'),
keras.layers.Dropout(0.3),
keras.layers.Dense(32, activation='relu'),
keras.layers.Dense(1)
])
model.compile(
optimizer='adam',
loss='mse',
metrics=['mae']
)
# Train the model
history = model.fit(
X_train, y_train,
validation_split=0.2,
epochs=50,
batch_size=32,
verbose=1
)
# Evaluate
nn_pred = model.predict(X_test)
nn_rmse = np.sqrt(mean_squared_error(y_test, nn_pred))
print(f"Neural Network RMSE: {nn_rmse:.4f}")
Challenges and Best Practices
1. Handling Missing Data
# Interpolation for spatial gaps
data_filled = data.interpolate_na(dim='latitude', method='linear')
# Forward fill for temporal gaps
data_filled = data_filled.fillna(method='ffill', limit=3)
2. Cross-Validation with Time Series
from sklearn.model_selection import TimeSeriesSplit
tscv = TimeSeriesSplit(n_splits=5)
cv_scores = []
for train_idx, val_idx in tscv.split(X):
X_train_cv, X_val_cv = X.iloc[train_idx], X.iloc[val_idx]
y_train_cv, y_val_cv = y.iloc[train_idx], y.iloc[val_idx]
model.fit(X_train_cv, y_train_cv)
score = model.score(X_val_cv, y_val_cv)
cv_scores.append(score)
print(f"Cross-validation scores: {cv_scores}")
print(f"Mean CV score: {np.mean(cv_scores):.4f} ± {np.std(cv_scores):.4f}")
3. Dealing with Scale Differences
from sklearn.preprocessing import StandardScaler
# Normalize features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
Advanced Applications
Convolutional Neural Networks for Spatial Data
For gridded climate data, CNNs can capture spatial patterns:
# Reshape data for CNN (samples, height, width, channels)
X_cnn = X_spatial.values.reshape(-1, 64, 128, len(feature_vars))
# Build CNN model
cnn_model = keras.Sequential([
keras.layers.Conv2D(32, (3, 3), activation='relu', input_shape=(64, 128, 5)),
keras.layers.MaxPooling2D((2, 2)),
keras.layers.Conv2D(64, (3, 3), activation='relu'),
keras.layers.MaxPooling2D((2, 2)),
keras.layers.Conv2D(64, (3, 3), activation='relu'),
keras.layers.Flatten(),
keras.layers.Dense(64, activation='relu'),
keras.layers.Dense(1)
])
LSTM for Time Series Prediction
For capturing temporal dependencies:
# Prepare sequence data
def create_sequences(data, seq_length=10):
X, y = [], []
for i in range(len(data) - seq_length):
X.append(data[i:(i + seq_length)])
y.append(data[i + seq_length])
return np.array(X), np.array(y)
X_seq, y_seq = create_sequences(time_series_data)
# LSTM model
lstm_model = keras.Sequential([
keras.layers.LSTM(50, return_sequences=True, input_shape=(seq_length, n_features)),
keras.layers.LSTM(50, return_sequences=False),
keras.layers.Dense(25),
keras.layers.Dense(1)
])
Evaluation and Interpretation
Model Explainability
import shap
# SHAP values for model interpretation
explainer = shap.TreeExplainer(rf_model)
shap_values = explainer.shap_values(X_test[:100])
# Plot SHAP summary
shap.summary_plot(shap_values, X_test[:100])
Physical Consistency Checks
Always validate that your ML model results make physical sense:
# Check if predictions follow physical laws
def validate_physics(predictions, inputs):
"""Check basic physical consistency."""
# Example: Precipitation should increase with humidity
humidity = inputs['relative_humidity']
precip = predictions
correlation = np.corrcoef(humidity, precip)[0, 1]
assert correlation > 0, "Precipitation should correlate positively with humidity"
return True
Resources and Next Steps
Recommended Reading
- “Deep Learning for the Earth Sciences” by Reichstein et al.
- “Machine Learning in Python for Climate Science” by Huntingford et al.
- “AI for Earth” Microsoft documentation
Datasets
Code Repositories
Conclusion
Machine learning offers powerful tools for climate science research, but success requires:
- Understanding both the climate science and ML fundamentals
- Careful data preprocessing and feature engineering
- Appropriate validation strategies
- Physical interpretation of results
The combination of domain expertise and ML techniques can lead to breakthrough insights in our understanding of the climate system.
In my next post, I’ll dive deeper into specific applications like tropical cyclone tracking and seasonal forecasting. Stay tuned!
Have questions about applying ML to your climate research? Feel free to reach out via email or connect with me on Twitter.