Extreme Value Analysis Tutorial

This notebook demonstrates how to use the ExtremesAnalyzer class for extreme value analysis, including return period calculations, block maxima extraction, and Peaks Over Threshold (POT) analysis.

Overview

ExtremesAnalyzer is designed to:

  • Calculate return periods and return values for extreme events

  • Extract block maxima (annual, monthly, seasonal)

  • Perform Peaks Over Threshold (POT) analysis with declustering

  • Fit extreme value distributions (GEV, Gumbel, Weibull, GPD)

  • Visualize return level plots

  • Handle flexible time series formats (pandas, datetime64, numeric arrays)

Setup and Data Generation

Let’s start by importing the necessary libraries and generating sample wind speed data using MagicA’s synthetic data generator.

[ ]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import magica as ma
from magica.utils import generate_wind_data

# Generate 30 years of daily wind speed data using MagicA
wind_series, plots = generate_wind_data(
    n_years=30,
    freq='D',
    mean_wind=8.0,
    weibull_shape=2.5,
    seasonal_amplitude=0.3,
    n_storms_per_year=5,
    storm_duration_range=(2, 5),
    storm_intensity_range=(12, 20),
    random_seed=42,
    create_plots=True
)

# Display the plots
plt.show()

# Get basic statistics
dates = wind_series.index
wind_speeds = wind_series.values

print(f"Generated {len(wind_series)} daily wind speed measurements ({len(dates)/365:.1f} years)")
print(f"Date range: {dates[0].date()} to {dates[-1].date()}")
print(f"Min: {wind_speeds.min():.2f} m/s")
print(f"Max: {wind_speeds.max():.2f} m/s")
print(f"Mean: {wind_speeds.mean():.2f} m/s")
print(f"Std: {wind_speeds.std():.2f} m/s")
print(f"95th percentile: {np.percentile(wind_speeds, 95):.2f} m/s")
print(f"99th percentile: {np.percentile(wind_speeds, 99):.2f} m/s")
print(f"\n💡 Data generated using MagicA's generate_wind_data() function")


../_images/tutorials_extremes_tutorial_2_0.png
Generated 10957 daily wind speed measurements (30.0 years)
Date range: 1990-01-01 to 2019-12-31
Min: 2.11 m/s
Max: 36.96 m/s
Mean: 7.82 m/s
Std: 3.48 m/s
95th percentile: 13.55 m/s
99th percentile: 22.09 m/s

💡 Data generated using MagicA's generate_wind_data() function

Visualizing the Time Series

Let’s look at the complete time series and identify extreme events.

Basic Usage: Creating an ExtremesAnalyzer

The ExtremesAnalyzer can be created from a DataProcessor using the get_extremes_analyzer() method. It automatically handles pandas Series with datetime indices.

[6]:
# Load data using MagicA's DataProcessor
# The pandas Series with datetime index will be automatically preserved
processor = ma.read_data(wind_series)
print(f"Loaded data: {processor}")

# Create ExtremesAnalyzer (time_unit='years' for return periods in years)
extremes = processor.get_extremes_analyzer(time_unit='years')
print(f"\nCreated ExtremesAnalyzer: {extremes}")

# Check the time span
print(f"Time span: {extremes.time_span:.2f} years")
print(f"Number of observations: {len(extremes.data)}")
Loaded data: DataProcessor(length=10957, dtype=float64)

Created ExtremesAnalyzer: ExtremesAnalyzer(n_points=10957, time_span=30.0 years)
Time span: 30.00 years
Number of observations: 10957

Block Maxima Approach: Generalized Extreme Value (GEV) Distribution

The Block Maxima method extracts the maximum value from fixed time blocks (e.g., annual maxima) and fits a GEV distribution. This is the classical approach in extreme value theory.

[7]:
# Extract annual maxima
annual_maxima, years = extremes.extract_block_maxima(block_size='Y', method='max')
print(f"Annual Maxima Analysis:")
print(f"Number of years: {len(annual_maxima)}")
print(f"\nAnnual maximum wind speeds:")
for year, max_wind in zip(years, annual_maxima):
    print(f"  {year.year}: {max_wind:.2f} m/s")
print(f"\nStatistics of annual maxima:")
maxima_values = list(annual_maxima)
print(f"Mean: {np.mean(maxima_values):.2f} m/s")
print(f"Std: {np.std(maxima_values):.2f} m/s")
print(f"Min: {np.min(maxima_values):.2f} m/s")
print(f"Max: {np.max(maxima_values):.2f} m/s")
Annual Maxima Analysis:
Number of years: 30

Annual maximum wind speeds:
  1990: 28.12 m/s
  1991: 27.15 m/s
  1992: 23.58 m/s
  1993: 21.92 m/s
  1994: 27.78 m/s
  1995: 27.95 m/s
  1996: 24.83 m/s
  1997: 24.74 m/s
  1998: 26.78 m/s
  1999: 33.76 m/s
  2000: 30.86 m/s
  2001: 29.43 m/s
  2002: 28.12 m/s
  2003: 16.50 m/s
  2004: 21.75 m/s
  2005: 25.50 m/s
  2006: 36.96 m/s
  2007: 24.97 m/s
  2008: 25.25 m/s
  2009: 31.48 m/s
  2010: 20.04 m/s
  2011: 25.87 m/s
  2012: 29.53 m/s
  2013: 30.45 m/s
  2014: 31.82 m/s
  2015: 29.54 m/s
  2016: 26.29 m/s
  2017: 26.69 m/s
  2018: 23.72 m/s
  2019: 24.56 m/s

Statistics of annual maxima:
Mean: 26.86 m/s
Std: 4.08 m/s
Min: 16.50 m/s
Max: 36.96 m/s
/Users/danilocoutodesouza/Documents/Programs_and_scripts/MagicA/magica/core/extremes_analyzer.py:460: FutureWarning: 'Y' is deprecated and will be removed in a future version, please use 'YE' instead.
  resampled = series.resample(block_size).max()
[33]:
# Create a new processor with annual maxima
maxima_processor = ma.read_data(np.array(maxima_values))

# Instead of assuming a distribution, let's test several
# common extreme value distributions using AutoFitter
print("Testing multiple extreme value distributions...")
print("=" * 70)

# Define distributions commonly used for extreme values
extreme_distributions = [
    'genextreme',      # Generalized Extreme Value (GEV) - most general
    'gumbel_r',        # Gumbel (Type I) - special case of GEV
    'weibull_min',     # Weibull - commonly used for wind extremes
    'weibull_max',     # Weibull maximum
    'lognorm',         # Log-normal - sometimes used for extremes
    'gamma',           # Gamma distribution
    'exponweib'        # Exponentiated Weibull
]

# Create AutoFitter with extreme value distributions
auto_fitter = maxima_processor.get_auto_fitter(
    candidates=extreme_distributions,
    criterion='rmse'  # Use RMSE as primary criterion
)

# Find best distribution
print("\nFitting distributions to annual maxima...\n")
best_result = auto_fitter.fit_best_distribution()

print(f"Best Distribution: {best_result['distribution']}")
print(f"RMSE: {best_result['rmse']:.6f}")
print(f"AIC: {best_result['aic']:.2f}")
print(f"BIC: {best_result['bic']:.2f}")

# Use the best distribution for further analysis
best_dist_name = best_result['distribution']

# Create ExtremesAnalyzer with the best fitted distribution
maxima_extremes = maxima_processor.get_extremes_analyzer()
maxima_extremes.fit_distribution(best_dist_name)

# Get fitted parameters
params = maxima_extremes.fitted_params
print(f"\n" + "=" * 70)
print(f"BEST DISTRIBUTION: {best_dist_name.upper()}")
print("=" * 70)
print(f"Fitted Parameters: {[round(param) for param in params]}")

# Evaluate goodness-of-fit for best distribution
gof = maxima_extremes.goodness_of_fit(method='ks')
print(f"\nGoodness-of-Fit Statistics:")
print(f"  KS Statistic: {gof['ks_statistic']:.4f}")
print(f"  P-value: {gof['p_value']:.4f}")
print(f"  Result: {'GOOD FIT ✓' if gof['p_value'] > 0.05 else 'POOR FIT ✗'} (α = 0.05)")

print(f"\n💡 The {best_dist_name} distribution provides the best fit to the annual maxima")
print(f"   based on RMSE criterion. This will be used for return value calculations.")
Testing multiple extreme value distributions...
======================================================================

Fitting distributions to annual maxima...

Testing 7 distributions...
  [1/7] Fitting genextreme...
  [2/7] Fitting gumbel_r...
  [3/7] Fitting weibull_min...
  [4/7] Fitting weibull_max...
  [5/7] Fitting lognorm...
  [6/7] Fitting gamma...
  [7/7] Fitting exponweib...
✓ All distributions fitted successfully
Best Distribution: gamma
RMSE: 0.032312
AIC: 175.48
BIC: 179.69

======================================================================
BEST DISTRIBUTION: GAMMA
======================================================================
Fitted Parameters: [7708, -331, 0]

Goodness-of-Fit Statistics:
  KS Statistic: 0.0869
  P-value: 0.9624
  Result: GOOD FIT ✓ (α = 0.05)

💡 The gamma distribution provides the best fit to the annual maxima
   based on RMSE criterion. This will be used for return value calculations.

Return Period and Return Value Analysis

Now let’s calculate return values for different return periods (e.g., 10-year, 50-year, 100-year wind speeds).

Important Note on Workflow:

  • When working with block maxima (e.g., annual maxima), each data point already represents one time unit (e.g., 1 year)

  • The ExtremesAnalyzer can fit distributions directly using fit_distribution()

  • No need to fit the distribution on the DataProcessor first

  • The time_unit parameter is mainly for the original time series, not for the extracted maxima

[34]:
# Calculate return values for various return periods
# The maxima_extremes analyzer already has the fitted distribution
return_periods = [2, 5, 10, 20, 50, 100]

print("Return Period Analysis (Block Maxima / GEV):")
print("=" * 60)
print(f"{'Return Period':<20} {'Return Value (m/s)':<20} {'Design Use'}")
print("-" * 60)

for T in return_periods:
    rv = maxima_extremes.return_value(T)

    # Suggest design use based on return period
    if T <= 5:
        design_use = "Operational limits"
    elif T <= 25:
        design_use = "Standard structures"
    elif T <= 50:
        design_use = "Important structures"
    else:
        design_use = "Critical infrastructure"

    print(f"{T} years{'':<13} {rv:.2f}{'':<15} {design_use}")

print("\n💡 Interpretation:")
print("   A 50-year return value has a 2% probability of being exceeded in any given year.")
print("   Over a 50-year period, there's approximately a 63.4% chance of exceeding this value.")
Return Period Analysis (Block Maxima / GEV):
============================================================
Return Period        Return Value (m/s)   Design Use
------------------------------------------------------------
2 years              26.85                Operational limits
5 years              30.29                Operational limits
10 years              32.10                Standard structures
20 years              33.60                Standard structures
50 years              35.29                Important structures
100 years              36.42                Critical infrastructure

💡 Interpretation:
   A 50-year return value has a 2% probability of being exceeded in any given year.
   Over a 50-year period, there's approximately a 63.4% chance of exceeding this value.
[35]:
# Calculate return periods for specific wind speed thresholds
threshold_speeds = [22, 24, 26, 28]

print("\nReturn Period for Specific Wind Speed Thresholds:")
print("=" * 60)
print(f"{'Wind Speed (m/s)':<20} {'Return Period (years)':<25} {'Probability/year'}")
print("-" * 60)

for speed in threshold_speeds:
    T = maxima_extremes.return_period(speed)
    prob = 1.0 / T if T > 0 else 0
    print(f"{speed}{'':<16} {T:.2f}{'':<21} {prob:.4f} ({prob*100:.2f}%)")

print("\n💡 Example: A wind speed of 25 m/s has a return period of X years,")
print("   meaning it's expected to be exceeded once every X years on average.")

Return Period for Specific Wind Speed Thresholds:
============================================================
Wind Speed (m/s)     Return Period (years)     Probability/year
------------------------------------------------------------
22                 1.13                      0.8837 (88.37%)
24                 1.32                      0.7581 (75.81%)
26                 1.72                      0.5824 (58.24%)
28                 2.57                      0.3889 (38.89%)

💡 Example: A wind speed of 25 m/s has a return period of X years,
   meaning it's expected to be exceeded once every X years on average.

Visualization: Return Level Plot

The return level plot is a fundamental diagnostic tool in extreme value analysis, showing both empirical and theoretical return levels.

[36]:
# Create return level plot
fig, ax = plt.subplots(figsize=(12, 7))

maxima_extremes.plot_return_levels(
    ax=ax,
    return_periods=np.logspace(np.log10(1.1), np.log10(200), 100),  # From 1.1 to 200 years
    confidence_level=0.95,
)

# Add reference lines for common design return periods
design_periods = [10, 50, 100]
for T in design_periods:
    rv = maxima_extremes.return_value(T)
    ax.axhline(rv, color='gray', linestyle=':', linewidth=1, alpha=0.5)
    ax.axvline(T, color='gray', linestyle=':', linewidth=1, alpha=0.5)
    ax.text(T * 1.1, rv, f'{T}-year: {rv:.1f} m/s', fontsize=9,
            bbox=dict(boxstyle='round,pad=0.3', facecolor='white', alpha=0.7))

plt.tight_layout()
plt.show()

print("\n📊 Return Level Plot Interpretation:")
print("   • Blue points: Empirical return levels from observed annual maxima")
print("   • Red line: Theoretical return levels from fitted GEV distribution")
print("   • Shaded area: 95% confidence interval (if calculated)")
print("   • Good agreement indicates the GEV model fits well")
print("   • Deviations at high return periods are common due to extrapolation")
../_images/tutorials_extremes_tutorial_13_0.png

📊 Return Level Plot Interpretation:
   • Blue points: Empirical return levels from observed annual maxima
   • Red line: Theoretical return levels from fitted GEV distribution
   • Shaded area: 95% confidence interval (if calculated)
   • Good agreement indicates the GEV model fits well
   • Deviations at high return periods are common due to extrapolation

Peaks Over Threshold (POT) Approach: Generalized Pareto Distribution (GPD)

The POT method identifies all values exceeding a high threshold and fits a GPD to the exceedances. This approach uses more data than block maxima and can be more efficient.

[37]:
# Choose threshold (e.g., 95th percentile)
threshold = np.percentile(wind_speeds, 95)
print(f"Threshold for POT analysis: {threshold:.2f} m/s (95th percentile)")

# Extract peaks over threshold with declustering
# min_separation ensures peaks are independent (at least 3 days apart)
peaks, peak_times = extremes.peaks_over_threshold(
    threshold=threshold,
    min_separation=3  # days
)

print(f"\nPeaks Over Threshold Analysis:")
print(f"Number of peaks identified: {len(peaks)}")
print(f"Average exceedance rate: {len(peaks) / extremes.time_span:.2f} peaks/year")
print(f"\nFirst 10 peaks:")
for i, (peak, time) in enumerate(zip(peaks[:10], peak_times[:10])):
    print(f"  {i+1}. {time}: {peak:.2f} m/s (exceedance: {peak - threshold:.2f} m/s)")
Threshold for POT analysis: 13.55 m/s (95th percentile)

Peaks Over Threshold Analysis:
Number of peaks identified: 340
Average exceedance rate: 11.33 peaks/year

First 10 peaks:
  1. 1990-01-02 00:00:00: 14.16 m/s (exceedance: 0.61 m/s)
  2. 1990-01-12 00:00:00: 14.91 m/s (exceedance: 1.36 m/s)
  3. 1990-02-03 00:00:00: 14.01 m/s (exceedance: 0.46 m/s)
  4. 1990-02-20 00:00:00: 14.26 m/s (exceedance: 0.71 m/s)
  5. 1990-03-11 00:00:00: 14.40 m/s (exceedance: 0.85 m/s)
  6. 1990-05-01 00:00:00: 15.16 m/s (exceedance: 1.61 m/s)
  7. 1990-06-09 00:00:00: 19.67 m/s (exceedance: 6.12 m/s)
  8. 1990-07-13 00:00:00: 16.42 m/s (exceedance: 2.87 m/s)
  9. 1990-07-16 00:00:00: 15.82 m/s (exceedance: 2.26 m/s)
  10. 1990-08-25 00:00:00: 16.13 m/s (exceedance: 2.57 m/s)
[38]:
# Visualize the POT extraction
fig, ax = plt.subplots(figsize=(14, 6))

# Plot full time series
ax.plot(dates, wind_speeds, linewidth=0.5, alpha=0.5, color='steelblue', label='All data')

# Highlight threshold
ax.axhline(threshold, color='red', linestyle='--', linewidth=2, label=f'Threshold ({threshold:.2f} m/s)', alpha=0.7)

# Mark identified peaks
ax.scatter(peak_times, peaks, color='red', s=50, zorder=5, label=f'Peaks (n={len(peaks)})', alpha=0.8)

ax.set_xlabel('Date')
ax.set_ylabel('Wind Speed (m/s)')
ax.set_title('Peaks Over Threshold (POT) Identification with Declustering')
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print(f"\n💡 Declustering ensures peaks are independent (min {3} days apart)")
print(f"   This avoids counting multiple observations from the same storm event.")
../_images/tutorials_extremes_tutorial_16_0.png

💡 Declustering ensures peaks are independent (min 3 days apart)
   This avoids counting multiple observations from the same storm event.
[39]:
# Fit GPD to exceedances (peaks - threshold)
exceedances = peaks - threshold

exceedance_processor = ma.read_data(exceedances)

# Create ExtremesAnalyzer for exceedances
# We specify time_unit='years' here because we'll need it for POT return value calculations
# (which depend on the exceedance rate per year from the original series)
pot_extremes = exceedance_processor.get_extremes_analyzer(time_unit='years')

# Fit GPD (genpareto in SciPy) directly on the extremes analyzer
pot_extremes.fit_distribution('genpareto')

# Get fitted parameters
gpd_params = pot_extremes.fitted_params
print(f"Fitted GPD Parameters (for exceedances):")
print(f"Shape (ξ): {gpd_params[0]:.4f}")
print(f"Location: {gpd_params[1]:.4f}")
print(f"Scale (σ): {gpd_params[2]:.4f}")

# Interpret the shape parameter
if gpd_params[0] > 0.1:
    print(f"\nInterpretation: Heavy-tailed distribution (higher probability of extremes)")
elif gpd_params[0] < -0.1:
    print(f"\nInterpretation: Short-tailed distribution (finite upper bound)")
else:
    print(f"\nInterpretation: Exponential-like tails")
Fitted GPD Parameters (for exceedances):
Shape (ξ): 0.2944
Location: 0.0022
Scale (σ): 2.0922

Interpretation: Heavy-tailed distribution (higher probability of extremes)
/Users/danilocoutodesouza/Documents/Programs_and_scripts/MagicA/magica/core/extremes_analyzer.py:113: UserWarning: No time information provided. Assuming uniform time spacing. Return periods will be in units of observation count.
  warnings.warn(

POT Return Value Calculation

For POT analysis, return values must account for the exceedance rate and threshold.

[40]:
# Calculate POT-based return values
# Return value = threshold + GPD quantile adjusted for exceedance rate

exceedance_rate = len(peaks) / extremes.time_span  # peaks per year

print("Return Period Analysis (POT / GPD):")
print("=" * 60)
print(f"Exceedance rate: {exceedance_rate:.2f} peaks/year")
print(f"Threshold: {threshold:.2f} m/s\n")
print(f"{'Return Period':<20} {'Return Value (m/s)':<20} {'Exceedance Level'}")
print("-" * 60)

for T in return_periods:
    # For POT, the return period relates to the probability of exceeding the threshold
    # P(X > x | X > u) = (1 - F_GPD(x-u))
    # Return level: x = u + scale * ((λT)^ξ - 1) / ξ, where λ is exceedance rate

    # Using the fitted GPD to calculate the quantile
    prob = 1.0 / (exceedance_rate * T)  # Probability in the GPD scale
    if prob < 1.0:
        exceedance_quantile = pot_extremes.ppf(1 - prob)
        rv_pot = threshold + exceedance_quantile
    else:
        rv_pot = np.nan

    if not np.isnan(rv_pot):
        exceedance = rv_pot - threshold
        print(f"{T} years{'':<13} {rv_pot:.2f}{'':<15} +{exceedance:.2f} m/s")
    else:
        print(f"{T} years{'':<13} {'N/A':<20} {'N/A'}")

print("\n💡 POT return values account for:")
print("   • The threshold level (base wind speed)")
print("   • The exceedance distribution (GPD)")
print("   • The rate of threshold exceedances")
Return Period Analysis (POT / GPD):
============================================================
Exceedance rate: 11.33 peaks/year
Threshold: 13.55 m/s

Return Period        Return Value (m/s)   Exceedance Level
------------------------------------------------------------
2 years              24.26                +10.71 m/s
5 years              29.78                +16.22 m/s
10 years              35.06                +21.50 m/s
20 years              41.53                +27.98 m/s
50 years              52.40                +38.85 m/s
100 years              62.80                +49.25 m/s

💡 POT return values account for:
   • The threshold level (base wind speed)
   • The exceedance distribution (GPD)
   • The rate of threshold exceedances

Comparing Block Maxima vs POT Approaches

Let’s compare the return value estimates from both methods.

Advanced Declustering: Event-Wise vs Time-Based Separation

When extracting peaks over threshold, we need to ensure independence between events. There are two main approaches:

  1. Time-based separation: Keep peaks separated by a minimum time interval (e.g., 3 days)

  2. Event-wise declustering: Identify consecutive periods above threshold as single events and extract only the maximum from each event

Let’s compare these approaches and see how different thresholds affect the number and characteristics of identified events.

[41]:
# Compare different declustering methods at 90th percentile
threshold_90 = np.percentile(wind_speeds, 90)

# Method 1: No declustering (all exceedances)
peaks_all, times_all = extremes.peaks_over_threshold(threshold=threshold_90)

# Method 2: Time-based separation (3 days)
peaks_3days, times_3days = extremes.peaks_over_threshold(
    threshold=threshold_90,
    min_separation=3
)

# Method 3: Event-wise (consecutive days as single event)
peaks_eventwise, times_eventwise = extremes.peaks_over_threshold(
    threshold=threshold_90,
    event_wise=True
)

print("Declustering Method Comparison (90th percentile threshold)")
print("=" * 70)
print(f"Threshold: {threshold_90:.2f} m/s")
print(f"\n{'Method':<25} {'# Peaks':<12} {'Mean Peak':<12} {'Max Peak'}")
print("-" * 70)
print(f"{'All exceedances':<25} {len(peaks_all):<12} {np.mean(peaks_all):<12.2f} {np.max(peaks_all):.2f}")
print(f"{'3-day separation':<25} {len(peaks_3days):<12} {np.mean(peaks_3days):<12.2f} {np.max(peaks_3days):.2f}")
print(f"{'Event-wise':<25} {len(peaks_eventwise):<12} {np.mean(peaks_eventwise):<12.2f} {np.max(peaks_eventwise):.2f}")

print(f"\n💡 Event-wise declustering identified {len(peaks_eventwise)} independent storm events")
print(f"   compared to {len(peaks_3days)} peaks with 3-day separation")
print(f"   and {len(peaks_all)} total exceedances.")
Declustering Method Comparison (90th percentile threshold)
======================================================================
Threshold: 11.78 m/s

Method                    # Peaks      Mean Peak    Max Peak
----------------------------------------------------------------------
All exceedances           1096         15.26        36.96
3-day separation          697          14.01        30.86
Event-wise                700          14.96        36.96

💡 Event-wise declustering identified 700 independent storm events
   compared to 697 peaks with 3-day separation
   and 1096 total exceedances.
[42]:
# Visualize the difference between declustering methods
fig, axes = plt.subplots(3, 1, figsize=(15, 12), sharex=True)

# Plot 1: All exceedances
ax = axes[0]
ax.plot(dates, wind_speeds, linewidth=0.5, alpha=0.5, color='steelblue', label='All data')
ax.axhline(threshold_90, color='orange', linestyle='--', linewidth=2,
           label=f'Threshold ({threshold_90:.2f} m/s)', alpha=0.7)
ax.scatter(times_all, peaks_all, color='red', s=30, zorder=5,
          label=f'All exceedances (n={len(peaks_all)})', alpha=0.6)
ax.set_ylabel('Wind Speed (m/s)')
ax.set_title('Method 1: All Exceedances (No Declustering)')
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)

# Plot 2: 3-day separation
ax = axes[1]
ax.plot(dates, wind_speeds, linewidth=0.5, alpha=0.5, color='steelblue', label='All data')
ax.axhline(threshold_90, color='orange', linestyle='--', linewidth=2,
           label=f'Threshold ({threshold_90:.2f} m/s)', alpha=0.7)
ax.scatter(times_3days, peaks_3days, color='green', s=50, zorder=5,
          label=f'3-day separation (n={len(peaks_3days)})', alpha=0.8, marker='^')
ax.set_ylabel('Wind Speed (m/s)')
ax.set_title('Method 2: Time-Based Declustering (3-day minimum separation)')
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)

# Plot 3: Event-wise
ax = axes[2]
ax.plot(dates, wind_speeds, linewidth=0.5, alpha=0.5, color='steelblue', label='All data')
ax.axhline(threshold_90, color='orange', linestyle='--', linewidth=2,
           label=f'Threshold ({threshold_90:.2f} m/s)', alpha=0.7)
ax.scatter(times_eventwise, peaks_eventwise, color='purple', s=60, zorder=5,
          label=f'Event-wise (n={len(peaks_eventwise)})', alpha=0.8, marker='*')
ax.set_xlabel('Date')
ax.set_ylabel('Wind Speed (m/s)')
ax.set_title('Method 3: Event-Wise Declustering (One peak per consecutive event)')
ax.legend(loc='upper right')
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\n📊 Interpretation:")
print("   • Top: Shows all days exceeding threshold (may include multiple days from same storm)")
print("   • Middle: Keeps peaks at least 3 days apart (may still miss some storm clusters)")
print("   • Bottom: One peak per storm event (consecutive days grouped together)")
../_images/tutorials_extremes_tutorial_23_0.png

📊 Interpretation:
   • Top: Shows all days exceeding threshold (may include multiple days from same storm)
   • Middle: Keeps peaks at least 3 days apart (may still miss some storm clusters)
   • Bottom: One peak per storm event (consecutive days grouped together)
[43]:
# Zoom in on a specific storm event to illustrate the differences
# Find a period with multiple consecutive exceedances
exceedance_mask = wind_speeds > threshold_90
consecutive_periods = []
in_period = False
period_start = None

for i, (date, exceeds) in enumerate(zip(dates, exceedance_mask)):
    if exceeds and not in_period:
        # Start of a new period
        period_start = i
        in_period = True
    elif not exceeds and in_period:
        # End of period
        consecutive_periods.append((period_start, i - 1))
        in_period = False

# If still in a period at the end
if in_period:
    consecutive_periods.append((period_start, len(dates) - 1))

# Find the longest consecutive period (most interesting storm)
longest_period = max(consecutive_periods, key=lambda x: x[1] - x[0])
storm_center = (longest_period[0] + longest_period[1]) // 2

# Extract 1 month window around this storm (±15 days)
window_days = 15
zoom_start = max(0, storm_center - window_days)
zoom_end = min(len(dates) - 1, storm_center + window_days)

zoom_dates = dates[zoom_start:zoom_end+1]
zoom_winds = wind_speeds[zoom_start:zoom_end+1]

# Filter peaks to this window
mask_all = (times_all >= zoom_dates[0]) & (times_all <= zoom_dates[-1])
mask_3days = (times_3days >= zoom_dates[0]) & (times_3days <= zoom_dates[-1])
mask_eventwise = (times_eventwise >= zoom_dates[0]) & (times_eventwise <= zoom_dates[-1])

zoom_peaks_all = peaks_all[mask_all]
zoom_times_all = times_all[mask_all]
zoom_peaks_3days = peaks_3days[mask_3days]
zoom_times_3days = times_3days[mask_3days]
zoom_peaks_eventwise = peaks_eventwise[mask_eventwise]
zoom_times_eventwise = times_eventwise[mask_eventwise]

# Create detailed zoom plot
fig, axes = plt.subplots(3, 1, figsize=(15, 12), sharex=True)

# Plot 1: All exceedances
ax = axes[0]
ax.plot(zoom_dates, zoom_winds, linewidth=2, color='steelblue', label='Wind speed', alpha=0.8)
ax.axhline(threshold_90, color='orange', linestyle='--', linewidth=2,
           label=f'Threshold ({threshold_90:.2f} m/s)', alpha=0.7)
ax.scatter(zoom_times_all, zoom_peaks_all, color='red', s=80, zorder=5,
          label=f'All exceedances (n={len(zoom_peaks_all)})', alpha=0.8, edgecolors='darkred', linewidth=1.5)
ax.fill_between(zoom_dates, threshold_90, zoom_winds,
                where=(zoom_winds > threshold_90), alpha=0.2, color='red', label='Exceedance area')
ax.set_ylabel('Wind Speed (m/s)', fontsize=11)
ax.set_title('Method 1: All Exceedances - Zoom on Storm Event', fontsize=12, fontweight='bold')
ax.legend(loc='upper right', fontsize=10)
ax.grid(True, alpha=0.3)

# Plot 2: 3-day separation
ax = axes[1]
ax.plot(zoom_dates, zoom_winds, linewidth=2, color='steelblue', label='Wind speed', alpha=0.8)
ax.axhline(threshold_90, color='orange', linestyle='--', linewidth=2,
           label=f'Threshold ({threshold_90:.2f} m/s)', alpha=0.7)
ax.scatter(zoom_times_3days, zoom_peaks_3days, color='green', s=100, zorder=5,
          label=f'3-day separation (n={len(zoom_peaks_3days)})', alpha=0.8, marker='^',
          edgecolors='darkgreen', linewidth=1.5)
ax.fill_between(zoom_dates, threshold_90, zoom_winds,
                where=(zoom_winds > threshold_90), alpha=0.2, color='green', label='Exceedance area')
ax.set_ylabel('Wind Speed (m/s)', fontsize=11)
ax.set_title('Method 2: Time-Based Declustering - Zoom on Storm Event', fontsize=12, fontweight='bold')
ax.legend(loc='upper right', fontsize=10)
ax.grid(True, alpha=0.3)

# Plot 3: Event-wise
ax = axes[2]
ax.plot(zoom_dates, zoom_winds, linewidth=2, color='steelblue', label='Wind speed', alpha=0.8)
ax.axhline(threshold_90, color='orange', linestyle='--', linewidth=2,
           label=f'Threshold ({threshold_90:.2f} m/s)', alpha=0.7)
ax.scatter(zoom_times_eventwise, zoom_peaks_eventwise, color='purple', s=120, zorder=5,
          label=f'Event-wise (n={len(zoom_peaks_eventwise)})', alpha=0.8, marker='*',
          edgecolors='indigo', linewidth=1.5)
ax.fill_between(zoom_dates, threshold_90, zoom_winds,
                where=(zoom_winds > threshold_90), alpha=0.2, color='purple', label='Exceedance area')
ax.set_xlabel('Date', fontsize=11)
ax.set_ylabel('Wind Speed (m/s)', fontsize=11)
ax.set_title('Method 3: Event-Wise Declustering - Zoom on Storm Event', fontsize=12, fontweight='bold')
ax.legend(loc='upper right', fontsize=10)
ax.grid(True, alpha=0.3)

# Format x-axis
for ax in axes:
    ax.tick_params(axis='both', labelsize=10)

plt.tight_layout()
plt.show()

print(f"\n🔍 Detailed Storm Analysis:")
print(f"Period shown: {zoom_dates[0].date()} to {zoom_dates[-1].date()} (~1 month)")
print(f"Storm duration: {longest_period[1] - longest_period[0] + 1} consecutive days above threshold")
print(f"\nPeaks identified in this window:")
print(f"  • All exceedances: {len(zoom_peaks_all)} peaks")
print(f"  • 3-day separation: {len(zoom_peaks_3days)} peaks")
print(f"  • Event-wise: {len(zoom_peaks_eventwise)} peak(s)")
print(f"\n💡 Notice how:")
print(f"   - All exceedances captures every day above threshold (may count same storm multiple times)")
print(f"   - 3-day separation reduces but may still capture multiple peaks from one extended storm")
print(f"   - Event-wise correctly identifies this as a single storm event with one representative peak")
../_images/tutorials_extremes_tutorial_24_0.png

🔍 Detailed Storm Analysis:
Period shown: 1995-10-29 to 1995-11-28 (~1 month)
Storm duration: 7 consecutive days above threshold

Peaks identified in this window:
  • All exceedances: 14 peaks
  • 3-day separation: 6 peaks
  • Event-wise: 3 peak(s)

💡 Notice how:
   - All exceedances captures every day above threshold (may count same storm multiple times)
   - 3-day separation reduces but may still capture multiple peaks from one extended storm
   - Event-wise correctly identifies this as a single storm event with one representative peak

Impact of Threshold Selection on Event Identification

Different thresholds will identify different numbers of events with varying characteristics. Let’s explore how threshold choice affects the storm identification.

[44]:
# Test different thresholds with event-wise declustering
percentiles = [85, 90, 95, 98]
threshold_results = {}

for p in percentiles:
    thresh = np.percentile(wind_speeds, p)
    peaks_evt, times_evt = extremes.peaks_over_threshold(
        threshold=thresh,
        event_wise=True
    )

    # Calculate event statistics
    event_durations = []
    series = pd.Series(wind_speeds, index=dates)

    for event_time in times_evt:
        # Find the consecutive period around this peak
        event_mask = series > thresh
        # Find the continuous block containing this event
        idx_loc = series.index.get_loc(event_time)

        # Expand left
        start_idx = idx_loc
        while start_idx > 0 and event_mask.iloc[start_idx - 1]:
            start_idx -= 1

        # Expand right
        end_idx = idx_loc
        while end_idx < len(series) - 1 and event_mask.iloc[end_idx + 1]:
            end_idx += 1

        duration = end_idx - start_idx + 1
        event_durations.append(duration)

    threshold_results[p] = {
        'threshold': thresh,
        'n_events': len(peaks_evt),
        'mean_peak': np.mean(peaks_evt),
        'max_peak': np.max(peaks_evt),
        'mean_duration': np.mean(event_durations) if event_durations else 0,
        'max_duration': np.max(event_durations) if event_durations else 0,
        'peaks': peaks_evt,
        'times': times_evt
    }

# Display results
print("Impact of Threshold on Event Characteristics (Event-Wise Declustering)")
print("=" * 90)
print(f"{'Percentile':<12} {'Threshold':<12} {'# Events':<12} {'Mean Peak':<12} {'Mean Duration':<15} {'Max Duration'}")
print("-" * 90)

for p in percentiles:
    res = threshold_results[p]
    print(f"{p}th{'':<9} {res['threshold']:<12.2f} {res['n_events']:<12} "
          f"{res['mean_peak']:<12.2f} {res['mean_duration']:<15.1f} {res['max_duration']} days")

print("\n💡 Key Observations:")
print("   • Higher thresholds → Fewer but more intense events")
print("   • Lower thresholds → More events, including weaker/shorter storms")
print("   • Event duration shows how long storms persist above threshold")
Impact of Threshold on Event Characteristics (Event-Wise Declustering)
==========================================================================================
Percentile   Threshold    # Events     Mean Peak    Mean Duration   Max Duration
------------------------------------------------------------------------------------------
85th          10.76        1018         13.78        1.6             8 days
90th          11.78        700          14.96        1.6             7 days
95th          13.55        321          17.86        1.7             7 days
98th          17.57        116          23.53        1.9             5 days

💡 Key Observations:
   • Higher thresholds → Fewer but more intense events
   • Lower thresholds → More events, including weaker/shorter storms
   • Event duration shows how long storms persist above threshold
[46]:
# Visualize threshold impact
fig, ax = plt.subplots(1, 1, figsize=(8, 8))

# Plot: Number of events vs threshold
thresholds = [threshold_results[p]['threshold'] for p in percentiles]
n_events = [threshold_results[p]['n_events'] for p in percentiles]
mean_peaks = [threshold_results[p]['mean_peak'] for p in percentiles]

ax_twin = ax.twinx()
line1 = ax.plot(percentiles, n_events, 'o-', color='steelblue', linewidth=2,
                 markersize=8, label='Number of events')
line2 = ax_twin.plot(percentiles, mean_peaks, 's--', color='orange', linewidth=2,
                      markersize=8, label='Mean peak intensity')

ax.set_xlabel('Threshold Percentile')
ax.set_ylabel('Number of Events', color='steelblue')
ax_twin.set_ylabel('Mean Peak Wind Speed (m/s)', color='orange')
ax.set_title('Threshold Selection Impact on Event Frequency and Intensity')
ax.tick_params(axis='y', labelcolor='steelblue')
ax_twin.tick_params(axis='y', labelcolor='orange')
ax_twin.grid(True, alpha=0.3)

# Combine legends
lines = line1 + line2
labels = [l.get_label() for l in lines]
ax.legend(lines, labels, loc='upper left')

print("\n📊 Interpretation:")
print("   Left: Trade-off between event frequency and intensity")
print("   Right: Higher thresholds capture only the most persistent events")

📊 Interpretation:
   Left: Trade-off between event frequency and intensity
   Right: Higher thresholds capture only the most persistent events
../_images/tutorials_extremes_tutorial_27_1.png

Comparing GPD Fits: Event-Wise vs Time-Based Declustering

Let’s see how the choice of declustering method affects the fitted GPD distribution and return value estimates.

[47]:
# Compare GPD fits with different declustering methods
threshold_compare = threshold_90

# Get peaks with both methods
peaks_time, times_time = extremes.peaks_over_threshold(threshold=threshold_compare, min_separation=3)
peaks_event, times_event = extremes.peaks_over_threshold(threshold=threshold_compare, event_wise=True)

# Fit GPD to exceedances from both methods
exceedances_time = peaks_time - threshold_compare
exceedances_event = peaks_event - threshold_compare

# Time-based declustering
proc_time = ma.read_data(exceedances_time)
pot_time = proc_time.get_extremes_analyzer(time_unit='years')
pot_time.fit_distribution('genpareto')

# Event-wise declustering
proc_event = ma.read_data(exceedances_event)
pot_event = proc_event.get_extremes_analyzer(time_unit='years')
pot_event.fit_distribution('genpareto')

# Compare parameters
print("GPD Parameter Comparison")
print("=" * 70)
print(f"Threshold: {threshold_compare:.2f} m/s (90th percentile)")
print(f"\n{'Method':<25} {'# Peaks':<12} {'Shape (ξ)':<12} {'Scale (σ)'}")
print("-" * 70)

params_time = pot_time.fitted_params
params_event = pot_event.fitted_params

print(f"{'3-day separation':<25} {len(peaks_time):<12} {params_time[0]:<12.4f} {params_time[2]:.4f}")
print(f"{'Event-wise':<25} {len(peaks_event):<12} {params_event[0]:<12.4f} {params_event[2]:.4f}")

# Calculate return values
print("\n\nReturn Value Comparison (m/s)")
print("=" * 70)
print(f"{'Return Period':<20} {'3-day separation':<20} {'Event-wise':<20} {'Difference'}")
print("-" * 70)

rate_time = len(peaks_time) / extremes.time_span
rate_event = len(peaks_event) / extremes.time_span

for T in [10, 50, 100]:
    # Time-based
    prob_time = 1.0 / (rate_time * T)
    if prob_time < 1.0:
        rv_time = threshold_compare + pot_time.ppf(1 - prob_time)
    else:
        rv_time = np.nan

    # Event-wise
    prob_event = 1.0 / (rate_event * T)
    if prob_event < 1.0:
        rv_event = threshold_compare + pot_event.ppf(1 - prob_event)
    else:
        rv_event = np.nan

    if not np.isnan(rv_time) and not np.isnan(rv_event):
        diff = rv_event - rv_time
        print(f"{T} years{'':<13} {rv_time:<20.2f} {rv_event:<20.2f} {diff:+.2f}")

print("\n💡 Key Insights:")
print("   • Event-wise declustering reduces peak count (fewer independent events)")
print("   • This affects the exceedance rate and thus return value estimates")
print("   • Event-wise is more physically meaningful for clustered extremes (storms)")
print("   • Time-based separation is simpler but may over-count dependent events")
GPD Parameter Comparison
======================================================================
Threshold: 11.78 m/s (90th percentile)

Method                    # Peaks      Shape (ξ)    Scale (σ)
----------------------------------------------------------------------
3-day separation          697          0.1610       1.8546
Event-wise                700          0.4381       1.9109


Return Value Comparison (m/s)
======================================================================
Return Period        3-day separation     Event-wise           Difference
----------------------------------------------------------------------
10 years              27.96                54.97                +27.01
50 years              36.15                103.66               +67.51
100 years              40.39                137.81               +97.42

💡 Key Insights:
   • Event-wise declustering reduces peak count (fewer independent events)
   • This affects the exceedance rate and thus return value estimates
   • Event-wise is more physically meaningful for clustered extremes (storms)
   • Time-based separation is simpler but may over-count dependent events
/Users/danilocoutodesouza/Documents/Programs_and_scripts/MagicA/magica/core/extremes_analyzer.py:113: UserWarning: No time information provided. Assuming uniform time spacing. Return periods will be in units of observation count.
  warnings.warn(
[48]:
# Visualize the GPD fits
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: Empirical vs fitted CDF
sorted_exc_time = np.sort(exceedances_time)
sorted_exc_event = np.sort(exceedances_event)

emp_cdf_time = np.arange(1, len(sorted_exc_time) + 1) / len(sorted_exc_time)
emp_cdf_event = np.arange(1, len(sorted_exc_event) + 1) / len(sorted_exc_event)

theo_cdf_time = pot_time.cdf(sorted_exc_time)
theo_cdf_event = pot_event.cdf(sorted_exc_event)

ax1.plot(sorted_exc_time, emp_cdf_time, 'o', color='blue', alpha=0.5,
         markersize=5, label='Empirical (3-day)')
ax1.plot(sorted_exc_time, theo_cdf_time, '-', color='blue', linewidth=2,
         label='GPD fit (3-day)')
ax1.plot(sorted_exc_event, emp_cdf_event, 's', color='red', alpha=0.5,
         markersize=5, label='Empirical (event-wise)')
ax1.plot(sorted_exc_event, theo_cdf_event, '-', color='red', linewidth=2,
         label='GPD fit (event-wise)')
ax1.set_xlabel('Exceedance (m/s)')
ax1.set_ylabel('Cumulative Probability')
ax1.set_title('GPD Fit Comparison: CDF')
ax1.legend()
ax1.grid(True, alpha=0.3)

# Plot 2: Q-Q plot
ax2.scatter(sorted_exc_time, pot_time.ppf(emp_cdf_time),
           color='blue', alpha=0.5, s=40, label='3-day separation')
ax2.scatter(sorted_exc_event, pot_event.ppf(emp_cdf_event),
           color='red', alpha=0.5, s=40, label='Event-wise', marker='s')

# Add diagonal line
max_val = max(np.max(sorted_exc_time), np.max(sorted_exc_event))
ax2.plot([0, max_val], [0, max_val], 'k--', linewidth=1, alpha=0.5, label='Perfect fit')

ax2.set_xlabel('Empirical Quantiles')
ax2.set_ylabel('Theoretical Quantiles (GPD)')
ax2.set_title('Q-Q Plot')
ax2.legend()
ax2.grid(True, alpha=0.3)
ax2.set_aspect('equal', adjustable='box')

plt.tight_layout()
plt.show()

print("\n📊 Plot Interpretation:")
print("   Left: Both methods show good GPD fits (empirical and theoretical CDFs align)")
print("   Right: Q-Q plot shows how well data follows GPD (points near diagonal = good fit)")
../_images/tutorials_extremes_tutorial_30_0.png

📊 Plot Interpretation:
   Left: Both methods show good GPD fits (empirical and theoretical CDFs align)
   Right: Q-Q plot shows how well data follows GPD (points near diagonal = good fit)
[49]:
# Calculate return values from both methods
comparison_periods = [5, 10, 20, 50, 100]

gev_values = [maxima_extremes.return_value(T) for T in comparison_periods]

# POT return values
pot_values = []
for T in comparison_periods:
    prob = 1.0 / (exceedance_rate * T)
    if prob < 1.0:
        exceedance_quantile = pot_extremes.ppf(1 - prob)
        pot_values.append(threshold + exceedance_quantile)
    else:
        pot_values.append(np.nan)

# Comparison table
print("Method Comparison: Block Maxima (GEV) vs POT (GPD)")
print("=" * 80)
print(f"{'Return Period':<15} {'GEV (m/s)':<15} {'GPD (m/s)':<15} {'Difference':<15} {'% Diff'}")
print("-" * 80)

for T, gev_val, pot_val in zip(comparison_periods, gev_values, pot_values):
    if not np.isnan(pot_val):
        diff = pot_val - gev_val
        pct_diff = (diff / gev_val) * 100
        print(f"{T} years{'':<8} {gev_val:<15.2f} {pot_val:<15.2f} {diff:<+15.2f} {pct_diff:+.1f}%")
    else:
        print(f"{T} years{'':<8} {gev_val:<15.2f} {'N/A':<15} {'N/A':<15} {'N/A'}")

print("\n📊 Method Selection Guidelines:")
print("\nBlock Maxima (GEV):")
print("  ✓ More robust, classical approach")
print("  ✓ Less sensitive to threshold selection")
print("  ✓ Better for data with clear seasonal structure")
print("  ✗ Uses less data (only one value per block)")
print("  ✗ Requires long time series for reliability")
print("\nPeaks Over Threshold (GPD):")
print("  ✓ More efficient, uses more extreme data")
print("  ✓ Better for short time series")
print("  ✓ Can adapt to changing extreme behavior")
print("  ✗ Sensitive to threshold choice")
print("  ✗ Requires careful declustering")
print("  ✗ More complex implementation")
Method Comparison: Block Maxima (GEV) vs POT (GPD)
================================================================================
Return Period   GEV (m/s)       GPD (m/s)       Difference      % Diff
--------------------------------------------------------------------------------
5 years         30.29           29.78           -0.52           -1.7%
10 years         32.10           35.06           +2.96           +9.2%
20 years         33.60           41.53           +7.93           +23.6%
50 years         35.29           52.40           +17.11          +48.5%
100 years         36.42           62.80           +26.38          +72.4%

📊 Method Selection Guidelines:

Block Maxima (GEV):
  ✓ More robust, classical approach
  ✓ Less sensitive to threshold selection
  ✓ Better for data with clear seasonal structure
  ✗ Uses less data (only one value per block)
  ✗ Requires long time series for reliability

Peaks Over Threshold (GPD):
  ✓ More efficient, uses more extreme data
  ✓ Better for short time series
  ✓ Can adapt to changing extreme behavior
  ✗ Sensitive to threshold choice
  ✗ Requires careful declustering
  ✗ More complex implementation

Visualizing Both Approaches Side-by-Side

[50]:
# Create comparison plot
fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(16, 6))

# Plot 1: GEV return levels
maxima_extremes.plot_return_levels(
    ax=ax1,
    return_periods=np.logspace(np.log10(1.1), np.log10(200), 100),
    confidence_level=0.95,
    title='Block Maxima Approach (GEV)'
)

# Plot 2: Exceedance distribution (GPD)
sorted_exceedances = np.sort(exceedances)
empirical_cdf = np.arange(1, len(sorted_exceedances) + 1) / len(sorted_exceedances)
theoretical_cdf = pot_extremes.cdf(sorted_exceedances)

ax2.plot(sorted_exceedances, empirical_cdf, 'bo', markersize=6, alpha=0.6, label='Empirical')
ax2.plot(sorted_exceedances, theoretical_cdf, 'r-', linewidth=2, label='GPD fit')
ax2.set_xlabel('Exceedance (m/s)')
ax2.set_ylabel('Cumulative Probability')
ax2.set_title('POT Approach: GPD Fit to Exceedances')
ax2.legend()
ax2.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()
../_images/tutorials_extremes_tutorial_33_0.png

Advanced: Monthly and Seasonal Block Maxima

Besides annual maxima, we can extract maxima from other time blocks.

[51]:
# Extract monthly maxima
monthly_maxima = extremes.extract_block_maxima(block_size='M', method='max')

print(f"Monthly Maxima Analysis:")
print(f"Number of months: {len(monthly_maxima[1])}")
print(f"Mean monthly maximum: {np.mean(list(monthly_maxima[0])):.2f} m/s")
print(f"\nFirst 12 months:")
for max_wind, month in zip(monthly_maxima[0][:12], monthly_maxima[1][:12]):
    print(f"{month.month}: {max_wind:.2f} m/s")
Monthly Maxima Analysis:
Number of months: 360
Mean monthly maximum: 15.32 m/s

First 12 months:
1: 14.91 m/s
2: 14.54 m/s
3: 14.40 m/s
4: 10.28 m/s
5: 28.12 m/s
6: 19.67 m/s
7: 19.66 m/s
8: 18.29 m/s
9: 10.83 m/s
10: 23.00 m/s
11: 24.49 m/s
12: 16.24 m/s
/Users/danilocoutodesouza/Documents/Programs_and_scripts/MagicA/magica/core/extremes_analyzer.py:460: FutureWarning: 'M' is deprecated and will be removed in a future version, please use 'ME' instead.
  resampled = series.resample(block_size).max()
[52]:
# Visualize seasonal variation in extremes
# Convert monthly maxima to DataFrame for easier analysis
# monthly_maxima is a tuple of (values, timestamps)
monthly_values, monthly_dates = monthly_maxima

monthly_df = pd.DataFrame({
    'date': monthly_dates,
    'max_wind': monthly_values
})
monthly_df['month'] = monthly_df['date'].dt.month
monthly_df['month_name'] = monthly_df['date'].dt.strftime('%b')

# Calculate monthly statistics
monthly_stats = monthly_df.groupby(['month', 'month_name'])['max_wind'].agg(['mean', 'std', 'min', 'max']).reset_index()

fig, ax = plt.subplots(figsize=(12, 6))

# Plot mean with error bars
months = monthly_stats['month'].values
means = monthly_stats['mean'].values
stds = monthly_stats['std'].values

ax.plot(months, means, 'o-', linewidth=2, markersize=8, color='steelblue', label='Mean monthly maximum')
ax.fill_between(months, means - stds, means + stds, alpha=0.3, color='steelblue', label='±1 std')
ax.scatter(months, monthly_stats['max'], color='red', s=50, zorder=5, alpha=0.6, label='Absolute maximum')

ax.set_xlabel('Month')
ax.set_ylabel('Wind Speed (m/s)')
ax.set_title('Seasonal Variation in Monthly Maximum Wind Speeds')
ax.set_xticks(months)
ax.set_xticklabels(monthly_stats['month_name'])
ax.legend()
ax.grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

# Find the most extreme month
most_extreme_month = monthly_stats.loc[monthly_stats['mean'].idxmax()]
print(f"\nMost extreme month on average: {most_extreme_month['month_name']} ({most_extreme_month['mean']:.2f} m/s)")
print(f"Least extreme month on average: {monthly_stats.loc[monthly_stats['mean'].idxmin(), 'month_name']} "
      f"({monthly_stats['mean'].min():.2f} m/s)")
../_images/tutorials_extremes_tutorial_36_0.png

Most extreme month on average: Jan (19.87 m/s)
Least extreme month on average: Jul (11.50 m/s)

Goodness-of-Fit Assessment for Extreme Value Distributions

Let’s evaluate how well our distributions fit the extreme data using both statistical tests and visual diagnostics.

[53]:
# Goodness-of-fit for GEV (annual maxima)
print("Goodness-of-Fit Assessment:")
print("=" * 60)
print("\nBlock Maxima (GEV):")

ks_gev = maxima_extremes.goodness_of_fit('ks')
print(f"  Kolmogorov-Smirnov Test:")
print(f"    Statistic: {ks_gev['ks_statistic']:.6f}")
print(f"    P-value: {ks_gev['p_value']:.6f}")
print(f"    Result: {'Good fit ✓' if ks_gev['p_value'] > 0.05 else 'Poor fit ✗'} (α = 0.05)")

rmse_gev = maxima_extremes.goodness_of_fit('rmse')
print(f"  RMSE: {rmse_gev:.6f}")

# Goodness-of-fit for GPD (exceedances)
print("\nPeaks Over Threshold (GPD):")

ks_gpd = pot_extremes.goodness_of_fit('ks')
print(f"  Kolmogorov-Smirnov Test:")
print(f"    Statistic: {ks_gpd['ks_statistic']:.6f}")
print(f"    P-value: {ks_gpd['p_value']:.6f}")
print(f"    Result: {'Good fit ✓' if ks_gpd['p_value'] > 0.05 else 'Poor fit ✗'} (α = 0.05)")

rmse_gpd = pot_extremes.goodness_of_fit('rmse')
print(f"  RMSE: {rmse_gpd:.6f}")

print("\n💡 Note: Small sample sizes in extreme value analysis often result in")
print("   lower statistical power. Visual inspection (see plots below) is equally important.")
Goodness-of-Fit Assessment:
============================================================

Block Maxima (GEV):
  Kolmogorov-Smirnov Test:
    Statistic: 0.086928
    P-value: 0.962403
    Result: Good fit ✓ (α = 0.05)
  RMSE: 0.032312

Peaks Over Threshold (GPD):
  Kolmogorov-Smirnov Test:
    Statistic: 0.042619
    P-value: 0.552774
    Result: Good fit ✓ (α = 0.05)
  RMSE: 0.018108

💡 Note: Small sample sizes in extreme value analysis often result in
   lower statistical power. Visual inspection (see plots below) is equally important.

Visual Goodness-of-Fit: Histograms and PDFs

Let’s visualize how well the fitted distributions match the observed data by comparing histograms with the theoretical probability density functions (PDFs).

[54]:
# Create comprehensive visualization of goodness-of-fit
fig, axes = plt.subplots(2, 2, figsize=(16, 12))

# ============================================================================
# TOP ROW: Block Maxima (GEV) Analysis
# ============================================================================

# Plot 1: Histogram + PDF for Annual Maxima (GEV)
ax1 = axes[0, 0]
ax1.hist(maxima_values, bins=8, density=True, alpha=0.7, color='steelblue',
         label='Observed annual maxima', edgecolor='black', linewidth=1)

# Generate smooth curve for fitted GEV distribution
x_gev = np.linspace(min(maxima_values) * 0.95, max(maxima_values) * 1.05, 200)
pdf_gev = maxima_extremes.pdf(x_gev)
ax1.plot(x_gev, pdf_gev, 'r-', linewidth=3, label='Fitted GEV PDF')

# Mark the mean
mean_maxima = np.mean(maxima_values)
ax1.axvline(mean_maxima, color='green', linestyle='--', linewidth=2,
            label=f'Mean: {mean_maxima:.2f} m/s', alpha=0.7)

ax1.set_xlabel('Annual Maximum Wind Speed (m/s)', fontsize=11)
ax1.set_ylabel('Probability Density', fontsize=11)
ax1.set_title('Block Maxima: GEV Distribution Fit', fontsize=13, fontweight='bold')
ax1.legend(loc='upper right', fontsize=10)
ax1.grid(True, alpha=0.3)

# Add text box with GEV parameters
param_text = f'GEV Parameters:\nShape (ξ): {params[0]:.4f}\nLocation (μ): {params[1]:.2f}\nScale (σ): {params[2]:.2f}'
ax1.text(0.02, 0.98, param_text, transform=ax1.transAxes, fontsize=9,
         verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# Plot 2: Q-Q Plot for GEV
ax2 = axes[0, 1]
sorted_maxima = np.sort(maxima_values)
n = len(sorted_maxima)
empirical_probs = (np.arange(1, n + 1) - 0.5) / n
theoretical_quantiles = maxima_extremes.ppf(empirical_probs)

ax2.scatter(theoretical_quantiles, sorted_maxima, alpha=0.6, s=60, color='steelblue',
            edgecolor='black', linewidth=0.5, label='Data')
# Add perfect fit line
min_val = min(theoretical_quantiles.min(), sorted_maxima.min())
max_val = max(theoretical_quantiles.max(), sorted_maxima.max())
ax2.plot([min_val, max_val], [min_val, max_val], 'r--', linewidth=2, label='Perfect fit')

ax2.set_xlabel('Theoretical Quantiles (GEV)', fontsize=11)
ax2.set_ylabel('Observed Quantiles (m/s)', fontsize=11)
ax2.set_title('Q-Q Plot: GEV Distribution', fontsize=13, fontweight='bold')
ax2.legend(loc='upper left', fontsize=10)
ax2.grid(True, alpha=0.3)

# Add goodness-of-fit statistics
gof_text = f'Goodness-of-Fit:\nKS statistic: {ks_gev["ks_statistic"]:.4f}\nP-value: {ks_gev["p_value"]:.4f}\nRMSE: {rmse_gev:.4f}'
ax2.text(0.98, 0.02, gof_text, transform=ax2.transAxes, fontsize=9,
         verticalalignment='bottom', horizontalalignment='right',
         bbox=dict(boxstyle='round', facecolor='lightgreen' if ks_gev['p_value'] > 0.05 else 'lightcoral', alpha=0.5))

# ============================================================================
# BOTTOM ROW: POT (GPD) Analysis
# ============================================================================

# Plot 3: Histogram + PDF for Exceedances (GPD)
ax3 = axes[1, 0]
ax3.hist(exceedances, bins=12, density=True, alpha=0.7, color='coral',
         label='Observed exceedances', edgecolor='black', linewidth=1)

# Generate smooth curve for fitted GPD distribution
x_gpd = np.linspace(0, max(exceedances) * 1.05, 200)
pdf_gpd = pot_extremes.pdf(x_gpd)
ax3.plot(x_gpd, pdf_gpd, 'darkred', linewidth=3, label='Fitted GPD PDF')

# Mark the mean
mean_exceedances = np.mean(exceedances)
ax3.axvline(mean_exceedances, color='green', linestyle='--', linewidth=2,
            label=f'Mean: {mean_exceedances:.2f} m/s', alpha=0.7)

ax3.set_xlabel('Exceedance above Threshold (m/s)', fontsize=11)
ax3.set_ylabel('Probability Density', fontsize=11)
ax3.set_title('POT: GPD Distribution Fit to Exceedances', fontsize=13, fontweight='bold')
ax3.legend(loc='upper right', fontsize=10)
ax3.grid(True, alpha=0.3)

# Add text box with GPD parameters
param_text_gpd = f'GPD Parameters:\nShape (ξ): {gpd_params[0]:.4f}\nLocation: {gpd_params[1]:.2f}\nScale (σ): {gpd_params[2]:.2f}\n\nThreshold: {threshold:.2f} m/s'
ax3.text(0.02, 0.98, param_text_gpd, transform=ax3.transAxes, fontsize=9,
         verticalalignment='top', bbox=dict(boxstyle='round', facecolor='wheat', alpha=0.5))

# Plot 4: Q-Q Plot for GPD
ax4 = axes[1, 1]
sorted_exceedances = np.sort(exceedances)
n_exc = len(sorted_exceedances)
empirical_probs_exc = (np.arange(1, n_exc + 1) - 0.5) / n_exc
theoretical_quantiles_exc = pot_extremes.ppf(empirical_probs_exc)

ax4.scatter(theoretical_quantiles_exc, sorted_exceedances, alpha=0.6, s=50, color='coral',
            edgecolor='black', linewidth=0.5, label='Data')
# Add perfect fit line
min_val_exc = min(theoretical_quantiles_exc.min(), sorted_exceedances.min())
max_val_exc = max(theoretical_quantiles_exc.max(), sorted_exceedances.max())
ax4.plot([min_val_exc, max_val_exc], [min_val_exc, max_val_exc], 'darkred',
         linestyle='--', linewidth=2, label='Perfect fit')

ax4.set_xlabel('Theoretical Quantiles (GPD)', fontsize=11)
ax4.set_ylabel('Observed Quantiles (m/s)', fontsize=11)
ax4.set_title('Q-Q Plot: GPD Distribution', fontsize=13, fontweight='bold')
ax4.legend(loc='upper left', fontsize=10)
ax4.grid(True, alpha=0.3)

# Add goodness-of-fit statistics
gof_text_gpd = f'Goodness-of-Fit:\nKS statistic: {ks_gpd["ks_statistic"]:.4f}\nP-value: {ks_gpd["p_value"]:.4f}\nRMSE: {rmse_gpd:.4f}'
ax4.text(0.98, 0.02, gof_text_gpd, transform=ax4.transAxes, fontsize=9,
         verticalalignment='bottom', horizontalalignment='right',
         bbox=dict(boxstyle='round', facecolor='lightgreen' if ks_gpd['p_value'] > 0.05 else 'lightcoral', alpha=0.5))

plt.tight_layout()
plt.show()

print("\n" + "="*80)
print("VISUAL GOODNESS-OF-FIT INTERPRETATION:")
print("="*80)
print("\n📊 LEFT COLUMN - Histogram vs PDF:")
print("   • Blue/Coral bars: Observed frequency of extreme values")
print("   • Red curves: Theoretical probability density from fitted distributions")
print("   • Good fit: PDF curve should follow the histogram shape closely")
print("   • Green dashed line: Mean of the observed data")
print("\n📈 RIGHT COLUMN - Q-Q Plots (Quantile-Quantile):")
print("   • Points: Comparison of observed vs theoretical quantiles")
print("   • Red dashed line: Perfect fit (y = x)")
print("   • Good fit: Points should lie close to the diagonal line")
print("   • Deviations at extremes are common due to sampling variability")
print("\n✅ GEV (Top Row):")
print(f"   • P-value: {ks_gev['p_value']:.4f}{'GOOD FIT' if ks_gev['p_value'] > 0.05 else 'POOR FIT'} (α = 0.05)")
print(f"   • RMSE: {rmse_gev:.6f} (lower is better)")
print("\n✅ GPD (Bottom Row):")
print(f"   • P-value: {ks_gpd['p_value']:.4f}{'GOOD FIT' if ks_gpd['p_value'] > 0.05 else 'POOR FIT'} (α = 0.05)")
print(f"   • RMSE: {rmse_gpd:.6f} (lower is better)")
print("\n💡 Note: Visual inspection is crucial in extreme value analysis.")
print("   Even with acceptable p-values, always check if the fit makes physical sense!")
print("="*80)
../_images/tutorials_extremes_tutorial_40_0.png

================================================================================
VISUAL GOODNESS-OF-FIT INTERPRETATION:
================================================================================

📊 LEFT COLUMN - Histogram vs PDF:
   • Blue/Coral bars: Observed frequency of extreme values
   • Red curves: Theoretical probability density from fitted distributions
   • Good fit: PDF curve should follow the histogram shape closely
   • Green dashed line: Mean of the observed data

📈 RIGHT COLUMN - Q-Q Plots (Quantile-Quantile):
   • Points: Comparison of observed vs theoretical quantiles
   • Red dashed line: Perfect fit (y = x)
   • Good fit: Points should lie close to the diagonal line
   • Deviations at extremes are common due to sampling variability

✅ GEV (Top Row):
   • P-value: 0.9624 → GOOD FIT (α = 0.05)
   • RMSE: 0.032312 (lower is better)

✅ GPD (Bottom Row):
   • P-value: 0.5528 → GOOD FIT (α = 0.05)
   • RMSE: 0.018108 (lower is better)

💡 Note: Visual inspection is crucial in extreme value analysis.
   Even with acceptable p-values, always check if the fit makes physical sense!
================================================================================

Summary Statistics and Report

Let’s generate a comprehensive summary of the extreme value analysis.

[55]:
# Get summary statistics from ExtremesAnalyzer
summary = extremes.get_summary_statistics()

print("EXTREME VALUE ANALYSIS SUMMARY REPORT")
print("=" * 80)
print(f"\nDataset Information:")
print(f"  Period: {dates[0].date()} to {dates[-1].date()}")
print(f"  Duration: {extremes.time_span:.2f} years")
print(f"  Total observations: {summary['n_observations']}")
print(f"  Time unit: {summary['time_unit']}")

print(f"\nData Statistics:")
print(f"  Mean: {summary['mean']:.2f} m/s")
print(f"  Std Dev: {summary['std']:.2f} m/s")
print(f"  Minimum: {summary['min']:.2f} m/s")
print(f"  Maximum: {summary['max']:.2f} m/s")
print(f"  95th percentile: {summary['percentile_95']:.2f} m/s")
print(f"  99th percentile: {summary['percentile_99']:.2f} m/s")

if summary['distribution_name']:
    print(f"\nFitted Distribution: {summary['distribution_name']}")
    print(f"  Parameters: {summary['distribution_params']}")

print(f"\nBlock Maxima Analysis:")
print(f"  Annual maxima: {len(annual_maxima)} years")
print(f"  Mean annual maximum: {np.mean(maxima_values):.2f} m/s")
print(f"  GEV shape parameter: {params[0]:.4f}")

print(f"\nPOT Analysis:")
print(f"  Threshold: {threshold:.2f} m/s (95th percentile)")
print(f"  Number of peaks: {len(peaks)}")
print(f"  Exceedance rate: {exceedance_rate:.2f} peaks/year")
print(f"  GPD shape parameter: {gpd_params[0]:.4f}")

print(f"\nDesign Wind Speeds (GEV):")
for T in [10, 50, 100]:
    rv = maxima_extremes.return_value(T)
    print(f"  {T}-year return value: {rv:.2f} m/s")

print("\n" + "=" * 80)
EXTREME VALUE ANALYSIS SUMMARY REPORT
================================================================================

Dataset Information:
  Period: 1990-01-01 to 2019-12-31
  Duration: 30.00 years
  Total observations: 10957
  Time unit: years

Data Statistics:
  Mean: 7.82 m/s
  Std Dev: 3.48 m/s
  Minimum: 2.11 m/s
  Maximum: 36.96 m/s
  95th percentile: 13.55 m/s
  99th percentile: 22.09 m/s

Block Maxima Analysis:
  Annual maxima: 30 years
  Mean annual maximum: 26.86 m/s
  GEV shape parameter: 7707.7004

POT Analysis:
  Threshold: 13.55 m/s (95th percentile)
  Number of peaks: 340
  Exceedance rate: 11.33 peaks/year
  GPD shape parameter: 0.2944

Design Wind Speeds (GEV):
  10-year return value: 32.10 m/s
  50-year return value: 35.29 m/s
  100-year return value: 36.42 m/s

================================================================================

Best Practices and Recommendations

When to Use Extreme Value Analysis:

  1. Design of structures (wind turbines, buildings, bridges)

  2. Risk assessment (insurance, safety planning)

  3. Climate studies (return periods of extreme weather)

  4. Ocean engineering (extreme wave heights)

  5. Hydrology (flood return periods)

Key Guidelines:

  1. Data Requirements:

    • Minimum 10-30 years for reliable annual maxima analysis

    • Data should be stationary (no long-term trends)

    • Quality control is crucial for extreme values

  2. Method Selection:

    • Use Block Maxima (GEV) when:

      • You have long time series (>20 years)

      • Data has strong seasonal patterns

      • Simplicity and robustness are priorities

    • Use POT (GPD) when:

      • Time series is shorter (<20 years)

      • You want to use more extreme data

      • Extremes don’t follow seasonal patterns

  3. Threshold Selection (POT):

    • Too low: violates GPD assumptions (not in tail)

    • Too high: insufficient data for fitting

    • Start with 90-95th percentile, adjust based on diagnostics

  4. Declustering:

    • Essential for POT to ensure independence

    • Separation time depends on phenomenon (storms: 3-7 days)

    • Use autocorrelation analysis to guide selection

  5. Return Period Interpretation:

    • T-year return value ≠ maximum in T years

    • It’s the value exceeded on average once every T years

    • Annual exceedance probability = 1/T

  6. Uncertainty:

    • Extrapolation beyond observed data is uncertain

    • Always report confidence intervals

    • Be especially cautious with return periods > 2 × record length

  7. Model Validation:

    • Visual diagnostics are ESSENTIAL (not optional!)

    • Always plot:

      • Histogram vs PDF: Check if theoretical curve matches observed distribution

      • Q-Q plots: Assess deviations from theoretical quantiles

      • Return level plots: Validate extrapolation behavior

    • Statistical tests alone are insufficient:

      • P-values can be misleading with small samples (typical in EVA)

      • Visual inspection catches systematic deviations

      • Physical plausibility check requires visualization

    • Compare multiple models (GEV, Gumbel, Weibull)

    • Red flags in plots:

      • Systematic curvature in Q-Q plots (wrong distribution family)

      • Poor PDF match in tails (critical for return values)

      • Erratic behavior in return level plots beyond data range

Common Pitfalls to Avoid:

  • ❌ Using insufficient data length

  • ❌ Ignoring data quality issues

  • ❌ Not checking for trends/non-stationarity

  • ❌ Over-extrapolating beyond data range

  • ❌ Forgetting to decluster POT peaks

  • ❌ Ignoring seasonal effects

  • ❌ Not validating model assumptions

  • Relying on p-values alone without visual inspection

  • Skipping histogram/PDF and Q-Q plots

Further Reading

Key References:

  1. Coles, S. (2001). An Introduction to Statistical Modeling of Extreme Values.
    Springer. (The definitive textbook on extreme value theory)
  2. Beirlant, J., et al. (2004). Statistics of Extremes: Theory and Applications.
    Wiley.
  3. Gumbel, E.J. (1958). Statistics of Extremes.
    Columbia University Press. (Classic text)
  4. Serinaldi, F. & Kilsby, C.G. (2014).
    “Rainfall extremes: Toward reconciliation after the battle of distributions.”
    Water Resources Research, 50(1), 336-352.

Online Resources:

  • R package ‘extRemes’: Excellent documentation and examples

  • Python package ‘pyextremes’: Modern Python implementation

  • NCAR’s EVA Tutorial: Practical guide for climate data

MagicA Documentation:

  • For more on distributions: See the AutoFitter tutorial

  • For Monte Carlo analysis: See the Monte Carlo stability tutorial

  • API reference: Check the ExtremesAnalyzer documentation