Tutorials

This section contains comprehensive tutorials and examples for using MagicA for statistical distribution fitting and Monte Carlo stability analysis.

Tip

New to MagicA? Start with the Quick Start guide, then explore the interactive Jupyter notebooks below.

Interactive Jupyter Notebooks

MagicAdjuster Tutorial

Comprehensive guide to single distribution fitting with Monte Carlo CPS (Coefficient/P-value/Sample size) methods.

This tutorial covers:

  • Distribution fitting and parameter estimation

  • Goodness-of-fit testing (KS, Chi-square, RMSE)

  • Different binning strategies for histogram-based tests

  • Monte Carlo stability analysis for determining minimum sample sizes

  • Large Sample Size Effect - why p-values become unreliable with large datasets (>10,000 samples)

  • RMSE as the most reliable stability indicator

  • Comparing different sampling strategies (random, bootstrap, disjoint)

  • CPS (Coefficient/P-value/Sample size) method from Lin et al. (2011)

  • Real-world example with INMET weather station data

Important

This tutorial demonstrates why RMSE is the preferred metric for assessing fit quality and stability, especially with large datasets where p-values can be misleading.

Key Learning Outcomes:

  • Master Monte Carlo stability analysis

  • Understand when p-values are reliable vs. unreliable

  • Learn to use RMSE as primary fit quality indicator

  • Determine optimal sample sizes for your data

AutoFitter Tutorial

Learn how to automatically find the best distribution from 113+ scipy.stats options.

This tutorial covers:

  • Automatic distribution testing and selection

  • RMSE-based selection (recommended for all data types)

  • Testing default curated list (16 distributions) vs. comprehensive set (113+)

  • Creating domain-specific distribution lists (e.g., wind, rainfall)

  • Comparison tables and ranking by different criteria

  • Filtering by p-value for synthetic data (with warnings for real-world use)

  • Understanding when each selection criterion is appropriate

  • Using the best-fitted distribution for further analysis

Note

This tutorial emphasizes RMSE as the primary criterion for distribution selection, with clear guidance on when p-value filtering is appropriate vs. problematic.

Key Learning Outcomes:

  • Automate distribution selection with confidence

  • Choose appropriate selection criteria for your data

  • Understand trade-offs between speed (default list) and thoroughness (comprehensive)

  • Apply best practices for real-world vs. synthetic data

Extreme Value Analysis Tutorial

Comprehensive guide to analyzing extreme events with return period calculations and extreme value distributions.

This tutorial covers:

  • Creating ExtremesAnalyzer from time series data

  • Block Maxima approach - GEV (Generalized Extreme Value) distribution

  • Peaks Over Threshold (POT) - GPD (Generalized Pareto Distribution)

  • Return period and return value calculations

  • Declustering techniques for independent extreme events

  • Monthly and seasonal block maxima extraction

  • Return level plots and visualization

  • Comparing GEV vs. GPD approaches

  • Flexible time series handling (pandas, datetime64, numeric arrays)

  • Best practices for extreme value analysis

Important

This tutorial demonstrates real-world applications in wind engineering, climate studies, and structural design using both classical (Block Maxima) and modern (POT) extreme value methods.

Key Learning Outcomes:

  • Master both Block Maxima (GEV) and POT (GPD) methods

  • Calculate design values for infrastructure (10-year, 50-year, 100-year return periods)

  • Understand when to use each extreme value approach

  • Extract and analyze seasonal patterns in extremes

  • Apply proper declustering for POT analysis

API Usage Tutorial

Quick reference guide for common MagicA operations and API patterns.

MagicA API Usage

Basic Example: Weibull Fit and Goodness-of-Fit Evaluation

import magica as ma
import numpy as np

# Example wind speed data
wind_data = [2.1, 5.4, 8.7, 12.3, 6.8, 9.1, 15.2, 3.4, 7.6, 11.0, 4.5, 13.2, 8.9, 6.7, 10.5]

# 1. Load data
processor = ma.read_data(wind_data)

# 2. Fit Weibull distribution
fitted = processor.fit_distribution('weibull')
params = fitted.get_fitted_params()

# 3. Evaluate goodness-of-fit
chi2 = fitted.goodness_of_fit('chi2')
ks = fitted.goodness_of_fit('ks')
rms = fitted.goodness_of_fit('rms')

print('Weibull parameters:', params)
print('Chi-square:', chi2)
print('Kolmogorov-Smirnov:', ks)
print('Root Mean Square:', rms)

Monte Carlo Stability Analysis

Determine the minimum sample size needed for stable parameter estimation:

import numpy as np
from magica.core import MagicAdjuster

# Generate larger dataset for Monte Carlo analysis
data = np.random.weibull(2, 1000)

# Create adjuster and fit distribution
adjuster = MagicAdjuster(data)
adjuster.fit_distribution('weibull_min')

# Run Monte Carlo stability analysis
results = adjuster.monte_carlo_fit(
  n_repeats=100,
  tests=['chi2', 'ks']  # no figure by default
)

# Work with xarray Dataset
ks_pvalues = results['ks_pvalue']
param_medians = results['param_0'].median(dim='repeats')

# Check stability points
stability = results.attrs['stability_points']
print(f"Parameter 0 stabilizes at: {stability['param_0']['size']}")
Advanced Monte Carlo Usage
# Custom sizes with fitting constraints
results = adjuster.monte_carlo_fit(
  sizes=[50, 100, 200, 400],
  n_repeats=200,
  tests=['chi2', 'ks', 'rmse'],
  plot_type='boxplots',
  fig_output_path='stability_boxplots.png',
  bins='scott',
  fit_kwargs={'floc': 0}
)

# Using pre-calculated parameters (bypass fitting)
known_params = (2.0, 0.0, 1.0)  # shape, loc, scale for Weibull
results = adjuster.monte_carlo_fit(
    distribution_params=known_params,
    n_repeats=150,
    tests=['chi2', 'ks']
)

# Easy data manipulation with xarray
size_200_data = results.sel(sizes=200)
param_convergence = results['param_0'].std(dim='repeats')

# Visualize parameter stability
results['param_0'].plot(x='sizes', hue='repeats', alpha=0.3)

API Reference

read_data(data)

Loads and validates data for analysis.

  • Parameters:

    • data: array-like (list, numpy array, pandas Series/DataFrame)

  • Returns: DataProcessor instance

DataProcessor.fit_distribution(distribution)

Fits a statistical distribution to the data.

  • Parameters:

    • distribution: str or scipy.stats distribution (e.g., ‘weibull’, ‘norm’, ‘gamma’)

  • Returns: DataProcessor (with fitted distribution)

DataProcessor.get_fitted_params()

Returns fitted parameters of the distribution.

  • Returns: tuple

DataProcessor.goodness_of_fit(method, bins='doane')

Evaluates the fit using a statistical test.

  • Parameters:

    • method: ‘chi2’, ‘ks’, or ‘rms’

    • bins: binning method (default ‘doane’)

  • Returns: dict with test results

MagicAdjuster.monte_carlo_fit(sizes=None, n_repeats=20, tests=['ks'], fig_output_path=None, plot_type='series', sampling='random', distribution_params=None, **kwargs)

Performs Monte Carlo stability analysis to determine minimum sample size for reliable parameter estimation.

  • Parameters:

    • sizes: list, optional - Explicit list of sample sizes

    • n_repeats: int - Repetitions per size (default 20)

    • tests: list - GOF tests to perform [‘chi2’,’ks’,’rmse’]

    • fig_output_path: str, optional - Save 2x3 summary figure if provided (includes red dashed stability lines)

    • plot_type: ‘series’ or ‘boxplots’ - Panel style when saving figure

    • sampling: ‘random’,’bootstrap’,’disjoint’ - subsampling strategy

    • distribution_params: tuple, optional - Use fixed params (no refitting)

    • bins: (kwarg) binning method for chi2 (default ‘doane’)

    • fit_kwargs: (kwarg) constraints passed to fit_distribution

    • plot: str or bool - Plotting option (‘series’, ‘boxplots’, True, or False) (default True)

    • sampling: str - Sampling strategy (‘random’ or ‘bootstrap’) (default ‘random’)

    • distribution_params: tuple, optional - Pre-calculated distribution parameters

    • bins: str or int - Binning strategy for chi-square test (default ‘doane’)

    • fit_kwargs: dict - Additional arguments passed to fit_distribution()

  • Returns: xarray.Dataset with dimensions [‘sizes’, ‘repeats’] and variables for parameters and test results

    • method: ‘chi2’, ‘ks’, or ‘rms’

    • bins: binning method (default ‘doane’)

  • Returns: dict with test results

DataProcessor.get_basic_stats()

Returns basic statistics of the data.

  • Returns: dict

DataProcessor.get_distribution_info()

Returns info about the fitted distribution.

  • Returns: dict

Working with xarray Results

The monte_carlo_fit method returns results as an xarray Dataset, providing powerful data manipulation capabilities:

# Run Monte Carlo analysis
results = adjuster.monte_carlo_fit(n_repeats=50, tests=['ks', 'chi2'])

# Dataset overview
print(results)
print(results.dims)  # Dimensions: sizes, repeats
print(results.data_vars)  # Variables: param_0, ks_statistic, ks_pvalue, etc.

# Select data for specific sample size
size_200 = results.sel(sizes=200)
print(size_200['param_0'].values)  # All parameter values for size 200

# Calculate statistics across repeats
param_means = results['param_0'].mean(dim='repeats')
param_stds = results['param_0'].std(dim='repeats')
ks_medians = results['ks_pvalue'].median(dim='repeats')

# Select size range
large_sizes = results.sel(sizes=slice(200, None))

# Convert to pandas DataFrame for further analysis
df = results.to_dataframe()

# Plot with built-in xarray methods
results['ks_pvalue'].plot(x='sizes', hue='repeats', alpha=0.5)

# Access metadata and stability points
stability = results.attrs['stability_points']
figure = results.attrs['figure']

Goodness-of-Fit Methods

  • 'chi2': Chi-square test

  • 'ks': Kolmogorov-Smirnov test

  • 'rms': Root Mean Square error between observed and estimated PDF


See also: example_magic_adjuster.py

Monte Carlo Stability Tutorial

Detailed explanation of Monte Carlo methods for stability analysis and sample size determination.

Monte Carlo Stability Tutorial

This tutorial explains why Monte Carlo stability analysis is useful and how the monte_carlo_fit routine in MagicA is designed and used.

Why Monte Carlo stability?

Large-sample-size effects occur when very large datasets make statistical tests extremely powerful: p-values become very small even for effects that are practically negligible. In other words, with enough data a test can reject the null hypothesis for differences that have no practical importance. The Monte Carlo stability workflow therefore looks for a sampling size where the chosen goodness-of-fit tests start to “pass” in a practical sense and parameter estimates stop changing much across repeats — that sample size is reported as the stability point.

Methodology (what we do)

  1. Build a grid of sample sizes to test (or use a provided list).

  2. For each size, generate multiple subsamples (repeats) from the original data.

  3. For each subsample, fit the distribution (unless fixed parameters are supplied).

  4. Compute goodness-of-fit metrics (e.g., KS p-value, chi-square p-value, RMSE).

  5. Store all results in an xarray Dataset with dimensions sizes x repeats.

  6. Detect stability points using a moving-window criterion on variability (e.g., CV).

  7. Optionally save a 2x3 summary figure with red dashed vertical lines marking each variable’s detected stability sample size.

Inputs and options

  • sizes: list or auto-generated grid of sample sizes.

  • n_repeats: number of subsamples per size.

  • tests: list of metrics to compute: [‘ks’,’chi2’,’rmse’].

  • stability_method: method for detecting stability points (‘kneedle’ [default], ‘cv’, ‘plateau’).

  • sampling: sampling strategy (random, bootstrap, disjoint).

  • fit_kwargs: passed to the underlying fit routine (e.g., floc=0).

  • distribution_params: optional fixed parameters (skip fitting when supplied).

  • fig_output_path: optional path to save the 2x3 summary figure.

  • plot_type: ‘series’ (median + IQR) or ‘boxplots’.

Stability Detection Methods

MagicA provides three methods for detecting when parameters and goodness-of-fit metrics have stabilized. Each method has different strengths depending on the metric behavior.

CV Method (Coefficient of Variation)

Principle: Monitors the coefficient of variation (CV = std/mean) across repeats. Stability is detected when CV remains below a threshold for a consecutive window of sample sizes.

Mathematical definition:

\[CV_n = \frac{\sigma_n}{\mu_n}\]

Where \(\sigma_n\) is the standard deviation and \(\mu_n\) is the mean across repeats at sample size n.

When to use:

  • Erratic p-values (KS, Chi-square tests)

  • When you need quantitative variability assessment

  • Conservative estimates preferred

Parameters (via kwargs):

  • window_size: Number of consecutive sizes for validation (default: 25% of n_sizes)

  • cv_threshold: Maximum allowed CV (default: 0.1)

Kneedle Method (Elbow Detection)

Principle: Based on the “Kneedle” algorithm (Satopää et al., 2011). Finds the “elbow” or “knee” point in a curve by:

  1. Normalizing the curve to [0, 1]

  2. Computing a reference line from first to last point

  3. Finding the point with maximum perpendicular distance from the reference line

Mathematical definition:

The knee point is where:

\[\text{knee} = \arg\max_i |y_i - (y_0 + (y_n - y_0) \cdot \frac{i}{n})|\]

Where \(y_i\) is the normalized curve value at point i.

When to use:

  • RMSE and converging parameters (RECOMMENDED)

  • Smooth, monotonic convergence patterns

  • Finding “point of diminishing returns”

  • Production analyses requiring objective detection

Parameters (via kwargs):

  • smooth: Whether to smooth curves before detection (default: True)

  • smoothing_method: ‘savgol’ (preserves features) or ‘spline’ (default: ‘savgol’)

Reference: Satopää, V., Albrecht, J., Irwin, D., & Raghavan, B. (2011). Finding a “Kneedle” in a Haystack: Detecting Knee Points in System Behavior. 2011 31st International Conference on Distributed Computing Systems Workshops, 166-171.

Plateau Method (Relative Gain Heuristic)

Principle: Detects when relative improvement between consecutive points becomes negligible. Stability is found when the relative change stays below a threshold for L consecutive points.

Mathematical definition:

Relative change at point i:

\[\Delta_i = \frac{|y_{i-1} - y_i|}{|y_{i-1}|}\]

Stability when:

\[\Delta_i < \epsilon \text{ for } L \text{ consecutive points}\]

When to use:

  • Limited computational budget

  • Early “good enough” detection

  • Iterative analyses with early stopping

Parameters (via kwargs):

  • consecutive_points: Number of points below tolerance (default: 3)

  • relative_tolerance: Maximum relative change (default: 0.01, i.e., 1%)

Method Selection Guide

Method

Best For

Typical Use Case

Kneedle

RMSE, parameters

Production RMSE analysis (RECOMMENDED)

CV

P-values (KS, Chi²)

Conservative p-value stability

Plateau

Early detection, limited budget

Quick exploration, iterative tuning

Recommended workflow:

# Use Kneedle for RMSE (most reliable)
results = adjuster.monte_carlo_fit(
    tests=['rmse'],
    stability_method='kneedle',
    smooth=True
)

recommended_n = results.attrs['recommended_size']
print(f"RMSE stable at n = {recommended_n}")

Sampling strategies

The helper _generate_subsample_indices() produces integer index lists that are used by the Monte Carlo routines to build subsamples from the original dataset. Below is a concise, practical explanation of the three supported strategies and how to choose between them.

  • random (without replacement) - What: Each subsample contains size unique indices drawn randomly from the

    original dataset (no duplicates inside the same subsample).

    • Constraint: size must be less than or equal to the original sample size N.

    • When to use: Typical default for studying how estimates change with smaller sample sizes. Keeps each subsample representative and avoids internal duplication.

  • bootstrap (with replacement) - What: Each subsample is drawn with replacement, so the same original row

    can appear multiple times in a single subsample.

    • Constraint: None — size may be larger than N because indices can repeat.

    • When to use: Use when you need to estimate uncertainty (variance, CIs) via resampling, or when you want to allow size >= N for simulation purposes.

  • disjoint (non-overlapping partitions) - What: The original indices are shuffled and partitioned into non-overlapping

    blocks of length size. Each block is a subsample with no shared indices.

    • Constraint: size must be <= N. The number of blocks per shuffle is N // size; for n_repeats larger than that, additional shuffles are used.

    • When to use: Use when you want independent partitions (similar to simple cross-validation) and want to avoid overlap between subsamples within the same shuffle.

Reproducibility and seed

  • A random seed controls the pseudo-random generator used to create indices. Providing the same seed reproduces the same sequence of subsamples. This is recommended for experiments where results must be repeatable.

Practical guidance

  • Use random as a safe default when size <= N and you want unbiased subsamples without duplicates.

  • Use bootstrap when you need to estimate uncertainty from resampling or when you want to allow size >= N.

  • Use disjoint when you need non-overlapping partitions to compare independent fits or to maximize coverage of the original dataset without duplication.

See _generate_subsample_indices() source for exact behaviour and edge-case handling.

Outputs

  • An xarray.Dataset with dimensions: - sizes: tested sample sizes - repeats: repetition index

  • Data variables: param_0, param_1, …, ks_statistic, ks_pvalue, chi2_statistic, chi2_pvalue, rmse (depending on tests).

  • Attributes: - stability_points: dict of detected sizes per variable with method info - recommended_size: sample size where primary metric stabilizes - primary_metric: the metric used for recommended_size (typically ‘rmse’) - stable_pvalue_ks, stable_pvalue_chi2, stable_rmse: metric values at stable point - figure_path: path to saved figure (if generated) - Metadata about sampling, bins, and configuration

Quick example

import numpy as np
from magica.core import MagicAdjuster

data = np.random.weibull(2, 1000)
adjuster = MagicAdjuster(data)
adjuster.fit_distribution('weibull_min')

# Use Kneedle method for RMSE stability detection
ds = adjuster.monte_carlo_fit(
    sizes=[50, 100, 200, 400, 600, 800],
    n_repeats=50,
    tests=['ks', 'chi2', 'rmse'],
    stability_method='kneedle',  # Recommended for RMSE
    smooth=True,                 # Smooth curves before detection
    sampling='random',
    fit_kwargs={'floc': 0},
    fig_output_path='mc_summary.png',
    plot_type='series'
)

# Access recommended size
recommended_n = ds.attrs['recommended_size']
print(f"Recommended sample size: {recommended_n}")

# Check all stability points
print(ds.attrs['stability_points'])

Interpretation tips

  • Look at medians/boxplots of parameters across sizes — convergence indicates stable estimation.

  • Check ks_pvalue / chi2_pvalue behavior: rising p-values toward larger sizes suggest better fit at those sizes, but beware of the large-sample-size effect where p-values can become unreliable.

  • RMSE is the most reliable indicator for stability: it shows smooth, monotonic decrease and clear convergence. Always include RMSE in your tests.

  • Use recommended_size (based on primary metric, typically RMSE) as the recommended minimum sample size for robust parameter estimation.

  • The generated figure shows: - Parameter convergence across sample sizes - Goodness-of-fit evolution (KS, Chi-square p-values, RMSE) - Red vertical lines marking detected stability points - Orange dashed lines showing smoothed curves (for Kneedle method) - Info boxes with metric values at stable points

  • Compare stability points across different methods to understand metric behavior.

Method Comparison

When analyzing your results, compare how different metrics stabilize:

stability = ds.attrs['stability_points']

# RMSE typically stabilizes earliest and most reliably
rmse_stable = stability['rmse']['size']
rmse_method = stability['rmse']['method']

# P-values may stabilize later or show erratic behavior
ks_stable = stability['ks_pvalue']['size']

print(f"RMSE stable at n = {rmse_stable} (method: {rmse_method})")
print(f"KS p-value stable at n = {ks_stable}")

# Use RMSE-based recommendation for production
recommended = ds.attrs['recommended_size']
print(f"\n⭐ Recommended sample size: {recommended}")

Further reading

See Core Module for function signature and options, and the example notebook in this folder for a hands-on run.

Which Tutorial Should I Use?

Choose MagicAdjuster Tutorial if:

  • ✓ You know which distribution to fit (e.g., Weibull for wind speed)

  • ✓ You want to perform detailed stability analysis

  • ✓ You need to understand parameter uncertainty

  • ✓ You’re working with large datasets (>10,000 samples)

  • ✓ You want to learn about the large sample size effect

Choose AutoFitter Tutorial if:

  • ✓ You’re unsure which distribution fits your data best

  • ✓ You want to compare multiple distributions quickly

  • ✓ You need a comprehensive distribution search

  • ✓ You’re exploring a new dataset

  • ✓ You want to validate your distribution choice

Choose Extreme Value Analysis Tutorial if:

  • ✓ You need to calculate return periods and design values

  • ✓ You’re working with extreme events (storms, floods, heat waves)

  • ✓ You need 10-year, 50-year, or 100-year design values

  • ✓ You want to compare Block Maxima (GEV) vs. POT (GPD) approaches

  • ✓ You’re doing wind/structural engineering or climate risk assessment

All tutorials emphasize:

  • RMSE as the most reliable metric for fit quality

  • 📊 When p-values are trustworthy vs. misleading

  • 🎯 Best practices for real-world data analysis

Legacy Examples

Weibull Fit Example

Note

The Jupyter notebook MagicA_Weibull_Fit_Example.ipynb can be found in this directory. You can download and run it locally to follow along with the examples.

Note: This is a legacy example. We recommend using the updated MagicAdjuster tutorial instead.

Magic Adjuster Example

Note

The Python script example_magic_adjuster.py contains example usage of the MagicAdjuster class. You can find it in this directory and run it as a standalone script.

Note: This is a legacy example. We recommend using the updated tutorials instead.

Downloadable Files

Download the following files to run locally:

Main Tutorials:

Legacy Materials:

  • MagicA_Weibull_Fit_Example.ipynb - Legacy Weibull fitting examples

  • example_magic_adjuster.py - Python script with MagicAdjuster examples

  • api_usage.md - API usage guide in Markdown format

Best Practices Summary

Based on the tutorials, here are the key best practices:

Distribution Selection:

  1. Use RMSE as primary criterion for real-world data

  2. Start with AutoFitter’s default list for quick screening

  3. Use comprehensive search when you need the absolute best fit

  4. Create domain-specific lists for faster, targeted analysis

Monte Carlo Analysis:

  1. Always include ‘rmse’ in tests - most reliable stability indicator

  2. Use 100+ repeats for robust stability detection

  3. Generate summary figures with fig_output_path for visual inspection

  4. Check RMSE stability first, then validate with other metrics

Large Sample Size Effect:

  1. Avoid p-value-only selection with large datasets (>10,000 samples)

  2. Use RMSE for datasets of any size - it’s always reliable

  3. Understand that p-values can reject even excellent fits with large N

  4. See MagicAdjuster tutorial for detailed examples

Next Steps

After completing the tutorials: