Tutorials

This section contains comprehensive tutorials and examples for using MagicA for statistical distribution fitting and Monte Carlo stability analysis.

Tip

New to MagicA? Start with the Quick Start guide, then explore the interactive Jupyter notebooks below.

Interactive Jupyter Notebooks

MagicAdjuster Tutorial

Comprehensive guide to single distribution fitting with Monte Carlo CPS (Coefficient/P-value/Sample size) methods.

This tutorial covers:

Distribution fitting and parameter estimation
Goodness-of-fit testing (KS, Chi-square, RMSE)
Different binning strategies for histogram-based tests
Monte Carlo stability analysis for determining minimum sample sizes
Large Sample Size Effect - why p-values become unreliable with large datasets (>10,000 samples)
RMSE as the most reliable stability indicator
Comparing different sampling strategies (random, bootstrap, disjoint)
CPS (Coefficient/P-value/Sample size) method from Lin et al. (2011)
Real-world example with INMET weather station data

Important

This tutorial demonstrates why RMSE is the preferred metric for assessing fit quality and stability, especially with large datasets where p-values can be misleading.

MagicAdjuster Tutorial

Key Learning Outcomes:

Master Monte Carlo stability analysis
Understand when p-values are reliable vs. unreliable
Learn to use RMSE as primary fit quality indicator
Determine optimal sample sizes for your data

AutoFitter Tutorial

Learn how to automatically find the best distribution from 113+ scipy.stats options.

This tutorial covers:

Automatic distribution testing and selection
RMSE-based selection (recommended for all data types)
Testing default curated list (16 distributions) vs. comprehensive set (113+)
Creating domain-specific distribution lists (e.g., wind, rainfall)
Comparison tables and ranking by different criteria
Filtering by p-value for synthetic data (with warnings for real-world use)
Understanding when each selection criterion is appropriate
Using the best-fitted distribution for further analysis

Note

This tutorial emphasizes RMSE as the primary criterion for distribution selection, with clear guidance on when p-value filtering is appropriate vs. problematic.

AutoFitter Tutorial

Key Learning Outcomes:

Automate distribution selection with confidence
Choose appropriate selection criteria for your data
Understand trade-offs between speed (default list) and thoroughness (comprehensive)
Apply best practices for real-world vs. synthetic data

Extreme Value Analysis Tutorial

Comprehensive guide to analyzing extreme events with return period calculations and extreme value distributions.

This tutorial covers:

Creating ExtremesAnalyzer from time series data
Block Maxima approach - GEV (Generalized Extreme Value) distribution
Peaks Over Threshold (POT) - GPD (Generalized Pareto Distribution)
Return period and return value calculations
Declustering techniques for independent extreme events
Monthly and seasonal block maxima extraction
Return level plots and visualization
Comparing GEV vs. GPD approaches
Flexible time series handling (pandas, datetime64, numeric arrays)
Best practices for extreme value analysis

Important

This tutorial demonstrates real-world applications in wind engineering, climate studies, and structural design using both classical (Block Maxima) and modern (POT) extreme value methods.

Extreme Value Analysis Tutorial

Key Learning Outcomes:

Master both Block Maxima (GEV) and POT (GPD) methods
Calculate design values for infrastructure (10-year, 50-year, 100-year return periods)
Understand when to use each extreme value approach
Extract and analyze seasonal patterns in extremes
Apply proper declustering for POT analysis

API Usage Tutorial

Quick reference guide for common MagicA operations and API patterns.

MagicA API Usage

Basic Example: Weibull Fit and Goodness-of-Fit Evaluation

import magica as ma
import numpy as np

# Example wind speed data
wind_data = [2.1, 5.4, 8.7, 12.3, 6.8, 9.1, 15.2, 3.4, 7.6, 11.0, 4.5, 13.2, 8.9, 6.7, 10.5]

# 1. Load data
processor = ma.read_data(wind_data)

# 2. Fit Weibull distribution
fitted = processor.fit_distribution('weibull')
params = fitted.get_fitted_params()

# 3. Evaluate goodness-of-fit
chi2 = fitted.goodness_of_fit('chi2')
ks = fitted.goodness_of_fit('ks')
rms = fitted.goodness_of_fit('rms')

print('Weibull parameters:', params)
print('Chi-square:', chi2)
print('Kolmogorov-Smirnov:', ks)
print('Root Mean Square:', rms)

Monte Carlo Stability Analysis

Determine the minimum sample size needed for stable parameter estimation:

import numpy as np
from magica.core import MagicAdjuster

# Generate larger dataset for Monte Carlo analysis
data = np.random.weibull(2, 1000)

# Create adjuster and fit distribution
adjuster = MagicAdjuster(data)
adjuster.fit_distribution('weibull_min')

# Run Monte Carlo stability analysis
results = adjuster.monte_carlo_fit(
  n_repeats=100,
  tests=['chi2', 'ks']  # no figure by default
)

# Work with xarray Dataset
ks_pvalues = results['ks_pvalue']
param_medians = results['param_0'].median(dim='repeats')

# Check stability points
stability = results.attrs['stability_points']
print(f"Parameter 0 stabilizes at: {stability['param_0']['size']}")

Advanced Monte Carlo Usage

# Custom sizes with fitting constraints
results = adjuster.monte_carlo_fit(
  sizes=[50, 100, 200, 400],
  n_repeats=200,
  tests=['chi2', 'ks', 'rmse'],
  plot_type='boxplots',
  fig_output_path='stability_boxplots.png',
  bins='scott',
  fit_kwargs={'floc': 0}
)

# Using pre-calculated parameters (bypass fitting)
known_params = (2.0, 0.0, 1.0)  # shape, loc, scale for Weibull
results = adjuster.monte_carlo_fit(
    distribution_params=known_params,
    n_repeats=150,
    tests=['chi2', 'ks']
)

# Easy data manipulation with xarray
size_200_data = results.sel(sizes=200)
param_convergence = results['param_0'].std(dim='repeats')

# Visualize parameter stability
results['param_0'].plot(x='sizes', hue='repeats', alpha=0.3)

API Reference

`read_data(data)`

Loads and validates data for analysis.

Parameters:
- data: array-like (list, numpy array, pandas Series/DataFrame)
Returns: DataProcessor instance

`DataProcessor.fit_distribution(distribution)`

Fits a statistical distribution to the data.

Parameters:
- distribution: str or scipy.stats distribution (e.g., ‘weibull’, ‘norm’, ‘gamma’)
Returns: DataProcessor (with fitted distribution)

`DataProcessor.get_fitted_params()`

Returns fitted parameters of the distribution.

Returns: tuple

`DataProcessor.goodness_of_fit(method, bins='doane')`

Evaluates the fit using a statistical test.

Parameters:
- method: ‘chi2’, ‘ks’, or ‘rms’
- bins: binning method (default ‘doane’)
Returns: dict with test results

`MagicAdjuster.monte_carlo_fit(sizes=None, n_repeats=20, tests=['ks'], fig_output_path=None, plot_type='series', sampling='random', distribution_params=None, **kwargs)`

Performs Monte Carlo stability analysis to determine minimum sample size for reliable parameter estimation.

Parameters:
- sizes: list, optional - Explicit list of sample sizes
- n_repeats: int - Repetitions per size (default 20)
- tests: list - GOF tests to perform [‘chi2’,’ks’,’rmse’]
- fig_output_path: str, optional - Save 2x3 summary figure if provided (includes red dashed stability lines)
- plot_type: ‘series’ or ‘boxplots’ - Panel style when saving figure
- sampling: ‘random’,’bootstrap’,’disjoint’ - subsampling strategy
- distribution_params: tuple, optional - Use fixed params (no refitting)
- bins: (kwarg) binning method for chi2 (default ‘doane’)
- fit_kwargs: (kwarg) constraints passed to fit_distribution
- plot: str or bool - Plotting option (‘series’, ‘boxplots’, True, or False) (default True)
- sampling: str - Sampling strategy (‘random’ or ‘bootstrap’) (default ‘random’)
- distribution_params: tuple, optional - Pre-calculated distribution parameters
- bins: str or int - Binning strategy for chi-square test (default ‘doane’)
- fit_kwargs: dict - Additional arguments passed to fit_distribution()
Returns: xarray.Dataset with dimensions [‘sizes’, ‘repeats’] and variables for parameters and test results
- method: ‘chi2’, ‘ks’, or ‘rms’
- bins: binning method (default ‘doane’)
Returns: dict with test results

`DataProcessor.get_basic_stats()`

Returns basic statistics of the data.

Returns: dict

`DataProcessor.get_distribution_info()`

Returns info about the fitted distribution.

Returns: dict

Working with xarray Results

The monte_carlo_fit method returns results as an xarray Dataset, providing powerful data manipulation capabilities:

# Run Monte Carlo analysis
results = adjuster.monte_carlo_fit(n_repeats=50, tests=['ks', 'chi2'])

# Dataset overview
print(results)
print(results.dims)  # Dimensions: sizes, repeats
print(results.data_vars)  # Variables: param_0, ks_statistic, ks_pvalue, etc.

# Select data for specific sample size
size_200 = results.sel(sizes=200)
print(size_200['param_0'].values)  # All parameter values for size 200

# Calculate statistics across repeats
param_means = results['param_0'].mean(dim='repeats')
param_stds = results['param_0'].std(dim='repeats')
ks_medians = results['ks_pvalue'].median(dim='repeats')

# Select size range
large_sizes = results.sel(sizes=slice(200, None))

# Convert to pandas DataFrame for further analysis
df = results.to_dataframe()

# Plot with built-in xarray methods
results['ks_pvalue'].plot(x='sizes', hue='repeats', alpha=0.5)

# Access metadata and stability points
stability = results.attrs['stability_points']
figure = results.attrs['figure']

Goodness-of-Fit Methods

'chi2': Chi-square test
'ks': Kolmogorov-Smirnov test
'rms': Root Mean Square error between observed and estimated PDF

Monte Carlo Stability Tutorial

Detailed explanation of Monte Carlo methods for stability analysis and sample size determination.

Monte Carlo Stability Tutorial

This tutorial explains why Monte Carlo stability analysis is useful and how the monte_carlo_fit routine in MagicA is designed and used.

Why Monte Carlo stability?

Large-sample-size effects occur when very large datasets make statistical tests extremely powerful: p-values become very small even for effects that are practically negligible. In other words, with enough data a test can reject the null hypothesis for differences that have no practical importance. The Monte Carlo stability workflow therefore looks for a sampling size where the chosen goodness-of-fit tests start to “pass” in a practical sense and parameter estimates stop changing much across repeats — that sample size is reported as the stability point.

Methodology (what we do)

Build a grid of sample sizes to test (or use a provided list).
For each size, generate multiple subsamples (repeats) from the original data.
For each subsample, fit the distribution (unless fixed parameters are supplied).
Compute goodness-of-fit metrics (e.g., KS p-value, chi-square p-value, RMSE).
Store all results in an xarray Dataset with dimensions sizes x repeats.
Detect stability points using a moving-window criterion on variability (e.g., CV).
Optionally save a 2x3 summary figure with red dashed vertical lines marking each variable’s detected stability sample size.

Inputs and options

sizes: list or auto-generated grid of sample sizes.
n_repeats: number of subsamples per size.
tests: list of metrics to compute: [‘ks’,’chi2’,’rmse’].
stability_method: method for detecting stability points (‘kneedle’ [default], ‘cv’, ‘plateau’).
sampling: sampling strategy (random, bootstrap, disjoint).
fit_kwargs: passed to the underlying fit routine (e.g., floc=0).
distribution_params: optional fixed parameters (skip fitting when supplied).
fig_output_path: optional path to save the 2x3 summary figure.
plot_type: ‘series’ (median + IQR) or ‘boxplots’.

Stability Detection Methods

MagicA provides three methods for detecting when parameters and goodness-of-fit metrics have stabilized. Each method has different strengths depending on the metric behavior.

CV Method (Coefficient of Variation)

Principle: Monitors the coefficient of variation (CV = std/mean) across repeats. Stability is detected when CV remains below a threshold for a consecutive window of sample sizes.

Mathematical definition:

\[CV_n = \frac{\sigma_n}{\mu_n}\]

Where \(\sigma_n\) is the standard deviation and \(\mu_n\) is the mean across repeats at sample size n.

When to use:

Erratic p-values (KS, Chi-square tests)
When you need quantitative variability assessment
Conservative estimates preferred

Parameters (via kwargs):

window_size: Number of consecutive sizes for validation (default: 25% of n_sizes)
cv_threshold: Maximum allowed CV (default: 0.1)

Kneedle Method (Elbow Detection)

Principle: Based on the “Kneedle” algorithm (Satopää et al., 2011). Finds the “elbow” or “knee” point in a curve by:

Normalizing the curve to [0, 1]
Computing a reference line from first to last point
Finding the point with maximum perpendicular distance from the reference line

Mathematical definition:

The knee point is where:

\[\text{knee} = \arg\max_i |y_i - (y_0 + (y_n - y_0) \cdot \frac{i}{n})|\]

Where \(y_i\) is the normalized curve value at point i.

When to use:

RMSE and converging parameters (RECOMMENDED)
Smooth, monotonic convergence patterns
Finding “point of diminishing returns”
Production analyses requiring objective detection

Parameters (via kwargs):

smooth: Whether to smooth curves before detection (default: True)
smoothing_method: ‘savgol’ (preserves features) or ‘spline’ (default: ‘savgol’)

Reference: Satopää, V., Albrecht, J., Irwin, D., & Raghavan, B. (2011). Finding a “Kneedle” in a Haystack: Detecting Knee Points in System Behavior. 2011 31st International Conference on Distributed Computing Systems Workshops, 166-171.

Plateau Method (Relative Gain Heuristic)

Principle: Detects when relative improvement between consecutive points becomes negligible. Stability is found when the relative change stays below a threshold for L consecutive points.

Mathematical definition:

Relative change at point i:

\[\Delta_i = \frac{|y_{i-1} - y_i|}{|y_{i-1}|}\]

Stability when:

\[\Delta_i < \epsilon \text{ for } L \text{ consecutive points}\]

When to use:

Limited computational budget
Early “good enough” detection
Iterative analyses with early stopping

Parameters (via kwargs):

consecutive_points: Number of points below tolerance (default: 3)
relative_tolerance: Maximum relative change (default: 0.01, i.e., 1%)

Method Selection Guide

Method	Best For	Typical Use Case
Kneedle	RMSE, parameters	Production RMSE analysis (RECOMMENDED)
CV	P-values (KS, Chi²)	Conservative p-value stability
Plateau	Early detection, limited budget	Quick exploration, iterative tuning

Recommended workflow:

# Use Kneedle for RMSE (most reliable)
results = adjuster.monte_carlo_fit(
    tests=['rmse'],
    stability_method='kneedle',
    smooth=True
)

recommended_n = results.attrs['recommended_size']
print(f"RMSE stable at n = {recommended_n}")

Sampling strategies

The helper _generate_subsample_indices() produces integer index lists that are used by the Monte Carlo routines to build subsamples from the original dataset. Below is a concise, practical explanation of the three supported strategies and how to choose between them.

random (without replacement) - What: Each subsample contains size unique indices drawn randomly from the

original dataset (no duplicates inside the same subsample).
- Constraint: size must be less than or equal to the original sample size N.
- When to use: Typical default for studying how estimates change with smaller sample sizes. Keeps each subsample representative and avoids internal duplication.
bootstrap (with replacement) - What: Each subsample is drawn with replacement, so the same original row

can appear multiple times in a single subsample.
- Constraint: None — size may be larger than N because indices can repeat.
- When to use: Use when you need to estimate uncertainty (variance, CIs) via resampling, or when you want to allow size >= N for simulation purposes.
disjoint (non-overlapping partitions) - What: The original indices are shuffled and partitioned into non-overlapping

blocks of length size. Each block is a subsample with no shared indices.
- Constraint: size must be <= N. The number of blocks per shuffle is N // size; for n_repeats larger than that, additional shuffles are used.
- When to use: Use when you want independent partitions (similar to simple cross-validation) and want to avoid overlap between subsamples within the same shuffle.

Reproducibility and seed

A random seed controls the pseudo-random generator used to create indices. Providing the same seed reproduces the same sequence of subsamples. This is recommended for experiments where results must be repeatable.

Practical guidance

Use random as a safe default when size <= N and you want unbiased subsamples without duplicates.
Use bootstrap when you need to estimate uncertainty from resampling or when you want to allow size >= N.
Use disjoint when you need non-overlapping partitions to compare independent fits or to maximize coverage of the original dataset without duplication.

See _generate_subsample_indices() source for exact behaviour and edge-case handling.

Outputs

An xarray.Dataset with dimensions: - sizes: tested sample sizes - repeats: repetition index
Data variables: param_0, param_1, …, ks_statistic, ks_pvalue, chi2_statistic, chi2_pvalue, rmse (depending on tests).
Attributes: - stability_points: dict of detected sizes per variable with method info - recommended_size: sample size where primary metric stabilizes - primary_metric: the metric used for recommended_size (typically ‘rmse’) - stable_pvalue_ks, stable_pvalue_chi2, stable_rmse: metric values at stable point - figure_path: path to saved figure (if generated) - Metadata about sampling, bins, and configuration

Quick example

import numpy as np
from magica.core import MagicAdjuster

data = np.random.weibull(2, 1000)
adjuster = MagicAdjuster(data)
adjuster.fit_distribution('weibull_min')

# Use Kneedle method for RMSE stability detection
ds = adjuster.monte_carlo_fit(
    sizes=[50, 100, 200, 400, 600, 800],
    n_repeats=50,
    tests=['ks', 'chi2', 'rmse'],
    stability_method='kneedle',  # Recommended for RMSE
    smooth=True,                 # Smooth curves before detection
    sampling='random',
    fit_kwargs={'floc': 0},
    fig_output_path='mc_summary.png',
    plot_type='series'
)

# Access recommended size
recommended_n = ds.attrs['recommended_size']
print(f"Recommended sample size: {recommended_n}")

# Check all stability points
print(ds.attrs['stability_points'])

Interpretation tips

Look at medians/boxplots of parameters across sizes — convergence indicates stable estimation.
Check ks_pvalue / chi2_pvalue behavior: rising p-values toward larger sizes suggest better fit at those sizes, but beware of the large-sample-size effect where p-values can become unreliable.
RMSE is the most reliable indicator for stability: it shows smooth, monotonic decrease and clear convergence. Always include RMSE in your tests.
Use recommended_size (based on primary metric, typically RMSE) as the recommended minimum sample size for robust parameter estimation.
The generated figure shows: - Parameter convergence across sample sizes - Goodness-of-fit evolution (KS, Chi-square p-values, RMSE) - Red vertical lines marking detected stability points - Orange dashed lines showing smoothed curves (for Kneedle method) - Info boxes with metric values at stable points
Compare stability points across different methods to understand metric behavior.

Method Comparison

When analyzing your results, compare how different metrics stabilize:

stability = ds.attrs['stability_points']

# RMSE typically stabilizes earliest and most reliably
rmse_stable = stability['rmse']['size']
rmse_method = stability['rmse']['method']

# P-values may stabilize later or show erratic behavior
ks_stable = stability['ks_pvalue']['size']

print(f"RMSE stable at n = {rmse_stable} (method: {rmse_method})")
print(f"KS p-value stable at n = {ks_stable}")

# Use RMSE-based recommendation for production
recommended = ds.attrs['recommended_size']
print(f"\n⭐ Recommended sample size: {recommended}")

Which Tutorial Should I Use?

Choose MagicAdjuster Tutorial if:

✓ You know which distribution to fit (e.g., Weibull for wind speed)
✓ You want to perform detailed stability analysis
✓ You need to understand parameter uncertainty
✓ You’re working with large datasets (>10,000 samples)
✓ You want to learn about the large sample size effect

Choose AutoFitter Tutorial if:

✓ You’re unsure which distribution fits your data best
✓ You want to compare multiple distributions quickly
✓ You need a comprehensive distribution search
✓ You’re exploring a new dataset
✓ You want to validate your distribution choice

Choose Extreme Value Analysis Tutorial if:

✓ You need to calculate return periods and design values
✓ You’re working with extreme events (storms, floods, heat waves)
✓ You need 10-year, 50-year, or 100-year design values
✓ You want to compare Block Maxima (GEV) vs. POT (GPD) approaches
✓ You’re doing wind/structural engineering or climate risk assessment

All tutorials emphasize:

⭐ RMSE as the most reliable metric for fit quality
📊 When p-values are trustworthy vs. misleading
🎯 Best practices for real-world data analysis

Legacy Examples

Weibull Fit Example

Note

The Jupyter notebook MagicA_Weibull_Fit_Example.ipynb can be found in this directory. You can download and run it locally to follow along with the examples.

Note: This is a legacy example. We recommend using the updated MagicAdjuster tutorial instead.

Magic Adjuster Example

Note

The Python script example_magic_adjuster.py contains example usage of the MagicAdjuster class. You can find it in this directory and run it as a standalone script.

Note: This is a legacy example. We recommend using the updated tutorials instead.

Downloadable Files

Download the following files to run locally:

Main Tutorials:

magic_adjuster_tutorial.ipynb - Complete MagicAdjuster tutorial with Monte Carlo CPS methods
auto_fitter_tutorial.ipynb - Comprehensive AutoFitter tutorial with RMSE emphasis
extremes_tutorial.ipynb - Extreme value analysis with GEV and GPD distributions

Legacy Materials:

MagicA_Weibull_Fit_Example.ipynb - Legacy Weibull fitting examples
example_magic_adjuster.py - Python script with MagicAdjuster examples
api_usage.md - API usage guide in Markdown format

Best Practices Summary

Based on the tutorials, here are the key best practices:

Distribution Selection:

Use RMSE as primary criterion for real-world data
Start with AutoFitter’s default list for quick screening
Use comprehensive search when you need the absolute best fit
Create domain-specific lists for faster, targeted analysis

Monte Carlo Analysis:

Always include ‘rmse’ in tests - most reliable stability indicator
Use 100+ repeats for robust stability detection
Generate summary figures with fig_output_path for visual inspection
Check RMSE stability first, then validate with other metrics

Large Sample Size Effect:

Avoid p-value-only selection with large datasets (>10,000 samples)
Use RMSE for datasets of any size - it’s always reliable
Understand that p-values can reject even excellent fits with large N
See MagicAdjuster tutorial for detailed examples

Next Steps

After completing the tutorials:

Apply MagicA to your own data
Explore the Core Module, AutoFitter - Automatic Distribution Selection, and Extreme Values Analysis documentation
Read about Monte Carlo Stability Analysis for advanced stability analysis
Check Contributing to contribute to the project

Tutorials

Interactive Jupyter Notebooks

MagicAdjuster Tutorial

AutoFitter Tutorial

Extreme Value Analysis Tutorial

API Usage Tutorial

MagicA API Usage

Basic Example: Weibull Fit and Goodness-of-Fit Evaluation

Monte Carlo Stability Analysis

Advanced Monte Carlo Usage

API Reference

read_data(data)

DataProcessor.fit_distribution(distribution)

DataProcessor.get_fitted_params()

DataProcessor.goodness_of_fit(method, bins='doane')

MagicAdjuster.monte_carlo_fit(sizes=None, n_repeats=20, tests=['ks'], fig_output_path=None, plot_type='series', sampling='random', distribution_params=None, **kwargs)

DataProcessor.get_basic_stats()

DataProcessor.get_distribution_info()

Working with xarray Results

Goodness-of-Fit Methods

Monte Carlo Stability Tutorial

Monte Carlo Stability Tutorial

Why Monte Carlo stability?

Methodology (what we do)

Inputs and options

Stability Detection Methods

CV Method (Coefficient of Variation)

Kneedle Method (Elbow Detection)

Plateau Method (Relative Gain Heuristic)

Method Selection Guide

Sampling strategies

Reproducibility and seed

Practical guidance

Outputs

Quick example

Interpretation tips

Method Comparison

Further reading

Which Tutorial Should I Use?

Legacy Examples

Weibull Fit Example

Magic Adjuster Example

Downloadable Files

Best Practices Summary

Next Steps

`read_data(data)`

`DataProcessor.fit_distribution(distribution)`

`DataProcessor.get_fitted_params()`

`DataProcessor.goodness_of_fit(method, bins='doane')`

`MagicAdjuster.monte_carlo_fit(sizes=None, n_repeats=20, tests=['ks'], fig_output_path=None, plot_type='series', sampling='random', distribution_params=None, **kwargs)`

`DataProcessor.get_basic_stats()`

`DataProcessor.get_distribution_info()`