Monte Carlo Stability Tutorial
This tutorial explains why Monte Carlo stability analysis is useful and how the monte_carlo_fit routine in MagicA is designed and used.
Why Monte Carlo stability?
Large-sample-size effects occur when very large datasets make statistical tests extremely powerful: p-values become very small even for effects that are practically negligible. In other words, with enough data a test can reject the null hypothesis for differences that have no practical importance. The Monte Carlo stability workflow therefore looks for a sampling size where the chosen goodness-of-fit tests start to “pass” in a practical sense and parameter estimates stop changing much across repeats — that sample size is reported as the stability point.
Methodology (what we do)
Build a grid of sample sizes to test (or use a provided list).
For each size, generate multiple subsamples (repeats) from the original data.
For each subsample, fit the distribution (unless fixed parameters are supplied).
Compute goodness-of-fit metrics (e.g., KS p-value, chi-square p-value, RMSE).
Store all results in an xarray Dataset with dimensions sizes x repeats.
Detect stability points using a moving-window criterion on variability (e.g., CV).
Optionally save a 2x3 summary figure with red dashed vertical lines marking each variable’s detected stability sample size.
Inputs and options
sizes: list or auto-generated grid of sample sizes.
n_repeats: number of subsamples per size.
tests: list of metrics to compute: [‘ks’,’chi2’,’rmse’].
stability_method: method for detecting stability points (‘kneedle’ [default], ‘cv’, ‘plateau’).
sampling: sampling strategy (random, bootstrap, disjoint).
fit_kwargs: passed to the underlying fit routine (e.g., floc=0).
distribution_params: optional fixed parameters (skip fitting when supplied).
fig_output_path: optional path to save the 2x3 summary figure.
plot_type: ‘series’ (median + IQR) or ‘boxplots’.
Stability Detection Methods
MagicA provides three methods for detecting when parameters and goodness-of-fit metrics have stabilized. Each method has different strengths depending on the metric behavior.
CV Method (Coefficient of Variation)
Principle: Monitors the coefficient of variation (CV = std/mean) across repeats. Stability is detected when CV remains below a threshold for a consecutive window of sample sizes.
Mathematical definition:
Where \(\sigma_n\) is the standard deviation and \(\mu_n\) is the mean across repeats at sample size n.
When to use:
Erratic p-values (KS, Chi-square tests)
When you need quantitative variability assessment
Conservative estimates preferred
Parameters (via kwargs):
window_size: Number of consecutive sizes for validation (default: 25% of n_sizes)
cv_threshold: Maximum allowed CV (default: 0.1)
Kneedle Method (Elbow Detection)
Principle: Based on the “Kneedle” algorithm (Satopää et al., 2011). Finds the “elbow” or “knee” point in a curve by:
Normalizing the curve to [0, 1]
Computing a reference line from first to last point
Finding the point with maximum perpendicular distance from the reference line
Mathematical definition:
The knee point is where:
Where \(y_i\) is the normalized curve value at point i.
When to use:
RMSE and converging parameters (RECOMMENDED)
Smooth, monotonic convergence patterns
Finding “point of diminishing returns”
Production analyses requiring objective detection
Parameters (via kwargs):
smooth: Whether to smooth curves before detection (default: True)
smoothing_method: ‘savgol’ (preserves features) or ‘spline’ (default: ‘savgol’)
Reference: Satopää, V., Albrecht, J., Irwin, D., & Raghavan, B. (2011). Finding a “Kneedle” in a Haystack: Detecting Knee Points in System Behavior. 2011 31st International Conference on Distributed Computing Systems Workshops, 166-171.
Plateau Method (Relative Gain Heuristic)
Principle: Detects when relative improvement between consecutive points becomes negligible. Stability is found when the relative change stays below a threshold for L consecutive points.
Mathematical definition:
Relative change at point i:
Stability when:
When to use:
Limited computational budget
Early “good enough” detection
Iterative analyses with early stopping
Parameters (via kwargs):
consecutive_points: Number of points below tolerance (default: 3)
relative_tolerance: Maximum relative change (default: 0.01, i.e., 1%)
Method Selection Guide
Method |
Best For |
Typical Use Case |
|---|---|---|
Kneedle |
RMSE, parameters |
Production RMSE analysis (RECOMMENDED) |
CV |
P-values (KS, Chi²) |
Conservative p-value stability |
Plateau |
Early detection, limited budget |
Quick exploration, iterative tuning |
Recommended workflow:
# Use Kneedle for RMSE (most reliable)
results = adjuster.monte_carlo_fit(
tests=['rmse'],
stability_method='kneedle',
smooth=True
)
recommended_n = results.attrs['recommended_size']
print(f"RMSE stable at n = {recommended_n}")
Sampling strategies
The helper _generate_subsample_indices() produces integer index lists that are used by the Monte Carlo routines to build subsamples from the original dataset. Below is a concise, practical explanation of the three supported strategies and how to choose between them.
random (without replacement) - What: Each subsample contains size unique indices drawn randomly from the
original dataset (no duplicates inside the same subsample).
Constraint: size must be less than or equal to the original sample size N.
When to use: Typical default for studying how estimates change with smaller sample sizes. Keeps each subsample representative and avoids internal duplication.
bootstrap (with replacement) - What: Each subsample is drawn with replacement, so the same original row
can appear multiple times in a single subsample.
Constraint: None — size may be larger than N because indices can repeat.
When to use: Use when you need to estimate uncertainty (variance, CIs) via resampling, or when you want to allow size >= N for simulation purposes.
disjoint (non-overlapping partitions) - What: The original indices are shuffled and partitioned into non-overlapping
blocks of length size. Each block is a subsample with no shared indices.
Constraint: size must be <= N. The number of blocks per shuffle is N // size; for n_repeats larger than that, additional shuffles are used.
When to use: Use when you want independent partitions (similar to simple cross-validation) and want to avoid overlap between subsamples within the same shuffle.
Reproducibility and seed
A random seed controls the pseudo-random generator used to create indices. Providing the same seed reproduces the same sequence of subsamples. This is recommended for experiments where results must be repeatable.
Practical guidance
Use random as a safe default when size <= N and you want unbiased subsamples without duplicates.
Use bootstrap when you need to estimate uncertainty from resampling or when you want to allow size >= N.
Use disjoint when you need non-overlapping partitions to compare independent fits or to maximize coverage of the original dataset without duplication.
See _generate_subsample_indices() source for exact behaviour and edge-case handling.
Outputs
An xarray.Dataset with dimensions: - sizes: tested sample sizes - repeats: repetition index
Data variables: param_0, param_1, …, ks_statistic, ks_pvalue, chi2_statistic, chi2_pvalue, rmse (depending on tests).
Attributes: - stability_points: dict of detected sizes per variable with method info - recommended_size: sample size where primary metric stabilizes - primary_metric: the metric used for recommended_size (typically ‘rmse’) - stable_pvalue_ks, stable_pvalue_chi2, stable_rmse: metric values at stable point - figure_path: path to saved figure (if generated) - Metadata about sampling, bins, and configuration
Quick example
import numpy as np
from magica.core import MagicAdjuster
data = np.random.weibull(2, 1000)
adjuster = MagicAdjuster(data)
adjuster.fit_distribution('weibull_min')
# Use Kneedle method for RMSE stability detection
ds = adjuster.monte_carlo_fit(
sizes=[50, 100, 200, 400, 600, 800],
n_repeats=50,
tests=['ks', 'chi2', 'rmse'],
stability_method='kneedle', # Recommended for RMSE
smooth=True, # Smooth curves before detection
sampling='random',
fit_kwargs={'floc': 0},
fig_output_path='mc_summary.png',
plot_type='series'
)
# Access recommended size
recommended_n = ds.attrs['recommended_size']
print(f"Recommended sample size: {recommended_n}")
# Check all stability points
print(ds.attrs['stability_points'])
Interpretation tips
Look at medians/boxplots of parameters across sizes — convergence indicates stable estimation.
Check ks_pvalue / chi2_pvalue behavior: rising p-values toward larger sizes suggest better fit at those sizes, but beware of the large-sample-size effect where p-values can become unreliable.
RMSE is the most reliable indicator for stability: it shows smooth, monotonic decrease and clear convergence. Always include RMSE in your tests.
Use recommended_size (based on primary metric, typically RMSE) as the recommended minimum sample size for robust parameter estimation.
The generated figure shows: - Parameter convergence across sample sizes - Goodness-of-fit evolution (KS, Chi-square p-values, RMSE) - Red vertical lines marking detected stability points - Orange dashed lines showing smoothed curves (for Kneedle method) - Info boxes with metric values at stable points
Compare stability points across different methods to understand metric behavior.
Method Comparison
When analyzing your results, compare how different metrics stabilize:
stability = ds.attrs['stability_points']
# RMSE typically stabilizes earliest and most reliably
rmse_stable = stability['rmse']['size']
rmse_method = stability['rmse']['method']
# P-values may stabilize later or show erratic behavior
ks_stable = stability['ks_pvalue']['size']
print(f"RMSE stable at n = {rmse_stable} (method: {rmse_method})")
print(f"KS p-value stable at n = {ks_stable}")
# Use RMSE-based recommendation for production
recommended = ds.attrs['recommended_size']
print(f"\n⭐ Recommended sample size: {recommended}")
Further reading
See Core Module for function signature and options, and the example notebook in this folder for a hands-on run.