Monte Carlo Stability Tutorial

This tutorial explains why Monte Carlo stability analysis is useful and how the monte_carlo_fit routine in MagicA is designed and used.

Why Monte Carlo stability?

Large-sample-size effects occur when very large datasets make statistical tests extremely powerful: p-values become very small even for effects that are practically negligible. In other words, with enough data a test can reject the null hypothesis for differences that have no practical importance. The Monte Carlo stability workflow therefore looks for a sampling size where the chosen goodness-of-fit tests start to “pass” in a practical sense and parameter estimates stop changing much across repeats — that sample size is reported as the stability point.

Methodology (what we do)

  1. Build a grid of sample sizes to test (or use a provided list).

  2. For each size, generate multiple subsamples (repeats) from the original data.

  3. For each subsample, fit the distribution (unless fixed parameters are supplied).

  4. Compute goodness-of-fit metrics (e.g., KS p-value, chi-square p-value, RMSE).

  5. Store all results in an xarray Dataset with dimensions sizes x repeats.

  6. Detect stability points using a moving-window criterion on variability (e.g., CV).

  7. Optionally save a 2x3 summary figure with red dashed vertical lines marking each variable’s detected stability sample size.

Inputs and options

  • sizes: list or auto-generated grid of sample sizes.

  • n_repeats: number of subsamples per size.

  • tests: list of metrics to compute: [‘ks’,’chi2’,’rmse’].

  • stability_method: method for detecting stability points (‘kneedle’ [default], ‘cv’, ‘plateau’).

  • sampling: sampling strategy (random, bootstrap, disjoint).

  • fit_kwargs: passed to the underlying fit routine (e.g., floc=0).

  • distribution_params: optional fixed parameters (skip fitting when supplied).

  • fig_output_path: optional path to save the 2x3 summary figure.

  • plot_type: ‘series’ (median + IQR) or ‘boxplots’.

Stability Detection Methods

MagicA provides three methods for detecting when parameters and goodness-of-fit metrics have stabilized. Each method has different strengths depending on the metric behavior.

CV Method (Coefficient of Variation)

Principle: Monitors the coefficient of variation (CV = std/mean) across repeats. Stability is detected when CV remains below a threshold for a consecutive window of sample sizes.

Mathematical definition:

\[CV_n = \frac{\sigma_n}{\mu_n}\]

Where \(\sigma_n\) is the standard deviation and \(\mu_n\) is the mean across repeats at sample size n.

When to use:

  • Erratic p-values (KS, Chi-square tests)

  • When you need quantitative variability assessment

  • Conservative estimates preferred

Parameters (via kwargs):

  • window_size: Number of consecutive sizes for validation (default: 25% of n_sizes)

  • cv_threshold: Maximum allowed CV (default: 0.1)

Kneedle Method (Elbow Detection)

Principle: Based on the “Kneedle” algorithm (Satopää et al., 2011). Finds the “elbow” or “knee” point in a curve by:

  1. Normalizing the curve to [0, 1]

  2. Computing a reference line from first to last point

  3. Finding the point with maximum perpendicular distance from the reference line

Mathematical definition:

The knee point is where:

\[\text{knee} = \arg\max_i |y_i - (y_0 + (y_n - y_0) \cdot \frac{i}{n})|\]

Where \(y_i\) is the normalized curve value at point i.

When to use:

  • RMSE and converging parameters (RECOMMENDED)

  • Smooth, monotonic convergence patterns

  • Finding “point of diminishing returns”

  • Production analyses requiring objective detection

Parameters (via kwargs):

  • smooth: Whether to smooth curves before detection (default: True)

  • smoothing_method: ‘savgol’ (preserves features) or ‘spline’ (default: ‘savgol’)

Reference: Satopää, V., Albrecht, J., Irwin, D., & Raghavan, B. (2011). Finding a “Kneedle” in a Haystack: Detecting Knee Points in System Behavior. 2011 31st International Conference on Distributed Computing Systems Workshops, 166-171.

Plateau Method (Relative Gain Heuristic)

Principle: Detects when relative improvement between consecutive points becomes negligible. Stability is found when the relative change stays below a threshold for L consecutive points.

Mathematical definition:

Relative change at point i:

\[\Delta_i = \frac{|y_{i-1} - y_i|}{|y_{i-1}|}\]

Stability when:

\[\Delta_i < \epsilon \text{ for } L \text{ consecutive points}\]

When to use:

  • Limited computational budget

  • Early “good enough” detection

  • Iterative analyses with early stopping

Parameters (via kwargs):

  • consecutive_points: Number of points below tolerance (default: 3)

  • relative_tolerance: Maximum relative change (default: 0.01, i.e., 1%)

Method Selection Guide

Method

Best For

Typical Use Case

Kneedle

RMSE, parameters

Production RMSE analysis (RECOMMENDED)

CV

P-values (KS, Chi²)

Conservative p-value stability

Plateau

Early detection, limited budget

Quick exploration, iterative tuning

Recommended workflow:

# Use Kneedle for RMSE (most reliable)
results = adjuster.monte_carlo_fit(
    tests=['rmse'],
    stability_method='kneedle',
    smooth=True
)

recommended_n = results.attrs['recommended_size']
print(f"RMSE stable at n = {recommended_n}")

Sampling strategies

The helper _generate_subsample_indices() produces integer index lists that are used by the Monte Carlo routines to build subsamples from the original dataset. Below is a concise, practical explanation of the three supported strategies and how to choose between them.

  • random (without replacement) - What: Each subsample contains size unique indices drawn randomly from the

    original dataset (no duplicates inside the same subsample).

    • Constraint: size must be less than or equal to the original sample size N.

    • When to use: Typical default for studying how estimates change with smaller sample sizes. Keeps each subsample representative and avoids internal duplication.

  • bootstrap (with replacement) - What: Each subsample is drawn with replacement, so the same original row

    can appear multiple times in a single subsample.

    • Constraint: None — size may be larger than N because indices can repeat.

    • When to use: Use when you need to estimate uncertainty (variance, CIs) via resampling, or when you want to allow size >= N for simulation purposes.

  • disjoint (non-overlapping partitions) - What: The original indices are shuffled and partitioned into non-overlapping

    blocks of length size. Each block is a subsample with no shared indices.

    • Constraint: size must be <= N. The number of blocks per shuffle is N // size; for n_repeats larger than that, additional shuffles are used.

    • When to use: Use when you want independent partitions (similar to simple cross-validation) and want to avoid overlap between subsamples within the same shuffle.

Reproducibility and seed

  • A random seed controls the pseudo-random generator used to create indices. Providing the same seed reproduces the same sequence of subsamples. This is recommended for experiments where results must be repeatable.

Practical guidance

  • Use random as a safe default when size <= N and you want unbiased subsamples without duplicates.

  • Use bootstrap when you need to estimate uncertainty from resampling or when you want to allow size >= N.

  • Use disjoint when you need non-overlapping partitions to compare independent fits or to maximize coverage of the original dataset without duplication.

See _generate_subsample_indices() source for exact behaviour and edge-case handling.

Outputs

  • An xarray.Dataset with dimensions: - sizes: tested sample sizes - repeats: repetition index

  • Data variables: param_0, param_1, …, ks_statistic, ks_pvalue, chi2_statistic, chi2_pvalue, rmse (depending on tests).

  • Attributes: - stability_points: dict of detected sizes per variable with method info - recommended_size: sample size where primary metric stabilizes - primary_metric: the metric used for recommended_size (typically ‘rmse’) - stable_pvalue_ks, stable_pvalue_chi2, stable_rmse: metric values at stable point - figure_path: path to saved figure (if generated) - Metadata about sampling, bins, and configuration

Quick example

import numpy as np
from magica.core import MagicAdjuster

data = np.random.weibull(2, 1000)
adjuster = MagicAdjuster(data)
adjuster.fit_distribution('weibull_min')

# Use Kneedle method for RMSE stability detection
ds = adjuster.monte_carlo_fit(
    sizes=[50, 100, 200, 400, 600, 800],
    n_repeats=50,
    tests=['ks', 'chi2', 'rmse'],
    stability_method='kneedle',  # Recommended for RMSE
    smooth=True,                 # Smooth curves before detection
    sampling='random',
    fit_kwargs={'floc': 0},
    fig_output_path='mc_summary.png',
    plot_type='series'
)

# Access recommended size
recommended_n = ds.attrs['recommended_size']
print(f"Recommended sample size: {recommended_n}")

# Check all stability points
print(ds.attrs['stability_points'])

Interpretation tips

  • Look at medians/boxplots of parameters across sizes — convergence indicates stable estimation.

  • Check ks_pvalue / chi2_pvalue behavior: rising p-values toward larger sizes suggest better fit at those sizes, but beware of the large-sample-size effect where p-values can become unreliable.

  • RMSE is the most reliable indicator for stability: it shows smooth, monotonic decrease and clear convergence. Always include RMSE in your tests.

  • Use recommended_size (based on primary metric, typically RMSE) as the recommended minimum sample size for robust parameter estimation.

  • The generated figure shows: - Parameter convergence across sample sizes - Goodness-of-fit evolution (KS, Chi-square p-values, RMSE) - Red vertical lines marking detected stability points - Orange dashed lines showing smoothed curves (for Kneedle method) - Info boxes with metric values at stable points

  • Compare stability points across different methods to understand metric behavior.

Method Comparison

When analyzing your results, compare how different metrics stabilize:

stability = ds.attrs['stability_points']

# RMSE typically stabilizes earliest and most reliably
rmse_stable = stability['rmse']['size']
rmse_method = stability['rmse']['method']

# P-values may stabilize later or show erratic behavior
ks_stable = stability['ks_pvalue']['size']

print(f"RMSE stable at n = {rmse_stable} (method: {rmse_method})")
print(f"KS p-value stable at n = {ks_stable}")

# Use RMSE-based recommendation for production
recommended = ds.attrs['recommended_size']
print(f"\n⭐ Recommended sample size: {recommended}")

Further reading

See Core Module for function signature and options, and the example notebook in this folder for a hands-on run.