Monte Carlo Stability Analysis

The Monte Carlo stability analysis is a powerful method to determine the minimum sample size needed for reliable parameter estimation and to assess how statistical tests behave with different sample sizes.

Overview

The monte_carlo_fit() method performs repeated subsampling and distribution fitting to:

Identify the minimum sample size where parameters stabilize
Track how goodness-of-fit tests evolve with sample size
Detect the “large sample size effect” where p-values become unreliable
Provide empirical distributions for robust inference

Method Signature

def monte_carlo_fit(
    self,
    sizes: Optional[List[int]] = None,
    n_repeats: int = 20,
    tests: List[str] = ['ks'],
    stability_method: str = 'aggregate',
    fig_output_path: Optional[str] = None,
    plot_type: str = 'series',
    sampling: str = 'random',
    seed: Optional[int] = None,
    min_size: int = 50,
    max_size: Optional[int] = None,
    n_sizes: int = 10,
    distribution_params: Optional[Tuple] = None,
    **kwargs
) -> xr.Dataset

Parameters

Sample Size Configuration

sizesList[int], optional

Explicit list of sample sizes to test. If not provided, sizes are automatically generated based on min_size, max_size, and n_sizes.

Example: [100, 200, 500, 1000, 2000]

min_sizeint, default=50

Minimum sample size to test (used only if sizes is not provided).

max_sizeint, optional

Maximum sample size to test. Defaults to the size of the original dataset.

n_sizesint, default=10

Number of different sample sizes to test (used only if sizes is not provided).

Monte Carlo Configuration

n_repeatsint, default=20

Number of independent subsamples to draw for each sample size. Higher values provide better statistical estimates but increase computation time.

For quick exploration: 10-20 repeats
For production analysis: 30-50 repeats
For publication-quality results: 50-100 repeats

samplingstr, default=’random’

Sampling strategy for creating subsamples. Options:

'random': Independent random draws without replacement (most common)
'bootstrap': Random draws with replacement (useful for uncertainty quantification)
'disjoint': Non-overlapping partitions (best for temporal/spatial data)

seedint, optional

Random seed for reproducibility. Always set this for reproducible analyses.

Goodness-of-Fit Tests

testsList[str], default=[‘ks’]

List of goodness-of-fit tests to perform. Options:

'ks': Kolmogorov-Smirnov test (p-value based)
'chi2': Chi-square test (p-value based)
'rmse': Root Mean Square Error (distance metric)

⭐ Recommendation: Always include 'rmse' for stability detection, as it shows clearer convergence behavior than p-value based tests.

Example: tests=['ks', 'chi2', 'rmse']

Stability Detection

stability_methodstr, default=’kneedle’

Method for detecting when parameters and tests have stabilized. Options:

'kneedle': Kneedle algorithm for elbow detection (default, best for RMSE)
'cv': Coefficient of Variation method (robust for erratic metrics)
'plateau': Relative gain heuristic (early “good enough” detection)
'aggregate': Legacy median-based method (still supported)
None: No stability detection

⭐ Default: ‘kneedle’ - Recommended for RMSE stability detection as it excels at finding the “point of diminishing returns” in smooth, converging metrics.

See Stability Detection Methods for detailed explanations and selection guidance.

Method-Specific Parameters (via kwargs):

For CV method:

window_size: Number of consecutive sizes for validation (default: 25% of n_sizes)

cv_threshold: Maximum allowed coefficient of variation (default: 0.1)

For Kneedle method:

smooth: Whether to smooth curves before detection (default: True)

smoothing_method: ‘savgol’ (default) or ‘spline’

For Plateau method:

consecutive_points: Number of consecutive points below tolerance (default: 3)

relative_tolerance: Maximum relative change threshold (default: 0.01, i.e., 1%)

Visualization

fig_output_pathstr, optional

Path where the summary figure should be saved. If provided, automatically generates a 2×3 grid showing parameter and test evolution.

Example: 'monte_carlo_results.png'

plot_typestr, default=’series’

Style of plots to generate:

'series': Line plots with median and IQR shading
'boxplots': Box plots showing full distribution per size

Advanced Parameters

distribution_paramsTuple, optional: If provided, uses these fixed parameters instead of fitting. Useful for testing how well the method recovers known parameters.
kwargs: Additional keyword arguments passed to the goodness-of-fit tests (e.g., bins='doane' for chi-square test).

Return Value

Returns an xarray.Dataset containing:

Dimensions

sizes: Sample sizes tested (length = n_sizes)
repeats: Repetition index (length = n_repeats)

Data Variables

param_0, param_1, ...: Fitted distribution parameters for each (size, repeat) combination
ks_statistic, ks_pvalue: Kolmogorov-Smirnov test results (if 'ks' in tests)
chi2_statistic, chi2_pvalue: Chi-square test results (if 'chi2' in tests)
rmse: Root mean square error values (if 'rmse' in tests)

Attributes

distribution: Name of the fitted distribution
original_data_size: Size of the original dataset
sampling_method: Sampling strategy used
bins_method: Binning method used for chi-square test
stability_points: Dictionary with detected stability information
recommended_size: Sample size where primary metric stabilizes (None if not detected)
primary_metric: The metric used for recommended_size (typically ‘rmse’)
stable_pvalue_ks: KS test p-value at stability point (if ‘ks’ in tests)
stable_pvalue_chi2: Chi-square p-value at stability point (if ‘chi2’ in tests)
stable_rmse: RMSE value at stability point (if ‘rmse’ in tests)
n_repeats: Number of repetitions performed
min_size: Minimum sample size tested
max_size: Maximum sample size tested
n_sizes: Number of sample sizes tested
figure_path: Path to saved figure (if fig_output_path was provided)
created_by: Identifier for the method

Stability Detection Methods

MagicA provides three advanced methods for detecting when parameters and goodness-of-fit metrics have stabilized. Each method has specific strengths depending on the metric behavior.

CV Method (Coefficient of Variation)

How it works:

For each parameter/test, computes the coefficient of variation (CV = std/mean) across repeats at each sample size
Uses a sliding window (default: 25% of n_sizes) to check for stability
Checks if CV stays below a threshold (default: 0.1) for all sizes in the window
Returns the first size where this condition is met

Mathematical formula:

\[CV_n = \frac{\sigma_n}{\mu_n}\]

Where: - \(\sigma_n\) = standard deviation across repeats at size n - \(\mu_n\) = mean value across repeats at size n

Advantages:

Robust to erratic metrics (especially p-values)
Provides quantitative measure of variability
Good for assessing precision of estimates
Works well with any number of repeats

Disadvantages:

May be conservative (detects stability later than other methods)
Less sensitive to smooth convergence patterns

Parameters:

window_size: Number of consecutive sizes for validation (default: max(2, n_sizes // 4))
cv_threshold: Maximum allowed CV (default: 0.1)

Result format:

stability_points = {
    'rmse': {
        'size': 800,
        'index': 4,
        'cv_at_stability': 0.087,
        'smoothed_curve': None,
        'method': 'cv'
    }
}

When to use: - When working with erratic p-values (KS, Chi-square) - When you need quantitative variability assessment - When conservative estimates are preferred

Kneedle Method (Elbow Detection)

How it works:

Normalizes the curve to [0, 1] range
Computes the reference line connecting first and last points
Calculates perpendicular distance from each point to the reference line
The knee/elbow is at the point with maximum distance
Optional: smooths the curve before detection using Savitzky-Golay filter or spline

Mathematical formula:

\[D_i = |y_i - (y_0 + (y_n - y_0) \cdot \frac{i}{n})|\]

Where: - \(D_i\) = distance from reference line at point i - \(y_i\) = normalized curve value at point i - \(y_0, y_n\) = first and last curve values

Advantages:

Optimal for RMSE: Excels at finding “point of diminishing returns” in smooth curves
Clear mathematical definition of “elbow”
Automatic curve direction detection
Smoothing reduces noise sensitivity
Based on published algorithm (Satopää et al., 2011)

Disadvantages:

Requires relatively smooth convergence
May not work well with highly erratic metrics

Parameters:

smooth: Whether to smooth before detection (default: True, recommended)
smoothing_method: ‘savgol’ (default, preserves features) or ‘spline’

Result format:

stability_points = {
    'rmse': {
        'size': 600,
        'index': 3,
        'cv_at_stability': None,
        'smoothed_curve': array([...]),  # If smoothing was used
        'method': 'kneedle'
    }
}

When to use: - ⭐ RECOMMENDED for RMSE stability detection - When metrics show smooth, monotonic convergence - When you want to find the “sweet spot” of diminishing returns - For production analyses requiring objective elbow detection

Reference:

Satopää, V., Albrecht, J., Irwin, D., & Raghavan, B. (2011). Finding a “Kneedle” in a Haystack: Detecting Knee Points in System Behavior. 2011 31st International Conference on Distributed Computing Systems Workshops, 166-171.

Plateau Method (Relative Gain Heuristic)

How it works:

Computes relative change between consecutive points:

\[\begin{split}\\Delta_i = \\frac{|y_{i-1} - y_i|}{|y_{i-1}|}\end{split}\]
Checks if relative change stays below tolerance for L consecutive points
Returns the first point where this “plateau” condition is met

Mathematical formula:

Stability detected when:

\[\Delta_i < \epsilon \text{ for } L \text{ consecutive points}\]

Where: - \(\Delta_i\) = relative change at point i - \(\epsilon\) = relative tolerance (default: 0.01, i.e., 1%) - \(L\) = consecutive points threshold (default: 3)

Advantages:

Early detection of “good enough” stabilization
Intuitive interpretation (% improvement)
Fast - no smoothing required
Good for early stopping in iterative analyses

Disadvantages:

May detect premature plateaus
Sensitive to parameter tuning
Less robust to noise than other methods

Parameters:

consecutive_points: Number of consecutive points below tolerance (default: 3)
relative_tolerance: Maximum relative change threshold (default: 0.01)

Result format:

stability_points = {
    'rmse': {
        'size': 500,
        'index': 2,
        'cv_at_stability': None,
        'smoothed_curve': None,
        'method': 'plateau'
    }
}

When to use: - When computational budget is limited - For early “good enough” detection - When slight improvements beyond plateau are acceptable - In iterative analyses where early stopping is beneficial

Method Selection Guide

Choose the appropriate method based on your metric and goals:

Method	Best For	Strengths	Typical Use Case
Kneedle	RMSE, converging parameters	Objective elbow detection, smooth curves	Production RMSE analysis (RECOMMENDED)
CV	P-values (KS, Chi-square)	Robust to noise, quantifies variability	P-value stability, conservative estimates
Plateau	Early detection, limited budget	Fast, intuitive, early stopping	Quick exploration, iterative tuning
Aggregate	Legacy analyses	Simple, robust	Backward compatibility

Recommended Workflow:

# 1. Use Kneedle for RMSE (most reliable)
results_rmse = processor.monte_carlo_fit(
    tests=['rmse'],
    stability_method='kneedle',
    smooth=True
)

# 2. Use CV for p-values (robust to noise)
results_pvalues = processor.monte_carlo_fit(
    tests=['ks', 'chi2'],
    stability_method='cv',
    cv_threshold=0.15  # Can relax for p-values
)

# 3. Compare stability points
rmse_stable = results_rmse.attrs['recommended_size']
print(f"RMSE stable at n = {rmse_stable} (MOST RELIABLE)")

Interpreting Stability Points

Each stability point contains:

size: The sample size where stability is first detected (None if not detected)
index: The index in the sizes list (None if not detected)
cv_at_stability: Coefficient of variation at stability (only for 'cv' method)
smoothed_curve: Smoothed curve data (only for 'kneedle' method with smoothing)
method: Name of the detection method used

Accessing stability information:

# Get recommended size (based on primary metric, usually RMSE)
recommended_n = results.attrs['recommended_size']
primary_metric = results.attrs['primary_metric']

if recommended_n:
    print(f"✓ {primary_metric.upper()} stable at n = {recommended_n}")

    # Get quality metrics at stable point
    if 'stable_rmse' in results.attrs:
        print(f"  RMSE at stable point: {results.attrs['stable_rmse']:.4f}")
    if 'stable_pvalue_ks' in results.attrs:
        print(f"  KS p-value at stable point: {results.attrs['stable_pvalue_ks']:.4f}")
else:
    print("✗ No stability detected - try larger sample sizes")

If stability is not detected:

The tested sample sizes are too small - increase max_size
The metric has high variability - increase n_repeats
The threshold is too strict - relax method-specific parameters
The data distribution may not be appropriate

Recommended Practices

Choosing Sample Sizes

# For small to medium datasets (< 5000 samples)
sizes = [50, 100, 200, 400, 600, 800]

# For large datasets (> 10000 samples) - CPS method
sizes = [100, 200, 400, 600, 800, 1000, 1500, 2000, 3000, 4000]

Using RMSE for Stability Detection

⭐ Always include RMSE in your tests for stability detection:

results = adjuster.monte_carlo_fit(
    sizes=[100, 200, 500, 1000, 2000],
    n_repeats=50,
    tests=['ks', 'chi2', 'rmse'],  # RMSE is crucial!
    sampling='random',
    seed=42
)

Why RMSE is recommended:

Monotonic decrease: RMSE decreases smoothly as sample size increases
Clear convergence: Stabilization point is visually obvious
Robust: Less affected by sampling variability than p-values
Direct measure: Measures actual fit quality, not hypothesis test power
No inflation: Unlike p-values, doesn’t become artificially small with large samples

P-values (KS, Chi-square) can be erratic and are subject to the “large sample size effect” where they become very small even for good fits when the sample is large.

Accessing Stability Information

# Get recommended size (based on primary metric, typically RMSE)
recommended_n = results.attrs['recommended_size']
primary_metric = results.attrs['primary_metric']

if recommended_n is not None:
    print(f"✓ {primary_metric.upper()} stabilizes at n = {recommended_n}")
    print(f"  This is the recommended minimum sample size")

    # Get quality metrics at the stable point
    print(f"\nMetrics at stable point (n = {recommended_n}):")
    if 'stable_rmse' in results.attrs:
        print(f"  RMSE: {results.attrs['stable_rmse']:.4f}")
    if 'stable_pvalue_ks' in results.attrs:
        print(f"  KS p-value: {results.attrs['stable_pvalue_ks']:.4f}")
    if 'stable_pvalue_chi2' in results.attrs:
        print(f"  Chi-square p-value: {results.attrs['stable_pvalue_chi2']:.4f}")
else:
    print("✗ No clear stability detected - try larger sample sizes or more repeats")

# Access individual stability points for all metrics
stability = results.attrs['stability_points']
for metric, info in stability.items():
    size = info['size']
    method = info['method']
    if size:
        print(f"{metric}: stable at n = {size} (method: {method})")
        if info['cv_at_stability'] is not None:
            print(f"  CV at stability: {info['cv_at_stability']:.3f}")

Complete Example

import numpy as np
import magica as ma

# Load data
data = np.random.weibull(2, 10000) * 8 + 2
processor = ma.read_data(data)
processor.fit_distribution('weibull_min')

# Run comprehensive Monte Carlo analysis with Kneedle method
results = processor.monte_carlo_fit(
    sizes=[100, 200, 400, 600, 800, 1000, 1500, 2000, 3000, 4000],
    n_repeats=50,
    tests=['ks', 'chi2', 'rmse'],  # Include RMSE!
    stability_method='kneedle',     # Use Kneedle for RMSE
    smooth=True,                    # Smooth curves before detection
    sampling='random',
    seed=42,
    fig_output_path='monte_carlo_analysis.png'
)

# Check recommended size (based on RMSE with Kneedle)
recommended_n = results.attrs['recommended_size']
primary_metric = results.attrs['primary_metric']

print(f"⭐ {primary_metric.upper()} stabilizes at n = {recommended_n} (RECOMMENDED)")
print(f"   RMSE at stable point: {results.attrs['stable_rmse']:.4f}")
print(f"   KS p-value at stable point: {results.attrs['stable_pvalue_ks']:.4f}")

# Check all stability points
stability = results.attrs['stability_points']
print("\nAll stability points:")
for metric, info in stability.items():
    size = info['size']
    method = info['method']
    if size:
        print(f"   {metric}: n = {size} (method: {method})")
    else:
        print(f"   {metric}: no clear stability detected")

# Use the recommended size for robust inference
print(f"\nRecommended subsample size for production: {recommended_n}")

Sampling Strategies

Random Sampling (default)

Each subsample is drawn independently without replacement from the original data.

When to use:

General purpose analysis
Independent observations
Standard stability analysis

Example:

results = processor.monte_carlo_fit(
    sampling='random',
    seed=42
)

Bootstrap Sampling

Each subsample is drawn with replacement, allowing the same observation to appear multiple times.

When to use:

Uncertainty quantification
Confidence interval estimation
When subsample size > original data size

Example:

results = processor.monte_carlo_fit(
    sampling='bootstrap',
    seed=42
)

Disjoint Sampling

Data is divided into non-overlapping partitions. Each observation is used exactly once per sample size.

When to use:

Time series data
Spatially correlated data
When you want lower variance between repeats
More representative of data structure

Example:

results = processor.monte_carlo_fit(
    sampling='disjoint',
    seed=42
)

Large Sample Size Effect

For very large datasets (>10,000 samples), statistical tests become extremely powerful, causing:

P-values become very small even for negligible deviations
Difficulty distinguishing statistical from practical significance
Over-rejection of perfectly adequate distributions

Solution: Use the Monte Carlo CPS (Coefficient/P-value/Sample size) method to find the optimal subsample size where:

Parameters have converged (stable estimates)
P-values haven’t yet inflated (interpretable tests)
RMSE has stabilized (good fit quality)

See the MagicAdjuster Tutorial for a complete demonstration.

Monte Carlo Stability Analysis

Overview

Method Signature

Parameters

Sample Size Configuration

Monte Carlo Configuration

Goodness-of-Fit Tests

Stability Detection

Visualization

Advanced Parameters

Return Value

Stability Detection Methods

CV Method (Coefficient of Variation)

Kneedle Method (Elbow Detection)

Plateau Method (Relative Gain Heuristic)

Method Selection Guide

Interpreting Stability Points

Recommended Practices

Choosing Sample Sizes

Using RMSE for Stability Detection

Accessing Stability Information

Complete Example

Sampling Strategies

Random Sampling (default)

Bootstrap Sampling

Disjoint Sampling

Large Sample Size Effect

See Also