AutoFitter - Automatic Distribution Selection

The AutoFitter class provides automatic distribution fitting and selection capabilities, testing multiple probability distributions and selecting the best one based on specified criteria.

Overview

AutoFitter automatically:

  • Tests multiple probability distributions (default: 16 curated distributions)

  • Can test all 113+ SciPy continuous distributions

  • Selects best distribution based on RMSE, AIC, BIC, or p-values

  • Uses lazy initialization for memory efficiency

  • Provides comprehensive comparison tables

Important

Best Practice: Always use RMSE as the primary selection criterion. RMSE is robust across all sample sizes and avoids the “large sample size effect” that makes p-values unreliable with large datasets (>10,000 samples).

Class Reference

class magica.core.AutoFitter(data_processor: DataProcessor, candidates: List[str] | None = None, criterion: str = 'rmse')[source]

Bases: object

Automatic distribution fitting class with model selection capabilities.

This class automatically tests multiple probability distributions and selects the best-fitting one based on specified criteria (default: RMSE).

Uses lazy initialization pattern - MagicAdjuster instances are created only when needed for each distribution candidate.

Parameters:
  • data_processor (DataProcessor) – Processor instance with loaded data

  • candidates (list of str, optional) – List of distribution names to test. If None, uses default set.

  • criterion (str, default 'rmse') – Selection criterion (‘rmse’, ‘aic’, ‘bic’, ‘ks_pvalue’, ‘chi2_pvalue’)

Examples

>>> import magica as ma
>>> import numpy as np
>>>
>>> # Load wind speed data
>>> data = np.random.weibull(2, 1000) * 8 + 2
>>> processor = ma.read_data(data)
>>>
>>> # Auto-fit best distribution
>>> auto_fitter = processor.get_auto_fitter()
>>> best_result = auto_fitter.fit_best_distribution()
>>>
>>> print(f"Best distribution: {best_result['distribution']}")
>>> print(f"RMSE: {best_result['rmse']:.4f}")
__init__(data_processor: DataProcessor, candidates: List[str] | None = None, criterion: str = 'rmse')[source]

Initialize AutoFitter with data processor and configuration.

Parameters:
  • data_processor (DataProcessor) – The data processor containing the dataset to fit

  • candidates (list of str, optional) – Distribution names to test. Default includes common distributions.

  • criterion (str, default 'rmse') – Selection criterion for best distribution

fit_single_distribution(distribution: str, **fit_kwargs) Dict[str, Any][source]

Fit a single distribution and calculate all metrics.

Parameters:
  • distribution (str) – Distribution name to fit

  • **fit_kwargs (dict) – Additional arguments passed to fit method

Returns:

Comprehensive results for this distribution

Return type:

dict

fit_all_distributions(**fit_kwargs) Dict[str, Dict[str, Any]][source]

Fit all candidate distributions and return comprehensive results.

Parameters:

**fit_kwargs (dict) – Additional arguments passed to all fit methods

Returns:

Results for all distributions, keyed by distribution name

Return type:

dict

fit_best_distribution(**fit_kwargs) Dict[str, Any][source]

Automatically find and fit the best distribution based on criterion.

Parameters:

**fit_kwargs (dict) – Additional arguments passed to fit methods

Returns:

Results for the best-fitting distribution

Return type:

dict

get_comparison_table(sort_by: str | None = None) Dict[str, Dict[str, Any]][source]

Get a formatted comparison table of all fitted distributions.

Parameters:

sort_by (str, optional) – Metric to sort by. If None, uses the selection criterion.

Returns:

Sorted results table

Return type:

dict

get_best_adjuster() MagicAdjuster[source]

Get the MagicAdjuster instance for the best-fitting distribution.

Returns:

The adjuster fitted with the best distribution

Return type:

MagicAdjuster

static get_all_available_distributions()[source]

Get all distribution names available in MagicAdjuster.

Returns:

Sorted list of all available distribution names

Return type:

list

__repr__() str[source]

String representation of the AutoFitter.

Quick Start

Basic Usage

import numpy as np
import magica as ma

# Load data
data = np.random.weibull(2, 1000) * 8 + 2
processor = ma.read_data(data)

# Create AutoFitter with RMSE criterion (recommended)
auto_fitter = processor.get_auto_fitter(criterion='rmse')

# Find best distribution
best_result = auto_fitter.fit_best_distribution()

print(f"Best distribution: {best_result['distribution']}")
print(f"RMSE: {best_result['rmse']:.6f}")
print(f"Parameters: {best_result['parameters']}")

Testing All Distributions

# Get all available distributions
all_dists = AutoFitter.get_all_available_distributions()
print(f"Total distributions available: {len(all_dists)}")

# Test all distributions (takes longer)
auto_fitter = processor.get_auto_fitter(
    candidates=all_dists,
    criterion='rmse'
)

best = auto_fitter.fit_best_distribution()
print(f"Best from all 113+: {best['distribution']}")

Custom Distribution List

# Define domain-specific distributions
wind_distributions = [
    'weibull_min',    # Most common for wind
    'weibull_max',
    'rayleigh',       # Theoretical wind model
    'lognorm',
    'gamma',
    'rice'
]

auto_fitter = processor.get_auto_fitter(
    candidates=wind_distributions,
    criterion='rmse'
)

best = auto_fitter.fit_best_distribution()

Working with Results

Comparison Table

Get a comprehensive comparison of all tested distributions:

# Fit all distributions first
auto_fitter.fit_all_distributions()

# Get sorted comparison table
comparison = auto_fitter.get_comparison_table(sort_by='rmse')

# Filter successful fits
successful = {d: r for d, r in comparison.items() if r['success']}

# For synthetic data, optionally filter by p-value
good_fits = {d: r for d, r in successful.items()
             if r['ks_pvalue'] > 0.05}

# Display top 5
for i, (dist, result) in enumerate(list(good_fits.items())[:5], 1):
    print(f"{i}. {dist}: RMSE={result['rmse']:.6f}")

Using the Best Distribution

Once you’ve found the best distribution, you can use it like a regular MagicAdjuster:

# Get the adjuster for best distribution
best_adjuster = auto_fitter.get_best_adjuster()

# Calculate statistics
mean = best_adjuster.stats(moments='m')
p95 = best_adjuster.ppf(0.95)  # 95th percentile

# Perform goodness-of-fit tests
ks_result = best_adjuster.goodness_of_fit('ks')
rmse_result = best_adjuster.goodness_of_fit('rmse')

# Monte Carlo stability analysis
mc_results = best_adjuster.monte_carlo_fit(
    tests=['ks', 'chi2', 'rmse'],
    n_repeats=100,
    fig_output_path='stability.png'
)

Selection Criteria

Available Criteria

The criterion parameter accepts:

  • ‘rmse’ (recommended): Root Mean Square Error - robust for all sample sizes

  • ‘aic’: Akaike Information Criterion - balances fit and complexity

  • ‘bic’: Bayesian Information Criterion - penalizes complexity more than AIC

  • ‘ks_pvalue’: Kolmogorov-Smirnov p-value - statistical significance

  • ‘chi2_pvalue’: Chi-square p-value - histogram-based test

When to Use Each

Use RMSE when:

  • ✅ Working with real-world data

  • ✅ Sample size is large (>10,000)

  • ✅ You want consistent, reliable results

  • ✅ Practical fit quality matters

Use p-values when:

  • ⚠️ Working with synthetic/controlled data

  • ⚠️ Sample size is moderate (<1,000)

  • ⚠️ Statistical significance is required

  • ⚠️ Educational/demonstration purposes

Use AIC/BIC when:

  • 📊 Comparing model complexity

  • 📊 Theoretical model selection

  • 📊 Need to balance fit vs. parameters

Warning

Large Sample Size Effect: With large datasets (>10,000 observations), goodness-of-fit tests (KS, Chi-square) tend to reject even excellent fits. Their p-values become unreliable. Always use RMSE for large datasets.

See the MagicAdjuster tutorial section on “Large Sample Size Effect” for detailed explanation.

Result Dictionary

Each distribution’s results contain:

{
    'distribution': str,      # Distribution name
    'success': bool,          # Whether fitting succeeded
    'parameters': tuple,      # Fitted parameters
    'rmse': float,           # Root mean square error
    'aic': float,            # Akaike Information Criterion
    'bic': float,            # Bayesian Information Criterion
    'ks_statistic': float,   # KS test statistic
    'ks_pvalue': float,      # KS test p-value
    'chi2_statistic': float, # Chi-square statistic
    'chi2_pvalue': float,    # Chi-square p-value
    'adjuster': MagicAdjuster  # Fitted adjuster instance
}

Default Distributions

The default candidate list includes 16 stable, commonly-used distributions:

default_distributions = [
    'norm',           # Normal
    'lognorm',        # Log-normal
    'expon',          # Exponential
    'weibull_min',    # Weibull (minimum)
    'gamma',          # Gamma
    'beta',           # Beta
    'chi2',           # Chi-square
    'rayleigh',       # Rayleigh
    'uniform',        # Uniform
    'logistic',       # Logistic
    'gumbel_r',       # Gumbel (right)
    'exponweib',      # Exponentiated Weibull
    'genextreme',     # Generalized Extreme Value
    'pareto',         # Pareto
    'maxwell',        # Maxwell
    'rice'            # Rice
]

To test all 113+ available distributions, use:

all_dists = AutoFitter.get_all_available_distributions()

Examples

Finding Best Distribution

import numpy as np
import magica as ma

# Generate wind speed data
wind_data = np.random.weibull(2.5, 5000) * 10 + 2
processor = ma.read_data(wind_data)

# Auto-fit with RMSE criterion
auto_fitter = processor.get_auto_fitter(criterion='rmse')
best = auto_fitter.fit_best_distribution()

print(f"Best distribution: {best['distribution']}")
print(f"RMSE: {best['rmse']:.6f}")
print(f"AIC: {best['aic']:.2f}")
print(f"KS p-value: {best['ks_pvalue']:.6f}")

Comparing Multiple Criteria

# Test with different criteria
criteria = ['rmse', 'aic', 'bic']

for criterion in criteria:
    fitter = processor.get_auto_fitter(criterion=criterion)
    best = fitter.fit_best_distribution()
    print(f"{criterion.upper()}: {best['distribution']}")

Filtering by P-value (Synthetic Data Only)

# For synthetic data, you can filter by p-value
auto_fitter = processor.get_auto_fitter(criterion='rmse')
auto_fitter.fit_all_distributions()

comparison = auto_fitter.get_comparison_table(sort_by='rmse')

# Get successful fits
successful = [(d, r) for d, r in comparison.items() if r['success']]

# Filter by p-value > 0.05 (good statistical fit)
good_fits = [(d, r) for d, r in successful if r['ks_pvalue'] > 0.05]

print(f"Distributions with p > 0.05: {len(good_fits)}")
print(f"Top 3 by RMSE (p > 0.05):")
for i, (dist, result) in enumerate(good_fits[:3], 1):
    print(f"  {i}. {dist}: RMSE={result['rmse']:.6f}, p={result['ks_pvalue']:.4f}")

Best Practices

  1. Always use RMSE as the primary criterion for real-world data

  2. Start with default distributions (faster), then try comprehensive if needed

  3. Create custom lists for domain-specific applications (e.g., wind, rainfall)

  4. Filter by p-value only for synthetic data with moderate sample sizes

  5. Check multiple criteria to verify consistency in distribution selection

  6. Use the best adjuster for further analysis (Monte Carlo, goodness-of-fit)

See Also