AutoFitter - Automatic Distribution Selection ============================================== The AutoFitter class provides automatic distribution fitting and selection capabilities, testing multiple probability distributions and selecting the best one based on specified criteria. Overview -------- AutoFitter automatically: - Tests multiple probability distributions (default: 16 curated distributions) - Can test all 113+ SciPy continuous distributions - Selects best distribution based on RMSE, AIC, BIC, or p-values - Uses lazy initialization for memory efficiency - Provides comprehensive comparison tables .. important:: **Best Practice**: Always use **RMSE** as the primary selection criterion. RMSE is robust across all sample sizes and avoids the "large sample size effect" that makes p-values unreliable with large datasets (>10,000 samples). Class Reference --------------- .. autoclass:: magica.core.AutoFitter :members: :undoc-members: :show-inheritance: Quick Start ----------- Basic Usage ~~~~~~~~~~~ .. code-block:: python import numpy as np import magica as ma # Load data data = np.random.weibull(2, 1000) * 8 + 2 processor = ma.read_data(data) # Create AutoFitter with RMSE criterion (recommended) auto_fitter = processor.get_auto_fitter(criterion='rmse') # Find best distribution best_result = auto_fitter.fit_best_distribution() print(f"Best distribution: {best_result['distribution']}") print(f"RMSE: {best_result['rmse']:.6f}") print(f"Parameters: {best_result['parameters']}") Testing All Distributions ~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Get all available distributions all_dists = AutoFitter.get_all_available_distributions() print(f"Total distributions available: {len(all_dists)}") # Test all distributions (takes longer) auto_fitter = processor.get_auto_fitter( candidates=all_dists, criterion='rmse' ) best = auto_fitter.fit_best_distribution() print(f"Best from all 113+: {best['distribution']}") Custom Distribution List ~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Define domain-specific distributions wind_distributions = [ 'weibull_min', # Most common for wind 'weibull_max', 'rayleigh', # Theoretical wind model 'lognorm', 'gamma', 'rice' ] auto_fitter = processor.get_auto_fitter( candidates=wind_distributions, criterion='rmse' ) best = auto_fitter.fit_best_distribution() Working with Results -------------------- Comparison Table ~~~~~~~~~~~~~~~~ Get a comprehensive comparison of all tested distributions: .. code-block:: python # Fit all distributions first auto_fitter.fit_all_distributions() # Get sorted comparison table comparison = auto_fitter.get_comparison_table(sort_by='rmse') # Filter successful fits successful = {d: r for d, r in comparison.items() if r['success']} # For synthetic data, optionally filter by p-value good_fits = {d: r for d, r in successful.items() if r['ks_pvalue'] > 0.05} # Display top 5 for i, (dist, result) in enumerate(list(good_fits.items())[:5], 1): print(f"{i}. {dist}: RMSE={result['rmse']:.6f}") Using the Best Distribution ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Once you've found the best distribution, you can use it like a regular MagicAdjuster: .. code-block:: python # Get the adjuster for best distribution best_adjuster = auto_fitter.get_best_adjuster() # Calculate statistics mean = best_adjuster.stats(moments='m') p95 = best_adjuster.ppf(0.95) # 95th percentile # Perform goodness-of-fit tests ks_result = best_adjuster.goodness_of_fit('ks') rmse_result = best_adjuster.goodness_of_fit('rmse') # Monte Carlo stability analysis mc_results = best_adjuster.monte_carlo_fit( tests=['ks', 'chi2', 'rmse'], n_repeats=100, fig_output_path='stability.png' ) Selection Criteria ------------------ Available Criteria ~~~~~~~~~~~~~~~~~~ The ``criterion`` parameter accepts: - **'rmse'** (recommended): Root Mean Square Error - robust for all sample sizes - **'aic'**: Akaike Information Criterion - balances fit and complexity - **'bic'**: Bayesian Information Criterion - penalizes complexity more than AIC - **'ks_pvalue'**: Kolmogorov-Smirnov p-value - statistical significance - **'chi2_pvalue'**: Chi-square p-value - histogram-based test When to Use Each ~~~~~~~~~~~~~~~~ **Use RMSE when:** - ✅ Working with real-world data - ✅ Sample size is large (>10,000) - ✅ You want consistent, reliable results - ✅ Practical fit quality matters **Use p-values when:** - ⚠️ Working with synthetic/controlled data - ⚠️ Sample size is moderate (<1,000) - ⚠️ Statistical significance is required - ⚠️ Educational/demonstration purposes **Use AIC/BIC when:** - 📊 Comparing model complexity - 📊 Theoretical model selection - 📊 Need to balance fit vs. parameters .. warning:: **Large Sample Size Effect**: With large datasets (>10,000 observations), goodness-of-fit tests (KS, Chi-square) tend to reject even excellent fits. Their p-values become unreliable. **Always use RMSE for large datasets.** See the MagicAdjuster tutorial section on "Large Sample Size Effect" for detailed explanation. Result Dictionary ----------------- Each distribution's results contain: .. code-block:: python { 'distribution': str, # Distribution name 'success': bool, # Whether fitting succeeded 'parameters': tuple, # Fitted parameters 'rmse': float, # Root mean square error 'aic': float, # Akaike Information Criterion 'bic': float, # Bayesian Information Criterion 'ks_statistic': float, # KS test statistic 'ks_pvalue': float, # KS test p-value 'chi2_statistic': float, # Chi-square statistic 'chi2_pvalue': float, # Chi-square p-value 'adjuster': MagicAdjuster # Fitted adjuster instance } Default Distributions --------------------- The default candidate list includes 16 stable, commonly-used distributions: .. code-block:: python default_distributions = [ 'norm', # Normal 'lognorm', # Log-normal 'expon', # Exponential 'weibull_min', # Weibull (minimum) 'gamma', # Gamma 'beta', # Beta 'chi2', # Chi-square 'rayleigh', # Rayleigh 'uniform', # Uniform 'logistic', # Logistic 'gumbel_r', # Gumbel (right) 'exponweib', # Exponentiated Weibull 'genextreme', # Generalized Extreme Value 'pareto', # Pareto 'maxwell', # Maxwell 'rice' # Rice ] To test all 113+ available distributions, use: .. code-block:: python all_dists = AutoFitter.get_all_available_distributions() Examples -------- Finding Best Distribution ~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python import numpy as np import magica as ma # Generate wind speed data wind_data = np.random.weibull(2.5, 5000) * 10 + 2 processor = ma.read_data(wind_data) # Auto-fit with RMSE criterion auto_fitter = processor.get_auto_fitter(criterion='rmse') best = auto_fitter.fit_best_distribution() print(f"Best distribution: {best['distribution']}") print(f"RMSE: {best['rmse']:.6f}") print(f"AIC: {best['aic']:.2f}") print(f"KS p-value: {best['ks_pvalue']:.6f}") Comparing Multiple Criteria ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # Test with different criteria criteria = ['rmse', 'aic', 'bic'] for criterion in criteria: fitter = processor.get_auto_fitter(criterion=criterion) best = fitter.fit_best_distribution() print(f"{criterion.upper()}: {best['distribution']}") Filtering by P-value (Synthetic Data Only) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ .. code-block:: python # For synthetic data, you can filter by p-value auto_fitter = processor.get_auto_fitter(criterion='rmse') auto_fitter.fit_all_distributions() comparison = auto_fitter.get_comparison_table(sort_by='rmse') # Get successful fits successful = [(d, r) for d, r in comparison.items() if r['success']] # Filter by p-value > 0.05 (good statistical fit) good_fits = [(d, r) for d, r in successful if r['ks_pvalue'] > 0.05] print(f"Distributions with p > 0.05: {len(good_fits)}") print(f"Top 3 by RMSE (p > 0.05):") for i, (dist, result) in enumerate(good_fits[:3], 1): print(f" {i}. {dist}: RMSE={result['rmse']:.6f}, p={result['ks_pvalue']:.4f}") Best Practices -------------- 1. **Always use RMSE** as the primary criterion for real-world data 2. **Start with default distributions** (faster), then try comprehensive if needed 3. **Create custom lists** for domain-specific applications (e.g., wind, rainfall) 4. **Filter by p-value only for synthetic data** with moderate sample sizes 5. **Check multiple criteria** to verify consistency in distribution selection 6. **Use the best adjuster** for further analysis (Monte Carlo, goodness-of-fit) See Also -------- - :doc:`core` - MagicAdjuster and DataProcessor documentation - :doc:`monte_carlo` - Monte Carlo stability analysis - :doc:`/tutorials/auto_fitter_tutorial` - Complete AutoFitter tutorial - :doc:`/tutorials/magic_adjuster_tutorial` - MagicAdjuster tutorial with large sample size effect