Quick Start
===========

MagicA (Magic Adjustment) is a Python package for statistical data adjustment with a focus on distribution fitting and Monte Carlo stability analysis.

Installation
------------

Install MagicA using pip:

.. code-block:: bash

    pip install magica

Basic Usage
-----------

Distribution Fitting
~~~~~~~~~~~~~~~~~~~~

Start by fitting a distribution to your data:

.. code-block:: python

    import numpy as np
    import magica as ma
    
    # Generate sample data (Weibull distribution)
    data = np.random.weibull(2, 1000)
    
    # Create processor and fit distribution
    processor = ma.read_data(data)
    processor.fit_distribution('weibull_min')
    
    # Get fitted parameters
    params = processor.get_fitted_params()
    print(f"Fitted parameters: {params}")

Goodness-of-Fit Testing
~~~~~~~~~~~~~~~~~~~~~~~

Evaluate how well your distribution fits the data:

.. code-block:: python

    # Perform Kolmogorov-Smirnov test
    ks_result = processor.goodness_of_fit('ks')
    print(f"KS p-value: {ks_result['p_value']:.4f}")
    
    # Perform chi-square test
    chi2_result = processor.goodness_of_fit('chi2')
    print(f"Chi-square p-value: {chi2_result['p_value']:.4f}")
    
    # Calculate RMSE (recommended)
    rmse_result = processor.goodness_of_fit('rmse')
    print(f"RMSE: {rmse_result['rmse']:.6f}")

.. tip::
   **RMSE is the most reliable metric** for assessing fit quality, especially with large datasets where p-values can be misleading due to the "large sample size effect".

Automatic Distribution Selection
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Use AutoFitter to automatically find the best distribution:

.. code-block:: python

    # Create AutoFitter with RMSE criterion (recommended)
    auto_fitter = processor.get_auto_fitter(criterion='rmse')
    
    # Find best distribution
    best = auto_fitter.fit_best_distribution()
    
    print(f"Best distribution: {best['distribution']}")
    print(f"RMSE: {best['rmse']:.6f}")
    print(f"AIC: {best['aic']:.2f}")

Testing Multiple Distributions
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Compare all available distributions:

.. code-block:: python

    # Get all 113+ available distributions
    all_dists = auto_fitter.get_all_available_distributions()
    
    # Test comprehensive set
    auto_fitter_all = processor.get_auto_fitter(
        candidates=all_dists,
        criterion='rmse'
    )
    best_comprehensive = auto_fitter_all.fit_best_distribution()
    
    # Get comparison table
    comparison = auto_fitter_all.get_comparison_table(sort_by='rmse')
    
    # Show top 5
    for i, (dist, result) in enumerate(list(comparison.items())[:5], 1):
        if result['success']:
            print(f"{i}. {dist}: RMSE={result['rmse']:.6f}")

Extreme Value Analysis
~~~~~~~~~~~~~~~~~~~~~~~

Analyze extreme values and calculate return periods for time series data:

.. code-block:: python

    import pandas as pd
    
    # Create time series with datetime index
    dates = pd.date_range('1980-01-01', '2023-12-31', freq='D')
    wind_speeds = np.random.weibull(2.5, len(dates)) * 15 + 5
    series = pd.Series(wind_speeds, index=dates)
    
    # Load data and create extremes analyzer
    processor = ma.read_data(series)
    extremes = processor.get_extremes_analyzer(time_unit='years')
    
    # Fit GEV distribution (common for extremes)
    extremes.fit_distribution('genextreme')
    
    # Calculate return values
    rv_50 = extremes.return_value(50)   # 50-year return value
    rv_100 = extremes.return_value(100) # 100-year return value
    
    print(f"50-year return value: {rv_50:.2f} m/s")
    print(f"100-year return value: {rv_100:.2f} m/s")
    
    # Calculate return period for specific value
    rp = extremes.return_period(30.0)
    print(f"A value of 30 m/s has a return period of {rp:.1f} years")

.. tip::
   **Common extreme value distributions**:
   
   - ``'genextreme'``: Generalized Extreme Value (GEV) - most flexible
   - ``'gumbel_r'``: Gumbel distribution - common for environmental extremes
   - ``'weibull_max'``: Weibull maximum - for maximum extremes

Monte Carlo Stability Analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Determine the minimum sample size needed for stable parameter estimation:

.. code-block:: python

    # Fit a distribution first
    processor.fit_distribution('weibull_min')
    
    # Run Monte Carlo analysis with automatic figure generation
    results = processor.monte_carlo_fit(
        n_repeats=100,
        tests=['ks', 'chi2', 'rmse'],  # Always include RMSE!
        fig_output_path='stability_analysis.png'
    )
    
    # Access results with xarray
    ks_pvalues = results['ks_pvalue']
    rmse_values = results['rmse']
    param_medians = results['param_0'].median(dim='repeats')
    
    # Check stability points
    stability = results.attrs['stability_points']
    
    # RMSE is the most reliable stability indicator
    if 'rmse' in stability:
        print(f"Recommended minimum size (RMSE): {stability['rmse']['size']}")

.. important::
   **Always include 'rmse' in your tests** - RMSE provides the most reliable stability detection with smooth, monotonic convergence, unlike p-values which can be erratic.

Working with xarray Results
~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Monte Carlo analysis returns an xarray Dataset for easy data manipulation:

.. code-block:: python

    # Select data for specific sample size
    size_200_results = results.sel(sizes=200)
    
    # Calculate statistics across repeats
    param_std = results['param_0'].std(dim='repeats')
    rmse_median = results['rmse'].median(dim='repeats')
    
    # Plot results directly
    import matplotlib.pyplot as plt
    results['rmse'].plot(x='sizes')
    plt.title('RMSE Convergence')
    plt.show()
    
    # Convert to pandas for further analysis
    df = results.to_dataframe()

Advanced Examples
-----------------

Custom Binning Strategy
~~~~~~~~~~~~~~~~~~~~~~~

Control the binning strategy for chi-square tests:

.. code-block:: python

    # Use Scott's rule for binning
    results = processor.monte_carlo_fit(
        n_repeats=200,
        tests=['chi2', 'rmse'],
        bins='scott',
        fig_output_path='chi2_stability.png'
    )
    
    # Access chi-square test results
    chi2_pvalues = results['chi2_pvalue']

Different Sampling Strategies
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Choose between random, bootstrap, or disjoint sampling:

.. code-block:: python

    # Random sampling (default)
    results_random = processor.monte_carlo_fit(
        sampling='random',
        n_repeats=100,
        tests=['ks', 'rmse']
    )
    
    # Bootstrap sampling (with replacement)
    results_bootstrap = processor.monte_carlo_fit(
        sampling='bootstrap',
        n_repeats=100,
        tests=['ks', 'rmse']
    )
    
    # Disjoint sampling (no overlap)
    results_disjoint = processor.monte_carlo_fit(
        sampling='disjoint',
        n_repeats=20,  # Limited by data size
        tests=['ks', 'rmse']
    )

Pre-calculated Parameters
~~~~~~~~~~~~~~~~~~~~~~~~~

Use known distribution parameters instead of fitting:

.. code-block:: python

    # Use known Weibull parameters (shape=2, loc=0, scale=1)
    known_params = (2.0, 0.0, 1.0)
    
    results = processor.monte_carlo_fit(
        distribution_params=known_params,
        n_repeats=150,
        tests=['chi2', 'ks', 'rmse']
    )

Block Maxima for Extreme Analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Extract block maxima from time series for GEV analysis:

.. code-block:: python

    import pandas as pd
    
    # Create daily time series
    dates = pd.date_range('1980-01-01', '2023-12-31', freq='D')
    values = np.random.weibull(2.5, len(dates)) * 15 + 5
    series = pd.Series(values, index=dates)
    
    # Create extremes analyzer
    processor = ma.read_data(series)
    extremes = processor.get_extremes_analyzer()
    
    # Extract annual maxima
    annual_max, times = extremes.extract_block_maxima(block_size='A')
    print(f"Extracted {len(annual_max)} annual maxima")
    
    # Create new analyzer with annual maxima
    processor_annual = ma.read_data(pd.Series(annual_max, index=times))
    extremes_annual = processor_annual.get_extremes_analyzer()
    extremes_annual.fit_distribution('genextreme')
    
    # Calculate design values
    design_values = extremes_annual.return_value([10, 50, 100])
    print(f"10-year: {design_values[0]:.2f}")
    print(f"50-year: {design_values[1]:.2f}")
    print(f"100-year: {design_values[2]:.2f}")
    
    # Plot return levels
    fig, ax = extremes_annual.plot_return_levels()

Peaks Over Threshold Analysis
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Analyze extreme values using peaks over threshold (POT):

.. code-block:: python

    import pandas as pd
    
    # Create hourly time series
    dates = pd.date_range('2010-01-01', '2023-12-31', freq='H')
    wave_heights = np.random.weibull(2, len(dates)) * 3 + 0.5
    series = pd.Series(wave_heights, index=dates)
    
    # Create extremes analyzer
    processor = ma.read_data(series)
    extremes = processor.get_extremes_analyzer(time_unit='years')
    
    # Define threshold (90th percentile)
    threshold = np.percentile(wave_heights, 90)
    
    # Extract peaks with declustering
    peaks, peak_times = extremes.peaks_over_threshold(
        threshold=threshold,
        min_separation='12H'  # Minimum 12 hours between peaks
    )
    
    print(f"Found {len(peaks)} independent peaks above {threshold:.2f} m")
    
    # Fit GPD to excesses
    excesses = peaks - threshold
    processor_pot = ma.read_data(excesses)
    processor_pot.fit_distribution('genpareto')

.. code-block:: python

    # Use known Weibull parameters (shape=2, loc=0, scale=1)
    known_params = (2.0, 0.0, 1.0)
    
    results = processor.monte_carlo_fit(
        distribution_params=known_params,
        n_repeats=150,
        tests=['chi2', 'ks', 'rmse']
    )

Custom Fitting Constraints
~~~~~~~~~~~~~~~~~~~~~~~~~~

Apply parameter constraints during fitting:

.. code-block:: python

    # Fix location parameter for Weibull distribution
    results = processor.monte_carlo_fit(
        n_repeats=100,
        tests=['ks', 'rmse'],
        fit_kwargs={'floc': 0}  # Force location = 0
    )
    
    # Multiple constraints
    results = processor.monte_carlo_fit(
        n_repeats=100,
        tests=['ks', 'rmse'],
        fit_kwargs={'floc': 0, 'method': 'MLE'}
    )

Understanding Results
---------------------

Monte Carlo Results Structure
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The `monte_carlo_fit` method returns an xarray Dataset with:

**Dimensions:**

- **sizes**: Sample sizes tested (e.g., [50, 100, 150, 200, ...])
- **repeats**: Repetition index for each size (e.g., [0, 1, 2, ..., 99])

**Data Variables:**

- **param_0, param_1, ...**: Fitted distribution parameters for each size/repeat
- **ks_statistic, ks_pvalue**: Kolmogorov-Smirnov test results (if requested)
- **chi2_statistic, chi2_pvalue**: Chi-square test results (if requested)
- **rmse**: Root mean square error values (if requested)

**Attributes:**

- **distribution**: Distribution name
- **original_data_size**: Size of original dataset
- **sampling_method**: Sampling strategy used
- **stability_points**: Detected stability points for each variable
- **figure_path**: Path to saved summary figure (if generated)

**Easy Data Access:**

.. code-block:: python

    # Get all RMSE values
    rmse_data = results['rmse']
    
    # Select specific size
    size_100_data = results.sel(sizes=100)
    
    # Calculate median across repeats
    param_medians = results['param_0'].median(dim='repeats')
    
    # Check when parameters stabilize
    stability = results.attrs['stability_points']
    if 'param_0' in stability:
        print(f"Parameter 0 stabilizes at size: {stability['param_0']['size']}")
    
    # RMSE stability (most reliable)
    if 'rmse' in stability:
        print(f"RMSE stabilizes at size: {stability['rmse']['size']}")

AutoFitter Results
~~~~~~~~~~~~~~~~~~

AutoFitter returns a dictionary with comprehensive fit information:

.. code-block:: python

    best_result = {
        'distribution': 'weibull_min',
        'success': True,
        'parameters': (2.1, 0.0, 5.3),
        'rmse': 0.012345,
        'aic': 1234.5,
        'bic': 1245.6,
        'ks_statistic': 0.023,
        'ks_pvalue': 0.456,
        'chi2_statistic': 12.3,
        'chi2_pvalue': 0.234,
        'adjuster': <MagicAdjuster instance>
    }

Generating Summary Figures
---------------------------

Automatic Summary Plots
~~~~~~~~~~~~~~~~~~~~~~~

By default no figure is created (faster). Provide a `fig_output_path` to save a
2×3 summary figure (first row: up to 3 parameters; second row: test metrics):

.. code-block:: python

    # Generate and save figure with series style panels
    results = processor.monte_carlo_fit(
        tests=['ks', 'chi2', 'rmse'],
        n_repeats=100,
        fig_output_path='stability_summary.png'
    )

    # Boxplot style
    results = processor.monte_carlo_fit(
        tests=['ks', 'rmse'],
        n_repeats=100,
        plot_type='boxplots',
        fig_output_path='stability_boxplots.png'
    )

    # Path stored in attributes
    print(results.attrs.get('figure_path'))  # saved file path

The red dashed vertical line in each panel marks the first sample size where
the corresponding parameter or test metric meets the stability criterion.

Custom Visualizations
~~~~~~~~~~~~~~~~~~~~~~

You can also create custom visualizations directly with xarray/matplotlib:

.. code-block:: python

    import matplotlib.pyplot as plt
    
    # Plot RMSE convergence
    fig, ax = plt.subplots(figsize=(10, 6))
    
    # Plot each repeat as a thin line
    for i in range(results.sizes['repeats']):
        results['rmse'].isel(repeats=i).plot(ax=ax, alpha=0.3, color='blue')
    
    # Plot median as thick line
    results['rmse'].median(dim='repeats').plot(ax=ax, linewidth=3, color='red', label='Median')
    
    # Mark stability point
    if 'rmse' in results.attrs['stability_points']:
        stable_size = results.attrs['stability_points']['rmse']['size']
        ax.axvline(stable_size, color='green', linestyle='--', linewidth=2, label=f'Stable at {stable_size}')
    
    ax.set_title('RMSE Convergence Analysis')
    ax.set_xlabel('Sample Size')
    ax.set_ylabel('RMSE')
    ax.legend()
    ax.grid(True, alpha=0.3)
    plt.tight_layout()
    plt.savefig('custom_rmse_plot.png', dpi=150)
    plt.show()

Synthetic Data Generation
--------------------------

MagicA provides synthetic data generators for creating realistic test datasets. Perfect for tutorials, testing, and method validation.

Generate Wind Speed Data
~~~~~~~~~~~~~~~~~~~~~~~~

Create non-directional wind speed time series with storms and seasonal variations:

.. code-block:: python

    from magica.utils import generate_wind_data
    import matplotlib.pyplot as plt
    
    # Generate 30 years of daily wind data with plots
    wind_series, plots = generate_wind_data(
        n_years=30,
        freq='D',
        mean_wind=8.0,
        weibull_shape=2.5,
        seasonal_amplitude=0.3,
        n_storms_per_year=5,
        storm_duration_days=(2, 5),
        storm_intensity_range=(12, 20),
        random_seed=42,
        create_plots=True
    )
    
    plt.show()
    
    # Use with MagicA
    processor = ma.read_data(wind_series)
    extremes = processor.get_extremes_analyzer(time_unit='years')
    
    print(f"Generated {len(wind_series)} observations")
    print(f"Max wind speed: {wind_series.max():.2f} m/s")

Generate Directional Wind Data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Create wind data with directional characteristics:

.. code-block:: python

    from magica.utils import generate_directional_wind_data
    
    # Generate 10 years of hourly directional wind data
    wind_data, plots = generate_directional_wind_data(
        n_years=10,
        freq='H',
        mean_wind=8.0,
        prevailing_direction=270,  # West
        directional_concentration=1.5,
        directional_speed_factors={
            'W': 1.4,   # Higher speeds from West (fetch effect)
            'SW': 1.4,
            'N': 0.8,   # Lower speeds from North (land effect)
            'NE': 0.8
        },
        storm_directions=[250, 270, 290],  # Storms from W/SW
        random_seed=42,
        create_plots=True
    )
    
    plt.show()
    
    # Access data
    print(wind_data.head())
    print(f"Columns: {wind_data.columns.tolist()}")

**Directional Speed Factors** accept three formats:

- **Sector names**: ``{'W': 1.4, 'SW': 1.4, 'N': 0.8}`` (8 sectors: N, NE, E, SE, S, SW, W, NW)
- **Degree ranges**: ``{(225, 315): 1.4, (0, 90): 0.8}``
- **Numeric centers**: ``{270: 1.4, 0: 0.8}`` (auto-creates ±22.5° range)

Intelligent POT Threshold Selection
------------------------------------

Automatically find optimal POT thresholds that balance extreme values with sample size.

Basic Threshold Search
~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python

    from magica.utils import generate_wind_data
    import magica as ma
    
    # Generate sample data
    wind_series = generate_wind_data(n_years=30, freq='D', random_seed=42)
    
    # Create extremes analyzer
    processor = ma.read_data(wind_series)
    extremes = processor.get_extremes_analyzer(time_unit='years')
    
    # Find optimal threshold automatically
    result = extremes.find_optimal_pot_threshold(
        min_samples=50,           # Target minimum independent samples
        percentile_min=90,        # Start at 90th percentile
        percentile_max=99,        # Stop at 99th percentile
        min_separation_hours=48,  # Minimum time between events
        max_separation_hours=120,
        verbose=True              # Show search progress
    )
    
    # Use the results
    if result['success']:
        print(f"Optimal threshold: {result['threshold']:.2f} m/s")
        print(f"Independent exceedances: {result['n_independent']}")
        
        # Access exceedances for further analysis
        exceedances = result['exceedances']
        exceedance_times = result['exceedance_times']

Custom Search Strategies
~~~~~~~~~~~~~~~~~~~~~~~~~

Control the search order with ``vary_first`` parameter:

.. code-block:: python

    # Strategy 1: Vary percentile first (default)
    # Tests each separation with decreasing percentiles
    result1 = extremes.find_optimal_pot_threshold(
        min_samples=50,
        vary_first='percentile'  # Default
    )
    
    # Strategy 2: Vary separation first
    # Tests each percentile with increasing separations
    result2 = extremes.find_optimal_pot_threshold(
        min_samples=50,
        vary_first='separation'
    )

Directional POT Analysis
~~~~~~~~~~~~~~~~~~~~~~~~

Apply to each directional sector:

.. code-block:: python

    from magica.utils import generate_directional_wind_data
    
    # Generate directional data
    wind_data = generate_directional_wind_data(n_years=10, freq='H')
    
    # Analyze each sector
    sectors = ['N', 'NE', 'E', 'SE', 'S', 'SW', 'W', 'NW']
    sector_results = {}
    
    for sector in sectors:
        # Filter by sector
        sector_mask = wind_data['sector_name'] == sector
        sector_series = wind_data.loc[sector_mask, 'wind_speed']
        
        # Find optimal threshold
        processor = ma.read_data(sector_series)
        extremes = processor.get_extremes_analyzer(time_unit='years')
        
        result = extremes.find_optimal_pot_threshold(
            min_samples=50,
            min_separation_hours=48,
            max_separation_hours=120
        )
        
        sector_results[sector] = result
        print(f"{sector}: threshold={result['threshold']:.2f}, "
              f"n={result['n_independent']}")

Directional Return Value Visualization
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Create polar plots of return values by direction:

.. code-block:: python

    # After fitting distributions for each sector
    directional_results = {}
    
    for sector in sectors:
        # ... (fit distribution and calculate return values) ...
        directional_results[sector] = {
            'center_deg': center_angle,
            'return_values': {10: rv10, 50: rv50, 100: rv100},
            'success': True
        }
    
    # Create polar plots (separate subplots)
    fig, axes = extremes.plot_directional_return_values(
        directional_results=directional_results,
        return_periods=[10, 50, 100]
    )
    
    # Or create overlay plot
    fig, ax = extremes.plot_directional_return_values(
        directional_results=directional_results,
        return_periods=[10, 50, 100],
        overlay=True
    )

Supported Distributions
-----------------------

MagicA supports all continuous distributions from scipy.stats, including:

**Common Distributions:**

- `'norm'` - Normal (Gaussian) distribution
- `'weibull_min'` - Weibull distribution (most common form)
- `'weibull_max'` - Weibull maximum distribution
- `'gamma'` - Gamma distribution
- `'lognorm'` - Log-normal distribution
- `'expon'` - Exponential distribution
- `'rayleigh'` - Rayleigh distribution (wind speed)
- `'beta'` - Beta distribution
- `'chi2'` - Chi-square distribution
- `'uniform'` - Uniform distribution

**Extreme Value Distributions:**

- `'gumbel_r'` - Gumbel (right) distribution
- `'gumbel_l'` - Gumbel (left) distribution  
- `'genextreme'` - Generalized Extreme Value
- `'exponweib'` - Exponentiated Weibull

**Environmental Data Distributions:**

- `'weibull_min'` - Wind speed (most common)
- `'lognorm'` - Rainfall, pollutant concentrations
- `'gamma'` - Rainfall amounts
- `'maxwell'` - Molecular speeds, wave heights
- `'rice'` - Wind with persistent component

**113+ Total Distributions Available** - See ``AutoFitter.get_all_available_distributions()`` for complete list.

Best Practices
--------------

Distribution Selection
~~~~~~~~~~~~~~~~~~~~~~

1. **Use RMSE as primary criterion** for real-world data
2. **Start with AutoFitter's default list** (16 stable distributions) for quick analysis
3. **Test comprehensive set** (113+ distributions) when you need the absolute best fit
4. **Create domain-specific lists** (e.g., wind, rainfall) for faster, targeted analysis
5. **Avoid p-value-only selection** with large datasets (>10,000 samples)

Monte Carlo Analysis
~~~~~~~~~~~~~~~~~~~~~

1. **Always include 'rmse' in tests** - most reliable stability indicator
2. **Use 100+ repeats** for robust stability detection
3. **Choose appropriate sampling**:
   - `'random'` - General purpose, allows overlap
   - `'bootstrap'` - With replacement, good for uncertainty
   - `'disjoint'` - No overlap, limited by data size
4. **Generate summary figures** with `fig_output_path` for visual inspection
5. **Check RMSE stability first**, then validate with other metrics

Large Sample Size Effect
~~~~~~~~~~~~~~~~~~~~~~~~~

.. warning::
   With large datasets (>10,000 observations), goodness-of-fit tests (KS, Chi-square) tend to reject even excellent fits. **Use RMSE for large datasets.**

See the :doc:`tutorials/magic_adjuster_tutorial` section on "Large Sample Size Effect" for detailed explanation and examples.

Extreme Value Analysis
~~~~~~~~~~~~~~~~~~~~~~~

1. **Use GEV (genextreme) for block maxima** - most flexible for annual/monthly maxima
2. **Use GPD (genpareto) for POT** - peaks over threshold analysis
3. **Require sufficient data**: Minimum 20-30 years for annual maxima
4. **Check stationarity**: Extreme value theory assumes stationary data
5. **Decluster POT data**: Use `min_separation` to ensure independence

**Time Series Requirements:**

- Use pandas Series with datetime index for automatic time handling
- Or provide times separately via `get_extremes_analyzer(times=...)`
- Specify appropriate `time_unit` ('years', 'days', 'hours', 'months')

Next Steps
----------

- Explore the :doc:`tutorials/index` for detailed examples
- Learn about :doc:`api/auto_fitter` for automatic distribution selection
- Read :doc:`api/monte_carlo` for comprehensive Monte Carlo documentation
- Check :doc:`api/extremes` for complete extreme value analysis documentation
- See :doc:`api/utils` for synthetic data generation and utilities
- Check :doc:`api/core` for complete API reference
- See :doc:`contributing` for development guidelines