Core Module

The core module contains the primary classes for statistical data processing and distribution fitting.

Core module - Main functionalities of MagicA

class magica.core.DataProcessor(data: ndarray | list | Series | DataFrame = None)[source]

Bases: object

Simple class for loading and basic processing of numerical data.

This class provides basic methods to load and validate numerical data for statistical analysis, converting data to numpy arrays.

__init__(data: ndarray | list | Series | DataFrame = None)[source]

Initialize the data processor.

Parameters:

data (array-like, optional) – Data to be processed. Can be: - numpy array - list of values - pandas Series - pandas DataFrame (will be flattened)

load_data(data: ndarray | list | Series | DataFrame) DataProcessor[source]

Load data and convert to numpy array.

Parameters:

data (array-like) – Data to be loaded. Can be: - numpy array - list of values - pandas Series - pandas DataFrame (will be flattened)

Returns:

Processor instance with loaded data

Return type:

DataProcessor

get_data_array() ndarray[source]

Return data as 1D numpy array.

Returns:

1D numpy array with the data

Return type:

ndarray

get_basic_stats() Dict[str, Any][source]

Return basic descriptive statistics of the data.

Returns:

Dictionary with basic statistics

Return type:

dict

fit_distribution(distribution: str | object, **kwargs) DataProcessor[source]

Fit a statistical distribution to the data.

Parameters:
  • distribution (str or scipy.stats distribution) – Distribution to fit. Can be: - String: ‘weibull’, ‘gamma’, ‘lognorm’, ‘norm’, etc. - SciPy distribution object: stats.weibull_min, stats.gamma, etc.

  • **kwargs (dict) – Additional arguments passed to the distribution’s fit method

Returns:

Processor instance with fitted distribution

Return type:

DataProcessor

get_fitted_params() Tuple[source]

Get the fitted distribution parameters.

Returns:

Fitted parameters of the distribution

Return type:

tuple

get_distribution_info() Dict[str, Any][source]

Get information about the fitted distribution.

Returns:

Dictionary with distribution information

Return type:

dict

get_auto_fitter(candidates=None, criterion='rmse')[source]

Create an AutoFitter instance for automatic distribution selection.

This method follows the factory pattern - it creates an AutoFitter instance when needed, allowing automatic testing of multiple distributions to find the best fit based on specified criteria.

Parameters:
  • candidates (list of str, optional) – List of distribution names to test. If None, uses default set including: ‘weibull_min’, ‘lognorm’, ‘gamma’, ‘norm’, ‘expon’, ‘rayleigh’, ‘chi2’, ‘beta’

  • criterion (str, default 'rmse') – Selection criterion (‘rmse’, ‘aic’, ‘bic’, ‘ks_pvalue’, ‘chi2_pvalue’)

Returns:

AutoFitter instance configured with this data

Return type:

AutoFitter

Examples

>>> import magica as ma
>>> import numpy as np
>>>
>>> # Load data
>>> data = np.random.weibull(2, 1000)
>>> processor = ma.read_data(data)
>>>
>>> # Get auto fitter and find best distribution
>>> auto_fitter = processor.get_auto_fitter()
>>> best_result = auto_fitter.fit_best_distribution()
>>> print(f"Best distribution: {best_result['distribution']}")
get_extremes_analyzer(times: ndarray | Series | DatetimeIndex | None = None, time_unit: str = 'years')[source]

Create an ExtremesAnalyzer instance for extreme value analysis.

This method provides access to return period and return value analysis for extreme events in time series data.

Parameters:
  • times (array-like, optional) – Time values corresponding to data points. Can be: - pandas DatetimeIndex - pandas Series with datetime values - numpy array of datetime64 - numpy array of numeric values (e.g., years) - None if data is pandas Series with datetime index

  • time_unit (str, default='years') – Unit for return period calculations. Options: ‘years’, ‘days’, ‘hours’, ‘months’

Returns:

ExtremesAnalyzer instance configured with this data

Return type:

ExtremesAnalyzer

Examples

>>> import numpy as np
>>> import pandas as pd
>>> import magica as ma
>>>
>>> # Using pandas Series with datetime index
>>> dates = pd.date_range('2000-01-01', periods=1000, freq='D')
>>> values = np.random.weibull(2, 1000) * 10
>>> series = pd.Series(values, index=dates)
>>>
>>> processor = ma.read_data(series)
>>> extremes = processor.get_extremes_analyzer()
>>> extremes.fit_distribution('genextreme')
>>>
>>> # Calculate 100-year return value
>>> rv_100 = extremes.return_value(100)
>>>
>>> # Or provide times separately
>>> processor2 = ma.read_data(values)
>>> extremes2 = processor2.get_extremes_analyzer(times=dates)
__repr__() str[source]

String representation of the object.

__len__() int[source]

Return the length of the data.

__getattr__(name)[source]

Delegate method calls to the internal MagicAdjuster.

This allows direct access to all scipy.stats methods like: cdf, pdf, ppf, sf, isf, rvs, stats, etc.

Parameters:

name (str) – Name of the method/attribute to access

Returns:

Method or attribute from the fitted distribution

Return type:

Any

class magica.core.MagicAdjuster(data_processor: DataProcessor)[source]

Bases: object

Simple class for statistical distribution fitting using SciPy.

This class takes processed data and fits statistical distributions, designed to be extended with goodness-of-fit tests and advanced features.

__init__(data_processor: DataProcessor)[source]

Initialize the adjuster with processed data.

Parameters:

data_processor (DataProcessor) – Processor instance with loaded data

fit_distribution(distribution: str | object, **kwargs) MagicAdjuster[source]

Fit a statistical distribution to the data.

Parameters:
  • distribution (str or scipy.stats distribution) – Distribution to fit. Can be: - String: ‘weibull’, ‘gamma’, ‘lognorm’, ‘norm’, etc. - SciPy distribution object: stats.weibull_min, stats.gamma, etc.

  • **kwargs (dict) – Additional arguments passed to the distribution’s fit method

Returns:

Adjuster instance with fitted distribution

Return type:

MagicAdjuster

__getattr__(name)[source]

Delegate method calls to the fitted scipy distribution with smart defaults.

This allows direct access to all scipy.stats methods like: cdf, pdf, ppf, sf, isf, rvs, stats, etc.

For methods that commonly evaluate distributions at data points (pdf, cdf, sf, logpdf, logcdf, logsf), if no input is provided, the original data will be used automatically.

get_fitted_params() Tuple[source]

Get the fitted distribution parameters.

Returns:

Fitted parameters of the distribution

Return type:

tuple

get_distribution_info() Dict[str, Any][source]

Get information about the fitted distribution.

Returns:

Dictionary with distribution information

Return type:

dict

__repr__() str[source]

String representation of the object.

get_bin_number_sturges()[source]

Calculate the optimal number of bins using Sturges’ rule.

Sturges’ rule is a simple heuristic for determining the number of bins in a histogram. It assumes that the data follows a normal distribution and is best suited for smaller datasets.

Returns:

Number of bins calculated using Sturges’ rule

Return type:

int

get_bin_number_rice()[source]

Calculate the optimal number of bins using Rice’s rule.

Rice’s rule suggests a bin count that scales with the cube root of the dataset size. It is a simple alternative to Sturges’ rule and works well for larger datasets.

Returns:

Number of bins calculated using Rice’s rule

Return type:

int

get_bin_number_freedman_diaconis()[source]

Calculate the number of bins based on the Freedman-Diaconis rule.

This rule uses the interquartile range (IQR) to calculate bin width and is robust for skewed distributions.

Returns:

Number of bins calculated using Freedman-Diaconis rule

Return type:

int

get_bin_number_scott()[source]

Calculate the bin width based on Scott’s rule.

Scott’s rule minimizes the integrated mean squared error for normal distributions, but can also work for large datasets.

Returns:

Number of bins calculated using Scott’s rule

Return type:

int

get_bin_number_doane()[source]

Calculate the optimal number of bins using Doane’s formula.

Doane’s formula is an extension of Sturges’ rule that accounts for the skewness of the data distribution. This method is particularly useful when dealing with non-normal data distributions, as it adjusts the bin count based on the sample skewness.

Returns:

Number of bins calculated using Doane’s rule

Return type:

int

get_num_bins(bins='doane')[source]

Determines the number of bins for histogram plotting based on a chosen method.

Parameters:

bins (int or str, optional) –

The binning method to use. Default is ‘doane’.

  • Integer (e.g., 30): A fixed number of bins.

  • ’sturges’: Sturges’ rule (log-based, best for normal distributions).

  • ’freedman-diaconis’: Uses Freedman-Diaconis rule (best for skewed distributions).

  • ’rice’: Uses Rice’s rule (scales with cube root of dataset size).

  • ’scott’: Uses Scott’s rule (minimizes IMSE for normal distributions).

  • ’doane’: Uses Doane’s rule (extension of Sturges’ rule that accounts for the skewness of the data distribution).

Returns:

The computed number of bins.

Return type:

int

Raises:

ValueError – If an unsupported binning method is provided.

goodness_of_fit(method: str, bins: str | int = 'doane', warn_on_normalization: bool = True)[source]

Perform goodness-of-fit test on the fitted distribution.

Parameters:
  • method (str) – Test method: ‘chi2’, ‘ks’, ‘rmse’, ‘aic’, or ‘bic’

  • bins (int or str, default='doane') – Binning method for chi2 and rmse tests

  • warn_on_normalization (bool, default=True) – Whether to warn about frequency normalization in chi2 test

Returns:

Test results (dict for statistical tests, float for RMSE/AIC/BIC)

Return type:

dict or float

monte_carlo_fit(sizes: List[int] | None = None, n_repeats: int = 20, tests: List[str] = ['ks'], stability_method: str = 'kneedle', fig_output_path: str | None = None, plot_type: str = 'series', sampling: str = 'random', seed: int | None = None, min_size: int = 50, max_size: int | None = None, n_sizes: int = 10, distribution_params: Tuple | None = None, **kwargs)[source]

Perform Monte Carlo stability analysis for distribution fitting.

This method evaluates how stable distribution parameters and goodness-of-fit statistics are across different sample sizes, helping determine the minimum sample size needed for reliable parameter estimation.

Parameters:
  • sizes (list[int], optional) – Explicit list of sample sizes to test. If None, a grid of sizes between min_size and max_size is created.

  • n_repeats (int, default=20) – Number of subsamples (repetitions) for each size.

  • tests (list[str], default=['ks']) – Goodness-of-fit tests to perform. Options: ‘ks’, ‘chi2’, ‘rmse’. Recommendation: Always include ‘rmse’ for best stability detection.

  • stability_method (str, default='cv') – Method for detecting stability points. Options: - ‘cv’: Coefficient of variation approach (default, robust) - ‘kneedle’: Kneedle algorithm on median curves (best for RMSE) - ‘plateau’: Relative gain heuristic (good for converging parameters) - ‘aggregate’: Median-based detection with tolerance windows (legacy) - ‘detect’: Alias for ‘cv’ (backward compatibility) - None or ‘none’: Skip stability detection

  • fig_output_path (str, optional) – When provided, a 2x3 summary figure is generated and saved to this path. The Dataset attribute ‘figure_path’ will contain the path. If None (default) no figure is created (saves time in large runs).

  • plot_type ({'series','boxplots'}, default='series') – Visual style per panel. ‘series’ plots medians with IQR shading; ‘boxplots’ draws boxplots per size.

  • sampling ({'random','bootstrap','disjoint'}, default='random') – Sampling strategy for subsamples.

  • seed (int, optional) – Random seed for reproducibility (propagated to subsamples).

  • min_size (int, default=50) – Minimum sample size when auto-generating sizes.

  • max_size (int, optional) – Maximum sample size. Defaults to full dataset if None.

  • n_sizes (int, default=10) – Number of sizes in automatically generated grid.

  • distribution_params (tuple, optional) – If provided, these fixed parameters are used for all subsamples (no refitting). Use only for scenarios evaluating GOF with known parameters. If None (default), each subsample is refitted.

  • **kwargs

    Additional parameters:

    • binsint or str, default=’doane’

      Binning for chi-square test.

    • fit_kwargsdict

      Passed to fit_distribution for parameter constraints (e.g., fit_kwargs={‘floc’: 0}).

    Stability detection parameters (method-specific):

    For ‘cv’ method:
    • window_sizeint, default=len(sizes)//4

      Window size for CV consistency check

    • cv_thresholdfloat, default=0.1

      Maximum acceptable coefficient of variation (10%)

    For ‘kneedle’ method:
    • smoothbool, default=True

      Whether to smooth curves before detection

    • smoothing_methodstr, default=’savgol’

      Smoothing method: ‘savgol’ or ‘spline’

    For ‘plateau’ method:
    • consecutive_pointsint, default=3

      Number of consecutive points for plateau (L parameter)

    • relative_tolerancefloat, default=0.01

      Maximum relative change threshold (Δ parameter, 1%)

Returns:

Monte Carlo results with dimensions:

  • sizes : Sample sizes tested

  • repeats : Repetition index for each size

Data variables include:

  • param_0, param_1, … : Fitted distribution parameters

  • ks_statistic, ks_pvalue : Kolmogorov-Smirnov test results

  • chi2_statistic, chi2_pvalue : Chi-square test results

  • rmse : Root mean square error values

Attributes include:

  • distribution : Distribution name

  • original_data_size : Size of original dataset

  • sampling_method : Sampling strategy used

  • bins_method : Binning method for chi-square test

  • stability_method : Detection method used

  • stability_points : Detected stability points per variable

  • recommended_size : Recommended sample size (from primary metric)

  • primary_metric : Metric used for recommendation (typically ‘rmse’)

  • stable_pvalue_ks : KS p-value at stable size (if applicable)

  • stable_pvalue_chi2 : Chi-square p-value at stable size (if applicable)

  • stable_rmse : RMSE at stable size (if applicable)

  • figure_path : Path to saved 2x3 summary figure (only if generated)

  • created_by : Method identifier

Return type:

xarray.Dataset

Examples

>>> # Basic usage with recommended settings (includes RMSE!)
>>> results = adjuster.monte_carlo_fit(
...     n_repeats=50,
...     tests=['ks', 'chi2', 'rmse'],
...     stability_method='kneedle'  # Best for RMSE
... )
>>> # Access stability information
>>> stability = results.attrs['stability_points']
>>> recommended_size = results.attrs['recommended_size']
>>> print(f"RMSE stabilizes at n={stability['rmse']['size']}")
>>> # Use fit constraints (e.g., fix location for Weibull)
>>> results = adjuster.monte_carlo_fit(
...     n_repeats=100,
...     tests=['rmse', 'ks'],
...     stability_method='plateau',
...     fit_kwargs={'floc': 0}
... )
>>> # Use pre-calculated parameters (bypass fitting entirely)
>>> known_params = (2.0, 0.0, 1.0)  # shape, loc, scale
>>> results = adjuster.monte_carlo_fit(
...     distribution_params=known_params,
...     n_repeats=50,
...     tests=['chi2', 'ks']
... )
>>> # Generate figure with automatic stability detection
>>> results = adjuster.monte_carlo_fit(
...     tests=['ks', 'rmse'],
...     stability_method='kneedle',
...     fig_output_path='mc_analysis.png'
... )

Notes

Fitting Strategy:

By default, each subsample gets an independent fit to evaluate how parameter estimates change with sample size. This is the correct approach for stability analysis since parameter estimation quality depends on sample size.

Stability Detection Methods:

  1. CV (Coefficient of Variation): Default method, robust and works for all variables. Looks for when std/mean becomes consistently small.

  2. Kneedle: Best for RMSE and converging parameters. Finds the “elbow” point where improvement rate changes. Requires curve smoothing.

  3. Plateau: Good for metrics with clear convergence. Uses relative gain heuristic (Δᵢ = |y(i-1) - y(i)| / |y(i-1)|).

Choosing the Right Method:

  • Use ‘kneedle’ with RMSE for clearest stability detection

  • Use ‘cv’ for general-purpose analysis

  • Use ‘plateau’ when you expect clear convergence behavior

Parameter Constraints:

Use fit_kwargs to impose constraints during fitting (e.g., floc=0 for Weibull distributions). These constraints apply to all subsample fits.

Pre-calculated Parameters:

Use distribution_params only when you want to evaluate goodness-of-fit with known/fixed parameters across all sample sizes, bypassing fitting entirely.

class magica.core.AutoFitter(data_processor: DataProcessor, candidates: List[str] | None = None, criterion: str = 'rmse')[source]

Bases: object

Automatic distribution fitting class with model selection capabilities.

This class automatically tests multiple probability distributions and selects the best-fitting one based on specified criteria (default: RMSE).

Uses lazy initialization pattern - MagicAdjuster instances are created only when needed for each distribution candidate.

Parameters:
  • data_processor (DataProcessor) – Processor instance with loaded data

  • candidates (list of str, optional) – List of distribution names to test. If None, uses default set.

  • criterion (str, default 'rmse') – Selection criterion (‘rmse’, ‘aic’, ‘bic’, ‘ks_pvalue’, ‘chi2_pvalue’)

Examples

>>> import magica as ma
>>> import numpy as np
>>>
>>> # Load wind speed data
>>> data = np.random.weibull(2, 1000) * 8 + 2
>>> processor = ma.read_data(data)
>>>
>>> # Auto-fit best distribution
>>> auto_fitter = processor.get_auto_fitter()
>>> best_result = auto_fitter.fit_best_distribution()
>>>
>>> print(f"Best distribution: {best_result['distribution']}")
>>> print(f"RMSE: {best_result['rmse']:.4f}")
__init__(data_processor: DataProcessor, candidates: List[str] | None = None, criterion: str = 'rmse')[source]

Initialize AutoFitter with data processor and configuration.

Parameters:
  • data_processor (DataProcessor) – The data processor containing the dataset to fit

  • candidates (list of str, optional) – Distribution names to test. Default includes common distributions.

  • criterion (str, default 'rmse') – Selection criterion for best distribution

fit_single_distribution(distribution: str, **fit_kwargs) Dict[str, Any][source]

Fit a single distribution and calculate all metrics.

Parameters:
  • distribution (str) – Distribution name to fit

  • **fit_kwargs (dict) – Additional arguments passed to fit method

Returns:

Comprehensive results for this distribution

Return type:

dict

fit_all_distributions(**fit_kwargs) Dict[str, Dict[str, Any]][source]

Fit all candidate distributions and return comprehensive results.

Parameters:

**fit_kwargs (dict) – Additional arguments passed to all fit methods

Returns:

Results for all distributions, keyed by distribution name

Return type:

dict

fit_best_distribution(**fit_kwargs) Dict[str, Any][source]

Automatically find and fit the best distribution based on criterion.

Parameters:

**fit_kwargs (dict) – Additional arguments passed to fit methods

Returns:

Results for the best-fitting distribution

Return type:

dict

get_comparison_table(sort_by: str | None = None) Dict[str, Dict[str, Any]][source]

Get a formatted comparison table of all fitted distributions.

Parameters:

sort_by (str, optional) – Metric to sort by. If None, uses the selection criterion.

Returns:

Sorted results table

Return type:

dict

get_best_adjuster() MagicAdjuster[source]

Get the MagicAdjuster instance for the best-fitting distribution.

Returns:

The adjuster fitted with the best distribution

Return type:

MagicAdjuster

static get_all_available_distributions()[source]

Get all distribution names available in MagicAdjuster.

Returns:

Sorted list of all available distribution names

Return type:

list

__repr__() str[source]

String representation of the AutoFitter.

class magica.core.ExtremesAnalyzer(data_processor: DataProcessor, times: ndarray | Series | DatetimeIndex | None = None, time_unit: str = 'years')[source]

Bases: object

Extreme values analysis with return period and return value calculations.

This class analyzes extreme values in time series data, calculating return periods and return values using statistical distributions fitted to the data.

The analyzer supports multiple input formats: - Pandas Series with datetime index - Pandas DataFrame with time column - Paired numpy arrays (times, values) - Simple numpy array (assumes uniform time spacing)

Parameters:
  • data_processor (DataProcessor) – Processor instance with loaded data

  • times (array-like, optional) – Time values corresponding to data points. Can be: - datetime array/Series - numeric array (e.g., years, days) - None if data is pandas Series with datetime index

  • time_unit (str, default='years') – Time unit for return period calculations (‘years’, ‘days’, ‘hours’, ‘months’)

Examples

>>> import numpy as np
>>> import pandas as pd
>>> import magica as ma
>>>
>>> # Using pandas Series with datetime index
>>> dates = pd.date_range('2000-01-01', periods=1000, freq='D')
>>> values = np.random.weibull(2, 1000) * 10
>>> series = pd.Series(values, index=dates)
>>>
>>> processor = ma.read_data(series)
>>> extremes = processor.get_extremes_analyzer()
>>> extremes.fit_distribution('genextreme')
>>>
>>> # Calculate 100-year return value
>>> rv_100 = extremes.return_value(100)
>>> print(f"100-year return value: {rv_100:.2f}")
__init__(data_processor: DataProcessor, times: ndarray | Series | DatetimeIndex | None = None, time_unit: str = 'years')[source]

Initialize ExtremesAnalyzer with data and time information.

Parameters:
  • data_processor (DataProcessor) – The data processor containing values to analyze

  • times (array-like, optional) – Time values or datetime index

  • time_unit (str, default='years') – Unit for return period calculations

fit_distribution(distribution: str | object, **kwargs) ExtremesAnalyzer[source]

Fit a statistical distribution for extreme value analysis.

This method can be called directly on ExtremesAnalyzer to fit distributions to the extreme data, without requiring a prior fit on the underlying DataProcessor.

Common distributions for extremes: - ‘genextreme’ - Generalized Extreme Value (GEV) - ‘gumbel_r’ - Gumbel distribution (Type I extreme) - ‘gumbel_l’ - Gumbel left (minimum extremes) - ‘weibull_min’ - Weibull (minimum extremes) - ‘weibull_max’ - Weibull (maximum extremes) - ‘genpareto’ - Generalized Pareto Distribution (GPD, for POT)

Parameters:
  • distribution (str or scipy.stats distribution) – Distribution to fit

  • **kwargs (dict) – Additional arguments passed to fit method

Returns:

Self for method chaining

Return type:

ExtremesAnalyzer

Examples

>>> # Direct fitting on extremes analyzer
>>> extremes = processor.get_extremes_analyzer()
>>> annual_max, times = extremes.extract_block_maxima('Y')
>>>
>>> # Create new processor with maxima and fit
>>> maxima_proc = ma.read_data(annual_max)
>>> maxima_extremes = maxima_proc.get_extremes_analyzer()
>>> maxima_extremes.fit_distribution('genextreme')
>>> rv_100 = maxima_extremes.return_value(100)
return_value(return_period: float | ndarray) float | ndarray[source]

Calculate return value for given return period(s).

The return value is the value expected to be exceeded once every T time units on average, where T is the return period.

Parameters:

return_period (float or array-like) – Return period(s) in time_unit units (e.g., years)

Returns:

Return value(s) corresponding to the return period(s)

Return type:

float or ndarray

Examples

>>> # Single return value
>>> rv_100 = extremes.return_value(100)  # 100-year return value
>>>
>>> # Multiple return values
>>> periods = [10, 50, 100, 500]
>>> rv = extremes.return_value(periods)
return_period(value: float | ndarray) float | ndarray[source]

Calculate return period for given value(s).

The return period is the average time interval between exceedances of the given value.

Parameters:

value (float or array-like) – Value(s) for which to calculate return period

Returns:

Return period(s) in time_unit units

Return type:

float or ndarray

Examples

>>> # Return period for specific value
>>> rp = extremes.return_period(25.0)
>>> print(f"A value of 25 has a return period of {rp:.1f} years")
>>>
>>> # Return periods for multiple values
>>> values = [20, 25, 30, 35]
>>> rp = extremes.return_period(values)
ppf(q: float | ndarray) float | ndarray[source]

Percent point function (inverse of CDF) of the fitted distribution.

Parameters:

q (float or array-like) – Probability value(s) between 0 and 1

Returns:

Quantile(s) corresponding to the probability value(s)

Return type:

float or ndarray

cdf(x: float | ndarray) float | ndarray[source]

Cumulative distribution function of the fitted distribution.

Parameters:

x (float or array-like) – Value(s) at which to evaluate the CDF

Returns:

Probability value(s) corresponding to x

Return type:

float or ndarray

pdf(x: float | ndarray) float | ndarray[source]

Probability density function of the fitted distribution.

Parameters:

x (float or array-like) – Value(s) at which to evaluate the PDF

Returns:

Probability density value(s) corresponding to x

Return type:

float or ndarray

goodness_of_fit(method: str, **kwargs)[source]

Perform goodness-of-fit test for the fitted distribution.

Parameters:
  • method (str) – Test method: ‘ks’ (Kolmogorov-Smirnov), ‘chi2’ (Chi-square), ‘ad’ (Anderson-Darling), or ‘rmse’ (Root Mean Square Error)

  • **kwargs (dict) – Additional arguments passed to the goodness-of-fit method

Returns:

Test results (dict for statistical tests, float for RMSE)

Return type:

dict or float

Examples

>>> extremes.fit_distribution('genextreme')
>>> ks_test = extremes.goodness_of_fit('ks')
>>> print(f"KS statistic: {ks_test['ks_statistic']:.4f}")
>>> print(f"P-value: {ks_test['p_value']:.4f}")
extract_block_maxima(block_size: str = 'YE', method: str = 'max') Tuple[ndarray, DatetimeIndex | None][source]

Extract block maxima (or minima) from time series.

This is commonly used for GEV analysis where annual maxima are extracted from the data.

Parameters:
  • block_size (str, default='YE') – Block size for resampling. Uses pandas offset aliases: - ‘YE’ or ‘Y’: Annual - ‘QE’: Quarterly - ‘ME’: Monthly - ‘W’: Weekly - ‘D’: Daily

  • method (str, default='max') – Aggregation method: ‘max’ or ‘min’

Returns:

  • values (ndarray) – Block maxima/minima values

  • times (DatetimeIndex or None) – Block center times (if datetime available)

Examples

>>> # Extract annual maxima
>>> annual_max, times = extremes.extract_block_maxima(block_size='YE')
>>>
>>> # Extract monthly minima
>>> monthly_min, times = extremes.extract_block_maxima(block_size='M', method='min')
peaks_over_threshold(threshold: float, min_separation: str | Timedelta | int | float | None = None, event_wise: bool = False) Tuple[ndarray, DatetimeIndex | ndarray | None][source]

Extract peaks over threshold (POT) from time series.

This method is used for GPD (Generalized Pareto Distribution) analysis.

Parameters:
  • threshold (float) – Threshold value for peak detection

  • min_separation (str, Timedelta, int, or float, optional) – Minimum time separation between peaks to avoid clustering. Can be: - String: ‘1D’, ‘12H’, etc. (pandas time string) - pd.Timedelta: pd.Timedelta(days=1) - Numeric: interpreted as days (e.g., 3 means 3 days) - ‘event’: Use event-wise declustering (requires event_wise=True) If None, all exceedances are returned.

  • event_wise (bool, default=False) – If True, identifies consecutive periods above threshold as single events and returns only the maximum value from each event. This is useful for identifying storm events where multiple consecutive days exceed the threshold. When True, min_separation is ignored.

Returns:

  • exceedances (ndarray) – Values exceeding the threshold

  • times (DatetimeIndex or ndarray or None) – Times of exceedances

Examples

>>> # Extract all peaks over threshold
>>> peaks, times = extremes.peaks_over_threshold(threshold=20.0)
>>>
>>> # Extract peaks with minimum 1-day separation (numeric)
>>> peaks, times = extremes.peaks_over_threshold(threshold=20.0, min_separation=1)
>>>
>>> # Extract peaks with minimum 1-day separation (string)
>>> peaks, times = extremes.peaks_over_threshold(
...     threshold=20.0,
...     min_separation='1D'
... )
>>>
>>> # Extract one peak per consecutive event (storm-wise)
>>> peaks, times = extremes.peaks_over_threshold(threshold=20.0, event_wise=True)
get_summary_statistics() Dict[str, Any][source]

Get summary statistics for extreme value analysis.

Returns:

Dictionary with summary statistics including: - n_observations: Number of data points - time_span: Total time span in time_unit units - time_unit: Time unit used for calculations - max: Maximum value in dataset - min: Minimum value in dataset - mean: Mean value - std: Standard deviation - percentile_95: 95th percentile - percentile_99: 99th percentile - has_datetime: Whether datetime information is available - distribution_name: Fitted distribution name (if fitted) - distribution_params: Fitted parameters (if fitted)

Return type:

dict

plot_return_levels(ax=None, return_periods: ndarray | None = None, empirical: bool = True, confidence_level: float | None = None, title: str | None = None)[source]

Plot return level plot (return value vs return period).

Parameters:
  • ax (matplotlib.axes.Axes, optional) – Axes object to plot on. If None, creates a new figure and axes.

  • return_periods (array-like, optional) – Return periods to plot. If None, uses logarithmic spacing.

  • empirical (bool, default=True) – Whether to include empirical return values

  • confidence_level (float, optional) – Confidence level for confidence intervals (e.g., 0.95)

  • title (str, optional) – Plot title. If None, uses default title.

Returns:

  • fig (matplotlib.figure.Figure) – Figure object (None if ax was provided)

  • ax (matplotlib.axes.Axes) – Axes object

Examples

>>> # Create standalone plot
>>> fig, ax = extremes.plot_return_levels()
>>>
>>> # Plot on existing axes
>>> fig, ax = plt.subplots()
>>> extremes.plot_return_levels(ax=ax, title='Custom Title')
find_optimal_pot_threshold(min_samples: int = 50, percentile_min: float = 90, percentile_max: float = 99, percentile_step: float = 1.0, min_separation_hours: float = 48, max_separation_hours: float = 120, separation_step_hours: float = 24, vary_first: str = 'percentile', max_iterations: int = 200, verbose: bool = False) Dict[str, Any][source]

Find optimal POT threshold ensuring sufficient independent samples.

This function systematically searches for a threshold that provides at least min_samples independent exceedances after declustering. The search can prioritize varying percentiles or time separation first.

Parameters:
  • min_samples (int, default=50) – Minimum number of independent exceedances required

  • percentile_min (float, default=90) – Minimum percentile to test (lower bound)

  • percentile_max (float, default=99) – Maximum percentile to test (upper bound, starting point)

  • percentile_step (float, default=1.0) – Step size for decreasing percentile

  • min_separation_hours (float, default=48) – Minimum time separation between peaks in hours (starting point)

  • max_separation_hours (float, default=120) – Maximum time separation to test in hours

  • separation_step_hours (float, default=24) – Step size for increasing separation in hours

  • vary_first (str, default='percentile') – Which parameter to vary first: ‘percentile’ or ‘separation’ - ‘percentile’: Varies percentile from max to min, then increases separation - ‘separation’: Varies separation from min to max, then decreases percentile

  • max_iterations (int, default=200) – Maximum total search iterations to prevent infinite loops

  • verbose (bool, default=False) – Print progress information during search

Returns:

Dictionary with keys: - ‘threshold’: Selected threshold value (float) - ‘percentile’: Percentile of threshold (float) - ‘separation_hours’: Time separation used in hours (float) - ‘n_raw_exceedances’: Total exceedances before declustering (int) - ‘n_independent’: Independent exceedances after declustering (int) - ‘exceedances’: Array of independent exceedance values (ndarray) - ‘exceedance_times’: Times of independent exceedances (DatetimeIndex or ndarray) - ‘success’: Whether minimum samples achieved (bool) - ‘iterations’: Number of iterations performed (int) - ‘warning’: Warning message if any (str, optional)

Return type:

dict

Examples

>>> # Default search (vary percentile first, 99->90, then increase separation)
>>> result = extremes.find_optimal_pot_threshold(min_samples=50)
>>> print(f"Threshold: {result['threshold']:.2f}")
>>> print(f"Independent samples: {result['n_independent']}")
>>>
>>> # Vary separation first, then percentile
>>> result = extremes.find_optimal_pot_threshold(
...     min_samples=50,
...     vary_first='separation'
... )
>>>
>>> # Custom percentile and separation ranges
>>> result = extremes.find_optimal_pot_threshold(
...     min_samples=30,
...     percentile_min=85,
...     percentile_max=98,
...     min_separation_hours=24,
...     max_separation_hours=96,
...     verbose=True
... )

Notes

Search Strategy:

When vary_first=’percentile’ (default): 1. Start with min_separation_hours and percentile_max 2. Decrease percentile by percentile_step until percentile_min 3. If not enough samples, increase separation by separation_step_hours 4. Repeat percentile search with new separation 5. Continue until max_separation_hours or min_samples achieved

When vary_first=’separation’: 1. Start with percentile_max and min_separation_hours 2. Increase separation by separation_step_hours until max_separation_hours 3. If not enough samples, decrease percentile by percentile_step 4. Repeat separation search with new percentile 5. Continue until percentile_min or min_samples achieved

Best Practices: - For synoptic events (storms): min_separation_hours >= 48 - For sub-daily events: min_separation_hours >= 12 - Typical percentile range: 90-99 for extremes - Ensure min_samples >= 30 for reliable fitting (50+ recommended)

static plot_directional_return_values(directional_results: Dict[str, Dict[str, Any]], return_periods: list = None, colors: list = None, overlay: bool = False, show_values: bool = True, figsize: tuple = None, title: str = None) tuple[source]

Create polar plots of return values by direction.

This method creates polar plots showing how return values vary by wind direction for different return periods. Can create separate subplots for each return period or overlay all periods in a single plot.

Parameters:
  • directional_results (dict) – Dictionary with sector names as keys and result dictionaries as values. Each result dictionary must contain: - ‘center_deg’: Center angle of sector in degrees - ‘return_values’: Dictionary mapping return periods to values - ‘success’: Boolean indicating if fit was successful

  • return_periods (list of int, optional) – Return periods to plot (max 4). If None, uses [10, 20, 50, 100]. Examples: [10, 50], [1, 5, 10, 20], [100]

  • colors (list of str, optional) – Colors for each return period. If None, uses default palette. Must have same length as return_periods.

  • overlay (bool, default=False) – If True, plot all return periods on single polar plot. If False, create separate subplot for each return period.

  • show_values (bool, default=True) – If True, display numeric values on the plot.

  • figsize (tuple, optional) – Figure size (width, height). If None, automatically determined.

  • title (str, optional) – Main figure title. If None, uses default title.

Returns:

  • fig (matplotlib.figure.Figure) – The figure object

  • axes (matplotlib.axes.Axes or array of Axes) – The axes object(s)

Raises:

ValueError – If return_periods has more than 4 values If colors length doesn’t match return_periods length If directional_results is empty or invalid

Examples

>>> # Assuming you have fitted distributions for each sector
>>> directional_results = {}
>>> for sector in sectors:
...     # ... fit distribution and calculate return values ...
...     directional_results[sector] = {
...         'center_deg': center_angle,
...         'return_values': {10: rv10, 50: rv50, 100: rv100},
...         'success': True
...     }
>>>
>>> # Create separate subplots for each return period
>>> fig, axes = extremes.plot_directional_return_values(
...     directional_results,
...     return_periods=[10, 50, 100]
... )
>>>
>>> # Create single overlay plot
>>> fig, ax = extremes.plot_directional_return_values(
...     directional_results,
...     return_periods=[10, 50, 100],
...     overlay=True
... )
>>>
>>> # Customize colors and appearance
>>> fig, axes = extremes.plot_directional_return_values(
...     directional_results,
...     return_periods=[20, 100],
...     colors=['#FF6B6B', '#4ECDC4'],
...     figsize=(12, 6)
... )
>>> plt.show()

Notes

  • Maximum 4 return periods can be plotted

  • If return_periods <= 2 and not overlay, creates single row of subplots

  • If return_periods == 1 and not overlay, creates single subplot

  • In overlay mode, all periods shown on one polar plot with legend

  • Automatically handles missing/failed sectors

__repr__() str[source]

String representation of the analyzer.

DataProcessor

The DataProcessor class handles data loading, validation, and provides access to distribution fitting capabilities.

class magica.core.DataProcessor(data: ndarray | list | Series | DataFrame = None)[source]

Bases: object

Simple class for loading and basic processing of numerical data.

This class provides basic methods to load and validate numerical data for statistical analysis, converting data to numpy arrays.

__init__(data: ndarray | list | Series | DataFrame = None)[source]

Initialize the data processor.

Parameters:

data (array-like, optional) – Data to be processed. Can be: - numpy array - list of values - pandas Series - pandas DataFrame (will be flattened)

load_data(data: ndarray | list | Series | DataFrame) DataProcessor[source]

Load data and convert to numpy array.

Parameters:

data (array-like) – Data to be loaded. Can be: - numpy array - list of values - pandas Series - pandas DataFrame (will be flattened)

Returns:

Processor instance with loaded data

Return type:

DataProcessor

get_data_array() ndarray[source]

Return data as 1D numpy array.

Returns:

1D numpy array with the data

Return type:

ndarray

get_basic_stats() Dict[str, Any][source]

Return basic descriptive statistics of the data.

Returns:

Dictionary with basic statistics

Return type:

dict

fit_distribution(distribution: str | object, **kwargs) DataProcessor[source]

Fit a statistical distribution to the data.

Parameters:
  • distribution (str or scipy.stats distribution) – Distribution to fit. Can be: - String: ‘weibull’, ‘gamma’, ‘lognorm’, ‘norm’, etc. - SciPy distribution object: stats.weibull_min, stats.gamma, etc.

  • **kwargs (dict) – Additional arguments passed to the distribution’s fit method

Returns:

Processor instance with fitted distribution

Return type:

DataProcessor

get_fitted_params() Tuple[source]

Get the fitted distribution parameters.

Returns:

Fitted parameters of the distribution

Return type:

tuple

get_distribution_info() Dict[str, Any][source]

Get information about the fitted distribution.

Returns:

Dictionary with distribution information

Return type:

dict

get_auto_fitter(candidates=None, criterion='rmse')[source]

Create an AutoFitter instance for automatic distribution selection.

This method follows the factory pattern - it creates an AutoFitter instance when needed, allowing automatic testing of multiple distributions to find the best fit based on specified criteria.

Parameters:
  • candidates (list of str, optional) – List of distribution names to test. If None, uses default set including: ‘weibull_min’, ‘lognorm’, ‘gamma’, ‘norm’, ‘expon’, ‘rayleigh’, ‘chi2’, ‘beta’

  • criterion (str, default 'rmse') – Selection criterion (‘rmse’, ‘aic’, ‘bic’, ‘ks_pvalue’, ‘chi2_pvalue’)

Returns:

AutoFitter instance configured with this data

Return type:

AutoFitter

Examples

>>> import magica as ma
>>> import numpy as np
>>>
>>> # Load data
>>> data = np.random.weibull(2, 1000)
>>> processor = ma.read_data(data)
>>>
>>> # Get auto fitter and find best distribution
>>> auto_fitter = processor.get_auto_fitter()
>>> best_result = auto_fitter.fit_best_distribution()
>>> print(f"Best distribution: {best_result['distribution']}")
get_extremes_analyzer(times: ndarray | Series | DatetimeIndex | None = None, time_unit: str = 'years')[source]

Create an ExtremesAnalyzer instance for extreme value analysis.

This method provides access to return period and return value analysis for extreme events in time series data.

Parameters:
  • times (array-like, optional) – Time values corresponding to data points. Can be: - pandas DatetimeIndex - pandas Series with datetime values - numpy array of datetime64 - numpy array of numeric values (e.g., years) - None if data is pandas Series with datetime index

  • time_unit (str, default='years') – Unit for return period calculations. Options: ‘years’, ‘days’, ‘hours’, ‘months’

Returns:

ExtremesAnalyzer instance configured with this data

Return type:

ExtremesAnalyzer

Examples

>>> import numpy as np
>>> import pandas as pd
>>> import magica as ma
>>>
>>> # Using pandas Series with datetime index
>>> dates = pd.date_range('2000-01-01', periods=1000, freq='D')
>>> values = np.random.weibull(2, 1000) * 10
>>> series = pd.Series(values, index=dates)
>>>
>>> processor = ma.read_data(series)
>>> extremes = processor.get_extremes_analyzer()
>>> extremes.fit_distribution('genextreme')
>>>
>>> # Calculate 100-year return value
>>> rv_100 = extremes.return_value(100)
>>>
>>> # Or provide times separately
>>> processor2 = ma.read_data(values)
>>> extremes2 = processor2.get_extremes_analyzer(times=dates)
__repr__() str[source]

String representation of the object.

__len__() int[source]

Return the length of the data.

__getattr__(name)[source]

Delegate method calls to the internal MagicAdjuster.

This allows direct access to all scipy.stats methods like: cdf, pdf, ppf, sf, isf, rvs, stats, etc.

Parameters:

name (str) – Name of the method/attribute to access

Returns:

Method or attribute from the fitted distribution

Return type:

Any

Automatic Distribution Selection

DataProcessor provides the get_auto_fitter() method to create an AutoFitter instance for automatic distribution selection:

import numpy as np
import magica as ma

# Load data
data = np.random.weibull(2, 1000)
processor = ma.read_data(data)

# Get AutoFitter for automatic distribution selection
auto_fitter = processor.get_auto_fitter(criterion='rmse')

# Find best distribution
best = auto_fitter.fit_best_distribution()
print(f"Best distribution: {best['distribution']}")

For complete AutoFitter documentation, see AutoFitter - Automatic Distribution Selection.

Extreme Value Analysis

DataProcessor also provides the get_extremes_analyzer() method to create an ExtremesAnalyzer instance for return period and return value analysis:

import pandas as pd
import magica as ma

# Create time series with datetime index
dates = pd.date_range('1980-01-01', '2023-12-31', freq='D')
wind_speeds = [...]  # your data
series = pd.Series(wind_speeds, index=dates)

# Load data and create extremes analyzer
processor = ma.read_data(series)
extremes = processor.get_extremes_analyzer(time_unit='years')

# Fit GEV distribution and calculate return values
extremes.fit_distribution('genextreme')
rv_100 = extremes.return_value(100)  # 100-year return value
print(f"100-year return value: {rv_100:.2f}")

For complete ExtremesAnalyzer documentation, see Extreme Values Analysis.

MagicAdjuster

The MagicAdjuster class is the central component for statistical distribution fitting and Monte Carlo stability analysis.

class magica.core.MagicAdjuster(data_processor: DataProcessor)[source]

Bases: object

Simple class for statistical distribution fitting using SciPy.

This class takes processed data and fits statistical distributions, designed to be extended with goodness-of-fit tests and advanced features.

__init__(data_processor: DataProcessor)[source]

Initialize the adjuster with processed data.

Parameters:

data_processor (DataProcessor) – Processor instance with loaded data

fit_distribution(distribution: str | object, **kwargs) MagicAdjuster[source]

Fit a statistical distribution to the data.

Parameters:
  • distribution (str or scipy.stats distribution) – Distribution to fit. Can be: - String: ‘weibull’, ‘gamma’, ‘lognorm’, ‘norm’, etc. - SciPy distribution object: stats.weibull_min, stats.gamma, etc.

  • **kwargs (dict) – Additional arguments passed to the distribution’s fit method

Returns:

Adjuster instance with fitted distribution

Return type:

MagicAdjuster

__getattr__(name)[source]

Delegate method calls to the fitted scipy distribution with smart defaults.

This allows direct access to all scipy.stats methods like: cdf, pdf, ppf, sf, isf, rvs, stats, etc.

For methods that commonly evaluate distributions at data points (pdf, cdf, sf, logpdf, logcdf, logsf), if no input is provided, the original data will be used automatically.

get_fitted_params() Tuple[source]

Get the fitted distribution parameters.

Returns:

Fitted parameters of the distribution

Return type:

tuple

get_distribution_info() Dict[str, Any][source]

Get information about the fitted distribution.

Returns:

Dictionary with distribution information

Return type:

dict

__repr__() str[source]

String representation of the object.

get_bin_number_sturges()[source]

Calculate the optimal number of bins using Sturges’ rule.

Sturges’ rule is a simple heuristic for determining the number of bins in a histogram. It assumes that the data follows a normal distribution and is best suited for smaller datasets.

Returns:

Number of bins calculated using Sturges’ rule

Return type:

int

get_bin_number_rice()[source]

Calculate the optimal number of bins using Rice’s rule.

Rice’s rule suggests a bin count that scales with the cube root of the dataset size. It is a simple alternative to Sturges’ rule and works well for larger datasets.

Returns:

Number of bins calculated using Rice’s rule

Return type:

int

get_bin_number_freedman_diaconis()[source]

Calculate the number of bins based on the Freedman-Diaconis rule.

This rule uses the interquartile range (IQR) to calculate bin width and is robust for skewed distributions.

Returns:

Number of bins calculated using Freedman-Diaconis rule

Return type:

int

get_bin_number_scott()[source]

Calculate the bin width based on Scott’s rule.

Scott’s rule minimizes the integrated mean squared error for normal distributions, but can also work for large datasets.

Returns:

Number of bins calculated using Scott’s rule

Return type:

int

get_bin_number_doane()[source]

Calculate the optimal number of bins using Doane’s formula.

Doane’s formula is an extension of Sturges’ rule that accounts for the skewness of the data distribution. This method is particularly useful when dealing with non-normal data distributions, as it adjusts the bin count based on the sample skewness.

Returns:

Number of bins calculated using Doane’s rule

Return type:

int

get_num_bins(bins='doane')[source]

Determines the number of bins for histogram plotting based on a chosen method.

Parameters:

bins (int or str, optional) –

The binning method to use. Default is ‘doane’.

  • Integer (e.g., 30): A fixed number of bins.

  • ’sturges’: Sturges’ rule (log-based, best for normal distributions).

  • ’freedman-diaconis’: Uses Freedman-Diaconis rule (best for skewed distributions).

  • ’rice’: Uses Rice’s rule (scales with cube root of dataset size).

  • ’scott’: Uses Scott’s rule (minimizes IMSE for normal distributions).

  • ’doane’: Uses Doane’s rule (extension of Sturges’ rule that accounts for the skewness of the data distribution).

Returns:

The computed number of bins.

Return type:

int

Raises:

ValueError – If an unsupported binning method is provided.

goodness_of_fit(method: str, bins: str | int = 'doane', warn_on_normalization: bool = True)[source]

Perform goodness-of-fit test on the fitted distribution.

Parameters:
  • method (str) – Test method: ‘chi2’, ‘ks’, ‘rmse’, ‘aic’, or ‘bic’

  • bins (int or str, default='doane') – Binning method for chi2 and rmse tests

  • warn_on_normalization (bool, default=True) – Whether to warn about frequency normalization in chi2 test

Returns:

Test results (dict for statistical tests, float for RMSE/AIC/BIC)

Return type:

dict or float

monte_carlo_fit(sizes: List[int] | None = None, n_repeats: int = 20, tests: List[str] = ['ks'], stability_method: str = 'kneedle', fig_output_path: str | None = None, plot_type: str = 'series', sampling: str = 'random', seed: int | None = None, min_size: int = 50, max_size: int | None = None, n_sizes: int = 10, distribution_params: Tuple | None = None, **kwargs)[source]

Perform Monte Carlo stability analysis for distribution fitting.

This method evaluates how stable distribution parameters and goodness-of-fit statistics are across different sample sizes, helping determine the minimum sample size needed for reliable parameter estimation.

Parameters:
  • sizes (list[int], optional) – Explicit list of sample sizes to test. If None, a grid of sizes between min_size and max_size is created.

  • n_repeats (int, default=20) – Number of subsamples (repetitions) for each size.

  • tests (list[str], default=['ks']) – Goodness-of-fit tests to perform. Options: ‘ks’, ‘chi2’, ‘rmse’. Recommendation: Always include ‘rmse’ for best stability detection.

  • stability_method (str, default='cv') – Method for detecting stability points. Options: - ‘cv’: Coefficient of variation approach (default, robust) - ‘kneedle’: Kneedle algorithm on median curves (best for RMSE) - ‘plateau’: Relative gain heuristic (good for converging parameters) - ‘aggregate’: Median-based detection with tolerance windows (legacy) - ‘detect’: Alias for ‘cv’ (backward compatibility) - None or ‘none’: Skip stability detection

  • fig_output_path (str, optional) – When provided, a 2x3 summary figure is generated and saved to this path. The Dataset attribute ‘figure_path’ will contain the path. If None (default) no figure is created (saves time in large runs).

  • plot_type ({'series','boxplots'}, default='series') – Visual style per panel. ‘series’ plots medians with IQR shading; ‘boxplots’ draws boxplots per size.

  • sampling ({'random','bootstrap','disjoint'}, default='random') – Sampling strategy for subsamples.

  • seed (int, optional) – Random seed for reproducibility (propagated to subsamples).

  • min_size (int, default=50) – Minimum sample size when auto-generating sizes.

  • max_size (int, optional) – Maximum sample size. Defaults to full dataset if None.

  • n_sizes (int, default=10) – Number of sizes in automatically generated grid.

  • distribution_params (tuple, optional) – If provided, these fixed parameters are used for all subsamples (no refitting). Use only for scenarios evaluating GOF with known parameters. If None (default), each subsample is refitted.

  • **kwargs

    Additional parameters:

    • binsint or str, default=’doane’

      Binning for chi-square test.

    • fit_kwargsdict

      Passed to fit_distribution for parameter constraints (e.g., fit_kwargs={‘floc’: 0}).

    Stability detection parameters (method-specific):

    For ‘cv’ method:
    • window_sizeint, default=len(sizes)//4

      Window size for CV consistency check

    • cv_thresholdfloat, default=0.1

      Maximum acceptable coefficient of variation (10%)

    For ‘kneedle’ method:
    • smoothbool, default=True

      Whether to smooth curves before detection

    • smoothing_methodstr, default=’savgol’

      Smoothing method: ‘savgol’ or ‘spline’

    For ‘plateau’ method:
    • consecutive_pointsint, default=3

      Number of consecutive points for plateau (L parameter)

    • relative_tolerancefloat, default=0.01

      Maximum relative change threshold (Δ parameter, 1%)

Returns:

Monte Carlo results with dimensions:

  • sizes : Sample sizes tested

  • repeats : Repetition index for each size

Data variables include:

  • param_0, param_1, … : Fitted distribution parameters

  • ks_statistic, ks_pvalue : Kolmogorov-Smirnov test results

  • chi2_statistic, chi2_pvalue : Chi-square test results

  • rmse : Root mean square error values

Attributes include:

  • distribution : Distribution name

  • original_data_size : Size of original dataset

  • sampling_method : Sampling strategy used

  • bins_method : Binning method for chi-square test

  • stability_method : Detection method used

  • stability_points : Detected stability points per variable

  • recommended_size : Recommended sample size (from primary metric)

  • primary_metric : Metric used for recommendation (typically ‘rmse’)

  • stable_pvalue_ks : KS p-value at stable size (if applicable)

  • stable_pvalue_chi2 : Chi-square p-value at stable size (if applicable)

  • stable_rmse : RMSE at stable size (if applicable)

  • figure_path : Path to saved 2x3 summary figure (only if generated)

  • created_by : Method identifier

Return type:

xarray.Dataset

Examples

>>> # Basic usage with recommended settings (includes RMSE!)
>>> results = adjuster.monte_carlo_fit(
...     n_repeats=50,
...     tests=['ks', 'chi2', 'rmse'],
...     stability_method='kneedle'  # Best for RMSE
... )
>>> # Access stability information
>>> stability = results.attrs['stability_points']
>>> recommended_size = results.attrs['recommended_size']
>>> print(f"RMSE stabilizes at n={stability['rmse']['size']}")
>>> # Use fit constraints (e.g., fix location for Weibull)
>>> results = adjuster.monte_carlo_fit(
...     n_repeats=100,
...     tests=['rmse', 'ks'],
...     stability_method='plateau',
...     fit_kwargs={'floc': 0}
... )
>>> # Use pre-calculated parameters (bypass fitting entirely)
>>> known_params = (2.0, 0.0, 1.0)  # shape, loc, scale
>>> results = adjuster.monte_carlo_fit(
...     distribution_params=known_params,
...     n_repeats=50,
...     tests=['chi2', 'ks']
... )
>>> # Generate figure with automatic stability detection
>>> results = adjuster.monte_carlo_fit(
...     tests=['ks', 'rmse'],
...     stability_method='kneedle',
...     fig_output_path='mc_analysis.png'
... )

Notes

Fitting Strategy:

By default, each subsample gets an independent fit to evaluate how parameter estimates change with sample size. This is the correct approach for stability analysis since parameter estimation quality depends on sample size.

Stability Detection Methods:

  1. CV (Coefficient of Variation): Default method, robust and works for all variables. Looks for when std/mean becomes consistently small.

  2. Kneedle: Best for RMSE and converging parameters. Finds the “elbow” point where improvement rate changes. Requires curve smoothing.

  3. Plateau: Good for metrics with clear convergence. Uses relative gain heuristic (Δᵢ = |y(i-1) - y(i)| / |y(i-1)|).

Choosing the Right Method:

  • Use ‘kneedle’ with RMSE for clearest stability detection

  • Use ‘cv’ for general-purpose analysis

  • Use ‘plateau’ when you expect clear convergence behavior

Parameter Constraints:

Use fit_kwargs to impose constraints during fitting (e.g., floc=0 for Weibull distributions). These constraints apply to all subsample fits.

Pre-calculated Parameters:

Use distribution_params only when you want to evaluate goodness-of-fit with known/fixed parameters across all sample sizes, bypassing fitting entirely.

Monte Carlo Fitting

The monte_carlo_fit method performs stability analysis to determine minimum sample sizes for reliable parameter estimation.

Note

For complete documentation including all parameters, stability detection methods, and best practices, see Monte Carlo Stability Analysis.

Quick Example:

import numpy as np
from magica.core import MagicAdjuster

# Generate sample data
data = np.random.weibull(2, 1000)

# Create adjuster and fit distribution
adjuster = MagicAdjuster(data)
adjuster.fit_distribution('weibull_min')

# Run Monte Carlo stability analysis
results = adjuster.monte_carlo_fit(
    sizes=[100, 200, 500, 1000],
    n_repeats=30,
    tests=['ks', 'chi2', 'rmse'],  # Always include RMSE!
    sampling='random',
    seed=42,
    fig_output_path='stability.png'
)

# Check RMSE stability (most reliable indicator)
rmse_size = results.attrs['stability_points']['rmse']['size']
print(f"Recommended minimum sample size: {rmse_size}")

Return Value: Return Value:

xarray.Dataset with:

  • Dimensions: - sizes: Sample sizes tested - repeats: Repetition index for each size

  • Data Variables: - param_0, param_1, …: Fitted distribution parameters - ks_statistic, ks_pvalue: Kolmogorov-Smirnov test results (if requested) - chi2_statistic, chi2_pvalue: Chi-square test results (if requested) - rmse: Root mean square error values (if requested)

  • Attributes: - distribution: Distribution name - original_data_size: Size of original dataset - sampling_method: Sampling strategy used - stability_points: Detected stability points for each variable (dict with ‘size’, ‘index’, ‘cv_at_stability’) - figure_path: Path to saved figure (if generated)

Tip

Always include ‘rmse’ in your tests - it provides the most reliable stability detection because it shows smooth, monotonic convergence unlike p-values which can be erratic.

For complete parameter documentation, stability methods, and best practices, see Monte Carlo Stability Analysis.

Goodness-of-Fit Testing

The class provides comprehensive goodness-of-fit testing with multiple methods:

Chi-Square Test: Tests the hypothesis that data follows the fitted distribution using histogram-based comparison.

Kolmogorov-Smirnov Test: Non-parametric test comparing empirical and theoretical cumulative distribution functions.

Root Mean Square Error (RMSE): Measures the average deviation between empirical and theoretical distributions.

Utility Methods

Binning Strategies:

The class supports multiple binning strategies for histogram-based tests:

  • _calculate_sturges_bins(): Sturges’ rule (log-based)

  • _calculate_rice_bins(): Rice rule (cube root)

  • _calculate_freedman_diaconis_bins(): Freedman-Diaconis rule (IQR-based)

  • _calculate_scott_bins(): Scott’s rule (standard deviation-based)

  • _calculate_doane_bins(): Doane’s rule (skewness-adjusted)

Subsampling:

  • _generate_subsample_indices(): Creates index lists for different sampling strategies

Sampling strategies

For a didactic, longer discussion of sampling strategies (random, bootstrap, and disjoint) and practical advice on when to use each, see the Monte Carlo tutorial: Monte Carlo Stability Tutorial.