mathematical.stats

Functions for calculating statistics.

Functions:

absolute_deviation(x[, axis, center, nan_policy])

Compute the absolute deviations from the median of the data along the given axis.

absolute_deviation_from_median(x[, axis, …])

Compute the absolute deviation from the median of each point in the data along the given axis, given in terms of the MAD.

d_cohen(sample1, sample2[, which, tail, pooled])

Calculates and returns Cohen’s effect size index d.

g_durlak_bias(g, n)

Application of Durlak’s bias correction to the Hedge’s g statistic.

g_hedge(sample1, sample2)

Calculates and returns Hedge’s g-Statistic.

interpret_d(d_or_g)

Interpret Cohen’s d or Hedge’s g values using Table 1 from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3444174/

iqr_none(dataset)

Calculate the interquartile range, excluding NaN, strings, boolean values, and zeros.

mean_none(dataset)

Calculate the mean, excluding NaN, strings, boolean values, and zeros.

median_absolute_deviation(x[, axis, center, …])

Compute the median absolute deviation of the data along the given axis.

median_none(dataset)

Calculate the median, excluding NaN, strings, boolean values, and zeros.

percentile_none(dataset, percentage)

Calculate the given percentile, excluding NaN, strings, boolean values, and zeros.

pooled_sd(sample1, sample2[, weighted])

Returns the pooled standard deviation.

std_none(dataset[, ddof])

Calculate the standard deviation, excluding NaN, strings, boolean values, and zeros.

within1min(value1, value2)

Returns whether value2 is within one minute of value1.

absolute_deviation(x, axis=0, center=<function median>, nan_policy='propagate')[source]

Compute the absolute deviations from the median of the data along the given axis.

Parameters
  • x (array_like) – Input array or object that can be converted to an array.

  • axis (Optional[int]) – Axis along which the range is computed. If None, compute the MAD over the entire array. Default 0.

  • center (Callable) – A function that will return the central value. The default is to use numpy.median. Any user defined function used will need to have the function signature func(arr, axis). Default <function median at 0x7f2c181178f0>.

  • nan_policy (Literal['propagate', 'raise', 'omit']) – Defines how to handle when input contains nan. ‘propagate’ returns nan, ‘raise’ throws an error, ‘omit’ performs the calculations ignoring nan values. Default 'propagate'.

Returns

If axis=None, a scalar is returned. If the input contains integers or floats of smaller precision than numpy.float64, then the output data-type is numpy.float64. Otherwise, the output data-type is the same as that of the input.

Return type

scalar or ndarray

Overloads

Note

The center argument only affects the calculation of the central value around which the MAD is calculated. That is, passing in center=numpy.mean will calculate the MAD around the mean - it will not calculate the mean absolute deviation.

absolute_deviation_from_median(x, axis=0, center=<function median>, nan_policy='propagate')[source]

Compute the absolute deviation from the median of each point in the data along the given axis, given in terms of the MAD.

Parameters
  • x (array_like) – Input array or object that can be converted to an array.

  • axis (Optional[int]) – Axis along which the range is computed. If None, compute the MAD over the entire array. Default 0.

  • center (Callable) – A function that will return the central value. The default is to use numpy.median. Any user defined function used will need to have the function signature func(arr, axis). Default <function median at 0x7f2c181178f0>.

  • nan_policy (Literal['propagate', 'raise', 'omit']) – Defines how to handle when input contains nan. ‘propagate’ returns nan, ‘raise’ throws an error, ‘omit’ performs the calculations ignoring nan values. Default 'propagate'.

Returns

If axis=None, a scalar is returned. If the input contains integers or floats of smaller precision than numpy.float64, then the output data-type is numpy.float64. Otherwise, the output data-type is the same as that of the input.

Return type

scalar or ndarray

Overloads

Note

The center argument only affects the calculation of the central value around which the MAD is calculated. That is, passing in center=numpy.mean will calculate the MAD around the mean - it will not calculate the mean absolute deviation.

d_cohen(sample1, sample2, which=1, tail=1, pooled=False)[source]

Calculates and returns Cohen’s effect size index d.

See also

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd Edition). Hillsdale, NJ: Lawrence Erlbaum Associates

Parameters
  • sample1 (Sequence[float]) – datapoints for first sample

  • sample2 (Sequence[float]) – datapoints for second sample

  • which (Literal[1, 2]) – Use the standard deviation of the first sample (1) or the second sample (2). Default 1.

  • tail (Literal[1, 2]) – The number of tails to consider. Default 1.

  • pooled (bool) – Whether to use the pooled standard deviation. Default False.

Return type

float

g_durlak_bias(g, n)[source]

Application of Durlak’s bias correction to the Hedge’s g statistic.

n = n1+n2

Parameters
  • g (float) – Hedge’s g-Statistic, calculated using g_hedge().

  • n (float) – The total number of samples in both datasets.

Return type

float

g_hedge(sample1, sample2)[source]

Calculates and returns Hedge’s g-Statistic.

Formula from https://www.itl.nist.gov/div898/software/dataplot/refman2/auxillar/hedgeg.htm

Parameters
Return type

float

interpret_d(d_or_g)[source]

Interpret Cohen’s d or Hedge’s g values using Table 1 from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3444174/

Parameters

d_or_g (float)

Return type

str

iqr_none(dataset)[source]

Calculate the interquartile range, excluding NaN, strings, boolean values, and zeros.

Parameters

dataset (Sequence[Union[float, bool, None]]) – A list to calculate iqr from.

Return type

float

Returns

The interquartile range.

mean_none(dataset)[source]

Calculate the mean, excluding NaN, strings, boolean values, and zeros.

Parameters

dataset (Sequence[Union[float, bool, None]]) – list to calculate mean from

Return type

float

Returns

mean

median_absolute_deviation(x, axis=0, center=<function median>, scale=1.4826, nan_policy='propagate')[source]

Compute the median absolute deviation of the data along the given axis. The median absolute deviation (MAD, 1) computes the median over the absolute deviations from the median. It is a measure of dispersion similar to the standard deviation, but is more robust to outliers 2. The MAD of an empty array is numpy.nan.

Parameters
  • x (array_like) – Input array or object that can be converted to an array.

  • axis (Optional[int]) – Axis along which the range is computed. If None, compute the MAD over the entire array. Default 0.

  • center (Callable) – A function that will return the central value. The default is to use numpy.median. Any user defined function used will need to have the function signature func(arr, axis). Default <function median at 0x7f2c181178f0>.

  • scale (float) – The scaling factor applied to the MAD. The default scale (1.4826) ensures consistency with the standard deviation for normally distributed data. Default 1.4826.

  • nan_policy (Literal['propagate', 'raise', 'omit']) – Defines how to handle when input contains nan. ‘propagate’ returns nan, ‘raise’ throws an error, ‘omit’ performs the calculations ignoring nan values. Default 'propagate'.

Returns

If axis=None, a scalar is returned. If the input contains integers or floats of smaller precision than numpy.float64, then the output data-type is numpy.float64. Otherwise, the output data-type is the same as that of the input.

Return type

scalar or ndarray

Overloads

Note

The center argument only affects the calculation of the central value around which the MAD is calculated. That is, passing in center=numpy.mean will calculate the MAD around the mean - it will not calculate the mean absolute deviation.

References

1

“Median absolute deviation” https://en.wikipedia.org/wiki/Median_absolute_deviation

2

“Robust measures of scale” https://en.wikipedia.org/wiki/Robust_measures_of_scale

Examples

When comparing the behavior of median_absolute_deviation with numpy.std, the latter is affected when we change a single value of an array to have an outlier value while the MAD hardly changes:

>>> import scipy.stats
>>> import mathematical.stats
>>> x = scipy.stats.norm.rvs(size=100, scale=1, random_state=123456)
>>> x.std()
0.9973906394005013
>>> mathematical.stats.median_absolute_deviation(x)
1.2280762773108278
>>> x[0] = 345.6
>>> x.std()
34.42304872314415
>>> mathematical.stats.median_absolute_deviation(x)
1.2340335571164334
Axis handling example:
>>> x = numpy.array([[10, 7, 4], [3, 2, 1]])
>>> x
array([[10,  7,  4], [ 3,  2,  1],])
>>> mathematical.stats.median_absolute_deviation(x)
array([5.1891, 3.7065, 2.2239])
>>> mathematical.stats.median_absolute_deviation(x, axis=None)
2.9652
median_none(dataset)[source]

Calculate the median, excluding NaN, strings, boolean values, and zeros.

Parameters

dataset (Sequence[Union[float, bool, None]]) – list to calculate median from

Return type

float

Returns

standard deviation

percentile_none(dataset, percentage)[source]

Calculate the given percentile, excluding NaN, strings, boolean values, and zeros.

Parameters
Raises

ValueError if dataset contains fewer than two values

Return type

float

Returns

The interquartile range.

pooled_sd(sample1, sample2, weighted=False)[source]

Returns the pooled standard deviation.

Parameters
  • sample1 (Sequence[float]) – datapoints for first sample

  • sample2 (Sequence[float]) – datapoints for second sample

  • weighted (bool) – True for weighted pooled SD. Default False.

Return type

float

std_none(dataset, ddof=1)[source]

Calculate the standard deviation, excluding NaN, strings, boolean values, and zeros.

Parameters
  • dataset (Sequence[Union[float, bool, None]]) – list to calculate mean from.

  • ddof (int) – Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. Default 1.

Return type

float

Returns

standard deviation

within1min(value1, value2)[source]

Returns whether value2 is within one minute of value1.

Parameters
  • value1 (float) – A time in minutes.

  • value2 (float) – Another time in minutes.

Return type

bool