`mathematical.stats`

Functions for calculating statistics.

Functions:

`absolute_deviation`(x[, axis, center, nan_policy])	Compute the absolute deviations from the median of the data along the given axis.
`absolute_deviation_from_median`(x[, axis, ...])	Compute the absolute deviation from the median of each point in the data along the given axis, given in terms of the MAD.
`d_cohen`(sample1, sample2[, which, tail, pooled])	Calculates and returns Cohen's effect size index d.
`g_durlak_bias`(g, n)	Application of Durlak's bias correction to the Hedge's g statistic.
`g_hedge`(sample1, sample2)	Calculates and returns Hedge's g-Statistic.
`interpret_d`(d_or_g)	Interpret Cohen's d or Hedge's g values using Table 1 from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3444174/
`iqr_none`(dataset)	Calculate the interquartile range, excluding NaN, strings, boolean values, and zeros.
`mean_none`(dataset)	Calculate the mean, excluding NaN, strings, boolean values, and zeros.
`median_absolute_deviation`(x[, axis, center, ...])	Compute the median absolute deviation of the data along the given axis.
`median_none`(dataset)	Calculate the median, excluding NaN, strings, boolean values, and zeros.
`percentile_none`(dataset, percentage)	Calculate the given percentile, excluding NaN, strings, boolean values, and zeros.
`pooled_sd`(sample1, sample2[, weighted])	Returns the pooled standard deviation.
`std_none`(dataset[, ddof])	Calculate the standard deviation, excluding NaN, strings, boolean values, and zeros.
`within1min`(value1, value2)	Returns whether `value2` is within one minute of `value1`.

absolute_deviation(x, axis=0, center=<function median>, nan_policy='propagate')[source]

Compute the absolute deviations from the median of the data along the given axis.

Parameters

x (array_like) – Input array or object that can be converted to an array.
axis (Optional[int]) – Axis along which the range is computed. If None, compute the MAD over the entire array. Default 0.
center (Callable) – A function that will return the central value. The default is to use numpy.median. Any user defined function used will need to have the function signature func(arr, axis). Default <function median at 0x75f1029c5270>.
nan_policy (Literal['propagate', 'raise', 'omit']) – Defines how to handle when input contains nan. ‘propagate’ returns nan, ‘raise’ throws an error, ‘omit’ performs the calculations ignoring nan values. Default 'propagate'.

Returns

If axis=None, a scalar is returned. If the input contains integers or floats of smaller precision than numpy.float64, then the output data-type is numpy.float64. Otherwise, the output data-type is the same as that of the input.

Return type

scalar or ndarray

Overloads

absolute_deviation(x, axis: None, center = …, nan_policy = … ) -> float
absolute_deviation(x, axis: int = …, center = …, nan_policy = … ) -> ndarray

Note

The center argument only affects the calculation of the central value around which the MAD is calculated. That is, passing in center=numpy.mean will calculate the MAD around the mean - it will not calculate the mean absolute deviation.

absolute_deviation_from_median(x, axis=0, center=<function median>, nan_policy='propagate')[source]

Compute the absolute deviation from the median of each point in the data along the given axis, given in terms of the MAD.

Parameters

x (array_like) – Input array or object that can be converted to an array.
axis (Optional[int]) – Axis along which the range is computed. If None, compute the MAD over the entire array. Default 0.
center (Callable) – A function that will return the central value. The default is to use numpy.median. Any user defined function used will need to have the function signature func(arr, axis). Default <function median at 0x75f1029c5270>.
nan_policy (Literal['propagate', 'raise', 'omit']) – Defines how to handle when input contains nan. ‘propagate’ returns nan, ‘raise’ throws an error, ‘omit’ performs the calculations ignoring nan values. Default 'propagate'.

Returns

If axis=None, a scalar is returned. If the input contains integers or floats of smaller precision than numpy.float64, then the output data-type is numpy.float64. Otherwise, the output data-type is the same as that of the input.

Return type

scalar or ndarray

Overloads

absolute_deviation_from_median(x, axis: None, center = …, nan_policy = … ) -> float
absolute_deviation_from_median(x, axis: int = …, center = …, nan_policy = … ) -> ndarray

Note

The center argument only affects the calculation of the central value around which the MAD is calculated. That is, passing in center=numpy.mean will calculate the MAD around the mean - it will not calculate the mean absolute deviation.

d_cohen(sample1, sample2, which=1, tail=1, pooled=False)[source]

Calculates and returns Cohen’s effect size index d.

See also

Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd Edition). Hillsdale, NJ: Lawrence Erlbaum Associates

Parameters

sample1 (Sequence[float]) – datapoints for first sample
sample2 (Sequence[float]) – datapoints for second sample
which (Literal[1, 2]) – Use the standard deviation of the first sample (1) or the second sample (2). Default 1.
tail (Literal[1, 2]) – The number of tails to consider. Default 1.
pooled (bool) – Whether to use the pooled standard deviation. Default False.

Return type

float

g_durlak_bias(g, n)[source]

Application of Durlak’s bias correction to the Hedge’s g statistic.

n = n1+n2

Parameters

g (float) – Hedge’s g-Statistic, calculated using g_hedge().
n (float) – The total number of samples in both datasets.

Return type: float

g_hedge(sample1, sample2)[source]

Calculates and returns Hedge’s g-Statistic.

Formula from https://www.itl.nist.gov/div898/software/dataplot/refman2/auxillar/hedgeg.htm

Parameters

sample1 (Sequence[float]) – datapoints for first sample
sample2 (Sequence[float]) – datapoints for second sample

Return type

float

interpret_d(d_or_g)[source]

Interpret Cohen’s d or Hedge’s g values using Table 1 from https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3444174/

Parameters: d_or_g (float)
Return type: str

iqr_none(dataset)[source]

Calculate the interquartile range, excluding NaN, strings, boolean values, and zeros.

Parameters: dataset (Sequence[Union[float, bool, None]]) – A list to calculate iqr from.
Return type: float
Returns: The interquartile range.

mean_none(dataset)[source]

Calculate the mean, excluding NaN, strings, boolean values, and zeros.

Parameters: dataset (Sequence[Union[float, bool, None]]) – list to calculate mean from
Return type: float
Returns: mean

median_absolute_deviation(x, axis=0, center=<function median>, scale=1.4826, nan_policy='propagate')[source]

Compute the median absolute deviation of the data along the given axis. The median absolute deviation (MAD, 1) computes the median over the absolute deviations from the median. It is a measure of dispersion similar to the standard deviation, but is more robust to outliers 2. The MAD of an empty array is numpy.nan.

Parameters

x (array_like) – Input array or object that can be converted to an array.
axis (Optional[int]) – Axis along which the range is computed. If None, compute the MAD over the entire array. Default 0.
center (Callable) – A function that will return the central value. The default is to use numpy.median. Any user defined function used will need to have the function signature func(arr, axis). Default <function median at 0x75f1029c5270>.
scale (float) – The scaling factor applied to the MAD. The default scale (1.4826) ensures consistency with the standard deviation for normally distributed data. Default 1.4826.
nan_policy (Literal['propagate', 'raise', 'omit']) – Defines how to handle when input contains nan. ‘propagate’ returns nan, ‘raise’ throws an error, ‘omit’ performs the calculations ignoring nan values. Default 'propagate'.

Returns

If axis=None, a scalar is returned. If the input contains integers or floats of smaller precision than numpy.float64, then the output data-type is numpy.float64. Otherwise, the output data-type is the same as that of the input.

Return type

scalar or ndarray

Overloads

median_absolute_deviation(x, axis: None, center = …, scale = …, nan_policy = … ) -> float
median_absolute_deviation(x, axis: int = …, center = …, scale = …, nan_policy = … ) -> ndarray

Note

The center argument only affects the calculation of the central value around which the MAD is calculated. That is, passing in center=numpy.mean will calculate the MAD around the mean - it will not calculate the mean absolute deviation.

References

1: “Median absolute deviation” https://en.wikipedia.org/wiki/Median_absolute_deviation
2: “Robust measures of scale” https://en.wikipedia.org/wiki/Robust_measures_of_scale

Examples

When comparing the behavior of median_absolute_deviation with numpy.std, the latter is affected when we change a single value of an array to have an outlier value while the MAD hardly changes:

>>> import scipy.stats
>>> import mathematical.stats
>>> x = scipy.stats.norm.rvs(size=100, scale=1, random_state=123456)
>>> x.std()
0.9973906394005013
>>> mathematical.stats.median_absolute_deviation(x)
1.2280762773108278
>>> x[0] = 345.6
>>> x.std()
34.42304872314415
>>> mathematical.stats.median_absolute_deviation(x)
1.2340335571164334
Axis handling example:
>>> x = numpy.array([[10, 7, 4], [3, 2, 1]])
>>> x
array([[10,  7,  4], [ 3,  2,  1],])
>>> mathematical.stats.median_absolute_deviation(x)
array([5.1891, 3.7065, 2.2239])
>>> mathematical.stats.median_absolute_deviation(x, axis=None)
2.9652

median_none(dataset)[source]

Calculate the median, excluding NaN, strings, boolean values, and zeros.

Parameters: dataset (Sequence[Union[float, bool, None]]) – list to calculate median from
Return type: float
Returns: standard deviation

percentile_none(dataset, percentage)[source]

Calculate the given percentile, excluding NaN, strings, boolean values, and zeros.

Parameters

dataset (Sequence[Union[float, bool, None]]) – Sequence to calculate the percentile from.
percentage (float)

Raises

ValueError if dataset contains fewer than two values

Return type

float

Returns

The interquartile range.

pooled_sd(sample1, sample2, weighted=False)[source]

Returns the pooled standard deviation.

Parameters

sample1 (Sequence[float]) – datapoints for first sample
sample2 (Sequence[float]) – datapoints for second sample
weighted (bool) – True for weighted pooled SD. Default False.

Return type: float

std_none(dataset, ddof=1)[source]

Calculate the standard deviation, excluding NaN, strings, boolean values, and zeros.

Parameters

dataset (Sequence[Union[float, bool, None]]) – list to calculate mean from.
ddof (int) – Means Delta Degrees of Freedom. The divisor used in calculations is N - ddof, where N represents the number of elements. Default 1.

Return type

float

Returns

standard deviation

within1min(value1, value2)[source]

Returns whether value2 is within one minute of value1.

Parameters

value1 (float) – A time in minutes.
value2 (float) – Another time in minutes.

Return type

bool

mathematical.stats

`mathematical.stats`