Skip to main content
Module

x/simplestatistics/index.js

simple statistics for node & browser javascript
Go to Latest
import * as simplestatistics from "https://deno.land/x/simplestatistics@v7.7.5/index.js";

Classes

This is a single-layer Perceptron Classifier that takes arrays of numbers and predicts whether they should be classified as either 0 or 1 (negative or positive examples).

This is a single-layer Perceptron Classifier that takes arrays of numbers and predicts whether they should be classified as either 0 or 1 (negative or positive examples).

Variables

Percentage Points of the χ2 (Chi-Squared) Distribution

We use ε, epsilon, as a stopping criterion when we want to iterate until we're "close enough". Epsilon is a very small number: for simple statistics, that number is 0.0001

A standard normal table, also called the unit normal table or Z table, is a mathematical table for the values of Φ (phi), which are the values of the cumulative distribution function of the normal distribution. It is used to find the probability that a statistic is observed below, above, or between values on the standard normal distribution, and by extension, any normal distribution.

Functions

When adding a new value to a list, one does not have to necessary recompute the mean of the list in linear time. They can instead use this function to compute the new mean by providing the current mean, the number of elements in the list that produced it and the new value to add.

Approximate equality.

The mean, also known as average, is the sum of all values over the number of values. This is a measure of central tendency: a method of finding a typical or central value of a set of numbers.

The mean, also known as average, is the sum of all values over the number of values. This is a measure of central tendency: a method of finding a typical or central value of a set of numbers.

The Bernoulli distribution is the probability discrete distribution of a random variable which takes value 1 with success probability p and value 0 with failure probability q = 1 - p. It can be used, for example, to represent the toss of a coin, where "1" is defined to mean "heads" and "0" is defined to mean "tails" (or vice versa). It is a special case of a Binomial Distribution where n = 1.

The Binomial Distribution is the discrete probability distribution of the number of successes in a sequence of n independent yes/no experiments, each of which yields success with probability probability. Such a success/failure experiment is also called a Bernoulli experiment or Bernoulli trial; when trials = 1, the Binomial Distribution is a Bernoulli Distribution.

Bisection method is a root-finding method that repeatedly bisects an interval to find the root.

The χ2 (Chi-Squared) Goodness-of-Fit Test uses a measure of goodness of fit which is the sum of differences between observed and expected outcome frequencies (that is, counts of observations), each squared and divided by the number of observations expected given the hypothesized distribution. The resulting χ2 statistic, chiSquared, can be compared to the chi-squared distribution to determine the goodness of fit. In order to determine the degrees of freedom of the chi-squared distribution, one takes the total number of observed frequencies and subtracts the number of estimated parameters. The test statistic follows, approximately, a chi-square distribution with (k − c) degrees of freedom where k is the number of non-empty cells and c is the number of estimated parameters for the distribution.

Split an array into chunks of a specified size. This function has the same behavior as PHP's array_chunk function, and thus will insert smaller-sized chunks at the end if the input size is not divisible by the chunk size.

Ckmeans clustering is an improvement on heuristic-based clustering approaches like Jenks. The algorithm was developed in Haizhou Wang and Mingzhou Song as a dynamic programming approach to the problem of clustering numeric data into groups with the least within-group sum-of-squared-deviations.

Thecoefficient of variation_ is the ratio of the standard deviation to the mean. .._coefficient of variation: https://en.wikipedia.org/wiki/Coefficient_of_variation

Implementation of Combinations Combinations are unique subsets of a collection - in this case, k x from a collection at a time. https://en.wikipedia.org/wiki/Combination

Implementation of Combinations with replacement Combinations are unique subsets of a collection - in this case, k x from a collection at a time. 'With replacement' means that a given element can be chosen multiple times. Unlike permutation, order doesn't matter for combinations.

When combining two lists of values for which one already knows the means, one does not have to necessary recompute the mean of the combined lists in linear time. They can instead use this function to compute the combined mean by providing the mean & number of values of the first list and the mean & number of values of the second list.

When combining two lists of values for which one already knows the variances, one does not have to necessary recompute the variance of the combined lists in linear time. They can instead use this function to compute the combined variance by providing the variance, mean & number of values of the first list and the variance, mean & number of values of the second list.

Given an array of x, this will find the extent of the x and return an array of breaks that can be used to categorize the x into a number of classes. The returned array will always be 1 longer than the number of classes because it includes the minimum value.

This computes the minimum & maximum number in an array.

The extent is the lowest & highest number in the array. With a sorted array, the first element in the array is always the lowest while the last element is always the largest, so this calculation can be done in one step, or constant time.

A Factorial, usually written n!, is the product of all positive integers less than or equal to n. Often factorial is implemented recursively, but this iterative approach is significantly faster and simpler.

Compute the gamma function of a value using Nemes' approximation. The gamma of n is equivalent to (n-1)!, but unlike the factorial function, gamma is defined for all real n except zero and negative integers (where NaN is returned). Note, the gamma function is also well-defined for complex numbers, though this implementation currently does not handle complex numbers as input values. Nemes' approximation is defined here as Theorem 2.2. Negative values use Euler's reflection formula for computation.

Compute the logarithm of the gamma function of a value using Lanczos' approximation. This function takes as input any real-value n greater than 0. This function is useful for values of n too large for the normal gamma function (n > 165). The code is based on Lanczo's Gamma approximation, defined here.

The Geometric Mean is a mean function that is more useful for numbers in different ranges.

The Harmonic Mean is a mean function typically used to find the average of rates. This mean is calculated by taking the reciprocal of the arithmetic mean of the reciprocals of the input numbers.

The Interquartile range is a measure of statistical dispersion, or how scattered, spread, or concentrated a distribution is. It's computed as the difference between the third quartile and first quartile.

The Inverse Gaussian error function returns a numerical approximation to the value that would have caused errorFunction() to return x.

The Interquartile range is a measure of statistical dispersion, or how scattered, spread, or concentrated a distribution is. It's computed as the difference between the third quartile and first quartile.

Kernel density estimation is a useful tool for, among other things, estimating the shape of the underlying probability distribution from a sample.

Kernel density estimation is a useful tool for, among other things, estimating the shape of the underlying probability distribution from a sample.

Perform k-means clustering.

Simple linear regression is a simple way to find a fitted line between a set of coordinates. This algorithm finds the slope and y-intercept of a regression line using the least sum of squares.

Given the output of linearRegression: an object with m and b values indicating slope and intercept, respectively, generate a line function that translates x values into y values.

The log average is an equivalent way of computing the geometric mean of an array suitable for large or small products.

The Logit is the inverse of cumulativeStdLogisticProbability, and is also known as the logistic quantile function.

The Median Absolute Deviation is a robust measure of statistical dispersion. It is more resilient to outliers than the standard deviation.

This computes the maximum number in an array.

The maximum is the highest number in the array. With a sorted array, the last element in the array is always the largest, so this calculation can be done in one step, or constant time.

The mean, also known as average, is the sum of all values over the number of values. This is a measure of central tendency: a method of finding a typical or central value of a set of numbers.

The mean, also known as average, is the sum of all values over the number of values. This is a measure of central tendency: a method of finding a typical or central value of a set of numbers.

The median is the middle number of a list. This is often a good indicator of 'the middle' when there are outliers that skew the mean() value. This is a measure of central tendency: a method of finding a typical or central value of a set of numbers.

The Median Absolute Deviation is a robust measure of statistical dispersion. It is more resilient to outliers than the standard deviation.

The median is the middle number of a list. This is often a good indicator of 'the middle' when there are outliers that skew the mean() value. This is a measure of central tendency: a method of finding a typical or central value of a set of numbers.

The min is the lowest number in the array. This runs in O(n), linear time, with respect to the length of the array.

The minimum is the lowest number in the array. With a sorted array, the first element in the array is always the smallest, so this calculation can be done in one step, or constant time.

The mode is the number that appears in a list the highest number of times. There can be multiple modes in a list: in the event of a tie, this algorithm will return the most recently seen mode.

The mode is the number that appears in a list the highest number of times. There can be multiple modes in a list: in the event of a tie, this algorithm will return the most recently seen mode.

The mode is the number that appears in a list the highest number of times. There can be multiple modes in a list: in the event of a tie, this algorithm will return the most recently seen mode.

Sort an array of numbers by their numeric value, ensuring that the array is not changed in place.

Implementation of Heap's Algorithm for generating permutations.

Conducts a permutation test to determine if two data sets are significantly different from each other, using the difference of means between the groups as the test statistic. The function allows for the following hypotheses:

  • two_tail = Null hypothesis: the two distributions are equal.
  • greater = Null hypothesis: observations from sampleX tend to be smaller than those from sampleY.
  • less = Null hypothesis: observations from sampleX tend to be greater than those from sampleY. Learn more about one-tail vs two-tail tests.

The Poisson Distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time and/or space if these events occur with a known average rate and independently of the time since the last event.

The Probit is the inverse of cumulativeStdNormalProbability(), and is also known as the normal quantile function.

The product of an array is the result of multiplying all numbers together, starting using one as the multiplicative identity.

The quantile: this is a population quantile, since we assume to know the entire dataset in this library. This is an implementation of the Quantiles of a Population algorithm from wikipedia.

This function returns the quantile in which one would find the given value in the given array. It will copy and sort your array before each run, so if you know your array is already sorted, you should use quantileRankSorted instead.

This function returns the quantile in which one would find the given value in the given array. With a sorted array, leveraging binary search, we can find this information in logarithmic time.

This is the internal implementation of quantiles: when you know that the order is sorted, you don't need to re-sort it, and the computations are faster.

Rearrange items in arr so that all items in [left, k] range are the smallest. The k-th element will have the (k - left + 1)-th smallest value in [left, right].

Relative error.

The Root Mean Square (RMS) is a mean function used as a measure of the magnitude of a set of numbers, regardless of their sign. This is the square root of the mean of the squares of the input numbers. This runs in O(n), linear time, with respect to the length of the array.

The Root Mean Square (RMS) is a mean function used as a measure of the magnitude of a set of numbers, regardless of their sign. This is the square root of the mean of the squares of the input numbers. This runs in O(n), linear time, with respect to the length of the array.

The R Squared value of data compared with a function f is the sum of the squared differences between the prediction and the actual value.

Create a simple random sample from a given array of n elements.

The correlation is a measure of how correlated two datasets are, between -1 and 1

Sample covariance of two datasets: how much do the two datasets move together? x and y are two datasets, represented as arrays of numbers.

Kurtosis is a measure of the heaviness of a distribution's tails relative to its variance. The kurtosis value can be positive or negative, or even undefined.

The rank correlation is a measure of the strength of monotonic relationship between two arrays

Skewness is a measure of the extent to which a probability distribution of a real-valued random variable "leans" to one side of the mean. The skewness value can be positive or negative, or even undefined.

The sample standard deviation is the square root of the sample variance.

The sample variance is the sum of squared deviations from the mean. The sample variance is distinguished from the variance by the usage of Bessel's Correction: instead of dividing the sum of squared deviations by the length of the input, it is divided by the length minus one. This corrects the bias in estimating a value from a set that you don't know if full.

Sampling with replacement is a type of sampling that allows the same item to be picked out of a population more than once.

A Fisher-Yates shuffle is a fast way to create a random permutation of a finite set. This is a function around shuffle_in_place that adds the guarantee that it will not modify its input.

A Fisher-Yates shuffle in-place - which means that it will change the order of the original array by reference.

Sign is a function that extracts the sign of a real number

Calculate the silhouette values for clustered data.

Calculate the silhouette metric for a set of N-dimensional points arranged in groups. The metric is the largest individual silhouette value for the data.

The standard deviation is the square root of the variance. This is also known as the population standard deviation. It's useful for measuring the amount of variation or dispersion in a set of values.

When removing a value from a list, one does not have to necessary recompute the mean of the list in linear time. They can instead use this function to compute the new mean by providing the current mean, the number of elements in the list that produced it and the value to remove.

Our default sum is the Kahan-Babuska algorithm. This method is an improvement over the classical Kahan summation algorithm. It aims at computing the sum of a list of numbers while correcting for floating-point errors. Traditionally, sums are calculated as many successive additions, each one with its own floating-point roundoff. These losses in precision add up as the number of numbers increases. This alternative algorithm is more accurate than the simple way of calculating sums by simple addition.

The sum of deviations to the Nth power. When n=2 it's the sum of squared deviations. When n=3 it's the sum of cubed deviations.

The simple sum of an array is the result of adding all numbers together, starting from zero.

This is to compute a one-sample t-test, comparing the mean of a sample to a known value, x.

This is to compute two sample t-test. Tests whether "mean(X)-mean(Y) = difference", ( in the most common case, we often have difference == 0 to test if two samples are likely to be taken from populations with the same mean value) with no prior knowledge on standard deviations of both samples other than the fact that they have the same standard deviation.

For a sorted input, counting the number of unique values is possible in constant time and constant memory. This is a simple implementation of the algorithm.

The variance is the sum of squared deviations from the mean.

This function calculates the Wilcoxon rank sum statistic for the first sample with respect to the second. The Wilcoxon rank sum test is a non-parametric alternative to the t-test which is equivalent to the Mann-Whitney U test. The statistic is calculated by pooling all the observations together, ranking them, and then summing the ranks associated with one of the samples. If this rank sum is sufficiently large or small we reject the hypothesis that the two samples come from the same distribution in favor of the alternative that one is shifted with respect to the other.