68 Concentration

68.1 Entropy (Shannon 1948)

68.1.1 Definition

Entropy is often referred to as the amount of information that is contained in an object (or the amount of disorder in a physical system):

In applied statistics, concentration measures are used to quantify how unevenly totals are distributed across categories (for example market shares, income shares, or portfolio weights). A low concentration means shares are spread out; a high concentration means a few categories dominate.

\[ H = - \sum_{i=1}^{n} p_i \ln p_i \]

where \(0 \leq H \leq \ln n\), \(p_i = \frac{x_i}{X}\), and \(X = \sum_{i=1}^{n} x_i\).

68.2 Maximum Entropy

\[ H_{max} = - \sum_{i=1}^{n} \frac{1}{n} \ln \frac{1}{n} = - \frac{n}{n} \ln \frac{1}{n} = - \ln \frac{1}{n} = \ln n \]

68.3 Normalized Entropy

\[ H_o = \frac{H}{H_{max}} = \frac{H}{\ln n} \]

where \(0 \leq H_o \leq 1\), \(H = - \sum_{i=1}^{n} p_i \ln p_i\) (for \(0 \leq H \leq \ln n\)), and \(H_{max} = \ln n\).

68.4 Exponential Index

68.4.1 Definition

\[ e^{-H} = \prod_{i=1}^{n} p_i^{p_i} \]

where \(H = - \sum_{i=1}^{n} p_i \ln p_i\) (for \(0 \leq H \leq \ln n\)), \(p_i = \frac{x_i}{X}\), and \(X = \sum_{i=1}^{n} x_i\).

68.4.2 Property

The relationship between Entropy and the Exponential Index can be written as

\[ e^{-H} = \prod_{i=1}^{n} p_i^{p_i} \]

\[ \ln \left( e^{-H} \right) = \ln \prod_{i=1}^{n} \left( p_i^{p_i} \right) \]

\[ -H = \sum_{i=1}^{n} \ln \left( p_i^{p_i} \right) = \sum_{i=1}^{n} p_i \ln p_i \]

68.5 Herfindahl Measure (Herfindahl 1950)

68.5.1 Definition

\[ H_e = \sum_{i=1}^{n} p_i^2 \]

where \(\frac{1}{n} \leq H_e \leq 1\), \(\sum_{i=1}^{n} p_i^2 = \sum_{i=1}^{n} \frac{x_i^2}{X^2}\), and \(X = \sum_{i=1}^{n} x_i\).

68.6 Normalized Herfindahl

\[ H_e^* = \frac{H_e - \frac{1}{n}}{1 - \frac{1}{n}} \]

where \(0 \leq H_e^* \leq 1\), \(H_e = \sum_{i=1}^{n} p_i^2 = \sum_{i=1}^{n} \frac{x_i^2}{X^2}\), \(X = \sum_{i=1}^{n} x_i\), and \(\frac{1}{n} \leq H_e \leq 1\).

68.6.1 Property

\[ H_e^* \propto CV^2 \]

\[ H_e^* = \frac{CV^2}{n-1} = \frac{s^2}{\bar{x}^2 (n-1)} \]

where \(CV = \frac{s}{\bar{x}}\), \(s^2 = \frac{1}{n} \sum_{i=1}^{n} \left( x_i - \bar{x} \right)^2\), and \(\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i\).

68.7 Gini Coefficient (Gini 1912)

68.7.1 Definition 1

For the Gini formulas below, the observations must be ordered in nondecreasing order, i.e. \(x_{(1)} \leq x_{(2)} \leq \cdots \leq x_{(n)}\) (so \(x_i\) denotes the \(i^{\text{th}}\) ordered value).

\[ G_1 = \left( \frac{2}{n^2 \bar{x}} \right) \sum_{i=1}^{n} \left( \left( i - \frac{n+1}{2} \right) x_i \right) \]

where \(0 \leq G_1 \leq 1\), and \(\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i\).

68.7.2 Definition 2

\[ G_2 = \left( \frac{2}{n^2 \bar{x}} \right) \sum_{i=1}^{n} (i x_i) - \frac{n+1}{n} \]

where \(0 \leq G_2 \leq 1\), and \(\bar{x} = \frac{1}{n} \sum_{i=1}^{n} x_i\).

68.7.3 Proof 1

\[ G_1 = \left( \frac{2}{n^2 \bar{x}} \right) \sum_{i=1}^{n} \left( \left( i - \frac{n+1}{2} \right) x_i \right) \]

\[ G_1 = \left( \frac{2}{n^2 \bar{x}} \right) \sum_{i=1}^{n} i x_i - \left( \frac{2}{n^2 \bar{x}} \right) \frac{n+1}{2} \sum_{i=1}^{n} x_i \]

\[ \left( \frac{2}{n^2 \bar{x}} \right) \frac{n+1}{2} \sum_{i=1}^{n} x_i = \left( \frac{2}{n^2 \bar{x}} \right) \frac{n+1}{2} (n \bar{x}) = \frac{n+1}{n} \]

\[ G_1 = \left( \frac{2}{n^2 \bar{x}} \right) \sum_{i=1}^{n} (i x_i) - \frac{n+1}{n} = G_2 \]

68.7.4 Definition 3

For Definitions 3 and 4, the shares \(p_i\) (equivalently the values \(x_i\)) must be in nondecreasing order before computing the cumulative shares \(v_i\).

\[ G_3 = 1 - \sum_{i=1}^{n} \frac{v_i + v_{i-1}}{n} \]

where \(0 \leq G_3 \leq 1\), \(v_i = \sum_{j=1}^{i} p_j = \sum_{j=1}^{i} \frac{x_j}{\sum_{l=1}^{n} x_l}\) for \(i = 1, 2, …, n\) and \(v_0 = 0\).

68.7.5 Definition 4

\[ G_4 = \frac{n + 1 - 2V}{n} \]

where \(0 \leq G_4 \leq 1\), \(V = \sum_{i=1}^{n} v_i\), \(v_i = \sum_{j=1}^{i} p_j = \sum_{j=1}^{i} \frac{x_j}{\sum_{l=1}^{n} x_l}\) for \(i = 1, 2, …, n\) and \(v_0 = 0\).

68.7.6 Proof 2

\[ \begin{align*}G_3 &= 1 - \sum_{i=1}^{n} \frac{v_i + v_{i-1}}{n} & \\\sum_{i=1}^{n} \frac{v_i + v_{i-1}}{n} &= 1 - G_3 & \\&= \sum_{i=1}^{n} \frac{v_i}{n} + \sum_{i=1}^{n} \frac{v_{i-1}}{n} & v_0 = 0 \\&= \frac{1}{n} \frac{1}{X} \sum_{i=1}^{n} \sum_{j=1}^{i} x_j + \frac{1}{n} \frac{1}{X} \sum_{i=2}^{n} \sum_{j=1}^{i-1} x_j & v_i = \sum_{j=1}^{i} \frac{x_j}{X} \\&= \frac{1}{n} \frac{1}{n \bar{x}} \left( \sum_{i=1}^{n} ((n-i+1)x_i) + \sum_{i=1}^{n} ((n-i) x_i) \right) & X = n \bar{x} \\&= \frac{1}{n^2 \bar{x}} 2 \sum_{i=1}^{n} \left( n - i + \frac{1}{2} \right) x_i & \\&= \frac{2}{n^2 \bar{x}} \left( n \sum_{i=1}^{n} x_i - \sum_{i=1}^{n} i x_i + \frac{1}{2} \sum_{i=1}^{n} x_i \right) & \\&= \frac{2}{n^2 \bar{x}} \left( n^2 \bar{x} - \sum_{i=1}^{n} i x_i + \frac{1}{2} n \bar{x} \right) & \\&= - \left( \frac{2}{n^2 \bar{x}} \right) \sum_{i=1}^{n} i x_i + 2 + \frac{1}{n} & \\G_3 &= 1 + \left( \frac{2}{n^2 \bar{x}} \right) \sum_{i=1}^{n} i x_i - 2 - \frac{1}{n} & \\&= \left( \frac{2}{n^2 \bar{x}} \right) \sum_{i=1}^{n} i x_i - 1 - \frac{1}{n} & \\&= \left( \frac{2}{n^2 \bar{x}} \right) \sum_{i=1}^{n} i x_i - \frac{n+1}{n} = G_2 &\end{align*} \]

68.7.7 Proof 3

\[ \begin{align*}G_4 &= \frac{n+1-2V}{n} = \frac{n+1}{n} - \frac{2}{n}V \\V &= \sum_{i=1}^{n} v_i = \sum_{i=1}^{n} \sum_{j=1}^{i} \frac{x_j}{X} = \frac{1}{X} \sum_{i=1}^{n} \sum_{j=1}^{i} x_j \\&= \frac{1}{n \bar{x}} \left( \sum_{i=1}^{n} (n-i+1) x_i \right) \\&= \frac{1}{n\bar{x}} \left( n \sum_{i=1}^{n} x_i - \sum_{i=1}^{n} i x_i + n \bar{x} \right) \\&= n - \frac{1}{n \bar{x}} \sum_{i=1}^{n} i x_i + 1 \\G_4 &= \frac{n+1}{n} - \frac{2}{n} \left( n - \frac{1}{n \bar{x}} \sum_{i=1}^{n} i x_i + 1 \right) \\&= \frac{n+1}{n} - 2 + \frac{2}{n^2 \bar{x}} \sum_{i=1}^{n} i x_i - \frac{2}{n} \\&= \left( \frac{2}{n^2 \bar{x}} \right) \sum_{i=1}^{n} i x_i + \frac{n+1-2n-2}{n} \\&= \left( \frac{2}{n^2 \bar{x}} \right) \sum_{i=1}^{n} i x_i - \frac{n+1}{n} = G_2\end{align*} \]

68.7.8 Property

There is a relationship between the Gini Coefficient and the Lorenz Curve (Lorenz 1905) which is the graphical representation of the cumulative distribution of wealth or income (typically one represents the % of households on the horizontal axis and the % of income on the vertical axis).

Here again, the cumulative shares \(v_i\) are computed from values/shares ordered in nondecreasing order.

The surface under the Lorenz curve is

\[ \frac{1}{2} \sum_{i=1}^{n} \frac{1}{n} \left( v_i + v_{i-1} \right) \]

The surface between the diagonal and the Lorenz curve is

\[ \frac{1}{2} - \frac{1}{2} \sum_{i=1}^{n} \frac{1}{n} (v_i + v_{i-1}) \]

The Gini Coefficient is defined as the surface between the diagonal and the Lorenz curve, relative to the total surface under the diagonal

\[ G = \frac{\frac{1}{2} - \frac{1}{2} \sum_{i=1}^{n}\frac{1}{n}(v_i + v_{i-1})}{\frac{1}{2}} \]

\[ G = 1 - \sum_{i=1}^{n} \frac{v_i + v_{i-1}}{n} = G_3 \]

68.8 Coefficient of Concentration

68.8.1 Definition

\[ C = \frac{n}{n-1} G \]

where \(G\) is the Gini Coefficient.

68.9 R Module

68.9.1 Public website

The Concentration module can be found on the public website:

https://compute.wessa.net/rwasp_concentration.wasp

68.9.2 RFC

The Concentration module is available in RFC under the menu item “Descriptive / Concentration”.

If you prefer to compute the Concentration measures on your local machine, the following script can be used in the R console:

library(ineq)

x <- c(112,118,132,129,121,135,148,148,136,119,104,118,115)

myLength <- length(x)
myMaximumEntropy <- log(myLength)
mySum <- sum(x)
myProportion <- x/mySum
myEntropy <- -sum(myProportion * log(myProportion))
myNormalizedEntropy <- myEntropy / myMaximumEntropy
myDifference <- myMaximumEntropy - myEntropy
myTheilEntropyIndex <- entropy(x,parameter=1,na.rm=T)
myExponentialIndex <- exp(-myEntropy)
myHerfindahlMeasure <- sum(myProportion^2)
myHerfindahl <- conc(x,type='Herfindahl',na.rm=T)
myRosenbluth <- conc(x,type='Rosenbluth',na.rm=T)
myNormalizedHerfindahlMeasure <- (myHerfindahlMeasure - 1/myLength) / (1 - 1/myLength)
myGini <- Gini(x,na.rm=T)
myConcentrationCoefficient <- myLength/(myLength -1)*myGini
myRS <- RS(x,na.rm=T)
myAtkinson <- Atkinson(x,na.rm=T)
myKolm <- Kolm(x,na.rm=T)
myCoefficientOfVariation <- var.coeff(x,square=F,na.rm=T)
mySquaredCoefficientOfVariation <- var.coeff(x,square=T,na.rm=T)
#Number of Categories
myLength
#Maximum Entropy
myMaximumEntropy
#Entropy
myEntropy
#Normalised Entropy
myNormalizedEntropy
#Max. Entropy - Entropy
myDifference
#Theil Entropy Index
myTheilEntropyIndex
#Exponential Index
myExponentialIndex
#Herfindahl
myHerfindahl
#Normalised Herfindahl
myNormalizedHerfindahlMeasure
#Rosenbluth
myRosenbluth
#Gini
myGini
#Concentration
myConcentrationCoefficient
#Ricci-Schutz (Pietra)
myRS
#Atkinson
myAtkinson
#Kolm
myKolm
#Coefficient of Variation
myCoefficientOfVariation
#Squared Coefficient of Variation
mySquaredCoefficientOfVariation

[1] 13
[1] 2.564949
[1] 2.559644
[1] 0.9979317
[1] 0.005304969
[1] 0.005304969
[1] 0.07733224
[1] 0.07774467
[1] 0.000890061
[1] 0.08166425
[1] 0.05805693
[1] 0.06289501
[1] 0.04488356
[1] 0.002645568
[1] 19.20464
[1] 0.1033476
[1] 0.01068073

The Lorenz curves can be obtained as follows:

plot(Lc(x))
grid()

plot(Lc(x),general=T)
grid()

To compute the Concentration measures, the R code uses several functions from the ineq library: entropy, conc, Gini, RS (the Ricci-Schutz or Pietra index; Pietra (1915)), Atkinson (Atkinson 1970), Kolm, and var.coeff. The Theil entropy index (Theil 1967) is also computed. Note that some functions have a parameter which eliminates missing data before the actual computation takes place: na.rm = T. It is generally speaking, a good idea to set this parameter to T (or TRUE). If, however, this parameter is not available, one might also use the command x = na.omit(x) before any computation takes place.

68.10 Purpose

Concentration measures are used for a wide variety of purposes. For instance, in Economics it is used to study income/wealth inequality, and in Biology it has been employed as a statistic for biodiversity. In addition, Concentration measures are often used in other types of statistical analysis such as machine learning algorithms.