Biographies Characteristics Analysis

Statistical analysis of numerical values ​​(non-parametric statistics). Normal law of probability distribution

in probability theory and mathematical statistics, various parametric families of distributions of numerical random variables are considered. Namely, they study families of normal distributions, logarithmically normal, exponential, gamma distributions, Weibull-Gnedenko distributions, etc. All of them depend on one, two or three parameters. Therefore, to fully describe the distribution, it is enough to know or estimate one, two or three numbers. Very comfortably. Therefore, the parametric theory of mathematical statistics is widely developed, in which it is assumed that the distributions of the results of observations belong to one or another parametric family.

Unfortunately, parametric families exist only in the minds of authors of textbooks on probability theory and mathematical statistics. They don't exist in real life. Therefore, econometrics mainly uses non-parametric methods, in which the distributions of the results of observations can have an arbitrary form.

First, using the example of a normal distribution, we will discuss in more detail the impossibility of the practical use of parametric families to describe the distributions of specific economic data. Then we will analyze parametric methods for rejecting outlier observations and demonstrate the impossibility of practical use of a number of methods of parametric statistics, the fallacy of the conclusions they lead to. Then we will analyze non-parametric methods of confidence estimation of the main characteristics of numerical random variables - mathematical expectation, median, variance, standard deviation, coefficient of variation. The lecture will conclude with methods for checking the homogeneity of two samples, independent or related.

Is the distribution of observations often normal?

In econometric and economic-mathematical models used, in particular, in the study and optimization of marketing and management processes, enterprise and regional management, accuracy and stability of technological processes, in problems of reliability, safety, including environmental safety, the functioning of technical devices and objects , the development of organizational charts often apply the concepts and results of probability theory and mathematical statistics. In this case, certain parametric families of probability distributions are often used. Most Popular normal distribution. Also used logarithmically normal distribution, exponential distribution, gamma distribution, Weibull-Gnedenko distribution, etc.

Obviously, it is always necessary to check the conformity of models to reality. There are two questions. Do the actual distributions differ from those used in the model? To what extent does this difference affect the conclusions?

Below, using the example of the normal distribution and the methods for rejecting sharply different observations (outliers) based on it, it is shown that real distributions almost always differ from those included in the classical parametric families, and the existing deviations from the given families make incorrect conclusions, in the case under consideration, about the rejection based on on the use of these families.

Is there any reason to assume a priori the normality of the measurement results?

It is sometimes argued that in the case where the measurement error (or other random value) is determined as a result of the combined action of many small factors, then, by virtue of the Central Limit Theorem (CLT) of probability theory, this value is well approximated (by distribution) by a normal random variable. This statement is true if the small factors act additively and independently of each other. If they act multiplicatively, then, due to the same CLT, it is necessary to approximate by a log-normal distribution. In applied problems, it is usually not possible to substantiate the additivity rather than the multiplicativity of the action of small factors. If the dependence is of a general nature, is not reduced to an additive or multiplicative form, and there are no grounds to accept models that give exponential, Weibull-Gnedenko, gamma or other distributions, then practically nothing is known about the distribution of the final random variable, except for intra-mathematical properties such as regularity .

When processing specific data, it is sometimes believed that measurement errors have normal distribution. On the assumption of normality, classical models of regression, dispersion, factor analyzes, metrological models, which still continue to be found both in domestic normative and technical documentation and in international standards. The models for calculating the maximum attainable levels of certain characteristics used in the design of systems for ensuring the safety of the functioning of economic structures, technical devices and objects are based on the same assumption. However, there is no theoretical basis for such an assumption. It is necessary to experimentally study the distribution of errors.

What do the experimental results show? The summary given in the monograph allows us to state that in most cases the distribution of measurement errors differs from the normal one. Thus, at the Machine and Electrotechnical Institute (Varna, Bulgaria), the distribution of calibration errors for the scales of analog electrical measuring instruments was studied. The devices manufactured in Czechoslovakia, the USSR and Bulgaria were studied. The error distribution law turned out to be the same. It has a density

We analyzed data on the parameters of 219 actual distributions of errors, studied by different authors, when measuring both electrical and non-electrical quantities with a wide variety of (electrical) devices. As a result of this study, it turned out that 111 distributions, i.e. approximately 50% belong to the class of distributions with density

where is the degree parameter; - shift parameter; - scale parameter; - gamma function of the argument ;

The Applied Mathematics Laboratory of Tartu State University analyzed 2,500 samples from the archive of real statistical data. In 92%, the normality hypothesis had to be rejected.

The above descriptions of the experimental data show that the measurement errors in most cases have distributions that differ from normal ones. This means, in particular, that most applications of the Student's t-test, the classical regression analysis and other statistical methods based on the normal theory, strictly speaking, is not justified, since the axiom of normality of the distributions of the corresponding random variables underlying them is incorrect.

Obviously, in order to justify or reasonably change the current practice of analyzing statistical data, it is necessary to study the properties of data analysis procedures in "illegal" applications. The study of rejection procedures has shown that they are extremely unstable to deviations from normality, and therefore it is not advisable to use them for processing real data (see below); therefore, one cannot assert that an arbitrarily taken procedure is stable against deviations from normality.

Sometimes it is suggested that before applying, for example, Student's test for the homogeneity of two samples, check the normality. Although there are many criteria for this, testing for normality is a more complex and time-consuming statistical procedure than testing for homogeneity (both with Student-type statistics and with non-parametric tests). A fairly large number of observations are required to establish normality sufficiently reliably. So, to guarantee that the distribution function of the results of observations differs from some normal one by no more than 0.01 (for any value of the argument), about 2500 observations are required. In most economic, technical, biomedical and other applied studies, the number of observations is significantly less. This is especially true for data used in the study of problems related to ensuring the safety of the functioning of economic structures and technical objects.

Sometimes they try to use the DCT to approximate the distribution of the error to the normal one, including special adders in the technological scheme of the measuring device. Let's evaluate the usefulness of this measure. Let be independent identically distributed random variables with the distribution function such that consider

The indicator of the proximity to normality provided by the adder is

The right-hand inequality in the last relation follows from estimates of the constant in the Berry-Esseen inequality obtained in the book, and the left-hand one from the example in the monograph. For normal law, for uniform , for two-point (this is the lower bound for ). Therefore, to ensure the distance (in the Kolmogorov metric) to the normal distribution of no more than 0.01 for "unsuccessful" distributions, at least terms are needed, where the probability of falling into a discrete set of decimal numbers with a given number of decimal places is equal to 0.

From what has been said above, it follows that the results of measurements and, in general, statistical data have properties that lead to the fact that they should be modeled by random variables with distributions that are more or less different from normal ones. In most cases, the distributions differ significantly from normal distributions, in others, normal distributions can apparently be considered as some kind of approximation, but there is never a complete coincidence. This implies both the need to study the properties of classical statistical procedures in non-classical probabilistic models(similar to what is done below for Student's t-test), and the need to develop stable (taking into account the presence of deviations from normality) and non-parametric, including distribution-free procedures, their wide introduction into the practice of statistical data processing.

The considerations omitted here for other parametric families lead to similar conclusions. The result can be formulated as follows. Real data distributions almost never belong to any particular parametric family. Real distributions are always different from those included in the parametric families. The differences can be big or small, but they always exist. Let's try to understand how important these differences are for econometric analysis.

The normal distribution (Gaussian distribution) has always played a central role in probability theory, since it arises very often as a result of the influence of many factors, the contribution of any one of which is negligible. The Central Limit Theorem (CLT) finds application in virtually all applied sciences, making the apparatus of statistics universal. However, there are very frequent cases when its application is impossible, and researchers try in every possible way to organize the fitting of the results to the Gaussian. That's about an alternative approach in the case of influence on the distribution of many factors, I'll tell you now.

Brief history of the CPT. While Newton was still alive, Abraham de Moivre proved a theorem on the convergence of a centered and normalized number of observations of an event in a series of independent trials to a normal distribution. Throughout the 19th and early 20th centuries, this theorem served as a scientific model for generalizations. Laplace proved the case of uniform distribution, Poisson - the local theorem for the case with different probabilities. Poincaré, Legendre and Gauss developed a rich theory of observational errors and the least squares method based on the convergence of errors to a normal distribution. Chebyshev proved an even stronger theorem for the sum of random variables by developing the method of moments. Lyapunov in 1900, relying on Chebyshev and Markov, proved the CLT in its current form, but only with the existence of third-order moments. And only in 1934 Feller put an end to it, showing that the existence of moments of the second order is both a necessary and sufficient condition.

The CLT can be formulated as follows: if random variables are independent, equally distributed and have a finite variance other than zero, then the sums (centered and normalized) of these variables converge to the normal law. It is in this form that this theorem is taught in universities and is so often used by observers and researchers who are not professional in mathematics. What is wrong with her? Indeed, the theorem has excellent applications in the fields that Gauss, Poincaré, Chebyshev and other geniuses of the 19th century worked on, namely: the theory of observational errors, statistical physics, least squares, demographic studies, and maybe something else. But scientists who lack the originality to discover, generalize and want to apply this theorem to everything, or just drag the normal distribution by the ears, where it simply cannot be. If you want examples, I have them.

Intelligence quotient IQ. Initially, it implies that the intelligence of people is normally distributed. They conduct a test that is pre-compiled in a way that does not take into account outstanding abilities, but is taken into account separately with the same fractional factors: logical thinking, mental design, computational abilities, abstract thinking and something else. The ability to solve problems beyond the reach of most, or passing a test in ultra-fast time is not taken into account in any way, and passing the test earlier increases the result (but not intelligence) in the future. And then the philistines believe that "no one can be twice as smart as they are", "let's take it away from the wise men and share it."

The second example: changes in financial indicators. The study of changes in the stock price, currency quotes, commodity options requires the use of the apparatus of mathematical statistics, and especially here it is important not to make a mistake with the type of distribution. Case in point: in 1997, the Nobel Prize in Economics was paid for the proposal of the Black-Scholes model, based on the assumption of a normal distribution of growth in stock indicators (the so-called white noise). At the same time, the authors explicitly stated that this model needs to be refined, but all that the majority of further researchers decided on was simply to add the Poisson distribution to the normal distribution. Here, obviously, there will be inaccuracies in the study of long time series, since the Poisson distribution satisfies the CLT too well, and even with 20 terms it is indistinguishable from the normal distribution. Look at the picture below (and it is from a very serious economic journal), it shows that, despite a fairly large number of observations and obvious distortions, the distribution is assumed to be normal.


It is quite obvious that the distribution of wages among the population of the city, the size of files on the disk, the population of cities and countries will not be normal.

The distributions from these examples have in common the presence of a so-called "heavy tail", that is, values ​​far from the mean, and a noticeable asymmetry, usually right. Let us consider what else, besides normal, such distributions could be. Let's start with Poisson mentioned earlier: it has a tail, but we want the law to be repeated for a set of groups, in each of which it is observed (calculate the size of files for an enterprise, salary for several cities) or scaled (arbitrarily increase or decrease the interval of the model Black-Scholes), as observations show, tails and asymmetry do not disappear, but the Poisson distribution, according to the CLT, should become normal. For the same reasons, the Erlang distribution, beta, logonormal, and all others with dispersion will not work. It remains only to cut off the Pareto distribution, but it does not fit due to the coincidence of the fashion with the minimum value, which almost never occurs in the analysis of sample data.

Distributions with the necessary properties exist and are called stable distributions. Their history is also very interesting, and the main theorem was proved a year after Feller's work, in 1935, by the joint efforts of the French mathematician Paul Levy and the Soviet mathematician A.Ya. Khinchin. The CLT was generalized, the condition for the existence of dispersion was removed from it. Unlike normal, neither the density nor the distribution function of stable random variables are expressed (with a rare exception, which is discussed below), all that is known about them is the characteristic function (the inverse Fourier transform of the distribution density, but to understand the essence, this can not be know).
So, the theorem: if random variables are independent, equally distributed, then the sums of these variables converge to a stable law.

Now the definition. Random value X will be stable if and only if the logarithm of its characteristic function can be represented as:

where .

In fact, there is nothing very complicated here, you just need to explain the meaning of the four parameters. The parameters sigma and mu are the usual scale and offset, as in the normal distribution, mu will be equal to the expectation if it is, and it is when alpha is greater than one. The beta parameter is asymmetry; if it is equal to zero, the distribution is symmetrical. But alpha is a characteristic parameter, which indicates what order the moments of a quantity exist, the closer it is to two, the more the distribution looks like a normal one, if it is equal to two, the distribution becomes normal, and only in this case it has moments of large orders, also in the case normal distribution, the skewness degenerates. In the case when alpha is equal to one and beta to zero, the Cauchy distribution is obtained, and in the case when alpha is equal to half and beta to one, the Levy distribution, in other cases there is no representation in quadratures for the distribution density of such quantities.
In the 20th century, a rich theory of stable quantities and processes (called Levy processes) was developed, their connection with fractional integrals was shown, various methods of parameterization and modeling were introduced, parameters were estimated in several ways, and the consistency and stability of the estimates were shown. Look at the picture, it shows the simulated trajectory of the Levy process with a fragment enlarged 15 times.


It was while dealing with such processes and their application in finance that Benoit Mandelbrot came up with fractals. However, not everywhere was so good. The second half of the 20th century passed under the general trend of applied and cybernetic sciences, which meant a crisis of pure mathematics, everyone wanted to produce, but did not want to think, the humanities occupied the mathematical spheres with their journalism. Example: the book "Fifty entertaining probabilistic problems with solutions" by the American Mosteller, problem number 11:


The author's solution to this problem is simply a defeat of common sense:

The same situation is with the 25th task, where THREE contradictory answers are given.

But back to stable distributions. In the rest of the article, I will try to show that there should be no additional difficulties when working with them. Namely, there are numerical and statistical methods that allow you to evaluate the parameters, calculate the distribution function and simulate them, that is, work in the same way as with any other distribution.

Modeling of stable random variables. Since everything is known in comparison, let me first recall the most convenient, from the point of view of calculations, method for generating a normal value (the Box-Muller method): if are basic random variables (uniformly distributed on )