Biographies Characteristics Analysis

Research methods dispersion correlation analysis. ANOVA METHODS

Analysis of variance

1. Concept analysis of variance

Analysis of variance- this is an analysis of the variability of a trait under the influence of any controlled variable factors. In foreign literature, analysis of variance is often referred to as ANOVA, which translates as analysis of variance (Analysis of Variance).

The task of analysis of variance consists in isolating the variability of a different kind from the general variability of the trait:

a) variability due to the action of each of the studied independent variables;

b) variability due to the interaction of the studied independent variables;

c) random variation due to all other unknown variables.

The variability due to the action of the studied variables and their interaction correlates with random variability. An indicator of this ratio is Fisher's F test.

The formula for calculating the criterion F includes estimates of variances, that is, the distribution parameters of a sign, therefore the criterion F is a parametric criterion.

Than in more the variability of a trait is due to the studied variables (factors) or their interaction, the higher empirical values ​​of the criterion.

Zero the hypothesis in the analysis of variance will say that the average values ​​of the studied effective feature in all gradations are the same.

Alternative the hypothesis will state that the average values ​​of the effective attribute in different gradations of the studied factor are different.

Analysis of variance allows us to state a change in a trait, but does not indicate direction these changes.

Let's start the analysis of variance with the simplest case, when we study the action of only one variable (single factor).

2. One-way analysis of variance for unrelated samples

2.1. Purpose of the method

The method of single-factor analysis of variance is used in those cases when changes in the effective attribute are studied under the influence of changing conditions or gradations of any factor. In this version of the method, the influence of each of the gradations of the factor is various sample of test subjects. There must be at least three gradations of the factor. (There may be two gradations, but in this case we will not be able to establish nonlinear dependencies and it seems more reasonable to use simpler ones).

A non-parametric variant of this type of analysis is the Kruskal-Wallis H test.

Hypotheses

H 0: Differences between factor grades (different conditions) are no more pronounced than random differences within each group.

H 1: Differences between factor gradations (different conditions) are more pronounced than random differences within each group.

2.2. Limitations of univariate analysis of variance for unrelated samples

1. Univariate analysis of variance requires at least three gradations of the factor and at least two subjects in each gradation.

2. The resultant trait must be normally distributed in the study sample.

True, it is usually not indicated whether we are talking about the distribution of a trait in the entire surveyed sample or in that part of it that makes up the dispersion complex.

3. An example of solving the problem by the method of single-factor analysis of variance for unrelated samples using the example:

Three different groups of six subjects received lists of ten words. Words were presented to the first group at a low rate of 1 word per 5 seconds, to the second group at an average rate of 1 word per 2 seconds, and to the third group at a high rate of 1 word per second. Reproduction performance was predicted to depend on the speed of word presentation. The results are presented in Table. one.

Number of words reproduced Table 1

subject number

low speed

average speed

high speed

total amount

H 0: Differences in word volume between groups are no more pronounced than random differences inside each group.

H1: Differences in word volume between groups are more pronounced than random differences inside each group. Using the experimental values ​​presented in Table. 1, we will establish some values ​​that will be needed to calculate the criterion F.

The calculation of the main quantities for one-way analysis of variance is presented in the table:

table 2

Table 3

Sequence of Operations in One-Way ANOVA for Disconnected Samples

Frequently used in this and subsequent tables, the designation SS is an abbreviation for "sum of squares". This abbreviation is most often used in translated sources.

SS fact means the variability of the trait, due to the action of the studied factor;

SS common- general variability of the trait;

S CA- variability due to unaccounted for factors, "random" or "residual" variability.

MS - "middle square", or the mean of the sum of squares, the average value of the corresponding SS.

df - the number of degrees of freedom, which, when considering nonparametric criteria, we denoted by the Greek letter v.

Conclusion: H 0 is rejected. H 1 is accepted. Differences in the volume of word reproduction between groups are more pronounced than random differences within each group (α=0.05). So, the speed of presentation of words affects the volume of their reproduction.

An example of solving the problem in Excel is presented below:

Initial data:

Using the command: Tools->Data Analysis->One-way analysis of variance, we get the following results:

The methods of verification discussed above statistical hypotheses about the significance of differences between the two averages in practice are of limited use. This is due to the fact that in order to identify the effect of all possible conditions and factors on a productive attribute, field and laboratory experiments, as a rule, is carried out using not two, but a larger number of samples (1220 or more).

Often researchers compare the means of several samples combined in single complex. For example, studying the effect various kinds and doses of fertilizers on crop yields, the experiments are repeated in different options. In these cases, pairwise comparisons become cumbersome, and the statistical analysis of the entire complex requires the use of a special method. This method, developed in mathematical statistics, is called analysis of variance. It was first used by the English statistician R. Fisher when processing the results of agronomic experiments (1938).

Analysis of variance is a method statistical evaluation the reliability of the manifestation of the dependence of the effective feature on one or more factors. Using the method of analysis of variance, statistical hypotheses are tested regarding the averages in several general populations that have a normal distribution.

Analysis of variance is one of the main methods of statistical evaluation of the results of an experiment. It is also increasingly used in the analysis of economic information. The analysis of variance makes it possible to establish how selective indicators of the relationship between the effective and factor signs are sufficient to disseminate the data obtained from the sample to the general population. The advantage of this method is that it gives fairly reliable conclusions from small samples.

By examining the variation of the resulting attribute under the influence of one or more factors, using analysis of variance, one can obtain, in addition to general estimates of the significance of dependencies, also an assessment of the differences in the average values ​​that are formed at different levels of factors, and the significance of the interaction of factors. Analysis of variance is used to study the dependences of both quantitative and qualitative features, as well as their combination.

The essence of this method is statistical study the probability of the influence of one or more factors, as well as their interaction on the effective feature. Accordingly, with the help of dispersion analysis, three main tasks are solved: 1) overall score significance of differences between group means; 2) assessment of the probability of interaction of factors; 3) assessment of the significance of differences between pairs of means. Most often, researchers have to solve such problems when conducting field and zootechnical experiments, when the influence of several factors on the resulting trait is studied.

The principle scheme of dispersion analysis includes the establishment of the main sources of variation of the effective feature and the determination of the volume of variation (sums of squared deviations) by the sources of its formation; determination of the number of degrees of freedom corresponding to the components of the total variation; calculation of variances as the ratio of the corresponding volumes of variation to their number of degrees of freedom; analysis of the relationship between dispersions; assessment of the reliability of the difference between the averages and the formulation of conclusions.

This scheme is preserved both in simple ANOVA models, when data are grouped according to one attribute, and in complex models, when data are grouped according to two and a large number signs. However, with an increase in the number of group characteristics, the process of decomposition of the general variation according to the sources of its formation becomes more complicated.

According to circuit diagram analysis of variance can be represented as five successive steps:

1) definition and decomposition of variation;

2) determination of the number of degrees of freedom of variation;

3) calculation of dispersions and their ratios;

4) analysis of dispersions and their ratios;

5) assessment of the reliability of the difference between the means and the formulation of conclusions on testing the null hypothesis.

The most time-consuming part of the analysis of variance is the first stage - the definition and decomposition of the variation by the sources of its formation. The order of expansion of the total volume of variation was discussed in detail in Chapter 5.

The basis for solving problems of dispersion analysis is the law of expansion (addition) of variation, according to which general variation(fluctuations) of the effective attribute is divided into two: the variation due to the action of the studied factor (factors), and the variation caused by the action of random causes, that is

Let us assume that the population under study is divided according to a factor attribute into several groups, each of which is characterized by its own average effective sign. At the same time, the variation of these values ​​can be explained by two types of reasons: those that systematically act on the effective feature and are amenable to adjustment in the course of the experiment and are not amenable to adjustment. It is obvious that intergroup (factorial or systematic) variation depends mainly on the action of the studied factor, and intragroup (residual or random) - on the action of random factors.

To assess the significance of differences between group means, it is necessary to determine the intergroup and intragroup variations. If the intergroup (factorial) variation significantly exceeds the intragroup (residual) variation, then the factor influenced the resulting trait, significantly changing the values ​​of the group averages. But the question arises, what is the ratio between the intergroup and intragroup variations can be considered as sufficient for the conclusion about the reliability (significance) of differences between the group means.

To assess the significance of differences between the means and formulate conclusions on testing the null hypothesis (H0: x1 = x2 = ... = xn), the analysis of variance uses a kind of standard - the G-criterion, the distribution law of which was established by R. Fisher. This criterion is the ratio of two variances: factorial, generated by the action of the factor under study, and residual, due to the action of random causes:

Dispersion ratio r = t>u : £ * 2 by the American statistician Snedecor proposed to be denoted by the letter G in honor of the inventor of the analysis of variance R. Fisher.

The variances of °2 and io2 are estimates of the variance population. If samples with variances of °2 °2 are drawn from the same general population, where the variation in values ​​had random character, then the discrepancy in the values ​​of °2 °2 is also random.

If the experiment checks the influence of several factors (A, B, C, etc.) on the effective feature at the same time, then the dispersion due to the action of each of them should be comparable to °e.gP, i.e

If the value of the factor variance is significantly greater than the residual, then the factor significantly influenced the resulting attribute and vice versa.

In multifactorial experiments, in addition to the variation due to the action of each factor, there is almost always a variation due to the interaction of factors ($av: ^ls ^ss $liіs). The essence of the interaction is that the effect of one factor significantly changes to different levels the second (for example, the effectiveness of soil quality at different doses of fertilizers).

The interaction of factors should also be assessed by comparing the respective variances 3 ^w.gr:

When calculating the actual value of the B-criterion, the largest of the variances is taken in the numerator, therefore B > 1. It is obvious that the larger the B-criterion, the more significant difference between dispersions. If B = 1, then the question of assessing the significance of differences in variances is removed.

To determine the limits of random fluctuations, the ratio of variances G. Fisher developed special tables of the B-distribution (Appendix 4 and 5). Criterion B is functionally related to probability and depends on the number of degrees of freedom of variation k1 and k2 of the two compared variances. Two tables are usually used to draw conclusions about the limit high value criterion for significance levels of 0.05 and 0.01. A significance level of 0.05 (or 5%) means that only in 5 cases out of 100 criterion B can take on a value equal to or higher than that indicated in the table. A decrease in the significance level from 0.05 to 0.01 leads to an increase in the value of the criterion B between two variances due to the action of only random causes.

The value of the criterion also depends directly on the number of degrees of freedom of the two compared dispersions. If the number of degrees of freedom tends to infinity (k-me), then the ratio of would for two dispersions tends to unity.

The tabular value of criterion B shows a possible random value of the ratio of two variances at a given significance level and the corresponding number of degrees of freedom for each of the compared variances. In these tables, the value of B is given for samples made from the same general population, where the reasons for the change in values ​​are only random.

The value of G is found from the tables (Appendix 4 and 5) at the intersection of the corresponding column (the number of degrees of freedom for a larger dispersion - k1) and the row (the number of degrees of freedom for a smaller dispersion - k2). So, if the larger variance (numerator G) k1 = 4, and the smaller one (denominator G) k2 = 9, then Ga at a significance level a = 0.05 will be 3.63 (app. 4). So, as a result of the action of random causes, since the samples are small, the variance of one sample can, at a 5% significance level, exceed the variance for the second sample by 3.63 times. With a decrease in the significance level from 0.05 to 0.01, the tabular value of the criterion D, as noted above, will increase. So, with the same degrees of freedom k1 = 4 and k2 = 9 and a = 0.01, the tabular value of the criterion G will be 6.99 (app. 5).

Consider the procedure for determining the number of degrees of freedom in the analysis of variance. The number of degrees of freedom, which corresponds to the total sum of squared deviations, is decomposed into the corresponding components similarly to the decomposition of the sums of squared deviations total number degrees of freedom (k") is decomposed into the number of degrees of freedom for intergroup (k1) and intragroup (k2) variations.

So if sampling frame, consisting of N observations divided by t groups (number of experiment options) and P subgroups (number of repetitions), then the number of degrees of freedom k, respectively, will be:

and for total amount squared deviations (d7zar)

b) for the intergroup sum of squared deviations ^m.gP)

c) for the intragroup sum of squared deviations in w.gr)

According to the addition rule of variation:

For example, if four variants of the experiment were formed in the experiment (m = 4) in five repetitions each (n = 5), and total observations N = = t o p \u003d 4 * 5 \u003d 20, then the number of degrees of freedom, respectively, is equal to:

Knowing the sums of squared deviations of the number of degrees of freedom, it is possible to determine unbiased (adjusted) estimates for three variances:

The null hypothesis H0 by criterion B is tested in the same way as by Student's u-test. To make a decision on checking H0, it is necessary to calculate the actual value of the criterion and compare it with the tabular value Ba for the accepted level of significance a and the number of degrees of freedom k1 and k2 for two dispersions.

If Bfakg > Ba, then, in accordance with the accepted level of significance, we can conclude that the differences in sample variances are determined not only by random factors; they are significant. In this case, the null hypothesis is rejected and there is reason to believe that the factor significantly affects the resulting attribute. If< Ба, то нулевую гипотезу принимают и есть основание утверждать, что различия между сравниваемыми дисперсиями находятся в границах возможных случайных колебаний: действие фактора на результативный признак не является существенным.

The use of one or another model of analysis of variance depends both on the number of factors studied and on the method of sampling.

Depending on the number of factors that determine the variation of the effective feature, samples can be formed by one, two or more factors. According to this analysis of variance is divided into single-factor and multi-factor. Otherwise, it is also called a single-factor and multi-factor dispersion complex.

The scheme of decomposition of the general variation depends on the formation of the groups. It can be random (observations of one group are not related to the observations of the second group) and non-random (observations of two samples are interconnected by the common conditions of the experiment). Accordingly, independent and dependent samples are obtained. Independent samples can be formed with both equal and uneven numbers. The formation of dependent samples assumes their equal number.

If the groups are formed in a non-violent order, then the total amount of variation of the resulting trait includes, along with the factorial (intergroup) and residual variation, the variation of repetitions, that is

In practice, in most cases it is necessary to consider dependent samples when the conditions for groups and subgroups are equalized. Yes, in field experience the entire site is divided into blocks, with the most virivnyanniya conditions. At the same time, each variant of the experiment gets equal opportunities to be represented in all blocks, which achieves equalization of conditions for all tested options, experience. This method of constructing experience is called the method of randomized blocks. Experiments with animals are carried out similarly.

When processing socio-economic data by the method of dispersion analysis, it must be borne in mind that, due to the rich number of factors and their interrelation, it is difficult, even with the most careful alignment of conditions, to establish the degree of objective influence of each individual factor on the effective attribute. Therefore, the level of residual variation is determined not only by random causes, but also by significant factors that were not taken into account when building the ANOVA model. As a result, the residual dispersion as a basis for comparison sometimes becomes inadequate for its purpose, it is clearly overestimated in magnitude and cannot act as a criterion for the significance of the influence of factors. In this regard, when building models of variance analysis, it becomes actual problem selection critical factors and leveling the conditions for the manifestation of the action of each of them. Besides. the use of analysis of variance assumes normal or close to normal distribution studied statistical aggregates. If this condition is not met, then the estimates obtained in the analysis of variance will be exaggerated.

Analysis of variance(from the Latin Dispersio - dispersion / in English Analysis Of Variance - ANOVA) is used to study the influence of one or more qualitative variables (factors) on one dependent quantitative variable (response).

The analysis of variance is based on the assumption that some variables can be considered as causes (factors, independent variables): , and others as consequences (dependent variables). Independent variables are sometimes called adjustable factors precisely because in the experiment the researcher has the opportunity to vary them and analyze the resulting result.

main goal analysis of variance(ANOVA) is the study of the significance of differences between means by comparing (analyzing) the variances. Dividing the total variance into multiple sources allows one to compare the variance due to intergroup difference with the variance due to within-group variability. If the null hypothesis is true (about the equality of the means in several groups of observations selected from the general population), the estimate of the variance associated with intragroup variability should be close to the estimate intergroup variance. If you are simply comparing the means of two samples, the analysis of variance will give the same result as a regular independent sample t-test (if you are comparing two independent groups of objects or observations) or a dependent-sample t-test (if you are comparing two variables on the same and the same set of objects or observations).

The essence of analysis of variance lies in the division of the total variance of the studied trait into separate components, due to the influence of specific factors, and testing hypotheses about the significance of the influence of these factors on the studied trait. Comparing the components of the dispersion with each other using Fisher's F-test, it is possible to determine what proportion of the total variability of the resulting trait is due to the action of adjustable factors.

The source material for the analysis of variance is the data of the study of three or more samples: , which can be either equal or unequal in number, both connected and disconnected. According to the number of identified adjustable factors, analysis of variance can be one-factor(at the same time, the influence of one factor on the results of the experiment is studied), two-factor(when studying the influence of two factors) and multifactorial(allows you to evaluate not only the influence of each of the factors separately, but also their interaction).

Analysis of variance belongs to the group of parametric methods and therefore it should be used only when it is proved that the distribution is normal.

Analysis of variance is used if the dependent variable is measured on a scale of ratios, intervals, or order, and the influencing variables are non-numeric (name scale).

Task examples

In problems that are solved by analysis of variance, there is a response of a numerical nature, which is affected by several variables that have a nominal nature. For example, several types of livestock fattening rations or two ways of keeping them, etc.

Example 1: During the week, several pharmacy kiosks operated in three different locations. In the future, we can leave only one. It is necessary to determine whether there is a statistically significant difference between the sales volumes of drugs in kiosks. If yes, we will select the kiosk with the highest average daily sales volume. If the difference in sales volume turns out to be statistically insignificant, then other indicators should be the basis for choosing a kiosk.

Example 2: Comparison of contrasts of group means. The seven political affiliations are ordered from extremely liberal to extremely conservative, and linear contrast is used to test whether there is a non-zero upward trend in group means—i.e., whether there is a significant linear increase in mean age when considering groups ordered in the direction from liberal to conservative.

Example 3: Two-way analysis of variance. The number of product sales, in addition to the size of the store, is often affected by the location of the shelves with the product. This example contains weekly sales figures characterized by four shelf layouts and three store sizes. The results of the analysis show that both factors - the location of the shelves with the goods and the size of the store - affect the number of sales, but their interaction is not significant.

Example 4: Univariate ANOVA: Randomized two-treatment full block design. The influence on the baking of bread of all possible combinations three fats and three dough rippers. Four flour samples taken from four different sources, served as blocking factors. The significance of the fat-ripper interaction needs to be identified. After that, to determine the various options for choosing contrasts, allowing you to find out which combinations of levels of factors differ.

Example 5: Model of a hierarchical (nested) plan with mixed effects. The influence of four randomly selected heads mounted in a machine tool on the deformation of manufactured glass cathode holders is studied. (The heads are built into the machine, so the same head cannot be used on different machines.) The head effect is treated as a random factor. The ANOVA statistics show that there are no significant differences between machines, but there are indications that the heads may differ. The difference between all the machines is not significant, but for two of them the difference between the types of heads is significant.

Example 6: Univariate repeated measurements analysis using a split-plot plan. This experiment was conducted to determine the effect of an individual's anxiety rating on exam performance on four consecutive attempts. The data are organized so that they can be considered as groups of subsets of the entire data set ("the whole plot"). The effect of anxiety was not significant, while the effect of trying was significant.

List of methods

  • Models of factorial experiment. Examples: factors affecting the success of solving mathematical problems; factors influencing sales volumes.

The data consist of several series of observations (processings), which are considered as realizations of independent samples. The initial hypothesis is that there is no difference in treatments, i.e. it is assumed that all observations can be considered as one sample from the total population:

  • One - factor parametric model : Scheffe 's method .
  • One-factor non-parametric model [Lagutin M.B., 237]: Kruskal-Wallis test [Hollender M., Wolf D.A., 131], Jonkheer's criterion [Lagutin M.B., 245].
  • General case of a model with constant factors, Cochran's theorem [Afifi A., Eisen S., 234].

The data are two-fold repeated observations:

  • Two-factor non-parametric model: Friedman's criterion [Lapach, 203], Page's criterion [Lagutin M.B., 263]. Examples: comparison of the effectiveness of production methods, agricultural practices.
  • Two-factor nonparametric model for incomplete data

Story

Where did the name come from analysis of variance? It may seem strange that the procedure for comparing means is called analysis of variance. In fact, this is due to the fact that when examining the statistical significance of the difference between the means of two (or several) groups, we are actually comparing (analyzing) sample variances. The fundamental concept of analysis of variance is proposed Fisher in 1920. Perhaps a more natural term would be sum of squares analysis or analysis of variation, but due to tradition, the term analysis of variance is used. Initially, analysis of variance was developed to process data obtained in the course of specially designed experiments, and was considered the only method that correctly explores causal relationships. The method was used to evaluate experiments in crop production. Later, the general scientific significance of dispersion analysis for experiments in psychology, pedagogy, medicine, etc., became clear.

Literature

  1. Sheff G. Dispersion analysis. - M., 1980.
  2. Ahrens H. Leiter Yu. Multivariate analysis of variance.
  3. Kobzar A.I. Applied math statistics. - M.: Fizmatlit, 2006.
  4. Lapach S. N., Chubenko A. V., Babich P. N. Statistics in science and business. - Kyiv: Morion, 2002.
  5. Lagutin M. B. Visual mathematical statistics. In two volumes. - M.: P-center, 2003.
  6. Afifi A., Eisen S. Statistical analysis: The computer-assisted approach.
  7. Hollender M., Wolf D.A. Nonparametric methods of statistics.

Links

Mean squares and s R 2 are unbiased estimates of the dependent variable, driven by the regression or explanatory variable, respectively X and the impact of unaccounted for random factors and errors; m is the number of estimated regression parameters, n is the number of observations. In the absence of a linear relationship between the dependent and explanatory (factorial) variable, random variables and s R 2 have 2 - distribution, respectively, with m-1 and n-m degrees of freedom, and their ratio F is a distribution with the same degrees of freedom. Therefore, the regression equation is significant at the level if the actually observed value of the statistic exceeds the table value:

(5.11),

where is the tabular value of F - the Fisher - Snedekor test, determined at the significance level at k1 = m-1 and k2 = n-m degrees of freedom.

Given the meaning of the values ​​and s R 2 , we can say that the value of F shows to what extent the regression estimates the value of the dependent variable better than its mean.

In the case of a steam room linear regression m = 2, and the regression equation is significant at the level if

(5.12)

The following ratio can serve as a measure of the significance of the regression line:

where ŷ i -i-e equalized value; -medium arithmetic values y i ; σ y.x - root mean square error (approximation error) regression equation, calculated from well-known formula; n is the number of compared pairs of feature values; m is the number of factor features.

Indeed, the connection is the greater, the more significant the measure of the dispersion of the attribute, due to regression, exceeds the measure of the dispersion of the deviations of the actual values ​​from the leveled ones.

This ratio allows us to resolve the issue of the significance of the regression equation as a whole, that is, the presence of a real-life statistical dependence between variables. The regression equation is significant, i.e., there is a statistical relationship between the signs, if for given level significance, the calculated value of the Fisher criterion F exceeds the critical value F cr , which stands at the intersection of the m-th column and the th row of a special statistical table, which is called the “Table of Fisher F-criterion values”.

Example. Let's use Fisher's criterion to assess the significance of the regression equation constructed in the last lecture, that is, the equation expressing the relationship between harvest and per capita sowing.

Substituting in the formula for calculating the Fisher criterion, the data of the previous example, we get

Referring to the F-distribution table for P=0.95 (α=1-P=0.5) and taking into account that n-2=21, m-1 =1, in the table of F-test values ​​for the intersections of the 1st column and 21st row we find the critical value F cr, equal to 4.32 with a degree of reliability P=0.95. Since the calculated value of the F-criterion significantly exceeds the F cr value, the discovered linear relationship is significant, i.e., the a priori hypothesis about the presence linear connection confirmed. The conclusion was made with the degree of reliability P=0.95. It can be checked that the output in this case will remain the same if the reliability is increased to P=0.99 (the corresponding value of F cr =8.02 for the significance level α=0.01).


Determination coefficient. With the help of the F-criterion, we have established that there is linear dependence between the amount of grain harvest and the amount of sowing per capita. Therefore, it can be argued that the amount of grain harvest per capita depends linearly on the amount of sowing per capita. Now it is appropriate to pose a clarifying question - to what extent does the amount of sowing per capita determine the amount of grain harvest per capita? This question can be answered by calculating what part of the variation of the resulting trait can be explained by the influence of the factor trait. This purpose is served by the index (or coefficient) of determination R 2 , which makes it possible to estimate the share of the scatter taken into account by the regression in the total scatter of the effective attribute. Determination coefficient, equal to the ratio of the factorial variation to the total variation of the trait, makes it possible to judge how “successfully” the type of function that describes the real statistical dependence is chosen.

If the coefficient of determination R 2 is known, then the criterion for the significance of the regression equation or the coefficient of determination itself (Fisher's criterion) can be written as:

Fisher's criterion also allows us to evaluate the usefulness of including additional factors into the model for the equation multiple linear regression.

In econometrics, apart from general criterion Fisher, the concept is also used private criterion . The partial F-criterion shows the degree of influence of an additional independent variable on the resulting attribute and can be used when deciding whether to add this independent variable to the equation or exclude it from it.

The scatter of the feature explained by the two-factor regression equation constructed earlier can be decomposed into two types: 1) the scatter of the feature due to the independent variable x 1, and 2) the scatter of the feature due to the independent variable x 2 when x 1 is already included in the equation. The first component corresponds to the spread of the attribute, explained by the equation, which includes only the variable x 1 . The difference between the feature scatter due to the pairwise linear regression equation and the feature scatter due to the two-way linear regression equation will determine the part of the scatter that is explained by the additional independent variable x 2 .

The ratio of the specified difference to the scatter of the feature, not explained by regression, is the value private criterion. A particular F-test is also called sequential if statistical characteristics are constructed by sequentially adding variables to the regression equation.

Example. Evaluate the usefulness of including an additional variable "yield" in the regression equation (according to the data and results of the previously considered examples).

The scatter of the feature explained by the equation multiple regression and calculated as the sum of the squared differences of the equalized values ​​and their mean, is equal to 1623.8815. The spread of the attribute, explained by the simple regression equation, is 1545.1331.

The scatter of the feature, not explained by regression, is determined by the square of the mean quadratic error equation and is equal to 10.9948.

Using these characteristics, we calculate the private F-criterion

With a reliability level of 0.95 (α = 0.05), the tabular value F (1.20), i.e., the value at the intersection of the 1st column and the 20th row of the table. 4A application, equal to 4.35. The calculated value of the F-criterion significantly exceeds the tabulated one, and, therefore, the inclusion of the variable "yield" in the equation makes sense.

Thus, the conclusions made earlier regarding the regression coefficients are quite legitimate.

4th study question. Estimation of the significance of individual parameters of the regression equation using Student's t-test.

Very often in econometrics it is required to evaluate the significance of the correlation coefficient r, that is, to determine how significant the difference of the correlation coefficient from zero is (for example, when analyzing multicollinearity and estimating paired correlation coefficients between factors in a multiple regression equation).

At the same time, it is assumed that in the absence of a correlation, the statistics t,

It has t-Student's distribution with (n-2) degrees of freedom.

The correlation coefficient r xy is significant at the level , (otherwise, the hypothesis Н 0 about the equality of the general correlation coefficient to zero is rejected), if

(5.13),

Where is a table value t-Student's criterion, determined at the significance level a with the number of degrees of freedom (n-2).

In linear regression, the significance of not only the equation as a whole, but also its individual parameters is usually evaluated. For this purpose, its standard error is determined for each of the parameters. The procedure for assessing the significance of this parameter does not differ from that considered above for the regression coefficient; the value of the t-criterion is calculated, its value is compared with the tabular value at (n-2) degrees of freedom. Testing hypotheses about the significance of regression and correlation coefficients is equivalent to testing the hypothesis about significance linear equation regression.

Conclusion. So, in this lecture, we have considered general rules testing of statistical hypotheses and their practical use when assessing the significance of regression equations and their individual parameters using the Fisher and Student criteria.

Analysis of variance

Course work by discipline: " System analysis»

Performer student gr. 99 ISE-2 Zhbanov V.V.

Orenburg State University

Faculty information technologies

Department of Applied Informatics

Orenburg-2003

Introduction

The purpose of the work: to get acquainted with such a statistical method as analysis of variance.

Dispersion analysis (from the Latin Dispersio - dispersion) - statistical method, allowing to analyze the influence various factors to the variable under study. The method was developed by the biologist R. Fisher in 1925 and was originally used to evaluate experiments in crop production. Later, the general scientific significance of dispersion analysis for experiments in psychology, pedagogy, medicine, etc., became clear.

The purpose of the analysis of variance is to test the significance of the difference between the means by comparing the variances. The variance of the measured attribute is decomposed into independent terms, each of which characterizes the influence of a particular factor or their interaction. The subsequent comparison of such terms allows us to evaluate the significance of each factor under study, as well as their combination /1/.

If the null hypothesis is true (about the equality of means in several groups of observations selected from the general population), the estimate of the variance associated with intragroup variability should be close to the estimate of intergroup variance.

When conducting market research, the question of comparability of results often arises. For example, by conducting surveys about the consumption of a product in different regions countries, it is necessary to draw conclusions on how much the survey data differ or do not differ from each other. compare individual indicators does not make sense and therefore the procedure for comparison and subsequent evaluation is carried out according to some averaged values ​​and deviations from this averaged estimate. The variation of the trait is being studied. Variance can be taken as a measure of variation. Dispersion σ 2 is a measure of variation, defined as the average of the deviations of a feature squared.

In practice, problems often arise general- tasks of checking the significance of differences in the averages of sample samples of several populations. For example, it is required to evaluate the effect of various raw materials on the quality of products, to solve the problem of the effect of the amount of fertilizers on the yield of agricultural products.

Sometimes analysis of variance is used to establish the homogeneity of several populations (the variances of these populations are the same by assumption; if the analysis of variance shows that the mathematical expectations are the same, then the populations are homogeneous in this sense). Homogeneous aggregates can be combined into one and thereby get more information about it. full information, hence, more reliable conclusions /2/.

1 Analysis of variance

1.1 Basic concepts of analysis of variance

In the process of observing the object under study, the qualitative factors change arbitrarily or in a predetermined way. The specific implementation of a factor (for example, a certain temperature regime, selected equipment or material) is called the factor level or processing method. An ANOVA model with fixed levels of factors is called model I, a model with random factors is called model II. By varying the factor, one can investigate its effect on the magnitude of the response. Currently general theory analysis of variance developed for models I.

Depending on the number of factors that determine the variation of the resulting feature, analysis of variance is divided into single-factor and multi-factor.

The main schemes for organizing initial data with two or more factors are:

Cross-classification, characteristic of models I, in which each level of one factor is combined with each gradation of another factor when planning an experiment;

Hierarchical (nested) classification, characteristic of model II, in which each randomly chosen value of one factor corresponds to its own subset of values ​​of the second factor.

If the dependence of the response on qualitative and quantitative factors is simultaneously investigated, i.e. factors of mixed nature, then covariance analysis is used /3/.

Thus, these models differ from each other in the way of choosing the levels of the factor, which, obviously, primarily affects the possibility of generalizing the experimental results obtained. For the analysis of variance of single-factor experiments, the difference between these two models is not so significant, but in multivariate analysis of variance it can be very important.

When conducting an analysis of variance, the following statistical assumptions must be met: regardless of the level of the factor, the response values ​​have a normal (Gaussian) distribution law and the same variance. This equality of dispersions is called homogeneity. Thus, changing the processing method affects only the position of the response random variable, which is characterized by the mean value or median. Therefore, all response observations belong to the shift family of normal distributions.

The ANOVA technique is said to be "robust". This term, used by statisticians, means that these assumptions can be violated to some extent, but despite this, the technique can be used.

When the law of distribution of response values ​​is unknown, nonparametric (most often rank) methods of analysis are used.

The analysis of variance is based on the division of the variance into parts or components. The variation due to the influence of the factor underlying the grouping is characterized by the intergroup dispersion σ 2 . It is a measure of the variation of partial means for groups around the common mean and is determined by the formula:

,

where k is the number of groups;

n j is the number of units in the j-th group;

Private average for the j-th group;

The overall average over the population of units.

The variation due to the influence of other factors is characterized in each group by the intragroup dispersion σ j 2 .

.

Between total varianceσ 0 2 , intragroup variance σ 2 and intergroup variance there is a relation:

σ 0 2 = + σ 2 .

The intragroup variance explains the influence of factors not taken into account when grouping, and the intergroup variance explains the influence of grouping factors on the group average /2/.

1.2 One-way analysis of variance

The one-factor dispersion model has the form:

x ij = μ + F j + ε ij , (1)

where x ij is the value of the variable under study, obtained on i-th level factor (i=1,2,...,m) with j-th serial number(j=1,2,...,n);

F i is the effect due to the influence of the i-th level of the factor;

εij – random component, or disturbance caused by the influence of uncontrollable factors, i.e. variation within a single level.

Basic prerequisites for analysis of variance:

The mathematical expectation of the perturbation ε ij is equal to zero for any i, i.e.

M(ε ij) = 0; (2)

Perturbations ε ij are mutually independent;

The variance of the variable x ij (or perturbation ε ij) is constant for

any i, j, i.e.

D(ε ij) = σ 2 ; (3)

The variable x ij (or perturbation ε ij) has normal law

distributions N(0;σ 2).

The influence of factor levels can be either fixed or systematic (Model I) or random (Model II).

Let, for example, it is necessary to find out whether there are significant differences between batches of products in terms of some quality indicator, i.e. check the impact on the quality of one factor - a batch of products. If all batches of raw materials are included in the study, then the influence of the level of such a factor is systematic (model I), and the findings are applicable only to those individual batches that were involved in the study. If we include only a randomly selected part of the parties, then the influence of the factor is random (model II). In multifactorial complexes, a mixed model III is possible, in which some factors have random levels, while others are fixed.

Let there be m batches of products. From each batch, respectively, n 1 , n 2 , ..., n m products were selected (for simplicity, it is assumed that n 1 =n 2 =...=n m =n). The values ​​of the quality indicator of these products are presented in the observation matrix:

x 11 x 12 … x 1n

x 21 x 22 … x 2n

………………… = (x ij), (i = 1.2, …, m; j = 1.2, …, n).

x m 1 x m 2 … x mn

It is necessary to check the significance of the influence of batches of products on their quality.

If we assume that the row elements of the observation matrix are numerical values random variablesХ 1 ,Х 2 ,...,Х m , expressing the quality of products and having a normal distribution law with mathematical expectations respectively a 1 ,а 2 ,...,а m and identical variances σ 2 , then given task is reduced to checking the null hypothesis H 0: a 1 =a 2 =...= a m, carried out in the analysis of variance.

The averaging over some index is indicated by an asterisk (or a dot) instead of an index, then average quality products i-th batch, or the group average for the i-th level of the factor, will take the form:

where i * is the average value over the columns;

Ij is an element of the observation matrix;

n is the sample size.

And the overall average:

. (5)

The sum of the squared deviations of observations x ij from the total mean ** looks like this:

2 = 2 + 2 +

2 2 . (6)

Q \u003d Q 1 + Q 2 + Q 3.

The last term is zero

since the sum of the deviations of the values ​​of the variable from its mean is equal to zero, i.e.

2 =0.

The first term can be written as:

The result is an identity:

Q = Q 1 + Q 2 , (8)

where - total, or total, sum of squared deviations;

- the sum of the squared deviations of the group means from the total average, or the intergroup (factorial) sum of the squared deviations;

- sum of squared deviations of observations from group means, or intragroup (residual) sum of squared deviations.

The expansion (8) contains the main idea of ​​the analysis of variance. In relation to the problem under consideration, equality (8) shows that the overall variation of the quality indicator, measured by the sum Q, consists of two components - Q 1 and Q 2, characterizing the variability of this indicator between batches (Q 1) and variability within batches (Q 2), characterizing the same variation for all batches under the influence of unaccounted for factors.

In the analysis of variance, not the sums of squared deviations themselves are analyzed, but the so-called mean squares, which are unbiased estimates of the corresponding variances, which are obtained by dividing the sums of squared deviations by the corresponding number of degrees of freedom.

The number of degrees of freedom is defined as the total number of observations minus the number of equations relating them. Therefore, for the mean square s 1 2 , which is an unbiased estimate of the intergroup variance, the number of degrees of freedom k 1 =m-1, since m group means interconnected by one equation (5) are used in its calculation. And for the mean square s22, which is an unbiased estimate of the intragroup variance, the number of degrees of freedom is k2=mn-m, because it is calculated using all mn observations interconnected by m equations (4).

Thus:

If we find the mathematical expectations of the mean squares and , substitute the expression xij (1) in their formulas through the model parameters, we get:

(9)

because taking into account the properties of mathematical expectation

a

(10)

For model I with fixed levels of the factor F i (i=1,2,...,m) are non-random values, therefore

M(S) = 2 /(m-1) +σ 2 .

The hypothesis H 0 takes the form F i = F * (i = 1,2,...,m), i.e. the influence of all levels of the factor is the same. If this hypothesis is true

M(S)= M(S)= σ 2 .

For random model II term F i in expression (1) is a random value. Denoting it by variance

we get from (9)

(11)

and, as in model I

Table 1.1 presents general form calculation of values, using analysis of variance.

Table 1.1 - Basic table of analysis of variance

Variance components

Sum of squares

Number of degrees of freedom

Medium square

Mean square expectation

Intergroup

Intragroup

The hypothesis H 0 will take the form σ F 2 =0. If this hypothesis is true

M(S)= M(S)= σ 2 .

In the case of a one-factor complex for both model I and model II, the mean squares S 2 and S 2 are unbiased and independent estimates of the same variance σ 2 .

Therefore, testing the null hypothesis H 0 was reduced to testing the significance of the difference between the unbiased sample estimates S and S dispersions σ 2 .

The hypothesis H 0 is rejected if the actually calculated value of the statistics F = S/S is greater than the critical value F α: K 1: K 2 , determined at the significance level α with the number of degrees of freedom k 1 =m-1 and k 2 =mn-m, and accepted if F< F α: K 1: K 2 .

The Fisher F distribution (for x > 0) has next function density (for = 1, 2, ...; = 1, 2, ...):

where - degrees of freedom;

G - gamma function.

In relation to this problem, the refutation of the hypothesis H 0 means the presence of significant differences in the quality of products of different batches at the level of significance under consideration.

To calculate the sums of squares Q 1 , Q 2 , Q it is often convenient to use the following formulas:

(12)

(13)

(14)

those. it is generally not necessary to find the averages themselves.

Thus, the procedure for one-way analysis of variance consists in testing the hypothesis H 0 that there is one group of homogeneous experimental data against the alternative that there are more than one such group. Homogeneity refers to the sameness of means and variances in any subset of the data. In this case, the variances can be both known and unknown in advance. If there is reason to believe that the known or unknown variance of measurements is the same throughout the entire data set, then the task of one-way analysis of variance is reduced to studying the significance of the difference in the means in the data groups /1/.

1.3 Multivariate dispersion analysis

It should be immediately noted that there is no fundamental difference between multivariate and single-factor analysis of variance. Multivariate analysis does not change common logic analysis of variance, but only somewhat complicates it, since, in addition to taking into account the influence on the dependent variable of each of the factors separately, one should also evaluate their joint action. Thus, the new thing that multivariate analysis of variance brings to data analysis concerns mainly the ability to assess interfactorial interaction. However, it is still possible to evaluate the influence of each factor separately. In this sense, the procedure of multivariate analysis of variance (in the variant of its computer use) is undoubtedly more economical, since in just one run it solves two problems at once: the influence of each of the factors and their interaction are estimated /3/.

General scheme two-factor experiment, the data of which are processed by analysis of variance, has the form:



Figure 1.1 - Scheme of a two-factor experiment

Data subjected to multivariate analysis of variance are often labeled according to the number of factors and their levels.

Assuming that in the considered problem of the quality of different m batches, products were manufactured on different t machines and it is required to find out if there are significant differences in the quality of products for each factor:

A - a batch of products;

B - machine.

The result is a transition to the problem of two-factor analysis of variance.

All data are presented in Table 1.2, in which the rows - levels A i of factor A, the columns - levels B j of factor B, and in the corresponding cells of the table are the values ​​of the product quality indicator x ijk (i = 1.2, ... ,m; j=1,2,...,l; k=1,2,...,n).

Table 1.2 - Product quality indicators

x 11l ,…,x 11k

x 12l ,…,x 12k

x 1jl ,…,x 1jk

x 1ll ,…,x 1lk

x 2 1l ,…,x 2 1k

x 22l ,…,x 22k

x 2jl ,…,x 2jk

x 2ll ,…,x 2lk

x i1l ,…,x i1k

x i2l ,…,x i2k

xijl ,…,xijk

xjll ,…,xjlk

x m1l ,…,x m1k

x m2l ,…,x m2k

xmjl ,…,xmjk

x mll ,…,x mlk

The two-factor dispersion model has the form:

x ijk =μ+F i +G j +I ij +ε ijk , (15)

where x ijk is the value of the observation in cell ij with number k;

μ - general average;

F i - effect due to the influence of the i-th level of factor A;

G j - effect due to the influence of the j-th level of factor B;

I ij - effect due to the interaction of two factors, i.e. deviation from the average for observations in cell ij from the sum of the first three terms in model (15);

ε ijk - perturbation due to the variation of the variable within a single cell.

It is assumed that ε ijk has a normal distribution N(0; с 2), and all mathematical expectations F * , G * , I i * , I * j are equal to zero.

Group averages are found by the formulas:

In cell:

by line:

by column:

overall average:

Table 1.3 presents a general view of the calculation of values ​​using analysis of variance.

Table 1.3 - Basic table of analysis of variance

Variance components

Sum of squares

Number of degrees of freedom

Middle squares

Intergroup (factor A)

Intergroup (factor B)

Interaction

Residual

Checking the null hypotheses HA, HB, HAB about the absence of influence on the considered variable of factors A, B and their interaction AB is carried out by comparing the ratios , , (for model I with fixed levels of factors) or relations , , (for a random model II) with the corresponding table values F - Fisher-Snedecor criterion. For the mixed model III, the testing of hypotheses regarding factors with fixed levels is performed in the same way as in model II, and for factors with random levels, as in model I.

If n=1, i.e. with one observation in the cell, then not all null hypotheses can be tested, since the Q3 component falls out of the total sum of squared deviations, and with it the mean square, since in this case there can be no question of the interaction of factors.

From the point of view of computational technique, to find the sums of squares Q 1, Q 2, Q 3, Q 4, Q, it is more expedient to use the formulas:

Q 3 \u003d Q - Q 1 - Q 2 - Q 4.

Deviation from the basic prerequisites of the analysis of variance - the normality of the distribution of the variable under study and the equality of variances in the cells (if it is not excessive) - does not significantly affect the results of the analysis of variance with an equal number of observations in the cells, but can be very sensitive if their number is unequal. In addition, with an unequal number of observations in the cells, the complexity of the apparatus for analysis of variance sharply increases. Therefore, it is recommended to plan a scheme with equal number observations in the cells, and if there are missing data, then compensate for them with the average values ​​of other observations in the cells. In this case, however, artificially introduced missing data should not be taken into account when calculating the number of degrees of freedom /1/.

2 Application of ANOVA in various processes and research

2.1 Using analysis of variance in the study of migration processes

Migration is complex social phenomenon which largely determines the economic and political aspects of society. The study of migration processes is associated with the identification of factors of interest, satisfaction with working conditions, and an assessment of the influence of the obtained factors on the intergroup movement of the population.

λ ij = c i q ij a j ,

where λ ij is the intensity of transitions from the original group i (output) to the new group j (input);

c i – possibility and ability to leave group i (c i ≥0);

q ij – attractiveness new group compared to the original (0≤q ij ≤1);

a j – availability of group j (a j ≥0).

ν ij ≈ n i λ ij =n i c i q ij a j . (sixteen)

In practice for individual person the probability p of moving to another group is small, and the number n of the group under consideration is large. In this case, the law rare events, that is, the limit ν ij is the Poisson distribution with the parameter μ=np:

.

As μ increases, the distribution approaches normal. The transformed value √ν ij can be considered normally distributed.

If we take the logarithm of expression (16) and make the necessary changes of variables, then we can obtain an analysis of variance model:

ln√ν ij =½lnν ij =½(lnn i +lnc i +lnq ij +lna j)+ε ij ,

X i,j =2ln√ν ij -lnn i -lnq ij ,

Xi,j =Ci +Aj +ε.

The values ​​of C i and A j make it possible to obtain a two-way ANOVA model with one observation per cell. reverse transformation coefficients c i and a j are calculated from C i and A j .

When conducting an analysis of variance, the following values ​​should be taken as the values ​​of the effective feature Y:

X \u003d (X 1.1 + X 1.2 +: + X mi, mj) / mimj,

where mimj is the estimate of the mathematical expectation X i,j ;

X mi and X mj - respectively, the number of exit and entry groups.

Factor I levels will be mi exit groups, factor J levels will be mj entry groups. Mi=mj=m is assumed. The problem is to test the hypotheses H I and H J about the equalities mathematical expectations Y values ​​at levels I i and at levels J j , i,j=1,…,m. Hypothesis testing H I is based on comparing the values ​​of unbiased estimates of the variance s I 2 and s o 2 . If the hypothesis H I is correct, then the value F (I) = s I 2 /s o 2 has a Fisher distribution with the number of degrees of freedom k 1 =m-1 and k 2 =(m-1)(m-1). For a given significance level α, the right-hand critical point x pr,α cr. If a numerical value F (I) the number of values ​​falls into the interval (x pr, α kr, +∞), then the hypothesis H I is rejected and it is believed that the factor I affects the effective feature. The degree of this influence according to the results of observations is measured sampling rate determination, which shows what proportion of the variance of the effective feature in the sample is due to the influence of factor I on it. If F (I) is a number

2.2 Principles of mathematical and statistical analysis of biomedical research data

Depending on the task, the volume and nature of the material, the type of data and their relationships, there is a choice of methods of mathematical processing at the stages of both preliminary (to assess the nature of the distribution in the study sample) and final analysis in accordance with the objectives of the study. An extremely important aspect is the verification of the homogeneity of the selected observation groups, including control ones, which can be carried out either by an expert, or by multivariate statistics methods (for example, using cluster analysis). But the first step is to compile a questionnaire that provides for a standardized description of the characteristics. Especially when conducting epidemiological studies, where unity is needed in understanding and describing the same symptoms by different doctors, including taking into account the ranges of their changes (severity). If there are significant differences in the registration of the initial data (subjective assessment of the nature of pathological manifestations by various specialists) and the impossibility of bringing them to a single form at the stage of collecting information, then the so-called covariant correction can be carried out, which involves the normalization of variables, i.e. elimination of abnormalities of indicators in the data matrix. "Coordination of opinions" is carried out taking into account the specialty and experience of doctors, which then makes it possible to compare the results of the examination obtained by them with each other. For this, multivariate analysis of variance and regression analyzes can be used.

Signs can be either of the same type, which is rare, or of different types. This term refers to their different metrological evaluation. Quantitative or numerical signs are those measured on a certain scale and on scales of intervals and ratios (I group of signs). Qualitative, ranking or scoring are used to express medical terms and concepts that do not have numerical values ​​(for example, the severity of the condition) and are measured on an order scale (group II of signs). Classification or nominal (for example, profession, blood type) - these are measured in the scale of names (group III of signs).

In many cases, an attempt is made to analyze an extremely large number of features, which should help to increase the information content of the presented sample. However, the choice of useful information, that is, the implementation of feature selection, is an absolutely necessary operation, since in order to solve any classification problem, information carrying information that is useful for this task must be selected. In the event that for some reason this is not carried out by the researcher on his own or there are no sufficiently substantiated criteria for reducing the dimension of the feature space for meaningful reasons, the fight against information redundancy is already carried out by formal methods by assessing the information content.

Analysis of variance allows you to determine the influence of various factors (conditions) on the trait (phenomenon) under study, which is achieved by decomposing the total variability (dispersion expressed as the sum of squared deviations from the general average) into individual components caused by the influence of various sources of variability.

With the help of analysis of variance, the threats of the disease are examined in the presence of risk factors. The concept of relative risk considers the relationship between patients with a particular disease and those without it. The relative risk value makes it possible to determine how many times the probability of getting sick increases in its presence, which can be estimated using the following simplified formula:

where a is the presence of a trait in the study group;

b - the absence of a trait in the study group;

c - the presence of a sign in the comparison group (control);

d - absence of a sign in the comparison group (control).

The attribute risk score (rA) is used to assess the proportion of morbidity associated with a given risk factor:

,

where Q is the frequency of the risk-marking trait in the population;

r" - relative risk.

Identification of factors contributing to the occurrence (manifestation) of the disease, i.e. risk factors can be carried out in various ways, for example, by assessing the information content with subsequent ranking of signs, which, however, does not indicate the cumulative effect of the selected parameters, in contrast to the use of regression, factor analyzes, methods of pattern recognition theory, which make it possible to obtain "symptomatic complexes" of risk- factors. In addition, more sophisticated methods make it possible to analyze indirect relationships between risk factors and diseases /5/.

2.3 Soil bioassay

Diverse pollutants, getting into the agrocenosis, can undergo various transformations in it, while increasing their toxic effect. For this reason, methods for the integral assessment of the quality of agrocenosis components turned out to be necessary. The studies were carried out on the basis of a multivariate analysis of variance in an 11-field grain-grass-rowed crop rotation. In the experiment, the influence of the following factors was studied: soil fertility (A), fertilizer system (B), plant protection system (C). Soil fertility, fertilizer system and plant protection system were studied at doses of 0, 1, 2 and 3. The basic options were represented by the following combinations:

000 - the initial level of fertility, without the use of fertilizers and plant protection products from pests, diseases and weeds;

111 - the average level of soil fertility, the minimum dose of fertilizer, the biological protection of plants from pests and diseases;

222 - the initial level of soil fertility, the average dose of fertilizers, chemical protection of plants from weeds;

333 - a high level of soil fertility, a high dose of fertilizers, chemical protection of plants from pests and diseases.

We studied options where only one factor is present:

200 - fertility:

020 - fertilizers;

002 - plant protection products.

As well as options with a different combination of factors - 111, 131, 133, 022, 220, 202, 331, 313, 311.

The aim of the study was to study the inhibition of chloroplasts and the coefficient of instantaneous growth, as indicators of soil pollution, in various variants of a multifactorial experiment.

The inhibition of phototaxis of duckweed chloroplasts was studied in different soil horizons: 0–20, 20–40 cm. The share in the total dispersion of soil fertility was 39.7%, fertilizer systems - 30.7%, plant protection systems - 30.7%.

To study the combined effect of factors on the inhibition of chloroplast phototaxis, various combinations of experimental variants were used: in the first case - 000, 002, 022, 222, 220, 200, 202, 020, in the second case - 111, 333, 331, 313, 133, 311 , 131.

The results of a two-way analysis of variance indicate a significant effect of the interacting fertilizer and plant protection systems on differences in phototaxis for the first case (the share in the total variance was 10.3%). For the second case, a significant influence of the interacting soil fertility and fertilizer system (53.2%) was found.

Three-way analysis of variance showed in the first case a significant influence of the interaction of all three factors. The share in the total dispersion was 47.9%.

The instantaneous growth coefficient was studied in different variants of the experiment 000, 111, 222, 333, 002, 200, 220. The first stage of testing was before the application of herbicides on winter wheat crops (April), the second stage was after the application of herbicides (May) and the last one was on harvest time (July). Forerunners - sunflower and corn for grain.

The appearance of new fronds was observed after a short lag phase with a period of total doubling of the fresh weight of 2-4 days.

In the control and in each variant, on the basis of the obtained results, the coefficient of instantaneous population growth r was calculated, and then the time of doubling the number of fronds (t doubling) was calculated.

t doubles \u003d ln2 / r.

The calculation of these indicators was carried out in dynamics with the analysis of soil samples. Analysis of the data showed that the doubling time of the duckweed population before tillage was the shortest compared to the data after tillage and at the time of harvest. In the dynamics of observations, the response of the soil after the application of the herbicide and at the time of harvesting is of greater interest. First of all, the interaction with fertilizers and the level of fertility.

Sometimes getting a direct response to the application of chemical preparations can be complicated by the interaction of the preparation with fertilizers, both organic and mineral. The data obtained made it possible to trace the dynamics of the response of the applied preparations, in all variants with chemical means of protection, where there is a suspension of the growth of the indicator.

The data of one-way analysis of variance showed a significant effect of each indicator on the growth rate of duckweed at the first stage. At the second stage, the effect of differences in soil fertility was 65.0%, in the fertilizer system and plant protection system - 65.0% each. The factors showed significant differences between the 222 variant and the 000, 111, 333 variants, average in terms of the instantaneous growth coefficient. At the third stage, the share in the total dispersion of soil fertility was 42.9%, fertilizer systems and plant protection systems - 42.9% each. A significant difference was noted in the average values ​​of options 000 and 111, options 333 and 222.

The studied soil samples from the field monitoring options differ from each other in terms of phototaxis inhibition. The influence of fertility factors was noted, the fertilizer system and plant protection products with shares of 30.7 and 39.7% in a single-factor analysis, in two-factor and three-factor analysis, the joint influence of factors was registered.

An analysis of the experimental results showed insignificant differences between the soil horizons in terms of the phototaxis inhibition indicator. Differences are marked by average values.

In all variants where there are plant protection products, changes in the position of chloroplasts and growth arrest of duckweed less are observed /6/.

2.4 Influenza causes increased production of histamine

Researchers at the Children's Hospital in Pittsburgh (USA) have received the first evidence that histamine levels increase with acute respiratory viral infections. Despite the fact that histamine has previously been suggested to play a role in the onset of symptoms of acute respiratory infections of the upper respiratory tract.

Scientists were interested in why many people use antihistamines, which in many countries are included in the OTC category, for self-treatment of “colds” and the common cold. available without a doctor's prescription.

The aim of this study was to determine whether histamine production is increased during experimental influenza A virus infection.

15 healthy volunteers were intranasally injected with influenza A virus and then observed for the development of the infection. Daily during the course of the disease, the morning portion of urine was collected from volunteers, and then histamine and its metabolites were determined, and the total amount of histamine and its metabolites excreted per day was calculated.

The disease developed in all 15 volunteers. Analysis of variance confirmed a significantly higher level of histamine in the urine on days 2-5 of viral infection (p<0,02) - период, когда симптомы «простуды» наиболее выражены. Парный анализ показал, что наиболее значительно уровень гистамина повышается на 2 день заболевания. Кроме этого, оказалось, что суточное количество гистамина и его метаболитов в моче при гриппе примерно такое же, как и при обострении аллергического заболевания.

The results of this study provide the first direct evidence that histamine levels are elevated in acute respiratory infections /7/.

Analysis of variance in chemistry

Dispersion analysis is a set of methods for determining dispersion, i.e., characteristics of particle sizes in disperse systems. Dispersion analysis includes various methods for determining the size of free particles in liquid and gaseous media, the size of pore channels in finely porous bodies (in this case, the equivalent concept of porosity is used instead of the concept of dispersion), as well as the specific surface area. Some of the methods of dispersion analysis make it possible to obtain a complete picture of the distribution of particles by size (volume), while others give only an average characteristic of dispersion (porosity).

The first group includes, for example, methods for determining the size of individual particles by direct measurement (sieve analysis, optical and electron microscopy) or by indirect data: the settling rate of particles in a viscous medium (sedimentation analysis in a gravitational field and in centrifuges), the magnitude of electric current pulses, arising from the passage of particles through a hole in a non-conductive partition (conductometric method).

The second group of methods combines the estimation of the average sizes of free particles and the determination of the specific surface area of ​​powders and porous bodies. The average particle size is found by the intensity of scattered light (nephelometry), using an ultramicroscope, diffusion methods, etc., the specific surface is found by the adsorption of gases (vapors) or dissolved substances, by gas permeability, dissolution rate, and other methods. Below are the limits of applicability of various methods of analysis of variance (particle sizes in meters):

Sieve analysis - 10 -2 -10 -4

Sedimentation analysis in a gravitational field - 10 -4 -10 -6

Conductometric method - 10 -4 -10 -6

Microscopy - 10 -4 -10 -7

Filtration method - 10 -5 -10 -7

Centrifugation - 10 -6 -10 -8

Ultracentrifugation - 10 -7 -10 -9

Ultramicroscopy - 10 -7 -10 -9

Nephelometry - 10 -7 -10 -9

Electron microscopy - 10 -7 -10 -9

Diffusion method - 10 -7 -10 -10

Dispersion analysis is widely used in various fields of science and industrial production to assess the dispersion of systems (suspensions, emulsions, sols, powders, adsorbents, etc.) with particle sizes from several millimeters (10 -3 m) to several nanometers (10 -9 m) /8/.

2.6 The use of direct intentional suggestion in the waking state in the method of education of physical qualities

Physical training is the fundamental side of sports training, since, to a greater extent than other aspects of training, it is characterized by physical loads that affect the morphological and functional properties of the body. The success of technical training, the content of an athlete's tactics, the realization of personal properties in the process of training and competition depend on the level of physical fitness.

One of the main tasks of physical training is the education of physical qualities. In this regard, there is a need to develop pedagogical tools and methods that allow taking into account the age characteristics of young athletes that preserve their health, do not require additional time and at the same time stimulate the growth of physical qualities and, as a result, sportsmanship. The use of verbal heteroinfluence in the training process in primary training groups is one of the promising areas of research on this issue.

An analysis of the theory and practice of the implementation of inspiring verbal hetero-influence revealed the main contradictions:

Evidence of the effective use of specific methods of verbal heteroinfluence in the training process and the practical impossibility of their use by a coach;

Recognition of direct intentional suggestion (hereinafter referred to as DSP) in the waking state as one of the main methods of verbal heteroinfluence in the pedagogical activity of a coach and the lack of a theoretical justification for the methodological features of its use in sports training, and in particular in the process of educating physical qualities.

In connection with the identified contradictions and insufficient development, the problem of using the system of methods of verbal heteroinfluence in the process of educating the physical qualities of athletes predetermined the purpose of the study - to develop rational targeted methods of PPV in the waking state, contributing to the improvement of the process of educating physical qualities based on the assessment of the mental state, manifestation and dynamics of physical qualities judoists of elementary training groups.

In order to test and determine the effectiveness of the experimental methods of PPV in the development of the physical qualities of judo wrestlers, a comparative pedagogical experiment was conducted, in which four groups took part - three experimental and one control. In the first experimental group (EG) the PPV M1 technique was used, in the second - the PPV M2 technique, in the third - the PPV M3 technique. In the control group (CG), the PPV methods were not used.

To determine the effectiveness of the pedagogical impact of the PPV methods in the process of educating physical qualities among judokas, a single-factor analysis of variance was carried out.

The degree of influence of the PPV M1 methodology in the process of education:

Endurance:

a) after the third month was 11.1%;

Speed ​​abilities:

a) after the first month - 16.4%;

b) after the second - 26.5%;

c) after the third - 34.8%;

a) after the second month - 26.7%;

b) after the third - 35.3%;

Flexibility:

a) after the third month - 20.8%;

a) after the second month of the main pedagogical experiment, the degree of influence of the methodology was 6.4%;

b) after the third - 10.2%.

Consequently, significant changes in the indicators of the level of development of physical qualities using the PPV M1 method were found in speed abilities and strength, the degree of influence of the method in this case is the greatest. The least degree of influence of the methodology was found in the process of educating endurance, flexibility, and coordination abilities, which gives grounds to speak about the insufficient effectiveness of using the PPV M1 method in educating these qualities.

The degree of influence of the PPV M2 methodology in the process of education:

Endurance

a) after the first month of the experiment - 12.6%;

b) after the second - 17.8%;

c) after the third - 20.3%.

Speed ​​abilities:

a) after the third month of training sessions - 28%.

a) after the second month - 27.9%;

b) after the third - 35.9%.

Flexibility:

a) after the third month of training sessions - 14.9%;

Coordination abilities - 13.1%.

The obtained result of one-way ANOVA of this EG allows us to conclude that the PPV M2 method is the most effective in developing endurance and strength. It is less effective in the process of developing flexibility, speed and coordination abilities.

The degree of influence of the PPV M3 methodology in the process of education:

Endurance:

a) after the first month of the experiment 16.8%;

b) after the second - 29.5%;

c) after the third - 37.6%.

Speed ​​abilities:

a) after the first month - 26.3%;

b) after the second - 31.3%;

c) after the third - 40.9%.

a) after the first month - 18.7%;

b) after the second - 26.7%;

c) after the third - 32.3%.

Flexibility:

a) after the first - there are no changes;

b) after the second - 16.9%;

c) after the third - 23.5%.

Coordination abilities:

a) there are no changes after the first month;

b) after the second - 23.8%;

c) after the third - 91%.

Thus, one-factor analysis of variance showed that the use of the PPV M3 technique in the preparatory period is most effective in the process of educating physical qualities, since there is an increase in the degree of its influence after each month of the pedagogical experiment /9/.

2.7 Relief of acute psychotic symptoms in patients with schizophrenia with an atypical antipsychotic

The purpose of the study was to study the possibility of using rispolept for the relief of acute psychosis in patients diagnosed with schizophrenia (paranoid type according to ICD-10) and schizoaffective disorder. At the same time, the indicator of the duration of the persistence of psychotic symptoms under pharmacotherapy with rispolept (main group) and classical antipsychotics was used as the main criterion under study.

The main objectives of the study were to determine the indicator of the duration of psychosis (the so-called net psychosis), which was understood as the preservation of productive psychotic symptoms from the start of the use of antipsychotics, expressed in days. This indicator was calculated separately for the risperidone group and separately for the classical antipsychotic group.

Along with this, the task was set to determine the proportion of reduction of productive symptoms under the influence of risperidone in comparison with classical antipsychotics at different periods of therapy.

A total of 89 patients (42 men and 47 women) with acute psychotic symptoms within the paranoid form of schizophrenia (49 patients) and schizoaffective disorder (40 patients) were studied.

The first episode and disease duration up to 1 year were registered in 43 patients, while in other cases at the time of the study, subsequent episodes of schizophrenia were noted with a disease duration of more than 1 year.

Rispoleptom therapy was received by 29 people, among whom there were 15 patients with the so-called first episode. Therapy with classical neuroleptics was received by 60 people, among whom there were 28 people with the first episode. The dose of rispolept varied in the range from 1 to 6 mg per day and averaged 4±0.4 mg/day. Risperidone was taken exclusively orally after meals once a day in the evening.

Therapy with classical antipsychotics included the use of trifluoperazine (triftazine) at a daily dose of up to 30 mg intramuscularly, haloperidol at a daily dose of up to 20 mg intramuscularly, triperidol at a daily dose of up to 10 mg orally. The vast majority of patients took classical antipsychotics as monotherapy during the first two weeks, after which they switched, if necessary (while maintaining delusional, hallucinatory or other productive symptoms), to a combination of several classical antipsychotics. At the same time, a neuroleptic with a pronounced elective anti-delusional and anti-hallucinatory affect (for example, haloperidol or triftazin) remained as the main drug, a drug with a distinct hypnosedative effect (chlorpromazine, tizercin, chlorprothixene in doses up to 50-100 mg / day) was added to it in the evening .

In the group taking classical antipsychotics, it was planned to take anticholinergic correctors (Parkopan, Cyclodol) at doses up to 10-12 mg/day. Correctors were prescribed in case of distinct extrapyramidal side effects in the form of acute dystonia, drug-induced parkinsonism and akathisia.

Table 2.1 presents data on the duration of psychosis in the treatment of rispolept and classical antipsychotics.

Table 2.1 - Duration of psychosis ("net psychosis") in the treatment of rispolept and classical antipsychotics

As follows from the data in the table, when comparing the duration of psychosis during therapy with classical antipsychotics and risperidone, there is an almost twofold reduction in the duration of psychotic symptoms under the influence of rispolept. It is significant that neither the factors of the serial number of seizures nor the nature of the picture of the leading syndrome influenced this value of the duration of psychosis. In other words, the duration of psychosis was determined solely by the therapy factor, i.e. depended on the type of drug used, regardless of the serial number of the attack, the duration of the disease and the nature of the leading psychopathological syndrome.

In order to confirm the obtained regularities, a two-factor analysis of variance was carried out. At the same time, the interaction of the therapy factor and the serial number of the attack (stage 1) and the interaction of the therapy factor and the nature of the leading syndrome (stage 2) were taken into account in turn. The results of the analysis of variance confirmed the influence of the therapy factor on the duration of psychosis (F=18.8) in the absence of the influence of the attack number factor (F=2.5) and the psychopathological syndrome type factor (F=1.7). It is important that the joint influence of the therapy factor and the number of the attack on the duration of psychosis was also absent, as well as the joint influence of the therapy factor and the psychopathological syndrome factor.

Thus, the results of the analysis of variance confirmed the effect of only the factor of the applied antipsychotic. Rispolept unequivocally led to a reduction in the duration of psychotic symptoms compared to traditional antipsychotics by about 2 times. It is important that this effect was achieved despite the oral administration of rispolept, while classical antipsychotics were used parenterally in most patients /10/.

2.8 Warping of fancy yarns with roving effect

The Kostroma State Technological University has developed a new shaped thread structure with variable geometric parameters. In this regard, there is a problem of processing fancy yarn in preparatory production. This study was devoted to the process of warping on the issues: the choice of the type of tensioner, which gives the minimum spread of tension and the alignment of tension, threads of different linear densities along the width of the warping shaft.

The object of research is a linen shaped thread of four variants of linear density from 140 to 205 tex. The work of tension devices of three types was studied: porcelain washer, two-zone NS-1P and single-zone NS-1P. An experimental study of the tension of warping threads was carried out on a warping machine SP-140-3L. The warping speed, the weight of the brake discs corresponded to the technological parameters of the warping of the yarn.

To study the dependence of the tension of the shaped thread on the geometric parameters during warping, an analysis was carried out for two factors: X 1 - the diameter of the effect, X 2 - the length of the effect. The output parameters are tension Y 1 and tension fluctuation Y 2 .

The resulting regression equations are adequate to the experimental data at a significance level of 0.95, since the calculated Fisher criterion for all equations is less than the tabular one.

To determine the degree of influence of the factors X 1 and X 2 on the parameters Y 1 and Y 2, an analysis of variance was carried out, which showed that the diameter of the effect has a greater influence on the level and fluctuation of tension.

A comparative analysis of the obtained tensograms showed that the minimum spread of tension during warping of this yarn is provided by a two-zone tension device NS-1P.

It has been established that with an increase in linear density from 105 to 205 tex, the NS-1P device gives an increase in the tension level by only 23%, while the porcelain washer - by 37%, single-zone NS-1P - by 53%.

When forming warping shafts, including shaped and "smooth" threads, it is necessary to individually adjust the tensioner using the traditional method /11/.

2.9 Concomitant pathology with complete loss of teeth in elderly and senile people

The epidemiologically complete loss of teeth and concomitant pathology of the elderly population living in nursing homes in the territory of Chuvashia have been studied. The examination was carried out by means of a dental examination and filling in statistical cards of 784 people. The results of the analysis showed a high percentage of complete loss of teeth, aggravated by the general pathology of the body. This characterizes the examined category of the population as a group of increased dental risk and requires a revision of the entire system of their dental care.

In the elderly, the incidence rate is two times, and in old age six times higher compared to the incidence rate in younger people.

The main diseases of elderly and senile people are diseases of the circulatory system, nervous system and sensory organs, respiratory organs, digestive organs, bones and organs of movement, neoplasms and injuries.

The purpose of the study is to develop and obtain information about concomitant diseases, the effectiveness of prosthetics and the need for orthopedic treatment of elderly and senile people with complete loss of teeth.

A total of 784 people aged 45 to 90 were examined. The ratio of women and men is 2.8:1.

Evaluation of the statistical relationship using the correlation coefficient of Pearson's ranks made it possible to establish the mutual influence of the absence of teeth on concomitant morbidity with a reliability level of p=0.0005. Elderly patients with complete loss of teeth suffer from diseases characteristic of old age, namely, cerebral atherosclerosis and hypertension.

Analysis of variance showed that the specificity of the disease plays a decisive role under the conditions under study. The role of nosological forms in different age periods ranges from 52-60%. The greatest statistically significant impact on the absence of teeth is caused by diseases of the digestive system and diabetes mellitus.

In general, the group of patients aged 75-89 years was characterized by a large number of pathological diseases.

In this study, a comparative study of the incidence of comorbidity among patients with complete loss of teeth of elderly and senile age living in nursing homes was carried out. A high percentage of missing teeth among people of this age group was revealed. In patients with complete adentia, comorbidities characteristic of this age are observed. Atherosclerosis and hypertension were the most common among the examined persons. The impact on the state of the oral cavity of diseases such as diseases of the gastrointestinal tract and diabetes mellitus is statistically significant, the share of other nozoological forms was in the range of 52-60%. The use of analysis of variance did not confirm the significant role of gender and place of residence on indicators of the state of the oral cavity.

Thus, in conclusion, it should be noted that the analysis of the distribution of concomitant diseases in persons with complete absence of teeth in the elderly and senile age showed that this category of citizens belongs to a special group of the population that should receive adequate dental care within the framework of existing dental systems /12/ .

3 Analysis of variance in the context of statistical methods

Statistical analysis methods are a methodology for measuring the results of human activity, that is, converting qualitative characteristics into quantitative ones.

The main steps in the statistical analysis:

Drawing up a plan for collecting initial data - the values ​​of input variables (X 1 ,...,X p), the number of observations n. This step is performed when the experiment is actively planned.

Obtaining initial data and entering them into a computer. At this stage, arrays of numbers are formed (x 1i ,..., x pi ; y 1i ,..., y qi), i=1,..., n, where n is the sample size.

Primary statistical data processing. At this stage, a statistical description of the considered parameters is formed:

a) construction and analysis of statistical dependencies;

b) correlation analysis is designed to evaluate the significance of the influence of factors (X 1 ,...,X p) on the response Y;

c) analysis of variance is used to evaluate the influence of non-quantitative factors (X 1 ,...,X p) on the response Y in order to select the most important among them;

d) regression analysis is designed to determine the analytical dependence of the response Y on quantitative factors X;

Interpretation of the results in terms of the task set /13/.

Table 3.1 shows the statistical methods by which analytical problems are solved. The corresponding cells of the table contain the frequencies of applying statistical methods:

Label "-" - the method is not applied;

Label "+" - the method is applied;

Label "++" - the method is widely used;

Label "+++" - the application of the method is of particular interest /14/.

Analysis of variance, like Student's t-test, allows you to evaluate differences between sample means; however, unlike the t-test, it has no restrictions on the number of means compared. Thus, instead of asking whether two sample means differ, one can assess whether two, three, four, five, or k means differ.

ANOVA allows dealing with two or more independent variables (features, factors) simultaneously, evaluating not only the effect of each of them separately, but also the effects of interaction between them /15/.


Table 3.1 - Application of statistical methods in solving analytical problems

Analytical tasks arising in the field of business, finance and management

Descriptive statistics methods

Methods for verifying statistical hypotheses

Regression analysis methods

Methods of dispersion analysis

Methods of multivariate analysis

Discriminant Analysis Methods

cluster-nogo

Analysis Methods

survivability

Analysis Methods

and forecast

time series

Tasks of horizontal (temporal) analysis

Tasks of vertical (structural) analysis

Tasks of trend analysis and forecast

Tasks of analysis of relative indicators

Tasks of comparative (spatial) analysis

Tasks of factor analysis

For most complex systems, the Pareto principle applies, according to which 20% of the factors determine the properties of the system by 80%. Therefore, the primary task of the researcher of the simulation model is to eliminate insignificant factors, which allows to reduce the dimension of the model optimization problem.

Analysis of variance evaluates the deviation of observations from the overall mean. Then the variation is broken down into parts, each of which has its own cause. The residual part of the variation, which cannot be related to the conditions of the experiment, is considered to be its random error. To confirm the significance, a special test is used - F-statistics.

Analysis of variance determines if there is an effect. Regression analysis allows you to predict the response (the value of the objective function) at some point in the parameter space. The immediate task of regression analysis is to estimate the regression coefficients /16/.

Too large sample sizes make statistical analyzes difficult, so it makes sense to reduce the sample size.

By applying analysis of variance, it is possible to identify the significance of the influence of various factors on the variable under study. If the influence of a factor turns out to be insignificant, then this factor can be excluded from further processing.

Macroeconometricians must be able to solve four logically distinct problems:

Description of data;

Macroeconomic forecast;

Structural inference;

Policy analysis.

Describing data means describing the properties of one or more time series and communicating these properties to a wide range of economists. Macroeconomic forecasting means predicting the course of the economy, usually two to three years or less (mainly because it is too difficult to forecast over longer horizons). Structural inference means checking whether macroeconomic data is consistent with a particular economic theory. Macroeconometric policy analysis proceeds in several directions: on the one hand, the impact on the economy of a hypothetical change in policy instruments (for example, a tax rate or short-term interest rate) is assessed, on the other hand, the impact of a change in policy rules (for example, a transition to a new monetary policy regime) is assessed. An empirical macroeconomic research project may include one or more of these four tasks. Each problem must be solved in such a way that correlations between time series are taken into account.

In the 1970s, these problems were solved using a variety of methods, which, if assessed from modern positions, were inadequate for several reasons. To describe the dynamics of an individual series, it was enough to simply use one-dimensional models of time series, and to describe the joint dynamics of two series, it was enough to use spectral analysis. However, there was no common language suitable for the systematic description of the joint dynamic properties of several time series. Economic forecasts were made either using simplified autoregressive-moving average (ARMA) models or using large structural econometric models popular at the time. Structural inference was based either on small single-equation models or on large models whose identification was achieved through ill-founded exclusionary constraints and which usually did not include expectations. Structural model policy analysis depended on these identifying assumptions.

Finally, the rise in prices in the 1970s was seen by many as a major setback for the big models that were being used to make policy recommendations at the time. That is, it was the right time for the emergence of a new macroeconometric construct that could solve these many problems.

In 1980, such a construction was created - vector autoregressions (VAR). At first glance, VAR is nothing more than a generalization of univariate autoregression to the multivariate case, and each equation in VAR is nothing more than a simple least squares regression of one variable on the lagged values ​​of itself and other variables in VAR. But this seemingly simple tool made it possible to systematically and internally consistently capture the rich dynamics of multivariate time series, and the statistical toolkit that accompanies VAR proved to be convenient and, very importantly, easy to interpret.

There are three different VAR models:

Reduced VAR form;

Recursive VAR;

Structural VAR.

All three are dynamic linear models that relate the current and past values ​​of the Y t vector of an n-dimensional time series. The reduced form and recursive VARs are statistical models that do not use any economic considerations other than the choice of variables. These VARs are used to describe data and forecast. Structural VAR includes constraints derived from macroeconomic theory and this VAR is used for structural inference and policy analysis.

The above form of VAR expresses Y t as a distributed past lag plus a serially uncorrelated error term, that is, it generalizes univariate autoregression to the case of vectors. The mathematically reduced form of the VAR model is a system of n equations that can be written in matrix form as follows:

where  is n l vector of constants;

A 1 , A 2 , ..., A p are n n coefficient matrices;

 t , is an nl vector of serially uncorrelated errors, which are assumed to have a mean of zero and a covariance matrix .

Errors  t , in (17) are unexpected dynamics in Y t , remaining after taking into account the linear distributed lag of past values.

Estimating the parameters of the reduced VAR form is easy. Each of the equations contains the same regressors (Y t–1 ,...,Y t–p), and there are no mutual restrictions between the equations. Thus, the effective estimation (maximum likelihood method with full information) is simplified to the usual least squares applied to each of the equations. The error covariance matrix can be reasonably estimated by the sample covariance matrix obtained from the LSM residuals.

The only subtlety is to determine the lag length p, but this can be done using an information criterion such as AIC or BIC.

At the level of matrix equations, recursive and structural VAR look the same. These two VAR models explicitly take into account simultaneous interactions between elements of Y t , which amounts to adding a simultaneous term to the right side of equation (17). Accordingly, recursive and structural VAR are both represented in the following general form:

where  - vector of constants;

B 0 ,..., B p - matrices;

 t - errors.

The presence of the matrix B 0 in the equation means the possibility of simultaneous interaction between n variables; that is, B 0 allows you to make these variables related to the same point in time, are defined together.

Recursive VAR can be estimated in two ways. The recursive structure gives a set of recursive equations that can be estimated using the least squares method. An equivalent estimation method is that the equations of the reduced form (17), considered as a system, are multiplied from the left by the lower triangular matrix.

The method of estimating the structural VAR depends on how exactly B 0 is identified. The partial information approach entails the use of single equation estimation methods such as two-step least squares. The complete information approach entails the use of multi-equation estimation methods such as three-step least squares.

Be aware of the many different types of VARs. The reduced form of VAR is unique. This order of variables in Y t corresponds to a single recursive VAR, but there are n! such orders, i.e. n! various recursive VARs. The number of structural VARs - that is, sets of assumptions that identify simultaneous relationships between variables - is limited only by the ingenuity of the researcher.

Since matrices of estimated VAR coefficients are difficult to interpret directly, VAR estimation results are usually represented by some function of these matrices. To such statistics decomposition of forecast errors.

Forecast error variance expansions are computed mainly for recursive or structural systems. This decomposition of the variance shows how important the error in the jth equation is to explain the unexpected changes in the ith variable. When the VAR errors are equationally uncorrelated, the variance of the forecast error for h periods ahead can be written as the sum of the components resulting from each of these errors /17/.

3.2 Factor analysis

In modern statistics, factor analysis is understood as a set of methods that, on the basis of real-life relationships of features (or objects), make it possible to identify latent generalizing characteristics of the organizational structure and development mechanism of the phenomena and processes under study.

The concept of latency in the definition is key. It means the implicitness of the characteristics disclosed using factor analysis methods. First, we deal with a set of elementary features X j , their interaction presupposes the presence of certain causes, special conditions, i.e. existence of some hidden factors. The latter are established as a result of generalization of elementary features and act as integrated characteristics, or features, but of a higher level. Naturally, not only trivial features X j can correlate, but also the observed objects N i themselves, so the search for latent factors is theoretically possible both by feature and object data.

If objects are characterized by a sufficiently large number of elementary features (m > 3), then another assumption is also logical - the existence of dense clusters of points (features) in the space of n objects. At the same time, the new axes generalize not the features of X j , but the objects n i , respectively, and the latent factors F r will be recognized by the composition of the observed objects:

F r = c 1 n 1 + c 2 n 2 + ... + c N n N ,

where c i is the weight of the object n i in the factor F r .

Depending on which of the types of correlation considered above - elementary features or observed objects - is studied in factor analysis, R and Q are distinguished - technical methods of data processing.

The name of the R-technique is volumetric data analysis by m features, as a result of which r linear combinations (groups) of features are obtained: F r =f(X j), (r=1..m). Analysis according to the proximity (connection) of n observed objects is called Q-technique and allows you to determine r linear combinations (groups) of objects: F=f(n i), (i = l .. N).

Currently, in practice, more than 90% of problems are solved using R-techniques.

The set of factor analysis methods is currently quite large, it includes dozens of different approaches and data processing techniques. In order to focus on the correct choice of methods in research, it is necessary to present their features. We divide all methods of factor analysis into several classification groups:

Principal component method. Strictly speaking, it is not classified as factor analysis, although it has much in common with it. Specific is, firstly, that in the course of computational procedures all the main components are simultaneously obtained and their number is initially equal to the number of elementary features. Secondly, the possibility of a complete decomposition of the dispersion of elementary features is postulated, in other words, its complete explanation through latent factors (generalized features).

Factor analysis methods. The variance of elementary features is not fully explained here, it is recognized that part of the variance remains unrecognized as a characteristic. Factors are usually singled out sequentially: the first, explaining the largest share of variation in elementary features, then the second, explaining the smaller part of the variance, the second after the first latent factor, the third, etc. The process of extracting factors can be interrupted at any step if a decision is made on the sufficiency of the proportion of the explained variance of elementary features or taking into account the interpretability of latent factors.

It is advisable to further divide the factor analysis methods into two classes: simplified and modern approximating methods.

Simple factor analysis methods are mainly associated with initial theoretical developments. They have limited capabilities in identifying latent factors and approximating factorial solutions. These include:

One factor model. It allows you to select only one general latent and one characteristic factors. For possibly existing other latent factors, an assumption is made about their insignificance;

bifactorial model. Allows for the influence on the variation of elementary features of not one, but several latent factors (usually two) and one characteristic factor;

centroid method. In it, correlations between variables are considered as a bunch of vectors, and the latent factor is geometrically represented as a balancing vector passing through the center of this bunch. : The method allows you to identify several latent and characteristic factors, for the first time it becomes possible to correlate the factorial solution with the original data, i.e. solve the approximation problem in the simplest form.

Modern approximating methods often assume that the first, approximate solution has already been found by some of the methods, and this solution is optimized by subsequent steps. The methods differ in the complexity of calculations. These methods include:

group method. The solution is based on groups of elementary features pre-selected in some way;

Method of main factors. It is closest to the method of principal components, the difference lies in the assumption of the existence of features;

Maximum likelihood, minimum residuals, a-factor analysis, canonical factor analysis, all optimizing.

These methods make it possible to consistently improve previously found solutions based on the use of statistical techniques for estimating a random variable or statistical criteria, and require a large amount of time-consuming calculations. The most promising and convenient for work in this group is the maximum likelihood method.

The main task, which is solved by various methods of factor analysis, including the method of principal components, is the compression of information, the transition from the set of values ​​according to m elementary features with the amount of information n x m to a limited set of elements of the factor mapping matrix (m x r) or the matrix of latent values factors for each observed object of dimension n x r, and usually r< m.

Factor analysis methods also make it possible to visualize the structure of the phenomena and processes under study, which means determining their state and predicting their development. Finally, the factor analysis data provide grounds for identifying the object, i.e. solving the problem of image recognition.

Factor analysis methods have properties that are very attractive for their use as part of other statistical methods, most often in correlation-regression analysis, cluster analysis, multivariate scaling, etc. /18/.

3.3 Paired regression. Probabilistic nature of regression models.

If we consider the problem of analyzing food expenses in groups with the same income, for example $10,000(x), then this is a deterministic value. But Y - the share of this money spent on food - is random and can change from year to year. Therefore, for each i-th individual:

where ε i - random error;

α and β are constants (theoretically), although they may vary from model to model.

Prerequisites for pairwise regression:

X and Y are linearly related;

X is a non-random variable with fixed values;

- ε - errors are normally distributed N(0,σ 2);

- .

Figure 3.1 shows a pairwise regression model.

Figure 3.1 - Paired regression model

These assumptions describe the classical linear regression model.

If the error has a non-zero mean, the original model will be equivalent to the new model and other intercept, but with a zero mean for the error.

If the prerequisites are satisfied, then the least squares estimators and are efficient linear unbiased estimators

If we designate:

the fact that the mathematical expectation and dispersion of the coefficients will be the following:

Covariance of coefficients:

If a then they are also normally distributed:

From this it follows that:

The variation β is completely determined by the variation ε;

The higher the variance of X, the better the estimate of β.

The total dispersion is determined by the formula:

The variance of the deviations in this form is an unbiased estimate and is called the standard error of the regression. N-2 - can be interpreted as the number of degrees of freedom.

Analysis of deviations from the regression line can provide a useful measure of how well the estimated regression reflects the real data. A good regression is one that explains a significant proportion of the variance in Y, and vice versa, a bad regression does not track most of the fluctuations in the original data. It is intuitively clear that any additional information will improve the model, that is, reduce the unexplained share of variation Y. To analyze the regression model, the variance is decomposed into components, and the coefficient of determination R 2 is determined.

The ratio of the two variances is distributed according to the F-distribution, i.e. if we check for statistical significance of the difference between the variance of the model and the variance of the residuals, we can conclude that R 2 is significant.

Testing the hypothesis about the equality of the variances of these two samples:

If the hypothesis H 0 (equality of variances of several samples) is true, t has an F-distribution with (m 1 ,m 2)=(n 1 -1,n 2 -1) degrees of freedom.

Having calculated the F-ratio as the ratio of two dispersions and comparing it with the table value, we can conclude that R 2 /2/, /19/ is statistically significant.

Conclusion

Modern applications of analysis of variance cover a wide range of problems in economics, biology, and technology and are usually interpreted in terms of the statistical theory of revealing systematic differences between the results of direct measurements performed under certain changing conditions.

Thanks to the automation of analysis of variance, a researcher can conduct various statistical studies using computers, while spending less time and effort on data calculations. Currently, there are many software packages that implement the dispersion analysis apparatus. The most common software products are:

Most statistical methods are implemented in modern statistical software products. With the development of algorithmic programming languages, it became possible to create additional blocks for processing statistical data.

ANOVA is a powerful modern statistical method for processing and analyzing experimental data in psychology, biology, medicine and other sciences. It is very closely related to the specific methodology for planning and conducting experimental studies.

Analysis of variance is used in all areas of scientific research, where it is necessary to analyze the influence of various factors on the variable under study.

Bibliography

1 Kremer N.Sh. Probability theory and mathematical statistics. M.: Unity - Dana, 2002.-343s.

2 Gmurman V.E. Theory of Probability and Mathematical Statistics. - M .: Higher School, 2003.-523s.

4 www.conf.mitme.ru

5 www.pedklin.ru

6 www.webcenter.ru

7 www.infections.ru

8 www.encycl.yandex.ru

9 www.infosport.ru

10 www.medtrust.ru

11 www.flax.net.ru

12 www.jdc.org.il

13 www.big.spb.ru

14 www.bizcom.ru

15 Gusev A.N. Dispersion analysis in experimental psychology. - M .: Educational and methodological collector "Psychology", 2000.-136s.

17 www.econometrics.exponenta.ru

18 www.optimizer.by.ru