Biographies Characteristics Analysis

Univariate analysis of variance correlation table. Multiple comparison: Tukey-Kramer procedure

Suppose that on an automatic line, several machines perform the same operation in parallel. For proper planning of subsequent processing, it is important to know how uniform the average dimensions of parts obtained on parallel machines are. There is only one factor that affects the size of the parts, and this is the machines on which they are made. It is necessary to find out how significant the influence of this factor is on the dimensions of the parts. Assume that the sets of sizes of parts produced on each machine have a normal distribution and equal variances.

We have m machines, therefore, m aggregates or levels at which n 1 , n 2 ,..., n t observations. For simplicity of reasoning, let us assume that n 1 \u003d n 2 \u003d ... \u003d etc. The dimensions of the parts that make up n i observations on i-th level, denote x i 1 , x i 2,..., x in . Then all observations can be represented in the form of a table, which is called the matrix of observations (Table 3.1).

Table 3.1

Levels Observation results
1 2 j n
x 11 x 12 x 1 j x 1 n
x 21 x 22 x 2 j x 2 n
x 31 x 32 x 3 j x 3 n
i x i1 x i2 x i j x i n
m x m1 x m2 x mj xmn

We will assume that for i-th level n observations have an average β i, equal to the sum the general average µ and its variation due to i-th level of the factor, i.e. β i = µ + γ i. Then one observation can be represented in following form:

x i j = µ + γ i. +ε ij= βi +εij (3.1)

where µ is the overall average; γ i- effect due to i-th level of the factor; εij- variation of results within a particular level.

Member εij characterizes the influence of all factors not taken into account by model (3.1). According to the general problem of dispersion analysis, it is necessary to evaluate the significance of the influence of the factor γ on the dimensions of the parts. General variation variable x i j can be decomposed into parts, one of which characterizes the influence of the factor γ, the other - the influence of unaccounted factors. To do this, it is necessary to find an estimate for the overall mean µ and an estimate for the averages over the levels β i. It is obvious that the assessment β is the arithmetic mean of n observations of the i-th level, i.e.

An asterisk in the index at x means that the observations are fixed at the i-th level. The arithmetic mean of the entire set of observations is an estimate of the overall mean µ, i.e.

Find the sum of squared deviations x i j from , i.e.

We represent it in the form (3.2)

And =

But = 0, since this is the sum of the deviations of the variables of one population from the arithmetic mean of the same population, i.e. the whole sum is zero. We write the second term of the sum (3.2) in the form:



Or

The term is the sum of the squared differences between the mean levels and the mean of the entire set of observations. This sum is called the sum of squared deviations between groups and characterizes the discrepancy between levels. The value , is also called dispersion by factors, i.e. dispersion due to the studied factor.

The term is the sum of the squared differences between individual observations and the average of the i-th level. This sum is called the sum of squared deviations within the group and characterizes the discrepancy between the observations of the i-th level. The value is also called the residual scattering, i.e. dispersion due to unaccounted for factors.

The value is called the total or the full amount squared deviations of individual observations from the total mean.

Knowing the sums of squares SS, SS 1 and SS 2, it is possible to estimate the unbiased estimates of the corresponding variances - total, intergroup and intragroup (table 3.2).

If the influence of all levels of the factor γ is the same, then and are estimates of the total variance.

Then, to assess the significance of the influence of the factor γ, it suffices to test the null hypothesis H 0: = .

To do this, calculate the Fisher criterion F B = , with the number of degrees of freedom k 1 = m - 1 and k 2 = m (n - 1). Then, according to the table of the F-distribution (see the distribution table of the Fisher criterion), for the significance level α, the critical value of F cr is found.

Table 3.2

If F B > F cr then the null hypothesis is rejected and a conclusion is made about the significant influence of the factor γ.

At F B< F кр нет основания отвергать нулевую гипотезу и можно считать, что влияние фактора γ несущественно.



Comparing the intergroup and residual variances, the magnitude of their ratio is used to judge how strongly the influence of factors is manifested.

Example 3.1. There are four batches of workwear fabrics. Five samples were selected from each batch and tests were carried out to determine the magnitude of the breaking load. The test results are given in table. 3.3.

Table 3.3

Batch number, t

It is required to find out whether the influence of different batches of raw materials on the magnitude of the breaking load is significant.

Decision.

AT this case m = 4, n = 5. The arithmetic mean of each row is calculated by the formula

We have: =(200+140+170+145+165)/5=164; =170; =202; = 164.

Find the arithmetic mean of the entire population:

Let us calculate the quantities necessary to construct the table. 3.4:

sum of squared deviations between groups SS 1 , with k 1 =t –1=

4-1=3 degrees of freedom:

the sum of squared deviations within the group SS 2 with k 2 = mp - m = =20-4=16 degrees of freedom:

the total sum of squares SS with k=mn-1=20-1=19 degrees of freedom:

Based on the values ​​found, we estimate the variance, according to the formulas (Table 3.2) we will compose (Table 3.4) for the example under consideration.

Table 3.4

Let's carry out a statistical analysis according to the Fisher criterion. Calculate F B \u003d \u003d (4980 1/3) / (7270 1/16) \u003d 1660 / 454.4 \u003d 3.65.

According to the F-distribution table (see appendices), we find the value of F Kp at k 2 = 16 and k 1= 3 degrees of freedom and significance level α = 0.01. We have F Kp = 5.29.

The calculated value of F B is less than the table value, so it can be argued that the null hypothesis is not rejected, which means that the difference between the tissues in the batches does not affect the breaking load.

In the Data Analysis package, the One-Way ANOVA tool is used to test the hypothesis that the means of two or more samples that belong to the same population are similar. Let's consider the work of the package for one-way analysis of variance.

Let's solve Example 3.1 using the One-Way ANOVA tool.

The use of statistics in this note will be shown with a cross-cutting example. Let's say you're a production manager at Perfect Parachute. Parachutes are made from synthetic fibers supplied by four different suppliers. One of the main characteristics of a parachute is its strength. You need to make sure that all fibers supplied have the same strength. To answer this question, it is necessary to design an experiment in which the strength of parachutes woven from synthetic fibers from different suppliers is measured. The information obtained during this experiment will determine which supplier provides the most durable parachutes.

Many applications are related to experiments in which several groups or levels of one factor are considered. Some factors, such as ceramic firing temperature, may have multiple numerical levels (ie 300°, 350°, 400° and 450°). Other factors, such as the location of goods in a supermarket, may have categorical levels (eg, first supplier, second supplier, third supplier, fourth supplier). Univariate experiments in which experimental units are randomly assigned to groups or factor levels are called fully randomized.

UsageF-criteria for evaluating the differences between several mathematical expectations

If the numerical measurements of the factor in the groups are continuous and some additional conditions are met, for comparison mathematical expectations several groups, analysis of variance (ANOVA - An alysis o f Va riance). Analysis of variance using fully randomized designs is called one-way ANOVA. In a sense, the term analysis of variance is misleading because it compares the differences between the mean values ​​of the groups, not between the variances. However, the comparison of mathematical expectations is carried out precisely on the basis of the analysis of data variation. In the ANOVA procedure, the total variation of the measurement results is divided into intergroup and intragroup (Fig. 1). The intragroup variation is explained by experimental error, while the intergroup variation is explained by the effects of experimental conditions. Symbol with denotes the number of groups.

Rice. 1. Separation of Variation in a Fully Randomized Experiment

Download note in or format, examples in format

Let's pretend that with groups are drawn from independent populations that have a normal distribution and the same variance. The null hypothesis is that the mathematical expectations of the populations are the same: H 0: μ 1 = μ 2 = ... = μ s. The alternative hypothesis states that not all mathematical expectations are the same: H 1: not all μ j are the same j= 1, 2, …, s).

On fig. 2 presents the true null hypothesis about the mathematical expectations of the five compared groups, provided that the general populations have a normal distribution and the same variance. Five populations associated with different levels factors are identical. Therefore, they are superimposed on one another, having the same mathematical expectation, variation and form.

Rice. 2. Five populations have the same mathematical expectation: μ 1 = μ 2 = μ 3 = μ 4 = μ 5

On the other hand, suppose that in fact the null hypothesis is false, and the fourth level has the largest mathematical expectation, the first level has a slightly lower mathematical expectation, and the remaining levels have the same and even smaller mathematical expectations (Fig. 3). Note that, with the exception of the magnitude of the mean, all five populations are identical (i.e., have the same variability and shape).

Rice. 3. The effect of the experimental conditions is observed: μ 4 > μ 1 > μ 2 = μ 3 = μ 5

When testing the hypothesis of equality of mathematical expectations of several general populations, the total variation is divided into two parts: intergroup variation, due to differences between groups, and intragroup variation, due to differences between elements belonging to the same group. The total variation is expressed as the total sum of squares (SST - sum of squares total). Since the null hypothesis is that the expectation of all with groups are equal to each other, the total variation is equal to the sum of the squared differences between individual observations and the total mean (mean of averages) calculated for all samples. Full variation:

where - overall average, Xij - i-e watch in j-th group or level, n j- number of observations in j-th group, n - total observations in all groups (i.e. n = n 1 + n 2 + … + nc), with- number of studied groups or levels.

Intergroup variation, usually called the sum of squares among groups (SSA), is equal to the sum of squared differences between the sample mean of each group j and overall average multiplied by the volume of the corresponding group n j:

where with- the number of groups or levels studied, n j- number of observations in j-th group, j- mean j-th group, - general average.

Intragroup variation, usually called the sum of squares withing groups (SSW), is equal to the sum of the squared differences between the elements of each group and the sample mean of this group j:

where Xij - i-th element j-th group, j- mean j-th group.

Because they are compared with factor levels, the intergroup sum of squares has s - 1 degrees of freedom. Each of with levels has n j – 1 degrees of freedom, so the intragroup sum of squares has n- with degrees of freedom, and

In addition, the total sum of squares has n – 1 degrees of freedom, since each observation Xij compared with the overall average calculated over all n observations. If each of these sums is divided by the corresponding number of degrees of freedom, three kinds of dispersion will arise: intergroup(mean square among - MSA), intragroup(mean square within - MSW) and complete(mean square total - MST):

Despite the fact that the main purpose of the analysis of variance is to compare the mathematical expectations with groups to reveal the effect of experimental conditions, its name is due to the fact that the main tool is the analysis of variances different type. If the null hypothesis is true, and between the expected values with groups there are no significant differences, all three variances - MSA, MSW and MST - are estimates of variance σ2 inherent in the analyzed data. So to test the null hypothesis H 0: μ 1 = μ 2 = ... = μ s and alternative hypothesis H 1: not all μ j are the same j = 1, 2, …, with), it is necessary to calculate the statistics F-criterion, which is the ratio of two variances, MSA and MSW. test F-statistics in univariate analysis of variance

Statistics F-criteria obeys F- distribution with s - 1 degrees of freedom in the numerator MSA and n - with degrees of freedom in the denominator MSW. For a given significance level α, the null hypothesis is rejected if the computed F FU inherent F- distribution with s - 1 n - with degrees of freedom in the denominator. Thus, as shown in fig. 4, decision rule is formulated as follows: null hypothesis H 0 rejected if F > FU; otherwise, it is not rejected.

Rice. 4. Critical area of ​​analysis of variance when testing a hypothesis H 0

If the null hypothesis H 0 is true, computed F-statistics is close to 1, since its numerator and denominator are estimates of the same value - the variance σ 2 inherent in the analyzed data. If the null hypothesis H 0 is false (and there is a significant difference between the expectation values ​​of different groups), computed F-statistic will be much greater than one, since its numerator, MSA, in addition to the natural variability of the data, estimates the effect of experimental conditions or the difference between groups, while the denominator MSW estimates only the natural variability of the data. Thus, the ANOVA procedure is F is a test in which, at a given significance level α, the null hypothesis is rejected if the calculated F- statistics are greater than the upper critical value FU inherent F- distribution with s - 1 degrees of freedom in the numerator and n - with degrees of freedom in the denominator, as shown in Fig. 4.

To illustrate the one-way analysis of variance, let's return to the scenario outlined at the beginning of the note. The purpose of the experiment is to determine whether parachutes woven from synthetic fibers obtained from different suppliers have the same strength. Each group has five parachutes woven. The groups are divided by supplier - Supplier 1, Supplier 2, Supplier 3 and Supplier 4. The strength of parachutes is measured using a special device that tests the fabric for tearing on both sides. The force required to break a parachute is measured on a special scale. The higher the breaking force, the stronger the parachute. Excel allows analysis F-Statistics with one click. Go through the menu DataData analysis, and select the line One-way analysis of variance, fill in the opened window (Fig. 5). The results of the experiment (gap strength), some descriptive statistics, and the results of one-way analysis of variance are shown in Figs. 6.

Rice. 5. Window One-Way ANOVA Analysis Package excel

Rice. Fig. 6. Strength indicators of parachutes woven from synthetic fibers obtained from different suppliers, descriptive statistics and results of one-way analysis of variance

An analysis of Figure 6 shows that there is some difference between the sample means. The average strength of fibers obtained from the first supplier is 19.52, from the second - 24.26, from the third - 22.84 and from the fourth - 21.16. Is this difference statistically significant? The rupture force distribution is shown in the scatter diagram (Fig. 7). It clearly shows the differences both between groups and within them. If the volume of each group were larger, they could be analyzed using a stem and leaf plot, a box plot, or a normal distribution plot.

Rice. 7. Strength spread diagram of parachutes woven from synthetic fibers obtained from four suppliers

The null hypothesis states that there are no significant differences between the mean strength values: H 0: μ 1 = μ 2 = μ 3 = μ 4. An alternative hypothesis is that there is at least one supplier whose average fiber strength differs from others: H 1: not all μ j are the same ( j = 1, 2, …, with).

Overall Average (See Figure 6) = AVERAGE(D12:D15) = 21.945; to determine, you can also average all 20 original numbers: \u003d AVERAGE (A3: D7). Variance values ​​are calculated Analysis package and are reflected in the table Analysis of variance(see Fig. 6): SSA = 63.286, SSW = 97.504, SST = 160.790 (see column SS tables Analysis of variance figure 6). Averages are calculated by dividing these sums of squares by the appropriate number of degrees of freedom. Insofar as with= 4, and n= 20, we get the following values degrees of freedom; for SSA: s - 1= 3; for SSW: n–c= 16; for SST: n - 1= 19 (see column df). Thus: MSA = SSA / ( c - 1)= 21.095; MSW=SSW/( n–c) = 6.094; MST = SST / ( n - 1) = 8.463 (see column MS). F-statistics = MSA / MSW = 3.462 (see column F).

Upper critical value FU, characteristic for F-distribution, is determined by the formula = F. OBR (0.95; 3; 16) = 3.239. Function parameters =F.OBR(): α = 0.05, the numerator has three degrees of freedom, and the denominator is 16. Thus, the calculated F-statistic equal to 3.462 exceeds the upper critical value FU= 3.239, the null hypothesis is rejected (Fig. 8).

Rice. 8. Critical region of analysis of variance at a significance level of 0.05 if the numerator has three degrees of freedom and the denominator is -16

R-value, i.e. the probability that under a true null hypothesis F- statistics not less than 3.46, equal to 0.041 or 4.1% (see column p-value tables Analysis of variance figure 6). Since this value does not exceed the significance level α = 5%, the null hypothesis is rejected. Furthermore, R-value indicates that the probability of finding such or a large difference between the mathematical expectations of the general populations, provided that they are actually the same, is 4.1%.

So. There is a difference between the four sample means. The null hypothesis was that all the mathematical expectations of the four populations are equal. Under these conditions, a measure of the total variability (i.e. total SST variation) of the strength of all parachutes is calculated by summing the squared differences between each observation Xij and overall average . Then the total variation was divided into two components (see Fig. 1). The first component was the intergroup variation in SSA and the second component was the intragroup variation in SSW.

What explains the variability in the data? In other words, why aren't all observations the same? One reason is that different firms supply fibers with different strengths. This partly explains why the groups have different expected values: the stronger the effect of the experimental conditions, the greater the difference between the mean values ​​of the groups. Another reason for data variability is the natural variability of any process, in this case the production of parachutes. Even if all fibers were purchased from the same supplier, their strength would not be the same under other conditions. equal conditions. Since this effect appears in each of the groups, it is called within-group variation.

The differences between the sample means are called the intergroup variation of the SSA. Part of the intragroup variation, as already mentioned, is explained by the fact that the data belong to different groups. However, even if the groups were exactly the same (i.e., the null hypothesis would be true), there would still be intergroup variation. The reason for this lies in the natural variability of the parachute manufacturing process. Since the samples are different, their sample means differ from each other. Therefore, if the null hypothesis is true, both intergroup and intragroup variability represent an estimate of population variability. If the null hypothesis is false, the between-group hypothesis will be larger. It is this fact that underlies F-criteria for comparing the differences between the mathematical expectations of several groups.

After performing one-way ANOVA and finding significant differences between firms, it remains unknown which supplier is significantly different from the others. We only know that the mathematical expectations of populations are not equal. In other words, at least one of the mathematical expectations differs significantly from the others. To determine which provider is different from the others, you can use Tukey procedure, which uses pairwise comparison between providers. This procedure was developed by John Tukey. Subsequently, he and K. Cramer independently modified this procedure for situations in which sample sizes differ from each other.

Multiple Comparison: Tukey-Kramer procedure

In our scenario, one-way analysis of variance was used to compare the strength of parachutes. Discovering significant differences between the mathematical expectations of the four groups, it is necessary to determine which groups differ from each other. Although there are several ways to solve this problem, we will only describe the Tukey-Kramer multiple comparison procedure. This method is an example of post hoc comparison procedures, since the hypothesis to be tested is formulated after data analysis. The Tukey-Kramer procedure allows you to simultaneously compare all pairs of groups. At the first stage, the differences are calculated Xj – Xj, where j ≠j, between mathematical expectations s(s – 1)/2 groups. Critical Span Tukey-Kramer procedure is calculated by the formula:

where Q U- the upper critical value of the distribution of the studentized range, which has with degrees of freedom in the numerator and n - with degrees of freedom in the denominator.

If the sample sizes are not the same, the critical range is calculated for each pair of mathematical expectations separately. At the last stage, each s(s – 1)/2 pairs of mathematical expectations is compared with the corresponding critical range. The elements of a pair are considered to be significantly different if the modulus of the difference | Xj – Xj| between them exceeds the critical range.

Let us apply the Tukey-Cramer procedure to the problem of the strength of parachutes. Since the parachute company has four suppliers, 4(4 – 1)/2 = 6 pairs of suppliers should be tested (Figure 9).

Rice. 9. Pairwise comparisons of sample means

Since all groups have the same volume (i.e. all n j = n j), it is sufficient to calculate only one critical range. To do this, according to the table ANOVA(Fig. 6) we determine the value of MSW = 6.094. Then we find the value Q U at α = 0.05, with= 4 (number of degrees of freedom in the numerator) and n- with= 20 – 4 = 16 (the number of degrees of freedom in the denominator). Unfortunately, I did not find the corresponding function in Excel, so I used the table (Fig. 10).

Rice. 10. Critical value of studentized range Q U

We get:

Since only 4.74 > 4.47 (see bottom table in Figure 9), a statistically significant difference exists between the first and second supplier. All other pairs have sample means, which do not allow us to talk about their difference. Consequently, the average strength of parachutes woven from fibers purchased from the first supplier is significantly less than that of the second.

Necessary conditions for one-way analysis of variance

When solving the problem of the strength of parachutes, we did not check whether the conditions are met under which one can use the one-factor F-criterion. How do you know if you can apply single-factor F-criterion in the analysis of specific experimental data? Single factor F The -test can only be applied if three basic assumptions are met: the experimental data must be random and independent, have a normal distribution, and their variances must be the same.

The first guess is randomness and data independence- should always be done, since the correctness of any experiment depends on the randomness of the choice and / or the randomization process. To avoid distorting the results, it is necessary that the data be extracted from with populations randomly and independently of each other. Similarly, the data should be randomly distributed over with levels of the factor of interest to us (experimental groups). Violation of these conditions can seriously distort the results of the analysis of variance.

The second guess is normality- means that the data is drawn from normally distributed populations. As for t-criterion, one-way analysis of variance based on F-criterion is relatively insensitive to the violation of this condition. If the distribution is not too far from normal, the significance level F-criterion changes little, especially if the sample size is large enough. If the condition of the normal distribution is seriously violated, it should be applied.

The third guess is uniformity of dispersion- means that the variances of each general population are equal to each other (i.e. σ 1 2 = σ 2 2 = … = σ j 2). This assumption allows one to decide whether to separate or pool the within-group variances. If the volumes of the groups are the same, the condition of homogeneity of the variance has little effect on the conclusions obtained using F-criteria. However, if the sample sizes are not the same, violation of the condition of equality of variances can seriously distort the results of the analysis of variance. Thus, one should strive to ensure that the sample sizes are the same. One of the methods for checking the assumption about the homogeneity of the variance is the criterion Levenay described below.

If, of all three conditions, only the uniformity of dispersion condition is violated, a procedure analogous to t-criterion using separate variance (see details). However, if the assumptions of normal distribution and homogeneity of variance are violated at the same time, it is necessary to normalize the data and reduce the differences between the variances or apply a nonparametric procedure.

Leveney's criterion for checking the homogeneity of the variance

Despite the fact that F- the criterion is relatively resistant to violations of the condition of equality of variances in groups, a gross violation of this assumption significantly affects the level of significance and power of the criterion. Perhaps one of the most powerful is the criterion Levenay. To check the equality of variances with general populations, we will test the following hypotheses:

H 0: σ 1 2 = σ 2 2 = ... = σj 2

H 1: Not all σ j 2 are the same ( j = 1, 2, …, with)

The modified Leveney test is based on the statement that if the variability in groups is the same, analysis of variance can be applied to test the null hypothesis of equality of variances. absolute values differences between observations and group medians. So, first you should calculate the absolute values ​​of the differences between the observations and the medians in each group, and then perform a one-way analysis of variance on the obtained absolute values ​​of the differences. To illustrate the Levenay criterion, let us return to the scenario outlined at the beginning of the note. Using the data presented in Fig. 6, we will carry out a similar analysis, but with respect to the modules of the differences in the initial data and medians for each sample separately (Fig. 11).

Analysis of variance is a set statistical methods, designed to test hypotheses about the relationship between certain features and the studied factors that do not have a quantitative description, as well as to establish the degree of influence of factors and their interaction. In specialized literature, it is often called ANOVA (from the English name Analysis of Variations). This method was first developed by R. Fischer in 1925.

Types and criteria for analysis of variance

This method is used to investigate the relationship between qualitative (nominal) features and a quantitative (continuous) variable. In fact, it tests the hypothesis about the equality of the arithmetic means of several samples. Thus, it can be considered as a parametric criterion for comparing the centers of several samples at once. If you use this method for two samples, then the results of the analysis of variance will be identical to the results of the Student's t-test. However, unlike other criteria, this study allows you to study the problem in more detail.

Analysis of variance in statistics is based on the law: the sum of the squared deviations of the combined sample is equal to the sum of the squares of the intragroup deviations and the sum of the squares of the intergroup deviations. For the study, Fisher's test is used to establish the significance of the difference between intergroup and intragroup variances. However, for this, the necessary prerequisites are the normality of the distribution and the homoscedasticity (equality of variances) of the samples. Distinguish between one-dimensional (single-factor) analysis of variance and multivariate (multifactorial). The first considers the dependence of the value under study on one attribute, the second - on many at once, and also allows you to identify the relationship between them.

Factors

Factors are called controlled circumstances that affect the final result. Its level or method of processing is called the value that characterizes the specific manifestation of this condition. These figures are usually given in a nominal or ordinal scale of measurement. Often output values ​​are measured on quantitative or ordinal scales. Then there is the problem of grouping the output data in a series of observations, which correspond to approximately the same numerical values. If the number of groups is too large, then the number of observations in them may be insufficient to obtain reliable results. If the number is taken too small, this can lead to the loss of essential features of influence on the system. The specific method of grouping data depends on the volume and nature of the variation in values. The number and size of intervals in univariate analysis are most often determined by the principle of equal intervals or by the principle of equal frequencies.

Tasks of dispersion analysis

So, there are cases when you need to compare two or more samples. It is then that it is advisable to use the analysis of variance. The name of the method indicates that the conclusions are made on the basis of the study of the components of the variance. The essence of the study is that the overall change in the indicator is divided into components that correspond to the action of each individual factor. Consider a number of problems that a typical analysis of variance solves.

Example 1

The workshop has a number of machine tools - automatic machines that produce a specific part. The size of each part is a random value, which depends on the settings of each machine and random deviations that occur during the manufacturing process of the parts. It is necessary to determine from the measurements of the dimensions of the parts whether the machines are set up in the same way.

Example 2

During the manufacture of an electrical apparatus, various types of insulating paper are used: capacitor, electrical, etc. The apparatus can be impregnated various substances: epoxy resin, varnish, ML-2 resin, etc. Leaks can be eliminated under vacuum at elevated pressure, when heated. It can be impregnated by immersion in varnish, under a continuous stream of varnish, etc. The electrical apparatus as a whole is poured with a certain compound, of which there are several options. Quality indicators are the dielectric strength of the insulation, the overheating temperature of the winding in operating mode, and a number of others. During working out technological process manufacturing devices, it is necessary to determine how each of the listed factors affects the performance of the device.

Example 3

The trolleybus depot serves several trolleybus routes. They operate trolleybuses of various types, and 125 inspectors collect fares. The management of the depot is interested in the question: how to compare the economic performance of each controller (revenue) given the different routes, different types of trolleybuses? How to determine the economic feasibility of launching trolleybuses of a certain type on a particular route? How to establish reasonable requirements for the amount of revenue that the conductor brings on each route in various types of trolleybuses?

The task of choosing a method is how to obtain maximum information regarding the impact on the final result of each factor, determine the numerical characteristics of such an impact, their reliability at minimum cost and for the maximum a short time. Methods of dispersion analysis allow to solve such problems.

Univariate analysis

The study aims to assess the magnitude of the impact of a particular case on the review being analyzed. Another challenge univariate analysis there may be a comparison of two or more circumstances with each other in order to determine the difference in their influence on the recall. If the null hypothesis is rejected, then the next step is to quantify and build confidence intervals for the obtained characteristics. In the case when the null hypothesis cannot be rejected, it is usually accepted and a conclusion is made about the nature of the influence.

One-way analysis of variance can become a non-parametric analogue of the Kruskal-Wallis rank method. It was developed by the American mathematician William Kruskal and economist Wilson Wallis in 1952. This test is intended to test the null hypothesis that the effects of influence on the studied samples are equal with unknown but equal mean values. In this case, the number of samples must be more than two.

The Jonkhier (Jonkhier-Terpstra) criterion was proposed independently by the Dutch mathematician T. J. Terpstrom in 1952 and the British psychologist E. R. Jonkhier in 1954. It is used when it is known in advance that the available groups of results are ordered by an increase in the influence of the factor under study, which is measured on an ordinal scale.

M - the Bartlett criterion, proposed by the British statistician Maurice Stevenson Bartlett in 1937, is used to test the null hypothesis about the equality of the variances of several normal populations from which the studied samples are taken, in general case having different volumes (the number of each sample must be at least four).

G is the Cochran test, which was discovered by the American William Gemmel Cochran in 1941. It is used to test the null hypothesis about the equality of the variances of normal populations for independent samples of equal size.

The nonparametric Levene test, proposed by the American mathematician Howard Levene in 1960, is an alternative to the Bartlett test in conditions where there is no certainty that the samples under study follow a normal distribution.

In 1974, American statisticians Morton B. Brown and Alan B. Forsythe proposed a test (the Brown-Forsyth test), which is somewhat different from the Levene test.

Two-way analysis

Two-way analysis of variance is used for linked normally distributed samples. In practice, it is often used complex tables this method, in particular those in which each cell contains a set of data (repeated measurements) corresponding to fixed level values. If the assumptions necessary to apply the two-way analysis of variance are not met, then the non-parametric rank test of Friedman (Friedman, Kendall and Smith), developed by the American economist Milton Friedman at the end of 1930, is used. This criterion does not depend on the type of distribution.

It is only assumed that the distribution of quantities is the same and continuous, and that they themselves are independent of each other. When testing the null hypothesis, the output is given in the form rectangular matrix, in which the rows correspond to the levels of the factor B, and the columns correspond to the levels A. Each cell of the table (block) can be the result of measurements of parameters on one object or on a group of objects with constant values ​​of the levels of both factors. In this case, the corresponding data are presented as the average values ​​of a certain parameter for all measurements or objects of the sample under study. To apply the output criterion, it is necessary to move from the direct results of measurements to their rank. The ranking is carried out for each row separately, that is, the values ​​are ordered for each fixed value.

The Page test (L-test), proposed by the American statistician E. B. Page in 1963, is designed to test the null hypothesis. For large samples, the Page approximation is used. They, subject to the reality of the corresponding null hypotheses, obey the standard normal distribution. In the case when the rows of the source table have the same values, it is necessary to use the average ranks. In this case, the accuracy of the conclusions will be the worse, the greater the number of such coincidences.

Q - Cochran's criterion, proposed by V. Cochran in 1937. It is used in cases where groups of homogeneous subjects are exposed to more than two influences and for which two options for reviews are possible - conditionally negative (0) and conditionally positive (1) . The null hypothesis consists of equality of influence effects. Two-way analysis of variance makes it possible to determine the existence of processing effects, but does not make it possible to determine for which columns this effect exists. To solve this problem, the method multiple equations Scheffe for linked samples.

Multivariate analysis

The problem of multivariate analysis of variance arises when it is necessary to determine the influence of two or more conditions on a certain random variable. The study provides for the presence of one dependent random variable, measured in difference or ratio scale, and several independent quantities, each of which is expressed in a scale of names or in a rank scale. Dispersion analysis of data is a fairly developed section mathematical statistics which has a lot of options. The concept of the study is common for both univariate and multivariate studies. Its essence is that total variance divided into components, which corresponds to a certain grouping of data. Each grouping of data has its own model. Here we will consider only the main provisions necessary for understanding and practical use the most used options.

Factor analysis of variance requires careful attention to the collection and presentation of input data, and especially to the interpretation of the results. In contrast to the one-factor, the results of which can be conditionally placed in a certain sequence, the results of the two-factor require a more complex presentation. More the situation is more difficult occurs when there are three, four or more circumstances. Because of this, the model rarely includes more than three (four) conditions. An example would be the occurrence of resonance at a certain value of capacitance and inductance of the electric circle; the manifestation of a chemical reaction with a certain set of elements from which the system is built; the occurrence of anomalous effects in complex systems under a certain coincidence of circumstances. The presence of interaction can radically change the model of the system and sometimes lead to a rethinking of the nature of the phenomena with which the experimenter is dealing.

Multivariate analysis of variance with repeated experiments

Measurement data can often be grouped not by two, but by more factors. So, if we consider the analysis of variance of the service life of tires for trolleybus wheels, taking into account the circumstances (manufacturer and the route on which the tires are operated), then we can distinguish as a separate condition the season during which the tires are operated (namely: winter and summer operation). As a result, we will have the problem of the three-factor method.

In the presence of more conditions, the approach is the same as in two-way analysis. In all cases, the model is trying to simplify. The phenomenon of the interaction of two factors does not appear so often, and the triple interaction occurs only in exceptional cases. Include those interactions for which there is previous information and good reasons to take it into account in the model. The process of isolating individual factors and taking them into account is relatively simple. Therefore, there is often a desire to highlight more circumstances. You shouldn't get carried away with this. The more conditions, the less reliable the model becomes and the greater the chance of error. The model itself, which includes a large number of independent variables becomes quite difficult to interpret and inconvenient for practical use.

General idea of ​​analysis of variance

Analysis of variance in statistics is a method of obtaining the results of observations that depend on various concurrent circumstances and assessing their influence. A controlled variable that corresponds to the method of influencing the object of study and in a certain period of time acquires certain value, is called a factor. They can be qualitative and quantitative. Levels of quantitative conditions acquire a certain value on a numerical scale. Examples are temperature, pressing pressure, amount of substance. The quality factors are different substances, different technological methods, devices, fillers. Their levels correspond to the scale of names.

The quality also includes the type of packaging material, storage conditions dosage form. It is also rational to include the degree of grinding of raw materials, the fractional composition of granules having quantitative value, but difficult to control if a quantitative scale is used. The number of quality factors depends on the type of dosage form, as well as the physical and technological properties of medicinal substances. For example, tablets can be obtained from crystalline substances by direct compression. In this case, it is sufficient to carry out the selection of sliding and lubricating agents.

Examples of quality factors for different types of dosage forms

  • Tinctures. Extractant composition, type of extractor, raw material preparation method, production method, filtration method.
  • Extracts (liquid, thick, dry). The composition of the extractant, the extraction method, the type of installation, the method of removing the extractant and ballast substances.
  • Pills. Composition of excipients, fillers, disintegrants, binders, lubricants and lubricants. The method of obtaining tablets, the type of technological equipment. Type of shell and its components, film formers, pigments, dyes, plasticizers, solvents.
  • injection solutions. Type of solvent, filtration method, nature of stabilizers and preservatives, sterilization conditions, method of filling ampoules.
  • Suppositories. The composition of the suppository base, the method of obtaining suppositories, fillers, packaging.
  • Ointments. base composition, structural components, method of preparing the ointment, type of equipment, packaging.
  • Capsules. Type of shell material, method of obtaining capsules, type of plasticizer, preservative, dye.
  • Liniments. Production method, composition, type of equipment, type of emulsifier.
  • Suspensions. Type of solvent, type of stabilizer, dispersion method.

Examples of quality factors and their levels studied in the tablet manufacturing process

  • Baking powder. Potato starch, white clay, a mixture of sodium bicarbonate with citric acid, basic magnesium carbonate.
  • binding solution. Water, starch paste, sugar syrup, methylcellulose solution, hydroxypropyl methylcellulose solution, polyvinylpyrrolidone solution, polyvinyl alcohol solution.
  • sliding substance. Aerosil, starch, talc.
  • Filler. Sugar, glucose, lactose, sodium chloride, calcium phosphate.
  • Lubricant. Stearic acid, polyethylene glycol, paraffin.

Models of dispersion analysis in the study of the level of competitiveness of the state

One of the most important criteria for assessing the state of the state, which is used to assess the level of its welfare and socio-economic development, is competitiveness, that is, a set of properties inherent in the national economy that determine the ability of the state to compete with other countries. Having determined the place and role of the state in the world market, it is possible to establish a clear strategy for ensuring economic security on an international scale, because it is the key to positive relations between Russia and all players in the world market: investors, creditors, state governments.

To compare the level of competitiveness of states, countries are ranked using complex indices, which include various weighted indicators. These indices are based on key factors that affect the economic, political, etc. situation. The complex of models for studying the competitiveness of the state provides for the use of methods of multidimensional statistical analysis (in particular, this is an analysis of variance (statistics), econometric modeling, decision making) and includes the following main stages:

  1. Formation of a system of indicators-indicators.
  2. Evaluation and forecasting of indicators of the competitiveness of the state.
  3. Comparison of indicators-indicators of competitiveness of states.

And now let's consider the content of the models of each of the stages of this complex.

At the first stage using methods of expert study, a reasonable set of economic indicators-indicators for assessing the competitiveness of the state is formed, taking into account the specifics of its development on the basis of international ratings and data from statistical departments, reflecting the state of the system as a whole and its processes. The choice of these indicators is justified by the need to select those that most fully, from the point of view of practice, allow to determine the level of the state, its investment attractiveness and the possibility of relative localization of existing potential and actual threats.

The main indicators-indicators of international rating systems are indices:

  1. Global Competitiveness (GCC).
  2. Economic freedom (IES).
  3. Development human potential(HDI).
  4. Perceptions of Corruption (CPI).
  5. Internal and external threats (IVZZ).
  6. Potential for International Influence (IPIP).

Second phase provides for the assessment and forecasting of indicators of the state's competitiveness in terms of international ratings for the studied 139 states of the world.

Third stage provides for a comparison of the conditions for the competitiveness of states using the methods of correlation and regression analysis.

Using the results of the study, it is possible to determine the nature of the processes in general and for individual components of the competitiveness of the state; test the hypothesis about the influence of factors and their relationship at the appropriate level of significance.

The implementation of the proposed set of models will allow not only to assess the current situation of the level of competitiveness and investment attractiveness of states, but also to analyze the shortcomings of management, prevent errors of wrong decisions, and prevent the development of a crisis in the state.

Analysis of variance

1. The concept of analysis of variance

Analysis of variance- this is an analysis of the variability of a trait under the influence of any controlled variable factors. In foreign literature, analysis of variance is often referred to as ANOVA, which translates as analysis of variance (Analysis of Variance).

The task of analysis of variance consists in isolating the variability of a different kind from the general variability of the trait:

a) variability due to the action of each of the studied independent variables;

b) variability due to the interaction of the studied independent variables;

c) random variation due to all other unknown variables.

The variability due to the action of the studied variables and their interaction correlates with random variability. An indicator of this ratio is Fisher's F test.

The formula for calculating the criterion F includes estimates of variances, that is, the distribution parameters of a sign, therefore the criterion F is a parametric criterion.

Than in more the variability of a trait is due to the studied variables (factors) or their interaction, the higher empirical values ​​of the criterion.

Zero the hypothesis in the analysis of variance will say that the average values ​​of the studied effective feature in all gradations are the same.

Alternative the hypothesis will state that the average values ​​of the effective attribute in different gradations of the studied factor are different.

Analysis of variance allows us to state a change in a trait, but does not indicate direction these changes.

Let's start the analysis of variance with the simplest case, when we study the action of only one variable (single factor).

2. One-way analysis of variance for unrelated samples

2.1. Purpose of the method

The method of single-factor analysis of variance is used in those cases when changes in the effective attribute are studied under the influence of changing conditions or gradations of any factor. In this version of the method, the influence of each of the gradations of the factor is various sample of test subjects. There must be at least three gradations of the factor. (There may be two gradations, but in this case we will not be able to establish nonlinear dependencies and it seems more reasonable to use simpler ones).

A non-parametric variant of this type of analysis is the Kruskal-Wallis H test.

Hypotheses

H 0: Differences between factor grades (different conditions) are no more pronounced than random differences within each group.

H 1: Differences between factor gradations (different conditions) are more pronounced than random differences within each group.

2.2. Limitations of univariate analysis of variance for unrelated samples

1. Univariate analysis of variance requires at least three gradations of the factor and at least two subjects in each gradation.

2. The resultant trait must be normally distributed in the study sample.

True, it is usually not indicated whether we are talking about the distribution of a trait in the entire surveyed sample or in that part of it that makes up the dispersion complex.

3. An example of solving the problem by the method of single-factor analysis of variance for unrelated samples using the example:

Three different groups of six subjects received lists of ten words. Words were presented to the first group at a low rate of 1 word per 5 seconds, to the second group at an average rate of 1 word per 2 seconds, and to the third group at a high rate of 1 word per second. Reproduction performance was predicted to depend on the speed of word presentation. The results are presented in Table. one.

Number of words reproduced Table 1

subject number

low speed

average speed

high speed

total amount

H 0: Differences in word volume between groups are no more pronounced than random differences inside each group.

H1: Differences in word volume between groups are more pronounced than random differences inside each group. Using the experimental values ​​presented in Table. 1, we will establish some values ​​that will be needed to calculate the criterion F.

The calculation of the main quantities for one-way analysis of variance is presented in the table:

table 2

Table 3

Sequence of Operations in One-Way ANOVA for Disconnected Samples

Frequently used in this and subsequent tables, the designation SS is an abbreviation for "sum of squares". This abbreviation is most often used in translated sources.

SS fact means the variability of the trait, due to the action of the studied factor;

SS common- general variability of the trait;

S CA- variability due to unaccounted for factors, "random" or "residual" variability.

MS - "middle square", or the mean of the sum of squares, the average value of the corresponding SS.

df - the number of degrees of freedom, which, when considering nonparametric criteria, we denoted by the Greek letter v.

Conclusion: H 0 is rejected. H 1 is accepted. Differences in the volume of word reproduction between groups are more pronounced than random differences within each group (α=0.05). So, the speed of presentation of words affects the volume of their reproduction.

An example of solving the problem in Excel is presented below:

Initial data:

Using the command: Tools->Data Analysis->One-way analysis of variance, we get the following results:

Coursework in mathematics

Introduction

The concept of analysis of variance

One-way analysis of variance (Practical implementation in IBM SPSS Statistics 20)

One-way analysis of variance (Practical implementation in Microsoft office 2013)

Conclusion

List of sources used

Introduction

Relevance of the topic. The development of mathematical statistics begins with the work of the famous German mathematician Carl Friedrich Gauss in 1795 and is still developing. AT statistical analysis there is a parametric method "One-factor analysis of variance". Currently, it is used in economics when conducting market research for comparability of results (for example, when conducting surveys about the consumption of a product in different regions of the country, it is necessary to draw conclusions on how much the survey data differ or do not differ from each other; in psychology, when conducting various types of research), when compiling scientific comparison tests, or researching any social groups, and for solving problems in statistics.

Objective. He will get acquainted with such a statistical method as one-way analysis of variance, as well as with its implementation on a PC in various programs and compare these programs.

To study the theory of one-way analysis of variance.

To study programs for solving problems for single-factor analysis.

Conduct comparative analysis these programs.

Work achievements: Practical part The work was completely done by the author: selection of programs, selection of tasks, their solution on a PC, after which a comparative analysis was carried out. In the theoretical part, the classification of ANOVA groups was carried out. this work was tested as a report at the student scientific session "Selected questions higher mathematics and Methods of Teaching Mathematics"

Structure and scope of work. The work consists of introduction, conclusion, content and bibliography, including 4 titles. The total volume of the work is 25 printed pages. The work contains 1 example solved by 2 programs.

The concept of analysis of variance

Often there is a need to investigate the influence of one or more independent variables (factors) on one or more dependent variables (resultant signs), such problems can be solved by methods of analysis of variance, authored by R. Fisher.

ANOVA analysis of variance - a set of statistical data processing methods that allow you to analyze the variability of one or more effective features under the influence of controlled factors (independent variables). Here, a factor is understood as a certain value that determines the properties of the object or system under study, i.e. reason for the end result. When conducting an analysis of variance, it is important to choose the right source and object of influence, i.e. identify dependent and independent variables.

Depending on the signs of classification, several classification groups of analysis of variance are distinguished (Table 1).

By the number of factors taken into account: Univariate analysis - the influence of one factor is studied; Multivariate analysis - the simultaneous influence of two or more factors is studied. By the presence of a connection between samples of values: Analysis of unrelated (different) samples - is carried out when there are several groups of research objects located in different conditions. (The null hypothesis H0 is checked: the average value of the dependent variable is the same in different measurement conditions, i.e. does not depend on the factor under study.); Analysis of related (same) samples - is carried out for two or more measurements taken on the same the same group of studied objects under different conditions. Here, the influence of an unaccounted factor is possible, which can be erroneously attributed to a change in conditions. By the number of dependent variables affected by factors. Univariate analysis (ANOVA or AMCOVA - covariance analysis) - one dependent variable is affected by factors; Multivariate analysis (MANOVA - multivariate analysis of variance or MANSOVA - multivariate covariance analysis) - several dependent variables are affected by factors. According to the purpose of the study. Deterministic - the levels of all factors are fixed in advance and it is their influence that is checked (the hypothesis H0 is checked about the absence of differences between the average levels); Random - the levels of each factor are obtained as random sample from the general set of factor levels (the hypothesis H0 is being tested that the dispersion of the average response values ​​calculated for different levels of the factor is non-zero);

One-way analysis of variance checks statistical significance differences between the sample means of two or more populations for this, hypotheses are preliminarily formed.

Null hypothesis H0: the average values ​​of the effective feature in all conditions of the factor action (or factor gradations) are the same

Alternative hypothesis H1: the average values ​​of the effective feature in all conditions of the factor are different.

ANOVA methods can be used for normally distributed populations (multivariate analogs of parametric tests) and for populations that do not have definite distributions (multivariate analogs of nonparametric tests). In the first case, it is necessary to first establish that the distribution of the resulting feature is normal. To check the normality of the distribution of a feature, you can use the asymmetry indicators A = , , and kurtosis E = , , where , . - the value of the effective feature and its average value; - standard deviation of the resulting feature; .

Number of observations;

Representativeness errors for indicators A and E

If the skewness and kurtosis indicators do not exceed their representativeness errors by more than 3 times, i.e. BUT<3тА и Е <3тЕ, то распределение можно считать нормальным. Для нормальных распределений показатели А и Е равны нулю.

Data relating to one condition of the factor (to one gradation) is called a dispersion complex. When conducting an analysis of variance, the equality of dispersions between complexes should be observed. In this case, the selection of elements should be carried out randomly.

In the second case, when sample populations have arbitrary distributions, non-parametric (rank) analogues of one-way analysis of variance are used (Kruskal-Wallis criteria, Friedman).

Consider a graphical illustration of the dependence of the rate of return on shares on the state of affairs in the country's economy (Fig. 1, a). Here, the factor under study is the level of the state of the economy (more precisely, three levels of its state), and the effective feature is the rate of return. The above distribution shows that this factor has a significant impact on profitability, i.e. As the economy improves, so does the return on stocks, which is not contrary to common sense.

Note that the chosen factor has gradations, i.e. its value changed during the transition from one gradation to another (from one state of the economy to another).

Rice. 1. The ratio of the influence of the factor and intra-group spread: a - significant influence of the factor; b - insignificant influence of the factor

The group of gradations of a factor is only a special case, in addition, a factor can have gradations presented even in a nominal scale. Therefore, more often they speak not about the gradations of a factor, but about the various conditions of its action.

Let us now consider the idea of ​​analysis of variance, which is based on the rule of adding variances: the total variance is equal to the sum of the intergroup and the average of the intragroup variances:

Total variance arising from the influence of all factors

Intergroup dispersion due to the influence of all other factors;

The average intra-group variance caused by the influence of the grouping attribute.

The influence of the grouped trait is clearly visible in Fig. 1a, since the influence of the factor is significant compared to the intragroup spread, therefore, the intergroup variance will be greater than the intragroup one ( > ), and in Fig. 1, b, the opposite picture is observed: here the intragroup spread prevails and the influence of the factor is practically absent.

The analysis of variance is built on the same principle, only it does not use variances, but the average of squared deviations ( , , ), which are unbiased estimates of the corresponding variances. They are obtained by dividing the sums of squared deviations by the corresponding number of degrees of freedom

Aggregates as a whole;

Intragroup averages;

Intergroup averages;

Overall average for all measurements (for all groups);

Group average for the j-th gradation of the factor.

Mathematical expectations for the intragroup and intergroup sum of squared deviations, respectively, are calculated by the formulas: (Fixed factor model),

.

E ( ) = E ( ) = , then the null hypothesis H0 about the absence of differences between the means is confirmed, therefore, the factor under study does not have a significant effect (see Fig. 1, b). If the actual value of Fisher's F-test F= E ( ) /E ( ) will be greater than the critical then the null hypothesis H0 at the significance level , the alternative hypothesis H1 is rejected and accepted - about the significant impact of the factor fig. 1, a. .

One-way analysis of variance

An analysis of variance that considers only one variable is called One-Way ANOVA.

There is a group of n objects of observation with measured values ​​of some variable under study . per variable is influenced by some quality factor With several levels (gradations) of impact. Measured variable values at different levels of the factor are given in Table 2 (they can also be presented in matrix form).

Table 2.

Tabular form of specifying initial data for univariate analysis

Observation object number ()Variable values at the level (gradation) of the factor (lowest) (short)… (highest)1 2 … n .Here, each level can contain a different number of responses measured at one level of the factor, then each column will have its own value . It is required to evaluate the significance of the influence of this factor on the variable under study. To solve this problem, a one-factor model of variance analysis can be used. One-factor dispersion model.

The value of the variable under study for the -th object of observation at -th level of the factor;

Group average for -th level of the factor;

The effect due to the influence of the -th level of the factor;

Random component, or perturbation caused by the influence of uncontrollable factors. So let's highlight the main limitations of using ANOVA:

Equality to zero of the mathematical expectation of a random component: = 0.

Random component , and hence also have a normal distribution.

The number of gradations of factors must be at least three.

This model, depending on the levels of the factor, using the Fisher F-test, allows you to test one of the null hypotheses.

When performing analysis of variance for related samples, it is possible to test one more null hypothesis H0(u) - individual differences between the objects of observation are expressed no more than differences due to random reasons.

One-way analysis of variance

(Practical implementation in IBM SPSS Statistics 20)

The researcher is interested in the question of how a certain attribute changes under different conditions of the action of a variable (factor). The effect of only one variable (factor) on the trait under study is studied. We have already considered an example from economics; now we will give an example from psychology, for example, how the time to solve a problem changes under different conditions of motivation of the subjects (low, medium, high motivation) or with different ways of presenting the task (orally, in writing or in the form of text with graphs and illustrations) , in different conditions of working with the task (alone, in a room with a teacher, in a classroom). In the first case, the factor is motivation, in the second - the degree of visibility, in the third - the factor of publicity.

In this version of the method, different samples of subjects are exposed to the influence of each of the gradations. There must be at least three gradations of the factor.

Example 1. Three different groups of six subjects were given lists of ten words. Words were presented to the first group at a low rate of 1 word per 5 seconds, to the second group at an average rate of 1 word per 2 seconds, and to the third group at a high rate of 1 word per second. It was predicted that the performance of reproduction will depend on the speed of presentation of words (Table 3).

Table 3

Number of words reproduced

SubjectGroup 1 low speedGroup 2 medium speedGroup 3 high speed

We formulate hypotheses: differences in the volume of word reproduction between groups are no more pronounced than random differences within each group: Differences in the volume of word reproduction between groups are more pronounced than random differences within each group.

We will carry out the solution in the SPSS environment according to the following algorithm

Let's run the SPSS program

Enter numerical values ​​in the window data

Rice. 1. Entering values ​​in SPSS

In the window Variables we describe all the initial data, according to the condition

Tasks

Figure 2 Variables window

For clarity, in the label column, we describe the name of the tables

In the graph Values describe the number of each group

Figure 3 Value Labels

All this is done for clarity, i.e. these settings can be ignored.

In the graph scale , in the second column you need to put the value of the nominal

In the window data order a one-way analysis of variance using the "Analysis" menu Average Comparison

One-way analysis of variance…

Figure 4 One-Way ANOVA Function

In the opened dialog box One-way analysis of variance select the dependent variable and add it to list of dependents , and the variable factor in the window factor

Figure 5 highlighting the list of dependents and the factor

Set up some parameters for high-quality data output

Figure 6 Parameters for qualitative data inference

Calculations for the selected one-way ANOVA algorithm starts after clicking OK

At the end of the calculations, the results of the calculation are displayed in the viewing window.

Descriptive StatisticsGroup NAverage Std. Deviation Std. Error95% confidence interval for mean Minimum Maximum Table 2. Descriptive statistics

The table Descriptive statistics shows the main indicators for speeds in groups and their total values.

The number of observations in each group and the total

Mean - arithmetic mean of observations in each group and for all groups together

Std. Deviation, Std. Error - standard deviation and standard deviations

% confidence interval for the mean - these intervals are more accurate for each group and for all groups together, rather than taking intervals below or above these limits.

Minimum, Maximum - the minimum and maximum values ​​for each group that the subjects heard

single-factor variance random

Criterion for homogeneity of variancesgroup Statistics Livinast.st.1st.st.

Livin's homogeneity test is used to test dispersions for homogeneity (homogeneity). In this case, it confirms the insignificance of the differences between the variances, since the value = 0.915, i.e., clearly greater than 0.05. Therefore, the results obtained using the analysis of variance are recognized as correct.

Table 1-way analysis of variance shows the results of 1-way DA

The "between groups" sum of squares is the sum of the squares of the differences between the overall mean and the means in each group, weighted by the number of objects in the group

"Within groups" is the sum of the squared differences between the mean of each group and each value of that group

Column "St. St." contains the number of degrees of freedom V:

Intergroup (v=number of groups - 1);

Intragroup (v=number of objects - number of groups - 1);

"mean square" contains the ratio of the sum of squares to the number of degrees of freedom.

Column "F" shows the ratio of the mean square between groups to the mean square within groups.

The "value" column contains the probability value that the observed differences are random.

Table 4 Formulas

Graphs of averages

The graph shows that it is decreasing. It is also possible to determine from the table Fk k1=2, k2=15 the tabular value of statistics is 3.68. By rule, if , then the null hypothesis is accepted, otherwise the alternative hypothesis is accepted. For our example (7.45>3.68), hence the alternative hypothesis is accepted. Thus, returning to the condition of the problem, we can conclude the null hypothesis rejected and an alternative accepted. : differences in word volume between groups are more pronounced than random differences within each group ). That. the speed of presentation of words affects the volume of their reproduction.

One-way analysis of variance

(Practical implementation in Microsoft Office 2013)

In the same example, consider one-way analysis of variance in Microsoft Office 2013

Solving a problem in Microsoft Excel

Let's open Microsoft Excel.


Figure 1. Writing data to Excel

Let's convert the data to number format. To do this, on the main tab there is an item Format and it has a subparagraph Cell Format . The Format Cells window appears on the screen. Rice. 2 Select Number format and the entered data is converted. As shown in Fig.3

Figure 2 Convert to Numeric Format

Figure 3 Result after conversion

On the data tab there is an item data analysis let's click on it.

Let's choose One-way analysis of variance

Figure 6 Data analysis

The One-way analysis of variance window will appear on the screen for conducting dispersion analysis of data (Fig. 7). Let's configure the parameters

Rice. 7 Setting parameters for univariate analysis

Click the mouse in the Input interval field. Select the range of cells B2::F9, the data in which you want to analyze. In the Input Spacing field of the Inputs control group, the specified range appears.

If the row by row switch is not set in the Input data control group, then select it so that the Excel program accepts data groups by row.

Optional Select the Labels in First Row check box in the Inputs controls group if the first column of the selected data range contains row names.

In the Alpha input field of the Input data control group, by default, the value 0.05 is displayed, which is related to the probability of an error in the analysis of variance.

If the output interval switch is not set in the Output Parameters group of controls, then set it or select the new worksheet switch so that the data is transferred to a new sheet.

Click the OK button to close the One-Way ANOVA window. The results of the analysis of variance will appear (Fig. 8).

Figure 8 Data output

In the range of cells A4:E7 are the results descriptive statistics. Line 4 contains the names of the parameters, lines 5 - 7 - statistical values calculated by batches. In the column "Account" are the number of measurements, in the column "Sum" - the sum of the values, in the column "Average" - the average arithmetic values, in the column "Dispersion" - dispersion.

The results obtained show that the highest average breaking load is in batch No. 1, and the largest dispersion of breaking load is in batches No. 2, No. 1.

The range of cells A10:G15 displays information regarding the significance of discrepancies between data groups. Line 11 contains the names of the analysis of variance parameters, line 12 - the results of intergroup processing, line 13 - the results of intragroup processing, and line 15 - the sum of the values ​​of these two lines.

The SS column contains the variation values, i.e. sums of squares over all deviations. Variation, like dispersion, characterizes the spread of data.

The df column contains the values ​​of the numbers of degrees of freedom. These numbers indicate the number of independent deviations over which the variance will be calculated. For example, the intergroup number of degrees of freedom is equal to the difference between the number of data groups and one. How more number degrees of freedom, the higher the reliability of the dispersion parameters. The degrees of freedom data in the table show that the within-group results are more reliable than the between-group parameters.

The MS column contains the dispersion values, which are determined by the ratio of variation and the number of degrees of freedom. Dispersion characterizes the degree of scatter of data, but unlike the magnitude of variation, it does not have a direct tendency to increase with an increase in the number of degrees of freedom. The table shows that the intergroup variance is much larger than the intragroup variance.

Column F contains the value of the F-statistic, calculated by the ratio of the intergroup and intragroup variances.

The F-critical column contains the F-critical value calculated from the number of degrees of freedom and the value of Alpha. F-statistic and F-critical value use the Fisher-Snedekor test.

If the F-statistic is greater than the F-critical value, then it can be argued that the differences between data groups are not random. those. at the level of significance α = 0 .05 (with a reliability of 0.95), the null hypothesis is rejected and the alternative is accepted: that the speed of presentation of words affects the volume of their reproduction. The P-value column contains the probability that the difference between groups is random. Since this probability is very small in the table, the deviation between groups is not random.

Comparison of IBM SPSS Statistics 20 and Microsoft Office 2013

one-factor variance random program

Let's look at the outputs of the programs, for this we will look again at the screenshots.

One-way analysis of variance group Sum of Squares St.Lm Mean Square FZn Between groups31.444215.7227.447.006 Within groups31.667152.111Total63.11117

Thus, the IBM SPSS Statistics 20 program produces a better score, can round numbers, build visual graph(cm. complete solution) by which you can determine the answer, it describes in more detail both the conditions of the problem and their solution. Microsoft Office 2013 has its advantages, firstly, of course, its prevalence, since Microsoft Office 2013 is installed in almost every computer, it displays Fcritical, which is not provided in SPSS Statistics, and it is also simple and convenient to calculate there. Still, both of these programs are very well suited for solving problems for one-way analysis of variance, each of them has its pros and cons, but if you count big tasks with more conditions would recommend SPSS Statistics.

Conclusion

Analysis of variance is applied in all areas scientific research, where it is necessary to analyze the influence various factors to the variable under study. AT modern world There are many tasks for single-factor analysis of variance in economics, psychology, and biology. As a result of studying theoretical material it was found that the basis of the analysis of variance is the theorem on the addition of variances, from the many software packages in which the apparatus of variance analysis is implemented, the best ones were selected and included in the work. Thanks to the advent of new technologies, each of us can conduct research (solutions), while spending less time and effort on calculations, using computers. In the process of work, goals were set, tasks that were achieved.

list of literature

Sidorenko, E.V. Methods mathematical processing in psychology [Text] / St. Petersburg. 2011. - 256 p.

Mathematical statistics for psychologists Ermolaev O.Yu [Text] / Moscow_2009 -336s

Lecture 7. Analytical statistics [ Electronic resource]. , Access date: 05/14/14

Probability theory and mathematical statistics [Text] / Gmurman V.E. 2010 -479s