Biographies Characteristics Analysis

Basic statistical categories. Relationship coefficients of qualitative features

The grouping method allows you to study the state and relationships economic phenomena, if the groups are characterized by indicators that reveal the most significant aspects of the phenomenon under study.

When analyzing and planning, it is necessary to rely not on random facts, but on indicators expressing the main, typical, root. This characterization is given different kinds mean values, as well as mode and median.

The question of the homogeneity of the population should not be decided formally in terms of the form of its distribution. It, like the question of a typical average, must be decided on the basis of the causes and conditions that form the population. Homogeneous is such a set, the units of which are formed under the influence of common main causes and conditions that determine general level of this characteristic, characteristic of the entire population.

According to the theory of typological groupings, crucial in assessing the homogeneity of the population, it belongs not to the form of distribution, but to the size of the variation and the conditions for its formation. A qualitatively homogeneous set is characterized by variation within certain limits, after which a new quality begins. At the same time, in order to assess the qualitative homogeneity of the aggregate, these limits must be approached from the point of view of the essence of the matter, and not formally, since the same quantity under different conditions expresses a new quality. For example, with the same number of workers, enterprises of some branches of industry are large, while others are small.

For a comprehensive and in-depth study phenomena, in order to objectively characterize the types of phenomena, their relationships and processes due to the development of the system as a whole, it is necessary to combine group averages with general averages. The combination of such averages is one of the main elements of the analysis of complex systems. This combination binds into one whole two organically complementary statistical method: method of averages and method of grouping. When calculating the average, individual values ​​varying by group are replaced by one average value. Wherein random deviations the values ​​of the attribute for individual units in the direction of increase or decrease are mutually balanced and cancel each other, and the average value shows the typical size of the attribute characteristic of this group. The average value serves as a characteristic of the population and at the same time refers to its individual element - the carrier of the qualitative features of the phenomenon. The meaning of the average is quite concrete, but at the same time abstract; it is obtained by abstracting from the random individual for each unit in order to identify the general, typical that is characteristic of all units and that forms this set. When calculating the average value, the number of population units should be large enough. The value of the average is defined as the ratio of the total volume of phenomena to the number of population units in the group. For ungrouped data, this will be the simple arithmetic mean:

and for grouped data, where each feature value has its own frequency, the weighted arithmetic mean:

where X i- the value of the feature; f i is the frequency of these feature values.

Since the arithmetic mean is calculated as the ratio of the sum of the characteristic values ​​to total strength, it never goes beyond these values. The arithmetic mean has a number of properties that are widely used to streamline calculations.

1. Sum of deviations individual values sign from the mean value is always zero:

Proof. n

Separating the left and right side on the

2. If the characteristic values ​​(X i) are changed in k times, then the arithmetic mean will also change in x once.

Proof.

We denote the arithmetic mean of the new values ​​of the attribute by X, then:

Constant value 1/k can be taken out of the sum sign, and then we get:

3. If out of all characteristic values X i subtract or add the same constant number, then the arithmetic mean will decrease or increase by that amount.

Proof.

The average of the deviations of the characteristic values ​​from constant number will be equal to:

This is proved in the same way in the case of adding a constant number.

4. If the frequencies of all characteristic values ​​are reduced or increased in n times, then the average will not change:

If there is data on the total volume and known values ​​of the trait, but unknown frequencies, the weighted arithmetic mean formula is used to determine the average.

For example, data are available on sales prices for cabbage and total revenue for various sales periods (Table 1).

Table 1.

Sales price of cabbage and total revenue for various sales periods


Since the average price represents the ratio of total revenue to the total volume of cabbage sold, it is first necessary to determine the amount of cabbage sold for different sales periods as a ratio of revenue to price, and then determine average price sold cabbage.

In our example, the average price would be:

If calculated in this case the average selling price according to the simple arithmetic mean, then we get a different result, which will distort the true situation and overestimate the average selling price, since the fact that a large share in sales falls on late cabbage with a lower price will not be taken into account.

Sometimes it is required to determine the average value when the characteristic values ​​are given in the form fractional numbers, i.e., reciprocals of integers (for example, when studying labor productivity through its inverse indicator, labor intensity). In such cases, it is advisable to use the harmonic mean formula:

Thus, the average time required to manufacture a unit of output is the harmonic mean. If X 1 \u003d 1/4 hour, X 2 \u003d 1/2 hour, X 3 \u003d 1/3 hour, then the harmonic mean of these numbers is:

To calculate the average value from the ratios of two indicators of the same name, for example, growth rates, the geometric mean is used, calculated by the formula:

where X 1 x X 2 ... x ... X 4 - the ratio of two quantities of the same name, for example, chain growth rates; n is the size of the set of ratios of growth rates.

The mean values ​​considered have the maorant property:

Let, for example, we have the following values X(20; 40), then the types of averages considered earlier will be equal to:

When studying the composition of the population, the typical size of a feature can be judged by the so-called structural averages - mode and median.

Fashionthe most frequently occurring value of the feature in the population is called. In the interval variation series, the modal interval is first found. In the found modal interval, the mode is calculated by the formula:

where X 0 is the lower limit of the modal interval; d- interval value; f 1 , f 2 , f 3 – frequencies of premodal, modal and postmodal intervals.

The value of fashion in interval series quite easy to find on the basis of the graph. To do this, two lines are drawn from the boundaries of two adjacent columns in the highest column of the histogram. From the point of intersection of these lines, a perpendicular is lowered to the abscissa axis. The value of the feature on the abscissa axis will be the mode (Fig. 2).


Rice. 2

To solve practical problems the greatest interest represents usually the mode expressed as an interval rather than a discrete number. This is explained by the purpose of the mode, which should reveal the most common dimensions of the phenomenon.

Average - a value typical for all units of a homogeneous population. Mode is also a typical value, but it directly determines the size of the attribute, which is characteristic of a significant part, but still not the entire population. She has great importance to solve some problems, for example, to predict what sizes of shoes, clothes should be intended for mass production, etc.

Median- the value of the feature, which is in the middle of the ranged series. It indicates the center of distribution of units of the population and divides it into two equal parts.

The median is the best feature central trend, when the boundaries of the extreme intervals are open. The median is a more acceptable characteristic of the distribution level even if there are excessively large or excessively small values ​​in the distribution series that have strong influence to the mean, but not to the median. The median also has the property of a linear minimum: the sum absolute values deviations of the value of the attribute for all units of the population from the median is minimal, i.e.

This property is of great importance for solving some practical problems - for example, for calculating the shortest possible distance for different types transport, to locate service stations in such a way that the distance to all cars serviced by this station is minimal, etc.

When looking for the median, it is first determined serial number in the distribution series:

Further, according to the serial number, the median itself is found from the accumulated frequencies of the series. AT discrete series- without any calculation, but in the interval series, knowing the ordinal number of the median, the accumulated frequencies are used to find the median interval, in which the value of the median is already determined by the simplest method of interpolation. The median is calculated using the formula:

where X 0 is the lower limit of the median interval; d– interval value; f _ 1 – frequency accumulated up to the median interval; f is the frequency of the median interval.

Let's calculate the mean, mode and median using the interval distribution as an example. The data are given in table. 2.


Thus, various indicators can be used as the center of distribution: mean, mode and median,


and each of these characteristics has its own characteristics. So, for the average value, it is characteristic that all deviations from it individual values signs cancel each other out, i.e.

The median is characterized by the fact that the sum of deviations of the individual values ​​of the attribute from it (excluding signs) is minimal. Mode characterizes the most frequently occurring value of a feature. Therefore, depending on which of the features is of interest to the researcher, one of the considered characteristics should be selected. AT individual cases all characteristics are calculated.

Their comparison and identification of relationships between them helps to clarify the features of the distribution of one or another variational series. So, in symmetrical series, as in our case, all three characteristics (mean, mode and median) approximately coincide. The greater the discrepancy between fashion and average, the more asymmetric the series is. It has been established that for moderately asymmetric series, the difference between the mode and the arithmetic mean is approximately three times greater than the difference between the median and the arithmetic mean:

This ratio can be used to determine one indicator from two known ones. It follows from this that the combination of mode, median, and mean is also important for characterizing the type of distribution.

Computer Science and Mathematics - Theoretical materials for the first colloquium

1. Thing mathematical statistics, its main sections. The concept of statistical distribution. Normal distribution. Under what conditions is a random variable normally distributed?

Statistics is a science that studies the total. wt. yavl-I in order to identify natural. and study them with the help of generalized indicators.

All methods of mathematical statistics can be attributed to its two main sections: theories of statistical estimation of parameters and theories of testing statistical hypotheses.

Sections:

1. descriptive statistics

2. sampling method, confidence intervals

3. correlation analysis

4. regression analysis

5. analysis of qualitative features

6. multivariate statistical analysis:

a) clustered

b) factorial

7. time series analysis

8. differential equations

9. mathematical modeling historical processes

Distribution:

Theoretical (infinitely many objects and they behave perfectly)

Empirical (real data that can be plotted in a histogram)

Normal distribution - when the nature of the distribution is influenced by many factors, and none of them is decisive. Especially often used in practice.


2. The normal distribution can be represented graphically as a bell-shaped, symmetric, single-peak curve. The height (ordinate) of each point on this curve indicates how often the corresponding value occurs. descriptive statistics. Mean values ​​- arithmetic mean, median, mode. In what situations do these three measures give similar values, and in what situations do they differ greatly?

Descriptive statistics - These are descriptive statistics.

arithmetic mean, median, mode - average measures - coefficients that can characterize a set of objects

· mean (arithmetic) value - the sum of all values, referred to total number observations ( accepted designations: mean or ), i.e. medium arithmetic value feature is called the value

where is the value of the feature y i-th object, n- the number of objects in the aggregate.

· mode - the most frequently occurring value of the variable (M)

· median is the average value (accepted designations: Median, m). The median is the "middle" value of the feature in the sense that half of the objects in the population have the value of this feature less, and the other half have more than the median. You can approximately calculate the median by arranging all the values ​​of the attribute in ascending (descending) order and finding a number in this variational series, which either has a number ( n+1)/2 - in case of odd n, or is in the middle between numbers with numbers n/2 and ( n+1)/2 - in case of even n.

Not all of the listed characteristics can be calculated for qualitative features. If the attribute is qualitative and nominal, then only the mode can be found for it (its value will be the name of the most frequently occurring category of the nominal attribute). If the attribute is a rank one, then in addition to the mode, one can also find the median for it. The arithmetic mean can only be calculated for quantitative traits.

In the case of quantitative data, all characteristics of the average level are measured in the same units as the original attribute itself.

The values ​​of the coefficients are the same if the distribution schedule is symmetrical.


3. Heterogeneity indicators - variance, mean square (standard) deviation, coefficient of variation. AT what units are they measured in? Why introduce the concept of coefficient of variation?

· root mean square or standard deviation- a measure of the spread of the characteristic values ​​around the arithmetic mean value (accepted designations: Std.Dev. ( standard deviation), s or s). The value of this deviation is calculated by the formula

.

· feature variance ( s2 or s2)

· coefficient of variation - the ratio of the standard deviation to the arithmetic mean, expressed as a percentage (denoted in statistics by the letter V). The coefficient is calculated by the formula: .

Allthese measures can only be calculated for quantitative characteristics. All of them show how strongly the values ​​of the attribute (or rather, their deviations from the average) vary in a given population. How less value scatter measures, the closer the feature values ​​for all objects to their average value, and hence to each other. If the value of the scatter measure is equal to zero, the attribute values ​​are the same for all objects.

The most commonly used is the mean square (or standard) deviation s. It is measured, like the arithmetic mean, in the same units as the original feature itself. If all the values ​​of the attribute change several times, the standard deviation will change in the same way, however, if all the values ​​of the attribute are increased (decreased) by a certain amount, its standard deviation Will not change. Along with the standard deviation, variance (= its square) is often used, but in practice it is a less convenient measure, because. the units of variance do not match the units of measurement.

The meaning of the coefficient of variation is that, unlike s, it measures not the absolute, but the relative measure of the spread of the values ​​of the attribute in the statistical population.

The more V , the less homogeneous the population.

Homogeneous Transitional Heterogeneous

V = 0 - 30% V = 30 - 50% V = 50 - 100%

Maybe »100% (too heterogeneous population).


4. The concept ofselective method. Representative Sample, its methods forming Two types of sampling errors. Confidence probability.

Sample:

Representative

Random

Mechanical sampling - similar to random sample(every 10th, 20th, etc.).

Natural (what's left of the HS over time) sampling.

Representative sample - accurately reflects the properties population.

In order for the sample to correctly reflect basic properties, inherent in the general population, it should be random, i.e. All items in the population must have an equal chance of being included in the sample.

Samples are formed using special techniques. The simplest is random selection, for example, using the usual draw (for small populations) or using tables. random numbers. For larger, but fairly homogeneous populations, mechanical selection is used (which was used in Zemstvo statistics). For heterogeneous populations with a certain structure, typical selection is more often used. There are other methods, including - combinations different ways selection at several stages of sampling.

Sample results always contain errors. These errors can be divided into two classes: random and systematic. The former include random deviations of the sample characteristics from the general ones, due to the very nature of the sampling method. The value of a random error can be calculated (estimated). Systematic errors, on the other hand, do not random; they are associated with the deviation of the sample structure from the real structure of the general population. Systematic errors appear when the basic rule of random selection is violated - ensuring that all objects have equal chances to be included in the sample. Errors of this kind statistics are not able to evaluate.

Main sources systematic errors are: a) the inadequacy of the formed sample to the objectives of the study; b) ignorance of the nature of the distribution in the general population and, as a result, a violation in the sample of the structure of the general population; c) conscious selection of the most convenient and advantageous elements of the general population.

Confidence probability -


5. Confidence probability. Medium (standard) and marginal sampling error. Confidence interval to estimate the average value in the general population. Testing the hypothesis about statistical significance of the difference between the two sample means.

Confidence interval - the value of the calculated coefficient, in which, we believe, this value for the gene should fall. Aggregate.

Confidence probability - the probability that the value of the calculated coefficient for the gene. The population will fall within the confidence interval. Which more DV, the more CI.

The inevitable spread of the sample means around the general mean (i.e. the standard deviation of the sample means) is called standard sampling error m, which is expressed by the formula (s- the average standard deviation, n- sample size). the standard error of the sample is the smaller, the smaller the values(which characterizes the spread of the trait values) and the larger the sample size n.

If the sampling method is used to work with non-quantitative data, then the role of the arithmetic mean in the population is played by the proportion or frequency q sign. The share is calculated as the ratio of the number of objects that have this feature () to the number of objects in the entire population: . The role of the spread measure is played by the quantity .

In this case, the standard sampling errormcalculated by the formula:

The accuracy and reliability of estimating the parameters of the general population based on the sample are in inverse relationship: the greater the precision (i.e. the less marginal error and the narrower the confidence interval), the lower the reliability of such an estimate (degree of confidence). And vice versa - the lower the accuracy of the estimate, the higher its reliability. Often a confidence interval is built for 95% reliability, so the marginal sampling error is usually equal to twice the mean error.m..

Confidence interval for estimating the mean in the general population:

X(g.s.) =x(selected) +-Δ =x(selected) +- = X(selected) +- σ(g.s.)/√n

Criteria for mean difference

Often there is a problem of comparing two sample means in order to test the hypothesis that these samples were obtained from the same general population, and the real discrepancies in the values ​​of the sample means are explained by the randomness of the samples.

The hypothesis under test can be formulated as follows: the difference between the sample means is random, i.e. the general averages are equal in both cases. As statistical characteristic value is used again t, which is the difference between the sample means divided by the average standard error of the mean for both samples.

The actual value of the statistical characteristic is compared with the critical value corresponding to the selected significance level. If the actual value is greater than the critical value, the tested hypothesis is rejected, i.e. the difference between the means is considered significant (significant).


7. Correlation. Linear correlation coefficient, its formula, limits of its values. Coefficient of determination, its meaningful meaning. The concept of statistical significance of the correlation coefficient.

Correlation coefficient shows how closely two variables are related .

Correlation coefficient r takes values ​​in the range from -1 to +1. If a r= 1, then between the two variables there is a functional positive linear connection, i.e. in the scatterplot, the corresponding points lie on a straight line with a positive slope. If a r = -1, then there is a functional negative relationship between the two variables. If a r = 0, then the variables under consideration linearly independent, i.e. in the scatterplot, the point cloud is "horizontally stretched".

It is advisable to calculate the regression equation and the correlation coefficient only in the case when the relationship between the variables can at least approximately be considered linear. Otherwise, the results may be completely wrong, in particular, the correlation coefficient may be close to zero in the presence of a strong relationship. This is especially true for cases where the dependence is clearly non-linear (for example, the dependence between variables is approximately described by a sinusoid or a parabola). In many cases, this problem can be circumvented by transforming the original variables. However, in order to guess the need for such a transformation, i.e. to find out what the data might contain complex shapes dependencies, it is desirable to “see” them. That is why the study of relationships between quantitative variables should usually include viewing scatterplots.

Correlation coefficients can be calculated without preliminary construction of the regression line. In this case, the question of interpreting signs as effective and factorial, i.e. dependent and independent are not set, and correlations are understood as the consistency or synchronism of the simultaneous change in the values ​​of features in the transition from object to object.

If objects are characterized by a whole set of quantitative features, you can immediately build the so-called. correlation matrix, i.e. a square table, the number of rows and columns of which is equal to the number of features, and at the intersection of each row and column is the correlation coefficient of the corresponding pair of features.

The correlation coefficient has no meaningful interpretation. However, his square, called determination coefficient(R2), It has.

determination coefficient (R 2) - this is an indicator of how much changes in the dependent feature are explained by changes in the independent one. More precisely, it is the proportion of the variance of the independent feature explained by the influence of the dependent .

If two variables are functionally linearly dependent (the points on the scatterplot lie on the same straight line), then we can say that the change in the variable y completely explained by the change in the variable x, and this is just the case when the coefficient of determination equal to one(in this case, the correlation coefficient can be equal to both 1 and -1). If two variables are linearly independent (method least squares gives a horizontal line), then the variable y its variations are in no way "owed" to the variable x– in this case, the coefficient of determination is equal to zero. In intermediate cases, the coefficient of determination indicates what part of the changes in the variable y explained by the change in the variable x(sometimes it is convenient to represent this value as a percentage).


8. Steam room andmultiple linear regression. Coefficient multiple correlation. The meaningful meaning of the regression coefficient, its significance, the concept of t- statistics. The meaningful meaning of the coefficient of determination R2.

Regression analysis - A statistical method that allows you to build explanatory models based on the interaction of features.

by the most simple case relationship is pair relationship, i.e. relationship between two traits. It is assumed that the relationship between the two variables is, as a rule, causal in nature, i.e. one of them depends on the other. The first (dependent) is called in regression analysis resulting second (independent) - factorial. It should be noted that it is not always possible to unambiguously determine which of the two variables is independent and which is dependent. Communication can often be seen as bi-directional.

Pair Regression Equation : y = kx + b.

Most often, several factors act on the dependent variable at once, among which it is difficult to single out the only or main one. For example, the income of an enterprise depends on simultaneously from two factors of production - the number of workers and the power supply. Moreover, these two factors are not independent of each other.

The equation multiple regression : y = k 1 · x 1 + k 2 · x 2 + … + b,

where x 1 , x 2, . . . - independent variables, on which the studied (resulting) variable y depends to one degree or another;

k 1 , k 2. . . are the coefficients for the corresponding variables ( regression coefficients) showing how much the value of the resulting variable changes when a single independent variable changes by one.

The multiple regression equation specifies regression model explaining the behavior of the dependent variable. None regression model unable to tell which variable is dependent (effect) and which are independent (causes).

R - multiple odds correlation, measures the totality of the impact of independent features, the closeness of the relationship of the resulting feature with the entire set of independent features, expressed in%.

Shows what proportion of characteristics are taken into account in the result section, i.e. how much % the variation of the feature y is explained by the variations of the considered features X1, X2, X3.

T-statisticsshows the level of the stat. the significance of each regression coefficient, i.e. its robustness with respect to the sample.

T = b/ Δb

Statistically significant t >2. The higher the coefficient, the better.

via R ² we make a conclusion about the % of the features taken into account explain the result.


9.Methods of multidimensional statistical analysis. Cluster analysis. The concept of hierarchical method and aboutK-means method. Multivariate classification with using fuzzy sets.

ISA:

cluster analysis

Factor analysis

Multidimensional scaling

cluster analysis - combining objects into a group with a common goal (there are many signs).

Cluster analysis methods:

1. hierarchical(hierarchical analysis tree):

main idea hierarchical method consists in the sequential association of grouped objects - first the closest, and then more and more distant from each other. The procedure for constructing a classification consists of successive steps, on each of which the two nearest groups of objects are merged (clusters).

2. k-means method.

Requires predefined classes (clusters). Emphasizes intra-class variance. based on the hypothesis of the most probable number of classes. The task of the method is to build a given number of clusters, which should differ from each other as much as possible.

The classification procedure begins with the construction of a given number of clusters obtained by random grouping of objects. Each cluster should consist of maximally "similar" objects, and the clusters themselves should be maximally "dissimilar" to each other.

The results of this method allow you to get the centers of all classes (as well as other parameters of descriptive statistics) for each of the original features, and also see graphic representation about how and in what parameters the resulting classes differ.

If the results of the classifications obtained by different methods coincide, then this confirms the reality. Existing groups (reliability, reliability).


10. Methods of multidimensional statistical analysis. Factor analysis, the purpose of its use. The concept of factorial weights, their limits values; the proportion of the total variance explained by the factors.

Multivariate statistical analysis. Its purpose: the construction of a simplified enlarged series of objects.

ISA:

cluster analysis

Factor analysis

Multidimensional scaling

At the core factor analysis lies the idea that behind the complex relationships of explicitly given features is a relatively simpler structure that reflects the most essential features of the phenomenon under study, and "external" features are functions of hidden common factors defining this structure.

Purpose: transition from more traits to a small number of factors.

in factor analysis all quantities included in the factorial model are standardized, i.e. are dimensionless quantities with arithmetic mean 0 and standard deviation 1.

The coefficient of the relationship between a certain feature and a common factor, expressing the degree of influence of the factor on the feature, is called factor load of this trait for this common factor . This is a number between -1 and 1. The farther away from 0, the stronger the relationship. The value of the factor load for a certain factor, close to zero, indicates that this factor practically does not affect this trait.

The value (measure of manifestation) of a factor in an individual object is called factorial weight object for this factor. Factor weights allow you to rank, order objects by each factor. The greater the factor weight of an object, the more it manifests that side of the phenomenon or that pattern that is reflected by this factor. Factors are standardized values, cannot be = zero. Factor weights close to zero indicate medium degree manifestations of the factor, positive - that this degree is above average, negative - about that. that it is below average.

The factor weight table has n rows by number of objects and k columns by the number of common factors. The position of objects on the axis of each factor shows, on the one hand, the order in which they are ranked by this factor, and on the other hand, the uniformity or unevenness in their arrangement, the presence of clusters of points depicting objects, which makes it possible to visually highlight more or less homogeneous groups.


11. Types of qualitative signs. Nominal features, examples from historical sources. Contingency table. The coefficient of connection of nominal features, the limits of its values.

Rated data are presented by categories for which the order is absolutely not important. No other way of comparison is defined for them, except for a literal match/mismatch.

Examples of nominal variables:

· Nationality: English, Belarusian, German, Russian, Japanese, etc.

· Occupation: employee, doctor, military, teacher, etc.

· Profile of education: humanitarian, technical, medical, legal, etc.

If in the case of the level of education we could still compare people in terms of "better-worse" or "higher-lower", now we are deprived of even this possibility; the only correct way of comparison is to say that these personalities "are all historians", or "all are not lawyers".

Contingency tables

A contingency table is a rectangular table, the rows of which indicate the categories of one feature (for example, different social groups), and the columns indicate the categories of another (for example, party affiliation). Each object of the collection falls into one of the cells of this table in accordance with which category it falls into for each of the two features. Thus, in the cells of the table there are numbers representing the frequencies of the joint occurrence of categories of two features (the number of people belonging to a particular social group and belonging to a particular party). Depending on the nature of the distribution of these frequencies within the table, one can judge whether there is a relationship between the features. What does the connection between social status and party affiliation? In this case, the existence of a connection would indicate the presence of certain political preferences among members of different social groups. Formally speaking, this connection is understood as a more frequent (or vice versa, rarer) joint occurrence of individual combinations of categories compared to the expected occurrence - a situation of purely random objects getting there (for example, a higher proportion of peasants in the Trudovik Party, and nobles in the Cadets Party, than the shares of these social groups in the entire population of Duma deputies).


12. Types of qualitative signs. Rank signs, examples from historical sources. AT what are the limits of the coefficient values rank correlation? What coefficients should be used to assess the relationship between rank and nominal signs?

Qualitative (or categorical) data are divided into two types: ranked and nominal.

Rank Dataare represented by categories for which you can specify the order, i.e. categories are comparable according to the principle "more-less" or "better-worse".

Examples of rank variables:

· Examination grades have a pronounced rank nature and are expressed in categories such as "excellent", "good", "satisfactory", etc.

· The level of education can be represented as a set of categories: "higher", "secondary", etc.

Of course, we can introduce a ranking scale and use it to rank all the people for whom we know their level of education or test score. However, is it true that "good" is as much worse than "excellent" as "satisfactory" is worse than "good"? Despite the fact that formally, in the case of grades, you can get a difference in points, it is hardly correct to measure the distance from "excellent" to "good" using the same rules as for the distance from Moscow to St. Petersburg. In the case of the level of education, it is especially clear that simple calculations are impossible, because there is no single rule for subtracting "secondary" education from "higher", even if we assign the code "3" to the higher education and the code "2" to the secondary.

The peculiarity of qualitative data does not mean that they cannot be analyzed using mathematical and statistical methods.

A number of objects ordered in accordance with the degree of manifestation of a certain property is called ranked, each number of such a series is assigned rank.

Measures of the relationship between a pair of features, each of which ranks the studied set of objects, are called in statistics rank correlation coefficients .

These coefficients are built on the basis of the following three properties:

· if the ranked series for both features completely coincide (i.e., each object occupies the same place in both series), then the rank correlation coefficient should be equal to +1, which means full positive correlation:

· if the objects in the same row are located in reverse order compared to the second, the coefficient is -1, which means a complete negative correlation;

· in other situations, the values ​​of the coefficient are in the interval [-1, +1]; an increase in the modulus of the coefficient from 0 to 1 characterizes an increase in the correspondence between two ranked rows.

Specified properties have rank correlation coefficients Spearman r and Kedalla t .

The Kedall coefficient gives a more conservative estimate of the correlation than the Spearman coefficient (numerical valuetalways less thanr).

Relationship coefficients of qualitative features

To assess the relationship of qualitative features, a coefficient is needed, which would have a certain maximum in the case of a maximum relationship and would allow comparison with each other. different tables by the strength of the relationship between features. In this case, we fit Cramer's coefficient V .

Based on the value of the chi-square test, the Cramer coefficient allows you to measure the strength of the relationship between two categorized variables - to measure it by a number that takes values ​​from 0 to 1, i.e. from complete lack of communication to maximum strong connection. The coefficient allows you to compare dependencies different signs, in order to reveal more and less strong connections.


13. Mathematical modeling historical processes andphenomena. Definition of the term "model". Three types of models, examples of them use in historical research.

14. Differential equations as the main construction tool mathematical models theoretical type. Their features in comparison with models of simulation and statistical type. An example of such a model.

Send your good work in the knowledge base is simple. Use the form below

Good work to site">

Students, graduate students, young scientists who use the knowledge base in their studies and work will be very grateful to you.

Posted on http://www.allbest.ru/

Task 1

In some region in current year 12,390 crimes were committed, and in the previous year - 11,800 crimes. Calculate (in %) the growth rate and growth rate of the number of crimes registered in the current year in relation to the previous one. Also calculate the crime rates for each year if the region's population at the end previous year was 1,475,000, and at the end of the current year - 1,770,000 people. Draw conclusions about the dynamics of crime in the region.

Decision: To obtain an accurate picture of crime, such an indicator of crime as dynamics, that is, change over time, is of great importance. The dynamics of crime is characterized by the concepts of absolute growth (or decline) and the rate of growth and growth of crime, to determine which these characteristics are calculated according to certain formulas.

Crime growth rates are calculated on the basis of basic indicators of dynamics, which involves comparing data for a number of years (and sometimes decades, if a wide coverage of material is needed) with a constant basis, which is understood as the level of crime in the initial period for analysis. This calculation allows criminologists to a large extent guarantee the comparability relative indicators, calculated as a percentage, which show how the crime of subsequent periods correlates with the previous one.

In the calculation, 100% is taken from the data of the original year; indicators obtained for subsequent years reflect only the percentage of growth, which makes the calculation accurate and the picture more objective; when operating with relative data, it is possible to exclude the influence on the decrease or increase in crime of an increase or decrease in the number of residents who have reached the age of criminal responsibility.

The rate of increase in crime is calculated as a percentage. The rate of increase in crime shows how much the subsequent crime rate has increased or decreased compared to the previous period. Accepted symbol growth rate vector: if percentage increases, a plus sign is put; if it decreases, a minus sign is put.

With regard to the conditions of our task, we should apply the appropriate formulas and calculate the growth and growth of crime.

1) The growth rate of crime is calculated by the formula ^

Tr \u003d U / U2 * 100%,

where U is an indicator of the crime rate, and U2 is an indicator of the crime rate of the previous period. So the growth rate of crime under the terms of the problem will be - 12390/11800 * 100% = 1.05%.

2) The rate of increase in crime is calculated by the following formula:

Tpr \u003d Tr-100%.

So the growth rate according to the conditions of the problem will be 1.05% -100% = 98.95%.

The crime rate is a specific summary indicator total recorded crimes, correlated with the population. It stands for the number of crimes per 100,000, 10,000 or 1,000 of the population and is an objective measure of crime that allows comparing its levels in different regions and in different years.

The crime rate helps to more adequately assess the dynamics of the level of crime calculated per capita.

The crime rate is calculated by the formula:

KP \u003d (P x 100000): N,

where P - absolute number recorded crimes; and H is the absolute number of the total population.

Both indicators are taken in the same territorial and temporal volume. The number of crimes is usually calculated per 100,000 population. But with small numbers of crimes and population (in a city, district, at an enterprise), the crime rate can be calculated per 10 thousand or per 1 thousand inhabitants. in any case, these numbers mean the dimension of the coefficient under consideration, which must be indicated: the number of crimes per 100,000 or 10,000 population.

Let's calculate the crime rate in relation to the conditions of our problem:

1) CP = (12390 * 100000): 1,770,000 people. = 700 (in the current year).

2) CP = (11800 * 100000): 1,475,000 = 800 (previous year).

Crime in the region is decreasing, because, analyzing the crime rate, we can conclude that with an increase in the population in the region (by 16.6%), and a slight increase in the number of crimes by 1.05%, in general, the increase in crime decreases (-98.95 %).

Task 2

The age of 11 young specialists of the institution, accepted for service, in the current year amounted to 19,25,21,23,23,23,25,20,18,20,21 years, respectively. Summarize and group the data in a statistical frequency table. For clarity, build a polygon of frequencies, and also find the modal, median, and average value of the age of hired employees.

Decision: grouping- this is the division of the population into groups that are homogeneous in some way. From the point of view of individual units of the population, a grouping is the union of individual units of the population into groups that are homogeneous in some way.

The grouping method is based on the following categories - grouping attribute, grouping interval and number of groups.

Grouping sign- this is a sign by which the individual units of the population are united into homogeneous groups.

The interval outlines the quantitative boundaries of the groups. As a rule, it represents the interval between the maximum and minimum values ​​of the attribute in the group.

Determining the number of groups.

The number of groups is approximately determined by the Sturgess formula:

n = 1 + 3.2log n = 1 + 3.2log(11) = 4.

The width of the interval will be:

Xmax- maximum value grouping trait in the aggregate. Xmin - the minimum value of the grouping feature. Let's define the boundaries of the group.

Group number

Bottom line

Upper bound

The same feature value serves as the upper and lower boundaries of two adjacent (previous and subsequent) groups.

For each value of the series, we calculate how many times it falls into a particular interval. To do this, sort the series in ascending order.

population number

frequency fi

The frequency polygon is a graph of density and probability random variable, is a broken line connecting the points corresponding to the median values ​​of the grouping intervals to the frequencies of these intervals.

Mean:

Fashionliteral meaning. Mode is the most common value of a feature in units of a given population.

where x 0 - the beginning of the modal interval; h - interval value; f 2 - frequency corresponding to the modal interval; f 1 - premodal frequency; f 3 - postmodal frequency.

We choose 19.75 as the beginning of the interval, since it is this interval that accounts for the largest number.

The most common value of the series is 20.92.

Median. The median divides the sample into two parts: half the option is less than the median, half is more.

In the interval distribution series, you can immediately specify only the interval in which the mode or median will be located. The median corresponds to the option in the middle of the range. The median is the interval 19.75-21.5, because in this interval, the accumulated frequency S is greater than the median number (the median is the first interval, the accumulated frequency S of which exceeds half total amount frequencies).

Thus, 50% of the population units will be less than 21.28.

Task 3

Determine the required sample size for studying the average age of certified employees of the Federal Penitentiary Service of Russia, provided that the standard deviation is 10 years, and the maximum allowable sampling error should not exceed 5%.

We are looking for a solution according to the formula for determining the size of the sample for re-selection.

Ф(t) = g/2 = 0.95/2 = 0.475 and according to the Laplace table this value corresponds to t=1.96.

Estimated standard deviation s = 10; sampling error e = 5.

Task 4

The following table provides official departmental statistics on the distribution of convicts by terms of imprisonment (punishment) for 2002-2011, posted on the official website of the Federal Penitentiary Service of Russia: www.fsin.su. Find the range and coefficient of variation of the number of convicts for each calendar year and draw conclusions about the homogeneity of the structure of this statistical feature.

The main indicator characterizing the homogeneity of data is the coefficient of variation. In statistics, it is generally accepted that if the value of the coefficient is less than 33%, then the data set is homogeneous, if more than 33%, then it is heterogeneous.

The coefficient of variation

Because v? 30%, then the population is homogeneous, and the variation is weak. The results obtained can be trusted.

Term of punishment

1 to 3 years

3 to 5 years

5 to 10 years

10 to 15 years old

Over 15 years

Maximum value (MAX function)

Minimum value (MIN function)

Span variation

Average value (AVERAGE function)

Standard deviation (STANDAR LONA function)

The coefficient of variation

simple average:

Fashion meaning

Median

We find the middle of the ranged series: h = (n + 1) / 2 = (7 + 1) / 2 = 4. This number corresponds to the value of the series 70580. Therefore, the median Me = 70580.

Variation indicators. .

R \u003d X max - X min.

R = 295916-2250 = 293666.

Average linear deviation

Each value of the series differs from the other by an average of 90895.71.

Dispersion

(mean error samples).

Each value of the series differs from the average value of 103008 by an average of 107169.83.

The coefficient of variation- a measure of the relative spread of population values: shows what proportion of the average value of this quantity is its average spread.

Because v>

or

Oscillation factor

simple average:

Fashion

There is no mode (all values ​​of the series are individual).

Median. The median is the value of the feature that divides the units of the ranked series into two parts. The median corresponds to the option in the middle of the range.

We find the middle of the ranged series: h = (n + 1) / 2 = (7 + 1) / 2 = 4. This number corresponds to the value of the series 76186. Therefore, the median Me = 76186.

Variation indicators. Absolute Variation Rates.

The range of variation is the difference between the maximum and minimum values ​​of the attribute of the primary series.

R = X max - X min

R = 291112-3101 = 288011.

Average linear deviation- calculated in order to take into account the differences of all units of the studied population.

Each value of the series differs from the other by an average of 83422.69.

Dispersion- characterizes the measure of spread around its mean value (measure of dispersion, i.e. deviation from the mean).

Standard deviation(mean sampling error).

Each value of the series differs from the average value of 97334.29 by an average of 100750.25.

Relative measures of variation. The relative indicators of variation include: oscillation coefficient, linear coefficient variations, relative linear deviation.

The coefficient of variation- a measure of the relative spread of population values: shows what proportion of the average value of this quantity is its average spread.

Since v>70%, the population approaches the edge of heterogeneity, and the variation is strong.

In this case, in practical studies, various statistical methods lead the population to a homogeneous form.

Linear coefficient of variation or Relative linear deviation- characterizes the proportion of the average value of the sign of absolute deviations from the average value.

Oscillation factor- reflects the relative volatility extreme values feature around the mean.

simple average:

Fashion literal meaning. Mode is the most common value of a feature in units of a given population.

There is no mode (all values ​​of the series are individual).

Median. The median is the value of the feature that divides the units of the ranked series into two parts. The median corresponds to the option in the middle of the range.

We find the middle of the ranged series: h = (n + 1) / 2 = (7 + 1) / 2 = 4. This number corresponds to the value of the series 71093. Therefore, the median Me = 71093.

Variation indicators. Absolute Variation Rates.

The range of variation is the difference between the maximum and minimum values ​​of the attribute of the primary series.

R = X max - X min

R = 243852-3856 = 239996.

Average linear deviation- calculated in order to take into account the differences of all units of the studied population.

Each value of the series differs from the other by an average of 68998.08.

Dispersion- characterizes the measure of spread around its mean value (measure of dispersion, i.e. deviation from the mean).

Standard deviation(mean sampling error).

Each value of the series differs from the average value of 85765.57 by an average of 82541.55.

Relative measures of variation. The relative indicators of variation include: oscillation coefficient, linear coefficient of variation, relative linear deviation.

The coefficient of variation- a measure of the relative spread of population values: shows what proportion of the average value of this quantity is its average spread.

Since v>70%, the population approaches the edge of heterogeneity, and the variation is strong.

The coefficient of variation is much more than 33%. Consequently, the considered set is heterogeneous and the average for it is insufficiently typical. In this case, in practical studies, various statistical methods lead the population to a homogeneous form.

Linear coefficient of variation or Relative linear deviation- characterizes the proportion of the average value of the sign of absolute deviations from the average value.

Oscillation factor- reflects the relative fluctuation of the extreme values ​​of the attribute around the average.

:

Fashion. Mode is the most common value of a feature in units of a given population.

There is no mode (all values ​​of the series are individual).

Median. The median is the value of the feature that divides the units of the ranked series into two parts. The median corresponds to the option in the middle of the range.

We find the middle of the ranked series: h = (n + 1) / 2 = (7 + 1) / 2 = 4. This number corresponds to the value of the series 74588. Therefore, the median Me = 74588.

Variation indicators. Absolute Variation Rates.

The range of variation is the difference between the maximum and minimum values ​​of the attribute of the primary series.

R \u003d X max - X min,

R=242984-5304=237680.

Average linear deviation- calculated in order to take into account the differences of all units of the studied population.

Each value of the series differs from the other by an average of 73148.73.

Dispersion- characterizes the measure of spread around its mean value (measure of dispersion, i.e. deviation from the mean).

Standard deviation(mean sampling error).

Each value of the series differs from the average value of 92104.14 by an average of 82873.1.

Relative measures of variation. The relative indicators of variation include: oscillation coefficient, linear coefficient of variation, relative linear deviation.

The coefficient of variation- a measure of the relative spread of population values: shows what proportion of the average value of this quantity is its average spread.

Since v>70%, the population approaches the edge of heterogeneity, and the variation is strong.

The coefficient of variation is much more than 33%. Consequently, the considered set is heterogeneous and the average for it is insufficiently typical. In this case, in practical studies, various statistical methods lead the population to a homogeneous form.

Linear coefficient of variation or Relative linear deviation- characterizes the proportion of the average value of the sign of absolute deviations from the average value.

Oscillation factor- reflects the relative fluctuation of the extreme values ​​of the attribute around the average.

simple arithmetic mean:

Fashion. Mode is the most common value of a feature in units of a given population.

There is no mode (all values ​​of the series are individual).

Median. The median is the value of the feature that divides the units of the ranked series into two parts. The median corresponds to the option in the middle of the range.

We find the middle of the ranged series: h = (n + 1) / 2 = (7 + 1) / 2 = 4. This number corresponds to the value of the series 76678. Therefore, the median Me = 76678

Variation indicators. Absolute Variation Rates.

The range of variation is the difference between the maximum and minimum values ​​of the attribute of the primary series.

R \u003d X max - X min.

R = 249346-6536 = 242810.

Average linear deviation- calculated in order to take into account the differences of all units of the studied population.

Each value of the series differs from the other by an average of 79680.53.

Dispersion- characterizes the measure of spread around its mean value (measure of dispersion, i.e. deviation from the mean).

Standard deviation(mean sampling error).

Each value of the series differs from the average value of 99551.71 by an average of 87389.04.

Relative measures of variation. The relative indicators of variation include: oscillation coefficient, linear coefficient of variation, relative linear deviation.

The coefficient of variation- a measure of the relative spread of population values: shows what proportion of the average value of this quantity is its average spread.

Since v>70%, the population approaches the edge of heterogeneity, and the variation is strong.

The coefficient of variation is much more than 33%. Consequently, the considered set is heterogeneous and the average for it is insufficiently typical. In this case, in practical studies, various statistical methods lead the population to a homogeneous form.

Linear coefficient of variation or Relative linear deviation- characterizes the proportion of the average value of the sign of absolute deviations from the average value.

Oscillation factor- reflects the relative fluctuation of the extreme values ​​of the attribute around the average.

simple arithmetic mean:

Fashion. Mode is the most common value of a feature in units of a given population.

There is no mode (all values ​​of the series are individual).

Median. The median is the value of the feature that divides the units of the ranked series into two parts. The median corresponds to the option in the middle of the range.

We find the middle of the ranged series: h = (n + 1) / 2 = (7 + 1) / 2 = 4. This number corresponds to the value of the series 76461. Therefore, the median Me = 76461.

Variation indicators. Absolute Variation Rates.

The range of variation is the difference between the maximum and minimum values ​​of the attribute of the primary series.

R \u003d X max - X min.

R = 254722-6704 = 248018.

Average linear deviation- calculated in order to take into account the differences of all units of the studied population.

Each value of the series differs from the other by an average of 82302.82.

Dispersion- characterizes the measure of spread around its mean value (measure of dispersion, i.e. deviation from the mean).

Standard deviation(mean sampling error).

Each value of the series differs from the average value of 102346.71 by an average of 89787.88.

Relative measures of variation. The relative indicators of variation include: oscillation coefficient, linear coefficient of variation, relative linear deviation.

The coefficient of variation- a measure of the relative spread of population values: shows what proportion of the average value of this quantity is its average spread.

Since v>70%, the population approaches the edge of heterogeneity, and the variation is strong.

The coefficient of variation is much more than 33%. Consequently, the considered set is heterogeneous and the average for it is insufficiently typical. In this case, in practical studies, various statistical methods lead the population to a homogeneous form.

Linear coefficient of variation or Relative linear deviation- characterizes the proportion of the average value of the sign of absolute deviations from the average value.

Oscillation factor- reflects the relative fluctuation of the extreme values ​​of the attribute around the average.

simple arithmetic mean:

Fashion. Mode is the most common value of a feature in units of a given population.

There is no mode (all values ​​of the series are individual).

Median. The median is the value of the feature that divides the units of the ranked series into two parts. The median corresponds to the option in the middle of the range.

We find the middle of the ranged series: h = (n + 1) / 2 = (7 + 1) / 2 = 4. This number corresponds to the value of the series 78959. Therefore, the median Me = 78959.

Variation indicators. Absolute Variation Rates.

The range of variation is the difference between the maximum and minimum values ​​of the attribute of the primary series.

R \u003d X max - X min.

R = 261334-7635 = 253699.

Average linear deviation- calculated in order to take into account the differences of all units of the studied population.

Each value of the series differs from the other by an average of 83791.55.

Dispersion- characterizes the measure of spread around its mean value (measure of dispersion, i.e. deviation from the mean).

Standard deviation(mean sampling error).

Each value of the series differs from the average value of 104898.86 by an average of 91616.15.

Relative measures of variation. The relative indicators of variation include: oscillation coefficient, linear coefficient of variation, relative linear deviation.

The coefficient of variation- a measure of the relative spread of population values: shows what proportion of the average value of this quantity is its average spread.

Since v>70%, the population approaches the edge of heterogeneity, and the variation is strong.

The coefficient of variation is much more than 33%. Consequently, the considered set is heterogeneous and the average for it is insufficiently typical. In this case, in practical studies, various statistical methods lead the population to a homogeneous form.

Linear coefficient of variation or Relative linear deviation- characterizes the proportion of the average value of the sign of absolute deviations from the average value.

Oscillation factor- reflects the relative fluctuation of the extreme values ​​of the attribute around the average.

simple arithmetic mean:

Fashion. Mode is the most common value of a feature in units of a given population.

There is no mode (all values ​​of the series are individual).

Median. The median is the value of the feature that divides the units of the ranked series into two parts. The median corresponds to the option in the middle of the range.

We find the middle of the ranged series: h = (n + 1) / 2 = (7 + 1) / 2 = 4. This number corresponds to the value of the series 75916. Therefore, the median Me = 75916.

Variation indicators. Absolute Variation Rates.

The range of variation is the difference between the maximum and minimum values ​​of the attribute of the primary series.

R \u003d X max - X min.

R=263863-8145=255718.

Average linear deviation- calculated in order to take into account the differences of all units of the studied population.

Each value of the series differs from the other by an average of 82767.96.

Dispersion- characterizes the measure of spread around its mean value (measure of dispersion, i.e. deviation from the mean).

Standard deviation(mean sampling error).

Each value of the series differs from the average value of 103440.71 by an average of 91207.92.

Relative measures of variation. The relative indicators of variation include: oscillation coefficient, linear coefficient of variation, relative linear deviation.

The coefficient of variation- a measure of the relative spread of population values: shows what proportion of the average value of this quantity is its average spread.

Since v>70%, the population approaches the edge of heterogeneity, and the variation is strong.

The coefficient of variation is much more than 33%. Consequently, the considered set is heterogeneous and the average for it is insufficiently typical. In this case, in practical studies, various statistical methods lead the population to a homogeneous form.

Linear coefficient of variation or Relative linear deviation- characterizes the proportion of the average value of the sign of absolute deviations from the average value.

Oscillation factor- reflects the relative fluctuation of the extreme values ​​of the attribute around the average.

simple arithmetic mean:

Fashion. Mode is the most common value of a feature in units of a given population.

There is no mode (all values ​​of the series are individual).

Median. The median is the value of the feature that divides the units of the ranked series into two parts. The median corresponds to the option in the middle of the range.

We find the middle of the ranked series: h = (n + 1) / 2 = (7 + 1) / 2 = 4. This number corresponds to the value of the series 78019. Therefore, the median Me = 78019.

Variation indicators. Absolute Variation Rates.

The range of variation is the difference between the maximum and minimum values ​​of the attribute of the primary series.

R = X max - X min

R = 260094-7798 = 252296.

Average linear deviation- calculated in order to take into account the differences of all units of the studied population.

Each value of the series differs from the other by an average of 77827.76.

Dispersion- characterizes the measure of spread around its mean value (measure of dispersion, i.e. deviation from the mean).

Standard deviation(mean sampling error).

Each value of the series differs from the average value of 99212.29 by an average of 88081.39.

Relative measures of variation. The relative indicators of variation include: oscillation coefficient, linear coefficient of variation, relative linear deviation.

The coefficient of variation- a measure of the relative spread of population values: shows what proportion of the average value of this quantity is its average spread.

Since v>70%, the population approaches the edge of heterogeneity, and the variation is strong.

The coefficient of variation is much more than 33%. Consequently, the considered set is heterogeneous and the average for it is insufficiently typical. In this case, in practical studies, various statistical methods lead the population to a homogeneous form.

Linear coefficient of variation or Relative linear deviation- characterizes the proportion of the average value of the sign of absolute deviations from the average value.

Oscillation factor- reflects the relative fluctuation of the extreme values ​​of the attribute around the average.

simple arithmetic mean:

Fashion. Mode is the most common value of a feature in units of a given population.

There is no mode (all values ​​of the series are individual).

Median. The median is the value of the feature that divides the units of the ranked series into two parts. The median corresponds to the option in the middle of the range.

We find the middle of the ranked series: h = (n + 1) / 2 = (7 + 1) / 2 = 4. This number corresponds to the value of the series 72248. Therefore, the median Me = 72248.

Variation indicators. Absolute Variation Rates.

The range of variation is the difference between the maximum and minimum values ​​of the attribute of the primary series.

R \u003d X max - X min.

R = 242137-7173 = 234964.

Average linear deviation- calculated in order to take into account the differences of all units of the studied population.

Each value of the series differs from the other by an average of 70459.02.

Dispersion- characterizes the measure of spread around its mean value (measure of dispersion, i.e. deviation from the mean).

Standard deviation(mean sampling error).

Each value of the series differs from the average value of 91375.14 by an average of 80674.43.

Relative measures of variation. The relative indicators of variation include: oscillation coefficient, linear coefficient of variation, relative linear deviation.

The coefficient of variation- a measure of the relative spread of population values: shows what proportion of the average value of this quantity is its average spread.

Since v>70%, the population approaches the edge of heterogeneity, and the variation is strong.

The coefficient of variation is much more than 33%. Consequently, the considered set is heterogeneous and the average for it is insufficiently typical. In this case, in practical studies, various statistical methods lead the population to a homogeneous form.

Linear coefficient of variation or Relative linear deviation- characterizes the proportion of the average value of the sign of absolute deviations from the average value.

Oscillation factor- reflects the relative fluctuation of the extreme values ​​of the attribute around the average.

Task 5

In the conditions of the previous task, regroup the given intervals of sentences in order to improve the relative indicators of the variation of the attribute in 2010. Build histograms of the distribution of convicts by terms of imprisonment (punishment) for 2010 before and after the grouping of data and draw conclusions about the homogeneity of the structure of the statistical feature under study.

Decision:

Since v>30%, but v<70 %, то вариация умеренная.

The coefficient of variation is much more than 33%. Consequently, the considered set is heterogeneous and the average for it is insufficiently typical.

Let's rearrange the data as follows:

Group 1) includes groups: up to a year, a year, from 1-3 years, respectively 156978.

Group 2) includes from the group over 3 to 5 years completely and 1/5 from the group over 5 to 10 years, we get 1/5 * 260094 + 168651 = 220669.8.

Group 3) includes 3\5 groups from 5 to 10 i.e. 3\5*260094=156056.4.

Group 4) (1\5*260094)+(1\5*78019)=67622.6.

Group 5) 3\5*78019=46811.4.

Group 6 30744+(1\5*78019)=46347.8.

Bar graph. To obtain a conclusion about the homogeneity of the studied statistical feature, we calculate the coefficient of variation:

Since v>30%, but v<70 %, то вариация умеренная.

The coefficient of variation is much more than 33%. Consequently, the considered set is heterogeneous and the average for it is insufficiently typical.

Task 6

Briefly state (abstract, on 1-2 pages) the content and results of a recent official statistical study in the social and legal sphere (topics - of your choice, links to Internet resources - are required), draw conclusions and put forward appropriate statistical hypotheses for the short term perspective.

As an official statistical study, a study on overdue wage arrears as of December 1, 2015 was taken.

As of December 1, 2015, according to organizations (not related to small businesses), total debt in terms of wages for the range of observed types of economic activity amounted to 3900 million rublesher and compared to November 1, 2015 increased by 395 million rubles (by 11.3%).

Overdue wages due to lack of own funds as of December 1, 2015 amounted to 3818 million rublesher, or 97.9% of the total amount of overdue debt. Compared to November 1, 2015 it increased by 389 million rubles (by 11.3%). Debt due to untimely receipt of funds from the budgets of all levels amounted to 82 million rublesher and increased compared to November 1, 2015. by 6 million rubles (by 7.7%), including debt from the federal budget amounted to 62 million rubles and decreased compared to November 1, 2015. by 6 million rubles (by 8.6%), budgets of subjects of the Russian Federation amounted to 1.1 million rubles (an increase of 0.2 million rubles or 20.7%), local budgets - 19 million rubles (an increase of 12 million rubles, or 2.5 times).

In mining, manufacturing, healthcare and social services, fisheries and fish farming, 100% of overdue wage arrears are formed due to the lack of own funds of organizations.

In the total amount of overdue wage arrears, 37% falls on manufacturing, 29% - on construction, 9% - on the production and distribution of electricity, gas and water, 7% - on transport, 6% - on mining, 5% - for agriculture, hunting and the provision of services in these areas, logging.

The volume of arrears on wages as of December 1, 2015 amounted to less than 1% of the monthly wage bill of workers in the observed types of economic activity.

Wage arrears for the last month, for which accruals were made, in the total amount of overdue debts amounted to an average of 29%: production and distribution of electricity, gas and water - 75%, activities in the field of education - 37%, healthcare and the provision of social services - 35%, scientific research and development - 32%, construction - 29%, transport - 23%, manufacturing - 22%.

Of the total amount of unpaid wages, debts formed in 2014 account for 457 million rubles (11.7%), in 2013. and earlier - 657 million rubles (16.8%).

In general, observing the dynamics of wage arrears (http://www.gks.ru/bgd/free/B04_03/IssWWW.exe/Stg/d06/Image 5258.gif), we can conclude that a significant decline will occur in January , February 2016.

The main percentage of debt falls on manufacturing industries - 37%, 29% - on construction, most likely this is due to a decrease in consumer demand for products, and profits accordingly decrease.

Let's put forward a hypothesis. Since January 2016, the percentage of debt will decrease due to the distribution of the annual budget for the next year, taking into account the partial repayment of wage arrears, and will amount to 2,700 million.

To test the hypothesis (We take this table as a basis http://www.gks.ru/bgd/free/B04_03/IssWWW.exe/Stg/d06/Image5258.gif).

Let us construct a discrete variational series. To do this, sort the series in ascending order and count the number of repetitions for each element of the series.

Let's calculate the average:

Let's calculate the variance. Dispersion - characterizes the measure of dispersion around its mean value (a measure of dispersion, i.e. deviation from the mean).

Using a one-tailed test with b = 0.05, test this hypothesis if, in a sample of n = 24 months, the average was 2741.25, and the variance is known and equal to y = 193469.27

Decision. Standard deviation:

The null hypothesis H 0 is put forward that the value of the mathematical expectation of the general population is equal to the number m 0: = 2700.

Alternative hypothesis:

H 1: m? 2700, critical area - bilateral.

To test the null hypothesis, a random variable is used:

where x is the sample mean; S is the standard deviation of the general population.

If the null hypothesis is true, then the random variable T has a standard normal distribution. The critical value of the statistic T is determined based on the type of alternative hypothesis:

P(|T|

Let us find the experimental value of the statistics T:

Since the sample size is quite large (n>30), instead of the true value of the standard deviation, you can use its estimate S=439.851.

F (t cr) \u003d (1-b) / 2 \u003d (1-0.05) / 2 \u003d 0.475.

According to the Laplace function table, we find at what t kp the value Ф (t kp) = 0.475.

The experimental value of the criterion T did not fall into the critical region T ? t kp , so the null hypothesis should be accepted. The value of the mathematical expectation of the general population can be taken equal to 2700

Bibliography

1. Kazantsev S.Ya. Legal Statistics: Textbook / Ed. S.Ya. Kazantseva, S.Ya. Lebedeva - M.: UNITY-DANA: Law and Law, 2009

2. Kurys? in K.N. Fundamentals of legal statistics: textbook. allowance / K.N. Kurys? in; VUI FPS of Russia. - Vladimir, 2005. - 44 p.

3. Makarova N.V. Statistics in Excel: textbook. allowance / N.V. Makarova, V.Ya. Trofimets. - M.: Finance and statistics.

4. Kondratyuk L.V., Ovchinsky V.S. Criminological dimension / ed. K.K. Goryainov. - M.: Norma, 2008.

5. Yakovlev V.B. Statistics. Calculations in Microsoft Excel: textbook. Manual for universities / V.B. Yakovlev. - M.: Kolos, 2005. - 352 p.

Hosted on Allbest.ru

...

Similar Documents

    The study of juvenile delinquency from the standpoint of the object of criminological research. The relationship between teenage alcoholism, substance abuse, drug addiction and crime. Causes and conditions and ways to prevent juvenile delinquency.

    term paper, added 04/08/2011

    Methodology of specific criminological research. Criminological characteristics of violent crime and its prevention. Public danger and the severity of the consequences of violent crimes. Crime statistics.

    test, added 01/15/2011

    The formula for calculating the crime rate. Calculation of the average annual workload per judge, the average duration of the investigation of criminal cases, the average annual growth rate of crime. Calculation of mode indicators, median, variation and standard deviation.

    test, added 04/20/2011

    The study of the foundations of mercenary crime: the concept, elements, objects and subjective aspects. Description of social and special-criminological crime prevention from mercenary motives. Development of a set of measures to prevent crime.

    thesis, added 11/09/2012

    The concept and subject of criminological forecasting. Establishment of possible changes in the state, level, structure and dynamics of crime in the future. Evaluation of the development of crime in the future. Crime control planning and prevention.

    term paper, added 05/29/2015

    Study of the types of criminological forecasting and design in the field of crime. Features of forecasting juvenile delinquency in the Republic of Kazakhstan. Development of programs to combat crime at the national level.

    thesis, added 10/25/2015

    Juvenile delinquency as an object of criminological research. Basic, criminological characteristics of juvenile delinquency. The state of crime. Features of the personal characteristics of minors.

    abstract, added 04/01/2003

    Trends in the criminal behavior of modern women: the growth and steady proportion of serious and recidivist crimes, the rejuvenation of criminals and the increase in the number of elderly women among convicts. General measures for the prevention of female crime.

    abstract, added 03/01/2014

    Calculation of relative indicators of the structure and coordination of categories of convicts according to the severity of the crimes committed. Crime and criminal record rates by federal districts and Russia as a whole. Calculation of dynamics indicators using MS Excel.

    test, added 07/31/2011

    The concept, types, meanings, determinants of latent crime, its causes, prevention and reduction methods. Determining the level and analysis of the structure of crime. A systematic approach to the study of latent crime as a social phenomenon.

abstract

Mean values ​​and indicators of variation

1. Essence of averages in statistics

2. Types of averages and methods for their calculation

3. Main indicators of variation and their significance in statistics

1. The essence of average weightsfaces in statistics

In the process of studying mass socio-economic phenomena, it becomes necessary to identify their common properties, typical sizes and characteristic features. The need for a generalizing average arises when the characteristics characterizing the units of the population under study vary quantitatively. For example, the amount of daily output of weavers in a textile factory depends on the general conditions of production, the weavers use the same raw materials, work on the same machines, etc. At the same time, the hourly output of individual weavers fluctuates; varies, as it depends on the individual characteristics of each weaver (his qualifications, professional experience, etc.). In order to characterize the daily output of all the weavers of the enterprise, it is necessary to calculate the average daily output, since, only in this indicator, the general conditions of production for weavers will be reflected.

Thus, the calculation of average generalizing indicators means distraction (abstraction) from the features reflected in the magnitude of the trait in individual units, and the identification of typical features and properties that are common to a given set.

Thus, the average value in statistics is a generalized, quantitative characteristic of a sign and a statistical population. It expresses the characteristic, typical value of a trait in the units of the population, which are formed under the given conditions of place and time under the influence of the totality of factors. The action of various factors generates fluctuation, variation of the averaged feature. The average value is the general measure of their action, the resultant of all these factors. The average value characterizes the population according to the averaged attribute, but refers to the unit of the population. For example, the average output per worker of a given enterprise is the ratio of all output (for any period of time) to the total (average for the same period) number of its workers. It characterizes the labor productivity of a given aggregate, but refers to one worker. In the average value of the mass phenomenon, individual differences in the units of the statistical population in the values ​​of the averaged attribute, due to random circumstances, are canceled out. As a result of this mutual cancellation, a general, natural property of a given statistical totality of phenomena is manifested in the average. There is a dialectical connection between the average and individual values ​​of the averaged attribute, as between the general and the individual. The average is the most important category of statistical science and the most important form of generalizing indicators. Many phenomena of social life become clear and definite only when they are generalized in the form of averages. Such, for example, are the productivity of labor mentioned above, the totality of workers, the yield of agricultural crops, and so on. The average is the most important method of scientific generalization in statistics. In this sense, one speaks of the method of averages, which is widely used in economics. Many categories of economic science are defined using the concept of average.

The main condition for the correct application of the average value is the homogeneity of the statistical population according to the averaged attribute. A homogeneous statistical totality is such a totality in which its constituent elements (units) are similar to each other in terms of features essential for this study and belong to the same type of phenomena. A homogeneous population, being homogeneous in some respects, may be heterogeneous in others. Only in the averages for such aggregates are specific features, patterns of development of the analyzed phenomenon manifested. The average calculated for a heterogeneous statistical population, i.e. one in which qualitatively different phenomena are combined loses its scientific significance. Such averages are fictitious, not only not giving an idea of ​​reality, but also distorting it. For the formation of homogeneous statistical aggregates, an appropriate grouping is carried out. With the help of groupings and in a qualitatively homogeneous set, quantitatively characteristic groups can be distinguished. For each of them, its own average can be calculated, called the group (private) average, in contrast to the general average (for the population as a whole).

2. Types of averages

Of great importance in the methodology of averages are the questions of choosing the form of the average, i.e. formulas by which you can correctly calculate the average value, and the choice of average weights. The most commonly used in statistics aggregate mean, arithmetic mean, harmonic mean, meangeometric, root mean square, mode and median. The use of a particular formula depends on the content of the averaged feature and the specific data on which it must be calculated. To choose the form of the average, you can use the so-called average initial ratio.

2.1 Arithmetic mean

The arithmetic mean is one of the most common forms of the average. The arithmetic mean is calculated as the quotient of dividing the sum of the individual values ​​(options) of the varying sign to their number. The arithmetic mean is used in cases where the volume of a variable attribute of phenomena of a homogeneous statistical population is formed by summing the values ​​of the attribute of all units of phenomena in the statistical population. There are the following arithmetic mean values:

1) simple arithmetic mean, which is determined by simply summing up the quantitative values ​​of a varying attribute and dividing this sum by their variants and is calculated using the following formula:

X - the average value of the statistical population,

x i - the sum of individual varying variants of the phenomena of the statistical population,

n i - the number of varying variants of the phenomena of the statistical population.

2) Arithmetic mean weighted- the average value of the sign of the phenomenon, calculated taking into account the weights. The weights of average values ​​are the frequencies with which individual values ​​of an averaged feature are taken into account when calculating its average value. The choice of weights for the average value depends on the nature of the averaged feature and the nature of the data available to calculate the average values. As weights of average values, there can be indicators of the number of units or sizes of parts of the statistical population (in the form of absolute or relative values) that have a given variant (value) of the averaged feature of the phenomenon of the statistical population, as well as the value of the indicator associated with the averaged feature. The weighted arithmetic mean is calculated using the following formula:

X- arithmetic weighted average,

x - the value of individual varying variants of the phenomena of the statistical population,

The purpose of a simple and weighted arithmetic mean is to determine the mean value of a variable attribute. If in the studied statistical population the variants of the values ​​of the characteristic occur one time or have the same weight, then the simple arithmetic mean is applied, but if the variants of the values ​​of this characteristic occur several times in the studied population or have different weights, the arithmetic mean is used to determine the mean value of the variable characteristic. weighted.

2.2 Average harmonic

The harmonic mean is used to calculate the average value when there are no direct data on the weights, and the variants of the averaged attribute (x) and the product of the values ​​of the variants by the number of units that have this value w (w = xf) are known.

This average is calculated using the following formulas:

1.) Average harmonic simple:

X - harmonic simple,

n - the number of varying variants of the phenomena of the statistical population.

2) Average harmonic weighted:

X - harmonic weighted average,

x - the sum of individual varying variants of the phenomena of the statistical population,

When using the harmonic weighted, the weights are revealed and thus the same result is obtained that would give the calculation of the arithmetic weighted average if all the data necessary for this were known.

2.3 Average aggregate

The average aggregate is calculated by the formula:

X - average aggregate,

x - the sum of individual varying variants of the phenomena of the statistical population,

The aggregate average is calculated in cases where the values ​​of the numerator and the denominator of the initial ratio of the average are known (available).

2.4 Geometric mean

The geometric mean is one of the forms of the average value and is calculated as the root of the nth degree from the product of individual values ​​- variants of the feature (x) and is determined by the following formula:

The geometric mean is mainly used in calculating average growth rates.

2.5 Mode and median

Along with the averages considered above, the so-called structural averages - mode and median.

Mode (Mo) is the most frequently occurring value of a feature in population units. For discrete series, this option has the highest frequency.

In interval variational series, it is possible to determine, first of all, the interval in which the mode is located, i.e. the so-called modal interval. In a variational series with equal intervals, the modal interval is determined by the highest frequency, in series with unequal intervals, by the highest distribution density.

To determine the mode in rows with equal intervals, use the formula of the following form:

Хн - the lower limit of the modal interval,

h - interval value,

f 1 , f 2 , f 3 - frequencies (or particulars) of the pre-modal, modal and post-modal intervals, respectively.

In the interval series, the mode can be found graphically. To do this, two lines are drawn from the boundaries of two adjacent columns in the highest column of the histogram. Then, from the point of their intersection, a perpendicular is lowered to the abscissa axis. The feature value on the abscissa corresponding to the perpendicular will be the mode.

In many cases, when characterizing the population as a generalized indicator, preference is given to the mode, rather than the arithmetic mean.

So, when studying prices in the market, it is not the average price for a certain product that is fixed and studied in dynamics, but the modal one; when studying the demand of the population for a certain size of shoes or clothes, it is of interest to determine the modal size of shoes, and the average size as such does not matter here at all. Fashion is not only of independent interest, but also plays the role of an auxiliary indicator for the average, characterizing its typicality. If the arithmetic mean is close in value to the mode, then it is typical.

The median (Me) is the value of the feature in the middle unit of the ranked series. (A ranked series is a series in which the attribute values ​​are written in ascending or descending order.)

To find the median, its serial number is first determined. To do this, with an odd number of units, one is added to the sum of all frequencies, and everything is divided by two. If the number of units is even, there will be two middle units in the series, and by all rules the median should be determined as the average of the values ​​of these two units. At the same time, practically with an even number of units, the median is found as the value of the attribute of the unit, the ordinal number of which is determined by the total sum of frequencies divided by two. Knowing the ordinal number of the median, it is easy to find its value from the accumulated frequencies.

In the interval series, after determining the ordinal number of the median by cumulative frequencies (particulars), the median interval is found, and then, using the simplest interpolation technique, the value of the median itself is determined. This calculation is expressed by the following formula:

X n - the lower limit of the median interval,

h - the value of the median interval,

Ordinal number of the median,

S Me - 1 frequency (frequency) accumulated up to the median interval,

F Me - frequency (particular) of the median interval.

According to the written formula, to the lower boundary of the median interval, such a part of the interval value is added that falls on the fraction of units of this group that are missing from the ordinal number of the median. In other words, the calculation of the median is based on the assumption that the growth of the feature among the units of each group occurs evenly. On the basis of what has been said, the median can be calculated in another way. Having determined the median interval, it is possible to subtract from the upper limit of the median interval (Xb) that part of the interval that falls on the fraction of units exceeding the ordinal number of the median, i.e. according to the following formula:

The median can also be determined graphically. To do this, a cumulate is built and from a point on the scale of accumulated frequencies (particulars) corresponding to the ordinal number of the median, a straight line is drawn parallel to the x axis until it intersects with the cumulate. Then, from the point of intersection of the indicated straight line with the cumulate, a perpendicular is lowered to the abscissa axis. The value of the feature on the x-axis corresponding to the drawn ordinate (perpendicular) will be the median.

By the same principle, it is easy to find the value of a feature for any unit of the ranked series.

Thus, a whole set of indicators can be used to calculate the average value of the variation series.

3. Main indicators of varitions and their significance in statistics

When studying a variable trait in units of a population, one cannot limit oneself to only calculating the average value from individual options, since the same average can refer to populations that are far from identical in composition. This can be illustrated by the following conditional example, which reflects data on the number of households in the agricultural enterprises of two districts:

The average number of households in the farms of the two districts is the same - 160. At the same time, the composition of these farms in the two districts is far from the same. Therefore, it becomes necessary to measure the variation of a trait in the population.

For this purpose, a number of characteristics are calculated in statistics, i.e. indicators. The most elementary indicator of the variation of a trait is range of variation R, which is the difference between the maximum and minimum values ​​of the attribute in this variation series, i.e. R = Xmax - Xmin. In our example, in the 1st region R = 300 - 80 - 220, and in the second region R = 180 - 145 = 35.

The range of variation indicator is not always applicable, since it takes into account only the extreme values ​​of the trait, which can be very different from all other units. Sometimes they find the ratio of the range of variation to the arithmetic mean and use this value, calling it an indicator oscillations.

More accurately, you can determine the variation in the series using indicators that take into account the deviations of all options from the arithmetic mean. There are two such indicators in statistics - the mean linear and the mean square deviation.

Average linear deviation represents the arithmetic mean of the absolute values ​​of the deviations of the variants from the mean. Deviation signs are ignored in this case, otherwise the sum of all deviations will be equal to zero. This indicator is calculated by the formula:

b) for a variation series:

It should be borne in mind that the average linear deviation will be minimal if the deviations are calculated from the median, i.e. according to the formula:

Standard deviation () is calculated as follows - each deviation from the average is squared, all squares are summed (taking into account weights), after which the sum of squares is divided by the number of members of the series and the square root is extracted from the quotient.

All these actions are expressed by the following formulas:

a) for ungrouped data:

b) for a variation series:

f, i.e. The standard deviation is the square root of the arithmetic mean of the squared deviations of the mean. The expression under the root is called the variance. Dispersion has an independent expression in statistics and is one of the most important indicators of variation.

where - respectively, the maximum and minimum value of the attribute in the aggregate;

is the number of groups.

The distribution series can be visualized using their graphic representation. For this purpose, a polygon, a histogram, a cumulative curve, an ogive are built.

THEME 4.Absolute and relative values

The concept of a statistical indicator and its types

statistic- this is a quantitative and qualitative generalizing characteristic of some property of a group of units or an aggregate as a whole in specific conditions of place and time. Unlike a characteristic, a statistical indicator is obtained by calculation. This can be a simple count of population units, summation of attribute values, comparison of two or more values, more complex comparisons.

1. According to the coverage of population units, statistical indicators are subdivided:


2. According to the method of calculation, statistical indicators are divided into:

3. According to spatial certainty, statistical indicators are divided into:


According to the form of expression, statistical indicators are divided into:

Absolute values

Absolute value (indicator)- this is a number that expresses the size, volume of the phenomenon in specific conditions of place and time. Absolute values ​​are always named values, that is, they have some unit of measure. Depending on the chosen unit of measure, the following are distinguished: types of absolute values:

1. natural- characterize the volume and size of the phenomenon in terms of length, weight, volume, the number of units, the number of events. Natural indicators are used to characterize the volume, size of individual types of products of the same name, and therefore their use is limited.

2. Conditionally natural- are used if it is necessary to transfer different types of products, but of the same value, into one conditional indicator. The conditionally natural indicator is calculated by multiplying the natural indicator by the conversion (recalculation) coefficient. Conversion coefficients are taken from directories or calculated independently. Conditionally natural indicators are used to characterize the volume, size of homogeneous products, and therefore their use is limited.

3. Labor- have such units of measurement as man-hour, man-day. Used to determine the cost of working time, to calculate wages and labor productivity.

4. Cost(universal) are measured in the currency of the respective country. Cost indicators = quantity of products in physical terms * price of a unit of production. Cost indicators are universal, as they allow you to determine the volume, size of different types of products.

Disadvantages of absolute indicators: it is impossible to characterize the qualitative features and structure of the phenomenon under study; for this, relative indicators are used, which are calculated on the basis of absolute indicators.

Relative values

Relative indicator- this is an indicator that is a quotient of dividing one absolute indicator by another and gives a numerical measure of the relationship between them.


Unnamed O.P.

1. The coefficient is obtained if the comparison base is 1. If the coefficient is greater than 1, then it shows how many times the compared value is greater than the comparison base. If the coefficient is less than 1, then it shows what part of the comparison base is the compared value.

2. The percentage will be obtained if the base of comparison is 100. The percentage is obtained by multiplying the coefficient by 100.

3. Permille (‰) - if the base of comparison is 1000. Obtained by multiplying the coefficient by 1000. Permille are used in order to avoid fractional values ​​​​of indicators. They are widely used in demographic statistics, where death rates, birth rates, and marriages are determined per 1,000 people.

4. Prodecimille (‰0) if the base of comparison is 10,000. It is obtained by multiplying the coefficient by 10,000. For example, how many doctors, hospital beds per 10,000 people.

Types of relative values ​​(indicators):

1. Relative structure index:

This indicator is calculated from grouped data and shows the share of individual parts in the total volume of the population. It can be expressed as a ratio (share) or percentage (specific gravity). Example, 0.4 - share, 40% - specific gravity. The sum of all shares is equal to 1, and the specific gravity is 100%.

2. Relative indicator of dynamics:

.

This indicator shows the change in the phenomenon over time. It is expressed in the form of a coefficient - the growth factor, and in the form of a percentage - the growth rate.

3. Relative performance of the plan:

This indicator shows the degree of implementation of the plan and is expressed in the form of%.

Relative target indicator:

This indicator shows what is planned to change the indicator in the future compared to the previous period and is expressed as a percentage.

Relationship between indicators: .

5. Relative indicator of coordination:

This indicator can be calculated for 1, 10, 100 units and shows how many units of one part account for an average of 1, 10, 100 units of another part. For example, the number of urban population per 1, 10, 100 villagers

6. Relative intensity indicator:

This indicator is calculated by comparing different indicators that are in a certain relationship with each other. This indicator can be calculated for 1, 10, 100 units and is a named indicator. For example, population density - people / 1, 10, 100 km2.

7. Relative Comparison Index:

This indicator is calculated by comparing similar indicators related to the same period of time, but to different objects or territories. It is expressed in the form of a coefficient and a percentage.

SUBJECT 5. Mean values ​​and indicators of variation

1. Average value: concept and types

Average value - this is a general indicator that characterizes the typical level of a varying quantitative trait per unit of the population under certain conditions of place and time.

Conditions for calculating the average value:

1. The population on which the average value is calculated must be large enough, otherwise random deviations in the value of the attribute will not be canceled out and the average will not show the patterns inherent in this process.

2. The population on which the average value is calculated must be qualitatively homogeneous, otherwise they will not only have no scientific value, but may also be harmful, distorting the true nature of the phenomenon under study.

3. The overall average should be supplemented by group averages. The general average shows the typical size of the entire population, and the group averages show its individual parts with specific properties.

4. For a comprehensive description of the phenomenon, a system of average indicators should be calculated, according to the most significant features.

The average value is always named, it has the same dimension as the averaged feature.

Types of averages:

1. Power means(these include the arithmetic mean, harmonic mean, mean square, geometric mean);

2. Structural averages(mode and median).

Power means are calculated by the formula (root to the power R of the means of all options taken to some extent):

where is the power mean value of the feature under study;

− individual value of the averaged feature;

− indicator of the degree of the average;

− number of signs (single set);

− amount.

Depending on the degree, different types of simple averages are obtained.

Meaning

The name of the simple average

simple harmonic

where P is the product

simple geometric

simple arithmetic

simple quadratic

The higher the exponent () in the power mean, the greater the value of the mean itself. If we calculate all these averages for the same data, we get the following ratio:

This property of power-law means to increase with an increase in the exponent of the defining function is called the rule of majorance of means.

Of these types of averages, the most commonly used are the arithmetic average and the harmonic average. The choice of the type of average depends on the initial information.

Arithmetic mean: methods of calculation and its properties

The arithmetic mean is the quotient of dividing the sum of the individual values ​​of a feature of all population units by the number of population units.

The arithmetic mean is used in the form of simple average and weighted average. simple arithmetic mean calculated by the formula:

where is the average value of the feature;

- individual values ​​of the attribute (options);

− number of population units (option).

The simple arithmetic mean is used in two cases:

when each variant occurs only once in the distribution series;

when all frequencies are equal.

Arithmetic weighted average used when the frequencies are not equal to each other:

where − frequencies or weights (numbers showing how many

times individual values ​​occur

sign).

Properties of the arithmetic mean(no proof):

1. The average value of a constant value is equal to itself: .

2. The product of the average value and the sum of frequencies is equal to the sum of the product of the options and their frequencies: .

3. If each option is increased or decreased by the same amount, then the average value will increase or decrease by the same amount: .

4. If each option is increased or decreased by the same number of times, then the average value will increase or decrease by the same number of times: .

5. If all frequencies are increased or decreased by the same number of times, the average value will not change: .

6. The average value of the sum is equal to the sum of the average values: .

7. The sum of deviations of all trait values ​​from the average value is zero.

3. Methods for calculating the mean harmonic

In some cases, the nature of the initial data is such that the calculation of the arithmetic mean loses its meaning and the only generalizing indicators can be the harmonic mean.

Types of mean harmonic:

1. Average harmonic simple calculated by the formula:

The simple harmonic simple is used very rarely, only to calculate the average time spent on manufacturing a unit of production, provided that the frequencies of all options are equal.

2. Average harmonic weighted calculated by the formula:

.

where is the total volume of the phenomenon.

The harmonic weighted average is used if the entire volume of the phenomenon is known, but the frequencies are not known. This harmonic is used to calculate average quality indicators: average wages, average price, average cost, average yield, average labor productivity.

4. Structural averages: mode and median

Structural averages (mode, median) are used to study the internal structure and structure of the series of distribution of attribute values.

Fashion- the most common value of the attribute in the units of the population. In a distribution series where each variant occurs once, the mode is not calculated. In a discrete series, the mode is the variant with the highest frequency. For an interval series with equal intervals, the mode is calculated by the formula:

.

where is the initial (lower) boundary of the modal interval;

- the value of the modal, before - and postmodal intervals, respectively

− frequency of modal, pre- and postmodal intervals, respectively.

The modal interval is the interval that has the highest frequency.

Median is the value of the feature that lies in the middle of the ranked series and divides this series into two equal parts by the number of units: one part has feature values ​​less than the median, and the other is greater than the median.

ranked row is the arrangement of characteristic values ​​in ascending or descending order.

In a discrete ranked series, where each option occurs once, and the number of options is not even, the median number is determined by the formula:

where is the number of terms in the series.

In a discrete ranked series, where each option occurs once and the number of options is even, the median will be the arithmetic mean of the two options located in the middle of the ranked series.

In a discrete ranked series, where each option occurs several times, the median number is determined by the formula:

Then, starting from the first option, the frequencies are sequentially summed until you get .

For an interval series, the median is calculated by the formula:

,

where is the lower limit of the median interval;

− the value of the median interval;

−total number of population units;

− cumulative frequency up to the median interval;

is the frequency of the median interval.

The median interval is such an interval in which its accumulated frequency is equal to or greater than half the sum of all frequencies in the series.

5. Indicators of variation

Feature Variation- this is the difference in the individual values ​​of the trait within the studied population. Variation of a trait is characterized by variation indicators. Variation indicators complement the average values, characterize the degree of homogeneity of the statistical population for a given trait, the boundaries of the trait variation. The ratio of variation indicators determines the relationship between features.

Variation indicators are divided into:

1) Absolute: range of variation; average linear deviation; standard deviation; dispersion. They have the same units as the characteristic values.

2) Relative: coefficient of oscillation, coefficient of variation, relative linear deviation.

The range of variation shows how much the value of the attribute changes:

where is the maximum value of the feature;

is the minimum value of the feature.

The mean linear deviation and the mean square deviation show how much the individual values ​​of a feature differ on average from its mean value.

Average linear deviation defined:

- simple; - weighted.

Dispersion are defined:

- simple; - weighted;

- simple; - weighted.

If the average value of the attribute was calculated using a simple arithmetic, then it is calculated using a simple formula, if the average was calculated using a weighted one, then it is calculated using a weighted formula.

Dispersionand standard deviation can also be calculated using a different formula:

- simple; - weighted.

To compare the variation of different traits in the same population, or the same trait in different populations, a relative indicator of variation is calculated, called coefficient of variation:

The greater the value of the coefficient of variation, the greater the spread of the trait values ​​around the average, the less homogeneous the population in its composition and the less representative the average. The set is considered homogeneous if the coefficient of variation does not exceed 33%.

6. Types of dispersions and the law (rule) of addition of dispersions

If the population under study consists of several groups formed on the basis of any attribute, then in addition to the total variance, the intergroup variance is also determined.

According to variance addition rule the total variance is equal to the sum of the average of the intragroup and intergroup variances:

Using the rule of adding variances, it is always possible to determine the third, unknown variance, from two known variances, and also to judge the strength of the influence of a grouping attribute.

Empirical coefficient of determination shows the share due to the variation of the grouping trait in the total variation of the studied trait:

Empirical correlation relation shows the influence of the attribute underlying the grouping on the variation of the resulting attribute:

The empirical correlation ratio varies from 0 to 1. If there is no connection, if - the connection is complete. Intermediate values ​​are evaluated according to their proximity to the limit values.

THEME 6.Series of dynamics

1. Series of dynamics: concept and types

Series of dynamics ( chronological series, dynamic series, time series) is a series of numerical values ​​of a statistical indicator arranged in chronological sequence. A series of dynamics consists of two elements (graph):

1. time (t) is moments (dates) or periods (years, quarters, months, days) of time to which statistical indicators (series levels) refer.

2. level of the series (y) - values ​​of a statistical indicator characterizing the state of the phenomenon at a specified point in time or over a period of time.

Row level y

Types of dynamics series:

1. By time:

A) interval - series, the levels of which characterize the size of the phenomenon over a period of time (day, month, quarter, year). An example of such a series is data on the dynamics of production, the number of man-days worked, etc. The absolute levels of the interval series can be summed up, the sum makes sense, which makes it possible to obtain series of dynamics of more enlarged periods.

B) momentary - series, the levels of which characterize the size of the phenomenon at the date (moment) of time. An example of such a series can be data on the dynamics of the population, livestock, inventory, value of fixed assets, current assets, etc. The levels of the moment series cannot be summarized, the sum does not make sense, since the next level fully or partially includes the previous one level.

2. According to the form of presentation (method of expression) of the levels:

A) series of absolute values.

B) series of relative values. Relative values ​​characterize, for example, the dynamics of the share of urban and rural population (%) and the unemployment rate.