Biographies Characteristics Analysis

The marginal sampling error level. Calculation of average and marginal sampling errors for various types of selection

Population- a set of units that have mass character, typicality, qualitative uniformity and the presence of variation.

The statistical population consists of materially existing objects (Employees, enterprises, countries, regions), is an object.

Population unit- each specific unit of the statistical population.

The same statistical population can be homogeneous in one feature and heterogeneous in another.

Qualitative uniformity- the similarity of all units of the population for any feature and dissimilarity for all the rest.

In a statistical population, the differences between one unit of the population and another are more often of a quantitative nature. Quantitative changes in the values ​​of the attribute of different units of the population are called variation.

Feature Variation- quantitative change of a sign (for a quantitative sign) during the transition from one unit of the population to another.

sign- this is a property, characteristic feature or other feature of units, objects and phenomena that can be observed or measured. Signs are divided into quantitative and qualitative. The diversity and variability of the value of a feature in individual units of the population is called variation.

Attributive (qualitative) features are not quantifiable (composition of the population by sex). Quantitative characteristics have a numerical expression (composition of the population by age).

Indicator- this is a generalizing quantitatively qualitative characteristic of any property of units or aggregates as a whole in specific conditions of time and place.

Scorecard is a set of indicators that comprehensively reflect the phenomenon under study.

For example, consider salary:
  • Sign - wages
  • Statistical population - all employees
  • The unit of the population is each worker
  • Qualitative homogeneity - accrued salary
  • Feature variation - a series of numbers

General population and sample from it

The basis is a set of data obtained as a result of measuring one or more features. The actually observed set of objects, statistically represented by a series of observations of a random variable , is sampling, and the hypothetically existing (thought-out) - general population. The general population can be finite (number of observations N = const) or infinite ( N = ∞), and a sample from the general population is always the result of a limited number of observations. The number of observations that make up a sample is called sample size. If the sample size is large enough n→∞) the sample is considered large, otherwise it is called a sample limited volume. The sample is considered small, if, when measuring a one-dimensional random variable, the sample size does not exceed 30 ( n<= 30 ), and when measuring simultaneously several ( k) features in a multidimensional space relation n to k less than 10 (n/k< 10) . The sample forms variation series if its members are order statistics, i.e., sample values ​​of the random variable X are sorted in ascending order (ranked), the values ​​of the attribute are called options.

Example. Almost the same randomly selected set of objects - commercial banks of one administrative district of Moscow, can be considered as a sample from the general population of all commercial banks in this district, and as a sample from the general population of all commercial banks in Moscow, as well as a sample of commercial banks in the country and etc.

Basic sampling methods

The reliability of statistical conclusions and meaningful interpretation of the results depends on representativeness samples, i.e. completeness and adequacy of the presentation of the properties of the general population, in relation to which this sample can be considered representative. The study of the statistical properties of the population can be organized in two ways: using continuous and discontinuous. Continuous observation includes examination of all units studied aggregates, a non-continuous (selective) observation- only parts of it.

There are five main ways to organize sampling:

1. simple random selection, in which objects are randomly extracted from the general population of objects (for example, using a table or a random number generator), and each of the possible samples has an equal probability. Such samples are called actually random;

2. simple selection through a regular procedure is carried out using a mechanical component (for example, dates, days of the week, apartment numbers, letters of the alphabet, etc.) and the samples obtained in this way are called mechanical;

3. stratified selection consists in the fact that the general population of volume is subdivided into subsets or layers (strata) of volume so that . Strata are homogeneous objects in terms of statistical characteristics (for example, the population is divided into strata by age group or social class; enterprises by industry). In this case, the samples are called stratified(otherwise, stratified, typical, zoned);

4. methods serial selection are used to form serial or nested samples. They are convenient if it is necessary to examine a "block" or a series of objects at once (for example, a consignment of goods, products of a certain series, or the population in the territorial-administrative division of the country). The selection of series can be carried out in a random or mechanical way. At the same time, a continuous survey of a certain batch of goods, or an entire territorial unit (a residential building or a quarter) is carried out;

5. combined(stepped) selection can combine several selection methods at once (for example, stratified and random or random and mechanical); such a sample is called combined.

Selection types

By mind there are individual, group and combined selection. At individual selection individual units of the general population are selected in the sample set, with group selection are qualitatively homogeneous groups (series) of units, and combined selection involves a combination of the first and second types.

By method selection distinguish repeated and non-repetitive sample.

Unrepeatable called selection, in which the unit that fell into the sample does not return to the original population and does not participate in the further selection; while the number of units of the general population N reduced during the selection process. At repeated selection caught in the sample, the unit after registration is returned to the general population and thus retains an equal opportunity, along with other units, to be used in the further selection procedure; while the number of units of the general population N remains unchanged (the method is rarely used in socio-economic studies). However, with a large N (N → ∞) formulas for unrepeated selection are close to those for repeated selection and the latter are used almost more often ( N = const).

The main characteristics of the parameters of the general and sample population

The basis of the statistical conclusions of the study is the distribution of a random variable , while the observed values (x 1, x 2, ..., x n) are called realizations of the random variable X(n is the sample size). The distribution of a random variable in the general population is theoretical, ideal in nature, and its sample analogue is empirical distribution. Some theoretical distributions are given analytically, i.e. them options determine the value of the distribution function at each point in the space of possible values ​​of the random variable . For a sample, it is difficult, and sometimes impossible, to determine the distribution function, therefore options are estimated from empirical data, and then they are substituted into an analytical expression describing the theoretical distribution. In this case, the assumption (or hypothesis) about the type of distribution can be both statistically correct and erroneous. But in any case, the empirical distribution reconstructed from the sample only roughly characterizes the true one. The most important distribution parameters are expected value and dispersion.

By their very nature, distributions are continuous and discrete. The best known continuous distribution is normal. Selective analogues of parameters and for it are: mean value and empirical variance. Among the discrete in socio-economic studies, the most commonly used alternative (dichotomous) distribution. The expectation parameter of this distribution expresses the relative value (or share) units of the population that have the characteristic under study (it is indicated by the letter ); the proportion of the population that does not have this feature is denoted by the letter q (q = 1 - p). The variance of the alternative distribution also has an empirical analog.

Depending on the type of distribution and on the method of selecting population units, the characteristics of the distribution parameters are calculated differently. The main ones for the theoretical and empirical distributions are given in Table. one.

Sample share k n is the ratio of the number of units of the sample population to the number of units of the general population:

k n = n/N.

Sample share w is the ratio of units that have the trait under study x to sample size n:

w = n n / n.

Example. In a batch of goods containing 1000 units, with a 5% sample sample fraction k n in absolute value is 50 units. (n = N*0.05); if 2 defective products are found in this sample, then sample fraction w will be 0.04 (w = 2/50 = 0.04 or 4%).

Since the sample population is different from the general population, there are sampling errors.

Table 1. Main parameters of the general and sample populations

Sampling errors

With any (solid and selective) errors of two types can occur: registration and representativeness. Mistakes registration can have random and systematic character. Random errors are made up of many different uncontrollable causes, are unintentional in nature, and usually balance each other out in combination (for example, changes in instrument readings due to temperature fluctuations in the room).

Systematic errors are biased, as they violate the rules for selecting objects in the sample (for example, deviations in measurements when changing the settings of the measuring device).

Example. To assess the social status of the population in the city, it is planned to examine 25% of families. If, however, the selection of every fourth apartment is based on its number, then there is a danger of selecting all apartments of only one type (for example, one-room apartments), which will introduce a systematic error and distort the results; the choice of the apartment number by lot is more preferable, since the error will be random.

Representativeness errors inherent only in selective observation, they cannot be avoided and they arise as a result of the fact that the sample does not fully reproduce the general one. The values ​​of the indicators obtained from the sample differ from the indicators of the same values ​​in the general population (or obtained during continuous observation).

Sampling error is the difference between the value of the parameter in the general population and its sample value. For the average value of a quantitative attribute, it is equal to: , and for the share (alternative attribute) - .

Sampling errors are inherent only in sample observations. The larger these errors, the more the empirical distribution differs from the theoretical one. The parameters of the empirical distribution and are random variables, therefore, sampling errors are also random variables, they can take different values ​​for different samples, and therefore it is customary to calculate average error.

Average sampling error is a value expressing the standard deviation of the sample mean from the mathematical expectation. This value, subject to the principle of random selection, depends primarily on the sample size and on the degree of variation of the trait: the greater and the smaller the variation of the trait (hence, the value of ), the smaller the value of the average sampling error . The ratio between the variances of the general and sample populations is expressed by the formula:

those. for sufficiently large, we can assume that . The average sampling error shows the possible deviations of the parameter of the sample population from the parameter of the general population. In table. 2 shows expressions for calculating the average sampling error for different methods of organizing observation.

Table 2. Mean error (m) of the sample mean and proportion for different types of sample

Where is the average of the intragroup sample variances for a continuous feature;

The average of the intra-group dispersions of the share;

— number of series selected, — total number of series;

,

where is the average of the th series;

- the general average over the entire sample for a continuous feature;

,

where is the proportion of the trait in the th series;

— the total share of the trait over the entire sample.

However, the magnitude of the average error can only be judged with a certain probability Р (Р ≤ 1). Lyapunov A.M. proved that the distribution of sample means, and hence their deviations from the general mean, with a sufficiently large number, approximately obeys the normal distribution law, provided that the general population has a finite mean and limited variance.

Mathematically, this statement for the mean is expressed as:

and for the fraction, expression (1) will take the form:

where - there is marginal sampling error, which is a multiple of the average sampling error , and the multiplicity factor is Student's criterion ("confidence factor"), proposed by W.S. Gosset (pseudonym "Student"); values ​​for different sample sizes are stored in a special table.

The values ​​of the function Ф(t) for some values ​​of t are:

Therefore, expression (3) can be read as follows: with probability P = 0.683 (68.3%) it can be argued that the difference between the sample and the general mean will not exceed one value of the mean error m(t=1), with probability P = 0.954 (95.4%)— that it does not exceed the value of two mean errors m (t = 2) , with probability P = 0.997 (99.7%)- will not exceed three values m (t = 3) . Thus, the probability that this difference will exceed three times the value of the mean error determines error level and is not more than 0,3% .

In table. 3 shows the formulas for calculating the marginal sampling error.

Table 3. Marginal sampling error (D) for the mean and proportion (p) for different types of sample observation

Extending Sample Results to the Population

The ultimate goal of sample observation is to characterize the general population. For small sample sizes, empirical estimates of the parameters ( and ) may deviate significantly from their true values ​​( and ). Therefore, it becomes necessary to establish the boundaries within which the true values ​​( and ) lie for the sample values ​​of the parameters ( and ).

Confidence interval of some parameter θ of the general population is called a random range of values ​​of this parameter, which with a probability close to 1 ( reliability) contains the true value of this parameter.

marginal error samples Δ allows you to determine the limit values ​​of the characteristics of the general population and their confidence intervals, which are equal to:

Bottom line confidence interval obtained by subtracting marginal error from the sample mean (share), and the top one by adding it.

Confidence interval for the mean, it uses the marginal sampling error and for a given confidence level is determined by the formula:

This means that with a given probability R, which is called the confidence level and is uniquely determined by the value t, it can be argued that the true value of the mean lies in the range from , and the true value of the share is in the range from

When calculating the confidence interval for the three standard confidence levels P=95%, P=99% and P=99.9% value is selected by . Applications depending on the number of degrees of freedom. If the sample size is large enough, then the values ​​corresponding to these probabilities t are equal: 1,96, 2,58 and 3,29 . Thus, the marginal sampling error allows us to determine the marginal values ​​of the characteristics of the general population and their confidence intervals:

The distribution of the results of selective observation to the general population in socio-economic studies has its own characteristics, since it requires the completeness of the representativeness of all its types and groups. The basis for the possibility of such a distribution is the calculation relative error:

where Δ % - relative marginal sampling error; , .

There are two main methods for extending a sample observation to the population: direct conversion and method of coefficients.

Essence direct conversion is to multiply the sample mean!!\overline(x) by the size of the population .

Example. Let the average number of toddlers in the city be estimated by a sampling method and amount to a person. If there are 1000 young families in the city, then the number of places required in the municipal nursery is obtained by multiplying this average by the size of the general population N = 1000, i.e. will be 1200 seats.

Method of coefficients it is advisable to use in the case when selective observation is carried out in order to clarify the data of continuous observation.

In doing so, the formula is used:

where all variables are the size of the population:

Required sample size

Table 4. Required sample size (n) for different types of sampling organization

When planning a sampling survey with a predetermined value of the allowable sampling error, it is necessary to correctly estimate the required sample size. This amount can be determined on the basis of the allowable error during selective observation based on a given probability that guarantees an acceptable error level (taking into account the way the observation is organized). Formulas for determining the required sample size n can be easily obtained directly from the formulas for the marginal sampling error. So, from the expression for the marginal error:

the sample size is directly determined n:

This formula shows that with decreasing marginal sampling error Δ significantly increases the required sample size, which is proportional to the variance and the square of the Student's t-test.

For a specific method of organizing observation, the required sample size is calculated according to the formulas given in Table. 9.4.

Practical Calculation Examples

Example 1. Calculation of the mean value and confidence interval for a continuous quantitative characteristic.

To assess the speed of settlement with creditors in the bank, a random sample of 10 payment documents was carried out. Their values ​​turned out to be equal (in days): 10; 3; fifteen; fifteen; 22; 7; eight; one; nineteen; 20.

Required with probability P = 0.954 determine marginal error Δ sample mean and confidence limits of the average calculation time.

Decision. The average value is calculated by the formula from Table. 9.1 for the sample population

The dispersion is calculated according to the formula from Table. 9.1.

The mean square error of the day.

The error of the mean is calculated by the formula:

those. mean value is x ± m = 12.0 ± 2.3 days.

The reliability of the mean was

The limiting error is calculated by the formula from Table. 9.3 for reselection, since the size of the population is unknown, and for P = 0.954 confidence level.

Thus, the mean value is `x ± D = `x ± 2m = 12.0 ± 4.6, i.e. its true value lies in the range from 7.4 to 16.6 days.

Use of Student's table. The application allows us to conclude that for n = 10 - 1 = 9 degrees of freedom the obtained value is reliable with a significance level a £ 0.001, i.e. the resulting mean value is significantly different from 0.

Example 2. Estimate of the probability (general share) r.

With a mechanical sampling method of surveying the social status of 1000 families, it was revealed that the proportion of low-income families was w = 0.3 (30%)(the sample was 2% , i.e. n/N = 0.02). Required with confidence level p = 0.997 define an indicator R low-income families throughout the region.

Decision. According to the presented function values Ф(t) find for a given confidence level P = 0.997 meaning t=3(see formula 3). Marginal share error w determine by the formula from Table. 9.3 for non-repeating sampling (mechanical sampling is always non-repeating):

Limiting relative sampling error in % will be:

The probability (general share) of low-income families in the region will be p=w±Δw, and the confidence limits p are calculated based on the double inequality:

w — Δw ≤ p ≤ w — Δw, i.e. the true value of p lies within:

0,3 — 0,014 < p <0,3 + 0,014, а именно от 28,6% до 31,4%.

Thus, with a probability of 0.997, it can be argued that the proportion of low-income families among all families in the region ranges from 28.6% to 31.4%.

Example 3 Calculation of the mean value and confidence interval for a discrete feature specified by an interval series.

In table. 5. The distribution of applications for the production of orders according to the timing of their implementation by the enterprise is set.

Table 5. Distribution of observations by time of occurrence

Decision. The average order completion time is calculated by the formula:

The average time will be:

= (3*20 + 9*80 + 24*60 + 48*20 + 72*20)/200 = 23.1 months

We get the same answer if we use the data on p i from the penultimate column of Table. 9.5 using the formula:

Note that the middle of the interval for the last gradation is found by artificially supplementing it with the width of the interval of the previous gradation equal to 60 - 36 = 24 months.

The dispersion is calculated by the formula

where x i- the middle of the interval series.

Therefore!!\sigma = \frac (20^2 + 14^2 + 1 + 25^2 + 49^2)(4) and the standard error is .

The error of the mean is calculated by the formula for months, i.e. the mean is!!\overline(x) ± m = 23.1 ± 13.4.

The limiting error is calculated by the formula from Table. 9.3 for reselection because the population size is unknown, for a 0.954 confidence level:

So the mean is:

those. its true value lies in the range from 0 to 50 months.

Example 4 To determine the speed of settlements with creditors of N = 500 enterprises of the corporation in a commercial bank, it is necessary to conduct a selective study using the method of random non-repetitive selection. Determine the required sample size n so that with a probability P = 0.954 the error of the sample mean does not exceed 3 days, if the trial estimates showed that the standard deviation s was 10 days.

Decision. To determine the number of necessary studies n, we use the formula for non-repetitive selection from Table. 9.4:

In it, the value of t is determined from for the confidence level P = 0.954. It is equal to 2. The mean square value s = 10, the population size N = 500, and the marginal error of the mean Δ x = 3. Substituting these values ​​into the formula, we get:

those. it is enough to make a sample of 41 enterprises in order to estimate the required parameter - the speed of settlements with creditors.

The discrepancies between the value of any indicator found through statistical observation and its actual size are called observation errors . Depending on the causes of occurrence, registration errors and representativeness errors are distinguished.

Registration errors arise as a result of incorrect fact-finding or erroneous recording in the process of observation or interview. They are random or systematic. Random registration errors can be made by both interviewees in their responses and registrars. Systematic errors can be both intentional and unintentional. Deliberate - conscious, tendentious distortion of the actual state of affairs. Unintentional are caused by various random reasons (negligence, inattention).

Representativeness errors (representativeness) arise as a result of an incomplete survey and if the survey population does not fully reproduce the general population. They can be random or systematic. Random representativeness errors are deviations that occur during non-continuous observation due to the fact that the set of selected observation units (sample) does not fully reproduce the entire population as a whole. Representativeness biases are deviations resulting from violations of the principles of random selection of units. Representativeness errors are organically inherent in sample observation and arise due to the fact that the sample population does not fully reproduce the general population. It is impossible to avoid representativeness errors, however, using the methods of probability theory based on the use of limit theorems of the law of large numbers, these errors can be reduced to minimum values, the boundaries of which are set with sufficiently high accuracy.

Sampling errors - the difference between the characteristics of the sample and the general population. For the average value, the error will be determined by the formula

where

Value
called marginal error samples.

The marginal sampling error is a random value. Limit theorems of the law of large numbers are devoted to the study of patterns of random sampling errors. These patterns are most fully disclosed in the theorems of P. L. Chebyshev and A. M. Lyapunov.

Theorem of P. L. Chebyshev in relation to the method under consideration, it can be formulated as follows: with a sufficiently large number of independent observations, it is possible to assert with a probability close to unity (i.e., almost with certainty) that the deviation of the sample mean from the general one will be arbitrarily small. P. L. Chebyshev’s theorem proves that the error value should not exceed . In turn, the value , expressing the standard deviation of the sample mean from the general mean, depends on the fluctuation of the trait in the general population and the number of selected units n. This dependence is expressed by the formula

, (7.2)

where also depends on the sampling method.

the value =called the average sampling error. In this expression is the general variance, n is the sample size.

Let us consider how the number of selected units affects the value of the average error n. It is logically easy to verify that when a large number of units are selected, the discrepancies between the means will be smaller, i.e., there is an inverse relationship between the average sampling error and the number of selected units. In this case, not just an inverse mathematical dependence is formed here, but such a dependence, which shows that the square of the discrepancy between the means is inversely proportional to the number of selected units.

An increase in the variability of a sign entails an increase in the standard deviation, and, consequently, errors. If we assume that all units will have the same feature value, then the standard deviation will become zero and the sampling error will also disappear. Then there is no need to apply sampling. However, it should be borne in mind that the magnitude of the variability of the trait in the general population is unknown, since the sizes of the units in it are unknown. It is possible to calculate only the variability of the trait in the sample population. The ratio between the variances of the general and sample population is expressed by the formula

Since the value for large enough n is close to unity, we can approximately assume that the sample variance is equal to the general variance, i.e.

Therefore, the average sampling error shows what possible deviations of the characteristics of the sample population from the corresponding characteristics of the general population. However, the magnitude of this error can be judged with a certain probability. The multiplier indicates the probability value

Theorem of A. M. Lyapunov . A. M. Lyapunov proved that the distribution of sample means (hence, their deviations from the general mean) with a sufficiently large number of independent observations is approximately normal, provided that the general population has a finite mean and limited variance.

Mathematically Lyapunov's theorem can be written like this:

(7.3)

where
, (7.4)

where
is a mathematical constant;

marginal sampling error , which makes it possible to find out within what limits the value of the general average lies.

The values ​​of this integral for different values ​​of the confidence coefficient t calculated and are given in special mathematical tables. In particular, when:

Insofar as t indicates the likelihood of a discrepancy
, i.e., the probability of how much the general mean will differ from the sample mean, then this can be read as follows: with a probability of 0.683 it can be argued that the difference between the sample and the general mean does not exceed one value of the mean sampling error. In other words, in 68.3% of cases, the representativeness error will not go beyond
With a probability of 0.954, it can be argued that the representativeness error does not exceed
(i.e. in 95% of cases). With a probability of 0.997, i.e., quite close to one, one can expect that the difference between the sample and the general mean will not exceed three times the mean sample error, etc.

Logically, the connection here looks quite clear: the greater the limits within which a possible error is allowed, the more likely it is to judge its magnitude.

Knowing the sample mean value of the feature
and marginal sampling error
, it is possible to determine the boundaries (limits) that contain the general average

1 . Self-random sampling - this method is focused on sampling units from the general population without any division into parts or groups. At the same time, in order to comply with the basic principle of sampling - equal opportunity for all units of the general population to be selected - a scheme of random extraction of units by lottery (lottery) or a table of random numbers is used. Repeated and non-repeated selection of units is possible

The mean error of a proper random sample is the standard deviation of the possible values ​​of the sample mean from the general mean. The average sampling errors for the random selection method are presented in Table. 7.2.

Table 7.2

Average sampling error μ

When selecting

repeated

non-repetitive

For medium

The following designations are used in the table:

is the sample variance;

– sample size;

- the size of the general population;

is the sample proportion of units that have the trait under study;

- the number of units that have the studied feature;

– sample size.

To increase accuracy instead of a multiplier take the multiplier
, but with a large number N the difference between these expressions is of no practical importance.

Marginal error of proper random sampling
calculated by the formula

, (7.6)

where t – the coefficient of confidence depends on the value of the probability.

Example. When examining one hundred samples of products selected randomly from a batch, 20 turned out to be non-standard. With a probability of 0.954, determine the limits in which the proportion of non-standard products in the lot is.

Decision. Calculate the total share ( R):
.

Share of non-standard products:
.

The marginal error of the sample fraction with a probability of 0.954 is calculated by formula (7.6) using the formula in Table. 7.2 for share:

With a probability of 0.954, it can be argued that the share of non-standard products in a batch of goods is within 12% ≤ P≤ 28 %.

In the practice of designing sample observation, there is a need to determine the size of the sample, which is necessary to ensure a certain accuracy in the calculation of general averages. The marginal sampling error and its probability are given in this case. From the formula
and formulas for mean sampling errors, the required sample size is established. Formulas for determining the sample size ( n) depend on the selection method. The calculation of the sample size for the actual random sample is given in Table. 7.3.

Table 7.3

Intended Selection

for middle

Repeated

non-repeating

2 . Mechanical sampling - with this method, they proceed from taking into account some features of the location of objects in the general population, their ordering (according to the list, number, alphabet). Mechanical sampling is carried out by selecting individual objects of the general population at a certain interval (every 10th or 20th). The interval is calculated in relation to , where n– sample size, N- the size of the general population. So, if from a population of 500,000 units it is supposed to get a 2% sample, i.e., select 10,000 units, then the selection proportion will be
The selection of units is carried out in accordance with the established proportion at regular intervals. If the location of objects in the general population is random, then mechanical sampling is similar in content to random selection. In mechanical selection, only nonrepeating sampling is used.

The average error and sample size in mechanical selection is calculated according to the formulas of proper random sampling (see Tables 7.2 and 7.3).

3 . Typical sample , at which the general population is divided according to some essential features into typical groups; selection of units is made from typical groups. With this method of selection, the general population is divided into groups that are homogeneous in some respects, which have their own characteristics, and the question is reduced to determining the size of the samples from each group. May be uniform sampling - with this method, the same number of units is selected from each typical group
Such an approach is justified only if the sizes of the initial typical groups are equal. In typical selection, disproportionate to the size of the groups, the total number of selected units is divided by the number of typical groups, the resulting value gives the number of selection from each typical group.

A more advanced form of selection is proportional sampling . Such a sampling scheme is called proportional when the number of samples taken from each typical group in the general population is proportional to the numbers, dispersions (or combined and numbers, and dispersions). We conditionally determine the sample size of 100 units and select units from the groups:

in proportion to the size of their general population (Table 7.4). The table indicates:

N i is the size of a typical group;

d j– share ( N i / N);

N- the size of the general population;

n i– sample size from a typical group is calculated:

, (7.7)

n is the size of the sample from the general population.

Table 7.4

N i

d j

n i

proportional to standard deviation (Table 7.5).

here  i– standard deviation of typical groups;

n i – sample size from a typical group is calculated by the formula

(7.8)

Table 7.5

N i

n i

combined (Table 7.6).

The sample size is calculated by the formula

. (7.9)

Table 7.6

i N i

When conducting a typical sample, direct selection from each group is carried out by random selection.

Average sampling errors are calculated using the formulas in Table. 7.7 depending on the method of selection from typical groups.

Table 7.7

Selection method

Repeated

non-repeating

for middle

for share

for middle

for share

Disproportionate to group size

Proportional to group size

Proportional fluctuation in groups (is the most beneficial)

here
is the average of the intragroup variances of typical groups;

is the proportion of units that have the trait under study;

is the average of the intra-group variances for the share;

is the standard deviation in a sample of i-th typical group;

is the sample size from a typical group;

is the total sample size;

is the volume of a typical group;

- the volume of the general population.

The sample size from each typical group should be proportional to the standard deviation in that group.
.Number calculation
produced according to the formulas given in Table. 7.8.

Table 7.8

4 . serial sampling - useful in cases where the units of the population are grouped into small groups or series. With a serial sample, the population is divided into groups of the same size - series. Series are selected in the sample set. The essence of serial sampling lies in the random or mechanical selection of series, within which a continuous survey of units is carried out. The average error of a serial sample with equal series depends on the value of the intergroup variance only. Average errors are summarized in Table. 7.9.

Table 7.9

Series selection method

for middle

for share

Repeated

non-repeating

Here R is the number of series in the general population;

r– number of selected series;

– interseries (intergroup) variance of means;

– interseries (intergroup) variance of the share.

With serial selection, the required number of selected series is determined in the same way as with the proper random selection method.

The calculation of the number of serial samples is made according to the formulas given in Table. 7.10.

Table 7.10

Example. 100 workers work in the machine shop of the plant in ten teams. In order to study the qualifications of workers, a 20% serial non-repeated sample was made, which included two teams. The following distribution of surveyed workers by category was obtained:

Ranks of workers in brigade 1

Ranks of workers in the brigade 2

Ranks of workers in brigade 1

Ranks of workers in the brigade 2

It is necessary to determine with a probability of 0.997 the limits in which the average category of the workers of the machine shop is located.

Decision. We define the sample means for the teams and the overall mean as the weighted average of the group means:

Let us determine the interseries dispersion by the formulas (5.25):

We calculate the average sampling error using the formula in Table. 7.9:

Let's calculate the marginal sampling error with a probability of 0.997:

With a probability of 0.997, it can be argued that the average rank of workers in a machine shop is within

To characterize the reliability of sample indicators, a distinction is made between mean and marginal sampling errors, which are characteristic only of sample observations. These indicators reflect the difference between the sample and the corresponding general indicators.

Average sample error is determined primarily by the sample size and depends on the structure and degree of variation of the trait under study.

The meaning of the mean sampling error is as follows. The calculated values ​​of the sample fraction (w) and the sample mean () are by their nature random variables. They can take on different values ​​depending on which specific units of the general population fall into the sample. For example, if, when determining the average age of employees of an enterprise, more young people are included in one sample, and older workers are included in another, then the sample means and sampling errors will be different. Average sampling error is determined by the formula:

(27) or - resampling. (28)

Where: μ is the average sampling error;

σ is the standard deviation of a trait in the general population;

n is the sample size.

The error value μ shows how the mean value of the feature, established by the sample, differs from the true value of the feature in the general population.

It follows from the formula that the sampling error is directly proportional to the standard deviation and inversely proportional to the square root of the number of units in the sample. This means, for example, that the greater the spread of the values ​​of a feature in the general population, that is, the greater the dispersion, the larger the sample size should be if we want to trust the results of a sample survey. Conversely, with a small variance, one can limit oneself to a small number of sample populations. The sampling error will then be within acceptable limits.

Since the size of the general population N during the sampling decreases during non-repeated selection, an additional factor is included in the formula for calculating the average sampling error

(one- ). The formula for the mean sampling error takes the following form:

The average error is smaller for non-repetitive sampling, which makes it more widely used.

Practical conclusions require a characterization of the general population based on sample results. Sample means and proportions are applied to the general population, taking into account the limit of their possible error, and with a level of probability that guarantees it. Given a specific level of probability, the value of the normalized deviation is chosen and the marginal sampling error is determined.

Reliability (confidence probability) of estimation X by X* called probability γ , with which the inequality


׀Х-Х*׀< δ, (30)

where δ is the marginal sampling error characterizing the width of the interval in which the value of the studied parameter of the general population is found with probability γ.

Trusted name the interval (X* - δ; X* + δ) that covers the investigated parameter X (that is, the value of the parameter X is inside this interval) with a given reliability γ.

Usually, the reliability of the estimate is set in advance, and a number close to one is taken as γ: 0.95; 0.99 or 0.999.

The limiting error δ is related to the average error μ as follows: , (31)

where: t is the confidence factor, depending on the probability P, with which it can be argued that the marginal error δ will not exceed the t-fold average error μ (it is also called the critical points or quantiles of Student's distribution).

As follows from the ratio, the marginal error is directly proportional to the average sampling error and the confidence coefficient, which depends on the given level of estimation reliability.

From the formula for the average sampling error and the ratio of the marginal and average errors, we obtain:

Taking into account the confidence probability, this formula will take the form.

The average sampling error shows how much the parameter of the sample population deviates on average from the corresponding parameter of the general population. If we calculate the average of the errors of all possible samples of a certain type of a given volume ( n) extracted from the same general population, then we get their generalizing characteristic - mean sampling error ().

In the theory of selective observation, formulas have been derived for determining , which are individual for different methods of selection (repeated and non-repeated), types of samples used and types of estimated statistical indicators.

For example, if repeated random sampling is used, then it is defined as:

When estimating the mean value of a feature;

If the sign is alternative, and the share is estimated.

In case of non-repeated random selection, the formulas are amended (1 - n/N):

- for the mean value of the attribute;

- for a share.

The probability of obtaining just such an error value is always equal to 0.683. In practice, it is preferable to obtain data with a higher probability, but this leads to an increase in the size of the sampling error.

The marginal sampling error () is equal to t times the number of average sampling errors (in sampling theory, it is customary to call the coefficient t the confidence coefficient):

If the sampling error is doubled (t = 2), then we get a much higher probability that it will not exceed a certain limit (in our case, double the average error) - 0.954. If we take t \u003d 3, then the confidence level will be 0.997 - practically certainty.

The marginal sampling error level depends on the following factors:

  • the degree of variation of units of the general population;
  • sample size;
  • selected selection schemes (non-repetitive selection gives a smaller error value);
  • confidence level.

If the sample size is more than 30, then the value of t is determined from the normal distribution table, if less - from the Student's distribution table.

Here are some values ​​of the confidence coefficient from the normal distribution table.

The confidence interval for the mean value of the attribute and for the proportion in the general population is set as follows:

So, the definition of the boundaries of the general average and share consists of the following steps:

Sampling errors for different types of selection

  1. Actually random and mechanical sampling. The average error of the actual random and mechanical sampling are found using the formulas presented in Table. 11.3.

Example 11.2. To study the level of return on assets, a sample survey of 90 enterprises out of 225 was carried out using the method of random re-sampling, as a result of which the data presented in the table were obtained.

In this example, we have a 40% sample (90: 225 = 0.4, or 40%). Let us determine its marginal error and the boundaries for the average value of the feature in the general population by the steps of the algorithm:

  1. Based on the results of the sample survey, we calculate the mean value and variance in the sample population:
Table 11.5.
Observation results Estimated values
return on assets, rub., x i number of enterprises, f i middle of the interval, x i \xb4 x i \xb4 f i x i \xb4 2 f i
Up to 1.4 13 1,3 16,9 21,97
1,4-1,6 15 1,5 22,5 33,75
1,6-1,8 17 1,7 28,9 49,13
1,8-2,0 15 1,9 28,5 54,15
2,0-2,2 16 2,1 33,6 70,56
2.2 and up 14 2,3 32,2 74,06
Total 90 - 162,6 303,62

Sample mean

Sample variance of the trait under study

For our data, we define the marginal sampling error, for example, with a probability of 0.954. According to the table of probability values ​​of the normal distribution function (see an excerpt from it given in Appendix 1), we find the value of the confidence coefficient t corresponding to the probability of 0.954. With a probability of 0.954, the coefficient t is 2.

Thus, in 954 cases out of 1000, the average return on assets will not exceed 1.88 rubles. and not less than 1.74 rubles.

Above, a repeated random selection scheme was used. Let's see if the results of the survey change if we assume that the selection was carried out according to the no-repeating selection scheme. In this case, the average error is calculated using the formula

Then, with a probability equal to 0.954, the marginal sampling error will be:

Confidence limits for the mean value of the feature in case of non-repetitive random selection will have the following values:

Comparing the results of the two selection schemes, we can conclude that the use of non-repetitive random sampling gives more accurate results compared to the use of repeated selection with the same confidence level. At the same time, the larger the sample size, the more significantly the boundaries of the mean values ​​narrow when moving from one selection scheme to another.

According to the example, we determine the boundaries of the share of enterprises with a return on assets that does not exceed the value of 2.0 rubles in the general population:

  1. Let's calculate the sample rate.

The number of enterprises in the sample with a return on assets not exceeding 2.0 rubles is 60 units. Then

m = 60, n = 90, w = m/n = 60: 90 = 0.667;

  1. calculate the variance of the share in the sample population
  1. the average sampling error when using a repeated selection scheme will be

If we assume that a non-repetitive selection scheme was used, then the average sampling error, taking into account the correction for finiteness of the population, will be

  1. we set the confidence probability and determine the marginal sampling error.

With a probability value of P = 0.997, according to the normal distribution table, we obtain the value for the confidence coefficient t = 3 (see an extract from it given in Appendix 1):

Thus, with a probability of 0.997, it can be argued that in the general population the share of enterprises with a return on assets not exceeding 2.0 rubles is no less than 54.7% and no more than 78.7%.

  1. Typical sample. With a typical sample, the general population of objects is divided into k groups, then

N 1 + N 2 + ... + N i + ... + N k = N.

The volume of units extracted from each typical group depends on the method of selection adopted; their total number forms the required sample size

n 1 + n 2 + … + n i + … + n k = n.

There are the following two ways to organize selection within a typical group: proportional to the volume of typical groups and proportional to the degree of fluctuation of the values ​​of the attribute in units of observation in groups. Consider the first of them, as the most commonly used.

The selection, proportional to the size of typical groups, assumes that the following number of population units will be selected in each of them:

n = n i N i /N

where n i is the number of extractable units for a sample from the i-th typical group;

n is the total sample size;

N i - the number of units of the general population that made up the i-th typical group;

N is the total number of units in the general population.

The selection of units within groups occurs in the form of random or mechanical sampling.

Formulas for estimating the mean sampling error for the mean and share are presented in Table. 11.6.

Here, is the average of the group variances of typical groups.

Example 11.3. A selective survey of students was conducted in one of the Moscow universities in order to determine the indicator of the average attendance of the university library by one student per semester. For this, a 5% non-repeated typical sample was used, the typical groups of which correspond to the course number. When selecting, proportional to the volume of typical groups, the following data were obtained:

Table 11.7.
Course number Total students, persons, N i Examined as a result of selective observation, people, n i Average number of library visits per student per semester, x i Intragroup sample variance,
1 650 33 11 6
2 610 31 8 15
3 580 29 5 18
4 360 18 6 24
5 350 17 10 12
Total 2 550 128 8 -

The number of students to be examined in each course is calculated as follows:

similar for other groups:

n 2 \u003d 31 (people);

n 3 \u003d 29 (people);

The distribution of values ​​of sample means always has a normal distribution law (or approaches it) for n > 100, regardless of the nature of the distribution of the general population. However, in the case of small samples, a different distribution law applies - Student's distribution. In this case, the confidence coefficient is found according to the Student's t-distribution table, depending on the value of the confidence probability P and the sample size n. Appendix 1 provides a fragment of the Student's t-distribution table, presented as a dependence of the confidence probability on the sample size and the confidence coefficient t.

Example 11.4. Suppose that a sample survey of eight students of the academy showed that they spent the following number of hours preparing for a test in statistics: 8.5; 8.0; 7.8; 9.0; 7.2; 6.2; 8.4; 6.6.

Let's estimate the sample average time spent and build a confidence interval for the average value of the attribute in the general population, taking the confidence probability equal to 0.95.

That is, with a probability of 0.95, it can be argued that the student's time spent on preparing for the test is in the range from 6.9 to 8.5 hours.

11.2.2. Determining the size of the sample

Before direct sampling, the question is always decided how many units of the population under study should be selected for the survey. The formulas for determining the sample size are derived from the formulas for the marginal sampling errors in accordance with the following assumptions (Table 11.7):

  1. the type of intended sample;
  2. selection method (repeated or non-repeated);
  3. choice of the estimated parameter (average value of a feature or share).

In addition, it is necessary to determine in advance the value of the confidence level that suits the consumer of information, and the size of the allowable marginal sampling error.

Note: when using the formulas given in the table, it is recommended that the resulting sample size be rounded up to provide some margin of accuracy.

Example 11.5. Let us calculate how many of the 507 industrial enterprises should be checked by the tax inspectorate in order to determine the share of enterprises with tax violations with a probability of 0.997. According to the previous similar survey, the value of the standard deviation was 0.15; the size of the sampling error is expected to be no higher than 0.05.

When using repeated random selection, check

In non-repetitive random selection, it will be necessary to check

As you can see, the use of non-repetitive sampling allows us to survey a much smaller number of objects.

Example 11.6. It is planned to conduct a survey of wages at the enterprises of the industry by the method of random non-repetitive selection. What should be the size of the sample if at the time of the survey the number of employed in the industry was 100,000 people? The marginal sampling error should not exceed 100 rubles. with a probability of 0.954. Based on the results of previous surveys of wages in the industry, it is known that the standard deviation is 500 rubles.

Therefore, to solve the problem, it is necessary to include at least 100 people in the sample.

Average sampling error

The sampling set can be formed on the basis of a quantitative sign of statistical values, as well as on an alternative or attributive basis. In the first case, the generalizing characteristic of the sample is sample mean quantity denoted , and in the second - sample share quantities, denoted w. In the general population, respectively: general average and general share of the river.

Differences -- and W -- p called sampling error, which is divided into registration error and representativeness error. The first part of the sampling error occurs due to incorrect or inaccurate information due to misunderstanding of the essence of the issue, carelessness of the registrar when filling out questionnaires, forms, etc. It is fairly easy to detect and fix. The second part of the error arises from the constant or spontaneous non-compliance with the principle of random selection. It is difficult to detect and eliminate, it is much larger than the first and therefore the main attention is paid to it.

The value of the sampling error depends on the structure of the latter. For example, if, when determining the average grade score of faculty students, more excellent students are included in one sample, and more losers are included in another, then the sample average scores and sampling errors will be different.

Therefore, in statistics, the average error of repeated and non-repeated sampling is determined in the form of its specific standard deviation according to the formulas

= - repeated; (1.35)

= - non-repetitive; (1.36)

where Dv is the sample variance, determined with a quantitative sign of statistical values ​​according to the usual formulas from Chapter 2.

With an alternative or attributive sign, the sample variance is determined by the formula

Dv \u003d w (1-w). (1.37)

It can be seen from formulas (1.35) and (1.36) that the average error is smaller for a non-repetitive sample, which determines its wider application.

Marginal sampling error

Considering that on the basis of a sample survey it is impossible to accurately estimate the parameter under study (for example, the mean value) of the general population, it is necessary to find the limits in which it lies. In a particular sample, the difference can be greater than, less than or equal to. Each of the deviations from has a certain probability. In a sample survey, the real value in the general population is unknown. Knowing the average sampling error, with a certain probability it is possible to estimate the deviation of the sample mean from the general one and establish the limits within which the parameter under study (in this case, the average value) is located in the general population. The deviation of the sample characteristic from the general one is called marginal sampling error. It is defined as a fraction of the average error with a given probability, i.e.

= t, (1.38)

where t - confidence factor, depending on the probability with which the marginal sampling error is determined.

The probability of occurrence of a certain sampling error is found using theorems of probability theory. According to the theorem of P. L. Chebyshev, with a sufficiently large sample size and limited population variance, the probability that the difference between the sample mean and the general mean will be arbitrarily small is close to one:

A. M. Lyapunov proved that regardless of the nature of the distribution of the general population, with an increase in the sample size, the probability distribution of the occurrence of one or another value of the sample mean approaches the normal distribution. This is the so-called central limit theorem. Therefore, the probability of deviation of the sample mean from the general mean, i.e. the probability of occurrence of a given limiting error also obeys the specified law and can be found as a function of t using the Laplace probability integral:

where is the normalized deviation of the sample mean from the general mean.

The values ​​of the Laplace integral for different t calculated and available in special tables, of which a combination is widely used in statistics:

Probability

Given a specific level of probability, choose the value of the normalized deviation t and determine the marginal sampling error by the formula (1.38)

In this case, = 0.95 and t= 1.96, i.e. consider that with a probability of 95% the marginal sampling error is twice the average. Therefore, in statistics, the value t sometimes referred to the multiplicity factor of the marginal error relative to the average.

After calculating the marginal error, the confidence interval of the generalizing characteristic of the general population is found. Such an interval for the general average has the form

(-) (+), (1.39)

and similarly for the general share

(w-)p(w+). (1.40)

Consequently, during selective observation, not one exact value of the generalizing characteristic of the general population is determined, but only its confidence interval with a given level of probability. And this is a serious shortcoming of the sampling method of statistics.

Determining the sample size

When developing a program of selective observation, sometimes they are given a specific value of the marginal error with a level of probability. The minimum sample size that provides the given accuracy remains unknown. It can be obtained from the formulas for the mean and marginal errors, depending on the type of sample. So, substituting formulas first (1.35) and then (1.36) into formula (1.38) and solving it with respect to the sample size, we obtain the following formulas

for resampling

for no resampling

In addition, for statistical values ​​with quantitative characteristics, one must also know the sample variance, but by the beginning of the calculations it is not known either. Therefore, it is taken approximately in one of the following ways:

taken from previous sample observations;

according to the rule that the range of variation fits about six standard deviations (R/ = 6 or R/ = 6; from here D = R 2 /36);

According to the “three sigma” rule, according to which approximately three standard deviations fit into the average value (/ \u003d 3; hence \u003d / 3 or D = 2 /9).

When studying non-numerical characteristics, even if there is no approximate information about the sample fraction, it is accepted w= 0.5, which, according to formula (1.37), corresponds to the sample variance in the amount Dv = 0,5(1-0,5) = 0,25.