Sample from the general population. General and sample populations

A set of homogeneous objects is often examined in relation to any feature that characterizes them, measured quantitatively or qualitatively.

For example, if there is a batch of parts, then the size of the part according to GOST can be a quantitative sign, and the standardness of the part can be a quality sign.

If necessary, they are checked for compliance with standards, sometimes they resort to a complete survey, but in practice this is rarely used. For example, if the general population contains a huge number of objects under study, then it is practically impossible to conduct a continuous survey. In this case, a certain number of objects (elements) are selected from the entire population and they are examined. Thus, there is a general and sample population.

The general name is the totality of all objects that are subject to examination or study. The general population, as a rule, contains a finite number of elements, but if it is too large, then in order to simplify mathematical calculations, it is assumed that the entire population consists of an uncountable number of objects.

A sample or sample population is a part of the selected elements from the entire population. Sampling can be repeated or non-repeated. In the first case, it is returned to the general population, in the second, it is not. In practice, non-repetitive random selection is more often used.

The population and the sample must be related to each other by representativeness. In other words, in order for the characteristics of the sample population to be able to confidently determine the characteristics of the entire population, it is necessary that the elements of the sample represent them as accurately as possible. In other words, the sample must be representative (representative).

A sample will be more or less representative if it is drawn randomly from a very large number of the entire population. This can be argued on the basis of the so-called law of large numbers. In this case, all elements have an equal probability of being included in the sample.

There are various selection options. All these methods, in principle, can be divided into two options:

Option 1. Items are selected when the population is not divided into parts. This variant includes simple random repeated and non-repeated selections.
Option 2. The general population is divided into parts and the selection of elements is made. These include typical, mechanical and serial selections.

Simple random - selection in which elements are extracted one at a time from the entire population at random.

Typical is a selection in which elements are selected not from the entire population, but from all its “typical” parts.

Mechanical - this is such a selection, when the entire population is divided into a number of groups equal to the number of elements that should be in the sample, and, accordingly, one element is selected from each group. For example, if it is necessary to select 25% of the parts made by the machine, then every fourth part is selected, and if 4% of the parts are required, then every twenty-fifth part is selected, and so on. At the same time, it must be said that sometimes mechanical selection may not provide sufficient

Serial - this is such a selection in which elements are selected from the entire population in "series" subjected to continuous research, and not one at a time. For example, when parts are manufactured by a large number of automatic machines, then a complete survey is carried out only in relation to the products of several machines. Serial selection is used if the trait under study has little variability in different series.

In order to reduce the error, estimates of the general population are used with the help of a sample. Moreover, selective control can be both single-stage and multi-stage, which increases the reliability of the survey.

In mathematical statistics, two fundamental concepts are distinguished: the general population and the sample.
A collection is a practically countable set of some objects or elements that are of interest to the researcher;
A property of an aggregate is a real or imaginary quality inherent in some of all its elements. The property can be random or non-random.
A population parameter is a property that can be quantified as a constant or a variable.
A simple collection is characterized by:
a separate property (for example: all students of Russia);
a separate parameter in the form of a constant or a variable (All female students);
a system of non-overlapping (incompatible) properties, for example: All teachers and students of schools in Vladivostok.
A complex set is characterized by:
a system of at least partially intersecting properties (Students of the psychological and mathematical faculties of the Far Eastern State University who graduated from school with a gold medal);
a system of independent and dependent parameters in the aggregate; in a comprehensive study of personality.
A set is called homogeneous or homogeneous, all the characteristics of which are inherent in each of its elements;
A heterogeneous or heterogeneous set is a set whose characteristics are concentrated in separate subsets of elements.
An important parameter is the volume of the population - the number of elements forming it. The size of the volume depends on how the population itself is defined, and what questions we are specifically interested in. Suppose we are interested in the emotional state of a 1st year student during the period of passing a specific exam in a session. Then the population is exhausted within half an hour. If we are interested in the emotional state of all 1st year students, then the totality will be much larger, and even more if we take the emotional state of all 1st year students of a given university, etc. It is clear that aggregates of large volumes can only be investigated selectively.
A sample is a certain part of the general population, something that is directly studied.
Samples are classified according to representativeness, size, sampling method and test design.
Representative - a sample that adequately reflects the general population in qualitative and quantitative terms. The sample must adequately reflect the general population, otherwise the results will not coincide with the objectives of the study.
Representativeness depends on the volume, the larger the volume, the more representative the sample. By selection method.
Random - if the elements are selected randomly. Since most methods of mathematical statistics are based on the concept of a random sample, it is natural that the sample must be random.
Non-random sample:
mechanical selection, when the entire population is divided into as many parts as there are units planned in the sample, and then one element is selected from each part;
typical selection - the population is divided into homogeneous parts, and a random sample is made from each;
serial selection - the population is divided into a large number of different-sized series, then a sample of one of any series is made;
combined selection - the considered types of selection are combined at different stages.
According to the test scheme, samples can be independent and dependent. The sample size is divided into small and large. Small samples include samples in which the number of elements n 200 and the average sample satisfies the condition 30. Small samples are used in statistical control of known properties of already studied populations.
Large samples are used to set unknown properties and population parameters.

More on the topic 1.3. General population and sample:

7.2 Sample and population characteristics
1.6. Point and interval estimates of correlation coefficients of a normally distributed general population

Population - the set of those people about whom the sociologist seeks to obtain information in his research. Depending on how broad the research topic is, the population will be just as broad.

Sample population – reduced model of the general population; those to whom the sociologist distributes questionnaires, who are called respondents, who, finally, are the object of sociological research.

Who exactly to refer to the general population is determined by the goals of the study, and who to include in the sample population is decided by mathematical methods. If a sociologist intends to look at the Afghan war through the eyes of its participants, the population will include all Afghan warriors, but he will have to interview a small part - the sample population. In order for the sample to accurately reflect the general population, the sociologist adheres to the rule that any Afghan warrior, regardless of place of residence, place of work, state of health, and other circumstances, must have the same probability of being included in the sample.

Once the sociologist has decided who he wants to interview, he has determined sampling frame. Then the question of the type of sample is decided.

The samples are divided into three large classes:

a) solid(censuses, referendums). All units from the general population are interrogated;

b) random;

in) non-random.

Random and non-random types of sampling, in turn, are divided into several types.

The random ones are:

1) probabilistic;

2) systematic;

3) zoned (stratified);

4) nesting.

The non-random ones are:

1) "spontaneous";

2) quota;

3) the "master array" method.

A complete and accurate list of sampling units forms sampling frame . The items to be selected are called selection units . Sampling units can be the same as observation units because unit of observation the element of the general population from which information is directly collected is considered. Usually the unit of observation is the individual. Selection from a list is best done by numbering units and using a table of random numbers, although a quasi-random method is often used, when every nth element is taken from a prime list.

If the sampling frame includes a list of sampling units, then the sampling design implies grouping them according to some important characteristics, for example, the distribution of individuals by profession, qualification, sex or age. If in the general population, for example, 30% of young people, 50% of middle-aged people and 20% of the elderly, then the same percentage proportions of the three ages should be observed in the sample population. Classes, gender, nationality, etc. can be added to ages. For each, percentage proportions are established in the general and sample population. Thus, sample structure - percentage proportions of the features of the object, on the basis of which the sample is compiled.

If the sample type tells how people get into the sample, then the sample size tells how many of them got there.

Sample size – the number of sample units. Since the sample population is a part of the general population selected using special methods, its volume is always less than the volume of the general population. Therefore, it is so important that the part does not distort the idea of the whole, that is, it represents it.

The reliability of the data is affected not by the quantitative characteristics of the sample (its volume), but by the qualitative characteristics of the general population - the degree of its homogeneity. The discrepancy between the general and sample population is called representativeness error , tolerance - 5%.

Here are some ways to avoid the error:

each unit in the population must have an equal probability of being included in the sample;

it is desirable to select from homogeneous populations;

you need to know the characteristics of the general population;

When compiling a sample population, random and systematic errors must be taken into account.

If the sampling set (sample) is compiled correctly, then the sociologist receives reliable results that characterize the entire population.

What are the main sampling methods?

Mechanical sampling method when the required number of respondents is selected from the general list of the general population at regular intervals (for example, every 10th).

Serial sampling method. In this case, the general population is divided into homogeneous parts and units of analysis are proportionally selected from each (for example, 20% of men and women in the enterprise).

Nested sampling method. The selection units are not individual respondents, but groups with subsequent continuous research in them. This sample will be representative if the composition of the groups is similar (for example, one group of students from each stream of some faculty of the university).

Main Array Method– survey of 60–70% of the general population.

Quota sampling method. The most complex method, requiring the determination of at least four characteristics, according to which the selection of respondents is carried out. It is usually used with a large general population.

Population (in English - population) - the totality of all objects (units), regarding which the scientist intends to draw conclusions when studying a specific problem.

The general population consists of all objects that are subject to study. The composition of the general population depends on the objectives of the study. Sometimes the general population is the entire population of a certain region (for example, when the attitude of potential voters to a candidate is being studied), most often several criteria are set that determine the object of study. For example, men aged 30-50 who use a certain brand of razor at least once a week and have an income of at least $100 per family member.

Sampleor sampling frame- a set of cases (subjects, objects, events, samples), using a certain procedure, selected from the general population for participation in the study.

Sample characteristics:

· Qualitative characteristics of the sample - who exactly we choose and what methods of sample construction we use for this.

· The quantitative characteristic of the sample is how many cases we select, in other words, the sample size.

Need for sampling

· The object of study is very broad. For example, consumers of the products of a global company are a huge number of geographically dispersed markets.

· There is a need to collect primary information.

Sample size

Sample size- the number of cases included in the sample. For statistical reasons, it is recommended that the number of cases be at least 30-35.

Dependent and independent samples

When comparing two (or more) samples, their dependence is an important parameter. If it is possible to establish a homomorphic pair (that is, when one case from sample X corresponds to one and only one case from sample Y and vice versa) for each case in two samples (and this basis of relationship is important for the trait measured in the samples), such samples are called dependent. Examples of dependent selections:

· pair of twins

· two measurements of any feature before and after experimental exposure,

· husbands and wives

· etc.

If there is no such relationship between the samples, then these samples are considered independent, For example:

· men and women,

· psychologists and mathematicians.

Accordingly, dependent samples always have the same size, while the size of independent samples may differ.

Samples are compared using various statistical criteria:

· Student's t-test

· Wilcoxon test

· Mann-Whitney U test

· Criterion of signs

· and etc.

Representativeness

The sample may be considered representative or non-representative.

An example of a non-representative sample

In the United States, one of the most famous historical examples of non-representative sampling is considered to be an incident that occurred during the presidential election in 1936. The Litrery Digest, which had successfully predicted the events of several previous elections, miscalculated by sending out ten million test ballots to its subscribers, as well as to people selected from countrywide phone books and people from car registration lists. In 25% of the returned ballots (nearly 2.5 million), the votes were distributed as follows:

· 57% preferred Republican candidate Alf Landon

· 40% chose then-Democratic President Franklin Roosevelt

As is well known, Roosevelt won the actual elections with more than 60% of the votes. The Litreary Digest's mistake was this: wanting to increase the representativeness of the sample - because they knew that the majority of their subscribers considered themselves Republicans - they expanded the sample with people selected from phone books and registration lists. However, they did not take into account contemporary realities and in fact recruited even more Republicans: during the Great Depression, it was mostly the middle and upper class (that is, most Republicans, not Democrats) who could afford to own phones and cars.

Types of plan for building groups from samples

There are several main types of group building plan:

1. Study with experimental and control groups, which are placed in different conditions.

2. Study with experimental and control groups using a paired selection strategy

3. Study using only one group - experimental.

4. A study using a mixed (factorial) plan - all groups are placed in different conditions.

Sample types

Samples are divided into two types:

· probabilistic

· improbability

Probability samples

1. Simple probability sampling:

oSimple resampling. The use of such a sample is based on the assumption that each respondent is equally likely to be included in the sample. Based on the list of the general population, cards with the numbers of respondents are compiled. They are placed in a deck, shuffled, and a card is taken out of them at random, a number is written down, then returned back. Further, the procedure is repeated as many times as the sample size we need. Minus: repetition of selection units.

The procedure for constructing a simple random sample includes the following steps:

1. you need to get a complete list of members of the general population and number this list. Such a list, recall, is called the sampling frame;

2. determine the expected sample size, that is, the expected number of respondents;

3. extract as many numbers from the table of random numbers as we need sample units. If the sample should include 100 people, 100 random numbers are taken from the table. These random numbers can be generated by a computer program.

4. select from the base list those observations whose numbers correspond to the written random numbers

· A simple random sample has obvious advantages. This method is extremely easy to understand. The results of the study can be extended to the study population. Most approaches to statistical inference involve collecting information using a simple random sample. However, the simple random sampling method has at least four significant limitations:

1. It is often difficult to create a sampling frame that would allow for a simple random sample.

2. The result of using a simple random sample can be a large population, or a population distributed over a large geographical area, which significantly increases the time and cost of data collection.

3. The results of applying a simple random sample are often characterized by low accuracy and a larger standard error than the results of applying other probabilistic methods.

4. As a result of the application of the SRS, an unrepresentative sample may be formed. Although the samples obtained by simple random selection, on average, adequately represent the population, some of them extremely incorrectly represent the population under study. The probability of this is especially high with a small sample size.

· Simple non-repetitive sampling. The procedure for constructing the sample is the same, only the cards with the numbers of the respondents are not returned back to the deck.

1. Systematic probability sampling. It is a simplified version of a simple probability sample. Based on the list of the general population, respondents are selected at a certain interval (K). The value of K is determined randomly. The most reliable result is achieved with a homogeneous general population, otherwise the step size and some internal cyclic patterns of the sample may coincide (sample mixing). Cons: the same as in a simple probability sample.

2. Serial (nested) sampling. The sampling units are statistical series (family, school, team, etc.). The selected elements are subjected to continuous examination. The selection of statistical units can be organized according to the type of random or systematic sampling. Cons: Possibility of greater homogeneity than in the general population.

3. Zoned sample. In the case of a heterogeneous population, before using probability sampling with any selection technique, it is recommended to divide the population into homogeneous parts, such a sample is called a zoned sample. The zoning groups can be both natural formations (for example, city districts) and any feature underlying the study. The sign on the basis of which the division is carried out is called the sign of stratification and zoning.

4. "Convenient" selection. The "convenience" sampling procedure consists in establishing contacts with "convenient" sampling units - with a group of students, a sports team, with friends and neighbors. If it is necessary to obtain information about people's reactions to a new concept, such a sample is quite reasonable. "Convenience" sampling is often used for preliminary testing of questionnaires.

Incredible Samples

The selection in such a sample is carried out not according to the principles of chance, but according to subjective criteria - accessibility, typicality, equal representation, etc.

1. Quota sampling - the sampling is built as a model that reproduces the structure of the general population in the form of quotas (proportions) of the studied characteristics. The number of sample elements with a different combination of the characteristics under study is determined in such a way that it corresponds to their share (proportion) in the general population. So, for example, if we have a general population of 5,000 people, of which 2,000 women and 3,000 men, then in the quota sample we will have 20 women and 30 men, or 200 women and 300 men. Quota samples are most often based on demographic criteria: gender, age, region, income, education, and others. Cons: usually such samples are not representative, because it is impossible to take into account several social parameters at once. Pros: easily accessible material.

2. Snowball method. The sample is constructed as follows. Each respondent, starting with the first, is asked to contact his friends, colleagues, acquaintances who would fit the selection conditions and could take part in the study. Thus, with the exception of the first step, the sample is formed with the participation of the objects of study themselves. The method is often used when it is necessary to find and interview hard-to-reach groups of respondents (for example, respondents with a high income, respondents belonging to the same professional group, respondents who have some similar hobbies / passions, etc.)

3. Spontaneous sampling - sampling of the so-called "first comer". Often used in television and radio polls. The size and composition of spontaneous samples is not known in advance, and is determined by only one parameter - the activity of the respondents. Disadvantages: it is impossible to establish what kind of general population the respondents represent, and as a result, it is impossible to determine representativeness.

4. Route survey - often used if the unit of study is the family. On the map of the settlement in which the survey will be carried out, all streets are numbered. Using a table (generator) of random numbers, large numbers are selected. Each large number is considered as consisting of 3 components: street number (2-3 first numbers), house number, apartment number. For example, the number 14832: 14 is the street number on the map, 8 is the house number, 32 is the apartment number.

5. Zoned sampling with selection of typical objects. If, after zoning, a typical object is selected from each group, i.e. an object that, according to most of the characteristics studied in the study, approaches the average, such a sample is called zoned with the selection of typical objects.

Group Building Strategies

The selection of groups for their participation in a psychological experiment is carried out using various strategies that are needed in order to ensure the greatest possible compliance with internal and external validity.

· Randomization (random selection)

· Pairwise selection

· Stratometric selection

· Approximate modeling

· Engaging Real Groups

Randomization, or random selection, is used to create simple random samples. The use of such a sample is based on the assumption that each member of the population is equally likely to be included in the sample. For example, to make a random sample of 100 university students, you can put pieces of paper with the names of all university students in a hat, and then take 100 pieces of paper out of it - this will be random selection (Goodwin J., p. 147).

Pairwise selection- a strategy for constructing sample groups, in which groups of subjects are made up of subjects that are equivalent in terms of side parameters that are significant for the experiment. This strategy is effective for experiments using experimental and control groups with the best option - attracting twin pairs (mono- and dizygotic), as it allows you to create ...

Stratometric selection - randomization with the allocation of strata (or clusters). With this method of sampling, the general population is divided into groups (strata) that have certain characteristics (sex, age, political preferences, education, income level, etc.), and subjects with the corresponding characteristics are selected.

Approximate modeling - drawing up limited samples and generalizing the conclusions about this sample to a wider population. For example, when participating in a study of students in the 2nd year of university, the data of this study are extended to "people aged 17 to 21 years." The admissibility of such generalizations is extremely limited.

Approximate modeling is the formation of a model that, for a clearly defined class of systems (processes), describes its behavior (or desired phenomena) with acceptable accuracy.

Lecture 6. Elements of mathematical statistics

Questions to control knowledge and summarize the lecture

1. Define a random variable.

2. Write formulas for the mathematical expectation and variance of discrete and continuous random variables.

3. Give a definition of Laplace's local integral limit theorem

4. Write formulas for the binomial distribution, hypergeometric distribution, Poisson distribution, uniform distribution, and normal distribution.

Purpose: To study the basic concepts of mathematical statistics

1. Population and sample

2. Statistical distribution of the sample. Polygon. bar graph .

3. Estimates of the parameters of the general population based on its sample

4. General and sample averages. Methods for their calculation.

5. General and sample variances.

6. Questions to control knowledge and summarize the lecture

We begin to study the elements of mathematical statistics, in which scientifically based methods for collecting statistical data and processing them are developed.

1. General population and sample. Let it be required to study a set of homogeneous objects (this set is called statistical aggregate) regarding some qualitative or quantitative feature that characterizes these objects. For example, if there is a batch of parts, then the standard part can serve as a qualitative sign, and the controlled size of the part can serve as a quantitative sign.

It is best to make a continuous survey, i.e. explore each item. However, in most cases, for various reasons, this is not possible. A large number of objects and their unavailability can prevent a continuous survey. If, for example, we need to know the average depth of the funnel during the explosion of a projectile from an experimental batch, then by making a complete survey, we will destroy the entire batch.

If a complete survey is not possible, then a part of the objects is selected for study from the entire population.

The statistical set from which some of the objects are selected is called the general population. A set of objects randomly selected from the general population is called sample.

The number of objects in the general population and the sample is called, respectively volume general population and volume samples.

Example 10.1. The fruits of one tree (200 pieces) are examined for the presence of a taste specific to this variety. To do this, select 10 pcs. Here 200 is the population size and 10 is the sample size.

If the sample is taken from one object, which is examined and returned to the general population, then the sample is called repeated. If the objects of the sample are no longer returned to the general population, then the sample is called unrepeated.

In practice, non-repetitive sampling is more often used. If the sample size is a small fraction of the population size, then the difference between resampling and non-repeating sampling is negligible.

The properties of the objects in the sample must correctly reflect the properties of the objects in the population, or, as they say, the sample must be representative(representative). It is believed that the sample is representative if all objects of the general population have the same probability of being included in the sample, i.e., the choice is made randomly. For example, in order to estimate the future harvest, one can make a sample from the general population of fruits that have not yet ripened and examine their characteristics (weight, quality, etc.). If the entire sample is taken from one tree, then it will not be representative. A representative sample should consist of randomly selected fruits from randomly selected trees.

2. Statistical distribution of the sample. Polygon. Bar graph. Let a sample be taken from the general population, and X 1 observed n 1 time, X 2 - p 2 once, ..., x k - n k times and n 1 +n 2 +…+ p k= P - sample size. Observed values x 1 , x 2 , …, x k called options, and the variant sequence, written in ascending order, is variation series. Number of observations n 1 , n 2 , …, nk called frequencies and their relationship to the sample size , , …, - relative frequencies. Note that the sum of the relative frequencies is equal to one: .

The statistical distribution of the sample call the list of options and their corresponding frequencies or relative frequencies. The statistical distribution can also be specified as a sequence of intervals and their corresponding frequencies (continuous distribution). As the frequency corresponding to the interval, take the sum of the frequencies of the variant that fell into this interval. For a graphical representation of the statistical distribution, use polygons and histograms.

To build a polygon on the axis Oh set aside option values X i , on the axis OU - frequency values P i (relative frequencies ).

Example 10.2. On fig. 10.1 shows the polygon of the following distribution

The polygon is usually used in the case of a small number of options. In the case of a large number of variants and in the case of a continuous distribution of the feature, histograms are more often built. To do this, the interval, which contains all the observed values of the feature, is divided into several partial intervals of length h and find for each partial interval n i, - the sum of the frequencies of the variant that fell into i-interval. Then, on these intervals, as on bases, they build rectangles with heights (or, where P - sample size).

Square i partial rectangle is , (or ).

Therefore, the area of the histogram is equal to the sum of all frequencies (or relative frequencies), i.e. sample size (or unit).

Example 10.3. On fig. 10.2 shows a histogram of continuous volume distribution n= 100 given in the following table.