Biographies Characteristics Analysis

Data clustering algorithms. Algorithms based on graph theory

Cluster analysis is

Good day. Here I have respect for people who are fans of their work.

Maxim, my friend, belongs to this category. Constantly works with figures, analyzes them, makes relevant reports.

Yesterday we had lunch together, so for almost half an hour he told me about cluster analysis - what it is and in what cases its application is reasonable and expedient. Well, what about me?

I have a good memory, so I will provide you with all this data, by the way, which I already knew about in its original and most informative form.

Cluster analysis is designed to divide a set of objects into homogeneous groups (clusters or classes). This is a task of multivariate data classification.

There are about 100 different clustering algorithms, however, the most commonly used are hierarchical cluster analysis and k-means clustering.

Where is cluster analysis used? In marketing, this is the segmentation of competitors and consumers.

In management: the division of personnel into groups of different levels of motivation, the classification of suppliers, the identification of similar production situations in which marriage occurs.

In medicine, the classification of symptoms, patients, drugs. In sociology, the division of respondents into homogeneous groups. In fact, cluster analysis has proven itself well in all spheres of human life.

The beauty of this method is that it works even when there is little data and the requirements for the normality of distributions of random variables and other requirements of classical methods of statistical analysis are not met.

Let us explain the essence of cluster analysis without resorting to strict terminology:
Let's say you conducted a survey of employees and want to determine how you can most effectively manage your staff.

That is, you want to divide employees into groups and select the most effective control levers for each of them. At the same time, the differences between groups should be obvious, and within the group, the respondents should be as similar as possible.

To solve the problem, it is proposed to use hierarchical cluster analysis.

As a result, we will get a tree, looking at which we must decide how many classes (clusters) we want to split the staff into.

Suppose that we decide to divide the staff into three groups, then to study the respondents who fell into each cluster, we get a tablet with the following content:


Let us explain how the above table is formed. The first column contains the number of the cluster — the group whose data is reflected in the row.

For example, the first cluster is 80% male. 90% of the first cluster fall into the age group from 30 to 50 years old, and 12% of respondents believe that benefits are very important. And so on.

Let's try to make portraits of respondents of each cluster:

  1. The first group is mainly men of mature age, occupying leadership positions. The social package (MED, LGOTI, TIME-free time) does not interest them. They prefer to receive a good salary, rather than help from the employer.
  2. Group two, on the contrary, prefers the social package. It consists mainly of "aged" people occupying low positions. Salary is certainly important for them, but there are other priorities.
  3. The third group is the "youngest". Unlike the previous two, there is an obvious interest in learning and professional growth opportunities. This category of employees has a good chance to replenish the first group soon.

Thus, when planning a campaign to introduce effective personnel management methods, it is obvious that in our situation it is possible to increase the social package for the second group to the detriment, for example, of wages.

If we talk about which specialists should be sent for training, then we can definitely recommend paying attention to the third group.

Source: http://www.nickart.spb.ru/analysis/cluster.php

Features of cluster analysis

A cluster is the price of an asset in a certain period of time during which transactions were made. The resulting volume of purchases and sales is indicated by a number within the cluster.

The bar of any TF contains, as a rule, several clusters. This allows you to see in detail the volumes of purchases, sales and their balance in each individual bar, for each price level.


A change in the price of one asset inevitably entails a chain of price movements on other instruments as well.

Attention!

In most cases, the understanding of the trend movement occurs already at the moment when it is rapidly developing, and entering the market along the trend is fraught with falling into a corrective wave.

For successful trades, it is necessary to understand the current situation and be able to anticipate future price movements. This can be learned by analyzing the cluster graph.

With the help of cluster analysis, you can see the activity of market participants inside even the smallest price bar. This is the most accurate and detailed analysis, as it shows the point distribution of transaction volumes for each asset price level.

In the market there is a constant confrontation between the interests of sellers and buyers. And every smallest price movement (tick) is the move to a compromise - the price level - which suits both parties at the moment.

But the market is dynamic, the number of sellers and buyers is constantly changing. If at one point in time the market was dominated by sellers, then the next moment, most likely, there will be buyers.

The number of completed transactions at neighboring price levels is also not the same. And yet, first, the market situation is reflected in the total volume of transactions, and only then on the price.

If you see the actions of the dominant market participants (sellers or buyers), then you can predict the price movement itself.

To successfully apply cluster analysis, you first need to understand what a cluster and a delta are.


A cluster is called a price movement, which is divided into levels at which transactions were made with known volumes. The delta shows the difference between buying and selling occurring in each cluster.

Each cluster, or group of deltas, allows you to figure out whether buyers or sellers dominate the market at a given time.

It is enough just to calculate the total delta by summing the sales and purchases. If the delta is negative, then the market is oversold, there are redundant sell transactions. When the delta is positive, the market is clearly dominated by buyers.

The delta itself can take on a normal or critical value. The value of the delta volume above the normal value in the cluster is highlighted in red.

If the delta is moderate, then this characterizes a flat state in the market. With a normal delta value, a trend movement is observed in the market, but a critical value is always a harbinger of a price reversal.

Forex trading with CA

To get the maximum profit, you need to be able to determine the transition of the delta from a moderate level to a normal one. Indeed, in this case, you can notice the very beginning of the transition from a flat to a trend movement and be able to get the most profit.

The cluster chart is more visual, on it you can see significant levels of accumulation and distribution of volumes, build support and resistance levels. This allows the trader to find the exact entry to the trade.

Using the delta, one can judge the predominance of sales or purchases in the market. Cluster analysis allows you to observe transactions and track their volumes inside the bar of any TF.

This is especially important when approaching significant support or resistance levels. Cluster judgments are the key to understanding the market.

Source: http://orderflowtrading.ru/analitika-rynka/obemy/klasternyy-analiz/

Areas and features of application of cluster analysis

The term cluster analysis (first introduced by Tryon, 1939) actually includes a set of different classification algorithms.

A common question asked by researchers in many fields is how to organize observed data into visual structures, i.e. expand taxonomies.

According to the modern system accepted in biology, man belongs to primates, mammals, amniotes, vertebrates and animals.

Note that in this classification, the higher the level of aggregation, the less similarity between members in the corresponding class.

Man has more similarities with other primates (i.e., apes) than with "distant" members of the mammal family (i.e., dogs), and so on.

Note that the previous discussion refers to clustering algorithms, but does not mention anything about testing for statistical significance.

In fact, cluster analysis is not so much an ordinary statistical method as a “set” of various algorithms for “distributing objects into clusters”.

There is a point of view that, unlike many other statistical procedures, cluster analysis methods are used in most cases when you do not have any a priori hypotheses about classes, but are still in the descriptive stage of research.

Attention!

It should be understood that cluster analysis determines the "most possibly meaningful decision".

Therefore, testing for statistical significance is not really applicable here, even in cases where p-levels are known (as, for example, in the K-means method).

The clustering technique is used in a wide variety of fields. Hartigan (1975) has provided an excellent overview of the many published studies containing results obtained by cluster analysis methods.

For example, in the field of medicine, the clustering of diseases, treatment of diseases, or symptoms of diseases leads to widely used taxonomies.

In the field of psychiatry, the correct diagnosis of symptom clusters such as paranoia, schizophrenia, etc. is critical to successful therapy. In archeology, using cluster analysis, researchers are trying to establish taxonomies of stone tools, funeral objects, etc.

There are wide applications of cluster analysis in marketing research. In general, whenever it is necessary to classify "mountains" of information into groups suitable for further processing, cluster analysis turns out to be very useful and effective.

Tree Clustering

The example in the Primary Purpose section explains the purpose of the join (tree clustering) algorithm.

The purpose of this algorithm is to combine objects (for example, animals) into sufficiently large clusters using some measure of similarity or distance between objects. A typical result of such clustering is a hierarchical tree.

Consider a horizontal tree diagram. The diagram starts with each object in the class (on the left side of the diagram).

Now imagine that gradually (in very small steps) you "weaken" your criterion for what objects are unique and what are not.

In other words, you lower the threshold related to the decision to combine two or more objects into one cluster.

As a result, you link more and more objects together and aggregate (combine) more and more clusters of increasingly different elements.

Finally, in the last step, all objects are merged together. In these charts, the horizontal axes represent the pooling distance (in vertical dendrograms, the vertical axes represent the pooling distance).

So, for each node in the graph (where a new cluster is formed), you can see the amount of distance for which the corresponding elements are linked into a new single cluster.

When the data has a clear "structure" in terms of clusters of objects that are similar to each other, then this structure is likely to be reflected in the hierarchical tree by various branches.

As a result of successful analysis by the join method, it becomes possible to detect clusters (branches) and interpret them.

The union or tree clustering method is used in the formation of clusters of dissimilarity or distance between objects. These distances can be defined in one-dimensional or multidimensional space.

For example, if you have to cluster the types of food in a cafe, you can take into account the number of calories contained in it, the price, the subjective assessment of taste, etc.

The most direct way to calculate distances between objects in a multidimensional space is to calculate Euclidean distances.

If you have a 2D or 3D space, then this measure is the actual geometric distance between objects in space (as if the distances between objects were measured with a tape measure).

However, the pooling algorithm does not "care" about whether the distances "provided" for that are real or some other derived distance measures, which is more meaningful to the researcher; and the challenge for researchers is to select the right method for specific applications.

Euclidean distance. This seems to be the most common type of distance. It is simply a geometric distance in multidimensional space and is calculated as follows:

Note that the Euclidean distance (and its square) is calculated from the original data, not from the standardized data.

This is the usual way of calculating it, which has certain advantages (for example, the distance between two objects does not change when a new object is introduced into the analysis, which may turn out to be an outlier).

Attention!

However, distances can be greatly affected by differences between the axes from which the distances are calculated. For example, if one of the axes is measured in centimeters, and then you convert it to millimeters (by multiplying the values ​​by 10), then the final Euclidean distance (or the square of the Euclidean distance) calculated from the coordinates will change dramatically, and, as a result, the results of the cluster analysis can be very different from the previous ones.

The square of the Euclidean distance. Sometimes you may want to square the standard Euclidean distance to give more weight to more distant objects.

This distance is calculated as follows:

City block distance (Manhattan distance). This distance is simply the average of the differences over the coordinates.

In most cases, this measure of distance leads to the same results as for the usual Euclid distance.

However, note that for this measure the influence of individual large differences (outliers) decreases (because they are not squared). Manhattan distance is calculated using the formula:

Chebyshev distance. This distance can be useful when one wishes to define two objects as "different" if they differ in any one coordinate (any one dimension). The Chebyshev distance is calculated by the formula:

Power distance. It is sometimes desired to progressively increase or decrease the weight related to a dimension for which the corresponding objects are very different.

This can be achieved using a power-law distance. The power distance is calculated by the formula:

where r and p are user-defined parameters. A few examples of calculations can show how this measure "works".

The p parameter is responsible for the gradual weighting of differences in individual coordinates, the r parameter is responsible for the progressive weighting of large distances between objects. If both parameters - r and p, are equal to two, then this distance coincides with the Euclidean distance.

The percentage of disagreement. This measure is used when the data is categorical. This distance is calculated by the formula:

Association or association rules

At the first step, when each object is a separate cluster, the distances between these objects are determined by the chosen measure.

However, when several objects are linked together, the question arises, how should the distances between clusters be determined?

In other words, you need a join or link rule for two clusters. There are various possibilities here: for example, you can link two clusters together when any two objects in the two clusters are closer to each other than the corresponding link distance.

In other words, you use the "nearest neighbor rule" to determine the distance between clusters; this method is called the single link method.

This rule builds "fibrous" clusters, i.e. clusters "linked together" only by individual elements that happen to be closer to each other than the others.

Alternatively, you can use neighbors in clusters that are farthest from each other of all other feature pairs. This method is called the full link method.

There are also many other methods for joining clusters, similar to those that have been discussed.

Single connection (nearest neighbor method). As described above, in this method, the distance between two clusters is determined by the distance between the two closest objects (nearest neighbors) in different clusters.

This rule must, in a sense, string objects together to form clusters, and the resulting clusters tend to be represented by long "strings".

Full connection (method of the most distant neighbors). In this method, the distances between clusters are defined as the largest distance between any two objects in different clusters (i.e. "most distant neighbors").

Unweighted pairwise mean. In this method, the distance between two different clusters is calculated as the average distance between all pairs of objects in them.

The method is effective when objects actually form different "groves", but it works equally well in cases of extended ("chain" type) clusters.

Note that in their book Sneath and Sokal (1973) introduce the abbreviation UPGMA to refer to this method as the unweighted pair-group method using arithmetic averages.

Weighted pairwise mean. The method is identical to the unweighted pairwise average method, except that the size of the respective clusters (ie, the number of objects they contain) is used as a weighting factor in the calculations.

Therefore, the proposed method should be used (rather than the previous one) when unequal cluster sizes are assumed.

Sneath and Sokal (1973) introduce the abbreviation WPGMA to refer to this method as the weighted pair-group method using arithmetic averages.

Unweighted centroid method. In this method, the distance between two clusters is defined as the distance between their centers of gravity.

Attention!

Sneath and Sokal (1973) use the acronym UPGMC to refer to this method as the unweighted pair-group method using the centroid average.

Weighted centroid method (median). This method is identical to the previous one, except that weights are used in the calculations to take into account the difference between cluster sizes (i.e., the number of objects in them).

Therefore, if there are (or are suspected) significant differences in cluster sizes, this method is preferable to the previous one.

Sneath and Sokal (1973) used the abbreviation WPGMC to refer to it as the weighted pair-group method using the centroid average.

Ward method. This method is different from all other methods because it uses ANOVA methods to estimate distances between clusters.

The method minimizes the sum of squares (SS) for any two (hypothetical) clusters that can be formed at each step.

Details can be found in Ward (1963). In general, the method seems to be very efficient, but it tends to create small clusters.

Earlier this method was discussed in terms of "objects" that should be clustered. In all other types of analysis, the question of interest to the researcher is usually expressed in terms of observations or variables.

It turns out that clustering, both by observations and by variables, can lead to quite interesting results.

For example, imagine that a medical researcher is collecting data on various characteristics (variables) of patients' conditions (observations) with heart disease.

The investigator may wish to cluster observations (of patients) to identify clusters of patients with similar symptoms.

At the same time, the researcher may want to cluster variables to identify clusters of variables that are associated with a similar physical state.e

After this discussion regarding whether to cluster observations or variables, one might ask, why not cluster in both directions?

The Cluster Analysis module contains an efficient two-way join procedure to do just that.

However, two-way pooling is used (relatively rarely) in circumstances where both observations and variables are expected to contribute simultaneously to the discovery of meaningful clusters.

So, returning to the previous example, we can assume that a medical researcher needs to identify clusters of patients that are similar in relation to certain clusters of physical condition characteristics.

The difficulty in interpreting the results obtained arises from the fact that the similarities between different clusters may come from (or be the cause of) some difference in the subsets of variables.

Therefore, the resulting clusters are inherently heterogeneous. Perhaps it seems a bit hazy at first; indeed, compared to other cluster analysis methods described, two-way pooling is probably the least commonly used method.

However, some researchers believe that it offers a powerful tool for exploratory data analysis (for more information, see Hartigan's description of this method (Hartigan, 1975)).

K means method

This clustering method differs significantly from agglomerative methods such as Union (tree clustering) and Two-Way Union. Suppose you already have hypotheses about the number of clusters (by observation or by variable).

You can tell the system to form exactly three clusters so that they are as different as possible.

This is exactly the type of problem that the K-Means algorithm solves. In general, the K-means method builds exactly K distinct clusters spaced as far apart as possible.

In the physical condition example, a medical researcher may have a "hunch" from their clinical experience that their patients generally fall into three different categories.

Attention!

If so, then the means of the various measures of physical parameters for each cluster would provide a quantitative way of representing the investigator's hypotheses (eg, patients in cluster 1 have a high parameter of 1, a lower parameter of 2, etc.).

From a computational point of view, you can think of this method as an analysis of variance "in reverse". The program starts with K randomly selected clusters, and then changes the belonging of objects to them in order to:

  1. minimize variability within clusters,
  2. maximize variability between clusters.

This method is similar to reverse analysis of variance (ANOVA) in that the significance test in ANOVA compares between-group versus within-group variability in testing the hypothesis that group means differ from each other.

In K-means clustering, the program moves objects (i.e., observations) from one group (cluster) to another in order to obtain the most significant result when performing analysis of variance (ANOVA).

Typically, once the results of a K-means cluster analysis are obtained, one can calculate the means for each cluster for each dimension to assess how the clusters differ from each other.

Ideally, you should get very different means for most, if not all, of the measurements used in the analysis.

Source: http://www.biometrica.tomsk.ru/textbook/modules/stcluan.html

Classification of objects according to their characteristics

Cluster analysis (cluster analysis) - a set of multidimensional statistical methods for classifying objects according to their characteristics, dividing a set of objects into homogeneous groups that are close in terms of defining criteria, selecting objects of a certain group.

A cluster is a group of objects identified as a result of cluster analysis based on a given measure of similarity or difference between objects.

The object is the specific subjects of study that need to be classified. The objects in the classification are, as a rule, observations. For example, consumers of products, countries or regions, products, etc.

Although it is possible to carry out cluster analysis by variables. Classification of objects in multidimensional cluster analysis occurs according to several criteria simultaneously.

These can be both quantitative and categorical variables, depending on the method of cluster analysis. So, the main goal of cluster analysis is to find groups of similar objects in the sample.

The set of multidimensional statistical methods of cluster analysis can be divided into hierarchical methods (agglomerative and divisive) and non-hierarchical (k-means method, two-stage cluster analysis).

However, there is no generally accepted classification of methods, and sometimes cluster analysis methods also include methods for constructing decision trees, neural networks, discriminant analysis, and logistic regression.

The scope of cluster analysis, due to its versatility, is very wide. Cluster analysis is used in economics, marketing, archeology, medicine, psychology, chemistry, biology, public administration, philology, anthropology, sociology and other areas.

Here are some examples of applying cluster analysis:

  • medicine - classification of diseases, their symptoms, methods of treatment, classification of patient groups;
  • marketing - the tasks of optimizing the company's product line, segmenting the market by groups of goods or consumers, identifying a potential consumer;
  • sociology - division of respondents into homogeneous groups;
  • psychiatry - correct diagnosis of symptom groups is crucial for successful therapy;
  • biology - classification of organisms by group;
  • economy - classification of subjects of the Russian Federation by investment attractiveness.

Source: http://www.statmethods.ru/konsalting/statistics-methody/121-klasternyj-analyz.html

General information about cluster analysis

Cluster analysis includes a set of different classification algorithms. A common question asked by researchers in many fields is how to organize observed data into visual structures.

For example, biologists aim to break down animals into different species in order to meaningfully describe the differences between them.

The task of cluster analysis is to divide the initial set of objects into groups of similar, close objects. These groups are called clusters.

In other words, cluster analysis is one of the ways to classify objects according to their characteristics. It is desirable that the classification results have a meaningful interpretation.

The results obtained by cluster analysis methods are used in various fields. In marketing, it is the segmentation of competitors and consumers.

In psychiatry, the correct diagnosis of symptoms such as paranoia, schizophrenia, etc. is crucial for successful therapy.

In management, the classification of suppliers is important, the identification of similar production situations in which marriage occurs. In sociology, the division of respondents into homogeneous groups. In portfolio investing, it is important to group securities according to their similarity in the trend of return in order to compile, based on the information obtained about the stock market, an optimal investment portfolio that allows maximizing return on investments for a given degree of risk.

In general, whenever it is necessary to classify a large amount of information of this kind and present it in a form suitable for further processing, cluster analysis turns out to be very useful and effective.

Cluster analysis allows considering a fairly large amount of information and greatly compressing large arrays of socio-economic information, making them compact and visual.

Attention!

Cluster analysis is of great importance in relation to sets of time series characterizing economic development (for example, general economic and commodity conditions).

Here it is possible to single out the periods when the values ​​of the corresponding indicators were quite close, as well as to determine the groups of time series, the dynamics of which are most similar.

In the problems of socio-economic forecasting, it is very promising to combine cluster analysis with other quantitative methods (for example, with regression analysis).

Advantages and disadvantages

Cluster analysis allows for an objective classification of any objects that are characterized by a number of features. There are a number of benefits to be derived from this:

  1. The resulting clusters can be interpreted, that is, to describe what kind of groups actually exist.
  2. Individual clusters can be culled. This is useful in cases where certain errors were made in the data set, as a result of which the values ​​of indicators for individual objects deviate sharply. When applying cluster analysis, such objects fall into a separate cluster.
  3. For further analysis, only those clusters that have the characteristics of interest can be selected.

Like any other method, cluster analysis has certain disadvantages and limitations. In particular, the composition and number of clusters depends on the selected partitioning criteria.

When reducing the initial data array to a more compact form, certain distortions may occur, and the individual features of individual objects may also be lost due to their replacement by the characteristics of the generalized values ​​of the cluster parameters.

Methods

Currently, more than a hundred different clustering algorithms are known. Their diversity is explained not only by different computational methods, but also by different concepts underlying clustering.

The Statistica package implements the following clustering methods.

  • Hierarchical algorithms - tree clustering. Hierarchical algorithms are based on the idea of ​​sequential clustering. At the initial step, each object is considered as a separate cluster. At the next step, some of the clusters closest to each other will be combined into a separate cluster.
  • K-means method. This method is the most commonly used. It belongs to the group of so-called reference methods of cluster analysis. The number of clusters K is set by the user.
  • Two way association. When using this method, clustering is carried out simultaneously both by variables (columns) and by observation results (rows).

The two-way join procedure is performed when it can be expected that simultaneous clustering on variables and observations will provide meaningful results.

The results of the procedure are descriptive statistics on variables and cases, as well as a two-dimensional color chart on which data values ​​are color-coded.

By the distribution of color, you can get an idea of ​​\u200b\u200bhomogeneous groups.

Normalization of variables

The division of the initial set of objects into clusters is associated with the calculation of distances between objects and the choice of objects, the distance between which is the smallest of all possible.

The most commonly used is the Euclidean (geometric) distance familiar to all of us. This metric corresponds to intuitive ideas about the proximity of objects in space (as if the distances between objects were measured with a tape measure).

But for a given metric, the distance between objects can be strongly affected by changes in scales (units of measurement). For example, if one of the features is measured in millimeters and then its value is converted to centimeters, the Euclidean distance between objects will change dramatically. This will lead to the fact that the results of cluster analysis may differ significantly from the previous ones.

If the variables are measured in different units of measurement, then their preliminary normalization is required, that is, the transformation of the initial data, which converts them into dimensionless quantities.

Normalization strongly distorts the geometry of the original space, which can change the results of clustering

In the Statistica package, any variable x is normalized according to the formula:

To do this, right-click on the variable name and select the sequence of commands from the menu that opens: Fill/ Standardize Block/ Standardize Columns. The values ​​of the normalized variable will become equal to zero, and the variances will become equal to one.

K-means method in Statistica

The K-means method splits a set of objects into a given number K of different clusters located at the greatest possible distance from each other.

Typically, once the results of a K-means cluster analysis are obtained, one can calculate the averages for each cluster for each dimension to assess how the clusters differ from each other.

Ideally, you should get very different means for most of the measurements used in the analysis.

The F-statistic values ​​obtained for each dimension are another indicator of how well the corresponding dimension discriminates between clusters.

As an example, consider the results of a survey of 17 employees of an enterprise on satisfaction with career quality indicators. The table contains the answers to the questionnaire questions on a ten-point scale (1 is the minimum score, 10 is the maximum).

The variable names correspond to the answers to the following questions:

  1. SLT - a combination of personal goals and the goals of the organization;
  2. OSO - a sense of fairness in wages;
  3. TBD - territorial proximity to the house;
  4. PEW - a sense of economic well-being;
  5. CR - career growth;
  6. ZhSR - the desire to change jobs;
  7. OSB is a sense of social well-being.

Using this data, it is necessary to divide the employees into groups and select the most effective control levers for each of them.

At the same time, the differences between groups should be obvious, and within the group, the respondents should be as similar as possible.

To date, most sociological surveys give only a percentage of votes: the main number of positive answers is considered, or the percentage of those who are dissatisfied, but this issue is not systematically considered.

Most often, the survey does not show trends in the situation. In some cases, it is necessary to count not the number of people who are “for” or “against”, but the distance, or the measure of similarity, that is, to determine groups of people who think about the same.

Cluster analysis procedures can be used to identify, on the basis of survey data, some really existing relationships of features and generate their typology on this basis.

Attention!

The presence of any a priori hypotheses of a sociologist when working with cluster analysis procedures is not a necessary condition.

In the Statistica program, cluster analysis is performed as follows.

When choosing the number of clusters, be guided by the following: the number of clusters, if possible, should not be too large.

The distance at which the objects of a given cluster were joined should, if possible, be much less than the distance at which something else joins this cluster.

When choosing the number of clusters, most often there are several correct solutions at the same time.

We are interested, for example, in how the answers to the questions of the questionnaire correlate with ordinary employees and the management of the enterprise. Therefore, we choose K=2. For further segmentation, you can increase the number of clusters.

  1. select observations with the maximum distance between cluster centers;
  2. sort distances and select observations at regular intervals (default setting);
  3. take the first observation centers and attach the rest of the objects to them.

Option 1 is suitable for our purposes.

Many clustering algorithms often “impose” a structure that is not inherent in the data and disorient the researcher. Therefore, it is extremely necessary to apply several cluster analysis algorithms and draw conclusions based on a general assessment of the results of the algorithms.

The results of the analysis can be viewed in the dialog box that appears:

If you select the Graph of means tab, a graph of the coordinates of the cluster centers will be plotted:


Each broken line on this graph corresponds to one of the clusters. Each division of the horizontal axis of the graph corresponds to one of the variables included in the analysis.

The vertical axis corresponds to the average values ​​of the variables for the objects included in each of the clusters.

It can be noted that there are significant differences in the attitude of the two groups of people to a service career on almost all issues. Only in one issue is there complete unanimity - in the sense of social well-being (OSB), or rather, the lack of it (2.5 points out of 10).

It can be assumed that cluster 1 represents workers and cluster 2 represents management. Managers are more satisfied with career development (CR), a combination of personal goals and organizational goals (SOLs).

They have a higher sense of economic well-being (SEW) and a sense of pay equity (SWA).

They are less concerned about proximity to home than workers, probably because of less transportation problems. Also, managers have less desire to change jobs (JSR).

Despite the fact that workers are divided into two categories, they give relatively the same answers to most questions. In other words, if something does not suit the general group of employees, the same does not suit senior management, and vice versa.

The harmonization of the graphs allows us to conclude that the well-being of one group is reflected in the well-being of another.

Cluster 1 is not satisfied with the territorial proximity to the house. This group is the main part of the workers who mainly come to the enterprise from different parts of the city.

Therefore, it is possible to offer the top management to allocate part of the profits to the construction of housing for the employees of the enterprise.

Significant differences are seen in the attitude of the two groups of people to a service career. Those employees who are satisfied with career growth, who have a high coincidence of personal goals and the goals of the organization, do not have a desire to change jobs and feel satisfaction with the results of their work.

Conversely, employees who want to change jobs and are dissatisfied with the results of their work are not satisfied with the above indicators. Senior management should pay special attention to the current situation.

The results of the analysis of variance for each attribute are displayed by pressing the Analysis of variance button.

The sums of squares of deviations of objects from cluster centers (SS Within) and the sums of squares of deviations between cluster centers (SS Between), F-statistics values ​​and p significance levels are displayed.

Attention!

For our example, the significance levels for the two variables are quite large, which is explained by the small number of observations. In the full version of the study, which can be found in the paper, the hypotheses about the equality of means for cluster centers are rejected at significance levels less than 0.01.

The Save classifications and distances button displays the numbers of objects included in each cluster and the distances of objects to the center of each cluster.

The table shows the case numbers (CASE_NO) that make up the clusters with CLUSTER numbers and the distances from the center of each cluster (DISTANCE).

Information about objects belonging to clusters can be written to a file and used in further analysis. In this example, a comparison of the results obtained with the questionnaires showed that cluster 1 consists mainly of ordinary workers, and cluster 2 - of managers.

Thus, it can be seen that when processing the results of the survey, cluster analysis turned out to be a powerful method that allows drawing conclusions that cannot be reached by constructing a histogram of averages or by calculating the percentage of those satisfied with various indicators of the quality of working life.

Tree clustering is an example of a hierarchical algorithm, the principle of which is to sequentially cluster first the closest, and then more and more distant elements from each other into a cluster.

Most of these algorithms start from a matrix of similarity (distances), and each individual element is considered at first as a separate cluster.

After loading the cluster analysis module and selecting Joining (tree clustering), you can change the following parameters in the clustering parameters entry window:

  • Initial data (Input). They can be in the form of a matrix of the studied data (Raw data) and in the form of a matrix of distances (Distance matrix).
  • Clustering (Cluster) observations (Cases (raw)) or variables (Variable (columns)), describing the state of the object.
  • Distance measures. Here you can select the following measures: Euclidean distances, Squared Euclidean distances, City-block (Manhattan) distance, Chebychev distance metric, Power ...), the percentage of disagreement (Percent disagreement).
  • Clustering method (Amalgamation (linkage) rule). The following options are possible here: Single Linkage, Complete Linkage, Unweighted pair-group average, Weighted pair-group average ), Unweighted pair-group centroid, Weighted pair-group centroid (median), Ward's method.

As a result of clustering, a horizontal or vertical dendrogram is built - a graph on which the distances between objects and clusters are determined when they are sequentially combined.

The tree structure of the graph allows you to define clusters depending on the selected threshold - a given distance between clusters.

In addition, the matrix of distances between the original objects (Distance matrix) is displayed; mean and standard deviations for each source object (Distiptive statistics).

For the considered example, we will carry out a cluster analysis of variables with default settings. The resulting dendrogram is shown in the figure.


The vertical axis of the dendrogram plots the distances between objects and between objects and clusters. So, the distance between the variables SEB and OSD is equal to five. These variables at the first step are combined into one cluster.

The horizontal segments of the dendrogram are drawn at levels corresponding to the threshold distances selected for a given clustering step.

It can be seen from the graph that the question “desire to change jobs” (JSR) forms a separate cluster. In general, the desire to dump anywhere visits everyone equally. Further, a separate cluster is the question of territorial proximity to home (LHB).

In terms of importance, it is in second place, which confirms the conclusion about the need for housing construction, made according to the results of the study using the K-means method.

Feelings of economic well-being (PEW) and pay equity (PWA) are combined - this is a block of economic issues. Career progression (CR) and the combination of personal goals and organization goals (COL) are also combined.

Other clustering methods, as well as the choice of other types of distances, do not lead to a significant change in the dendrogram.

Results:

  1. Cluster analysis is a powerful tool for exploratory data analysis and statistical research in any subject area.
  2. The Statistica program implements both hierarchical and structural methods of cluster analysis. The advantages of this statistical package are due to their graphical capabilities. Two-dimensional and three-dimensional graphical representations of the obtained clusters in the space of the studied variables are provided, as well as the results of the hierarchical procedure for grouping objects.
  3. It is necessary to apply several cluster analysis algorithms and draw conclusions based on a general assessment of the results of the algorithms.
  4. Cluster analysis can be considered successful if it is performed in different ways, the results are compared and common patterns are found, and stable clusters are found regardless of the clustering method.
  5. Cluster analysis allows you to identify problem situations and outline ways to solve them. Therefore, this method of non-parametric statistics can be considered as an integral part of system analysis.

Input Types

  • Indicative description of objects. Each object is described by a set of its characteristics, called signs. Features can be numeric or non-numeric.
  • Distance matrix between objects. Each object is described by distances to all other objects in the training set.

Goals of clustering

  • Understanding data by identifying cluster structure. Dividing the sample into groups of similar objects makes it possible to simplify further data processing and decision-making by applying its own analysis method to each cluster (the “divide and conquer” strategy).
  • Data compression. If the initial sample is excessively large, then it can be reduced, leaving one of the most typical representatives from each cluster.
  • novelty detection. novelty detection). Atypical objects are selected that cannot be attached to any of the clusters.

In the first case, they try to make the number of clusters smaller. In the second case, it is more important to ensure a high degree of similarity of objects within each cluster, and there can be any number of clusters. In the third case, individual objects that do not fit into any of the clusters are of greatest interest.

In all these cases, hierarchical clustering can be applied, when large clusters are split into smaller ones, which, in turn, are split up even smaller, etc. Such tasks are called taxonomy tasks.

The result of taxonomy is a tree-like hierarchical structure. In addition, each object is characterized by an enumeration of all clusters to which it belongs, usually from large to small.

A classic example of taxonomy based on similarity is the binomial nomenclature of living beings proposed by Carl Linnaeus in the middle of the 18th century. Similar systematizations are built in many fields of knowledge in order to organize information about a large number of objects.

Clustering methods

Formal Statement of the Clustering Problem

Let be a set of objects, be a set of numbers (names, labels) of clusters. The distance function between objects is given. There is a finite training set of objects . It is required to split the sample into non-overlapping subsets, called clusters, so that each cluster consists of objects close in metric , and objects of different clusters differ significantly. In this case, each object is assigned a cluster number.

Clustering algorithm is a function that associates any object with a cluster number. The set in some cases is known in advance, but more often the task is to determine the optimal number of clusters, from the point of view of one or another quality criteria clustering.

Literature

  1. Aivazyan S. A., Buchstaber V. M., Enyukov I. S., Meshalkin L. D. Applied Statistics: Classification and Dimension Reduction. - M.: Finance and statistics, 1989.
  2. Zhuravlev Yu. I., Ryazanov V. V., Senko O. V."Recognition". Mathematical methods. Software system. Practical applications. - M.: Fazis, 2006. ISBN 5-7036-0108-8.
  3. Zagoruiko N. G. Applied methods of data and knowledge analysis. - Novosibirsk: IM SO RAN, 1999. ISBN 5-86134-060-9.
  4. Mandel I. D. cluster analysis. - M.: Finance and statistics, 1988. ISBN 5-279-00050-7.
  5. Shlesinger M., Glavach V. Ten lectures on statistical and structural recognition. - Kyiv: Naukova Dumka, 2004. ISBN 966-00-0341-2.
  6. Hastie T., Tibshirani R., Friedman J. The Elements of Statistical Learning. - Springer, 2001. ISBN 0-387-95284-5.
  7. Jain Murty Flynn Data clustering: a review . // ACM Comput. Surv. 31 (3) , 1999

External links

In Russian

  • www.MachineLearning.ru - professional wiki resource dedicated to machine learning and data mining
  • S. Nikolenko. Lecture slides on clustering algorithms

In English

  • COMPACT - Comparative Package for Clustering Assessment. A free Matlab package, 2006.
  • P. Berkhin, Survey of Clustering Data Mining Techniques, Accrue Software, 2002.
  • Jain, Murty and Flynn: Data Clustering: A Review, ACM Comp. Surv., 1999.
  • for another presentation of hierarchical, k-means and fuzzy c-means see this introduction to clustering . Also has an explanation on mixture of Gaussians.
  • david dowe, Mixture Modeling page- other clustering and mixture model links.
  • a tutorial on clustering
  • The on-line textbook: Information Theory, Inference, and Learning Algorithms , by David J.C. MacKay includes chapters on k-means clustering, soft k-means clustering, and derivations including the E-M algorithm and the variational view of the E-M algorithm.
  • "The Self-Organized Gene" , tutorial explaining clustering through competitive learning and self-organizing maps.
  • kernlab - R package for kernel based machine learning (includes spectral clustering implementation)
  • Tutorial - Tutorial with introduction of Clustering Algorithms (k-means, fuzzy-c-means, hierarchical, mixture of gaussians) + some interactive demos (java applets)
  • Data Mining Software - Data mining software frequently utilizes clustering techniques.
  • Java Competitve Learning Application A suite of Unsupervised Neural Networks for clustering. Written in Java. Complete with all source code.

cluster analysis

Most researchers are inclined to believe that for the first time the term "cluster analysis" (eng. cluster- bunch, clot, bunch) was proposed by the mathematician R. Trion. Subsequently, a number of terms arose that are now considered to be synonymous with the term "cluster analysis": automatic classification; botryology.

Cluster analysis is a multidimensional statistical procedure that collects data containing information about a sample of objects, and then arranges objects into relatively homogeneous groups (clusters) (Q-clustering, or Q-technique, cluster analysis itself). Cluster - a group of elements characterized by a common property, the main goal of cluster analysis is to find groups of similar objects in the sample. The range of applications of cluster analysis is very wide: it is used in archeology, medicine, psychology, chemistry, biology, public administration, philology, anthropology, marketing, sociology and other disciplines. However, the universality of application has led to the emergence of a large number of incompatible terms, methods and approaches that make it difficult to unambiguously use and consistently interpret cluster analysis. Orlov A. I. suggests distinguishing as follows:

Tasks and conditions

Cluster analysis performs the following main goals:

  • Development of a typology or classification.
  • Exploring useful conceptual schemes for grouping objects.
  • Generation of hypotheses based on data exploration.
  • Hypothesis testing or research to determine whether types (groups) identified in one way or another are actually present in the available data.

Regardless of the subject of study, the use of cluster analysis involves next steps:

  • Sampling for clustering. It is understood that it makes sense to cluster only quantitative data.
  • Definition of a set of variables by which objects in the sample will be evaluated, that is, a feature space.
  • Calculation of the values ​​of one or another measure of similarity (or difference) between objects.
  • Application of the cluster analysis method to create groups of similar objects.
  • Validation of the results of the cluster solution.

Cluster analysis presents the following data requirements:

  1. indicators should not correlate with each other;
  2. indicators should not contradict the theory of measurements;
  3. the distribution of indicators should be close to normal;
  4. indicators must meet the requirement of "stability", which means the absence of influence on their values ​​by random factors;
  5. the sample should be homogeneous, not contain "outliers".

You can find a description of two fundamental requirements for data - uniformity and completeness:

Homogeneity requires that all entities represented in a table be of the same nature. The requirement for completeness is that the sets I and J presented a complete description of the manifestations of the phenomenon under consideration. If we consider a table in which I is a collection, and J- the set of variables describing this population, then it should be a representative sample from the studied population, and the system of characteristics J should give a satisfactory vector representation of individuals i from a researcher's point of view.

If cluster analysis is preceded by factor analysis, then the sample does not need to be “repaired” - the stated requirements are performed automatically by the factor modeling procedure itself (there is another advantage - z-standardization without negative consequences for the sample; if it is carried out directly for cluster analysis, it can lead to resulting in a decrease in the clarity of the separation of groups). Otherwise, the sample must be adjusted.

Typology of clustering problems

Input Types

In modern science, several algorithms for processing input data are used. Analysis by comparing objects based on features (most common in the biological sciences) is called Q- type of analysis, and in the case of feature comparison, on the basis of objects - R- type of analysis. There are attempts to use hybrid types of analysis (for example, RQ analysis), but this methodology has not yet been properly developed.

Goals of clustering

  • Understanding data by identifying cluster structure. Dividing the sample into groups of similar objects makes it possible to simplify further data processing and decision-making by applying its own analysis method to each cluster (the “divide and conquer” strategy).
  • Data compression. If the initial sample is excessively large, then it can be reduced, leaving one of the most typical representatives from each cluster.
  • novelty detection. novelty detection). Atypical objects are selected that cannot be attached to any of the clusters.

In the first case, they try to make the number of clusters smaller. In the second case, it is more important to ensure a high degree of similarity of objects within each cluster, and there can be any number of clusters. In the third case, individual objects that do not fit into any of the clusters are of greatest interest.

In all these cases, hierarchical clustering can be applied, when large clusters are split into smaller ones, which, in turn, are split even smaller, etc. Such tasks are called taxonomy tasks. The result of taxonomy is a tree-like hierarchical structure. In addition, each object is characterized by an enumeration of all clusters to which it belongs, usually from large to small.

Clustering methods

There is no generally accepted classification of clustering methods, but a solid attempt by V. S. Berikov and G. S. Lbov can be noted. If we generalize the various classifications of clustering methods, we can distinguish a number of groups (some methods can be attributed to several groups at once, and therefore it is proposed to consider this typification as some approximation to the real classification of clustering methods):

  1. Probabilistic approach. It is assumed that each object under consideration belongs to one of the k classes. Some authors (for example, A. I. Orlov) believe that this group does not belong to clustering at all and oppose it under the name "discrimination", that is, the choice of assigning objects to one of the known groups (training samples).
  2. Approaches based on artificial intelligence systems. A very conditional group, since there are a lot of AI methods and methodically they are very different.
  3. logical approach. The construction of a dendrogram is carried out using a decision tree.
  4. Graph-theoretic approach.
    • Graph clustering algorithms
  5. Hierarchical approach. The presence of nested groups (clusters of different orders) is assumed. Algorithms, in turn, are divided into agglomerative (unifying) and divisive (separating). According to the number of features, monothetic and polythetic methods of classification are sometimes distinguished.
    • Hierarchical divisional clustering or taxonomy. Clustering problems are considered in quantitative taxonomy.
  6. Other Methods. Not included in the previous groups.
    • Statistical clustering algorithms
    • Ensemble of clusterers
    • Algorithms of the KRAB family
    • Algorithm based on the sifting method
    • DBSCAN etc.

Approaches 4 and 5 are sometimes combined under the name of the structural or geometric approach, which has a more formalized concept of proximity. Despite significant differences between the listed methods, they all rely on the original " compactness hypothesis»: in object space, all close objects must belong to the same cluster, and all different objects, respectively, must be in different clusters.

Formal Statement of the Clustering Problem

Let be a set of objects, be a set of numbers (names, labels) of clusters. The distance function between objects is given. There is a finite training set of objects . It is required to split the sample into non-overlapping subsets, called clusters, so that each cluster consists of objects close in metric , and objects of different clusters differ significantly. In this case, each object is assigned a cluster number.

Clustering algorithm is a function that associates any object with a cluster number. The set in some cases is known in advance, but more often the task is to determine the optimal number of clusters, from the point of view of one or another quality criteria clustering.

Clustering (unsupervised learning) differs from classification (supervised learning) in that the labels of the original objects are not initially set, and the set itself may even be unknown.

The solution of the clustering problem is fundamentally ambiguous, and there are several reasons for this (according to a number of authors):

  • there is no uniquely best criterion for the quality of clustering. A number of heuristic criteria are known, as well as a number of algorithms that do not have a clearly defined criterion, but carry out a fairly reasonable clustering “by construction”. All of them can give different results. Therefore, to determine the quality of clustering, an expert in the subject area is required, who could assess the meaningfulness of the selection of clusters.
  • the number of clusters is usually unknown in advance and is set according to some subjective criterion. This is true only for discrimination methods, since in clustering methods, clusters are selected using a formalized approach based on proximity measures.
  • the clustering result significantly depends on the metric, the choice of which, as a rule, is also subjective and is determined by an expert. But it is worth noting that there are a number of recommendations for choosing proximity measures for various tasks.

Application

In biology

In biology, clustering has many applications in a wide variety of fields. For example, in bioinformatics, it is used to analyze complex networks of interacting genes, sometimes consisting of hundreds or even thousands of elements. Cluster analysis allows you to identify subnets, bottlenecks, hubs and other hidden properties of the system under study, which ultimately allows you to find out the contribution of each gene to the formation of the phenomenon under study.

In the field of ecology, it is widely used to identify spatially homogeneous groups of organisms, communities, etc. Less commonly, cluster analysis methods are used to study communities over time. The heterogeneity of the structure of communities leads to the emergence of non-trivial methods of cluster analysis (for example, the Czekanowski method).

In general, it is worth noting that historically, similarity measures are more often used as proximity measures in biology, rather than difference (distance) measures.

In sociology

When analyzing the results of sociological research, it is recommended to carry out the analysis using the methods of a hierarchical agglomerative family, namely the Ward method, in which the minimum dispersion is optimized within the clusters, as a result, clusters of approximately equal sizes are created. Ward's method is the most successful for the analysis of sociological data. As a measure of difference, the quadratic Euclidean distance is better, which contributes to an increase in the contrast of clusters. The main result of hierarchical cluster analysis is a dendrogram or “icicle diagram”. When interpreting it, researchers are faced with a problem of the same kind as the interpretation of the results of factor analysis - the lack of unambiguous criteria for identifying clusters. It is recommended to use two methods as the main ones - visual analysis of the dendrogram and comparison of the results of clustering performed by different methods.

Visual analysis of the dendrogram involves "cutting" the tree at the optimal level of similarity of the sample elements. The “vine branch” (terminology of Oldenderfer M.S. and Blashfield R.K.) should be “cut off” at around 5 on the Rescaled Distance Cluster Combine scale, thus achieving an 80% similarity level. If the selection of clusters by this label is difficult (several small clusters merge into one large one on it), then you can choose another label. This technique is proposed by Oldenderfer and Blashfield.

Now the question of the stability of the adopted cluster solution arises. In fact, checking the stability of clustering comes down to checking its reliability. There is a rule of thumb here - a stable typology is preserved when clustering methods change. The results of hierarchical cluster analysis can be verified by iterative k-means cluster analysis. If the compared classifications of groups of respondents have a share of coincidences of more than 70% (more than 2/3 of coincidences), then a cluster decision is made.

It is impossible to check the adequacy of the solution without resorting to another type of analysis. At least theoretically, this problem has not been solved. Oldenderfer and Blashfield's classic Cluster Analysis elaborates on and ultimately rejects five additional robustness testing methods:

In computer science

  • Clustering search results - used for "intelligent" grouping of results when searching for files, websites, other objects, allowing the user to quickly navigate, select a subset that is obviously more relevant and excludes a known less relevant one - which can increase the usability of the interface compared to output in the form a simple sorted by relevance list .
    • Clusty - Vivísimo's clustering search engine
    • Nigma - Russian search engine with automatic results clustering
    • Quintura - visual clustering in the form of a cloud of keywords
  • Image segmentation image segmentation) - Clustering can be used to break up a digital image into distinct regions for the purpose of edge detection. edge detection) or object recognition.
  • Data mining data mining)- Clustering in Data Mining becomes valuable when it acts as one of the stages of data analysis, building a complete analytical solution. It is often easier for an analyst to identify groups of similar objects, study their features and build a separate model for each group than to create one general model for all data. This technique is constantly used in marketing, highlighting groups of customers, buyers, goods and developing a separate strategy for each of them.

see also

Notes

Links

In Russian
  • www.MachineLearning.ru - professional wiki resource dedicated to machine learning and data mining
In English
  • COMPACT - Comparative Package for Clustering Assessment. A free Matlab package, 2006.
  • P. Berkhin, Survey of Clustering Data Mining Techniques, Accrue Software, 2002.
  • Jain, Murty and Flynn: Data Clustering: A Review, ACM Comp. Surv., 1999.
  • for another presentation of hierarchical, k-means and fuzzy c-means see this introduction to clustering . Also has an explanation on mixture of Gaussians.
  • david dowe, Mixture Modeling page- other clustering and mixture model links.
  • a tutorial on clustering
  • The on-line textbook: Information Theory, Inference, and Learning Algorithms , by David J.C. MacKay includes chapters on k-means clustering, soft k-means clustering, and derivations including the E-M algorithm and the variational view of the E-M algorithm.
  • "The Self-Organized Gene", tutorial explaining clustering through competitive learning and self-organizing maps.
  • kernlab - R package for kernel based machine learning (includes spectral clustering implementation)
  • Tutorial - Tutorial with introduction of Clustering Algorithms (k-means, fuzzy-c-means, hierarchical, mixture of gaussians) + some interactive demos (java applets)
  • Data Mining Software - Data mining software frequently utilizes clustering techniques.
  • Java Competitve Learning Application A suite of Unsupervised Neural Networks for clustering. Written in Java. Complete with all source code.
  • Machine Learning Software - Also contains much clustering software.

We know that the Earth is one of the 8 planets that revolve around the Sun. The sun is just a star among about 200 billion stars in the Milky Way galaxy. It is very difficult to understand this number. Knowing this, one can make an assumption about the number of stars in the universe - approximately 4X10^22. We can see about a million stars in the sky, although this is only a small fraction of the actual number of stars. So we have two questions:

  1. What is a galaxy?
  2. And what is the connection between galaxies and the topic of the article (cluster analysis)


A galaxy is a collection of stars, gas, dust, planets, and interstellar clouds. Usually galaxies resemble a spiral or oedeptic figure. In space, galaxies are separated from each other. Huge black holes are most often the centers of most galaxies.

As we will discuss in the next section, there are many similarities between galaxies and cluster analysis. Galaxies exist in three-dimensional space, cluster analysis is a multidimensional analysis carried out in n-dimensional space.

The note: A black hole is the center of a galaxy. We will use a similar idea for centroids for cluster analysis.

cluster analysis

Let's say you're the head of marketing and customer relations at a telecommunications company. You understand that all customers are different and that you need different strategies to reach different customers. You will appreciate the power of such a tool as customer segmentation to optimize costs. To brush up on your knowledge of cluster analysis, consider the following example, which illustrates 8 customers and their average conversation duration (local and international). Below is the data:

For better perception, let's draw a graph where the x-axis will be the average duration of international calls, and the y-axis - the average duration of local calls. Below is the chart:

The note: This is similar to analyzing the position of the stars in the night sky (here the stars are replaced by consumers). In addition, instead of a 3D space, we have a 2D one, defined by the duration of local and international calls, as the x and y axes.
Now, speaking in terms of galaxies, the problem is formulated as follows - to find the position of black holes; in cluster analysis they are called centroids. To detect centroids, we will start by taking arbitrary points as the position of the centroids.

Euclidean Distance for Finding Centroids for Clusters

In our case, we will randomly place two centroids (C1 and C2) at the points with coordinates (1, 1) and (3, 4). Why did we choose these two centroids? Visual display of points on the graph shows us that there are two clusters that we will analyze. However, we will see later that the answer to this question will not be so simple for a large data set.
Next, we will measure the distance between the centroids (C1 and C2) and all points on the graph using Euclid's formula to find the distance between two points.

Note: The distance can also be calculated using other formulas, for example,

  1. the square of the Euclidean distance - to give weight to objects that are more distant from each other
  2. Manhattan distance - to reduce the impact of emissions
  3. power distance - to increase / decrease the influence on specific coordinates
  4. percent disagreement - for categorical data
  5. and etc.
Column 3 and 4 (Distance from C1 and C2) is the distance calculated using this formula. For example, for the first user

The belonging to centroids (last column) is calculated according to the principle of proximity to centroids (C1 and C2). The first consumer is closer to centroid #1 (1.41 compared to 2.24) hence belongs to the cluster with centroid C1.

Below is a graph illustrating the C1 and C2 centroids (depicted as a blue and orange diamond). Consumers are shown in the color of the corresponding centroid to which they were assigned.

Since we have arbitrarily chosen centroids, the second step is to make this choice iterative. The new position of the centroids is chosen as the average for the points of the corresponding cluster. So, for example, for the first centroid (these are consumers 1, 2 and 3). Therefore, the new x-coordinate for the centroid C1 is the average of the x-coordinates of these consumers (2+1+1)/3 = 1.33. We will get new coordinates for C1 (1.33, 2.33) and C2 (4.4, 4.2). The new plot is below:

Finally, we will place the centroids at the center of the respective cluster. Chart below:

The positions of our black holes (cluster centers) in our example are C1 (1.75, 2.25) and C2(4.75, 4.75). The two clusters above are like two galaxies separated in space from each other.

So, let's look at examples further. Let us face the task of segmenting consumers according to two parameters: age and income. Suppose we have 2 consumers aged 37 and 44 with incomes of $90,000 and $62,000 respectively. If we want to measure the Euclidean distance between the points (37, 90000) and (44, 62000), we will see that in this case the income variable "dominates" the age variable and its change strongly affects the distance. We need some kind of strategy to solve this problem, otherwise our analysis will give an incorrect result. The solution to this problem is to bring our values ​​to comparable scales. Normalization is the solution to our problem.

Data normalization

There are many approaches to normalize data. For example, minimum-maximum normalization. For this normalization, the following formula is used

in this case, X* is a normalized value, min and max are the minimum and maximum coordinates over the entire set X
(Note, this formula places all coordinates on the segment )
Consider our example, let the maximum income be $130,000 and the minimum be $45,000. The normalized value of income for consumer A is

We will do this exercise for all points for each variable (coordinate). The income for the second consumer (62000) will become 0.2 after the normalization procedure. Additionally, let the minimum and maximum ages be 23 and 58 respectively. After normalization, the ages of our two consumers will be 0.4 and 0.6.

It's easy to see that now all of our data is between 0 and 1. Therefore, we now have normalized datasets on comparable scales.

Remember, before the cluster analysis procedure, it is necessary to perform normalization.

Input Types

  • Indicative description of objects. Each object is described by a set of its characteristics, called signs. Features can be numeric or non-numeric.
  • Distance matrix between objects. Each object is described by distances to all other objects in the training set.

Distance matrix can be calculated from the matrix of feature descriptions of objects in an infinite number of ways, depending on how to introduce the distance function (metric) between feature descriptions. The Euclidean metric is often used, but this choice in most cases is a heuristic and is due only to considerations of convenience.

The inverse problem - the restoration of feature descriptions by the matrix of pairwise distances between objects - in the general case has no solution, and the approximate solution is not unique and may have a significant error. This problem is solved by multidimensional scaling methods.

Thus, the formulation of the problem of clustering by distance matrix is more general. On the other hand, in the presence of feature descriptions, it is often possible to build more efficient clustering methods.

Goals of clustering

  • Understanding data by identifying cluster structure. Dividing the sample into groups of similar objects makes it possible to simplify further data processing and decision-making by applying its own analysis method to each cluster (the “divide and conquer” strategy).
  • Data compression. If the initial sample is excessively large, then it can be reduced, leaving one of the most typical representatives from each cluster.
  • Novelty detection. Atypical objects are selected that cannot be attached to any of the clusters.

In the first case, they try to make the number of clusters smaller. In the second case, it is more important to ensure a high (or fixed) degree of similarity of objects within each cluster, and there can be any number of clusters. In the third case, individual objects that do not fit into any of the clusters are of greatest interest.

In all these cases, hierarchical clustering can be applied, when large clusters are split into smaller ones, which, in turn, are split even smaller, etc. Such tasks are called taxonomy tasks.

The result of taxonomy is a tree-like hierarchical structure. In addition, each object is characterized by an enumeration of all clusters to which it belongs, usually from large to small. Visually, taxonomy is represented as a graph called a dendrogram.

A classic example of taxonomy based on similarity is binomial nomenclature of living beings proposed by Carl Linnaeus in the middle of the 18th century. Similar systematizations are built in many fields of knowledge in order to organize information about a large number of objects.

Distance functions

Clustering methods

  • Statistical clustering algorithms
  • Hierarchical clustering or taxonomy

Formal Statement of the Clustering Problem

Let be a set of objects, be a set of numbers (names, labels) of clusters. The distance function between objects is given. There is a finite training set of objects . It is required to split the sample into non-overlapping subsets, called clusters, so that each cluster consists of objects close in metric , and objects of different clusters differ significantly. In this case, each object is assigned a cluster number.

Clustering algorithm is a function that associates any object with a cluster number. The set in some cases is known in advance, but more often the task is to determine the optimal number of clusters, from the point of view of one or another quality criteria clustering.

Clustering (unsupervised learning) differs from classification (supervised learning) in that the labels of the original objects are not initially set, and the set itself may even be unknown.

The solution of the clustering problem is fundamentally ambiguous, and there are several reasons for this:

  • There is no uniquely best criterion for the quality of clustering. A number of heuristic criteria are known, as well as a number of algorithms that do not have a clearly defined criterion, but carry out a fairly reasonable clustering “by construction”. All of them can give different results.
  • The number of clusters is usually unknown in advance and is set according to some subjective criterion.
  • The result of clustering significantly depends on the metric, the choice of which, as a rule, is also subjective and is determined by an expert.

Links

  • Vorontsov K.V. Mathematical Teaching Methods by Precedents. Moscow Institute of Physics and Technology (2004), VMiK MGU (2007).
  • Sergei Nikolenko. Lecture slides "Clustering Algorithms 1" and "Clustering Algorithms 2". Course "Self-learning systems".

Literature

  1. Aivazyan S. A., Buchstaber V. M., Enyukov I. S., Meshalkin L. D. Applied Statistics: Classification and Dimension Reduction. - M.: Finance and statistics, 1989.
  2. Zhuravlev Yu. I., Ryazanov V. V., Senko O. V."Recognition". Mathematical methods. Software system. Practical applications. - M.: Fazis, 2006. .
  3. Zagoruiko N. G. Applied methods of data and knowledge analysis. - Novosibirsk: IM SO RAN, 1999. .
  4. Mandel I. D. cluster analysis. - M.: Finance and statistics, 1988. .
  5. Shlesinger M., Glavach V. Ten lectures on statistical and structural recognition. - Kyiv: Naukova Dumka, 2004. .
  6. Hastie T., Tibshirani R., Friedman J. The Elements of Statistical Learning. - Springer, 2001. .