Classification

  1. Exploratory Classification
  2. Classification Into Given Classes
En Español  In Finnish   Contents

When studying a number of objects or observations you can sometimes reveal an underlying structure of the material by grouping the observations into classes. For the basis of categorization you have to select a property or attribute which you can record for each of the studied specimens or cases. All the cases with this named attribute go to one class, those that have it in different degrees perhaps go into their own classes as do, finally, those cases (if any) that do not have this property at all.
Books in
English:
Books in
Spanish:
Other
Books:
Number of
books: 168
Number of
books: 151
Number of
books: 6
You could, for example, arrange your books on the basis of their language as in the table on the right.

From where do you get the basis for categorization?

Classifications are usually presented as tables. At the top of the table, in the table heading, you expose the basis of classification. In delimiting the classes, you can employ any variety of customary scales and other languages of description like verbal description, visual shape, coding, ordinal scale, interval scale or ratio scale, or if you feel like it, you can invent a new original classification of your own.

Customary symbols in tables
(empty box) = the table has not been completed
. . = no information obtained
* = preliminary data
- = 0.000 (exactly zero)
0 = rounded to zero
The rest of your table consists of boxes where you put the specimens or cases that you are classifying. Each box usually contains a number which shows how many specimens or cases fulfill the conditions defined in the column heading. This number is also called frequency. Some abbreviations conventionally used in table boxes are presented on the right.
If the frequency is divided by the total number of cases, relative frequency is obtained. It can be indicated as a percentage or fraction. All these statistics are variables of arithmetic scale regardless of the scale type of the original measurements.

Moreover, you can place additional data in the boxes and even pictures of the specimens, if it does not take up too much space.

If you have only one criterion for classifying your material you will get only one row of boxes in your table. However, you will often want to make a secondary grouping of the same material, using another criterion for this new classification. In this case you will get as many rows in the table as you have classes in your second taxonomy, and the boxes become arranged in columns as well as in rows. Such a grouping of data is called cross tabulation and it is a powerful tool in searching for invariant structures in large sets of cases or specimens.
  Books in
English:
Books in
Spanish:
Other
Books:
Total:
Books on music: 100 11 4 115
Books on yoga: 60 91 2 153
Other topics: 8 49 - 57
Total: 168 151 6 325
If you cross tabulated your books in this way, according to their language and topic, you might get a table like the one on the right. Note that tables often include an extra column for row totals and a bottom line for column totals.

A two-dimensional table is perfect for presenting a classification with two criteria, but it is a little awkward if you need more than two dimensions of grouping. Possible methods for this purpose are:

Exploratory Classification

When doing exploratory research with material from a great number of objects or cases, your first tool is not necessarily classification because it can be difficult to sort the material into a meaningful order if you have no names for the pigeonholes to put it in. To be sure, it is always possible to experiment with different categorizations in the hope that a fertile principle of grouping eventually turns up. Such experimentation is easy once the material has been loaded into a computer. It often happens that you find it useful to regroup your material many times and redefine your class boundaries when your knowledge of the object increases. The initial number of classes might be, for example, equal to the cube root of the number of the population, until you find a more justifiable classification. It is practical to start with a greater number of classes than you intend to use finally, because during analysis it is much easier to combine a few classes than to divide a class.

The goal in classification is always the same: to reveal the systematic structure, invariance, that exists in all the cases that you study. This goal is fully achieved only if all the cases fit into classes. If you have to add a category named "Others" it seems to show that the criteria for your classification are somehow deficient. Moreover, any classification works best when the boundaries between the categories are clear-cut and there are few in-between cases; it will clarify the invariance that we try to extract from the data.

Normally you would start by first studying literature on the topic, if there is any. If this leaves you with no suitable theoretical classification you can try to find the structure from the material itself. Sometimes the people that you are studying already use habitual groupings of people or of things; such indigenous categories are often serviceable in research, too. People normally organise themselves into families, tribes and work groups. E.g. "Americans", "The Impressionists" are well known groups that you do not even need to define before using them in your study. Often the names of human groups, especially geographic or temporary divisions, can also be used to classify the products that these people have made.
Boat types

Sometimes people already have names for various arrays of products like furniture and tools. Sailors, for example, habitually assort all innumerable existing boats into a small number of boat types, some of which are shown on the right. The significant differences between the yacht types concern the number of masts, position of the mast in the aft, number of sails in the fore, and presence of gaff. For the purposes of research the types can be defined as drawn outlines as in the picture, or alternatively as verbal explanations such as, for example, "a schooner = a boat with two masts and the largest sail in the aft".

More common in research, however, is that the researcher has to define from scratch the groups that he needs to create. Typical of exploratory research is that you will not have any ready-made classification at hand. You have to examine your material and try to arrange it into groups of similar cases.
Rug knotsIf you have luck, each specimen or case that you are studying fits into one and only one box of your classification. Such seems to be the case in classifying hand-made rugs. There is only a limited number of possible rug knots, and each type is unequivocally distinguishable from all the others, as can be seen in the illustration made by Geijer in the book A History of Textile Art on the right. Alas, such definite classes are not frequent in research.

Regarding most products, there is an endless number of variations. Often, only a few cases are exactly alike and once you have placed them into the appropriate classes the majority still falls in-between. Often the objects or cases resemble a continuum where each case is unique even if it differs slightly from its neighbours. Moreover, often there are several attributes of the cases which seem to be significant when classifying, however no one of these is of such paramount importance that it could be used as an exclusive criterion in assorting. In such a "fuzzy" situation you can try one of the following methods.

Fuzzy classification is a method which aims at placing all the cases or specimens in one or other of the classes even if the "fit" is not perfect. The method allows the simultaneous use of several criteria for assorting, that is, several attributes of the objects or cases are taken into consideration. Every member of the class complies with most of them, but not necessarily with all of them. What is common to all the members of a class is no specific attribute but family resemblance, which means that several but not necessarily all of the attributes match. Characteristic of fuzzy classification is that there will be no surplus class for those specimens or cases which would not fit.

Typology is a method of classification where each class is formed around a "typical" or "pure" exemplar. Each object in the collected material is then compared to these exemplars and will be assigned to the class where it most resembles the exemplar. This can be pre-defined, but in exploratory study it can simply be selected among the specimens that are being studied, or it can be defined as an agglomeration of typical features of the "family".

It can be difficult to make a typology of large and many-detailed artifacts like buildings. A successful example is found in the book Suomen kirkot (Churches in Finland) by Carolus Lindberg. He defined 38 traditional church types using both simple depictions and verbal characterization. Below are three samples of his definitions.

Church types Class XXIX. Wooden churches in the shape of a Greek cross. The central space is widened by orthogonal extensions which are usually lower than the rest of the church. This type was popular in Eastern Finland during the 18C.

Class XXX. Wooden churches in the shape of a Greek cross. Orthogonal extensions at the four corners of the central space. The tent-shaped roof of the midpoint space rises high over the roofs of the wings and is decorated with a pinnacle. These pyramid-roofed central churches were built in eastern Finland from the late 18C to the early 19C.

Class XXXI. A variation of type XXX, with the difference that a vertical segment interrupts the central tent-shaped roof.

Artistic style is sometimes used for assorting products, though it is often very difficult to define and the risk of subjective variation is great. On the left is a classical pictorial exposition of architectural styles from Regole generali di architettura, by Sebastiano Serlio, 16c.

Stylistic classification is almost always a fuzzy one, because the criterions of style are usually defined as a set of qualifications, all of which need not be fulfilled at the same time. Here again, "family resemblance" to the exemplars determines if the given object belongs to a given style or not.

Stylistic classification (like any other classification) can be used to sort objects that belong to the same point of time, e.g. products from various artists, manufacturers, users, cultures or geographical areas. Note that styles are often studied not as classifications but instead as historical sequences.

Weaknesses of typologies. It is usually difficult to measure the goodness of fit of a case; thus typologies and classifications often remain rather subjective. Especially when a single person, i.e., the researcher, both defines the classes and places objects in them, it can become too easy for him to demonstrate the validity of any concocted typology simply by disregarding (as "non typical") all those objects or cases which do not fit into the classification. To reduce subjectivity the exemplars should be defined as explicitly as possible and preferably in both verbal and visual presentation (and also with the help of measurement, if applicable).

The correct method for pointing out the average case or the typical cases in the population, is first to define unambiguously the population, then select a proper sample of it, classify all items of the sample, and note the most frequent type(s).

Cluster Analysis is a suitable method for constructing a classification for cases or specimens for which you have a lot of measurements but you do not know which of the variables are best suited for the basis of classification. It is a tool for finding groups of individual cases which resemble each other to the highest degree. The researcher can either select the number of groups in advance, or leave it open, in which case the analysis program will propose several alternative groupings.
Clusters The figure on the right shows an example where the computer program has analyzed 30 cases and found that the ones that most resemble each other are cases numbered 8 and 29, and also respectively 6 and 22 etc.
Moreover, the cluster analysis program has created a logical tree where the number of groups gradually drops. It will then be up to the researcher to decide how many clusters shall be accepted as the final result of the analysis. In the figure, three final groups: A, B and C have been selected, but two groups would also have been possible: A and (B+C). Each researcher has to make the decision on the basis of his theory, as is usual in all statistical analysis. To make the decision, you should consider the theoretical meaning and interpretation of each cluster.

Cross Tabulation. As was noted above, in exploratory study it is often difficult enough to find the basis for a single classification, let alone for a cross tabulation of two or more properties of the objects or cases. However, with modern computers it is easy to experiment with tentative cross tabulations on the basis of the various attributes that you have recorded from the cases or objects, and it often helps to detect hidden structures in the material.

Below is an example of exploratory cross tabulation where over 3000 specimens of ancient Egyptian pottery are arranged according to their ornamentation. Each row in the table contains the pots which have a specific shape of pattern, as defined below, from A to F:
Pot decorations

  1. Line
  2. Waving line
  3. Dotted line
  4. Crossing lines
  5. Row of rectangles
  6. Row of triangles
The division between columns (1...4) is based on the lines that the ancient artist had drawn over the patterns:
  1. No line
  2. Lines on both sides of the patterns
  3. Line through the pattern
  4. Lines through and on both sides

On the basis of this exploratory table the author, Rostislav Holthoer (1977), continued his study by recording the number of specimens in each class and recognizing the region and period they came from, and in this way he could gradually construct a theoretical structure of all the material.

The most usual method of searching a structure in cross-tabulated data is to look for an association between two characteristics of the objects, which is often indicated by an unusually high (or low) frequency of cases where a certain combination of characteristics occur.

Table #1 Stool A Stool B
Pains 10 5
No pains 90 45
As an example let us look at the (invented) table #1 on the right where have been classified the employees in an office, on the basis of whether they had experienced occasional back pains or not. The stools used in the office were of two different types ("A" and "B") and the purpose of the analysis was to find out whether back pains had some relationship with the type of seat that the person was using.

In table #1 the percentage of pains is equal (10%) for both stools, which does not indicate any connection between back pains and type of stool.

Table #2 Stool A Stool B
Pains 20 5
No pains 80 45
If, instead, the result of tabulation had been equal to the table #2, the differences in complaint percentages (20% and 10%) had shown a definite association between frequency of pains and type of stool.

How great differences in frequency then are statistically significant? There are algorithms that can give an answer to this question, notably the Chi test which is explained on another page.

An statistical association having been found, the study is often continued by trying to determine the exact structure of this invariance and by perhaps finding its explanation.

Of course, mere correlation between two variables does not prove that causality exists, because the correlation can be due to other possible explanations. Therefore you should complement the tabulation with other methods like interviewing (if people are involved in the activity to be studied) or experiment where you exclude the non-desirable factors.

The examples about exploratory classification, given above, show that the classes that finally will be selected will seldom emerge purely from empirical data. Normally they will have some theoretical justification, too, borrowed somewhere. There is thus no sharp contrast between the approaches of using, or not using, classes that have been defined earlier. Using them, is the theme of the following paragraph.

Classification Into Given Classes

Quite often the target of the study, i.e. the question that you are studying or the intended use of the results, will define the population to be studied and to some extent also the classes into which the population or the sample of it should be divided. Two ambitions of research which often give grounds for defining both the population and its classification are:

  1. If the aim is to expand only the scope of validity of existing theory, it will often be possible to use the classifications of earlier studies and fill the table with data from a new, enlarged population. If, instead, your project is aimed to append a new aspect to the existing erudition, this new aspect becomes a novel dimension in your table, while many other dimensions of classification can remain the same as in earlier studies.
  2. In applied research, where your theme arises from practical needs, these often dictate both the population and the criteria of classification.

Beginning of the Phase Phase of Family
Marriage etc. Establishing family
First baby born Growing family
Last baby born Static period
First child leaves home Shrinking family
Last child leaves home Post-parental phase
First parent dies Disintegration of family
Second parent dies -
Classification of people. When studying the users of products: people, families and organizations, it is often practical to borrow the categorization from sociology, the science that traditionally studies human groups. Well established group definitions are available on the basis of profession, education, income and other demographic characteristics. Families are often classified according to the "phase" of the family (on the right):

In everyday life, people are habitually being characterized and classified by selecting a suitable attribute from a pair of opposites, like old/young or urban/rural. Similar logic has been used in many scientific classifications of people, too. Sometimes the classification has been so successful that it has even been adopted into the common language, as has happened with the class "yuppie" (from YUP = Young Urban Person, as a contrast to persons that are old or rural, or both).

However, you should always keep in mind that being able to to define two or four classes does not guarantee that those classes exist in reality.

For example, when studying the users and potential buyers of industrial products, you might wish to divide these people into a few groups on the basis of their tastes, so that your company could then design a specific "styling" of products for each of these groups. Such a hypothetical division into four "taste groups" has been tentatively drafted in the upper diagram on the right. Each person is there represented by a dot.
If you test the above "taste group" hypothesis with any natural population of people, you will very probably find that their opinions are distributed more or less like the lower diagram on the right, and it would definitely distort the truth to present the population as consisting of four groups. The reason is that opinions, like most properties of people, follow nearly always the distribution depicted in the upper diagram on the left, which researchers therefore have begun to call "normal distribution". Distributions like the lower diagram on the left are quite exceptional.

Averages of the four clustersThe credibility of the above mentioned study would not improve - on the contrary - if the researcher tried to prove the existence of his clusters by calculating their averages, as has been done in the third diagram (on the right). The averages would certainly differ, but it would just be a result of the researchers's original willful classification of the cases - it would tell nothing of the empirical data itself.

Neither would it improve the result if the researcher decided to add one more group to the center of the diagram, depicting the "average" people with average opinions. Average people do exist, all right, and they can be taken as a proper group in a typology, but it would not give any reason to see that the four extreme "groups" exist.

The above example contains a typology of four groups, but it goes without saying that the same reasoning is valid for any number of groups as well.

Is there a significant difference between classes? After you have defined the classes and filled them with empirical cases, you perhaps still are in doubt, whether there is so much difference between the classes that it is reasonable to treat them as separate classes. If all the cases share one or more important characteristics that you have measured in each case, there are mathematical methods for measuring the difference between the sets, and for evaluating whether this difference is so large that it is unlikely that only chance had caused the difference in the measurements. Some usual methods for this are the Chi test, the t-test of the means of the classes, and the analysis of variance.

Normative Classification

  Product A Product B Product C
Good 81 % 34 % 9 %
Average 4 % 36 % 60 %
Bad 15 % 30 % 31 %
Generally, normative is any cross tabulation where one of the dimensions expresses an evaluation like in the table on the right. The data can be collected, for example, with the help of a consumer survey or with a Customer Feedback system.

Because all evaluation is subjective it is important to consider and define exactly whose point of view is used in the evaluation; this aspect is discussed elsewhere under the titles of Human Subjectivity and Objectivity and Normative Research. Often the most interesting opinions come from people who have been using the product that is now going to be improved, sometimes from the target group of future customers, not forgetting the people with special requirements like the elderly, the visually impaired, etc.

En Español  In Finnish   Contents

August 3, 2007.
Comments to the author:

Original location: http://www2.uiah.fi/projects/metodi