When studying a number of objects or observations you can
sometimes reveal an underlying structure of the material by grouping the
observations into classes. For the basis of categorization you have to
select a property or attribute which you can record for each of the studied
specimens or cases. All the cases with this named attribute go to
one class, those that have it in different degrees perhaps go into their
own classes as do, finally, those cases (if any) that do not have this property
|Number of |
From where do you get the basis for categorization?
Classifications are usually presented as tables. At the top of the table, in the table heading, you expose the basis of classification. In delimiting the classes, you can employ any variety of customary scales and other languages of description like verbal description, visual shape, coding, ordinal scale, interval scale or ratio scale, or if you feel like it, you can invent a new original classification of your own.
|Customary symbols in tables|
|(empty box)||= the table has not been completed|
|. .||= no information obtained|
|*||= preliminary data|
|-||= 0.000 (exactly zero)|
|0||= rounded to zero|
Moreover, you can place additional data in the boxes and even pictures of the specimens, if it does not take up too much space.
If you have only one criterion for classifying your material you will get only one row of boxes in your table. However, you will often want to make a secondary grouping of the same material, using another criterion for this new classification. In this case you will get as many rows
in the table as you have classes in your second taxonomy, and the boxes become arranged in columns as well as in rows. Such a
grouping of data is called cross tabulation and it is a powerful tool in searching for invariant structures in
large sets of cases or specimens.
|Books on music:||100||11||4||115|
|Books on yoga:||60||91||2||153|
A two-dimensional table is perfect for presenting a classification with two criteria, but it is a little awkward if you need more than two dimensions of grouping. Possible methods for this purpose are:
When doing exploratory research with material from a great number of objects or cases, your first tool is not necessarily classification because it can be difficult to sort the material into a meaningful order if you have no names for the pigeonholes to put it in. To be sure, it is always possible to experiment with different categorizations in the hope that a fertile principle of grouping eventually turns up. Such experimentation is easy once the material has been loaded into a computer. It often happens that you find it useful to regroup your material many times and redefine your class boundaries when your knowledge of the object increases. The initial number of classes might be, for example, equal to the cube root of the number of the population, until you find a more justifiable classification. It is practical to start with a greater number of classes than you intend to use finally, because during analysis it is much easier to combine a few classes than to divide a class.
The goal in classification is always the same: to reveal the systematic structure, invariance, that exists in all the cases that you study. This goal is fully achieved only if all the cases fit into classes. If you have to add a category named "Others" it seems to show that the criteria for your classification are somehow deficient. Moreover, any classification works best when the boundaries between the categories are clear-cut and there are few in-between cases; it will clarify the invariance that we try to extract from the data.
Normally you would start by first studying literature on the topic, if there is any. If this leaves you with no suitable theoretical classification you can try to find the structure from the material itself. Sometimes the people that you are studying already use habitual groupings of people or of things; such indigenous categories are often serviceable in research, too. People normally organise themselves into families, tribes and work groups. E.g. "Americans", "The Impressionists" are well known groups that you do not even need to define before using them in your study. Often the names of human groups, especially geographic or temporary divisions, can also be used to classify the products that these people have made.
Sometimes people already have names for various arrays of products like furniture and tools. Sailors, for example, habitually assort all innumerable existing boats into a small number of boat types, some of which are shown on the right. The significant differences between the yacht types concern the number of masts, position of the mast in the aft, number of sails in the fore, and presence of gaff. For the purposes of research the types can be defined as drawn outlines as in the picture, or alternatively as verbal explanations such as, for example, "a schooner = a boat with two masts and the largest sail in the aft".
More common in research, however, is that the researcher has to
define from scratch the groups that he needs to create. Typical of
exploratory research is that you will not have any ready-made
classification at hand. You have to examine your material and try to
arrange it into groups of similar cases.
If you have luck, each specimen or case that you are studying fits into one and only one box of your classification. Such seems to be the case in classifying hand-made rugs. There is only a limited number of possible rug knots, and each type is unequivocally distinguishable from all the others, as can be seen in the illustration made by Geijer in the book A History of Textile Art on the right. Alas, such definite classes are not frequent in research.
Regarding most products, there is an endless number of variations. Often, only a few cases are exactly alike and once you have placed them into the appropriate classes the majority still falls in-between. Often the objects or cases resemble a continuum where each case is unique even if it differs slightly from its neighbours. Moreover, often there are several attributes of the cases which seem to be significant when classifying, however no one of these is of such paramount importance that it could be used as an exclusive criterion in assorting. In such a "fuzzy" situation you can try one of the following methods.
Fuzzy classification is a method which aims at placing all the cases or specimens in one or other of the classes even if the "fit" is not perfect. The method allows the simultaneous use of several criteria for assorting, that is, several attributes of the objects or cases are taken into consideration. Every member of the class complies with most of them, but not necessarily with all of them. What is common to all the members of a class is no specific attribute but family resemblance, which means that several but not necessarily all of the attributes match. Characteristic of fuzzy classification is that there will be no surplus class for those specimens or cases which would not fit.
Typology is a method of classification where each class is formed around a "typical" or "pure" exemplar. Each object in the collected material is then compared to these exemplars and will be assigned to the class where it most resembles the exemplar. This can be pre-defined, but in exploratory study it can simply be selected among the specimens that are being studied, or it can be defined as an agglomeration of typical features of the "family".
It can be difficult to make a typology of large and many-detailed artifacts like buildings. A successful example is found in the book Suomen kirkot (Churches in Finland) by Carolus Lindberg. He defined 38 traditional church types using both simple depictions and verbal characterization. Below are three samples of his definitions.
Class XXIX. Wooden churches in the shape of a Greek cross. The central space is widened by orthogonal extensions which are usually lower than the rest of the church. This type was popular in Eastern Finland during the 18C.
Class XXX. Wooden churches in the shape of a Greek cross. Orthogonal extensions at the four corners of the central space. The tent-shaped roof of the midpoint space rises high over the roofs of the wings and is decorated with a pinnacle. These pyramid-roofed central churches were built in eastern Finland from the late 18C to the early 19C.
Class XXXI. A variation of type XXX, with the difference that a vertical segment interrupts the central tent-shaped roof.
Artistic style is sometimes used for assorting products, though it is often very difficult to define and the risk of subjective variation is great. On the left is a classical pictorial exposition of architectural styles from Regole generali di architettura, by Sebastiano Serlio, 16c.
Stylistic classification is almost always a fuzzy one, because the criterions of style are usually defined as a set of qualifications, all of which need not be fulfilled at the same time. Here again, "family resemblance" to the exemplars determines if the given object belongs to a given style or not.
Stylistic classification (like any other classification) can be used to sort objects that belong to the same point of time, e.g. products from various artists, manufacturers, users, cultures or geographical areas. Note that styles are often studied not as classifications but instead as historical sequences.
Weaknesses of typologies. It is usually difficult to measure the goodness of fit of a case; thus typologies and classifications often remain rather subjective. Especially when a single person, i.e., the researcher, both defines the classes and places objects in them, it can become too easy for him to demonstrate the validity of any concocted typology simply by disregarding (as "non typical") all those objects or cases which do not fit into the classification. To reduce subjectivity the exemplars should be defined as explicitly as possible and preferably in both verbal and visual presentation (and also with the help of measurement, if applicable).
The correct method for pointing out the average case or the typical cases in the population, is first to define unambiguously the population, then select a proper sample of it, classify all items of the sample, and note the most frequent type(s).
Cluster Analysis is a suitable method for constructing a classification for cases or specimens for
which you have a lot of measurements but you do not know which of the variables are best suited for the basis of classification. It is a tool
for finding groups of individual cases which resemble each other
to the highest degree. The researcher can either select the number of
groups in advance, or leave it open, in which case the analysis program
will propose several alternative groupings.
The figure on the right shows an example where the computer program has analyzed 30 cases and found that the ones that most resemble each other are cases numbered 8 and 29, and also respectively 6 and 22 etc.
Moreover, the cluster analysis program has created a logical tree where the number of groups gradually drops. It will then be up to the researcher to decide how many clusters shall be accepted as the final result of the analysis. In the figure, three final groups: A, B and C have been selected, but two groups would also have been possible: A and (B+C). Each researcher has to make the decision on the basis of his theory, as is usual in all statistical analysis. To make the decision, you should consider the theoretical meaning and interpretation of each cluster.
Cross Tabulation. As was noted above, in exploratory study it is often difficult enough to find the basis for a single classification, let alone for a cross tabulation of two or more properties of the objects or cases. However, with modern computers it is easy to experiment with tentative cross tabulations on the basis of the various attributes that you have recorded from the cases or objects, and it often helps to detect hidden structures in the material.
Below is an example of exploratory cross tabulation where over 3000
specimens of ancient Egyptian pottery are arranged according to their
ornamentation. Each row in the table contains the pots which have a
specific shape of pattern, as defined below, from A to F:
On the basis of this exploratory table the author, Rostislav Holthoer (1977), continued his study by recording the number of specimens in each class and recognizing the region and period they came from, and in this way he could gradually construct a theoretical structure of all the material.
The most usual method of searching a structure in cross-tabulated data is to look for an association between two characteristics of the objects, which is often indicated by an unusually high (or low) frequency of cases where a certain combination of characteristics occur.
|Table #1||Stool A||Stool B|
In table #1 the percentage of pains is equal (10%) for both stools, which does not indicate any connection between back pains and type of stool.
|Table #2||Stool A||Stool B|
How great differences in frequency then are statistically significant? There are algorithms that can give an answer to this question, notably the Chi test which is explained on another page.
An statistical association having been found, the study is often continued by trying to determine the exact structure of this invariance and by perhaps finding its explanation.
Of course, mere correlation between two variables does not prove that causality exists, because the correlation can be due to other possible explanations. Therefore you should complement the tabulation with other methods like interviewing (if people are involved in the activity to be studied) or experiment where you exclude the non-desirable factors.
The examples about exploratory classification, given above, show that the classes that finally will be selected will seldom emerge purely from empirical data. Normally they will have some theoretical justification, too, borrowed somewhere. There is thus no sharp contrast between the approaches of using, or not using, classes that have been defined earlier. Using them, is the theme of the following paragraph.
Quite often the target of the study, i.e. the question that you are studying or the intended use of the results, will define the population to be studied and to some extent also the classes into which the population or the sample of it should be divided. Two ambitions of research which often give grounds for defining both the population and its classification are:
|Beginning of the Phase||Phase of Family|
|Marriage etc.||Establishing family|
|First baby born||Growing family|
|Last baby born||Static period|
|First child leaves home||Shrinking family|
|Last child leaves home||Post-parental phase|
|First parent dies||Disintegration of family|
|Second parent dies||-|
In everyday life, people are habitually being characterized and classified by selecting a suitable attribute from a pair of opposites, like old/young or urban/rural. Similar logic has been used in many scientific classifications of people, too. Sometimes the classification has been so successful that it has even been adopted into the common language, as has happened with the class "yuppie" (from YUP = Young Urban Person, as a contrast to persons that are old or rural, or both).
However, you should always keep in mind that being able to to define two or four classes does not guarantee that those classes exist in reality.
For example, when studying the users and potential buyers of industrial products, you might wish to divide these people into a few groups on the basis of their tastes, so that your company could then design a specific "styling" of products for each of these groups. Such a hypothetical division into four "taste groups" has been tentatively drafted in the upper diagram on the right. Each person is there represented by a dot.
If you test the above "taste group" hypothesis with any natural population of people, you will very probably find that their opinions are distributed more or less like the lower diagram on the right, and it would definitely distort the truth to present the population as consisting of four groups. The reason is that opinions, like most properties of people, follow nearly always the distribution depicted in the upper diagram on the left, which researchers therefore have begun to call "normal distribution". Distributions like the lower diagram on the left are quite exceptional.
The credibility of the above mentioned study would not improve - on the contrary - if the researcher tried to prove the existence of his clusters by calculating their averages, as has been done in the third diagram (on the right). The averages would certainly differ, but it would just be a result of the researchers's original willful classification of the cases - it would tell nothing of the empirical data itself.
Neither would it improve the result if the researcher decided to add one more group to the center of the diagram, depicting the "average" people with average opinions. Average people do exist, all right, and they can be taken as a proper group in a typology, but it would not give any reason to see that the four extreme "groups" exist.
The above example contains a typology of four groups, but it goes without saying that the same reasoning is valid for any number of groups as well.
Is there a significant difference between classes? After you have defined the classes and filled them with empirical cases, you perhaps still are in doubt, whether there is so much difference between the classes that it is reasonable to treat them as separate classes. If all the cases share one or more important characteristics that you have measured in each case, there are mathematical methods for measuring the difference between the sets, and for evaluating whether this difference is so large that it is unlikely that only chance had caused the difference in the measurements. Some usual methods for this are the Chi test, the t-test of the means of the classes, and the analysis of variance.
|Product A||Product B||Product C|
|Good||81 %||34 %||9 %|
|Average||4 %||36 %||60 %|
|Bad||15 %||30 %||31 %|
Because all evaluation is subjective it is important to consider and define exactly whose point of view is used in the evaluation; this aspect is discussed elsewhere under the titles of Human Subjectivity and Objectivity and Normative Research. Often the most interesting opinions come from people who have been using the product that is now going to be improved, sometimes from the target group of future customers, not forgetting the people with special requirements like the elderly, the visually impaired, etc.
August 3, 2007.
Comments to the author:
Original location: http://www2.uiah.fi/projects/metodi