Before you submit data to analysis, it will often be useful to perform some preliminary operations. These may include:

**removal**of data which are obviously erroneous or irrelevant. This should be done with caution:*outliers*or data which are*anomalous*and do not harmonize with your hypotheses, are perhaps not faulty after all. They can as well show that your hypothesis is defective.**normalizing**or**reducing**your data means that you eliminate the influence of some well known but uninteresting factor. For example, you may remove the effect of inflation by dividing all the prices with the price index of the date of the purchase.

In the analysis itself, the target usually is to extract an invariance, an interesting structure in data. This does not mean feeding data in a computer and expecting the machine to report the structures that can be found in them. Computers are not intelligent enough for that.

Instead, it is habitual that already early in the project, the investigator has a mathematical model that he tries to apply to the data. This model also provides the possible hypotheses, if any, for the investigation project, or at least it functions as an initially nonexact working hypothesis that will be refined during the analysis.

In other words, the investigator first arranges the data into the pattern given by the model, and then he assesses whether the model now is giving a plausible picture of the data or will it still be necessary to search for another, better model.

The general pattern of the assumed invariance thus gives a starting point for the analysis. When choosing the method of analysis you have to ask yourself whether you want to analyse disconnected variables, or the relations between several variables? Or, do you just want to use the measured variables for classifying or assorting individuals or cases?

Another important decision concerns the final purpose of your project. Do you want to describe how *is* the present (or past) state of your
object, or do you wish to find out how the object *should be:* which degree of the measured attributes would be optimal? This latter type of analysis is discussed under the title Adding a normative dimension to a descriptive analysis.

In the following, there is a list of some usual methods of statistical analysis of
**one** single variable. The methods have been arranged according
to the scale of measurement of the variable.

- | Nominal scale | Ordinal scale | Interval scale | Ratio scale |
---|---|---|---|---|

Methods of presenting data | - Tabulation ; Graphical presentation - | |||

Averages: | - The mode - | |||

- | - The median - | |||

- | - | - The arithmetic mean - | ||

Measures of dispersion: | - | - The quartile deviation - | ||

- | - The range - | |||

- | - | - Standard deviation - |

A simple way of presenting a distribution of values is to present each value as a dot on a scale. If there is a great number of values, it may be better first to classify them and then present the frequency of each class as a **histogram** (fig. on the right).

If your studies involve people, your measurements quite often turn out to be distributed according to a certain curve, the so called *Gauss curve* (on the left) which is therefore called the **normal distribution**. One of its properties is that 68% of all measurements will differ from the mean (in the figure: **M**) by no more than the standard deviation, and 95% by two standard deviations.

Sometimes you will wish to emphasize not the *absolute* distribution but the *proportional* or percentage
distribution. A suitable diagram for this is the **pie chart** (on the
right):

- the mode
- the median
- the arithmetic mean.

**Median** is the value in the middle of the selection, if all the
values are first arranged from the smallest to the largest.

**(Arithmetic) mean** is the sum of all the values divided by their
number, or

From the averages that were presented above, the researcher can usually choose the one that best shows the typical value of the variable. Arithmetic mean is the most popular one, but it can give the wrong picture e.g. in data which include one value which greatly differs from the others (see the picture below).

The same happens if the distribution is **skewed,**
like in the picture on the right. In the example, the minutes that the
different subjects spend carrying out a certain task, have been listed. The
fastest ones needed 5 minutes but the most common performance (= the
*mode*) was seven minutes. The value in the middle, i.e. the
*median,* has been shown as a red letter M in the picture. The median has the
value 11 here.

What about the *mean?* As the performance of the slowest
subject took as long as 34 minutes, the mean went up to 11.98 minutes,
which does not give a very accurate picture of the average performance
in this case. This shows that if the data is skewed, the type of average
must be chosen with care. A graphic presentation would often be more
illustrative than calculating a single statistic.

The distribution shown in the picture is **positively** skewed, because the measurements that are larger than the median (=11) are spreading out on a large range (from 11 to 34) while the measurements below median concentrate into just a few values (from 5 to 11).

A statistic can also be found to describe the amount of skew, if necessary.

When selecting the most suitable average, you should consider the scale which was used in the collection of the data. If the scale was nominal, the only possible mean is the mode. If the scale was ordinal, you may use either the median or the mode. Note, however, that the categories of scales are not always quite distinct; for example the common worded scale

beautiful / - / - / - / - / - / ugly

should actually be regarded ordinal because the intervals between the markings are not truly equal (people prefer to put their ticks near the middle because the intervals near the ends are sensed as greater). However, many researchers prefer using the mean, not the median, as a summary for this type of questionnaire, which means that they rate this scale as arithmetical.

Finally, if the average was calculated from a sample, you should test its statistical significance, or how probable it is that the same average is true in the population from which the sample was drawn. A suitable test for this is the t-test.

Once you have calculated the average value, it would sometimes be interesting to describe how far the singular values are scattered around the average. To this end, you may choose between a variety of statistics. The selection depends on the type of average that you have used:

- In connection to mode the dispersion of values is seldom interesting.
- Instead, if you have calculated a median, you will often want to indicate the spread of values around it. A suitable measure for that is the
**quartile deviation.**A "higher quartile" is the value which is surpassed by 25% of the number of all measurements; likewise 25% of all values are lower than the "lower quartile". The quartiles are marked with a green Q in the diagram above. The average deviation of the quartiles from the median is called the quartile deviation, and it is easily calculated by halving the difference of the quartiles. - An alternative and very simple statistic is the
**range:**the difference between the greatest and the smallest value. - In connection with the arithmetic mean you will
often want to calculate the
**standard deviation.**If the values have been measured from a population, the formula will be,

However, if the standard deviation concerns just a random sample, the formula is,

In both formulas, **n** is the number of the values, and the values of each variable will be substituted for **x** one at the time.
Hardly any researcher bothers to perform the calculation himself because
the necessary algorithm for it now exists even in pocket calculators.

The square of the standard deviation is called **variance**, and it, too, is often used to describe and to analyse the dispersion.

If the statistic of dispersion has been calculated from a sample, its statistical significance should also be calculated in the end. The t-test is suited for this.

If two variables vary in such a way that they follow each other to some extent, we say that there is an *association* or *covariation* between the variables. For example, the height and weight of people are statistically associated: although nobody's weight is caused by his height nor the height by the weight, it is, however, usual for a tall person to weigh more than a short person. On the other hand, the data usually include exceptions as well, which means that a statistic association is inherently
*stochastic*.

The science of statistics offers numerous methods for revealing and presenting the associations between two or more variables. The simplest means are the methods of graphic presentation and tabulation. The strength of an association between the variables can also be measured with the help of special statistics, such as contingence or correlation.

If, when analysing the data, an association between two variables is discovered, the researcher often would like to know the reason of this association in the empirical world, in other words, he would like to *explain* this association. Usual types of explanations are enumerated on the page Description and Explanation. Common to all of them is that they give the *cause* of the phenomenon that is being studied. When measurements have been made from a series of these phenomena, one series of measurements, called **independent variable** is thus usually made from the presumed cause, and another series of measurements, the **dependent variable,** from the presumed effect on the phenomenon.

Note that when analyzing variables, no method of mathematical analysis can find out the causal explanation for a statistical association, or even for finding out which variable measures the cause and which the effect. Indeed, a strong covariation between two variables, say, A and B, can be due to any of four alternative reasons:

- A is the cause of B.
- B is the cause of A.
- Both A and B are caused by C.
- A and B have nothing to do with each other. Their association in the analyzed data is a coincidence.

The researcher must thus find the causality or other explanation for the association of variables somewhere else than in the measurements. In many cases, the original theory of the researcher can provide an explanation; if not, the researcher must use his common sense to clarify the causal relationships.

In the following, we mention some usual methods of statistical analysis which can be used when studying the interdependence between two or more variables. The methods have been arranged according to the type of scale that has been used in the measurement.

Target of analysis | Nominal scale | Ordinal scale | Interval scale | Ratio scale |
---|---|---|---|---|

Presenting data and roughly its structure: | Tabulation ; Graphics | |||

Measuring the strength of association between two variables: | Coefficient of contingence | |||

- | Ordinal correlation | |||

- | - | Pearson's r
correlation | ||

Finding which variables among several are associated: | Calculating pairwise contingences or correlations for all the variables; Factor analysis | |||

Transcribing a statistical association into a mathematical function: | - | - | Regression analysis |

Some abbreviations conventionally used in tables are presented on the page Classification.

Products, as objects of study, are often presented as pictures, which
are one kind of graphical presentation. (Examples of pictorial presentations.)

If the researcher wishes to highlight some common traits or
general patterns he has found in a group of objects, he can combine
several objects in one graphic, like in the figure on the left. In the
diagram, Sture Balgård shows how the old buildings in
Härnösand follow uniform proportions of width and height (the
red line) with just a few exceptions. In inventing illustrative methods of
presenting the findings of the study of products, the most serious
restriction is the imagination of the researcher.

Often, however, the appearance of the object itself is not
important and only the *numerical values* of his measurements are
of interest. If you feel like that, the first question that you should consider
when selecting the type of graphics is what the structure in the data that
you wish to show is. Of course, you must not "lie with the help of
statistics", but it is always admissible to select a style of presentation
which highlights the important patterns by eliminating or diminishing the
uninteresting relations and structures.

If your data consists of only a few measurements, it is possible to
show all of them as a **scattergram.** You may exhibit the values of
two variables on the two axes x and y, and additionally a couple of
variables by utilizing the colours or shapes of the dots. In the diagram to
the right, the variable **z** has two values, which are indicated by a
square and a plus sign.

If the variation is too small to appear clearly, you may emphasize it by cutting off parts of one or both of the scales, see examples. You simply cut off the uninteresting part, either from the top or the bottom. The discarded part must be empty of empirical values. To make sure that the reader notices the operation, it is better to show it not only in the scales but also in the background grid of the diagram.

On the other hand, if the variation range of your data is very
large, you may consider using a **logarithmic scale** on one or both
of the axes, see diagram on the left. Logarithmic scaling is appropriate on
a ratio scale only.

If you have hundreds of measurements, you will probably not want to
show each of them as a scattergram. One possibility in this case is to
classify the cases and present them as a histogram.

The histogram may be adapted to present up to four or five
variables. You can do this by varying the widths of the columns, their
colours, background patterns and by a three dimensional presentation
(fig. on the left). All these variations are easily created by a spreadsheet
program like Excel, but they should not be used for decoration only.

The **patterns** filling or making up the histogram columns may
be chosen so that they symbolize one of the variables. For example, the
columns describing the number of cars may be formed by piling up
pictures of cars one above or after the other. This is all right, provided
that you do not vary the **size of the symbols** used in a histogram.
Otherwise, the interpretation would be difficult for the reader (does the
number of cars relate to the length, to the area or to the volume of the
car symbols?)

The researcher is often interested in the **relations** of two or more variables rather than in the detached pairs of measurements as such. The normal way of presenting two or more interdependent variables is the **curve.** It implicates a **continuous** variable
(i.e. where the number of possible values are infinite).

You should not fabricate a curve from
measurements which are not values of the **same** variable. For
instance, the attributes of an object are different variables. Examples are
the personal evaluations that researchers often gather with the help of
semantic differential scales of the type below:

Estimate the characteristics of your bedroom.
Tick one box on every line. | ||||||||
---|---|---|---|---|---|---|---|---|

Light | _ | _ | _ | _ | _ | _ | _ | Dark |

Noisy | _ | _ | _ | _ | _ | _ | _ | Quiet |

Clean | _ | _ | _ | _ | _ | _ | _ | Dirty |

Big | _ | _ | _ | _ | _ | _ | _ | Small |

It would now be pointless to present the various evaluations
of the bedroom as a single "profile" as in the diagram on the left
(although you often find this type of illogical presentations in research
reports.)

If you absolutely want to stress that the variables belong together
(e.g. because all of them are evaluations of the same object), an
appropriate method would be e.g. a horizontal histogram (on the right).

All of the above diagrams can be combined with **maps** and
other topological presentations. For
example, the variation in the different areas of the country is often shown
as a **cartogram** by distinguishing the different districts with different colours or shades. Another technique is the **cartopictogram** in which small pie or column diagrams have been placed on the map. If you need to show associations between areas this can be done with arrows whose thickness indicates the intensity of the connection. (Example.)

How close a relation two variables have with each other, can be studied with the means of tabulating or graphical presentation, and it can be measured with special statistics, too. The statistics available for analysing the links between two variables depend on what type of scale the variables have been measured by (see table that was presented earlier).

**Contingence quotient**is appropriate for all types of variables regardless of the type of their scale (classification, ordinal or arithmetic). Alternative statistic for this task is Chi squared presented on the page*Assessing the Findings*.**Ordinal correlation**is suitable when at least one variable has been measured with an ordinal scale. The other can be either ordinal or arithmetical.- The standard
**correlation**or more precisely*product moment correlation*is available when both variables are measured on arithmetic scale.

Formulas for calculating the statistics of contingence are not shown here because performing the calculations manually would be awkward and researchers usually do them with computers.

The product moment correlation or Pearson's correlation which is usually abbreviated with the letter **r** measures how closely the association between two variables resembles the linear equation
*y = ax + b.* If the correlation coefficient is high, in other words if its value approaches either +1 or -1, it means that the relation between the two variables approaches this equation. If the correlation is low, e.g. something between -0.3 and +0.3, the two variables have not much to do with each other (more exactly, they have almost no *linear* covariation). The sign of the correlation coefficient is not important; the sign is always identical with the sign of the coefficient *a* in the above equation.

Below you can see three scattergrams which show three different sets of data from two variables, each set consisting of eight pairs of values. The correlations between the two variables have been calculated and are shown under each scattergram. It can be seen that there is no correlation between the variables in the set on the left, and the other two sets show correlations of 0.5 and 1.0.

Notwithstanding the fact that correlation analysis is able to handle only two variables, it is an excellent tool for the initial analysis of a large number of variables, when you have no clear idea of their mutual relations. A computer can quickly calculate the correlations between all possible pairs of variables, finally constructing a *correlation matrix* from the results. You can then select those pairs that have the strongest correlation and continue by examining these pairs with other, more refined tools for analysis.

A weakness of the correlation analysis is that it cannot detect other than linear relations between the variables. E.g., a relation that obeys the equation
*y = ax ^{2}* would pass unnoticed. However, some of the newer analysis programs are able to detect even this and some other usual relationships of variables. Besides, you can try to:

- replace the values of a variable with its values squared, its square root or with some other modification (the computer takes care of the calculation) and redo the correlation matrix, or
- make a scattergram of the two variables which you think might have a relation, and see whether the resulting pattern follows the shape of any appropriate mathematical function.

Once you have found a pair of variables with a strong correlation (or contingence) you can continue, for example, with the following operations:

- consider on the basis of your theory: which variable of the pair is independent (i.e. the reason) and which is the dependent one (the consequence), and whether the relationship can involve still more variables.
- find out the exact pattern of the relationship. Possible methods for this include the analysis of time series and of regression.

Finally remember that when a correlation has been calculated from a sample, you should examine its statistical significance with the t test.

The researcher has often theoretical or practical reasons to believe that a certain variable is causally dependent on one or more other variables. If there are enough empirical data on these variables, the classical or "multivariate" regression analysis is a suitable method for revealing the exact pattern of this association.

Regression analysis finds the linear equation which deviates as little as possible from the empirical observations. For example, in the diagram on the right, the dots symbolize the observations where two variables have been measured, and the line represents the equation y = 8x + 45, obtained with regression analysis so that the sum of the squared differences from the measured values of y becomes minimal.

The diagram contains only four observations, which is much too few to produce a plausible equation because the observations could quite well be the result of chance only, without any real dependence between the variables. If you want plausible or "statistically significant" results you would need much more, perhaps 40 observations multiplied by the number of explaining variables.

The algorithm of regression analysis constructs an equation which has the following pattern and can have several independent variables. Moreover, it gives the parameters a_{1} , a_{2} etc. and b such values that the equation corresponds to the empirical values as closely as possible.

y = a_{1}x_{1} + a_{2}x_{2} + a_{3}x_{3} + ... + b

In the equation,

y = the dependent variable

x_{1} , x_{2} etc. = independent variables

a_{1} , a_{2} etc. = parameters

b = constant.

A disadvantage of the algorithm of regression analysis is that it can detect only linear relationships between the variables. Thus it cannot handle such usual formats of equation as *y = ax ^{2} + bx + c*. This difficulty, however, can be avoided by temporarily replacing the non-linear variable with a suitable transformation of it, such as its square, square root, inverse, or logarithm.

If you have extensive data with many variables, and no plausible hypothesis about their relationships, you are perhaps at the beginning of the analysis not sure which variables should be included in the equation. You might first study this by making a correlation matrix. Another alternative is to let the regression analysis program select the "right" variables (x_{1}, x_{2} etc) to the equation. "Right" are those variables which have the best ability to explain the behavior of the dependent variable, in other words which improve the closeness of fit between the equation and the empirical values.

When one of the independent variables is time, and especially when we have a series of measurements at equal intervals, this series goes by the name *time series*. Regression analysis is a suitable tool for revealing a *trend* or a long-term development in a time series, see Historical Study. This trend can often be used for forecasting the future development of the dependent variable.

In classical regression analysis the sought-after equation contains only one dependent variable. In the case that more than one dependent variable seem to be involved, a suitable tool for their analysis is *canonical correlation,* not discussed here.

A suitable procedure for assessing the statistical significance of the equation created by regression analysis is the t test.

All the questions in a questionnaire can be seen as variables, the values of which are found by studying the answers that each question receives. Usually most of the questions are more or less related to a given topic, and it is therefore normal that some of these variables turn out to have a high mutual correlation. The researcher now might want to find out whether one or more new, artificial variables could be constructed so that each of them has a high correlation with a group of the original variables. These "background" variables or **latent factors** could thus be said to "contain" approximately a group of the original variables, and data contained in the questionnaires would be greatly compressed and become easier to comprehend. Factor analysis is the normal method of finding these background variables.

For example, in a study about young Finnish people's style of dress, Sinikka Ruohonen (2001, p. 97) examined with a questionnaire the respondents' activities of leisure, and found out that there was a high correlation between spending time in concerts, art galleries, theaters, libraries and reading books, and these activities had a negative correlation with watching TV or athletics games. Ruohonen gave the name "cultural factor" to the factor that associated with these activities. It also correlated with a high education of mother and father, and with independence from others' opinions when buying clothes.

Another factor that Ruohonen found and named "aesthetic-social", contained such objectives of selecting clothes as highlighting one's good looks, portraying self-confidence and own personality, attracting attention, showing camaraderie, common values and ideologies. This factor correlated also, in lesser degree, with making clothes oneself, not using furs, and being interested in ecology.

A third factor, "spending", contained several indicators of using money for clothing, cosmetics and jewels, as well as the appreciation of style, quality and fashion.

With the computer program of factor analysis, the factors hidden behind the empirically measurable variables can be detected and specified, and the analysis also tells how closely these factors are linked with the original variables. The researcher has also the option of placing an extra condition on the factors, namely that they must not correlate with each other at all, and they can hence be said to be "perpendicular" to each other (= "orthogonal rotation" of the factors during analysis). You should not use this option if you want the factors to have as high a correlation as possible with the original variables.

A drawback of the method of factor analysis is that a formally correct but thoughtless use of it can easily produce a set of elegant, mathematically exact factors which however have no sensible empirical meaning. In the study quoted above, Ruohonen avoided this dead end by continuing her study with interviews with a few respondents that had scored high on one or the other extremity of a factor and who thus were able to clarify their attitudes and life styles, and give grounds for their opinions that differ from the average.

It is possible to continue factor analysis by grouping the respondents (or other empirical cases) into groups on the basis of their scores on one or several of the factors that were found in analysis. This operation suffers from the same inconvenience as above: it is difficult to give an empirical explanation to these artificial groups and find any trace of their real existence in empiria. Besides, the dispersion of cases along each factor nearly always follows the normal distribution of Gauss, which means that the majority of cases are quite near the middle point and the researcher cannot find any distinct division into groups. This researcher's trap is discussed also on the page Classification.

August 3, 2007.

Comments to the author:

Original location: http://www2.uiah.fi/projects/metodi