Once the results are obtained, before reporting them, the researcher should dedicate some time to a special phase of study: evaluation. He should assess the results: are they what he wanted them to be, is it possible to improve something in the report, should something be left out or is the report worth publishing at all?
The normal method when assessing a project is to compare the results to the initial objectives. The target of descriptive study is to find out how things about the object of study are (or how they have been, when studying the past). The first task in the evaluation is thus to inspect whether the desired pieces of information exist in the report. This is normally a trivial task.
The second and much more difficult question concerns the reliability of the findings: how high is the risk of them being false, or how large is their probable error.
In normative research evaluation of the final report is often relatively simple: you (and the other stakeholders in the project) read the final proposals and then everyone declares whether they agree, or not. The situation is different in descriptive research. Just looking at the summary does not suffice for assessing the value of the work - the results normally seem absolutely reliable and accurate. Only seldom there is another reliable source that you could compare the findings with. Normally the only way to assess the reliability is to go deeper into the report and inspect all the tasks that the researcher has executed before arriving at the final results. These evaluations are discussed below under the titles Assessing Theoretical Input, Assessing Input Data and Assessing Correctness of Analysis.
First after the process of the project has been approved in the above inspection is the right moment to take seriously its reported findings and start evaluating them directly. Two most usual points of view in this examination are Assessing Theoretical Output and Assessing Practical Consequences. In these final evaluations you do not need to care much about the initial targets of the project - it is normal that a project attains more (or less) than was planned.
"Garbage in, garbage out." In other words, a research project cannot give meaningful output if there are controversies or absurdities in the theoretical models that have been used as starting points in defining the problem or in selecting the phenomena to be observed. Of course, these questions should have been put in order already before starting to collect data, but the sad truth is that sometimes the researcher knows quite little about the problem initially, and only the study itself can make him competent in judging relevant theoretical models. Anyway, better late than never.
A symptom which can arouse suspicions about weaknesses in the theory that you have used as basis for your work is a large number of anomalies, empirical cases that do not fit into the model, i.e. the model cannot explain them. These can have appeared in earlier projects or they turn up when you collect your new data.
Eventually surprising final results in your study or in its application could be another indication of deficient input model. The methods of assessing either theoretical or practical output of a research project are explained later.
In the history of many branches of science it has now and then happened that an important theory has been superceded by a new one, and a great number of studies that have relied on the old theory have become just historical curiosities. In the research of products this thing occurred, albeit gradually, during the 19 century, when Baumgarten's theory of 'beauty' as a process of subjective perception replaced Plato's doctrine on beauty as a property of objects. These so called scientific revolutions, however, happen so infrequently that normally researchers cannot be expected to take them into account when evaluating their material.
The procedures of acquiring data for research consist usually of three distinct operations which should be evaluated separately:
The researcher will have to consider the delimitation of his study in several phases of the project and in several locations of its report. Typical such situations are:
Consistency of demarcation means simply that during the project all the definitions for the population of study must be identical or at least compatible. Failing this, the logical structure of the study runs a risk of collapsing.
In itself, there are no "right" or "wrong" demarcations - the researcher has the right to choose whichever he feels useful or interesting. An appropriate criterion, instead, is rationality of the demarcation. In studies which include practical application a serviceable population often are those people who shall benefit from the project, for example the target customers that have been defined for the project.
As a contrast, in theoretical basic research you would often want to use a wide demarcation which, for example, includes all historical periods or all comparable cases in the universe. However, very wide delimitations often bring about difficulties when designing a sample to be studied or when recording data (see below), and these complications in turn can damage the credibility of the results.
When you have gathered the results from a sample from a population, you will often wish to generalize your results. Generalizing means that you declare that the results are true not only in the sample but also concerning the population. Is it possible to evaluate the credibility of such a declaration?
The crucial question in the evaluation is whether the sample deviates from the population in relevant issues. Relevant are the facts that are mentioned in the goals of the project and which you want to record from the sample.
If the sample is non-random there is always the risk that the method of sampling has been somehow biased and the sample therefore contains systematic error which is often quite difficult to detect without studying the entire population. One method is to study whatever material can be found about the population, such as public files on the demographic, age or sex structure of the population, and compare these figures to the existing sample. If the sample deviates from the population in these respects, you can suspect deviations also in the variables that are important in your project.
To assist the consideration, you might calculate the contingence or correlation between the deviating demographic variable and your "relevant" variables (if they are numerical). For example, if the sex distribution of your sample is unequal to the sex distribution of the population, you calculate the contingence between sex and your "relevant" variables in the sample. High contingence means that the bias in your sample will probably affect the results, too.
Another method for evaluating the representativeness of a non-random sample would be to investigate another sample drawn from the same population with another method of sampling.
If you have gathered your empirical results from a properly made random sample, the difference between sample and population cannot be due to bias. However, there is normally more or less difference that has been caused by chance when picking out the sample. You will often want to evaluate how large this difference is, and indeed its probable value can be calculated. Two usual methods for it are:
Finding the confidence interval. Note that despite the promising name of "margin of error", this method only measures the difference between population and sample, exactly as all the other procedures of studying statistical significance. It ignores all errors in registering facts, likewise all variation that perhaps later will occur in the object of study, for example that customers do not in reality behave or vote so as they have said in a survey.
When you have a random sample and you have measured or calculated from it a statistic such as a mean or a percentage, it is usually possible to calculate the confidence interval or the range of values of the population which will include the value obtained from the sample, with a given probability that you may choose. If you select 95% probability, it means that there is a 5% risk that the statistic in the real population is outside of this range.
The formula for calculating the margin of error - i.e. half of the confidence interval - with a risk of 5%, for a simple variable or for a mean is:
s = standard deviation of the population
n = sample size
The diagram below on the left shows the dispersion of a certain variable in the population P, and also in two random samples from the population. The researcher is interested in the mean of this variable in the population P. Assuming that the dispersion of this variable in the population is not too far from normal and the population is not smaller than about one hundred, we can calculate the margin of error (m) which defines the limits of the confidence interval which includes 95% of the means of all the random samples from this population. The random samples R1 and R2 have here been drawn at the two extremes of the confidence interval for the mean.
In the diagram on the right the reasoning goes into the opposite direction in order to solve a problem common in empirical research. Here you only have one random sample, and you would like to know the mean of the population from which the sample originates. For getting the confidence interval where this mean will reside, you can use the same formula as above, though there is the difficulty that now you do not know the standard deviation in the population. However, you can use as a substitute the standard deviation of the sample, which is usually nearly the same.
The formulas for calculating the margin of error are a little different for various statistics. The formula for handling a percentage is:
p = percentage calculated from a sample (for example, the percentage of customers that are satisfied)
n = sample size.
In both cases the coefficient, here 1.96, depends on the desired probability, for example for a risk of 1% it would be 2.58 and for a 10% risk 1.64.
Calculating statistical significance. The basis of probability calculus is the same as in the margin of error method, but here the level of probability and risk are not regarded as constants. Instead, the target is to find how probable it is that the results from the sample are true also in the original population. For this examination, there are methods called statistical tests.
When you have analyzed data from a sample and obtained the results, statistical tests help you to choose between two alternative explanations for these results:
Now it is possible to calculate the probability of getting, by chance only, certain results from the sample. If this probability is very small, e.g. less
than 0.1%, you have good reasons to reject the null hypothesis and believe that the same results are true in the population. Such results are called statistically highly significant.
However, if the probability of receiving the result by chance alone is large, say over 5%, you should not assert that your results are necessarily true in the population. In this case your results are called statistically not significant.
The customarily used significance levels of research findings are given below. The percentages mean the probability of getting the result in the sample by chance alone, even when the result is not true in the population.
The abbreviation is used so that you place one or more stars after the research finding you have tested.
The appellations for the significance levels vary from country to country. That is why, to be on the safe side, it is better to write in the research report that (e.g.) "the result is significant on the level of 5%", meaning that the probability of the result being produced by accident is under 5%.
How significant a result should then be achieved in a study? Generally, the purpose of statistical testing is to help the researcher to avoid a "type 1 error", wrongly accepting the research hypothesis and discarding the null hypothesis, while in reality the research hypothesis is untrue. In spite of this risk, the researcher should not set the significance target unnecessarily high, because then there is the threat of so-called "type 2 error", in which the researcher accepts the null hypothesis and wrongly discards the research hypothesis although it is true in reality.
In practice, the significance of a study usually depends on what kind of data the researcher has managed to gather. A research report is often deemed fit for being made public if at least in some of the studied questions the level of at least 5 % significance is reached.
There is no universal formula for statistical testing. Instead, there are a number of special tests for almost every different type of statistic (for the mean, the variance, etc.). However, the general procedure in the tests is always the same:
|DATA TO BE TESTED:||SUITABLE TEST:|
|Statistics which describe one arithmetic variable (e.g. mean):||t-test|
|Variables measured on a nominal scale:||Cochran Q test|
|Variables measured on an ordinal scale:||Wilcoxon test|
|Correlation (arithmetic scale):||t test|
|Dissimilarity of groups:||Analysis of variance|
Chi test can be used to assess how the objects or subjects in a random sample are distributed into classes.
An invented example:
A manufacturer sells bathroom water faucets in Finland. These can be either
chrome plated or gilded. The business will soon start marketing these products in
Germany and needs to know if German customers are relatively more interested in
gilded ones than Finnish customers.
A questionnaire has been sent to 150 randomly chosen Germans and as many Finns. 100 questionnaires were not returned. The obtained 200 responses were distributed as the T marked numbers indicate:
|.||Prefers chrome finish||Prefers gold finish||Total|
|Finns||T = 50||T = 40||90|
|Germans||T = 50||T = 60||110|
In this (invented) case, the majority of Germans preferred gilded taps while the
majority of Finns preferred chromed ones. A question now arises: Is this difference
valid among all Germans and Finns, or is it possible that such a divergence of the
samples was caused by chance only? We must consider the fact that we received
only 200 responses, and it is quite possible that accidentally we have, in such a small
group, several people that are not typical in their opinions.
The probability of such an accidental result can be calculated with the Chi test.
To accomplish the test, we first want to investigate how these 200 responses
would most probably be distributed, if there was no difference between the two
populations; in other words, if all the Germans and all the Finns had identical views on
faucet finishes. This hypothetical distribution is called the expected
distribution. In our example it would be as follows. (Class frequencies in this
distribution are marked with the letter V):
|.||Prefers chrome finish||Prefers gold finish||Total|
|Finns||V = 45||V = 45||90|
|Germans||V = 55||V = 55||110|
Now we need to construct a measure to indicate how much discrepancy there is between the real distribution and the expected distribution. This measure is called Chi squared, and it is calculated as follows:
(x being the discrepancy, T the total and V the value of those preferring either chrome or gold and meaning addition)
In the formula we have to substitute for T subsequently each T value in the "real
distribution" table, and likewise, for V in turn, each value of V in the "expected
In our example, the Chi squared gets the following value:
= 0,56 + 0,56 + 0,45 + 0,45 = 2,02
The next step is to figure out the probability of getting, by accident only, the above
result (or, which is the same thing, the earlier mentioned divergence between
Germans and Finns).
We do not need to calculate this probability, as it is already displayed, for a great number of Chi values, in statistics handbooks. These tables tell us, e.g., that there is a 5% probability of obtaining the value of 3.84 for Chi squared in a pair of four-cell tables, when chance only is operating and there is no difference between the populations.
In our example, we obtained a Chi squared value of 2.02, which is less than 3.84. This means that the probability of getting such a Chi squared value by chance is more than 5%. In other words, the results of our questionnaire are statistically not significant. Thus, our questionnaire does not allow us to make any assertions on differences between Germans and Finns.
When the Chi test is applied to distributions that consist of more than four classes,
Chi squared is likely to become larger for the simple reason that the formula of Chi
squared will then include more than four terms which are going to be added together.
To neutralize this increase, the Chi test requires you to give a measure of the extent
of your table. The measure for this has a peculiar name, degree of freedom.
The name indicates the number of those cells in your table which might be able to
change when the total number of cases is constant.
For example, if you are studying distributions where exactly one hundred people are classified in six groups, the degree of freedom in any of those distributions is five. The explanation for this is that five cells of the six are always free to receive any number of subjects (between 0 and 100), but after the five cells have received their contents the sixth cell has no more freedom to change; it will be determined by the total of 100.
A table of 2 x 2 cells has a degree of freedom of exactly one: if any one of the cells changed, all the other cells would have to change accordingly; they have collectively only a single degree of freedom.
In the table below are given the degrees of freedom of some small tables.
|Size of the table||Degree of freedom f|
|2 x 2||1|
|2 x 3||2|
|2 x 4||3|
|2 x 5||4|
|3 x 3||4|
|3 x 4||6|
The following table gives the values which Chi squared
reaches under the influence of chance only, with a probability of either 5%, 1% or
0.1%. This table includes only small distributions with a degree of freedom of up to 6.
You can find larger tables in the handbooks.
|Degree of freedom||Probability|
|f||5 %||1 %||0,1 %|
|2||Seating of type I:||1||1||1||1||1||1||1||0||0||1||0||0||8||64|
|3||Seating of type II:||1||1||1||1||0||1||1||0||1||1||1||1||10||100|
|4||Seating of type III:||0||0||1||1||0||1||1||0||0||1||0||0||5||25|
|5||Seating of type IV:||0||1||1||1||0||1||1||0||1||1||1||1||9||81|
|6||Totals by each subject = SH||2||3||4||4||1||4||4||0||2||4||2||2||Total
|7||Squares of the preceding row = SH2||4||9||16||16||1||16||16||0||4||16||4||4||Total
In the next step, the table values are placed in the following formula, which then
All the parameters in this formula are found in the table except k, which is the number of alternatives (here = 4).
The value of Q becomes greater if there is statistical association between the variables. If there is no association and only chance is operating, Q reaches exactly the same values as Chi squared. This means that when assessing the value of Q that we have received in a test, we can use the earlier shown Chi squared table
In a distribution table of the above type, the degree of freedom equals = k-1; in the
example above it is = 3.
In the above example, the obtained value of Q was 7.63. By consulting the table of Chi squared we find that the above results would be significant only when the Q had a value of at least 7.815. The Cochran Q test thus showed that the empirically found difference between the seat belt alternatives was statistically not significant.
With the Wilcoxon test we can test the significance of preferences between two (or more) alternatives. The scale in giving preferences is ordinal.
In his study Seat and Seat-Belt Comfort in Heavy Commercial Vehicles in
Finland, Raimo Nikkanen wanted to compare two different truck seats. Twelve
subjects tried both seats and rated their comfort on a customary 7-point scale. The
obtained ratings are given on the rows 2 and 3 in the following table.
Row 5, Rank order of the differences (ignoring the sign) means that the smallest difference on this row receives the value 1, the second smallest value 2 etc. The cases where there is no difference are dropped out completely. The value 6.5 means that the differences ranked as sixth and seventh were equal; so both of them were given the same rank of 6.5.
Row 6, The cases of the less frequent sign means that we pick from row 5 those cases where the sign of the difference was of the less common type. In this example there was a minority of negative differences. From those cases we just pick their ranks, from row 5. In the example, the only number to be picked is 3.
|2||Comfort of the standard seat:||3||5||5||6||5||5||3||5||4||5||3||3|
|3||Comfort of the anti-vibration seat:||2||3||2||2||2||2||2||3||3||2||2||4|
|4||Difference of the above values:||1||2||3||4||3||3||1||2||1||3||1||-1|
|5||Rank order of the differences (ignoring the sign):||3||6,5||9,5||12||9,5||9,5||3||6,5||3||9,5||3||3|
|6||The cases of the less frequent sign:||-||-||-||-||-||-||-||-||-||-||-||3|
In the next step, we add all the numbers on row 6, ignoring their signs. We call
this sum T. In the example, it will be 3.
Now it is time to look up the Wilcoxon test table in your handbook. The rows in the table are for different numbers of pairs (=N, here 12). In the correct row, we find Tmax = value, which T must not exceed, or the results will not be significant on the 5% level. Below, there is a portion of the table.
In our case, the empirical value (3) is clearly under the permissible value, and we may thus call our results significant. It would be highly improbable to get such a large difference in preferences by chance only.
Above, we have tested distributions and measurements on an ordinal scale. These types of tests are well explained in Sidney Siegel: Nonparametric statistics for the behavioral sciences.
The effect of chance must always be considered when we calculate a descriptive statistic, e.g. a mean or a correlation. These statistics always come out from the algorithm apparently exact and dependable, even in cases in which the underlying material is both scarce and unreliable.
This can be illustrated by an example (invented, on the right) where two variables measured from four randomly chosen students in the UIAH university are marked with red dots.
When looking at the scattergram, you might think that there is an association between the two variables. It seems that the higher is the class of the student, the more weight he or she has. The normal method of measuring the strength of such an association is to calculate the correlation between these two variables. In this case its value would be .956 which is quite much for a correlation.
A high correlation has here little importance, however, because the sample has been very small, only four cases. The probability of getting by chance a high correlation from so small a sample is quite large. It is easy to calculate this probability, the value of which depends on the size of the random sample, and you can find typical values for it in any handbook of statistical analysis. A portion of such a table is presented below.
|If the correlation is at least:|
|...||... then the correlation
on the 5% level.
|... then the correlation
on the 1% level.
The table indicates that when a correlation of .956 has been obtained from only four pairs of measurements it is significant only on the 5% level. In other words, if you have studied twenty samples of this size, one of them probably shows such a correlation even in the case that there is no such association of variables in the population.
The more measurements we have, the lower correlation is significant. Sometimes the intended use of the research project dictates how high a significance must be obtained, and the method of improving it then is to use a larger sample.
The above table is a simple example of a t-test. In addition to the correlation quotient, the t-test can be used to estimate the significance of many other statistic parameters. In most cases, however, the values for the parameter t cannot be obtained directly from a table as above but they have to be calculated with special formulas which are unfortunately all different.
The t-test can also be used to estimate the significance of the difference between the parameters obtained from two different samples, i.e., how probable it is that also the corresponding populations differ from each other in a corresponding way. The principle of the test is always the same. One limitation of the t-test is that it can be used to estimate only one or two parameters at a time.
The ANOVA method is based on the mathematically proven fact that there is a difference between the groups only if the between-groups variance is greater than
the within-group variance.
The analysis is initiated by calculating the within-group variance for each group, and the mean of all these group variances.
The next step is to calculate the mean for each group, and then the variance of these means. It is the between-groups variance.
Then you calculate the ratio of the above two figures, which is called F. In other words,
F = (variance of the group means) / (mean of the group variances).
Finally you refer to the table (in statistics handbooks) which shows how high values the coefficient F may reach when only chance operates. If the F received from the ANOVA is higher than the table value, there is a difference between the groups which is as significant as the table reports.
Empirical observations, which give the basis for all the results in a research project, are mostly presented as seemingly exact assertions: "86% of the customers were satisfied"; "The weight of the telephone was 85 g". Now, it is possible that among the many measurements made in a study there are some which are not true, or are only approximately true, so there should be a method of assessing the factuality of the collected information.
In the course of time, researchers have interpreted the words "factual" and "true" in a number of ways, most of which state essentially that the assertion must correspond with reality, but not necessarily with the theory or the authorities in the field of study, for example.
The difficulty is that an assertion resides in the world of concepts and theories, not
in the world of the empirical things it tells about. Its definition purports to serve as a bridge between these two
worlds; however, it is often difficult to construct a perfectly valid method of observation or measurement that would
record exactly the desired theoretical concept and not another nearby concept.
Besides, there is always the risk of deficient reliability, which leads to errors in measurements. Moreover, it is often impossible to reach or record some cases or specimens in the group that should be studied.
Because of all these obstacles, researchers now agree that in the factual sciences you can never reach 100% certainty of any assertion's correspondence with empirical reality. The case is different in formal sciences such as mathematics where you can say that the area of a circle is exactly pi times radius squared. But this is only true if you are not describing an empirical circle because, if you measure a real circle, you will probably find that the last decimals in your measurements are wrong.
Thus, we have to accept the fact that we can never have absolute faith in empirical observations. However, even when we know that the results contain small errors, they can still be quite useful for many practical purposes. For example, if we believe that 99% of all the measurements are correct, we can often take the risk of using them.
Consequently, we need methods to measure and evaluate the reliability and credibility of observations. Some such methods are listed below, and they are somewhat different depending on the nature of the observations (qualitative or quantitative).
1. Quantitative Observations. The dispersion of data, which can be measured with e.g. their variance, often gives a good indication of their reliability as well. See also the discussion on the errors in measurement. The powerful methods of statistical testing of observations are discussed below.
2. Qualitative Observations. An examination of their credibility might include some of the following questions.
Ethics of data gathering. If there are people (other than the researchers) or animals involved in the data gathering situation these must not be exposed to excessive stress or inconvenience. There is a separate page on the Ethical Considerations in research.
Evaluating the correctness of analysis is important especially in basic research which aims at enlarging theoretical knowledge, because the primary targets for evaluation, the research findings, are often difficult to evaluate directly. In development projects you often focus the assessment predominately to the practical outcome of the project and only if this assessment does not give unequivocal conclusion you start scrutinizing also the analysis methods that have been used.
The correctness of analysis can be assessed by studying the following questions:
A third possible explanation for anomalies is simply that the phenomenon was a weak one and it had many irregularities, which perhaps can diminish the practical value of your work. According to some philosophers, even a single case where your hypothesis has been found false would render useless the entire hypothesis. Others say that a small percentage of anomalies are acceptable, because other researchers can perhaps later explain them. In order to help the work of later researchers of the topic you should give full details about the anomalies in your report.
Per definition, theoretical results, i.e. benefits of a research project to its branch of science are a central issue in projects of descriptive research, and to some extent also in normative research. The benefits can be of three kinds:
Note that the above benefits are possible only on the condition that the new research project has unequivocal connections to earlier theory on the appropriate field of research. For this reason it is very important that you should use such definitions which are as far as possible identical or similar to those used in earlier research. Then it will be easy for you (and for your public) to estimate whether your results are coherent with earlier theory or not. Note that coherence is not a goal in itself -- it just indicates that your project is either enlarging our prevailing knowledge or it is connecting previously detached pieces of existing theory to one larger theory (alternatives 1 and 2 of the list above).
The third alternative in the list: disagreement with earlier reports, means that either these earlier reports or your new findings are faulty. If you find yourself facing such a situation it is best to start by verifying your findings once more and being prepared to defend your work against heavy attacks. The reason is that on most fields of science the influential persons tend to value very high the existing mosaic of theory, even when they know that it may contain some weaknesses. Richard Milton has given in the book Forbidden Science (1994) many spectacular examples of this phenomenon which is based on natural psycho-sociological mechanisms of the human teams working in scientific institutions. Pierre Bourdieu has also shed some light on them in the book Homo Academicus (1988).
Despite some notorious historical refutations of courageous proposals (like Galileo's) it is clear that truthfulness and reliability of published reports must be guarded in any branch of science because progress of science would be impossible if researchers could not rely on the results of their earlier colleagues. That is why all modern scientific communities use certain conventional procedures in order to verify the factuality of published findings. These often include some of the following events, where quite a number of fellow researchers participate:
While none of the above mentioned people will necessarily find every fault in your report, your findings will, however, gain credibility during the process. Colleagues and professionals will gradually start regarding your findings as a reliable basis for their own work, in the same way as they use their own observations, tables that are found in handbooks, or anything that everybody accepts as factual and true. This gradual process is sometimes called the "consensus" method of confirmation. While it can give no absolute certainty, in the practice of science it works quite effectively. As Karl R. Popper, in The Logic of Scientific Discovery, 1959, p. 111, put it:
"Science does not rest upon rock-bottom. The bold structure of its theories rises, as it were, above a swamp. It is like a building erected on piles. The piles are driven down from above into the swamp, but not down to any natural or 'given' base; and when we cease our attempts to drive our piles into a deeper layer, it is not because we have reached firm ground. We simply stop when we are satisfied that they are firm enough to carry the structure, at least for the time being".
The theoretical value of research findings will thus eventually receive its final evaluation from other people than the researcher himself.
Many descriptive projects aim at finding knowledge for a practical purpose, though by definition a descriptive project does not go so far as developing proposals for changing things in practice, as does the normative approach.
The normal method for assessing the practical success of a project is to compare its results to the initial targets of the project. Besides these intended results, it often happens that additional, not expected possible areas of applying the findings have turned up during the project. In any case, the closing chapter of the research report is the right place for the researcher's own assessment of all these possible practical benefits (and inconveniences, if any) from the project. Because of the great variation of these benefits it is difficult to name any methods or give checklists for the work, but some ideas for it can perhaps be found from the pages Evaluating Normative Proposals or Ethics of Application.
August 3, 2007.
Comments to the author:
Original location: http://www2.uiah.fi/projects/metodi