Evaluating Rating Scales for Sensory Testing with Children

Sensory testing with children is becoming increasingly important to the food industry, but little research on appropriate methodology has been conducted
Beverley J. Kroll

AS THE NUMBER of food products aimed at the children's market increases and the role of children in purchase decisions expands, sensory testing with children becomes increasingly important to the food processing industry. However, sensory research has not kept pace with this need.

Testing with children is in an embryonic stage. Over the years, a few sensory researchers have considered the problems involved in applying their science to this special population, but for the most part the field has been static. The need for serious investigation is pointed up by how little research has been done in this area.

As a way of focusing on the specific needs for this kind of research, a thumbnail sketch of certain key questions the literature considers is presented in the box on p. 80.

One thing is very noticeable not only in the literature, but also in word-of-mouth, unpublished material about children's testing. The methods used have been intuitive, even granted that the investigator may have had a rationale. Once a method has been selected, there has been no serious investigation of possible alternatives. It is as if the researchers said, "We planned this, we tried it, it seemed to work, and there was no time to bother with what might have worked better.”

We therefore undertook a basic research project designed to help establish a solid foundation for future investigations. This article describes the procedures, analysis, and conclusions of research intended to evaluate the relative merit of rating scales that might be used when testing with children. In this study, we used two methods of questioning – one-on-one interviewing (Fig. 1) and self-administered questionnaire (Fig. 2) – and three types of rating scale (Fig. 3).


Fig. 1 -- Children Ages 5-7 and 8-10 were tested using one-on-one interviews

Fig. 2-- Children Ages 8-10 were also tested using self-administered questionnaires in standard sensory testing booths
 
Fig. 3-- Three Types of Rating Scale Were Used: the traditional hedonic scale, the P&K scale developed for this study, and the typical face scale. After testing, scale values of 1 to 9 were assigned (starting with 1 at the top) for the purposes of analysis
Variables Selected The author is President, Peryam & Kroll, Marketing and Sensory Research, 6323 N. Avondale Ave., Suite 121, Chicago, IL 60631

A great many variables could be considered. Hence, it was necessary to be selective and try to choose the more important ones.

Test Products. The test product was not really a source of variation, but remained constant throughout the main series of experiments. We settled on a sweetness difference in an orange drink. One can reliably predict that children will like a sweeter drink, at least within the normal range. This proved to be the case. Preliminary testing of drinks with various sweetness differences indicated the adjustments needed. For example, a drink sweetened with the recommended amount of sugar com- pared to one made with only 50% of that amount produced highly significant differences no matter what rating scale was used. Needed was a difference that was definite but not overwhelming, so that the possible effects of the variations of interest could emerge. The final choice was an orange- flavored drink sweetened with the recommended amount of sugar, compared to a drink with 80% of that amount.

Scale Type. Differences in scale type were the main issue addressed in these experiments. After preliminary work with older children, we concentrated on three scale types (Fig. 3) – the standard hedonic scale with the usual verbal categories, a pictorial or face scale, and a child-oriented verbal scale we developed. Over the years, researchers have investigated test language suitable for children. After reviewing child-oriented word scales designed by others, we decided to develop our own scale, with more nearly equal intervals (although exact equality probably cannot be achieved with scales of this type). The result was dubbed the Peryam & Kroll or P&K scale.

It was imperative that the study include a picture scale. Testing with children is overrun with picture scales, the rationale being that younger people may not understand words and phrases but can more accurately deal with facial expressions. Besides, pictures are entertaining and should inspire closer attention to the task.

There are many such caricature scales around, but all have the same general characteristics, representing degrees of pleasantness ranging from high to low. The question is how well successive pictures communicate the basic idea. Some preliminary work was done with a scale from an earlier published study, which used the Snoopy cartoon character, but the results were disappointing. Scales using children's faces with variations in degree of detail were also tried. Eventually a series of simplified people faces was selected as probably best and certainly representative.

Scale Length. There is a school of thought, bolstered by intuition, that longer scales tend to create confusion because there are lots of words to understand and choices to make. The implication is that this problem should be more serious with younger children. On the other hand, there is evidence that longer scales can be more discriminating and produce more reliable results.

Certainly, this factor was of enough importance to be included in the study. Starting with the frequently used 9 points, how far down should one go? To 7 points? 5? 3? Or even to just 2 points, which would be paired comparison? The study addressed this variable in subdued fashion by trying 7 points, using the same three scale types as before but eliminating one good category and one bad category from each scale.

Age. For what ages might special techniques be required? Our initial work was with children over 10 years of age, most of whom seemed to handle self-administered questionnaires fairly well, with no problems that are not encountered to some degree with adults. To address the real issue, therefore, we defined two age groups based on suppositions about ability to handle verbal input: the preliterate, ages 5 – 7, where most can be expected to read very little if at all and not understand big words; and the semiliterate, ages 8 – 10, where most can read at some level but still may not understand words such as "extremely" or "moderately." No attempt was made to extend the investigation to preschoolers.

Mode of Presentation. Most of the experiments employed a straightforward approach, where the successive categories were read one after another, always starting at the good end.

Another approach sometimes used by investigators is what may be called "bifurcated" – the interviewer first asks the subject to place the stimulus into either the good/ like or the bad/dislike category, then tries to get the child to scale degree of like or dislike by presenting the successive categories. The categories were presented starting in the middle and proceeding to the ends. This seemed logical, but that could be open to debate. If the subject failed to make a choice in response to the initial question, the result was recorded as "maybe good/maybe bad" or "neither like nor dislike" (but was not read to the subject). This phase of testing included only the hedonic and P&K scales because the face scale is inappropriate to this approach.

The question of which was the better procedure – the bifurcated or the straightforward – was addressed in a side experiment.

Another side issue that seemed worth testing was one-on-one interviewing vs a self-administered questionnaire. This experiment used the 9-point hedonic scale and P&K scales and involved only children 8 – 10 years old, i.e., the semiliterate group. Again, the face scale was excluded because the concern was mainly with ability to read with understanding.

Testing Procedure

The test subjects were prerecruited from families on our extensive roster of consumer panelists. Usually, the computer knows which families have children and their ages. All had to like orange drinks, which was no problem. Otherwise the only concern was age, sex, and availability to fit into the schedule. An important proviso was that no child should be invited to participate in more than one test, which would raise questions about training effect.

In all cases, a subject tried the pair of samples, high sweet vs low sweet, twice, using a different scale for each pair, then made a paired-comparison choice after each pair. Except for those on the mode of presentation, the experiments included all three scale types – hedonic, P&K, and face. The design required that the scales be used equally often and appear equally often as the first or second pair. Furthermore, for each scale type the high-sweet and low- sweet samples were served first or second equally often.

Sex differences did not seem important in the context of this investigation, but our recruiters attempted to have equal numbers of girls and boys in each of the age groups. This was not achieved exactly, but it was close. They also tried to get an even distribution of ages within each age group. Again, this was not exact but was very close.

The drinks were prepared in quantity ahead of time, chilled to refrigerator temperature, and held at that temperature throughout testing. They were poured just before serving. A sample as served was about 1% oz of drink in a small plastic glass. The samples were identified by code number, but only for the convenience of the operators and to avoid errors. If a subject even saw the codes, it was accidental.

All interviewing was conducted one-on-one, except for the sessions using the regular written questionnaires. The interviewers were carefully briefed on the protocol to be followed for each variation.

The interviewer met the subject and parent in a reception area. Leaving the parent there, the interviewer took the child to the testing area while chatting in a friendly manner to establish rapport and relieve possible tension. The test itself was not discussed except in a very general way.

In the test room, the child was seated at a table across from the interviewer (Fig. 1) and told that he or she would get some samples of orange drink and would be asked questions about them. The first sample was brought and the child invited to try it. When the child was finished, the interviewer began the questioning procedure according to the set protocol. After a rating was made, the child was told to drink some water while the interviewer got the next sample. The waiting period was about 2 minutes. The second sample of the pair was then tried and rated. This was followed by the question, "Which did you like better, the first sample you tried or the second one?”

Then the child was told there were more drinks to be tried and had a drink of water while waiting another 2 minutes. The second pair was handled like the first, and the child was escorted back to his or her parent. The whole sequence took about 10 minutes.

Analyses

There is a qualification to note here. Some findings, in the sense of the objectives of the research, rely on what may be called soft data; however, they were derived from hard data.

Hard Data. For the paired comparison, the significance of the proportions of choice was determined by the z-test. For the scalar measures, the significance of the difference between the average rating for the high-sweet and low-sweet drinks was determined using the t-by-difference test, which was natural, since each subject had tried both samples. Using the variances of the distributions was also considered, but the figures were volatile and hard to interpret. With scales of this kind, the variance is highly dependent on the average rating, being quite low when the upper end of the scale is approached, but increasing as the average drops toward the midpoint.

Soft Data. The tables of results show significance levels ranging from 1% to 15%. These figures were compared among scales, between age groups, between test orders, between orders of serving, and so on.

How legitimate, or how useful, is this approach? There is no routine, accepted statistical procedure for determining whether one level of significance is or is not significantly different from another. Perhaps a method for this purpose could be devised, but its possible utilization has not been explored. An example of the questions to be resolved would be, how much more important is the 1 % level than the 2% level? Probably not very important, since both are near certainty. But one is easily convinced that the 1% level shows more discrimination than the 10% level. These are the kinds of decisions that served as the basis for most of the conclusions in this study.

Results

What, if anything, was discovered in this study? Are any conclusions definitive, settling certain points once and for all? Not likely! But there are results that can direct future research on the subject.

Paired-Comparison. The paired comparisons were always made after the pair of drinks had been presented and rated. The results, summarized across all tests, are shown in Table l.

Table 1 -- Paired-Comparison Test Results between high- and low-sweet orange drinks. Numbers in parenthesis are N
Preference (%)
By sex By age By test order By scale type
   
Sample preferred Total
(1,032)
Male
(518)
Female
(514)
5-7
(424)
8-10
(608)
First
(516)
Second
(412)
Hedonic
(412)
P&K
(412)
Face
(208)
High-sweet 59 59 60 53 64 60 61 61 60 55
Low-sweet 41 41 40 47 36 40 39 39 40 45
Significance level 1% 1% 1% NSª 1% 1% 1% 1% 1% NSª
ªNS = not significant

Overall, there was a highly significant difference – well below the 0.1% level – which was due in part to the large number of subjects (N). As expected, the high-sweet sample was preferred, which validated the product variable. Other conclusions come from comparing different subgroups.

Test order, whether the first or second pair of the session, made no difference.

There was no difference in discrimination between boys and girls.

Children 8 – 10 years old were definitely more discriminating than the younger kids, who failed to establish a significant difference. Their failure might have been due to interference by the scaling task. The difference between ages might have been expected.

Scale type may also have made a difference, although evidence is borderline. When the comparison was made after the hedonic and P&K scales, discrimination was about the same as overall; but when it was made after the face scale, it dropped to the level of nonsignificance. This might be a chance effect, or there may be something about the face scale which later interfered with the paired comparison.

Scale Length. Scale-length results (Table 2) tend to lay to rest the belief that children need simplicity and shouldn't be presented with too much because they will get confused. Within the context of these experiments, that did not prove to be the case. Quite the contrary – the 9-point scales were as good, if not better, than the 7-point versions. Definitely, the 7-point scales were not better. Whether the 9-point scales were actually better for discrimination rests on comparison of the 5% vs 1% levels of significance, but the 7-point scales offer no advantage.

Table 2-- Effect of Scale Length on Discrimination (combined across scale type). Numbers in parenthesis are N
Significance level
By sex By age By test order
Scale Type Total
(312)
Male
(154)
Female
(158)
  5-7
(156)
8-10
(156)
  First
(156)
Second
(156)
9-point 1% 1% 8%   3% 1%   15% 1%
7-point 5% 14% NS   NS 1%   NS 2%

With the 9-point scales, all subgroups showed significant discrimination, granted that at one point it dropped to a questionable 15% level; whereas with the 7-point scales, three subgroups showed nonsignificance.

The boys did slightly better than the girls, although this was not consistent. It is probably trivial, and not indicative of any meaningful trend.

This result is definite and hardly unexpected. The children 8 – 10 years old showed good discrimination with both scale lengths, whereas the children 5 – 7 years old showed significant discrimination only with the 9-point scales, completely failing the task with the shorter version. On the basis of the supposition that the simpler scales should be easier for younger children, one might have expected this to be the other way around.

It is often noted in sequential monadic testing that there is better discrimination when only the second-served samples are considered. In this study, there was significant discrimination with the second-served samples for both scale lengths, but almost none with the first-served samples. Is this due to some kind of contrast?

Is it a training effect, where the ratings of the second sample have the benefit of experience with the first? This research could not address such questions in all of their complexity. Besides, such effects pertain to all testing, not just when children are concerned.

Scale Type. The crux of the research is the comparative evaluation of the three scale types. Overall, with N = 208 for each scale, all scales significantly discriminated at better than the 10% level. However, the P&K scale (1% significance level) was better than the hedonic scale (8% significance level) and the face scale (7% significance level). We think this is an important finding, but remember the qualification about soft data – it is based on comparison of the 1% vs the 7% or 8% level of significance. In addition, the face scale, which typified the kind alleged to be better for children, failed to emerge as better than the other scales.

Table 3-- Interactions of Scale Type with other variables (combined across scale length). Numbers in parenthesis are N
Significance level
By age By serving order By test order
Scale Type Total
(208)
5-7
(104)
8-10
(104)
First
(104)
Second
(104)
First
(104)
Second
(104)
Hedonic 8% NS 7% NS NS NS 7%
P&K 1% 13% 1% NS 1% 12% 1%
Face 7% NS 5% NS 2% NS 8%

In a way, Table 3 is repetitive, exhibiting effects shown in the other tables, but now separately for each scale type. However, it may add further emphasis to the following conclusions: The P&K scale gave better overall discrimination; older children showed better discrimination with all scales; and no scale discriminated when just the first-served samples were considered, but the P&K and face scales did with the second-served samples.

The second pair of drinks tested was consistently better for discrimination than the first pair, no matter the scale type. Does this mean that there is a learning effect, even from the brief first exposure to the task? If so, it is both bad news and good news. The bad news is that one does not have a pure measure. But who believes that is possible anyway? The good news is that kids quickly learn to do a good job, and that the testing of multiple pairs is acceptable.

Mode of Presentation. Table 4 shows the results of the side study designed to help answer the question, Is there any advantage in using the two-stage, bifurcated approach? The study was limited to the 9-point scale.

Table 4-- Effect of Straightforward vs Bifurcated Presentation (combined hedonic and 9-point scales). Numbers in parenthesis are N
Significance level
By age By serving order By test order
Mode of presentation Total
5-7 8-10 First Second First Second
Straight-forward 2%
(208)
NS
(104)
1%
(104)
NS
(104)
8%
(104)
6%
(104)
NS
(104)
Bifurcated 10%
(224)
14%
(112)
NS
(112)
2%
(112)
NS
(112)
NS
(112)
2%
(112)

Overall, the bifurcated approach seems to offer no advantage over the straightforward. Even for the children 5 – 7 years old – the age group for whom the method was designed – the bifurcated scale was little better than the straightforward approach.

The self-administration phase of the study was an embellishment done as an afterthought. It was limited in scope, utilizing only the hedonic and P&K scales, and excluding children 5 – 7 years old for the obvious reason that they are preliterate.

The results (Table 5) showed that children 8 – 10 years old can handle written questionnaires effectively. Overall, the results were significant at the 1% level.

Table 5-- Effect of Mode of Presentation-- one-on-one interviewing vs self-administration (combined hedonic and 9-point scales, children 8-10 years old). Numbers in parenthesis are N
Significance level
By sex By serving order By test order
Mode of presentation Total
5-7 8-10 First Second First Second
One-on-one 1%
(104)
2%
(54)
9%
(50)
1%
(52)
NS
(52)
NS
(52)
1%
(52)
Self-administered 1%
(184)
1%
(90)
7%
(94)
10%
(92)
3%
(92)
1%
(92)
NS
(92)

Although not shown in the table, the effect of self-administration was more pronounced with the hedonic scale, whereas discrimination with the P&K scale was about the same with both approaches (one-on-one interviewing and self-administration). This finding should cheer sensory specialists. It makes things easier. If children of this age are sufficiently knowledgeable that big words do not defeat the purpose, why bother with expensive one-on-one interviewing?

Further Studies Needed

The results of this study can be summarized as follows: The P&K scale performs better than the hedonic or face scale. Reducing scale length from 9 points to 7 offers no advantage. Children 5 – 7 years old do not perform any better with the face scale than with the other two scales. The bifurcated approach does not discriminate as well as the straightforward method. And older children perform as well using written questionnaires as when interviewed one-on-one. The study, as noted earlier, was not intended to be the be all and end all. Rather, it was intended as a foundation for further studies. A review of variables will show that many need further attention. While there are problems involved, there is a great deal to be obtained.

References

--Birch, L.L. 1979. Dimensions of preschool children's food preferences. J. Nutr. Educ. 2(2): 77.
--Colwill, J.S. 1987. Sensory analysis by consumers. Food Mfr., Feb., p. 53.
--Morse, R.L.D., 1953. Exploratory studies of preschool children's taste discrimination and preference for selected juices. Proc. of Florida State Horticultural Soc., Daytona Beach.
--Moskowitz, H.R. 1985. Product testing with children. In "New Direction for Product Testing and Sensory Analysis of Foods," p. 147. Food and Nutrition Press, Inc., Westport, Conn.
--Peryam, D.R. 1989. Personal communication. Peryam & Kroll Marketing and Sensory Research, Chicago.
--Wells, W.D. 1965. Communicating with children. J. Adv. Res., p. 2.

Based on a paper presented at the Spring Meeting of ASTM, San Francisco, Calif, May 24, 1990.

– Edited by Neil H. Mermelstein, Senior Associate Editor

Reprinted from Food Technology 44(11) 78-80, 82, 84, & 86
©1990 Institute of Food Technologists