Looks ratings 101: Nearly all studies show a Cronbach's alpha > 0.80 for inter-rater reliability. What does this mean? Putting the neomancr hypothesis to the test

One apparently confused redditor has made the following claims about the attractiveness assessments used in research into preferences:

https://cdn-images-1.medium.com/max/2000/0*aiEOj6bJOf5mZX_z.png Look at the male messaging curve.

Now again look at the woman's curve.

http://cdn.okcimg.com/blog/your_looks_and_inbox/Female-Messaging-Curve.png Why would men be messaging women they mostly find attractive while women seem to be messaging men they on average find unattractive?

Here's a break down of how this works:

Let's say there are 3 ice cream flavors: A B C, and subjects are to each rate them 1 - 5. And this happened:

Subject 1

A 1 B 3 C 5

Subject 2

A 5 B 3 C 1

Subject 3

A 1 B 5 C 1

Subject 4

A 1 B 5 C 3

So our results are:

5 1s 3 3s 3 5s

3 good flavors

8 less than good flavors

The subjects would be rating 80 percent of ice cream flavors less desirable. Yet they each still individually PREFER ice cream flavors that are on average rated as less than desirable by the group.

Black pillers along with LMSers deliberately ignore the messaging curve while pretending that women all have the same tastes and judge 80 percent of men as unattractive and so the 20 percent that remains must all be the same guys.

The messaging curve easily debunks that and reveals what's really happening.

The power of stats.

Side-stepping the utterly questionable (aka wrong) math and implicit assumptions involved in interpreting the sum count of all <5/5 ratings on 3 ice cream flavors as subjects overall rating "80 percent of (three!) ice cream flavors less desirable," let's focus on the crux of this post: that the ratings are too "variegated" to be reliable.

First, I'll elaborate on something I mentioned here in response to this redditor's concerns. An excerpt:

The argument you're trying to make is that some subgroup or diffuse heterogeneity precludes any statistical analyses. Except for the fact that if this were true then:

there would be poor correlation of ratings between different independent observers used in the studies for a single final rating (usually a central tendency metric such as mean) to be useful (this is measured by the alpha index, by the way)

By alpha index, I'm referring to the Cronbach's α aka tau-equivalent reliability measure for inter-rater reliability. Nearly all research involving attractiveness ratings show a Cronbach's α >0.80, and often >0.9 when ratings are limited to heterosexual raters evaluating opposite sex targets. Hitsch 2006 and 2010 (in the sidebar) had a mixed sex group of 100 different raters for their massive dataset, yielding 12 ratings per photo, with a Cronbach's α of 0.80. Here's a commonly used scheme for interpreting the value:

Cronbach's alpha	Internal consistency
0.9 ≤ α	Excellent
0.8 ≤ α < 0.9	Good
0.7 ≤ α < 0.8	Acceptable
0.6 ≤ α < 0.7	Questionable
0.5 ≤ α < 0.6	Poor
α < 0.5	Unacceptable

Which bring's us to the heart of the matter:

What's the Cronbach's α of the neomancr hypothetical ratings dataset?

First, his data, re-presented again in a clearer table form:

Rater	Ice cream A	Ice cream B	Ice cream C
Subject 1	1	3	5
Subject 2	5	3	1
Subject 3	1	5	1
Subject 4	1	5	3

The next steps may be performed with your preferred stats software of choice or excel:

Anova: Two-Factor Without Replication||||||| :--|--:|:--|:--|:--|:--|:--| ||||||| SUMMARY|Count|Sum|Average|Variance||| Subject 1|3|9|3|4||| Subject 2|3|9|3|4||| Subject 3|3|7|2.333333|5.333333||| Subject 4|3|9|3|4||| ||||||| Ice cream A|4|8|2|4||| Ice cream B|4|16|4|1.333333||| Ice cream C|4|10|2.5|3.666667||| ||||||| ||||||| ANOVA||||||| Source of Variation|SS|df|MS|F|P-value|F crit| Rows|1|3|0.333333|0.076923|0.970184|4.757063| Columns|8.666667|2|4.333333|1|0.421875|5.143253| Error|26|6|4.333333|||| ||||||| Total|35.66667|11||||| ||||||| ||||||| Cronbach's α|0||||||

The Cronbach's α of the neomancr dataset is ZERO.

Slightly more "variegated" than what actual studies show, eh?

Given there hasn't been a single study that I'm aware of with a Cronbach's α below 0.75 for looks ratings, we can probably rest assured that the hypothetical dataset neomancr envisioned, with such marked variation between raters, exists nowhere except his own imagination.

To facilitate the understanding of how Cronbach's α changes with how "variegated" the numbers are, see below.

Case 2: Perfect agreement between raters:

Rater	Ice cream A	Ice cream B	Ice cream C
Subject 1	5	3	1
Subject 2	5	3	1
Subject 3	5	3	1
Subject 4	5	3	1

Anova: Two-Factor Without Replication||||||| :--|--:|:--|:--|:--|:--|:--| ||||||| SUMMARY|Count|Sum|Average|Variance||| Subject 1|3|9|3|4||| Subject 2|3|9|3|4||| Subject 3|3|9|3|4||| Subject 4|3|9|3|4||| ||||||| Ice cream A|4|20|5|0||| Ice cream B|4|12|3|0||| Ice cream C|4|4|1|0||| ||||||| ||||||| ANOVA||||||| Source of Variation|SS|df|MS|F|P-value|F crit| Rows|0|3|0|65535|#DIV/0!|4.757063| Columns|32|2|16|65535|#DIV/0!|5.143253| Error|0|6|0|||| ||||||| Total|32|11||||| ||||||| ||||||| Cronbach's α|1||||||

Case 3: Less than perfect agreement between raters:

Rater	Ice cream A	Ice cream B	Ice cream C
Subject 1	4	2	1
Subject 2	3	3	2
Subject 3	5	3	1
Subject 4	4	2	1

Anova: Two-Factor Without Replication| | | |||| :--|--:|:--|:--|:--|:--|:--| ||||||| SUMMARY|Count|Sum|Average|Variance||| Subject 1|3|7|2.333333|2.333333||| Subject 2|3|8|2.666667|0.333333||| Subject 3|3|9|3|4||| Subject 4|3|7|2.333333|2.333333||| ||||||| Ice cream A|4|16|4|0.666667||| Ice cream B|4|10|2.5|0.333333||| Ice cream C|4|5|1.25|0.25||| ||||||| ||||||| ANOVA||||||| Source of Variation|SS|df|MS|F|P-value|F crit| Rows|0.916667|3|0.305556|0.647059|0.612811|4.757063| Columns|15.16667|2|7.583333|16.05882|0.0039|5.143253| Error|2.833333|6|0.472222|||| ||||||| Total|18.91667|11||||| ||||||| ||||||| Cronbach's α|0.937729||||||

Results and Discussion

Overview

The meta-analyses showed that, both within and across cultures, people agreed about who is and is not attractive. Furthermore, attractiveness is an advantage in a variety of important, real-life situations. We found not a single gender difference and surprisingly few age differences, suggesting that attractiveness is as important for males as for females and for children as for adults. Other moderator variables had little consistent impact on effect sizes, although in some cases there were insufficient data to draw conclusions. Reliability of Attractiveness Ratings

Cross-Ethnic and Cross-Cultural Agreement

For cross-ethnic agreement, the average effective reliability was r = .88. Cross-cultural agreement was even higher, r = .94. These reliabilities for both cross-ethnic and cross-cultural ratings of attractiveness were significant (p < .05), indicating meaningful and consistent agreement among raters (see Table 3). Once again, nothing surprising or consistent emerged from the moderator analyses (see Table 4).

These results indicate that beauty is not simply in the eye of the beholder. Rather, raters agreed about the attractiveness of both adults and children. Our findings for reliability of adult raters were consistent with Feingold (1992b), who meta-analyzed reliability coefficients from samples of U.S. and Canadian adults and obtained an average effective reliability of r = .83. More importantly, our cross-cultural and cross-ethnic analyses showed that even diverse groups of raters readily agreed about who is and is not attractive. Both our cross-cultural and cross-ethnic agreement effect sizes are more than double the size necessary to be considered large (Cohen, 1988), suggesting a possibly universal standard by which attractiveness is judged. These analyses seriously question the common assumption that attractiveness ratings are culturally unique and merely represent media-induced standards. These findings are consistent with the fact that even young infants prefer the same faces as adults (Langlois, Ritter, Roggman, & Vaughn, 1991; Langlois et al., 1987; Langlois, Roggman, & Rieser-Danner, 1990).

[–]ChadsPenis0 points1 point2 points 6 years ago (15 children) | Copy Link

[–]SubsaharanAmericanshitty h-index[S] 2 points3 points4 points 6 years ago (14 children) | Copy Link

[–]ChadsPenis0 points1 point2 points 6 years ago (13 children) | Copy Link

[–]SubsaharanAmericanshitty h-index[S] 1 point2 points3 points 6 years ago (12 children) | Copy Link

[–]nerocon 1 points1 points1 points 5 years ago [recovered] | Copy Link

[–]SubsaharanAmericanshitty h-index[S] 2 points3 points4 points 5 years ago (0 children) | Copy Link

[–]ChadsPenis0 points1 point2 points 5 years ago (9 children) | Copy Link

[–]SubsaharanAmericanshitty h-index[S] 3 points4 points5 points 5 years ago (8 children) | Copy Link

[–]ChadsPenis0 points1 point2 points 5 years ago* (7 children) | Copy Link

The 47 male faces described above were used as stimuli in a pupil diameter change task. Faces were prepared for display by masking to obscure hair and clothing, cropping to a standard size (for this, faces were aligned and normalized on inter-pupillary distance), set to grayscale, and setting against a black background. For each face stimulus, we created an image containing a block of uniform intensity pixels that was matched for size and that contained the mean luminance of the facial stimulus, using Matlab. This allowed us to control for any pupil diameter change that was due to differences in luminance between faces. The laboratory used for testing had no natural light so that luminance in the room was standardised.

Preparation of images

All images were colour-calibrated using the program Psychomorph61 to control for subtle random variations in colour due to lighting and photographic conditions. The program adjusts the colour of the images by comparing the CIELab values of the ColorChecker patches in the images to known values. We then rotated and aligned the faces so that the eyes were all sitting at the same height on a horizontal plane. A black oval mask was applied to hide most of the hair, ears, and neck. The masking procedure is widely used in facial perception studies62

The women in both studies were only rating oval masks that were rendered from images in a black background. Even hair was removed.

Can you figure out what "normalized inter pupillary distance meant? Did they just presume that all men had the same IPD and crop accordingly?

In the last one they were forced to make a "neutral face", remove all hair from view and then the images were just slid vertically so that the eyes matched up.

[–]SubsaharanAmericanshitty h-index[S] 2 points3 points4 points 5 years ago* (6 children) | Copy Link

The procedural aspects you highlighted may or may not have contributed to the >0.9 alphas in those studies. A bit moot since inter-rater reliability coefficients for the studies in the sidebar that utilize them, such as Luo & Zhang (2009) (α = 0.86, predominantly female mixed-sex raters [in acknowledgements line of study]), are used to assess the quality of independent variables for predicting desirability. I.e., the estimates should be thought of more as a methodological hurdle to qualify a predictor (in this case, the goal is generally >0.80). The more crucial question once ratings qualify is how well they predict outcomes; and clearly they do quite well.

But, since it looks like you have further interest in the topic, hope the information below helps:

https://www.ncbi.nlm.nih.gov/pubmed/10825783 :

Maxims or myths of beauty? A meta-analytic and theoretical review.

Langlois JH, Kalakanis L, Rubenstein AJ, Larson A, Hallam M, Smoot M.

Abstract

Common maxims about beauty suggest that attractiveness is not important in life. In contrast, both fitness-related evolutionary theory and socialization theory suggest that attractiveness influences development and interaction. In 11 meta-analyses, the authors evaluate these contradictory claims, demonstrating that (a) raters agree about who is and is not attractive, both within and across cultures; (b) attractive children and adults are judged more positively than unattractive children and adults, even by those who know them; (c) attractive children and adults are treated more positively than unattractive children and adults, even by those who know them; and (d) attractive children and adults exhibit more positive behaviors and traits than unattractive children and adults. Results are used to evaluate social and fitness-related evolutionary theories and the veracity of maxims about beauty.

More on claim (a), from the paper:

For the reliability analyses, most studies provided correlational statistics that could be used directly. Because different studies reported different types of reliability coefficients, we converted the different coefficients (e.g., Kendall's tau) to an r value. We computed both mean interrater and effective reliabilities (see Rosenthal, 1991, for conversion statistics). Mean interrater reliability estimates agreement between specific pairs of judges whereas effective reliabilities estimate the reliability of the mean of the judges' ratings (Rosenthal, 1991). We, like Rosenthal, prefer effective reliabilities because we are more interested in generalizing to how raters in general would agree than in the agreement of single pairs of judges evaluating a single face (Rosenthal, 1991). Just as a longer test is a more reliable assessment of a construct than a two-item test, the effective reliability coefficient is a more reliable estimate of attractiveness because it accounts for the sampling errors in small samples (Guilford & Fruchter, 1973; Nunnally, 1978). Although we report both estimates of reliability in Table 3 ,

Per Rosenthal (from an earlier book, previewed via google books: Rosenthal, R. (1987). Judgment studies: Design, analysis, and meta-analysis. Cambridge University Press.):

https://i.imgur.com/7B12HQ0.png

Back to the paper; Table 3 and pertinent discussion:

https://i.imgur.com/x57czOB.png

Results and Discussion

Overview

The meta-analyses showed that, both within and across cultures, people agreed about who is and is not attractive. Furthermore, attractiveness is an advantage in a variety of important, real-life situations. We found not a single gender difference and surprisingly few age differences, suggesting that attractiveness is as important for males as for females and for children as for adults. Other moderator variables had little consistent impact on effect sizes, although in some cases there were insufficient data to draw conclusions. Reliability of Attractiveness Ratings

Within-Culture Agreement

The meta-analysis of effective reliability coefficients revealed that judges showed high and significant levels of agreement when evaluating the attractiveness of others. Overall, for adult raters, r = .90 for ratings of adults and r = .85 for ratings of children, both ps < .05 (see Table 3).

Cross-Ethnic and Cross-Cultural Agreement

For cross-ethnic agreement, the average effective reliability was r = .88. Cross-cultural agreement was even higher, r = .94. These reliabilities for both cross-ethnic and cross-cultural ratings of attractiveness were significant (p < .05), indicating meaningful and consistent agreement among raters (see Table 3). Once again, nothing surprising or consistent emerged from the moderator analyses (see Table 4).

These results indicate that beauty is not simply in the eye of the beholder. Rather, raters agreed about the attractiveness of both adults and children. Our findings for reliability of adult raters were consistent with Feingold (1992b), who meta-analyzed reliability coefficients from samples of U.S. and Canadian adults and obtained an average effective reliability of r = .83. More importantly, our cross-cultural and cross-ethnic analyses showed that even diverse groups of raters readily agreed about who is and is not attractive. Both our cross-cultural and cross-ethnic agreement effect sizes are more than double the size necessary to be considered large (Cohen, 1988), suggesting a possibly universal standard by which attractiveness is judged. These analyses seriously question the common assumption that attractiveness ratings are culturally unique and merely represent media-induced standards. These findings are consistent with the fact that even young infants prefer the same faces as adults (Langlois, Ritter, Roggman, & Vaughn, 1991; Langlois et al., 1987; Langlois, Roggman, & Rieser-Danner, 1990).

[–]ChadsPenis0 points1 point2 points 5 years ago (5 children) | Copy Link

[–]SubsaharanAmericanshitty h-index[S] 0 points1 point2 points 5 years ago (4 children) | Copy Link

[–]minoxidilcel0 points1 point2 points 5 years ago (2 children) | Copy Link

[–]SubsaharanAmericanshitty h-index[S] 1 point2 points3 points 5 years ago (1 child) | Copy Link

Tells you how reliable the mean of a set of ratings is in representing the judgements of individual raters. The method employed here to calculate Cronbach alpha is

1-(mean squared error/mean squares for Ice Cream ratings)

or, put another way:

1-(variation of ratings within a given Ice cream / variation of mean ratings between the different Ice creams)

Hence, if your ratings are all over the place for each target (such as the neomancr dataset tested), your mean squared error, or MSE, would be so high it would approximate the mean squares for your target and approach 1 (making alpha = 1 - 1 = 0).

Here's another way of explaining it (using a different approach to calculate reliability) found in Rosenthal, R. (1987). Judgment studies: Design, analysis, and meta-analysis. Cambridge University Press:

https://i.imgur.com/Uv32Wph.png

https://i.imgur.com/VQ6T2dC.png

A known limitation is too many raters/ratings (e.g., several dozens) for a target can elevate alpha even with modest correlations

[–]minoxidilcel0 points1 point2 points 5 years ago (0 children) | Copy Link