Since my last blog entry involving statistics was much better received than I had ever anticipated, I thought I'd do another one for you fine folks. This time I'll be addressing the question of reviewer bias, a very often-made accusation among people in the forums and comments sections. Despite the malice that some hold who make such accusations, I thought it is nonetheless a perfectly good and valid question that is well-tailored to the realm of statistics. The two questions I will be addressing within this article are as follows:
1. Is there evidence of reviewer bias in any of the major reviewing sites?; and
2. If so, which sites show evidence of bias, and towards which consoles is there evidence of bias?
As before, if you don't want to wade through all of the statistics and data, you can click here to skip straight to the conclusions. But please read the article in its entirety if you want to contest the conclusions.
Necessary concepts
Before I get into the data, I'd first like to explain as best and as quickly as I can a few necessary concepts for the purposes of this article. The overarching concept is that of the "null hypothesis test". Within such a test, we basically have two hypotheses: one which is the "default" or "null" hypothesis, so to speak, and one which is the alternative hypothesis that will be accepted only if there is sufficient evidence to do so. In this case, we will assume that there is no bias, and then determine whether or not such an idea is credible given the data we have accumulated - if not, we will instead accept the alternative hypothesis and find there to be sufficient evidence there is bias. This is similar to the rule of "innocent until proven guilty" - we will consider sites to be unbiased until proven biased.
I'll also be touching on something called a "standard deviation". In layman's terms, the standard deviation is effectively the area around the average within which one can expect to find most of the data points. For example, if the average of the data points was 1, and if the standard deviation of the data points was 0.5, then we would expect most of the data points to be between 0.5 and 1.5. This is not a rigorous definition, but it is one that is good enough for this space.
Methodology
To collect my data and ensure its completeness, I wrote a very simple computer program that mined Metacritic for data (just parsing publicly available webpages, nothing fancy), and retrieved both the Metacritic average and the review scores from the top five review sites for every single game on the PS3, Xbox 360, and Wii that at least one of those sites had reviewed. This gives us a total of 1,456 games that were reviewed (351 PS3 games, 656 Xbox 360 games, and 449 Wii games), so we can be confident that the data seen herein are representative. Since this test does not concern itself with popularity, I have selected the top five sites based on interest from people here at GameSpot rather than by internet traffic. Consequently, those sites which I will address are GameSpot, IGN, GameTrailers, 1UP, and Giant Bomb.
Before we can conduct the statistical test, however, we must quantify exactly what we are testing. I said above that our assumption to be rejected is that there is no bias, but in order to test that based on the data rather than simply eyeballing it, we must ask ourselves just what it means for there to be no bias.
First, we must account for what I have deemed as the "house effect" in each of the reviewing sites. It is a well-known fact that GameSpot, for example, reviews every game on average more strictly than IGN. Thus, a GameSpot review score that is less than an IGN review score is not necessarily an indication of bias on either end; instead, it could simply be a manifestation of the two sites' different standards. In order to be able to compare one site's reviews to another's, we must remove this house effect.
So, what I have done is the following. First, I calculated the difference between each site's review score and the Metacritic average. I then took the average of these differences across all consoles to determine each sites' house effect. Across all games, the sites in question had an average difference from the Metacritic score as follows:
GameSpot: -0.16
IGN: +0.10
GameTrailers: +0.24
1UP: -0.38
Giant Bomb: -0.66
In other words, on average, GameSpot reviews games about 0.16 points lower than average, whereas IGN reviews games about 0.10 points higher than average.
(Note that 1UP's numerical score comes from Metacritic's conversion of 1UP's letter grades into a numerical value. Since they have not complained to Metacritic and requested that the conversion be altered or removed altogether - something Metacritic explicitly gives them the right to do - we can assume that this conversion is an accurate enough representation of the 1UP reviewers' intent.)
Once we have this house effect, we can then subtract that house effect from the difference between each site's review score and the Metacritic score. This has the effect of removing each sites' house effect from the resulting adjusted differences - thus, the adjusted numbers become comparable across the sites.
Now that we have those adjusted differences, we can quantitatively define the quality of "no bias". If there is no bias in a site towards or against a console, then the average adjusted difference should be 0 across review scores on that site of games on that console. In other words, reviews on any given site should diverge from the average no more on games for a specific console than they diverge from the average on all games. That is to say, if a site on average reviews games 0.1 points above than the Metacritic average, then it should not on average review a single console's games significantly greater than 0.1 points above than the Metacritic average.
Thus, we now have our null hypothesis (that the average adjusted difference is 0 for each console), and we are ready to test whether or not the data finds it to be a credible hypothesis. To do this, we will first find the average adjusted difference from the Metacritic score across each console.
However, any set of data includes in it random variation that will inevitably make it not precisely in accordance with the null hypothesis, even if the null hypothesis is actually true. Therefore, we must go a little further than that and establish the statistical likelihood that the data we have received comes from a source in which there is no bias. So, we will also find one other value to accompany each average adjusted difference, which is the standard deviation of the adjusted differences. A higher standard deviation indicates more variance in scores, which means that a larger average adjusted difference is more likely to be caused simply by random chance than by actual bias.
Once we have the average and the standard deviation, as well as the number of reviews, we can then perform what is known as a z-test. In layman's terms, this is a test that will give us the probability that the data we have found comes from a source with no bias. Because a certain amount of deviation from the expected average is just due to random chance, we apply a threshold of 5%: If this probability is less than 5%, we will reject that null hypothesis, because we will deem it too unlikely to be reasonable. If it is equal to or greater than 5%, then we will instead fail to reject the null hypothesis, having found insufficient evidence to do so.
Results, all games
First, here are the average adjusted differences from the Metacritic scores for all games:
PS3
GameSpot: -0.00 (standard deviation 0.73, 296 reviews)
IGN: -0.02 (standard deviation 0.71, 345 reviews)
GameTrailers: +0.04 (standard deviation 0.59, 141 reviews)
1UP: -0.03 (standard deviation 1.28, 236 reviews)
Giant Bomb: +0.12 (standard deviation 1.33, 60 reviews)
Xbox 360
GameSpot: +0.07 (standard deviation 0.74, 568 reviews)
IGN: -0.01 (standard deviation 0.71, 637 reviews)
GameTrailers: +0.01 (standard deviation 0.61, 219 reviews)
1UP: +0.05 (standard deviation 1.32, 404 reviews)
Giant Bomb: +0.02 (standard deviation 1.29, 110 reviews)
Wii
GameSpot: -0.12 (standard deviation 0.82, 292 reviews)
IGN: +0.00 (standard deviation 0.77, 424 reviews)
GameTrailers: -0.09 (standard deviation 0.65, 118 reviews)
1UP: -0.07 (standard deviation 1.30, 194 reviews)
Giant Bomb: -0.49 (standard deviation 1.31, 19 reviews)
And, based on this data, we can calculate the following percentage probabilities that there is no bias in each case, as well as whether or not the hypothesis of no bias is rejected:
PS3
GameSpot: 91% likelihood of no bias (no evidence of bias found)
IGN: 68% likelihood of no bias (no evidence of bias found)
GameTrailers: 37% likelihood of no bias (no evidence of bias found)
1UP: 76% likelihood of no bias (no evidence of bias found)
Giant Bomb: 48% likelihood of no bias (no evidence of bias found)
Xbox 360
GameSpot: 3% likelihood of no bias (evidence of positive bias found)
IGN: 84% likelihood of no bias (no evidence of bias found)
GameTrailers: 80% likelihoodof no bias (no evidence of bias found)
1UP: 47% likelihood of no bias (no evidence of bias found)
Giant Bomb: 88% likelihood of no bias (no evidence of bias found)
Wii
GameSpot: 1% likelihood of no bias (evidence of negative bias found)
IGN: 91% likelihood of no bias (no evidence of bias found)
GameTrailers: 14% likelihood of no bias (no evidence of bias found)
1UP: 46% likelihood of no bias (no evidence of bias found)
Giant Bomb: N/A (not enough games reviewed to make test results valid)
Results, top 100 reviewed games
Astute readers will no doubt have noticed a point of interest in the results above, which is the evidence that was found in favor of positive and negative bias towards the Xbox 360 and against the Wii in GameSpot's reviews. However, before you begin to either celebrate or dread, let me remind the readers that bias is an exceptionally serious charge when levelled towards an organization whose main form of capital is trust - trust that would be broken if there truly was bias present. So, before we make any conclusions, let us continue.
I realized in looking at the above that it includes reviews for very poorly reviewed games - games that people are unlikely to actually care about, and games that reviewers review basically solely because they have to. So, as a second test, I thought it would be worthwhile to consider only the higher-rated games on average to test to see if the above results hold up even in the games that people are actually likely to care about. First, however, since this is a new body of reviews, we have different house effects for each reviewer. These were determined to be as follows:
GameSpot: -0.22
IGN: +0.13
GameTrailers: +0.13
1UP: -0.04
Giant Bomb: -0.27
And, now, we can get the new average adjusted differences in each console for each site for the top 100 reviewed games (according to Metacritic):
PS3
GameSpot: +0.02 (standard deviation 0.50, 93 reviews)
IGN: +0.00 (standard deviation 0.39, 100 reviews)
GameTrailers: +0.04 (standard deviation 0.34, 57 reviews)
1UP: +0.05 (standard deviation 0.80, 82 reviews)
Giant Bomb: +0.11 (standard deviation 1.24, 33 reviews)
Xbox 360
GameSpot: +0.02 (standard deviation 0.42, 93 reviews)
IGN: -0.05 (standard deviation 0.33, 98 reviews)
GameTrailers: -0.05 (standard deviation 0.34, 59 reviews)
1UP: +0.12 (standard deviation 0.83, 88 reviews)
Giant Bomb: +0.19 (standard deviation 0.99, 34 reviews)
Wii
GameSpot: -0.05 (standard deviation 0.70, 78 reviews)
IGN: +0.05 (standard deviation 0.44, 98 reviews)
GameTrailers: +0.01 (standard deviation 0.40, 43 reviews)
1UP: -0.21 (standard deviation 1.16, 68 reviews)
Giant Bomb: -0.79 (standard deviation 1.31, 13 reviews)
And, based on this data, we can calculate the following percentage probabilities that there is no bias in each case, as well as whether or not the hypothesis of no bias is rejected:
PS3
GameSpot: 82% likelihood of no bias (no evidence of bias found)
IGN: 79% likelihood of no bias (no evidence of bias found)
GameTrailers: 35% likelihood of no bias (no evidence of bias found)
1UP: 46% likelihood of no bias (no evidence of bias found)
Giant Bomb: 62% likelihood of no bias (no evidence of bias found)
Xbox 360
GameSpot: 57% likelihood of no bias (no evidence of bias found)
IGN: 13% likelihood of no bias (no evidence of bias found)
GameTrailers: 30% likelihood of no bias (no evidence of bias found)
1UP: 18% likelihood of no bias (no evidence of bias found)
Giant Bomb: 26% likelihood of no bias (no evidence of bias found)
Wii
GameSpot: 51% likelihood of no bias (no evidence of bias bias found)
IGN: 26% likelihood of no bias (no evidence of bias found)
GameTrailers: 90% likelihood of no bias (no evidence of bias found)
1UP: 13% likelihood of no bias (no evidence of bias found)
Giant Bomb: N/A (not enough games reviewed to make test results valid)
Though there appeared to be initial evidence of bias in GameSpot, that quickly faded away when we restricted our test to only the games that were on average reviewed fairly high. In other words, while there may have been a statistically significant overscoring of Xbox 360 games and underscoring of Wii games in total, that appears to have only been the case for games that were on average reviewed poorly - when critically well-received games are considered, there is no evidence of bias. In addition, in none of the other four sites was there ever any evidence of bias detected.
It should also be noted that this sort of test does not prove anything; it merely provides evidence in favor of something to a certain level of confidence. While the cutoff is generally taken that a probability of less than 5% is sufficient evidence against the default hypothesis, this is arbitrary.
I should point out, however, that this is only analyzing bias in reviews that were made. There do appear to be substantially fewer Wii games reviewed than games for other consoles - that said, I should caution that there are certainly other factors there that would be at play as well. And I am unable to think of a way to statistically test whether there is evidence of bias there, given that there is no collection of data points to be found there. Thus, while this is a possible point of interest, I do not believe it to be sufficient grounds for the charge of bias, either.
Given the above facts, I am forced to conclude that there is insufficient evidence across the board to support the often-leveled charges of systemic bias on the part of any of the reviewing sites in question (GameSpot, IGN, GameTrailers, 1UP, and Giant Bomb). It may be the case that GameSpot slightly overscores poorly received Xbox 360 games and slightly underscores poorly received Wii games, but that is as far as it seems to go. I thus find such charges of systemic bias without merit, and I would advise all parties in the future to refrain from making such charges unless some extremely compelling evidence can be presented that would negate the above analysis.
Further study
It should be noted that, like the previous blog entry I made, this does not separate games in each console into separate categories, which is an action that might have merit. For example, one might separate games into those targeted towards casual gamers and those targeted towards hardcore gamers. Or, one might separate games into those which are retail games and those which are downloadable games. I will not be doing so for the purposes of this article, but this is something that I have considered for a future article as a possible more in-depth analysis.
In addition, though I cannot currently think of any way to statistically test for bias the difference in number of games reviewed for each console, if I later discovered a way to do so, that would be another point that could be subjected to statistical analysis as well.
Comments
Ah, everyone always notes what I haven't done...
Nah, I kid; I think you're quite right that one should not simply ignore the interesting 3% and 1% figures when all games are considered, so I've added that to the conclusion. I don't think it changes anything - I still believe that claims of pervasive bias are unfounded - but I do agree that it should be given mention as it's quite relevant.
As for the comment about the top five developers, that's actually an interesting question. Perhaps if I do a followup I might take a look at that as well.
Where are you getting the data that other reviewers thought the 360 and PS3 versions were superior? Metacritic says the opposite: Wii 6.6, Xbox 360 6.0, PS3 5.4.
Regarding the Soapbox, you just need to put a blog post you deem worthy into the "Editorial" category. That will put it in the queue of blog posts to be looked at, and if the guy from GameSpot responsible for that thinks your blog is good enough, you'll get the emblem.
@bacchus2:
I would so love to get my hands on the marketing budget for every video game, but I just plain can find no sources for that whatsoever. Everything I can find just lists total company marketing budgets, not what specific games it's spent on. I've thought of other ways I could approximately get the same thing, such as number of hours of commercial airtime or something, but I can find no sources for that either. It's a tough life for an amateur statistician...
Anyway, I play PC games so I really don't give an airborne copulation about how the console games are scored
Well, like you alluded to, this article isn't discussing the issue of specifically biased reviews, but rather the specific issue of a pervasive bias towards or against one or more of the three current-gen consoles, which is something that reviewers are often accused of. The question of whether there is such a thing as a truly unbiased review is an entirely different (and certainly good) question.
@masterlu:
I don't really disagree with what you say, but I think you're kind of touching on the same difference as 55592. This isn't discussing the insertion of personal opinion and taste in reviews; it's discussing the charge of pervasive site-wide bias towards or against another console. This is a problem because, if it's the case, it's probably the case that at least one of the consoles out there is not getting a fair shake by the reviewing site, and they should probably address that issue. Fortunately, however, it appears to be the case that this is not the case for any of the sites.
I'm still not sure where you're getting that IGN thought the Xbox 360 version was better... check the Metacritic pages - IGN gave the Wii version a 7.2 and the Xbox 360 and PS3 versions a 4.5. That's pretty much in line with what GameSpot gave them, really.
I respect the attempt, and I REALLY respect the general principles you present at the outset. Casually, you've shown trends in scoring, but no proof of/against bias unless you also accept the hypothesis that the source of the bias is the reviewer or site, and not something more global such as an economic force that effects ALL sites and sources. In short, this is only something that you can analyze through statistics AFTER you've constructed a study that eliminates more variables.
You have shown however, that within a fair degree of accuracy given the sites and sources you listed, the source of the Review Score is essentially meaningless, as the conclusions are often similar. That said, it's a very good read on its own, if you simply appreciate the data and analysis, and not the overall conclusions.
But not every game is for every gamer.
Could you perhaps post the formula for the z-test used to calculate the % chance of bias?
I was under the impression that the percentage was based on a sites +/- value for a console, with some accommodation given for it's standard deviation. So if a site had a greater +/- value, but a low standard deviation, that would be indicative of a bias.
But while Gamespot's +0.07 with a standard deviation of 0.74 got it 3%, Gametrailer's -0.09 with a standard deviation of 0.65 got it 14%. So is there something I'm missing?
Also I like what PresidentDman said about the idea of doing something like this but for developers, if you do do a followup like that be sure to include Eidos.
X-Play is is SUPER biased when it comes to the 360
What I'm specifically addressing is the claim often heard in System Wars where, for example, GameSpot rates game X much lower than average and everyone's like "OMG GameSpot is biased!!!1". The data is pretty clear that that this is nonsense. The question of whether everyone is biased is, of course, a different question.
@WaddaWaddaWadda:
The difference between the two is the sample size: GameTrailers had less reviews than GameSpot. The fewer data points there are, the more likely that a divergence from an adjusted average of 0 is due just to random chance than to actual bias. If you really want to read about the methodology behind z-tests, you can do so here, but I warn you that it's quite involved. There's a reason why I didn't go into it in this space.
The thing about this sort of thing is that a small amount of divergence from an average of 0 is expected purely due to random chance. You can't take the slightest divergence and then conclude, "aha, bias!" - it needs to be substantial enough a divergence that it's impossible to explain it any other way than by accepting the alternative hypothesis that there exists bias.
@MajorWaffle:
...You just read the article until you found something you liked and stopped there, didn't you?
I suppose there's nothing for it if someone really wants to be convinced of something...
@polsci1503:
Well, even if there is bias I highly doubt it's actually going to cause people to make too many purchases that they wouldn't otherwise. Even with all Xbox 360 games, the average deviation from the mean was only 0.07 - not exactly massive.
daman12369