Comments/Ratings for a Single Item
Michael, The purpose of a rating system is to measure relative differences between playing strength. I can't emphasize the world relative enough. The best way to measure relative playing strength is a holistic method that regularly takes into account all games in its database. One consequence of this is that ratings may change even when someone stops playing games. This makes the method more accurate. The Elo and CXR methods have not been holistic, because a holistic method is not feasible on the scale these systems are designed for. They have to settle for strictly sequential changes. Because GCR works in a closed environment with complete access to game logs, it does not have to settle for strictly sequential changes. It has the luxury of making global assessments of relative playing strength on the basis of how everyone is doing. A separate issue you raised is of a 3000 rated player losing less points than a 1500 rated player. Since last night, I have rethought how to use and calculate stability. Instead of basing stability on a player's rating, I can keep track of how many games have so far factored into the estimate of each player's rating. One thought is to just count the games whose results have so far directly factored into a player's rating. Another thought is to also keep track of each opponent's stability, keep a running total of this, and divide it by the number of opponents a player has so far been compared with. I'm thinking of adding these two figures together, or maybe averaging them, to recalculate the stability score of each player after each comparison. Thus, stability would be a factor of how reliable an indicator a player's past games have been of his present rating. That covers my new thoughts on recalculating stability. As for using it, I am thinking of using both player's stability scores to weigh how much ratings may change in each direction. I am still trying to work out the details on this. The main change is that both stability scores would affect the change in rating of both players being compared. In contrast, the present method factors in only a player's own stability score in recalculating his rating. One consequence of this is that if a mid-range rated player defeats a high-rated player, and the mid-range player has so far accumulated the higher stability score, the change will be more towards his rating than towards the high-rated player's rating. The overall effect will be to adjust ratings toward the better established ratings, making all ratings in general more accurate.
Roberto,
As you know, there is a language barrier between us. It sometimes gets in the way of understanding you, as it has with your misuse of the word notorious. I am simply asking for clarification on what you are trying to say.
Anyway, I am in agreement with the last points you have raised. A single game should not have a great effect on a player's score, and the formulas need more tweaking, but it's not an easy task.
well if you don't play games Michael, your rating will drop :) looks like mine will be dropping too he he. (i'm kinda a little shocked by that) not that i really care but, i must be bored, but doesn't that mean, if you have two players that have a 'true' rating (played many rated games) of 1500, and one of them is inactive for a bit, therefore rating drops, now if these players play, it will be a game between 2 players where one is higher rated than the other, where in reality, it should be a game between equals ... wouldn't that distort ratings after outcome? another thing, fair amount of games played are more in the spirit of TESTING OUT A VARIANT, more than anything else. i agree with those that said that only 'tournament games' should be rated, unless people agree otherwise beforehand. as far as 1500 vs 3000, and 1500 rises 750 points if wins, surely that is too much. i agree that 3000 player should not drop 'heavily' finally (yawn), are we going to see people less likely to put up a challenge because of fear of someone much less rated accepting? will this lead to 'behind the scenes' arranging of games? if a vote was taken, would more people want ratings than not? sorry for length, just adding food for thought.
Good questions. I'll deffer them to Fergus as he understands what is going on here far better than I do and I could end up giving a wrong answer. But I do know a player who was about 2000. Unfortunately he has a mental condition, he is now about 1400 and getting weeker in all cognitive areas. It is now a strain for him just to walk. Understandably, he could have quit playing chess while at 2000... but he still plays. Anyway, if he quit at 2000 his frozen 2000 rating would certainly be false. Of course, if he quit and his rating climbed, that too would be false. It would need to drop over time to reflect reality. Would this happen with the equations Fergus is using? I don't know... We can shoot all kinds of rating situations around and argue one way or the other, but what is the point? Does it really matter? Why should we get so wrapped up in these values? They are just a means of comparison. Before we had nothing. Now we will have something. If we do not like that 'something' then we can choose the 'unrated game' option once implemented. We can also play in USCF tournaments where our ratings will freeze once we quit playing.
yeah no need to get wrapped up in it, but it would be good to get the best rating system in place, i am sure it would save Fergus a lot of hassle in the future also if people complain, say 'other sites have a better system' etc etc. will be kinda fun too, to see people have ratings, then you can see who is like the 'favorite' and the 'underdog' in games etc etc high drama :)
Michael Howe asks:
what, therefore, is the refutation to my concern that a player's rating be retroactively affected by the performance of a past opponent whose real playing strength has increased or decreased since that player last played him?
A system that offers estimates instead of measurements is always going to be prone to making one kind of error or the other. This is as true of Elo as it is of GCR. Keeping ratings static may avoid some errors, but it will introduce others. The question to ask then is not how to avoid this kind of error or that kind of error. The more important question to ask is which sort of system will be the most accurate overall. Part of the answer to that question is a holistic system. Given that the system estimates relative differences in playing strength, the best way to estimate these differences is a method that bases every rating on all available data. Because of its monodirectional chronological nature, Elo does not do this. But the GCR method does do this. This allows it to provide a much more globally consistent set of ratings than Elo can with its piecemeal approach of calculating ratings. Since ratings have no meanings in themselves and mean something only in relation to each other, a high level of global consistency is the most important way in which a set of ratings can be described as accurate. Since a holistic method is the most important means, if not actually necessary, for achieving this, a holistic method is the way to go, regardless of whatever conceivable errors might still be allowed.
The testdata field takes data in a line by line grid form like this:
1500 0 1 0 1500 0 0 1 1500 1 0 0
It automatically names players with letters of the alphabet. Each line begins with a rating and is followed by a series of numbers of wins against each player. The above form means that A beat B once, B beat C once, and C beat A once.
A weighted history would work with the assumption that anyone who isn't actively playing games is deteriorating in ability. I'm not sure this is an accurate assumption to make. Also, the factors you list as causing performance to drop are going to have less effect on games played on Game Courier, because these are correspondence games played over a long period of time, and a person may take time off for illness or disinterest without it affecting his game. When it does affect someone's games, it will generally affect fewer games than it would for someone active in Chess tournaments. Also, the large timeframe on which games are played is going to make it even harder to determine what the weights should be. For these reasons, I am not convinced that a weighted history should be used here. Anyway, if you do want ratings that better reflect the current level of play among active players, you already have the option of using the Age Filter to filter out older games. I think that should suffice for this purpose.
I want to draw attention to the main change I made today. You may notice that the list of ratings now uses various background colors. Each background color identifies a different network of players. The predominant network is yellow, and the rest are other colors. Everyone in a network is connected by a chain of opponents, all of whom are also in the network. Regarding weighted histories, they probably work better for the world of competitive Chess, in which active players normally play several rated games at regular intervals. This frequency and regularity provides a basis for weighting games. But Game Courier lacks anything like this. Games here are played at the leisure of players.
I've always thought the best implementation of ratings would be an 'open-source' approach: make public the raw data that go into calculating the ratings, and allow many people to set up their own algorithms for processing the data into ratings. So users would have a 'Duniho rating' and a 'FIDE rating' and 'McGillicuddy rating' and so on. Then users could choose to pay attention to whichever rating they think is most significant. Over time, studies would become available as to which ratings most accurately predict the outcomes of games, and certain ratings would outcompete others: a free market of ideas.I also like the open-source approach (maybe make the raw data XML, plain-text, or both), but there should also be one built-in to this site as well, so if you don't have your own implementation you can view your own.
'I also like the open-source approach (maybe make the raw data XML, plain-text, or both), but there should also be one built-in to this site as well, so if you don't have your own implementation you can view your own.' Sure, the site should have its own 'brand' of ratings. But I mean, it would be good to make ratings from many user-defined systems available here also. Just as the system allows users to design their own webpages (subject to editorial review) and their own game implementations, there could be a system whereby users could design their own ratings systems, and any or all these systems could be available here at CVP to anyone who wants to view them, study their predictive value, use them for tournament matchings, etc. Of course, it's much easier to suggest a system of multiple user-defined rating schemes (hey, we could call it MUDRATS) than to do the work of implementing it. But if enough people consider the idea and feel it has merit, eventually someone will set it up someplace and it will catch on.
No, I have no intention of doing anything along the lines of Mark Thompson's 'open source' suggestion. While some people might like to play with their own ratings system, most people are simply going to want one standard ratings system without the fuss and bother of choosing one among many. Also, if I set something up to freely let people create their own ratings systems, there would soon be many bad ratings systems for people to choose from. As for letting a free market choose the best one, it wouldn't work like that. Without the benefit of serious investigation into them, there would be little basis for informed decisions. A bad system could become popular as easily as a good one. Consider how well things like astrology and numerology fair under a free market. A free market is no substitute for scientific investigation. If people are interested in Game Courier having as good a ratings system as it can, then they are free to offer comments and suggestions. I have described the method in algorithmic detail on this page, and I have also further discussed what it does and have compared it with Elo.
What happens when a game you've won, and it says 'You have won' in the game log, doesn't show up in the calculations, even though you can call it up by name from the game logs with your password, and it shows as a win when you list all your games? The game in question is Omega Chess oncljan-joejoyce-2005-96-245 Admittedly, it's not a good win, but it balances out one of the almost-won games where my opponent disappeared just before the end. (I see the value of timed games now.) Actually, I hadn't brought it up before because it is such a poor win that I didn't feel I deserve it, but I realized that if it was included, I just might get up to 1500 briefly, before I lose to Carlos, David, Gary..., and that'd be a kick for someone who's only been playing a year or so after, depending on how you wish to count time off, 30-40 years. I will say the ratings have brought out everyone's competitive spirits. As for me, I'll happily carry a general rating that takes in all my games: playtests, coffee-house, and tournament; but, since people are asking for so many things, I'd like to add one more. Would it be possible or practical to allow people to choose one or more subsets of games for a specific rating. For example, I am currently playing several variants of shatranj now, one of which is 'grand shatranj'. Could I be allowed to put any number of game names into a 'Rate these games only' field, so I could get a combined rating for say 6 shatranj variants plus Grand Chess? And then another for the 'big board' games, and so on?
Joe, GCR reads data only from public games, not from private games. That's why one of your private games is not factoring into the calculations. I have now added a group filter that lets users select certain groups, and I have extended the Game Filter to recognize comma separated lists. To list games in the Game Filter, separate each game by a single comma, nothing more or less, and don't use any wildcards.
I have now modified the reliability and stability formulas to these:
$r1 = ($n / ($n + 9)) - (($gamessofar[$p1] + 1) / (10*$gamessofar[$p1] + 100)); $r2 = ($n / ($n + 9)) - (($gamessofar[$p2] + 1) / (10*$gamessofar[$p2] + 100)); $s1 = (($gamessofar[$p1] + 1) / ($gamessofar[$p1] + 5)) - ($n / (5*$n + 100)); $s2 = (($gamessofar[$p2] + 1) / ($gamessofar[$p2] + 5)) - ($n / (5*$n + 100));
$n is the number of games two players have played together, and $gamessofar holds the number of games that have so far factored into the ratings of each player. I have modified each formula by subtracting a small fraction based on what determines the other. The fraction subtracted from reliability has a limit of .1, which is otherwise the lower boundary of reliability. The fraction subtracted from stability has a limit of .2, which is otherwise the lower boundary of stability. These have been introduced as erosion factors. A very high stability or reliability erodes the other, and may do so to a limit of zero. Thus, the more games won by one person against another increases the point gain for the winner ever closer to a limit of 400. Likewise, the more single games won by someone against separate individuals also allows his point gain to get closer to a limit of 400. Also, these changes have increased the accuracy slightly.
Fergus, I suggest you use a different rating system, especially considering how your current one is pretty arbitrary (we can nitpick about the 400 point difference as opposed to a 500 or 600 point difference, but we would do by knowing in advance that one number is just as arbitrary as another), and how it appears to be designed to judge people's 'future performance' based upon observations of previous games that users were told wouldn't count. (Although that's really not /that/ big of a deal.) And if you encouraged users to add their computer programs to the fray, the ratings, as such, would add an extra dimension of utility.
By necessity, any rating system is going to have a degree of arbitrariness to it, for some unit of measurement must be used, and there is no hard and fast reason for preferring one over another. But that is no reason at all against any particular method. As for the 400 figure, that is at least rooted in tradition. This same figure is used by the Elo method, which has already established itself as the most popular rating method. As for including computer opponents, you are free to play games in which you enter the moves made by a computer. If you do that, it would be best to create a separate userid for the computer opponent. But Game Courier does not provide any computer opponents, and I don't consider their inclusion in the ratings important. Finally, the filters let you filter out games that are not officially rated. So it's a moot point whether the calculations factor in unrated games. They factor them in only if you choose not to filter them out.
I have a suggestion. Is it possible to have a maximum number of points that a player can gain or lose per game? I am thinking of a maximum change per game of around 10 or 20 points, because there are many players listed here who have only played one or two games but they have highly inflated or deflated ratings. Hats off to Jeremy Good who apparently has completed more games here than anyone else, looks like 250 completed games, and counting!
The good news is that the reason this didn't work sometimes was not because of too many files but because of a monkey wrench thrown into one of the log files. With that file renamed, it's not being read, and this page generates ratings even when set to give ratings for all public games. The bad news is that it seems to be undercounting the games played by people. I checked out a player it said had played only one game, and the logs page listed 23 games he has finished playing. I was also skeptical that I had played only 62 games. I counted more than that and saw that I had still played several more. So that has to be fixed. And since I have made the new FinishedGames database table, I will eventually rewrite this to use that instead of reading the files directly.
This script now reads the database instead of individual logs, and some bugs have been fixed. For one thing, it shouldn't be missing games anymore, as I was complaining about in a previous comment. Also, I found some functions for dealing with mixed character encodings in the database. Some years ago, I tried to start converting everything to UTF-8, but I never finished that. This led to multiple character encodings in the database. By using one function to detect the character encoding and another to convert whatever was detected to UTF-8, I'm now getting everyone's name to show up correctly. One of the practical changes is the switch from Unix wildcards to SQL wildcards. Basically, use % instead of *, and use _ instead of ?. One more thing. I moved this script from play/pbmlogs/ to play/pbm/. It was in the former only because it had to read the logs. Now that it doesn't, it seems more logical to put it in play/pbm/. The old script is still at its old location if you want to compare.
I have also modified groups to work with mysql, and one new feature that helps with groups is that it shows the sql of the search it does. This lets you see what Chess variants are in a group. Most of the groups are based on the tiers I made in the Recognized variants. These may not be too helpful, since they are not of related games. The Capablanca group, which I just expanded, seems to be the most useful group here, since it groups together similar games. What I would like to do is add more groups of related games. I'm open to suggestions.
It looks like everything's been fixed! Well done, Fergus, and thank you! I see that the Finished Games database also allowed for the creation of a page listing Game Courier's top 50 most-played games, which is a very nice addition. Now I guess I have to see if I can catch Hexa Sakk... ; )
It's now possible to list multiple games in the Game Filter field. Just comma-separate them and don't use wildcards.
It is now possible to use wildcards within comma-separated lists of games. Also, Unix style wildcards are now converted to SQL style wildcards. So you can use either.
I'm thinking of tweaking the way the GCR is calculated. As it is right now, the value that is going to grow the quickest is a player's past games. This affects the stability value, which is already designed to near the limit of one more quickly than reliability ever will. Even if games with the current opponent and one's past games remained equal in number, stability would grow more quickly than reliability. But after the first opponent, one's past games will usually outnumber one's games with the current opponent. Besides this, gravity is based on stability scores, and as stability scores for both opponents quickly near the limit of one, gravity becomes fairly insignificant. Given that past games will usually outnumber games played against the current opponent, it makes sense for reliability to grow more quickly than stability.
I'm rethinking this even more. I was reading about Elo, and I realized its main feature is a self-correcting mechanism, sort of like evolution. Having written about evolution fairly extensively in recent years, I'm aware of how it's a simple self-correcting process that gets results. So I want a ratings system that is more modeled after evolution, using self-correction to get closer to accurate results. So let's start with a comparison between expectations and results. The ratings for two players serve as a basis for predicting the percentage of games each should win against the other. Calculate this and compare it to the actual results. The GCR currently does it backward from this. Given the results, it estimates new ratings, then figures out how much to adjust present ratings to the new ratings. The problem with this is that different pairs of ratings can predict the same results, whereas any pair of ratings predicts only one outcome. It is better to go with known factors predicting a single outcome. Going the other way requires some arbitrary decision making. If there is no difference between predicted outcome and actual outcome, adjustments should be minimal, perhaps even zero. If there is a difference, ratings should be adjusted more. The maximum difference is if one player is predicted to win every time, and the other player wins every time. Let's call this 100% difference. This would be the case if one rating was 400 points or more higher than another. The maximum change to their scores should be 400 points, raising the lower by 400 points and decreasing the higher by 400. So the actual change may be expressed as a limit that approaches 400. Furthermore, the change should never be greater than the discrepancy between predictions and outcomes. The discrepancy can always be measured as a percentage between 0% and 100%. The maximum change should be that percentage of 400. But it wouldn't be fair to give the maximum change for only a single game. The actual change should be a function of the games played together. This function may be described as a limit that reaches the maximum change as they play more games together. This is a measure of the reliability of the results. At this point, the decision concerning where to set different levels of reliability seems arbitrary. Let's say that at 10 games, it is 50% reliable, and at 100 games near 100% reliable. So, Games/(Games + 10) works for this. At 10, 10/20 is .5 and at 100, 100/110 is .90909090909. This would give 1 game a reliability of .090909090909, which is almost 10%. So, for one game with 100% difference between predictions and results, the change would be 36.363636363636. This is a bit over half of what the change currently is for two players with ratings of 1500 when one wins and the other loses. Currently, the winner's rating rises to 1564, while the loser's goes down to 1435. With both players at 1500, the predicted outcome would be that both win equally as many games or draw a game. Any outcome where someone won all games would differ from the predicted outcome by 50%, making the maximum change only 200, and for a single game, that change would be 18.1818181818. This seems like a more reasonable change for a single game between 1500 rated players. Now the question comes in whether anything like stability or gravity should factor into how the scores change. Apparently the USCF uses something called a K-factor, which is a measure of how many games one's current rating is based on. This corresponds to what I have called stability. Let's start with maximums. What should be the maximum amount that stability should minimize the change to a score? Again, this seems like an arbitrary call. Perhaps 50% would be a good maximum. And at what point should a player's rating receive that much protection? Or, since this may be a limit, at what point should change to a player's rating be minimized by half as much, which is 25%? Let's say 200 games. So, Games/(Games + 600) works for this. At 200, it gives 200/800. At 400, it gives 400/1000. And what about gravity? Since gravity is a function of stability, maybe it adds nothing significant. If one player has high stability and the other doesn't, the one whose rating is less stable will already change more. So, gravity can probably be left out of the calculation.
So far, the current method is still getting higher accuracy scores than the new method I described. Maybe gravity does matter. This is the idea that if one player's rating is based on several games, and the other player's rating isn't, the rating of the player with fewer games should change even more than it would if their past number of games were equal. This allows the system to get a better fix on a player's ability by adjusting his rating more when he plays against opponents with better established ratings.
I've been more closely comparing different approaches to the ratings. One is the new approach I described at length earlier, and one is tweaking the stability value. In tweaking the stability value, I could increase the accuracy measurement by raising the number of past games required for a high stability score. But this came at a cost. I noticed that some players who had played only a few games quickly got high ratings. Perhaps they had played a few games against high rated players and won them all. Still, this seemed to be unfair. Maybe the rating really was reflective of their playing abilities, but it's hard to be sure about this, and their high ratings for only a few games seemed unearned. In contrast to this, the new rating method put a stop to this. It made high ratings something to be earned through playing many games. Its highest rated players were all people who had played several games. Its highest rating for someone who played games in the single digits was 1621 for someone who had won 8.5 out of 9 games. In contrast, the tweaked system gave 1824 to someone who won 4 out of 4 games, placing him 5th in the overall rankings. The current system, which has been in place for years, gave 1696 and 1679 to people who won 8.5/9 and 4/4 respectively. In the ratings for all games, the new system gets a lower accuracy score by less than 2%. That's not much of a difference. In Chess, it gets the higher accuracy score. In some other games, it gets a lower score by a few percentage points. Generally, it's close enough but has the advantage of reducing unearned high ratings, which gives it a greater appearance of fairness. So I may switch over to it soon.
I have switched the ratings system to the new method, because it is fairer. Details on the new system can be found on the page. I have included a link to the old ratings system, which will let you compare them.
Hi Fergus I lost a game of Sac Chess to Carlos quite some time ago. I thought that it was to be rated, but as far as I can tell my rating is based on only 1 game (a win at Symmetric Glinski's Hexagonal Chess vs. Carlos). I don't know if the ratings have been updated to take into account my Sac Chess loss, but I thought I'd let you know, even though I don't plan to play on Game Courier, likely at least anytime soon.
Kevin,
I just recreated the FinishedGames table, and your Sac Chess game against Carlos is now listed there. I'm not sure why it didn't get in before, but I have been fixing up the code for entering finished games into this table, and hopefully something like this won't happen again. But if it does, let me know.
Things are getting weird. When I looked at Kevin Pacey's rating, I noticed it was still based on one game, not two. For some reason, the game he won was not getting added to the database. At this time, I was using the REPLACE command to populate the database. Also, it was failing silently. So, I Truncated the table, changed REPLACE to INSERT and recreated the table. This time, the game he won got in, but the game he lost did not. Maybe this game didn't make it in originally because of some mysterious problem with how INSERT works. It is frustrating that the MySQL commands are not performing reliably, and they are failing silently. So if it wasn't for noticing these specific logs, I would be unaware of the problem.
I changed INSERT back to REPLACE and ran the script for creating the FinishedGames table again. This time, the log for the game Kevin lost got in, and the log for the game he won vanished even though I did not Truncate the table prior to doing this. Also, the total number of rows in the table did not change.
I finally figured out the problem and got the logs for both of Kevin's games into the FinishedGames table. The problem was that both logs had the same name, and the table was set up to require each log to have a unique name. So I ALTERed the table to remove all keys, then I made the primary key the combination of Log + Game. Different logs had been recorded with INSERT and REPLACE, because INSERT would go with the first log it found with the same name, and REPLACE would replace any previous entries for the same log name with the last one. This change increased the size of the table from 4773 rows to 4883 rows.
The rating system could be off. I'm not sure if ratings should change instantly, meaning once any game is finished, then the rating is recalculated for the 2 palyers in question :)! Anyway yesterday a few games of mine (I think 3) have finished and ratings have not changed. It should have probably ended up a bit bellow 1530.
Ratings are calculated holistically, and they are designed to become more stable the more games you play. You can read the details on the ratings page for more on how they work differently than Elo ratings.
I did read the rules, but I have not understood them. It seemed to me that tehy do not look like ELO ratings though. Anyway Fergus, are you saying that they work fine?
@Fergus,
I think I know what is going on with my ratings. By now I already have quite a few games played, and losing to a very high rated opponent or winning against a very low rated opponent does not mean much for the algorithm, in terms of correcting my rating. I think this is how is supposed to work.
So, are you using a system of equations where the unknowns are the ratings, and the coefients are based on the results :)?!...
Aurelian, I have moved this discussion to the relevant page.
By now I already have quite a few games played, and losing to a very high rated opponent or winning against a very low rated opponent does not mean much for the algorithm, in terms of correcting my rating. I think this is how is supposed to work.
Yes, it is supposed to work that way.
So, are you using a system of equations where the unknowns are the ratings, and the coefients are based on the results :)?!...
I'm using an algorithm, which is a series of instructions, not a system of equations, and the ratings are never treated as unknowns that have to solved for. Everyone starts out with a rating of 1500, and the algorithm finetunes each player's rating as it processes the outcomes of the games. Instead of processing every game chronologically, as Elo does, it processes all games between the same two players at once.
Ok, I'm starting to understand it, thanks for the clarifications :)!
Perhaps the Game Courier rating system could someday be altered to somehow take into account the number of times a particular chess variant has been played by a particular player, and/or between him and a particular opponent, when calculating the overall public (or rated) games played rating for a particular player.
I used to think that players that play a more diverse assortment of variants are disadvantaged by the current system, but probably it is not a big deal.
Also maybe larger games with more pieces should matter more as they are definitely more demanding.
But both these things are quite difficult to do without a wide variety of statistics which we cannot have at the time :)!
Actually, I found that when I played competitively a few years ago, the more different games I played, the better I played in all of then, in general. This did not extend to games like Ultima or Latrunculi, but did apply to all the chesslike variants, as far as I can tell.
There probably is something akin to general understanding :)!
This script has just been converted from mysql to PDO. One little tweak in the conversion is that if you mix wildcards with names, the SQL will use LIKE or = for each item where appropriate. So, if you enter "%Chess,Shogi", it will use "AND (Game LIKE '%Chess' OR Game = 'Shogi' )" instead of "AND (Game LIKE '%Chess' OR Game LIKE 'Shogi' )".
Perhaps the Game Courier rating system could someday be altered to somehow take into account the number of times a particular chess variant has been played by a particular player, and/or between him and a particular opponent, when calculating the overall public (or rated) games played rating for a particular player.
First of all, the ratings script can be used for a specific game. When used this way, all its calculations will pertain to that particular game. But when it is used with a wildcard or with a list of games, it will base calculations on all included games without distinguishing between them.
Assuming it is being used for a specific game, the number of times two players have played that game together will be factored into the calculation. The general idea is that the more times two players play together, the larger the effect that their results will have on the calculation. After a maximum amount of change to their ratings is calculated, it is recalculated further by multiplying it by n/(n+10) where n is the number of games they have played together. As n increases, n/(n+10) will increase too, ever getting nearer to the limit of 1. For example n=1 gives us 1/11, n=2 gives us 2/12 or 1/6, n=3 gives us 3/13, ... n=90 gives us 90/100 or 9/10, and so on.
During the calculation, pairs of players are gone through sequentially. At any point in this calculation, it remembers how many games it has gone through for each player. The more games a player has played so far, the more stable his rating becomes. After the maximum change to each player's rating is calculated as described above, it is further modified by the number of games each player has played. Using p for the number of games a player has played, the maximum change gets multipled by 1-(p/(p+800)). As p increases, so does p/(p+800), getting gradually closer to 1. But since this gets substracted from 1, that means that 1-(p/(p+800) keeps getting smaller as p increases, ever approaching but never reaching the limit of zero. So, the more games someone has played, the less his rating gets changed by results between himself and another player.
Since p is a value that keeps increasing as the calculations of ratings is made, and its maximum value is not known until the calculations are finished, the entire set of calculations is done again in reverse order, and the two sets of results are averaged. This irons out the advantages any player gains from the order of calculations, and it ensures that every player's rating is based on every game he played.
As I examine my description of the algorthm, the one thing that seems to be missing is that a player who has played more games should have a destabilizing effect on the other player's rating, not just a stabilizing effect on his own rating. So, if P1 has played 100 games, and P2 has played 10, this should cause P2's rating to change even more than it would if P1 has also played only 10 games. At present, it looks like the number of games one's opponent has played has no effect on one's own ratings. I'll have to examine the code and check whether it really matches the text description I was referring to while writing this response.
I was wondering along the lines of do we want a Game Courier rating system that rewards players for trying out a greater number of chess variants with presets. There are many presets that have barely been tried, if at all. However, conversely this could well 'punish' players who choose to specialize in playing only a small number of chess variants, perhaps for their whole Game Courier 'playing career'. [edit: in any case it seems, if I'm understanding right, the current GC rating system may 'punish' the winner (a player, or a userid at least) of a given game between 2 particular players who have already played each other many times, by not awarding what might otherwise be a lot of rating points for winning the given game in question.]
I was wondering along the lines of do we want a Game Courier rating system that rewards players for trying out a greater number of chess variants with presets.
Such a ratings system would be more complicated and work differently than the current one. The present system can work for a single game or for a set of games, but when it does work with a set of games, it treats them all as though they were the same game.
However, conversely this could well 'punish' players who choose to specialize in playing only a small number of chess variants, perhaps for their whole Game Courier 'playing career'.
Yes, that would be the result. Presently, someone who specializes in a small set of games, such as Francis Fahys, can gain a high general GCR by doing well in those games.
in any case it seems, if I'm understanding right, the current GC rating system may 'punish' the winner of a given game between 2 particular players who have already played each other many times, by not awarding what might otherwise be a lot of rating points for winning the given game in question.
Game Courier ratings are not calculated on a game-by-game basis. For each pair of players, all the games played between them factor into the calculation simultaenously. Also, it is not designed to "award" points to players. It works very differently than Elo, and if you begin with Elo as your model for how a ratings system works, you could get some wrong ideas about how GCR works. GCR works through a trial-and-error method of adjusting ratings between two players to better match the ratings that would accurately predict the outcome of the games played between them. The number of games played between two players affects the size of this adjustment. Given the same outcome, a smaller adjustment is made when they have played few games together, and a larger adjustment is made when they have played several games together.
Getting back to your suggestion, one thought I'm having is to put greater trust in results that come from playing the same game and to put less trust in results that come from playing different games together. More trust would result in a greater adjustment, while less trust would result in a smaller adjustment. The rationale behind this is that results for the same game are more predictive of relative playing ability, whereas results from different games are more independent of each other. But it is not clear that this would reward playing many variants. If someone played only a few games, the greater adjustments would lead to more extreme scores. This would reward people who do well in the few variants they play, though it would punish people who do poorly in those games. However, if someone played a wide variety of variants, smaller adjustments would keep his rating from rising as fast if he is doing well, and they would keep it from sinking as fast if he is not doing well. So, while this change would not unilaterally reward players of many variants over players of fewer variants, it would decrease the cost of losing in multiple variants.
How would you measure the diversity of games played between two players? Suppose X1 and Y1 play five games of Chess, 2 of Shogi, and 1 each of Xiang Qi, Smess, and Grand Chess. Then we have X2 and Y2, who play 3 games of Chess, 3 of Shogi, 2 of Xiang Qi, and 1 each of Smess and Grand Chess. Finally, X3 and Y3 have played two games each of the five games the other pairs have played. Each pair of players has played ten games of the same five games. For each pair, I want to calculate a trust value between a limit of 0 and a limit of 1, which I would then multiply times the maximum adjustment value to get a lower adjustment value.
Presently, the formula n/(n+10) is used, where n is the number of games played between them. In this case, n is 10, and the value of n/(n+10) is 10/20 or 1/2. One thought is to add up fractions that use the number of games played of each game.
X1 and Y1
5/(5+10)+2/(2+10)+1/(1+10)+1/(1+10)+1/(1+10) = 5/15+2/12+1/11+1/11+1/11 = 17/22 = 0.772727272
X2 and Y2
3/13 + 3/13 + 2/12 + 1/12 + 1/11 = 6/13 + 2/12 + 2/11 = 695/858 = 0.81002331
X3 and Y3
2/12 * 5 = 10/12 = 0.833333333
The result of this is to put greater trust in a diverse set of games than in a less diverse set, yet this is the opposite of what I was going for.
How would this change if I changed the constant 10 to a different value? Let's try 5.
X1 and Y1
5/(5+5)+2/(2+5)+1/(1+5)+1/(1+5)+1/(1+5) = 5/10+2/7+3/6 = 1 2/7 = 1.285714286
Since this raises the value above 1, it's not acceptable. Let's try changing 10 to 20.
X1 and Y1
5/(5+20)+2/(2+20)+1/(1+20)+1/(1+20)+1/(1+20) = 5/25+2/22+1/21+1/21+1/21 = 167/385 = 0.433766233
X2 and Y2
3/23 + 3/23 + 2/22 + 1/22 + 1/21 = 6/23 + 2/22 + 2/21 = 2375/5313 = 0.447016751
X3 and Y3
2/22 * 5 = 10/22 = 0.454545454
This follows the same pattern, though the trust values are lower. To clearly see the difference, look at X2 and Y2, and compare 2/22, which is for two games of Xiang Qi, with 2/21, which is for one game each of Smess and Grand Chess. 2/22 is the smaller number, which indicates that it is giving lower trust scores for the same game played twice.
Since it is late, I'll think about this more later. In the meantime, maybe somebody else has another suggestion.
Presently, the more games two players play together, the greater the amount of trust that is given to the outcome of their games, but each additional game they play together adds a smaller amount of trust. This is why players playing the same game together would produce a smaller increase in trust than players playing different games together in the calculations I was trying out in my previous comment. Since this is how the calculations naturally fall, is there a rationale for doing it this way instead of what I earlier proposed? If one player does well against another in multiple games, this could be more indicative of general Chess variant playing ability, whereas if one does well against another mainly in one particular game but plays that game a lot, this may merely indicate mastery in that one game instead of general skill in playing Chess variants, and that may be due to specialized knowledge rather than general intelligence. The result of doing it this way is that players who played lots of different games could see a faster rise in their ratings than a player who specialized in only a few games. However, people who specialized in only a few games would also see slower drops in their ratings if they do poorly. For each side, there would be some give and take. But if we want to give higher ratings to people who do well in many variants, then this might be the way to do it.
Hi Fergus
Note I did put a late, second, edit to my previous post, mentioning the small distinction that we're talking about specific userid's rather than specific players. I made this distinction since it's possible (and evident in some cases already on GC's ratings list) that people can have more than one userid, hence more than one Game Courier rating. While presumably it would be tough to prevent this if desired, starting a new rating from scratch at least does not guarentee a player he will get a higher one after many games (a long time ago, it may be worth noting, the Chess Federation of Canada allowed a given player to at least once effectively destroy his existing rating and begin again from scratch, perhaps for a $ price).
It goes without saying that I am talking about userids. The script is unable to disinguish players by anything other than userid, and it has no confirmed data on which players are using multiple userids. All I can do about this is discourage the use of multiple userids so that this doesn't become much of a factor. But if someone wants to play games with multiple userids, he presumably has a reason for wanting to keep seperate ratings for different games.
One concern I had was that adding up fractions for the number of times two players played each separate game could eventually add up to a value greater than 1. For example, if two players played 12 different games together, the total would be 12 * (1/11) or 12/11, which is greater than 1. One way to get around this is to divide the total by the number of different games played. Let's see how this affects my original scenarios:
X1 and Y1
5/(5+10)+2/(2+10)+1/(1+10)+1/(1+10)+1/(1+10) = 5/15+2/12+1/11+1/11+1/11 = 17/22 = 0.772727272
17/22 * 1/5 = 17/110 = 0.154545454
X2 and Y2
3/13 + 3/13 + 2/12 + 1/12 + 1/11 = 6/13 + 2/12 + 2/11 = 695/858 = 0.81002331
695/858 * 1/5 = 695/4290 = 0.162004662
X3 and Y3
2/12 * 5 = 10/12 = 0.833333333
10/12 * 1/5 = 10/60 = 0.1666666666
As before, these values are greater where the diversity is more evenly spread out, which is to say more homogenous.
However, the number of different games played was fixed at 5 in these examples, and the number of total games played was fixed at 10. Other examples need to be tested.
Consider two players who play 20 individual games once each and two others who play 10 individual games twice each. Each pair has played 20 games total.
Scenario 1: 20 different games
(20 * 1/11) / 20 = 20/11 * 1/20 = 1/11
Scenario 2: 10 different games twice
(10 * 2/12)/10 = 20/12 * 1/10 = 2/12 = 1/6
Applying the same formula to these two scenarios, the 20 different games have no more influence than a single game, which is very bad. This would severely limit the ratings of people who are playing a variety of games. So, if diversity of games played is to be factored in, something else will have to be done.
The problem is that the importance of diversity is not as clear as the importance of quantity. It is clear that the more games two players have played together, the more likely it is that the outcome of their games is representative of their relative playing abilities. But whether those games are mixed or the same does not bear so clearly on how likely it is that the outcome of the games played reflects their relative playing abilities. With quantity as a single factor, it is easy enough to use a formula that returns a value that gets closer to 1 as the quantity increases. But with two factors, quantity and diversity, it becomes much less clear how they should interact. Furthermore, diversity is not simply about how many different games are played but also about how evenly the diversity is distributed, what I call the homogeneity of diversity. When I think about it, homogeneity of diversity sounds like a paradoxical concept. The X3 and Y3 example has a greater homogeneity of diversity than the other two, but an example where X4 and Y4 play Chess 10 times has an even greater homogeneity of diversity but much less diversity. Because of these complications in measuring diversity, I'm feeling inclined to not factor it in.
The most important part of the GCR method is the use of trial-and-error. Thanks to the self-correcting nature of trial-and-error, the difference that factoring in diversity could make is not going to have a large effect on the final outcome. So, unless someone can think of a better way to include a measure of diversity, it may be best to leave it out.
If left as is, is the current rating system at least somewhat kind to a player who suddenly improves a lot (e.g. through study or practice), but who has already played a lot of games on Game Courier? I'm not so sure, even if said player from then on plays much more often vs. players he hasn't much played against before on GC.
I was thinking about older results (both in time and number of games) should, maybe fade away. Is that very difficult to implement Fergus? It seems fairer, but trouble that you need many games at the "same time" to make the ratings meaningfull, and as the current population is that cannot be easilly done :)!
The script includes an age filter. If you don't want to include old ratings, you can filter them out.
I was also thinking the results from old games should be less trusted than the results from new games. A recent game is a better indication of a player's ability than a 10-year-old game.
Off-topic, there is much a chess variants player might do to improve in a relatively short period of time (aside from suddenly improved real-life conditions partly or wholly beyond his control, such as recovering from poor health or personal problems, or acquiring more free time than before). Besides any sort of intuition/experience aquired through sustained practice there's general or specific study he might do on his own, as I alluded to previously. As Joe alluded to, there are many variants that are rather like standard chess, and study and practice of chess probably can only help playing chess variants generally.
Unlike for many chess variants, there is an abundance of chess literature that can help improvement, even at many variants, and hiring a chess teacher, coach or trainer will probably help those who play chess variants too. A chess trainer can help with any physical fitness regime, which also can help those who play chess variants. There might also be such ways available for improvement found by those into other major chess variants with literature etc. such as Shogi and Chinese Chess, though these two are perhaps less generally applicable for overall improvement at chess variants than using chess would be (not sure).
@Fergus,
I think it is not only about "old" in the calendar sense, but also "old" in the many games ago sense.
Also, I think fading away is a nicer way of putting things than cut offs :)!
Since different people play games at different rates, the last n games of each player would not return a single set of games that would work for everyone. A chronological cut-off point works better, because it can be used to select a uniform set of games.
I was envisioning a system where older games are considered with lesser weight by some formula down to some minimum. A game should have, say, at least half the weight of a recent game no matter how old (for example.)
I can play atound with formulas if interested. Beyond the question of age, however, I think the system is good as-is.
I have two problems with discounting the results of older games. One is that the decision concerning what age to start reducing game results is arbitrary. The other is that the results for a game is zero points for a loss and one point for a win. While one point for a win could be reduced to a smaller value, zero points for a loss could not without introducing negative values. The alternative would then be to reduce the 1 point for a win and increase the zero points for a loss, making the results of the older game more drawish. This would distort the results of a game and possibly result in less accurate ratings. I prefer using the age filter to set a clean cut-off point at which older results just aren't included in the calculation.
100 comments displayed
Permalink to the exact comments currently displayed.