Check out Glinski's Hexagonal Chess, our featured variant for May, 2024.


[ Help | Earliest Comments | Latest Comments ]
[ List All Subjects of Discussion | Create New Subject of Discussion ]
[ List Earliest Comments Only For Pages | Games | Rated Pages | Rated Games | Subjects of Discussion ]

Comments/Ratings for a Single Item

EarliestEarlier Reverse Order LaterLatest
Game Courier Ratings. Calculates ratings for players from Game Courier logs. Experimental.[All Comments] [Add Comment or Rating]
Roberto Lavieri wrote on Wed, Jan 11, 2006 01:13 AM UTC:
Reliability and stability may be tuned, but it must be made in base to ideals for the purposes. These ideals are not so easy to stablish quickly, and perhaps some probalistic considerations may help. For me, at first appretiation, when a 3000 player loses against a 1500, the 'rating lose' must be more that when a 1500 loses against other 1500, I can´t say in five seconds how much must be in both cases, but the difference between the first and second case must be relatively notorious.

🕸📝Fergus Duniho wrote on Wed, Jan 11, 2006 01:20 AM UTC:
Regarding Roberto's recent comments, let me mention that GCR does not work with the same paradigm as Elo. A player does not have a fixed GCR that eventually becomes too stable to change. Each time someone reloads this page, each player gets an initial rating of 1500, and his GCR is freshly recalculated on the basis of all available data. If he has previously held a very high rating but has started to do poorly, this may cause his rating to drop more before it is calculated nearly as high as it used to be. Furthermore, each GCR is based on all available data of everyone in the same playing network. Even if someone has previously attained a very high rating, it could drop as other players start to do better, even without him playing another game.

🕸📝Fergus Duniho wrote on Wed, Jan 11, 2006 01:27 AM UTC:
When two players play only one game together, GCR works with the assumption that the results are more due to chance than to skill. That is why I think a 3000 rated player who loses to a 1500 rated player in their only game together should lose fewer points than another 1500 rated player would. Furthermore, since 1500 is everyone's initial rating, whereas a 3000 rating must be earned, it is assumed that the 3000 rating is more reliable than the 1500 rating, and the main change in rating should be to bring the 1500 rated player up in rating. As things work right now, the 1500 rated player who beat a 3000 rated player would rise to a rating of 2250.

🕸📝Fergus Duniho wrote on Wed, Jan 11, 2006 03:29 AM UTC:
Gary, the CXR method works on a game-by-game basis, and it generally assumes that players won't be playing other players more than 400 points apart. GCR works on an opponent-by-opponent basis, and since it freshly calculates all ratings in a non-chronological order, it cannot make the assumption that players will usually be less than 400 points apart.

🕸📝Fergus Duniho wrote on Wed, Jan 11, 2006 03:45 AM UTC:
Roberto, I don't think you expressed yourself well with the word notorious. It means infamous, as in being famous for something questionable, unworthy, or downright awful, and it generally describes people. What did you mean to say?

Roberto Lavieri wrote on Wed, Jan 11, 2006 12:49 PM UTC:
Fergus, I have tried to mean 'clear, by argued reasons'. But returning to the point, I am now contrary to drastic changes in a rating after a single game, even when a low rated player beats a very high rated one. There are many factors that can produce it, including forfeit or stopping in the middle by any reason, and one game can´t be so decisive. I agree with tuning the modifiers, but I believe this is not a very easy task.

🕸📝Fergus Duniho wrote on Wed, Jan 11, 2006 03:54 PM UTC:
Michael,

The purpose of a rating system is to measure relative differences between
playing strength. I can't emphasize the world relative enough. The best
way to measure relative playing strength is a holistic method that
regularly takes into account all games in its database. One consequence of
this is that ratings may change even when someone stops playing games. This
makes the method more accurate. The Elo and CXR methods have not been
holistic, because a holistic method is not feasible on the scale these
systems are designed for. They have to settle for strictly sequential
changes. Because GCR works in a closed environment with complete access to
game logs, it does not have to settle for strictly sequential changes. It
has the luxury of making global assessments of relative playing strength
on the basis of how everyone is doing.

A separate issue you raised is of a 3000 rated player losing less points
than a 1500 rated player. Since last night, I have rethought how to use
and calculate stability. Instead of basing stability on a player's
rating, I can keep track of how many games have so far factored into the
estimate of each player's rating. One thought is to just count the games
whose results have so far directly factored into a player's rating.
Another thought is to also keep track of each opponent's stability, keep
a running total of this, and divide it by the number of opponents a player
has so far been compared with. I'm thinking of adding these two figures
together, or maybe averaging them, to recalculate the stability score of
each player after each comparison. Thus, stability would be a factor of
how reliable an indicator a player's past games have been of his present
rating.

That covers my new thoughts on recalculating stability. As for using it, I
am thinking of using both player's stability scores to weigh how much
ratings may change in each direction. I am still trying to work out the
details on this. The main change is that both stability scores would
affect the change in rating of both players being compared. In contrast,
the present method factors in only a player's own stability score in
recalculating his rating.

One consequence of this is that if a mid-range rated player defeats a
high-rated player, and the mid-range player has so far accumulated the
higher stability score, the change will be more towards his rating than
towards the high-rated player's rating. The overall effect will be to
adjust ratings toward the better established ratings, making all ratings
in general more accurate.

🕸📝Fergus Duniho wrote on Wed, Jan 11, 2006 03:59 PM UTC:

Roberto,

As you know, there is a language barrier between us. It sometimes gets in the way of understanding you, as it has with your misuse of the word notorious. I am simply asking for clarification on what you are trying to say.

Anyway, I am in agreement with the last points you have raised. A single game should not have a great effect on a player's score, and the formulas need more tweaking, but it's not an easy task.


Christine Bagley-Jones wrote on Wed, Jan 11, 2006 04:05 PM UTC:
well if you don't play games Michael, your rating will drop :)
looks like mine will be dropping too he he. (i'm kinda a little shocked
by that)
not that i really care but, i must be bored, but doesn't that mean, if
you have two players that have a 'true' rating (played many rated games)
of 1500, and one of them is inactive for a bit, therefore rating drops, now
if these players play, it will be a game between 2 players where one is
higher rated than the other, where in reality, it should be a game between
equals ... wouldn't that distort ratings after outcome?
another thing, fair amount of games played are more in the spirit of
TESTING OUT A VARIANT, more than anything else. i agree with those that
said that only 'tournament games' should be rated, unless people agree
otherwise beforehand.
as far as 1500 vs 3000, and 1500 rises 750 points if wins, surely that is
too much. i agree that 3000 player should not drop 'heavily'
finally (yawn), are we going to see people less likely to put up a
challenge because of fear of someone much less rated accepting?
will this lead to 'behind the scenes' arranging of games? 
if a vote was taken, would more people want ratings than not?
sorry for length, just adding food for thought.

🕸📝Fergus Duniho wrote on Wed, Jan 11, 2006 04:20 PM UTC:
Your rating could also rise without playing any more games. Your rating in GCR is a holistic measure of your relative performance against everyone else in your playing network. It depends on how everyone is doing relative to each other. There is no such thing as a true rating, because all ratings are relative. The method is not trying to measure your playing strength in some absolute terms that can be given a specific number with a specific meaning. The one meaningful constant here is the difference between two players' scores. Each time it recalculates ratings, this page will be using all the data at hand to give the best estimates of relative playing strength that it can. It would not do a better job of this by keeping ratings static when people aren't playing.

Gary Gifford wrote on Wed, Jan 11, 2006 05:28 PM UTC:
Originally I wrote, in part: 'For a player's rating to rise or fall while sitting on his (or her) laurels seems terrible to me.' However, in the 4 hours that have since then passed I have reversed my hasty opinion (obviously biased by years under the USCF Rating system). Anyway, since we are talking about a player's playing strength in relation to other player's playing strenghts,[an ever changing field of relative values] then in that light a static [or frozen rating] just isn't realistic... and is not a valid number for comparison. So, by further thought I have crossed the fence to Fergus's side of the ratings camp. I still don't like the idea of fun-games and experimental games getting thrown into the equation, but I guess we have to start somewhere. So, in closing, thank you Fergus for all the effort you are putting into this. I am sure it will turn out well and be valued over time.

Gary Gifford wrote on Wed, Jan 11, 2006 11:37 PM UTC:
Good questions.  I'll deffer them to Fergus as he understands what is
going on here far better than I do and I could end up giving a wrong
answer.  But I do know a player who was about 2000.  Unfortunately he has
a mental condition, he is now about 1400 and getting weeker in all
cognitive areas.  It is now a strain for him just to walk. 
Understandably, he could have quit playing chess while at 2000... but he
still plays.  Anyway, if he quit at 2000 his frozen 2000 rating would
certainly be false.  Of course, if he quit and his rating climbed, that
too would be false.  It would need to drop over time to reflect reality. 
Would this happen with the equations Fergus is using?  I don't know... 
We can shoot all kinds of rating situations around and argue one way or
the other, but what is the point?  Does it really matter?

Why should we get so wrapped up in these values?  They are just a means of
comparison.  Before we had nothing.  Now we will have something.  If we do
not like that 'something' then we can choose the 'unrated game' option
once implemented.  We can also play in USCF tournaments where our ratings
will freeze once we quit playing.

Christine Bagley-Jones wrote on Thu, Jan 12, 2006 01:55 AM UTC:
yeah no need to get wrapped up in it, but it would be good to get the best
rating system in place, i am sure it would save Fergus a lot of hassle in
the future also if people complain, say 'other sites have a better
system' etc etc.
will be kinda fun too, to see people have ratings, then you can see who is
like the 'favorite' and the 'underdog' in games etc etc
high drama :)

Roberto Lavieri wrote on Thu, Jan 12, 2006 01:02 PM UTC:
I don´t dislike Fergus method, but it needs some tuning, and, perhaps, one or a couple of new modifiers added, although it can make the method unnecesarily complicated. I have not had time enough to go deep on it, but I´ll try in some moment.

🕸📝Fergus Duniho wrote on Thu, Jan 12, 2006 04:57 PM UTC:
I have changed the method again. The changes are along the lines of what I was describing yesterday. Stability is now a factor of how many of each player's games have already factored into the calculations for his ratings. This will cause the ratings of players who have played more games to stabilize more than the ratings of other players. I have also added a gravity factor. This is a function of stability. It goes to the player with the higher stability, and it diminishes with distance. It increases the reliability of the scores in the direction of the player with greater stability. Thus, the players who have played the most games become gravitational centers around which ratings of less experienced players gravitate. The specific details are given in the description of the method. I expect it will still need some tweaking, and I plan to add a form field for entering test data.

🕸📝Fergus Duniho wrote on Thu, Jan 12, 2006 05:16 PM UTC:

Michael Howe asks:

what, therefore, is the refutation to my concern that a player's rating be retroactively affected by the performance of a past opponent whose real playing strength has increased or decreased since that player last played him?

A system that offers estimates instead of measurements is always going to be prone to making one kind of error or the other. This is as true of Elo as it is of GCR. Keeping ratings static may avoid some errors, but it will introduce others. The question to ask then is not how to avoid this kind of error or that kind of error. The more important question to ask is which sort of system will be the most accurate overall. Part of the answer to that question is a holistic system. Given that the system estimates relative differences in playing strength, the best way to estimate these differences is a method that bases every rating on all available data. Because of its monodirectional chronological nature, Elo does not do this. But the GCR method does do this. This allows it to provide a much more globally consistent set of ratings than Elo can with its piecemeal approach of calculating ratings. Since ratings have no meanings in themselves and mean something only in relation to each other, a high level of global consistency is the most important way in which a set of ratings can be described as accurate. Since a holistic method is the most important means, if not actually necessary, for achieving this, a holistic method is the way to go, regardless of whatever conceivable errors might still be allowed.


🕸📝Fergus Duniho wrote on Thu, Jan 12, 2006 07:03 PM UTC:

The testdata field takes data in a line by line grid form like this:

1500 0 1 0
1500 0 0 1
1500 1 0 0

It automatically names players with letters of the alphabet. Each line begins with a rating and is followed by a series of numbers of wins against each player. The above form means that A beat B once, B beat C once, and C beat A once.


Roberto Lavieri wrote on Thu, Jan 12, 2006 07:11 PM UTC:
I agree with the Gravity modifier, but you need tune all the modifiers, although I am not sure what is going to be the best, we need good arguments for the decisions, and these are not enterely clear. I think the method, as now, has some weaknesses, basically due to the weight of modifiers. If you have some troubles with my English, please tell me. I´m trying to write orthodox English, as possible.

Roberto Lavieri wrote on Thu, Jan 12, 2006 09:48 PM UTC:
There is a very little error in my all time performance. The result in the LOG rlavieri2003-cvgameroom2004-318-638 was not counted (it says 'has won', but no one is mentioned). I believe there is another error, but I can´t find it, my own record register gives 38.0/75

🕸📝Fergus Duniho wrote on Thu, Jan 12, 2006 10:30 PM UTC:
Because Game Courier hasn't kept track of winners and losers by userid, I have had to resort to checking the status string to find the winner. It reads the name from there, compares it with the name on record for each player, and determines from that who has won. If there is a problem with a particular status string, it has to be fixed in the log. You or your opponent may be able to do this by updating who has won in the log in question.

🕸📝Fergus Duniho wrote on Thu, Jan 12, 2006 11:34 PM UTC:
After testing some different values, I increased the value of stability, and reduced the values of reliability and gravity. One thing I was trying to do was bring all the ratings in a perfect cycle to 1500. This is where A beats B, B beats C, C beats D, D beats E, and E beats A, and they all start at 1500. I didn't succeed at this, but I did manage to bring them closer. The overall effect of the changes I made was to bring all ratings closer together. Among all the games, the greatest difference is now little more than 400. This may be a fair estimate, given that not a lot of games have been played yet, and many people have still played very few games. Also, I removed the testdata field for the time being.

🕸📝Fergus Duniho wrote on Fri, Jan 13, 2006 03:37 AM UTC:
I've just done some more tweaking of values. After calculating the ratings, I used them to predict the scores, and I measured their success rate. I used this measure to help me tweak values. I got it up to about 75% accuracy, but it was at the expense of too rapidly raising the ratings of players who had played few games. I let it drop to about 72% to prevent the ratings of less experienced players from changing too quickly. In general, the final ratings have been slightly more accurate than the two separately calculated sets of ratings that were averaged to get them. The main change I made was to increase reliability. I left stability and gravity at what I tweaked them to earlier today.

🕸📝Fergus Duniho wrote on Fri, Jan 13, 2006 04:19 AM UTC:
I brought the accuracy on the current data up to almost 75% by changing the order in which the players are sorted. They are now ordered by how many games they have played. The most accurate sequence of calculations is the one that begins with the players who have played the most games. Although including the reverse sequence of calculations has reduced the overall accuracy score, it has also corrected one error I spotted in the first series of calculations. Although the accuracy rate is an important measure, it is not the only way to assess accuracy, and the data I have used it on is not enough to draw any firm conclusions. I remain convinced that using two sequences of calculations in opposite orders is overall better than using only one.

Roberto Lavieri wrote on Fri, Jan 13, 2006 01:46 PM UTC:
I think that a 'weighted history' makes sense in every rating system. Recent played games must have more importance in the rating calculations than old ones. This may help to reflect drastic changes in real player´s force. Illness, temporary desinterest, and other factors can make players skills fall down, and experience, progressive knowledge of a game, high interest and other factors can help to increase rating quickly in some cases.

🕸📝Fergus Duniho wrote on Fri, Jan 13, 2006 05:39 PM UTC:
A weighted history would work with the assumption that anyone who isn't
actively playing games is deteriorating in ability. I'm not sure this is
an accurate assumption to make. Also, the factors you list as causing
performance to drop are going to have less effect on games played on Game
Courier, because these are correspondence games played over a long period
of time, and a person may take time off for illness or disinterest without
it affecting his game. When it does affect someone's games, it will
generally affect fewer games than it would for someone active in Chess
tournaments. Also, the large timeframe on which games are played is going
to make it even harder to determine what the weights should be. For these
reasons, I am not convinced that a weighted history should be used here.

Anyway, if you do want ratings that better reflect the current level of
play among active players, you already have the option of using the Age
Filter to filter out older games. I think that should suffice for this
purpose.

Roberto Lavieri wrote on Fri, Jan 13, 2006 07:24 PM UTC:
You are right about the use of the age filter to reflect 'current' ratings (this is not enterely true, but it can be a better approximation), although I still disagree with you about the weighted history, I think it can be good for our purposes, but I recognize it is not easy give the weights in every case. This site contains many games for which people is learning and constructing some basis for better play by experience, and this is a step-by-step proccess, perhaps long in time; all of us must be considered real novices in many games, this is a reason to consider weighted history, precisely by the nature of this site. The case is other if we are talking about old, popular games widely played since a lot of time, but TCVP contains many new games, and the list is expected to grow in the future. I insist with other claim: not all games must be rated, or the rating system can be a tool which mainly reflects how good is someone to play in an inedit scenario. The list of 'rated' games can grow, but with games that become 'relatively popular' with time.

Roberto Lavieri wrote on Fri, Jan 13, 2006 07:37 PM UTC:
The Age filter and some other filters don´t work yet.

Roberto Lavieri wrote on Fri, Jan 13, 2006 08:11 PM UTC:
I used 'inedit' in a past comment, this is not an english word. Use 'new' instead.

🕸📝Fergus Duniho wrote on Fri, Jan 13, 2006 08:11 PM UTC:
I have fixed the Age Filter. So it now works. I have tested the other filters, and they all work. If you see a bunch of warnings when you try to view only rated games, that's because there are none, and the program is reporting problems with an empty array.

🕸📝Fergus Duniho wrote on Fri, Jan 13, 2006 08:28 PM UTC:
I want to draw attention to the main change I made today. You may notice
that the list of ratings now uses various background colors. Each
background color identifies a different network of players. The
predominant network is yellow, and the rest are other colors. Everyone in
a network is connected by a chain of opponents, all of whom are also in
the network.

Regarding weighted histories, they probably work better for the world of
competitive Chess, in which active players normally play several rated
games at regular intervals. This frequency and regularity provides a basis
for weighting games. But Game Courier lacks anything like this. Games here
are played at the leisure of players.

🕸📝Fergus Duniho wrote on Fri, Jan 13, 2006 11:57 PM UTC:
I have updated the 'GCR vs Elo' text and rearranged how sections of this page are displayed.

Mark Thompson wrote on Sat, Jan 14, 2006 02:33 AM UTC:
I've always thought the best implementation of ratings would be an 'open-source' approach: make public the raw data that go into calculating the ratings, and allow many people to set up their own algorithms for processing the data into ratings. So users would have a 'Duniho rating' and a 'FIDE rating' and 'McGillicuddy rating' and so on. Then users could choose to pay attention to whichever rating they think is most significant. Over time, studies would become available as to which ratings most accurately predict the outcomes of games, and certain ratings would outcompete others: a free market of ideas.

(zzo38) A. Black wrote on Sat, Jan 14, 2006 04:33 AM UTC:
Quote:
I've always thought the best implementation of ratings would be an 'open-source' approach: make public the raw data that go into calculating the ratings, and allow many people to set up their own algorithms for processing the data into ratings. So users would have a 'Duniho rating' and a 'FIDE rating' and 'McGillicuddy rating' and so on. Then users could choose to pay attention to whichever rating they think is most significant. Over time, studies would become available as to which ratings most accurately predict the outcomes of games, and certain ratings would outcompete others: a free market of ideas.
I also like the open-source approach (maybe make the raw data XML, plain-text, or both), but there should also be one built-in to this site as well, so if you don't have your own implementation you can view your own.

Mark Thompson wrote on Sat, Jan 14, 2006 05:13 AM UTC:
'I also like the open-source approach (maybe make the raw data XML,
plain-text, or both), but there should also be one built-in to this site
as well, so if you don't have your own implementation you can view your
own.'

Sure, the site should have its own 'brand' of ratings. But I mean, it
would be good to make ratings from many user-defined systems available
here also. Just as the system allows users to design their own webpages
(subject to editorial review) and their own game implementations, there
could be a system whereby users could design their own ratings systems,
and any or all these systems could be available here at CVP to anyone who
wants to view them, study their predictive value, use them for tournament
matchings, etc.

Of course, it's much easier to suggest a system of multiple user-defined
rating schemes (hey, we could call it MUDRATS) than to do the work of
implementing it. But if enough people consider the idea and feel it has
merit, eventually someone will set it up someplace and it will catch on.

🕸📝Fergus Duniho wrote on Sat, Jan 14, 2006 04:52 PM UTC:
No, I have no intention of doing anything along the lines of Mark
Thompson's 'open source' suggestion. While some people might like to
play with their own ratings system, most people are simply going to want
one standard ratings system without the fuss and bother of choosing one
among many. Also, if I set something up to freely let people create their
own ratings systems, there would soon be many bad ratings systems for
people to choose from. As for letting a free market choose the best one,
it wouldn't work like that. Without the benefit of serious investigation
into them, there would be little basis for informed decisions. A bad
system could become popular as easily as a good one. Consider how well
things like astrology and numerology fair under a free market. A free
market is no substitute for scientific investigation.

If people are interested in Game Courier having as good a ratings system
as it can, then they are free to offer comments and suggestions. I have
described the method in algorithmic detail on this page, and I have also
further discussed what it does and have compared it with Elo.

Roberto Lavieri wrote on Sat, Jan 14, 2006 06:40 PM UTC:
I think GCR is an alternative good method, although it has its weaknesses, as ELO also has. Both are not very sensitive to drastic changes in a person´s game play, I know it is unusual, but not impossible. But I insist that weighted history must be considered, weighted history (for each game,I mean) can reflect some evolution in player´s game force, it is expected to happen in our site, because many of the games we play are new games, all of us are gaining experience with little theory as help, and results are less indicative in the first contacts with a game. GCR main weakness is that it does not reflect with the best accuracy the actual real force, but it tends toward an average over all the time.

Roberto Lavieri wrote on Sat, Jan 14, 2006 07:03 PM UTC:
Other weakness I see is that You don´t know how many games are needed to consider a rating to be 'somewhat confident'. It is very possible that a player with only a few games played, say less than ten, but with almost perfect score against 'well rated' players, show a rating that does not reflect the player´s force, being the rating, perhaps, much less than other player´s rating with a lot of games played but much less average and relatively worse record against others. It has been said that the rating must stabilize with time, but I´m not sure how many games are needed, and the disparity in number of games may introduce a bias that can give ratings that could be not so easy to compare with accuracy. But once 'stabilized', the whole history introduces another bias, product of very old games considered with the same weight as new ones, this is the main reason I insist with the weighted history idea.

🕸📝Fergus Duniho wrote on Sun, Jan 15, 2006 09:52 PM UTC:
Roberto, I was reading about the Glicko method the other day. This is an improvement on Elo that takes into consideration each player's activity. As I was reading about it, it seemed to me that it was addressing some of the same concerns as a weighted system is supposed to address. But instead of weighting the point value of games, it was treating the ratings of more active players as more stable than the ratings of less active players. GCR already does this. So consider a player who intially does poorly at a new game then gets it and starts doing a lot better. So long as he actively plays the game with others, his initial games won't count for as much. If they were against the same opponents he continues to play, each new game he wins against them will lessen the effect of his initial losses. If they were against opponents he no longer plays, they will be considered as less stable than scores against players he plays against more often. Furthermore, if his old opponents don't improve as much as he does, his losses against them won't count as much as losses against stronger players. Although a weighted Elo method might be an improvement on Elo, GCR already comes with features that address the concerns that weighting Elo is supposed to meet. So there seems to be less, if any, need for weighting GCR.

🕸📝Fergus Duniho wrote on Sun, Jan 15, 2006 10:05 PM UTC:
I'll draw attention to the change I made today. Previously, when two players had ratings more than 400 points apart, GCR would calculate their provisional ratings by adding the lower rating to each player's percentage of games won times the full distance between them. Now, when two players have ratings more than 400 points apart, GCR calculates the midpoint between them, and calculates each player's provisional rating in a range between his current rating and 200 points past the midpoint. For example, if it compares two players at 1500 and 2000, the 1500 rated player's provisional rating would fall between 1500 and 1950, and the other's would fall between 1550 and 2000. The higher rated player's provisional rating is now calculated by subtracting the product of his opponent's score times the range. Since both scores add up to one, this is simply the same as using 1 minus his own score. The advantage of doing it this way is that the provisional scores for both players are only 400 points apart when the lower-rated player wins all games. Previously, this would give each player his opponent's rating as a provisional rating in this event, and that would be too much. After I made this change and fixed the bugs, the calculated ratings became slightly more accurate at predicting the original scores. So it seems to be an improvement.

Joe Joyce wrote on Mon, Jan 16, 2006 05:18 AM UTC:
What happens when a game you've won, and it says 'You have won' in the
game log, doesn't show up in the calculations, even though you can call
it up by name from the game logs with your password, and it shows as a win
when you list all your games? The game in question is
Omega Chess oncljan-joejoyce-2005-96-245
Admittedly, it's not a good win, but it balances out one of the
almost-won games where my opponent disappeared just before the end. (I see
the value of timed games now.) Actually, I hadn't brought it up before
because it is such a poor win that I didn't feel I deserve it, but I
realized that if it was included, I just might get up to 1500 briefly,
before I lose to Carlos, David, Gary..., and that'd be a kick for someone
who's only been playing a year or so after, depending on how you wish to
count time off, 30-40 years.  
I will say the ratings have brought out everyone's competitive spirits.
As for me, I'll happily carry a general rating that takes in all my
games: playtests, coffee-house, and tournament; but, since people are
asking for so many things, I'd like to add one more. Would it be possible
or practical to allow people to choose one or more subsets of games for a
specific rating. For example, I am currently playing several variants of
shatranj now, one of which is 'grand shatranj'. Could I be allowed to
put any number of game names into a 'Rate these games only' field, so I
could get a combined rating for say 6 shatranj variants plus Grand Chess?
And then another for the 'big board' games, and so on?

🕸📝Fergus Duniho wrote on Mon, Jan 16, 2006 04:04 PM UTC:
Joe,

GCR reads data only from public games, not from private games. That's why
one of your private games is not factoring into the calculations. 

I have now added a group filter that lets users select certain groups, and
I have extended the Game Filter to recognize comma separated lists. To list
games in the Game Filter, separate each game by a single comma, nothing
more or less, and don't use any wildcards.

Joe Joyce wrote on Mon, Jan 16, 2006 04:56 PM UTC:Excellent ★★★★★
Thank you very much.

🕸📝Fergus Duniho wrote on Mon, Jan 16, 2006 05:24 PM UTC:

I have now modified the reliability and stability formulas to these:

$r1 = ($n / ($n + 9)) - (($gamessofar[$p1] + 1) / (10*$gamessofar[$p1] + 100));
$r2 = ($n / ($n + 9)) - (($gamessofar[$p2] + 1) / (10*$gamessofar[$p2] + 100));
$s1 = (($gamessofar[$p1] + 1) / ($gamessofar[$p1] + 5)) - ($n / (5*$n + 100));
$s2 = (($gamessofar[$p2] + 1) / ($gamessofar[$p2] + 5)) - ($n / (5*$n + 100));

$n is the number of games two players have played together, and $gamessofar holds the number of games that have so far factored into the ratings of each player. I have modified each formula by subtracting a small fraction based on what determines the other. The fraction subtracted from reliability has a limit of .1, which is otherwise the lower boundary of reliability. The fraction subtracted from stability has a limit of .2, which is otherwise the lower boundary of stability. These have been introduced as erosion factors. A very high stability or reliability erodes the other, and may do so to a limit of zero. Thus, the more games won by one person against another increases the point gain for the winner ever closer to a limit of 400. Likewise, the more single games won by someone against separate individuals also allows his point gain to get closer to a limit of 400. Also, these changes have increased the accuracy slightly.


🕸📝Fergus Duniho wrote on Thu, Feb 16, 2006 05:00 PM UTC:
I have corrected an inaccuracy in the description of the method used. It used to say that a higher rated player is expected to win a percentage of games equal to one quarter of the point difference between the ratings, capped at 100%. Although this works for a 400 point difference, it is inaccurate for other point differences. In particular, this formula predicts that the lower rated player would win more games for any point difference below 200, which is just crazy. Anyway, the examples I gave to illustrate the formula did not illustrate it, and examination of my code indicates that I did not use it. The actual formula, which the examples did illustrate, and which I did use in the code, is that a higher rated player may be expected to win a percentage of games equal to 50% plus one eighth of the point difference, capped at 100%.

Matthew Montchalin wrote on Thu, Feb 16, 2006 11:03 PM UTC:
Fergus, I suggest you use a different rating system, especially considering
how your current one is pretty arbitrary (we can nitpick about the 400
point difference as opposed to a 500 or 600 point difference, but we would
do by knowing in advance that one number is just as arbitrary as another),
and how it appears to be designed to judge people's 'future
performance' based upon observations of previous games that users were
told wouldn't count.  (Although that's really not /that/ big of a deal.)
 And if you encouraged users to add their computer programs to the fray,
the ratings, as such, would add an extra dimension of utility.

🕸📝Fergus Duniho wrote on Fri, Feb 17, 2006 12:41 AM UTC:
By necessity, any rating system is going to have a degree of arbitrariness
to it, for some unit of measurement must be used, and there is no hard and
fast reason for preferring one over another. But that is no reason at all
against any particular method. As for the 400 figure, that is at least
rooted in tradition. This same figure is used by the Elo method, which has
already established itself as the most popular rating method.

As for including computer opponents, you are free to play games in which
you enter the moves made by a computer. If you do that, it would be best
to create a separate userid for the computer opponent. But Game Courier
does not provide any computer opponents, and I don't consider their
inclusion in the ratings important.

Finally, the filters let you filter out games that are not officially
rated. So it's a moot point whether the calculations factor in unrated
games. They factor them in only if you choose not to filter them out.

Thomas McElmurry wrote on Wed, Feb 22, 2006 06:19 AM UTC:
When I view the ratings for all tournament games by using '?*' as the tournament filter, exactly one player is displayed in a different color than the others. How is this possible? Does it indicate an error in the code, or in my understanding of what the colors indicate?

🕸📝Fergus Duniho wrote on Wed, Feb 22, 2006 02:31 PM UTC:
Since I have won a rated game against the person in question, I know his row should be yellow like all the rest. There must be a bug.

🕸📝Fergus Duniho wrote on Wed, Feb 22, 2006 03:43 PM UTC:
Okay, the bug should now be fixed. Thanks for reporting it.

Stephen Stockman wrote on Mon, May 8, 2006 08:41 PM UTC:Excellent ★★★★★
WOW!! this ratings page is super cool! Now I see why people are playing more games, they're working on their ratings. Thank You Fergus

Jeremy Good wrote on Mon, May 8, 2006 09:46 PM UTC:
Never noticed this before. Hey, Joe (Joyce) you and I have a very similar rating at this time. We're a good match.

Stephen Stockman wrote on Fri, Jul 28, 2006 07:20 AM UTC:Excellent ★★★★★
I have a suggestion. Is it possible to have a maximum number of points that
a player can gain or lose per game? I am thinking of a maximum change per
game of around 10 or 20 points, because there are many players listed here
who have only played one or two games but they have highly inflated or
deflated ratings.

Hats off to Jeremy Good who apparently has completed more games here than
anyone else, looks like 250 completed games, and counting!

🕸📝Fergus Duniho wrote on Fri, Jul 28, 2006 04:58 PM UTC:
So far, the ratings for all public games fall within a 500 point range. Except for the top rating, all fall within a 400 point range. Most fall within a 200 point range. Ratings of people who have played only two games fall within a 300 point range. Ratings of people who have played only one game fall within a 200 point range. So there does not appear to be any deflation or inflation of ratings. There is a range of variablity among players who have played few games, but you cannot get your rating very high or low without playing many games.

🕸📝Fergus Duniho wrote on Tue, Apr 7, 2015 02:11 PM UTC:
The good news is that the reason this didn't work sometimes was not because of too many files but because of a monkey wrench thrown into one of the log files. With that file renamed, it's not being read, and this page generates ratings even when set to give ratings for all public games. The bad news is that it seems to be undercounting the games played by people. I checked out a player it said had played only one game, and the logs page listed 23 games he has finished playing. I was also skeptical that I had played only 62 games. I counted more than that and saw that I had still played several more. So that has to be fixed. And since I have made the new FinishedGames database table, I will eventually rewrite this to use that instead of reading the files directly.

🕸📝Fergus Duniho wrote on Sat, Apr 11, 2015 01:02 AM UTC:
This script now reads the database instead of individual logs, and some bugs have been fixed. For one thing, it shouldn't be missing games anymore, as I was complaining about in a previous comment. Also, I found some functions for dealing with mixed character encodings in the database. Some years ago, I tried to start converting everything to UTF-8, but I never finished that. This led to multiple character encodings in the database. By using one function to detect the character encoding and another to convert whatever was detected to UTF-8, I'm now getting everyone's name to show up correctly.

One of the practical changes is the switch from Unix wildcards to SQL wildcards. Basically, use % instead of *, and use _ instead of ?.

One more thing. I moved this script from play/pbmlogs/ to play/pbm/. It was in the former only because it had to read the logs. Now that it doesn't, it seems more logical to put it in play/pbm/. The old script is still at its old location if you want to compare.

🕸📝Fergus Duniho wrote on Sat, Apr 11, 2015 01:36 AM UTC:
I have also modified groups to work with mysql, and one new feature that helps with groups is that it shows the sql of the search it does. This lets you see what Chess variants are in a group. Most of the groups are based on the tiers I made in the Recognized variants. These may not be too helpful, since they are not of related games. The Capablanca group, which I just expanded, seems to be the most useful group here, since it groups together similar games. What I would like to do is add more groups of related games. I'm open to suggestions.

Cameron Miles wrote on Sat, Apr 11, 2015 01:37 AM UTC:
It looks like everything's been fixed! Well done, Fergus, and thank you!

I see that the Finished Games database also allowed for the creation of a page listing Game Courier's top 50 most-played games, which is a very nice addition.

Now I guess I have to see if I can catch Hexa Sakk...  ; )

🕸📝Fergus Duniho wrote on Sat, Apr 11, 2015 03:26 AM UTC:
It's now possible to list multiple games in the Game Filter field. Just comma-separate them and don't use wildcards.

🕸📝Fergus Duniho wrote on Sat, Apr 11, 2015 10:40 AM UTC:
It is now possible to use wildcards within comma-separated lists of games. Also, Unix style wildcards are now converted to SQL style wildcards. So you can use either.

🕸📝Fergus Duniho wrote on Sat, Apr 11, 2015 09:20 PM UTC:
I'm thinking of tweaking the way the GCR is calculated. As it is right now, the value that is going to grow the quickest is a player's past games. This affects the stability value, which is already designed to near the limit of one more quickly than reliability ever will. Even if games with the current opponent and one's past games remained equal in number, stability would grow more quickly than reliability. But after the first opponent, one's past games will usually outnumber one's games with the current opponent. Besides this, gravity is based on stability scores, and as stability scores for both opponents quickly near the limit of one, gravity becomes fairly insignificant. Given that past games will usually outnumber games played against the current opponent, it makes sense for reliability to grow more quickly than stability.

🕸📝Fergus Duniho wrote on Sat, Apr 11, 2015 11:50 PM UTC:
I'm rethinking this even more. I was reading about Elo, and I realized its main feature is a self-correcting mechanism, sort of like evolution. Having written about evolution fairly extensively in recent years, I'm aware of how it's a simple self-correcting process that gets results. So I want a ratings system that is more modeled after evolution, using self-correction to get closer to accurate results.

So let's start with a comparison between expectations and results. The ratings for two players serve as a basis for predicting the percentage of games each should win against the other. Calculate this and compare it to the actual results. The GCR currently does it backward from this. Given the results, it estimates new ratings, then figures out how much to adjust present ratings to the new ratings. The problem with this is that different pairs of ratings can predict the same results, whereas any pair of ratings predicts only one outcome. It is better to go with known factors predicting a single outcome. Going the other way requires some arbitrary decision making.

If there is no difference between predicted outcome and actual outcome, adjustments should be minimal, perhaps even zero. If there is a difference,  ratings should be adjusted more. The maximum difference is if one player is predicted to win every time, and the other player wins every time. 
Let's call this 100% difference. This would be the case if one rating was 400 points or more higher than another.  The maximum change to their scores should be 400 points, raising the lower by 400 points and decreasing the higher by 400. So the actual change may be expressed as a limit that approaches 400. Furthermore, the change should never be greater than the discrepancy between predictions and outcomes. The discrepancy can always be measured as a percentage between 0% and 100%. The maximum change should be that percentage of 400.

But it wouldn't be fair to give the maximum change for only a single game. The actual change should be a function of the games played together. This function may be described as a limit that reaches the maximum change as they play more games together. This is a measure of the reliability of the results. At this point, the decision concerning where to set different levels of reliability seems arbitrary. Let's say that at 10 games, it is 50% reliable, and at 100 games near 100% reliable. So, Games/(Games + 10) works for this. At 10, 10/20 is .5 and at 100, 100/110 is .90909090909. This would give 1 game a reliability of .090909090909, which is almost 10%. So, for one game with 100% difference between predictions and results, the change would be 36.363636363636. This is a bit over half of what the change currently is for two players with ratings of 1500 when one wins and the other loses. Currently, the winner's rating rises to 1564, while the loser's goes down to 1435. With both players at 1500, the predicted outcome would be that both win equally as many games or draw a game. Any outcome where someone won all games would differ from the predicted outcome by 50%, making the maximum change only 200, and for a single game, that change would be 18.1818181818. This seems like a more reasonable change for a single game between 1500 rated players.

Now the question comes in whether anything like stability or gravity should factor into how the scores change. Apparently the USCF uses something called a K-factor, which is a measure of how many games one's current rating is based on. This corresponds to what I have called stability. Let's start with maximums. What should be the maximum amount that stability should minimize the change to a score? Again, this seems like an arbitrary call. Perhaps 50% would be a good maximum. And at what point should a player's rating receive that much protection? Or, since this may be a limit, at what point should change to a player's rating be minimized by half as much, which is 25%? Let's say 200 games. So, Games/(Games + 600) works for this. At 200, it gives 200/800. At 400, it gives 400/1000.

And what about gravity? Since gravity is a function of stability, maybe it adds nothing significant. If one player has high stability and the other doesn't, the one whose rating is less stable will already change more. So, gravity can probably be left out of the calculation.

🕸📝Fergus Duniho wrote on Sun, Apr 12, 2015 02:43 AM UTC:
So far, the current method is still getting higher accuracy scores than the new method I described. Maybe gravity does matter. This is the idea that if one player's rating is based on several games, and the other player's rating isn't, the rating of the player with fewer games should change even more than it would if their past number of games were equal. This allows the system to get a better fix on a player's ability by adjusting his rating more when he plays against opponents with better established ratings.

🕸📝Fergus Duniho wrote on Mon, Apr 13, 2015 01:48 AM UTC:
I've been more closely comparing different approaches to the ratings. One is the new approach I described at length earlier, and one is tweaking the stability value. In tweaking the stability value, I could increase the accuracy measurement by raising the number of past games required for a high stability score. But this came at a cost. I noticed that some players who had played only a few games quickly got high ratings. Perhaps they had played a few games against high rated players and won them all. Still, this seemed to be unfair. Maybe the rating really was reflective of their playing abilities, but it's hard to be sure about this, and their high ratings for only a few games seemed unearned. In contrast to this, the new rating method put a stop to this. It made high ratings something to be earned through playing many games. Its highest rated players were all people who had played several games. Its highest rating for someone who played games in the single digits was 1621 for someone who had won 8.5 out of 9 games. In contrast, the tweaked system gave 1824 to someone who won 4 out of 4 games, placing him 5th in the overall rankings. The current system, which has been in place for years, gave 1696 and 1679 to people who won 8.5/9 and 4/4 respectively.

In the ratings for all games, the new system gets a lower accuracy score by less than 2%. That's not much of a difference. In Chess, it gets the higher accuracy score. In some other games, it gets a lower score by a few percentage points. Generally, it's close enough but has the advantage of reducing unearned high ratings, which gives it a greater appearance of fairness. So I may switch over to it soon.

🕸📝Fergus Duniho wrote on Mon, Apr 13, 2015 11:43 PM UTC:
I have switched the ratings system to the new method, because it is fairer. Details on the new system can be found on the page. I have included a link to the old ratings system, which will let you compare them.

Kevin Pacey wrote on Fri, Apr 15, 2016 02:08 AM UTC:
Hi Fergus

I lost a game of Sac Chess to Carlos quite some time ago. I thought that it was to be rated, but as far as I can tell my rating is based on only 1 game (a win at Symmetric Glinski's Hexagonal Chess vs. Carlos). I don't know if the ratings have been updated to take into account my Sac Chess loss, but I thought I'd let you know, even though I don't plan to play on Game Courier, likely at least anytime soon.

🕸📝Fergus Duniho wrote on Fri, Apr 15, 2016 02:21 AM UTC:
Your game is marked as rated, but for some reason it didn't make it into the FinishedGames database table. I will have to look into whether this problem is isolated or more systemic. Just as a quick check, the last two games I finished are in the database. I will give this more attention <s>tomorrow</s> soon.

🕸📝Fergus Duniho wrote on Fri, Jun 3, 2016 06:03 PM UTC:

Kevin,

I just recreated the FinishedGames table, and your Sac Chess game against Carlos is now listed there. I'm not sure why it didn't get in before, but I have been fixing up the code for entering finished games into this table, and hopefully something like this won't happen again. But if it does, let me know.


🕸📝Fergus Duniho wrote on Fri, Jun 3, 2016 08:58 PM UTC:

Things are getting weird. When I looked at Kevin Pacey's rating, I noticed it was still based on one game, not two. For some reason, the game he won was not getting added to the database. At this time, I was using the REPLACE command to populate the database. Also, it was failing silently. So, I Truncated the table, changed REPLACE to INSERT and recreated the table. This time, the game he won got in, but the game he lost did not. Maybe this game didn't make it in originally because of some mysterious problem with how INSERT works. It is frustrating that the MySQL commands are not performing reliably, and they are failing silently. So if it wasn't for noticing these specific logs, I would be unaware of the problem.


🕸📝Fergus Duniho wrote on Fri, Jun 3, 2016 09:12 PM UTC:

I changed INSERT back to REPLACE and ran the script for creating the FinishedGames table again. This time, the log for the game Kevin lost got in, and the log for the game he won vanished even though I did not Truncate the table prior to doing this. Also, the total number of rows in the table did not change.


🕸📝Fergus Duniho wrote on Fri, Jun 3, 2016 10:06 PM UTC:

I finally figured out the problem and got the logs for both of Kevin's games into the FinishedGames table. The problem was that both logs had the same name, and the table was set up to require each log to have a unique name. So I ALTERed the table to remove all keys, then I made the primary key the combination of Log + Game. Different logs had been recorded with INSERT and REPLACE, because INSERT would go with the first log it found with the same name, and REPLACE would replace any previous entries for the same log name with the last one. This change increased the size of the table from 4773 rows to 4883 rows.


Game Courier Ratings. Calculates ratings for players from Game Courier logs. Experimental.[All Comments] [Add Comment or Rating]
Aurelian Florea wrote on Mon, Dec 11, 2017 10:58 AM UTC:

The rating system could be off. I'm not sure if ratings should change instantly, meaning once any game is finished, then the rating is recalculated for the 2 palyers in question :)! Anyway yesterday a few games of mine (I think 3) have finished and ratings have not changed. It should have probably ended up a bit bellow 1530.


🕸📝Fergus Duniho wrote on Mon, Dec 11, 2017 02:59 PM UTC:

Ratings are calculated holistically, and they are designed to become more stable the more games you play. You can read the details on the ratings page for more on how they work differently than Elo ratings.


Aurelian Florea wrote on Mon, Dec 11, 2017 03:48 PM UTC:

I did read the rules, but I have not understood them. It seemed to me that tehy do not look like ELO ratings though. Anyway Fergus, are you saying that they work fine?


🕸📝Fergus Duniho wrote on Mon, Dec 11, 2017 04:56 PM UTC:

As far as I'm aware, they do.


Aurelian Florea wrote on Tue, Dec 12, 2017 03:29 AM UTC:

@Fergus,

I think I know what is going on with my ratings. By now I already have quite a few games played, and losing to a very high rated opponent or winning against a very low rated opponent does not mean much for the algorithm, in terms of correcting my rating. I think this is how is supposed to work.

So, are you using a system of equations where the unknowns are the ratings, and the coefients are based on the results :)?!...


Game Courier Ratings. Calculates ratings for players from Game Courier logs. Experimental.[All Comments] [Add Comment or Rating]
🕸📝Fergus Duniho wrote on Tue, Dec 12, 2017 05:04 PM UTC:

Aurelian, I have moved this discussion to the relevant page.

By now I already have quite a few games played, and losing to a very high rated opponent or winning against a very low rated opponent does not mean much for the algorithm, in terms of correcting my rating. I think this is how is supposed to work.

Yes, it is supposed to work that way.

So, are you using a system of equations where the unknowns are the ratings, and the coefients are based on the results :)?!...

I'm using an algorithm, which is a series of instructions, not a system of equations, and the ratings are never treated as unknowns that have to solved for. Everyone starts out with a rating of 1500, and the algorithm finetunes each player's rating as it processes the outcomes of the games. Instead of processing every game chronologically, as Elo does, it processes all games between the same two players at once.


Aurelian Florea wrote on Wed, Dec 13, 2017 06:59 AM UTC:

Ok, I'm starting to understand it, thanks for the clarifications :)!


Kevin Pacey wrote on Mon, Apr 23, 2018 04:17 PM UTC:

Perhaps the Game Courier rating system could someday be altered to somehow take into account the number of times a particular chess variant has been played by a particular player, and/or between him and a particular opponent, when calculating the overall public (or rated) games played rating for a particular player.


Aurelian Florea wrote on Mon, Apr 23, 2018 04:58 PM UTC:

I used to  think that players that play a more diverse assortment of variants are disadvantaged by the current system, but probably it is not a big deal.

Also maybe larger games with more pieces should matter more as they are definitely more demanding.

But both these things are quite difficult to do without a wide variety of statistics which we cannot have at the time :)!


Joe Joyce wrote on Mon, Apr 23, 2018 05:42 PM UTC:

Actually, I found that when I played competitively a few years ago, the more different games I played, the better I played in all of then, in general. This did not extend to games like Ultima or Latrunculi, but did apply to all the chesslike variants, as far as I can tell.


Aurelian Florea wrote on Tue, Apr 24, 2018 12:13 AM UTC:

There probably is something akin to general understanding :)!


🕸📝Fergus Duniho wrote on Tue, Apr 24, 2018 04:21 PM UTC:

This script has just been converted from mysql to PDO. One little tweak in the conversion is that if you mix wildcards with names, the SQL will use LIKE or = for each item where appropriate. So, if you enter "%Chess,Shogi", it will use "AND (Game LIKE '%Chess' OR Game = 'Shogi' )" instead of "AND (Game LIKE '%Chess' OR Game LIKE 'Shogi' )".


🕸📝Fergus Duniho wrote on Wed, Apr 25, 2018 04:44 PM UTC:

Perhaps the Game Courier rating system could someday be altered to somehow take into account the number of times a particular chess variant has been played by a particular player, and/or between him and a particular opponent, when calculating the overall public (or rated) games played rating for a particular player.

First of all, the ratings script can be used for a specific game. When used this way, all its calculations will pertain to that particular game. But when it is used with a wildcard or with a list of games, it will base calculations on all included games without distinguishing between them.

Assuming it is being used for a specific game, the number of times two players have played that game together will be factored into the calculation. The general idea is that the more times two players play together, the larger the effect that their results will have on the calculation. After a maximum amount of change to their ratings is calculated, it is recalculated further by multiplying it by n/(n+10) where n is the number of games they have played together. As n increases, n/(n+10) will increase too, ever getting nearer to the limit of 1. For example n=1 gives us 1/11, n=2 gives us 2/12 or 1/6, n=3 gives us 3/13, ... n=90 gives us 90/100 or 9/10, and so on.

During the calculation, pairs of players are gone through sequentially. At any point in this calculation, it remembers how many games it has gone through for each player. The more games a player has played so far, the more stable his rating becomes. After the maximum change to each player's rating is calculated as described above, it is further modified by the number of games each player has played. Using p for the number of games a player has played, the maximum change gets multipled by 1-(p/(p+800)). As p increases, so does p/(p+800), getting gradually closer to 1. But since this gets substracted from 1, that means that 1-(p/(p+800) keeps getting smaller as p increases, ever approaching but never reaching the limit of zero. So, the more games someone has played, the less his rating gets changed by results between himself and another player.

Since p is a value that keeps increasing as the calculations of ratings is made, and its maximum value is not known until the calculations are finished, the entire set of calculations is done again in reverse order, and the two sets of results are averaged. This irons out the advantages any player gains from the order of calculations, and it ensures that every player's rating is based on every game he played.

As I examine my description of the algorthm, the one thing that seems to be missing is that a player who has played more games should have a destabilizing effect on the other player's rating, not just a stabilizing effect on his own rating. So, if P1 has played 100 games, and P2 has played 10, this should cause P2's rating to change even more than it would if P1 has also played only 10 games. At present, it looks like the number of games one's opponent has played has no effect on one's own ratings. I'll have to examine the code and check whether it really matches the text description I was referring to while writing this response.


Kevin Pacey wrote on Wed, Apr 25, 2018 08:14 PM UTC:

I was wondering along the lines of do we want a Game Courier rating system that rewards players for trying out a greater number of chess variants with presets. There are many presets that have barely been tried, if at all. However, conversely this could well 'punish' players who choose to specialize in playing only a small number of chess variants, perhaps for their whole Game Courier 'playing career'. [edit: in any case it seems, if I'm understanding right, the current GC rating system may 'punish' the winner (a player, or a userid at least) of a given game between 2 particular players who have already played each other many times, by not awarding what might otherwise be a lot of rating points for winning the given game in question.]


🕸📝Fergus Duniho wrote on Wed, Apr 25, 2018 11:18 PM UTC:

I was wondering along the lines of do we want a Game Courier rating system that rewards players for trying out a greater number of chess variants with presets.

Such a ratings system would be more complicated and work differently than the current one. The present system can work for a single game or for a set of games, but when it does work with a set of games, it treats them all as though they were the same game.

However, conversely this could well 'punish' players who choose to specialize in playing only a small number of chess variants, perhaps for their whole Game Courier 'playing career'.

Yes, that would be the result. Presently, someone who specializes in a small set of games, such as Francis Fahys, can gain a high general GCR by doing well in those games.

in any case it seems, if I'm understanding right, the current GC rating system may 'punish' the winner of a given game between 2 particular players who have already played each other many times, by not awarding what might otherwise be a lot of rating points for winning the given game in question.

Game Courier ratings are not calculated on a game-by-game basis. For each pair of players, all the games played between them factor into the calculation simultaenously. Also, it is not designed to "award" points to players. It works very differently than Elo, and if you begin with Elo as your model for how a ratings system works, you could get some wrong ideas about how GCR works. GCR works through a trial-and-error method of adjusting ratings between two players to better match the ratings that would accurately predict the outcome of the games played between them. The number of games played between two players affects the size of this adjustment. Given the same outcome, a smaller adjustment is made when they have played few games together, and a larger adjustment is made when they have played several games together.

Getting back to your suggestion, one thought I'm having is to put greater trust in results that come from playing the same game and to put less trust in results that come from playing different games together. More trust would result in a greater adjustment, while less trust would result in a smaller adjustment. The rationale behind this is that results for the same game are more predictive of relative playing ability, whereas results from different games are more independent of each other. But it is not clear that this would reward playing many variants. If someone played only a few games, the greater adjustments would lead to more extreme scores. This would reward people who do well in the few variants they play, though it would punish people who do poorly in those games. However, if someone played a wide variety of variants, smaller adjustments would keep his rating from rising as fast if he is doing well, and they would keep it from sinking as fast if he is not doing well. So, while this change would not unilaterally reward players of many variants over players of fewer variants, it would decrease the cost of losing in multiple variants. 


🕸📝Fergus Duniho wrote on Thu, Apr 26, 2018 12:09 AM UTC:

How would you measure the diversity of games played between two players? Suppose X1 and Y1 play five games of Chess, 2 of Shogi, and 1 each of Xiang Qi, Smess, and Grand Chess. Then we have X2 and Y2, who play 3 games of Chess, 3 of Shogi, 2 of Xiang Qi, and 1 each of Smess and Grand Chess. Finally, X3 and Y3 have played two games each of the five games the other pairs have played. Each pair of players has played ten games of the same five games. For each pair, I want to calculate a trust value between a limit of 0 and a limit of 1, which I would then multiply times the maximum adjustment value to get a lower adjustment value.

Presently, the formula n/(n+10) is used, where n is the number of games played between them. In this case, n is 10, and the value of n/(n+10) is 10/20 or 1/2. One thought is to add up fractions that use the number of games played of each game.

X1 and Y1

5/(5+10)+2/(2+10)+1/(1+10)+1/(1+10)+1/(1+10) = 5/15+2/12+1/11+1/11+1/11 = 17/22 = 0.772727272

X2 and Y2

3/13 + 3/13 + 2/12 + 1/12 + 1/11 = 6/13 + 2/12 + 2/11 = 695/858 = 0.81002331

X3 and Y3

2/12 * 5 = 10/12 = 0.833333333

The result of this is to put greater trust in a diverse set of games than in a less diverse set, yet this is the opposite of what I was going for.

How would this change if I changed the constant 10 to a different value? Let's try 5.

X1 and Y1

5/(5+5)+2/(2+5)+1/(1+5)+1/(1+5)+1/(1+5) = 5/10+2/7+3/6 = 1 2/7 = 1.285714286

Since this raises the value above 1, it's not acceptable. Let's try changing 10 to 20.

X1 and Y1

5/(5+20)+2/(2+20)+1/(1+20)+1/(1+20)+1/(1+20) = 5/25+2/22+1/21+1/21+1/21 = 167/385 = 0.433766233

X2 and Y2

3/23 + 3/23 + 2/22 + 1/22 + 1/21 = 6/23 + 2/22 + 2/21 = 2375/5313 = 0.447016751

X3 and Y3

2/22 * 5 = 10/22 = 0.454545454

This follows the same pattern, though the trust values are lower. To clearly see the difference, look at X2 and Y2, and compare 2/22, which is for two games of Xiang Qi, with 2/21, which is for one game each of Smess and Grand Chess. 2/22 is the smaller number, which indicates that it is giving lower trust scores for the same game played twice.

Since it is late, I'll think about this more later. In the meantime, maybe somebody else has another suggestion.


🕸📝Fergus Duniho wrote on Thu, Apr 26, 2018 12:43 AM UTC:

Presently, the more games two players play together, the greater the amount of trust that is given to the outcome of their games, but each additional game they play together adds a smaller amount of trust. This is why players playing the same game together would produce a smaller increase in trust than players playing different games together in the calculations I was trying out in my previous comment. Since this is how the calculations naturally fall, is there a rationale for doing it this way instead of what I earlier proposed? If one player does well against another in multiple games, this could be more indicative of general Chess variant playing ability, whereas if one does well against another mainly in one particular game but plays that game a lot, this may merely indicate mastery in that one game instead of general skill in playing Chess variants, and that may be due to specialized knowledge rather than general intelligence. The result of doing it this way is that players who played lots of different games could see a faster rise in their ratings than a player who specialized in only a few games. However, people who specialized in only a few games would also see slower drops in their ratings if they do poorly. For each side, there would be some give and take. But if we want to give higher ratings to people who do well in many variants, then this might be the way to do it.


Kevin Pacey wrote on Thu, Apr 26, 2018 09:56 AM UTC:

Hi Fergus

Note I did put a late, second, edit to my previous post, mentioning the small distinction that we're talking about specific userid's rather than specific players. I made this distinction since it's possible (and evident in some cases already on GC's ratings list) that people can have more than one userid, hence more than one Game Courier rating. While presumably it would be tough to prevent this if desired, starting a new rating from scratch at least does not guarentee a player he will get a higher one after many games (a long time ago, it may be worth noting, the Chess Federation of Canada allowed a given player to at least once effectively destroy his existing rating and begin again from scratch, perhaps for a $ price).


🕸📝Fergus Duniho wrote on Thu, Apr 26, 2018 11:07 AM UTC:

It goes without saying that I am talking about userids. The script is unable to disinguish players by anything other than userid, and it has no confirmed data on which players are using multiple userids. All I can do about this is discourage the use of multiple userids so that this doesn't become much of a factor. But if someone wants to play games with multiple userids, he presumably has a reason for wanting to keep seperate ratings for different games.


🕸📝Fergus Duniho wrote on Thu, Apr 26, 2018 12:46 PM UTC:

One concern I had was that adding up fractions for the number of times two players played each separate game could eventually add up to a value greater than 1. For example, if two players played 12 different games together, the total would be 12 * (1/11) or 12/11, which is greater than 1. One way to get around this is to divide the total by the number of different games played. Let's see how this affects my original scenarios:

X1 and Y1

5/(5+10)+2/(2+10)+1/(1+10)+1/(1+10)+1/(1+10) = 5/15+2/12+1/11+1/11+1/11 = 17/22 = 0.772727272

17/22 * 1/5 = 17/110 = 0.154545454

X2 and Y2

3/13 + 3/13 + 2/12 + 1/12 + 1/11 = 6/13 + 2/12 + 2/11 = 695/858 = 0.81002331

695/858 * 1/5 = 695/4290 = 0.162004662

X3 and Y3

2/12 * 5 = 10/12 = 0.833333333

10/12 * 1/5 = 10/60 = 0.1666666666

As before, these values are greater where the diversity is more evenly spread out, which is to say more homogenous.

However, the number of different games played was fixed at 5 in these examples, and the number of total games played was fixed at 10. Other examples need to be tested.

Consider two players who play 20 individual games once each and two others who play 10 individual games twice each. Each pair has played 20 games total.

Scenario 1: 20 different games

(20 * 1/11) / 20 = 20/11 * 1/20 = 1/11

Scenario 2: 10 different games twice

(10 * 2/12)/10 = 20/12 * 1/10 = 2/12 = 1/6

Applying the same formula to these two scenarios, the 20 different games have no more influence than a single game, which is very bad. This would severely limit the ratings of people who are playing a variety of games. So, if diversity of games played is to be factored in, something else will have to be done.

The problem is that the importance of diversity is not as clear as the importance of quantity. It is clear that the more games two players have played together, the more likely it is that the outcome of their games is representative of their relative playing abilities. But whether those games are mixed or the same does not bear so clearly on how likely it is that the outcome of the games played reflects their relative playing abilities. With quantity as a single factor, it is easy enough to use a formula that returns a value that gets closer to 1 as the quantity increases. But with two factors, quantity and diversity, it becomes much less clear how they should interact. Furthermore, diversity is not simply about how many different games are played but also about how evenly the diversity is distributed, what I call the homogeneity of diversity. When I think about it, homogeneity of diversity sounds like a paradoxical concept. The X3 and Y3 example has a greater homogeneity of diversity than the other two, but an example where X4 and Y4 play Chess 10 times has an even greater homogeneity of diversity but much less diversity. Because of these complications in measuring diversity, I'm feeling inclined to not factor it in.

The most important part of the GCR method is the use of trial-and-error. Thanks to the self-correcting nature of trial-and-error, the difference that factoring in diversity could make is not going to have a large effect on the final outcome. So, unless someone can think of a better way to include a measure of diversity, it may be best to leave it out.


Kevin Pacey wrote on Thu, Apr 26, 2018 12:54 PM UTC:

If left as is, is the current rating system at least somewhat kind to a player who suddenly improves a lot (e.g. through study or practice), but who has already played a lot of games on Game Courier? I'm not so sure, even if said player from then on plays much more often vs. players he hasn't much played against before on GC.


Aurelian Florea wrote on Thu, Apr 26, 2018 01:14 PM UTC:

I was thinking about older results (both in time and number of games) should, maybe fade away. Is that very difficult to implement Fergus? It seems fairer, but trouble that you need many games at the "same time" to make the ratings meaningfull, and as the current population is that cannot be easilly done :)!


🕸📝Fergus Duniho wrote on Thu, Apr 26, 2018 01:52 PM UTC:

The script includes an age filter. If you don't want to include old ratings, you can filter them out.


Greg Strong wrote on Thu, Apr 26, 2018 02:02 PM UTC:

I was also thinking the results from old games should be less trusted than the results from new games.  A recent game is a better indication of a player's ability than a 10-year-old game.

 

Kevin Pacey wrote on Thu, Apr 26, 2018 02:32 PM UTC:

Off-topic, there is much a chess variants player might do to improve in a relatively short period of time (aside from suddenly improved real-life conditions partly or wholly beyond his control, such as recovering from poor health or personal problems, or acquiring more free time than before). Besides any sort of intuition/experience aquired through sustained practice there's general or specific study he might do on his own, as I alluded to previously. As Joe alluded to, there are many variants that are rather like standard chess, and study and practice of chess probably can only help playing chess variants generally.

Unlike for many chess variants, there is an abundance of chess literature that can help improvement, even at many variants, and hiring a chess teacher, coach or trainer will probably help those who play chess variants too. A chess trainer can help with any physical fitness regime, which also can help those who play chess variants. There might also be such ways available for improvement found by those into other major chess variants with literature etc. such as Shogi and Chinese Chess, though these two are perhaps less generally applicable for overall improvement at chess variants than using chess would be (not sure).


Greg Strong wrote on Thu, Apr 26, 2018 02:34 PM UTC:

That certainly is off-topic

 

Aurelian Florea wrote on Thu, Apr 26, 2018 04:59 PM UTC:

@Fergus,

I think it is not only about "old" in the calendar sense, but also "old" in the many games ago sense.

Also, I think fading away is a nicer way of putting things than cut offs :)!


🕸📝Fergus Duniho wrote on Thu, Apr 26, 2018 06:31 PM UTC:

Since different people play games at different rates, the last n games of each player would not return a single set of games that would work for everyone. A chronological cut-off point works better, because it can be used to select a uniform set of games.


Greg Strong wrote on Thu, Apr 26, 2018 07:00 PM UTC:

I was envisioning a system where older games are considered with lesser weight by some formula down to some minimum.  A game should have, say, at least half the weight of a recent game no matter how old (for example.)

I can play atound with formulas if interested.  Beyond the question of age, however, I think the system is good as-is.


🕸📝Fergus Duniho wrote on Thu, Apr 26, 2018 07:46 PM UTC:

I have two problems with discounting the results of older games. One is that the decision concerning what age to start reducing game results is arbitrary. The other is that the results for a game is zero points for a loss and one point for a win. While one point for a win could be reduced to a smaller value, zero points for a loss could not without introducing negative values. The alternative would then be to reduce the 1 point for a win and increase the zero points for a loss, making the results of the older game more drawish. This would distort the results of a game and possibly result in less accurate ratings. I prefer using the age filter to set a clean cut-off point at which older results just aren't included in the calculation.


100 comments displayed

EarliestEarlier Reverse Order LaterLatest

Permalink to the exact comments currently displayed.