Check out Modern Chess, our featured variant for January, 2025.


[ Help | Earliest Comments | Latest Comments ]
[ List All Subjects of Discussion | Create New Subject of Discussion ]
[ List Earliest Comments Only For Pages | Games | Rated Pages | Rated Games | Subjects of Discussion ]

Comments by HGMuller

Earlier Reverse Order LaterLatest
Piece Values[Subject Thread] [Add Response]
H. G. Muller wrote on Sat, May 3, 2008 06:59 AM UTC:
Ha, finally my registration could be processed manually, as all automatic procedures consistently failed. So this thread is now also open to me for posting. Let me start with some remarks to the ongoing discussion. * I tried Reinhards 4A vs 8N setup. In a 100-game match of 40/1' games with Joker80, the Knights are crushed by the Archbishops 80-20. So although in principle I agree with Reinhard that such extreme tests with setups that make the environment for the pieces very alien compared to normal Chess could be unreliable, I certainly would not take it for granted that his claim that 8 Knights beat 4 Archbishops is actually true. Possible reasons for the discrepancy could be: 1) Reinhard did not base his conclusion on enough games. In my experience using anything less than 100 games is equivalent to making the decision by throwing dice. It often happens that after 30 games the side that is leading by 60% will eventually lose by 45%. 2) Smirf does not handle the Archbishop well, because it is programmed to underestimate its value, and is prepared to trade it to easily for two Knights to avoid or postpone a Pawn loss, while Joker80 just gives the Pawn and saves its Archbishops until he can get 3 Knights for it. 3) The shorter time control used does restrict search depth such that this does not allow Joker80 to recognize some higher, unnatural strategy (which has no parallel in normal Chess) where all Knights can be kept defending each other multiple times, because they all have identical moves, and so judges the pieces more on their tactical merits that would be relevant for normal Chess. * The arguments Reinhard gives against more realistic 'asymmetrical platesting': | Let me point to a repeatedly written detail: if a piece will be | captured, then not only its average piece exchange value is taken | from the material balance, but also its positional influence from | the final detail evaluation. Thus it is impossible to create | 'balanced' different armies by simply manipulating their pure material | balance to become nearly equal - their positional influences probably | would not be balanced as need be. seem invalid. For one, all of us are good enough Chess players that we can recognize for ourselves in the initial setup we use for playtesting if the Archbishop or Knight or whatever piece is part of the imbalance is an exceptionally strong or poor one, or just an average one. So we don't put a white Knight on e5 defended by Pf4, while the black d- and f-pawn already passed it, and we don't put it on a1 with white pawns on b3, c2 and black pawns on b4, c3. In particular, I always test from opening positions, where non of the pieces is on a particularly good square, but they can be easily developed, as the opponent does not inderdict access to any of the good squares either. So after a few opening moves, the pieces get to places that, almost by definition, are the average where you can get them. Secondly, when setting up the position, we get the evaluation of the engine for that position telling us if the engine does consider one of the sides highly favored positionally (by taking the difference between the engine evaluation and the known material difference for the piece values we know the engine is using). Although I would trust this less than my own judgement, it can be used as additional confirmation. Like Derek says, averaging over many positions (like I always do: all my matches are played starting from 432 different CRC opening positions) will tend to have avery piece on the average in an average position. If a certain piece, like A, would always have a +200cP 'positional' contribution, (e.g. calculated as its contribution to mobility) no matter where you put it, then that contribution is not positional at all, but a hidden part of the piece value. Positional contributions should average to zero, when averaged over all plausible positions. Furthermore, in Chess positional contributions are usually small compared to material ones, if they do not have to do with King safety or advanced passers. And none of the latter play a role in the opening positions I use. * Symettrical playtesting between engines with different piece-value sets is known to be a notoriously unreliable method. Dozens of people have reported trying it, often with quite advanced algorithms to step through search space (e.g. genetic algorithms, or annealing). The result was always the same: in the end (sometimes after months of testing) they obtained piece values that, when pitted against the original hand-tuned values, would consistently lose. The reason is most likely that the method works in principle, but requires too many games in practice. Derek mentioned before, that if two engines value certain piece combinations differently, they often exchange them for each other, creating a material imbalance, which then affects their winning chances. Well, 'often' is not the same as 'always'. For very large errors, like putting AR the undervaluation of A only can lead to much more complicated bad trades, as you have to have at least two pieces for A. The probability that this occurs is far smaller, and only 10-20% of the games will see such a trade. Now the problem is that the games in which the bad trades do NOT happen will not be affected by the wrong piece value. So this subset of games will have a 50-50 outcome, pushing the outcome of the total score average towards 50%. If A vs R+N gives you 60% winning chance,(so 10% excess), if it is the only bad trade that happens (because you set A slightly under 8), and happens in only 20% of the cases, the total effect you would see (and on which you would have to conclude the A value is suboptimal) would be 52%. But the 80% of games that did not contribute to learning anything about A value, because in the end A was traded for A, will contribute to the statistical noise! To recognize a 2% excess score in stead of a 10% excess score you need a 5 times lower statistical error. But statistical errors only decrease as the SQUARE ROOT of the number of games. So to get it down a factor 5, you need 25 times as many games. You could not conclude anything before you had 2500 games! Symmetrical playtesting MIGHT work if you first discard all the games that traded A for A (to eliminate the noise they produce, and they can't say anything about the correctness of the A value), and make sure you have about 100 games left. Otherwise, the result will be garbage.

[Subject Thread] [Add Response]
H. G. Muller wrote on Sat, May 3, 2008 07:14 AM UTC:
Well, this is exactly the kind of games I played. Plus that I do not play
from a single position, but shuffle the pieces in the backrank to have 432
different initial positions. This to minimize the risk that I am putting to
much emphasis on a position that inadvertantly contained hidden tactics,
biasing the score.

If there are such positions, sometimes one side should be favored,
sometimes the other, and the effect will average out. If the posession of
one piece as opposed to another (or a set of others) would systematically
lead to more tactics in favor of that piece even from an opeing position,
I think that is a valid contribution to the piece value of such a piece.

Of course I did all games on a 10x8 board, as I wanted to have piece
values for Capablanca Chess. If I were to do it on 8x8, I would use a
setup like yours, but with Q next to K for both sides, to make the piece
mix to which it is exposed even more natural. (Of course there always is a
problem introducing A and C in 8x8 Chess that they don't fit naturally on
the board, s you have to kick out some other pieces at their expense. But
you don't have to kick out the same pieces all the time. It is perfectly
valid to sometimes give both sides an A on d1/d8, some times a Q, some
times a C, or sometimes Q+C at the expense of a Bishop. The total mix of
pieces in the game should be N AVERAGE close to what it will be in real
games, or you cannot be sure that results are meaningful.

I never went more extreme than giving one side two A and the other two C
(or similarly AA vs QQ and CC vs QQ), by substituting A->C for one side of
the Capablanca array, and C-> for the other. For the total list of
combinations I tried, see:
http://z13.invisionfree.com/Gothic_Chess_Forum/index.php?showtopic=389&st=1
(For clarity: the pieces mentioned in that list where in general the
pieces I deleted from the opening array.)

H. G. Muller wrote on Sat, May 3, 2008 08:58 AM UTC:
For completeness, I listed the combinations that are relevant for
comparison of the Q, A and C value here:

Q-BNN    (172+ 186- 75=) 48.4%
Q-BBN    (143+ 235- 54=) 39.4%
C-BNN    (130+ 231- 71=) 38.3%
C-BBN    ( 39+  86- 11=) 32.7%
A-BNN    (124+ 241- 67=) 36.5%
RR-Q     (174+ 194- 64=) 47.7%
RR-CP    (131+ 227- 74=) 38.9%
RR-AP    (166+ 199- 67=) 46.2%
RR-C     (188+ 170- 74=) 52.1%
RR-A     (197+ 162- 73=) 54.1%
QQ-CC    (131+ 55-  30=) 67.6%
QQ-AA    (117+ 60-  39=) 63.2%
QQ-CCP   (112+ 72-  32=) 59.3%
QQ-AAP   (112+ 78-  26=) 57.9%
CC-AA    (102+ 89-  25=) 53.0%
Q-CP     (164+ 191- 77=) 46.9%
Q-AP     (191+ 186- 55=) 50.6%
Q-C      (215+ 161- 56=) 56.3%
Q-A      (219+ 138- 75=) 59.4%
C-A      (187+ 182- 63=) 50.6%
A-RN     (261+ 122- 49=) 66.1%
C-RN     (273+ 101- 58=) 69.9%
A-RNP    (247+ 121- 64=) 64.6%
C-RNP    (242+ 144- 46=) 61.3%

So it is not only that C and A has been tried against each other, alone or
in pairs. They have also been tested against Q (alonme or in pairs, with or
without pawn odds for the latter), BNN, RR and RN (with or without Pawn
odds). On the average, C does only slightly better than A, on the average
2-3%, where giving Pawn odds makes a difference of ~12%. 

The A-RNP result seems a statistical fluke, as it is almost the same as
A-RN, while the extra Pawn obviously should help, and the A even does
better there than C-RNP. Note the statistical error in 432 games is 2.2%,
so that 32% of the results (so eight) should be off by more than 2.2%, and
5% (1 or 2) should be off by more than 4.5%. And A-RNP is most likely to be
that latter one.

Aberg variation of Capablanca's Chess. Different setup and castling rules. (10x8, Cells: 80) [All Comments] [Add Comment or Rating]
H. G. Muller wrote on Sat, May 3, 2008 09:15 AM UTC:
Note that a Nash equilibrium in a symmetric zero-sum game must be the globally optimum strategy. If it weren't, the player scoring negative could unilaterally change its strategy to be the same as his opponent applies, and by symmetry then raise his score to 0, showing that the earlier situation could not heave been a Nash equilibrium.

Piece Values[Subject Thread] [Add Response]
H. G. Muller wrote on Sat, May 3, 2008 10:14 AM UTC:
Sorry my original long post got lost.

As this is not a position where you can expect piece values to work, and
my computers are actually engaged in useful work, why don't YOU set it
up?

Aberg variation of Capablanca's Chess. Different setup and castling rules. (10x8, Cells: 80) [All Comments] [Add Comment or Rating]
H. G. Muller wrote on Sat, May 3, 2008 10:34 AM UTC:
As piece values are only useful as strategic guidelines for quiet positions, they cannot be sensitive to who has the move. A position where it matters who has the move is by definition nont quiet, as one ply later that characteristic will have essentially changed. So at the level of piece-value strategies, Chess is a perfectly symmetric game.

Piece Values[Subject Thread] [Add Response]
H. G. Muller wrote on Sat, May 3, 2008 10:36 AM UTC:
It seems to me that that is bad strategy. If you fail you should keep
trying until you succeed. Only when you succeed you can stop trying...

Aberg variation of Capablanca's Chess. Different setup and castling rules. (10x8, Cells: 80) [All Comments] [Add Comment or Rating]
H. G. Muller wrote on Sat, May 3, 2008 03:22 PM UTC:
Sure, this is what people do and have done for ages. It is well known that the advantage of having the move is worth 1/6 of a Pawn, (corresponding in normal Chess to a white score of 53-54%) and that, by inference, wasting a full move is equivalent to 1/3 of a Pawn.

But the point is that this does not alter the piece values. It just adds to them, like every positional advantage adds to them. In my test the advantage of having the lead move is neutralized by playing every position both with white to move and black to move.

Piece Values[Subject Thread] [Add Response]
H. G. Muller wrote on Sat, May 3, 2008 04:15 PM UTC:
To summarize the state of affairs, we now seem to have sets of piece
values for Capablanca Chess by:

Hans Aberg (1)
Larry Kaufman (1)
Reinhard Scharnagl (2)
H.G. Muller (3)
Derek Nalls (4)

1) Educated guessing based on known 8x8 piece values and assumptions on
synergy values of compound pieces
2) Based on board-averaged piece mobilities
3) Obtained as best-fit of computer-computer games with material
imbalance
4) Based on mobilities and more complex arguments, fitted to experimental
results ('playtesting')

I think we can safely dismiss method (1) as unreliable, as the (clearly
stated) assumptions on which they are based were never tested in any way,
and appear to be invalid.
Method (3) and (4) now are basically in agreement. 
Method (2) produces substantially different results for the Archbishop.

One problem I see with method (2) is that plain averaging over the board
does not seem to be the relevant thing to do, and even inconsitent at
places: suppose we apply it to a piece that has no moves when standing in
a corner, the corner squares would suppress the mobility. If otoh, the
same piece would not be allowed to move into the corner at all, the
average would be taken over the part of the board that it could access
(like for the Bishop), and would be higher than for the piece that could
go there, but not leave it (if there weren't too many moves to step into
the corner). While the latter is clearly upward compatible, and thus must
be worth more.

The moral lesson is that a piece that has very low mobility on certain
squares, does not lose as much value because of that as the averaging
suggest, as in practice you will avoid putting the piece there. The SMIRF
theory doe not take that into account at all.

Focussing on mobility only also makes you overlook disastrous handicaps a
certain combination of moves can have. A piece that has two forward
diagonal moves and one forward orthogonal (fFfW in Betza notation) has
exactly the same mobility as that with forward diagonal and backward
orthogonal moves (fFbW). But the former is restricted to a small (and ever
smaller) part of the board, while the latter can reach every point from
every other point. My guess is that the latter piece would be worth much
more than the former, although in general forward moves are worth more
than backward moves. (So fWbF should be worth less than fFbW.) But I have
not tested any of this yet.

I am not sure how much of the agreement between (3) and (4) can be
ascribed to the playtesting, and how much to the theoretical arguments:
the playtesting methods and results are not extensively published and not
open to verification, and it is not clear how well the theoretical
arguments are able to PREdict piece values rather than POSTdict them. IMO
it is not possible to make an all encompasisng theory with just 4 or 6
empirical piece values as input, as any elaborate theory will have many
more than 6 adjustable parameters.

So I think it is crucial to get accurate piece values for more different
pieces. One keystone piece could be the Lion. This is can make all leaps
to targets in a 5x5 square centered on it (and is thus a compound of Ferz,
Wazir, Alfil, Dabbabah and Knight). This piece seems to be 1.25 Pawn
stronger than a Queen (1075 on my scale). This reveals a very interesting
approximate law for piece values of short-range leapers with N moves:

value = (30+5/8*N)*N

For N=8 this would produce 280, and indeed the pieces I tested fall in the
range 265 (Commoner) to 300 (Knight), with FA (Modern Elephant), WD (Modern
Dabbabah) and FD in between. For N=16 we get 640, and I found WDN
(Minister) = 625 and FAN (High Priestess) and FAWD (Sliding General) 650.
And for the Lion, with N=24, the formula predicts 1080.

My interpretation is that adding moves to a piece does not only add the
value of the move itself (as described by the second factor, N), but also
increases the value of all pre-existing moves, by allowing the piece to
better manouevre in place for aiming them at the enemy. I would therefore
expect it is mainly the captures that contribute to the second factor,
while the non-captures contribute to the first factor.

The first refinement I want to make is to disable all Lion moves one at a
time, as captures or as non-captures, to see how much that move
contributes to the total strength. The simple counting (as expressed by
the appearence of N in the formula) can then be replaced by a weighted
counting, the weights expressing the relative importance of the moves. (So
that forward captures might be given a much bigger weight than forward
non-captures, or backward captures along a similar jump.) This will
require a lot of high-precision testing, though.

H. G. Muller wrote on Sat, May 3, 2008 04:21 PM UTC:
Oh Yes, I forgot about:

[name removed] (5)

5) Based on safe checking

I am not sure that safe checking is of any relevance. Most games are not
won by checkmating the opponent King in an equal-material position, but
by
annihilating the opponent's forces. So mainly by threatening Pawns and
other Pieces, not Kings. A problem is that safe checking seems to predict
zero value for pieces like Ferz, Wazir and Commoner, while the latter is
not that much weaker than the Knight. (And, averaged over all game
stages,
might even be stronger than a Knight.) This directly seems to falsify the
method.

[The above has been edited to remove a name and/or site reference. It is
the policy of cv.org to avoid mention of that particular name and site to
remove any threat of lawsuits. Sorry to have to do that, but we must
protect ourselves. -D. Howe]

H. G. Muller wrote on Sat, May 3, 2008 04:46 PM UTC:
Reinhard, why do you attach such importance to the 4A-9N position. I think
that example is totally meaningless. If it would prove anything, it is
that you cannot get the value of 9 Knights by taking 9 times the Knight
value. It will prove _nothing_ about the Archbishop value. Chancellor and
Queen will encounter exactly the same problems facing an army of 9
Knights.

The problem is that there is a positional bonus for identical pieces
defending each other. This is well known (e.g. connected Rooks). Problem
is that such pair interactions grow as the square of the number of pieces,
and thus start to dominate the total evaluation if the number of identical
pieces gets extremely high (as it never will in real games).

Pieces like A, C and Q (or in particular the highest-valued pieces on the
board) will not get such bonuses, as the bonus is asociated with the
safety of mutually defending each other, and tactical security in case the
piece is traded, because the recapture then replaces it by an identical
one, preserving all defensive moves it had. In absence of equal or higher
pieces, defending pieces is a useless exercise, as recapture will not
offer compensation. If you are attacked, you will have to withdraw. So the
mutual-defence bonus is also dependent on the piece makeup of the opponent,
and is zero for Archbishops when the opponent only has Knights, and very
high for Knights when the opponent has only Archbishops.

If you want to playtest material imbalances, the positional value of the
position has to be as equal as possible. The 4A-9N position violates that
requirement to an extreme extent. It thus cannot tell us anything about
piece values. Just like deleting the white Queen and all 8 black Pawns
cannot tell us anything about the value of Q vs P.

H. G. Muller wrote on Sat, May 3, 2008 05:18 PM UTC:
Well, Reinhard, there could be many explanations for the 'surprising'
strength of an all-Knight army, and we could speculate forever on it. But
it would only mean anything if we could actually find ways to test it. I
think the mutual defence is a real effect, and I expect an army of all
different 8-target leapers to do significantly worse than an army of all
Knights, even though all 8-target leapers are almost equally strong. But
it would have to be tested.

Defending each other for Archbishops is useless (in the absence of opponet
Q, C or A), as defending Archbishop in the face of Knight attacks is of
zero use. So the factthey can do it is not worth anything.

Nevertheless, the Archbishops do not do so bad as you want to make us
believe, and I think they still would have a fighting chance against 9
Knights. So perhaps I will run this tests (on the Battle-of-the-Goths
port, so that everyone can watch) if I have nothing better to do. But
currently I have more important and urgent things to do on my Chess PC. I
have a great idea for a search enhancement in Joker, and would like to
implement and test it before ICT8.

The Pizza Kings. An experimental army for Chess with Different Armies, with lots of calories.[All Comments] [Add Comment or Rating]
H. G. Muller wrote on Sat, May 3, 2008 05:24 PM UTC:
I thought this piece (W+D+A+F+N) was called a Lion, but it seems I was misinformed. I playtested this piece in a Capablanca Chess environment, and it is not that excessively strong. It is about 1.25 pawn stronger than a Queen, 1075 on my scale (on 10x8 board).

Piece Values[Subject Thread] [Add Response]
H. G. Muller wrote on Sat, May 3, 2008 06:32 PM UTC:
Well, I got that from the beginning. But the problem is not that the A
cannot be defended. It is strong and mobile enough to care for itself. The
problem is that the Knights cannot be threatened (by A), because they all
defend each other, and can do so multiple times. So you can build a
cluster of Knights that is totally unassailable. That would be much more
difficult for a collection of all different pieces. This will be likely to
have always some weak spots, which the extremely agile Archbishops then
seek out and attack that point with deadly precision.

But I don't see this as a fundamental problem of pitting different armies
against each other. After an unequal trade, andy Chess game becomes a game
between different armies. But to define piece values that can be helpful
to win games, it is only important to test positions that could occur in
chames, or at least are not fundamentally different in character from what
you might encounter in games. and the 4A-9N position definitely does not
qualify as such.

I think this is valid critisism against what Derek has done (testing
super-pieces only against each other, without any lighter pieces being
present), but has no bearing on what I have done. I never went further
than playing each side with two copies of the same super-piece, by
replacing another super-piece (which was then absent in that army). This
is slightly unnatural, but I don't expect it to lead to qualitatively
different games, as the super-pieces are similar in value and mobility.
And unlike super-pieces share already some moves, so like and unlike
super-pieces can cooperate in very similar ways (e.g. forming batteries).
It did not essentially change the distribution of piece values, as all
lower pieces were present in normal copy numbers.

I understand that Derek likes to magnify the effect by playing several
copies of the piece under test, but perhaps using 8 or 9 is overdoing it.
To test a difference in piece value as large as 200cP, 3 copies should be
more than enough: This can still be done in a reasonably realistic mix of
pieces, e.g. replacing Q and C on one side by A, and on the other side by
Q and A by C, so that you play 3C vs 3A, and then give additional Knight
odds to the Chancellors. This would predict about +3 for the Chancellors
with the SMIRF piece values, and -2.25 according to my values. Both
imbalances are large enough to cause 80-90% win percentages, so that just
a few games should make it obvious which value is very wrong.

H. G. Muller wrote on Sat, May 3, 2008 06:42 PM UTC:
Derek Nalls:
| Given enough years (working with only one server), this quantity of 
| well-played games may eventually become adequate.

I never found any effect of the time control on the scores I measure for
some material imbalance. Within statistical error, the combinations I
tries produced the same score at 40/15', 40/20', 40/30', 40/40',
40/1', 40/2', 40/5'. Going to even longer TC is very expensive, and I
did not consider it worth doing just to prve that it was a waste of
time...

The way I see it, piece-values are a quantitative measure for the amount
of control that a piece contributes to steering the game tree in the
direction of the desired evaluation. He who has more control, can
systematically force the PV in the direction of better and better
evaluation (for him). This is a strictly local property of the tree. The
only advantage of deeper searches is that you average out this control
(which highly fluctuates on a ply-by play basis) over more ply. But in
playing the game, you average over all plies anyway.

H. G. Muller wrote on Sat, May 3, 2008 08:18 PM UTC:
| And by that this would create just the problem I have tried to 
| demonstrate. The three Chancellors could impossibly be covered, 
| thus disabling their potential to risk their own existence by 
| entering squares already influenced by the opponent's side.

You make it sound like it is a disadvantage to have a stronger piece,
because it cannot go on squares attacked by the weaker piece. To a certain
extent this is true, if the difference in capabilities is not very large.
Then you might be better off ignoring the difference in some cases, as
respecting the difference would actually deteriorate the value of the
stronger piece to the point where it was weaker than the weak piece. (For
this reason I set the B and N value in my 1980 Chess program Usurpator to
exactly the same value.) But if the difference between the pieces is
large, then the fact that the stronger one can be interdicted by the
weaker one is simply an integral part of its piece value.

And IMO this is not the reason the 4A-9N example is so biased. The problem
there is that the pieces of one side are all worth more than TWICE that of
the other. Rooks against Knights would not have the same problem, as they
could still engage in R vs 2N trades, capturing a singly defended Knight,
in a normal exchange on a single square. But 3 vs 1 trades are almost
impossible to enforce, and require very special tactics.

It is easy enough to verify by playtesting that playing CCC vs AAA (as
substitutes for the normal super-pieces) will simply produce 3 times the
score excess of playing a normal setup with on one side a C deleted, and
at the other an A. The A side will still have only a single A to harrass
every C. Most squares on enemy territory will be covered by R, B, N or P
anyway, in addition to A, so the C could not go there anyway. And it is
not true that anything defended by A would be immune to capture by C, as
A+anything > C (and even 2A+anything > 2C. So defending by A will not
exempt the opponent from defending as many times as there is attack, by
using A as defenders. And if there was one other piece amongst the
defenders, the C had no chance anyway. 

The effect you point out does not nearly occur as easily as you think.
And, as you can see, only 5 of my different armies did have duplicated
superpieces. All the other armies where just what you would get if you
traded the mentioned pieces, thus detecting if such a trade would enhance
or deteriorate your winning chances or not.

H. G. Muller wrote on Sat, May 3, 2008 09:31 PM UTC:
Reinhard, if I understand you correct, what you basically want to introduce
in the evaluation is terms of the type w_ij*N_i*N_j, where N_i is the
number of pieces of type i of one side, and N_j is the number of pieces of
type j of the opponent, and w_ij is an tunable weight.

So that, if type i = A and type j = N, a negative w_ij would describe a
reduction of the value of each Archbishop by the presence of the enemy
Knights, through the interdiction effect. Such a term would for instance
provide an incentive to trade A in a QA vs ABNN for the QA side, as his A
is suppressed in value by the presence of the enemy N (and B), while the
opponent's A would not be similarly suppressed by our Q. On the contrary,
our Q value would be suppressed by the the opponent's A as well, so
trading A also benefits him there.

I guess it should be easy enough to measure if terms of this form have
significant values, by playing Q-BNN imbalances in the presence of 0, 1
and 2 Archbishops, and deducing from the score whose Archbishops are worth
more (i.e. add more winning probability). And similarly for 0, 1, 2
Chancellors each, or extra Queens. And then the same thing with a Q-RR
imbalance, to measure the effect of Rooks on the value of A, C or Q.

In fact, every second-order term can be measured this way. Not only for
cross products between own and enemy pieces, but also cooperative effects
between own pieces of equal or different type. With 7 piece types for each
side (14 in total) there would be 14*13/2 = 91 terms of this type possible.

H. G. Muller wrote on Sun, May 4, 2008 08:57 AM UTC:
Derek Nalls: | The additional time I normally give to playtesting games to improve | the move quality is partially wasted because I can only control the | time per move instead of the number of plies completed using most | chess variant programs. Well, on Fairy-Max you won't have that problem, as it always finishes an iteration once it decides to start it. But although Fairy-Max might be stronger than most other variant-playing AIs you use, it is not stronger than SMIRF, so using it for 10x8 CVs would still be a waste of time. Joker80 tries to minimize the time wastage you point out by attempting only to start iterations when it has time to finish them. It cannot always accurately guess the required time, though, so unlike Fairy-Max it has built in some emergency breaks. If they are triggered, you would have an incomplete iteration. Basically, the mechanism works by stopping to search new moves in the root if there already is a move with a similar score as on the previous iteration, once it gets in 'overtime'. In practice, these unexpectedly long iterations mainly occur when the previously best move runs into trouble that so far was just beyond the horizon. As the tree for that move will then look completely different from before, it takes a long time to search (no useful information in the hash), and the score will have a huge drop. It then continues searching new moves even in overtime in a desparate attempt to find one that avoids the disaster. Usually this is time well spent: even if there is no guarantee it finds the best move of the new iteration, if it aborts it early, it at least has found a move that was significantly better than that found in the previous iteration. Of course both Joker80 and Fairy-Max support the WinBoard 'sd' command, allowing you to limit the depth to a certain number of plies, although I never use that. I don't like to fix the ply depth, as it makes the engine play like an idiot in the end-game. | Can you explain to me in a way I can understand how and why | you are able to successfully obtain valuable results using this | method? Well, to start with, Joker80 at 1 sec per move still reaches a depth of 8-9 ply in the middle-game, and would probably still beat most Humans at that level. My experience is that, if I immediately see an obvious error, it is usually because the engine makes a strategic mistake, not a tactical one. And such strategic mistakes are awefully persistent, as they are a result of faulty evaluation, not search. If it makes them at 8 ply, it is very likely to make that same error at 20 ply. As even 20 ply is usually not enough to get the resolution of the strategical feature within the horizon. That being said, I really think that an important reason I can afford fast games is a statistical one: by playing so many games I can be reasonably sure that I get a representative number of gross errors in my sample, and they more or less cancel each other out on the average. Suppose at a certain level of play 2% of the games contains a gross error that turns a totally won position into a loss. If I play 10 games, there is a 20% error that one game contains such an error (affecting my result by 10%), and only ~2% probability on two such errors (that then in half the cases would cancel, but in other cases would put the result off by 20%). If, OTOH, I would play 1000 faster games, with an increased 'blunder rate' of 5% because of the lower quality, I would expect 50 blunders. But the probability that they were all made by the same side would be negligible. In most cases the imbalace would be around sqrt(50) ~ 7. That would impact the 1000-game result by only 0.7%. So virtually all results would be off, but only by about 0.7%, so I don't care too much. Another way of visualizing this would be to imagine the game state-space as a2-dimensional plane, with two evaluation terms determining the x- and y-coordinate. Suppose these terms can both run from -5 to +5 (so the state space is a square), and the game is won if we end in the unit circle (x^2 + y^2 < 1), but that we don't know that. Now suppose we want to know how large the probability of winning is if we start within the square with corners (0,0) and (1,1) (say this is the possible range of the evaluation terms when we posses a certain combination of pieces). This should be the area of a quarter circle, PI/4, divided by the area of the square (1), so PI/4 = 79%. We try to determine this empirically by randomly picking points in the square (by setting up the piece combination in some shuffled configuration), and let the engines play the game. The engines know that getting closer or farther away of (0,0) is associated with changing the game result, and are programmed to maximize or minimize this distance to the origin. If they both play perfectly, they should by definition succeed in doing this. They don't care about the 'polar angle' of the game state, so the point representing the game state will make a random walk on a circle around the origin. When the game ends, it will still be in the same region (inside or outside the unit circle), and games starting in the won region will all be won. Now with imperfect play, the engines will not conserve the distance to the origing, but their tug of war will sometimes change it in favor of one or the other (i.e. towards the origin, or away from it). If the engines are still equally strong, by definition on the average this distance will not change. But its probability distribution will now spread out over a ring with finite width during the game. This might lead to won positions close to the boundary (the unit circle) now ending up outside it, in the lost region. But if the ring of final game states is narrow (width << 1), there will be a comparable number of initial game states that diffuse from within the unit circle to the outside, as in the other direction. In other words, the game score as a function of the initial evaluation terms is no longer an absolute all or nothing, but the circle is radially smeared out a little, making a smooth transition from 100% to 0% in a narrow band centered on the original circle. This will hardly affect the averaging, and in particular, making the ring wider by decreasing playing accuracy will initially hardly have any effect. Only when play gets so wildly inaccurate that the final positions (where win/loss is determined) diverge so far from the initial point that it could cross the entire circle, you will start to see effects on the score. In the extreme case wher the radial diffusion is so fast that you could end up anywhere in the 10x10 square when the game finishes, the result score will only be PI/100 = 3%. So it all depends on how much the imperfections in the play spread out the initial positions in the game-state space. If this is only small compared to the measures of the won and lost areas, the result will be almost independent of it.

Simplified Chess. Missing description (8x7, Cells: 56) [All Comments] [Add Comment or Rating]
H. G. Muller wrote on Thu, May 8, 2008 02:44 PM UTC:
I don't think that 'promoting to a captured piece only' is a simplification of the rules. 'Always promote to Queen' would be a simplification. This just adds a complex rule.

H. G. Muller wrote on Thu, May 8, 2008 04:15 PM UTC:
Well, I do not consider the stalemate rule essential to Chess, and there are many variants where stalemate = loss.

You won't get rid of many draws, though, when you abolish it. To get rid of draws entirely, you could add some kind of a tie-break to the game, like penalty shootouts in soccer: In a position where FIDE rules would declare draw (50-moves, 3-fold-rep, insuff. material) you could trigger this tie-break. From the moment on it is triggered, the opponent can do two moves in a row, then you can do three moves, then he 4, etc. This would even work for King vs King, as in the end there will be no place you can hide without his King being able to capture you.

H. G. Muller wrote on Fri, May 9, 2008 10:24 AM UTC:
Rich Hutnik:
| Do you want a flipped rook to become a 'Jester' piece that can 
| represent any other piece on the board?  Guess what the rook does 
| now.  It is that.  
This has never been a problem when I was playing OTB games. In most variants the choice of promotion piece is a rather academic one anyway, as in practice almost always the strongest piece is chosen. After the promotion, if the piece is then not captured, the game is over in 5 or 6 moves...

Even in Capablanca Chess, where there are 3 nearly equivalent pieces available (Q, C and A), it took me months before I discvered that 'underpromotion' to C or A was not properly implemented in my engine Joker80. Although it was considering other promotions than Q in its search, the MakeMove routine at game level always promoted to Q, overruling the choice. This never changed the game result, and I only discovered it when Joker80 announced mate-in-1 on a promotion move, and then the game continued a few more moves before it actually was checkmate.

People that want to play variants should have a wider choice of piece equipment anyway. An inverted Rook is a warning sign that whatever it is, it is not a Rook. But nothing is more annoying to a Chess player than having a Knight on the board that doesn't move like a Knight, but as a Camel or Zebra. The solution to that is easy enough:

http://home.hccnet.nl/h.g.muller/ultima.html

Piece Values[Subject Thread] [Add Response]
H. G. Muller wrote on Mon, May 12, 2008 05:57 AM UTC:
To Derek:

I am aware that the empirical Rook value I get is suspiciously low. OTOH,
it is an OPENING value, and Rooks get their value in the game only late.
Furthermore, this only is the BASE VALUE of the Rook; most pieces have a
value that depends on the position on the board where it actually is, or
where you can quickly get it (in an opening situation, where the opponent
is not yet able to interdict your moves, because his pieces are in
inactive places as well). But Rooks only increase their value on open
files, and initially no open files are to be seen. In a practical game, by
the time you get to trade a Rook for 2 Queens, there usually are open
files. So by that time, the value of the Q vs 2R trade will have gone up
by two times the open-file bonus. You hardly have the possibility of
trading it before there are open files. So it stands to reason that you
might as well use the higher value during the entire game.

In 8x8 Chess, the Larry Kaufman piece values include the rule that a Rook
should be devaluated by 1/8 Pawn for each Pawn on the board there is over
five. In the case of 8 Pawns that is a really large penalty of 37.5cP for
having no open files. If I add that to my opening value, the late
middle-game / end-game value of the Rook gets to 512, which sounds a lot
more reasonable.

There are two different issues here:
1) The winning chances of a Q vs 2R material imbalance game
2) How to interpret that result as a piece value

All I say above has no bearing on (1): if we both play a Q-2R match from
the opening, it is a serious problem if we don't get the same result. But
you have played only 2 games. Statistically, 2 games mean NOTHING. I don't
even look at results before I have at least 100 games, because before they
are about as likely to be the reverse from what they will eventually be,
as not. The standard deviation of the result of a single Gothic Chess game
is ~0.45 (it would be 0.5 point if there were no draws possible, and in
Gothic Chess the draw percentge is low). This error goes down as the
square root of the number of games. In the case of 2 games this is
45%/sqrt(2) = 32%. The Pawn-odds advantage is only 12%. So this standard
error corresponds to 2.66 Pawns. That is 1.33 Pawns per Rook. So with this
test you could not possibly see if my value is off by 25, 50 or 75. If you
find a discrepancy, it is enormously more likely that the result of your
2-game match is off from to true win probability.

Play 100 games, and the error in the observed score is reasonable certain
(68% of the cases) to be below 4.5% ~1/3 Pawn, so 16 cP per Rook. Only thn
you can see with reasonable confidence if your observations differ from
mine.

[Subject Thread] [Add Response]
H. G. Muller wrote on Mon, May 12, 2008 06:06 AM UTC:
Note that you can also use WinBoard as a FEN editor. There are commands
(with shortcut keys) to copy FENs from and to the clipboard. And there is
an edit-position mode that allows you to conveniently drag and drop pieces
over the board, and add new ones from a popup menu when right-clicking a
square.

http://home.hccnet.nl/h.g.muller/winboardF.html

Knightmate. Win by mating the knight. (8x8, Cells: 64) [All Comments] [Add Comment or Rating]
H. G. Muller wrote on Mon, May 12, 2008 09:31 PM UTC:
Note that there has just been released a WinBoard compatible version of the variant-capable engine Dabbaba of Jens Baek Nielsen. One of the games it knows is Knightmate. You can currently watch it play a Knightmate match live against my own engine Fairy-Max, on my Chess-Live! webserver

http://80.100.28.169/gothic/knightmate.html

for the next one or two days.

If anyone knows any other WinBoard engines that can play Knightmate, let me know; then I could hold a tournament.

Piece Values[Subject Thread] [Add Response]
H. G. Muller wrote on Mon, May 12, 2008 10:12 PM UTC:
Drek Nalls:
| They definitely mean something ... although exactly how much is not 
| easily known or quantified (measured) mathematically.
Of course that is easily quantified. The entire mathematical field of
statistics is designed to precisely quantify such things, through
confidence levels and uncertainty intervals. The only thing you proved
with reasonable confidence (say 95%) is that two Rooks are not 1.66 Pawn
weaker than a Queen. So if Q=950, then R > 392. Well, no one claimed
anything different. What we want to see is if Q-RR scores 50% (R=475) or
62% (R=525). That difference just can't be seen with two games. Play 100.
There is no shortcut. Even perfect play doesn't help. We do have perfect
play for all 6-men positions. Can you derive piece values from that, even
end-game piece values???

| Statistically, when dealing with speed chess games populated 
| exclusively with virtually random moves ... YES, I can understand and 
| agree with you requiring a minimum of 100 games.  However, what you 
| are doing is at the opposite extreme from what I am doing via my 
| playtesting method.
Where do you get this nonsense? This is approximately master-level play.
Fact is that results from playing opening-type positions (with 35 pieces
or more) are stochastic quantity at any level of play we are likely to see
the next few million years. And even if they weren't, so that you could
answer the question 'who wins' through a 35-men tablebase, you would
still have to make some average over all positions (weighted by relevance)
with a certain material composition to extract piece values. And if you
would do that by sampling, the resukt would again be a sochastic quantity.
And if you would do it by exhaustive enumeration, you would have no idea
which weights to use.
And if you are sampling a stochastic quantity, the error will be AT LEAST
as large as the statistical error. Errors from other sources could add to
that. But if you have two games, you will have at least 32% error in the
result percentage. Doesnt matter if you play at an hour per move, a week
per move, a year per move, 100 year per move. The error will remain >=
32%. So if you want to play 100 yesr per move, fine. But you will still
need 100 games.

| Nonetheless, games played at 100 minutes per move (for example) have 
| a much greater probability of correctly determining which player has 
| a definite, significant advantage than games played at 10 seconds per 
| move (for example).
Why do I get the suspicion that you are just making up this nonsense? Can
you show me even one example where you have shown that a certain material
advantage would be more than 3-sigma different for games at 100 min / move
than for games at 1 sec/move? Show us the games, then. Be aware that this
would require at least 100 games at aech time control. That seems to make
it a safe guess that you did not do that for 100 min/move.
 On the other hand, in stead of just making things up, I have actually
done such tests, not with 100 games per TC, but with 432, and for the
faster even with 1728 games per TC. And there was no difference beyond the
expected and unavoidable statistical fluctuations corresponding to those
numbers of games, between playing 15 sec or 5 minutes. 
The advantage that a player has in terms of winning probability is the
same at any TC I ever tried, and can thus equally reliably be determined
with games of any duration. (Provided ou have the same number of games).
If you think it would be different for extremely long TC, show us
statistically sound proof.

I might comment on the rest of your long posting later, but have to go
now...

25 comments displayed

Earlier Reverse Order LaterLatest

Permalink to the exact comments currently displayed.