is responsible for "The Swedish Rating List"?
Swedish Chess Computer Association (In Swedish
"SvenskaSchackdatorforeningen", abbreviated SSDF). The rating list is
the result of its members' efforts.
a game have to be played with a particular time limit, or is the list
based on games played with various time settings?
games are played at tournament level (40 moves/2h). A game played with
anything other than 3 minutes per move is not counted. However, SSDF
publishes a separate rating list for blitz games (5 min/game or 60
moves/5 min) in PLY, the journal of the Association, a couple of times
SSDF have a lab full of computers playing each other?
All testing is done in our members' homes and on their own computers.
Also, most vendors are willing to lend us one or two chess computers
for testing when a new model is released.
anyone play a few test games and send his results to SSDF?
only accepts results from its members. Furthermore, we do not accept
tests from people having a commercial interest in computer chess. The
person responsible for managing the tests regularly calls all testers,
usually once every few weeks, and collects their latest results. On
those occasions, he also plans upcoming tests and suggests suitable
computers and/or programs to be pitted against each other.
is managing the list?
1990 Thoralf Karlsson, chairman of SSDF, has been handling the list. He
took over from Göran Grottling who had been in charge since
the inception of SSDF in 1984.
do you know that you can trust reported results?
mostly a question of confidence. We have known most of the testers for
many years and don't believe they would try to deceive us. Also, all
testers are required to keep a written record of their activity. In
case there are any doubts, those records will surely be of good help.
in theory, someone could have sent in false results?
but not on a large scale. Experience has taught us that a series of 20
games, which is our normal test match between two computers, can
produce some rather unexpected results. Still we'd be very suspicious
if anyone reported that, say, Super Constellation outplayed Genius on a
Pentium-class PC by 15 to 5. You must remember that normally a lot of
people test the same program or computer, and we are therefore able to
compare results from different sources. Likewise, a tester who
consistently reported low scores for, say, Richard Lang's programs
would raise a few eyebrows.
many people are involved in the testing?
the end of 1994, SSDF had played well over 40,000 games. This had taken
us eleven years, and all in all 132 testers have been involved, each
contributing with anything between 1 to 5,770 games. During 1994 alone,
40 people were doing tests. Our most industrious tester played over 700
games that year; he usually plays three games in parallel. (And yes,
that means he has six computers!)
are the tests carried out?
goal is to always play matches of 20 games between two
computers/programs under test. We also consider it important to
evaluate a computer against various kinds of opponents. For instance, a
new chess program has often been tested by the programmer against
contemporary products, but we also try to find some old programs to run
it against. The reason for this is that the programmer may well have
optimized its opening books so as to get maximum performance against
the major competitors in the market, but he has most likely not had
access to all the software that's hiding in the closets of SSDF's
members. Which games or computers actually get to play each other is
ultimately dependent on what kind of equipment our testers have or are
able to borrow. So in order to let, say, Mephisto Vancouver and MChess
Pro meet, we have to find someone with access to both the Mephisto
machine and an ordinary PC. Rule of thumb number three is that matches
between opponents whose ratings differ by more than 400 points are
meaningless. The outcome of such a match does not provide statistically
all games played until mate?
but we don't accept early, grandmaster-style draws. Normally, a game is
allowed to go on a bit further than human players would have done. We
know that strange things can happen in computer games! Many testers
also follow their own rules. Gunnar Blomstrand, one of the more
productive, generally plays on until one of the computers evaluates its
position as -10 or below. Super Expert, Hiarcs and some other computers
are able to resign on their own, and of course we accept that if it
is SSDF's opinion on so called "killer libraries", opening libraries
that are specifically tuned to give good results when playing against
certain other computers?
don't like them, but there is not much we can do. If we disqualified
results from games played with such a library, then surely someone
would protest against that. The best method is likely to be the one we
described above: make sure each program under test gets to play against
as many different opponents as possible, including older programs.
it happens that two computers repeat a game they have played before,
are both games included in the results in that case?
a game is allowed to continue even if the tester can see that it is
going to be a duplicate. Any program that's stupid enough to lose the
same game several times has but itself to blame. Furthermore, from a
statistical point of view this behavior is not very important, since
the program is just as likely to repeat a win as a loss.
are the ratings calculated?
uses its own rating program, written by our member Lars Hjorth, but the
basic formulas are derived from Arpad Elo's ELO rating system. Our
program calculates, for each computer, the average rating for its
opponents and how many points it has scored. Given those two numbers,
professor Elo's formulas produces a rating.
if all computers are only tested against other computers, all we get is
a relative rating that is just valid among those computers. Therefore,
SSDF has played several hundred games between computers and human
players in serious tournaments and used these results to set a
"correct" absolute level for the rating list according to Swedish
conditions. Different national rating systems are not completely in
accordance though, and that has to betaken into account when reading
our list. For instance, US ratings seems to lie approximately 150
points above the corresponding Swedish ratings (maybe more when below
2000 and less on the other side of the scale). For ourselves we
obviously use the Swedish scale.
firmly believe that our ratings are correct in the sense that if a
computer were to play a sufficient number of games against Swedish
humans, it would end up with a rating close to what it has on our list.
Unfortunately, as programs get better it becomes increasingly difficult
to arrange meaningful games against human players. Reassuringly, we've
noted that our ratings are fairly consistent with the results from the
yearly AEGON tournament in Holland.
often uses the term "margin of error". What factors influence the size
of this margin?
than anything else, the number of games played decides the margin of
error (confidence range, see below). Once upon a time, we thought that
40 games between two computers was a lot. Nowadays, we know more about
statistics. After so few games, you can almost never say for sure which
computer is better. Of course, in most situations your result after 40
games looks similar to what you will see after 1,000 games. But it
happens often enough, that the picture is different. Even if the two
are of equal strength, you may well get a result of 28-12 in the first
series of games and 12-28 in the next. The margin of error also depends
on the relative strength of the two computers. A big difference in
strength results in a larger margin of error. From a statistical point
of view, the optimal solution is thus to play a large number of games
against opponents of similar strength.
typical line in your list looks like this: "Genius 3.0/P90/rating
2440/+54/-49". How should I interpret all those numbers?
tell you that Genius 3.0, when played on a 90 MHz Pentium PC, with 95
percent probability has a rating between 2391 (2440-49) and 2494
(2440+54).The fact that we are using a 95 percent confidence gap
implies that, on the average, 5 percent of our ratings will indeed be
outside the specified range. Therefore, in a list with 60 computers,
three of them are probably erroneously rated. But neither we nor anyone
else knows which ones.
you explain why new computers tend to get a high rating, which then
decreases as more games are played?
claim is simply not true. In early 1994, we studied the change in
rating for the 28 programs that had entered the list since the fall of
1991. Of those, exactly half increased their rating during the period,
while the other half lost points. Admittedly, most of the programs that
lost points were the ones with high ratings, but we regard that as pure
chance. It is true that CM The King has fallen dramatically (72
points), and so has Mephisto RISC, Vancouver 68030 and MCPro. But it's
equally true that Zarkov2.5 has gained 65 points and Chessmaster 3000
51 points during the same period. We have observed that as more games
are played, the list seems to be "squeezed" so that the difference
between the top and bottom computers decrease. We are not sure why this
happens, but it is most likely a deficiency in Prof. Elo's rating
undoubtedly there are cases where SSDF has missed the mark with a new
Mephisto Polgar, for instance. In 1989, after 94 games, Mephisto Polgar
was given the rating 2057 +/- 57. Now it has played 1693 games and has
a rating of 1973 +/- 17. Obviously, the first rating was too high and
Polgar was thus one of the computers that lay slightly outside its 95%
how about Mephisto Gideon?
was first rated on list number 8/93, after 176 games, and its rating
was given as 2319 (+59, -53). You'll recall that this is to be read
fully as "with 95% probability, the true rating lies between 2378 and
2266". After 393 games, Gideon's rating is 2280 (+37, -35), and we
cannot see that these two results are not to be in accordance with each
the testers use Windows multitasking when playing two PC programs
against each other?
definitely not! Even if such a solution could be made to work
technically, it would not produce correct results. Among other things,
it means that the programs would be unable to use the opponent's time,
the so called "permanent brain" function. To test two PC programs
according to SSDF's rules, you must have two computers.
these two machines required to have identical memory configurations?
usually have. But even if one of them had 8 Mbyte of RAM and the other
just 4, it would not mean a lot. Kathe Spracklen once estimated the net
effect of doubling the size of the hash tables to about 7 rating
points, and we have not found anything to contradict that. It's a fact
of life that not all PC's are the same, not even if they have identical
processors, and we'll have to live with that. With so many factors
besides the processor, like the speed of RAM, size of cache memory,
type of expansion bus, and architecture of the mother board that affect
performance, there is no way that SSDF can enforce a standard.
automated testing being utilized?
thanks to Dr. Christian Donninger from Vienna, we have been able to run
automatic tests since November 1994. Hitherto, only a few people
testing have been able to do it. After all, it takes two computers.
Nevertheless, they have produced many results. However, autotesting has
brought about a dramatic increase in testing capacity for PC programs,
which means that we are getting more programs on to the list faster.
The other testers will not be superfluous, however. It still takes
humans to maneuver ordinary chess computers, and we will need them to
do you set program preferences, and what opening library do you use?
use instructions from the manual or, in some cases, straight from the
source - the programmer. Experimenting with various styles of play is
out of the question, since it would require hundreds of games with each
setting to differentiate between them. We do not have the time to do
that. On those occasions when a program has more than one opening
library, we use the "tournament library". Now, we do not believe that
the choice of openings is as important as the programmers tend to
think, but we still have to use optimal settings. If we didn't, someone
would surely come along and blame an unexpected (bad) result on us for
not doing so.
your rating list it often says that PC-programs are played at
"50-66MHz" and in some cases "25-33 MHz". What does that mean?
of our testers have PCs with the processor 486DX running at 50 MHz,
others have the processor 486/DX2 running at 66 MHz. Many tests have
shown that the difference in speed between these two is not more than
5-10%. The 66 MHz processor is somewhat faster for chess programs, but
the difference in playing strength is not more than about 7 rating
points. Similarly, results obtained with 486 PCs running at 25 and 33
MHz have been lumped together. Actually, very few games have been
played on 25 MHz computers, but we still want to give as accurate
information as possible to our readers.
isn't TASC R30 on the list - it is definitely a strong chess computer.
has not had the opportunity to test TASC R30. It is an exclusive
computer that has not sold very well in Sweden, and no retailer has
been willing to lend us a machine. Neither has TASC, although they have
lent us a number of 30 MHz Chess Machine cards. A few members have
bought R30s though, and they have reported some results to SSDF. Those
results have been counted as if they had been played with a Chess
Machine 30-32 MHz.
important is speed?
you double the clock speed, you gain about 70 points. That was true ten
years ago, when we evaluated Constellation, Plymate and others at
different speeds, and it still seems to be true in 1995 when we run PC
programs that have a much higher playing strength. Some say that it
varies between different machines or programmers, that some programs
gain more and others less from an increase in speed, but that has never
been proved. You will certainly find differences if you compare all
programs we have tested at different clock frequencies, but such
differences could well be attributed to statistical inaccuracies.
were late in starting to test Fritz 3.0 last autumn. In your comments
you said that this was because ChessBase did not send you the
diskettes. But couldn't you have bought the program yourselves?
we could have, but it is a question of financial resources. We would
have needed to buy programs sufficient for 15-20 testers, which would
mean about 10 original diskettes. That would have been a considerable
cost for a small, idealistic association like ours. Fortunately, the
other programmers have not been as slow as ChessBase was last summer
(they eventually sent us two diskettes). Most of them (Lang, Hirsch,
Uniacke, Schroeder, de Koning and Weststrate) are very interested in
SSDF's test results and the position their program will achieve on the
list. Nowadays, they therefore send us a number of diskettes as soon as
possible. We are grateful for this!
did you test some PC-programs at 66 MHz and others at 33 MHz? Surely,
this is not fair.
test work is governed by our resources. During 1993/94, about half of
the PC-testers had the "slower" 486 and half had the faster one. Had we
tested each program at both speeds, the number of games at a given
speed would have been halved and the statistical uncertainty therefore
greater. For example, instead of having played 300 games with Genius
2.0 at 66 MHz, we would have had 150 games at 66 MHz and 150 at 33 MHz.
The testers have of course a maximum potential. By the way, the
different testers play their test games at very different paces.
must remember that SSDF's rating list is not a commercially oriented
sales list. We assume that it is read by people who know to what degree
different processor speeds affect ratings. The difference between the
faster and slower 486 version is theoretically 35-40 points. We have
two examples of this, which confirm the theory: Genius 1.0 and MChess
Pro 3.12 have been tested at both speeds. Check this out by yourselves
in the rating list.
all our testers have now upgraded to faster 486's, and all PC-programs
can be tested at the higher speed. But of course the problem has
started again as a few testers have now acquired Pentium 90 MHz
machines, and it will take time before we are able to test all new
programs on this processor.
wasn't it strange that you chose to test Hiarcs 2 at the lower speed,
as it sensationally won the world championships in Munich -93. And by
the way, why did you never test Hiarcs 2.1?
the end of 1993, several new programs were released at about the same
time. These were Genius 2, MChess Pro 3.5, Chessmaster 4000, Hiarcs 2
and Socrates. As half of our testers at that time had 486/33 MHz
computers, we had to decide which programs to test on the faster 66 MHz
computers and which on the slower 33 MHz computers. Our guiding
principle was that the strongest programs should be tested at the
2 and Socrates were the two programs to be tested at 33 MHz, and the
results showed that we made the right choice. Neither of these programs
turned out to be better than the other three programs in question. In
early 1995, Hiarcs 2.0 had 2208 after 229 games. At 50-66 MHz, the
program would have achieved 2250 at the most.
we received diskettes with Hiarcs 2.1 from Mark Uniacke, we had already
completed 150 of the 229 games with the 2.0 version. We definitely had
no possibility to free resources to start all over again with the new
version, which according to Uniacke himself was probably only 10-15
points better. Furthermore, only a few weeks separated the release of
the two versions. For commercial reasons, it was better to market the
exact version which made such a good result in Munich.
the list 1/95 it says that 41,088 games have been played by 136
computers, but I can only find 59 on the list. In the long list, which
can be downloaded for free from SSDF's BBS, there are only 127
computers. How can this be?
order to save space, we have over time taken 68 computers out of the
list. We also feel that it makes the list easier to grasp when old
computers, taken off the market many years ago, are removed. Nine
computers don't even show on the long version of the list, as they
haven't played the minimum number of 100 games required to attain a
position. However, all games played by all computers are included in
the calculation of the ratings. The old games contribute to the
stability of the list, and sometimes it is also nice to study the long
list if only for nostalgic reasons.
is your BBS reached?
you have a modem you can call +46 31 992301, which is the number for
our BBS - "Grottan BBS". The first time you will be asked to answer a
few questions, and you cannot do much more. When the SysOp,
Göran Grottling, has accepted you as a new user, it becomes
possible to download files and read mail. You will be able to freely
get the latest rating list, including results from all individual
matches between computers, both in the short and long version. Grottan
BBS is managed by SSDF and contains a large amount of chess related
files. It is possible to chose English as the language in the BBS.
Grottan BBS is connected to fidget and has the address 2:203/245.
can I get the ordinary paper version of "The Swedish Rating list"?
can't anymore! From now the rating list is available through Internet,
through our BBS and can be read in several chess computer magazines,
among them PLY.
final question: Do you in SSDF believe that your rating list represents
"The Absolute Truth"?
we are quite humble about this. Above all, we want our readers to
realize that the ratings are not exact enough to attach any
significance to rating differences of 10-20 points. You also have to
consider the confidence range for each program. We are also aware that
some have the opinion that certain programs can perform better in play
against humans, while others are worse. However, we think that this
remains to be confirmed. Of course, the rating list of SSDF only
accurately represents the outcome for machines playing against each
other. Anyway, it must be better to rely on thousands of computer games
than only a few!
SETS FROM AMERICA'S LARGEST CHESS STORE
Chess USA is America's leading retailer of chess sets, chess pieces,
chess boards, and more. In fact, for 30 years Your Move Chess &
Games has been the leading retailer of all things chess! Not everyone
can view the hundreds of chess sets we have on display in our New York
Chess Store, which is why we strive to have the most detailed chess
website anywhere on the internet. After All, with over 2,000 chess
sets, chess boards, chess pieces, chess computers and more, we need to
be detailed! No matter what you are looking for, from chess set or
chess program, our knowledgable staff is ready to help.
Have a Chess Set already, and just need to freshen up on the rules?
We've got them - just click here
Your Move Chess & Games, America's Largest Chess Set Store.
It's Your Move!