This FAQ has been compiled by Göran Grottling.
Q: Who is responsible for "The Swedish Rating List"?
A: The Swedish Chess Computer Association (In Swedish
"SvenskaSchackdatorforeningen", abbreviated SSDF). The rating list is the result of its
members' efforts.
Q: Does a game have to be played with a particular
time limit, or is the list based on games played with various time settings?
A: All games are played at tournament level (40 moves/2h).
A game played with anything other than 3 minutes per move is not counted. However,
SSDF publishes a separate rating list for blitz games (5 min/game or 60 moves/5 min)
in PLY, the journal of the Association, a couple of times per year.
Q: Does SSDF have a lab full of computers playing
each other?
A: No. All testing is done in our
members' homes and on their own computers. Also, most vendors are willing to lend us
one or two chess computers for testing when a new model is released.
Q: Can anyone play a few test games and send his
results to SSDF?
A: SSDF only accepts results from
its members. Furthermore, we do not accept tests from people having a commercial
interest in computer chess. The person responsible for managing the tests regularly
calls all testers, usually once every few weeks, and collects their latest results.
On those occasions, he also plans upcoming tests and suggests suitable computers and/or
programs to be pitted against each other.
Q: Who is managing the list?
A: Since 1990 Thoralf Karlsson, chairman of SSDF,
has been handling the list. He took over from Göran Grottling who had been in
charge since the inception of SSDF in 1984.
Q: How do you know that you can trust reported
results?
A: It's mostly a question of confidence. We have
known most of the testers for many years and don't believe they would try to deceive
us. Also, all testers are required to keep a written record of their activity. In
case there are any doubts, those records will surely be of good help.
Q: But in theory, someone could have sent in
false results?
A: Yes, but not on a large scale. Experience has
taught us that a series of 20 games, which is our normal test match between two
computers, can produce some rather unexpected results. Still we'd be very suspicious
if anyone reported that, say, Super Constellation outplayed Genius on a Pentium-class
PC by 15 to 5. You must remember that normally a lot of people test the same program
or computer, and we are therefore able to compare results from different sources.
Likewise, a tester who consistently reported low scores for, say, Richard Lang's
programs would raise a few eyebrows.
Q: How many people are involved in the testing?
A: At the end of 1994, SSDF had played well over
40,000 games. This had taken us eleven years, and all in all 132 testers have been
involved, each contributing with anything between 1 to 5,770 games. During 1994 alone,
40 people were doing tests. Our most industrious tester played over 700 games that
year; he usually plays three games in parallel. (And yes, that means he has six
computers!)
Q: How are the tests carried out?
A: Our goal is to always play matches of 20 games
between two computers/programs under test. We also consider it important to evaluate
a computer against various kinds of opponents. For instance, a new chess program
has often been tested by the programmer against contemporary products, but we also
try to find some old programs to run it against. The reason for this is that the
programmer may well have optimized its opening books so as to get maximum performance
against the major competitors in the market, but he has most likely not had access
to all the software that's hiding in the closets of SSDF's members. Which games or
computers actually get to play each other is ultimately dependent on what kind of
equipment our testers have or are able to borrow. So in order to let, say, Mephisto
Vancouver and MChess Pro meet, we have to find someone with access to both the
Mephisto machine and an ordinary PC. Rule of thumb number three is that matches
between opponents whose ratings differ by more than 400 points are meaningless.
The outcome of such a match does not provide statistically meaningful information.
Q: Are all games played until mate?
A: No, but we don't accept early, grandmaster-style
draws. Normally, a game is allowed to go on a bit further than human players would
have done. We know that strange things can happen in computer games! Many testers
also follow their own rules. Gunnar Blomstrand, one of the more productive, generally
plays on until one of the computers evaluates its position as -10 or below. Super
Expert, Hiarcs and some other computers are able to resign on their own, and of
course we accept that if it happens.
Q: What is SSDF's opinion on so called "killer
libraries", opening libraries that are specifically tuned to give good results when
playing against certain other computers?
A: We don't like them, but there is not much we
can do. If we disqualified results from games played with such a library, then
surely someone would protest against that. The best method is likely to be the one
we described above: make sure each program under test gets to play against as many
different opponents as possible, including older programs.
Q: When it happens that two computers repeat a
game they have played before, are both games included in the results in that case?
A: Yes, a game is allowed to continue even if the
tester can see that it is going to be a duplicate. Any program that's stupid enough
to lose the same game several times has but itself to blame. Furthermore, from a
statistical point of view this behavior is not very important, since the program
is just as likely to repeat a win as a loss.
Q: How are the ratings calculated?
A: SSDF uses its own rating program, written by
our member Lars Hjorth, but the basic formulas are derived from Arpad Elo's ELO
rating system. Our program calculates, for each computer, the average rating for
its opponents and how many points it has scored. Given those two numbers, professor
Elo's formulas produces a rating.
However, if all computers are only tested
against other computers, all we get is a relative rating that is just valid among
those computers. Therefore, SSDF has played several hundred games between computers
and human players in serious tournaments and used these results to set a "correct"
absolute level for the rating list according to Swedish conditions. Different
national rating systems are not completely in accordance though, and that has
to betaken into account when reading our list. For instance, US ratings seems
to lie approximately 150 points above the corresponding Swedish ratings (maybe
more when below 2000 and less on the other side of the scale). For ourselves we
obviously use the Swedish scale.
We firmly believe that our ratings are correct in the sense that if a computer
were to play a sufficient number of games against Swedish humans, it would end up
with a rating close to what it has on our list. Unfortunately, as programs get
better it becomes increasingly difficult to arrange meaningful games against
human players. Reassuringly, we've noted that our ratings are fairly consistent
with the results from the yearly AEGON tournament in Holland.
Q: SSDF often uses the term "margin of error".
What factors influence the size of this margin?
A: More than anything else, the number of games
played decides the margin of error (confidence range, see below). Once upon a time,
we thought that 40 games between two computers was a lot. Nowadays, we know more
about statistics. After so few games, you can almost never say for sure which
computer is better. Of course, in most situations your result after 40 games
looks similar to what you will see after 1,000 games. But it happens often enough,
that the picture is different. Even if the two are of equal strength, you may well
get a result of 28-12 in the first series of games and 12-28 in the next. The margin
of error also depends on the relative strength of the two computers. A big difference
in strength results in a larger margin of error. From a statistical point of view,
the optimal solution is thus to play a large number of games against opponents of
similar strength.
Q: A typical line in your list looks like this:
"Genius 3.0/P90/rating 2440/+54/-49". How should I interpret all those numbers?
A: They tell you that Genius 3.0, when played on a
90 MHz Pentium PC, with 95 percent probability has a rating between 2391 (2440-49)
and 2494 (2440+54).The fact that we are using a 95 percent confidence gap implies
that, on the average, 5 percent of our ratings will indeed be outside the specified
range. Therefore, in a list with 60 computers, three of them are probably erroneously
rated. But neither we nor anyone else knows which ones.
Q: Can you explain why new computers tend to get
a high rating, which then decreases as more games are played?
A: This claim is simply not true. In early 1994,
we studied the change in rating for the 28 programs that had entered the list since
the fall of 1991. Of those, exactly half increased their rating during the period,
while the other half lost points. Admittedly, most of the programs that lost points
were the ones with high ratings, but we regard that as pure chance. It is true that
CM The King has fallen dramatically (72 points), and so has Mephisto RISC, Vancouver
68030 and MCPro. But it's equally true that Zarkov2.5 has gained 65 points and
Chessmaster 3000 51 points during the same period. We have observed that as more
games are played, the list seems to be "squeezed" so that the difference between
the top and bottom computers decrease. We are not sure why this happens, but it
is most likely a deficiency in Prof. Elo's rating system.
Q: But undoubtedly there are cases where SSDF
has missed the mark with a new program?
A: Yes, Mephisto Polgar, for instance. In 1989,
after 94 games, Mephisto Polgar was given the rating 2057 +/- 57. Now it has played
1693 games and has a rating of 1973 +/- 17. Obviously, the first rating was too high
and Polgar was thus one of the computers that lay slightly outside its 95% interval.
Q: And how about Mephisto Gideon?
A: Gideon was first rated on list number 8/93, after
176 games, and its rating was given as 2319 (+59, -53). You'll recall that this is to
be read fully as "with 95% probability, the true rating lies between 2378 and 2266".
After 393 games, Gideon's rating is 2280 (+37, -35), and we cannot see that these two
results are not to be in accordance with each other.
Q: Do the testers use Windows multitasking when
playing two PC programs against each other?
A: No, definitely not! Even if such a solution could
be made to work technically, it would not produce correct results. Among other things,
it means that the programs would be unable to use the opponent's time, the so called
"permanent brain" function. To test two PC programs according to SSDF's rules, you must
have two computers.
Q: Are these two machines required to have identical
memory configurations?
A: They usually have. But even if one of them had 8
Mbyte of RAM and the other just 4, it would not mean a lot. Kathe Spracklen once
estimated the net effect of doubling the size of the hash tables to about 7 rating
points, and we have not found anything to contradict that. It's a fact of life that
not all PC's are the same, not even if they have identical processors, and we'll have
to live with that. With so many factors besides the processor, like the speed of RAM,
size of cache memory, type of expansion bus, and architecture of the mother board
that affect performance, there is no way that SSDF can enforce a standard.
Q: Is automated testing being utilized?
A: Yes, thanks to Dr. Christian Donninger from
Vienna, we have been able to run automatic tests since November 1994. Hitherto,
only a few people testing have been able to do it. After all, it takes two computers.
Nevertheless, they have produced many results. However, autotesting has brought about
a dramatic increase in testing capacity for PC programs, which means that we are
getting more programs on to the list faster. The other testers will not be superfluous,
however. It still takes humans to maneuver ordinary chess computers, and we will need
them to test those.
Q: How do you set program preferences, and what
opening library do you use?
A: We use instructions from the manual or, in some
cases, straight from the source - the programmer. Experimenting with various styles
of play is out of the question, since it would require hundreds of games with each
setting to differentiate between them. We do not have the time to do that. On those
occasions when a program has more than one opening library, we use the "tournament
library". Now, we do not believe that the choice of openings is as important as the
programmers tend to think, but we still have to use optimal settings. If we didn't,
someone would surely come along and blame an unexpected (bad) result on us for not
doing so.
Q: In your rating list it often says that
PC-programs are played at "50-66MHz" and in some cases "25-33 MHz". What does
that mean?
A: Some of our testers have PCs with the processor
486DX running at 50 MHz, others have the processor 486/DX2 running at 66 MHz. Many
tests have shown that the difference in speed between these two is not more than
5-10%. The 66 MHz processor is somewhat faster for chess programs, but the difference
in playing strength is not more than about 7 rating points. Similarly, results
obtained with 486 PCs running at 25 and 33 MHz have been lumped together. Actually,
very few games have been played on 25 MHz computers, but we still want to give as
accurate information as possible to our readers.
Q: Why isn't TASC R30 on the list - it is
definitely a strong chess computer.
A: SSDF has not had the opportunity to test TASC
R30. It is an exclusive computer that has not sold very well in Sweden, and no
retailer has been willing to lend us a machine. Neither has TASC, although they
have lent us a number of 30 MHz Chess Machine cards. A few members have bought
R30s though, and they have reported some results to SSDF. Those results have been
counted as if they had been played with a Chess Machine 30-32 MHz.
Q: How important is speed?
A: If you double the clock speed, you gain about
70 points. That was true ten years ago, when we evaluated Constellation, Plymate
and others at different speeds, and it still seems to be true in 1995 when we run
PC programs that have a much higher playing strength. Some say that it varies
between different machines or programmers, that some programs gain more and others
less from an increase in speed, but that has never been proved. You will certainly
find differences if you compare all programs we have tested at different clock
frequencies, but such differences could well be attributed to statistical inaccuracies.
Q: You were late in starting to test Fritz 3.0
last autumn. In your comments you said that this was because ChessBase did not send
you the diskettes. But couldn't you have bought the program yourselves?
A: That we could have, but it is a question of
financial resources. We would have needed to buy programs sufficient for 15-20
testers, which would mean about 10 original diskettes. That would have been a
considerable cost for a small, idealistic association like ours. Fortunately,
the other programmers have not been as slow as ChessBase was last summer (they
eventually sent us two diskettes). Most of them (Lang, Hirsch, Uniacke, Schroeder,
de Koning and Weststrate) are very interested in SSDF's test results and the position
their program will achieve on the list. Nowadays, they therefore send us a number of
diskettes as soon as possible. We are grateful for this!
Q: Why did you test some PC-programs at 66 MHz
and others at 33 MHz? Surely, this is not fair.
A: Our test work is governed by our resources.
During 1993/94, about half of the PC-testers had the "slower" 486 and half had the
faster one. Had we tested each program at both speeds, the number of games at a
given speed would have been halved and the statistical uncertainty therefore greater.
For example, instead of having played 300 games with Genius 2.0 at 66 MHz, we would
have had 150 games at 66 MHz and 150 at 33 MHz. The testers have of course a maximum
potential. By the way, the different testers play their test games at very different
paces.
You must remember that SSDF's rating list is not a commercially oriented sales
list. We assume that it is read by people who know to what degree different processor
speeds affect ratings. The difference between the faster and slower 486 version is
theoretically 35-40 points. We have two examples of this, which confirm the theory:
Genius 1.0 and MChess Pro 3.12 have been tested at both speeds. Check this out by
yourselves in the rating list.
Almost all our testers have now upgraded to faster 486's, and all PC-programs can
be tested at the higher speed. But of course the problem has started again as a few
testers have now acquired Pentium 90 MHz machines, and it will take time before we
are able to test all new programs on this processor.
Q: But wasn't it strange that you chose to test
Hiarcs 2 at the lower speed, as it sensationally won the world championships in
Munich -93. And by the way, why did you never test Hiarcs 2.1?
A: Towards the end of 1993, several new programs
were released at about the same time. These were Genius 2, MChess Pro 3.5,
Chessmaster 4000, Hiarcs 2 and Socrates. As half of our testers at that time had
486/33 MHz computers, we had to decide which programs to test on the faster 66 MHz
computers and which on the slower 33 MHz computers. Our guiding principle was that
the strongest programs should be tested at the higher speed.
Hiarcs 2 and Socrates were the two programs to be tested at 33 MHz, and the
results showed that we made the right choice. Neither of these programs turned out
to be better than the other three programs in question. In early 1995, Hiarcs 2.0
had 2208 after 229 games. At 50-66 MHz, the program would have achieved 2250 at
the most.
When we received diskettes with Hiarcs 2.1 from Mark Uniacke, we had already
completed 150 of the 229 games with the 2.0 version. We definitely had no possibility
to free resources to start all over again with the new version, which according to
Uniacke himself was probably only 10-15 points better. Furthermore, only a few
weeks separated the release of the two versions. For commercial reasons, it was
better to market the exact version which made such a good result in Munich.
Q: In the list 1/95 it says that 41,088 games have
been played by 136 computers, but I can only find 59 on the list. In the long list,
which can be downloaded for free from SSDF's BBS, there are only 127 computers. How
can this be?
A:In order to save space, we have over time taken
68 computers out of the list. We also feel that it makes the list easier to grasp
when old computers, taken off the market many years ago, are removed. Nine computers
don't even show on the long version of the list, as they haven't played the minimum
number of 100 games required to attain a position. However, all games played by all
computers are included in the calculation of the ratings. The old games contribute
to the stability of the list, and sometimes it is also nice to study the long list
if only for nostalgic reasons.
Q: How is your BBS reached?
A: If you have a modem you can call +46 31 992301,
which is the number for our BBS - "Grottan BBS". The first time you will be asked
to answer a few questions, and you cannot do much more. When the SysOp, Göran
Grottling, has accepted you as a new user, it becomes possible to download files
and read mail. You will be able to freely get the latest rating list, including
results from all individual matches between computers, both in the short and long
version. Grottan BBS is managed by SSDF and contains a large amount of chess related
files. It is possible to chose English as the language in the BBS. Grottan BBS is
connected to fidget and has the address 2:203/245.
Q: How can I get the ordinary paper version
of "The Swedish Rating list"?
A:You can't anymore! From now the rating list
is available through Internet, through our BBS and can be read in several chess
computer magazines, among them PLY.
Q: A final question: Do you in SSDF believe that
your rating list represents "The Absolute Truth"?
A: No, we are quite humble about this. Above all,
we want our readers to realize that the ratings are not exact enough to attach any
significance to rating differences of 10-20 points. You also have to consider the
confidence range for each program. We are also aware that some have the opinion that
certain programs can perform better in play against humans, while others are worse.
However, we think that this remains to be confirmed. Of course, the rating list of
SSDF only accurately represents the outcome for machines playing against each other.
Anyway, it must be better to rely on thousands of computer games than only a few!
Bo Sjögren, bosj@nsc.liu.se
|