Q: Who is responsible for "The Swedish Rating List"?
A: The Swedish Chess Computer Association (In Swedish "SvenskaSchackdatorforeningen", abbreviated SSDF). The rating list is the result of its members' efforts.
Q: Does a game have to be played with a particular time limit, or is the list based on games played with various time settings?
A: All games are played at tournament level (40 moves/2h). A game played with anything other than 3 minutes per move is not counted. However, SSDF publishes a separate rating list for blitz games (5 min/game or 60 moves/5 min) in PLY, the journal of the Association, a couple of times per year.
Q: Does SSDF have a lab full of computers playing each other?
A: No. All testing is done in our members' homes and on their own computers. Also, most vendors are willing to lend us one or two chess computers for testing when a new model is released.
Q: Can anyone play a few test games and send his results to SSDF?
A: SSDF only accepts results from its members. Furthermore, we do not accept tests from people having a commercial interest in computer chess. The person responsible for managing the tests regularly calls all testers, usually once every few weeks, and collects their latest results. On those occasions, he also plans upcoming tests and suggests suitable computers and/or programs to be pitted against each other.
Q: Who is managing the list?
A: Since 1990 Thoralf Karlsson, chairman of SSDF, has been handling the list. He took over from Göran Grottling who had been in charge since the inception of SSDF in 1984.
Q: How do you know that you can trust reported results?
A: It's mostly a question of confidence. We have known most of the testers for many years and don't believe they would try to deceive us. Also, all testers are required to keep a written record of their activity. In case there are any doubts, those records will surely be of good help.
Q: But in theory, someone could have sent in false results?
A: Yes, but not on a large scale. Experience has taught us that a series of 20 games, which is our normal test match between two computers, can produce some rather unexpected results. Still we'd be very suspicious if anyone reported that, say, Super Constellation outplayed Genius on a Pentium-class PC by 15 to 5. You must remember that normally a lot of people test the same program or computer, and we are therefore able to compare results from different sources. Likewise, a tester who consistently reported low scores for, say, Richard Lang's programs would raise a few eyebrows.
Q: How many people are involved in the testing?
A: At the end of 1994, SSDF had played well over 40,000 games. This had taken us eleven years, and all in all 132 testers have been involved, each contributing with anything between 1 to 5,770 games. During 1994 alone, 40 people were doing tests. Our most industrious tester played over 700 games that year; he usually plays three games in parallel. (And yes, that means he has six computers!)
Q: How are the tests carried out?
A: Our goal is to always play matches of 20 games between two computers/programs under test. We also consider it important to evaluate a computer against various kinds of opponents. For instance, a new chess program has often been tested by the programmer against contemporary products, but we also try to find some old programs to run it against. The reason for this is that the programmer may well have optimized its opening books so as to get maximum performance against the major competitors in the market, but he has most likely not had access to all the software that's hiding in the closets of SSDF's members. Which games or computers actually get to play each other is ultimately dependent on what kind of equipment our testers have or are able to borrow. So in order to let, say, Mephisto Vancouver and MChess Pro meet, we have to find someone with access to both the Mephisto machine and an ordinary PC. Rule of thumb number three is that matches between opponents whose ratings differ by more than 400 points are meaningless. The outcome of such a match does not provide statistically meaningful information.
Q: Are all games played until mate?
A: No, but we don't accept early, grandmaster-style draws. Normally, a game is allowed to go on a bit further than human players would have done. We know that strange things can happen in computer games! Many testers also follow their own rules. Gunnar Blomstrand, one of the more productive, generally plays on until one of the computers evaluates its position as -10 or below. Super Expert, Hiarcs and some other computers are able to resign on their own, and of course we accept that if it happens.
Q: What is SSDF's opinion on so called "killer libraries", opening libraries that are specifically tuned to give good results when playing against certain other computers?
A: We don't like them, but there is not much we can do. If we disqualified results from games played with such a library, then surely someone would protest against that. The best method is likely to be the one we described above: make sure each program under test gets to play against as many different opponents as possible, including older programs.
Q: When it happens that two computers repeat a game they have played before, are both games included in the results in that case?
A: Yes, a game is allowed to continue even if the tester can see that it is going to be a duplicate. Any program that's stupid enough to lose the same game several times has but itself to blame. Furthermore, from a statistical point of view this behavior is not very important, since the program is just as likely to repeat a win as a loss.
Q: How are the ratings calculated?
A: SSDF uses its own rating program, written by our member Lars Hjorth, but the basic formulas are derived from Arpad Elo's ELO rating system. Our program calculates, for each computer, the average rating for its opponents and how many points it has scored. Given those two numbers, professor Elo's formulas produces a rating.
However, if all computers are only tested against other computers, all we get is a relative rating that is just valid among those computers. Therefore, SSDF has played several hundred games between computers and human players in serious tournaments and used these results to set a "correct" absolute level for the rating list according to Swedish conditions. Different national rating systems are not completely in accordance though, and that has to betaken into account when reading our list. For instance, US ratings seems to lie approximately 150 points above the corresponding Swedish ratings (maybe more when below 2000 and less on the other side of the scale). For ourselves we obviously use the Swedish scale.
We firmly believe that our ratings are correct in the sense that if a computer were to play a sufficient number of games against Swedish humans, it would end up with a rating close to what it has on our list. Unfortunately, as programs get better it becomes increasingly difficult to arrange meaningful games against human players. Reassuringly, we've noted that our ratings are fairly consistent with the results from the yearly AEGON tournament in Holland.
Q: SSDF often uses the term "margin of error". What factors influence the size of this margin?
A: More than anything else, the number of games played decides the margin of error (confidence range, see below). Once upon a time, we thought that 40 games between two computers was a lot. Nowadays, we know more about statistics. After so few games, you can almost never say for sure which computer is better. Of course, in most situations your result after 40 games looks similar to what you will see after 1,000 games. But it happens often enough, that the picture is different. Even if the two are of equal strength, you may well get a result of 28-12 in the first series of games and 12-28 in the next. The margin of error also depends on the relative strength of the two computers. A big difference in strength results in a larger margin of error. From a statistical point of view, the optimal solution is thus to play a large number of games against opponents of similar strength.
Q: A typical line in your list looks like this: "Genius 3.0/P90/rating 2440/+54/-49". How should I interpret all those numbers?
A: They tell you that Genius 3.0, when played on a 90 MHz Pentium PC, with 95 percent probability has a rating between 2391 (2440-49) and 2494 (2440+54).The fact that we are using a 95 percent confidence gap implies that, on the average, 5 percent of our ratings will indeed be outside the specified range. Therefore, in a list with 60 computers, three of them are probably erroneously rated. But neither we nor anyone else knows which ones.
Q: Can you explain why new computers tend to get a high rating, which then decreases as more games are played?
A: This claim is simply not true. In early 1994, we studied the change in rating for the 28 programs that had entered the list since the fall of 1991. Of those, exactly half increased their rating during the period, while the other half lost points. Admittedly, most of the programs that lost points were the ones with high ratings, but we regard that as pure chance. It is true that CM The King has fallen dramatically (72 points), and so has Mephisto RISC, Vancouver 68030 and MCPro. But it's equally true that Zarkov2.5 has gained 65 points and Chessmaster 3000 51 points during the same period. We have observed that as more games are played, the list seems to be "squeezed" so that the difference between the top and bottom computers decrease. We are not sure why this happens, but it is most likely a deficiency in Prof. Elo's rating system.
Q: But undoubtedly there are cases where SSDF has missed the mark with a new program?
A: Yes, Mephisto Polgar, for instance. In 1989, after 94 games, Mephisto Polgar was given the rating 2057 +/- 57. Now it has played 1693 games and has a rating of 1973 +/- 17. Obviously, the first rating was too high and Polgar was thus one of the computers that lay slightly outside its 95% interval.
Q: And how about Mephisto Gideon?
A: Gideon was first rated on list number 8/93, after 176 games, and its rating was given as 2319 (+59, -53). You'll recall that this is to be read fully as "with 95% probability, the true rating lies between 2378 and 2266". After 393 games, Gideon's rating is 2280 (+37, -35), and we cannot see that these two results are not to be in accordance with each other.
Q: Do the testers use Windows multitasking when playing two PC programs against each other?
A: No, definitely not! Even if such a solution could be made to work technically, it would not produce correct results. Among other things, it means that the programs would be unable to use the opponent's time, the so called "permanent brain" function. To test two PC programs according to SSDF's rules, you must have two computers.
Q: Are these two machines required to have identical memory configurations?
A: They usually have. But even if one of them had 8 Mbyte of RAM and the other just 4, it would not mean a lot. Kathe Spracklen once estimated the net effect of doubling the size of the hash tables to about 7 rating points, and we have not found anything to contradict that. It's a fact of life that not all PC's are the same, not even if they have identical processors, and we'll have to live with that. With so many factors besides the processor, like the speed of RAM, size of cache memory, type of expansion bus, and architecture of the mother board that affect performance, there is no way that SSDF can enforce a standard.
Q: Is automated testing being utilized?
A: Yes, thanks to Dr. Christian Donninger from Vienna, we have been able to run automatic tests since November 1994. Hitherto, only a few people testing have been able to do it. After all, it takes two computers. Nevertheless, they have produced many results. However, autotesting has brought about a dramatic increase in testing capacity for PC programs, which means that we are getting more programs on to the list faster. The other testers will not be superfluous, however. It still takes humans to maneuver ordinary chess computers, and we will need them to test those.
Q: How do you set program preferences, and what opening library do you use?
A: We use instructions from the manual or, in some cases, straight from the source - the programmer. Experimenting with various styles of play is out of the question, since it would require hundreds of games with each setting to differentiate between them. We do not have the time to do that. On those occasions when a program has more than one opening library, we use the "tournament library". Now, we do not believe that the choice of openings is as important as the programmers tend to think, but we still have to use optimal settings. If we didn't, someone would surely come along and blame an unexpected (bad) result on us for not doing so.
Q: In your rating list it often says that PC-programs are played at "50-66MHz" and in some cases "25-33 MHz". What does that mean?
A: Some of our testers have PCs with the processor 486DX running at 50 MHz, others have the processor 486/DX2 running at 66 MHz. Many tests have shown that the difference in speed between these two is not more than 5-10%. The 66 MHz processor is somewhat faster for chess programs, but the difference in playing strength is not more than about 7 rating points. Similarly, results obtained with 486 PCs running at 25 and 33 MHz have been lumped together. Actually, very few games have been played on 25 MHz computers, but we still want to give as accurate information as possible to our readers.
Q: Why isn't TASC R30 on the list - it is definitely a strong chess computer.
A: SSDF has not had the opportunity to test TASC R30. It is an exclusive computer that has not sold very well in Sweden, and no retailer has been willing to lend us a machine. Neither has TASC, although they have lent us a number of 30 MHz Chess Machine cards. A few members have bought R30s though, and they have reported some results to SSDF. Those results have been counted as if they had been played with a Chess Machine 30-32 MHz.
Q: How important is speed?
A: If you double the clock speed, you gain about 70 points. That was true ten years ago, when we evaluated Constellation, Plymate and others at different speeds, and it still seems to be true in 1995 when we run PC programs that have a much higher playing strength. Some say that it varies between different machines or programmers, that some programs gain more and others less from an increase in speed, but that has never been proved. You will certainly find differences if you compare all programs we have tested at different clock frequencies, but such differences could well be attributed to statistical inaccuracies.
Q: You were late in starting to test Fritz 3.0 last autumn. In your comments you said that this was because ChessBase did not send you the diskettes. But couldn't you have bought the program yourselves?
A: That we could have, but it is a question of financial resources. We would have needed to buy programs sufficient for 15-20 testers, which would mean about 10 original diskettes. That would have been a considerable cost for a small, idealistic association like ours. Fortunately, the other programmers have not been as slow as ChessBase was last summer (they eventually sent us two diskettes). Most of them (Lang, Hirsch, Uniacke, Schroeder, de Koning and Weststrate) are very interested in SSDF's test results and the position their program will achieve on the list. Nowadays, they therefore send us a number of diskettes as soon as possible. We are grateful for this!
Q: Why did you test some PC-programs at 66 MHz and others at 33 MHz? Surely, this is not fair.
A: Our test work is governed by our resources. During 1993/94, about half of the PC-testers had the "slower" 486 and half had the faster one. Had we tested each program at both speeds, the number of games at a given speed would have been halved and the statistical uncertainty therefore greater. For example, instead of having played 300 games with Genius 2.0 at 66 MHz, we would have had 150 games at 66 MHz and 150 at 33 MHz. The testers have of course a maximum potential. By the way, the different testers play their test games at very different paces.
You must remember that SSDF's rating list is not a commercially oriented sales list. We assume that it is read by people who know to what degree different processor speeds affect ratings. The difference between the faster and slower 486 version is theoretically 35-40 points. We have two examples of this, which confirm the theory: Genius 1.0 and MChess Pro 3.12 have been tested at both speeds. Check this out by yourselves in the rating list.
Almost all our testers have now upgraded to faster 486's, and all PC-programs can be tested at the higher speed. But of course the problem has started again as a few testers have now acquired Pentium 90 MHz machines, and it will take time before we are able to test all new programs on this processor.
Q: But wasn't it strange that you chose to test Hiarcs 2 at the lower speed, as it sensationally won the world championships in Munich -93. And by the way, why did you never test Hiarcs 2.1?
A: Towards the end of 1993, several new programs were released at about the same time. These were Genius 2, MChess Pro 3.5, Chessmaster 4000, Hiarcs 2 and Socrates. As half of our testers at that time had 486/33 MHz computers, we had to decide which programs to test on the faster 66 MHz computers and which on the slower 33 MHz computers. Our guiding principle was that the strongest programs should be tested at the higher speed.
Hiarcs 2 and Socrates were the two programs to be tested at 33 MHz, and the results showed that we made the right choice. Neither of these programs turned out to be better than the other three programs in question. In early 1995, Hiarcs 2.0 had 2208 after 229 games. At 50-66 MHz, the program would have achieved 2250 at the most.
When we received diskettes with Hiarcs 2.1 from Mark Uniacke, we had already completed 150 of the 229 games with the 2.0 version. We definitely had no possibility to free resources to start all over again with the new version, which according to Uniacke himself was probably only 10-15 points better. Furthermore, only a few weeks separated the release of the two versions. For commercial reasons, it was better to market the exact version which made such a good result in Munich.
Q: In the list 1/95 it says that 41,088 games have been played by 136 computers, but I can only find 59 on the list. In the long list, which can be downloaded for free from SSDF's BBS, there are only 127 computers. How can this be?
A:In order to save space, we have over time taken 68 computers out of the list. We also feel that it makes the list easier to grasp when old computers, taken off the market many years ago, are removed. Nine computers don't even show on the long version of the list, as they haven't played the minimum number of 100 games required to attain a position. However, all games played by all computers are included in the calculation of the ratings. The old games contribute to the stability of the list, and sometimes it is also nice to study the long list if only for nostalgic reasons.
Q: How is your BBS reached?
A: If you have a modem you can call +46 31 992301, which is the number for our BBS - "Grottan BBS". The first time you will be asked to answer a few questions, and you cannot do much more. When the SysOp, Göran Grottling, has accepted you as a new user, it becomes possible to download files and read mail. You will be able to freely get the latest rating list, including results from all individual matches between computers, both in the short and long version. Grottan BBS is managed by SSDF and contains a large amount of chess related files. It is possible to chose English as the language in the BBS. Grottan BBS is connected to fidget and has the address 2:203/245.
Q: How can I get the ordinary paper version of "The Swedish Rating list"?
A:You can't anymore! From now the rating list is available through Internet, through our BBS and can be read in several chess computer magazines, among them PLY.
Q: A final question: Do you in SSDF believe that your rating list represents "The Absolute Truth"?
A: No, we are quite humble about this. Above all, we want our readers to realize that the ratings are not exact enough to attach any significance to rating differences of 10-20 points. You also have to consider the confidence range for each program. We are also aware that some have the opinion that certain programs can perform better in play against humans, while others are worse. However, we think that this remains to be confirmed. Of course, the rating list of SSDF only accurately represents the outcome for machines playing against each other. Anyway, it must be better to rely on thousands of computer games than only a few!
SETS FROM AMERICA'S LARGEST CHESS STORE