## Wednesday, August 09, 2006

### Elo rating! Rn = Ro + K (W - We)

I'm 1875. What's your rating?' is a typical greeting from one chessplayer meeting another for the first time. The other player, rated 1750, knows immediately that a game between them will be a tough battle. A draw will be a satisfactory result and a win will be a small upset.

What is a rating and how is it calculated? Perhaps most importantly, a rating is only meaningful relative to other ratings. In July 1989, when Garry Kasparov became the first player to break the 2800 barrier, his accomplishment was by comparison with all other ratings, current and historical.

The rating system may seem mysterious, but it is grounded in statistical theory. Obviously, two players with the same rating should have an equal chance of winning against each other. Less obviously, the same rating difference implies the same chance of winning. A player rated 2400 playing against a player rated 2200 has the same chance of winning as a 1400 against a 1200. The rating difference is 200 points in both cases.

The most widely used rating system is known as the Elo system. Arpad E. Elo, born 1903 in Hungary, emigrated to the United States at age 10. From 1935 to 1965, he was professor of physics and astronomy at Marquette University.

From 1935 to 1937, Elo was administrator of the American Chess Federation which merged in 1939 with the National Chess Federation to become the USCF. He was nine times champion or co-champion of Wisconsin.

The Elo system was adopted by the USCF in 1960, and by FIDE in 1970. Elo served as Chairman of the USCF Rating Committee from 1959 to 1976. He was inducted into U.S. Chess Hall of fame in 1988 and died in 1992.

From The Rating of Chessplayers:-

Few chessplayers are totally objective about their positions on the board, and even fewer can be objective about their personal capacities and ratings. Most of them believe they are playing "in form" only when far above normal form, and they tend to forget that an outstanding tournament success is just as likely the result of off form performances by opponents as superior play by themselves. There is truth in the paradox that "every chessplayer believes himself better than his equal".

***

How are ratings calculated? Already in 1959, the USCF rating system arbitrarily used 2000 as the upper level for strong club players and 200 point divisions to assign players to classes. Elo kept these measures because they were 'steeped in tradition'.

The table at the bottom of this page relates expected game results to rating differences. The P column is the expected Percentage for the result of a single game. The D column is the rating Difference corresponding to that expected result.

For example, two players with the same rating (D=0) each have a 50% (P=.50) chance of winning a game. Similarly, a player with a rating 100 points greater than an opponent (D=102 is the closest value in the table) has a 64% (P=.64) chance of winning a game.

Let's say you score +3-2=1 (three wins, two losses and a draw) against opposition with an average rating of 1500. Your score is 3.5-2.5, for a percentage of 58% (P=0.58). The value for P=0.58 in the table corresponds to D=57. Your performance is calculated as 1500 + 57 = 1557. If you had achieved the same score against opponents with an average rating of 2000, your performance would be 2057.

This method is used to calculate an initial rating for a previously unrated player. The more games used in the calculation, the more accurate the initial rating will be. Established players are rated using the following formula.

 Rn = Ro + K (W - We) Rn New rating Ro Old rating K Value of a single game W Score; 1.0 for a win, 0.5 for a draw We Expected score based on Ro

The formula says that after an event has finished, a player's new rating is calculated from the old rating adjusted by the result of the event. The adjustment is the difference between the player's actual result and the expected result, which is based on the old rating.

The difference is multiplied by a coefficient ('K'), which is a number between 10 and 40. A lower coefficient gives more weight to previous events and changes the rating at a slower rate. A higher coefficient gives more weight to the most recent events and changes the rating faster.

No system is perfect, and the there are some problems with the rating system. Rating deflation is a natural phenomenon caused by young improving players entering the rating pool and old stable players leaving the pool. Rating manipulation happens when unscrupulous organizers submit fraudulent reports to a rating agency.

These problems are a small price to pay for the great benefits that Elo's rating system has provided to the chess world. His induction into the Chess Hall of Fame was an appropriate expression of his great contribution to chess.

Comparative ratings
The phrase "ELO rating" is often used to mean a player's chess rating as calculated by FIDE. However, this usage is confusing and often misleading, because Élő's general ideas have been adopted by many different organizations, including the USCF (before FIDE), the Internet Chess Club (ICC), and the now defunct Professional Chess Association (PCA). Each organization has a unique implementation, and none of them precisely follows Élő's original suggestions. It would be more accurate to refer to all of the above ratings as ELO ratings, and none of them as the ELO rating.
Instead one may refer to the organization granting the rating, e.g. "As of August 2002, Gregory Kaidanov had a FIDE rating of 2638 and a USCF rating of 2742." It should be noted that the ELO ratings of these various organizations are not always directly comparable. For example, someone with a FIDE rating of 2500 will generally have a USCF rating near 2600 and an ICC rating in the range of 2500 to 3100.
The following analysis of the January 2006 FIDE rating list gives a rough impression of what a given FIDE rating means:

above 2200 - CM title (candidate master)
2400 and 2499 - IM (international master) or the GM (grand master) title.
2500 and 2699 - GM title.
2700 and 2799 - Super GM title.

Only Garry Kasparov of Russia, Vladimir Kramnik of Russia, Veselin Topalov of Bulgaria, and Viswanathan Anand of India have ever had a rating of 2800 or above. As of July 2006, only Topalov (2813) has a rating over 2800, while Kramnik has a rating of 2743 and Anand has a rating of 2779. Although Kasparov's last rating was 2812, he has been inactive for over a year and has been removed from the FIDE list.
The highest ever FIDE rating was 2851, which Garry Kasparov had on the July 1999 and January 2000 lists.
In the whole history of FIDE rating system, only 39 players (including list 01.04.2006), sometimes called "Super-grandmasters", have achieved a peak rating of 2700 or more. However, due to ratings inflation, nearly all of these are modern players: all but two of these achieved their peak rating after 1993.
As of April 2006, the Hydra supercomputer was probably the strongest "over the board" chess "player" in the world; its playing strength is estimated by its creators to be over 3000 on the FIDE scale[1]. This is consistent with its six game match against Michael Adams in 2005 in which the then seventh highest rated player in the world only managed to score a single draw [1]. However, six games are hardly a statistical evidence. As for June 2006, Deep Junior is the computer chess champion [2], demonstrating that the superior hardware Hydra has, is not enough [3] (thirty-two processors versus only one dual core AMD machine.)
Some computer chess experts, however, believe that a centaur, i.e. a human assisted by a computer, is capable of producing the highest level of chess known to mankind[2]. An effort to decide whether the best chess entity is really Hydra or a centaur is the PAL/CSS Freestyle Tournament, organized by the Hydra Team itself, that allows participation of both centaurs and unassisted engines. The 2005 tournament was won by a centaur, but the 2006 tournament (with a faster time control) was won by the Hydra Team.

Mathematical details
Performance can't be measured absolutely; it can only be inferred from wins and losses. Ratings therefore have meaning only relative to other ratings. Therefore, both the average and the spread of ratings can be arbitrarily chosen. Élő suggested scaling ratings so that a difference of 200 rating points in chess would mean that the stronger player has an expected score of approximately 0.75, and the USCF initially aimed for an average club player to have a rating of 1500.
A player's expected score is his probability of winning plus half his probability of drawing. Thus an expected score of 0.75 could represent a 75% chance of winning, 25% chance of losing, and 0% chance of drawing. On the other extreme it could represent a 50% chance of winning, 0% chance of losing, and 50% chance of drawing. The probability of drawing, as opposed to having a decisive result, is not specified in the ELO system. Instead a draw is considered half a win and half a loss.
If Player A has true strength RA and Player B has true strength RB, the exact formula (using the logistic curve) for the expected score of Player A

1

Ea = ----------------------

1+10 (Rb - Ra)/400'

Similarly the expected score for Player B is

1

Eb = -----------------------

1+10 (Ra - Rb)/400'

Note that EA + EB = 1. In practice, since the true strength of each player is unknown, the expected scores are calculated using the player's current ratings.
When a player's actual tournament scores exceed his expected scores, the ELO system takes this as evidence that player's rating is too low, and needs to be adjusted upward. Similarly when a player's actual tournament scores fall short of his expected scores, that player's rating is adjusted downward. Élő's original suggestion, which is still widely used, was a simple linear adjustment proportional to the amount by which a player overperformed or underperformed his expected score. The maximum possible adjustment per game (sometimes called the K-value) was set at K = 16 for masters and K = 32 for weaker players.
Supposing Player A was expected to score EA points but actually scored SA points. The formula for updating his rating is

R’a = Ra + K (Sa – Ea)

This update can be performed after each game or each tournament, or after any suitable rating period. An example may help clarify. Suppose Player A has a rating of 1613, and plays in a five-round tournament. He loses to a player rated 1609, draws with a player rated 1477, defeats a player rated 1388, defeats a player rated 1586, and loses to a player rated 1720. His actual score is (0 + 0.5 + 1 + 1 + 0) = 2.5. His expected score, calculated according the formula above, was (0.506 + 0.686 + 0.785 + 0.539 + 0.351) = 2.867. Therefore his new rating is (1613 + 32·(2.5 − 2.867)) = 1601.
Note that while two wins, two losses, and one draw may seem like a par score, it is worse than expected for Player A because his opponents were lower rated on average. Therefore he is slightly penalized. If he had scored two wins, one loss, and two draws, for a total score of three points, that would have been slightly better than expected, and his new rating would have been (1613 + 32·(3 − 2.867)) = 1617.
This updating procedure is at the core of the ratings used by FIDE, USCF, Yahoo! Games, the ICC, and FICS. However, each organization has taken a different route to deal with the uncertainty inherent in the ratings, particularly the ratings of newcomers, and to deal with the problem of ratings inflation/deflation. New players are assigned provisional ratings, which are adjusted more drastically than established ratings, and various methods (none completely successful) have been devised to inject points into the rating system so that ratings from different eras are roughly comparable.
The principles used in these rating systems can be used for rating other competitions—for instance, international football matches.
ELO ratings have been also applied to games without the possibility of draws, and to games in which the result can have also a quantity (small/big margin) in addition to the quality (win/loss).

There are three main mathematical concerns relating to the original work of Professor Elo, namely the correct curve, the correct K-factor, and the provisional period crude calculations.

Most accurate distribution model
The first major mathematical concern addressed by both FIDE and the USCF was the use of the normal distribution. They found that this did not accurately represent the actual results achieved by particularly the lower rated players. Instead they switched to a logistical distribution model, which seemed to provide a better fit for the actual results achieved.

Most accurate K-factor
The second major concern, is the correct "K-factor" used. The chess statistician Jeff Sonas reckons that the original K=10 value (for players rated above 2400) is inaccurate in Professor Elo's work. If the K-factor coefficient is set too large, there will be too much sensitivity to winning, losing or drawing, in terms of the large number of points exchanged. Too low a K-value, and the sensitivity will be minimal, and it would be hard to achieve a significant number of points for winning, etc.
Elo's original K-factor estimation, was based without the benefit of huge databases and statistical evidence. Sonas indicates that a K-factor of 24 (for players rated above 2400) may be more accurate both as a predictive tool of future performance, and also more sensitive to performance. A key Sonas article is Jeff Sonas: The Sonas Rating Formula — Better than Elo?
Certain Internet chess sites seem to avoid a three-level K-factor staggering based on rating range. For example the ICC seems to adopt a global K=32 except when playing against provisionally rated players. The USCF (which makes use of a logistical distribution curve as opposed to a normal distribution) have staggered the K-factor according to three main rating ranges of:
Players below 2100 -> K factor of 32 used
Players between 2100 and 2400 -> K factor of 24 used
Players above 2400 -> K factor of 16 used
FIDE apparently (according to Mark Weeks in the following article:-)
K-factor article
make use of:-
Players <30> K factor of 25 used
Players less than 2400 -> K factor of 15 used
Players 2400+ and played 30 rated games+ -> K factor of 10 used
Certainly in Over-the-board chess, the staggering of K-factor is important to help ensure minimial inflation at the top end of the rating spectrum. This assumption might in theory apply equally to an online chess server, as well as a standard over-the-board chess organisation such as FIDE or USCF. In theory, it would make it harder for players to get the much higher ratings, if their K-factor sensitivity was lessened from 32 to 16 for example, when they get over 2400 rating. However, the ICC's help on K-factors at the following reference:-
ICC K-factor help
indicates that it may simply be the choosing of opponents that enables 2800+ players to further increase their rating quite easily. This would seem to hold true, for example, if one analysed the games of GM Shirov on the ICC who is nicknamed "leon", you can find a string of games of opponents who are all over 3100. In Over-the-board chess, it would only be in very high level all-play-all events that GM Shirov would be able to find a steady stream of 2700+ opponents - in at least a category 15+ FIDE event. A specific category 10 FIDE event would mean players are restricted in rating between 2476 to 2500. However if GM Shirov entered normal Swiss-paired open Over-the-board chess tournaments, he would likely meet many opponents less than 2500 FIDE on a regular basis. A single loss or draw against a player <2500>2400 rating.

Elo ratings in other competitions
A spin off system not related to chess has been adopted to rate the relative team strength of national football teams in competition called Elo football rating.
In other sports, individuals maintain rankings based on the Elo algorithm. For instance, Jeff Sagarin publishes rankings for American college football and basketball, with "Elo chess" being one of the two rankings he presents.
In the strategy game Tantrix an Elo-rating scored in a tournament changes the overall rating according to the ratio of the games played in the tournament and the overall game count. Every year passed, ratings are deweighted until they completely disappear taken over by the new ratings.[5]
National Scrabble organizations compute normally-distributed Elo ratings except in the United Kingdom, where a different system is used. The North American National Scrabble Association has the largest rated population, numbering over 11,000 as of early 2006.