Metaculus

Forecasting AI

Metaculus Scoring System

Note: This page is about the obsolete Metaculus Points. In November 2023, we switched to using Baseline and Peer scores, but this page has not been updated yet.

Here at Metaculus, we want our players to make the best possible predictions. To that end, we needed to come up with a scoring system that rewards both individuals and the community as a whole for being correct over the long run. This app demonstrates a variety of scoring models, and it shows how the final result can depend upon the distribution of player predictions.

Try playing around with it to design what you think would be the best way to calculate scores, or jump down to our explanation of how the whole thing works, and the scoring system Metaculus uses.

Distribution of player predictions

Distribution 1

mode: ...
width: ...
# players: ...

Distribution 2

mode: ...
width: ...
# players: ...

Output player scores

Constant score function: ...

Relative score function: ...

Real-world probability: ...

  • If the outcome is yes, the average player will get ... points
  • If the outcome is no, the average player will get ... points
  • Given the probability of the event, the expected average is ... points

Explanation

If we were only asking our players to bet for or against an event happening, then our scoring would be straightforward: we'd set up something like a prediction market with payout proportional to the ratio of people making the opposite bet. Instead, we want our players to predict on the probability of an outcome, not the outcome itself. This makes the scoring a bit more complicated, but there are a few features that any good system should have:

  1. If an event's real-world probability of happening is \(x\), then the expected score for an indivdual player (i.e., the average of that player's score over many similar events) should be maximized when the player guesses the probability \(x\). This is known as the proper score criterion.
  2. If the community as a whole predicts that something is likely to happen, and it does happen, then the players on average should get rewarded. Conversely, the community should not get rewarded for being wrong.
  3. The scoring should be dynamic and depend upon what the community thinks. If everyone else thinks that something is sure to happen, but you think it won't, then you should get a big payout if it doesn't happen. And if there is strong consensus in the community, you should only get a limited payout for riding along by agreeing.
  4. The scoring should be relatively robust against outliers. A single person making a crazy prediction shouldn't affect everyone else's score too much.

By playing around with the above sliders, you might be able to get a sense of which scoring functions satisfy which criteria.

The top set of sliders change the distrubtion of player guesses (displayed in the first graph), so you can, for example, see what the scores would look like if everyone guessed 99%. The next couple of sliders and drop-downs change the precise functions used for scoring. There is a constant scoring function which does not depend upon the player distribution, and there is a relative scoring function in which players are scored against each other. The magnitude of the relative scoring function increases as more players make predictions. You can mix and match these functions, and you can change the relative weights of the two components. The bottom graph shows the resulting score as a function of a player's prediction, both in the case that the outcome is yes and that it is no.

Finally, there's a slider to change the actual probability of the event happening. Of course, generally no one knows ahead of time what the real probability is — that's why we're asking you to help make predictions — so we can't use that in the scoring. But we can use it to see what the expected score is for any given player's prediction. This is shown as the thin black line in the bottom graph.

Scoring functions

We go into some mathematical detail here for those that would like to see it. Rest assured, you need to know precisely 0% of this information to use Metaculus and make great predictions.

There are an infinite number of proper score functions, so the task of picking one at first seems a little daunting, but there are a few popular ones from which we can start. \[ S_{\rm log}(p) = \begin{cases} \log(p) & \text{if the outcome is $yes$} \\ \log(1-p) & \text{if the outcome is $no$} \end{cases} \\ S_{\rm quadratic/Brier}(p) = \begin{cases} -(1-p)^2 & \text{if the outcome is $yes$} \\ -p^2 & \text{if the outcome is $no$} \end{cases} \\ S_{\rm spherical}(p) = \begin{cases} \frac{p}{\sqrt{p^2 + (1-p)^2}} & \text{if the outcome is $yes$} \\ \frac{1-p}{\sqrt{p^2 + (1-p)^2}} & \text{if the outcome is $no$} \end{cases} \\ \] It's easy to account for the average community prediction \(p_c\) by adding a constant to each of these. For example, \(S_{\rm log}(p, p_c) = S_{\rm log}(p) - S_{\rm log}(p_c)\). This way a player would get precisely zero points if they just go along with the community average.

We also introduce a set of betting functions in which a player's score is calculated as if they made a bet with the community setting the odds. \[ S_{\rm bet}(p, p_c) = \begin{cases} +(1-p_c) & \text{if the outcome is $yes$ and $p > p_c$} \\ -p_c & \text{if the outcome is $no$ and $p > p_c$} \\ -(1-p_c) & \text{if the outcome is $yes$ and $p < p_c$} \\ +p_c & \text{if the outcome is $no$ and $p < p_c$} \\ 0 & \text{if $p=p _c$} \end{cases} \\ \] This is the constant poolbetting function, because the total number of points risked plus the total number potentially gained is equal to a constant. Similar functions can be defined where instead the total amount risked is set to a constant, the total amount gained is set to a constant, or anywhere in between (e.g., the sqrt gain function has the player gaining \(\sqrt{1-p_c}\) if the outcome is yes). Even though these functions are largely flat and cannot exactly be maximized, they are still proper scoring functions in the sense that the player cannot get a better score by predicting something other than the real-world probability, no matter what the community prediction happens to be.

Of course, the scoring functions can be scaled by any constant without changing their properties. We have chosen a normalization such that average scores tend to fall in the range of 10-100 points. More specifically, each scoring function will yield exactly 100 points for a positive outcome if the player predicts 99% and the rest of the community predicts 50%. The relative scoring functions are further scaled by \[ \log\left(1 + \frac{n}{20}\right), \] where \(n\) is the total number of predictions (only counting the most recent prediction for each player).

One nice thing about proper scoring functions is that any linear combination of different proper scoring functions will result in another proper scoring function. So, for example, we can combine the score based on one value of \(p_c\) with the score for another value of \(p_c\). This lets us create a fully relative scoring function \(R(p),\) where each player's score depends on each other player's guess, \[ R(p) = \frac{\sum_i S(p, p_i)}{n}, \] where the sum is over all other players' predictions. This is what is used in the above graph, and, with a combination of log scoring and sqrt gain betting (with values given by the defaults when you load this page), it's what's used to power the scoring on the Metaculus site.