The SuperGlue Benchmark measures progress in language understanding tasks.
The original benchmark, GLUE (General Language Understanding Evaluation) is a collection of language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty. The tasks were sourced from a survey of ML researchers, and it was launched in mid 2018. Several models have now surpassed the GLUE human baseline.
The new SuperGLUE benchmark contains a set of more difficult language understanding tasks. Human Level performance on the SuperGlue baseline is 89.8. The current best performing ML model as of July 19th, 2019 is BERT++ with a score of 71.5. Will language model performance have progressed enough that by next year one will have superhuman performance on the SuperGLUE benchmark?
Will a single language model obtain an average score equal to or greater than 90% on the SuperGLUE benchmark at any time before May 1st, 12:00:01 a.m. GMT?
This question will be resolved as true if, according to the public SuperGLUE benchmark leaderboard, a single entry has a score of 90% or higher. This question closes and resolves retroactively 48 hours before the first such score is listed on the SuperGLUE benchmark leaderboard.