The SuperGlue Benchmark measures progress in language understanding tasks.
The original benchmark, GLUE (General Language Understanding Evaluation) is a collection of language understanding tasks built on established existing datasets and selected to cover a diverse range of dataset sizes, text genres, and degrees of difficulty. The tasks were sourced from a survey of ML researchers, and it was launched in mid 2018. Several models have now surpassed the GLUE human baseline.
The new SuperGLUE benchmark contains a set of more difficult language understanding tasks. Human Level performance on the SuperGlue baseline is 89.8. The current best performing ML model as of July 19th, 2019 is BERT++ with a score of 71.5. Will language model performance have progressed enough that by next year one will have superhuman performance on the SuperGLUE benchmark?