benchmarks

  • MATH500 (mathematical equation solving)
  • SWE-bench (software fixes on codebases )
  • METR (length of independant thinking time)