benchmarks MATH500 (mathematical equation solving) SWE-bench (software fixes on codebases ) METR (length of independant thinking time)