Cross-referencing the other benchmark, this model claims to be Claude-level, but that is slightly suspicious https://github.com/bigcode-project/bigcodebench/issues/105