Hi LightEval team,
BGPT released REFUTE, an Apache-2.0 Hugging Face benchmark for scientific critique and epistemic calibration:
We have prototype integrations:
Before opening a LightEval contribution, we'd appreciate guidance on:
- Preferred task format for mixed generative (rubric-scored critique) + objective (forced-choice / soundness) axes
- Whether REFUTE fits LightEval's benchmark scope
- Recommended path for Hub eval-results compatibility
Happy to adapt to repo conventions. Thanks!
Hi LightEval team,
BGPT released REFUTE, an Apache-2.0 Hugging Face benchmark for scientific critique and epistemic calibration:
We have prototype integrations:
Before opening a LightEval contribution, we'd appreciate guidance on:
Happy to adapt to repo conventions. Thanks!