Guidance on adding REFUTE benchmark task (scientific critique & calibration)

Hi LightEval team,

BGPT released **REFUTE**, an Apache-2.0 Hugging Face benchmark for scientific critique and epistemic calibration:

- Dataset: https://huggingface.co/datasets/BGPT-OFFICIAL/refute
- Technical report: https://huggingface.co/datasets/BGPT-OFFICIAL/refute/blob/main/TECHNICAL_REPORT.md

We have prototype integrations:
- Inspect AI: https://huggingface.co/datasets/BGPT-OFFICIAL/refute/tree/main/integrations/inspect_ai
- lm-evaluation-harness: https://huggingface.co/datasets/BGPT-OFFICIAL/refute/tree/main/integrations/lm_eval_harness

Before opening a LightEval contribution, we'd appreciate guidance on:
1. Preferred task format for mixed generative (rubric-scored critique) + objective (forced-choice / soundness) axes
2. Whether REFUTE fits LightEval's benchmark scope
3. Recommended path for Hub eval-results compatibility

Happy to adapt to repo conventions. Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Guidance on adding REFUTE benchmark task (scientific critique & calibration) #1253

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Guidance on adding REFUTE benchmark task (scientific critique & calibration) #1253

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions