Question about Reproducing FactScore for Llama-3-8b-Instruct

Hi, thanks for your inspiring work!

I would like to ask if the Llama-3-8b-Chat in Table 1 refers to the original "meta-llama/Meta-Llama-3-8B-Instruct" model. When I attempted to reproduce the results from Table 1, I calculated a factscore of 35.53, and for llama-3-8b-instruct + factalign, the factscore was 37.62. I noticed a significant discrepancy compared to the values in Table 1 (Llama-3-8b-chat=54.96, Llama-3-8b-chat+factalign=62.84).

For calculating the factscore, I used the evaluation script provided by [1], and to save costs, I evaluated using the "retrieval+llama+npm" model. Although this differs from your "retrieval+ChatGPT" approach, based on the FactScore authors' results, the difference shouldn't be too large. Therefore, I suspect it might be due to decoding parameters. I used the default sampling decoding with temperature=1.0. What is your decoding strategy? What other reasons do you think could lead to this discrepancy?

Thank you!

[1] https://github.com/shmsw25/FActScore

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about Reproducing FactScore for Llama-3-8b-Instruct #1

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Question about Reproducing FactScore for Llama-3-8b-Instruct #1

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions