Skip to content

Question about Reproducing FactScore for Llama-3-8b-Instruct #1

@LuckyyySTA

Description

@LuckyyySTA

Hi, thanks for your inspiring work!

I would like to ask if the Llama-3-8b-Chat in Table 1 refers to the original "meta-llama/Meta-Llama-3-8B-Instruct" model. When I attempted to reproduce the results from Table 1, I calculated a factscore of 35.53, and for llama-3-8b-instruct + factalign, the factscore was 37.62. I noticed a significant discrepancy compared to the values in Table 1 (Llama-3-8b-chat=54.96, Llama-3-8b-chat+factalign=62.84).

For calculating the factscore, I used the evaluation script provided by [1], and to save costs, I evaluated using the "retrieval+llama+npm" model. Although this differs from your "retrieval+ChatGPT" approach, based on the FactScore authors' results, the difference shouldn't be too large. Therefore, I suspect it might be due to decoding parameters. I used the default sampling decoding with temperature=1.0. What is your decoding strategy? What other reasons do you think could lead to this discrepancy?

Thank you!

[1] https://github.com/shmsw25/FActScore

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions