Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BFCL v1] Update Executable Ground Truth for REST Category #708

Merged
merged 3 commits into from
Oct 17, 2024

Conversation

CharlieJCJ
Copy link
Collaborator

@CharlieJCJ CharlieJCJ commented Oct 17, 2024

Previously during sanity, one of the API changes the ground truth response structure that fails the sanity check.

Example command:

❯ bfcl evaluate --model claude-3-5-sonnet-20240620 --test-category executable -c
Number of models evaluated:   0%|                                                                                                                                                     | 0/56 [00:00<?, ?it/s]🦍 Model: claude-3-5-sonnet-20240620
🔍 Running test: exec_simple
---- Sanity checking API status ----
All placeholders API keys have been replaced. 🦍
API Status Test (REST): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 70/70 [00:57<00:00,  1.22it/s]
API Status Test (Non-REST): 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 28/28 [01:05<00:00,  2.35s/it]
API Status Test (Non-REST):  96%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████     | 27/28 [01:05<00:00,  1.28it/s]
------------------ Executable Categories' Error Bounds Based on API Health Status ------------------

❗️ Warning: Unable to verify health of executable APIs used in executable test category (REST). Please contact API provider.

5 / 70 APIs affected:

  - Test Case: requests.get(url='https://date.nager.at/api/v3/LongWeekend/2023/CA')
    Error Type: executable_checker_rest:wrong_key

  - Test Case: requests.get(url='https://date.nager.at/api/v3/LongWeekend/2023/CA')
    Error Type: executable_checker_rest:wrong_key

  - Test Case: requests.get(url='https://date.nager.at/api/v3/LongWeekend/2023/FR')
    Error Type: executable_checker_rest:wrong_key

  - Test Case: requests.get(url='https://date.nager.at/api/v3/LongWeekend/2023/JP')
    Error Type: executable_checker_rest:wrong_key

  - Test Case: requests.get(url='https://date.nager.at/api/v3/LongWeekend/2023/CA')
    Error Type: executable_checker_rest:wrong_key

----------------------------------------------------------------------------------------------------

This PR updates the ground truth file to be up-to-date with the latest API structure.

This will change the leaderboard score. We will update the leaderboard in a separate PR.

@HuanzhiMao HuanzhiMao added BFCL-General General BFCL Issue BFCL-Dataset BFCL Dataset-Related Issue and removed BFCL-General General BFCL Issue labels Oct 17, 2024
Copy link
Collaborator

@HuanzhiMao HuanzhiMao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes LGTM.
But this PR will affect the leaderboard score. Please update your PR description @CharlieJCJ.
The checker only pulls the latest resspinse for the (normal) executable categories, but not the REST category. Code here.

@HuanzhiMao HuanzhiMao changed the title [BFCL v1 REST category] Updates on executable ground truth for rest category [BFCL v1] Update Executable Ground Truth for REST Category Oct 17, 2024
@HuanzhiMao HuanzhiMao merged commit 38216dc into ShishirPatil:main Oct 17, 2024
ShishirPatil pushed a commit that referenced this pull request Oct 21, 2024
This PR updates the leaderboard to reflect the change in score due to
the following PR merge:

1. #660 
2. #661
3. #683
4. #679
5. #708 
6. #709
7. #701
8. #657 
9. #658 
10. #640 
11. #653
12. #642 
13. #696 
14. #667

Close #662.

Note: Some models (like `firefunction`, `functionary`,
`microsoft/phi`)are not included in this leaderboard update because we
don't have all the entries generated. We will add them back once we get
the full result generated.
VishnuSuresh27 pushed a commit to VishnuSuresh27/gorilla that referenced this pull request Nov 11, 2024
…til#708)

This PR updates the ground truth file to be up-to-date with the latest
API structure.

This will change the leaderboard score. 

---------

Co-authored-by: Huanzhi (Hans) Mao <huanzhimao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BFCL-Dataset BFCL Dataset-Related Issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy