[Turing] Guidelines for RLHF Assessment_2025_DS.pptx
[Turing] Guidelines for RLHF Assessment_2025_DS.pptx
We are seeking talented individuals capable of training Large Language Models (LLMs) to solve
real-world problems for the world’s biggest companies.
To ensure a fair evaluation for all candidates, please adhere to the following:
● No online resources or tools (e.g., LeetCode, ChatGPT, search engines) are permitted
🚩 Failure to comply with these guidelines WILL result in immediate disqualification. We appreciate your
honesty and commitment to a fair process.
RLHF Assessment Objective
- You will be provided a short conversation based on interactions between a user and an AI model
assistant.
- At the end of the conversation, there will be two potential responses given by the AI model to
answer the user’s last message/request.
Good Example(1/2)
Correctness: Consider the factuality, logical flow, contextual relevance, and
consistency.
Response A's approach provides code that has no major issues in satisfying the user's ask
User: Give me a fast method in python to check if a list contains
for python code to check if a list contains duplicated items. It is highly efficient due to
duplicated items. leveraging Python's set, which only holds unique elements. This allows for constant-time
Model response A: membership checks (O(1)), ensuring quick operations for most dataset sizes.
Response B's approach provides code that has minor issues in satisfying the user's ask for
python code to check if a list contains duplicated items. It is less efficient for larger lists, as
it relies on comparing each element to every other, yielding a time complexity of O(n^2).
Additionally, there's a minor error in using "for i in range(n+1)" instead of the correct "for i in
range(n)".
Model response B:
Informativeness: Is the information provided relevant to the user’s request? Is the
information complete?
Response A's approach is not informative. It has a lack of comments, so a practical usage
example might limit its immediate clarity for some users.
Response B's approach is very informative. Overall extensive commenting enhances its
readability, aiding beginner understanding.
Which one would you choose as the best possible model response? Explain why.
Response A is a better response as it has no issues in correctness and more efficient.
Response A is the preferred way to check for duplicates in most cases. Response B, while
simpler and easier to understand, can be impractically slow for lists with many elements.
The user is looking for a fast solution, so model response A is the best of the two.
Good Example(2/2)
User: Give me a fast method in python to check if a list contains
duplicated items. In this solution:
Model response A:
- The answer is comprehensive, demonstrating that aspects of
correctness and informativeness of both responses were
addressed in the respective input boxes.
- The depth of analysis is commendable, as the two
approaches were compared in terms of their time
.
Solution:
Bad Example Correctness: Consider the factuality, logical flow, contextual relevance, and
consistency.
User: Give me a fast method in python to check if a list contains Response A is good because set a fast function. Only use here 1 for and 1 if. Code is short.
duplicated items.
Response B is also good because is return corrected result. FOR loops is correct.
Model response A:
Which one would you choose as the best possible model response? Explain why.
Response A is better than B because using set is the best solution.
Model response B:
In this solution:
- The answer is too short lacking a comprehensive comparison of the two model
responses.
- In "Informativeness", response A is not even discussed.
- The depth of analysis is too superficial.
- Some important differences are missing: comments, test case, incorrect range in
outer FOR loop.
- Spelling/grammar mistakes are notorious.
This is not a good solution, even though model response A is indeed better than B, and the
set() is a better approach.
You’re good to start!