Bee SFT Coding 240918 023144
Bee SFT Coding 240918 023144
🚨IMPORTANT🚨
You must do the task following IN YOUR LANGUAGE
(i.e., Spanish, German, French). Both prompt and
response should be in the same/your language.
Always check task instructions – like this example
is in Spanish:
Table of Contents
🔭 Goal
🧑💻Structured Data Generation
📐Data Format
✅ Good Example
🛑 Bad Example
🔑 Rubric
❓FAQ
🔭 Goal
Your mission: Beat the state of the art model (SOTA) by providing your own prompt(s) and
response(s) that are:
Following the user’s instruction
Concise
Truthful
Harmless
Satisfying
These are all explained in the Rubric section below.
Single Turn: In this project, you’ll be asked to both write the initial prompt and the response to
that prompt. For each task, this is the workflow:
Step 1: Write prompt.
Step 2: Write response.
Step 3: Click submit.
🚨IMPORTANT🚨
DO NOT WRITE CODE
BLOCKS! You do not
have to write code to
convert from one form to
the other, you just
convert the data.
Good Bad
It is important to match the tone of the prompt. Essentially there are 2 ways to write tasks:
Descriptions: The user will provide one or more data examples and the
assistant should stick with the format and additional requirements if any.
Example Prompt: “I am a teacher and I need a CSV that…”
Ex Response: “Here is the CSV!”
DO NOT just give the data → EXPLAIN !
Direct Examples: The user requirements are very specific.
Example Prompt: “Generate the json for the following data ..., only show
me the JSON object.”
Ex Response: Only output the JSON object.
DO NOT SAY: ““Sure! Here’s the JSON:...” → BE DIRECT
📐Data Format
Code: Including inline code and code blocks in standard Markdown format.
Inline code should be enclosed by ` (a single backtick), such as .
Code blocks should be enclosed by ``` (triple backticks) with language
name whenever language name is available.
Helpful Tools
JSON Tool →
https://jsonformatter.org/markdown-formatt
er
CSV Tool →
https://www.convertcsv.com/csv-to-markdo
wn.htm
Table Tool →
https://www.tablesgenerator.com/markdow
n_tables
Clear Asks
Examples
Similar to a well articulated response that a high quality LLM could output.
The text is directed to the user and mentions key parts of the prompt.
The data is organized, with the code section first and then an explanation
of the key components.
There are no grammar or syntax errors and the response is correctly
formatted. This is likely spell/grammar checked.
Completeness
All parts of the question are directly answered, specifically the formatting
of the output file just as the user asks.
There is the data component and also an explanation.
Example 2: Direct Example
Prompt: I want ideas for spring and summer painting motifs for my large scale canvas
paintings. I want the ideas in three tables, each table with the following themes: Botanical
Delights, Tranquil Scenes, Abstract Expressions.
What makes this a good prompt ?
Clear Asks
Examples
Similar to a well articulated response that a high quality LLM could output.
The text is directed to the user and doesn’t explain more than necessary,
because all the user wants is the table.
Completeness
All parts of the question are directly answered, specifically the formatting
of the output file just as the user asks.
More examples here.
🛑 Bad Example
Common mistakes to
avoid:
Lack of
Context
in
Prompts.
Ensure
that each
prompt
includes a
relevant
context or
real-life
situation.
Lack of
clear ask.
Please
ensure
your
prompt
has a
clear ask
in your
prompt.
After
outlining
the
problem
and
context,
ask a
direct,
clear
question.
Broken
code. IF
YOU USE
CODE,
please
make
sure the
code runs
successfu
lly without
any
errors.
Please
use your
own IDE
or online
compilers
to test the
code.
Incomple
te
Respons
es.Your
response
must
address
every
aspect of
the
prompt.
Missing
any part
of the
request
will cause
lower
scores.
Formatti
ng and
Concisen
ess: Note
the format
of your
response.
Ensure it
is clear,
concise,
and
follows
any
specified
formatting
guidelines
.
Alignmen
t with
SxS
Scores:
Make
sure your
evaluation
s align
with the
side-by-si
de (SxS)
scores.
Consisten
cy is key.
Objective
Scoring:
Be
objective
when
assigning
scores.
Avoid
giving
higher
scores
than what
the
response
realisticall
y
deserves.
Avoid
Over-Sco
ring: A
score of 7
should be
extremely
rare.
Reserve
this for
cases
where the
State-of-t
he-Art
(SOTA)
response
fails in
every
aspect,
which
almost
never
happen.
Invalid
prompt
for
Structure
d Data
Generati
on. You
are not
asking a
question
that
involves
managing
and
generatin
g SGD.
Instead
you use
only one
data type
(i.e., use
this JSON
to make
another
JSON).
Missing
markdow
n format
for the
program
ming
language
. Please
ensure
you are
using the
correct
markdow
n for the
used
programm
ing
language
in prompt
(if code
present in
the
prompt)
and
response.
Using
prohibite
d
libraries.
Please do
not use
SwiftUI,
UIKit,
AppKit,
XCTest,
Combine?
, ARKit,
CoreML,
SceneKit,
Metal,
CoreData.
Not
mentioni
ng the
program
ming
language
in your
prompts.
Not
including
the
programm
ing
language
in your
prompt
may lead
to SOTA
response
s
generated
in another
language
which
makes it
difficult to
evaluate
the better
response
while
comparin
g your
response
with the
SOTA
response.
Such
tasks will
be
rejected.
Using
complete
system
design
problems
in the
prompts.
Such
prompts
are not
helpful in
training
the
model.
We are
looking
for more
logical
real life
problems
which can
be
reviewed
in under
30 mins.
NO Clear Asks
NO Examples
🔑 Rubric
In addition to a prompt having:
Real Life User Request
Clear Asks
Examples of Data
And a response having:
LLM-like Structure
Consideration for the User
Organization and Readability
Completeness
Requirements
Scoring
1-2 Language
Terrible Prompt
or
respons
e is not
in the
indicate
d
languag
e,
dialect,
or
spelling
convent
ion
Prompt
or
respons
e is
partially
in the
indicate
d
languag
e but
have
major
errors
that
make
them
hard to
underst
and
Length
The
respons
e
significa
ntly
deviate
s from
the
length
instructi
ons
(‘500
words’
or ‘2
sentenc
es’)
Role/Context
The
prompt
is
unclear
about
the role
or
context
expecte
d from
the
respons
e
The
Respon
se does
not
follow
role or
context
instructi
ons
Tone
The
respons
e
contradi
cts the
tone
request
ed in
the
prompt
or takes
an
inappro
priate
tone of
voice
for the
context
3 Language
Not Prompt
Passing or
respons
e is in
the
indicate
d
languag
e,
dialect,
or
spelling
convent
ion, but
have
some
spelling
,
gramm
ar, or
phrasin
g errors
Length
The
respons
e
partially
follows
length
instructi
ons
Role/Context
The
respons
e is
mostly
clear on
the role
or
context
The
respons
e
mostly
follows
context
ual
instructi
on
Tone
The
respons
e
generall
y
follows
the tone
request
ed in
the
prompt,
with
only
minor
errors
4-5 Language
Excellent Prompt
or
respons
e is in
the
indicat
ed
langua
ge,
dialect,
or
spellin
g
conven
tion
with no
errors
or only
minor
errors
Length
The
respons
e
exactly
or
nearly
follows
the
length
require
ments
Role/Context
The
respons
e
perfectl
y
adhere
s to the
prompt
’s
context
or role
Tone
The
respons
e
perfectl
y
adhere
s to the
request
ed
tone,
with
virtually
no
errors
or
breaks
CONCISENESS (Not too many words)
Definition
Prompts/responses must be
written with essential and relevant
details, removing unnecessary
details, fluff, or pleasantries.
Requirements
Prompts/responses SHOULD:
Be to the point,
concise, and
answer the request
in an easily
digestible manner
Be conversational
and natural in tone
Contains a limited
level of detail and
nice-to-have
explanations
Be free of
redundant,
irrelevant
information
Prompts/responses SHOULD
NOT:
Be verbose,
provide extraneous
information, or
over-explain
concepts when the
prompt does not
request it
Include “fluff” or
pleasantries (e.g.,
“Here’s your
request,” “Sure, I
can help with that,”
“Below is a blog
with 100 words”)
Scoring
1-2 Length
Terrible The
resp
onse
exce
eds
the
word
limit
s
direc
tly
requ
este
d by
the
pro
mpt
Verbosity
Multi
ple
sent
ence
s to
expl
ain
conc
epts
that
are
poss
ible
withi
na
sent
ence
or
two
The
resp
onse
com
muni
cate
s the
sam
e
idea
s in
sligh
tly
diffe
rent
way
s
seve
ral
time
s
Focus
Ther
e is
no
cent
ral
the
me
or
mes
sage
to
the
resp
onse
Ther
e is
signi
fican
t
irrel
evan
t or
distr
actin
g
infor
mati
on in
the
resp
onse
Tone
Inap
prop
riate
,
abru
pt,
or
othe
rwis
e
unpl
easa
nt
tone
3 Length
Not The
Passing resp
onse
is
clos
e to
the
lengt
h
spec
ified
expli
citly
by
the
pro
mpt
Verbosity
Res
pons
e
still
may
have
som
e
overl
y
verb
ose
sent
ence
s or
clai
ms
The
resp
onse
com
muni
cate
s the
sam
e
idea
s,
but
in
sligh
tly
diffe
rent
way
s
once
or
twic
e
Focus
Ther
e’s
an
over
archi
ng
the
me,
but
it’s
not
follo
wed
in
the
resp
onse
A
few
mor
e
nice-
to-h
aves
than
nee
ded
Tone
Con
vers
ation
al
and
eng
agin
g
tone
4-5 Length
Excellent The
resp
onse
fits
the
lengt
h
spec
ified
expli
citly
by
the
pro
mpt
Verbosity
Res
pons
e
effici
ently
com
muni
cate
s its
conc
epts
and
point
s
cons
isten
tly
Res
pons
e
does
not
unn
eces
saril
y
repe
at
infor
mati
on,
impli
citly
or
expli
citly
Focus
Ther
e is
a
clear
the
me
that
unite
s the
bulle
ts or
detai
ls of
the
resp
onse
Deta
ils
are
relev
ant
or
provi
de
satis
fying
illust
ratio
ns of
the
core
the
me
Tone
Con
vers
ation
al
and
eng
agin
g
tone
TRUTHFULNESS
Definition
Requirements
All of the facts
inside the response
(such as definitions,
numbers, dates,
etc.) are completely
accurate (Please
conduct online
research to make
sure).
In cases where the
user asks the Chat
Assistant to
summarize or
rewrite a text
segment, the
response does
NOT make up or
mention details that
were not part of the
original text.
If the user asks a
question that
assumes that a
particular false
claim is true, your
response should
assert that the false
claim is NOT true,
rather than leaving
room for ambiguity.
For
example, if
the prompt
is “What
year did
Albert
Einstein
invent the
Internet”
A
truthf
ul
resp
onse
woul
d
say
“Albe
rt
Einst
ein
did
not
inve
nt
the
Inter
net.”
An
untru
thful
resp
onse
to
this
prom
pt
coul
d be
som
ethin
g like
“It’s
not
clear
exac
tly
whe
n
Alber
t
Einst
ein
inve
nted
the
inter
net.”
One important
exception: if the
prompt explicitly
asks for fictional
writing, such as
writing a story for
children, then your
response doesn’t
need to be truthful.
Scoring
Prompts/responses SHOULD:
Be respectful,
considerate,
and kind
Prompts/responses SHOULD
NOT:
Promote hatred,
violence,
marginalization,
stereotypes,
slurs, or
slandering of
any individuals
or groups of
people
This
includes
groups
of
people
based
on race,
color,
national
origin,
religion,
sex,
gender
identity,
sexual
orientati
on, age,
and
disability
.
Promote
substance
abuse or illegal
activity.
Contain violent
or sexual
content.
Contain moral
judgment or
opinionated
content.
Contain PII
(personally
identifiable
information)
If the prompt is inappropriate
or is asking the Chat Assistant
to be harmful in any way, your
response should politely turn
down the user’s request and
explain that a Chat Assistant is
not allowed to provide any
inappropriate information.
Scoring
3 None given. No
Not middle ground.
Passing
A satisfying prompt or
response fits like a glove. It’s
engaging, it’s human, it’s
correct, and it’s just right. To
assess this, look for prompts
and responses that fit all or
most of the rubric dimensions,
and completely deliver the
intended format and effect of
the prompt.
Requirements
Well-written in
the correct
language
Free from
spelling or
grammar
mistakes
Creative and a
little different
(not mandatory
to be a 5 per se,
but important to
not be bland)
Delivers
everything the
user asks for
Scoring
1-2 Response fails
Terrible the majority of
the quality
rubric
dimensions and
needs to be
rewritten.
I
n
c
o
rr
e
c
t
l
a
n
g
u
a
g
e
-
i
n
a
f
o
r
e
i
g
n
l
a
n
g
u
a
g
e
,
o
r
w
ri
tt
e
n
s
o
p
o
o
rl
y
t
h
e
m
e
a
n
i
n
g
c
a
n
’t
b
e
i
n
t
e
r
p
r
e
t
e
d
S
p
e
lli
n
g
a
n
d
g
r
a
m
m
a
r
-
S
i
g
n
ifi
c
a
n
t
a
n
d
/
o
r
d
is
tr
a
c
ti
n
g
m
is
t
a
k
e
s
D
o
e
s
n
’t
‘fi
t’
-
D
o
e
s
n
’t
fi
t
t
h
e
i
n
t
e
n
t
o
f
t
h
e
p
r
o
m
p
t
3 Response fails
Not some aspects of
Passing the rubric but
could be fixed in
less than 30
minutes.
C
o
rr
e
c
t
l
a
n
g
u
a
g
e
-
b
u
t
m
a
y
b
e
a
lit
tl
e
a
w
k
w
a
r
d
o
r
u
n
cl
e
a
r
S
p
e
lli
n
g
a
n
d
g
r
a
m
m
a
r
-
m
i
n
o
r
m
is
t
a
k
e
s
R
e
a
s
o
n
a
b
l
e
‘fi
t’
-
D
o
e
s
n
’t
fi
t
t
h
e
i
n
t
e
n
t
o
f
t
h
e
p
r
o
m
p
t
4-5 Meets every
Excellent aspect of the
quality
dimensions.
Perfect, or could
be fixed in less
than 2 minutes.
C
o
rr
e
c
t
l
a
n
g
u
a
g
e
-
w
it
h
f
e
w
o
r
n
o
m
is
t
a
k
e
s
S
p
e
lli
n
g
a
n
d
g
r
a
m
m
a
r
-
o
n
e
o
r
t
w
o
m
i
n
o
r
b
l
e
m
is
h
e
s
o
k
G
o
o
d
‘fi
t’
-
F
it
s
t
h
e
p
r
o
m
p
t’
s
t
o
n
e
a
n
d
i
n
t
e
n
ti
o
n
C
r
e
a
ti
v
e
-
d
o
e
s
n
’t
r
e
a
d
o
r
f
e
e
l
li
k
e
a
b
a
si
c
L
L
M
r
e
s
p
o
n
s
e
❓FAQ
Q. Do I have to include code?
A: NO! You can use any data form (list, table). If you do write code, please make sure these are
written and compile completely correctly.
Q. How deep should the code be in problem reflection (pseudocode vs. very specific
implementation)?
A: The depth should be sufficient to demonstrate a clear understanding and solid solution, but it
can vary depending on the time available. Ensure it is well-done and effectively addresses the
prompt.
A: DO NOT COMPARE YOUR RESPONSE WITH OTHER LLM OR CHATBOTS. It's strictly
prohibited. To get a sense about the quality of your tasks use your own criteria and personal
experience. Your task should be excellent on its own without comparing it.
A: Yes, task attempts have a limited time of 45 min. L0 Reviewers have a limit of 30 min. L4
Reviewers have a limit of 20 min.
A: Comments should be very concise and aligned with the instructions, avoiding verbosity and
irrelevant information.
A: Problem Reflection: is more oriented towards seeking help to solve a problem by providing
the necessary context. It is required to provide at least two approaches to the problem, each of
which should be thoroughly explained.
Code Generation: focuses more on providing the solution code addressing the requirements.
A: For general questions that can help other contributors, post your questions in the dedicated
Slack channel. If it's a specific question, please contact your QM. This way, we can optimize
communication to specific issues.
Q. In the prompt, do we describe the data? Provide 4-5 lines of example? Should it be
code formatted?
A: It is up to the tasker. It is recommended to generate it, either in the prompt or in the response
for better understanding.
A: You can exploit the weak points of LLMs (code structure, adding or removing comments,
latest library releases, 3D animation or graphics that other LLMs cannot produce, tricky words).