-
Notifications
You must be signed in to change notification settings - Fork 68
confusion matrix metrics #285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…box-python into ms/confusion-matrix-metrics
…into ms/compute-conf-matrix
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
It first computes the confusion matrix for each metric and then sums across all classes | ||
|
||
Args: | ||
ground_truth : Label containing human annotations or annotations known to be correct |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think we avoided using this name in the label import project because it didnt exist anywhere else in the application. Can you double check that we want to bake this into a public-facing method? I think maybe we can just call them label
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great point, we need to come up with some consistent language. The problem here is that Label means "data and annotations" not "human made annotations". There needs to be some distinction between human generated and machine generated labels. I am open to changing the name but I already exposed this concept with the iou metric functions for our first mea release :). Although these are positional args so if we want to change this we probably can. We should probably do it soon though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i honestly have no stake in it, i guess i just wanted to point it out. I think Gareth asked if we could change the name in label imports, so he might be a person to run it by.
I definitely dont have a problem with it, and honestly, i think ground truth label is a good name, as thats the name in the industry, so i dont know why labelbox wouldnt go that direction.
f"Found `{type(prediction)}` and `{type(ground_truth)}`") | ||
|
||
if isinstance(prediction.value, Radio): | ||
return radio_confusion_matrix(ground_truth.value, prediction.value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont follow why were returning the confusion matrices for just the first prediction/ground truth values? dont we need to find an average for all of the values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because at this point we have grouped annotations by class. So there can only be one or zero of a classification in an image. I updated the docstring to explain this.
@@ -158,3 +157,19 @@ def _create_feature_lookup(features: List[FeatureSchema], | |||
for feature in features: | |||
grouped_features[getattr(feature, key)].append(feature) | |||
return grouped_features | |||
|
|||
|
|||
def has_no_matching_annotations(ground_truths: List[ObjectAnnotation], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is a matching
annotation? wouldnt a matching prediction and label have the same schemaId? Would it make sense to check for that in this function, or is it not that simple?
Would it make sense to make sure the lengths are the same?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All annotations passed to this function have the same schema ids. So this function checks if we even need to run the metric calculation code - if there is no ground truth or no prediction then there is nothing to compute.
expected = [0, 0, 0, 0] | ||
for expected_values in example.expected.values(): | ||
for idx in range(4): | ||
expected[idx] += expected_values[idx] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
isnt this whole block just re-generating a list, expected
, which is the same list as example.expected.values()
?
Do we need this block at all? can we just do
assert score[0].value == tuple(
example.expected.values()), f"{example.predictions},{example.ground_truths}"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are actually accumulating all expected values.
expected = [0, 0, 0, 0]
for expected_values in example.expected.values():
for idx in range(4):
expected[idx] += expected_values[idx]
assert ... # Each of the values are added to expected before this assert statement is reached.
if the assert was within the outer for loop, then what you are suggesting would be true.
def test_custom_confusion_matrix_metric(): | ||
with open('tests/data/assets/ndjson/custom_confusion_matrix_import.json', | ||
'r') as file: | ||
data = json.load(file) | ||
|
||
label_list = NDJsonConverter.deserialize(data).as_list() | ||
reserialized = list(NDJsonConverter.serialize(label_list)) | ||
assert json.dumps(reserialized, | ||
sort_keys=True) == json.dumps(data, sort_keys=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this test test_custom_confusion_matrix_metric
? it seems like it just deserialize and re-serializes a json file, then asserts that its equal to the file data?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This tests the serialization and deserialization of the custom metric. Not the actual calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it. I might recommend naming it something about the serialization, then but that just a nit!
label = project.create_label(data_row=data_row, label=rand_gen(str)) | ||
time.sleep(10) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems like a lot of time. Is there any way to avoid this? I assume that things were not being created in time? is there a way to "await" (for lack of a pythonic verison of the word) these function calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
everything is technically awaited in python (unless using asyncio or some other async abstraction). The problem is the async stuff on the backend that can't be awaited like items that are added to the queue to be processed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also once label imports are complete, I can throw out this garbage hack to upload labels for tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, yeah. dang. 10s is a while to add to a test, but if the processing takes that long, makes sense!
metric_name: str | ||
aggregation: ConfusionMatrixAggregation | ||
|
||
def to_common(self) -> ConfusionMatrixMetric: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is common
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is the standard format for annotations in the repo. This terminology is used elsewhere. I am not a huge fan of this naming convention but I didn't have a great alternative. The type hints should make this a little less ambiguous.
return [tps, fps, 0, fns] | ||
|
||
|
||
def mask_confusion_matrix(ground_truths: List[ObjectAnnotation], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont feel like i was able to review these functions at all, fyi
As badly as a i want to say LGTM, i dont feel confident with my review here. I left some comments, but most arent very useful. This PR is worth being proud of, though. Really shows an accumulation of a lot of knowledge and hard work. I am excited to see people use this, youve really owned custom metrics. |
No description provided.