confusion matrix metrics #285

msokoloff1 · 2021-09-13T10:44:47Z

No description provided.

…box-python into ms/confusion-matrix-metrics

…into ms/compute-conf-matrix

review-notebook-app · 2021-09-13T10:44:50Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

labelbox/data/annotation_types/metrics/base.py

labelbox/data/annotation_types/metrics/confusion_matrix.py

GrantEating · 2021-09-13T17:50:04Z

labelbox/data/metrics/confusion_matrix/calculation.py

+    It first computes the confusion matrix for each metric and then sums across all classes
+
+    Args:
+        ground_truth : Label containing human annotations or annotations known to be correct


i think we avoided using this name in the label import project because it didnt exist anywhere else in the application. Can you double check that we want to bake this into a public-facing method? I think maybe we can just call them label?

Great point, we need to come up with some consistent language. The problem here is that Label means "data and annotations" not "human made annotations". There needs to be some distinction between human generated and machine generated labels. I am open to changing the name but I already exposed this concept with the iou metric functions for our first mea release :). Although these are positional args so if we want to change this we probably can. We should probably do it soon though.

i honestly have no stake in it, i guess i just wanted to point it out. I think Gareth asked if we could change the name in label imports, so he might be a person to run it by.
I definitely dont have a problem with it, and honestly, i think ground truth label is a good name, as thats the name in the industry, so i dont know why labelbox wouldnt go that direction.

labelbox/data/metrics/confusion_matrix/calculation.py

GrantEating · 2021-09-13T18:32:20Z

labelbox/data/metrics/confusion_matrix/calculation.py

+            f"Found `{type(prediction)}` and `{type(ground_truth)}`")
+
+    if isinstance(prediction.value, Radio):
+        return radio_confusion_matrix(ground_truth.value, prediction.value)


i dont follow why were returning the confusion matrices for just the first prediction/ground truth values? dont we need to find an average for all of the values?

Because at this point we have grouped annotations by class. So there can only be one or zero of a classification in an image. I updated the docstring to explain this.

GrantEating · 2021-09-13T18:54:20Z

labelbox/data/metrics/group.py

@@ -158,3 +157,19 @@ def _create_feature_lookup(features: List[FeatureSchema],
    for feature in features:
        grouped_features[getattr(feature, key)].append(feature)
    return grouped_features
+
+
+def has_no_matching_annotations(ground_truths: List[ObjectAnnotation],


what is a matching annotation? wouldnt a matching prediction and label have the same schemaId? Would it make sense to check for that in this function, or is it not that simple?
Would it make sense to make sure the lengths are the same?

All annotations passed to this function have the same schema ids. So this function checks if we even need to run the metric calculation code - if there is no ground truth or no prediction then there is nothing to compute.

GrantEating · 2021-09-13T20:47:26Z

tests/data/metrics/confusion_matrix/test_confusion_matrix_data_row.py

+            expected = [0, 0, 0, 0]
+            for expected_values in example.expected.values():
+                for idx in range(4):
+                    expected[idx] += expected_values[idx]


isnt this whole block just re-generating a list, expected, which is the same list as example.expected.values()?
Do we need this block at all? can we just do

assert score[0].value == tuple( example.expected.values()), f"{example.predictions},{example.ground_truths}"

?

We are actually accumulating all expected values.

expected = [0, 0, 0, 0] for expected_values in example.expected.values(): for idx in range(4): expected[idx] += expected_values[idx] assert ... # Each of the values are added to expected before this assert statement is reached.

if the assert was within the outer for loop, then what you are suggesting would be true.

GrantEating · 2021-09-13T20:51:31Z

tests/data/serialization/ndjson/test_metric.py

+def test_custom_confusion_matrix_metric():
+    with open('tests/data/assets/ndjson/custom_confusion_matrix_import.json',
+              'r') as file:
+        data = json.load(file)
+
+    label_list = NDJsonConverter.deserialize(data).as_list()
+    reserialized = list(NDJsonConverter.serialize(label_list))
+    assert json.dumps(reserialized,
+                      sort_keys=True) == json.dumps(data, sort_keys=True)


how does this test test_custom_confusion_matrix_metric ? it seems like it just deserialize and re-serializes a json file, then asserts that its equal to the file data?

This tests the serialization and deserialization of the custom metric. Not the actual calculation.

got it. I might recommend naming it something about the serialization, then but that just a nit!

GrantEating · 2021-09-13T20:53:15Z

tests/integration/conftest.py

    label = project.create_label(data_row=data_row, label=rand_gen(str))
+    time.sleep(10)


That seems like a lot of time. Is there any way to avoid this? I assume that things were not being created in time? is there a way to "await" (for lack of a pythonic verison of the word) these function calls?

everything is technically awaited in python (unless using asyncio or some other async abstraction). The problem is the async stuff on the backend that can't be awaited like items that are added to the queue to be processed.

Also once label imports are complete, I can throw out this garbage hack to upload labels for tests.

ah, yeah. dang. 10s is a while to add to a test, but if the processing takes that long, makes sense!

GrantEating · 2021-09-13T20:55:33Z

labelbox/data/serialization/ndjson/metric.py

+    metric_name: str
+    aggregation: ConfusionMatrixAggregation
+
+    def to_common(self) -> ConfusionMatrixMetric:


what is common?

It is the standard format for annotations in the repo. This terminology is used elsewhere. I am not a huge fan of this naming convention but I didn't have a great alternative. The type hints should make this a little less ambiguous.

GrantEating · 2021-09-13T20:57:09Z

labelbox/data/metrics/confusion_matrix/calculation.py

+    return [tps, fps, 0, fns]
+
+
+def mask_confusion_matrix(ground_truths: List[ObjectAnnotation],


I dont feel like i was able to review these functions at all, fyi

GrantEating · 2021-09-13T20:59:10Z

As badly as a i want to say LGTM, i dont feel confident with my review here. I left some comments, but most arent very useful.
Happy to stamp, but its more of a LGTM? haha.

This PR is worth being proud of, though. Really shows an accumulation of a lot of knowledge and hard work. I am excited to see people use this, youve really owned custom metrics.

Metric serialization

…into ms/expand-metrics

…box-python into ms/expand-metrics

…box-python into ms/compute-conf-matrix

confusion matrix metrics

Matt Sokoloff added 15 commits September 7, 2021 20:20

debug

574780e

remove testing code

23b9375

add confusion matrix metric and confidence

2a49913

serialize confidence and confusion matrix metrics

7eaa569

add missing files

045e7ea

Merge branch 'ms/expand-metrics' of https://github.com/Labelbox/label…

3fb44a1

…box-python into ms/confusion-matrix-metrics

add missing test json

0d58cf1

clean up imports

e4fd74e

wip

e05322c

wip

d6c8fe1

add missing tests

557f56f

wip

d30cc74

Merge branch 'develop' of https://github.com/Labelbox/labelbox-python …

b1f9bf6

…into ms/compute-conf-matrix

tested

dc6310b

update docstrings

ab01346

Matt Sokoloff added 3 commits September 13, 2021 08:57

fix gt and pred order

b0d884f

remove breaking change

ab49cc8

add missing file

c9d8e4b

msokoloff1 requested a review from GrantEating September 13, 2021 17:17