Document using Scikit-learn models for multilabel datasets with missing labels #3128

j-adamczyk · 2022-11-30T21:23:51Z

🚀 Feature

Currently it is unclear how multilabel datasets are treated for Scikit-learn estimators, many of which are used in MoleculeNet. For neural networks, this is trivial, as we can use masked loss or weights matrix, but for Scikit-learn compatibility, there must be no missing values. How this is handled is currently not documented at all.

As far as I understand, going through this example using Random Forest:

No multitask models are actually used, just as many single task models as there are tasks. This includes models that can handle multilabel classification, like Random Forest, but we cannot use them that way, since they cannot handle NaN in targets.
Dataset is splitted into multiple copies, one per task. Missing values are removed from each task, creating clean single task classification datasets.
We fit as many models as there are tasks, and then during prediction for new samples (e.g. on validation / test set) we use each model to predict its task, and concatenate the predictions.
For calculating metrics, we simply average metric value calculated separately for each task.
The same set of hyperparameters is used in all models created this way. This also means that during hyperparameter optimization we will select hyperparams such that we perform the best on average across all tasks (due to how metrics are handled).

If this is correct, this could be added to the documentation, or at least as a tutorial in examples.

Motivation

This is important to understand how Scikit-learn models are trained and on which data.

arunppsg · 2022-12-01T12:50:28Z

Yes, you are right. We should document it.

helios2003 · 2023-02-14T08:24:43Z

Hello, Ankit here... I am currently new to the DeepChem community and looking at the issues, I found this one quite interesting so can you assign the issue to me?
Also, any guide as to how to approach the issue and get familiar with the community..

divyajot5005 · 2024-10-14T11:59:13Z

Hi!
I am new here and wanted to contribute. Can this issue be assigned to me?

Bpriya42 · 2025-02-18T21:04:46Z

Hey ! I am new to the deepchem community. Can this issue be assigned to me ? Thanks :)

JanaviN7 · 2025-03-09T09:37:18Z

Hi, is this issue still open? I want to work on documenting how to use Scikit-learn models for multi-label datasets with missing labels.

arunppsg added Good First Contribution Contribution Welcome labels Dec 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Document using Scikit-learn models for multilabel datasets with missing labels #3128

Document using Scikit-learn models for multilabel datasets with missing labels #3128

j-adamczyk commented Nov 30, 2022

arunppsg commented Dec 1, 2022

helios2003 commented Feb 14, 2023

divyajot5005 commented Oct 14, 2024

Bpriya42 commented Feb 18, 2025

JanaviN7 commented Mar 9, 2025

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Document using Scikit-learn models for multilabel datasets with missing labels #3128

Document using Scikit-learn models for multilabel datasets with missing labels #3128

Comments

j-adamczyk commented Nov 30, 2022

🚀 Feature

Motivation

arunppsg commented Dec 1, 2022

helios2003 commented Feb 14, 2023

divyajot5005 commented Oct 14, 2024

Bpriya42 commented Feb 18, 2025

JanaviN7 commented Mar 9, 2025

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.