Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Document using Scikit-learn models for multilabel datasets with missing labels #3128

Open
j-adamczyk opened this issue Nov 30, 2022 · 5 comments

Comments

@j-adamczyk
Copy link

🚀 Feature

Currently it is unclear how multilabel datasets are treated for Scikit-learn estimators, many of which are used in MoleculeNet. For neural networks, this is trivial, as we can use masked loss or weights matrix, but for Scikit-learn compatibility, there must be no missing values. How this is handled is currently not documented at all.

As far as I understand, going through this example using Random Forest:

  1. No multitask models are actually used, just as many single task models as there are tasks. This includes models that can handle multilabel classification, like Random Forest, but we cannot use them that way, since they cannot handle NaN in targets.
  2. Dataset is splitted into multiple copies, one per task. Missing values are removed from each task, creating clean single task classification datasets.
  3. We fit as many models as there are tasks, and then during prediction for new samples (e.g. on validation / test set) we use each model to predict its task, and concatenate the predictions.
  4. For calculating metrics, we simply average metric value calculated separately for each task.
  5. The same set of hyperparameters is used in all models created this way. This also means that during hyperparameter optimization we will select hyperparams such that we perform the best on average across all tasks (due to how metrics are handled).

If this is correct, this could be added to the documentation, or at least as a tutorial in examples.

Motivation

This is important to understand how Scikit-learn models are trained and on which data.

@arunppsg
Copy link
Contributor

arunppsg commented Dec 1, 2022

Yes, you are right. We should document it.

@helios2003
Copy link

Hello, Ankit here... I am currently new to the DeepChem community and looking at the issues, I found this one quite interesting so can you assign the issue to me?
Also, any guide as to how to approach the issue and get familiar with the community..

@divyajot5005
Copy link

Hi!
I am new here and wanted to contribute. Can this issue be assigned to me?

@Bpriya42
Copy link

Hey ! I am new to the deepchem community. Can this issue be assigned to me ? Thanks :)

@JanaviN7
Copy link

JanaviN7 commented Mar 9, 2025

Hi, is this issue still open? I want to work on documenting how to use Scikit-learn models for multi-label datasets with missing labels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy