You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently it is unclear how multilabel datasets are treated for Scikit-learn estimators, many of which are used in MoleculeNet. For neural networks, this is trivial, as we can use masked loss or weights matrix, but for Scikit-learn compatibility, there must be no missing values. How this is handled is currently not documented at all.
No multitask models are actually used, just as many single task models as there are tasks. This includes models that can handle multilabel classification, like Random Forest, but we cannot use them that way, since they cannot handle NaN in targets.
Dataset is splitted into multiple copies, one per task. Missing values are removed from each task, creating clean single task classification datasets.
We fit as many models as there are tasks, and then during prediction for new samples (e.g. on validation / test set) we use each model to predict its task, and concatenate the predictions.
For calculating metrics, we simply average metric value calculated separately for each task.
The same set of hyperparameters is used in all models created this way. This also means that during hyperparameter optimization we will select hyperparams such that we perform the best on average across all tasks (due to how metrics are handled).
If this is correct, this could be added to the documentation, or at least as a tutorial in examples.
Motivation
This is important to understand how Scikit-learn models are trained and on which data.
The text was updated successfully, but these errors were encountered:
Hello, Ankit here... I am currently new to the DeepChem community and looking at the issues, I found this one quite interesting so can you assign the issue to me?
Also, any guide as to how to approach the issue and get familiar with the community..
🚀 Feature
Currently it is unclear how multilabel datasets are treated for Scikit-learn estimators, many of which are used in MoleculeNet. For neural networks, this is trivial, as we can use masked loss or weights matrix, but for Scikit-learn compatibility, there must be no missing values. How this is handled is currently not documented at all.
As far as I understand, going through this example using Random Forest:
If this is correct, this could be added to the documentation, or at least as a tutorial in examples.
Motivation
This is important to understand how Scikit-learn models are trained and on which data.
The text was updated successfully, but these errors were encountered: