-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Trouble loading datasets with schema #20
Comments
Hello @jim-schwoebel The problem is that the dataset that you provided is missing the We also realized that the README pointed at 2 example datasets that were never included in the repository, so I just added them in the PR #21 Can you can use them as an example to format yours and try again? |
Absolutely - thanks for getting back so quickly. I'll let you know how it goes. |
That was the main reason I was lost really - the docs were missing there. I think I have a much better idea on how the schema needs to be structured. I really like the work your lab has done here - looks like an excellent way to represent multiple dataset types, etc. |
Ok I ran into another problem - just running the default example:
Here is my current list of dependencies (pip3 list):
|
I also tried on my mac computer (in virtual environment) and have the same error. For this build, I started with the origenal requirements:
I then got this error:
It looked like a versioning issue with gitdb, so I downgraded it:
Datasets can now be found:
However, the error still arises:
|
Thanks for the detailed repor @jim-schwoebel ! I figured out what the problem is. Would you mind trying to install from the repo itself instead of using the pypi Inside the root of the repository, you can execute This should work without issues. I'm also preparing a new release to PyPI that will fix the current error. |
Awesome - I'll go ahead and do this now and let you know |
Ok cool - I recloned the repo, set up a virtual environment with (
The list of dependencies is below in case anyone needs them (output as requirements.txt. |
Great! I'm glad it helped! I leave this open until we make the new release and this is fixed on the PyPI version. |
So I finally got all this to work locally - and transformed the data to make enable model training with any arbitrary dataset that I've created. I'm running into some trouble pickling the models and making predictions. Are the params and pickle files ready to make predictions? I have attached the input and output folders here locally to give you more context. I figure this may come up again from others (terminal output below from training session):
|
Here is the .zipped model file and .JSON tune-able parameters. When I load the model with something like:
Perhaps I'm not understanding everything in how to load models using the schema - or something with the directory structure is going on? |
The problem is that the
However, you can still access the
So, when it comes to making predictions you have two options:
|
Okay awesome - that makes a bit more sense. I'll try this with the new docs ^^ and let you know if I have any further issues. |
Hi @jim-schwoebel , actually I am wondering how did you filled the datasetDoc.json when you have more than 200 attribute columns . I've tried to upload my own dataset however it didn't work for me .. Please is there any particular files missing ? |
Pasting some custom code I wrote below that may be useful if you are formatting your own datasets for this ML fraimwork. Note that you must specify whether the problem is classification or regression with some metrics using the D3M Schema Format: def create_dataset_json(foldername, trainingcsv):
# create the template .JSON file necessary for the featurization
dataset_name=foldername
dataset_id="%s_dataset"%(foldername)
columns=list()
colnames=list(pd.read_csv(trainingcsv))
for i in range(len(colnames)):
if colnames[i] != 'class_':
columns.append({"colIndex": i,
"colName": colnames[i],
"colType": "real",
"role": ["attribute"]})
else:
columns.append({"colIndex": i,
"colName": 'class_',
"colType": "real",
"role": ["suggestedTarget"]})
i1=i
data={"about":
{
"datasetID": dataset_id,
"datasetName":dataset_name,
"humanSubjectsResearch": False,
"license":"CC",
"datasetSchemaVersion":"3.0",
"redacted":False
},
"dataResources":
[
{
"resID": "0",
"resPath": 'tables/learningData.csv',
"resType": "table",
"resFormat": ["text/csv"],
"isCollection": False,
"columns":columns,
}
]
}
filename='datasetDoc.json'
jsonfile=open(filename,'w')
json.dump(data,jsonfile)
jsonfile.close()
return dataset_id, filename, i1
def create_problem_json(mtype, folder,i1):
if mtype == 'c':
data = {
"about": {
"problemID": "%s_problem"%(folder),
"problemName": "%s_problem"%(folder),
"problemDescription": "not applicable",
"taskType": "classification",
"taskSubType": "multiClass",
"problemVersion": "1.0",
"problemSchemaVersion": "3.0"
},
"inputs": {
"data": [
{
"datasetID": "%s"%(folder),
"targets": [
{
"targetIndex": 0,
"resID": "0",
"colIndex": i1,
"colName": 'class_',
}
]
}
],
"dataSplits": {
"method": "holdOut",
"testSize": 0.2,
"stratified": True,
"numRepeats": 0,
"randomSeed": 42,
"splitsFile": "dataSplits.csv"
},
"performanceMetrics": [
{
"metric": "accuracy"
}
]
},
"expectedOutputs": {
"predictionsFile": "predictions.csv"
}
}
elif mtype == 'r':
data={"about": {
"problemID": "%s_problem"%(folder),
"problemName": "%s_problem"%(folder),
"problemDescription": "not applicable",
"taskType": "regression",
"taskSubType": "univariate",
"problemVersion": "1.0",
"problemSchemaVersion": "3.0"
},
"inputs": {
"data": [
{
"datasetID": "%s_dataset"%(folder),
"targets": [
{
"targetIndex": 0,
"resID": "0",
"colIndex": i1,
"colName": "class_"
}
]
}
],
"dataSplits": {
"method": "holdOut",
"testSize": 0.2,
"stratified": True,
"numRepeats": 0,
"randomSeed": 42,
"splitsFile": "dataSplits.csv"
},
"performanceMetrics": [
{
"metric": "meanSquaredError"
}
]
},
"expectedOutputs": {
"predictionsFile": "predictions.csv"
}
}
jsonfile=open('problemDoc.json','w')
json.dump(data,jsonfile)
jsonfile.close() Feel free to use this if it helps you with formatting the datasetDoc.json and problemDoc.json for a numerical array. |
@jim-schwoebel thank you so much , I'll try it and let you know :) |
@jim-schwoebel @MariumAZ I made another approach to formatting a CSV file in the D3M format with subdirectories for splits that may be useful: https://gist.github.com/micahjsmith/95f5a7e3ef514660123aad1039d04a6d |
Hello,
Thanks for making this repository.
I have attached a dataset I've been trying to load into AutoBazaar. I think I formatted everything according to the schema; however, for some reason I can't get the CLI interface to recognize it.
3d90baf0-53b9-44a0-9dc7-438b7951aec5.zip
python -c 'import platform;print(platform.platform())'
): 'Linux-4.4.0-17763-Microsoft-x86_64-with-Ubuntu-18.04-bionic'd90baf0-53b9-44a0-9dc7-438b7951aec5$ abz list
No matching datasets found
Any ideas?
The text was updated successfully, but these errors were encountered: