You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It appears the positional embedding addition to the embedding model was a winner. GRUs alone deterriorated performance. LayerNorm dis as well. We have shown that with these improvements, we are well agead of the GPT architecture at a cold start than they are as a pre-trained model from all angles on the text classification problem.
We are training much faster at 3 min / epoch and 45 min to train the structurally optimal model. GPT2 takes 1 hour and 15 to train for 4 epochs.
Us: val_binary_accuracy: 0.955 From a cold start them val_binary_accuracy: 0.93 from a pre-trained LLM
We are much lighter weight at 88M parameters and a 15 dimension embedding. GPT2 base-en is at 117 million params and and embedding dimensionality of 768 dims.
Given that they are at 15 minutes to complete an epoch, and we need only 3 min, this shows that at inference, we could process at least 5 X as fast at the 87M scale as a transformer at the 117M ~ comparable scale. This would translate to the same scale of cost savings at inference (1/5 the cost) in a production system in addition to the train - time savings.
This supports devoting resources for further experimentation to scale this up and develop a LLM from this text classifier model.
We are close to an optimal overall architecture. The number of nodes and levels seems to be optimal. The number of units per node is maxed out in the search range, so this may be lower than optimal, but we have to consider what fits the small scale hardware we are experimenting with (8 CPUs) ...
Next steps: (Sub issues)
Experiment with max_neurons_per_unit beyond 8 and see if there is an accuracy benefit and how the 8 CPU env handles it.
Experiment with a larger embedding output_dim. The benefit may be marginal as we are at ceil(VOCABULARY_SIZE ** (1/4)), the "classical rule of thumb optimum", but we do know that GPT-2 uses an output dim of 768 dims, and we are at 15.
Try dropout instead of bnorm in Cerebros NAS on optimal positional embedding model. If I recall, this was optimal in past NLP runs.
Try a model with GRUs after the positional embedding carrying state forward and concatenating with GRU output
Try a model with GRUs after the positional embedding not carrying state forward
Try a model with a skip connection around a GRU layer after the positional embedding and GRU state carried forward
Try a model with a skip connection around a GRU layer after the positional embedding and GRU state not carried forward
Do studies to determine rules of thumb or a predictive model that translates size of text corpus, other characteristics of the text into a predicted optimal architecture to reduce the need for future conventional neural architecture search studies.
Add an API to hard - code an architecture to reproduce a predicted optimal architecture (that way we can use it as a foundation for a generative LLM).
Update Cerebros to replace the random search on conditional features with Optuna multivariate TPE search on the same conditional features (number of nodes (units) in each layer (level), given the number of layers that was chosen, ... Number of Dense neurons in a given node (unit) .... given the number of nodes that was selected for that level ...)
Integrate the optimal model in the generative model backbone in the NotGPT repo.
For reference: Optimal Architecture thus far is:
Optimal network:
Level 1:
- Unit 0: 3
- Unit 1: 3
- Unit 2: 2
- Unit 3: 3
Level 2:
- Unit 0: 1
- Unit 1: 3
- Unit 2: 4
- Unit 3: 3
- Unit 4: 3
level 3:
- Unit 0: 5
- Unit 1: 1
- Unit 2: 4
- Unit 3: 1
- Unit 4: 4
- Unit 5: 1
- Unit 6: 4
- Unit 7: 3
Level 4:
Final Dense: 1
The text was updated successfully, but these errors were encountered:
Uh oh!
There was an error while loading. Please reload this page.
Working from #156 ...
It appears the positional embedding addition to the embedding model was a winner. GRUs alone deterriorated performance. LayerNorm dis as well. We have shown that with these improvements, we are well agead of the GPT architecture at a cold start than they are as a pre-trained model from all angles on the text classification problem.
We are close to an optimal overall architecture. The number of nodes and levels seems to be optimal. The number of units per node is maxed out in the search range, so this may be lower than optimal, but we have to consider what fits the small scale hardware we are experimenting with (8 CPUs) ...
Next steps: (Sub issues)
For reference: Optimal Architecture thus far is:
The text was updated successfully, but these errors were encountered: