Skip to content

optimize positional embedding model #157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 11 tasks
david-thrower opened this issue Mar 26, 2025 · 0 comments
Open
2 of 11 tasks

optimize positional embedding model #157

david-thrower opened this issue Mar 26, 2025 · 0 comments

Comments

@david-thrower
Copy link
Owner

david-thrower commented Mar 26, 2025

Working from #156 ...

It appears the positional embedding addition to the embedding model was a winner. GRUs alone deterriorated performance. LayerNorm dis as well. We have shown that with these improvements, we are well agead of the GPT architecture at a cold start than they are as a pre-trained model from all angles on the text classification problem.

  • We are training much faster at 3 min / epoch and 45 min to train the structurally optimal model. GPT2 takes 1 hour and 15 to train for 4 epochs.
  • Us: val_binary_accuracy: 0.955 From a cold start them val_binary_accuracy: 0.93 from a pre-trained LLM
  • We are much lighter weight at 88M parameters and a 15 dimension embedding. GPT2 base-en is at 117 million params and and embedding dimensionality of 768 dims.
  • Given that they are at 15 minutes to complete an epoch, and we need only 3 min, this shows that at inference, we could process at least 5 X as fast at the 87M scale as a transformer at the 117M ~ comparable scale. This would translate to the same scale of cost savings at inference (1/5 the cost) in a production system in addition to the train - time savings.
  • This supports devoting resources for further experimentation to scale this up and develop a LLM from this text classifier model.

We are close to an optimal overall architecture. The number of nodes and levels seems to be optimal. The number of units per node is maxed out in the search range, so this may be lower than optimal, but we have to consider what fits the small scale hardware we are experimenting with (8 CPUs) ...

Next steps: (Sub issues)

  • Experiment with max_neurons_per_unit beyond 8 and see if there is an accuracy benefit and how the 8 CPU env handles it.
  • Experiment with a larger embedding output_dim. The benefit may be marginal as we are at ceil(VOCABULARY_SIZE ** (1/4)), the "classical rule of thumb optimum", but we do know that GPT-2 uses an output dim of 768 dims, and we are at 15.
  • Try dropout instead of bnorm in Cerebros NAS on optimal positional embedding model. If I recall, this was optimal in past NLP runs.
  • Try a model with GRUs after the positional embedding carrying state forward and concatenating with GRU output
  • Try a model with GRUs after the positional embedding not carrying state forward
  • Try a model with a skip connection around a GRU layer after the positional embedding and GRU state carried forward
  • Try a model with a skip connection around a GRU layer after the positional embedding and GRU state not carried forward
  • Do studies to determine rules of thumb or a predictive model that translates size of text corpus, other characteristics of the text into a predicted optimal architecture to reduce the need for future conventional neural architecture search studies.
  • Add an API to hard - code an architecture to reproduce a predicted optimal architecture (that way we can use it as a foundation for a generative LLM).
  • Update Cerebros to replace the random search on conditional features with Optuna multivariate TPE search on the same conditional features (number of nodes (units) in each layer (level), given the number of layers that was chosen, ... Number of Dense neurons in a given node (unit) .... given the number of nodes that was selected for that level ...)
  • Integrate the optimal model in the generative model backbone in the NotGPT repo.

For reference: Optimal Architecture thus far is:


Optimal network:

Level 1:

- Unit 0: 3
- Unit 1: 3
- Unit 2: 2
- Unit 3: 3

Level 2:

- Unit 0: 1
- Unit 1: 3
- Unit 2: 4
- Unit 3: 3
- Unit 4: 3

level 3: 

- Unit 0: 5
- Unit 1: 1
- Unit 2: 4
- Unit 3: 1
- Unit 4: 4
- Unit 5: 1
- Unit 6: 4
- Unit 7: 3


Level 4: 



Final Dense: 1 
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant
pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy