optimize positional embedding model #157

david-thrower · 2025-03-26T18:18:37Z

Working from #156 ...

It appears the positional embedding addition to the embedding model was a winner. GRUs alone deterriorated performance. LayerNorm dis as well. We have shown that with these improvements, we are well agead of the GPT architecture at a cold start than they are as a pre-trained model from all angles on the text classification problem.

We are training much faster at 3 min / epoch and 45 min to train the structurally optimal model. GPT2 takes 1 hour and 15 to train for 4 epochs.
Us: val_binary_accuracy: 0.955 From a cold start them val_binary_accuracy: 0.93 from a pre-trained LLM
We are much lighter weight at 88M parameters and a 15 dimension embedding. GPT2 base-en is at 117 million params and and embedding dimensionality of 768 dims.
Given that they are at 15 minutes to complete an epoch, and we need only 3 min, this shows that at inference, we could process at least 5 X as fast at the 87M scale as a transformer at the 117M ~ comparable scale. This would translate to the same scale of cost savings at inference (1/5 the cost) in a production system in addition to the train - time savings.
This supports devoting resources for further experimentation to scale this up and develop a LLM from this text classifier model.

We are close to an optimal overall architecture. The number of nodes and levels seems to be optimal. The number of units per node is maxed out in the search range, so this may be lower than optimal, but we have to consider what fits the small scale hardware we are experimenting with (8 CPUs) ...

Next steps: (Sub issues)

For reference: Optimal Architecture thus far is:


Optimal network:

Level 1:

- Unit 0: 3
- Unit 1: 3
- Unit 2: 2
- Unit 3: 3

Level 2:

- Unit 0: 1
- Unit 1: 3
- Unit 2: 4
- Unit 3: 3
- Unit 4: 3

level 3: 

- Unit 0: 5
- Unit 1: 1
- Unit 2: 4
- Unit 3: 1
- Unit 4: 4
- Unit 5: 1
- Unit 6: 4
- Unit 7: 3


Level 4: 



Final Dense: 1

The text was updated successfully, but these errors were encountered:

david-thrower mentioned this issue Apr 12, 2025

add-grus-to-optimal-model-154-a #155

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

optimize positional embedding model #157

optimize positional embedding model #157

david-thrower commented Mar 26, 2025 •

edited

Loading

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

optimize positional embedding model #157

optimize positional embedding model #157

Comments

david-thrower commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

We are close to an optimal overall architecture. The number of nodes and levels seems to be optimal. The number of units per node is maxed out in the search range, so this may be lower than optimal, but we have to consider what fits the small scale hardware we are experimenting with (8 CPUs) ...

Next steps: (Sub issues)

For reference: Optimal Architecture thus far is:

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

david-thrower commented Mar 26, 2025 •

edited

Loading