Source files for Image Captioning system for the use in my Bachelor thesis: "Design and implementation of image caption generation system" (2022). The model relies on RNN network for the sentence embedding and pretrained networks (like VGG19) for image embedding.
A model for captioning images was constructed by merging the sentence and image representation and then calculating the propability of the most probable next token in the caption.
The Following merging methods were tested:
- Simple Sum (A+B)
- Multiplication (A*B)
- Weighted Sum (A + (WA)*B) Of which the weighted sum was the most proficient.
Captioning an image starts with feeding the model with image embedding and starting off with [START] token, the caption generation ends when model generates [END] token.