Question about the training pipeline of R_Transformer #98

1311894932 · 2024-12-29T13:26:41Z

Hello Dear Authors, thanks for sharing the code of MoMask! you're so great !!! but i have a question...

In paper, it says: " All the tokens in the preceding layers are summed as the token embeddings...the residual transformer is trained to predict the j-th layer tokens...We also share the parameters of the j-th prediction layer and the (j + 1)-th motion token embedding layer"

In code of class ResidualTransformer, the function forward:
logits = self.trans_forward(history_sum, active_q_layers, cond_vector, ~non_pad_mask, force_mask) # 64，49，512
logits = self.output_project(logits, active_q_layers-1)
, I cant understand, it looks like we have got the j-th prediction(the first logits) throuth the sum of (0,j-1)-th layers, what does the self.output_project mean?

i konw, we use a new (motion_idx-->motion embedding) mapping in function process_embed_proj_weight:
self.output_proj_weight = torch.cat([self.embed_proj_shared_weight, self.output_proj_weight_], dim=0)
self.token_embed_weight = torch.cat([self.token_embed_weight_, self.embed_proj_shared_weight], dim=0)
, andin self.output_project:
output_proj_weight = self.output_proj_weight[qids]
output = torch.einsum('bnc, bcs->bns', output_proj_weight, logits)
i know we use the new mapping to represent the "qids", but what does the second line mean...

Any response will be appreciated !!!

The text was updated successfully, but these errors were encountered:

EricGuo5513 · 2024-12-29T15:35:52Z

Hi, the logits returned by self.trans_forward is not in the dimension of (B, L, Vocab_size), the self.output_project is a linear layer which projects the logistics into the discrete distribution space.

…

On Sun, 29 Dec 2024 at 08:27, 1311894932 ***@***.***> wrote: Hello Dear Authors, thanks for sharing the code of MoMask! you're so great !!! but i have a question... In paper, it says: " All the tokens in the preceding layers are summed as the token embeddings...the residual transformer is trained to predict the j-th layer tokens...We also share the parameters of the j-th prediction layer and the (j + 1)-th motion token embedding layer" In code of class ResidualTransformer, the function forward: logits = self.trans_forward(history_sum, active_q_layers, cond_vector, ~non_pad_mask, force_mask) # 64，49，512 logits = self.output_project(logits, active_q_layers-1) I cant understand, it looks like we have got the j-th prediction(the first logits) throuth the sum of (0,j-1)-th layers, what does the self.output_project mean? i konw, we use a new (motion_idx-->motion embedding) projection in self.token_embed_weight and self.output_proj_weight, in self.output_project: output_proj_weight = self.output_proj_weight[qids] output = torch.einsum('bnc, bcs->bns', output_proj_weight, logits) i know we use the new projection to represent the "qids", but what does the second line mean... Any response will be appreciated !!! — Reply to this email directly, view it on GitHub <#98>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AKRYNB3EZE2BGAYREPCNTGL2H72CPAVCNFSM6AAAAABUKZZNSOVHI2DSMVQWIX3LMV43ASLTON2WKOZSG43DEMRVGE4DKMY> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

1311894932 · 2024-12-29T16:16:46Z

thank you Dr.Guo, thanks for your reply!!!!

1311894932 · 2024-12-30T11:19:55Z

Hi, the logits returned by self.trans_forward is not in the dimension of (B, L, Vocab_size), the self.output_project is a linear layer which projects the logistics into the discrete distribution space.

Thanks again and Happy new year! I finally understand the projection during your reply and T2M-GPT pipeline. But I still cannot understand why we can share part of embedding parameter:
self.output_proj_weight = torch.cat([self.embed_proj_shared_weight, self.output_proj_weight_], dim=0)
self.token_embed_weight = torch.cat([self.token_embed_weight_, self.embed_proj_shared_weight], dim=0)
In T2M-GPT, it uses two irrelevant parameters:
self.tok_emb = nn.Embedding(num_vq + 2, embed_dim)
self.head = nn.Linear(embed_dim, num_vq + 1, bias=False)
I can feel that, maybe because they are both the represent of the same motion? So 6 layers in total, and 2-5 be shared? but why we can use the representation to do:
output = torch.einsum('bnc, bcs->bns', output_proj_weight, logits) # 64,513,512 * 64,512,49 --> 64,513,49
i'm confused...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question about the training pipeline of R_Transformer #98

Question about the training pipeline of R_Transformer #98

1311894932 commented Dec 29, 2024 •

edited

Loading

EricGuo5513 commented Dec 29, 2024 via email

1311894932 commented Dec 29, 2024

1311894932 commented Dec 30, 2024 •

edited

Loading

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

Question about the training pipeline of R_Transformer #98

Question about the training pipeline of R_Transformer #98

Comments

1311894932 commented Dec 29, 2024 • edited Loading

EricGuo5513 commented Dec 29, 2024 via email

1311894932 commented Dec 29, 2024

1311894932 commented Dec 30, 2024 • edited Loading

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier! Saves Data!

1311894932 commented Dec 29, 2024 •

edited

Loading

1311894932 commented Dec 30, 2024 •

edited

Loading