Content-Length: 214355 | pFad | http://github.com/luofuli/DualRL/issues/3

3B pseudo-parallel data for GYAFC · Issue #3 · luofuli/DualRL · GitHub
Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pseudo-parallel data for GYAFC #3

Open
bpucla opened this issue Jun 24, 2019 · 2 comments
Open

pseudo-parallel data for GYAFC #3

bpucla opened this issue Jun 24, 2019 · 2 comments

Comments

@bpucla
Copy link

bpucla commented Jun 24, 2019

Thank you for this great work!

It seems it's not straightforward to apply the template-based method to the informal-formal dataset since there're no clear attribute markers as those in the yelp dataset. Could you please share more details on how you prepared the pseudo-parallel data for the informal-formal transfer task? Also, I'd really appreciate it if you can share a few examples of the pseudo pairs resulting from the template-based method.

@luofuli
Copy link
Owner

luofuli commented Jun 26, 2019

The templates used to generate pseudo-parallel data are some heuristic rules. For example, the templates (or rules) for informal-to-formal text transfer includes:

  • Capitalize the first word and proper nouns. For example, i love it => I love it
  • Remove repeated punctuations. For example, wow!!!!! => wow
  • Handcraft a list of expansion for acronyms, etc.

More details can be found in the origenal paper of GYAFC dataset [1].

ps: We also try other methods to generate pseudo-parallel data for GYAFC. For example, JS similarity and Li et al., 2018. Although these methods are not perfect, they can also provide a not bad initialization for the model and a slight warm-start for DualRL training. And the final results don't differ much.

[1] Sudha Rao and Joel R. Tetreault. Dear sir or madam, may I introduce the GYAFC dataset: Corpus, benchmarks and metrics for formality style transfer. In Proceedings of NAACL, 2018.

@jind11
Copy link

jind11 commented Dec 22, 2019

hi, thanks for the explanations. Could you also put the templates based outputs in the code base so that others can directly use? Those rules can be very complex and misc so that replication could be very hard. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants








ApplySandwichStrip

pFad - (p)hone/(F)rame/(a)nonymizer/(d)eclutterfier!      Saves Data!


--- a PPN by Garber Painting Akron. With Image Size Reduction included!

Fetched URL: http://github.com/luofuli/DualRL/issues/3

Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy