Why does train_text_to_image.py perform so differently from the CompVis script?

I posted about this on the forum but didn't get any useful feedback - would love to hear from someone who knows the in and outs of the diffusers codebase!

https://discuss.huggingface.co/t/discrepancies-between-compvis-and-diffuser-fine-tuning/25556

To summarize the post: the `train_text_to_image.py` script and original CompVis repo perform very differently when fine-tuning on the same dataset with the same hyperparameters. I'm trying to reproduce the Lamda Labs Pokemon fine-tuning results and finding difficulty doing so (picture results in forum post).

I've been digging into the implementations and I'm not noticing any obvious differences in how the models are trained, losses are calculated, etc - so what explains the large behavioral discrepancies?

Would really appreciate any insight on what might be causing this.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why does train_text_to_image.py perform so differently from the CompVis script? #1153

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why does train_text_to_image.py perform so differently from the CompVis script? #1153

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions