Skip to content

Conversation

@ArthurZucker
Copy link
Contributor

The note2audio model is pretty complexe, it uses a T5 style EncoderDecoder. During the diffusion process, conditioning can be given to the encoder in two ways, MIDI file and the previous spectrogram. Two seperate network take care of the concatenation and then the Spectrogram Decoder generates a spectrogram.

Finally, SoundStream is used as a Vocoder to convert the MelSpectrogram to a raw audio. We only need to use the decoder part of SoundStream.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint.

@patrickvonplaten
Copy link
Contributor

Super cool!

Note that we should add the vocoder in the very last step (it'll require some tf graph/onnx hacking )

@ArthurZucker
Copy link
Contributor Author

It will require the conversion from TF's SoundStream 😅

I will focus on the T5v1.1 style encoder decoder now.

BTW tell me if the file where I am putting the model is correct or if it needs changing!

@patil-suraj
Copy link
Contributor

Very cool! Let me know if you need any help with T5X and weight conversion.

@patrickvonplaten patrickvonplaten mentioned this pull request Oct 21, 2022
2 tasks
@github-actions
Copy link
Contributor

github-actions bot commented Nov 7, 2022

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

@github-actions github-actions bot added the stale Issues that haven't received updates label Nov 7, 2022
@patrickvonplaten
Copy link
Contributor

Closing in favor of #1044

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

stale Issues that haven't received updates

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants