Community chat: https://t.me/hamster_kombat_chat_2
Twitter: x.com/hamster_kombat
YouTube: https://www.youtube.com/@HamsterKombat_Official
Bot: https://t.me/hamster_kombat_bot
Game: https://t.me/hamster_kombat_bot/
Last updated 3 months, 1 week ago
Your easy, fun crypto trading app for buying and trading any crypto on the market
Last updated 3 months ago
Turn your endless taps into a financial tool.
Join @tapswap_bot
Collaboration - @taping_Guru
Last updated 4 days, 5 hours ago
tasty multimodal transformer papers which i like in november of 2024
[3/3]
Here, i prepare papers with the model which process text and image embeddings. In all papers, authors used simple decoder architecture and predict next token. They work differently with images: normalizing flows, rectified flow, just mse between next and current tokens.
Multimodal Autoregressive Pre-training of Large Vision Encoders
by Apple
tldr: simple yet effective multimodal transformer
• one simple decoder which predict next img patches and next token.
• can be used for image understanding, img caption.
• bettter than sota contrastive models (SigLIP) in multimodal image understanding.
link: https://arxiv.org/abs/2411.14402
JetFormer: An Autoregressive Generative Model of Raw Images and Text by DeepMind
tl;dr: use normalizing flow instead of vqvae for image embeddings.
- train from scratch to model text and raw pixels jointly
- transformer predicts distribution of next image latents, so we will could sample during inference.
- normalizing flow do not lose information so potentially this approach might be good for understandings and generation at the same time.
link: https://arxiv.org/abs/2411.19722?s=35
JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation by DeepSeek
tl;dr: combine next text token prediction with flow matching.
• model easily understands image and text prompt
• generate image embeddings from noise embeds via flow matching.
• use differeng image embeddings for understanding and for generation.
- understanding: [image - caption] : generation: [prompt -image]
link: https://arxiv.org/abs/2411.07975
my thoughts
Check out this tech plot twist - like something from an action movie! All the top labs are simultaneously ditching CLIP with its contrastive learning and switching to pure autoregression. And it makes total sense - why have separate encoders for images and text when you can teach one model to do it all?
DeepMind really went for it here - they straight up put normalizing flow right into the core architecture. Meanwhile, DeepSeek took a different route - mixing flow matching with VQVAE to enhance features. Both approaches work, and that's amazing! Apple's keeping up too - they built a super simple decoder that predicts both tokens and patches, and it just works better than SigLIP.
You know what's really cool? We're watching a new generation of models being born - universal, powerful, yet elegantly simple. The old CLIP+VQVAE combos will soon be history.
Please vote !!
Community chat: https://t.me/hamster_kombat_chat_2
Twitter: x.com/hamster_kombat
YouTube: https://www.youtube.com/@HamsterKombat_Official
Bot: https://t.me/hamster_kombat_bot
Game: https://t.me/hamster_kombat_bot/
Last updated 3 months, 1 week ago
Your easy, fun crypto trading app for buying and trading any crypto on the market
Last updated 3 months ago
Turn your endless taps into a financial tool.
Join @tapswap_bot
Collaboration - @taping_Guru
Last updated 4 days, 5 hours ago