Heuristics AI

Description
Ai research updates
LLMs
Reinforcement learning
Deep learning
GANs
Stable diffusion
Transformers
NLP

Kindly join (⁠☞⁠ ⁠ಠ⁠_⁠ಠ⁠)⁠☞ @heuristics_ai
Advertising
We recommend to visit

Community chat: https://t.me/hamster_kombat_chat_2

Twitter: x.com/hamster_kombat

YouTube: https://www.youtube.com/@HamsterKombat_Official

Bot: https://t.me/hamster_kombat_bot
Game: https://t.me/hamster_kombat_bot/

Last updated 3 months, 1 week ago

Your easy, fun crypto trading app for buying and trading any crypto on the market

Last updated 3 months ago

Turn your endless taps into a financial tool.
Join @tapswap_bot


Collaboration - @taping_Guru

Last updated 4 days, 5 hours ago

13 hours ago
1 day, 13 hours ago

tasty multimodal transformer papers which i like in november of 2024
[3/3]

Here, i prepare papers with the model which process text and image embeddings. In all papers, authors used simple decoder architecture and predict next token. They work differently with images: normalizing flows, rectified flow, just mse between next and current tokens.

Multimodal Autoregressive Pre-training of Large Vision Encoders
by Apple
tldr: simple yet effective multimodal transformer
• one simple decoder which predict next img patches and next token.
• can be used for image understanding, img caption.
• bettter than sota contrastive models (SigLIP) in multimodal image understanding.
link: https://arxiv.org/abs/2411.14402

JetFormer: An Autoregressive Generative Model of Raw Images and Text by DeepMind
tl;dr: use normalizing flow instead of vqvae for image embeddings.
- train from scratch to model text and raw pixels jointly
- transformer predicts distribution of next image latents, so we will could sample during inference.
- normalizing flow do not lose information so potentially this approach might be good for understandings and generation at the same time.
link: https://arxiv.org/abs/2411.19722?s=35

JanusFlow: Harmonizing Autoregression and Rectified Flow for Unified Multimodal Understanding and Generation by DeepSeek

tl;dr: combine next text token prediction with flow matching.
• model easily understands image and text prompt
• generate image embeddings from noise embeds via flow matching.
• use differeng image embeddings for understanding and for generation.
- understanding: [image - caption] : generation: [prompt -image]
link: https://arxiv.org/abs/2411.07975

my thoughts
Check out this tech plot twist - like something from an action movie! All the top labs are simultaneously ditching CLIP with its contrastive learning and switching to pure autoregression. And it makes total sense - why have separate encoders for images and text when you can teach one model to do it all?

DeepMind really went for it here - they straight up put normalizing flow right into the core architecture. Meanwhile, DeepSeek took a different route - mixing flow matching with VQVAE to enhance features. Both approaches work, and that's amazing! Apple's keeping up too - they built a super simple decoder that predicts both tokens and patches, and it just works better than SigLIP.

You know what's really cool? We're watching a new generation of models being born - universal, powerful, yet elegantly simple. The old CLIP+VQVAE combos will soon be history.

2 days, 7 hours ago
1 week ago
1 week ago

Please vote !!

1 week ago
2 weeks ago
2 weeks ago
2 weeks ago
We recommend to visit

Community chat: https://t.me/hamster_kombat_chat_2

Twitter: x.com/hamster_kombat

YouTube: https://www.youtube.com/@HamsterKombat_Official

Bot: https://t.me/hamster_kombat_bot
Game: https://t.me/hamster_kombat_bot/

Last updated 3 months, 1 week ago

Your easy, fun crypto trading app for buying and trading any crypto on the market

Last updated 3 months ago

Turn your endless taps into a financial tool.
Join @tapswap_bot


Collaboration - @taping_Guru

Last updated 4 days, 5 hours ago