Large Model
-
Detic: open-vocabulary object detector
-
Mamba:SSM(状态空间模型)变体
基础设施
Transformer
- Transformer:基于注意力机制self-attention(attention is all you need,multi-headed attention)
- Vision Transformer (ViT)P:ViT-S,ViT-B (base),ViT-L (large),ViT-H (huge)
- DINO、DINOv2,通过自监督习得稳定的视觉特征 v1:论文Emerging Properties in Self-Supervised Vision Transformers,GitHub代码:https://github.com/facebookresearch/dino v2,分别为两篇论文DINOv2: Learning Robust Visual Features without Supervision和Vision Transformers Need Registers。GitHub代码:https://github.com/facebookresearch/dinov2
BERT(Google)
- 一种seq2seq模型,基于Transformer Encoder,缺点:注意力机制计算开销大,随输入规模二次增长
GPT(OpenAI)
- Generative Pre-trained Transformer,基于Transformer Decoder
大语言模型Large Language Model
视觉语言模型Vision-Language Model
CLIP
- CLIP (Contrastive Language-Image Pre-training):对图像、语言进行建模,GitHub链接:https://github.com/openai/CLIP