报告嘉宾:徐海洋 (阿里巴巴-通义实验室)
报告时间:2024年3月27日 (星期三)晚上20:40 (北京时间)
报告题目:通义mPLUG多模态大模型技术体系
报告人简介:
徐海洋,阿里高级算法专家,负责通义多模态大模型mPLUG体系,在国际顶级期刊和会议ICML/ CVPR/ ICCV/ ACL/ EMNLP/ MM/ TOIS/ IJCAI/ AAAI等发表论文30多篇,并担任多个顶级和会议AC/ PC/ Reviewer,mPLUG VQA Leaderboard首超人类水平,获得多个多模态榜单第一和Best Paper。主导参与开源项目mPLUG,X-PLUG,AliceMind,DELTA。
个人主页:
https://github.com/orgs/X-PLUG/repositories
报告摘要:
OpenAI GPT4V和Google Gemini都展现了非常强的多模态理解能力,推动了多模态大模型 (MLLM)快速发展,MLLM成为了现在业界最热的研究方向。多模态大模型mPLUG通过模块化方式将视觉表征与大语言模型结合,提升大语言模型的多模态能力,其技术体系包括多模态基础模型mPLUG/ mPLUG-2,多模态对话大模型mPLUG-Owl/ Owl2,多模态文档大模型mPLUG-DocOwl/ PaperOwl,多模态智能体Mobile-Agent,多模态视频模型Youku-mPLUG/ HiTeA。mPLUG VQA Leaderboard首超人类水平,获得多个多模态榜单第一和Best Paper。
Github (https://github.com/X-PLUG)。
参考文献:
[1] mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections. EMNLP2022.
[2] mPLUG-2: A modularized multi-modal foundation model across text, image and video. ICML2023.
[3] mPLUG-Owl: Modularization Empowers Large Language Models with Multimodality.
[4] mPLUG-Owl2: Revolutionizing Multi-modal Large Language Model with Modality Collaboration. CVPR2024.
[5] mPLUG-DocOwl: Modularized Multimodal Large Language Model for Document Understanding.
[6] mPLUG-PaperOwl: Scientific Diagram Analysis with the Multimodal Large Language Model.
[7] Mobile-Agent: Autonomous multi-modal mobile device agent with visual perception.