|
Chengyao Wang (王程钥)
I am a PhD student at the Department of Computer Science and Engineering, The Chinese University of Hong Kong (CUHK), advised by Prof. Jiaya Jia and Prof. Bei Yu .
Prior to that, I obtained my B.E. degree in Computer Science from Sun Yat-Sen University (SYSU).
I am particular interested in building Human-like Multimodal Intelligence that can actively interact to the physical world, learning from interaction and have long-term memory.
Recently, my research mainly focus on Multi-modal Large Language Models (MLLMs), representative works includes LLaMA-VID, Mini-Gemini and MGM-Omni.
Prior to that, I also had some experience on visual perception and representation learning.
I am seeking Research Scientist / Member of Technical Staff opportunities in industry for 2026 Fall on Multimodal Foundation Models and related application (Computer Use Agents, Embodied AI), open to any location. Feel free to contact if you are intersted.
Research discussion and collaboration are always welcome, feel free to set up a coffee chat.
Google Scholar  / 
GitHub  / 
X  / 
Linkdin  / 
Email
|
|
News
- [2025-08] We release MGM-Omni , an open source omni moded support long speech understanding, generation and zero-shot voice clone.
- [2025-08] Concerto is accepted in NeurIPS 2025, San Diego.
- [2025-06] Lyra is accepted in ICCV 2025, Hawaii.
- [2025-03] VisionZip and DreamOmni are accepted in CVPR 2025, Nashville.
- [2024-12] We release Lyra , an open source multi-modal large language models that support long speech comprehension, omni understanding and cross-modality efficiency.
- [2024-07] LLaMA-VID is accepted in ECCV 2024, Milano.
- [2024-03] We release Mini-Gemini , an open source vision-language models that support high-resolution image understanding and reasoning image generation.
- [2024-02] GroupContrast is accepted in CVPR 2024, Seattle.
- [2023-11] We release LLaMA-VID , an open source vision-language models that support hour-long video understanding and reasoning.
|
|
Research
* indicates equal contribution
|
|
Omni-modal Large Language Models (Omni MLLMs)
|
|
|
|
|
|
Vision Language Models (VLMs)
|
|
|
|
|
|
|
|
Visual Perception & Representation
|
|
|
|
|
|
|
|
|
Selected Awards
- Gold Medal x 3, International Collegiate Programming Contest (ICPC), Regional
- Gold Medal, Chinese Collegiate Programming Contest (CCPC)
|
|
Academic Service
Reviewer / Program Committee Member
- IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
- IEEE International Conference on Computer Vision (ICCV)
- Conference on Neural Information Processing Systems (NeurIPS)
- International Conference on Learning Representations (ICLR)
- Association for the Advancement of Artificial Intelligence (AAAI)
- IEEE Winter Conference on Applications of Computer Vision (WACV)
|
|