Junting Pan

I am a final-year Ph.D.student in Multimedia Lab (MMLab), at the Chinese University of Hong Kong, supervised by Prof. Xiaogang Wang and Prof. Hongsheng Li. Currently, I am a research scientist intern at Meta AI (FAIR) lab.

My research interests lie in computer vision, deep learning, and their applications with a particular interest in Video understanding, Video Generation, and Multimodal Representation Learning.

Email  /  CV  /  Google Scholar  /  LinkedIn  /  Github

profile photo
News
09-2023 JourneyDB is accepted in NeurIPS 2023 Datasets and Benchmarks Track!
07-2023 We release JourneyDB, a large-scale benchmark for multimodal generative image understanding.
05-2023 Starting my internship as a research scientist Intern at Meta AI (FAIR).
09-2022 Our paper ST-Adapter on efficient image-to-video transfer learning is accepted to NeurIPS 2022.
Work Experience

May 2023 - Now FAIR, Meta AI, Reserach Scientist Intern (Multi-Modal Fundation Models).
Sep 2021 - Mar 2022 Samsung Research AI Center Reserach Intern (Efficient Mobile Transformers).
Mar 2017 - Dec 2017 DVMM Lab, Columbia University Reserach Assistant (Online Action Dectection).
Research

JourneyDB: A Benchmark for Generative Image Understanding
J. Pan*, K. Sun*, Y. Ge, H. Li, H. Duan, X. Wu, R. Zhang, A. Zhou, Z. Qin, Y. Wang, J. Dai, Y. Qiao, H. Li
NeurIPS, 2023
arXiv, github

JourneyDB is a large-scale generated image understanding dataset that contains 4,4M high-resolution generated images, annotated with corresponding text prompt, image caption, and visual question answering.

Retrieving-to-Answer: Zero-Shot Video Question Answering with Frozen Large Language Models
J. Pan, Z. Lin, Y. Ge, X. Zhu, R. Zhang, Y. Wang, Y. Qiao, H. Li
ICCVW, 2023
arXiv, bibtex

We propose a simple yet effective Retrieving-to-Answer (R2A) framework and achieves SOTA videoQA with zero training.

ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning
J. Pan, Z. Lin, X. Zhu, J. Shao, H. Li
NeurIPS, 2022
arXiv, bibtex, github

We propose a SpatioTemporal Adapter (ST-Adapter) for image-to-video transfer learning. With ~20 times fewer parameters, we achieve on-par or better results compared to full fine-tuning strategy and state-of-theart video models.

EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers
J. Pan, A. Bulat, F. Tan, X. Zhu, L. Dudziak, H. Li, G. Tzimiropoulos, B. Martinez
ECCV, 2022
arXiv, bibtex, github

We introduce EdgeViTs, a new family of light-weight ViTs that for the first time, enable attention based vision models to compete with the best light-weight CNNs in the tradeoff between accuracy and on device efficiency.

Actor-Context-Actor Relation Network forSpatio-Temporal Action Localization
J. Pan*, S. Chen*, J. Shao, Z. Shou, H. Li
CVPR, 2021
arXiv, bibtex, github

We propose to explicitly model the Actor-Context-Actor Relation, which is the relation between two actors based on their interactions with the context. Notably, our method ranks first in the AVA-Kinetics action localization task of ActivityNet Challenge 2020, outperforming other entries by a significant margin (+6.71mAP).

Video Generation from Single Semantic Label Map
J. Pan, C. Wang, X. Jia, J. Shao, L. Sheng, J. Yan, X. Wang
CVPR, 2019
arXiv, bibtex github

We present a two-stage framework for video synthesis conditioned on a single semantic label map. At the first stage, we generate the starting frame from a semantic label map. Then, we propose a flow prediction network to transform the initial frame to a video sequence.

Online detection of action start in untrimmed, streaming videos
J. Pan*, Z. Shou*, J. Chan, K. Miyazawa, H. Mansour, A. Vetro, X. Giro-i-Nieto, SF. Chang
ECCV, 2018
arXiv, bibtex

We present a novel Online Detection of Action Start task in a practical setting involving untrimmed, unconstrained videos. Three training methods have been proposed to specifically improve the capability of ODAS models in detecting action timely and accurately.

Shallow and Deep Convolutional Networks for Saliency Prediction
J. Pan*, K. McGuinness*, NE. O'Connor, E. Sayrol, X. Giro-i-Nieto
CVPR, 2016
arXiv, bibtex

We present the first end-to-end CNN based saliency prediction network.


cr: J.Baron