Junting Pan
I am a final-year Ph.D.student in Multimedia Lab (MMLab), at the Chinese University of Hong Kong, supervised by Prof. Xiaogang Wang and Prof. Hongsheng Li.
Currently, I am a research scientist intern at Meta AI (FAIR) lab.
My research interests lie in computer vision, deep learning, and their applications with a particular interest in Video understanding, Video Generation, and Multimodal Representation Learning.
Email  / 
CV  / 
Google Scholar  / 
LinkedIn  / 
Github
|
|
09-2023 |
JourneyDB is accepted in NeurIPS 2023 Datasets and Benchmarks Track! |
07-2023 |
We release JourneyDB, a large-scale benchmark for multimodal generative image understanding. |
05-2023 |
Starting my internship as a research scientist Intern at Meta AI (FAIR). |
09-2022 |
Our paper ST-Adapter on efficient image-to-video transfer learning is accepted to NeurIPS 2022. |
|
JourneyDB: A Benchmark for Generative Image Understanding
J. Pan*, K. Sun*, Y. Ge, H. Li, H. Duan, X. Wu, R. Zhang, A. Zhou, Z. Qin, Y. Wang, J. Dai, Y. Qiao, H. Li
NeurIPS, 2023
arXiv,
github
JourneyDB is a large-scale generated image understanding dataset that contains 4,4M high-resolution generated images, annotated with corresponding text prompt, image caption, and visual question answering.
|
|
EdgeViTs: Competing Light-weight CNNs on Mobile Devices with Vision Transformers
J. Pan, A. Bulat, F. Tan, X. Zhu, L. Dudziak, H. Li, G. Tzimiropoulos, B. Martinez
ECCV, 2022
arXiv,
bibtex,
github
We introduce EdgeViTs, a new family of light-weight ViTs that for the first time, enable attention based vision models to compete with
the best light-weight CNNs in the tradeoff between accuracy and on device efficiency.
|
|
Actor-Context-Actor Relation Network forSpatio-Temporal Action Localization
J. Pan*, S. Chen*, J. Shao, Z. Shou, H. Li
CVPR, 2021
arXiv,
bibtex,
github
We propose to explicitly model the Actor-Context-Actor Relation,
which is the relation between two actors based on their interactions with the context. Notably, our method ranks first in the AVA-Kinetics action localization task of ActivityNet Challenge 2020, outperforming other entries by a significant margin (+6.71mAP).
|
|
Video Generation from Single Semantic Label Map
J. Pan, C. Wang, X. Jia, J. Shao, L. Sheng, J. Yan, X. Wang
CVPR, 2019
arXiv,
bibtex
github
We present a two-stage framework for video synthesis conditioned on a single semantic label map.
At the first stage, we generate the starting frame from a semantic label map. Then, we propose a
flow prediction network to transform the initial frame to a video sequence.
|
|
Online detection of action start in untrimmed, streaming videos
J. Pan*, Z. Shou*, J. Chan, K. Miyazawa, H. Mansour, A. Vetro, X. Giro-i-Nieto, SF. Chang
ECCV, 2018
arXiv,
bibtex
We present a novel Online Detection of Action Start task in a practical setting involving untrimmed, unconstrained videos. Three training
methods have been proposed to specifically improve the capability of ODAS models in detecting action timely and accurately.
|
|