My research interests lie in computer vision, deep learning, and their applications with a particular interest in Video Understanding, Video Generation, and Multimodal Representation Learning.
SAM 2: Segment Anything in Images and Videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr,
Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun,
Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion,Chao-Yuan Wu Ross Girshick,
Piotr Dollár, Christoph Feichtenhofer
Arxiv, 2024
[paper][website][demo][code]
We present Segment Anything Model 2 (SAM 2 ), a foundation model towards solving promptable visual segmentation in images and videos.
Math-Vision (Math-V), a curated collection of 3,040 high-quality mathematical problems with visual contexts sourced from real math competitions that provides a comprehensive and diverse set of challenges for evaluating the mathematical reasoning abilities of LMMs.
JourneyDB is a large-scale generated image understanding dataset that contains 4,4M high-resolution generated images, annotated with corresponding text prompt, image caption, and visual question answering.
We propose a SpatioTemporal Adapter (ST-Adapter) for image-to-video transfer learning.
With ~20 times fewer parameters, we achieve on-par or better results compared to full fine-tuning strategy and state-of-theart video models.
We introduce EdgeViTs, a new family of light-weight ViTs that for the first time, enable attention based vision models to compete with
the best light-weight CNNs in the tradeoff between accuracy and on device efficiency.
We propose to explicitly model the Actor-Context-Actor Relation,
which is the relation between two actors based on their interactions with the context. Notably, our method ranks first in the AVA-Kinetics action localization task of ActivityNet Challenge 2020, outperforming other entries by a significant margin (+6.71mAP).
We present a two-stage framework for video synthesis conditioned on a single semantic label map.
At the first stage, we generate the starting frame from a semantic label map. Then, we propose a
flow prediction network to transform the initial frame to a video sequence.
We present a novel Online Detection of Action Start task in a practical setting involving untrimmed, unconstrained videos. Three training
methods have been proposed to specifically improve the capability of ODAS models in detecting action timely and accurately.