Projects about 3D Perception
I develop computational- and data-efficient temporal-3D representations.
NeRF-Det: Learning Geometry-Aware Volumetric Representations for Multi-View Indoor 3D Object Detection
The paper is featured as top 5 ICCV papers in Meta AI! [Link] 🎉
CV4Metaverse workshop (oral🎉) at ICCV 2023.
Does NeRF only work for 3D reconstruction? 🤔️
🙋♂️NeRF-Det makes a novel use of NeRF to build geometry-aware volumetric representations for 3D detection, with large improvement while eliminating the heavy overhead of per-scene optimization.
Quadric Representations for LiDAR Odometry, Mapping, and Localization
How to represent a point-cloud scene with thousands of points in an efficient manner? 🤔️
🙋♂️ You only need several quadrics. We propose quadric representations to describe the complex point-cloud scenes in LiDAR odometry, mapping and localization. Such a sparse representation enables better odometry accuracy, 3x faster mapping speed and 1000x less localization storage.
Open-Vocabulary Point-Cloud Object Detection without 3D Annotation
Can point-cloud detectors be trained without 3D labels? 🤔️
🙋♂️ Image domain has shown great generalizablities in 2D foundation models. We address open-vocabulary 3D point-cloud detection by leveraging the 2D foundation models such as CLIP.
3DiffTection: 3D Object Detection with Geometry-aware Diffusion Features
Chenfeng Xu, Huan Ling, Sanja Fidler, Or Litany. [Project page]
Can StableDiffusion model work for 3D detection? 🤔️
🙋♂️ Hmm, maybe yes? But it is hard because it lacks 3D awareness. We incorporate 3D awareness into 2D stablediffusion model via a geometric controlnet.
Time will tell: New Outlooks and A Baseline for Temporal Multi-View 3D Object Detection
Is temporal multi-view 3D detection able to run in an efficient way? 🤔️
🙋♂️ We theoretically analyze the effect brought by time frames, image resolutions, camera rotations and translations etc. We find that long-term frames can compensate for the lack of resolutions. We propose to generate a cost volume from a long history of image observations, compensating for the coarse but efficient matching resolution with a more optimal multi-view matching setup.
Image2Point: 3D Point-Cloud Understanding with 2D Image Pretrained Models
This is a surprising work 😯！
Image and point-cloud have huge domain gap given that images are dense RGB arrays while point-clouds are sparse xyz points. We surprisingly found that image-pretrained model can be efficiently (300x fewer tuned parameters) tuned for point-cloud tasks. We also shed light on why it works through neural collapse, i.e., image-pretrained models present neural collapse in point-cloud.
DetMatch: Two Teachers are Better Than One for Joint 2D and 3D Semi-Supervised Object Detection
This is a useful autolabeling tool for 2D-3D object detection!
We propose DetMatch, a flexible framework for joint semi-supervised learning on 2D and 3D modalities. By identifying objects detected in both sensors, our pipeline generates a cleaner, more robust set of pseudo-labels that both demonstrate stronger performance and stymies single-modality error propagation.
SqueezeSegV3: Spatially-adaptive Convolution for Efficient Point-cloud Segmentation
Traditional convolutions are not suitable for point-cloud processing! Point-cloud map does not enable the convolution to learn inductive bias since different locations have totally different distributions. See our paper for a comprehensive analysis. We propose spatially-adaptive convolution to deal with it.
Projects about Planning
A general temporal-3D representation can work beyond perception. I extend it to motion prediction and planning.
PreTraM: Self-Supervised Pre-training via Connecting Trajectory and Map
Why is this work the first to pre-train for trajectory forecasting? 😯
Trajectory data is too scarce to lift trajectory forecasting model data-efficient from pre-training. We open up a new path by leveraging hundreds of map data and connecting the trajectory representations to strong map representations. We associate geometric representations of maps and shapes of trajectories, which boosts the performance of trajectory forecasting. We then extend this into synthetic data. See Pre-Training on Synthetic Driving Data For Trajectory Prediction.
What Matters to You? Towards Visual Representation Alignment for Robot Learning
Ran Tian, Chenfeng Xu, Masayoshi Tomizuka, Jitendra Malik, Andrea Bajcsy [Paper]
How can we align visual representations to human preferences? 🤔️
🙋♂️In this work, we propose that robots should leverage human feedback to align their visual representations with the end-user and disentangle what matters for the task. We propose Representation-Aligned Preference-based Learning (RAPL), a method for solving the visual representation alignment problem and visual reward learning problem through the lens of preference-based learning and optimal transport.
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Human-oriented Representation Learning for Robotic Manipulation
How can we train a vision model more suitable for robotic learning? 🤔️
🙋♂️ Train it like training a human! We formalize this idea through the lens of human-oriented multi-task fine-tuning on top of pre-trained visual encoders, where each task is a perceptual skill tied to human-environment interactions. We introduce Task Fusion Decoder as a plug-and-play embedding translator that utilizes the underlying relationships among these perceptual skills to guide the representation learning towards encoding meaningful structure for what’s important for all perceptual skills, ultimately empowering learning of downstream robotic manipulation tasks.