My current research focus is 3D multimodal. I am working on the following problems: 1) (text-conditioning) 3D generation, 2) single-image reconstruction, 3) 3D representation learning with text supervision, 4) scalable embodied learning, 5) self training from simulators, e.t.c.
I previously worked a lot on image-and-text understanding and pre-training. I am continuing investigaing this direction. I am especially interested in three problems 1) what is the scalable way to build universal multimodal models? 2) can we use information from other modality to help language understanding? 3) multimodal large langugage model.