Object-centric Representations in Computer Vision: CVPR 24 tutorial

CVPR 2024 Tutorial on "Object-Centric Representations in Computer Vision"

Introduction

This tutorial discusses the evolution of object-centric representation in computer vision and deep learning. Initially inspired by decomposing visual scenes into surfaces and objects, recent developments focus on learning causal variables from high-dimensional observations like images or videos. The tutorial covers the objectives of OCL, its development, and connections with machine learning fields, emphasizing object-centric approaches, especially in unsupervised segmentation. Advances in encoder, decoder, and self-supervised learning objectives are explored, with a focus on real-world applications and challenges. The tutorial also introduces open-source tools and showcases breakthroughs in video-based object-centric learning. The broader scope of object-centric representation is discussed, including applications in tasks like amodal video segmentation, Visual Language Navigation (VLN), Facial Appearance Editing (FAE), robotic grasping, and the PaLM-E model’s impact across various domains.
This tutorial will have four talks covering the basic ideas, learning good features for object-centric learning, video based object-centric representation, and more diverse real-world applications. We will summarize the main contents of each talk.

Introduction of Object-Centric Representation, and Beyond

Decomposing visual scenes into sets of descriptive entities is both an old idea and a promising new direction for solving complex downstream tasks on visual data. In the early days of computer vision, scenes were decomposed in surfaces and objects, with features describing reflectances, albedo, and geometry. In the deep learning era, these concepts are being rediscovered with a twist: if a deep network learn to map an image into some of the elements that describe a scene, then we may hope that many tasks pertaining to the scene can be solved. This insight connects to the more general inverse problem of learning causal variables (in this case, high-level variables describing the scene) from high-dimensional observations (images or videos). The first part of the tutorial, will dive into the high-level objectives of OCL, it’s original development and connections with other fields in machine learning, in particular with causality.
More in detail, Francesco will cover the revive of object-centric approaches, developed from disentangled representations, into object discovery architectures,concretely measuring progress on unsupervised segmentation benchmarks. In the past few years, progress has significantly accelerated building on the scalable slot attention module and its integration in image and video models. Nowadays, the field evolved to scene decomposition models, leveraging both large visual backbones but also language models and 3D. At the same time, theoretical understanding from causal representation learning is catching up, with new results on identifiability guarantees for the scene decomposition. As slot-based models are scaling up, radically new ideas are also being proposed, for example, a new neuron model, that binds to specific information (i.e. can bind to objects) and allows to represent a scene in an object-centric format that is also distributed in the classical sense.

Bridging the gap to real-world object-centric learning

The development of object-centric methods begins with relatively simple synthetic datasets to demonstrate proof of concept. Scaling these methods to the complexities of the real world is a significant challenge. In this talk, we will explore recent advancements aimed at narrowing the gap between object-centric learning and real-world image and video data processing, as the foundation to make real-world impact.
Tianjun will introduce these efforts in three aspects: the encoder, the decoder and the self-supervised learning objectives. Additionally, Tianjun will also introduce the open source toolbox for object-centric learning, released by AWS lab.

Video-Based Object-centric Learning

Recent breakthroughs in video representation learning and the development of pre-trained vision-language models have paved the way for significant enhancements in self-supervised video-based object-centric learning tasks. In this presentation, Tong will showcase a series of works dedicated to video-based open-vocabulary object-centric learning.
Tong’s discussion will encompass various aspects, starting with the localization of objects in videos through slot attention, leveraging semantically named entities with the pre-trained CLIP model. The exploration extends to the development of a video object-centric model tailored for multiple-object tracking and addresses challenges related to object-centric video amodal tasks.
Moreover, Tong will provide insights into ongoing projects, shedding light on the utilization of the Segment Anything Model (SAM) to empower video-based object-centric learning. Join this talk to gain a comprehensive understanding of the cutting-edge advancements and innovative approaches driving progress in video-based object-centric learning.

More Real-world Applications by Object-centric Representations

While the inception of object-centric learning was rooted in unsupervised segmentation tasks within the CLEVER dataset, recent years have witnessed a paradigm shift. Various object-centric learning methods have emerged, extending their application to a diverse array of real-world challenges. Yanwei Fu will delve into the broader scope of real-world applications facilitated by object-centric representation.
In this talk, Yanwei Fu will explore the versatility of object-centric representation, ranging from tasks like amodal video segmentation to the intricacies of Visual Language Navigation (VLN). The latter empowers agents to execute actions based on language instructions from humans, opening up new possibilities for human-AI interaction.
The discussion extends further to the innovative use of slot attention in Facial Appearance Editing (FAE). This involves the modification of physical attributes, such as pose, expression, and lighting, in human facial images while preserving crucial attributes like identity and background.
Notably, Yanwei Fu will shed light on the application of object-centric representation in robotic grasping, demonstrating its efficacy in real-world scenarios. The talk will also highlight its pivotal role in the PaLM-E model, showcasing the model’s capability and impact across various applications.

Program

Sessions	Title	Speakers	Slides	Video
1:30-1:40	Opening Remarks	Yanwei Fu
1:40-2:45	Introduction of Object-Centric Representation, and Beyond	Francesco Locatello	Link	Link
2:45-3:30	Bridging the gap to real-world object-centric learning	Tianjun Xiao	Link	Link
3:30-3:45	Break
3:45-4:30	Video-Based Object-centric Learning	Tong He	Link
4:30-5:35	More Real-world Applications by Object-centric Representations	Yanwei Fu	Link	Link