The AC-State algorithm discovers a latent representation of a system from a sequence of sensory observations and actions taken by an agent interacting in the system, while requiring no external supervision (such as rewards or labels). This consists of predicting actions from observations with the smallest possible representation. This representation provably captures all of the information which is necessary for controlling the agent while discarding all irrelevant or distracting details. Previous approaches either fail to capture the full state or fail to ignore irrelevant information. We demonstrate this on a robot arm where we are able to recover the position of the arm using only a high-resolution video and recorded actions, while ignoring complex background distractors.
A person walking along a city street who tries to model all aspects of the world would quickly be overwhelmed by a multitude of shops, cars, and people moving in and out of view, following their own complex and inscrutable dynamics. Exploration and navigation in such an environment is an everyday task, requiring no vast exertion of mental resources. Is it possible to turn this fire hose of sensory information into a minimal latent state which is necessary and sufficient for an agent to successfully act in the world? We formulate this question concretely, and propose the Agent-Controllable State Discovery algorithm (AC-State), which has theoretical guarantees and is practically demonstrated to discover the minimal controllable latent state which contains all of the information necessary for controlling the agent, while fully discarding all irrelevant information. This algorithm consists of a multi-step inverse model (predicting actions from distant observations) with an information bottleneck. AC-State enables localization, exploration, and navigation without reward or demonstrations. We demonstrate the discovery of controllable latent state in three domains: localizing a robot arm with distractions (e.g., changing lighting conditions and background), exploring in a maze alongside other agents, and navigating in the Matterport house simulator.
We evaluate AC-State in a house navigation environment (top), where the agent observes high resolution images of first-person views where the vertical position of the camera is random exogenous noise (center). The algorithm discovers a controllable latent state which is parsimonious (bottom left). AC-State captures the position of the agent in the house and discards information about the position of the camera (bottom right). The baselines we consider capture the controllable latent state but fail to discard the exogenous noise.