Jenga is a quintessential example of a contact rich task where we need to interact with the tower to learn and infer block mechanics and multi-modal behavior by combining touch and sight.
Current learning methodologies struggle with these challenges and have not exploited physics nearly as richly as we believe humans do. Most robotic learning systems still use purely visual data, without a sense of touch; this fundamentally limits how quickly and flexibly a robot can learn about the world. Learning algorithms that build on model-free reinforcement learning methods have little to no ability to exploit knowledge about the physics of objects and actions. Even the methods using model-based reinforcement learning or imitation learning have mostly used generic statistical models that do not explicitly represent any of the knowledge about physical objects, contacts, or forces that humans have from a very early age. As a consequence, these systems require far more training data than humans do to learn new models or new tasks, and they generalize much less broadly and less robustly.