A trained neural network pipeline simulates physical systems of rigid and deformable bodies and environmental conditions

One on motion capture

MIT researchers used the RISP method to predict the action sequence, joint stiffness, or movement of an articulated hand, such as this one, from a target image or video. Credit: Massachusetts Institute of Technology. Credit: Massachusetts Institute of Technology

From “Star Wars” to “Happy Feet”, many beloved films contain scenes made possible by motion capture technology, which records the movement of objects or people via video. Moreover, the applications of this tracking, which involve complicated interactions between physics, geometry and perception, extend beyond Hollywood to the military, sports training, medical fields, computer vision and robotics, allowing engineers to understand and simulate actions taking place in the real world. environments.

As this can be a complex and expensive process – often requiring markers placed on objects or people and recording the sequence of action – researchers strive to shift the onus to neural networks, which could acquire this data from a simple video and reproduce it in a model. Work in physics simulations and rendering shows promises to make it more widely used, as it can characterize realistic, continuous and dynamic motion from images and transform back and forth between a 2D rendering and a 3D scene. in the world. However, to do this, current techniques require precise knowledge of the environmental conditions in which the action takes place and the choice of rendering engine, which are often unavailable.

Now, a team of researchers from MIT and IBM have developed a trained neural network pipeline that avoids this problem, with the ability to infer the state of the environment and the actions that occur, the physical characteristics of the object or person of interest (system), and its control parameters. When tested, the technique can outperform other methods in simulations of four physical systems of rigid and deformable bodies, which illustrate different types of dynamics and interactions, under various environmental conditions. Additionally, the methodology enables imitation learning, predicting and replicating the trajectory of a real-world flying quadcopter from video.

“The high-level research problem addressed in this paper is how to reconstruct a digital twin from a video of a dynamical system,” says Tao Du Ph.D. ’21, a postdoc in the Department of Electrical Engineering and computer science (EECS), member of the Computer Science and Artificial Intelligence Laboratory (CSAIL) and member of the research team. To do this, Du says, “we have to ignore the rendering deviations of the video clips and try to capture the basic information about the dynamic system or the dynamic motion.”

One on motion capture

Caption: This training set was used to train the RISP pipeline to see how rendering differences can affect texture, light and background. Credit: Massachusetts Institute of Technology

Du’s co-authors include lead author Pingchuan Ma, EECS graduate student and CSAIL member; Josh Tenenbaum, Paul E. Newton Career Development Professor of Cognitive and Computer Science in the Department of Brain and Cognitive Sciences and CSAIL Fellow; Wojciech Matusik, Professor of Electrical Engineering and computing and CSAIL member; and Chuang Gan, senior member of the MIT-IBM Watson AI Lab research staff. This work was presented this week at the International Conference on Representations of Learning.

While capturing video of characters, robots, or dynamic systems to infer dynamic motion makes this information more accessible, it also brings a new challenge. “The images or videos [and how they are rendered] Much depends on the lighting conditions, background information, texture information, material information of your environment, and these are not necessarily measurable in a real scenario,” says Du. Without this render configuration information or knowing which render engine is being used, it is currently difficult to glean dynamic information and predict the behavior of the video subject. Even though the rendering engine is known, current neural network approaches still require large training datasets. However, with their new approach, this may become questionable.” If you take a video of a running leopard in the morning and in the evening, of course, you will get visually different video clips because the lighting conditions are very different. But what really matters is the dynamic movement: the angles of the leopard’s joints, not whether they look light or dark,” says Du.

In order to solve the problem of rendering domains and image differences, the team developed a pipeline system containing a neural network, called a “render invariant state prediction (RISP)” network. RISP transforms differences in images (pixels) into differences in states of the system, i.e. the action environment, which makes their method generalizable and independent of rendering configurations. RISP is trained using random render settings and states, which are fed into a differentiable renderer, a type of renderer that measures pixel sensitivity to render configurations, e.g., l lighting or the colors of the materials. This generates a set of varied images and videos from known ground truth parameters, which will later allow RISP to reverse this process, predicting the state of the environment from the video of entrance. The team also minimized RISP’s rendering gradients, so its predictions were less sensitive to changes in rendering setups, allowing it to learn to forget about visual appearances and focus on learning dynamic states. This is made possible by a differentiable rendering engine.

The method then uses two similar pipelines, executed in parallel. One is for the source domain, with known variables. Here, system parameters and actions are entered into a differentiable simulation. The states of the generated simulation are combined with different rendering configurations in a differentiable renderer to generate images, which are fed into RISP. RISP then produces predictions about the environmental states. At the same time, a similar target domain pipeline is run with unknown variables. RISP in this pipeline receives these output images, generating a predicted state. When the predicted states of the source and target domains are compared, a new loss is produced; this difference is used to adjust and optimize some of the parameters in the source domain pipeline. This process can then be repeated, further reducing the loss between pipelines.







The RISP technique (left) is able to similarly reconstruct the dynamic motion of a flying quadcopter (like the input video) without knowing the exact rendering setup. The lighting and hardware setups that RISP uses here are intentionally different from the input video, to demonstrate the capability of the method. Credit: Massachusetts Institute of Technology

To determine the success of their method, the team tested it in four simulated systems: a quadcopter (a flying rigid body that has no physical contact), a cube (a rigid body that interacts with its environment, like a dice), an articulated hand, and a rod (deformable body that can move like a snake). Tasks included estimating the state of a system from an image, identifying system parameters and action command signals from a video, and discovering command signals from a target image that direct the system to the desired state. Additionally, they created baselines and an oracle, comparing the new RISP process in these systems to similar methods that, for example, don’t have the rendering gradient loss, don’t train a neural network with no loss or completely lacking the RISP neural network. The team also examined the impact of gradient loss on the performance of the state prediction model over time. Finally, the researchers deployed their RISP system to infer the motion of a real-world quadcopter, which has complex dynamics, from video. They compared performance to other techniques that had no loss function and used pixel differences, or that included manual tuning of a renderer’s configuration.

In nearly all experiments, the RISP procedure outperformed similar or state-of-the-art methods available, mimicking or replicating the desired settings or motion, and proving to be a data-efficient and generalizable competitor to current motion capture approaches. .

For this work, the researchers made two important assumptions: that information about the camera is known, such as its position and settings, and the geometry and physics governing the object or person being tracked. Future work is planned to address this.

“I think the biggest problem we’re solving here is reconstructing information from one domain to another, without very expensive equipment,” says Ma. Such an approach should be “useful for [applications such as the] metaverse, which aims to reconstruct the physical world in a virtual environment,” adds Gan. “It’s basically an everyday, available, simple and neat solution for cross-domain reconstruction or the inverse dynamics problem,” says Ma.


The technique allows real-time rendering of 3D scenes


More information:
RISP: Render invariant state predictor with differentiable simulation and rendering for cross-domain parameter estimation. openreview.net/forum?id=uSE03demja

This story is republished with the kind permission of MIT News (web.mit.edu/newsoffice/), a popular site that covers news about research, innovation, and education at MIT.

Quote: Trained Neural Network Pipeline Simulates Rigid and Deformable Body Physical Systems and Environmental Conditions (2022, May 3) Retrieved May 3, 2022, from https://techxplore.com/news/2022-05-neural-network- pipeline-simulates-physical.html

This document is subject to copyright. Except for fair use for purposes of private study or research, no part may be reproduced without written permission. The content is provided for information only.

Leave a Comment