will be and determine the action to take • Learn background and other information from interactions and observations → important aspects of events are important • Connections between the real world and language Real-world Understanding and Prediction of Human • Machine learning for real-world recognition prediction • input (observation) is an image → equivalent to human vision • predictions of image features are considered real- world predictions • ML doesn't make predictions based on physical properties of objects or physical laws, as humans do BUT… 1. To construct a chage point prediction model. 2. To connect language with the real world. 3. To express in more detailed sentences based on the characteristics of the environment. Purpose
Graph embedding vector • Velocity of each object • Acceleration of each object • Positional relationship between objects predicted image generated text Red cylinder is repulsed by green sphere object’s color ,shape image Prediction of graph structure representing physical properties output Change Point Prediction Model Condition of physical common sense (Ex) Condition 29 • A's mass is large. B's mass is small. • A is slow. B is fast • Floor is rough. Text generation model including physical common sense regenerated text Red cylinder collides with green sphere with such force that the green sphere is bounced off into the distance. Object A = red cylinder Object B = green sphere Prediction of graph structure add CLEVRER [Yi+, 19] 1 2 3
Prediction 𝐸𝑡_ℓ+1 𝐸𝑡_ℓ ⊝ ⊝ 𝑅𝑡_ℓ+1 መ 𝐴𝑡_ℓ+1 መ 𝐴𝑡_ℓ 𝑅𝑡_ℓ 𝑥𝑡 Input upsample Inference of the entire input information every time Hierarchically infer errors
Video REpresentation and Reasoning 9 Number of videos 20,000 (train:val:test=2:1:1) Video Length 5 sec Number of frames 128 frame Shape cube, sphere, cylinder Material metal, rubber Color gray, red, blue, green, brown, cyan, purple, yellow Event appear, disappear, collide Annotation object id, position, speed, acceleration
characteristics of the environment 10 object recognition object position velocity acceleration Position direction flags between objects graph structure embedding vector
structure representing physical properties output Condition of physical common sense (Ex) Condition 29 • A's mass is large. B's mass is small. • A is slow. B is fast • Floor is rough. Text generation model including physical common sense regenerated text Red cylinder collides with green sphere with such force that the green sphere is bounced off into the distance. Object A = red cylinder Object B = green sphere add physical training data Language Model • Graph embedding vector • Velocity of each object • Acceleration of each object • Positional relationship between objects generated text Red cylinder is repulsed by green sphere object’s color ,shape image Change Point Prediction Model Prediction of graph structure CLEVRER [Yi+, 19] 1 2
3(sentence type) • Object type • color, shape • ex) blue sphere, gray cylinder, etc. 12 • before • A and B approach • A approaches B • B approaches A • collision • A and B collide • A is repulsed by B • B is repulsed by A • after • A and B leave • A away from B • B away from A template ※ A, B : objects
i Correct Green sphere and red cylinder collide. Green sphere is repulsed by red cylinder. Red cylinder is repulsed by green sphere. Our Red cylinder is repulsed by green sphere. Range ii Correct Brown cube and green cylinder collide. Brown cube is repulsed by green cylinder. Green cylinder is repulsed by brown cube. Ours Brown cube is repulsed by green cylinder.
Score-en 90.6 77.1 67.9 78.1 80.3 Score-ja 88.3 80.6 79.2 80.4 81.2 Discussion • Sentences describing the environment could be generated with high accuracy • BLEU@4 evaluation showed lower accuracy • The subject of the sentences describing the collision depends on the language generation model. • BLEU scores were calculated for each of the three correct sentences and averaged, resulting in lower accuracy.
• Velocity of each object • Acceleration of each object • Positional relationship between objects predicted image image Prediction of graph structure representing physical properties output Change Point Prediction Model Prediction of graph structure CLEVRER [Yi+, 19] generated text Language Model generated text Red cylinder is repulsed by green sphere object’s color ,shape Condition of physical common sense (Ex) Condition 29 • A's mass is large. B's mass is small. • A is slow. B is fast • Floor is rough. Text generation model including physical common sense regenerated text Red cylinder collides with green sphere with such force that the green sphere is bounced off into the distance. Object A = red cylinder Object B = green sphere add 2 3
knowledge 17 State of the environment / objects 1.Floor is slippery. The mass of object A is large. The mass of object B is small. … 2.Floor is rough. The mass of object A is large. The mass of object B is small. … … 45. … List of environmental conditions Select from conditions No.1 – No.45 condition: No.19 • Floor is slippery. • The mass of object A is large. The mass of object B is small. • Object A is slow. Object B is fast. T5 crowdsourcing Green sphere collides with red cylinder with great force, and red cylinder is bounced off into the distance.
State of an object (mass) State of an object (speed) 1. Object A and object B have equal mass. 2. The mass of object A is large. The mass of object B is small. 3. The mass of object A is small. The mass of object B is large. 1. Object A and object B have equal velocities. 2. Object A is fast. Object B is slow. 3. Object A is slow. Object B is fast. Environment 1. Floor is slippery. 2. Floor is rough. Environment pattern: 1, 2, none mass pattern: 1, 2, 3, none speed pattern: 1, 2, 3, none in all 3×(4×4-1)=45* * Excluding none for both mass and speed
• Transformer-based model structure • Used for various tasks such as translation, question answering, classification, summarization, etc. • All tasks output text for input text • Three models used in the experiment (pre-trained in Japanese) • sonoisa/t5-base-japanese • megagonlabs/t5-base-japanese-web • nlp-waseda/comet-t5-base-Japanese 19
validation: 250 • test: 250 T5 study settings Input data Output data Input statement: - A cube and a cylinder collide. Condition: - The floor is smooth. - The mass of the cube is small. - The mass of the cylinder is large. - The speed of the cube is slow. - The speed of the cylinder is fast. 1. A cylinder collides with a cube with such force that the cube is thrown far away. 2. The cube is thrown far away when the cylinder collides with the cube. 3. Cylinders collide with cubes with great force, and cubes are thrown away with great force. 4. Cylinder collides with cube and cube falls down. 5. The cylinder collides with the cube with great force, and the cube is thrown far away. One of the five correct answers is randomly selected at the training. 20
A red cylinder is repelled by a green sphere. Object A: red cylinder, Object B: green sphere Floor mass speed Generation text based on physical common sense Examples of correct answers by human operators slippery A = B A = B Red cylinder and green sphere collide and both are bounced off in opposite directions. Red cylinder strikes a green sphere with the same velocity, and the green sphere bounces off into the distance. slippery − A < B Green sphere collides with red cylinder with such force that the red cylinder is bounced off into the distance. Red cylinder and green sphere collide, Red cylinder is bounced a little and Green sphere is bounced a little. rough A < B A > B Red cylinders collide with green spheres with such force that the green spheres are bounced away from the red cylinders. The red cylinder hits the green sphere with great force, and the red cylinder bounces back just a little. − A > B A < B The green sphere collides with the red cylinder with such force that the green sphere is bounced off. Green sphere strikes Red cylinder and the green sphere is bounced.
Full Implication Division G-EVAL-4o 0.55 0.82 0.49 0.56 0.67 0.80 0.92 • G-EVA-4o • the highest accuracy generated sentence: "the scene of a collision between objects” and "the movement of objects after the collision.”
Full Implication Division G-EVAL-4o 0.55 0.82 0.49 0.56 0.67 0.80 0.92 • G-EVA-4o • the highest accuracy generated sentence: "the scene of a collision between objects” and "the movement of objects after the collision.” • Implication Division 1. whether the fact of collision and the color and shape of the objects involved in the collision were correct. 2. whether the content of the movement after the collision was implied in the correct sentences.
these indicators are evaluated based on the degree of word agreement between the correct sentence and the generated sentence after the addition of common sense. BLEU@4 BERTScore BLEURT ROUGE Implication Full Implication Division G-EVAL-4o 0.55 0.82 0.49 0.56 0.67 0.80 0.92
change points • Models that can visually and physically predict change points in the observed environment • Inference content is expressed as a language to link the real world to the language • Linguistic generation of inferences based on experimental results • Add conditions about environment and objects • Regenerate sentences including human experience and commonsense knowledge Future works • Dataset is simple • Replacement of inference and language generation in LLM 27