SCIS-ISIS2024_erikuroda

Verbal Description Focusing on Physical Properties of Real-World Environments ◯
Eri Kuroda, Yuki Taya & Ichiro Kobayashi Ochanomizu University

Background, Purpose 2 • Predict what the object's next move
will be and determine the action to take • Learn background and other information from interactions and observations → important aspects of events are important • Connections between the real world and language Real-world Understanding and Prediction of Human • Machine learning for real-world recognition prediction • input (observation) is an image → equivalent to human vision • predictions of image features are considered real- world predictions • ML doesn't make predictions based on physical properties of objects or physical laws, as humans do BUT… 1. To construct a chage point prediction model. 2. To connect language with the real world. 3. To express in more detailed sentences based on the characteristics of the environment. Purpose

generated text Overview 3 physical training data Language Model •
Graph embedding vector • Velocity of each object • Acceleration of each object • Positional relationship between objects predicted image generated text Red cylinder is repulsed by green sphere object’s color ，shape image Prediction of graph structure representing physical properties output Change Point Prediction Model Condition of physical common sense (Ex) Condition 29 • A's mass is large. B's mass is small. • A is slow. B is fast • Floor is rough. Text generation model including physical common sense regenerated text Red cylinder collides with green sphere with such force that the green sphere is bounced off into the distance. Object A = red cylinder Object B = green sphere Prediction of graph structure add CLEVRER [Yi+, 19] 1 2 3

PreCNet [Straka+, 23] 4 PredNet [Lotter+, 16] PreCNet Error Representation
Prediction 𝐸𝑡_ℓ+1 𝐸𝑡_ℓ ⊝ ⊝ 𝑅𝑡_ℓ+1 መ 𝐴𝑡_ℓ+1 መ 𝐴𝑡_ℓ 𝑅𝑡_ℓ 𝑥𝑡 Input upsample Inference of the entire input information every time Hierarchically infer errors

5 Variational Temporal Abstraction [Kim+, 19] when walking on the
blue road when walking on the red road all events change points all events change points

6 Variational Temporal Abstraction [Kim+, 19] difficult to decide when
to transition 𝑍 problem Human: easy Model: difficult Observation (Input) Observation abstraction Temporal abstraction

7 Variational Temporal Abstraction [Kim+, 19] Determines the flag (0
or 1) of 𝑚 by the magnitude of the change in latent state compared to the previous observation Introduced flags

8 PreCNet-based proposed Model 𝐸𝑡_𝑖𝑚𝑔 ℓ+1 𝐸𝑡_𝑖𝑚𝑔 ℓ ⊝ ⊝
𝑅𝑡_𝑖𝑚𝑔 ℓ+1 መ 𝐴𝑡_𝑖𝑚𝑔 ℓ+1 መ 𝐴𝑡_𝑖𝑚𝑔 ℓ 𝑅𝑡_𝑖𝑚𝑔 ℓ Error Representation Prediction 𝑥𝑡_𝑖𝑚𝑔 Input 𝐸𝑡_𝑝ℎ𝑦 ℓ+1 𝐸𝑡_𝑝ℎ𝑦 ℓ ⊝ ⊝ 𝑅𝑡_𝑝ℎ𝑦 ℓ+1 መ 𝐴𝑡_𝑝ℎ𝑦 ℓ+1 መ 𝐴𝑡_𝑝ℎ𝑦 ℓ 𝑥𝑡_𝑝ℎ𝑦 Input 𝑅𝑡−1_𝑖𝑚𝑔 ℓ 𝑅𝑡−1_𝑝ℎ𝑦 ℓ 𝑅𝑡_𝑝ℎ𝑦 ℓ upsample upsample 𝑚𝑡 = ቊ 0 ∶ 𝑑𝑖𝑓𝑓𝑡 < 𝛼 1 ∶ 𝑑𝑖𝑓𝑓𝑡 > 𝛼 time t image data physical data 𝑑𝑖𝑓𝑓𝑡_𝑖𝑚𝑔 𝑑𝑖𝑓𝑓𝑡_𝑝ℎ𝑦 img Output 𝑑𝑖𝑓𝑓𝑡 = 𝑑𝑖𝑓𝑓𝑡_𝑖𝑚𝑔 + 𝑑𝑖𝑓𝑓𝑡_𝑝ℎ𝑦

Dataset：CLEVRER [Yi+,2020] • CLEVRER [Yi+, 2020] • CoLlision Events for
Video REpresentation and Reasoning 9 Number of videos 20,000 (train:val:test=2:1:1) Video Length 5 sec Number of frames 128 frame Shape cube, sphere, cylinder Material metal, rubber Color gray, red, blue, green, brown, cyan, purple, yellow Event appear, disappear, collide Annotation object id, position, speed, acceleration

combination Dataset physical training dataset • Dataset created from physical
characteristics of the environment 10 object recognition object position velocity acceleration Position direction flags between objects graph structure embedding vector

3 generated text Overview 11 predicted image Prediction of graph
structure representing physical properties output Condition of physical common sense (Ex) Condition 29 • A's mass is large. B's mass is small. • A is slow. B is fast • Floor is rough. Text generation model including physical common sense regenerated text Red cylinder collides with green sphere with such force that the green sphere is bounced off into the distance. Object A = red cylinder Object B = green sphere add physical training data Language Model • Graph embedding vector • Velocity of each object • Acceleration of each object • Positional relationship between objects generated text Red cylinder is repulsed by green sphere object’s color ，shape image Change Point Prediction Model Prediction of graph structure CLEVRER [Yi+, 19] 1 2

Ex1: Creation of Templates • nine templates • 3(before・collision・after) ×
3(sentence type) • Object type • color, shape • ex) blue sphere, gray cylinder, etc. 12 • before • A and B approach • A approaches B • B approaches A • collision • A and B collide • A is repulsed by B • B is repulsed by A • after • A and B leave • A away from B • B away from A template ※ A, B : objects

13 Ex1: text generating model test Trained Decoder Model generated
text indicating predicted content pred graph embedding input Decoder Softmax <bos> w1 w2 wt <eos> … w1 w2 wt … Decoder model text pair data train Linear graph embedding 219,303 pieces 10,965 pieces

Result - Generation example - 14 Range Image Sentence Range
i Correct Green sphere and red cylinder collide. Green sphere is repulsed by red cylinder. Red cylinder is repulsed by green sphere. Our Red cylinder is repulsed by green sphere. Range ii Correct Brown cube and green cylinder collide. Brown cube is repulsed by green cylinder. Green cylinder is repulsed by brown cube. Ours Brown cube is repulsed by green cylinder.

Evaluation of text generation 15 BLEU@2 BLEU@3 BLEU@4 METEOR CIDEr
Score-en 90.6 77.1 67.9 78.1 80.3 Score-ja 88.3 80.6 79.2 80.4 81.2 Discussion • Sentences describing the environment could be generated with high accuracy • BLEU@4 evaluation showed lower accuracy • The subject of the sentences describing the collision depends on the language generation model. • BLEU scores were calculated for each of the three correct sentences and averaged, resulting in lower accuracy.

1 Overview 16 physical training data • Graph embedding vector
• Velocity of each object • Acceleration of each object • Positional relationship between objects predicted image image Prediction of graph structure representing physical properties output Change Point Prediction Model Prediction of graph structure CLEVRER [Yi+, 19] generated text Language Model generated text Red cylinder is repulsed by green sphere object’s color ，shape Condition of physical common sense (Ex) Condition 29 • A's mass is large. B's mass is small. • A is slow. B is fast • Floor is rough. Text generation model including physical common sense regenerated text Red cylinder collides with green sphere with such force that the green sphere is bounced off into the distance. Object A = red cylinder Object B = green sphere add 2 3

Text generation model for collision situation based on physical commonsense
knowledge 17 State of the environment / objects 1.Floor is slippery. The mass of object A is large. The mass of object B is small. … 2.Floor is rough. The mass of object A is large. The mass of object B is small. … … 45. … List of environmental conditions Select from conditions No.1 – No.45 condition: No.19 • Floor is slippery. • The mass of object A is large. The mass of object B is small. • Object A is slow. Object B is fast. T5 crowdsourcing Green sphere collides with red cylinder with great force, and red cylinder is bounced off into the distance.

Add common sense externally Assignment of conditions: 45 types 18
State of an object (mass) State of an object (speed) 1. Object A and object B have equal mass. 2. The mass of object A is large. The mass of object B is small. 3. The mass of object A is small. The mass of object B is large. 1. Object A and object B have equal velocities. 2. Object A is fast. Object B is slow. 3. Object A is slow. Object B is fast. Environment 1. Floor is slippery. 2. Floor is rough. Environment pattern: 1, 2, none mass pattern: 1, 2, 3, none speed pattern: 1, 2, 3, none in all 3×(4×4-1)=45* * Excluding none for both mass and speed

Language generating model • T5 (Text-To-Text Transfer Transformer) [Raffel+, 2020]
• Transformer-based model structure • Used for various tasks such as translation, question answering, classification, summarization, etc. • All tasks output text for input text • Three models used in the experiment (pre-trained in Japanese) • sonoisa/t5-base-japanese • megagonlabs/t5-base-japanese-web • nlp-waseda/comet-t5-base-Japanese 19

• Using data collected through crowdsourcing • train: 1,500 •
validation: 250 • test: 250 T5 study settings Input data Output data Input statement: - A cube and a cylinder collide. Condition: - The floor is smooth. - The mass of the cube is small. - The mass of the cylinder is large. - The speed of the cube is slow. - The speed of the cylinder is fast. 1. A cylinder collides with a cube with such force that the cube is thrown far away. 2. The cube is thrown far away when the cylinder collides with the cube. 3. Cylinders collide with cubes with great force, and cubes are thrown away with great force. 4. Cylinder collides with cube and cube falls down. 5. The cylinder collides with the cube with great force, and the cube is thrown far away. One of the five correct answers is randomly selected at the training. 20

21 T5 study settings setting Learning rate 5 × 10−5
batch size 32 Epoch 100 optimization AdamW [Loshchilov+, 17] loss function cross entropy

Results using T5 22 Results with test data Epoch BLEU↑
ROUGE-2↑ ROUGE-L↑ sonoisa/ t5-base-japanese 81 95.2 64.2 74.6 megagonlabs/ t5-base-japanese-web 93 81.6 56.6 67.7 nlp-waseda/ comet-t5-base-japanese 98 80.9 56.2 67.4

23 Result - Range i - Generated statement by model:
A red cylinder is repelled by a green sphere. Object A: red cylinder, Object B: green sphere Floor mass speed Generation text based on physical common sense Examples of correct answers by human operators slippery A = B A = B Red cylinder and green sphere collide and both are bounced off in opposite directions. Red cylinder strikes a green sphere with the same velocity, and the green sphere bounces off into the distance. slippery − A < B Green sphere collides with red cylinder with such force that the red cylinder is bounced off into the distance. Red cylinder and green sphere collide, Red cylinder is bounced a little and Green sphere is bounced a little. rough A < B A > B Red cylinders collide with green spheres with such force that the green spheres are bounced away from the red cylinders. The red cylinder hits the green sphere with great force, and the red cylinder bounces back just a little. − A > B A < B The green sphere collides with the red cylinder with such force that the green sphere is bounced off. Green sphere strikes Red cylinder and the green sphere is bounced.

24 Result - Accuracy - BLEU@4 BERTScore BLEURT ROUGE Implication
Full Implication Division G-EVAL-4o 0.55 0.82 0.49 0.56 0.67 0.80 0.92 • G-EVA-4o • the highest accuracy generated sentence: "the scene of a collision between objects” and "the movement of objects after the collision.”

25 Result - Accuracy - BLEU@4 BERTScore BLEURT ROUGE Implication
Full Implication Division G-EVAL-4o 0.55 0.82 0.49 0.56 0.67 0.80 0.92 • G-EVA-4o • the highest accuracy generated sentence: "the scene of a collision between objects” and "the movement of objects after the collision.” • Implication Division 1. whether the fact of collision and the color and shape of the objects involved in the collision were correct. 2. whether the content of the movement after the collision was implied in the correct sentences.

26 Result - Accuracy - BLEU, BLEURT, and ROUGE •
these indicators are evaluated based on the degree of word agreement between the correct sentence and the generated sentence after the addition of common sense. BLEU@4 BERTScore BLEURT ROUGE Implication Full Implication Division G-EVAL-4o 0.55 0.82 0.49 0.56 0.67 0.80 0.92

Conclusion & Future works • Predictive inference model to extract
change points • Models that can visually and physically predict change points in the observed environment • Inference content is expressed as a language to link the real world to the language • Linguistic generation of inferences based on experimental results • Add conditions about environment and objects • Regenerate sentences including human experience and commonsense knowledge Future works • Dataset is simple • Replacement of inference and language generation in LLM 27

SCIS-ISIS2024_erikuroda

SCIS-ISIS2024_erikuroda

Eri KURODA

More Decks by Eri KURODA

Other Decks in Research

Featured

Transcript

Verbal Description Focusing on Physical Properties of Real-World Environments ◯

Background, Purpose 2 • Predict what the object's next move

generated text Overview 3 physical training data Language Model •

PreCNet [Straka+, 23] 4 PredNet [Lotter+, 16] PreCNet Error Representation

5 Variational Temporal Abstraction [Kim+, 19] when walking on the

6 Variational Temporal Abstraction [Kim+, 19] difficult to decide when

7 Variational Temporal Abstraction [Kim+, 19] Determines the flag (0

8 PreCNet-based proposed Model 𝐸𝑡_𝑖𝑚𝑔 ℓ+1 𝐸𝑡_𝑖𝑚𝑔 ℓ ⊝ ⊝

Dataset：CLEVRER [Yi+,2020] • CLEVRER [Yi+, 2020] • CoLlision Events for

combination Dataset physical training dataset • Dataset created from physical

3 generated text Overview 11 predicted image Prediction of graph

Ex1: Creation of Templates • nine templates • 3(before・collision・after) ×

13 Ex1: text generating model test Trained Decoder Model generated

Result - Generation example - 14 Range Image Sentence Range

Evaluation of text generation 15 BLEU@2 BLEU@3 BLEU@4 METEOR CIDEr

1 Overview 16 physical training data • Graph embedding vector

Text generation model for collision situation based on physical commonsense

Add common sense externally Assignment of conditions: 45 types 18

Language generating model • T5 (Text-To-Text Transfer Transformer) [Raffel+, 2020]

• Using data collected through crowdsourcing • train: 1,500 •

21 T5 study settings setting Learning rate 5 × 10−5

Results using T5 22 Results with test data Epoch BLEU↑

23 Result - Range i - Generated statement by model:

24 Result - Accuracy - BLEU@4 BERTScore BLEURT ROUGE Implication

25 Result - Accuracy - BLEU@4 BERTScore BLEURT ROUGE Implication

26 Result - Accuracy - BLEU, BLEURT, and ROUGE •

Conclusion & Future works • Predictive inference model to extract