[IROS24] Object Segmentation from Open-Vocabulary Manipulation Instructions Based on Optimal Transport Polygon Matching with Multimodal Foundation Models
pillow on the smaller white sofa, the pillow closest to the plant on the small table.” Ours - 3 - EVF-SAM-2 [Zhang+, 24] Even SOTA foundation models struggle with our task
4 - Main novelty: Polygon Matching Loss based on Optimal Transport Polygon’s vertex order must be the same Predicted Mask Our method Existing methods Ground Truth Mask Predicted Mask Ground Truth Mask