AIが生成した画像の品質を測定する方法 / How to measure the quality of AI-Generated Images

How to Measure the Quality of AI-Generated Images LINE Plus,
Applied ML Dev. Jongwoo Han Heechan Kim Suman Bae

Jongwoo Han I joined LINE Plus in 2022 and is
currently the lead of Applied ML Dev. My current interests are evaluation and content monitoring using the vision models.

- Introduction to Generated Image Evaluation - Proposed Image Generation
Pipeline - Applications • Blackbox Optimization • Image Translation • Image Inpainting Agenda

Generated Image Evaluation Why do we need it?

Given Ground Truth Ground Truth Prediction Results Angry Angry O
Happy Angry X Traditional Vision Task Images Generation Task No Ground Truth Beyond Pixels: Challenges in Evaluating Generated Images Draw a man holding a tennis racket in a LINE- Style illustration.

• Visual Quality / Aesthetics • Prompt Alignment • Originality
• Photo Realism • Toxicity / Fairness • etc. Criterion for Generated Image Evaluation https://techblog.lycorp.co.jp/ko/how-to-evaluate-ai-generated-images-1

IS (Inception Score) FID (Fréchet Inception Distance) Distribution Based Approach
Instance Based Approach LAION aesthetics score CLIP-IQA Q-Align Visual Quality / Aesthetics Distribution of Trained Images Distribution of Generated Images Evaluation Model Training Using Aesthetics Datasets Generated Image Quality Evaluation Model Evaluation Result https://techblog.lycorp.co.jp/ko/how-to-evaluate-ai-generated-images-1

CLIP Score Direct Approach QA(question-answering) Approach VQA(visual-question-answering) Score Prompt Alignment
https://techblog.lycorp.co.jp/ko/how-to-evaluate-ai-generated-images-1 Generated Image Image Embedder Evaluation Result Text Prompt Text Embedder Prompt : "Does this figure show {Gen. Prompt}?" Generated Image QA Based Evaluation Evaluation Result

Inference Phase Training Phase Proposed Image Generation Pipeline Image Generation
Model Generated Image Auto Evaluation Loss Image Generation Model Generated Image Auto Evaluation Satisfied? No Output Yes

How to apply the generated image evaluation to improve quality
Practical Cases Applications • Blackbox Optimization • Image Translation • Image Inpainting

Heechan Kim I joined LINE Plus in 2021. Within the
team, I am developing various AI models and services according to the diverse needs within the company. Currently, I am very interested in generative AI models and their evaluation. Applications • Blackbox Optimization • Image Translation • Image Inpainting

Image Gen. with Blackbox Opt. Applications • Blackbox Optimization •
Image Translation • Image Inpainting

#2 Setting (Hyperparameters) A woman is holding a picket sign.
The sign has the words "hold the LINE" written on it. How Challenging is Image Gen.? Same Model, Same Prompt but Different Results with Different Hyperparameters Prompt #1 Setting (Hyperparameters) Finetuned StableDiffusion3.5 Model

Find Good Starting Point ⎯ Attend-and-Excite ⎯ InitNO Measure relevance
using attention map between noise and prompt Mining seeds using initial noise and prompt A cat and a rabbit

CLIPScore HPS v2 PickScore VQAScore Measure Goodness of Image Average
out various scores Selected Scorer 0.2395 0.2243 #1 Hyperparameters #2 Hyperparameters

Apply Blackbox Optimization Find good hyperparameters based on the result
Minimize efforts for selecting hyperparameters Yes Image Generation Model Generated Image Auto Evaluation Blackbox Opt. Hyperparams End? Summary No Seed, Prompt

#1 Hyperparameters #2 Hyperparameters #3 Hyperparameters Final Results with One
Click We can select the best image in these candidates

Image Translation Applications • Blackbox Optimization • Image Translation •
Image Inpainting

Original Image Can We Measure Quality of Anything? How good
is translated result? Translated Image

Mining Phase Evaluation Phase Rubric Mining and Eval. with VLM
Find good criteria, Evaluate image with the criteria VLM Seed rubric Image Refined rubric VLM Image Score "criteria": "Visual Consistency", "description": "Determine how well the visual style and layout are preserved during translation, maintaining a consistent appearance with the original.", "scoring": { "5": "Visual style and layout are perfectly… "criteria 1": { "description": "The translated image should be similar to the reference image in terms of text block position.", "score": { "0": "The translated image is not similar to the reference image in terms of text block position.", … "Visual Consistency: 5” … Prometheus-Vision

Refined Rubrics and Evaluation Results VLM can evaluate the image
using the rubrics, but these scores are acceptable? Text Chunking and Blocking Cultural Context and Appropriateness Text Positioning and Layout Consistency Numerical and Symbol Accuracy Visual Clarity and Quality 4 4 5 4 5

Image Gen. w. B.O. Not sensitive score about specific style
Find a measure which can capture characteristics of the style Evaluation with VLM VLM ignores the features of detailed images Explore methods to inject detailed image feature into VLM Limitation and Future Plan We can go further

Suman Bae I joined LINE Plus in 2021 and developed
various ML models related to Vision. Currently, I am very interested in applications using multimodal and generative models.

How to apply the generated image evaluation to improve quality
Image Inpainting Applications • Blackbox Optimization • Image Translation • Image Inpainting

Unexpected people in the photo Background Person Removal Original Photo
Application of Image Inpainting Clear background Photo with background people removed

Pipeline Instance Segmentation Original Img Inpainting Area Selection Inpainting Output
Img Salient Object Detection Instance Segmentation Salient Object Detection Inpainting Area Selection Background Person Removal

Simple Case LaMa Model (WACV, 2022) HINT Model (TMM, 2024)
Flux. 1-Fill-dev Model (Arxiv, 2024) Original Image

Challenging Case LaMa Model (WACV, 2022) HINT Model (TMM, 2024)
Flux. 1-Fill-dev Model (Arxiv, 2024) Original Image

Single-Image based Image Quality Assessment - LAION aesthetics score-v2 -
CLIP-IQA (AAAI, 2023) - Q-Align (ICML, 2024) Distribution-Based Image Quality Assessment - FID (NeurIPS, 2017) - FD-Dino (NeurIPS, 2023) - CMMD (CVPR, 2023) Image Quality Assessment with Promt Alignment - ImageReward (NeurIPS, 2023) - HPS v2 (Arxiv, 2023) - PickScore (NeurIPS, 2024) Evaluation Metrics https://techblog.lycorp.co.jp/ko/how-to-evaluate-ai-generated-images-3-inpainting

Evaluation Model - LaMa (WACV 2022) - MAT (CVPR 2022)
- CoordFILL (AAAI 2023) - SCAT (AAAI 2023) - HINT (TMM 2024) - MxT (BMVC 2024) - PUT (TPAMI 2024) - Latent Codes for Pluralistic Image Inpainting (CVPR 2024) - FLUX. 1-Fill-dev (Arxiv 2024) Evaluation Dataset - Places365-Standard validation dataset: 36,500 images - Challenging case of background person removal dataset: 10 images Evaluation Model and Datasets https://techblog.lycorp.co.jp/ko/how-to-evaluate-ai-generated-images-3-inpainting

- CMMD: 0.898 - LAION aesthetics score-v2: 0.877 - FID:
0.877 - Q-Align: 0.843 - PickScore: 0.648 - FD-Dino: 0.604 - ImageReward: 0.428 - HPS v2: 0.387 - CLIP-IQA: 0.063 Evaluation Result Places365-Standard validation dataset Pearson Correlation Coefficient with Human Evaluation

- LAION aesthetics score-v2: 0.924 - Q-Align: 0.384 - HPS
v2: -0.290 - PickScore: 0.282 - ImageReward: 0.279 - CLIP-IQA: 0.187 Evaluation Result Challenging case of background person removal dataset Pearson Correlation Coefficient with Human Evaluation

Summary • Since various results can be considered correct for
image generation models, extensive studies are actively being conducted to best mimic the image quality as perceived by humans. • Our team is researching these image generation evaluation methods and applying them to various applications. • Since image generation models are expected to be used in various fields in the future, the importance of this evaluation area is expected to grow increasingly.

• Tech Blog • https://techblog.lycorp.co.jp/en/how-to-evaluate-ai-generated-images-1 • EN Released, JP Release
: 7/16 • https://techblog.lycorp.co.jp/en/how-to-evaluate-ai-generated-images-2-blackbox- optimization • EN Released, JP Release : 7/28 • https://techblog.lycorp.co.jp/en/how-to-evaluate-ai-generated-images-3-inpainting • EN Released, JP Release : 8/8 Reference

AIが生成した画像の品質を測定する方法 / How to measure the qualit...

AIが生成した画像の品質を測定する方法 / How to measure the quality of AI-Generated Images

More Decks by LINEヤフーTech (LY Corporation Tech)

Other Decks in Technology

Featured

Transcript