Main conference論文の多くが「画像」生成 vs 直近かつ産業的な話題の中心は「動画」生成 ▪ Content Creation系での、Adobeの存在感の高まり ▪ The Future of Video Generation: Beyond Data and Scale (Tali Dekel, Weizmann Institute) ▪ 動画生成の課題として、physicalな整合性に加え、 camera pose, geometry, characters’ identity, movements, emotionsなどの制御性が低いことを挙げた ▪ 基盤モデル vs タスク特化モデル ▪ 基盤モデルの良さ:Space-time priors ▪ 特化モデルの良さ:計算量、制御性 ▪ これらを組み合わせることが重要と指摘 Workshops ピックアップ: Video Generation
The Future of Video Generation: Beyond Data and Scale (Tali Dekel, Weizmann Institute) ▪ Space-Time Features for Text-driven Motion Transfer, CVPR2024 ▪ 外観や形状の変化に対する特徴量記述子(SMM)を定義し、その変化への頑健性 ▪ 時空間特徴損失を導入して時間的に高周波な動きまで生成する ▪ 事前学習済みのText2Video Diffusion ModelsからPrior情報を抽出→特化型へ Workshops ピックアップ: Video Generation 引用元:https://diffusion-motion-transfer.github.io/
Workshop List ▪ Track on 3D Vision ▪ 23612 2nd Workshop on Compositional 3D Vision ▪ 23611 - 3rd Monocular Depth Estimation Challenge ▪ 23609 7th International Workshop on Visual Odometry and Computer Vision Applications... ▪ 23610 Second Workshop for Learning 3D with Multi View Supervision ▪ 23608 ViLMa – Visual Localization and Mapping ▪ Track on Applications ▪ 23687 10th IEEE International Workshop on Computer Vision in Sports ▪ 23681 Agriculture Vision Challenges & Opportunities for Computer Vision in Agriculture ▪ 23685 GAZE 2024 6th Workshop on Gaze Estimation and Prediction in the Wild ▪ 23684 MetaFood Workshop MTF ▪ 23683 RetailVision Field Overview and Amazon Deep Dive ▪ 23688 - Workshop on Virtual Try-On Workshops Topics & Records
Assistive Technology ▪ 23652 VizWiz Grand Challenge Describing Images and Videos Taken by Blind People ▪ Track on Assortment of Recognition Topics ▪ 23583 2nd Workshop on Scene Graphs and Graph Representation Learning ▪ 23584 Image Matching Local Features and Beyond ▪ Track on Autonomous Driving ▪ 23648 7th Workshop on Autonomous Driving WAD ▪ 23649 Data Driven Autonomous Driving Simulation DDASD ▪ 23651 Populating Empty Cities – Virtual Humans for Robotics and Autonomous Driving ▪ 23650 Vision and Language for Autonomous Driving and Robotics VLADR ▪ Track on Biometrics and Forensics ▪ 23641 6th Workshop and Competition on Affective Behavior Analysis in the wild ▪ 23637 The 5th Face Anti Spoofing Workshop Workshops Topics & Records
Computational Photography ▪ 23626 20th Workshop on Perception Beyond the Visible Spectrum ▪ 23627 The 5th Omnidirectional Computer Vision Workshop ▪ 23624 The 7th Workshop and Challenge Bridging the Gap between Computational Photography and Visual... ▪ Track on Contemporary discussions and Community building ▪ 23622 LatinX in Computer Vision Research Workshop ▪ 23621 Women in Computer Vision ▪ Track on Content Creation ▪ 23633 AI for 3D Generation ▪ 23632 AI for Content Creation AI4CC ▪ 23631 The Future of Generative Visual Art ▪ 23635 Workshop on Computer Vision for Fashion, Art, and Design ▪ 23634 Workshop on Graphic Design Understanding and Generation GDUG Workshops Topics & Records
Efficient Methods ▪ 23578 Efficient Large Vision Models ▪ 23576 Fifth Workshop on Neural Architecture Search ▪ Track on Egocentric & Embodied AI ▪ 23596 First Joint Egocentric Vision EgoVis Workshop ▪ 23598 The 5th Annual Embodied AI Workshop ▪ Track on Emerging Learning Paradigms ▪ 23591 1st Workshop on Dataset Distillation for Computer Vision ▪ Track on Emerging Topics ▪ 23572 Equivariant Vision From Theory to Practice ▪ Track on Foundation Models ▪ 23667 2nd Workshop on Foundation Models ▪ 23668 Foundation Models for Autonomous Systems ▪ 23670 Towards 3D Foundation Models Progress and Prospects Workshops Topics & Records
Generative Models ▪ 23672 - 2nd Workshop on Generative Models for Computer Vision ▪ 23675 First Workshop on Efficient and On Device Generation EDGE ▪ 23676 GenAI Media Generation Challenge for Computer Vision Workshop ▪ 23674 ReGenAI First Workshop on Responsible Generative AI ▪ 23673 The First Workshop on the Evaluation of Generative Foundation Models ▪ Track on Human Understanding ▪ 23604 New Challenges in 3D Human Understanding ▪ 23606 Workshop on Human Motion Generation ▪ Track on Medical Vision ▪ 23664 9th Workshop on Computer Vision for Microscopy Image Analysis ▪ 23665 Data Curation and Augmentation in Enhancing Medical Imaging Applications ▪ 23663 Domain adaptation, Explainability and Fairness in AI for Medical Image Analysis Workshops Topics & Records
Mobile and Embedded Vision ▪ 23628 Third Workshop of Mobile Intelligent Photography & Imaging ▪ Track on Multimodal Learning ▪ 23567 7th MUltimodal Learning and Applications ▪ 23568 Multimodal Algorithmic Reasoning Workshop ▪ Track on Open World Learning ▪ 23594 VAND 2 0 Visual Anomaly and Novelty Detection ▪ 23595 Visual Perception via Learning in an Open World ▪ Track on Physics, Graphics, Geometry, AR/VR/MR ▪ 23616 Computer Vision for Mixed Reality ▪ 23618 The Sixth Workshop on Deep Learning for Geometric Computing DLGC 2024 Workshops Topics & Records
Responsible and Explainable AI ▪ 23642 2nd Workshop on Multimodal Content Moderation mp4 ▪ 23643 The 3rd Explainable AI for Computer Vision XAI4CV Workshop ▪ 23644 The Fifth Workshop on Fair, Data efficient, and Trusted Computer Vision ▪ 23645 Workshop on Responsible Data ▪ Track on Science Applications ▪ 23658 4th Workshop on CV4Animals Computer Vision for Animal Behavior Tracking and Modeling ▪ 23659 AI4Space 2024 ▪ 23660 Computer Vision for Materials Science Workshop ▪ 23661 The Seventh International Workshop on Computer Vision for Physiological Measurement CVPM ▪ Track on Synthetic Data ▪ 23678 SyntaGen Harnessing Generative Models for Synthetic Visual Datasets ▪ 23677 Synthetic Data for Computer Vision Workshops Topics & Records
Urban Environments ▪ 23654 1st Workshop on Urban Scene Modeling ▪ 23656 8th AI City Challenge ▪ Track on Video Understanding ▪ 23567 7th MUltimodal Learning and Applications ▪ 23568 Multimodal Algorithmic Reasoning Workshop ▪ 23603 Learning from Procedural Videos and Language What is Next? Workshops Topics & Records
以下のタイトルでYoutube検索すると出てきます。(ComputerVisionFoundation Videosチャンネル) ▪ 23728 full Learning Deep Low dimensional Models mp4 ▪ 23735 Efficient Homotopy full mp4 ▪ 23730 Diffusion based Video ▪ 23726 Unifying Graph Neural Networks ▪ 23736 Computational Design of Diverse Morphologies and Sensors for Vision and Robotics ▪ 23725 All You Need To Know about Point Cloud Understanding ▪ 23724 Machine Unlearning in Computer Vision ▪ 23721 Deep Stereo Matching in the Twenties ▪ 23720 All You Need to Know about Self Driving ▪ 23713 Towards Building AGI in Autonomy and Robotics ▪ 23717 From Multimodal LLM to Human level AI ▪ 23715 Contactless AI Healthcare using Cameras and Wireless Sensors ▪ 23716 Disentanglement and Compositionality in Computer Vision ▪ 27319 Edge Optimized Deep Learning ▪ 23733 Full Stack, GPU based Acceleration Tutorials Topics & Records
以下のタイトルでYoutube検索すると出てきます。(ComputerVisionFoundation Videosチャンネル) ▪ 23731 Robustness at Interference ▪ 23729 Object centric Representations in Computer Vision ▪ 23727 Geospatial Computer Vision and Machine Learning for Large Scale Earth Observation Data ▪ 23718 3D 4D Generation and Modeling with Generative Priors ▪ 23734 SCENIC An Open Source Probabilistic Programming System for Data Generation ▪ 23722 End to End Autonomy A New Era of Self Driving ▪ 23714 Edge AI in Action Practical Approaches to Developing and Deploying Optimized Models Tutorials Topics & Records
Google Research ▪ 静止画像からシームレスなループ動画とインタラクティブな動作を生成 ▪ Rich Human Feedback for Text-to-Image Generation(右下図、後で詳説) ▪ Google Researchなど ▪ Text2Imageモデルの改善のためのフィードバックデータセットとその予測モデル Main Conference: Awarded Papers
Gaussian Splatting ▪ 3DGSの異なるスケールでのレンダリング時のエラーを、2DMip Filterなどで抑制 ▪ BioCLIP: A Vision Foundation Model for the Tree of Life ▪ 生物学分野特化のCLIP。大規模データセットの作成と、階層構造を考慮した汎化。 ▪ Best Student Paper Runners-Up ▪ SpiderMatch: 3D Shape Matching with Global Optimality and Geometric Consistency | CVF Open Access ▪ 3D形状マッチングにおいて、幾何的一貫性を保ちながら大域的最適解を効率的に探索する手法 ▪ Image Processing GNN: Breaking Rigidity in Super-Resolution ▪ 超解像タスクにおいて、GNNを用いてピクセル単位で柔軟に情報集約する手法 ▪ Objects as volumes: A stochastic geometry view of opaque solids ▪ 不透明個体の3次元再構成のための、ボリュメトリックな表現に関する理論的整理 ▪ Comparing the Decision-Making Mechanisms by Transformers and CNNs via Explanation Methods ▪ 説明モデルを用いて、TransformerとCNNの比較(特にLayer Normの重要性)を比較 Main Conference: Awarded Papers
for Fast and Consistent Video Editing with Diffusion Models ▪ Github Stars: 200 / License: MIT / Highlight ▪ Diffusion-BasedなText2Imageモデルを活用した、Zero-shotでの動画編集(変換) Main Conference: 商用利用可ライセンス&注目論文
for Images and Videos at Scale ▪ Github Stars: 1000 / License: MIT / Highlight ▪ 画像及び動画における様々なタスクでのzero-shot推論を可能にする基盤モデル Main Conference: 商用利用可ライセンス&注目論文
Representation for a Variety of Vision Tasks ▪ Hugging Face Stars: 900 / License: MIT / Oral ▪ テキストプロンプトを入力とした、ざまざまな画像タスクのための基盤モデル Main Conference: 商用利用可ライセンス&注目論文
Text-to-Image Generation ▪ Generative Image Dynamics ▪ Orals ▪ FreeU: Free Lunch in Diffusion U-Net ▪ Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models ▪ Style Aligned Image Generation via Shared Attention ▪ Instruct-Imagen: Image Generation with Multi-modal Instruction ▪ Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following ▪ Attention Calibration for Disentangled Text-to-Image Personalization ▪ Alchemist: Parametric Control of Material Properties with Diffusion Models ▪ Analyzing and Improving the Training Dynamics of Diffusion Models ▪ MonoHair: High-Fidelity Hair Modeling from a Monocular Video Main Conference: Image & Video Synthesis
Generation ▪ Generative Image Dynamics ▪ Orals ▪ FreeU: Free Lunch in Diffusion U-Net ▪ Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models ▪ Style Aligned Image Generation via Shared Attention ▪ Instruct-Imagen: Image Generation with Multi-modal Instruction ▪ Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following ▪ Attention Calibration for Disentangled Text-to-Image Personalization ▪ Alchemist: Parametric Control of Material Properties with Diffusion Models ▪ Analyzing and Improving the Training Dynamics of Diffusion Models ▪ MonoHair: High-Fidelity Hair Modeling from a Monocular Video 56 Main Conference: Image & Video Synthesis
scripts: https://github.com/google-research/google-research/tree/master/richhf_18k arxiv: https://arxiv.org/abs/2312.10240 Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, Jiao Sun, Jordi Pont-Tuset, Sarah Young, Feng Yang, Junjie Ke, Krishnamurthy Dj Dvijotham, Katherine M. Collins, Yiwen Luo, Yang Li, Kai J Kohlhoff, Deepak Ramachandran, and Vidhya Navalpakkam Best Paper https://openaccess.thecvf.com/content/CVPR2024/html/Liang_Rich_Human_Feedback_for_ Text-to-Image_Generation_CVPR_2024_paper.html
の作成 (RichHF-18K) ▪ 画像に対する Rich human feedback の予測モデルの提案 ▪ Rich human feedback を用いた Text to Image 生成の洗練方法の提案 58 Summary Dataset: RichHF-18 Rich Automatic Human Feedback (RAHF) Model Model Finetuning Universal guidance w/o guidance w/ score guidance Refining Text-to-Image models Region inpainting Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, and Jiao Sun, et al. "Rich Human Feedback for Text-to-Image Generation". In CVPR 2024.
(RichHF-18K) 59 Data Collection: RichHF-18K [Figure 1. An illustration of our annotation UI] 画像上にポイントをマーク: ・1. アーティファクト/非現実的な領域(赤いポイント) ・2. テキストプロンプトに対して整合していない領域(青いポイント) 単語をマーク(下線&シェー ディング): ・3. 誤っているキーワード スコアを選択(下線): ・4. 信頼性 ・5. テキストと画像の整合性 ・6. 美的品質 ・7. 全体的な品質 Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, and Jiao Sun, et al. "Rich Human Feedback for Text-to-Image Generation". In CVPR 2024. (16K training, 1K validation, 1K test samples) [Figure 2. Histograms of the average scores of image-text pairs in the training set.] Pick-a-pic dataset (NeurIPS 2023) (35,000テキスト、500,000生成画像、2画像間の好み、データセット) から、PaLI(ICLR2023)属性を用い、 多様な1.8K画像選定、アノテーション追加。 Annotations [Figure 10. Histograms of the PaLI attributes of the images in the training set.] 多様な画像選定 Positive/Negative評価 バランスのとれたスコア分布
implausibility heatmaps] [Figure 6. Examples of misalignment heatmaps] Implausibility heatmap Text-misalignment heatmap Prompt: A snake on a mushroom. Heatmap Prediction [Table 3. Text misalignment heatmap prediction results on the test set.] * GT = 0:空のヒートマップ(良好画像 : test set 144/995(14%))) GT > 0:空でないヒートマップ(課題ある画像 : test set 851/995(86%)) Prompt: photo of a slim asian little girl ballerina with long hair wearing white tights running on a beach from behind nikon D5 [Figure 5. Examples of implausibility heatmaps] Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, and Jiao Sun, et al. "Rich Human Feedback for Text-to-Image Generation". In CVPR 2024. Test set でモデル予測と人アノテーション (GT) の heatmap /point間の Metrics 計算
results on the test set.] Score Prediction * PLCC: Pearson linear correlation coefficient (ピアソンの線形相関係数 ) SRCC: Spearman rank correlation coefficient (スピアマンの順位相関係数 ) [Table 4. Text misalignment prediction results on the test set.] Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, and Jiao Sun, et al. "Rich Human Feedback for Text-to-Image Generation". In CVPR 2024. Test set でモデル予測と人の評価の間の Metrics 計算 Misaligned text prediction [Figure 7. Examples of ratings. “GT” is the ground-truth score (average score from three annotators).] Examples of ratings
latent diffusion models". In CVPR 2022. 63 Learning from rich human feedback (LHF) Fine-tuning Muse[1] with predicted scores (via sample selection [2]) Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, and Jiao Sun, et al. "Rich Human Feedback for Text-to-Image Generation". In CVPR 2024. 予測Score を利用し、画像選択し model Finetuning [2] Jiao Sun, et al. "DreamSync: Aligning Text-to-Image Generation with Image Understanding Feedback". arXiv. [1] Huiwen Chang, et al. "Muse: Text-to-image generation via masked generative transformers". In ICML 2023. [1] [2] Human Evaluation Results [Table 5. Human Evaluation Results: Finetuned Muse vs original Muse model preference] 主観評価(割合): Fine-tuned Museが元のMuseよりも、大幅に良い(≫)、わずかに良い( >)、 ほぼ同じ( ≈)、わずかに悪い( <)、大幅に悪い(≪) [Top:Figure 8. Examples illustrating the impact of RAHF on generative models.] Prompt: A cat sleeping on the ground using a shoe as a pillow Before finetuning After finetuning [Bottom: Figure 15. More examples illustrating the impact of RAHF on generative models.] Prompt: Three zebras are standing together in a line Fine-tuned Museによって生成された画像は、元の Museによって生成された画像よりも高評価。 Muse[1]と LDM[3]は異なるため、手法の汎化可能性を示している。
as universal guidance [4] (with Latent DM [3]) Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, and Jiao Sun, et al. "Rich Human Feedback for Text-to-Image Generation". In CVPR 2024. 予測Score を Universal Guidance として誘導に利用 [4] Arpit Bansal, et al. "Universal guidance for diffusion models". arXiv. [3] Robin Rombach, et al. "High-resolution image synthesis with latent diffusion models". In CVPR 2022. [Figure 8. Examples illustrating the impact of RAHF on generative models.] Prompt: a macro lens closeup of a paperclip w/o guidance w/ score guidance w/o guidance w/ score guidance Prompt: Kitten sushi stained glass window sunset fog. [Figure 15. More examples illustrating the impact of RAHF on generative models..] Aesthetic score Overall score universal guidance
Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, and Jiao Sun, et al. "Rich Human Feedback for Text-to-Image Generation". In CVPR 2024. Region inpainting with predicted heatmaps and score (via Muse inpainting) [Figure 9. Region inpainting with Muse [1] generative model.] [1] Huiwen Chang, et al. "Muse: Text-to-image generation via masked generative transformers". In ICML 2023.
の作成 (RichHF-18K) ▪ 画像に対する Rich human feedback の予測モデルの提案 ▪ Rich human feedback を用いた Text to Image 生成の洗練方法の提案 66 Summary Dataset: RichHF-18 Rich Automatic Human Feedback (RAHF) Model Model Finetuning Universal guidance w/o guidance w/ score guidance Refining Text-to-Image models Region inpainting Youwei Liang, Junfeng He, Gang Li, Peizhao Li, Arseniy Klimovskiy, Nicholas Carolan, and Jiao Sun, et al. "Rich Human Feedback for Text-to-Image Generation". In CVPR 2024.
arxiv: https://arxiv.org/abs/2309.07906 demo: https://generative-dynamics.github.io/#demo Zhengqi Li, Richard Tucker, Noah Snavely, and Aleksander Holynski Best Paper
Motion self-guidance. エネルギー関数:開始フレームと終了フレームでのピ クセルの位置と速度ができるだけ近くなるようする Interactive dynamics from a single image Spectral volume を 振動モードの基底として利用し、物理応答を記述 Pixel p の 運動変位 pixel p の Spectral Volume 時間t におけるモーダル座標 Zhengqi Li, Richard Tucker, Noah Snavely, and Aleksander Holynski. "Generative Image Dynamics". In CVPR 2024. https://generative-dynamics.github.io/
evaluation of text-to-image generation] [Table 4. Quantitative Results of FID and CLIP-score.] [Figure11. Text-to-image generation results of SD-XL with or without FreeU.] Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/ SDXL + FreeU
× 4096 SD-XL Images generated by ScaleCrafter with or without FreeU.] Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/
× 4096 SD-XL Images generated by ScaleCrafter with or without FreeU.] Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/
images from LCM [36] with and without FreeU enhancement.] Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/ Latent Consistency Model
evaluation of text-to-video generation.] [Figure 12. Text-to-video generation results of ModelScope [37] with or without FreeU. ] Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/
videos from Animatediff with and without FreeU enhancement.] Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/
videos from Animatediff with and without FreeU enhancement.] Chenyang Si, Ziqi Huang, Yuming Jiang, and Ziwei Liu. "FreeU: Free Lunch in Diffusion U-Net". In CVPR 2024. https://chenyangsi.top/FreeU/
Models project page: https://dangeng.github.io/visual_anagrams/ arxiv: https://arxiv.org/abs/2311.17919 code: https://github.com/dangeng/visual_anagrams Daniel Geng, Inbum Park, and Andrew Owens Oral https://openaccess.thecvf.com/content/CVPR2024/html/Geng_Visual_Anagrams_Generatin g_Multi-View_Optical_Illusions_with_Diffusion_Models_CVPR_2024_paper.html
Simpleな手法、面白い応用 91 Summary Daniel Geng, Inbum Park, and Andrew Owens. "Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models". In CVPR 2024. generated results [Figure 1. Generating Multi-View Illusions.]
(多視点光学錯覚)の Diffusion model による生成 ▪ Simpleな手法、面白い応用 92 Summary Rotations Inner Circle Patch permutation Daniel Geng, Inbum Park, and Andrew Owens. "Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models". In CVPR 2024. 例
Generation ▪ Generative Image Dynamics ▪ Orals ▪ FreeU: Free Lunch in Diffusion U-Net ▪ Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models ▪ Style Aligned Image Generation via Shared Attention ▪ Instruct-Imagen: Image Generation with Multi-modal Instruction ▪ Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following ▪ Attention Calibration for Disentangled Text-to-Image Personalization ▪ Alchemist: Parametric Control of Material Properties with Diffusion Models ▪ Analyzing and Improving the Training Dynamics of Diffusion Models ▪ MonoHair: High-Fidelity Hair Modeling from a Monocular Video 95 Main Conference: Image & Video Synthesis
Text-to-Image Generation ▪ Generative Image Dynamics ▪ Orals ▪ FreeU: Free Lunch in Diffusion U-Net ▪ Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models ▪ Style Aligned Image Generation via Shared Attention ▪ Instruct-Imagen: Image Generation with Multi-modal Instruction ▪ Ranni: Taming Text-to-Image Diffusion for Accurate Instruction Following ▪ Attention Calibration for Disentangled Text-to-Image Personalization ▪ Alchemist: Parametric Control of Material Properties with Diffusion Models ▪ Analyzing and Improving the Training Dynamics of Diffusion Models ▪ MonoHair: High-Fidelity Hair Modeling from a Monocular Video Main Conference: Image & Video Synthesis