⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges: ▹ Handle deformable parts ▹ Filter motion noise ▹ When to inference vs. extrapolate?
⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges: ▹ Handle deformable parts ▹ Filter motion noise ▹ When to inference vs. extrapolate? ▹ See paper for details!
⊖ xt-1) diff (motion) synthesis Motion-based Synthesis ▸ Synthesis operation: Extrapolate based on motion vectors ▸ Address three challenges: ▹ Handle deformable parts ▹ Filter motion noise ▹ When to inference vs. extrapolate? ▹ See paper for details! Computationally efficient: Extrapolation: 10K operations/frame CNN Inference: 50B operations/frame
Vision Results 66% energy saving & 1% accuracy loss with RTL/measurement. SoC Exploits synergies across IP blocks. Enables task autonomy. Algorithm Motion-based tracking and detection synthesis.
Vision Results 66% energy saving & 1% accuracy loss with RTL/measurement. SoC Exploits synergies across IP blocks. Enables task autonomy. Algorithm Motion-based tracking and detection synthesis.
the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme 10
the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10
Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10
Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10
Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10
Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10 MVs
Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10 MVs
Color Balance ISP Internal Interconnect SoC Interconnect ISP Pipeline Frame Buffer (DRAM) ISP Sequencer Noisy Frame Denoised Frame Prev. Noisy Frame Prev. Denoised Frame ISP Augmentation ▸ Expose motion vectors to the rest of the SoC ▸ Design decision: transfer MVs through DRAM ▹ One 1080p frame: 8KB MV traffic vs. ~6MB pixel data ▹ Easy to piggyback on the existing SoC communication scheme ▸ Light-weight modification to ISP Sequencer 10 MVs
accelerator, but a new IP? ▹Independent of vision algo./arch implementation 11 Extrapolation Unit Motion Vector Buffer DMA Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf
accelerator, but a new IP? ▹Independent of vision algo./arch implementation ▸ Why not synthesize in CPU, but a new IP? ▹Switch-off CPU to enable “always-on” vision 11 Extrapolation Unit Motion Vector Buffer DMA Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf
Motion Vector Buffer DMA Sequencer (FSM) ROI Selection ROI 4-Way SIMD Unit Scalar MVs New ROI MMap Regs ROI Winsize Base Addrs Conf ISP SoC Interconnect ▸ Why not directly augment the CNN accelerator, but a new IP? ▹Independent of vision algo./arch implementation ▸ Why not synthesize in CPU, but a new IP? ▹Switch-off CPU to enable “always-on” vision
Vision Algorithm Motion-based tracking and detection synthesis. SoC Exploits synergies across IP blocks. Enables task autonomy. Results 66% energy saving & 1% accuracy loss with RTL/measurement.
Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2) 14 ▸ Evaluate on Object Tracking and Object Detection ▹Important domains that are building blocks for many vision applications ▹IP vendors have started shipping standalone tracking/detection IPs
Nvidia Tegra X2 ▹ Real board measurement ▸ Develop RTL models for IPs unavailable on TX2 ▹ CNN Accelerator (651 mW, 1.58 mm2) ▹ Motion Controller (2.2 mW, 0.035 mm2) 14 ▸ Evaluate on Object Tracking and Object Detection ▹Important domains that are building blocks for many vision applications ▹IP vendors have started shipping standalone tracking/detection IPs
synthesis algorithm. ▸ We must expand our focus from isolated accelerators to holistic SoC architecture. ▸ 66% SoC energy savings with ~1% accuracy loss. More efficient than scaling-down CNNs.