Upgrade to Pro — share decks privately, control downloads, hide ads and more …

RTNN: Accelerating Neighbor Search Using Hardwa...

Yuhao Zhu
March 28, 2022

RTNN: Accelerating Neighbor Search Using Hardware Ray Tracing

A long talk for the PPoPP 2022 paper with the same title. The code is at: https://github.com/horizon-research/rtnn.

Yuhao Zhu

March 28, 2022
Tweet

More Decks by Yuhao Zhu

Other Decks in Research

Transcript

  1. (BNF1MBO 4 "DDFMFSBUJOH /FJHICPS4FBSDI 6TJOH)BSEXBSF 3BZ5SBDJOH • What is ray

    tracing? • How does hardware support ray tracing? • What is neighbor search?
  2. 5 • What is ray tracing? • How does hardware

    support ray tracing? • What is neighbor search? • How to use hardware ray tracing to accelerate neighbor search? (BNF1MBO "DDFMFSBUJOH /FJHICPS4FBSDI 6TJOH)BSEXBSF 3BZ5SBDJOH
  3. 7

  4. TABLE I PROCESSING TIMES AND QUALITY MEASURES FOR THE PROCESSED

    MESHES. THE COLUMNS ARE RESPECTIVELY THE NUMBER OF VERTICES OF THE INPUT AND OUTPUT MESHES, THE METRIC USED FOR THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) .FTI JF )PX4DFOFJT3FQSFTFOUFE 9 Very informally: 3D piece-wide linear approximation of arbitrary 3D surfaces Quadrilateral mesh Triangular mesh Valette, et al. [TVCG’08] free3d.com
  5. 10 Modeling Rendering Lighting, camera, material, etc. Visibility Shading 3D

    mesh 2D image cgarena.com Visibility Problem For each pixel in the image (to be rendered), which point in the scene (i.e., on the mesh) corresponds to it?
  6. 10 Modeling Rendering Lighting, camera, material, etc. Visibility Shading 3D

    mesh 2D image cgarena.com * Usually cast multiple rays for each pixel
  7. Shading Problem What’s the color of an intersecting scene point

    along the ray direction? 10 Modeling Rendering Lighting, camera, material, etc. Visibility Shading 3D mesh 2D image cgarena.com * Usually cast multiple rays for each pixel
  8. "TJEF0UIFS(FPNFUSZ1SJNJUJWFT 11 Hair and furs are usually modeled using curves

    (e.g., Catmull–Rom spline). https://developer.nvidia.com/blog/optix-sdk-7-1/ Points and spheres. https://www.sciencefocus.com/future-technology/notre-dame-how-faithfully-can-we-rebuild-the-cathedral-with-modern-tech/
  9. 3BZ4DFOF*OUFSTFDUJPO 12 INPUT AND OUTPUT MESHES, THE METRIC USED FOR

    THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • Goal: calculate the [x, y, z] coordinates of the closest hit between the ray and the mesh. • Why closest hit? [x, y, z] Valette, et al. [TVCG’08] x y z
  10. 3BZ4DFOF*OUFSTFDUJPO 12 INPUT AND OUTPUT MESHES, THE METRIC USED FOR

    THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • Goal: calculate the [x, y, z] coordinates of the closest hit between the ray and the mesh. • Why closest hit? [x, y, z] Valette, et al. [TVCG’08] x y z
  11. &YIBVTUJWF4FBSDI 13 INPUT AND OUTPUT MESHES, THE METRIC USED FOR

    THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • The simplest solution: [x, y, z] Valette, et al. [TVCG’08]
  12. &YIBVTUJWF4FBSDI 13 INPUT AND OUTPUT MESHES, THE METRIC USED FOR

    THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • The simplest solution: • iterate all triangles [x, y, z] Valette, et al. [TVCG’08]
  13. &YIBVTUJWF4FBSDI 13 INPUT AND OUTPUT MESHES, THE METRIC USED FOR

    THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • The simplest solution: • iterate all triangles • test intersection for each triangle [x, y, z] Valette, et al. [TVCG’08]
  14. &YIBVTUJWF4FBSDI 13 INPUT AND OUTPUT MESHES, THE METRIC USED FOR

    THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • The simplest solution: • iterate all triangles • test intersection for each triangle • return the closest hit, if any [x, y, z] Valette, et al. [TVCG’08]
  15. &YIBVTUJWF4FBSDI 13 INPUT AND OUTPUT MESHES, THE METRIC USED FOR

    THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • The simplest solution: • iterate all triangles • test intersection for each triangle • return the closest hit, if any • Complexity: [x, y, z] Valette, et al. [TVCG’08]
  16. &YIBVTUJWF4FBSDI 13 INPUT AND OUTPUT MESHES, THE METRIC USED FOR

    THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • The simplest solution: • iterate all triangles • test intersection for each triangle • return the closest hit, if any • Complexity: • O(# of rays x # of triangles) [x, y, z] Valette, et al. [TVCG’08]
  17. &YIBVTUJWF4FBSDI 13 INPUT AND OUTPUT MESHES, THE METRIC USED FOR

    THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • The simplest solution: • iterate all triangles • test intersection for each triangle • return the closest hit, if any • Complexity: • O(# of rays x # of triangles) • Slow: [x, y, z] Valette, et al. [TVCG’08]
  18. &YIBVTUJWF4FBSDI 13 INPUT AND OUTPUT MESHES, THE METRIC USED FOR

    THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • The simplest solution: • iterate all triangles • test intersection for each triangle • return the closest hit, if any • Complexity: • O(# of rays x # of triangles) • Slow: • lots of triangles and lots of rays [x, y, z] Valette, et al. [TVCG’08]
  19. &YIBVTUJWF4FBSDI 13 INPUT AND OUTPUT MESHES, THE METRIC USED FOR

    THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) • The simplest solution: • iterate all triangles • test intersection for each triangle • return the closest hit, if any • Complexity: • O(# of rays x # of triangles) • Slow: • lots of triangles and lots of rays • …and it’s recursive [x, y, z] Valette, et al. [TVCG’08]
  20. "TJEF8IZ3FDVSTJWF3BZ5SBDJOH 14 INPUT AND OUTPUT MESHES, THE METRIC USED FOR

    THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) Valette, et al. [TVCG’08] • To implement realistic shading. Color?
  21. "TJEF8IZ3FDVSTJWF3BZ5SBDJOH 14 INPUT AND OUTPUT MESHES, THE METRIC USED FOR

    THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) Valette, et al. [TVCG’08] • To implement realistic shading. • The color* of an exiting ray depends on the colors* of all incident rays. Color? Color? Color? Color?
  22. "TJEF8IZ3FDVSTJWF3BZ5SBDJOH 14 INPUT AND OUTPUT MESHES, THE METRIC USED FOR

    THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) Valette, et al. [TVCG’08] • To implement realistic shading. • The color* of an exiting ray depends on the colors* of all incident rays. • color* should technically be radiance; not important for our discussion here. Color? Color? Color? Color?
  23. "TJEF8IZ3FDVSTJWF3BZ5SBDJOH 14 INPUT AND OUTPUT MESHES, THE METRIC USED FOR

    THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) Valette, et al. [TVCG’08] • To implement realistic shading. • The color* of an exiting ray depends on the colors* of all incident rays. • color* should technically be radiance; not important for our discussion here. • also depends on the surface material (diffuse vs. specular vs. …); not important for our discussion here. Color? Color? Color? Color?
  24. "TJEF8IZ3FDVSTJWF3BZ5SBDJOH 14 INPUT AND OUTPUT MESHES, THE METRIC USED FOR

    THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) Valette, et al. [TVCG’08] • To implement realistic shading. • The color* of an exiting ray depends on the colors* of all incident rays. • color* should technically be radiance; not important for our discussion here. • also depends on the surface material (diffuse vs. specular vs. …); not important for our discussion here. • How do we know the color of an incident ray? Cast more rays! Color? Color? Color? Color?
  25. "TJEF8IZ3FDVSTJWF3BZ5SBDJOH 15 INPUT AND OUTPUT MESHES, THE METRIC USED FOR

    THE CLUSTERING, THE TIME SPENT ON THE CURVATURE MEASURE COMPUTATION AND ON THE CLUSTERING, THE PERCENTAGE OF MINIMAL INTERNAL ANGLES BELLOW 30o AND THE AVERAGE TRIANGLE ASPECT RATIO. Fig. 12. Coarsened versions of the rockerarm model (1000 vertices) and the buddha model (20k vertices). models (left : AQ metric; right: IQ metric). The anisotropic behavior of the AQ metric is clearly visible in elongated Fig. 13. Closeup view of the David model remeshed to 500k vertices (Isotropic metric) Valette, et al. [TVCG’08] • To implement realistic shading. • The color* of an exiting ray depends on the colors* of all incident rays. • color* should technically be radiance; not important for our discussion here. • also depends on the surface material (diffuse vs. specular vs. …); not important for our discussion here. • How do we know the color of an incident ray? Cast more rays! Secondary Ray Secondary Ray Secondary Ray
  26. "TJEF3FOEFSJOH&RVBUJPO 16 https://en.wikipedia.org/wiki/Rendering_equation Lo (x, ωo ) = ∫ Ω

    fr (x, ωo , ωi ) Li (x, ωi ) cos θ dωi “Color” of exiting ray wo “Color” of incident ray wi Integrate incident rays over the hemisphere “Transfer function”
  27. 4QFFEJOH6Q3BZ5SJBOHMF*OUFSTFDUJPO5FTU 17 • Prune the search space. • Only search

    part of the scene that does intersect the ray. intersect(space, ray) { if ray doesn’t intersect space boundary: return else: foreach subspace in space if (subspace != empty) intersect(subspace, ray) }
  28. 4QFFEJOH6Q3BZ5SJBOHMF*OUFSTFDUJPO5FTU 17 • Prune the search space. • Only search

    part of the scene that does intersect the ray. intersect(space, ray) { if ray doesn’t intersect space boundary: return else: foreach subspace in space if (subspace != empty) intersect(subspace, ray) }
  29. 4QFFEJOH6Q3BZ5SJBOHMF*OUFSTFDUJPO5FTU 17 • Prune the search space. • Only search

    part of the scene that does intersect the ray. intersect(space, ray) { if ray doesn’t intersect space boundary: return else: foreach subspace in space if (subspace != empty) intersect(subspace, ray) }
  30. 4QFFEJOH6Q3BZ5SJBOHMF*OUFSTFDUJPO5FTU 17 • Prune the search space. • Only search

    part of the scene that does intersect the ray. • Key: how to partition the space? intersect(space, ray) { if ray doesn’t intersect space boundary: return else: foreach subspace in space if (subspace != empty) intersect(subspace, ray) }
  31. 4QBDF1BSUJUJPOWT0CKFDU1BSUJUJPO 18 Space partition: one object could be in different

    partitions Object partition: different partitions could overlap in space
  32. #PVOEJOH7PMVNF)JFSBSDIZ 0CKFDU1BSUJUJPO Scene BVH Tree 21 2 1 4 A

    B C D E 3 A B C 1 D 2 3 E 4 Interior node Leaf node Root Primitive
  33. #PVOEJOH7PMVNF)JFSBSDIZ 0CKFDU1BSUJUJPO 2 1 4 22 A B C D

    E 3 • A, B, C, D, E are the bounding volumes, which are Axis-Aligned Bounding Boxes (AABBs) here. Other (irregular) bounding volumes are possible. A B C 1 D 2 3 E 4 Interior node Leaf node Root Primitive
  34. *OUFSTFDUJPO5FTU6TJOH#7) 23 2 1 4 A B C D E

    A B E C D 2 3 4 3 1 Current Stack A Ray Ray-AABB Intersection Test ClosestHit = NA
  35. *OUFSTFDUJPO5FTU6TJOH#7) 24 2 1 4 A B C D E

    A B E C D 2 3 4 3 1 Current Stack B E Ray-AABB Intersection Test Ray ClosestHit = NA
  36. *OUFSTFDUJPO5FTU6TJOH#7) 25 2 1 4 A B C D E

    A B E C D 2 3 4 3 1 Current Stack C E D Ray-AABB Intersection Test Ray ClosestHit = NA
  37. *OUFSTFDUJPO5FTU6TJOH#7) 26 2 1 4 A B C D E

    A B E C D 2 3 4 3 1 Current Stack D E Ray-AABB Intersection Test Ray ClosestHit = NA
  38. *OUFSTFDUJPO5FTU6TJOH#7) 27 2 1 4 A B C D E

    A B E C D 2 3 4 3 1 Current Stack E 2 Ray-Triangle Intersection Test Ray 3 ClosestHit = NA
  39. *OUFSTFDUJPO5FTU6TJOH#7) 28 2 1 4 A B C D E

    A B E C D 2 3 4 3 1 Current Stack Ray Ray-AABB Intersection Test E ClosestHit = 2
  40. *OUFSTFDUJPO5FTU6TJOH#7) 28 2 1 4 A B C D E

    A B E C D 2 3 4 3 1 Current Stack Ray Ray-AABB Intersection Test E ClosestHit = 2 Distance to E > Distance to 2; Stop!
  41. "4VCUMFCVU$SJUJDBM$BTF 30 Ray: O + tD, tmin <= t <=

    tmax O D thit tmin tmax Should this be counted as a hit? tmin tmax
  42. "4VCUMFCVU$SJUJDBM$BTF 30 Ray: O + tD, tmin <= t <=

    tmax O D thit tmin tmax Should this be counted as a hit? tmin tmax
  43. "4VCUMFCVU$SJUJDBM$BTF 30 Ray: O + tD, tmin <= t <=

    tmax O D thit Yes; any ray segment that’s completely inside an AABB must be treated as intersecting. tmin tmax Should this be counted as a hit? tmin tmax
  44. "TJEF5XP5FSNJOPMPHZ$POGVTJPOT 31 • Ray casting vs. ray tracing • Technically,

    finding the intersection of one ray and the scene is called ray casting. • Ray tracing referes to recursive ray casting. • Acceleration structures • Data structures that help speed up ray tracing is called “acceleration structures” (e.g., BVH), not to be confused with hardware accelerators.
  45. 3BZ5SBDJOHPO(16T6TJOH#7) 33 2 1 4 A B C D E

    3 Ray Ray Ray • Build the BVH.
  46. 3BZ5SBDJOHPO(16T6TJOH#7) 33 2 1 4 A B C D E

    3 Ray Ray Ray • Build the BVH. • For each ray (thread): • Traverse the BVH (manage local stack) • Ray-AABB intersection test • Ray-primitive intersection test • Executes a shading algorithm
  47. 3BZ5SBDJOHPO(16T6TJOH#7) 33 2 1 4 A B C D E

    3 Ray Ray Ray • Build the BVH. • For each ray (thread): • Traverse the BVH (manage local stack) • Ray-AABB intersection test • Ray-primitive intersection test • Executes a shading algorithm • Prior to OptiX (2010) • Manually implement in CUDA.
  48. 3BZ5SBDJOHPO(16T6TJOH#7) 33 2 1 4 A B C D E

    3 Ray Ray Ray • Build the BVH. • For each ray (thread): • Traverse the BVH (manage local stack) • Ray-AABB intersection test • Ray-primitive intersection test • Executes a shading algorithm • Prior to OptiX (2010) • Manually implement in CUDA. Fixed-function ~Fixed-function
  49. 3BZ5SBDJOHJO0QUJ9BOE5VSJOH(16 34 • OptiX (2010): a ray tracing-specific programming model.

    • Provides a generic ray tracing pipeline. • Some pipeline stages are programmable; others are fixed functions. ACM Reference Format Parker, S., Bigler, J., Dietrich, A., Friedrich, H., Hoberock, J., Luebke, D., McAllister, D., McGuire, M., Morley, K., Robison, A., Stich, M. 2010. OptiX™: A General Purpose Ray Tracing Engine. ACM Trans. Graph. 29, 4, Article 66 (July 2010), 13 pages. DOI = 10.1145/1778765.1778803 http://doi.acm.org/10.1145/1778765.1778803. Copyright Notice Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profi t or direct commercial advantage and that copies show this notice on the fi rst page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specifi c permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701, fax +1 (212) 869-0481, or [email protected]. © 2010 ACM 0730-0301/2010/07-ART66 $10.00 DOI 10.1145/1778765.1778803 http://doi.acm.org/10.1145/1778765.1778803 OptiX: A General Purpose Ray Tracing Engine Steven G. Parker1⇤ James Bigler1 Andreas Dietrich1 Heiko Friedrich1 Jared Hoberock1 David Luebke1 David McAllister1 Morgan McGuire1,2 Keith Morley1 Austin Robison1 Martin Stich1 NVIDIA1 Williams College2 Figure 1: Images from various applications built with OptiX. Top: Physically based light transport through path tracing. Bottom: Ray tracing of a procedural Julia set, photon mapping, large-scale line of sight and collision detection, Whitted-style ray tracing of dynamic geometry, and ray traced ambient occlusion. All applications are interactive. Abstract The NVIDIA® OptiX™ ray tracing engine is a programmable sys- tem designed for NVIDIA GPUs and other highly parallel archi- tectures. The OptiX engine builds on the key observation that most ray tracing algorithms can be implemented using a small set of programmable operations. Consequently, the core of OptiX is a domain-specific just-in-time compiler that generates custom ray tracing kernels by combining user-supplied programs for ray generation, material shading, object intersection, and scene traver- sal. This enables the implementation of a highly diverse set of ray tracing-based algorithms and applications, including interactive rendering, offline rendering, collision detection systems, artificial intelligence queries, and scientific simulations such as sound prop- agation. OptiX achieves high performance through a compact ob- ject model and application of several ray tracing-specific compiler optimizations. For ease of use it exposes a single-ray programming model with full support for recursion and a dynamic dispatch mech- anism similar to virtual function calls. CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism; D.2.11 [Software Architectures]: Domain- specific architectures; I.3.1 [Computer Graphics]: Hardware Architectures—; Keywords: ray tracing, graphics systems, graphics hardware ⇤e-mail: [email protected] 1 Introduction To address the problem of creating an accessible, flexible, and effi- cient ray tracing system for many-core architectures, we introduce OptiX, a general purpose ray tracing engine. This engine combines a programmable ray tracing pipeline with a lightweight scene rep- resentation. A general programming interface enables the imple- mentation of a variety of ray tracing-based algorithms in graphics and non-graphics domains, such as rendering, sound propagation, collision detection and artificial intelligence. In this paper, we discuss the design goals of the OptiX engine as well as an implementation for NVIDIA Quadro®, GeForce®, and Tesla® GPUs. In our implementation, we compose domain-specific compilation with a flexible set of controls over scene hierarchy, ac- celeration structure creation and traversal, on-the-fly scene update, and a dynamically load-balanced GPU execution model. Although OptiX currently targets highly parallel architectures, it is applica- ble to a wide range of special- and general-purpose hardware and multiple execution models. To create a system for a broad range of ray tracing tasks, several ACM Transactions on Graphics, Vol. 29, No. 4, Article 66, Publication date: July 2010.
  50. 3BZ5SBDJOHJO0QUJ9BOE5VSJOH(16 34 • OptiX (2010): a ray tracing-specific programming model.

    • Provides a generic ray tracing pipeline. • Some pipeline stages are programmable; others are fixed functions. • Prior to Turing architecture (2018): • Everything runs on CUDA cores. ACM Reference Format Parker, S., Bigler, J., Dietrich, A., Friedrich, H., Hoberock, J., Luebke, D., McAllister, D., McGuire, M., Morley, K., Robison, A., Stich, M. 2010. OptiX™: A General Purpose Ray Tracing Engine. ACM Trans. Graph. 29, 4, Article 66 (July 2010), 13 pages. DOI = 10.1145/1778765.1778803 http://doi.acm.org/10.1145/1778765.1778803. Copyright Notice Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profi t or direct commercial advantage and that copies show this notice on the fi rst page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specifi c permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701, fax +1 (212) 869-0481, or [email protected]. © 2010 ACM 0730-0301/2010/07-ART66 $10.00 DOI 10.1145/1778765.1778803 http://doi.acm.org/10.1145/1778765.1778803 OptiX: A General Purpose Ray Tracing Engine Steven G. Parker1⇤ James Bigler1 Andreas Dietrich1 Heiko Friedrich1 Jared Hoberock1 David Luebke1 David McAllister1 Morgan McGuire1,2 Keith Morley1 Austin Robison1 Martin Stich1 NVIDIA1 Williams College2 Figure 1: Images from various applications built with OptiX. Top: Physically based light transport through path tracing. Bottom: Ray tracing of a procedural Julia set, photon mapping, large-scale line of sight and collision detection, Whitted-style ray tracing of dynamic geometry, and ray traced ambient occlusion. All applications are interactive. Abstract The NVIDIA® OptiX™ ray tracing engine is a programmable sys- tem designed for NVIDIA GPUs and other highly parallel archi- tectures. The OptiX engine builds on the key observation that most ray tracing algorithms can be implemented using a small set of programmable operations. Consequently, the core of OptiX is a domain-specific just-in-time compiler that generates custom ray tracing kernels by combining user-supplied programs for ray generation, material shading, object intersection, and scene traver- sal. This enables the implementation of a highly diverse set of ray tracing-based algorithms and applications, including interactive rendering, offline rendering, collision detection systems, artificial intelligence queries, and scientific simulations such as sound prop- agation. OptiX achieves high performance through a compact ob- ject model and application of several ray tracing-specific compiler optimizations. For ease of use it exposes a single-ray programming model with full support for recursion and a dynamic dispatch mech- anism similar to virtual function calls. CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism; D.2.11 [Software Architectures]: Domain- specific architectures; I.3.1 [Computer Graphics]: Hardware Architectures—; Keywords: ray tracing, graphics systems, graphics hardware ⇤e-mail: [email protected] 1 Introduction To address the problem of creating an accessible, flexible, and effi- cient ray tracing system for many-core architectures, we introduce OptiX, a general purpose ray tracing engine. This engine combines a programmable ray tracing pipeline with a lightweight scene rep- resentation. A general programming interface enables the imple- mentation of a variety of ray tracing-based algorithms in graphics and non-graphics domains, such as rendering, sound propagation, collision detection and artificial intelligence. In this paper, we discuss the design goals of the OptiX engine as well as an implementation for NVIDIA Quadro®, GeForce®, and Tesla® GPUs. In our implementation, we compose domain-specific compilation with a flexible set of controls over scene hierarchy, ac- celeration structure creation and traversal, on-the-fly scene update, and a dynamically load-balanced GPU execution model. Although OptiX currently targets highly parallel architectures, it is applica- ble to a wide range of special- and general-purpose hardware and multiple execution models. To create a system for a broad range of ray tracing tasks, several ACM Transactions on Graphics, Vol. 29, No. 4, Article 66, Publication date: July 2010.
  51. 3BZ5SBDJOHJO0QUJ9BOE5VSJOH(16 34 • OptiX (2010): a ray tracing-specific programming model.

    • Provides a generic ray tracing pipeline. • Some pipeline stages are programmable; others are fixed functions. • Prior to Turing architecture (2018): • Everything runs on CUDA cores. • Turing architecture: • RT Cores accelerate fixed-function stages. • Programmable stages on the CUDA cores. ACM Reference Format Parker, S., Bigler, J., Dietrich, A., Friedrich, H., Hoberock, J., Luebke, D., McAllister, D., McGuire, M., Morley, K., Robison, A., Stich, M. 2010. OptiX™: A General Purpose Ray Tracing Engine. ACM Trans. Graph. 29, 4, Article 66 (July 2010), 13 pages. DOI = 10.1145/1778765.1778803 http://doi.acm.org/10.1145/1778765.1778803. Copyright Notice Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profi t or direct commercial advantage and that copies show this notice on the fi rst page or initial screen of a display along with the full citation. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any component of this work in other works requires prior specifi c permission and/or a fee. Permissions may be requested from Publications Dept., ACM, Inc., 2 Penn Plaza, Suite 701, New York, NY 10121-0701, fax +1 (212) 869-0481, or [email protected]. © 2010 ACM 0730-0301/2010/07-ART66 $10.00 DOI 10.1145/1778765.1778803 http://doi.acm.org/10.1145/1778765.1778803 OptiX: A General Purpose Ray Tracing Engine Steven G. Parker1⇤ James Bigler1 Andreas Dietrich1 Heiko Friedrich1 Jared Hoberock1 David Luebke1 David McAllister1 Morgan McGuire1,2 Keith Morley1 Austin Robison1 Martin Stich1 NVIDIA1 Williams College2 Figure 1: Images from various applications built with OptiX. Top: Physically based light transport through path tracing. Bottom: Ray tracing of a procedural Julia set, photon mapping, large-scale line of sight and collision detection, Whitted-style ray tracing of dynamic geometry, and ray traced ambient occlusion. All applications are interactive. Abstract The NVIDIA® OptiX™ ray tracing engine is a programmable sys- tem designed for NVIDIA GPUs and other highly parallel archi- tectures. The OptiX engine builds on the key observation that most ray tracing algorithms can be implemented using a small set of programmable operations. Consequently, the core of OptiX is a domain-specific just-in-time compiler that generates custom ray tracing kernels by combining user-supplied programs for ray generation, material shading, object intersection, and scene traver- sal. This enables the implementation of a highly diverse set of ray tracing-based algorithms and applications, including interactive rendering, offline rendering, collision detection systems, artificial intelligence queries, and scientific simulations such as sound prop- agation. OptiX achieves high performance through a compact ob- ject model and application of several ray tracing-specific compiler optimizations. For ease of use it exposes a single-ray programming model with full support for recursion and a dynamic dispatch mech- anism similar to virtual function calls. CR Categories: I.3.7 [Computer Graphics]: Three-Dimensional Graphics and Realism; D.2.11 [Software Architectures]: Domain- specific architectures; I.3.1 [Computer Graphics]: Hardware Architectures—; Keywords: ray tracing, graphics systems, graphics hardware ⇤e-mail: [email protected] 1 Introduction To address the problem of creating an accessible, flexible, and effi- cient ray tracing system for many-core architectures, we introduce OptiX, a general purpose ray tracing engine. This engine combines a programmable ray tracing pipeline with a lightweight scene rep- resentation. A general programming interface enables the imple- mentation of a variety of ray tracing-based algorithms in graphics and non-graphics domains, such as rendering, sound propagation, collision detection and artificial intelligence. In this paper, we discuss the design goals of the OptiX engine as well as an implementation for NVIDIA Quadro®, GeForce®, and Tesla® GPUs. In our implementation, we compose domain-specific compilation with a flexible set of controls over scene hierarchy, ac- celeration structure creation and traversal, on-the-fly scene update, and a dynamically load-balanced GPU execution model. Although OptiX currently targets highly parallel architectures, it is applica- ble to a wide range of special- and general-purpose hardware and multiple execution models. To create a system for a broad range of ray tracing tasks, several ACM Transactions on Graphics, Vol. 29, No. 4, Article 66, Publication date: July 2010.
  52. 0QUJ91SPHSBNNJOH.PEFM 36 Construct BVH Ray Generation (RG) Shader BVH Traversal

    + Ray-AABB Test (TL) • “Shaders” are user-defined functions executing on CUDA cores.
  53. 0QUJ91SPHSBNNJOH.PEFM 36 Construct BVH Ray Generation (RG) Shader BVH Traversal

    + Ray-AABB Test (TL) • “Shaders” are user-defined functions executing on CUDA cores. A B E C D 2 3 4 1
  54. 0QUJ91SPHSBNNJOH.PEFM 36 Construct BVH Ray Generation (RG) Shader Intersection (IS)

    Shader Enter leaf node BVH Traversal + Ray-AABB Test (TL) • “Shaders” are user-defined functions executing on CUDA cores. A B E C D 2 3 4 1
  55. 0QUJ91SPHSBNNJOH.PEFM 36 Construct BVH Ray Generation (RG) Shader Intersection (IS)

    Shader Enter leaf node BVH Traversal + Ray-AABB Test (TL) • “Shaders” are user-defined functions executing on CUDA cores. • Allows custom primitives (not just triangles). A B E C D 2 3 4 1
  56. 0QUJ91SPHSBNNJOH.PEFM 36 Construct BVH Ray Generation (RG) Shader Intersection (IS)

    Shader Enter leaf node BVH Traversal + Ray-AABB Test (TL) No Any-Hit (AH) Shader Ray primitive intersect? Yes • “Shaders” are user-defined functions executing on CUDA cores. • Allows custom primitives (not just triangles).
  57. 0QUJ91SPHSBNNJOH.PEFM 36 Construct BVH Ray Generation (RG) Shader Intersection (IS)

    Shader Enter leaf node BVH Traversal + Ray-AABB Test (TL) No Any-Hit (AH) Shader Ray primitive intersect? Yes Found a hit? Traversal completes • “Shaders” are user-defined functions executing on CUDA cores. • Allows custom primitives (not just triangles).
  58. 0QUJ91SPHSBNNJOH.PEFM 36 Construct BVH Ray Generation (RG) Shader Intersection (IS)

    Shader Enter leaf node BVH Traversal + Ray-AABB Test (TL) No Any-Hit (AH) Shader Ray primitive intersect? Yes Closest-Hit (CH) Shader Miss Shader Found a hit? Traversal completes • “Shaders” are user-defined functions executing on CUDA cores. • Allows custom primitives (not just triangles).
  59. 0QUJ91SPHSBNNJOH.PEFM 36 Construct BVH Ray Generation (RG) Shader Intersection (IS)

    Shader Enter leaf node BVH Traversal + Ray-AABB Test (TL) No Any-Hit (AH) Shader Ray primitive intersect? Yes Closest-Hit (CH) Shader Miss Shader Found a hit? Traversal completes • “Shaders” are user-defined functions executing on CUDA cores. • Allows custom primitives (not just triangles). Fixed functions executed on the RT cores.
  60. 0QUJ91SPHSBNNJOH.PEFM 37 Ray Generation (RG) Shader Construct BVH … …

    … … … … BVH Traversal + Ray-AABB Test (TL) Found a hit? Closest-Hit (CH) Shader Miss Shader Any-Hit (AH) Shader Intersection (IS) Shader Ray primitive intersect? Enter leaf node Yes No Traversal completes One Single CUDA Kernel CUDA Threads OptiX Rays
  61. -JGFPGBO0QUJ93BZ 38 2 1 4 A B C D E

    3 CUDA Cores RT Cores RG TL (A, B, D) IS (2, 3) TL (C, E) CH Think of RT cores as special function units for BVH traversal.
  62. "TJEF0UIFS/PUBCMF3BZ5SBDJOH&OHJOFT 39 • Intel OSPRay • Won 2020 Oscar for

    Scientific and Technical Achievement. • Built on Intel Embree, a collection of ray tracing kernels, which uses Intel Implicit SPMD Program Compiler (ISPC) for explicit vectorization. • PBRT • Pedagogical engine. • The book won 2014 Oscar for Scientific and Technical Achievement.
  63. (BNF1MBO 40 "DDFMFSBUJOH /FJHICPS4FBSDI 6TJOH)BSEXBSF 3BZ5SBDJOH • What is ray

    tracing? • How does hardware support ray tracing? • What is neighbor search • How does it relate to ray tracing?
  64. /FJHICPS4FBSDI 42 Range Search usually also limits the total #

    of neighbors: • practical memory constraint, • downstream algorithms expect a fixed # of neighbors.
  65. /FJHICPS4FBSDI 42 Range Search usually also limits the total #

    of neighbors: • practical memory constraint, • downstream algorithms expect a fixed # of neighbors. rangeSearch(query, points, range, K) Return any K points that are within range of query
  66. /FJHICPS4FBSDI 43 Range Search KNN Search 2 nearest neighbors usually

    also limits the total # of neighbors: • practical memory constraint, • downstream algorithms expect a fixed # of neighbors. rangeSearch(query, points, range, K) Return any K points that are within range of query
  67. /FJHICPS4FBSDI 44 Range Search KNN Search usually also limits the

    total # of neighbors: • practical memory constraint, • downstream algorithms expect a fixed # of neighbors. usually also limits ranges of neighbors: • neighbors too far away are of no significance (e.g., force from a remote particle). rangeSearch(query, points, range, K) Return any K points that are within range of query
  68. /FJHICPS4FBSDI 44 Range Search KNN Search usually also limits the

    total # of neighbors: • practical memory constraint, • downstream algorithms expect a fixed # of neighbors. usually also limits ranges of neighbors: • neighbors too far away are of no significance (e.g., force from a remote particle). rangeSearch(query, points, range, K) Return any K points that are within range of query KNN(query, points, range, K) Return K nearest points that are within range of query
  69. 0VS'PDVT-PX%JNFOTJPOBM4FBSDI 45 • Low dimension: <= 3D. • Prevalent in

    science and engineering fields (e.g., computational fluid dynamics, graphics, vision). • They deal with physical data (e.g., particles, surface samples) that are inherent 2D/3D. • High-dimensional search is a completely different game. • “Curse of dimensionality” means we need different algorithms and distance metric.
  70. 5VSOUIF1SPCMFN"SPVOE 47 Find all points within r from Q Find

    whether Q is within r from other points Q r
  71. • Recall: any ray that’s within an AABB must be

    treated as intersecting. 1PJOUJO""##5FTU 49 Q 2r
  72. • Recall: any ray that’s within an AABB must be

    treated as intersecting. • Idea: generate a short ray from Q and (ask the RT cores to) perform the ray-AABB test. 1PJOUJO""##5FTU 49 Q 2r
  73. • Recall: any ray that’s within an AABB must be

    treated as intersecting. • Idea: generate a short ray from Q and (ask the RT cores to) perform the ray-AABB test. • The ray has an arbitrary direction and a very small length. 1PJOUJO""##5FTU 49 Q 2r
  74. • Recall: any ray that’s within an AABB must be

    treated as intersecting. • Idea: generate a short ray from Q and (ask the RT cores to) perform the ray-AABB test. • The ray has an arbitrary direction and a very small length. • Why a very small ray length? 1PJOUJO""##5FTU 49 Q 2r
  75. • Recall: any ray that’s within an AABB must be

    treated as intersecting. • Idea: generate a short ray from Q and (ask the RT cores to) perform the ray-AABB test. • The ray has an arbitrary direction and a very small length. • Why a very small ray length? 1PJOUJO""##5FTU 49 Q 2r Q’
  76. 50 • What is ray tracing? • How does hardware

    support ray tracing? • What is neighbor search? • How to use hardware ray tracing to accelerate neighbor search? (BNF1MBO "DDFMFSBUJOH /FJHICPS4FBSDI 6TJOH)BSEXBSF 3BZ5SBDJOH
  77. 0WFSBMM*EFB 52 Create an AABB of width 2r for every

    point rangeSearch(query, points, r, K) https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8
  78. 0WFSBMM*EFB 52 Create an AABB of width 2r for every

    point rangeSearch(query, points, r, K) https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8 Construct a BVH from the AABBs (No control; hidden behind the OptiX APIs and most likely done in hardware)
  79. 0WFSBMM*EFB 52 Create an AABB of width 2r for every

    point rangeSearch(query, points, r, K) https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8 Construct a BVH from the AABBs (No control; hidden behind the OptiX APIs and most likely done in hardware)
  80. 0WFSBMM*EFB 52 Create an AABB of width 2r for every

    point rangeSearch(query, points, r, K) https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8 Construct a BVH from the AABBs (No control; hidden behind the OptiX APIs and most likely done in hardware) Use spheres as primitives, not triangles.
  81. 0WFSBMM*EFB 52 Create an AABB of width 2r for every

    point rangeSearch(query, points, r, K) https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8 Generate a ray for each query (RG Shader) Construct a BVH from the AABBs (No control; hidden behind the OptiX APIs and most likely done in hardware)
  82. 0WFSBMM*EFB 52 Create an AABB of width 2r for every

    point rangeSearch(query, points, r, K) https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8 Generate a ray for each query (RG Shader) Construct a BVH from the AABBs (No control; hidden behind the OptiX APIs and most likely done in hardware) Traverse BVH; skip non-circumscribing AABBs (No control; done in hardware)
  83. 0WFSBMM*EFB 52 Create an AABB of width 2r for every

    point rangeSearch(query, points, r, K) https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8 Generate a ray for each query (RG Shader) Construct a BVH from the AABBs (No control; hidden behind the OptiX APIs and most likely done in hardware) Traverse BVH; skip non-circumscribing AABBs (No control; done in hardware) At leaf nodes: calc dist, collect neighbors (IS Shader)
  84. 0WFSBMM*EFB 52 Create an AABB of width 2r for every

    point rangeSearch(query, points, r, K) https://forums.developer.nvidia.com/t/bvh-building-algorithm-and-primitive-order/182231/8 Generate a ray for each query (RG Shader) Construct a BVH from the AABBs (No control; hidden behind the OptiX APIs and most likely done in hardware) Traverse BVH; skip non-circumscribing AABBs (No control; done in hardware) At leaf nodes: calc dist, collect neighbors (IS Shader)
  85. "OPUIFS1FSTQFDUJWF1PJOUJO4QIFSF5FTU 53 Ray Generation (RG) Shader Construct BVH BVH Traversal

    + Ray-AABB Test (TL) Found a hit? Closest-Hit (CH) Shader Miss Shader Any-Hit (AH) Shader Intersection (IS) Shader Ray primitive intersect? Enter leaf node Yes No Traversal completes Is Q in the AABB? (Prunes remote points) If so, is Q in the sphere?
  86. *EFB0SEFS2VFSJFT4QBUJBMMZ 56 • Intuition: group spatially close queries together so

    that their rays follow similar traversal paths. • Improving ray coherence in graphics parlance.
  87. *EFB0SEFS2VFSJFT4QBUJBMMZ 56 • Intuition: group spatially close queries together so

    that their rays follow similar traversal paths. • Improving ray coherence in graphics parlance. • How? A simple heuristic: queries enclosed by the same AABB are spatially close.
  88. *EFB0SEFS2VFSJFT4QBUJBMMZ 57 • A query might be enclosed by many

    AABBs, but any AABB will do. 1 2 3 4 7 6 5 8
  89. *EFB0SEFS2VFSJFT4QBUJBMMZ 57 • A query might be enclosed by many

    AABBs, but any AABB will do. • How to find one? Cast a ray and immediately terminate the ray once the first IS shader is called. 1 2 3 4 7 6 5 8
  90. *EFB0SEFS2VFSJFT4QBUJBMMZ 57 • A query might be enclosed by many

    AABBs, but any AABB will do. • How to find one? Cast a ray and immediately terminate the ray once the first IS shader is called. • optixTerminateRay() 1 2 3 4 7 6 5 8
  91. *EFB0SEFS2VFSJFT4QBUJBMMZ 57 • A query might be enclosed by many

    AABBs, but any AABB will do. • How to find one? Cast a ray and immediately terminate the ray once the first IS shader is called. • optixTerminateRay() • Effectively returning ID (key) of the first enclosing leaf AABB. 1 2 3 4 7 6 5 8
  92. *EFB0SEFS2VFSJFT4QBUJBMMZ 57 • A query might be enclosed by many

    AABBs, but any AABB will do. • How to find one? Cast a ray and immediately terminate the ray once the first IS shader is called. • optixTerminateRay() • Effectively returning ID (key) of the first enclosing leaf AABB. • Then sort by key. 1 2 3 4 7 6 5 8
  93. 4FBSDI"MHPSJUIN 4P'BS 58 1 2 3 4 7 6 5

    8 bvh ← buildBVH(points, r); firstHitAABBs ← traceRays(bvh, queries); reorderQueries(queries, firstHitAABBs); traceRays(bvh, queries);
  94. 1SPCMFN-BSHF""##T 59 Q 2r rangeSearch(query, points, r, K) • Strictly

    speaking, the AABB width must be 2r. • What if we can find K neighbors in a smaller range? We can use a smaller AABB. • What’s the benefit?
  95. #FOF fi UTPG4NBMMFS""##T 60 35 30 25 20 15 10

    5 0 Time (s) 30 25 20 15 10 5 0 AABB Width • Using smaller AABBs drastically reduces the search time.
  96. #FOF fi UTPG4NBMMFS""##T 60 35 30 25 20 15 10

    5 0 Time (s) 30 25 20 15 10 5 0 AABB Width • Using smaller AABBs drastically reduces the search time. • Smaller AABB means a query is enclosed by fewer AABBs.
  97. #FOF fi UTPG4NBMMFS""##T 60 35 30 25 20 15 10

    5 0 Time (s) 30 25 20 15 10 5 0 AABB Width • Using smaller AABBs drastically reduces the search time. • Smaller AABB means a query is enclosed by fewer AABBs. • …which leads to fewer traversals and IS shader calls.
  98. #FOF fi UTPG4NBMMFS""##T 60 35 30 25 20 15 10

    5 0 Time (s) 30 25 20 15 10 5 0 AABB Width • Using smaller AABBs drastically reduces the search time. • Smaller AABB means a query is enclosed by fewer AABBs. • …which leads to fewer traversals and IS shader calls. • Particularly important for KNN search, where the IS shader manipulates a priority queue.
  99. *EFB2VFSZ1BSUJUJPOJOH 61 • For each query, find an AABB size

    that’s just large enough to ensure correctness. 2r
  100. *EFB2VFSZ1BSUJUJPOJOH 61 • For each query, find an AABB size

    that’s just large enough to ensure correctness. 2r d
  101. *EFB2VFSZ1BSUJUJPOJOH 62 • For each query, find an AABB size

    that’s just large enough to ensure correctness. • Group queries such that queries in each partition share the same AABB. q0 q1 q2 q3 Calc. Smallest AABB Size q1 .. .. .. BVH 0 Partitions … …… Queries …… q0 .. BVH 1 .. BVH n-1 q2 BVH n q3 ..
  102. *EFB2VFSZ1BSUJUJPOJOH 62 • For each query, find an AABB size

    that’s just large enough to ensure correctness. • Group queries such that queries in each partition share the same AABB. • Build a different BVH for each partition. q0 q1 q2 q3 Calc. Smallest AABB Size q1 .. .. .. BVH 0 Partitions … …… Queries …… q0 .. BVH 1 .. BVH n-1 q2 BVH n q3 ..
  103. *EFB2VFSZ1BSUJUJPOJOH 62 • For each query, find an AABB size

    that’s just large enough to ensure correctness. • Group queries such that queries in each partition share the same AABB. • Build a different BVH for each partition. • Essentially trades BVH construction overhead for faster search. q0 q1 q2 q3 Calc. Smallest AABB Size q1 .. .. .. BVH 0 Partitions … …… Queries …… q0 .. BVH 1 .. BVH n-1 q2 BVH n q3 ..
  104. %FUFSNJOJOH""##4J[FGPS3BOHF4FBSDI 63 • Build a uniform grid. • Start from

    the cell that contains the query, and iteratively grow along all four (2D) or six (3D) directions. d
  105. %FUFSNJOJOH""##4J[FGPS3BOHF4FBSDI 63 • Build a uniform grid. • Start from

    the cell that contains the query, and iteratively grow along all four (2D) or six (3D) directions. • Stop when K neighbors are found (or the sphere boundary is reached). d
  106. %FUFSNJOJOH""##4J[FGPS3BOHF4FBSDI 63 • Build a uniform grid. • Start from

    the cell that contains the query, and iteratively grow along all four (2D) or six (3D) directions. • Stop when K neighbors are found (or the sphere boundary is reached). • We call the final collection of cells the megacell, with a width d. d
  107. %FUFSNJOJOH""##4J[FGPS3BOHF4FBSDI 63 • Build a uniform grid. • Start from

    the cell that contains the query, and iteratively grow along all four (2D) or six (3D) directions. • Stop when K neighbors are found (or the sphere boundary is reached). • We call the final collection of cells the megacell, with a width d. • d is the AABB size. d
  108. %FUFSNJOJOH""##4J[FGPS,//4FBSDI 64 • Find the megacell (width d), just like

    in range search. • Can we use d as the AABB size? d
  109. %FUFSNJOJOH""##4J[FGPS,//4FBSDI 65 • Find the megacell (width d), just like

    in range search. • Can we use d as the AABB size? • No! Some of the nearest K neighbors might be outside of the megacell. d p2 q qp1 > qp2 p1
  110. "$POTFSWBUJWF""##4J[FGPS,// 66 • The circumscribing circle/sphere of the megacall is

    guaranteed to have the K nearest neighbors. • Why? Given a circle with N neighbors, those N neighbors are by definition the N nearest neighbors; N is guaranteed to be >= K. d p2 q p1
  111. "$POTFSWBUJWF""##4J[FGPS,// 66 • The circumscribing circle/sphere of the megacall is

    guaranteed to have the K nearest neighbors. • Why? Given a circle with N neighbors, those N neighbors are by definition the N nearest neighbors; N is guaranteed to be >= K. • AABB must be the circumscribing square/cube of that circle/sphere. d p2 q p1
  112. "$POTFSWBUJWF""##4J[FGPS,// 66 • The circumscribing circle/sphere of the megacall is

    guaranteed to have the K nearest neighbors. • Why? Given a circle with N neighbors, those N neighbors are by definition the N nearest neighbors; N is guaranteed to be >= K. • AABB must be the circumscribing square/cube of that circle/sphere. • Width is for 2D and for 3D. 2d 3d d p2 q p1
  113. $BO8F%P#FUUFS 67 • What we really want to find is

    sphere C, which is smallest sphere that contains K nearest neighbors. d p2 q p1 A B C
  114. $BO8F%P#FUUFS 67 • What we really want to find is

    sphere C, which is smallest sphere that contains K nearest neighbors. • How? We know cube A has at least K neighbors. d p2 q p1 A B C
  115. "#FUUFS ""##4J[FGPS,// 68 • Assumption: point density is locally uniform

    within and around a megacell. • A sphere C that has the same volume as cube A will contain K neighbors, which are guaranteed to be the K nearest neighbors. d p2 q p1 A B C
  116. "#FUUFS ""##4J[FGPS,// 68 • Assumption: point density is locally uniform

    within and around a megacell. • A sphere C that has the same volume as cube A will contain K neighbors, which are guaranteed to be the K nearest neighbors. • AABB size is for 3D. 2 3 3 4π d d p2 q p1 A B C
  117. 4FBSDI"MHPSJUIN 4P'BS 69 bvh ← buildBVH(points, r); firstHitAABBs ← traceRays(bvh,

    queries); reorderQueries(queries, firstHitAABBs); traceRays(bvh, queries);
  118. 4FBSDI"MHPSJUIN 4P'BS 69 bvh ← buildBVH(points, r); firstHitAABBs ← traceRays(bvh,

    queries); reorderQueries(queries, firstHitAABBs); traceRays(bvh, queries); foreach q in queries: AABBSize ← findSmallestAABBSize(q); partitions.add(AABBSize, q); // assuming a hash table foreach p in partitions: queries ← all queries in p; r ← AABBSize of p;
  119. #VOEMF1BSUJUJPOT 70 • Problem: too many partitions leads to high

    BVH construction overhead. • Especially bad when point density is globally non-uniform (e.g., astrophysics simulation).
  120. #VOEMF1BSUJUJPOT 70 • Problem: too many partitions leads to high

    BVH construction overhead. • Especially bad when point density is globally non-uniform (e.g., astrophysics simulation). • Bundle partitions to minimize overall search time. Bundling two partitions: p1 p2 p3 p4 b1 b2 b3 Partitions Bundles
  121. #VOEMF1BSUJUJPOT 70 • Problem: too many partitions leads to high

    BVH construction overhead. • Especially bad when point density is globally non-uniform (e.g., astrophysics simulation). • Bundle partitions to minimize overall search time. Bundling two partitions: • eliminates one BVH construction cost. p1 p2 p3 p4 b1 b2 b3 Partitions Bundles
  122. #VOEMF1BSUJUJPOT 70 • Problem: too many partitions leads to high

    BVH construction overhead. • Especially bad when point density is globally non-uniform (e.g., astrophysics simulation). • Bundle partitions to minimize overall search time. Bundling two partitions: • eliminates one BVH construction cost. • but also increases the search cost. Why? p1 p2 p3 p4 b1 b2 b3 Partitions Bundles
  123. $PTU.PEFM 71 • Search cost is dictated by the number

    of IS shader calls, which 35 28 21 14 7 0 Execution Time (s) 0.9 0.6 0.3 0.0 # of IS Shader Calls (millions)
  124. $PTU.PEFM 71 • Search cost is dictated by the number

    of IS shader calls, which • …is dictated by the number of AABBs a query resides in, which 35 28 21 14 7 0 Execution Time (s) 0.9 0.6 0.3 0.0 # of IS Shader Calls (millions)
  125. $PTU.PEFM 71 • Search cost is dictated by the number

    of IS shader calls, which • …is dictated by the number of AABBs a query resides in, which • …is equivalent to the number of points inside an AABB, which 35 28 21 14 7 0 Execution Time (s) 0.9 0.6 0.3 0.0 # of IS Shader Calls (millions)
  126. $PTU.PEFM 71 • Search cost is dictated by the number

    of IS shader calls, which • …is dictated by the number of AABBs a query resides in, which • …is equivalent to the number of points inside an AABB, which • …is density x volume (r3), assuming locally-uniform density 35 28 21 14 7 0 Execution Time (s) 0.9 0.6 0.3 0.0 # of IS Shader Calls (millions)
  127. $PTU.PEFM 71 • Search cost is dictated by the number

    of IS shader calls, which • …is dictated by the number of AABBs a query resides in, which • …is equivalent to the number of points inside an AABB, which • …is density x volume (r3), assuming locally-uniform density • Search cost ∝ r3 35 28 21 14 7 0 Execution Time (s) 0.9 0.6 0.3 0.0 # of IS Shader Calls (millions)
  128. $PTU.PEFM 71 • Search cost is dictated by the number

    of IS shader calls, which • …is dictated by the number of AABBs a query resides in, which • …is equivalent to the number of points inside an AABB, which • …is density x volume (r3), assuming locally-uniform density • Search cost ∝ r3 35 28 21 14 7 0 Execution Time (s) 0.9 0.6 0.3 0.0 # of IS Shader Calls (millions) Tsearch = kNρS3
  129. $PTU.PEFM 71 • Search cost is dictated by the number

    of IS shader calls, which • …is dictated by the number of AABBs a query resides in, which • …is equivalent to the number of points inside an AABB, which • …is density x volume (r3), assuming locally-uniform density • Search cost ∝ r3 35 28 21 14 7 0 Execution Time (s) 0.9 0.6 0.3 0.0 # of IS Shader Calls (millions) Tsearch = kNρS3 # of queries in a partition
  130. $PTU.PEFM 71 • Search cost is dictated by the number

    of IS shader calls, which • …is dictated by the number of AABBs a query resides in, which • …is equivalent to the number of points inside an AABB, which • …is density x volume (r3), assuming locally-uniform density • Search cost ∝ r3 35 28 21 14 7 0 Execution Time (s) 0.9 0.6 0.3 0.0 # of IS Shader Calls (millions) Tsearch = kNρS3 # of queries in a partition Point density in a partition
  131. $PTU.PEFM 71 • Search cost is dictated by the number

    of IS shader calls, which • …is dictated by the number of AABBs a query resides in, which • …is equivalent to the number of points inside an AABB, which • …is density x volume (r3), assuming locally-uniform density • Search cost ∝ r3 35 28 21 14 7 0 Execution Time (s) 0.9 0.6 0.3 0.0 # of IS Shader Calls (millions) Tsearch = kNρS3 # of queries in a partition Point density in a partition AABB size of the partition
  132. $PTU.PEFM 71 • Search cost is dictated by the number

    of IS shader calls, which • …is dictated by the number of AABBs a query resides in, which • …is equivalent to the number of points inside an AABB, which • …is density x volume (r3), assuming locally-uniform density • Search cost ∝ r3 35 28 21 14 7 0 Execution Time (s) 0.9 0.6 0.3 0.0 # of IS Shader Calls (millions) Tsearch = kNρS3 # of queries in a partition Point density in a partition AABB size of the partition A constant regressed offline
  133. $PTU.PEFM 72 • When combining two partitions, the AABB size

    of the new partition must be the max of the two. k(N1 ρ1 + N2 ρ2 )[max(S1 , S2 )]3 k(N1 ρ1 S3 1 + N2 ρ2 S3 2 ) >
  134. 0QUJNBM#VOEMJOH 73 • Bundling increases search cost, but reduces BVH

    construction cost. What’s the optimal bundling?
  135. 0QUJNBM#VOEMJOH 73 • Bundling increases search cost, but reduces BVH

    construction cost. What’s the optimal bundling? p1 p2 p3 p4 b1 b2 b3 Partitions Bundles p1 p2 p3 p4 b1 b2 b3 Partitions Bundles
  136. 0QUJNBM#VOEMJOH 73 • Bundling increases search cost, but reduces BVH

    construction cost. What’s the optimal bundling? • Combinatorial optimization, but we have to solve it at run-time. p1 p2 p3 p4 b1 b2 b3 Partitions Bundles p1 p2 p3 p4 b1 b2 b3 Partitions Bundles
  137. 0QUJNBM#VOEMJOH 73 • Bundling increases search cost, but reduces BVH

    construction cost. What’s the optimal bundling? • Combinatorial optimization, but we have to solve it at run-time. • We leverage an empirical observation to simplify the problem structure, which yields an efficient linear-time solution. p1 p2 p3 p4 b1 b2 b3 Partitions Bundles p1 p2 p3 p4 b1 b2 b3 Partitions Bundles
  138. &NQJSJDBM0CTFSWBUJPO 74 • Empirically: AABB size and # of queries

    are inversely correlated. 104 105 106 107 Number of Queries 2.3 1.9 1.5 1.1 0.7 0.3 AABB Size
  139. &NQJSJDBM0CTFSWBUJPO 74 • Empirically: AABB size and # of queries

    are inversely correlated. 104 105 106 107 Number of Queries 2.3 1.9 1.5 1.1 0.7 0.3 AABB Size Intuitively, only a handful of sparsely located queries need a large AABB to find K neighbors.
  140. &NQJSJDBM0CTFSWBUJPO 74 • Empirically: AABB size and # of queries

    are inversely correlated. • Given this empirical observation, we can derive the optimal bundling in linear time. • Proof omitted; see paper. 104 105 106 107 Number of Queries 2.3 1.9 1.5 1.1 0.7 0.3 AABB Size Intuitively, only a handful of sparsely located queries need a large AABB to find K neighbors.
  141. 0QUJNBM#VOEMJOH"MHPSJUIN 75 • Algorithm: • Sort partitions according to the

    ascending order of their AABB sizes. • Start from the last partition and scan backward; at each step, bundle all partitions that have been scanned, leave the rest unbundled. • Pick the one with the lowest search cost. p1 p2 p3 p4 b1 b2 b3 Partitions Bundles Larger AABBs, fewer queries.
  142. 0QUJNBM#VOEMJOH"MHPSJUIN 75 • Algorithm: • Sort partitions according to the

    ascending order of their AABB sizes. • Start from the last partition and scan backward; at each step, bundle all partitions that have been scanned, leave the rest unbundled. • Pick the one with the lowest search cost. p1 p2 p3 p4 b1 b2 b3 Partitions Bundles Larger AABBs, fewer queries. p1 p2 p3 p4 b1 b2 b3 Partitions Bundles
  143. 0QUJNBM#VOEMJOH"MHPSJUIN 75 • Algorithm: • Sort partitions according to the

    ascending order of their AABB sizes. • Start from the last partition and scan backward; at each step, bundle all partitions that have been scanned, leave the rest unbundled. • Pick the one with the lowest search cost. p1 p2 p3 p4 b1 b2 b3 Partitions Bundles Larger AABBs, fewer queries. p1 p2 p3 p4 b1 b2 b3 Partitions Bundles
  144. 'JOBM4FBSDI"MHPSJUIN 76 foreach q in queries: AABBSize ← findSmallestAABBSize(q); partitions.add(AABBSize,

    q); // assuming a hash table foreach p in partitions: queries ← all queries in p; r ← AABBSize of p; bvh ← buildBVH(points, r); firstHitAABBs ← traceRays(bvh, queries); reorderQueries(queries, firstHitAABBs); traceRays(bvh, queries);
  145. 'JOBM4FBSDI"MHPSJUIN 76 foreach q in queries: AABBSize ← findSmallestAABBSize(q); partitions.add(AABBSize,

    q); // assuming a hash table foreach p in partitions: queries ← all queries in p; r ← AABBSize of p; bvh ← buildBVH(points, r); firstHitAABBs ← traceRays(bvh, queries); reorderQueries(queries, firstHitAABBs); traceRays(bvh, queries); bundle(partitions);
  146. &YQFSJNFOUBM4FUVQ • OptiX 7.1, CUDA 11; RTX 2080. • Baselines:

    • cuNSearch: grid search in CUDA; used in SPlisHSPlasH fluid simulator. • FRNN: grid search in CUDA. • PCLOctree: octree-search in CUDA (i.e., use octree, as opposed to BVH, to prune search). • FastRNN: KNN search in RT cores without our optimizations. • Datasets: • KITTI: self-driving car datasets; points are surface samples; mostly confined in 2D (ground) • Stanford 3D Scanning Repo: Bunny, Dragon, Buddha. • N-body simulation: non-uniform distribution in 3D. 78
  147. 4QFFEVQTPWFS#BTFMJOFT 79 10-1 100 101 102 103 Speedup (log) KITTI-1M

    KITTI-6M KITTI-12M KITTI-25M NBody-9M NBody-10M Bunny-360K Dragon-3.6M Buddha-4.6M OOM DNF Range Search PCLOctree cuNSearch KNN Search FRNN FastRNN 10-1 100 101 102 103 Speedup (log) 1M 6M 12M 25M KITTI 9M 10M N-body 3D scans 360K 3.6M 4.6M Range search speedup: 2.2X — 44.0X KNN search speedup: 3.5X — 65.0X 1. higher speedups on larger inputs. 2. higher speedups on KNN search.
  148. 5JNF%JTUSJCVUJPO 80 100 80 60 40 20 0 Time (%)

    KITTI-1M KITTI-6M KITTI-12M KITTI-25M NBody-9M NBody-10M Bunny-360K Dragon3.6M Buddha-4.6M Data Opt BVH FS Search 100 80 60 40 20 0 Time (%) KITTI-1M KITTI-6M KITTI-12M KITTI-25M NBody-9M NBody-10M Bunny-360K Dragon3.6M Buddha-4.6M Data Opt BVH FS Search Range search: much of the time is spent on optimization, data transfer, BVH construction. KNN search: time is mostly dominated by the actual search. KITTI N-body 3D scan KITTI N-body 3D scan 0 0
  149. 5JNF%JTUSJCVUJPO 81 100 80 60 40 20 0 Time (%)

    KITTI-1M KITTI-6M KITTI-12M KITTI-25M NBody-9M NBody-10M Bunny-360K Dragon3.6M Buddha-4.6M Data Opt BVH FS Search 100 80 60 40 20 0 Time (%) KITTI-1M KITTI-6M KITTI-12M KITTI-25M NBody-9M NBody-10M Bunny-360K Dragon3.6M Buddha-4.6M Data Opt BVH FS Search N-body N-body Galaxy (point) distribution in universe is very non- uniform; so a lot of time spent on partitioning. 0 0
  150. 0QUJNJ[BUJPO& ff FDUT 82 10-2 100 102 Log-Scale Time (s)

    KNN Range 18.6% 161.3 NoOpt Sched. Oracle Sched. + Partition Sched. + Partition + Bundle N-body (9M) 10-2 100 102 104 Log-Scale Time (s) KNN Range 18.8% NoOpt Sched. Oracle Sched. + Partition Sched. + Partition + Bundle KITTI (12M)
  151. (FOFSBM1VSQPTF*SSFHVMBS1SPDFTTPS • Conventional GPUs evolved to support general-purpose regular applications;

    will the same happen to RT cores? • A few examples of using RT cores for non-graphics workloads. • Key: formulate your problem as a BVH search. • But very limited, because RT cores are built to support only BVH search, which has a very specific branching logic (ray-AABB test). • Relax the hardware? Does it make sense? Will Nvidia do it? 84
  152. "QQSPYJNBUF/FJHICPS4FBSDI • Most often applications don’t need precise search. •

    Many natural opportunities for approximation in our algorithm. • Use a smaller-than-necessary AABB to build the BVH. • Elide ray-sphere test (skip IS shader calls); provides an error bound. • Even better: many applications that use neighbor search are differentiable (e.g., neural network). We could integrate approximate neighbor search into the training process to tolerate end-to-end accuracy loss. • See Yu Feng’s ISCA 2022 paper. 85