AtomicAdd-Free Improvements to Log-Polar Reconstruction Gather Kernel

February 4, 2026
12:34 pm

Log Polar Reconstruction

Log-polar reconstruction is a technique that leverages the properties of logarithmic polar coordinates to efficiently compute the Radon transform and its inverse, the back-projection operator, which are fundamental in tomographic imaging. In this method, both the Radon transform and back-projection are expressed as convolutions in log-polar coordinates. By resampling data into these coordinate system, fast algorithms utilizing the Fast Fourier Transform (FFT) can be applied, enhancing computational efficiency.

Implementing log-polar reconstruction on graphics processing units (GPUs) further improves performance, as GPUs are highly optimized for computationally intensive operations such as Fast Fourier Transforms (FFTs) and higher-order interpolation schemes, including cubic interpolation. Consequently, for large-scale datasets, GPU-accelerated algorithms based on log-polar reconstruction can outperform GPU implementations of filtered backprojection (FBP). A critical step in log-polar reconstruction is resampling data from log-polar coordinates to Cartesian coordinates, which may require a large number of atomic add operations targeting the same memory locations. These atomic updates are serialized, potentially creating significant contention and becoming a performance bottleneck, and this blog post presents an approach to effectively mitigate this issue.

For further information, check Fast Algorithms and Efficient GPU Implementations for the Radon Transform and the Back-Projection Operator Represented as Convolution Operators.

Log polar reconstructions steps

The current workflow of the log-polar reconstruction algorithm.

Initialization and Preparation: Set up reconstruction parameters, allocate memory, and initialize data structures.
Filtering with Frequency-Domain Oversampling: Apply filtering in the frequency domain using oversampling to reduce aliasing and improve accuracy.
Complex Data Formation and FFT: Convert the filtered sinogram to complex-valued representation and perform the Fast Fourier Transform (FFT) to obtain frequency-domain data.
Frequency-Domain Spread: Map non-uniformly sampled frequency data onto a regular 2D Cartesian grid using Gaussian spreading.
Inverse 2D FFT: Perform the inverse two-dimensional FFT to transform the data back to the spatial domain.
Final Processing: Apply post-processing steps such as normalization, cropping, and any required corrections.

Log polar reconstructions steps - Fig.1. Image stages at different steps — *Fig.1. Image stages at different steps*

Frequency-Domain Spread

This stage allocates a two-dimensional frequency grid for reconstruction and performs the sampling of angular frequency data are onto a regular 2D Cartesian grid using Gaussian spreading kernels.

Basic Gather Operation

Performs the core gathering (spreading) frequency-domain projection data onto a regular 2D Cartesian grid. This is the standard implementation without any optimizations. The threads are concurrently writing to the same memory address, which is causing atomic operation collisions. The frequency with which atomic adds write to the same memory address is higher near the center than further from the center due to the polar coordinate system. The attached axial view illustrates this by counting the atomic add operations at specific memory addresses.

*Fig.2. The atomic add operations number at different voxel.*

Basic Gather Operation flow

Thread assignment: Each thread processes one frequency-domain projection sample at position (tx, ty, tz) where:
- tx: frequency index along detector dimension (0 to n-1)
- ty: projection angle index (0 to nproj-1)
- tz: slice index (0 to nz-1)
Coordinate transformation: Converts the projection sample to normalized frequency coordinates:
- Computes rotation angle: theta[ty]
- Calculates target position (x0, y0) in 2D frequency space using:
  - x0 = (tx – n/2) / n ⋅ cos(theta)
  - y0 = -(tx – n/2) / n ⋅ sin(theta)
- Clamps coordinates to valid range [-0.5, 0.5)
Gaussian spreading: Distributes the projection sample across a (2m+1) × (2m+1) neighbourhood using a Gaussian kernel:
- For each grid point (ell0, ell1) in the spreading window:
  - Calculates distance from (x0, y0) to grid point
  - Computes Gaussian weight: w = (π/μ) ⋅ exp(-π²/μ ⋅ distance²)
  - Applies weight to complex value g0
  - Atomically adds weighted contribution to output grid f using atomicAdd
Index wrapping: Uses modulo arithmetic to handle periodic boundaries in the frequency domain

The source code can be found on the following link.

Optimized center region spread

Performs Gaussian spread operation only in the center region using a reverse gather approach optimized with angle pruning. Instead of spreading each projection sample to the grid, it pulls contributions from relevant projections for each output pixel, with that the atomic add operation is avoided.

The gather kernel center has two nested for loops, where the first loops through the theta values and the second loops through the valid radius points.

Prunning

During the gathering stage, each projection contributes to all pixels within a certain spatial radius. This means that a single ray or projection affects multiple neighboring pixels around its path. If we look at this process inversely, from the perspective of a single pixel, we can observe that the pixel only receives contributions from the subset of projections that pass through a circular neighborhood centered at that pixel.

This observation enables pruning. Instead of looping over all projections for every pixel, we precompute which projections can possibly intersect the pixel’s support region. For each pixel, we determine the start and end projection index, that bound the set of projections intersecting the pixel’s circle of influence.

All projections outside this interval are guaranteed to have zero contribution and can be safely skipped. This significantly reduces the number of memory accesses and arithmetic operations, improving both runtime and cache efficiency.

The figure 3. shows the geometric intuition behind pruning, where red lines represent all possible projection rays over the full angular range, the gray rectangular region represents the reconstruction domain (the image grid), the blue circle marks the local support (radius of influence) around a chosen pixel and green rays highlight only the projections that intersect this circle.

Only the rays that pass through the blue circle contribute to that pixel. The remaining red rays do not intersect the support and therefore can be ignored. The top-left and top-right panels illustrate how the contributing angular interval changes as the pixel location moves, while the bottom panel shows the symmetric case for a central pixel where all the projections contribute.

The figure 4. shows a heatmap of the number of contributing projections per pixel, where yellow regions indicate pixels that receive contributions from many projections (typically near the center) and purple regions indicate fewer contributing projections (toward edges).

The naive and optimized prune kernel source code can be found on the following link.

Fig.3. Each pixel receives contributions from a different subset of projections (green lines), depending on which rays intersect its spatial support region (blue circle).

*Fig.4. The number of contributing projections of different pixels.*

Optimized Gather Kernel

The gather kernel reconstructs the value of each output pixel by directly accumulating contributions from only the projections that geometrically intersect its spatial support. Each thread is responsible for computing a single center pixel (tx,ty,tz), eliminating write conflicts and avoiding atomic operations.

For every pixel, the kernel first loads precomputed angular intervals that specify which projections may contribute. The pixel location is converted to normalized coordinates, and the algorithm iterates only over the pruned projection ranges. For each candidate projection, a geometric distance test determines whether the projection line lies within the spreading radius. Non-intersecting projections are skipped early.

For intersecting projections, the kernel computes the detector radial range where the pixel’s circular support overlaps the projection line. Only those detector samples are processed. Samples are handled in small SIMD-style chunks, where complex projection values are loaded, mapped to frequency coordinates, weighted with a Gaussian kernel, and accumulated into a local accumulator.

After all valid contributions are gathered, the final value is written directly to the output grid. Because each thread exclusively updates its own pixel, the method avoids synchronization overhead while reducing memory traffic and improving cache locality through pruned, sequential projection access.

Key advantages

No atomic operations (one thread per pixel)
Fewer projection accesses via angular pruning
Improved cache locality along detector samples
SIMD-style vectorized accumulation for higher throughput

The optimized gather kernel source code can be found on the following link.

Benchmark Results

The optimization performance for different input configurations is summarized in figure 5. All experiments were conducted using datasets with 64 slices, while varying the number of projections and detector width. The benchmarks were run on an NVIDIA V100 as well as an NVIDIA L40S accelerator.

For the smallest input case [723,362], the optimization results in a slightly degraded performance, at -12% and -5% for the V100 and L40S respectively. However, as the input sizes increase, the performance gains consistently increase on both accelerators. At the second smallest input [1447,724], the gains are 17% and 41% respectively. Going further with size, the speedup increases until input size [3619,1840] where it reaches a peak of 88% for the V100 and 146% for the L40S. At the largest benchmarked input size [5790,2896], the optimization still shows major improvements at 83% and 89% respectively, demonstrating stable and scalable performance gains for large problem sizes.

*Fig.5. Speedup of LPRec optimization at different input data dimensions. Higher number is better, 100% means 2x performance.*

To show the results in the context of other reconstruction algorithms, the advantage of the optimized LPRec over the FBP reconstruction method is summarized in figure 6. In a matching setup to the previous, it is shown, that the optimized LPRec is significantly faster than the FBP method of the toolkit in all measured cases, both on the V100 and L40S accelerators.

Interestingly, difference in the hardware shows different patterns. In the V100 case, generally the advantage is higher with larger sizes. The lowest advantage is measured at the small-medium input size of [1447,724] at 42%, while the highest advantage is at the largest benchmarked size [5790,2896] at 273%. Conversely, in the case of the L40S, the advantage is the highest at the smallest input size [723,362] at 292%. From this, the advantage is generally diminishing with size, arriving at 92% for the largest input case [5790,2896].

*Fig.6. Advantage of the optimized LPRec compared to the FBP method. 100% means 2x performance*

Conclusion

Log polar reconstruction overall performance is often limited less by arithmetic cost and more by atomic addition of the frequency domain spreading step. Near the center of the grid many threads compete for the same memory locations. This contention becomes the dominant bottleneck and prevents the GPU from reaching its full computational throughput.

By reformulating the operation as a gather instead of a scatter, this bottleneck can be removed entirely. Assigning one thread to each output pixel eliminates write conflicts and removes the need for atomic operations. Geometric pruning further reduces the workload by restricting each pixel to only the projections that can actually contribute. Together with improved cache locality and SIMD style accumulation, the spreading stage becomes a predictable and compute efficient kernel rather than a synchronization bound one.

The outcome is clear in the benchmarks. Runtime reductions of roughly 55 to 60 percent are achieved for large datasets, and the gains scale with problem size. At the same time, numerical accuracy is preserved while memory traffic and synchronization overhead are significantly reduced.

Ultimately, avoiding atomic operations is not just a small optimization. It reshapes the algorithm to better match GPU architecture and unlocks the true performance potential of log polar reconstruction. With these atomic free gather strategies, log polar methods become not only elegant in theory but also fast and practical for large scale tomographic imaging.

References

Nikitin, V. et. al., 2017. Fast hyperbolic Radon transform represented as convolutions in log-polar coordinates. Computers & Geosciences, 105, pp.21-33.
Nikitin, V., 2023. TomocuPy–efficient GPU-based tomographic reconstruction with asynchronous data processing. Synchrotron Radiation, 30(1), pp.179-191.
https://github.com/tomography/tomocupy/

let’s work together