Parallelism, Vectorization and Multi-Threading
21 mns read.

XX WIP - This series of lessons is being written (July-December 2022) XX

XX TODO need to speak about scalability here, including the fact that with network rendering you can expand beyond the capabilities of a single computer (both in terms of memory and CPU) aka distributed systems. XX TODO you don't mention vectorization. Fix this please(

Foreword

Ray-Tracing and 3D rendering, in general, are slow. It doesn't take a lot to get render times that are greater than a few minutes. We had an example of that with the least lesson from the basic section that was devoted to rendering a volumetric sphere with a single light. When dealing with a more complex scene, more lights, complex shading techniques, motion blur, etc. render time can easily go to the roof. Some frames on movies produced by studios such as Pixar, Disney or Weta Digital can go up to 400+ hours per frame per core (as up 2022). Yep 400+ hours. We are far far away from real-time.

So, if our goal is to write a modern renderer with which we can create frames of reasonable "beauty", the first thing you should be concerned about finding whatever means you can to speed up that process. If ray-tracing is the method of your choice, the two techniques that will offer you the most immediate and greatest benefits in terms of speed gain will be multi-threading (including SIMD instructions which we will touch upon in this lesson) and ray-tracing acceleration structures which are the topic of the next lessons. Let's be fast first, before making pretty pictures (because waiting for an image, particularly when it's buggy is not a lot of fun).

The other reason why starting with multi-threading is also a choice we made is first because it's fun and almost easy and two because adding multi-threading support to our renderer will have a significant impact on how the rest of the program will be structured. Therefore it's better to do it now than later. Laying down the foundation of your house and building the walls will eventually define how you will be able to move throughout your house. Well, the same idea is true with multi-threading support. Get it done first, as it will eventually set the framework with which you will be left to work with for the rest of the project.

Finally there will be little to no mathematics in this lesson. Multi-threading is essentially a programming/engineering task. That will be a nice change as constantly dealing with calculus can be a pain.

In the first version of this lesson we will essentially focus on learning how to implement multi-threading using a tiling strategy. We will however briefly touch on some other techniques such as ray-packets and SIMD instructions. We will develop the last two topics in future versions of this lesson (once the advanced section is completed).

Ready?

Multi-Threading

Most (if not all) modern CPUs now have more than a single core (or processor), typically 2 (typically on laptops), 4, 8 (or more) each of which can work on a different task. This has in effect, a huge impact on performance. The number of cores a CPU has shouldn't be mixed up with the number of threads. Most modern CPUs can also have their cores 'split' into two threads. This is called simultaneous multithreading or hyper-threading. For instance, the AMD Ryzen 7 (1800X) CPU is an 8-core processor but uses multithreading and so is effectively delivering 16 threads (most Intel processors use multithreading too).

Applied to the problem of rendering an image of a 3D scene, the idea is that if takes 16 minutes to render a frame per thread, then it shall take 1 minute if we put the 16 threads to work on the problem. That's the theory in practice you never get a x16 speed up because there's always a bit of time lost in the process of synchronizing work between threads, your computer is also busy running other tasks, etc. The speed up you gain also depends on the scene itself. But more on that in the next lessons. Bottom line, rendering on 16 threads should be faster than rendering on a single thread. And that's what matters for now.

The good thing about rendering a frame on multiple threads is that it has more to do with programming than with engineering a special kind of data structure such as the acceleration structures we will study in the next lessons. So in essence it's slightly less brainwork intensive than developing an acceleration structure and relies instead preferentially on your knowledge of how multithreading is supported by the programming language.

Yet we still need a strategy. How do we apply multi-threading to accelerate ray-tracing?

Tiling

While there can be several strategies, in this first revision of this lesson, we will use a technique that has the benefit to be super simple to explain and implement. The idea is to "divide to conquer". We will split the frame into small square chunks called tiles (sometimes also abusively called buckets in reference to a mechanism used by old versions of the Pixar's rendering software) and let the threads render these tiles until all tiles are processed.

Typically, these tiles have a resolution that's a power of 2, such as 16, 32 or 64 pixels. So for a frame that has a resolution of 640x480 pixels for instance, we would get 20 tiles along the horizontal dimension and 15 tiles along the vertical dimension for a tile of 32x32 pixels. That makes a total of 300 tiles. So it's like having 16 "computers" (assuming our CPU has 16 threads available) rendering 300 32x32 small images and then all we have to do in the end to get the final picture is assemble the tiles together. Done!

How do we implement that?

We are going to use the thread library from the C++ standard library. More specifically we are going to use the jthread objects (jthread is more robust than thread but they two classes are essentially the same). This library let us create individual threads of execution within a program. The library provides a function hardware_concurrency() that conveniently returns the number of threads your CPU has:

unsigned int num_threads = std::jthread::hardware_concurrency();

Next we need to render the number of horizontal and vertical tiles for a particular frame. This task is trivial:

struct render_info_t { unsigned int width { 640 }; unsigned int height { 480 }; unsigned int tile_size { 32 }; unsigned int num_tiles_x{}, num_tiles_y{}; }; int main() { ... ri.num_tiles_x = (ri.width + ri.tile_size - 1) / ri.tile_size; ri.num_tiles_y = (ri.height + ri.tile_size - 1) / ri.tile_size; ... }

The image width or height is not necessarily a multiple of the tile resolution and the code above takes care of that by adding tile_size - 1 pixels to the image width and height before dividing the result number by tile_size itself.

Next we are ready to kick our threads. The strategy we will be using in this particular implementation is to create a single global variable set initially to the total number of tiles to render. This variable, a simple integer, will be defined with an atomic type and we will explain why next. Let's give the code first and explain what it does:

void render(unsigned std::atomic_int& num_tiles) { int curr_tile {}; while ((curr_tile = --count) >= 0) { } } int main() ... std::atomic_int num_tiles = ri.num_tiles_x * ri.num_tiles_y; for (unsigned int n = 0; n < num_threads; ++n) { threads.emplace_back(std::jthread(render, std::ref(num_tiles))); }

The idea is rather simple (so is the implementation which is why we like this approach for teaching multi-threading in rendering). We create threads calling the std::jthread function. This function takes as its first argument the function that the thread itself will be running followed by a sequence of arguments that will be passed onto that thread function (which in our example is called render). For now the only argument that we are going to pass to render() is the num_tiles variable. This variable is passed-by-reference (hence the use of the std::ref function). The variable count in the render() references the num_tiles variable declared in the main() function of the program (which remember original is set to 300 in our current example). Next, within the render function each thread decrements the variable count variable (which remember is a reference to num_tiles) and assign the result to the thread local variable curr_thread. This will become the index of the tile the thread will calculate the pixels of in the while loop. Imagine we have 2 threads for now. The sequence of execution could look like this (the order may change depending on how long the threads take to render each tile in the sequence):

Thread 1: rendering tile #299 Thread 1: rendering tile #298 Thread 1: rendering tile #297 Thread 1: rendering tile #296 ... Thread 1: rendering tile #1 Thread 1: rendering tile #0

Next we convert the current tile index number to pixel coordinates:

void render(unsigned std::atomic_int& num_tiles) { int curr_tile {}; while ((curr_tile = --count) >= 0) { unsigned int curr_tile_y = curr_tile / ri->num_tiles_x; unsigned int curr_tile_x = curr_tile - curr_tile_y * ri->num_tiles_x; unsigned int x0 = curr_tile_x * ri->tile_size; unsigned int x1 = std::min((curr_tile_x + 1) * ri->tile_size, ri->width); unsigned int y0 = curr_tile_y * ri->tile_size; unsigned int y1 = std::min((curr_tile_y + 1) * ri->tile_size, ri->height); for (unsigned int y = y0; y < y1 ; ++y) { for (unsigned int x = x0; x < x1; ++x) { // render pixel with frame coordinate x,y ... voodoo goes here } } } }

The only subtlety here is to clip the x1 and y1 variable to the image width and height resolution respectively. If the width of the image is 1000 and the tile size 32, you will get 32 horizontal tiles. 32x32 is equal to 1024 which is greater than the image width. We don't want to render extra pixels and writing the result of these pixels into the framebuffer is likely to make the program crash anyway (as the buffer will be likely dimensioned for a 1000 pixel width image; and writing to a memory address that doesn't exist generally causes a program to crash). Beyond these two small details, everything else is really trivial. All we are now left with is to calculate the color of each pixel in the tile.

By the way now is a good time to say that tiles don't need to have a resolution that's a power of 2 such as 16, 32 or 64 but it's generally best for at least two good reasons. The width (at least) of most formats used in production divided by 16, 32, ... gives a round number (e.g 640, 1920, 2048). Data whose size is a power of 2 is generally aligned within the computer's memory leading to more efficient use and access of that memory. This memory alignment issue will be touched on later again in this lesson). Anyway, while you can choose any number you like it's better to stick to these power of 2 numbers.

Thread concurrency and race-condition

Now that we have an overview of the strategy, let's explain why we declared the variable num_tiles with an atomic type.

In the world of parallelism and multi-threading, two or more threads may access and manipulate the same variable at the same time. And this is bad. Imagine two threads T1 and T2 incrementing the variable and reading the content of that variable after it has been incremented to do something important with that number (like rendering a tile). T1 could be first to increment the variable v but then T2 could also be incrementing v before T1 got a chance to read the content of v. So imagine that v=0 to start with. T1 increments v, then T2 so when T1 reads v=2 instead of 1 as it should be (from T1's perspective). Imagine that number is the tile that threads should render. Then in this example, you'd have two threads rendering tile number 2, while tile number 1 would have been skipped. Not what you want.

This is called a race condition is what makes parallelism challenging at times. You can solve the problems in two ways. When a thread enters what is called a critical section that is a part of the code in which the thread is about to access and change the content of variables that other threads might also access and change at the same time, C++ at least offer two possible solutions:

Therefore setting the type of the variable num_tiles to std::atomic_int guarantees that no thread will decrement and read the variable count in the thread function at the same time, potentially leading to some of the problems we mentioned earlier.

Finally all we are left to do is store the pixels of each tile rendered by the threads into the final image buffer. We personally like to create a small buffer managed by the thread into which the pixel of the current tiles will be stored. Once all the pixels of a tile have been rendered we then copy the pixels from the thread's buffer into the image's buffer. Many different strategies are possible here, but that's the one we've chosen for this particular version of the lesson. Since the threads are guaranteed to always write to a different part of the image buffer, we don't need here to lock/unlock this part of the code with a mutex. Here is the code:

void render(const thread_info_t& thread_info, std::atomic_int &count) { int curr_tile {}; const render_info_t* ri = thread_info.render_info; thread_local std::mt19937 gen(std::hash<std::jthread::id>()(std::this_thread::get_id())); std::uniform_real_distribution<float> dist(0.0f, 1.f); unsigned char *buffer = (unsigned char*)malloc(ri->tile_size * ri->tile_size * 3); while ((curr_tile = --count) >= 0) { float r = dist(gen); float g = dist(gen); float b = dist(gen); unsigned int curr_tile_y = curr_tile / ri->num_tiles_x; unsigned int curr_tile_x = curr_tile - curr_tile_y * ri->num_tiles_x; unsigned int x0 = curr_tile_x * ri->tile_size; unsigned int x1 = std::min((curr_tile_x + 1) * ri->tile_size, ri->width); unsigned int y0 = curr_tile_y * ri->tile_size; unsigned int y1 = std::min((curr_tile_y + 1) * ri->tile_size, ri->height); for (unsigned int y = y0; y < y1 ; ++y) { for (unsigned int x = x0; x < x1; ++x) { // the data pointed by the pointer is not part of the struct, so it can be changed ri->buffer[(y * ri->width + x) * 3 ] = (unsigned char)(r * 255); ri->buffer[(y * ri->width + x) * 3 + 1] = (unsigned char)(g * 255); ri->buffer[(y * ri->width + x) * 3 + 2] = (unsigned char)(b * 255); } } } free(buffer); } int main(int argc, char **argv) { unsigned int num_threads = std::jthread::hardware_concurrency(); std::cout << "Rendering with " << num_threads << " threads" << std::endl; std::vector<std::jthread> threads; render_info_t ri; ri.num_tiles_x = (ri.width + ri.tile_size - 1) / ri.tile_size; ri.num_tiles_y = (ri.height + ri.tile_size - 1) / ri.tile_size; ri.buffer = (unsigned char*)malloc(ri.width * ri.height * 3); std::atomic_int num_tiles = ri.num_tiles_x * ri.num_tiles_y; for (unsigned int n = 0; n < num_threads; ++n) { thread_info_t thread_info; thread_info.id = n; thread_info.render_info = &ri; threads.emplace_back(std::jthread(render, std::move(thread_info), std::ref(num_tiles))); } for (auto& thread : threads) thread.join(); std::ofstream ofs; ofs.open("./output.ppm", std::ios::binary); ofs << "P6\n" << ri.width << " " << ri.height << "\n255\n"; ofs.write((char*)ri.buffer, ri.width * ri.height * 3); ofs.close(); free(ri.buffer); return EXIT_SUCCESS; }

Let's conclude with a few additional comments:

  1. What is join used for? You want the main calling thread (your main program) to hold its execution until all the threads it spawned are done with their task. That's what join() is done for. It blocks the current thread until the threads finish their execution.
  2. Why is emplace_back used instead of the more common push_back? That's rather a detail but a thread can be moved but can't be copied. Using emplace_back allows to create the thread "in-place" inside the factor (so it's more efficient but hey there's no small gain). Note that you can also pass pass the constructor parameters you would have passed to std::thread to the emplace_back() function like so:
    threads.emplace_back(render, std::move(thread_info), std::ref(num_tiles));
    They are forwarded to the thread that is constructed
  3. To be sure our code is working properly and to get something to look at, a random color is assigned to all the pixels of each rendered tile. To do so we use random number generators provided by the random C++ standard library. We won't spend a lot of time detailing this part of the code, but note that random number generators in the thread need to be handled with special care. The way you use these functions (and the type of function you use) may or may not be thread-safe. More on that in future versions of this lesson. Let's progress with the rest of this section's content first.

    Note that regarding random number generators within threads you might either want the random number generators to produce the exact same sequence of random numbers each time you render the same frame. This also needs a special solution. For now, we are just seeding the random number generator of each thread with a hash key generated from the thread id (line 5) and the result will be different on each run.

    Declaring the random number generator thread_local is not necessary here but we wanted to introduce this concept in this lesson as we will be using it in future lessons. When a variable is declared thread_local either globally or inside a thread function, its state is maintained throughout recursive call to the thread function or can be accessed by functions called from within the thread function like showed in the following examples. Let's first look at recursion:
    #include <thread> #include <mutex> #include <iostream> void thread_func() { thread_local int a { 0 }; int b{ 0 }; { static std::mutex m; std::lock_guard<std::mutex> lock{ m }; std::cout << "a: " << a++ << " b: " << b++ << std::endl; } if (a <= 2) thread_func(); } int main() { std::jthread a(thread_func); a.join(); return 0; }
    Possible outcome:
    a: 0 b: 0 a: 1 b: 0 a: 2 b: 0
    The state of the variable `a` remains "global" (within the context of the thread) to the successive recursive calls to the thread function (you can see in the outcome that `a` gets incremented while `b` keeps its initial value (state of `b` is "initialized" each time the thread function is called while `a` gets incremented).

    Now let's look at another use case:
    #include <thread> #include <mutex> #include <iostream> thread_local int c{ 0 }; void another_func() { c++; } void thread_func(int id) { thread_local int a { 0 }; int b{ 0 }; { static std::mutex m; std::lock_guard<std::mutex> lock{ m }; std::cout << "id: " << id << " Results -> a: " << a++ << " b: " << b++ << " c: " << c << std::endl; } another_func(); if (a <= 2) thread_func(id); } int main() { std::jthread a(thread_func, 1); std::jthread b(thread_func, 2); a.join(); b.join(); std::cout << "Goodbye: " << c << std::endl; return 0; }
    Possible outcome:
    id: 2 Results -> a: 0 b: 0 c: 0 id: 2 Results -> a: 1 b: 0 c: 1 id: 2 Results -> a: 2 b: 0 c: 2 id: 1 Results -> a: 0 b: 0 c: 0 id: 1 Results -> a: 1 b: 0 c: 1 id: 1 Results -> a: 2 b: 0 c: 2 Goodbye: 0
    Here the variable `c` is declared thread_local at the global scope of the program, so it can be called and used by the main function, yet, each thread has its own copy of the variable and maintains its state, independently from other threads (including the main one). And this state is maintained throughout the function that the thread function may eventually call (in this example thread_func calls another_func).

    Sorry for the long boring explanation but the use of thread_local is poorly documented and we will be using it quite often.

    Here is an example of the image the program will generate.

Conclude: there's way more to do with multi-threading but this is a first fully function approach. There's more feature we will rely on as we develop the lessons. There's also other strategy to tiling and working with threads like creating a pool of tasks that we stack into some vector and let the thread run until the pool is empty. But most of these approaches are more complex to code, so for now as our goal is just to speed up the render taking advantage of multi-threading this is the simplest most direct possible approach.

Ready to move on...

Memory alignment

TODO

Ray packet

TODO

SIMD instructions

TODO

What's next

Rendering a mesh and seeing how slow this is. Next is to add an acceleration structure. We will start with kd-tree before moving on to modern BVH (not the version we already introduced in the basic section).

Source Code

TODO