GPU Computation

First of all, sorry for the long delay in posting.

As I mentioned in Multi-threading for improved render times!, I had an i5-4430 with its stock cooler. I have now updated my rig to a more powerful processor i7-7700K (with liquid cooling) and also installed a GTX 1080.

Calculation for a single pixel is a very simple task and outcome for one pixel cannot affect another pixel’s calculation in any way, which makes such a task ideal for a system with a very high number of cores (such as GPUs, which have hundreds of processing cores). Thus, I wanted to shift the computation to my GPU and see how the CPU fares against it.


I want to keep this short, so here is the code: Github – mandelbrot_gpu (uses CUDA 8.0) Note that I enabled maximum optimization during compilation using -O3.


I used two zoom sequences, 1000 frames each; one of them spans a large area and has mostly empty space, so it is easier to calculate . The other one requires heavy computation since it moves mostly near the circumference of the set.

Platform Seq 1 average per frame Seq 2 average per frame
CPU 486 ms 2158 ms
GPU 15 ms 43 ms


This is the finished video (with a lot of compression artifacts):


The GPU blew the CPU out of the water. Why do people still have CPUs? Why isnt all computation done on GPUs? That is because most tasks cannot be parallelized to this extent. Most tasks require a very elaborate instruction set and sequential execution of code, which is best done by few, high powered cores as in a CPU. Tasks which can be parallelized though (such as ML tasks), are faster and efficient when run on GPUs.

CHIP-8 Emulator – inspired by MITx Computation Structures (6.004)!

I am currently enrolled in MITx’s Computational Structures (6.004) on edX. I have learnt about registers, stacks, procedures, compilers, opcodes etc.

To put my new knowledge to work, I decided to program an emulator. I read about many systems and decided that CHIP-8 would be perfect as my first emulation project. It has just 36 instructions which encompass all features – math, branching, graphics, sound, input. These instructions are themselves easy to program.

CHIP-8 is not actually a computer system. It is a really low level programming language – deals directly with registers, stack pointers, sprite data and RAM – that ran on 1970’s computers such as COSMAC VIP and Telmac 1800. When handheld programmable calculators became powerful enough in the 1990’s, CHIP-8 presented an easy way to quickly program games for them.

So far, I have completed the core work – the emulator can load programs, take inputs and execute them. There is no sound yet – I am working on it. The emulator uses SDL 2.0 for I/O.


Minor bugs left in code.

Screenshot of CONNECT4

Stay tuned for updates! 😀

Thanks to Thomas P. Greene; I used his page as reference.

Multi-threading for improved render times!

Currently, I am experimenting with a Newton-Raphson convergent fractal plotter I wrote in C++. I came up with the initial program about 2 weeks ago, and at that time, it was very sluggish. It took about 24 seconds to calculate one full HD (1920×1080) bitmap image.

I was looking to generate a 1000-frame video as the roots move in a circle of unit radius. With my initial program, rendering of frames would’ve taken 1000×24 = 24000 seconds. 6 hours and 40 minutes. I knew I could do better.

I optimized the program in all possible ways I could think of. Some of what I did:

  • Implemented a simple complex number class myself (no pun intended).
  • Program writes pixel to file as soon as it is calculated instead of saving data to a vector<vector<vector<char> > > and writing all pixels after frame has been calculated.
  • If this were a chemical reaction, then calculating \frac{f'(z)}{f(z)} would be the rate determining step. Previously, the program calculated each term of the derivative of polynomial and added those together, followed by the same for the functional value. Now program uses the fact that \frac{f'(z)}{f(z)} = \frac{\alpha}{z-a} + \frac{\beta}{z-b}... if f(z) = (z-a)^\alpha (z-b)^\beta.... This gives a much needed speed boost.
  • Many more small optimizations in code structure.

After rewriting major chunks of the program, I achieved more than a 10x speed boost. Now, the program could compute and write each frame in 2.2 seconds, and used much less memory (770 KB compared to 10 MB). You can find the program here:

I knew I could do better. Individual frames could easily be calculated in parallel. Since I had never experimented with multi-threading in my programs, I decided this is a good opportunity to give multithreading a try. The simplest way to do this was using the STL multithread implementation – std::thread. Here are some of the results I obtained with my i5-4430 (Quad Core @ 3GHz):



Single Thread CPU Usage

Single Thread CPU Usage – ~33%

Double Thread CPU Usage

CPU Usage with 2 threads – ~60%

Quadruple Thread CPU Usage 2

CPU is maxed out with anything more than 4 threads.

Thermal Throttling Thermal Throttling 2

Since I do not have a CPU Cooler other than the stock cooler, I run into thermal throttling issues pretty soon. So in practice, the max number of threads I use is 2.

Finally, with 4 threads, I get about 0.61 seconds per image, about 40 times faster than my first program!

I have now uploaded the 1000 frame video to my YouTube channel:

Stay tuned! 🙂

Digital Rangoli! (Newton fractals, actually)

This year, on the occasion of Diwali, I decided I will make Rangoli digitally!

x6 + x3 - 1 (30 iterations)

I wrote a program to calculate and draw the generalized Newton fractal for a given polynomial. The fractal pattern arises from the fact that for some initial values, the Newton-Raphson method is very sensitive. This image is a map of the behaviour of each point on the complex plane (when it is chosen as starting point) under this method.

Lets go over the details part-by-part.

Newton-Raphson Method:

Given a function f(x) and a complex number z_0, the series of numbers z_{n+1} = z_n - \frac{f(z_n)}{f'(z_n)} either converges towards a root of the function (called an attractor), or does not show convergent behaviour.

Most complex numbers produce a convergent series. Numbers in each other’s vicinity usually converge to the same root, but in some areas, nearby points reach different roots. When each complex number is coloured by the root it is attracted to, interesting plots are obtained. These plots display fractal like boundaries and similar patterns appear on zooming further.

Colouring Criteria:

The Hue (h), Saturation (s) and Lightness(l) of any point on the graph is determined by:

h = 180\frac{\theta + \pi}{\pi}s = 0.75l = \frac{1}{2} - \frac{n}{2m}, where \theta is the argument of the root the point reaches, n is the number of steps in which the point got within \epsilon of the root and m is the maximum iterations allowed. If the point does not show convergence within the maximum number of iterations or the derivative is zero, then the point is coloured black.

Corollary: Points that have same colour, converge to the same root.

\epsilon is around 0.0001 and m is 30 for the following images. The plotting area is -3.2 to +3.2 on the real axis and -1.8 to 1.8 on the imaginary axis.

Click on images to zoom in.

x8 + 15x4 - 16 (30 iterations)

f(z) = z^8 + 15z^4 - 16

x8 + 15x4 - 16 (100 iterations) (zoom)

A closeup of an intricate region of the above graph.

x6 + x3 - 1 (30 iterations)

f(z) = z^6 + z^3 - 1

x3 - 2x + 2 (30 iterations)

f(z) = z^3 - 2z +2

x3 - 1 (30 iterations)

f(z) = z^3 - 1

The program is written in C++. You can find the source code here:

Currently, you need to update the source code with you plotting parameters (look for comments in the main function on how to do so) and then compile it (I used Visual Studio ’15 but GCC works as well). I am working on a way to enter the parameters through a text file (like Mandelbrot Set Plotter) or through command line. Have fun with the program!

If you have any feedback, leave a comment or contact me at

Wishing you a Happy and Safe Diwali! 🙂