Introduction

Frustum culling is process of discarding objects not visible on the screen. As we don't see them – we don't need to spend resources on computer to prepare it for rendering and rendering itself.

In this paper I will cover next themes:

culling of: Bounding Spheres, Axis-Aligned Bounding Boxes (AABB), Oriented Bounding Boxes (OBB)
culling of huge amount of objects
using SSE
multithreaded culling
GPU culling
comparison of approaches efficiency, working speed

What I will not cover:

using hierarchical structures, trees. We can unite objects in groups according to world positions and first check visibility of whole group.
optimizations of one object, like using last 'successful' culling plane
visibility test taking into account scene depth buffer. Object can be inside frustum, but can be completely blocked by another, closer to viewer object. Hence we also might discard this object from rendering
software culling. We may perform blocking of one objects by another on CPU side.
Culling for shadows

Simple culling

We have area of visibility, which is set by frustum of viewing pyramid. Objects that are not inside this area should be discarded from rendering process. Frustum one usually set with 6 planes.

Image may be NSFW.
Clik here to view.

We have objects in the scene. Each object might be aproxinated with simple geometry such as sphere or box. All object's geometry lies inside this primitive.

Image may be NSFW.
Clik here to view.

Visibility test of such simple geometry performs very fast. Our aim is to understand if this object visible in frustum.
Consider the definition of visibility: spheres and boxes. There are different kinds of boxes: world axis aligned (AABB) and aligned according local object's axis (OBB). One could clearly see that OBB beter aproximates object, but it's visibility test performs harder than AABB.

Sphere-frustum

Algorithm: for object's center we find distance to each frustum plane. If point is behind any plane more than sphere radius, then sphere not in frustum. And found one is splitting plane.

__forceinline bool SphereInFrustum(vec3 &pos, float &radius, vec4 *frustum_planes)
{
	bool res = true;
	//test all 6 frustum planes
	for (int i = 0; i < 6; i++)
	{
		//calculate distance from sphere center to plane.
		//if distance larger then sphere radius - sphere is outside frustum
		if (frustum_planes[i].x * pos.x + frustum_planes[i].y * pos.y + frustum_planes[i].z * pos.z + frustum_planes[i].w <= -radius)
			res = false;
			//return false; //with flag works faster
	}
	return res;
}

AABB-frustum

Bounding sphere sometimes not great choise to approximate object. For more precise test one often use boxes. World Axis-Aligned or Oriented Bounding Boxes.
Basic idea to test the box visibility in frustum: if all 8 box points lie behind one of frustum planes then box is not in frustum. In next example implemented AABB-fustum test. But if we place OBB world-space points in equations – we get OBB-frustum test.

__forceinline bool RightParallelepipedInFrustum2(vec4 &Min, vec4 &Max, vec4 *frustum_planes)
{
//this is just example of basic idea - how BOX culling works, both AABB and OBB
//Min & Max are 2 world space box points. For AABB-drustum culling
//We may use transformed (by object matrix) to world space 8 box points. Replace Min & Max in equations and we get OBB-frustum.
	//test all 6 frustum planes
	for (int i = 0; i<6; i++)
	{
	//try to find such plane for which all 8 box points behind it
	//test all 8 box points against frustum plane
	//calculate distance from point to plane
	//if point infront of the plane (dist > 0) - this is not separating plane
		if (frustum_planes[i][0] * Min[0] + frustum_planes[i][1] * Max[1] + frustum_planes[i][2] * Min[2] + frustum_planes[i][3]>0)
			continue;
		if (frustum_planes[i][0] * Min[0] + frustum_planes[i][1] * Max[1] + frustum_planes[i][2] * Max[2] + frustum_planes[i][3]>0)
			continue;
		if (frustum_planes[i][0] * Max[0] + frustum_planes[i][1] * Max[1] + frustum_planes[i][2] * Max[2] + frustum_planes[i][3]>0)
			continue;
		if (frustum_planes[i][0] * Max[0] + frustum_planes[i][1] * Max[1] + frustum_planes[i][2] * Min[2] + frustum_planes[i][3]>0)
			continue;
		if (frustum_planes[i][0] * Max[0] + frustum_planes[i][1] * Min[1] + frustum_planes[i][2] * Min[2] + frustum_planes[i][3]>0)
			continue;
		if (frustum_planes[i][0] * Max[0] + frustum_planes[i][1] * Min[1] + frustum_planes[i][2] * Max[2] + frustum_planes[i][3]>0)
			continue;
		if (frustum_planes[i][0] * Min[0] + frustum_planes[i][1] * Min[1] + frustum_planes[i][2] * Max[2] + frustum_planes[i][3]>0)
			continue;
		if (frustum_planes[i][0] * Min[0] + frustum_planes[i][1] * Min[1] + frustum_planes[i][2] * Min[2] + frustum_planes[i][3]>0)
			continue;
		return false;
    }
    return true;
}

AABB-frustum test might be implemented more optimal.
Algorithm: from 8 point find closest one to the plane ant test if it is behind the plane. If yes – the object is not in frustum.

__forceinline bool RightParallelepipedInFrustum(vec4 &Min, vec4 &Max, vec4 *frustum_planes)
{
	bool inside = true;
	//test all 6 frustum planes
	for (int i = 0; i<6; i++)
	{
	//pick closest point to plane and check if it behind the plane
	//if yes - object outside frustum
		float d = max(Min.x * frustum_planes[i].x, Max.x * frustum_planes[i].x) +
				  max(Min.y * frustum_planes[i].y, Max.y * frustum_planes[i].y) +
				  max(Min.z * frustum_planes[i].z, Max.z * frustum_planes[i].z) +
				  frustum_planes[i].w;
		inside &= d > 0;
		//return false; //with flag works faster
	}
	return inside;
}

OBB-frustum

Algorithm: transform 8 box points from local space to clip-space. It is easy to test points against plane in such space as frustum becomes the unit cube [-1..1] (but we should take into account that for DirectX for Z axis we have another size [0..1]). If for one of axis all 8 vertexes <-1 or >1, then box outside the frustum.

__forceinline bool OBBInFrustum(const vec3 &Min, const vec3 &Max, mat4 &obj_transform_mat, mat4 &cam_modelview_proj_mat)
{
	//transform all 8 box points to clip space
	//clip space because we easily can test points outside required unit cube
	//NOTE: for DirectX we should test z coordinate from 0 to w (-w..w - for OpenGL), look for      transformations / clipping box differences
	//matrix to transfrom points to clip space
	mat4 to_clip_space_mat = cam_modelview_proj_mat * obj_transform_mat;

	//transform all 8 box points to clip space
	vec4 obb_points[8];
	obb_points[0] = to_clip_space_mat * vec4(Min[0], Max[1], Min[2], 1.f);
	obb_points[1] = to_clip_space_mat * vec4(Min[0], Max[1], Max[2], 1.f);
	obb_points[2] = to_clip_space_mat * vec4(Max[0], Max[1], Max[2], 1.f);
	obb_points[3] = to_clip_space_mat * vec4(Max[0], Max[1], Min[2], 1.f);
	obb_points[4] = to_clip_space_mat * vec4(Max[0], Min[1], Min[2], 1.f);
	obb_points[5] = to_clip_space_mat * vec4(Max[0], Min[1], Max[2], 1.f);
	obb_points[6] = to_clip_space_mat * vec4(Min[0], Min[1], Max[2], 1.f);
	obb_points[7] = to_clip_space_mat * vec4(Min[0], Min[1], Min[2], 1.f);

	bool outside = false, outside_positive_plane, outside_negative_plane;

	//we have 6 frustum planes, which in clip space is unit cube (for GL) with -1..1 range
	for (int i = 0; i < 3; i++) //3 because we test positive & negative plane at once
	{
	//if all 8 points outside one of the plane
	//actually it is vertex normalization xyz / w, then compare if all 8points coordinates <      -1 or > 1
		outside_positive_plane =
			obb_points[0][i] > obb_points[0].w &&
			obb_points[1][i] > obb_points[1].w &&
			obb_points[2][i] > obb_points[2].w &&
			obb_points[3][i] > obb_points[3].w &&
			obb_points[4][i] > obb_points[4].w &&
			obb_points[5][i] > obb_points[5].w &&
			obb_points[6][i] > obb_points[6].w &&
			obb_points[7][i] > obb_points[7].w;

		outside_negative_plane =
			obb_points[0][i] < -obb_points[0].w &&
			obb_points[1][i] < -obb_points[1].w &&
			obb_points[2][i] < -obb_points[2].w &&
			obb_points[3][i] < -obb_points[3].w &&
			obb_points[4][i] < -obb_points[4].w &&
			obb_points[5][i] < -obb_points[5].w &&
			obb_points[6][i] < -obb_points[6].w &&
			obb_points[7][i] < -obb_points[7].w;

		outside = outside || outside_positive_plane || outside_negative_plane;
		//if (outside_positive_plane || outside_negative_plane)
			//return false;
	}
	return !outside;
	//return true;
}

Table 1: culing results of 100к objects. Intel Core I5. Single Thread.

Simple Culling	Sphere	AABB	OBB
Just cullung	0,92	1,42	9,14
Whole frame	1,94	2,5	10,3

The results are obvious. The harder calculations the slower it works. OBB test much slower than Spheres or AABB tests. But we get more precise culling with OBB.
May be, optimal solution is spiting objects into groups, For each group depending on distance to camera we use appropriate primitive. For closest groups use OBB, for middle one groups use ABB and Spheres for the rest.
Also should be notices than whole frame time is larger than just culling. 1 ms. in average. Because of transferring data about visible objects to gpu has cost, couple of dips and API commands. But it is necessary actions.

SSE

SSE (Streaming SIMD Extensions) – with one instructions we perform calculations on group of operands. SSE includes in it's architecture eight 128 bit registers and set of instructions to perform any operations on them.

Theoretically we might speedup code execution 4 times as we make operations with 4 operands simultaneously. Offcourse on practice perfomance win will be less because of SSE drawbacks.

Not all algorithms could be easily rewrited in SSE
data should be packed according to SSE requirements in registers to perform calculations
SSE has some restrictions with vertical operations like dot products
there are no conditions. One use so called static branching, when we execute bot 2 parts of condition an take just one interesting to us result.
Loading data in registers and storing results back into memory
don't forget about sse data striding

Algorithm of SSE Spheres-frustum and SSE AABB-frustum culling almost identical to simple implementation. In exception of that we perform calculations on 4 objects simultaneously.

void sse_culling_spheres(BSphere *sphere_data, int num_objects, int *culling_res, vec4 *frustum_planes)
{
	float *sphere_data_ptr = reinterpret_cast<float*>(&sphere_data[0]);
	int *culling_res_sse = &culling_res[0];

	//to optimize calculations we gather xyzw elements in separate vectors
	__m128 zero_v = _mm_setzero_ps();
	__m128 frustum_planes_x[6];
	__m128 frustum_planes_y[6];
	__m128 frustum_planes_z[6];
	__m128 frustum_planes_d[6];

	int i, j;
	for (i = 0; i < 6; i++)
	{
		frustum_planes_x[i] = _mm_set1_ps(frustum_planes[i].x);
		frustum_planes_y[i] = _mm_set1_ps(frustum_planes[i].y);
		frustum_planes_z[i] = _mm_set1_ps(frustum_planes[i].z);
		frustum_planes_d[i] = _mm_set1_ps(frustum_planes[i].w);
	}

	//we process 4 objects per step
	for (i = 0; i < num_objects; i += 4)
	{
	//load bounding sphere data
		__m128 spheres_pos_x = _mm_load_ps(sphere_data_ptr);
		__m128 spheres_pos_y = _mm_load_ps(sphere_data_ptr + 4);
		__m128 spheres_pos_z = _mm_load_ps(sphere_data_ptr + 8);
		__m128 spheres_radius = _mm_load_ps(sphere_data_ptr + 12);
		sphere_data_ptr += 16;

	//but for our calculations we need transpose data, to collect x, y, z and w coordinates in separate vectors
		_MM_TRANSPOSE4_PS(spheres_pos_x, spheres_pos_y, spheres_pos_z, spheres_radius);
		__m128 spheres_neg_radius = _mm_sub_ps(zero_v, spheres_radius); // negate all elements

		__m128 intersection_res = _mm_setzero_ps();
		for (j = 0; j < 6; j++) //plane index
		{
		//1. calc distance to plane dot(sphere_pos.xyz, plane.xyz) + plane.w
		//2. if distance < sphere radius, then sphere outside frustum
			__m128 dot_x = _mm_mul_ps(spheres_pos_x, frustum_planes_x[j]);
			__m128 dot_y = _mm_mul_ps(spheres_pos_y, frustum_planes_y[j]);
			__m128 dot_z = _mm_mul_ps(spheres_pos_z, frustum_planes_z[j]);

			__m128 sum_xy = _mm_add_ps(dot_x, dot_y);
			__m128 sum_zw = _mm_add_ps(dot_z, frustum_planes_d[j]);
			__m128 distance_to_plane = _mm_add_ps(sum_xy, sum_zw);

			__m128 plane_res = _mm_cmple_ps(distance_to_plane, spheres_neg_radius); //dist < -sphere_r ?
			intersection_res = _mm_or_ps(intersection_res, plane_res); //if yes - sphere behind the plane & outside frustum
		}

		//store result
		__m128i intersection_res_i = _mm_cvtps_epi32(intersection_res);
		_mm_store_si128((__m128i *)&culling_res_sse[i], intersection_res_i);
	}
}

void sse_culling_aabb(AABB *aabb_data, int num_objects, int *culling_res, vec4 *frustum_planes)
{
	float *aabb_data_ptr = reinterpret_cast<float*>(&aabb_data[0]);
	int *culling_res_sse = &culling_res[0];

	//to optimize calculations we gather xyzw elements in separate vectors
	__m128 zero_v = _mm_setzero_ps();
	__m128 frustum_planes_x[6];
	__m128 frustum_planes_y[6];
	__m128 frustum_planes_z[6];
	__m128 frustum_planes_d[6];

	int i, j;
	for (i = 0; i < 6; i++)
	{
		frustum_planes_x[i] = _mm_set1_ps(frustum_planes[i].x);
		frustum_planes_y[i] = _mm_set1_ps(frustum_planes[i].y);
		frustum_planes_z[i] = _mm_set1_ps(frustum_planes[i].z);
		frustum_planes_d[i] = _mm_set1_ps(frustum_planes[i].w);
	}

	__m128 zero = _mm_setzero_ps();
	//we process 4 objects per step
	for (i = 0; i < num_objects; i += 4)
	{
	//load objects data
	//load aabb min
		__m128 aabb_min_x = _mm_load_ps(aabb_data_ptr);
		__m128 aabb_min_y = _mm_load_ps(aabb_data_ptr + 8);
		__m128 aabb_min_z = _mm_load_ps(aabb_data_ptr + 16);
		__m128 aabb_min_w = _mm_load_ps(aabb_data_ptr + 24);

	//load aabb max
		__m128 aabb_max_x = _mm_load_ps(aabb_data_ptr + 4);
		__m128 aabb_max_y = _mm_load_ps(aabb_data_ptr + 12);
		__m128 aabb_max_z = _mm_load_ps(aabb_data_ptr + 20);
		__m128 aabb_max_w = _mm_load_ps(aabb_data_ptr + 28);
		aabb_data_ptr += 32;

	//for now we have points in vectors aabb_min_x..w, but for calculations we need to xxxx yyyy zzzz vectors representation - just transpose data
		_MM_TRANSPOSE4_PS(aabb_min_x, aabb_min_y, aabb_min_z, aabb_min_w);
		_MM_TRANSPOSE4_PS(aabb_max_x, aabb_max_y, aabb_max_z, aabb_max_w);

		__m128 intersection_res = _mm_setzero_ps();
		for (j = 0; j < 6; j++) //plane index
		{
		//this code is similar to what we make in simple culling
		//pick closest point to plane and check if it begind the plane. if yes - object outside frustum
		//dot product, separate for each coordinate, for min & max aabb points
			__m128 aabbMin_frustumPlane_x = _mm_mul_ps(aabb_min_x, frustum_planes_x[j]);
			__m128 aabbMin_frustumPlane_y = _mm_mul_ps(aabb_min_y, frustum_planes_y[j]);
			__m128 aabbMin_frustumPlane_z = _mm_mul_ps(aabb_min_z, frustum_planes_z[j]);

			__m128 aabbMax_frustumPlane_x = _mm_mul_ps(aabb_max_x, frustum_planes_x[j]);
			__m128 aabbMax_frustumPlane_y = _mm_mul_ps(aabb_max_y, frustum_planes_y[j]);
			__m128 aabbMax_frustumPlane_z = _mm_mul_ps(aabb_max_z, frustum_planes_z[j]);

		//we have 8 box points, but we need pick closest point to plane. Just take max
			__m128 res_x = _mm_max_ps(aabbMin_frustumPlane_x, aabbMax_frustumPlane_x);
			__m128 res_y = _mm_max_ps(aabbMin_frustumPlane_y, aabbMax_frustumPlane_y);
			__m128 res_z = _mm_max_ps(aabbMin_frustumPlane_z, aabbMax_frustumPlane_z);

		//dist to plane = dot(aabb_point.xyz, plane.xyz) + plane.w
			__m128 sum_xy = _mm_add_ps(res_x, res_y);
			__m128 sum_zw = _mm_add_ps(res_z, frustum_planes_d[j]);
			__m128 distance_to_plane = _mm_add_ps(sum_xy, sum_zw);

			__m128 plane_res = _mm_cmple_ps(distance_to_plane, zero); //dist from closest point to plane < 0 ?
			intersection_res = _mm_or_ps(intersection_res, plane_res); //if yes - aabb behind the plane & outside frustum
		}

		//store result
		__m128i intersection_res_i = _mm_cvtps_epi32(intersection_res);
		_mm_store_si128((__m128i *)&culling_res_sse[i], intersection_res_i);
	}
}

OBB culling is a bit harder. We perform calculations on one object at once. But make calculations for three xyz axes simultaneously. It is not optimal but it reflects basic idea of algorithm. Besides, vector math (matrix multiplications and point transformations) with SSE perform faster.

void sse_culling_obb(int firs_processing_object, int num_objects, int *culling_res, mat4 &cam_modelview_proj_mat)
{
	mat4_sse sse_camera_mat(cam_modelview_proj_mat);
	mat4_sse sse_clip_space_mat;

	//box points in local space
	__m128 obb_points_sse[8];
	obb_points_sse[0] = _mm_set_ps(1.f, box_min[2], box_max[1], box_min[0]);
	obb_points_sse[1] = _mm_set_ps(1.f, box_max[2], box_max[1], box_min[0]);
	obb_points_sse[2] = _mm_set_ps(1.f, box_max[2], box_max[1], box_max[0]);
	obb_points_sse[3] = _mm_set_ps(1.f, box_min[2], box_max[1], box_max[0]);
	obb_points_sse[4] = _mm_set_ps(1.f, box_min[2], box_min[1], box_max[0]);
	obb_points_sse[5] = _mm_set_ps(1.f, box_max[2], box_min[1], box_max[0]);
	obb_points_sse[6] = _mm_set_ps(1.f, box_max[2], box_min[1], box_min[0]);
	obb_points_sse[7] = _mm_set_ps(1.f, box_min[2], box_min[1], box_min[0]);

	ALIGN_SSE int obj_culling_res[4];
	__m128 zero_v = _mm_setzero_ps();

	int i, j;
	//process one object per step
	for (i = firs_processing_object; i < firs_processing_object+num_objects; i++)
	{
	//clip space matrix = camera_view_proj * obj_mat
		sse_mat4_mul(sse_clip_space_mat, sse_camera_mat, sse_obj_mat[i]);
		__m128 outside_positive_plane = _mm_set1_ps(0xffffffff);
		__m128 outside_negative_plane = _mm_set1_ps(0xffffffff);

	//for all 8 box points
		for (j = 0; j < 8; j++)
		{
		//transform point to clip space
			__m128 obb_transformed_point = sse_mat4_mul_vec4(sse_clip_space_mat, obb_points_sse[j]);

		//gather w & -w
			__m128 wwww = _mm_shuffle_ps(obb_transformed_point, obb_transformed_point, _MM_SHUFFLE(3, 3, 3, 3)); //get w
			__m128 wwww_neg = _mm_sub_ps(zero_v, wwww); // negate all elements

		//box_point.xyz > box_point.w || box_point.xyz < -box_point.w ?
		//similar to point normalization: point.xyz /= point.w; And compare: point.xyz > 1 && point.xyz < -1
			__m128 outside_pos_plane = _mm_cmpge_ps(obb_transformed_point, wwww);
			__m128 outside_neg_plane = _mm_cmple_ps(obb_transformed_point, wwww_neg);

		//if at least 1 of 8 points in front of the plane - we get 0 in outside_* flag
			outside_positive_plane = _mm_and_ps(outside_positive_plane, outside_pos_plane);
			outside_negative_plane = _mm_and_ps(outside_negative_plane, outside_neg_plane);
		}

		//all 8 points xyz < -1 or > 1 ?
		__m128 outside = _mm_or_ps(outside_positive_plane, outside_negative_plane);

		//store result
		__m128i outside_res_i = _mm_cvtps_epi32(outside);
		_mm_store_si128((__m128i *)&obj_culling_res[0], outside_res_i);

		//for now we have separate result separately for each axis
		//combine results. If outside any plane, then objects is outside frustum
		culling_res[i] = (obj_culling_res[0] != 0 || obj_culling_res[1] != 0 ||  obj_culling_res[2] != 0) ? 1 : 0;
	}
}

Table 2. SSE culling result of 100k objects. Intel Core I5. Single Thread. SSE.

SSE Culling	Sphere	AABB	OBB
Just culling	0,26	0,46	3,48
Whole frame	1,2	1,43	4,6

SSE implementation in average 3 times faster than simple one in C++.

Multithreading

Nowadays processors has several cores. Calculations might be performed simultaneously on all cores.

Architecture on new games should be planned taking into account multithreading, i.e. split work on independent parts/tasks and solve them simultaneously, loading evenly all the processor cores. The design should be flexible. Too large amount of small tasks leads to overhead of synchronizing work and switching between tasks. Too small abound of big tasks leads to uneavenly loading of cores. Need a balance. In current games there might be from several hundreds to thousand tasks per frame.

In our case of frustum culling each object is independent from the rest. Thats why we easily could split work into equal groups and cull them simultaneously with different cores of processor. After running jobs execution we need to wait threads to do their job and gather results.
Off course we should not ask results right after execution start.

Worker::Worker() : first_processing_oject(0), num_processing_ojects(0)
{
	//create 2 events: 1. to signal that we have a job 2.signal that we finished job
	has_jobs_event = CreateEvent(NULL, false, false, NULL);
	jobs_finished_event = CreateEvent(NULL, false, true, NULL);
}

void Worker::doJob()
{
	//make our part of work
	cull_objects(first_processing_oject, num_processing_ojects);
}

unsigned __stdcall thread_func(void* arguments)
{
	printf("In thread...\n");
	Worker *worker = static_cast<worker*>(arguments);

	//each worker has endless loop untill we signal to quit (stop_work flag)
	while (true)
	{
	//wait for starting jobs
	//if we have no job - just wait (has_jobs_event event). We do not wasting cpu work. Events designed for this.
		WaitForSingleObject(worker->has_jobs_event, INFINITE);

	//if we have signal to break - exit endless loop
		if (worker->stop_work)
			break;

	//do job
		worker->doJob();

	//signal that we finished the job
		SetEvent(worker->jobs_finished_event);
	}
	_endthreadex(0);
	return 0;
}

void create_threads()
{
	//create the threads
	//split the work into parts between threads
	int worker_num_processing_ojects = MAX_SCENE_OBJECTS / num_workers;
	int first_processing_oject = 0;

	int i;
	for (i = 0; i < num_workers; i++)
	{
	//create threads
		workers[i].thread_handle = (HANDLE)_beginthreadex(NULL, 0, &thread_func, &workers[i], CREATE_SUSPENDED, &workers[i].thread_id);
		thread_handles[i] = workers[i].thread_handle;

	//set threads parameters
		workers[i].first_processing_oject = first_processing_oject;
		workers[i].num_processing_ojects = worker_num_processing_ojects;
		first_processing_oject += worker_num_processing_ojects;
	}
	//run workers to do their jobs
	for (int i = 0; i < num_workers; i++)
		ResumeThread(workers[i].thread_handle);
}

void process_multithreading_culling()
{
	//signal workers that they have the job
	for (int i = 0; i < num_workers; i++)
		SetEvent(workers[i].has_jobs_event);
}

void wate_multithreading_culling_done()
{
	//wait threads to do their jobs
	HANDLE wait_events[num_workers];
	for (int i = 0; i < num_workers; i++)
		wait_events[i] = workers[i].jobs_finished_event;
	WaitForMultipleObjects(num_workers, &wait_events[0], true, INFINITE);
}

Table 3. Culling results of 100k objects. Intel Core I5 (4 cores). In brackets – speedup relatively to simple c++ implementation.

Method	Sphere	AABB	OBB
Simple c++	0,92 (1)	1,42 (1)	9,14 (1)
SSE	0,26 (3,54)	0,46 (3,08)	3,48 (2,62)
Simple c++, Mulithreaded	0,25 (3,68)	0,4 (3,55)	2,5 (3,65)
SSE, Multithreaded	0,1 (9,2)	0,18 (7,89)	1 (9,14)

Multithreaded version faster than single threaded in 3,6 times in average.
Using SSE gieves us 3 times speedup, relatively to simple c++ implementation.
Both using SSE and Multithreading gives us 8,7 times speedup!
I.e. we optimize our calculations by almost 9 times, depending on used culling primitive type.

GPU culling

GPU designed to perform the same operation on huge amount of data. GPU has a lot more parallel threads (thousands) than in CPU (2-8 in most desktop cases). But culling on gpu not allwas comfortably:

this assumes special graphics engine architecture
there is unpleasant moment that we evaluate dip execution on cpu side. For this we need to know the amout of generated primitives by GPU (visible in frustum objects in our case). Thats why we need to ask feedback from GPU. There are special commands for this purpose.
The problem is if we want get result in the same frame with culling and rendering we get GPU-stall, because we need to wait the result. This is bad for perfomance. If read result from previous frame – we get bugs. Full solution to this problem is using DrawIndirect commands and preparing information about dip on GPU side. This is available since DirectX11 and Opengl 4.

Implementation on gpu culling consist from next steps:

Pack all instances data in vertex buffer. Assume that one vertex is one object for culling. Amount of atributes for vertex equal to amount of data per one object.
Enable transform feedback. Send prepared vertex buffer on render. All results redirect to another vertex buffer with visible instances data.
In vertex shader check visibility on the object
In geometry shader discard object / kill the vertex if instance is not visible in frustum.
Thus, we formed buffer with just visible instances data.
But now we need to get information amout amount of visible objects from GPU to make the dip on CPU side. In this case we do this with transform feedback from previous frame (just for code simplicity).

void do_gpu_culling()
{
	culling_shader.bind();

	int cur_frame = frame_index % 2;
	int prev_frame = (frame_index + 1) % 2;

	//enable transform feedback & query
	glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 0, dips_texture_buffer);
	glBeginQuery(GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN,
	num_visible_instances_query[cur_frame]);
	glBeginTransformFeedback(GL_POINTS);

	//render cloud of points which we interprent as objects data
	glBindVertexArray(all_instances_data_vao);
	glDrawArrays(GL_POINTS, 0, MAX_SCENE_OBJECTS);
	glBindVertexArray(0);

	//disable all
	glEndTransformFeedback();
	glEndQuery(GL_TRANSFORM_FEEDBACK_PRIMITIVES_WRITTEN);
	glBindBufferBase(GL_TRANSFORM_FEEDBACK_BUFFER, 0, 0);
	glDisable(GL_RASTERIZER_DISCARD);

	//get feedback from prev frame
	num_visible_instances = 0;
	glGetQueryObjectiv(num_visible_instances_query[prev_frame], GL_QUERY_RESULT,      &num_visible_instances);

	//next frame
	frame_index++;
}

Vertex shader from 3rd step:

#version 330 core
in vec4 s_attribute_0;
in vec4 s_attribute_1;

out vec4 instance_data1;
out vec4 instance_data2;
out int visible;

uniform mat4 ModelViewProjectionMatrix;
uniform vec4 frustum_planes[6];

int InstanceCloudReduction()
{
	//sphere - frustum test
	bool inside = true;
	for (int i = 0; i < 6; i++)
	{
		if (dot(frustum_planes[i].xyz, s_attribute_0.xyz) + frustum_planes[i].w <= -s_attribute_0.w)
			inside = false;
	}
	return inside ? 1 : 0;
}

void main()
{
//read instance data
	instance_data1 = s_attribute_0;
	instance_data2 = s_attribute_1;

//visibility
	visible = InstanceCloudReduction();
	gl_Position = ModelViewProjectionMatrix * vec4(s_attribute_0.xyz,1);
}

Geometry shader from 4th step:

#version 400 core
layout (points) in;
layout (points, max_vertices = 1) out;

in vec4 instance_data1[];
in vec4 instance_data2[];
in int visible[];

out vec4 output_instance_data1;
out vec4 output_instance_data2;

void main( )
{
	//if primitive is not visible - discard it !
	//visible comes from vertex shader
	if (visible[0] == 1)
	{
	//just transfer data
		output_instance_data1 = instance_data1[0];
		output_instance_data2 = instance_data2[0];
		gl_Position = vec4(0.0, 0.0, 0.0, 1.0);
		EmitVertex();
		EndPrimitive();
	}
}

Whole frame time with GPU culling is about 1,19 ms. Allmost the same as with the fastest CPU variant, SSE multithreaded sphere-frustum culling.

Perfomance comparision of all methods

We compared culling speed CPU methods, but mention only culling time. For now we are going to measure whole frame time, so measurements will differs a bit as we need take into account data transferring and some other work.

Table 4: Perfomance comparition of all culling methods. Whole Frame time in ms.

Method	Sphere	AABB	OBB
Simple c++	1,94	2,46	10,3
SSE	1,2	1,45	4,63
Simple c++, multithreading	1,23	1,42	3,6
SSE, multithreading	1,18	1,2	2,02
GPU	1,19	-	-

There might be some inaccuracy in measurements (in 2nd sigh after coma). Even averaged time always differs from frame to frame.

Conclusion

Using both SSE and multithreading gives us 9 times perfomance win in comparition with simple c++ implementation. If one can use Opengl 4 (map buffers with GL_MAP_PERSISTENT_BIT and GL_MAP_COHERENT_BIT flags), then data transfering to gpu becomes really fast. Plus we able to optimise culling using hierarchial structures and different trics like remembering last frustum plane which culled the object.

GPU culling works as fast as the most optimized CPU method. And has couple advantages:

no need to transfer visible instances data to GPU. They are already there. If using DrawIndirect (DX11+, OpenGL 4+, new API: DX12, Vulkan, Metal) and form information about dip on CPU side, then we can get even better performance.
Relatively easily able to make additional culling taking into account self blocking/hiding/inner visibility (Hierarchical-Z map based occlusion culling)

So – what to use CPU or GPU culling?
Depends on several things:

project and amount of changes to make GPU culling
Do we able to use DrawIndirect
what is bottle neck: CPU or GPU?
Do we want make additional culling according to depth complexity of the scene. On GPU side it will be much easier and faster.

If there is such opportunity – one should use GPU culling.

Source code of all examples.

Links:

Gerlits Anatoliy.
February, 2017 year.

Frustum Culling

Introduction

Simple culling

Sphere-frustum

AABB-frustum

OBB-frustum

SSE

Multithreading

GPU culling

Perfomance comparision of all methods

Conclusion

Trending Articles

Scuffham Amps - S-GEAR 2.6.0 VST, AAX, STANDALONE x86 x64 (R2R NO iLok2, +NO...

Practice Sheet of Right form of verbs for HSC Students

VHSE First (1st) Allotment 2025 - vhscap.kerala.gov.in

UNIVERSE LEAGUE – UNIVERSE LEAGUE – WAR (We Are Ready) – EP [iTunes Plus M4A]

City Hunter Teledrama – Episode 18 – 07th May 2016

Comment on Proposed Criteria for Identifying Predatory Conferences by Luke...

Bureau of Internal Revenue: Regional Offices (Directory)

Kendrick Lamar – Not Like Us (2024) [24Bit-88.2kHz] [PMEDIA] ⭐️

Inception 2010 Hindi Dual Audio 650MB BRRip 720p ESubs HEVC

East Hull MD admits sexual assaults after another victim comes forward

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

R. v. Sargeant, 2023 ONSC 6406 (CanLII)

Rajasthan Board 10th Result 2016 Roll No wise & Name Wise

Who’s been sentenced at Northampton Magistrates’ Court

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Family cries out as traditional ruler allegedly abducts brother, extorts N2.5m

Long-Running Conflict In Springfield (MA) Gangland Sphere Has Manzi Family &...

Wondershare Filmora X v10.1.20.16 x64

Man arrested after fracas in flat

Man charged in ongoing Sexual Assault Investigation Derek Nyilas, 46, Faces...