Draw Call Performance, OpenCL Batching & Shader Questions

about everything
  • Author
  • Message
Offline
User avatar
*sensei*
Posts: 316
Joined: 12 Aug 2013, 18:55
Location: Scotland

Draw Call Performance, OpenCL Batching & Shader Questions

Hullo lads.

For the past wee while I've been trying to do a bit of learning, about draw calls 'n' general graphics performance. Got a couple questions that I didn't find any information on, so I figured here would be a good place ta ask.

Anyhow!


1. Supposedly, combining textures into an atlas and adjusting the UVs of your meshes, is a common method to reduce draw calls. Is there an average number/percentage for how many draw calls are reduced?

E.g, a simple scene has 3,000 different (unique texture, geometry) meshes. There's no lighting, shadows, specular shaders, etc. Each mesh has one diffuse and one normal map. That would make 9,000 draw calls, right? 3,000 meshes, 3,000 diffuse maps, 3,000 normal maps.

If the diffuse and normal maps were made into two texture atlases, one for the diffuse maps & one for the normal maps, how many draw calls would we now have? 3,000 meshes + 1 diffuse map + 1 normal map for 3,002 draw calls? Or is it not so simple?


2. One aspect of draw call performance that I've read about, is "context switching". From what little I could find, I think a context switch occurs when a different texture is being drawn. Due to driver heuristics, fail-safes and whatnot, draw calls that invoke context switches are far worse for performance.

If you have 8,998 meshes and two texture atlases (total of 9,000 draw calls?), and compare the overall performance (cpu usage? framerate? frametimes?) to having 3,000 meshes with each having their own diffuse & normal map, is the former 'more efficient', or are all draw calls equally draining?


3. With the advent of OpenCL, I did some searching to see if static/dynamic batching can be done with it. Didn't find any discussions nor questions regarding the matter, sadly.

So aye. Seeing as how OpenCL allows you to access the GPU, surely it would be best used to batch geometry together, instead of using the CPU as is done traditionally? What with GPUs being made for transforming triangles.


4. Having learned a bit about Geometry Instancing, it seems that the generic features of it are:

Only works if geometry, UV maps & textures are the same.
Can colour tint the object.
Supports skeletal deformation (changing geometry shape via skinned bones).

But I couldn't find an implementation that supports vertex morphs (like in 3DS Max; have one mesh, change it's shape, and then you can interpolate the changes in vertex positions). It should be possible, no? Seeing as how skinning does pretty much the same thing.


5. Was doing a bit of reading about the use of large shaders, in order to avoid if statements; if statements are fairly serial and kill GPU performance, and uber shaders are the workaround.

But damned if I can find any concrete information on uber shaders. How do they work? How do you program shaders, in such a way that you avoid the use of if statements?



...Aye, that seems to be all I've got on my mind right now. Tah in advance, broskis.
_________________
Intel i7 6700k | AMD Vega 56 8GB | 2x16GB DDR4 @ 3000mhz | Windows 7 64bit | Creative Soundblaster X-Fi Titanium Fatal1ty Pro | Asus z170 Pro Gaming

Offline
User avatar
*blah-blah-blah maniac*
Posts: 17427
Joined: 27 Dec 2011, 08:53
Location: Rather not to say

Re: Draw Call Performance, OpenCL Batching & Shader Question

3,000 meshes, 3,000 diffuse maps, 3,000 normal maps
Something wrong with your math. It's valid only for some non effective semi-defer rendering mode. Every mesh have 1 diffuse and 1 normal, which means 3000 draw calls. Without lighting and specular normal maps don't need at all btw.
With dips things are not that simple. Not just draw function call cost lot of performance, but every command to driver between draw calls. You may (for old nvidia drivers at least) call crazy amount of draw calls in cycle without modifying anything, performance will be awesome. Insert any changes to object -> bottleneck. So, packing textures in to atlases is a good thing, but it elliminates only few percents of bottleneck and itself have rather negative impact because handling atlases means extra vram and slower shaders. Fully batching objects in to statics is the best possible thing to improve performance. And there are tricks to minimize negative side effects of statics by not making them fully statics.
Context switching is not the thing you need to worry about as switching textures. Any changes for device state are bad and mostly in about the same percent. Assigning textures is not the slowest code, unless they are in vram. Developers ignore cost of functions which set shaders constants, because they are cheap. This is not true, because with modern complexity of shaders amount of data transfered to videocard as such constants can me much greater than costs of textures or many other functions.
Batching using OpenCL is bad idea, because data is not shared between it and DirectX, you will stuck with cpu-gpu-cpu-gpu data transfer, which is much slower than just compute on cpu, especially in separate threads. And i don't see the problem to batch meshes while level is loading.
Vertex morthing is officially not exist as feature of realtime graphics and implementation is task for developer. Add to every source vertex positions of destionation object vertices and interpolate values. Compatible with instancing, because not require much data per object. Instancing as hardware feature is not very useful on practice, to optimize simplest objects is not that important as complex shadowed and lit models. And here tricky instancing algorithms come to life - usage of vertex textures for storing data.
Uber shaders is the worse thing happen to graphics, because assholes don't know how to use them properly. They ruin many performance optimizations and produce cpu bottlenecks. To use without performance degradation you need clearly understand when execution on gpu is slower than switching shaders. Everything is very individual. Uber shaders are generated automatically in cycle from big base shaders with only preprocessor definitions changed in every tact. For example
void somefunc()
{
//...
#ifdef APPLYSPECULAR
//compute specular here
#endif //APPLYSPECULAR
//...
}
If in first compiled shader you will not define anything, it will not have specular code. In the second tact of cycle assign APPLYSPECULAR definition and call function to compile shader with it - you have second uber shader. The problems starts when amount of definitions increase and amount of their combos raises to crazy values (3000 shaders generated in modern game is "normal", but some have much more). Because of these uber shaders my mod have long startup time, everyone need to be processed. How much memory such shit costs for driver i don't know, but Deus Ex 3 had too many shaders (7k if i remember) and driver crashed before fix. Current drivers may handle more shaders, "thanks" to modern developers.

Sorry, i can't answer the way you'd like to see it, because questions leads you in wrong direction.
_________________
i9-9900k, 64Gb RAM, RTX 3060 12Gb, Win7

Offline
User avatar
*sensei*
Posts: 316
Joined: 12 Aug 2013, 18:55
Location: Scotland

Re: Draw Call Performance, OpenCL Batching & Shader Question

ENBSeries wrote: Something wrong with your math. It's valid only for some non effective semi-defer rendering mode. Every mesh have 1 diffuse and 1 normal, which means 3000 draw calls. Without lighting and specular normal maps don't need at all btw.
Oooh! So different rendering methods (forward rendering, deferred, forward+, etc.) utilize fewer/more draw calls than others? Didn't know that; thought it was just 1 draw call for each mesh & 1 for each texture, regardless of the renderer.

Is there a guideline for how many draw calls each rendering method generates? As in, 1 mesh + 1 diffuse + 1 normal + 1 parallax + 1 specular + 1 light + 1 shadow cast + 1 shadow on object = x draw calls in forward, y draw calls in generic deferred, z draw calls in tiled deferred, etc.?

With dips things are not that simple. Not just draw function call cost lot of performance, but every command to driver between draw calls. You may (for old nvidia drivers at least) call crazy amount of draw calls in cycle without modifying anything, performance will be awesome. Insert any changes to object -> bottleneck.
Ooh, this sounds pretty damn cool. So if you have many of the same object without variation (bone transform, object-specific shader, animated normals, different diffuse, etc.), even without instancing, performance won't be horrid?

So, packing textures in to atlases is a good thing, but it elliminates only few percents of bottleneck and itself have rather negative impact because handling atlases means extra vram and slower shaders.
More VRAM usage is to be expected, I suppose. And when you say shaders will be slower, is that just for object shaders (e.g, object cel-shading outline, geometry instancing, Skyrim's ghost shader effect)?

Or does that apply to all shaders in general (tonemapping, bloom, geometry instancing, ambient lighting, etc.)?

Fully batching objects in to statics is the best possible thing to improve performance. And there are tricks to minimize negative side effects of statics by not making them fully statics.
Not making the statics entirely static? That sounds like dirty-hack territory. Is it possible to get a brief example, or is it a pain in the ass due to complexity?

Developers ignore cost of functions which set shaders constants, because they are cheap. This is not true, because with modern complexity of shaders amount of data transfered to videocard as such constants can me much greater than costs of textures or many other functions.
Aha! This sounds like performance gold.

Basically, you should define shader constants (variables that don't change) at the start of the shader; before the vertex and pixel processing code?
Batching using OpenCL is bad idea, because data is not shared between it and DirectX, you will stuck with cpu-gpu-cpu-gpu data transfer, which is much slower than just compute on cpu, especially in separate threads. And i don't see the problem to batch meshes while level is loading.
Yeah, the latency and bandwidth would be murder for performance. Didn't think of that. Perhaps that is where HSA (basically OpenCL that runs directly on the iGPU?) could save the day.

Ah! Didn't think about loading screens...Wouldn't be as simple for open world (Skyrim-esque open world) games, I take it? Unless it's possible to process a set amount of vertices at once; so the software slowly batches over time, without hampering performance.

Vertex morthing is officially not exist as feature of realtime graphics and implementation is task for developer. Add to every source vertex positions of destionation object vertices and interpolate values. Compatible with instancing, because not require much data per object. Instancing as hardware feature is not very useful on practice, to optimize simplest objects is not that important as complex shadowed and lit models. And here tricky instancing algorithms come to life - usage of vertex textures for storing data.
Oh my. Being able to instance morphs could lead to some ingenious draw-call wizardry. Use morphs to change the appearance of armour (raise/flatten spikes, stretch platemail, grow/shrink helmet visors), heads (face shapes), bodies (muscular/thin, tall/short, pregnant/not pregnant, fat/skinny, etc.) and then instance the meshes.

Damn, that sounds pretty fucking cool.


Vertex textures? Found this NVidia .pdf on the subject. Doesn't really explain how it works. ftp://download.nvidia.com/developer/Pap ... xtures.pdf

So, basically, you create a 3D texture (basically a huge HLSL float3 array?) and store the positions of each vertex in it?

For example, each pixel would have an R colour value (x position), G colour value (y position) and B colour value (z position). Then the shader moves the vertex according to the RGB value?

Uber shaders is the worse thing happen to graphics, because assholes don't know how to use them properly. They ruin many performance optimizations and produce cpu bottlenecks. To use without performance degradation you need clearly understand when execution on gpu is slower than switching shaders. Everything is very individual. Uber shaders are generated automatically in cycle from big base shaders with only preprocessor definitions changed in every tact. For example
void somefunc()
{
//...
#ifdef APPLYSPECULAR
//compute specular here
#endif //APPLYSPECULAR
//...
}
If in first compiled shader you will not define anything, it will not have specular code. In the second tact of cycle assign APPLYSPECULAR definition and call function to compile shader with it - you have second uber shader. The problems starts when amount of definitions increase and amount of their combos raises to crazy values (3000 shaders generated in modern game is "normal", but some have much more). Because of these uber shaders my mod have long startup time, everyone need to be processed. How much memory such shit costs for driver i don't know, but Deus Ex 3 had too many shaders (7k if i remember) and driver crashed before fix. Current drivers may handle more shaders, "thanks" to modern developers.
Bloody hell, uber shaders are convoluted beyond belief. Unless you absolutely know what you're doing, it's probably best to stick to 'traditional' shaders.
Sorry, i can't answer the way you'd like to see it, because questions leads you in wrong direction.
Don't worry dude, you're giving some pretty damn great info. Much appreciated.


Got another question, but it's about learning how to code shaders.

So far, my experience with shaders has been porting ReShade effects over to Fallout 4's ENB. It's not terribly exciting, and there isn't much to learn about coding shaders, just by looking at

Code: Select all

color.rgb
multiplication/division, far as I can tell.

Is there a good way to get to grips with shaders, or rendering in general, that isn't some insanely boring, condescending 20 years old step-by-step lesson?

I suppose downloading an engine (Amazon Lumberyard might be good?) and hacking away at it would be a start, but that would require some decent experience to avoid drowning in a non-intuitive nightmare.
_________________
Intel i7 6700k | AMD Vega 56 8GB | 2x16GB DDR4 @ 3000mhz | Windows 7 64bit | Creative Soundblaster X-Fi Titanium Fatal1ty Pro | Asus z170 Pro Gaming

Offline
User avatar
*blah-blah-blah maniac*
Posts: 17427
Joined: 27 Dec 2011, 08:53
Location: Rather not to say

Re: Draw Call Performance, OpenCL Batching & Shader Question

just 1 draw call for each mesh & 1 for each texture
No. Texture is "attached" to mesh and drawed together with it. Render mesh without texture is only useful for some defer implementations. There are methods to draw meshes and then apply textures, but nobody use them for drawing objects (i use only to increase quality of reflections in skyrim).
Is there a guideline for how many draw calls each rendering method generates?
Don't think so. Everyone decides how render pipeline will looks like, based on requirements and knowledge. In general any render have same amount of draw calls based on mesh count + meshes drawed to shadow + few extra calls mostly for screen space operations. Other than that reflections drawed to cubemaps, non optimized deferred rendering or extending limits of hardware by drawing extras. Also increasing performance for some stage by prepass, which not guaranteed to work as expected and heavily depends from hardware and display resolution. Amount of deferred render modes possible are too many to classify them, more than 10 not including very perverted and perfect for certain situations.
even without instancing, performance won't be horrid?
Yes, but you don't see difference on screen (except alpha blending, if enabled), so no sense to use pure dips.
when you say shaders will be slower
Reading area from atlas properly without artifacts with anisotropic filtering adds a lot of extra code to shaders.
Not making the statics entirely static? That sounds like dirty-hack territory. Is it possible to get a brief example, or is it a pain in the ass due to complexity?
Pack meshes in few big vertex buffers and add data to each inside it or in external stream which identify individual mesh, for example one float value. Use that value to index to any data, for example matrices for transformation, which will be selected in shader similar to matrices of bones. Works fine with dynamic objects and all you need is to remove deleted meshes or add new to large buffers. Can be done on hardware too via other tricks (don't ask that please).
Basically, you should define shader constants (variables that don't change) at the start of the shader; before the vertex and pixel processing code?
That's not what i wanted to say. Constants which changing for every object is the problem and instead of making code which have minimal amount of constants with different data, developers compute everything on cpu and sending huge amount of constants. If pci-e speed is 16 gbps, it's easy to compute how constants decrease performance if skyrim sends 100-400 bytes per object, 2k+ objects drawed per frame at 60fps. It's just a digits came to mind, on practice games i modded sending around 10-300 mbps data in shader constants.
Didn't think about loading screens...Wouldn't be as simple for open world...
Man, look at cpu usage of games, there is a huge room for batching even while game is running, not just loading screens. Only games with software renders (emulators) able to utilize all cores at full.
Vertex textures is standard textures, but assigned to vertex shaders. You may read data from them same way as in pixel shaders with exception of hardware and directx specific limitations. Store in such textures whatever you want, positions, matrices, lights, heights, etc. Instancing data also.
Can't recommend anything regarding graphics/shader programming, because learning things myself from examples and official docs of directx, information on forums, papers and practice. Shaders programming is badly documented (directx11 docs is kinda same :lol: ) by M$, but i've seen online books at google which looks like good enough to pay attention for beginners. But i don't recommend anybody to enter the world of 3d graphics, it's not make you rich and most of time will do other things, programming graphics itself takes small time of game development cycle and graphics only programmers need to look for another job or to simulate activity.
_________________
i9-9900k, 64Gb RAM, RTX 3060 12Gb, Win7

Offline
User avatar
*sensei*
Posts: 316
Joined: 12 Aug 2013, 18:55
Location: Scotland

Re: Draw Call Performance, OpenCL Batching & Shader Question

Yo dude, this stuff is pretty damn interesting. Thanks for sharing your insight.

And naw, don't get me wrong, there's no way I'm looking to make a job out of shaders and all that. I'd lose interest day 2 of working on generic game Tits 'N' Guns 'N' Elves 5000.

It's some sort of wannabe hobby of messing around in Game Engines; pretty much like playing Bethesda's games really, just one step further and without all the crash-happy nonsense and shit code in the way. Results in an interest in figuring out what causes bad performance, fixes for it, and then how to actually do it.


So, yeah man. Quite interestin' stuff.
_________________
Intel i7 6700k | AMD Vega 56 8GB | 2x16GB DDR4 @ 3000mhz | Windows 7 64bit | Creative Soundblaster X-Fi Titanium Fatal1ty Pro | Asus z170 Pro Gaming

Offline
User avatar
*sensei*
Posts: 316
Joined: 12 Aug 2013, 18:55
Location: Scotland

Re: Draw Call Performance, OpenCL Batching & Shader Question

Felt this thread was worthy of having another one o' me questions, rather than just making a new thread. Gotta keep that knowledge all together.


Supposing that Skyrim is CPU limited because of the draw calls (shadows being a big contributor), on a lower end CPU/platform (for example, me own AMD Phenom II CPU + AMD GPU), how much of a benefit would texture atlases have on performance? As in, having all the weapons on one atlas, all the armours and clothing on 2 or 3 atlases, compiling all the architecture textures into another small group of atlas textures, etc.

Reason for asking, is that TES5LODGen uses atlasing for the LOD meshes like a hero, and it's possible to extend this functionality, making texture atlases and modifying the .nif uv coordiantes, to be used on the full-detail meshes, such as weaponry, armour, buildings, etc.

If there's a decent performance boost to be had, I'll go and pitch it to Sheson, 'n' hope the prospect makes him salivate.
_________________
Intel i7 6700k | AMD Vega 56 8GB | 2x16GB DDR4 @ 3000mhz | Windows 7 64bit | Creative Soundblaster X-Fi Titanium Fatal1ty Pro | Asus z170 Pro Gaming

Offline
User avatar
*blah-blah-blah maniac*
Posts: 17427
Joined: 27 Dec 2011, 08:53
Location: Rather not to say

Re: Draw Call Performance, OpenCL Batching & Shader Question

Nothing will change, because making atlases is not enough, a lot of complex programming required from engine side and it can't be done on mod side (like ENBSeries), because rendering time is comparable to optimization time and frequently not possible at all (f.e. objects have different lights).
_________________
i9-9900k, 64Gb RAM, RTX 3060 12Gb, Win7

Offline
User avatar
*sensei*
Posts: 316
Joined: 12 Aug 2013, 18:55
Location: Scotland

Re: Draw Call Performance, OpenCL Batching & Shader Question

ENBSeries wrote:Nothing will change, because making atlases is not enough, a lot of complex programming required from engine side and it can't be done on mod side (like ENBSeries), because rendering time is comparable to optimization time and frequently not possible at all (f.e. objects have different lights).
Damn, was hoping I was on to something there. So atlases can only be of any benefit when it's used in conjunction with, say, instancing or batching?
_________________
Intel i7 6700k | AMD Vega 56 8GB | 2x16GB DDR4 @ 3000mhz | Windows 7 64bit | Creative Soundblaster X-Fi Titanium Fatal1ty Pro | Asus z170 Pro Gaming

Offline
User avatar
*blah-blah-blah maniac*
Posts: 17427
Joined: 27 Dec 2011, 08:53
Location: Rather not to say

Re: Draw Call Performance, OpenCL Batching & Shader Question

Yes
_________________
i9-9900k, 64Gb RAM, RTX 3060 12Gb, Win7

Offline
User avatar
*sensei*
Posts: 316
Joined: 12 Aug 2013, 18:55
Location: Scotland

Re: Draw Call Performance, OpenCL Batching & Shader Question

Got another lil' question!


Seeing as how deferred rendering typically uses per pixel lighting, are there deferred implementations that support vertex lighting? Or is that only possible via forward rendering (maybe forward+ rendering is the solution)?
_________________
Intel i7 6700k | AMD Vega 56 8GB | 2x16GB DDR4 @ 3000mhz | Windows 7 64bit | Creative Soundblaster X-Fi Titanium Fatal1ty Pro | Asus z170 Pro Gaming
Post Reply