19/04/2010 OpenGL 3.4 / 4.1: Expectations / Wish-list (major)

Separate shader programs and generalized explicit locations / indices

When Cg was introduced in 2002, it was already designed with a concept call 'semantics'. Semantics are like a contract signed between the shader programs and the C++ programs defining how attributes should be linked between them. The names of these semantics were POSITION, PSIZE, FOG, COLOR0, COLOR1, TEXCOORD0–TEXCOORD7. A contract signed by the C++ program and the shader programs which details were wrote by nVidia... For years a practice was to use texture coordinate entries to pass the tangents because we didn't had the choice... what's the heck? Oo

When GLSL was released in 2004 a fairly different approach have been taken. Actually, 2 ways were possibles: Either querying the attribute locations with glGetAttribLocation or setting the location value with glBindAttribLocation, a method that took a long time to be reliable on both nVidia and AMD.

With OpenGL 3.3 and GL_ARB_explicit_attrib_location, a third really nice new way is available: directly setting the attribute locations in the shader. The semantics idea finally done right! The developers can define a contract between the C++ programs and the shader programs without any setting or querying step. The first Cg implementation defined itself the contract, with GLSL 3.30, programmers themself can define this contract which prevents TEXCOORD2 to be used for tangents.

The idea with generalized explicit locations and indices is to extend the possibility of defining a contract to uniform variables and uniform blocks but also to varying variables so that the contract could define the communication between shaders. nVidia OpenGL 3.3 drivers already allows to specify the location of varying variables, to define varying 'semantics'.

The concept of 'blocks' has been introduce in GLSL 1.40 with OpenGL 3.1 and were only available with uniform variables. In a way, blocks follow the idea of contract, initially just at uniform variables level. With GLSL 1.50 and OpenGL 3.2, blocks has been generalized to varying variables and by that fact, extend a bit the idea of contract to communicate between shader stages. nVidia has made some work to extend this concept to vertex shader inputs and fragment shader outputs. I have experimented it and vertex attribute block work well but I had a link error when trying fragment shader output block. This interest by nVidia feels like a good hint for where we are going and it is one that really please me!

OpenGL 3.3 explicit locations:
  • #version 330 core
  • // Declare all the semantics
  • #define ATTR_POSITION 0
  • #define ATTR_COLOR 3
  • #define ATTR_TEXCOORD 4
  • #define FRAG_COLOR 0
  • uniform transform
  • {
  • mat4 MVP;
  • } Transform;
  • layout(location = ATTR_POSITION) in vec2 Position;
  • layout(location = ATTR_TEXCOORD) in vec2 Texcoord;
  • out vert
  • {
  • vec2 Texcoord;
  • } Vert;
  • void main()
  • {
  • Vert.Texcoord = Texcoord;
  • gl_Position = Transform.MVP * vec4(Position, 0.0, 1.0);
  • }
OpenGL 3.4 generalized block and explicit locations and indices?
  • #version 340 core
  • // Declare all the semantics
  • #define UNIF_TRANSFORM 0
  • #define UNIF_MATERIAL 1
  • #define ATTR_POSITION 0
  • #define ATTR_COLOR 3
  • #define ATTR_TEXCOORD 4
  • #define VERT_POSITION 0
  • #define VERT_COLOR 3
  • #define VERT_TEXCOORD 4
  • #define VERT_INSTANCE 7
  • #define FRAG_COLOR 0
  • layout(index = UNIF_TRANSFORM) uniform transform
  • {
  • mat4 MVP;
  • } Transform;
  • in attrib
  • {
  • layout(location = ATTR_POSITION) vec2 Position;
  • layout(location = ATTR_TEXCOORD) vec2 Texcoord;
  • } Attrib;
  • out vert
  • {
  • vec4 gl_Position;
  • layout(location = VERT_TEXCOORD) vec2 Texcoord;
  • } Vert;
  • void main()
  • {
  • Vert.Texcoord = Attrib.Texcoord;
  • gl_Position = Transform.MVP * vec4(Attrib.Position, 0.0, 1.0);
  • }

I would be really surprised if all of the generalized blocks, explicit locations and indexes are actually part of OpenGL 3.4. However, the chances to see explicit varying locations is actually high. nVidia already support it in the their OpenGL 3.3 drivers! The idea of communication contract between shader might bring a huge benefice: If we know how to communicate then the linking step of a GLSL program might because moot.

Currently, at linking the compiler need to link the output variables of the vertex shader to the input variables of the fragment shader using the variable string name. With explicit varying location, this task would disapear: output vertex shader variable with location 0 would automatically communicate with input fragment shader variable with location 0 automatically. We could even imagine a new function to replace glTransformFeedbackVarying taking location (semantics) arguments. Such function actually already exist (glTransformFeedbackVaryingsNV) within nVidia GL_NV_transform_feedback extension.

Separate shader programs are a highly asked feature because it will allows to mix and match shaders from different stages. Chances are that the same vertex shader could be use for different fragment shaders (for different materials) but so far with GLSL, the vertex shader has to be attached to several GLSL programs. Moreover, it will make easier porting Direct3D softwares which design rely on shader stage independence.

VAO evolution through 'bindless graphics'

When VAOs went available with OpenGL 3.0, the community was at first quite enthusiast. However, then we realized that VAOs didn't satisfied the expectations their brought in term of performance. Basically, VAOs reduce all the vertex attributes configuration calls to just a single call at draw call. All the usual calls are moved at VAOs creation time. Unfortunately, as long as the number of VAOs increase, the performance decrease to the points that performance drop lower than a use of VBOs only.

The way VAOs are designed, VAO matches one mesh. If we don't use a semantis oriented software design it become worse because we need as much VAOs as shaders used by mesh, multiple VAOs per mesh. To be efficient, VAO need to store several meshes which is possible but it require a quite complicate software design for actually a small performance increase anyway. Doesn't it worse it? On AMD, it's up to +20% for less than 2000 VAOs. On nVidia, it's more performance than up to +20% but up to 200 VAOs...

The lazy 'one everything per meshes' approach is actually quite used by developers that can't afford to develop a more sophisticated design. In this case VAO API feet easily but it's really inefficient anyway. OpenGL becomes efficient when assets are shared and draw calls can be sorted to take advantage of these sharing to reduce state and object changes.

A fairly common scenario is that a lot of different meshes (a lot of buffer set) in a single software share the same vertex layout / format. A sharing scenario where VAOs don't feet because every different buffer combinations required a new VAO or a VAO update.

I have been told and I read that VAO benefits come from the saving of chasing buffer pointers but actually there is a lot of other saving. Here is an idea I advocate: What if we could share the vertex layout across several different buffer combinations?

OpenGL 3.0 VAOs:
  • glGenVertexArrays(1, &this->VertexArrayName);
  • glBindVertexArray(this->VertexArrayName);
  • glBindBuffer(GL_ARRAY_BUFFER, this->DynamicBufferName);
  • glVertexAttribPointer(glf::semantic::attr::POSITION, 3, GL_FLOAT, GL_FALSE, sizeof(vertex),
  • GLF_BUFFER_OFFSET(0));
  • glVertexAttribPointer(glf::semantic::attr::NORMAL, 4, GL_INT_2_10_10_10_REV, GL_FALSE, sizeof(vertex),
  • GLF_BUFFER_OFFSET(sizeof(glm::vec3)));
  • glVertexAttribPointer(glf::semantic::attr::TANGENT, 4, GL_INT_2_10_10_10_REV, GL_FALSE, sizeof(vertex),
  • GLF_BUFFER_OFFSET(sizeof(glm::vec3) + sizeof(i2i10vec4)));
  • glBindBuffer(GL_ARRAY_BUFFER, this->StaticBufferName);
  • glVertexAttribPointer(glf::semantic::attr::ALPHA, 1, GL_FLOAT, GL_FALSE, sizeof(vertex),
  • GLF_BUFFER_OFFSET(0));
  • glVertexAttribPointer(glf::semantic::attr::TEXCOORD, 2, GL_FLOAT, GL_FALSE, sizeof(vertex),
  • GLF_BUFFER_OFFSET(sizeof(float)));
  • glBindBuffer(GL_ARRAY_BUFFER, 0);
  • glEnableVertexAttribArray(glf::semantic::attr::POSITION);
  • glEnableVertexAttribArray(glf::semantic::attr::NORMAL);
  • glEnableVertexAttribArray(glf::semantic::attr::TANGENT);
  • glEnableVertexAttribArray(glf::semantic::attr::ALPHA);
  • glEnableVertexAttribArray(glf::semantic::attr::TEXCOORD);
  • glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, this->ElementBufferName);
  • glBindVertexArray(0);
  • ...
  • // Use the VAO and draw
  • glBindVertexArray(VertexArrayName);
  • glDrawElements(GL_TRIANGLES, ElementCount, IndicesType, 0);
  • // Use another VAO and draw, reset the entire vertex format / layout and buffers
  • glBindVertexArray(VertexArrayName2);
  • glDrawElements(GL_TRIANGLES, ElementCount2, IndicesType, 0);
VAO alternative idea with explicit binding points and layout sharing:
  • glGenVertexArrays(1, &this->VertexArrayName);
  • glBindVertexArray(this->VertexArrayName);
  • glVertexAttribFormat(0, glf::semantic::attr::POSITION, 3, GL_FLOAT, GL_FALSE, sizeof(vertex),
  • GLF_BUFFER_OFFSET(0));
  • glVertexAttribFormat(0, glf::semantic::attr::NORMAL, 4, GL_INT_2_10_10_10_REV, GL_FALSE, sizeof(vertex),
  • GLF_BUFFER_OFFSET(sizeof(glm::vec3)));
  • glVertexAttribFormat(0, glf::semantic::attr::TANGENT, 4, GL_INT_2_10_10_10_REV, GL_FALSE, sizeof(vertex),
  • GLF_BUFFER_OFFSET(sizeof(glm::vec3) + sizeof(i2i10vec4)));
  • glBindBufferBase(GL_ARRAY_BUFFER, 0, this->DynamicArrayBufferName);
  • glVertexAttribFormat(1, glf::semantic::attr::ALPHA, 1, GL_FLOAT, GL_FALSE, sizeof(vertex),
  • GLF_BUFFER_OFFSET(0));
  • glVertexAttribFormat(1, glf::semantic::attr::TEXCOORD, 2, GL_FLOAT, GL_FALSE, sizeof(vertex),
  • GLF_BUFFER_OFFSET(sizeof(float)));
  • glBindBufferBase(GL_ARRAY_BUFFER, 1, this->StaticArrayBufferName);
  • glEnableVertexAttribArray(glf::semantic::attr::POSITION);
  • glEnableVertexAttribArray(glf::semantic::attr::NORMAL);
  • glEnableVertexAttribArray(glf::semantic::attr::TANGENT);
  • glEnableVertexAttribArray(glf::semantic::attr::ALPHA);
  • glEnableVertexAttribArray(glf::semantic::attr::TEXCOORD);
  • glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, this->ElementBufferName);
  • glBindVertexArray(0);
  • ...
  • // Use the VAO and draw
  • glBindVertexArray(VertexArrayName);
  • glDrawElements(GL_TRIANGLES, ElementCount, IndicesType, 0);
  • // Change the buffers but keep the vertex format / layout. Only one buffer could be change if relevant
  • glBindBufferBase(GL_ARRAY_BUFFER, 0, MewDynamicArrayBufferName);
  • glBindBufferBase(GL_ARRAY_BUFFER, 1, NewStaticArrayBufferName);
  • glBindBuffer(GL_ELEMENT_ARRAY_BUFFER, NewElementBufferName);
  • glDrawElements(GL_TRIANGLES, ElementCount2, IndicesType, 0);

Currently with VAOs, when we change a buffer we need to call again all the glVertexAttribPointer functions. With this proposal, we would just need to change the buffers and I would expect that sorting the draw call per VAO to bring an extra benefit. This proposal also allows to still use VAOs as static OpenGL 3.0 VAOs... but nVidia release something following this idea and going beyond!

After the early community debate on VAOs, nVidia answer was what they call 'bindless graphics' which involves 2 extensions: GL_NV_shader_buffer_load and GL_NV_vertex_buffer_unified_memory. At release, nVidia announce that 'bindless graphics' is 7X faster which no one I know reached but a solid 2X has been measure which is actually already absolutely amazing. Considering that with VAO we get up to +20% so far and with 'bindless graphics' we get +100%, it's a difference of 5X!

With bindless graphics, nVidia follows the principle of separating the vertex format from the buffers thanks to GL_NV_vertex_buffer_unified_memory but also handle the problem of chasing buffer pointers by giving access to GPU addresses. Moreover, GL_NV_shader_buffer_load allows to read from as much buffers as we want inside shaders and GL_NV_shader_buffer_store even allows to write from shaders into buffer with OpenGL 4 hardware. A new way to do transform feedback? On top of that, this extension provides mecanisums to perform atomic operation on these buffers...

'Bindless graphics' in OpenGL 4.1 (maybe OpenGL 3.4?). I think there is good chances even if I have the feeling that AMD is a bit against it. One trouble is that we have GPU addresses access and from that we can imagine all the Windows blue screens and hardware resets we want (or not). If 'bindless graphics' remains optional, then careful developers that need this performance benefice (2X easily!) and extra features would use it, others would ignore it. The ARB need to agree and as I have been told, it was already a confictual topic during OpenGL 3.0 development...

'Bindless graphics' is somekind of a Direct3D 11 feature with the concepts of RWBuffer and RWTexture. I have the idea that the ARB really want to catch up with all Direct3D 11 features as soon as possible which, I guess, increases the probability of this feature adoption.

From 'texture barrier' to image and buffer 'load and store'

One the the most missed feature from OpenGL 3.3 and OpenGL 4.0 release is nVidia GL_NV_texture_barrier. Texture barrier allows reading from a texture that we are writing to in a safe manner instead of using a texture ping-pong method.

With their OpenGL 4.0 drivers, nVidia has released the specification of extensions called GL_EXT_shader_image_load_store and GL_NV_shader_buffer_store which extend both texture barrier and bindless graphics to a more generic kind of RWBuffer and RWTexture where as much as needed buffers and textures can be used in shaders to load and write data...amazing feature. Actually, I don't know how this is going to be used yet but the possibilities bring by this flexiblity feel huge!

GL_NV_shader_buffer_store is actually a 'bindless graphics' writes feature which allows us to think about wonderfully flexible transform feedback method, writing into buffers from any stages, from as much stages as needed and data types and sizes we want.

Is GL_EXT_shader_image_load_store actually the famious blend shader (integrated in the fragment shader) that has been advotated by a lot of developers? Probably!

Both nVidia and AMD has worked on GL_EXT_shader_image_load_store which definitely put it in the feature I think we will see in OpenGL 4.1 specification, probably the OpenGL 4.1 'selling feature'.

OpenGL 3.4 / 4.1: Expectations / Wish-list (others) >
< OpenGL 3.4 / 4.1: Expectations / Wish-list (DSA!)
Copyright Christophe Riccio 2002-2016 all rights reserved
Designed for Chrome 9, Firefox 4, Opera 11 and Safari 5