Re: Another rendering update
Posted: Thu Jan 10, 2013 2:44 pm
Quick follow up on entities. This is an interesting item, because of how the harware is designed. Most of the time you have a bunch of identical objects on the screen, and they all have different positions which move around, you want to use something called instancing, if it's available. What this does is it lets you store the model data once, and then has per-instance data stored in another buffer. Typically it's a custom transformation matrix which moves the model into the correct world coordinates, but can also be more advanced stuff, like entire skeleton poses. Once that is stored, you make a single draw call to generate X instances.
The old school way to do this is store matrix transformations to a dedicated matrix stack and call a display list to draw the model each time. The problems are twofold. First, I'm pretty sure editing the transformation matrix counts as a state change, which causes the render pipeline to flush (in other words, it finishes drawing everything it was working on before applying the new matrix, because that would affect everything still in the queue. More importantly it won't put anything else on the queue until the matrix is changed, so all that time the pipelines are empty is wasted. Once or twice is acceptable, but a few thousand times and it starts to show). It's worth noting this doesn't always happen when you put matrix change instructions to a compiled display list, because OpenGL can get clever and pre-transform the input coordinates so that the current matrix doesn't need to change.
The second problem is that including too many draw calls can make you hit your batch limit. OpenGL has a limited number of commands it can send to the GPU per second, and it is independent of the PCI bandwidth. Once you hit it, no other optimization can help your fps. How you fix it is you put more stuff into a single "batch" and send it all off at once. For instance, rather than drawing triangles one at a time, you draw a whole array of triangles. Display lists are good for this, too, because they can store all of those commands to the GPU and execute them there. You would think from all of that that display lists are pretty great, and while it's true they are a little faster than vertex buffer objects, there is a reason they were depreciated. The reasoning is kind of complicated and not everyone agrees with it, but the main thing is that display lists were designed for a fixed-function pipeline and modern games all use programmable shaders and have more high-poly models, and the two systems don't fit together quite right. For the sake of consistency, we're going to try to keep the use of display lists to a minimum, even though Minecraft falls more in line with old-school fixed function pipeline games.
So back to instancing. Hardware instancing is nice, but it does have a cost. Specifically, a per-instance cost. In fact, you need about 1K triangles in your instance model to break even. In the case of minecraft, I don't care how fancy your entities get, you're probably not going to go past 20. So that's a waste. Well, if not that, and not display list style, then what do we do? The best answer I've found to this is you do it yourself. Specifically, instead of storing a model once in a vertex buffer, you store it for however many times that guy is going to show up. I know this whole time we've been trying to ration every byte of video memory we can, but this is worth it, and as a matter of fact, the display list style would compile into roughly the same data (since it is trying to cancel out all the matrix changes, it needs independent copies of the geometry.) Instead of storing a per-instance transformation matrix, you just precalculate each vertex into their proper world coordinates. These are low poly models so it's not a really big deal to do this, and even the display lists are doing the vector math on the CPU so there's no difference there.
But this is where it gets fun. Once you have these guys all nicely lined up in your VBO, you can edit them at will. To move one, you recalculate the vertice coordinates for that instance and simply overwrite the section of memory containing it. No rebinding required and only the blocks of memory you change need to be sent to the GPU. Display lists, on the other hand, are read-only. Once you compile it, the only way you are changing anything is if you recompile, which will get too expensive if you have a lot of entities, so performance-wise, the VBO method wins in pretty much every respect. About the only thing display lists do is save on cheap and plentiful CPU memory.
While changing data in-place in a VBO is simple, resizing it is less so. You more or less need to generate a new VBO and transfer everything over to it. So instead, what programmers do is they set the VBO up as a geometry pool. Each instance allocation is the same size, so it doesn't really matter which entity it represents. With that in mind, you can leave sufficient extra slots at the end of your VBO for adding entities later on. When you draw, you simply draw the range of your VBO which is actively in use. If multiple models have the same vertex format, you can even store multiple model types to a single geometry pool, though it gets a bit more complicated to control. I would probably put two types in a pool and start one list at one end, and the other list at the other and work towards the center.
There's also the issue of how to sort entities for rendering only what is in rendering range and on screen. Depending on how you are doing on batch calls, you might just want to draw several ranges from the VBO (aside from the batch call quota, additional draw calls don't hurt the pipeline). If you are running out of batch calls, you may want to either "defrag" your entity list so you can draw them with fewer calls and take a one-time bandwidth hit, or accept that some of the entities drawn will be off-screen and run them through the vertex shader anyway (they won't make past the rasterizer stage, though).
That's entities in a nutshell.
----
There's one other thing I didn't mention, and that's animation. There's two main methods of animation, and they are related. The first is manual matrix transformations. This is normally what you see in minecraft. Things like limbs and spinny things are all rendered as separate pieces, and this is where the nested matrix transformations and rotations become handy, because you can rotate an object in its model space and the place it in world space all relative to its parent object. For the most part, each entity has a dedicated class which programically edits the positions based on whatever animation it is performing and where it is in the sequence. The good thing about this is that the animation is always smooth and precisely placed for whatever point in time it lands on. More complicated animations tend to use positions from pre-made keyframes, which are then interpolated to the correct time by some curve function.
The other way to animate is with a skeleton. A skeleton is simply a tree-like structure of joints and connections which move according to a heiarchy. For instance, if you lift your arm, your hand lifts with it. Instead of doing complicated calcuations on each vector of the model, or store it in pieces, you do the spacial calculations on the handful of control points on the skeleton. Each vertex, then, is painted with a weight cooresponding to how much influence each control point exerts on that particular vertex. This allows the entire mesh to deform. When rendered, you run a custom vertex shader which is given in the model space coordinates of each control pont as constants (or more accurately, their displacement from the model's starting pose. For humanoids, that is typically standing feet together and arms out to their sides.) Each vertex processed is then displaced from its starting point by a weighted average of each influencing control point.
Skeleton animations are good for both static animations (as there is much less data to store) and for dynamic animation, like terrain-accurate walking and ragdoll physics, which is why they are so popular in modern games. However, the drawback is that you are limited by processing power. On the CPU, it is typically things like collision physics, and on the GPU it is the detail of the meshes which need to be deformed. The physics limitation is beginning to be solved by utilizing brand new hardware on the GPU called compute shaders, which can process general non-graphical operations in a super-parallel manner. The vertex shading is getting helped by hardware tessellation. Only the base mesh needs to be run through the vertex shader, and the tesselated vectors are interpolated from those.
There are all sorts of other animation as well. Lighting and texture changes, facial animations, per-vertex mathematical functions (there are a lot of those in the minecraft shader pack), keeping full static meshes of each key frame and switching between them like a cartoon, particle functions, and all manner of specialty techniques, none of which I will go into because they have little to do with Futurecraft and I am already talking too much as it is. But this is the end. I think I've covered everything I wanted to.
The old school way to do this is store matrix transformations to a dedicated matrix stack and call a display list to draw the model each time. The problems are twofold. First, I'm pretty sure editing the transformation matrix counts as a state change, which causes the render pipeline to flush (in other words, it finishes drawing everything it was working on before applying the new matrix, because that would affect everything still in the queue. More importantly it won't put anything else on the queue until the matrix is changed, so all that time the pipelines are empty is wasted. Once or twice is acceptable, but a few thousand times and it starts to show). It's worth noting this doesn't always happen when you put matrix change instructions to a compiled display list, because OpenGL can get clever and pre-transform the input coordinates so that the current matrix doesn't need to change.
The second problem is that including too many draw calls can make you hit your batch limit. OpenGL has a limited number of commands it can send to the GPU per second, and it is independent of the PCI bandwidth. Once you hit it, no other optimization can help your fps. How you fix it is you put more stuff into a single "batch" and send it all off at once. For instance, rather than drawing triangles one at a time, you draw a whole array of triangles. Display lists are good for this, too, because they can store all of those commands to the GPU and execute them there. You would think from all of that that display lists are pretty great, and while it's true they are a little faster than vertex buffer objects, there is a reason they were depreciated. The reasoning is kind of complicated and not everyone agrees with it, but the main thing is that display lists were designed for a fixed-function pipeline and modern games all use programmable shaders and have more high-poly models, and the two systems don't fit together quite right. For the sake of consistency, we're going to try to keep the use of display lists to a minimum, even though Minecraft falls more in line with old-school fixed function pipeline games.
So back to instancing. Hardware instancing is nice, but it does have a cost. Specifically, a per-instance cost. In fact, you need about 1K triangles in your instance model to break even. In the case of minecraft, I don't care how fancy your entities get, you're probably not going to go past 20. So that's a waste. Well, if not that, and not display list style, then what do we do? The best answer I've found to this is you do it yourself. Specifically, instead of storing a model once in a vertex buffer, you store it for however many times that guy is going to show up. I know this whole time we've been trying to ration every byte of video memory we can, but this is worth it, and as a matter of fact, the display list style would compile into roughly the same data (since it is trying to cancel out all the matrix changes, it needs independent copies of the geometry.) Instead of storing a per-instance transformation matrix, you just precalculate each vertex into their proper world coordinates. These are low poly models so it's not a really big deal to do this, and even the display lists are doing the vector math on the CPU so there's no difference there.
But this is where it gets fun. Once you have these guys all nicely lined up in your VBO, you can edit them at will. To move one, you recalculate the vertice coordinates for that instance and simply overwrite the section of memory containing it. No rebinding required and only the blocks of memory you change need to be sent to the GPU. Display lists, on the other hand, are read-only. Once you compile it, the only way you are changing anything is if you recompile, which will get too expensive if you have a lot of entities, so performance-wise, the VBO method wins in pretty much every respect. About the only thing display lists do is save on cheap and plentiful CPU memory.
While changing data in-place in a VBO is simple, resizing it is less so. You more or less need to generate a new VBO and transfer everything over to it. So instead, what programmers do is they set the VBO up as a geometry pool. Each instance allocation is the same size, so it doesn't really matter which entity it represents. With that in mind, you can leave sufficient extra slots at the end of your VBO for adding entities later on. When you draw, you simply draw the range of your VBO which is actively in use. If multiple models have the same vertex format, you can even store multiple model types to a single geometry pool, though it gets a bit more complicated to control. I would probably put two types in a pool and start one list at one end, and the other list at the other and work towards the center.
There's also the issue of how to sort entities for rendering only what is in rendering range and on screen. Depending on how you are doing on batch calls, you might just want to draw several ranges from the VBO (aside from the batch call quota, additional draw calls don't hurt the pipeline). If you are running out of batch calls, you may want to either "defrag" your entity list so you can draw them with fewer calls and take a one-time bandwidth hit, or accept that some of the entities drawn will be off-screen and run them through the vertex shader anyway (they won't make past the rasterizer stage, though).
That's entities in a nutshell.
----
There's one other thing I didn't mention, and that's animation. There's two main methods of animation, and they are related. The first is manual matrix transformations. This is normally what you see in minecraft. Things like limbs and spinny things are all rendered as separate pieces, and this is where the nested matrix transformations and rotations become handy, because you can rotate an object in its model space and the place it in world space all relative to its parent object. For the most part, each entity has a dedicated class which programically edits the positions based on whatever animation it is performing and where it is in the sequence. The good thing about this is that the animation is always smooth and precisely placed for whatever point in time it lands on. More complicated animations tend to use positions from pre-made keyframes, which are then interpolated to the correct time by some curve function.
The other way to animate is with a skeleton. A skeleton is simply a tree-like structure of joints and connections which move according to a heiarchy. For instance, if you lift your arm, your hand lifts with it. Instead of doing complicated calcuations on each vector of the model, or store it in pieces, you do the spacial calculations on the handful of control points on the skeleton. Each vertex, then, is painted with a weight cooresponding to how much influence each control point exerts on that particular vertex. This allows the entire mesh to deform. When rendered, you run a custom vertex shader which is given in the model space coordinates of each control pont as constants (or more accurately, their displacement from the model's starting pose. For humanoids, that is typically standing feet together and arms out to their sides.) Each vertex processed is then displaced from its starting point by a weighted average of each influencing control point.
Skeleton animations are good for both static animations (as there is much less data to store) and for dynamic animation, like terrain-accurate walking and ragdoll physics, which is why they are so popular in modern games. However, the drawback is that you are limited by processing power. On the CPU, it is typically things like collision physics, and on the GPU it is the detail of the meshes which need to be deformed. The physics limitation is beginning to be solved by utilizing brand new hardware on the GPU called compute shaders, which can process general non-graphical operations in a super-parallel manner. The vertex shading is getting helped by hardware tessellation. Only the base mesh needs to be run through the vertex shader, and the tesselated vectors are interpolated from those.
There are all sorts of other animation as well. Lighting and texture changes, facial animations, per-vertex mathematical functions (there are a lot of those in the minecraft shader pack), keeping full static meshes of each key frame and switching between them like a cartoon, particle functions, and all manner of specialty techniques, none of which I will go into because they have little to do with Futurecraft and I am already talking too much as it is. But this is the end. I think I've covered everything I wanted to.