2011-01-06

More Optimizations

In my previous post I discussed some challenges and interesting compromises I had to make to simulate physics in my game as much as I needed given how I was trying to make the game behave in ways that conflict with the real world. In this post I revisit optimization and discuss some recent work I've put into that area.

Introduction

I'll start with a brief explanation of why it has been so long since my last entry. Partly this is because I've been busy with other things and the recent holidays have also taken up some of my time, but the main reason is that this round of optimizations has been quite a bit more complex than I anticipated. Interestingly, it hasn't been the actual code changes that have been challenging, it has been my efforts to carefully analyze my code's slow areas and fully understand the optimizations I've made.

In some ways this is related to my previous post on optimizations. In that post I left with a conclusion that you need to be careful when optimizing to make sure you understand if and how the changes are really improving performance. To that end I tried to be more diligent in this round of optimizations.

My initial goal in this round was to optimize my game's calculations when advancing the game state frame by frame. I'm currently targeting 30 FPS, but would like that to be as efficient as possible in order to preserve battery life. Also, even though the iPhone can do graphics up to 60 FPS, I examined what my game does and found that there is very little perceivable visual improvement moving from 30 FPS to 60 FPS, which is why I decided to target the former.

To be able to properly analyze the performance of my code's calculations, as opposed to time taken by the actual graphics chip's rendering, I created a benchmark mode for my game that does all calculations as normal, but doesn't do any of the OpenGL API calls. I also do this outside of the top-level game loop, since the OpenTK framework imposes its own FPS limitations in this loop.

Sample Size

I should mention that all of this optimization work was done on my PC in Visual Studio. Since my goal is to optimize my algorithms, I'm working on the assumption that the improvements will also be reflected in code when running on an actual iPhone.

When I started testing this benchmark mode I tried running for a few thousand frames. I got results around 1200 FPS. I initially assumed this number of frames was enough to produce useful results and started the process of running under various conditions both for generating performance statistics to measure improvements and for running within VS' performance analysis tools to see what areas of the code were slowest and the best target for optimization efforts.

The problem I found was that the performance differed wildly from run to run. I ended up wasting quite a lot of time doing analysis with these too-short runs that produced inconsistent results. I finally found that the only real way to get more consistent results was to simply increase the number of frames to render and therefore the test time. Eventually I settled on running for 100,000 frames, or about one minute. At this number of frames I found that I could get multiple runs to finish within a few milliseconds of one another, which, over 1 minute, is quite consistent.

First Optimization - Math

Using this sample size, my baseline analysis produced a run in 78.05 seconds at 1281.23 FPS. I then ran this through the VS performance analysis tools. This revealed a lot of information, but I decided to start with one particular method I though I could quickly optimize.

I have an object called Shape. This represents a set of vertices that are connected to define a 2D shape. So far I'm using it for triangles or quads only, but it is flexible up to any number of points. This object also supports some transform operations like moving it in the 2D plane, resizing it, and rotating it (both around it's own center point as well as some other point). The way I've implemented these transform operations is to, at the time they're called, only remember what the operations are, but not to actually do the transform calculations until needed. This is to optimize when multiple transforms occur at the same time, such as an offset, resize, and rotation.

When the final transformed shape is needed I have a method called MakeClean that, if needed, performs the calculations. When I did my first performance analysis I found that this method was taking 20.8% of total time with these details:

I rewrote this to remove temporary object creation (actually Vector2 is a struct, not an object, but same idea), reduce repeated property accesses, combine multiple calculations, etc. This reduced the time to 9.8% and produced these details:

The result of this was the 100,000 frames finished in 68.9 seconds at 1451.38 FPS. That's 88% of the original time, and 113% of the original FPS.

Second Optimization - LINQ

After the above optimization I did another performance analysis in VS and received this summary report:

I was not surprised to see the LINQ entry there since when I was originally writing that area of the code I knew it wasn't particularly efficient. At the time, however, I wrote it as simply as possible and decided I would optimize it if necessary later. Well, now it's later and time to optimize.

For this area, I can't do any simple localized calculation optimizations, however. That LINQ code is all in the .NET framework and out of my control. In the VS performance analyzer, if I dig into that method, no source code is shown at all, since it is part of the core libraries. In fact, trying to determine where this code is called from is slightly complicated because VS shows this only indirectly. It shows the caller of this LINQ as:

But, in this there's no ConcatIterator to be seen anywhere. Fortunately, it isn't too hard to guess that it is somewhere in ElementsToRender. The code for that initially was:

Hah, see, I told you I wasn't surprised to see this as taking a lot of time. I even wrote a comment to myself about it. One thing the VS performance analysis doesn't show very well is why this is slow. Although this code clearly has a lot of stringed-together Concat calls, I suspect that the embedded Cast operations are also quite a performance hit.

The reason my original code was like this is because I'm developing in .NET 3.5, which does not support covariance and contravariance and therefore doesn't allow casting collections of subclasses into collections of common supperclasses. This is supported in .NET 4.0, however.

My solution to this was to create simple collection classes that hold my GameElement objects, and that can provide those objects both in their subclass typed state as well as their superclass form, and do this efficiently. The basic way I do this is to have have the collections detect when they are modified (added to, removed from, etc) and mark themselves as dirty in that case. Then, when I request the collection in the superclass form, that list is generated and cached, if necessary. This allows me to reduce the above code to this:

At a different area in the code I do a SelectMany to combine all of these collections together. The result of this was the 100,000 frames finished in 55.59 seconds at 1799.04 FPS. That's 81% of the previous time, and 124% of the original FPS.

Third Optimization

I also spent a bit of time on a third optimization. The end result of that was no performance improvement at all, in fact, it was 3% slower that the previous version. I won't describe this optimization in detail now, and I'm actually keeping the slower version for now. The reason is that I think it lays a good foundation for possibly making a more significant optimization in the future, and in some ways is cleaner that the earlier version. If I do such an optimization later I'll then describe it more.

Summary

The end result of all of the above was that my original performance of 78.05 seconds at 1281.23 FPS was improved to 57.08 seconds at 1752.04 FPS. This is 73% of the original time and 137% of the original FPS.

All in all I'm pretty happy with that result. I'm now able to continue feature development and will hopefully be able to blog more regularly again.

Next: I'm Back