Project update 5 Conclusion



After many weeks of testing and reviewing the source code some small improvement has been made.
The information in from the previous tests revealed the hot function that was butteraugli::convolution and it also revealed that most of the work in the entire application was preformed in butteraugli.cc

I will first cover some of my failed attempts and then move on to explain the actual improvement made

One of the first things I tired before even touching the code was compiler optimizations, the package had already come set at o3 level I played around with all settings available both with GCC and visual studio anything I did on that front seemed to be notably worse so I quickly abandoned those attempts as it was quite clear someone had already worked out the best configuration for this.

The next thing I tried was modifying the algorithm whiten the convolution function. I wanted to convert the 4 separate loops into one single pass over the data passed in. This ended horribly as it became clear that my understanding of the code I was looking over was incorrect. performance degraded and I started noticing issues with the output, so I rolled back those changes right away.

Implementing multi threading for the loop intensive code is not something I considered attempting because from my understanding of the code each loop iteration is dependent on the previous iteration making implementation of multi threading difficult.

I also noticed inside of many loops were converting floats to ints or vise versa, So I tried switching variable types around so that no casting or conversion was required, however this was ither not significant or the compiler already had it handled the only real reason I even attempted it was because it was code in a hot loop.



Now to the attempt that actuary yielded some results, the next and final thing I did was enable verbosity for the f tree vectorizer so it would let me know what loops had not been vectorized.

at that moment I was presented with a very long list of potential optimizations to be made  including loops in the functions I identified as hot. However I had difficulty getting those loops to vectorise since they required algorithm changes I no longer had the time to experiment with. I decided to move on to other areas with the rule that I would not work on code outside of butteraugli.cc as I do not think it would be worth the time spent to deviate any further. from where most of the cpu time was being spent.

The changes I ended up making were not super special or significant in some cases it was just setting up a local variable for a parameter instead of using the parameter directly. the task was mainly just tedious to make sure the correct names are being used everywhere, I will post a paste bin link to the code I modified. In total I vectorized about 6 loops that previously where not.

Now for the results the relative time between the different parts of the program did not seem to change at all however the total time did improve. In order to accurately test this I did need to set up another environment on one of my own computers so that I can make sure other programs and possess have minimal impact on the numbers seeing as I previously discovered that there is a variance from run to run in exact performance usually 1 to 2 seconds in difference.

That being said I did not  feel the need to run an insane amount of tests so I only ran 4 tests per build. 4 for my modified version and 4 for the original build. All the tests were all run at the same time I ran them in sequence first the old build and then the new one.

The test results are the following all times are in seconds taken for the program to finish



Original build




run 1: 15.314
run 2: 14.708
run 3: 14.744
run 4: 14.942

Total time taken: 59.708
Average time taken: 14.942


New build






run 1: 14.970
run 2: 14.770
run 3: 15.480
run 4: 14.696


total time taken 58.787
Average time taken 14.696


So a total of 1.1% faster on average. Was this performance worth while for the small amount of effort required to implement it once the opportunity was found I think so. And if I had started going down this route sooner I think it would be even better as there are still many loops I did not attempt to vecortize that could potentially be improved However it should be noted that for majority of the loops the vectorisor will highlight it wont be possible to get them vectorized for simple reasons like if statements being un avoidable to achieve the correct behavior and so on . In addition there are also other opportunity still available to pursue such a bugs whiten the code that could be costing performance however I wont talk to much about those here as they are already documented whiten the code.

I only have one concern with the optimization as my new version of the build had the lowest low but also the height high so this may be an indicator that the build may have lost some performance stability. I further investigated this using my profiling data and found (see the charts above) that the cpu usage over time was up and down allot more over the life time of the program then before.






I think the reason for this might be because I optimized some parts of the code while leaving other untouched so I dont see it as a very big issue. and it might be even less of an issue if I had more time to find and preform the same optimization to every possible location in that file.


The following is a paste bin link to the code I modified replacing this one file in the solution google provided should allow you to reproduce what I have done.

original github link
https://github.com/google/guetzli


Link to modified code
https://pastebin.com/8AwmmhiW




























 

Comments

Popular posts from this blog

project update 4 - testing

Project update 3, change of course

My expeeriance so far.