Monday, December 12, 2016

The Inherent Problems in Speed Test Videos and Their Validity as Performance Benchmarks

Clarkson, Hammond, and Mays line up, each of them driving one of the world's fastest "Hypercars." Millions of dollars metal, aluminum and carbon fiber waiting to barrel down the track, each wanting to take the throne as the fastest of the litter.

If you aren't familiar with the description above, this scene took place in the first episode of the new series from the Top Gear trio, "The Grand Tour". If you haven't seen it yet, you will soon see how it relates to this article, and if you have you probably already see the connection.


When one seeks to measure how their particular device performs, even if generally and unscientifically, a user would typically download a popular benchmark application like Geekbench 4, or PCMark. These applications stress the device in using a preset method and then spits out a result, usually a number, that you can use to assert your phone's prowess in relation to other devices in the market.

If you are a particularly savvy user, an application like DiscoMark tests the opening and closing speeds of applications, or if you are interested with storage speeds you can run a storage benchmark. One thing remains consistent though, and that is the objectivity of the test being run. Geekbench 4 will run my Pixel through the same tests as my LG V20. As long as I am running the same version of the application, these tests are mostly uniform, and the same goes for PCMark and the storage benchmark — there are background services sucking up processing power and other variables to control if one is to get a truly accurate result, but for the most part, the number will give you a general idea of your device's "tier." However, DiscoMark and other real-world benchmarks can throw quite a curveball into the mix, and is why DiscoMark scores need to be heavily vetted for accuracy and are not readily used — in fact, they shouldn't be without careful control for confounding variables throwing off the results. 

Finally, these applications have been designed to do just this; benchmark the overall or specific component performance of your device. One could argue that these tests don't reflect the actual real world performance, that it is more like a 0-60 time done at sea-level in controlled conditions by an experienced driver than an actual real-world test. Those arguments do have some validity to them and speaks heavily towards the rise in popularity of "speed test" videos in which a reviewer pits device A against device B, and throws at them a bevy of real world applications opened and closed in quick succession. But is there any sort of validity to these tests? And what sort of actual substantial knowledge, if any, could be gained through them?


Variance is the Spice of Life

As we mentioned above, applications like Geekbench 4 are tests done in a specific testing environment. They are self contained (although some draw system resources), and thus are less likely to be affected by the environment directly (however, the hardware also has other tasks to deal with while running the benchmark). Contrary to this, a speed test is open to a host of variables like touch response times, background processes, the amount of user data synced, which side of Google's beloved A/B testing a phone could be involved in, the application state, the unavoidable human error… I could go on, but the point is there, with speed tests there are more variables that can, and do affect the outcome.

For example, I did some very unscientific testing of my own between an LG V20 and a Google Pixel (5"), not unlike that of your typical speed test video, which displays one sample of the differences you can observe under various setups and starting states. It is of note that the V20 and Pixel tests were done with relatively the same user data installed as they were both running my daily driver account with my typical setup of 150+ applications installed and signed in.

Test LG V20 Google Pixel LG V20 (Clean Setup)
Clear All 38.79 27.62 31.86
Reboot 47.70 33.71 36.34
Reboot + Clear All 33.49 30.42 28.22
Cache Cleared + Reboot 32.67 36.40 30.25
Cache Cleared + Reboot + Clear All 26.91 25.93 24.78

All in all, there is an 11-20 second gap between the fastest and slowest times of each phone, beyond that there are a few other things to be gleaned from these results. The first is that after a reboot you will not always find your fastest results, instead, the fastest times were found after the cache was cleared, after the phone was rebooted and after using the "clear all" button in recents and killing everything running. If I were to put out a video of either of those two tests, with no context nor details surrounding the test environment and starting conditions, this is probably what the comments sections may look like:

"TL;DW – The Pixel beat the V20 by over 10 seconds in a 40 second test. LG needs to debloat and switch to NVME."

Or

"TL;DW – The Pixel and its "optimization" only beat the V20 by 1 second even with its bloat. Why do they even charge so much?"

The crazy thing is that both of those TL;DW's were correct depending on which test I decided to show; to either fit a narrative or simply because I did not do enough testing, variable control, or make sure that the starting conditions were as similar as possible. While both phones showed relatively similar levels of improvements through the tests, it can easily be seen how these results could be taken out of context, and proper context is something few if any speed testers are actually providing.

Further, why did I choose those applications? What if I used Facebook Messenger instead of Hangouts, Spotify instead of Google Play Music, or GTA III instead of Mikey Shorts? Could the results have been different? Would I suddenly hate my Pixel because the V20 may beat it in that scenario, instead of this one? As I said earlier, there are too many variables involved to make a final decision with things as simple as clearing the cache and rebooting completely changing the outcome.


Extraneous Factors & Common Mistakes

When we do our benchmarks and gather app opening speed data, we make sure to strip the phone clean of the elements we know are likely to cause interference, such as bloatware. Some phones are understandably worse than others when gathering data; for example, it was really hard for us to find reproducible DiscoMark results on the Galaxy Note 7, whereas the numbers we got for the OnePlus 3 were extremely consistent across factory resets, multiple devices, and even on different Google accounts, given the same initial conditions. With Samsung's 2016 phones, we found insane variance not just in app opening speeds, but also regular benchmarks — more than on other devices like the Pixel XL. We understand the importance of gathering data that's reproducible and as consistent as possible, even if this is often an unattainable goal when shooting for perfection. We often disclose the conditions of our tests so that the reader can get a feel for what we did to get our numbers: we disclose initial temperature, whether bloatware was removed, that we ran them after a factory reset, on the same Google account, WiFi network and surface, etc. Even then, it's hard to be fully satisfied with the testing environment, as there will always be some degree of variance.

Over the last year, we saw speed test videos becoming even more prominent than before as the number of tech YouTubers grew. They are an easy vehicle for quasi-technical insight, but many people running the speed tests show little understanding of the factors at play. For instance, most focus on the processor as the main contributor to the speed differences, when the storage is arguably a much bigger (if not the main) factor when it comes to app opening speeds (especially since many Snapdragon 820/821 phones max out all clockspeeds on all cores specifically when opening applications). This isn't to mention the misconceptions brought by the Snapdragon 821 revision. Another important aspect is the filesystem employed, something we found to make a significant difference in particular when opening heavy games, which are usually one of the biggest deciding factors of most speed test videos. For the "RAM round", the memory management solution employed is also really important: for example, you might recall how the Note 5 and OnePlus 3 were initially blasted for being terrible at this part of speed tests; at XDA, we focused on finding the root cause and a simple fix dramatically improved the situation. After OnePlus improved memory management in a following patch, the OnePlus 3 became one of the top performers in these speed tests. Out-of-memory / Low-Memory-Killer values, as well as ceilings to the amount of background apps, are things to take into account when evaluating memory management capabilities, and these can change from one build to the next. Thus, it's of great value to be able to spot whether the issue holding back performance can be addressed through a software fix, or a user modification.

Which leads us to software updates: it's very important to disclose the OS version that the tests are running on, because these can bring dramatic performance changes and essentially redefine the entire result ladder. You might recall, for example, how the Nexus 6 originally had real-world performance constraints due to forced encryption, which was promptly disabled not long after. The Nexus 6 also received a significant kernel patch early into its life that altered performance by leveraging its four cores better. Another example is that a tester running a community build of the OnePlus 3's software would unknowingly have an advantage over the default software branch should he be testing the phone with the F2FS improvements on board. So on and so forth.

Then there is the question of the applications themselves; ideally, neither phone should feature OEM applications not available on the other device, as these are coded and optimized differently, possibly yielding significant differences. There are also included services (like on Samsung devices) that run on the background for no apparent reason and can also contribute to speed differences, so minimizing background interference should be a priority for an efficient test environment. While it can be argued that these should be left alone to mimic a real-world environment, I'd say it's too volatile to control for — an example we noted in a review was how Samsung's Text-to-Speech engine was somehow hitting up to 12% of the CPU load while playing Asphalt 8, a completely unrelated task.  Apps syncing can have a dramatic effect on the resulting performance as well; a simple way to asses whether the device is ready for testing is by looking at the CPU clockspeeds when idling on the homescreen, to make sure there is nothing unduly influencing the test environment.

Then there is human error: it's rare to find a tester with both perfect vision and iron reflexes, so it's hard to really know whether there is a tacit delay affecting the results; when the test stretches for minutes on end, these little errors can add up. A small but significant factor is also the fact that touch latency varies across devices, and even across software versions, something which isn't taken into account in the string of multiple tests. Also, ROMs can ramp up the CPU frequency for the next activity after a screen tap, introducing behavior that favors some tests over others (close successioaln tests). Even worse, I've seen some speed tests effectively butcher the results due to their system settings: for example, a popular Latin American YouTuber recently had to redo an entire speed test as he had not noticed that the device's home button was set to open the camera upon a second tap; the minor delay while waiting for that second input on every home button press added up to several seconds.

These are only some of the issues we have with these tests, specifically the mistakes we've seen manifested in videos over and over. Consider that it's likely that most of these problems show up in every video in some form or another, although not always terribly so. These issues can add up unpredictably, and without specific comments on the user's methodology they are often enough to tarnish the legitimacy of the results. It's not rare, for example, to find users in comment sections complaining that their device does not behave that way, sometimes even providing concrete examples. And it's very hard to judge the behavior of these devices from merely one sample that the videomaker decided to show the public — we can't even know whether the creator bothered to test the devices multiple times before recording, to make sure that the results were consistent and reproducible. When we gather our data, we make sure to get at least 100 data points per application (it sounds like more work than it is, but remember this is mostly automated) — now imagine if I we were to show you a single result at random instead of a boxplot displaying the interquartile ranges and median. It's really impossible for the viewer to assess whether  the results shown in the video are mere outliers.


Quite Entertaining, Not Quite Insightful

So what data can be gleaned from these tests and discussion? Are we saying that you should not bother viewing these speed test videos? That's not what this article is all about at all, but we need to realize that these tests are highly volatile. As mentioned earlier, even if you use the "same apps" in the comparison, things like A/B testing on a developer's side and using the built-in, device-specific applications such as the calendar, dialer, clock and camera impact results. Software update improvements also weigh heavily on results, as OEMs can patch early bugs or make major revisions in early beta and community builds. Further as we demonstrated there is a tangible difference in performance depending on if you rebooted the device, cleared its cache or had it setup as a personal device with apps and data loaded or merely as a test unit. Finally, there is the human error element. As fast as we may be, the delay between visually seeing an app load and tapping the key changes from run to run, and device to device, and human error of some sort can be seen in almost every speed test video. All of those things can impact the final results of these speed tests more than the actual hardware, not to even mention cross-platform tests where these variables increase substantially. 

Clarkson, Hammond and Mays ran the test we spoke about in the intro time and time again, each with a different result. Mays failed the launch control on the La Ferrari a few times, Clarkson forgot to engage his wing and Hammond miscounted his 3 seconds. So what is the lesson? Like speed tests this drag race was not a feasible method of determining which car was faster as the variables involved were too great. Instead "The Grand Tour" settled the comparison with an all out race around the track, using multiple laps to find the quickest and by using a single driver to equally compare each of the cars, thus the most controlled environment they could have. The final winner was decided by less than a tenth of a second, far closer than any of the drag races and with a different outcome than most drag races.

But just because the drag race had no bearing on deciding which car was faster doesn't mean the drag race was any less fun to watch. It did mean that it was just that, fun to watch but it held little reliable data one could draw a conclusion from. The same is true with the speed tests. Yes, they can be entertaining to watch and some data can be gleaned from them in very general terms and moreso when the deltas are large, but largely they are too tied to the reviewer's decisions and the settings made prior to and during the test to hold any conclusive comparisons… with the main issue being that a large chunk of context is often missing. However, we really appreciate the medium's ability to shine light on certain performance issues, namely the memory management problems we mentioned above. Such "internet drama" can prompt OEMs to act and fix these issues sooner rather than later, which ultimately benefits us all. And we must add that we do recognize that some testers clearly do a better job than others.

Finally, maybe that isn't the only lesson to be had from our friends at The Grand Tour. The fact that they did multiple drag races and each ended up with different outcomes and then the final result was decided by such a close margin may go to teach us that today many flagships are within a performance margin of error for real world use. It also shows that regardless of which device you purchase, the applications and decisions you make likely have a larger impact than the hardware alone.



from xda-developers http://ift.tt/2hqz3E7
via IFTTT

No comments:

Post a Comment