First off - What is a follow-up experiment?
Google says:
"
When the results of an experiment suggest a winning combination, you can choose to stop that experiment and run another where the only two combinations are the original and the winning combination. The winning combination will get most of the traffic while the original gets the remaining. This way, you can effectively install the winner and check to see how it performs against the original to verify your previous results."And why should I run a follow-up experiment?
"Running a follow-up experiment will give you two benefits. First, it will enable you to verify the results of your original experiment by running a winning combination alongside the original. Second, it will maximize conversions, by delivering the winning combination to the majority of your users. We encourage you to run follow-up experiments to get the best, most confident results for any changes you make to your site."
But what happens when a follow-up experiment delivers contradictory results?
The screenshot below shows the original MVT test results....
I commenced a follow-up test running the the winning variant from this test in a head to head with the original default. And this is what happened...
The blue line is the original design beating the first test winning variant. This has happened time & again with my follow-up experiments. Then I noticed something. When you set up a follow-up experiment it's easy to overlook the weightings setting or the 'choose the percentage of visitors that will see your selected combination' option of a follow-up test. By default it's set to 95% for your selected combination.
Now I cant offer any explanation but from previous testing with other tools such as Maxymiser we've seen when you up-weight a particular variant in a test in favour of another, invariably it's conversion performance goes down, sometimes radically so. I recommend only doing a 50-50 weighting at anytime in any follow-up experiment because for whatever reason an unequal weighting seems to skew performance. Just be aware of this possibility and you'll be fine : )
If anyone can offer me a scientific explanation for this behaviour I'm all ears!
By the way, below shows the test after the weightings are reset to a 50/50 split. Bit different from the original follow-up experment no?