Effective testing strategies for low-traffic apps

Is A/B testing off the table? Let’s rethink experimentation.

Effective testing strategies for low-traffic apps
Daphne Tideman
Published

By now, who hasn’t heard of A/B testing? I am a huge advocate of A/B testing and the many benefits it reaps. A/B testing is an incredible tool for apps to learn what is and—just as importantly—isn’t working in a less risky manner.

However, it does require a base level of traffic, much more than most apps have, especially when starting out. Without this traffic, you end up running non-significant tests, meaning you see no difference at all. Or, worse, you’ll draw false conclusions from too little data. It can be a waste of time, resources, and budget. 

So, what options exist for subscription apps that don’t have the luxury of massive traffic? The answer, “Well, you just can’t A/B test,” is rather frustrating and unhelpful. You don’t want to end up making endless guesses about potential features and changes, hoping you’ll grow.

Luckily, there are many alternatives, ones that are often underappreciated and undervalued. Each has advantages that can help you gain confidence in the improvements you implement and understand the opportunity areas. But then it becomes a matter of too many choices, so which one will you pick?

Which one you use depends on the change and what matters to you, e.g., confidence, insights, etc. Most companies end up relying on a combination of methods, using these alternative methods to research and test even when they have low traffic. 

Here are the alternatives for subscription apps I usually recommend and the considerations for each one. Before we get into that, I’ve also included a little guide to help you understand once and for all: Can you run A/B tests? If you are just starting out with your app or have checked this before, feel free to skip the next section.

Do you have enough traffic to A/B test?

Whether you can or can’t depends on whether you have a small enough minimum detectable effect (MDE) to monitor a change. The MDE is the smallest true effect size that your A/B test can detect with sufficient statistical power. Some of you are vehemently nodding, and others are probably blinking at the screen in utter confusion. So, let me break that down for you.

In non-stats language, let’s say I have an MDE of 5%; that means if there is an improvement in the time I’m testing smaller than 5%, I won’t be able to measure it accurately. This means if I am making a small change, I’m unlikely to be able to measure if it has a positive or negative effect. It’s not so confusing now, is it?

I’ll show you how to calculate the MDE yourself, but first, what is the right MDE to use?

Most advice recommends working with an MDE of 2%—5% that you can measure within 4 weeks (do not run a test if you expect to have less than a 2%—5% lift). However, this is based on larger, more optimized companies. For startups with much to optimize, you can work with an MDE of up to 10%, maybe even 15%, if you make more significant changes, e.g. revamping your onboarding flow.

How to work out your MDE

Here is how I work out the MDE myself. First, I usually examine the last 12 weeks of data to find the average daily number of users/conversions for key app screens, such as sign-up flow steps. I avoid considering a period with an unusual drop or increase due to seasonality or a boost in traffic (e.g., an app store feature, during a big marketing campaign). Whilst these can help you reach an MDE faster, these big traffic boosts, they may give you false hope that you have enough data when you don’t.

Second, I decide on the other key inputs:

  • Confidence Level: 95%
  • Statistical Power: 80%
  • Number of variants: 2
  • Max test duration: 2/3 weeks

These are pretty industry-standard guidelines. With the test duration you can increase this up to 4 weeks, but you start to run into reliability issues when you grow it beyond that. Even if that wasn’t the case, it also just takes the momentum out of learning.

The reason we also go with two variants rather than more, is that with more variants the traffic needed increases substantially in order to avoid what is known as a false positive (when the test suggests there is a difference but in truth there isn’t one).

I then usually use Speero’s free online calculator to input this data. Start by looking at subscribers or trial users and noting the different MDEs for the choice duration. For example, you fill in:

  • Users that reach the checkout page as the number of daily visitors
  • % users that started a trial as the baseline conversion rate

Work through this for different parts of your app and note it down. Do you see an MDE like the following example?

Example using the Speero calculator

This means that you just don’t have enough traffic to the A/B test, and that is completely fine. Here, we see that even if we ran the test for four weeks, we’d need a lift of 15.19% to be able to measure it. So even if the variant had an improvement of 15%, we can’t measure it, suggesting it’s probably not worth A/B testing. 

Don’t forget that we’re ideally looking for a maximum of 5% – 10% after two to three weeks of testing to allow us to get into a good flow of A/B testing.

Run this analysis for a few different parts of the journey, and if you can also combine similar screens in your app where you could run the same test to increase the volume (e.g. for a workout app, counting all workouts as one type of screen for A/B tests). 

Again, it isn’t an issue if you don’t for now, as long as you know. You won’t waste time setting up an A/B testing tool and the tests, but instead consider alternative approaches. In all honesty, most of my clients don’t, and we still experiment and learn constantly. Let’s get into an alternative approach.

A/B testing through micro-conversions

Before we move further and further away from the classic A/B testing, there is still one other option. You can also check micro-conversions, such as moving on to the next step of the onboarding. If the correlation is strong enough, and that number improves, you also tend to see more trials or subscriptions started. 

The advantage of this is there tends to be far higher volume, making it quicker to reach the same MDE and allowing you to still A/B test. However, it might not perfectly correlate with the end goal.

Alternatively, you can do the same calculations for A/B testing through marketing channels and the App Store. The volumes of this will be even higher than that of your app. This method is specifically great for testing your app’s messaging. 

Again, it is higher up the funnel, meaning that test reliability can be an issue. To solve this, I suggest running a few tests, e.g. via Meta or Google ads and the app store, to ensure consistent results. If you have a large enough email database, you could even A/B test messaging through emails:

Example of the structure for testing this via email

I do this across these options by setting up two variants that are identical in design and only change the messaging. If it’s a paid ad, I’ll make sure this change in messaging is on the creative itself, as that’s what people will notice. 

A/B testing through both app stores has also never been easier. App Radar has a great guide on a/b testing in the Apple App Store and a separate one for the Google Play Store. So this is also worth considering.

When considering any of these methods I still run the above calculations first.

Quantitative analysis first

Even if you are A/B testing, the alternative methods are still invaluable because they help to improve the quality of your tests and/or the insights that drive changes. I always recommend the same process:

  1. Use quantitative research to find out where the issues are
  2. Then, various qualitative methods will be used to explore the why behind it and what is causing the dropoff

The quantitative analysis only tells you where the issue is, but not why you lose customers in that step of the journey. Basically, it allows you to work out:

  • Where are you losing people in the customer journey
  • Where you could make the most significant impact
  • Where there might be technical issues impacting performance (e.g., comparing device/screen size performance)

Instead, you want to triangulate research methods to get more reliable results. This is just a fancy way of saying that several methods showing the same results are more trustworthy than a singular method.

First, we’ll cover several qualitative methods you could use, and then we’ll consider how to prioritize ideas further:

  1. App user testing
  2. User interviews
  3. Customer feedback (surveys, customer care)
  4. Heatmaps and session recordings

From there, we’ll look at how to turn all those ideas into a prioritized roadmap. Let’s get started.

App user testing

App user testing is one of the most reliable qualitative methods because it is the closest you’ll get to seeing people actually using the app. It’s extremely effective in helping you understand your users and improve your app. 

Every time I run app user tests, I uncover actionable insights that I never expected. It is truly invaluable. I once ran user tests for a workout app focused on an older target audience with a web-to-app signup flow. It helped us realize that even though we’d used a larger font they were still struggling to read it because of how much they increased the font size on their phone normally. We also discovered that even though the trainer was their age, they didn’t believe it. She looked too young and that was reducing the credibility of the app in their eyes—they were sick of the young trainers at the gym not understanding what they needed. These were insights we never would have gotten just from analyzing the quantitative data or looking at heatmaps and screen recordings.

That’s not to say that app user testing doesn’t come with unique challenges, as you’ll need to: 

  • Test across operating systems and platforms (iOS, Android, web, tablet)
  • Account for behavioral differences
  • Have them install it upfront and walk them through it as part of the test
  • Ideally, test in person to get closer to a real-life situation

The last one is key. Real-life scenarios and environments provide more accurate insights than remote testing, which can be tricky for less tech-savvy users. However, if you can’t do that, remote user testing or even unmoderated testing (where users go off and walk through the task themselves) is definitely an option.

You can run these with potential customers to get an unbiased view of your app or with existing customers to get feedback on both new and existing features.

My biggest tip is to keep it simple. Start with basic tools, like using a voice recorder if you’re meeting in person; fancy equipment truly isn’t necessary for impactful results.

My second tip would be to follow the step-by-step process we share here; good user testing starts in the preparation stage. You need to: 

  1. Set goals
  2. Recruit relevant users with incentives
  3. Prepare questions to ask before hand and the task you want them to run through
  4. Create a document for noting down findings, using tools like rainbow sheets
  5. Develop prototypes (mock ups of the changes) for further testing

Prototypes are another crucial part of your non-A/B testing toolkit as they allow you to build features far cheaper than implementing them directly and de-risk them further. However, they usually come a bit later in the process once you’ve conducted the qualitative research.

User interviews

While user testing involves walking through a specific scenario with a potential or existing customer, user interviews take the form of a conversation with your customers. This is especially valuable in understanding what matters to them, how they experience your app, and the role it plays in their life. 

If we go back to the previous example of the workout app for the older target audience. We only touched on their struggles with regular gyms by chance during the user testing, however, it was only with user interviews that we uncovered many more valuable insights into what mattered to them. We were originally talking all about building strength and balance, but through user interviews we actually uncovered that the real reason they were signing up was to keep up with their grandkids. They wanted to be able to play and lift up their grandkids, to not worry about their physical strength and stamina to keep up with them. This led to a/b tests via ads on messaging and once we saw this messaging won further testing of the landing page.I call these Jobs to be Done interviews, you are trying to uncover what they are trying to achieve with your app. I share some suggested questions here.

Now this doesn’t have to be just with customers. You can speak to previous app users to understand why they stopped using your app and their feedback. You can also speak to potential app users to understand the broader market, what they are comparing your app to, and why. 

I do find that companies shy away from these conversations, finding them time-consuming or expensive, but they are honestly worth it. Ideally, every quarter you should be speaking with customers and getting their feedback, as well as utilizing other forms of customer feedback.

Customer care and surveys

I grouped these together because they are both more ongoing forms of feedback. Often you have standard surveys running that you don’t look at often enough. So, you also want to build an approach for reviewing feedback and implementing it. 

Imagine you find out during a user test that customers are struggling to understand a key feature that you’ve seen from quantitative data correlates with high retention. Many customers seem to view the screen of that feature, but few are actively using it. You then also see in your cancellation survey that one of the main reasons customers are canceling is they aren’t getting enough value out of the app. Now, you don’t know for sure that this is referencing that feature, but all these data points are building a stronger argument that you need to explore this area further.

You want to ensure that any insights you have from customer care, cancelation surveys, etc., are being fed back into the process. In particular, ensure you consistently monitor this over time and note down changes. 

I always keep in close contact with customer care, either through regular meetings (usually once a month to understand what customers are enquiring about and why) as well as one-off meetings to discuss specific topics. In those regular meetings, I’m trying to understand:

  • What are the main issues users are reaching out about?
  • What is coming through more or less frequently than before?
  • What do they (customer care) wish could be changed or improved on the app?

Let’s take the example of a wellness app I previously worked with. We were seeing a sudden increase in the number of refunds, so I went to the customer care team for insights. After speaking to them, I understood that the renewal reminders lacked clarity and that even though we were reminding users, we could increase the number of reminders. This reduced the number of refund requests significantly, leading to an overall improvement in profitability even with the slightly higher cancellation rate.

Heatmaps and screen recordings

Heatmaps and screen recordings are great tools if you can’t use user tests or get additional validation and insights into issues. They are also helpful in identifying potential technical problems or bottlenecks in the customer journey.

They are quite high in terms of reliability, especially heatmaps that show patterns across users, however, you don’t know the why behind a specific action, and you can’t ask a customer the way you can in a user test or user interview.

I use heatmaps to understand what information users are and aren’t focusing on a certain screen, what they are tapping on (which they may not be able to tap). Whilst user recordings to understand where people might be struggling in the customer journey and how they navigate through the app.

You do need to be conscious of platform-specific gestures (e.g., Android back button vs. iOS swipe gestures) that might affect usability. The app will appear differently depending on the device they are using.

I’ve rounded up a range of app-specific tools that can be used for this, all of which come with unique benefits.

ToolBenefitsIdeal for you when…
SmartlookNew heatmaps instantly get populated with historical data (from the moment you installed it)Always-on recording for all app sessions without manual triggersFunnel visualization to find bottlenecks in user flowsHeatmaps for both static and dynamic app screensAPI integration for connecting insights to other analytics toolsYou have a lot of data but not a lot of time to set everything up
UX Cam Heatmaps are optimized specifically for touch gesturesIt can combine with events to watch specific replaysCrash analytics to diagnose and fix bugsUser segmentation for custom insights into different audiencesYou have frequent technical issues with your app
FullstoryAutomatic event tagging for specific interactions, e.g., taps or scrollsIdentifies friction points, e.g., dead clicks or rage clicks, so you can watch session replaysFunnel analysis to track where users drop offFilter by segments like device to see how different cohorts are navigatingTiny SDK making an impact on app speed lowerYou are looking at a longer customer journey (e.g., a lengthy onboarding)
Sprig AI analysis of heatmaps to suggest actionable insights It can combine with events to watch specific replaysFilter on particular behavior based on user attributes or in-product actionsYou don’t have much experience with heatmaps/user recordings

One final consideration is what else you want to use the tool for. Most app tools that offer heatmaps and screen recordings also offer other usages, e.g. product analytics, journey mapping, in-app surveying.

Which qualitative method should you use?

You’ve now learned about various qualitative methods, but how do you choose between them all? As mentioned before, a mix is ideal. But just between us, here is a little cheat sheet to keep in mind.

What you want to learnRecommended method
How do customers interact with a certain step or new feature?Heatmaps, session recordings, user testing
Why are customers dropping off in the customer journey?User testing, surveys, session recordings
What are common challenges they face? Customer care, session recordings, user testing
Why are users canceling their subscriptions?Customer care, user testing, user surveys, cancelation interviews
Should you change your monetization model?User surveys, user interviews

I appreciate that it doesn’t cover every possible scenario, but hopefully, it can help get you started with most of the bases. Again, the more you use, the more reliable your results. So, especially with bigger changes, like changing your pricing model, use several.

What do you do with all the new ideas?

Let’s imagine you’ve been using this approach, you’ve conducted quantitative analysis, and you’ve understood the reason behind certain behaviors through qualitative research. Now what? 

Whether you can or can’t test those hypotheses through micro conversions, it’s worth still prioritizing those ideas rather than blindly implementing.

Now it’s easy to just focus on the high-effort, high-impact ideas which is what most companies do:

However, this doesn’t consider the validation behind it. I would prioritize a high-effort, high-impact idea with more research behind it over a low-effort, high-impact idea that we are less confident in (especially with apps, where changes often require more effort than expected).

For this, a prioritization framework is helpful. It allows you to further arrange your ideas and bring some structure to your backlog. Which prioritization framework you use will depend on what you find important. If you aren’t sure, I personally like the PXL framework as a starting point as it weights ideas with more validation higher, e.g. if you saw this was an issue across several research techniques.

Credit & image rights: CXL

However, there’s no doubt that it was built for the web, because the first point specifies “above the fold.” While you can, of course, scroll in an app, that behavior is less common. Also, pages become screens, of course.

So, I would suggest the following changes for apps when using this framework:

  • Remove “above the fold” as a criteria
  • Change high-traffic “page(s)” to “screens” and have a relative scale (e.g. 0 – 3, depending on how essential the page is)

From there, adjust it depending on what matters to you. For example, if you are very developer constrained, you might value ease more as this helps your weakness. Or if you value impact, add an additional factor that weighs up potential impact; you could define a score from 0 to 3 based on the expected increase in new subscribers if you are improving something earlier on in the funnel.

Keep in mind that you are also grouping and prioritizing ideas per area based on what your quantitative research showed. Let’s say you have two areas you are working on: getting to the aha moment in the first session and getting users to find key content quicker while using the app. These are two very different areas; you shouldn’t compare one idea for one area with another for another. Instead, you should be working with two separate backlogs.

Cohort analysis after implementing changes

It can be worth monitoring cohorts with high-risk changes, especially with changes that may impact long-term behavior, e.g., a pricing model change. You can use version-based cohorts for this. 

This is a variation of before/after testing: comparing the results before and after a test. I haven’t talked much about this as an option because of the noise around it. Especially as you often implement multiple fixes or changes per version, and other factors, such as traffic source changes, may be involved. 

So use it as a double check, but even if you see an improvement or drop, dive deeper into the data before jumping to conclusions.

The traffic might not be there, but the experiments are

If nothing else, I hope you’ve learned that when you lack the traffic to run an A/B test, the solution is certainly not blindly implementing something. That is never the solution in growth, I promise you that. A lack of traffic is a hurdle, not a roadblock, and you’ll find ways around it. There are always alternative ways to run A/B tests:

  • Focusing on micro-conversions
  • Through the App Store
  • Through marketing channels 

Even if you use those methods, 90% of A/B testing should be focused on proper research. Consistently look at the quantitative data to identify where the dropoffs are, and then use qualitative methods to understand why. This can be done through:

  • App user testing
  • User interviews
  • Customer feedback (surveys or customer care)
  • Heatmaps and session recordings

Then, systematically prioritize and either test or implement the changes, if needed, using cohort analysis to monitor changes over time. If anything, sometimes not having the traffic makes you better at choosing and deciding priorities because you are forced to take the time to look at the data. What may initially seem like a limitation, might, in fact be the thing that propels your growth. All you need is the right mindset. 

You might also like

Share this post

Subscribe: App Growth Advice

Enjoyed this post? Subscribe to Sub Club for biweekly app growth insights, best practice guides, case studies, and more.

Want to see how RevenueCat can help?

RevenueCat enables us to have one single source of truth for subscriptions and revenue data.

Olivier Lemarié, PhotoroomOlivier Lemarié, Photoroom
Read Case Study