Apple gave indie developers proper screenshot A/B testing back in 2021 with Product Page Optimization. Four years later, most apps still have not run a single test. That is a quietly expensive mistake. The gap between "screenshots that look fine" and "screenshots that actually convert" is routinely a 1.5x to 2x difference in install rate from the same product page traffic. If you are paying anything for installs — search ads, paid social, content marketing — you are funding the gap.

This guide walks through how Product Page Optimization actually works in 2026, how to design a test that produces a real signal, what to test first when you have one shot at a 90-day window, and how to read the results without fooling yourself. The principles apply to Google Play's experiment tool too, with one important caveat I will cover at the end.

what Product Page Optimization actually is#

Product Page Optimization (PPO) is Apple's built-in A/B testing tool for your App Store product page. It lives inside App Store Connect under the "Product Page Optimization" tab and lets you run experiments with up to three treatments against your original product page. A treatment is a complete variant: a different app icon, a different set of screenshots, or a different app preview video. The traffic to your product page is split between the original and the treatments at percentages you control, and Apple measures conversion — what percentage of users who view the page go on to install.

The mechanics that constrain how you should think about it:

One test at a time. You cannot run multiple PPO tests in parallel.
Up to three treatments per test, plus the original control. Most useful tests use one or two treatments, not three.
Tests run for up to 90 days. You can stop a test early if a treatment is decisively winning, but you cannot pause and resume.
Per-device-size screenshot sets. If you only change the iPhone 6.7-inch screenshots, Apple uses the original for every other device size in that treatment.
The treatment's icon must be included in your shipped binary. You cannot test an icon that isn't already an alternate icon in your app.

the constraints that shape your test design#

The 90-day cap and the one-test-at-a-time rule are why most indie teams get PPO wrong. They treat each test the way you would treat a quick on-site A/B test on a website — fast iteration, lots of small experiments, optimize as you go. PPO is the opposite. You get four shots a year. Maybe five if you push it.

That changes the math. With unlimited tests you optimize for speed. With four tests a year you optimize for learning per test. The bias should be toward bigger differences between treatments, not smaller ones. If your treatment changes one word of headline copy on the first screenshot, you will burn 90 days to learn that one word didn't matter and you will not know whether anything else might have. If your treatment swaps the entire visual concept for the first three screenshots, you learn whether the concept matters — and if it does, you have a direction for the next test.

The other thing the 90-day window does is set a traffic floor. PPO's confidence depends on volume. An app with a few hundred product page views per day will not produce a statistically meaningful result inside 90 days unless the treatment effect is enormous. A useful rule of thumb: if you cannot get to roughly 100,000 impressions across the test period, you should be running fewer treatments and bigger differences, or skipping PPO entirely and making the change you think is right.

what to test, in order of expected impact#

If you have one PPO test to run, here is the priority order I would suggest based on what consistently produces the largest deltas across indie apps:

The first screenshot. Specifically: the image, the headline copy on the image, and the visual hierarchy. The first screenshot is the only one most users will look at before deciding whether to scroll. It does the heaviest lifting in your entire listing. Optimize this before anything else.
The full screenshot set, treated as a story. Once the first screenshot is reasonable, the next-biggest lever is whether the next two screenshots back up the promise. Tests that move from "feature tour" to "outcome story" frequently outperform.
App preview video. A good preview video can move conversion 5–15%. A bad one usually does nothing. The asymmetry is real, so the cost of trying is mostly production effort.
App icon. Icon changes are powerful for discoverability in search but produce smaller conversion lifts on the product page itself, because by the time users are on the product page they have already decided to look. Test the icon when you suspect it is filtering out users at the search-results stage.

how to design a test that produces a real signal#

The mistake every team makes the first time: running a treatment that differs from the control along three different dimensions at once. New first-screenshot image, new headline copy, new color palette. The treatment wins by 18%. You have no idea why. You have no second test to attribute the win to.

The discipline is to vary one variable at a time per treatment, but to use the three-treatment slot to triangulate. A clean test design looks like this:

Original (control): your existing first screenshot.
Treatment 1: same image as the original, new headline copy.
Treatment 2: new image, original headline copy.
Treatment 3: new image and new headline copy.

Now if Treatment 3 wins big and Treatments 1 and 2 are flat, you know the combination matters and the individual elements don't. If Treatment 2 wins and Treatment 3 wins by the same amount, the image is the lever and the copy is irrelevant. You learn something either way.

The other test-design discipline most teams skip: define your minimum detectable effect before you start. If your daily traffic means you need a 12% lift to be confident in 90 days, and you don't believe any of your treatments could realistically deliver 12%, the test is going to inconclusive-out and you will have wasted the slot. Better to skip it and ship the change you think is right.

interpreting the result without fooling yourself#

PPO reports conversion rate per treatment with a confidence interval. The instinct is to look at the headline number — "Treatment 2 has +14% conversion" — and stop. Three things to do instead:

First, look at the confidence interval. If the lower bound is barely above the original, the lift is fragile. A treatment showing +14% with an interval of [-2%, +30%] is not a winner; it's a coin flip with a wide spread.

Second, look at the result by device size and storefront. The aggregate result can hide a treatment that wins in the US and loses in your three biggest non-US markets. PPO splits traffic globally by default, but the underlying behavior is not global. If your top markets diverge, ship a per-storefront localization rather than a global change.

Third, and most importantly, check what happened to your downstream metrics during the test window. Pull the App Store Connect Analytics dashboard and look at day-1 retention, in-app purchase rate, and crash-free sessions for the cohort that installed during the test. A first screenshot that lifts conversion by 14% but pulls in users who churn 20% harder is not a win. It is an expensive mistake disguised as a metric improvement.

beyond Apple: variant generation as the bottleneck#

The reason most indie teams don't run PPO tests is not that they don't believe in A/B testing. It's that producing the variants is genuinely painful. The first screenshot for an iPhone 6.7-inch device is 1290×2796. The 6.5-inch is 1242×2688. iPad Pro 13-inch is 2064×2752. If your test has three treatments and you need to localize the headline copy into your top six storefronts, you are looking at producing somewhere on the order of 50–100 individual screenshot files for one test. By hand, in Figma, with a designer, that is half a sprint.

This is the part of the workflow that breaks most often. Teams either ship under-designed variants because they ran out of time, or they don't run the test at all.

practical takeaways#

If you are about to run your first PPO test, here is the short version of everything above:

Set up a PPO test that runs for the full 90 days. Don't pause it, don't second-guess it mid-test.
Vary the first screenshot before anything else. Image and headline copy.
Use the three-treatment slot to triangulate one image change, one copy change, and the combination — not three random ideas.
Estimate your minimum detectable effect from your traffic before you start. If the math doesn't work, ship the change directly and skip the test.
When you read the result, check the confidence interval, the per-storefront breakdown, and downstream retention.
Treat variant production as the constraint, not the design.

where Stora fits#

Producing the screenshot variants for a PPO test is the part of the workflow Stora was built to absorb. The screenshot generator takes a single source frame for each screen in your app and produces every device size — 6.9-inch and 6.5-inch iPhone, 13-inch and 11-inch iPad, the Android tablet sizes Google quietly added — for every storefront localization, in seconds. If your PPO test has three treatments across your top six storefronts, that's not a half-sprint of design work; it's a single run.

The point is not that Stora replaces a designer. It's that the bottleneck of variant production stops being the reason you skip a test. Run the test. Read the result. Ship the change. Do that four times a year and the compounding effect on conversion is the difference between an indie app that grinds and an indie app that grows.

How to A/B test App Store screenshots in 2026: a complete Product Page Optimization guide