Mapping to QALYs: how do I know if that’s good enough?
In HTA, QALYs are the least-bad measure of the value of health outcomes that we have so we have to make them work. The best way, we think, is by measuring outcomes prospectively in a clinical trial. When we don’t have that, we reach for a ‘second best’ such as taking values from a previously published study or by using a disease-specific outcome and then estimating a conversion equation to derive quality of life values (utilities) to use as the weights in QALYs. This is called mapping.
But this poses a challenge to people like me tasked with reviewing the quality of an economic evaluation: what should we be looking out for in mapping studies and what standards should we accept?
Most of the literature I could find on ‘good practice’ focused on the derivation of the relationship between the disease-specific scale and utilities in the first place; less attention has been paid to how these estimates are then used by other researchers.
Useful references were as follows:
‘A review of studies mapping (or cross walking) non-preference based measures of health to generic preference-based measures.’ Brazier JE, Yang Y, Tsuchiya A, Rowen DL. Eur J Health Econ. 2010 Apr;11(2):215-25
‘Do estimates of cost-utility based on the EQ-5D differ from those based on the mapping of utility scores?’ Barton GR, Sach TH, Jenkinson C, Avery AJ, Doherty M, Muir KR. Health Qual Life Outcomes. 2008 Jul 14;6:51.
In terms of the studies estimating the disease-specific to utility relationship, a reviewer should bear in mind the following:
1. Mapping must be based on data, not opinion
There is general agreement that using the opinion of clinicians to predict how a patient would have answered an EQ-5D questionnaire based on their responses to a disease-specific questionnaire is no longer acceptable.
2. The minimum expectation is linear regression analysis.
Linear regression analysis is commonly used but because the utility scale is bounded the results may be biased and inconsistent. Especially if a large proportion of subjects are in full health, better options may be Tobit, censored least absolute deviations or restricted maximum likelihood (Brazier).
3. Different models should be tested,
Some evidence suggests simple methods are best with added complexity having little value. Other studies have found squared terms and interactions, as well as patient characteristics. For example, Barton et al developed five models to predict utilities in rheumatoid arthritis:
Model A: total WOMAC score only
Model B: pain, stiffness, functioning (sub-scales of WOMAC)
Model C: total WOMAC and total WOMAC2
Model D: pain, functioning, pain*functioning, pain*stiffness, stiffness*functioning, pain2, stiffness2, functioning2
Model E: best of models A through D plus age and sex of patient
The final model, E, was found to have the best fit.
4. A goodness-of-fit test should be reported.
Brazier et al’s review found R2 was the most commonly used measure – when mapping a generic to a generic (e.g. SF-36 to EQ-5D) a value of 0.5 can be achieved. Of the 30 studies they identified one of the lowest R2 values for a disease-specific to generic was 0.17 so if your R2 is at this level you have a problem.
5. The key test is the ability to predict.
The main reason for estimating the quantitative relationship between the disease-specific measure and utilities is to be able to predict. An essential part of the validation of a model is therefore to test it, either by dividing the original sample and estimating the relationship on one half and testing it on the other, by obtaining a second dataset using the same disease-specific outcome, or a similar test.
Brazier et al propose a measure of prediction error such as Mean Absolute Error should be used; an example is Barton et al who defined it as the average value of the difference between actual and predicted values. Barton et al briefly review other studies and found an MAE of 0.13 was the lowest observed (where low is good) and 0.19 was the highest observed. Brazier et al found lower MAEs but it was unclear if these were for disease-specific to generic mapping studies.
Plotting errors against EQ-5D-based utilities can also be helpful; there may be a tendency for utilities to be underestimated for those in better health and over-estimated for those in poorer health.
So that gives me some idea of what to look for in the source study, but what about how it was applied in the submission I am reading?
My first issue is: what is the hierarchy of sources for utility values? Presumably we place most faith in direct measurement of an instrument such as EQ-5D in a trial but how does mapping compare with – say – time trade-off (TTO) valuation of descriptions of health states? The balance would seem to lie with mapping because the disease-specific scale was measured in the trial whereas the TTO values are for descriptions which are then applied retrospectively to how the patients MAY have been feeling, which seems to introduce potential biases.
This raises another interesting question: suppose the study in front of me used TTO (or similar) but I know that a secondary outcome measure could have been mapped to estimate QALYs using an existing study – as a reviewer, should I insist this is carried out? On the one hand I know mapping studies are imperfect but they are based on patients’ self-assessed health. The answer is that I would probably request a sensitivity analysis using mapping, but I realise that is a way of ducking!
The second issue is how I would know there WAS an existing mapping study in the first place. The only published review I could find, by Brazier and colleagues, covered 30 studies up to 2007, but several were establishing relationships between generic instruments and others were unpublished studies; none used cancer-specific scales (the topic I was interested in). So do I have to conduct a literature search each time I see a utility that isn’t based on the gold standard method? Should the onus be on the pharma company to establish whether a mapping study is available? A great solution would be to construct a database of mapping studies. I’ve made a start; please e-mail me if you are interested.
The third issue is: if there are several mapping studies available, which should I use? For example, I am interested in the EORTC QLQ-C30 measure and even a brief literature search identified three different studies, with a search of the references identifying as many again? The QLQ-C30 measure is intended to be used across many different types of cancer, but does this mean the mapping algorithm is equally applicable – if the medicine I am reviewing is for colorectal cancer then is a QLQ-C30 model derived from oesophageal cancer patients and validated on breast cancer patients applicable or not? Should I be influenced by where the sample of patients for the mapping came from – for example, one of the QLQ-C30 studies was in Greek cancer patients, so should I discount it because I work in
? The most advice I could find was that patients in the original mapping study should have similar QLQ-C30 scores to the ones receiving the treatment I am interested in, but what does that mean in practice? With six different mapping studies (at least) I can’t even use my normal trick of running a sensitivity analysis as there seems a good chance at least one of the algorithms will give a different answer to the base case. Ideally I’d like to pick the one with the most robust statistical method, but I don’t think there is currently guidance to help me do that. Scotland
Our approach to mapping seems to be evolving in an ad hoc way. Some bits of the jigsaw are available but there are a lot of gaps. I’ve started to piece some of them together but would like to hear from anyone who thinks I’ve got anything wrong or who can fill in the gaps.
Thoughts on a database of mapping studies are very welcome. (Stop press: I’ve just found another review of published algorithms:
‘Comparing the Incomparable? A Systematic Review of Competing Techniques for Converting Descriptive Measures of Health Status into QALY-Weights’ Duncan Mortimer and Leonie Segal Med Decis Making 2008; 28; 66.)