Episode 1.8 - What’s so Great About the P-value? A Statistician’s Point of View

Dr Stephen Hibbs invites HemaSphere Associate Editor Prof Robert K Hills to discuss statistics in scientific writing in an entertaining and informative episode: What’s so Great About the P-value? A Statistician’s Point of View. This discussion covers the overlap of statistics and hematology with themes like the importance of P value and the use of statistics in clinical trials.

What’s so Great About the P-value? A Statistician’s Point of View, is on our website , all major podcast platforms (Spotify, Apple Podcast, etc), and YouTube.

Listen and enjoy casual, insightful discussions about #hematology research. HemaSphere articles are always fully open access.

Recent HemaSphere articles you may find interesting:

Interim Positron Emission Tomography During Frontline Chemoimmunotherapy for Follicular Lymphoma.

Merryman R, Michaud L, Redd R, et al.

Definition and Prognostic Value of Ph-like and IKZF1plus Status in Children With Down Syndrome and B-cell Precursor Acute Lymphoblastic Leukemia.

Palmi C, Bresolin S, Junk S, et al.

Transcript

00:00 Welcome to the podcast of HemaSphere - the official journal of the European Hematology Association. HemaSphere's podcast presents insightful, expert discussions about recent hematology publications. We hope you enjoy.

:: 00:14

So I'm joined today by Prof. Robert Hills who is a professor of medical statistics. He has been very involved in leukaemia trials over the last twenty years - and is one of the associate editors for HemaSphere. We're going to be exploring [today] a few topics about where statistics, hematology and trials overlap.

Robert, I wonder if (to start with) I could just ask you about your own journey; because I noticed that your doctoral research was initially in a field with applications to superconductors. So I'm interested to know - how did you end up moving into health - and did you notice any difference about how health researchers think about statistics, compared to those in physics?

:: 01:08

So, that's a very interesting question - and I'm afraid it gets even worse than that - because I wasn't a physicist and I didn't do any experiments. I was a mathematician. You know, mathematics is all about proof and truth... whatever it is. You know... 1+1 is always equal to 2. You start with a set of beliefs, then everything else follows logically.

That isn't true in statistics. And the reason, really, it isn't true is that you're dealing with people - and there is a variability in people. So your experiment, if you like, of giving a drug to a person, isn't remotely replicable because every person is different. Every person's leukaemia is different. Everybody is going to respond slightly differently and other things in the world are going to intervene.

So you have this idea of uncertainty; which is anathema to a mathematician who is looking at proof and looking at truth and looking at an equation - and the equation is true or not. And even experimental physicists tend to have very, very tightly controlled experiments.

So they take their data and they'll fit their curves to it... but they're not too worried about the variability in the experiment, because they've got their tightly controlled conditions. It's even more controlled than the best controlled lab experiment in medicine, to be honest with you.

So there is a bit of a change. It's all about embracing uncertainty.

But the one thing that I think is important and it is coming from the mathematical tree - Training is the idea of being a professional sceptic.

So, you never believe anything that anyone ever tells you. You're asking yourself "Where is the evidence?"

But I got into it purely by accident - and I got into it because I did a lot of programming in my younger days in a language called Fortran, which is, I think, now virtually extinct. But in the days before having databases, a lot of the trials units wrote their own software to do the analyses and to store the data - and the language of choice was Fortran.

So I answered an advert for Fortran programmers and came to work at the then Clinical Trial Service Unit in Oxford and started working on a large trial of bowel cancer called QUASAR, which is still producing results even thirty years later. I started working on that - and I found the idea of trying to sort of, as it were, disprove everything, far more interesting than the idea of trying to prove a theorem.

So it was a bit of a change in approach. It was great fun and it still is great fun!

:: 04:17

For those of us who don't have the same mathematical underpinnings or foundations - do you find that there are certain concepts around statistics that we find particularly difficult to grasp?

:: 04:45

You know, if you say "This drug is likely to improve your five year chance of surviving from 50% to 60%"... Well, that's fine, but actually half of the people would have survived anyway without the drug and don't need it, and the 40% of the people who died with the drug didn't necessarily get any benefit out of it either. So it's that sort of idea of dealing with chances and probabilities. It's quite difficult, I think.

And then, to actually understand that statistics can be funny things. There's a great tendency to sort of want to drill down into the detail of the data and look at subgroups of patients and things like that. And of course, the experiments are not designed to be able to do that. They're designed to give an average.

So I think the two big concepts are the idea that we're dealing with averages and that we have uncertainty. We could go on and we could talk for days about P values, but one of the things about a P value is that it's not the strength of the effect of the drug. It's the amount of evidence that you've got.

versus: 1000

Same drug. Just a different size of trial.

versus: 1000

That's a really good trial to look at from that point of view. It's not hematology, but it's a very large simple trial in the sense that; you have a heart attack, you either take an Aspirin or a placebo, and at the end of the day all that people have done is count the number of deaths. It's very similar to these very large trials like RECOVERY that have been done during COVID as well. It's simple to do. It can be done in the emergency setting and the like.

But one thing that was done when it was written up - and it was actually at the request of the LANCET - subgroup analysis was put in to show the dangers of trying to look into subgroups and say "Things work in this group and not in that group".

The subgroup they chose was "star sign". And they found that in two star signs there was no benefit of Aspirin, but in the other ten there was a large benefit of Aspirin. So fundamentally, either this is a chance finding or astrology really does work. You know... I'll leave everybody to decide what they think is the more likely solution there.

:: 08:52

I remember reading an article. I think it was The Science of Nature or something like that, a few years ago, that was actually suggesting that maybe we get rid of P values altogether, because they lead to so much kind of "magical thinking" about this kind of arbitrary threshold - and saying "Why don't, actually, we just go for 95% confidence intervals", which (in some ways) gets people closer to perhaps what's really going on.

I just wonder if you've got any comments on that and on this kind of idea of the kind of P ≤ 0.05 threshold as this magical truth-telling device?

:: 09:45

The first thing I probably ought to say is that for some statistical tests you are largely stuck with a P value. And that is the strength of the evidence.

onald Fisher in a book in the: 1920

"Personally, the writer prefers to set a low standard of significance at the 5% and ignore entirely all results which fail to reach this level".

So, he is saying a low standard of significance actually is quite a large P value. So he's using P ≤ 0.05 and not P = .01. But if he's saying, then, that if P is bigger than .05 then you should ignore the result.

le, of things like [inaudible: 11:05

But his idea here is that P ≤ 0.05 is a starting point for negotiations, rather than the end result, I think.

And that, I think, is important; that you shouldn't start stretching things. You know, if you like, P ≤ 0.05 is more likely - in a trial of M&Ms for whatever disease you want - than it is to take two dice and throw a double six. You know, that's not that uncommon.

But I do think that actually there is an awful lot to be said. And I think, certainly as time has gone on and as outcomes have improved, that it's not just whether something works. It's whether it works well enough. I think it's important here that "well enough" you probably need to think about in a number of different ways.

There's obviously the cost in toxicity to the patient. The balance between early toxicity and sometimes even early mortality - and late benefit. So, you know, there is a balance of benefits and risks going on; which is sort of masked by a P value.

There is also, I think, particularly in places like the UK with a finite health budget, there is an issue of what is actually worth doing in terms of value for money as well as everything else. I think that's really important; is to understand all of those things. And this is where estimation comes in.

So you can say that something is significant. Well it could actually have an effect that is too small to worry about. What you want to say is that this effect is going to be big enough even -- You know... What you really want is a slam dunk. You know?

The smallest possible estimate of the effect here is big enough to be worth doing. And now you're convincing the payers to actually do it.

:: 13:31

:: 13:32

So I like estimation. I think estimation is a very good way of looking at things. It tells you sort of, you know, how many people you need to treat to save a life; to give so many life years. And I also like the idea of breaking down these sorts of crude composites, like overall survival, into understanding what's - in hematology - likely to be an early dis-benefit because of giving a toxic drug versus a later benefit.

Other diseases - it's the other way around. So, you know, we give

anthracyclines in AML; and the risk of the anthracycline is induction death. You give anthracyclines in breast cancer- and the way it's given - there's not a huge rate of induction death, but there is actually a linked risk of developing AML and heart disease with it.

So the different time windows - the horizons - differ, according to your setting. That's important to understand. I think as we get better and better and better and the drugs get more and more expensive for more and more marginal benefits, understanding what you can do with a basic drug and what you can do with an expensive drug is very important.

Not least because you'd like to deploy the expensive drug. And the toxic drug, in particular, as infrequently as you have to.

And this is why there are still debates over the value of transplant in certain areas of hematology. It's because obviously the benefits are not gigantic. They're moderate. There are disbenefits - quality of life, transplant-related mortality and morbidity. And you have to say to yourself: "Well, where is the balance point?", in order to give somebody a transplant and not do more harm than good.

of view of being a [inaudible: 16:15

:: 16:19

:: 16:36

The good trend is, in the sense, trials are much more likely to have a proper sample size calculation and we are much more likely to see asset ratios and confidence intervals; so that you can actually see the effect size that's going on.

There are some trends that are much more difficult. One of them is, as it were, not the fault of the trialists at all, but the fault of the funders - and that's the ability to get long-term follow up. A long-term follow up is absolutely crucial, I think, because we want to know that there are no late effects, or how long these effects last for.

You know, if you are merely delaying a recurrence by six months, but the outcome after the recurrence is the same, that's very different from preventing a recurrence or turning an early recurrence (which is bad risk) into a late recurrence (which is good risk). So there are all sorts of things there that long-term follow up helps.

I think we're much better in hematology, and I think hematology has probably led the way in terms of getting samples associated with clinical trials. I think they're very good. And of course, outcomes have improved an awful lot.

So that changes the question, again, from one of purely survival into survivorship. At what cost are you actually living that little bit longer?

So that means that issues like quality of life become evermore important. Quality of life, you could argue, is much less important if nobody is surviving; than being alive, even in a less healthy state, is better than being dead.

But when you're actually looking at marginal improvements, quality of life becomes very important, I think, there.

The other thing, of course, is with all these samples we're discovering that the conditions that we sort of lump together under the heading of an "AML/ALL" etc, are not just one thing. They are a variety of things.

So we know think of APL - Acute Promyelocytic Leukaemia - as different from AML... and it is. From a statistician's point of view, it's different because you treat it differently. It's a different condition - because actually the approach you go in with is different. So it needs a different look. You know, the introduction of ATRA and Arsenic has transformed it from a very bad risk subtype into a very, very good risk group.

And I think that there is an understandable desire to try and look at these sort of targeted therapies and to try and look at what mutations are actually driving the leukaemia and what mutations drive the response to therapy.

And that becomes very difficult because we're not looking at breast cancer or lung cancer or bowel cancer or heart attacks or childbirth; where you've got tens or even hundreds of thousands of cases you can get hold of every year. We're looking at, generally for AML, three thousand cases a year. A lot of those are quite old. And so, the treatment options are limited.

So you end up with dividing what's already quite a small cake into even smaller and smaller pieces - and it becomes very difficult then to run randomised trials. So you're really relying on a single-arm study, which is in a centre case series, and the decision to enter the trial is now something that is important.

You know, a 40% response rate in good risk patients is going to be interpreted very differently from 40% in bad risk patients. So until you know what risk these patients are, you can't interpret it. And that, of course, was the benefit of randomization. It doesn't matter about the selection going in. What you're actually looking at is the difference between them.

Generally speaking, it's quite rare that you find that the decision to enter a trail actually affects the effectiveness of any treatment. They are generally affected in a much wider population than may necessarily have been in the first trial.

In these small case series of patients treated with, you know, a targeted agent, it becomes much harder to understand what's going on. And I think it's then very easy to either find a comparative group that does very badly, or a comparative group that does quite well.

So, which one do you believe? It becomes very hard - and I think that's a real challenge going forward; is that, if we don't all band together, then we're not going to be able to get reliable evidence that we can understand going in.

:: 21:50

Perhaps just to kind fo set the scene on that, I'd be interested to know: If you look back on kind of an early career version of yourself, so when you were first new to trials and healthcare, and your own understanding of statistics then, is there anything that you can see and that you've learned that's changed in your approach to understanding how statistics works over that time?

:: 22:27

Again, there's a lovely quote by [Gio Ashleigh]: "Like other occult techniques of divination, the statistical method has a private jargon deliberately contrived to obscure its methods from non-practitioners".

I think this is probably the in-built sceptic in me. The more jargon there is, the less intellectual content there is behind it, necessarily.

Therefore, it's all about being able to have a conversation and have a conversation in, you know, real words with people.

You know, if I were to say the thing that I learned is that, as a statistician, you've got to talk to the clinician and understand as much as possible of what they're doing. It is teamwork.

Again, it's important to get the statistician involved early on in the idea; because, you know, if the experiment has gone wrong, you're just doing a postmortem. But the statistician doesn't necessarily understand the clinical relevance of what's going on.

So being able to understand that enables the right question to be asked. And the important thing is to ask the right important question and to answer it reliably, and in a way that people are going to be convinced.

So, yes. You're absolutely right. The jargon of statistics puts a lot of people off. I think this is one of the reasons where the idea of significant and not significant is very, very attractive to people, because it's a straightforward yes/no.

I think we've all seen examples where things have been written up in such an obscure and complicated way. The ultimate aim here is not to get a number at the end of it. It's to actually save lives and improve peoples' outcomes.

So if you do something that nobody believes, what's the point of doing it? You've taken the goodwill of the participants in your study and said "Well, you know, I don't care" or "I'm just going to do what I want to do with this and I'll come up with some numbers".

It doesn't matter if it changes practice.

So the simple methods are always the best from that point of view, because they can be explained. But I do think it's important that every clinician, really, should have a statistician that they can go to and talk to.

Alan Burnett always talked of it as a "walk on the beach" - and it's absolutely right. That rather than doing this by email or doing anything like that; the more informal way of looking at it, the better.

I have learned an awful lot about hematology and the like from some very, very important hematologists over the years. I think that what's also important here is that the statistician justifies their existence to the clinician or the lab scientist as to what's going on.

You know, it's a specialist occupation being a statistician. So you know, you don't expect me to see a patient. So I shouldn't expect you to have to analyze data. To be honest with you, you probably shouldn't analyze data, because it's a question of experience. And that's the whole thing for me. It's a partnership and it's going forward.

You know, the thing that I learned - and I was very, very fortunate to be exposed to - is people who will talk about their subject.

:: 26:19

:: 26:20

:: 26:33

One thing I am aware of is, when I'm reading a clinical trial report, I've often got this fear that there's some sort of trickery going on. There's something that's kind of happening...

But it's a scepticism in a clever way, where you can kind of really unearth it. It's more this cynicism that's like "Ah, can I trust this? Can't I trust this?"

And obviously, there's lots of different parts to reading clinical trials - and it's not all about the statistics. But I guess I'm just kind of wondering what your advice would be to the clinician who fears that the wool is being pulled over their eyes with stats in some way or another.

It doesn't feel like they really know exactly what they should and shouldn't be looking for. They don't know how much they can trust that due diligence has been done with the statistics before they're reading it. They kind of want to get beyond just that general sense of doubt, to something a bit more solid.

:: 27:34

I would go back to sort of various series' on -- You know, there are some in the old BMJ. There are some in various places. -- and ask: "How would I cheat if I was doing this thing?". And if it's not explained that you can't cheat, then start to distrust it. It's okay to be skeptical. I mean, you know, that's what a statistician does.

You're the optimists and we're the pessimists.

:: 28:21

--Even if it's just having better conversations with the statistician they're working with. Where is a good place to start? Where is a good place to go first with wanting to take their learning forward?

:: 28:55

I say there are some very good articles that were done by Doug Altman and Martin Bland over the course of about the last thirty years, again, which are called the "Statistics Notes" in The British Medical Journal.

There are other ones available in other journals as well. There are some quite nice little books like Statistics at Square One and Statistics with Confidence.

I genuinely think that the best way is to actually question somebody over and over again. If you don't get it - if you've got somebody - just find somebody; latch onto them. Statisticians are great.

We'll always work for alcohol. So you'd buy us a beer and we'll do anything, basically. I mean, food, beer, coffee. You know, we're very good; if you give us something like that, then we'll talk for hours and hours and hours.

:: 29:51

:: 29:56

:: 29:57

:: 30:04

We hope you will join us for future podcast episodes.

HemaSphere Podcast

Episode 10

7th Dec 2023

Episode 1.8 - What’s so Great About the P-value? A Statistician’s Point of View

Transcript

Listen for free

About the Podcast

About your host

European Hematology Association