If John Carlisle had a cat flap, scientific fraudsters could rest easier at night. Carlisle routinely gets up at 4:30 am to let out Wizard, the family's pet. Unable to sleep, he reaches for his laptop and begins typing data from published articles about clinical trials. Before his wife's alarm sounds 90 minutes later, he has usually managed to fill a table with the age, weights and heights of hundreds of people ̵
Carlisle is an anesthesiologist at the English National Health Service in the coastal town of Torquay during the day. In his free time, however, he relies on the scientific knowledge about suspicious data in clinical research. Over the last decade, he has conducted studies to examine a variety of health issues, from the benefits of specific diets to guidelines for hospital treatment. It has resulted in hundreds of newspapers being withdrawn and corrected for misconduct and errors. And it has helped end the careers of some big counterfeiters: of the six scientists with the most retreats worldwide, three were overthrown using variants of Carlisle's data analysis.
"His technique has proven incredibly useful. Says Paul Myles, director of anesthesia and perioperative medicine at Alfred Hospital in Melbourne, Australia, who has worked with Carlisle to investigate research that includes shady statistics. "He used it to show some important examples of fraud."
Carlisle's statistical site is not popular with everyone. Critics claim that it sometimes led to the questioning of papers that were not obviously flawed, leading to unjustified suspicions.
But Carlisle believes he helps to protect patients, so he spends his free time thinking about others. Studies. "I do that because my curiosity motivates me," he says, not because of an overwhelming zeal to expose misconduct: "It's important not to become a crusader against wrongdoing."
Along with the work of other researchers, His efforts suggest that science gatekeepers – magazines and institutions – could do much more to spot mistakes. In medical studies, the way in which Carlisle focuses can be a matter of life and death.
Anesthetists who misbehave
Torquay looks like any other traditional English provincial town, with pretty flower patterns on the roundabouts and just enough pastel-colored huts stand out. Carlisle has lived in the area for 18 years and works in the city's General Hospital. In an empty operating room, after a patient had just been sutured and carried away, he explains how he began to search for fake data in medical research.
More than ten years ago, Carlisle and other anesthetists began talking about published results by a Japanese researcher, Yoshitaka Fujii. In a series of randomized controlled trials (RCTs), Fujii, who was then working at Toho University in Tokyo, said he had studied the effects of various drugs on the prevention of vomiting and nausea in postoperative patients. But the data looked too clean to be true. Carlisle, one of many concerned, decided to check the numbers using statistical tests to identify unlikely patterns in the data. In 2012, he showed that the probability that the patterns were randomly created was in many cases "infinitely small" 1 . Encouraged in part by this analysis, the publishers of Fujiis magazine asked current and former universities to investigate. Fujii was released from Toho University in 2012 and withdrew 183 of his papers, a record of all time. Four years later, Carlisle published an analysis of the results of another Japanese anesthetist, Yuhji Saitoh – a frequent co-author of Fujii – and showed that his data were also extremely suspicious 2 . Saitoh currently has 53 retractions.
Other researchers soon cited Carlisle's work in their own analyzes using variants of his approach. For example, researchers in New Zealand and the United Kingdom in 2016 reported issues in publications by Yoshihiro Sato, a bone researcher in a hospital in southern Japan 3 . This eventually led to 27 retreats and 66 articles written by Sato were withdrawn altogether.
The anesthesia was shaken by several scams before Fujii and Saitoh came to trial – including German anesthesiologist Joachim Boldt, who had more than 90 papers drawn. But Carlisle began to wonder if only his own field was to blame. So he selected eight leading journals and reviewed in his spare time thousands of randomized studies that they had published.
In 2017, he published an analysis in the journal Anesthesia in which he found that he had found a suspect data in 90 of more than 5,000 studies that were published over a period of 16 years 4 . At least ten of these works have since been withdrawn and six corrected, including a high-profile study published in the New England Journal of Medicine (19459012) (19459011) on the health benefits of the Mediterranean diet. In this case, however, there was no evidence of fraud: The authors had made a mistake in the random selection of participants. After the authors removed erroneous data, the paper was re-published with similar conclusions 5 .
Carlisle has continued. This year, he warned of dozens of anesthesia studies by an Italian surgeon, Mario Schietroma, at the University of L 'Aquila in central Italy, stating that they are not a reliable basis for clinical practice 6 . Myles, who had worked with Carlisle on the report, raised the alarm last year after discovering similarities in raw data for control and patient groups in five of Schietroma's papers.
The Challenges for ski gym assertions have impacted hospitals around the world. The World Health Organization (WHO) quoted Schietroma's work when it issued a recommendation in 2016 that anesthetists should routinely increase the level of oxygen they deliver to patients during and after surgery to reduce the infection. This was a controversial call: Anesthesiologists know that in some procedures too much oxygen can be associated with an increased risk of complications – and the recommendations would have made hospitals in poorer countries spend more money on expensive bottled oxyvos, Myles says. 19659006] The five papers Myles had warned were quickly withdrawn and the WHO revised its recommendation from "strong" to "conditional", meaning that clinicians have more freedom to make different choices for different patients. According to Schietroma, his calculations were evaluated by an independent statistician and through peer review, and he deliberately chose similar groups of patients. It is therefore not surprising that the data match exactly. He also said he had lost raw data and documents related to the trials when L & # 39; Aquila was hit by an earthquake in 2009. A spokesman for the university said he had forwarded inquiries to "the investigating authorities," but had not explained what bodies they had or whether any inquiries were in progress.
Recognizing Unnatural Data
The essence of Carlisle's approach is nothing new, he says: It's simply that real data has natural patterns that make artificial data hard to replicate. Such phenomena were discovered in the 1880s, made popular by US electrical engineer and physicist Frank Benford in 1938 and since then used by many statistical examiners. For example, political scientists have long used a similar approach to analyzing survey data, a technique they call Stouffer's method, according to sociologist Samuel Stouffer, who made them popular in the 1950s. In the case of RCTs, Carlisle examines the baseline measurements that describe the characteristics of the groups of volunteers in the study, typically the control group and the intervention group. These include size, weight and relevant physiological properties – usually described in the first table of a work.
In a true RCT, volunteers are randomly assigned to the control group or (one or more) intervention groups. Consequently, the mean and standard deviation should be approximately the same for each feature – but not identical. That would be suspiciously perfect.
Carlisle first creates a P value for each mating: a statistical measure of the probability of reported base data points, assuming that volunteers were actually randomly assigned to each group. He then sums up all of these P values to get a sense of how coincidental the measurements are overall. A combined P value that looks too high suggests that the data is suspiciously balanced. too low and could show that the patients were wrongly randomized.
The method is not foolproof. The statistical checks require that the variables in the table are truly independent – but often not in reality. (Size and weight, for example, are related.) In practice, this means that some papers marked as mislabeling are incorrect – and that's why some statisticians have criticized Carlisle's work.
But Carlisle says this is true This method is a good first step and can highlight studies that deserve close scrutiny, such as querying individual patient data behind the paper.
"She can put a red flag. Or a yellow flag or five or ten red flags that say it's most likely not real data, "says Myles.
Mistakes vs. Rogues
Carlisle says he makes sure he does not identify any possible problems he identifies. In 2017, however, when Carlisle's analysis of 5,000 studies in Anesthesia – of which he is the editor – appeared, an accompanying editorial by anesthetists John Loadsman and Tim McCulloch of the University of Sydney in Australia adopted a more provocative line 7 .
It talked about "dishonest authors" and "abusers" and suggested that "more authors of already released RCTs will eventually pat on the back". It also said, "There could be a strong argument that every magazine in the world now has to apply Carlisle's method to all the RCTs they've ever published."
This triggered a violent reaction from the editors in a journal. Anesthesiology which had published 12 of Carlisle's most problematic works. "The Carlisle article is ethically questionable and a poor service for the authors of previously published articles" challenged, "wrote the editor of the journal, Evan Kharasch, an anesthesiologist at Duke University in Durham, North Carolina. 8 . In his editorial, co-authored with the anesthesiologist Timothy Houle of the Massachusetts General Hospital in Boston, who serves as a statistical advisor to Anaesthesiology issues such as the fact that the method could yield false positives were highlighted. "A valid counterfeit detection method (similar to software for plagiarism testing) would be welcome. The Carlisle method is not such, "they wrote in a correspondence to Anesthesia 9 .
In May Anesthesiology corrected one of the papers that Carlisle had highlighted that it had reported in two tables "systematically incorrect" P values and that the authors lost the original data and could not recalculate the values. Kharasch, however, says he stands by his point of view in the editorial. Carlisle says that Loadsman and McCulloch's editorials were "reasonable" and that criticism of his work does not undermine its value. "I'm sure it's worth the effort, while others may not," he says.
The Data Controller
Carlisle's method is not the only one used in recent years to review published data.  Michèle Nuijten, who studies analysis methods at Tilburg University in the Netherlands, has developed a so-called statistics spelling checker that allows journaling articles to be scanned to verify that the statistics described are internally consistent. For example, statcheck checks whether the data specified in the result area matches the calculated P values. It was used to identify errors, usually numeric typing errors, in decades of journal articles.
And Nick Brown, a graduate psychologist at the University of Groningen, also in the Netherlands, and James Heathers, who studies scientific methods at Northeastern University in Boston, Massachusetts, have used a program called GRIM to calculate statistical meanings check and identify suspicious data.
Neither technique would work for papers that describe RCTs, such as the Carlisle has rated. Statcheck uses the strict data presentation format of the American Psychological Association. GRIM works only if the data is integers, such as the discrete numbers generated in psychology questionnaires, when a value of 1 to 5 is reached.
Interest in such exams is growing, says John Ioannidis of Stanford University in California. who studies scientific methods and advocates the better use of statistics to improve reproducibility in science. "They are wonderful tools and very ingenious." However, he warns against drawing conclusions about the reason for the problems identified. "It's a completely different landscape when it comes to cheating than when it comes to typos," he says.
Brown, Nuijten and Carlisle agree that their tools can only address problems that need to be investigated. "I really do not want to associate statcheck with fraud," says Nuijten. According to Ioannidis, the real value of such tools is to check papers for problematic data before they are published – and to prevent fraud or mistakes from reaching the literature.
Carlisle says that an increasing number of journal editors have approached him about using his technique in this way. Currently, most of this effort is unofficially done on an ad hoc basis and only when editors are already suspicious.
At least two journals have developed things and now use the statistical exams as part of the publication process for all papers. Carlisle's own journal Anesthesia routinely uses it, as does the publisher of NEJM . "We're trying to prevent a rare but potentially serious negative event," says a spokesman for NEJM . "It's worth the extra time and extra cost."
Carlisle is very impressed that a journal with the status of NEJM has introduced these exams, which he knows firsthand laboriously, time consuming and not universally popular. However, automation must be required to introduce it to the extent required to test even a fraction of the approximately two million papers released each year around the world. He believes that it could be done. Statcheck works this way and, according to Nuitjen, is routinely used by several psychological journals to screen contributions. Using text mining techniques, for example, researchers have been able to evaluate the P values in thousands of publications to investigate P hacking, which optimizes data for significant results P values.
One problem, say several researchers in the field, is that promoters, journals and many in the scientific community give such exams a relatively low priority. "It's not a very rewarding job," says Nuijten. "They try to find flaws in the work of others, and that will not make you very popular."
Even if you find that a study is fraudulent, it's not always done. In 2012, researchers in South Korea submitted a report on a study in Anesthesia & Analgesia that looked at how tonus of the facial muscles indicates the best time to insert breathing tubes into the throat. Unofficially asked to take a look, Carlisle noted discrepancies between patient and abstract data, and the paper was rejected.
Remarkably, it was then submitted to Carlisle's own journal with various patient records – but Carlisle recognized the paper. It was again rejected, and the editors of both journals addressed their concerns to the authors and their institutions. To Carlisle's astonishment, the paper was published a few months later – unchanged from the last version – in the European Journal of Anaesthesiology . After sharing the dubious history of the newspaper with the magazine's editor, Carlisle was withdrawn in 2017 for "irregularities in her data, including misrepresentation of the results". 10
After so many cases of fraud had co-existed With typos and mistakes, Carlisle has developed his own theory that has led some researchers to put their data together. "They believe random coincidences were in the way of truth this time, knowing that the universe really works," he says. "So they change the result into what they think, what it should have been."