When the Patient Protection and Affordable Care Act was being drafted, the wonks behind it were excited by, among other things, the opportunity to experiment with new ways to deliver services more cheaply and effectively. After all, isn’t that what we all want -- cheaper, better health care?

So the law contained provisions that were supposed to help us get there by trying new approaches and seeing what worked. Two of them hit the news this weekend: the accountable-care organization provisions, which were supposed to achieve better and cheaper health care through consolidating and coordinating care, and the Centers for Medicare and Medicaid Services' Innovation Center, which runs pilot projects on delivery system reform. But two articles, one in the New York Times and the other in the Washington Post, suggest that the administration’s approach to data analysis may be too weak to get good results from these experiments.

The Innovation Center, for example, has unaccountably decided not to use its resources for randomized controlled trials, writes Gina Kolata:

The studies that are regarded as the most reliable randomly assign people or institutions to participate in a program or to go on as usual, and then compare outcomes for the two groups to see if the intervention had an effect.

Instead, the Innovation Center has so far mostly undertaken demonstration projects; about 40 of them are now underway. Those projects test an idea, like a new payment system that might encourage better medical care -- with all of a study’s participants, and then rely on mathematical modeling to judge the results.

Dr. Patrick Conway, the director of the center, defended its reliance on demonstration projects, saying they allowed researchers to evaluate programs in the real world and regularly adapt them. “Does it look like it is working?” he asked. “If it does not look like it is working, we can stop.”

He said that the center has had trouble getting such studies to yield solid results because those in the control groups -- who do not get the innovation being tested -- tend to drop out.

This is fantastically disappointing. The whole point of a randomized controlled trial is that it’s the best way to know whether something is working. That’s why we use them to determine what drugs are safe, and not a demonstration project in which we give everyone the drug, then see if they look unusually sick.

The administration’s approach, by contrast, is excessively prone to the perils of the promising pilot program: The effect you think you’re picking up can too easily be spurious, and even if it isn’t spurious, you can’t necessarily scale it. Of course, that’s also potentially true of a randomized controlled trial; something that really does work well in a small group may be a disaster in a large one.

But at least you’re starting from a solid base, with something you believe probably works. Approaches that rely on mathematical modeling are inherently more vulnerable to researchers finding the result they want to see -- not because the researchers are dishonest, but because the more freedom you give yourself to specify your analytic methods, the more danger there is of picking the method that validates what you want to find.

Of course, randomized controlled trials are more expensive and difficult to conduct. And yet, that doesn’t make me want to say: “You know, it’s hard to find candidates for Phase I trials. Let’s just skip them and model how this new drug works on the computer.” It is better to do fewer trials with better data quality than a lot of trials that don’t tell us much.

The administration may have chosen even weaker methods for evaluating the success of accountable-care organizations, according to Jenny Gold. It has announced that they have so far saved the government $380 million. But what that number means is unclear:

“If these results and savings continue, this will be a phenomenal success story for the Medicare program,” said Jon Blum, principal deputy administrator at CMS.

Of the 114 Shared Savings Program ACOs in the first year, 54 had lower spending than projected. But just 29 generated enough savings to qualify to keep some of it, which totaled $126 million for the provider networks and an additional $128 million for the Medicare Trust Fund.

The Pioneer ACOs, high-performing organizations which take greater risks, were measured separately. Of the 23 Pioneer ACOs in the program, nine had significantly lower spending growth and also did well on their quality measurements. The Pioneer program generated a total of $147 million in gross savings.

Those numbers add up to $401 million, not $380 million, but CMS did not explain the discrepancy. It’s also unclear whether the savings figures factored in any losses from some of the ACOs that did not do well. And the agency did not release information about which ACOs saved money and which did not.

The way you want to evaluate this program is to look at savings and losses by everyone in it, and then see whether the difference between what they spent in the program and out of it is statistically significant. In other words, you want to look at total net savings.

That’s because any random group of hospitals is likely to contain significant variance in how much individual hospitals spend on patients during the year -- some will randomly end up with extra-expensive patients, some will end up with an unusually hale group who hop out of bed the day after surgery and go home without complaint. If you just look at the “successful” group, you may end up with the grievous impression that it is “achieving” this random variance, rather than just lucking out.

Gold’s article implies that the administration is looking at gross savings -- which is to say, it's just reporting the amount of money saved by the accountable-care organizations that ended up on the positive side of the ledger, even though this is less than half the total.

Statisticians have a term for this: the Texas sharpshooter fallacy. David McRaney has written one of my favorite explanations:

The fallacy gets its name from imagining a cowboy shooting at a barn. Over time, the side of the barn becomes riddled with holes. In some places there are lots of them, in others there are few. If the cowboy later paints a bullseye over a spot where his bullet holes clustered together it looks like he is pretty good with a gun.

By painting a bullseye over a bullet hole the cowboy places artificial order over natural random chance.

If you have a human brain, you do this all of the time. Picking out clusters of coincidence is a predictable malfunction of normal human logic.

When you are dazzled by the idea of Nostradamus predicting Hitler, you ignore how he wrote almost 1,000 ambiguous predictions, and most of them make no sense at all. He seems even less interesting when you find out Hister is the Latin name for the Danube River.

I’m told that this kind of program evaluation is not unusual for civil servants. It’s not exactly unknown in the business world, either, though the folks in finance tend to get pretty beady-eyed and disparaging. It’s apparently endemic among would-be entrepreneurs.

Common, and understandable, perhaps, but still bunk. I’m frankly surprised to see this sort of thing coming out of such a wonky administration.

To contact the writer of this article: Megan McArdle at mmcardle3@bloomberg.net.

To contact the editor responsible for this article: James Gibney at +1-202-624-1863 or jgibney5@bloomberg.net.