Matthew Yglesias has a fantastic post about what’s wrong with data-mining programs like that apparently being deployed by the NSA:
The problem is that when you’re searching for a rare condition, like being a terrorist, even a very precise statistical tool is going to overwhelmingly give you false positives. Ordinarily, when people are doing statistical analyses they take 95 percent confidence to constitute a statistically meaningful result. But there are 200 million people in the NSA pool and only a handful of terrorists. How many? Let’s be generous and say there are 200 al-Qaeda sleeper agents in the USA. Then you apply a 95 percent accurate statistical filter to 200 million people. What you’re going to wind up with are 10 terrorists labeled non-terrorists, 190 terrorists labeled terrorists, and a whopping 10 million non-terrorists labeled terrorists. That’s a process that works. You’ve reduced the size of your search pool by an order of magnitude. The program “works.” But what does it really accomplish? In practice, nothing. The NSA can’t hand the FBI the names of 10 million Americans and ask them to investigate–that would be a silly waste of time. Now what you can do is that if in addition to your secret, illegal, oversight-free call records database you’re also running a secret, illegal, oversight-free wiretapping operation is start listening to the content of everyone in the 10 million group’s conversations. Obviously, the manpower’s not going to exist to actually listen to all that, but maybe you have another data-mining algorithm that can run on the content. Say this one is also 95 percent accurate. That means 10 more terrorists will get away. And 7.5 million innocent people will be off the hook. But you’re still left with a pool of 2.5 million innocent people and only 180 terrorists left under suspicion. What you would do with that information just isn’t clear to me. There’s still not enough manpower to do serious investigations into all those people. And it would be insanely abusive anyway to subject such a huge group to invasive investigations when over 99.9 percent of them are totally innocent. Trying to compile a list of “people with Arab-sounding names” would be about as effective as these two computer algorithms.
So you’re not likely to catch many terrorists with a program like that. What such a database would be useful for is harrassment and blackmail. Want to know who’s been spilling White House gossip to the New York Times? All you need is the reporter’s phone number and you can dramatically narrow down the list of likely leakers. Want to find out if a political opponent has a mistress? Pull up a list of his phone calls over the previous 6 months and you’ll have a short list in a matter of minutes.
In a lot of ways, that’s the most troubling aspect of this. You have a program that would be much more effective for abusive uses than it would be for its ostensible purpose. The people ultimately in charge of the program have a well-earned repuation for dishonesty and a well-earned reputation for hardball politics. They’ve gone out of their way to make sure that the program operates in total secrecy and is subject to no meaningful oversight. Why on earth would you want a program like that?
Update: Obviously 5% of 10 million is 500,000, not 2.5 million. I don’t think that really affects his argument, though.