According to documents leaked by former NSA contractor Edward Snowden, the NSA can automatically recognize and flag content in phone calls, using what it calls "Google for Voice." And though the technology, which they've been working on for at least a decade, is still imperfect, it can significantly aid human analysts in sifting through vast amounts of data.
The amount of time needed for a human to listen to audio and transcribe it has always been prohibitive when collecting audio for surveillance.
Thomas Drake, an NSA whistleblower who worked as a voice processing crypto-linguist at the Agency, told the Intercept that, despite the post 9-11 push to start collecting more audio communications, the limiting factor was having people to listen to them all.
"There weren’t enough ears," he said.
But recent explosive advances in voice recognition technology could change all that, with the age of "bulk listening" on the horizon.
"I think people don’t understand that the economics of surveillance have totally changed," Jennifer Granick, civil liberties director at the Stanford Center for Internet and Society, told the Intercept.
— END PRISM (@endprism) May 5, 2015
"Once you have this capability, then the question is: How will it be deployed?" she said. "Can you temporarily cache all American phone calls, transcribe all the phone calls, and do text searching of the content of the calls? It may not be what they are doing right now, but they’ll be able to do it."
Storable & Searchable, Across Languages & Accents
Though the Defense Advanced Research Projects Agency (DARPA) started funding voice recognition in the 1970s, the last decade or so are when the technology really took off.
However, Dan Kaufman, director of DARPA’s Information Innovation Office, insists that the technology — especially as applied to phone call audio — is far from perfect, because "there’s a lot of noise on the signal" and "it’s informal as hell."
— Nikki (@NikkiMcLayton) April 11, 2015
One of the earlier tools, launched in 2004 was called RHINEHART, which allowed analysts to both filter content as it came in for a pre-determined set of keywords or search the communications later, according to a 2006 document entitled "For Media Mining, the Future is Now!"
The document also indicated that RHINEHART was being used "across a wide variety of missions and languages."
Shortly thereafter, the NSA’s Human Language Technology (HLT) program introduced VoiceRT, first used in Baghdad, and which let analysts "index, tag, and graph" communications so they could "sort through millions of cuts per day and focus on only the small percentage that is relevant."
By 2009, the NSA's British analogue, GCHQ, noted that the NSA "have invested heavily in producing their own corpora of transcribed Sigint in both American English and an increasing range of other languages," while GCHQ itself was accumulating communications in "Northern Irish Accented English."
— TheJournal.ie (@thejournal_ie) May 5, 2015
The latest tool mentioned in the Snowden documents, called SPIRITFIRE, was launched in 2013 and is described as "a more robust voice processing capability based on speech-to-text keyword search and paired dialogue transcription."
"Try Your Best to Behave Just Like Everyone Else"
Though a 2011 document, "Finding Nuggets — Quickly — in a Heap of Voice Collection, From Mexico to Afghanistan," describes the successful use of the technology in war zones and among communications from Latin America, the extent to which the voice-to-text tools have been used domestically remains unclear.
Also unclear is how this technology is regulated by the government, whose debates about mass surveillance have so far not included discussion of voice collection and automated transcription.
The technology isn't mentioned at all in the USA Freedom Act, which Congress is currently debating. The bulk data collection program addressed there targets phone metadata: information about who called whom and when.
The Privacy and Civil Liberties Oversight Board (PCLOB) — appointed by the president — has also never mentioned this type of technology in its reports.
The PCLOB's chairman, David Medine, told the Intercept that he could neither confirm nor deny that the technology was mentioned in any information the Board has seen but couldn't declassify, but that such technology, if it existed "would also allow the government to listen in on more calls, which would raise more of the kind of privacy issues that the board has raised in the past."
Most worrying to Phillip Rogaway, a professor of computer science at the University of California, Davis, however, was that the mechanisms by which voice communications were searched or flagged were being developed in the dark. Sophisticated predictive technologies would mean that keyword searches would be "the least of our problems."
A 2006 memo describes tools with the "capability to predict what intercepted data might be of interest to analysts based on the analysts’ past behavior."
"When the NSA identifies someone as ‘interesting’ based on contemporary NLP [Natural Language Processing] methods, it might be that there is no human-understandable explanation as to why beyond: ‘his corpus of discourse resembles those of others whom we thought interesting'; or the conceptual opposite: ‘his discourse looks or sounds different from most people’s.'"
Beyond the chilling effect this may have on people worried about eavesdropping, the obscurity of the processes means "it will be impossible to understand the contours of the surveillance apparatus by which one is judged. All that people will be able to do is to try your best to behave just like everyone else."