The pair, journalist Svea Eckert and data scientist Andreas Dewes, decided to team up and see both how easy it would be to acquire personal user data, and what they could glean from it.
Presenting their findings at the July 27-30 DefCon hacking conference in Las Vegas, the pair revealed they secured a database containing 3 billion URLs from three million German users, spread over 9 million different sites.
Some were intermittent users, with but a couple of dozen of sites visited in the 30-day period the duo examined — other users' troves offered tens of thousands of data points, which provided a full record of their online lives.
Getting hold of the information was easier than buying it off the shelf — the pair simply created a fake company, replete with its own website, a LinkedIn page for its chief executive and a careers site (which garnered a few applications from individuals tricked by the company).
They crammed the bogus company website full of stock pictures and "marketing buzzwords," and contacted almost 100 companies, saying they had developed a machine-learning algorithm capable of marketing goods and services more effectively to potential customers, but required a large amount of data for the task.
The companies were asked whether they would turn over their raw data on German web surfers — and most of those contacted were only too happy to oblige, without charge.
Dewes stated there were a number of methods by which it's possible to identify individuals in the data morass, simply from a long list of URLs and timestamps. For instance, anyone who visits their own analytics page on Twitter ends up with a URL in their browsing record containing their Twitter username, which is only visible to them. Finding that URL connects anonymous data directly to a specific person.
Other techniques are less direct — a mere 10 URLs can be enough to create a unique fingerprint to identify someone from anonymous data, if it's compared against URLs posted on public platforms such as social media. Known fans of a particular band or newspaper and the like, can potentially be found by whittling down the links shared by anonymous users.
A similar strategy was used in 2008 to de-anonymize a set of ratings published by Netflix to help computer scientists improve its recommendation algorithm: by comparing "anonymous" ratings of films with public profiles on IMDB, researchers were able to unmask Netflix users — including one individual who subsequently sued the streaming giant for violation of privacy.
Another discovery through the data collection occurred via Google Translate, which stores the text of every query put through it in its URL. From this, the team uncovered operational details about a German cybercrime investigation, since the detective involved was translating requests for assistance to foreign police forces.
The data itself the pair harvested came from a number of browser plugins, with the prime offender being "safe surfing" tool Web of Trust.
"What would you think if somebody showed up at your door saying, 'Hey, I have your complete browsing history — every day, every hour, every minute, every click you did on the web for the last month'? How would you think we got it: some shady hacker? No. It was much easier: you can just buy it," concluded Eckert.