Researcher Nabs Details from 35 Million Google Profiles

Thursday, May 26, 2011



In an effort to demonstrate the relative ease with which large amounts of personally identifiable information can be amassed into a single database, a researcher has collected a cornucopia of information from online Google profiles over a one month period.

The database, created by University of Amsterdam Ph.D. student Matthijs R. Koot, now contains the names, educational backgrounds, work histories, Twitter conversations, links to Picasa photo albums, and other details of over thirty-five million people.

Koot also was able to collect over eleven million account usernames, and all of the data collected could potentially be exploited by hackers, scammers, identity thieves, spear-phishers, social engineers, private investigators and the government.

“I wrote a small bash script to download all the sitemap-NNN(N).txt files mentioned in that file and attempted to download 10k, then 100k, than 1M and then, utterly surprised that my connection wasn't blocked or throttled or CAPTCHA'd, the rest of them,” Koot said.

Koot was able to collect all of the data over a one month period via a single IP address because the Google permissions file for the Google Profiles URL does not prevent indexing of the information.

Google has no technical protocol to prevent the "scraping" of information, which is available in an extensible markup language file called profiles-sitemap.xml.

“I'm curious about whether there are any implications to the fact that it is completely trivial for a single individual to do this – possibly there aren't. That's something worth knowing too. I'm curious whether Google will apply some measures to protect against mass downloading of profile data, or that this is a non-issue for them too,” Koot explained.

While Google has not addressed the security and privacy implications of Koot's demonstration, they are investigating whether or not Koot may have violated the company's terms of service. 

“Public profiles are usually discovered when people use search engines, and sitemap information makes it possible for search engines to index these public profiles so that people can find them. The sitemap does not reveal any information that is not already designated to be public," a Google spokesperson said in a statement.

Possibly Related Articles:
Google Privacy Databases Research internet Headlines Personally Identifiable Information web scraping username
Post Rating I Like this!
The views expressed in this post are the opinions of the Infosec Island member that posted this content. Infosec Island is not responsible for the content or messaging of this post.

Unauthorized reproduction of this article (in part or in whole) is prohibited without the express written permission of Infosec Island and the Infosec Island member that posted this content--this includes using our RSS feed for any purpose other than personal use.