Logo

Confidential Information Of 1.2 Billion People Discovered In Massive Data Leak

Author avatar

Admin  |  2019-11-25

Unprecedented 4TB Data Leak Exposes 1.2 Billion People

On October 16, 2019, security researchers discovered a wide-open Elasticsearch server containing 4 billion user accounts across more than 4 terabytes of data. This included names, email addresses, phone numbers, and social media profiles, making it one of the largest data leaks from a single source in history.

How Does Data Enrichment Work?

For a very low price, data enrichment companies allow you to take a single piece of information on a person (such as a name or email address), and expand (or enrich) that user profile to include hundreds of additional new data points of information.

Each time a company chooses to “enrich” a user profile, they are also agreeing to provide what they know about the person to the enriching organization. The resulting data continues to be compounded, creating a situation with no oversight that ultimately allows all of a person’s social and personal information to be easily downloaded.

The Open Elasticsearch Server

The discovered Elasticsearch server containing all of the information was unprotected and accessible via web browser. No password or authentication of any kind was needed to access or download all of the data. The data spanned 4 separate data indexes, labelled “PDL” and “OXY”.

Company 1: People Data Labs (PDL)

Based on our analysis, we believe the data in the “PDL” indexes originated from People Data Labs, a data aggregator and enrichment company. De-duplicating the nearly 3 billion PDL user records revealed roughly 1.2 billion unique people and 650 million unique email addresses. According to their website, their application can be used to search for:

  • Over 1.5 Billion unique people
  • Over 1 billion personal email addresses
  • Over 420 million Linkedin URLs
  • Over 1 billion Facebook URLs and IDs
  • Over 400 million+ phone numbers

Confirming the Data Match

After being notified, PDL stated the server did not belong to them. To test if the data was theirs, we compared data from the open server to data returned by the official PDL API. The data was an almost complete match.

The only difference was that the PDL API returned education histories, which were absent from the leaked data. Everything else was exactly the same, including accounts with multiple email addresses and phone numbers. This consistency across random user checks strongly indicated the data originated from PDL.

Company 2: OxyData.io (OXY)

Analysis of the “Oxy” database revealed an almost complete scrape of LinkedIN data, including recruiter information. Upon contacting OxyData, they also stated the server did not belong to them but confirmed that a sample record from the leak appeared to match their data.

The Problem of Accountability and Attribution

This is an incredibly tricky situation. The data appears to originate from PDL and OxyData, but the server was not owned by them. This raises questions about how a third party, likely a mutual customer, obtained the data and then failed to secure it.

Identification of exposed servers is one of the most difficult parts of an investigation. Cloud providers will not share customer information without a legal process, making it a dead end. This incident raises questions about the effectiveness of current privacy and breach notification laws when the source of a leak is a customer who has mis-used legally obtained data.

About Data Viper

Data Viper is a next-generation threat intelligence platform, providing organizations with the ability to search across thousands of data breaches and hacker channels.

Source: dataviper.io