The perils of metadata and privacy

Many modern approaches to data analysis rely on crunching metadata, without peering into the actual content of a customer's record. For example, you might look at whether the person provided their zip code, without actually looking at the zip code, and use that as an input into some kind of mathematical model.

There are plenty of risks in even this level of analysis, however. Here's an example (based on some rough math; there are plenty of different estimates based on gender and location.)

In an Ontario survey, 45 percent of transgendered respondents said that they attempted suicide. In Ontario, 7.86 men in 100,000 commit suicide, and the World Health Organization estimates that actual suicides are only 1/20th of all suicides. That makes 157.2 attempts per 100,000 people—or 0.15% of people. That means a transgendered person is over 286 times more likely to try and kill themselves.

Putting aside the horrible human cost of this for a moment, it turns out that the financial costs of treating attempted suicide are high, too. According to the NIH, a suicide attempt has medical costs of $US 13,536.

Now imagine it's your job to identify insurance applicants who might cost your company more money. A transgendered person often changes their name, or has a first name that doesn't match legal documentation.

Set an algorithm loose on the metadata, and it could find a correlation between people whose first names have been changed in a database—or differ from their names in official records—and those who will cost more to insure.

That's what the algorithms at Orbitz did when they decided Mac users had more money to spend, and served up higher-priced travel offerings based on the visitor's web browser. It wasn't a human decision—it was an algorithm converging on a particular set of attributes that suggested someone was likely to spend more.

The ethics of big data are murky, partly because algorithms are opaque and often confusing. You may think that by only using metadata, you're being respectful of users' privacy. But metadata is leaky, and relying on it to protect user data is risky at best.