Figures similar to the following are coming to a slide show near you: “It has been estimated that Wal-Mart records more than 1 million customer transactions each hour, resulting in more than 2.5 petabytes of data being stored, the equivalent of 167 times the data stored in the Library of Congress. Facebook is reported to store 40 billion photographs. And an estimated 35 hours of video content is uploaded to YouTube every minute.”
This is interesting stuff, but it mixes apples and oranges. WalMart cranks out a lot of data – but it's structured data. We're talking about a lot of small things. But pictures and video are another matter. These are large, unstructured data objects – what I'd call genuine big data.
Genuine big data poses novel problems of interpretation and combination. Pseudo big data – that is, large transaction volumes – might tax the processing capabilities of traditional databases, but it doesn't challenge the analysis model in the same way as unstructured data does.
So, when someone raises the problems of big data, it's worth asking whether they're talking about high volumes of structured data, unstructured data, or both. These are different topics and they call for different strategies. Big Data