Monday, March 29, 2010

Forget BigData, Think MacroData & MicroData

Recently I've been hearing a lot of talk about the potential for so called "BigData" an idea that has emerged out of Google's use of a concept of storing large disperse data sets in what they call "BigTable". The general idea of Google's BigTable is as a distributed storage system for managing structured data that is designed to scale to a very large (Big) size: petabytes of data across thousands of commodity servers. Basically a way for Google's engineers to think about working in big ways -- a data mantra.

Many of the approaches to BigData have grown from roots found in traditional grid and parallel computing realms such as non-relational column oriented databases systems (NoSQL), distributed filesystems and distributed memory caching systems (Memcache). These platforms typically forming the basis for many of the products and services found within the broader BigData & N0SQL trends and associated ecosystem of startups. (Some of which have been grouped in the broad Cloud computing category) The one thing that seems consistent among all the BigData applications & approaches are found in the emergence of the concept from within very large, data intensive companies such as Google, Yahoo, Microsoft and Facebook. Because of the scale of these companies, they were forced to rethink how they managed an ever increasing deluge of user created data.

To summarize the trend in the most simplistic terms -- (to me) it seems to be a kind of Googlification for IT. In a sense we are all now expected to run Google Scale infrastructure without the need for a physical Google infrastructure. The problem is the infrastructure Google and other large web centric companies have put in place have less to do with the particular technological infrastructure and more to do with handling a massive and continually evolving global user base. I believe this trend has to do with the methodology that these companies apply to their infrastructure or more importantly the way they think about applying this methodology to their technology.

I think this trend towards BigData may also potentially miss the bigger opportunities - mostly because of the word BIG. For most companies size in terms of raw storage is less important than scale. And when I say scale (relative magnitude) mostly what I mean is the time takes me to get a job done (logarithmic scale). Again the problem with BigData is its relative nature of the word "Big" -- how big is big? Is my notion of big the same as yours? I'm just not sure How big is Big? And just because I have petabytes of data doesn't mean I should work on petabyte workloads. I believe a better and more descriptive terminology would be one that is less subjective yet still broad enough to describe the problem. Think MacroData or MicroData.

Macro = Very large in scale or scope or capability
Micro = Extremely small in scale or scope or capability

The bigger question is what happens when you start to think about handling data as many smaller workloads (Micro), which collectively may be distributed across many geographies and environments (Macro). On a macro level the data could be very large in size but singularly on a micro level could be just a few bytes in size (Think a Twitter Status Update). Also the benefit to thinking about data from a micro or macro stand point is you start to think about the metrics that matter most, how fast I can achieve my particular goals. For me the goal is getting data analyzed in a time frame near or as close to real time as possible. This means data workloads that are processed in smaller byte sized pieces when they are created and (if possible) before they are stored / warehoused. It's easier to read data than it is to process data into information.

Thinking about data in terms of scale and time will result in a significantly more useful ways of potentially solving many more real world problems. I'm very interested in hearing other opinion on this -- am just getting caught up on semantics of the English language?

#DigitalNibbles Podcast Sponsored by Intel

If you would like to be a guest on the show, please get in touch.