One of the big talks these days is around “Big Data” and the rising interest in the enterprise. This term “Big Data” is no different to the term “Cloud Computing” in the fact that it’s a general statement to simplify something that’s much more complex when you dig into the details. So in it’s simplest form “Big Data” is the practice of gathering, storing and analyzing lots of data. The first thing most will think about when the term “Big Data” is brought up is Hadoop. Simply put Hadoop, along with MapReduce and HDFS, is the framework or platform that makes consuming and analyzing “Big Data” possible in a scalable way.
Getting useful analytics from the data is the end goal here but one must not forget there needs to be a framework and a well thought out infrastructure foundation that includes networking, storage and systems. As an enterprise IT architect this is where I think challenges will arise in the enterprise. First is the concept of using non redundant commodity servers with a scale out model similar to what you see used by Google or Facebook. So no raid or redundancy and having lots of physical servers and hardware is the general rule. Now in the enterprise this is somewhat of a mind shift. Companies like Google and Facebook can afford to have their own data centers equipped with plenty of floor space, sufficient power and cooling which is great for this scale-out design. Not so much for a lot of enterprises. By using that scale out model you also start to introduce additional physical hardware to management, cabling, etc. In fact didn’t the enterprise spend years consolidating as much as possible to reduce their footprint in the hopes of reducing cost? Then there is the talk of using virtualization with Hadoop which has gotten me in a couple heated debates with the data folks (i.e. DBA, Data Architect).
All of these concerns depends on what your plans are for Big Data but what is the enterprise to do? This question is probably on the minds of other enterprises which brings up another area of concern which is the lack of experienced resources but I won’t get into that. The original infrastructure design for Big Data has to change if it’s to be adopted in the enterprise. If you plan to do it on your own, I would design the infrastructure to what makes the most sense for your environment. Take a look at Hortonworks Data Platform. It’s one of the most complete free open source platforms I’ve worked with but there are others. There’s also been testing with Hadoop in virtualized environments which proves to be promising. One such vendor is VMware and you can find the report they’ve done using vSphere 5 here. EMC has its GreenPlum hardware and software solutions.
I wonder what other enterprises are doing in this space. Are you building your own systems and going open source? Buying prepackaged solutions? Hiring others to do it for you? All the above?