Hadoop Buyer's Guide
At Strata+Hadoop World today I stopped by MapR's booth to chat with their reps, and I was presented with a most curious little book. In big letters it proclaims "Hadoop Buyer's Guide", with a smaller assurance that these 27 pages are "everything you need to know about choosing the right Hadoop distribution for production". The front and back cover both have the Ubuntu logo, but there's no mention of MapR - the Author is Robert D. Schneider, who I was assured (emphatically, by the MapR rep) is an independent consultant.
...this guide is specifically designed to be incorporated into your RFP when it comes to evaluating Hadoop platforms. - Hadoop Buyer's Guide, page 1
The Guide makes some bold promises right from page one. Not only will it literally write your RFP, but it will also explain "... why selecting a Hadoop platform is so vital". Ostensibly the alternative, a Hadoop quantum superposition, is difficult and costly to maintain at room temperature. There's some vague mention about choosing operating systems, but I suppose Canonical realized the ethical quandry of paying someone to write an "objective" comparison, so The Guide has only a fleeting mention that hey, Juju can do this configuration stuff, or whatever.
Big Data, MapReduce, and Hadoop
The original implementation of Hadoop was in Java and used the Linux file system... later [causing] difficulties for enterprises - Hadoop Buyer's Guide, page 7
So we get your standard history of Hadoop, which every presenter in history has used to pad their "Introduction to Hadoop" deck / book intro. The first six pages are breathlessly exuberant about the revolutionary potential of MapReduce, greedily hoarded by the gods of Google until it was stolen away by Doug "Prometheus" Cutting. The tale takes a tragic turn when we discover that Doug used Java, the bane of enterprises everywhere, leading to the quote above. What difficulties and which enterprises is left unclear, but The Guide assures us that this is a giant downside of the "original" - which is in fact referred to as the "Apache" variant in polite conversation, since it's still alive.
We're introduced in this section to the "three models" of Hadoop vendor-dom: "bare-bones" open-source like Hortonworks, "management innovations" like Cloudera, and "adding value through architectural innovations" like MapR. The Guide conflates "closed-source" with "adding value" - perhaps it's my pinko commie heart talking, but since Cloudera created technologies like Sqoop, Flume and Oozie, and both Cloudera and HortonWorks are involved in little things like Hive, YARN, and defining the actual MapReduce interface, it seems like they add a lot of value. In fact, they add value even if you buy a MapR license, which seems like a swell thing to do.
Critical Considerations When Selecting A Hadoop Platform
Using C/C++ is consistent with almost all other enterprise-grade software - Hadoop Buyer's Guide, page 13
In my mind "enterprise software" is a bad thing. It's lampooned in examples like Enterprise Fizzbuzz, and typified by things like the longest class name. According to The Guide, Java's "unpredictability" is "well documented". I've never really thought of Java as the wild and crazy sort of language - I figured it stays in on Friday nights and reads, while Objective-C and Scala go out to a hip club, and Lisp tries to sneak past the bouncer. We're actually treated to a number of unsupported assertions in easy-to-read table format: implementing your own filesystem is more reliable ( ext3 just adds "moving parts" ), many Hadoop installations require "separate instances" (whatever those are), and it's important to be able to use Google Compute Engine or EC2 to augment your cluster (whoops, that one's true!).
Moving on to the more prolix portions of the chapter, The Guide assures us that Flume and Scribe are "complex and cumbersome", and introduce "deep-seated inefficiencies". The only useful Hadoop distribution is one with full NFS read/write. Which one has that again? Likewise, we should avoid a distribution with the "NameNode bottleneck", since they're limited to a "paltry 100 million to 150 million files". At this point it's pretty apparent that nobody has ever done anything good or useful with one of those crippled, open-source Hadoop distributions. HBase is in a similar situation: the open source community has screwed up so badly that only "a number of innovations ... can transform HBase" to make it useful. These innovations are all suspiciously similar to the ones that can make Hadoop useful! And they're listed yet again under the "Dependability" heading!
The High Availability section asserts that "it shouldn't be necessary to perform any special steps to take advantage of HA". I did think it was a bit odd when I had to smear goat blood all over the NameNode with Apache Hadoop, but the manual assured me it would bring me favour from the Gods of Big Data(tm). On the other hand, a MapR representative did warn me once that "enabling HA in Cloudera takes a 90 page document!", which leads me to believe that basic literacy may be considered "special" by some. Either way, The Guide will have none of your high-falutin "configuration".
The Winner Is!
Cash rules everything around me, CREAM, get the money, dolla dolla bill yo - the Author, somewhere, probably
We're not going to tell you outright what vendor to buy from - this is a complex, multi-faceted decision which requires an assessment of your needs, your existing codebase and your development resources. Surely every vendor has pros and cons, and is suited to a different business case. Of course, MapR does have a lot of 'Yes' in the matrix on the final page. Hey, they have all the things I'm supposed to put in my RFP! What's that, Hadoop Buyer's Guide? HortonWorks hates freedom and devours the hearts of kittens?