Big Data

I want to do Big Data!

You just have spent a lot of time doing analysis whether you should adopt Big Data and the outlook looks positive. You will create lots of business value, you will get new insights, find the hidden gems, everything looks great! You get started, you identify vendors that can supply you the software. They often even give you the software for free and will just charge for support. The business case looks good, the cost savings enormous, you just need to get in touch with the infrastructure people to identify how much that will cost...

Disaster strikes

Master nodes, worker nodes, horizontal scalability, network ports, bare metal deployments, 10Gigabit network, network switches, SLAs, cost per node. You will be confronted with some if not all of these terms. Suddenly the cost looks a lot higher. And it doesn't stop there. Who is going to maintain all that new infrastructure? How many months will it take to deliver? How is this going to fit in your current IT strategy? Companies have been virtualizing their infrastructure for the last 10 years and suddenly people want to run their Big Data Technology like Hadoop again on bare metal?

New machines means more people that need to maintain the software and hardware. Every machine has an Operating System, has services, needs to be patched, needs to be upgraded, and so on. Maintaining this beast requires specialized skills.

But I want my Big Data project to succeed

What data will you process? Will the data be anonymized? What is your company policy on data in the cloud?

Often companies can process data outside their own datacenters as long as it is encrypted or anonymized. If that is the case for you, why bother buying this expensive hardware and have your own teams managing exotic software? All the (realtime) analytics processes you want to run on premise, you can find provided as a service at the big Cloud providers like Amazon and Google.

Do you want to do realtime data processing with Flume, Kafka, Storm, and HBase? Take a look at Amazon AWS and in particular to Kinesis or AWS Lambda, in combination with DynamoDB. More of a Google fan? Take a look at Google Kubernetes and BigQuery.

Do you just want a data lake and store all your data cheaply? Take a look at amazon S3. It integrates very well with all their other products. Google has Google Cloud Storage. Microsoft has Azure Storage.

You want the yellow elephant? Amazon has EMR, Elastic Mapreduce, which gives you an on demand cluster. Integrates with S3 as well.

Still in doubt? Take a look on our website or contact us.

Think twice

Think twice before getting complicated on-premise software. The last years the focus was to virtualize everything and to have Software as a Service. Don't go back to the past, make sure you benefit from using Cloud as much as possible. There is a lot out there, make sure you make your analysis, compare the products and do a Proof of Concept on different platforms.

About the Author

Ward has been a system administrator for more than a decade and has been working as a Consultant and Trainer for the last few years. Besides DevOps he is also into the latest Big Data technologies. Originally Belgian, currently enjoying life in London.