Big Data

1. Cloud Storage

Use Cloud Storage as your Data Lake. Anonymize your data or use encryption to satisfy the policies to move your data outside the enterprise. Storing 1 TB in a European Amazon Datacenter is as cheap as $30 per month.

2. Processing as a Service

You can either run open source software on EC2 machines (IaaS) or go for one of the many processing services that Cloud provide:

AWS Lambda, optionally with AWS API Gateway (new) Real Time processing with AWS Kinesis Create Data Pipelines with Amazon AWS Data Pipeline AWS RedShift if you want to query big data or Hadoop as a service with AWS EMR Google Cloud Data Flow Google BigQuery Messaging Queues with Google PubSub

3. Shutdown Machines

Don't keep any services running you don't need. You can easily process a couple of TBs of data using generic EC2 machines, write the results to a persistent data store (database, S3), and then shut down the machines. Use provisioning tools like Puppet / Chef (or OpsWorks) / Ansible to build your machines or boot generic images with the tools in place.

4. Use Cloud Integration

Big Data Processing tools like Apache Spark have Cloud Support integrated. You can specify your AWS credentials when starting your spark job to launch a job directly on the Cloud.

5. Use Spot Instances

Amazon AWS has a nice feature where you can bid on spare computing time. You can get an EC2 instance at a lower price, with the downside it can be terminated at any time. It can be a little more tricky, but processing can be done in a way that if the instance get terminated, it just starts where it was left. Data should be persisted outside the instance, so the data will not be lost.

About the Author

Ward has been a system administrator for more than a decade and has been working as a Consultant and Trainer for the last few years. Besides DevOps he is also into the latest Big Data technologies. Originally Belgian, currently enjoying life in London.