Thursday, March 26, 2015

List of essential books to read.....

These are the books I referenced while I was studying. These helped me alot as many of the exam questions in the Cloudera practice tests will give reference to sections in these books.


  Hadoop: The Definitive Guide

 




Hadoop Operations




I'm currently trying to learn and perfect the ecosystem components by importing and exporting data using Sqoop and flume. So I'm now referencing these books as well as help full posts on internet. The focus of my immediate blogs posts will most likey document my struggles with getting my Flume syncs to work.

Using Flume: Flexible, Scalable, and Reliable Data Streaming




Apache Flume: Distributed Log Collection for Hadoop (What You Need to Know)

More to come....

Setting up a mini cluster to train with on your PC.


So you may have already downloaded the Cloudera VM and are getting familiar with the single node environment setup.  This is great to run the tutorials and get familiar with all the ecosystem components, however if you are trying to understand Hadoop and get your admin certificate then you need to know how to build and setup a cluster from scratch.

The good news is you can do this right on your laptop if you done have more than one PC. Practice setting up and tearing down your very own cluster for training purposes is pretty easy once you do your research.

Whats even better is I went through this growing pain many, many months ago when I first started so I will post the links/resources I used of the many blogs and forums in my next few posts I used to get my system up and running.

Hopefully with my lessons learned your training will go smoothly.

What you need to get started to mirror what I did is below.

  1. At least one PC with 16GB of memory. You will run 4 VM's  each taking up a portion of you ram to mimic your cluster.
  2. I used Oracle Virtual box as my VM to install my OS and Hadoop.
  3. Download a opensource Linux OS to use.  I choose CentOS.

From there I followed instructions on post I found on Cloudera, which you may have already found if you found this blog.

http://blog.cloudera.com/blog/2014/01/how-to-create-a-simple-hadoop-cluster-with-virtualbox/ 



Hortonworks Bigdata Sandbox

If Cloudera or Pivotal is not the version you need to learn on for your clients then go get this VM. (I have them all!)
Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials. Sandbox includes many of the most exciting developments from the latest HDP distribution, packaged up in a virtual environment that you can get up and running in 15 minutes!

Pivotal HD Virtual Machine

Cloudera is not your only choice to learn bigdata on hadoop. Pivotal HD also has a single node Virtual machine you can download and use to learn and play with currently version 2.1 at 2.39GB download.

The core features are common but you can gain experience in their platform specific features such as Hawq and GemFire XD and use there dataloader.
 
http://pivotal.io/big-data/pivotal-hd
 

Once you get it downloaded and running the readme on the server desktop contains the password you need to as well as the link to run the tutorials.

Have fun.

Getting Started: Hadoop 0 to 60 using Cloudera

I like to stay on top of things and try and learn new technologies when I have spare time.  My new personal project has been all things BigData.   This includes reading RSS feeds, industry reports and of course playing with the tools.
My goal with this blog is to share with others who are like minded my process and resources used to learn Hadoop and BigData.
Let get started.
The first thing you will need is a VM (virtual machine).  I personally use Oracle VM Virtual Box, its free and can be found here (http://www.oracle.com/technetwork/community/developer-vm/index.html)  as well as many pre-built VM’s to learn all things Oracle.  I was playing with the OBIEE but that would be another blog.
Next thing you will need to do is find a Hadoop VM image.
Since everything available as opensource you can depending on your needs, configure your Virtual Box from scratch.  Download and install your linux version / Hadoop etc…. and try to configure all your software.  However, since is see my role as a consultant using a pre-configured running instance I grabbed a working vendor VM.
Most vendors are now supplying them  pick the vendor you need/want to learn and download the Virtual Box version if they have one.
I downloaded my VM’s from Cloudera.
The Cloudera QuickStart VM (Make sure you choose the correct VM version or you will download a large file for nothing)
I then stumbled upon the link below which is a free intro course to Mapreduce and it used another functional VM with the training materials.  
And that is where I began.  Follow the course videos.. start up your instance and you’re in business running mapreduce jobs on your own Cloudera VM server.
Lastly…  I’m not a java guy and I had some reservations about my ability to write mappers and reducers. So I choose to use Python code. And the tutorial above also uses python so if you have any scripting knowledge then you will be fine.
Other nice to have software:
Notepad++  great free tool http://notepad-plus-plus.org/
If your writing Java or Python get Eclipse.  It’s the tool used on the VM’s and it’s nice to have a local copy to use when you don’t want to have your VM up and running taking up memory and eating you laptops resources.


First post

I plan on using this blog to aide those new to hadoop that wish to learn and master all things within the hadoop ecosystem.  Hopefully the information I publish and reference will be of some assistance to you as start your journey.

I started mine about 2 years ago and did reach my goal of CDH4 certification but the journey is not over yet.  I continue to learn and refine my skill daily as everything in this domain changes so quickly.

I find I have a lot of bookmarks and snippets I searched for and used while I tried to solve my problems in my setup.

So where do you start?  You have to ask yourself... do you want to download all the opensource components individually and build you own hadoop cluster from scratch? Or do you want to use an available vendor package like Cloudera, Pivotal HD or Hortonworks and get certificated on those.

This blog will focus on the Cloudera ecosystem as that is the one I choose to learn and the one I'm most familiar with.