Wednesday, April 1, 2015

Playing with Flume



Once thing is certain reading all materials and taking the courses does not prepare you for using the Cloudera Manager to configure and maintain your cluster.

Case in point, using flume.

I’ve been trying to understanding how to configure and run multiple flume agents in Cloudera using the manager.

It was pretty easy to setup one agent.  I just had to change the configuration on the default teir1 agent to the settings I created and restart. But the documentation I was reading was not clear.

I replaced teir1 with my a1 configuration of a source spoorDir using a memory channel and it worked almost as intended.  I do have some issues I need to work and I’m playing with the rollover size and interval as my small 250K test file was importing into many 10 row flume import files.

So I adjusted the settings and dropped in a few more files all around 250K maybe 2MB in total to generate a larger file in HDFS. But after running yet after resting the cluster I was surprised to see a 48MB file.  So what was happening in the background to generate such a large file from a few small data sources.  Next step at a later date to find out why……

a1.sources = ExternalFileScrDir
a1.channels = memoryChannel
a1.sinks = flumeHDFS

# Setting the source to spool directory where the file exists
a1.sources.ExternalFileScrDir.type = spooldir
a1.sources.ExternalFileScrDir.spoolDir = /usr/local/flume/live
a1.sources.ExternalFileScrDir.deletePolicy = immediate

# Setting the channel to memory
a1.channels.memoryChannel.type = memory
# Max number of events stored in the memory channel
a1.channels.memoryChannel.capacity = 1000
a1.channels.memoryChannel.batchSize = 250
a1.channels.memoryChannel.transactionCapacity = 500

# Setting the sink to HDFS
a1.sinks.flumeHDFS.type = hdfs
a1.sinks.flumeHDFS.hdfs.path = hdfs://quickstart.cloudera/user/hive/warehouse/flumeimport
a1.sinks.flumeHDFS.hdfs.fileType = DataStream

# Write format can be text or writable
a1.sinks.flumeHDFS.hdfs.writeFormat = Text

#a1.sinks.flumeHDFS.hdfs.rollCount = 0
#a1.sinks.flumeHDFS.hdfs.rollInterval = 0
#a1.sinks.flumeHDFS.hdfs.rollSize = 0
a1.sinks.flumeHDFS.hdfs.batchSize = 1000

# use a single csv file at a time
a1.sinks.flumeHDFS.hdfs.maxOpenFiles = 1

# Connect source and sink with channel
a1.sources.ExternalFileScrDir.channels = memoryChannel
a1.sinks.flumeHDFS.channel = memoryChannel


Then I assumed I could just append another configuration “a2” of another spoolDir using a file type channel and both would be active and running waiting for files… But that’s not what happened only the initial agent was working as the second agent had its own name.

After some searching it hit me… and my initial config below worked… two agents one memory type one file type both waiting for files.

# Initialize agent's source, channel and sink
a1.sources = ExternalFileScrDir ExternalFileScrDir1
a1.channels = memoryChannel fileChannel
a1.sinks = flumeHDFS flumeHDFS1

# Setting the source to spool directory where the file exists
a1.sources.ExternalFileScrDir.type = spooldir
a1.sources.ExternalFileScrDir.spoolDir = /usr/local/flume/live
#a1.sources.ExternalFileScrDir.deletePolicy = immediate
a1.sources.ExternalFileScrDir1.type = spooldir
a1.sources.ExternalFileScrDir1.spoolDir = /usr/local/flume/files


# Setting the channel to memory
a1.channels.memoryChannel.type = memory

a1.channels.memoryChannel.capacity = 10000
a1.channels.memoryChannel.batchSize = 250
a1.channels.memoryChannel.transactionCapacity = 5000
a1.channels.memoryChannel.checkpointInterval = 3000
a1.channels.memoryChannel.maxFileSize = 5242880

# Setting the channel to memory
a1.channels.fileChannel.type = file
a1.channels.fileChannel.capacity = 10000
a1.channels.fileChannel.batchSize = 250
a1.channels.fileChannel.transactionCapacity = 5000
a1.channels.fileChannel.checkpointInterval = 3000
a1.channels.fileChannel.maxFileSize = 5242880

# Setting the sink to HDFS
a1.sinks.flumeHDFS.type = hdfs
a1.sinks.flumeHDFS.hdfs.path = hdfs://quickstart.cloudera/user/hive/warehouse/flumeimport
a1.sinks.flumeHDFS.hdfs.fileType = DataStream

a1.sinks.flumeHDFS1.type = hdfs
a1.sinks.flumeHDFS1.hdfs.path = hdfs://quickstart.cloudera/user/hive/warehouse/flumeimport
a1.sinks.flumeHDFS1.hdfs.fileType = DataStream

# Write format can be text or writable
a1.sinks.flumeHDFS.hdfs.writeFormat = Text
a1.sinks.flumeHDFS1.hdfs.writeFormat = Text

# rollover file based on max time of 1 min
#a1.sinks.flumeHDFS.hdfs.rollInterval = 0
#a1.sinks.flumeHDFS.hdfs.idleTimeout = 600

a1.sinks.flumeHDFS.hdfs.rollCount = 0
a1.sinks.flumeHDFS.hdfs.rollInterval = 0
a1.sinks.flumeHDFS.hdfs.rollSize = 0
a1.sinks.flumeHDFS.hdfs.batchSize = 1000

# use a single csv file at a time
a1.sinks.flumeHDFS.hdfs.maxOpenFiles = 1

# Connect source and sink with channel
a1.sources.ExternalFileScrDir.channels = memoryChannel
a1.sinks.flumeHDFS.channel = memoryChannel
a1.sources.ExternalFileScrDir1.channels = fileChannel
a1.sinks.flumeHDFS1.channel = fileChannel



I’m still perfecting the file sizes and debugging the channel full issues but some of this may be attributed to my tiny VM instance and low resources on my laptop.

Thursday, March 26, 2015

List of essential books to read.....

These are the books I referenced while I was studying. These helped me alot as many of the exam questions in the Cloudera practice tests will give reference to sections in these books.


  Hadoop: The Definitive Guide

 




Hadoop Operations




I'm currently trying to learn and perfect the ecosystem components by importing and exporting data using Sqoop and flume. So I'm now referencing these books as well as help full posts on internet. The focus of my immediate blogs posts will most likey document my struggles with getting my Flume syncs to work.

Using Flume: Flexible, Scalable, and Reliable Data Streaming




Apache Flume: Distributed Log Collection for Hadoop (What You Need to Know)

More to come....

Setting up a mini cluster to train with on your PC.


So you may have already downloaded the Cloudera VM and are getting familiar with the single node environment setup.  This is great to run the tutorials and get familiar with all the ecosystem components, however if you are trying to understand Hadoop and get your admin certificate then you need to know how to build and setup a cluster from scratch.

The good news is you can do this right on your laptop if you done have more than one PC. Practice setting up and tearing down your very own cluster for training purposes is pretty easy once you do your research.

Whats even better is I went through this growing pain many, many months ago when I first started so I will post the links/resources I used of the many blogs and forums in my next few posts I used to get my system up and running.

Hopefully with my lessons learned your training will go smoothly.

What you need to get started to mirror what I did is below.

  1. At least one PC with 16GB of memory. You will run 4 VM's  each taking up a portion of you ram to mimic your cluster.
  2. I used Oracle Virtual box as my VM to install my OS and Hadoop.
  3. Download a opensource Linux OS to use.  I choose CentOS.

From there I followed instructions on post I found on Cloudera, which you may have already found if you found this blog.

http://blog.cloudera.com/blog/2014/01/how-to-create-a-simple-hadoop-cluster-with-virtualbox/ 



Hortonworks Bigdata Sandbox

If Cloudera or Pivotal is not the version you need to learn on for your clients then go get this VM. (I have them all!)
Sandbox is a personal, portable Hadoop environment that comes with a dozen interactive Hadoop tutorials. Sandbox includes many of the most exciting developments from the latest HDP distribution, packaged up in a virtual environment that you can get up and running in 15 minutes!

Pivotal HD Virtual Machine

Cloudera is not your only choice to learn bigdata on hadoop. Pivotal HD also has a single node Virtual machine you can download and use to learn and play with currently version 2.1 at 2.39GB download.

The core features are common but you can gain experience in their platform specific features such as Hawq and GemFire XD and use there dataloader.
 
http://pivotal.io/big-data/pivotal-hd
 

Once you get it downloaded and running the readme on the server desktop contains the password you need to as well as the link to run the tutorials.

Have fun.

Getting Started: Hadoop 0 to 60 using Cloudera

I like to stay on top of things and try and learn new technologies when I have spare time.  My new personal project has been all things BigData.   This includes reading RSS feeds, industry reports and of course playing with the tools.
My goal with this blog is to share with others who are like minded my process and resources used to learn Hadoop and BigData.
Let get started.
The first thing you will need is a VM (virtual machine).  I personally use Oracle VM Virtual Box, its free and can be found here (http://www.oracle.com/technetwork/community/developer-vm/index.html)  as well as many pre-built VM’s to learn all things Oracle.  I was playing with the OBIEE but that would be another blog.
Next thing you will need to do is find a Hadoop VM image.
Since everything available as opensource you can depending on your needs, configure your Virtual Box from scratch.  Download and install your linux version / Hadoop etc…. and try to configure all your software.  However, since is see my role as a consultant using a pre-configured running instance I grabbed a working vendor VM.
Most vendors are now supplying them  pick the vendor you need/want to learn and download the Virtual Box version if they have one.
I downloaded my VM’s from Cloudera.
The Cloudera QuickStart VM (Make sure you choose the correct VM version or you will download a large file for nothing)
I then stumbled upon the link below which is a free intro course to Mapreduce and it used another functional VM with the training materials.  
And that is where I began.  Follow the course videos.. start up your instance and you’re in business running mapreduce jobs on your own Cloudera VM server.
Lastly…  I’m not a java guy and I had some reservations about my ability to write mappers and reducers. So I choose to use Python code. And the tutorial above also uses python so if you have any scripting knowledge then you will be fine.
Other nice to have software:
Notepad++  great free tool http://notepad-plus-plus.org/
If your writing Java or Python get Eclipse.  It’s the tool used on the VM’s and it’s nice to have a local copy to use when you don’t want to have your VM up and running taking up memory and eating you laptops resources.


First post

I plan on using this blog to aide those new to hadoop that wish to learn and master all things within the hadoop ecosystem.  Hopefully the information I publish and reference will be of some assistance to you as start your journey.

I started mine about 2 years ago and did reach my goal of CDH4 certification but the journey is not over yet.  I continue to learn and refine my skill daily as everything in this domain changes so quickly.

I find I have a lot of bookmarks and snippets I searched for and used while I tried to solve my problems in my setup.

So where do you start?  You have to ask yourself... do you want to download all the opensource components individually and build you own hadoop cluster from scratch? Or do you want to use an available vendor package like Cloudera, Pivotal HD or Hortonworks and get certificated on those.

This blog will focus on the Cloudera ecosystem as that is the one I choose to learn and the one I'm most familiar with.