Is HDP different than Hadoop


HDP the horton data platform is a distribution from hortonworks that is involved in apache hadoop enterprise distributions. It is popularly called HDP. HDP is at 2.3.2 now and supports big data for enterprises. So, is HDP different than apache hadoop? Nope. If so why do I need HDP. Instead I can download directly from apache software foundation
Apache hadoop ecosystem is a set of tools that solve the big data challenges. Essential components of hadoop include hdfs, flume, spark, sqoop, hbase, hive, mapreduce, yarn a resource scheduler enhanced from map reduce jobs to name a few. Each and every component in itself is a separate project and has many different versions that get released at different point in time.
This lack of synchronization can cause compatibility issue among many different components, can cause one or more components in hadoop cluster to break while upgrade, can affect performance functionality etc. As such there is a need to bundle these hadoop components, make sure they are properly functional, test them adn release them to field as standard enterprise distributions that are stable, relied upon. That’s wherein vendors like hortonworks, cloudera come into picture. Hence, HDP is the hadoop flavor bundled, shipped to enterprise by hortonworks. Another popular hadoop distribution is CDH from cloudera
Start learning hadoop to enter bigdata space
Hadoop the open source Apache foundation Project originally written in JAva code that makes use of Google filesystem as its base forms the framework to support big data
All of us say big data. So far we have been processing TB’s of data using existing relational dataabse management system. So, what exactly is Big Data?
Let us first take a look at what are the three major things that are being addressed by Hadoop – It is popularly called 3 V’s – Velocity, Volume, Variety
Yep – These 3 acronyms form the basics of Big data
Big data as the basic properties define ;
1) Grow at a spectacular rate. Good examples inlude data collected from sensors in offices, RFId, mobile phones etc
2) Are voluminous in nature
3) Can be – structure, unstructure, semi – structured – Come in different forms and variety
To handle this kind of data relational database may not be sufficient. To handle this hadoop framework comes into picture
Hadoop Framework A Quick Overview:-
To kick start the big data arena it becomes mandate to know the ABCD “Hadoop”. Apache hadoop is the framework built by team of Yahoo engineers in 2005. Originally built in c++ this project eventually has become stable in java.
This is an open source project supported by Apache. Anyone can download, practise this binary for free. As with any popular frameworks, Apache hadoop is available in popular commercial flavours from Hortonworks, Cloudera etc
Lets take a quick look at the pieces that will make the Hadoop framework work big
1) Apache hadoop – This is the framework on which big data is supported. This is considered hadoop data management tool
2) Hadoop Pig Latin – This is the scripting language used to process big data. As this is associated with data this becomes big data management tool
3) Apache Hadoop HBase – This is the NoSQL database from hadoop. This is database for big data
4) Apache Hadoop HDFS – The Hadoop distributed file system that hosts big data and is a data management tool
5) Apache Hadoop Ambari- The monitoring and management tool classified as Hadoop operational tool
6) Apache Hadoop zookeeper – This bigdata operational tool is used for configuration of hadoop framework
7) HAdoop squoop – This is used to migrate data from relational databases onto HAdoop HDFS
8) Hadoop flume – This forms big data aggregation. This includes aggregation of logs onto central repository
Bigdata promising career for java developers
As an aspiring java programmer if you are exhausted and looking for a career change that travels on top of your prior java development experience but gives you best compensation, big data is the way to go
Many vendors are in process of developing and implementing tools to support big data. One of the most popular vendor technology that supports big data is Cloudera
Now, lets take a look at how it is possible to handle the big data challenges thrown at a java programmer. As a java programmer you can choose to start learning the following skills to grow big and make more money in your career
1) Start learning hadoop mapreduce. Scripting using pig or java is an essential skill. In some organizations mapreduce functions inherent in nosql databases like mongodb might come handy
2) Hadoop hive, spark and such distributed data processing platform experience is an essential skill to take up job as big data engineer
3) Experience with tools like cloudera manager would be a plus. Some employers prefer certification from cloudera
4) Experience working with nosql databases like hbase, mongodb would be a plus
5) Development experience with java is much preferred. However experience with python, R might come handy
6) This is a technology trend. Lots of attitude, ambition, self-learning is an essential skill
7) Must be very much comfortable working in linux environment, shell scripting, perl scripting, python scripting , ruby on rails scripting etc comes handy
8) Some employers prefer cloud knowledge and experience like AWS, essentially components of AWS including EC2, S3, EMR etc
9) Big data development is an agile environment and hence SDLC life cycle knowledge is a must
Cloudera Distribution and apache hadoop quick overview :
Data has grown form paper files to digital CD’s, floppy disks, hard disk, storage SAN/NAS and now hadoop cluster is the trend. Apache software foundation called ASF runs set of projects to support data that are generated at faster pace, come in structured, unstructured form and that needs to be stored and processed to mine valuable business insights. This is wherein hadoop project has come into existence. In simple terms apache hadoop is the framework needed to store, process massive amounts of data.
Set of machines presented to end user as cluster. In real world this is considered a cost saving measure as commodity hardware can be made use of for implementing apache hadoop cluster. Here are some interesting facts and features of apache hadoop
1) Fault tolerant – The basic unit of datastore is HDFS the hadoop distributed file system that is used to store bigdata that can come in many different forms
2) Scalable – possible to add more machines to cluster to meet the growing demand
3) Open source – Hadoop is not owned by any firm. Anyone can download source code. They can modify and run the code. Instead of downloading directly from apache website, look for distributions like CDH from cloudera, HDP from hortonworks that are apache ahdoop flavors bundled with appropriate components, tested and released for use by enterprises
Projects are built around hadoop comprise hadoop ecosystem. Some components include:
1) Spark
2) Scala
3) Kafka
4) Ranger
5) Storm
6) Flume
These tools that form part of apache hadoop ecosystem make hadoop easier to use
Give details on cloudera, CDH and how is this related to hadoop?
Cloudera offers enterprise solutions to solve the bigdata problem of enterprises. Just like Ubuntu, RHEL, Federacor or any other Linux distribution CDH is a licensed version of apache hadoop for enterprises. The service offerings of cloudera doesn’t stop there. Cloudera Manager is the graphical user interface that can be used to manage hadoop cluster from UI. This cna be treated similar to oracle enterprise manager the GUI from oracle
Career of cloudera bigdata analyst
Want to take up career as a Cloudera bigdata data analyst? Interested in learning the requisites? Here is the outline of requirements to emerge as a cloudera big data data analyst
1) The primary skill necessary to find career as adata analyst is SQL
2) In hadoop specific environment it becomes mandatory to learn tools like pig scripting, Hive, impala to analyse big data
3) Deals with high level analysis is involved. Dont need to be a developer
4) Learn how to get data other systems like datawarehouse, databases
5) Learn how to analyse big data set
6) Knowledge of basic relational databases comes handy
7) Basic unix commands definitely helps. Some interesting commands include mkdir,rm,cp,move etc
8) High level programming language knowledge definitely helps. Most preferred languages include java, python, perl
9) Knowledge of ETL, hadoop framework is a plus
10) You get to learn hadoop data ingestion tools, analysis using pig scripting, hive commands, impala to name a few
Building hadoop cluster know how
As a administrator in infrastructure team say your manager wants you to come up with a plan for prospective hadoop cluster to support an upcoming big data project in your organization. Are you wondering from where to start? Here are some basic things that come handy as a first step in building hadoop cluster
Before looking at installation options lets see what are all the ways to build the cluster. This can be done in one of the following ways
Choosing hardware understanding hadoop architecture
Choose set of commodity hardware in your datacenter. If not have a meeting with capacity planner to determine if you have commodity hardware in place. Use one or more of them. With the combined resource availability make sure you can start building hadoop cluster on your own. Typically when it comes to HDFS files the recommendation is to have atleast three nodes for redundancy purpose that does guarantee high availability. This is a totally different topic that we can discuss in detail. Essentially 3 commodity hardware machines might be needed.
Here are the many different ways to build hadoop cluster
1) Utilize the commodity hardware in your organization
2) Rent hardware
3) Make use of cloud services like amazon web services, microsoft azure that make the hadoop cluster creation and hosting a piece of cake. All you need is to buy the appropriate virtual machines form these vendors, create and launch hadoop cluster in less timeframe. This comes with unique advantage of paying as and when your resource consumption increases. These Infrastructure as Service makes the job easy and simple
Now, lets look at hadoop cluster installation options:
Say, you choose to build hadoop cluster on your own, here are the installation options to be considered
1.1) Apache tarballs – This is one of the most time consuming task as you need to download appropriate binary tar balls from apache hadoop and related projects. You may have to decide on location of installation files, configuration files, log files in file system. Need to make sure file permissions are set correctly and so on. Also, unique thing to be noted is that you make sure the version of hadoop you download is compatible with Hive. The component compatability has not been tested and certified once you do all by yourself
1.2) Apache Packages – Starting with apache hadoop bigtop projects, to vendor packages form hortonworks, cloudera, MApR to name a few enterprise hadoop clusters mostly rely on RPM and debian packages from certified vendors. This ensures the component compatability like proper functioning of hadoop with hive, puppet to name a few that eases most of the work
1.3) Hadoop Cluster Management tools – Starting with Apache ambari to cloudera manager many GUI tools make this a piece of cake. Also, this comes with unique advantage of rolling upgrade that helps cluster upgrade with zero downtime. When there is a need to add more resources to cluster, the job becomes easy utilizing these tools. The tools come with heuritstics, best recommendation that comes handy while working with many different components of hadoop

Bigdata Hadoop Hortonworks Articles FREE:

Delivered by FeedBurner