hadoop error could not find or load main class fs

As a first step in learning hadoop, I’m currently reading hortonworks tutorial to mirror datasets between hadoop clusters with apache falcon. I stumbled on error could not find or load main class fs
As a first step I tried logging in as falcon user
su – falcon
Now,I’m trying to create directory using hdfs command as provided in tutorial:
hadoop fs -mkdir /apps/falcon/primaryCluster
Error: could not find or load main class fs
As a next step I tried setting hadoop_prefix as follows
export HADOOP_PREFIX=/usr/hdp/current/hadoop-client
This did not fix the issue either
As a next step instead of fs I tried using dfs
hadoop dfs -mkdir /apps/falcon/primaryCluster
It did work fine
To confirm that folder got created I issued the following command and this did work fine
hdfs dfs -ls /apps/falcon

Bigdata Hadoop Hortonworks Articles FREE:

Delivered by FeedBurner

Hadoop data load interview question answer preparation

1) What are all the datasources from which we can load data into Hadoop?
Hadoop is an open source framework for supporting distributed data and processing of big data. First step would be to pump data into hadoop. Datasources can come in many different forms as follows:
1) Traditional relational databases like oracle
2) Data warehouses
3) Middle tier including web server and application server – Server logs from major source of information
4) Database logs
2) What tools are mainly used in data load scenarios?
Hadoop does offer data load into and out of hadoop from one or more of the above mentioned datasources. Tools including Sqoop, flume are used with data load scenarios. If you are from oracle background think of tools like datapump, sql*loader that help with data load. Though not exactly the same logicwise they match
3) What is a load scenario?
Bigdata loaded into hadoop can come from many different datasources. Depending on datasource origin there are many different load scenarios as follows:
1) Data at rest – Normal information stored in files, directories, sub-directories are considered data at rest. These files are not intended to be modified any further and are considered data at rest. To load such information HDFS shell commands like cp, copyfromlocal, put can be used
2) Data in motion – Also called as streaming data. This is a type of data that is continuously being updated. New information keeps on getting added to the datasource. Logs from webservers like apache, logs from application server, database server logs say alert.log in case of oracle database are all examples of data in motion. It is to be notes that multiple logs need to be merged before being uploaded onto hadoop
3) Data from web server – Web server logs
4) Data from datawarehouse – Data should be exporeted from traditional warehouses and imported onto hadoop. Tools like sqoop, bigsql load, jaql netezza can be used for this purpose
3) How does sqoop connect to relational databases?
Information stored in relational DBMS liek Oracle, MySQL, SQL Server etc can be loaded into Hadoop using sqoop. As with any load tool, sqoop needs some parameters to connect to RDBMS, pull information, upload the data into hadoop. Typically it includes
3.1) username/password
3.2) connector – this is a database specific JDBC driver needed to connect to many different databases
3.3) target-dir – This is the name of directory in HDFS into which information is loaded as csv file
3.4) WHERE – subset of rows from a table can be exported and loaded using WHERE clause

Start learning hadoop to enter bigdata space

Hadoop the open source Apache foundation Project originally written in JAva code that makes use of Google filesystem as its base forms the framework to support big data
All of us say big data. So far we have been processing TB’s of data using existing relational dataabse management system. So, what exactly is Big Data?
Let us first take a look at what are the three major things that are being addressed by Hadoop – It is popularly called 3 V’s – Velocity, Volume, Variety
Yep – These 3 acronyms form the basics of Big data
Big data as the basic properties define ;
1) Grow at a spectacular rate. Good examples inlude data collected from sensors in offices, RFId, mobile phones etc
2) Are voluminous in nature
3) Can be – structure, unstructure, semi – structured – Come in different forms and variety
To handle this kind of data relational database may not be sufficient. To handle this hadoop framework comes into picture
Hadoop Framework A Quick Overview:-
To kick start the big data arena it becomes mandate to know the ABCD “Hadoop”. Apache hadoop is the framework built by team of Yahoo engineers in 2005. Originally built in c++ this project eventually has become stable in java.
This is an open source project supported by Apache. Anyone can download, practise this binary for free. As with any popular frameworks, Apache hadoop is available in popular commercial flavours from Hortonworks, Cloudera etc
Lets take a quick look at the pieces that will make the Hadoop framework work big
1) Apache hadoop – This is the framework on which big data is supported. This is considered hadoop data management tool
2) Hadoop Pig Latin – This is the scripting language used to process big data. As this is associated with data this becomes big data management tool
3) Apache Hadoop HBase – This is the NoSQL database from hadoop. This is database for big data
4) Apache Hadoop HDFS – The Hadoop distributed file system that hosts big data and is a data management tool
5) Apache Hadoop Ambari- The monitoring and management tool classified as Hadoop operational tool
6) Apache Hadoop zookeeper – This bigdata operational tool is used for configuration of hadoop framework
7) HAdoop squoop – This is used to migrate data from relational databases onto HAdoop HDFS
8) Hadoop flume – This forms big data aggregation. This includes aggregation of logs onto central repository

Bigdata Hadoop Hortonworks Articles FREE:

Delivered by FeedBurner

Apache hadoop start of bigdata career

Hadoop the talk of the town is the most popular open-source framework upon which big data architecture can be built. There are many technologies out there that support big data. However, hadoop is most popular owing to the following reasons:
1) Hadoop is an open-source apache project
2) It hosts all the tool needed to store, process and support big data
3) It comes with an integral NoSQL database HBase. This avoids issues with driver integration related to third party NoSQL databases like MongoDB, Cassandra etc
4) Vendors like Hortonworks have taken up projects to spice up open-source apache hadoop into their proprietary HDP called Hortonworks Data Platform
5) Hadoop has technologies that offer high availability, low latency data processing
6) Hadoop is java based. This makes it possible to run in many different operating systems like UNIX, Linux, windows without any issues. Platform independent nature of java makes this possible
Hadoop is popular and widely adapted framework to support big data. So, what is the real role of Hadoop is supporting big data projects real-time?
Hadoop comprises of lots of tools that are developed to store big data into chunks that can easily be accessed and processed. Lets us see the tools that form the basis of hadoop architecture and role of every component in supporting big data project:
1) HDFS – Hadoop Distributed File system is HDFS. Any system starting with our desktop PC is expected to store data in file system. Hadoop offers their propreitory filesystem that can host data that is big.The byte size of processing in hadoop file system is 128MB. HDFS comes with replication factor which enables the information to be split and distributed across more than one physical machine. Every HDFS needs one name node which is the admin node that stores metadata about all other nodes in the hadoop cluster. The second set of machines are called data nodes that store actual data. Replication factor determines number of copies of the data. Namenode and data node can be installed in same machine for learning purpose. In production name node should be in a machine different than data nodes. HDFS stores both structured and unstructured data. Realtime data can also be stored.
2) Flume – To store unstructured data onto HDFS flume is used
3) Sqoop – Sqoop is used to store and retrieve structred data from HDFS. In real-time oracle is a popular relational database. If there is a requirement to transfer data from oracle onto HDFS sqoop can be used. When information is processed and returned to client sqoop is needed
4) Yarn – We can think of yarn as operating system. It is the heart of big data architecture. Yarn does the data processing of information stored in HDFS
5) HBase – This is NoSQL database that comes as part of hadoop. Tabular information is stored in HBase and files, iamges, unstructured data is stored in HDFS
6) Mapreduce – Hadoop offers mapreduce that interacts with yarn. Yarn does processing of information in HDFS
7) Pig(Latin) – This is a new language developed for sake of data analysts that makes prcessing simple. pig helps with data analysis
8) Hive – The SQL of big data is hive
9) OOzie – This is workflow scheduler tool needed to schedule the tasks in an organized fashion. This can be thought of build tool like ant, mavern in a software development environment
10) Spark – Apache spark is the in-memory processing engine that can be used with hadoop as well as non-hadoop env like mongodb. Spark is gaining popularity
11) Mahout – This is machine learning language used for statistical analysis
12) Client/Python/CLI – The user interacts at this layer
13) HUE – Hadoop User Experience is the user-interface that supports the apache hadoop ecosystem, that makes all of its tools accessible and entire ecosystem is accessed via web browse interface
14) Zookeeper – This is a tool used for maintaining configuration information, naming, providing distributed synchronization, and providing group services
Why is hadoop used for big data?
When we talk of big data the first thing that comes to my mind is Oracle exadata. Oracle has built a sophisticated machine with powerful operating system to process data that is big. However, hadoop is an open-source project and developed to cater the needs to storing and processing both structured and unstructred data. Henceforth, hadoop is the preferred architecture in big data space.
Is hadoop the only architecture available to support big data?
No. Hadoop is the preferred choice and widely accepted

Bigdata Hadoop Hortonworks Articles FREE:

Delivered by FeedBurner

Building hadoop cluster know how

As a administrator in infrastructure team say your manager wants you to come up with a plan for prospective hadoop cluster to support an upcoming big data project in your organization. Are you wondering from where to start? Here are some basic things that come handy as a first step in building hadoop cluster
Before looking at installation options lets see what are all the ways to build the cluster. This can be done in one of the following ways
Choosing hardware understanding hadoop architecture
Choose set of commodity hardware in your datacenter. If not have a meeting with capacity planner to determine if you have commodity hardware in place. Use one or more of them. With the combined resource availability make sure you can start building hadoop cluster on your own. Typically when it comes to HDFS files the recommendation is to have atleast three nodes for redundancy purpose that does guarantee high availability. This is a totally different topic that we can discuss in detail. Essentially 3 commodity hardware machines might be needed.
Here are the many different ways to build hadoop cluster
1) Utilize the commodity hardware in your organization
2) Rent hardware
3) Make use of cloud services like amazon web services, microsoft azure that make the hadoop cluster creation and hosting a piece of cake. All you need is to buy the appropriate virtual machines form these vendors, create and launch hadoop cluster in less timeframe. This comes with unique advantage of paying as and when your resource consumption increases. These Infrastructure as Service makes the job easy and simple
Now, lets look at hadoop cluster installation options:
Say, you choose to build hadoop cluster on your own, here are the installation options to be considered
1.1) Apache tarballs – This is one of the most time consuming task as you need to download appropriate binary tar balls from apache hadoop and related projects. You may have to decide on location of installation files, configuration files, log files in file system. Need to make sure file permissions are set correctly and so on. Also, unique thing to be noted is that you make sure the version of hadoop you download is compatible with Hive. The component compatability has not been tested and certified once you do all by yourself
1.2) Apache Packages – Starting with apache hadoop bigtop projects, to vendor packages form hortonworks, cloudera, MApR to name a few enterprise hadoop clusters mostly rely on RPM and debian packages from certified vendors. This ensures the component compatability like proper functioning of hadoop with hive, puppet to name a few that eases most of the work
1.3) Hadoop Cluster Management tools – Starting with Apache ambari to cloudera manager many GUI tools make this a piece of cake. Also, this comes with unique advantage of rolling upgrade that helps cluster upgrade with zero downtime. When there is a need to add more resources to cluster, the job becomes easy utilizing these tools. The tools come with heuritstics, best recommendation that comes handy while working with many different components of hadoop

Bigdata Hadoop Hortonworks Articles FREE:

Delivered by FeedBurner

Hortonworks dataflow dataplatform real difference

This month Hortonworks has released the latest version of hortonworks dataflow version 1.1. Hoetonworks DAta Platform popularly called HDP is the major project and product of hortonworks that is built on top of open-source hadoop ecosystem.
Now, do hortonwoks data flow and data paltform represent the same?
No. HDP the hortonworks data platform is the bundled version of open-source hadoop in a packaged format. Using a installer all the components that form part of hadoop project are chosen and bundled correctly. As the many different components in hadoop ecosystem have different version releases at different point-in-time and compatability is not always guaranteed, HDP is a stable solution for enterprises looking to have hadoop implemented as a customized, stable, tested package that is installed using installer
Hadoop Dataflow on other hand is Apache nifi. This is the GUI tool used to design the dataflows using processors which are data extracting engines designed to work with many different datasources. Hadoop is meant for its data enrinchment. As such there are around 90 processors in HDP that can getfiles from local file system, extract information from twitter etc. This information can be put into HDFS the hadoop distributed filesystem and dataflow is designed using relationships. Once the drag and drop of the processors is done in GUI, appropriate properties are configured, relation ship is established and built appropriately dataflow gets initiated.
As such HDF is for designing dataflow, HDP is the apache hadoop platform supporting enterprise big data projects starting with its HDFS the hadoop distributed file system

Bigdata Hadoop Hortonworks Articles FREE:

Delivered by FeedBurner

Hortonworks sandbox to start learning hadoop

Hortonworks the enterprise hadoop software vendor offers world class software for enterprises with a stable hadoop suite of software. Sandbox is the download offered by hortonworks. As a first step to learn hadoop we need an environment that is easy to play with. As per hortonworks sandbox is a single node cluster that comes as learnign environment to start learning hadoop and its ecosystem of products . This can be installed in virtual box, vmware in desktop. As per hortonworks this can be installed in microsoft Azure as well. Based on our personal experience azure installation of hortonworks demands A4 sized virtual machine which is nto free. Start learning hadoop in desktop now.

Bigdata Hadoop Hortonworks Articles FREE:

Delivered by FeedBurner