Hadoop data load interview question answer preparation

1) What are all the datasources from which we can load data into Hadoop?
Hadoop is an open source framework for supporting distributed data and processing of big data. First step would be to pump data into hadoop. Datasources can come in many different forms as follows:
1) Traditional relational databases like oracle
2) Data warehouses
3) Middle tier including web server and application server – Server logs from major source of information
4) Database logs
2) What tools are mainly used in data load scenarios?
Hadoop does offer data load into and out of hadoop from one or more of the above mentioned datasources. Tools including Sqoop, flume are used with data load scenarios. If you are from oracle background think of tools like datapump, sql*loader that help with data load. Though not exactly the same logicwise they match
3) What is a load scenario?
Bigdata loaded into hadoop can come from many different datasources. Depending on datasource origin there are many different load scenarios as follows:
1) Data at rest – Normal information stored in files, directories, sub-directories are considered data at rest. These files are not intended to be modified any further and are considered data at rest. To load such information HDFS shell commands like cp, copyfromlocal, put can be used
2) Data in motion – Also called as streaming data. This is a type of data that is continuously being updated. New information keeps on getting added to the datasource. Logs from webservers like apache, logs from application server, database server logs say alert.log in case of oracle database are all examples of data in motion. It is to be notes that multiple logs need to be merged before being uploaded onto hadoop
3) Data from web server – Web server logs
4) Data from datawarehouse – Data should be exporeted from traditional warehouses and imported onto hadoop. Tools like sqoop, bigsql load, jaql netezza can be used for this purpose
3) How does sqoop connect to relational databases?
Information stored in relational DBMS liek Oracle, MySQL, SQL Server etc can be loaded into Hadoop using sqoop. As with any load tool, sqoop needs some parameters to connect to RDBMS, pull information, upload the data into hadoop. Typically it includes
3.1) username/password
3.2) connector – this is a database specific JDBC driver needed to connect to many different databases
3.3) target-dir – This is the name of directory in HDFS into which information is loaded as csv file
3.4) WHERE – subset of rows from a table can be exported and loaded using WHERE clause

Building hadoop cluster know how

As a administrator in infrastructure team say your manager wants you to come up with a plan for prospective hadoop cluster to support an upcoming big data project in your organization. Are you wondering from where to start? Here are some basic things that come handy as a first step in building hadoop cluster
Before looking at installation options lets see what are all the ways to build the cluster. This can be done in one of the following ways
Choosing hardware understanding hadoop architecture
Choose set of commodity hardware in your datacenter. If not have a meeting with capacity planner to determine if you have commodity hardware in place. Use one or more of them. With the combined resource availability make sure you can start building hadoop cluster on your own. Typically when it comes to HDFS files the recommendation is to have atleast three nodes for redundancy purpose that does guarantee high availability. This is a totally different topic that we can discuss in detail. Essentially 3 commodity hardware machines might be needed.
Here are the many different ways to build hadoop cluster
1) Utilize the commodity hardware in your organization
2) Rent hardware
3) Make use of cloud services like amazon web services, microsoft azure that make the hadoop cluster creation and hosting a piece of cake. All you need is to buy the appropriate virtual machines form these vendors, create and launch hadoop cluster in less timeframe. This comes with unique advantage of paying as and when your resource consumption increases. These Infrastructure as Service makes the job easy and simple
Now, lets look at hadoop cluster installation options:
Say, you choose to build hadoop cluster on your own, here are the installation options to be considered
1.1) Apache tarballs – This is one of the most time consuming task as you need to download appropriate binary tar balls from apache hadoop and related projects. You may have to decide on location of installation files, configuration files, log files in file system. Need to make sure file permissions are set correctly and so on. Also, unique thing to be noted is that you make sure the version of hadoop you download is compatible with Hive. The component compatability has not been tested and certified once you do all by yourself
1.2) Apache Packages – Starting with apache hadoop bigtop projects, to vendor packages form hortonworks, cloudera, MApR to name a few enterprise hadoop clusters mostly rely on RPM and debian packages from certified vendors. This ensures the component compatability like proper functioning of hadoop with hive, puppet to name a few that eases most of the work
1.3) Hadoop Cluster Management tools – Starting with Apache ambari to cloudera manager many GUI tools make this a piece of cake. Also, this comes with unique advantage of rolling upgrade that helps cluster upgrade with zero downtime. When there is a need to add more resources to cluster, the job becomes easy utilizing these tools. The tools come with heuritstics, best recommendation that comes handy while working with many different components of hadoop

Bigdata Hadoop Hortonworks Articles FREE:

Delivered by FeedBurner

Cloudera Distribution and apache hadoop quick overview

Data has grown form paper files to digital CD’s, floppy disks, hard disk, storage SAN/NAS and now hadoop cluster is the trend. Apache software foundation called ASF runs set of projects to support data that are generated at faster pace, come in structured, unstructured form and that needs to be stored and processed to mine valuable business insights. This is wherein hadoop project has come into existence. In simple terms apache hadoop is the framework needed to store, process massive amounts of data.
Set of machines presented to end user as cluster. In real world this is considered a cost saving measure as commodity hardware can be made use of for implementing apache hadoop cluster. Here are some interesting facts and features of apache hadoop
1) Fault tolerant – The basic unit of datastore is HDFS the hadoop distributed file system that is used to store bigdata that can come in many different forms
2) Scalable – possible to add more machines to cluster to meet the growing demand
3) Open source – Hadoop is not owned by any firm. Anyone can download source code. They can modify and run the code. Instead of downloading directly from apache website, look for distributions like CDH from cloudera, HDP from hortonworks that are apache ahdoop flavors bundled with appropriate components, tested and released for use by enterprises
Projects are built around hadoop comprise hadoop ecosystem. Some components include:
1) Spark
2) Scala
3) Kafka
4) Ranger
5) Storm
6) Flume
These tools that form part of apache hadoop ecosystem make hadoop easier to use
Give details on cloudera, CDH and how is this related to hadoop?
Cloudera offers enterprise solutions to solve the bigdata problem of enterprises. Just like Ubuntu, RHEL, Federacor or any other Linux distribution CDH is a licensed version of apache hadoop for enterprises. The service offerings of cloudera doesn’t stop there. Cloudera Manager is the graphical user interface that can be used to manage hadoop cluster from UI. This cna be treated similar to oracle enterprise manager the GUI from oracle.

Bigdata Hadoop Hortonworks Articles FREE:

Delivered by FeedBurner