1) What are all the datasources from which we can load data into Hadoop?
Hadoop is an open source framework for supporting distributed data and processing of big data. First step would be to pump data into hadoop. Datasources can come in many different forms as follows:
1) Traditional relational databases like oracle
2) Data warehouses
3) Middle tier including web server and application server – Server logs from major source of information
4) Database logs
2) What tools are mainly used in data load scenarios?
Hadoop does offer data load into and out of hadoop from one or more of the above mentioned datasources. Tools including Sqoop, flume are used with data load scenarios. If you are from oracle background think of tools like datapump, sql*loader that help with data load. Though not exactly the same logicwise they match
3) What is a load scenario?
Bigdata loaded into hadoop can come from many different datasources. Depending on datasource origin there are many different load scenarios as follows:
1) Data at rest – Normal information stored in files, directories, sub-directories are considered data at rest. These files are not intended to be modified any further and are considered data at rest. To load such information HDFS shell commands like cp, copyfromlocal, put can be used
2) Data in motion – Also called as streaming data. This is a type of data that is continuously being updated. New information keeps on getting added to the datasource. Logs from webservers like apache, logs from application server, database server logs say alert.log in case of oracle database are all examples of data in motion. It is to be notes that multiple logs need to be merged before being uploaded onto hadoop
3) Data from web server – Web server logs
4) Data from datawarehouse – Data should be exporeted from traditional warehouses and imported onto hadoop. Tools like sqoop, bigsql load, jaql netezza can be used for this purpose
3) How does sqoop connect to relational databases?
Information stored in relational DBMS liek Oracle, MySQL, SQL Server etc can be loaded into Hadoop using sqoop. As with any load tool, sqoop needs some parameters to connect to RDBMS, pull information, upload the data into hadoop. Typically it includes
3.2) connector – this is a database specific JDBC driver needed to connect to many different databases
3.3) target-dir – This is the name of directory in HDFS into which information is loaded as csv file
3.4) WHERE – subset of rows from a table can be exported and loaded using WHERE clause
Hadoop the open source Apache foundation Project originally written in JAva code that makes use of Google filesystem as its base forms the framework to support big data
All of us say big data. So far we have been processing TB’s of data using existing relational dataabse management system. So, what exactly is Big Data?
Let us first take a look at what are the three major things that are being addressed by Hadoop – It is popularly called 3 V’s – Velocity, Volume, Variety
Yep – These 3 acronyms form the basics of Big data
Big data as the basic properties define ;
1) Grow at a spectacular rate. Good examples inlude data collected from sensors in offices, RFId, mobile phones etc
2) Are voluminous in nature
3) Can be – structure, unstructure, semi – structured – Come in different forms and variety
To handle this kind of data relational database may not be sufficient. To handle this hadoop framework comes into picture
Hadoop Framework A Quick Overview:-
To kick start the big data arena it becomes mandate to know the ABCD “Hadoop”. Apache hadoop is the framework built by team of Yahoo engineers in 2005. Originally built in c++ this project eventually has become stable in java.
This is an open source project supported by Apache. Anyone can download, practise this binary for free. As with any popular frameworks, Apache hadoop is available in popular commercial flavours from Hortonworks, Cloudera etc
Lets take a quick look at the pieces that will make the Hadoop framework work big
1) Apache hadoop – This is the framework on which big data is supported. This is considered hadoop data management tool
2) Hadoop Pig Latin – This is the scripting language used to process big data. As this is associated with data this becomes big data management tool
3) Apache Hadoop HBase – This is the NoSQL database from hadoop. This is database for big data
4) Apache Hadoop HDFS – The Hadoop distributed file system that hosts big data and is a data management tool
5) Apache Hadoop Ambari- The monitoring and management tool classified as Hadoop operational tool
6) Apache Hadoop zookeeper – This bigdata operational tool is used for configuration of hadoop framework
7) HAdoop squoop – This is used to migrate data from relational databases onto HAdoop HDFS
8) Hadoop flume – This forms big data aggregation. This includes aggregation of logs onto central repository
Hadoop the talk of the town is the most popular open-source framework upon which big data architecture can be built. There are many technologies out there that support big data. However, hadoop is most popular owing to the following reasons:
1) Hadoop is an open-source apache project
2) It hosts all the tool needed to store, process and support big data
3) It comes with an integral NoSQL database HBase. This avoids issues with driver integration related to third party NoSQL databases like MongoDB, Cassandra etc
4) Vendors like Hortonworks have taken up projects to spice up open-source apache hadoop into their proprietary HDP called Hortonworks Data Platform
5) Hadoop has technologies that offer high availability, low latency data processing
6) Hadoop is java based. This makes it possible to run in many different operating systems like UNIX, Linux, windows without any issues. Platform independent nature of java makes this possible
Hadoop is popular and widely adapted framework to support big data. So, what is the real role of Hadoop is supporting big data projects real-time?
Hadoop comprises of lots of tools that are developed to store big data into chunks that can easily be accessed and processed. Lets us see the tools that form the basis of hadoop architecture and role of every component in supporting big data project:
1) HDFS – Hadoop Distributed File system is HDFS. Any system starting with our desktop PC is expected to store data in file system. Hadoop offers their propreitory filesystem that can host data that is big.The byte size of processing in hadoop file system is 128MB. HDFS comes with replication factor which enables the information to be split and distributed across more than one physical machine. Every HDFS needs one name node which is the admin node that stores metadata about all other nodes in the hadoop cluster. The second set of machines are called data nodes that store actual data. Replication factor determines number of copies of the data. Namenode and data node can be installed in same machine for learning purpose. In production name node should be in a machine different than data nodes. HDFS stores both structured and unstructured data. Realtime data can also be stored.
2) Flume – To store unstructured data onto HDFS flume is used
3) Sqoop – Sqoop is used to store and retrieve structred data from HDFS. In real-time oracle is a popular relational database. If there is a requirement to transfer data from oracle onto HDFS sqoop can be used. When information is processed and returned to client sqoop is needed
4) Yarn – We can think of yarn as operating system. It is the heart of big data architecture. Yarn does the data processing of information stored in HDFS
5) HBase – This is NoSQL database that comes as part of hadoop. Tabular information is stored in HBase and files, iamges, unstructured data is stored in HDFS
6) Mapreduce – Hadoop offers mapreduce that interacts with yarn. Yarn does processing of information in HDFS
7) Pig(Latin) – This is a new language developed for sake of data analysts that makes prcessing simple. pig helps with data analysis
8) Hive – The SQL of big data is hive
9) OOzie – This is workflow scheduler tool needed to schedule the tasks in an organized fashion. This can be thought of build tool like ant, mavern in a software development environment
10) Spark – Apache spark is the in-memory processing engine that can be used with hadoop as well as non-hadoop env like mongodb. Spark is gaining popularity
11) Mahout – This is machine learning language used for statistical analysis
12) Client/Python/CLI – The user interacts at this layer
13) HUE – Hadoop User Experience is the user-interface that supports the apache hadoop ecosystem, that makes all of its tools accessible and entire ecosystem is accessed via web browse interface
14) Zookeeper – This is a tool used for maintaining configuration information, naming, providing distributed synchronization, and providing group services
Why is hadoop used for big data?
When we talk of big data the first thing that comes to my mind is Oracle exadata. Oracle has built a sophisticated machine with powerful operating system to process data that is big. However, hadoop is an open-source project and developed to cater the needs to storing and processing both structured and unstructred data. Henceforth, hadoop is the preferred architecture in big data space.
Is hadoop the only architecture available to support big data?
No. Hadoop is the preferred choice and widely accepted
As a administrator in infrastructure team say your manager wants you to come up with a plan for prospective hadoop cluster to support an upcoming big data project in your organization. Are you wondering from where to start? Here are some basic things that come handy as a first step in building hadoop cluster
Before looking at installation options lets see what are all the ways to build the cluster. This can be done in one of the following ways
Choosing hardware understanding hadoop architecture
Choose set of commodity hardware in your datacenter. If not have a meeting with capacity planner to determine if you have commodity hardware in place. Use one or more of them. With the combined resource availability make sure you can start building hadoop cluster on your own. Typically when it comes to HDFS files the recommendation is to have atleast three nodes for redundancy purpose that does guarantee high availability. This is a totally different topic that we can discuss in detail. Essentially 3 commodity hardware machines might be needed.
Here are the many different ways to build hadoop cluster
1) Utilize the commodity hardware in your organization
2) Rent hardware
3) Make use of cloud services like amazon web services, microsoft azure that make the hadoop cluster creation and hosting a piece of cake. All you need is to buy the appropriate virtual machines form these vendors, create and launch hadoop cluster in less timeframe. This comes with unique advantage of paying as and when your resource consumption increases. These Infrastructure as Service makes the job easy and simple
Now, lets look at hadoop cluster installation options:
Say, you choose to build hadoop cluster on your own, here are the installation options to be considered
1.1) Apache tarballs – This is one of the most time consuming task as you need to download appropriate binary tar balls from apache hadoop and related projects. You may have to decide on location of installation files, configuration files, log files in file system. Need to make sure file permissions are set correctly and so on. Also, unique thing to be noted is that you make sure the version of hadoop you download is compatible with Hive. The component compatability has not been tested and certified once you do all by yourself
1.2) Apache Packages – Starting with apache hadoop bigtop projects, to vendor packages form hortonworks, cloudera, MApR to name a few enterprise hadoop clusters mostly rely on RPM and debian packages from certified vendors. This ensures the component compatability like proper functioning of hadoop with hive, puppet to name a few that eases most of the work
1.3) Hadoop Cluster Management tools – Starting with Apache ambari to cloudera manager many GUI tools make this a piece of cake. Also, this comes with unique advantage of rolling upgrade that helps cluster upgrade with zero downtime. When there is a need to add more resources to cluster, the job becomes easy utilizing these tools. The tools come with heuritstics, best recommendation that comes handy while working with many different components of hadoop
Data has grown form paper files to digital CD’s, floppy disks, hard disk, storage SAN/NAS and now hadoop cluster is the trend. Apache software foundation called ASF runs set of projects to support data that are generated at faster pace, come in structured, unstructured form and that needs to be stored and processed to mine valuable business insights. This is wherein hadoop project has come into existence. In simple terms apache hadoop is the framework needed to store, process massive amounts of data.
Set of machines presented to end user as cluster. In real world this is considered a cost saving measure as commodity hardware can be made use of for implementing apache hadoop cluster. Here are some interesting facts and features of apache hadoop
1) Fault tolerant – The basic unit of datastore is HDFS the hadoop distributed file system that is used to store bigdata that can come in many different forms
2) Scalable – possible to add more machines to cluster to meet the growing demand
3) Open source – Hadoop is not owned by any firm. Anyone can download source code. They can modify and run the code. Instead of downloading directly from apache website, look for distributions like CDH from cloudera, HDP from hortonworks that are apache ahdoop flavors bundled with appropriate components, tested and released for use by enterprises
Projects are built around hadoop comprise hadoop ecosystem. Some components include:
These tools that form part of apache hadoop ecosystem make hadoop easier to use
Give details on cloudera, CDH and how is this related to hadoop?
Cloudera offers enterprise solutions to solve the bigdata problem of enterprises. Just like Ubuntu, RHEL, Federacor or any other Linux distribution CDH is a licensed version of apache hadoop for enterprises. The service offerings of cloudera doesn’t stop there. Cloudera Manager is the graphical user interface that can be used to manage hadoop cluster from UI. This cna be treated similar to oracle enterprise manager the GUI from oracle.