1) What are all the datasources from which we can load data into Hadoop?
Hadoop is an open source framework for supporting distributed data and processing of big data. First step would be to pump data into hadoop. Datasources can come in many different forms as follows:
1) Traditional relational databases like oracle
2) Data warehouses
3) Middle tier including web server and application server – Server logs from major source of information
4) Database logs
2) What tools are mainly used in data load scenarios?
Hadoop does offer data load into and out of hadoop from one or more of the above mentioned datasources. Tools including Sqoop, flume are used with data load scenarios. If you are from oracle background think of tools like datapump, sql*loader that help with data load. Though not exactly the same logicwise they match
3) What is a load scenario?
Bigdata loaded into hadoop can come from many different datasources. Depending on datasource origin there are many different load scenarios as follows:
1) Data at rest – Normal information stored in files, directories, sub-directories are considered data at rest. These files are not intended to be modified any further and are considered data at rest. To load such information HDFS shell commands like cp, copyfromlocal, put can be used
2) Data in motion – Also called as streaming data. This is a type of data that is continuously being updated. New information keeps on getting added to the datasource. Logs from webservers like apache, logs from application server, database server logs say alert.log in case of oracle database are all examples of data in motion. It is to be notes that multiple logs need to be merged before being uploaded onto hadoop
3) Data from web server – Web server logs
4) Data from datawarehouse – Data should be exporeted from traditional warehouses and imported onto hadoop. Tools like sqoop, bigsql load, jaql netezza can be used for this purpose
3) How does sqoop connect to relational databases?
Information stored in relational DBMS liek Oracle, MySQL, SQL Server etc can be loaded into Hadoop using sqoop. As with any load tool, sqoop needs some parameters to connect to RDBMS, pull information, upload the data into hadoop. Typically it includes
3.2) connector – this is a database specific JDBC driver needed to connect to many different databases
3.3) target-dir – This is the name of directory in HDFS into which information is loaded as csv file
3.4) WHERE – subset of rows from a table can be exported and loaded using WHERE clause