Why should oracle provide support for JSON?
MongoDB the most popular NoSQL database is based on document based model wherein documents are stored in JSON format/BSON [Binary form of JSON]. As the demand for bigdata grows, it becomes inevitable for players in RDBMS space to look for ways to support JSON format.
How does Oracle database 12c handle this?
Oracle database 12c does make use of CLOB datatype while storing such information. Lets take a simple example,
create table json_support(id int, json_data clob);
Is this optimal?
Oracle does make this optimal using functions like json_value(). Function-based indexes extend the performance benefits
Can Oracle database support Bigdata?
Oracle database the legacy database supporting variety of systems on healthcare, finance, retail and many such sectors has been widely popular supporting OLTP, datawarehousing applications. Bigdata is the talk of this decade and many of us would like to determine if oracle can be used as BigData database.
First of all what is a bigdata?
Bigdata as such is a platform on which we pull data from datawarehouse, OLAP, tweets and many disparate datasources supporting structured and unstructured data. The datawarehouse can be an oracle database. Hence oracle database can be portion of bigdata platform.
Oracle the master of databases is the only database in industry to come up with Oracle big data appliance and big data connector that caters the need of enterprise big data.As contrast to other databases Oracle adopts principle of leverage the existing enterprise database architecture including Oracle Exadata and incorporate big data into it to deliver big value to business
Want to take up career as a Cloudera bigdata data analyst? Interested in learning the requisites? Here is the outline of requirements to emerge as a cloudera big data data analyst
1) The primary skill necessary to find career as adata analyst is SQL
2) In hadoop specific environment it becomes mandatory to learn tools like pig scripting, Hive, impala to analyse big data
3) Deals with high level analysis is involved. Dont need to be a developer
4) Learn how to get data other systems like datawarehouse, databases
5) Learn how to analyse big data set
6) Knowledge of basic relational databases comes handy
7) Basic unix commands definitely helps. Some interesting commands include mkdir,rm,cp,move etc
8) High level programming language knowledge definitely helps. Most preferred languages include java, python, perl
9) Knowledge of ETL, hadoop framework is a plus
10) You get to learn hadoop data ingestion tools, analysis using pig scripting, hive commands, impala to name a few
Data science the rapidly evolving stream that employs different strategies to come up with an answer for existing business problem is a methodical approach that involves the following phases to start with understanding the business issue until upto coming up with an answer for business problem.
Here are the many different stages of data science methodology:
1) Understand business issues
2) Determining analytical approach
3) Determining data requirements for building analytical models
3) Collect data from many different datasources
4) Understand the type of data. This can be from relational databases, website logs, structured as well as unstructured data
5) Prepare data
6) Modelling data and continuous evaluation of models
7) Deployment of models designed to predict outcomes
8) Get feedback from customers and implement changes to models as needed
What does a typical job of a data analyst or someone who works closely with customers to understand business requirements and utilize data science to solve their business issue do?
A data analyst professional helps client to understand and improve the user experience with their online properties. For collecting data some datasource is needed. This can typically be firm’s website as well as web based or desktop based applications associated with storing of data like a EMR system in a typical hospital environment
Clients look for automated analytical solutions that will help them look at the metrics in form of dashboard views, reports etc. A data analyst will work with developer and automate this process using reporting solutions like SSRS, tableau etc depending on the solution being used for product
Data analyst is sometimes expected to do SAS programming that automates the insight retrieval based on statistical modelling techniques like significance testing, t-test, regression analysis etc. These techniques help with pre-treatment, post-treatment, control analysis to identify the best performing location in case of website projects. This can be header, top banner parallel to header, side banner to name a few. The location prominence helps clients place important information based on user behavior for maximum conversion which will be the primary business issue. This is typically referred to as user experience
Propensity matching Technique is the Technique used to predict a set of customers likely to have similar characteristics as another customer group. This technique is typically used in case of projects demanding market segmentation as opposed to projects that involve yes or no type questions
Model to Predict the customers who are more likely to sign up and activate for a product that will be launched in future. This is future business prediction that helps clients with decision making on production as well as inventory
As an aspiring java programmer if you are exhausted and looking for a career change that travels on top of your prior java development experience but gives you best compensation, big data is the way to go
Many vendors are in process of developing and implementing tools to support big data. One of the most popular vendor technology that supports big data is Cloudera
Now, lets take a look at how it is possible to handle the big data challenges thrown at a java programmer. As a java programmer you can choose to start learning the following skills to grow big and make more money in your career
1) Start learning hadoop mapreduce. Scripting using pig or java is an essential skill. In some organizations mapreduce functions inherent in nosql databases like mongodb might come handy
2) Hadoop hive, spark and such distributed data processing platform experience is an essential skill to take up job as big data engineer
3) Experience with tools like cloudera manager would be a plus. Some employers prefer certification from cloudera
4) Experience working with nosql databases like hbase, mongodb would be a plus
5) Development experience with java is much preferred. However experience with python, R might come handy
6) This is a technology trend. Lots of attitude, ambition, self-learning is an essential skill
7) Must be very much comfortable working in linux environment, shell scripting, perl scripting, python scripting , ruby on rails scripting etc comes handy
8) Some employers prefer cloud knowledge and experience like AWS, essentially components of AWS including EC2, S3, EMR etc
9) Big data development is an agile environment and hence SDLC life cycle knowledge is a must
Hadoop the open source Apache foundation Project originally written in JAva code that makes use of Google filesystem as its base forms the framework to support big data
All of us say big data. So far we have been processing TB’s of data using existing relational dataabse management system. So, what exactly is Big Data?
Let us first take a look at what are the three major things that are being addressed by Hadoop – It is popularly called 3 V’s – Velocity, Volume, Variety
Yep – These 3 acronyms form the basics of Big data
Big data as the basic properties define ;
1) Grow at a spectacular rate. Good examples inlude data collected from sensors in offices, RFId, mobile phones etc
2) Are voluminous in nature
3) Can be – structure, unstructure, semi – structured – Come in different forms and variety
To handle this kind of data relational database may not be sufficient. To handle this hadoop framework comes into picture
Hadoop Framework A Quick Overview:-
To kick start the big data arena it becomes mandate to know the ABCD “Hadoop”. Apache hadoop is the framework built by team of Yahoo engineers in 2005. Originally built in c++ this project eventually has become stable in java.
This is an open source project supported by Apache. Anyone can download, practise this binary for free. As with any popular frameworks, Apache hadoop is available in popular commercial flavours from Hortonworks, Cloudera etc
Lets take a quick look at the pieces that will make the Hadoop framework work big
1) Apache hadoop – This is the framework on which big data is supported. This is considered hadoop data management tool
2) Hadoop Pig Latin – This is the scripting language used to process big data. As this is associated with data this becomes big data management tool
3) Apache Hadoop HBase – This is the NoSQL database from hadoop. This is database for big data
4) Apache Hadoop HDFS – The Hadoop distributed file system that hosts big data and is a data management tool
5) Apache Hadoop Ambari- The monitoring and management tool classified as Hadoop operational tool
6) Apache Hadoop zookeeper – This bigdata operational tool is used for configuration of hadoop framework
7) HAdoop squoop – This is used to migrate data from relational databases onto HAdoop HDFS
8) Hadoop flume – This forms big data aggregation. This includes aggregation of logs onto central repository
Hadoop the talk of the town is the most popular open-source framework upon which big data architecture can be built. There are many technologies out there that support big data. However, hadoop is most popular owing to the following reasons:
1) Hadoop is an open-source apache project
2) It hosts all the tool needed to store, process and support big data
3) It comes with an integral NoSQL database HBase. This avoids issues with driver integration related to third party NoSQL databases like MongoDB, Cassandra etc
4) Vendors like Hortonworks have taken up projects to spice up open-source apache hadoop into their proprietary HDP called Hortonworks Data Platform
5) Hadoop has technologies that offer high availability, low latency data processing
6) Hadoop is java based. This makes it possible to run in many different operating systems like UNIX, Linux, windows without any issues. Platform independent nature of java makes this possible
Hadoop is popular and widely adapted framework to support big data. So, what is the real role of Hadoop is supporting big data projects real-time?
Hadoop comprises of lots of tools that are developed to store big data into chunks that can easily be accessed and processed. Lets us see the tools that form the basis of hadoop architecture and role of every component in supporting big data project:
1) HDFS – Hadoop Distributed File system is HDFS. Any system starting with our desktop PC is expected to store data in file system. Hadoop offers their propreitory filesystem that can host data that is big.The byte size of processing in hadoop file system is 128MB. HDFS comes with replication factor which enables the information to be split and distributed across more than one physical machine. Every HDFS needs one name node which is the admin node that stores metadata about all other nodes in the hadoop cluster. The second set of machines are called data nodes that store actual data. Replication factor determines number of copies of the data. Namenode and data node can be installed in same machine for learning purpose. In production name node should be in a machine different than data nodes. HDFS stores both structured and unstructured data. Realtime data can also be stored.
2) Flume – To store unstructured data onto HDFS flume is used
3) Sqoop – Sqoop is used to store and retrieve structred data from HDFS. In real-time oracle is a popular relational database. If there is a requirement to transfer data from oracle onto HDFS sqoop can be used. When information is processed and returned to client sqoop is needed
4) Yarn – We can think of yarn as operating system. It is the heart of big data architecture. Yarn does the data processing of information stored in HDFS
5) HBase – This is NoSQL database that comes as part of hadoop. Tabular information is stored in HBase and files, iamges, unstructured data is stored in HDFS
6) Mapreduce – Hadoop offers mapreduce that interacts with yarn. Yarn does processing of information in HDFS
7) Pig(Latin) – This is a new language developed for sake of data analysts that makes prcessing simple. pig helps with data analysis
8) Hive – The SQL of big data is hive
9) OOzie – This is workflow scheduler tool needed to schedule the tasks in an organized fashion. This can be thought of build tool like ant, mavern in a software development environment
10) Spark – Apache spark is the in-memory processing engine that can be used with hadoop as well as non-hadoop env like mongodb. Spark is gaining popularity
11) Mahout – This is machine learning language used for statistical analysis
12) Client/Python/CLI – The user interacts at this layer
13) HUE – Hadoop User Experience is the user-interface that supports the apache hadoop ecosystem, that makes all of its tools accessible and entire ecosystem is accessed via web browse interface
14) Zookeeper – This is a tool used for maintaining configuration information, naming, providing distributed synchronization, and providing group services
Why is hadoop used for big data?
When we talk of big data the first thing that comes to my mind is Oracle exadata. Oracle has built a sophisticated machine with powerful operating system to process data that is big. However, hadoop is an open-source project and developed to cater the needs to storing and processing both structured and unstructred data. Henceforth, hadoop is the preferred architecture in big data space.
Is hadoop the only architecture available to support big data?
No. Hadoop is the preferred choice and widely accepted
Data is growing big and it business do find opportunity out of mining data and performing predictive data analysis. The data from different sources including clicks are gathered and stored in unstructured pattern and are often referred to as big data. They are stored in Hadoop framework and are supported on top of NoSQL dataabses.The essential ingredient of big data predictive analysis job is learning some statistical languages like R to grab the opportunity. Lets start with installing R in windows
1) Download, launch R setup installer
2) Choose destination directory
3) Choose components to be installed
4) Select startup options
5) Select start menu folder to create shortcuts
6) Install shortcuts
7) Program starts installing
8) R installation gets finished successfully
HDFS the Hadoop distributed file system is the file system project that supports bigdata in Hadoop ecosystem. Some big data companies like MapR do have their proprietary filesystem instead of HDFS.Many organizations do make use of HDFS and questions on HDFS file system is definitely going to be part of bigdata interview. Here are some HDFS interview questions to crack bigdata jobs
1) What is HDFS?
HDFS stands for Hadoop distributed filesystem. It is a distributed file system that is used to manage distributed data across clusters.
2) What are the two major components of HDFS?
Namenode, data node
3) What is the core concept behind design of HDFS filesystem?
HDFS filesystem is designed with the following concepts in mind:
Size of file – HDFS filesystem is meant to store files that are of gigabyte, terabyte in size. Real-time hadoop clusters store petabytes worth data in filesystem
Performance with streaming data – IF we look at typical bigdata projects information from many different sources gets stored in HDFS, additional information is appended onto data in HDFS. Hadoop is write once read many implementation that should consider performance of data access as well. HDFS is designed to support streaming data
HDFS failover capabilities – Hadoop clusters are a good alternate to expensive hardware as they are designed to be run in commodity hardware. The term commodity hardware is more generic. They are mid range servers from many different vendors that are reasonably stable. In case of failure there should be node failover provision. HDFS does support this via replication factor
4) Is HDFS a good fit for all the applications?
Nope. HDFS may not be the right choice under following situations
Low latency applications
Lots of small files
HDFS files can be written to in append-only fashion. Many applications can’t write to it at same time
5) What is the purpose of distributed filesystem?
Bigdata refers to the data whose volume does mandate storage of data across more than one commodity servers. The information is stored across network and set of machines across network behave as single physical entity. This mandates a filesystem that should take care of filesystem management across network. Thats wherein distributed filesystem comes into picture
6) What is the use of heartbeat in HDFS?
Datanodes sends signals to namenode in HDFS environment. This is an indication that namenodes are functioning properly. The default interval of heartbeat is 3 seconds. The heartbeat interval is configurable and can be changed by setting dfs.heartbeat.interval value in hdfs-site.xml file
7) What command is used to check the status of daemon on HDFS?
The jps command is a java command that needs jdk for running. This comamnd will produce list of all hadoop daemons that are currently running like Namenode, TestTracker,JobTracker. This can be thought of linux equivalent ps command
8) Which file should I make use of for changing the block size of HDFS files?
hdfs-site.xml file has default size parameter. This value can be changed to change the block size of HDFS. This should be done during downtime as this process demands cluster restart
9) I want to modify the files present in HDFS. What should I do?
HDFS works on the concept of write once, read many. All we can do is append data to an existing file. It is not possible to modify the files already present in HDFS
10) What is the use of hadoop archives?
HDFS utilized hadoop archives concept to minimize the metadata information stored in namenode. This inturn conserves memory
Big Data Professional need good grep on UNIX/LINUX OS. Here is a synopsis of common commands to be used by them. Note that all unix/linux commands are executed in shella dn should end with carriage return :
As a hadoop big data preofessional some tasks are common in day-to-day life and these commands come handy. A unix prompt usually has $ (or) # if it is root user aka superuser which has maximum system privilege
1) man commandname – Provides help notes on command
2) Commands to create directory
change to this location
to determine details on current working directory
pwd – also called as present working directory
3) Commands to create files
Use editors like vi/emacs to open and read file
To get to know list of files user :
For long listing :
4) Access the files and read file contents
test plan – includes eerything. Also called test strategy – automated/manual, deadline, resources. test strategy involves cost also
Test Scenario – higher level of functionalitythat we are trying to test
test case – part of test scenario. goes into greater depth of testing individual items. This is a breakage of test scenarios with various combinations