Bigdata is a term that has become real fantacy in IT world. We see people offering training, looking for trianing, people looking for tutorials on big data learnings across the web.
In this scenario it would be really interesting to know that you can leverage what you have learnt, your current work experience and knowledge to find a prospective role in bigdata space.
Though bigdata talks much on statistical analysis involving building of models, predicting business outcomes to solve business issues it is still possible to find entry into this much lucrative space if you have ever learnt (or) worked in one or more of the following technologies
1) UNIX/LINUX – This is the most common operating system in which systems are built. Hence knowledge of UNIX including commands come handy
2) Programming basics – This can be as complex as Java/C++, scripting language such as python, perl, groovy, shell that can be considered even if not you are aware of R the statistical programming, MATLAB, Python that are widely used by data analyst
3) Learn basics of Hadoop and its components – This is not so difficult. Basic understanding of hadoop framework and its constituent components like HDFS the Hadoop distributed file system, pig, hive, apache spark, nosql databases etc will be sufficient for starting role
4) Learn basics of SQL – Technology is an evolution of fundamentals and all data science relies heavily on SQL programming language with little bit of massaging. In bigdata analytical space this goes beyond basics covering complex analytical SQL functions. So, learn sql if you are not aware of
5) Domain knowledge – Though not mandate learn basics of domain you will be working on like finance, banking, healthcare, life science before taking up the job as this will help with understanding business issues once you have some good grep on domain related terminologies
To fix the HDFS checksum issues and avoid errors in a hadoop cluster, hadoop block scanner is run in each and every datanode in hadoop cluster. This is hdfs block scanning. The scanning does offer at a default time period of 3 weeks. The value can be modified by setting dfs.datanode.scan.period.hours parameter. The default value of this parameter is 504 hours.
Say, if the value of the parameter dfs.datanode.scan.period.hours is set to zero. What happens in this case?
Still default value of 3 weeks or 504 hours is used for scanning the blocks in datanode.
The block scanners do make use of throttling mechanism for scanning the blocks in the datanode to determine if there are some checksum errors.This value can be modified.The block scanner checks for corrupt blocks that needs to be fixed. They are reported to namenode.
Visit http://datanode:50075/blockscannerreport to get web interface of block scanner report. More details at block level can be obtained as http://datanode:50075/blockscannerreport?listblocks. Details on blocks and their status of ok or failed is listed as key value pairs
Scan type is as follows:
1) local – In case block scanning is performed by background thread
2) remote – In case scanning is performed by client or remote datanode
How to disable block scanner in datanodes?
Block scanner can be disabled by setting one or more of the following values:
1) The parameter dfs.datanode.scan.period.hours is set to negative value. Note that setting value to zero still scans with default 3 weeks or 504 hours
2) The parameter dfs.block.scanner.volume.bytes.per.second if set to zero then block scanner is disabled
As a administrator in infrastructure team say your manager wants you to come up with a plan for prospective hadoop cluster to support an upcoming big data project in your organization. Are you wondering from where to start? Here are some basic things that come handy as a first step in building hadoop cluster
Before looking at installation options lets see what are all the ways to build the cluster. This can be done in one of the following ways
Choosing hardware understanding hadoop architecture
Choose set of commodity hardware in your datacenter. If not have a meeting with capacity planner to determine if you have commodity hardware in place. Use one or more of them. With the combined resource availability make sure you can start building hadoop cluster on your own. Typically when it comes to HDFS files the recommendation is to have atleast three nodes for redundancy purpose that does guarantee high availability. This is a totally different topic that we can discuss in detail. Essentially 3 commodity hardware machines might be needed.
Here are the many different ways to build hadoop cluster
1) Utilize the commodity hardware in your organization
2) Rent hardware
3) Make use of cloud services like amazon web services, microsoft azure that make the hadoop cluster creation and hosting a piece of cake. All you need is to buy the appropriate virtual machines form these vendors, create and launch hadoop cluster in less timeframe. This comes with unique advantage of paying as and when your resource consumption increases. These Infrastructure as Service makes the job easy and simple
Now, lets look at hadoop cluster installation options:
Say, you choose to build hadoop cluster on your own, here are the installation options to be considered
1.1) Apache tarballs – This is one of the most time consuming task as you need to download appropriate binary tar balls from apache hadoop and related projects. You may have to decide on location of installation files, configuration files, log files in file system. Need to make sure file permissions are set correctly and so on. Also, unique thing to be noted is that you make sure the version of hadoop you download is compatible with Hive. The component compatability has not been tested and certified once you do all by yourself
1.2) Apache Packages – Starting with apache hadoop bigtop projects, to vendor packages form hortonworks, cloudera, MApR to name a few enterprise hadoop clusters mostly rely on RPM and debian packages from certified vendors. This ensures the component compatability like proper functioning of hadoop with hive, puppet to name a few that eases most of the work
1.3) Hadoop Cluster Management tools – Starting with Apache ambari to cloudera manager many GUI tools make this a piece of cake. Also, this comes with unique advantage of rolling upgrade that helps cluster upgrade with zero downtime. When there is a need to add more resources to cluster, the job becomes easy utilizing these tools. The tools come with heuritstics, best recommendation that comes handy while working with many different components of hadoop
Big data is a game changer of this decade. With changing demographics of business, information the need to transform existing relational model to noSQL Big Data model has become inevitable. In addition to the amount of information the type and format of data has taken a transformation from text to videos, feeds, pictures etc. It is more of a media these days. To be in pace and survive in this world of risks and opportunities it would be better to be safe than sorry. Learn faster, be proactive and get into trend
Why is that big data will find best role in healthcare sector?
Healthcare IT is mainly composed of following silos
1) Entering patient information using EHR/EMR/HIS – Information on patient details, study, visits etc is stored digitally in database. As long as health need persists this keeps growing.
2) DICOM/PACS – Picture archiving is the digitization of x-rays, scans etc. DICOM the Digital Information Communication standard is the standard adopted for such purposes. As mentioned above this is media and as size of this information grows it becomes big data
3) Digitization of insurance, billing and receipts – With the mandate to implement ICD-10 billing and related information digitization is growing big and will never end. This needs information to be stored in big data
4) Healthcare CRM – The customer relationship management is the biggest ever data feed that gets stored in NoSQL databases. This is always big and keeps growing big
As such healthcare is a sector whose data is big and is expected to grow big continuously. As a healthcare professional if you are into healthcare IT, planning to take up a job in this ever growing sector learn NoSQL with us
This month Hortonworks has released the latest version of hortonworks dataflow version 1.1. Hoetonworks DAta Platform popularly called HDP is the major project and product of hortonworks that is built on top of open-source hadoop ecosystem.
Now, do hortonwoks data flow and data paltform represent the same?
No. HDP the hortonworks data platform is the bundled version of open-source hadoop in a packaged format. Using a installer all the components that form part of hadoop project are chosen and bundled correctly. As the many different components in hadoop ecosystem have different version releases at different point-in-time and compatability is not always guaranteed, HDP is a stable solution for enterprises looking to have hadoop implemented as a customized, stable, tested package that is installed using installer
Hadoop Dataflow on other hand is Apache nifi. This is the GUI tool used to design the dataflows using processors which are data extracting engines designed to work with many different datasources. Hadoop is meant for its data enrinchment. As such there are around 90 processors in HDP that can getfiles from local file system, extract information from twitter etc. This information can be put into HDFS the hadoop distributed filesystem and dataflow is designed using relationships. Once the drag and drop of the processors is done in GUI, appropriate properties are configured, relation ship is established and built appropriately dataflow gets initiated.
As such HDF is for designing dataflow, HDP is the apache hadoop platform supporting enterprise big data projects starting with its HDFS the hadoop distributed file system
Apacha nifi is the data ingestion tool that has been customized as Hortonworks dataflow. Processor is the basic component that helps with collecting, aggregating correct information to be processed, pushed onto HDFS. There are more than 90 processors as of date that come as integral part of Apache Nifi. It would be better if we have a easy method to locate the correct processor
1) Drag and drop the processor icon from apache Nifi web user interface
2) Click on tags to locate the processor based on usage. Say, tag ingest is going to get list of processors that start with get
3) Type the processor name in search box and add them
Hortonworks dataflow the GUI that helps in collecting, conducting and curating the distributed data from structured, unstructured data sources and pushing onto enriched Hadoop has its latest version Hortonworks dataflow 1.1.1 released and available for download. This software has been released on January 3rd,2016 and can be downloaded from the following link for download
Hortonworks dataflow is powered by Apache nifi and has same GUI as apache nifi. Also this is a integrated platform that helps with transporting data from as small as twitter feeds onto HDFS to as big as a datasource that has continuous streaming of information. Also, data is transported in a secured fashion.
The basic component that comes out of the box as part of Apache nifi is the processor. This is the core component of dataflow. It is possible to choose appropriate processor from among the 90 processors by simple search using name, tag in search box. Drag and drop processor from GUI that is accessed in port 8080. The incoming information or data is referred to as flowfile. Typically relationship is established between processor and datatore which happens to be HDFS in a hadoop environment
With growing demand for bigdata, analytics professionals around that world many cloud firms have made decision to come up with free version of their analytical cloud solution that will give very good feel of products to the end user. One such initiative is the release of salesforce analytics cloud palyground as they brand it the free salesforce analytical solution from salesforce.com
If you are a newbie looking forward to learn the way a typical analytics solution works this playground can be the place to go. Follow these simple steps to explore this interesting analytical solution today
1) Navigate to salesforce analytics cloud playground website
2) Start with taking a tutorial for free. This interactive video provides simple example of how this analytic solution can be used to cater our needs
3) The salesforce analytic cloud dashboard offers singleview of data from multiple datasources. This is presented as data visualizations in single view
4) As a next step try uploading data from your own datasources that can be salesforce.com account, csv, google docs, spreadsheets to name a few
As this is a free solution there is no provision to save the visualizations we create during demo. Only possibility is to download visualizations as demo.Here is the salesforce analytics cloud playground website:
Hortonworks the enterprise hadoop software vendor offers world class software for enterprises with a stable hadoop suite of software. Sandbox is the download offered by hortonworks. As a first step to learn hadoop we need an environment that is easy to play with. As per hortonworks sandbox is a single node cluster that comes as learnign environment to start learning hadoop and its ecosystem of products . This can be installed in virtual box, vmware in desktop. As per hortonworks this can be installed in microsoft Azure as well. Based on our personal experience azure installation of hortonworks demands A4 sized virtual machine which is nto free. Start learning hadoop in desktop now.