This post has been pending from a long time.
After attending few seminars on ‘Big Data and Hadoop’ at TCS, I thought of compiling post few months back. I thought of sharing my understanding towards this topic.
We live in a world with full of information. Information is in the form of data. There are various kind of data like video and images, social data, document, machine generated data — these are often extremely time sensitive and difficult to extract. The data comes from NYSE (often in Terabytes) and every second HD video generation is greater than 2000 times the data as required to store a single page of text. Every year 30 M network sensors are growing at a rate of 30% per year. The RDBMS cannot handle these data. For travelling from Delhi to Bhubaneswar, 1 Tera Byte is generated, of which only 2% is utilized. There are various unstructured data, that are difficult to capture.
What makes Big Data different :
- Job distributed across affordable hardware
- Manages and analyses all kind of data
- Analyze data in native format
- From velocity to value
For example, Google do indexing of all sites, it requires high scalable tech to handle TB data, so they moved to distributed computing. The Google is working on top 10 Big Data Projects (Galaxy).
Big Data comprises of 4Vs :
- Volume — Volume of Data
- Velocity — Speed of Data
- Variety — Variety of Data
- Value — Value of Data
The traditional way was ‘Structured and repeatable analysis’ while in Big Data we have ‘Iterative and Exploratory Approach’.
It is a scalable fault tolerant distributed system for data storage and processing system. The primary system is HDFS (Hadoop Distributed File System). Map Reduce is a processing environment in distributed computing.
It is supported by various Apache related projects : Hbase, Avro etc.
It is generally for heterogeneous commodity hardware. Suppose we have a large amount of data and we want to process it faster. We require expensive hardwares for the same. Here we can see one of the advantage of Hadoop : We can split GBs of data and store them in different locations and we should also have the copy of the files for safety. We can process the data where it is rather than moving the data.
HDFS –self healing bandwidth clustered storage. Instead of having 1 copy of data we have 5 and these are available in 2-3 system (distributed way of storing files).
Big insights brings Hadoop to the Enterprise. It is highly coupled with Linux Platforms as most technologies are taken from Open Source.
(These are the excerpts of the Seminar,given by Mr.Sudhir Menon, Principle Solution Architect at TCS Bhubaneswar)
P.S.: The above Post is compiled from my notes I wrote at the seminar. Please provide your valuable suggestion and feedback