Voiced by Amazon Polly
What is Big data?
It is definitely the most talked of ‘new kid on the block’ in the analytics fraternity. Everyone seems to be talking about it. So, what exactly is Big data?
Big data consists of data sets that grow so large that they become ‘difficult / awkward’ to work with using existing database management systems (Oracle, Sybase, MySQL, Teradata etc.). Difficulties include capture, storage, search, sharing, and mining the data for analytics. With an explosion in sources of data – internet forms, cookies, sensors, mobile applications, satellite data etc., the quantum on data is growing and will continue to grow at an astronomical pace. The cost of storage of data is reducing exponentially too. The cost of a 4 GB pen drive is now 10% of what it was a couple of years ago. Coupled together, these two trends will fuel the growth in quantum of data that we will have access to.
The world’s technological per capita capacity to store information has roughly doubled every 40 months since the 1980s (about every 3 years). Some people say it is no going to double every 1.5 years. Every day 2.5 quintillion bytes of data is created.
As you can visualize, these new sources of data will mostly be non-relational data and the storage is in non-relational DBMS. Thus, transactional data within an organization will, bynecessity, be in the traditional relational database while there is this ‘other’ data which is where there will be maximum growth. This ‘other’ data will need to be mined and put into MIS and reports, analyzed for trends and used to create probability equations.
This ‘other’ data is generally called Big data.
Connecting Big Data
As systems and processes stabilize and mature on capture and storage of Big Data, the focus is shifting to ‘WHAT NEXT’?
Logically, the next step is to mine the data for information – business intelligence and Analytics.
I will specifically look at Apache Hadoop in this context. HadoopMapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte datasets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner.
Many vendors that have caught the Hadoop bug and released versions of the software such as Cloudera, HortonWorks, Microsoft with HDInsight.
Sounds simple, but for a data analyst with no Java coding skills, it is all Latin and Greek . Until you take a look at Pig – high-level platform for creating MapReduce programs used with Hadoop. The language for this platform is called Pig Latin. It abstracts the programming from the Java MapReduce idiom into a notation which makes MapReduce programming high level, similar to that of SQL for RDBMS systems. High Level language is very close to natural language and spoken English, thus making it very user friendly for the non – coder.
Wow ..this makes life so much better for data analysts. And enables us Analysts to look forward to many more projects where we will effectively crunch ‘Big Data‘.
Software that competes with Hadoop is Google’s BigQuery. And the comparisons between these two giants is a story for another day. But in the real world out there, Hadoop is the current favorite Big Data Management and Analysis system.
Interesting aside:- Apache Hadoop is an open-source software framework that supports data-intensive distributed applications. Hadoop is written in the Java programming language Hadoop was created by Doug Cutting and Mike Cafarella and Doug named it after his son’s toy elephant!! All parts of the Hadoop framework have names commonly found in a Zoo J.
From 2002 onwards, Subhashini has a decade of experience across roles in Analytics in Retail Finance and Banking. These roles have been across Risk Management, Collections strategy, Fraud Control and Marketing in GE Money, Standard Chartered Bank, Tata Motors Finance and Citi GDM. Her area of interest is the integration of results / outputs of Analytics with Business Decisions – Tactics and Strategy.
She is currently active in the Analytics Training and Consulting arena.
(Link to LinkedIn profile – https://in.linkedin.com/pub/subhashini-s-tripathi/3/405/77b )
CloudThat is also the official AWS (Amazon Web Services) Advanced Consulting Partner and Training partner and Microsoft gold partner, helping people develop knowledge of the cloud and help their businesses aim for higher goals using best in industry cloud computing practices and expertise. We are on a mission to build a robust cloud computing ecosystem by disseminating knowledge on technological intricacies within the cloud space. Our blogs, webinars, case studies, and white papers enable all the stakeholders in the cloud computing sphere.
Drop a query if you have any questions regarding Big Data and I will get back to you quickly.
WRITTEN BY CloudThat