Big Data, BI and everything about data

Monday, July 21, 2014

Sparks on Hortonworks

This news might not be new but something i would like to put inside my blog as to do list. Sparks is enabled by Hortonworks as announced last month June. Cloudera has similar release in February this year. Seems like Sparks is getting a traction.

Below is the high level architecture of Sparks and other members in hadoop big data stack.

The Future of Data Management: The Enterprise Data Hub

Today i come across Cloudera solutions on enterprise data management which they called it Enterprise Data Hub (EDH)

Here's the full presentation on the topic by Cloudera.

The Future of Data Management: The Enterprise Data Hub from Cloudera, Inc.

Monday, November 25, 2013

Future of Database

I have come across a interesting article that condensed history of commercial database technologies and current development of database.

Interestingly, the article articulates how database come from and will be evolving due to explosions of data and how people are consuming data now in a form of infographic. For more read, go here. What do you think?

Friday, November 8, 2013

Presto : Facebook Big Data Query Engine

Facebook has announced their latest open source project named Presto. You can read the whole introduction from the engineering team on Facebook :-D

Here's a quick look at how it works.

Presto is written in Java and here are some info from the introduction of Presto.

Presto is 10x better than Hive/MapReduce in terms of CPU efficiency and latency for most queries at Facebook. It currently supports a large subset of ANSI SQL, including joins, left/right outer joins, subqueries, and most of the common aggregate and scalar functions, including approximate distinct counts (using HyperLogLog) and approximate percentiles (based on quantile digest). The main restrictions at this stage are a size limitation on the join tables and cardinality of unique keys/groups. The system also lacks the ability to write output data back to tables (currently query results are streamed to the client).

They have open source the project and it can be found here. Interesting right? So go ahead and see if you can build something with it.

Wednesday, October 30, 2013

Cloudera : Center of Universe?

Today GigaOm release an article talking about Cloudera introducing enterprise data hub (short for EDH) as quoting Mr Tom Reilly , Cloudera CEO

“We believe the EDH is going to become the center of most enterprise’s data architectures.”

I'm not sure this is just another gimmick from big data company but big data technologies is going to be really 'big' in coming future. So the biggest question for myself is what i can do with it.

Here is Cloudera Platform Architecture which i get from article.

Cloudera Architecture

You can compare this to Hortonworks way of doing things.

Hortonworks partners

So which platform you like or thinking of implementing in your organization?

Tuesday, October 29, 2013

Hortonworks is shipping version 2.0

Hortonworks have new release as Hortonworks Data Platform 2.0. In summary, the release has few highlights as follows

Enterprise Ready YARN, the Hadoop Operating System

With Hadoop 2, Apache Hadoop YARN serves as the Hadoop operating system, and takes Hadoop from a single-use data platform for batch processing to a multi-use platform that enables batch, interactive, online and stream processing.

Stinger Phase 2; Interactive SQL Queries at Petabyte Scale

The Stinger Initiative was launched at the beginning of 2013 as a broad community-based effort to enhance the speed, scale and breadth of SQL semantics supported by Apache Hive. Hive 0.12 represents phase 2 of the Stinger Initiative and HDP 2.0 is a significant step forward for Hive, the de-facto standard for SQL access in Hadoop.
Reliable NoSQL IN Hadoop with HBaseApache HBase 0.96 is the culmination of more than a year’s worth of effort that’s delivered important enterprise features such as Snapshots and improved MTTR
Manage & Monitor YARN and a Hadoop 2 clusterApache Ambari 1.4.1 allows you to provision, manage and monitor a cluster based on the Hadoop 2 stack. This includes YARN, MapReduce 2 and support for enabling native NameNode High Availability (HA).You can read the whole report here in zdnet site.