All Things Analytics: January 2012

Tuesday, January 31, 2012

Agile BI, Simple BI, Self-Serve BI - Okay, What the Hell This Thing is?

In layman's terms, anyone, including my mom, who is suffering from information overload should be able to analyze any data using simple and easy to use data visualization tools, get insights (like growth in milk usage at our home) and then share the results with my dad who should cut feeding expensive organic milk to his two cats.

Wow, that sounds pretty simple, isn't it? Yes, and precisely for that reason IDC says that this phenomenon presents a big market opportunity:

“We are at the forefront of an evolutionary market that is fraught with opportunity for innovative tools and solutions that can help users handle the information overload plaguing every major organization around the globe.” IDC Market Analysis, Worldwide Interactive Data Visualization Tools Forecast

How big of a market opportunity? $1 Billion big by 2013 and $1.6 Billion by 2015 says Gartner. See this graphics:

So someone asked me few weeks ago, how would you define simple, self-serve BI and I gave him the following definition -

Agile BI is a simple yet power-packed solution which is easy-to-use, cost-effective and offers full 360 degree experience and above all my mom should be able to use it without bugging me...

And here is my definition of a power-packed solution:

There are ZERO products that fulfill this vision today. Products like QlikTech, Spotfire and Tableau do a pretty good job and therefore enjoy more that 70% of the market share. Where are the big guys?

"Agile" and "Big" doesn't go together I guess!

Here is how I contrasted Qlik against a large enterprise BI player:

This story is universal and gives competitive advantage to younger more agile players over their older and aging brethren because they have offered one single self-serve BI tool that could serve to many personas!

Qlik and Tableau have seen pretty solid growth over the past few years as a result of keeping their strategy simple. Here is an older blog on Qlik showing its amazing growth: http://goo.gl/cyV7a

The most recent evidence of double digit growth in the Agile BI market was seen in Tableau's 2011 earnings: (http://apandre.wordpress.com/)

sales doubled year over year to $72M in 2011
104% growth in bookings in Q4’11 and 94% growth YoY,
WW customer base grew by 40% in 2011
more than 7,000 organizations use its analytics product
big growth with customers in Europe, where base grew by 67 percent

2011 was the year of Agile (Simple) BI and the momentum is gaining further strength. Do you know now what Agile BI a.k.a Simple BI a.k.a self-serve BI is defined as?

Happy Simplifying!

Monday, January 30, 2012

Updated Sentiment Analysis and a Word Cloud for Netflix - The R Way!

The Netflix investors must be happy and cheerful as the stock is up more than 78% since the beginning of the year (YES, 78%, Source: Yahoo Finance!). I am not going to talk about what turned the stock around after a much talked/hyped about Netflix debacle of the late 2011 that earned Reed Hastings quite a few UNWANTED title and every one demanded his resignation from the top post. Not so fast, Mr. Bear! Reed Hastings must be smiling! After a stellar performance this year including carefully released stats on viewership, streaming hours as well as a solid Q4'11 earnings, Netflix is back and most importantly viewers are back!

Well, is is not coincidental that the sentiment for Netflix is also improving, 68% of the tweets now have positive sentiment. See the table below:

*Total*	*Positive*	*Negative*	*Average*	*Total*	*Sentiment*
Tweets Fetched	*Tweets*	*Tweets*	*Score*	*Tweets*	*Sentiment*
499	171	80	0.281	251	68%

*Make sure you understand and interpret this analysis correctly. This analysis is not based on NLP.

I updated the sentiment analysis that I did last year, http://goo.gl/fkfPy , (I was then just beginning to play with Twitter and Text Mining packages in R) and used advanced packages like "TM" and "WordCloud". The new analysis is based on more than 6,800 words which are most commonly prescribed in various sentiment analysis blogs/books. (Check out Hu and Liu http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html)

I came across this excellent blog by Jeffrey Bean, @JeffreyBean, (http://goo.gl/RPkFX) and his tutorial. Thank you Mr. Bean! Please follow the instructions from Bean's slides and the R code listed there as well as the R code here:

Here is the updated R code snippets -
#Populate the list of sentiment words from Hu and Liu (http://www.cs.uic.edu/~liub/FBS/sentiment-analysis.html)

huliu.pwords <- scan('opinion-lexicon/positive-words.txt', what='character', comment.char=';')

huliu.nwords <- scan('opinion-lexicon/negative-words.txt', what='character', comment.char=';')

# Add some words

huliu.nwords <- c(huliu.nwords,'wtf','wait','waiting','epicfail', 'crash', 'bug', 'bugy', 'bugs', 'slow', 'lie')

#Remove some words

huliu.nwords <- huliu.nwords[!huliu.nwords=='sap']

huliu.nwords <- huliu.nwords[!huliu.nwords=='cloud']

#which('sap' %in% huliu.nwords)

twitterTag <- "@Netflix"

# Get 1500 tweets - an individual is only allowed to get 1500 tweets

tweets <- searchTwitter(tag, n=1500)

tweets.text <- laply(tweets,function(t)t$getText())

sentimentScoreDF <- getSentimentScore(tweets.text)

sentimentScoreDF$TwitterTag <- twitterTag

# Get rid of tweets that have zero score and seperate +ve from -ve tweets

sentimentScoreDF$posTweets <- as.numeric(sentimentScoreDF$SentimentScore >=1)

sentimentScoreDF$negTweets <- as.numeric(sentimentScoreDF$SentimentScore <=-1)

#Summarize finidings

summaryDF <- ddply(sentimentScoreDF,"TwitterTag", summarise,

TotalTweetsFetched=length(SentimentScore),

PositiveTweets=sum(posTweets), NegativeTweets=sum(negTweets),

AverageScore=round(mean(SentimentScore),3))

summaryDF$TotalTweets <- summaryDF$PositiveTweets + summaryDF$NegativeTweets

#Get Sentiment Score

summaryDF$Sentiment <- round(summaryDF$PositiveTweets/summaryDF$TotalTweets, 2)

Saving the best for the last, here is a word cloud (also called tag cloud) for Netflix built in R-

I will be putting the R code up here for building a word cloud after scrubbing it.

Happy Analyzing!

Tuesday, January 24, 2012

Geocode your data using, R, JSON and Google Maps' Geocoding API

First and foremost, I absolutely love the topic of Location Analytics (Geo-Spatial Analysis) and see tremendous business potential in not so distant future. I would go out on a limb to predict that the Location Analytics will soon go viral in the enterprise space because it has the capability to WOW us. Look no further than your iPhone or an Android phone and count how many location aware apps you have. We all have at lease one app - Google Maps. Mobile is one of the strongest catalyst for enterprise adoption of Location aware apps. All right, enough of business talk, let's get dirty with the code.

Over the last year and half, I have faced numerous challenges with geocoding the data that I have used to showcase my passion for location analytics. In 2012, I decided to take thing in my control and turned to R. Here, I am sharing a simple R script that I wrote to geo-code my data whenever I needed it, even BIG Data.

To geocode my data, I use Google's Geocoding service which returns the geocoded data in a JSON. I will recommend that you register with Google Maps API and get a key if you have large amount of data and would do repeated geo coding.

Here is function that can be called repeatedly by other functions:

getGeoCode <- function(gcStr)

{

library("RJSONIO") #Load Library

gcStr <- gsub(' ','%20',gcStr) #Encode URL Parameters

#Open Connection

connectStr <- paste('http://maps.google.com/maps/api/geocode/json?sensor=false&address=',gcStr, sep="")

con <- url(connectStr)

data.json <- fromJSON(paste(readLines(con), collapse=""))

close(con)

#Flatten the received JSON

data.json <- unlist(data.json)

lat <- data.json["results.geometry.location.lat"]

lng <- data.json["results.geometry.location.lng"]

gcodes <- c(lat, lng)

names(gcodes) <- c("Lat", "Lng")

return (gcodes)

}

Let's put this function to test:

geoCodes <- getGeoCode("Palo Alto,California")

> geoCodes
Lat Lng
"37.4418834" "-122.1430195"

You can run this on the entire column of a data frame or a data table:

Here is my sample data frame with three columns - Opposition, Ground.Country and Toss. Two of the columns, you guessed it right, need geocoding.

> head(shortDS,10)

Opposition Ground.Country Toss

1 Pakistan Karachi,Pakistan won

2 Pakistan Faisalabad,Pakistan lost

3 Pakistan Lahore,Pakistan won

4 Pakistan Sialkot,Pakistan lost

5 New Zealand Christchurch,New Zealand lost

6 New Zealand Napier,New Zealand won

7 New Zealand Auckland,New Zealand won

8 England Lord's,England won

9 England Manchester,England lost

10 England The Oval,England won

To geo code this, here is a simple one liner I execute:

shortDS <- with(shortDS, data.frame(Opposition, Ground.Country, Toss,

laply(Ground.Country, function(val){getGeoCode(val)})))

> head(shortDS, 10)
Opposition Ground.Country Toss Ground.Lat Ground.Lng
1 Pakistan Karachi,Pakistan won 24.893379 67.028061
2 Pakistan Faisalabad,Pakistan lost 31.408951 73.083458
3 Pakistan Lahore,Pakistan won 31.54505 74.340683
4 Pakistan Sialkot,Pakistan lost 32.4972222 74.5361111
5 New Zealand Christchurch,New Zealand lost -43.5320544 172.6362254
6 New Zealand Napier,New Zealand won -39.4928444 176.9120178
7 New Zealand Auckland,New Zealand won -36.8484597 174.7633315
8 England Lord's,England won 51.5294 -0.1727
9 England Manchester,England lost 53.479251 -2.247926
10 England The Oval,England won 51.369037 -2.378269

Happy Demoing and Coding!