Showing posts with label Big Data. Show all posts
Showing posts with label Big Data. Show all posts

Wednesday, September 4, 2013

The Future of Big Data is Cognitive Big Data Apps



Volume, Velocity, Variety and Veracity of your data, the 4V challenge, has become untamable.  Wait, yet another big data blog?  No, not really.  In this blog, I would like to propose a cognitive app approach that can transform your big-data problems into big opportunities at a fraction of the cost.

Everyone is talking about big data problems but not many are helping us in understanding big data opportunities.  Let's define a big data opportunity in the context of customers because growing customer base, customer satisfaction and customer loyalty is everyone’s business:

  • you have a large, diverse and growing customer base
  • your customers are more mobile and social than ever before
  • you have engaged with your customers where ever they are: web, mobile, social, local
  • you believe that "more data beats better algorithms" and that big data is all data
  • you wish to collect all data - call center records, web logs, social media, customer transactions and more so that
  • you can understand your customers better and how they speak of and rank you in their social networks
  • you can group (segment) your customers to understand their likes and dislikes
  • you can offer (recommend) them the right products at the right time and at the right price
  • you can preempt customer backlash and prevent them for leaving (churn) to competitors and taking their social network with them (negative network effects)
  • all this effort will allow you to forecast sales accurately, run targeted marketing campaigns and cut cost to improve revenues and profitability
  • you wish to do all of this without hiring an army of data analysts, consultants and data scientists
  • and without buying half-dozen or more tools, getting access to several public / social data sets and integrating it all in your architecture
  • and above all, you wish to do it fast and drive changes in real time
  • And most importantly, you wish to rinse and repeat this approach for the foreseeable future
There are hardly any enterprise solutions in the market that can address the challenges listed above.  You have no other choice but to build a custom solution by hiring several consultants and striking separate licenses agreements with public and social data vendors to get a combined lens on public and private data.  This approach will be cost prohibitive for most enterprise customers and as "90% of the IT projects go" will be mired with delays, cost overruns and truck load of heartache. 

The advances in technologies like in-memory databases and graph structures as well as democratization of data science concepts can help in addressing the challenges listed above in a meaningful and cost-effective way.  Intelligent big data apps are the need of the hour.  These apps need to be designed and built from scratch keeping the challenges and technologies such as cognitive computing[1] in mind.  These apps will leave the technology paradigms of 1990s like "data needs to be gathered and modeled (caged) before an app is built" in the dumpster and will achieve the flexibility required from all modern apps to adapt as the underlying data structures and data sources change.  These apps can be deployed right off the shelf with minimum customization and consulting because the app logic will not be anchored to the underlying data-schema and will evolve with changing data and behavior.

The enterprise customers will soon be asking for a suite of such cognitive big data apps for all domain functions so that they can put the big data opportunities to work to run their businesses better than their competitors.  Without dynamic cognitive approach in apps, addressing the 4V challenge will be a nightmare and big data will fail to deliver its promise.

Stay tuned for future blogs on this topic including discussions on a pioneering technology approach.

[1] Cognitive computing is the ability to analyze oceans of data in context with related information and expertise.  Cognitive systems learn from how they’re used and adjust their rules and results dynamically.  Google search engine and knowledge graph technology is predicated upon this approach.  

 This blog has benefited from the infinite wisdom and hard work of my former colleagues Ryan Leask and Harish Butani and that of my current colleagues Sethu M., Jens Doerpmund and Vijay Vijayasankar.

Image courtesy of  MemeGenerator

Sunday, August 25, 2013

Data Science: Definition and Opportunities


Image courtesy of BBC
My thoughts on what data science is, what skills data scientists have, what are the current issues in the Business Intelligence pipeline, how can machine learning automate a part of the BI chain, why and how data science should be democratized and made available to every one including decision makers (business users), how business analyst should build complex data models and how data scientists should be freed up from the mundane tasks of rinse and repeat ETL before building models that provide input for decision making, how companies can build a business practice around data science. 

Key Premise: big data is all data and the big data apps offer the ability to combine all data (public + private) and expand the horizon to discover more meaningful insights.

Data Science is:
  • An art of mining large quantities of data 
  • An art of combining disparate data sources and blending public data with corporate data
  • Forming hypothesis to solve hard problems
  • Building models to solve current problems and provide forecast
  • Anticipate future events (based on historical data) and provide correcting actions (finance, banking, travel, operational runtime)
  • Automating the processes to reduce time to solve future problems
A Data Scientists has following minimum set of core skills:
  • Problem-Solver
  • Creative and can form an hypothesis
  • Is able to program with large quantities of data
  • Can think of bringing data from appropriate data source and can bring and blend data 
  • Stats/math/analytics background to build models and write algorithms 
  • Can quickly develop domain knowledge to understand key factors which influence the performance of a busies problem
Roles Data Scientists play:
  • Problem description 
  • Hypothesis formation
  • Data assembly, ETL and data integration role
  • Model development (pattern recognition or any other model to provide answers) and training
  • Data visualization 
  • AB Testing 
  • Propose solutions and/or new business idea
The balance between human vs. machines:
  • Current: humans play a significant role in the process – ETL, joins, models, visualization, machine-learning and repeating and recycling this process as the problem changes
  • Tomorrow: A big portion of the food-chain can be automated via machine learning so machines can take over and scientists can free up to build more algorithms/models 
  • The process can be automated so repeating/recycling can be cheaper and less time consuming
The Data Science pipeline currently look like:
  • From Data to Insights – this entire process requires mundane skills (IT),  specialized skills (data-scientist) and elements of human psychology to present the right information at right time 
  • The data needs to be discovered, assembled, semantically enriched and anchored to a business logic – this task can be be automated through machine learning (a set of harmonized tools with AI) to free up scarce resources
  • Specialized skills today get addressed by open source technologies such as R and expensive solutions like Matlab and SPSS.
  • Very few software solution carefully introduce human interface to make their application consumable without requiring customer training
This pipeline needs complete rethinking:
  • Automate mundane tasks that IT gets tagged with 
  • Discover data automatically 
  • Detach business logic from data models
  • Make blending public data with corporate data a second nature
  • Free up scientists so that they can build analytics micro-apps for a domain or a sub-domain
  • Data Science need not be a niche (specialized category), it should appeal to the masses (democratization of data and brining insights to everyone without needing specialized skills)
Opportunities in Data Science: 
  • Understand the value chain (IT + Business Analyst + Data Scientists + Business Users)
  • Provide something for everyone  - a single integrated platform (ETL + Data Integration + Predictive modeling + in-memory computing +  storage)  for data-scientist so that they can build standard analytical apps and move away from proprietary models and standardize (helps IT)
  • Analytical apps on this platform (think of them as Rapid Deployment Solutions) for business users
  • Help business analysts write basic models (churn, segmentation, correlation etc.) without needing advanced skills
  • Work with consulting companies so that they can consult and build apps for companies that do not have data scientist on their pay-roll (Mu-Sigma and Opera Solutions)
  • Partner with public data provider (to help clients), consulting companies (Rapid Solutions solution), R/Python/ML communities (mind-share and thought-leadership), 
  • Donate your predictive models to open-source communities

Friday, April 19, 2013

Democratization of Business Analytics Dashboards

I am super impressed with the following visual dashboard from IPL T20 tournament - IPL 2013 in Numbers.  For those of you not so familiar with cricket or IPL, IPL is the biggest, the most extravagant and the most lucrative cricket tournament in the world.  I like the way IPL is bringing sports analytics to the common masses.


What is impressive is that each metric (runs, wickets, or tweets) is live so these numbers get updated automatically, pretty cool for IPL and cricket fans.  Also, each metric is clickable so one can drill down to his or her heart's content.  This is a common roll-up analysis but the visualization and the real time updates make this dashboard pretty appealing.  IPL team, thanks for not putting any dials on this dashboard (LOL).

I have been influencing and now building analytics products that power these sports and various other dashboards/reports for many years.  The most fascinating thing is that these dashboards (or lets call it analytics in general) are reaching the masses like never before.  Everyone has heard of terms like democratization of data and humanization of analytics.  This is it!  The data revolution is underway.  

Now, there are many new frontiers to go after and the existing ones need to be reinvented.  Yes, the analytics market is ready for massive disruption.  This is what keeps me excited about Business Analytics space.

Happy Analyzing and Happy Friday!

Friday, April 5, 2013

Tableau IPO: Let The Gold Rush Begin For Enterprise Software IPOs!


The year 2013 is going to be the year of enterprise software IPOs.  That is not a prediction but well discussed point in Silicon Valley.  Everybody believes that there is a pent-up demand from return hungry investors for the enterprise software IPOs.  Consumer software IPOs have failed to live up to their promise in the last couple of years but the enterprise software IPOs have continued to deliver (examples: WDAY, NOW, SPLK), case-in-point.   

In the last couple of days, two of my favorite companies, Marketo and Tableau have announced plans to go public.  Here are the links to Marketo's S1 and Tableau's S1.  I have had the good fortune to study, evaluate and follow both companies since 2010.  Both the companies have done very well in their respective segments, SaaS marketing automation and on-premise self-serve BI.  They have both exceeded expectations on all fronts (employees, customers, analyst  markets, competitors) after a long hard slog.  

To all my friends, colleagues, investors and readers of this blog, enterprise software is a hard slog, you are in it for a long-haul.  Tableau is a 10-year old company and Marketo is 7 years old (Source:  SEC Filings).

Valuation
Since Tableau ("DATA") has announced its plan to go IPO this year, I decided to put the striped-down version of my due-diligence, performed in early 2011, on my slide-share account.  Back then, I used relative valuation using QlikView ("QLIK") as a close proxy to put a number on Tableau.  I used PE (earnings multiple) and PS (revenue multiple) of QLIK and assessed a market value of $380million based on Tableau's 2010 revenues of $40 million (from their press release in 2011, this number has been revised down to $34million in S1, huh, strange!)

Now, if one were to use QLIK's current revenue multiple of 5.5 (Source: Yahoo Finance), Tableau could be valued between $700million (based on trailing revenue of $128million) and 1.4billion (based on  $256million in expected revenue for 2013 assuming that they grow their revenue YET AGAIN by 100% in 2013.)

I personally don't think that the street should use QLIK as a proxy instead apply Splunk's ("SPLK") lens to value Tableau.  So using SPLK's multiple of ~19.7 (Source: Yahoo Finance), Tableau will be valued at $2.5billion based on their 2012 revenues.  ServiceNow ("NOW") also has a PS multiple of ~19. 

I have strong reasons to believe that street will be valuing Tableau in this range based on a great growth story till this point and amazing opportunities ahead as we are just starting to drill the BigData mountain.  I will not be surprised to see the valuation range from $2.5billion to $5billion. Amazing!

Tableau's S1
I studied Tableau's S1 filing briefly looking for information on valuation and offering on number of shares.  Not much is disclosed there just yet.  It will likely be disclosed in the subsequent filings as they hit the roadshow to assess the demand from the institutional investors.  Just like Workday, Tableau will also have dual class shares (Class A and Class B) with different voting rights.  The Class A will be offered to investors by converting the Class B shares. 

The last internal valuation of employee options priced the stock at ~$15.  To raise $150million, Tableau will at least be putting 10 million shares of Class A on the block.  Now of course, this will change as the demand starts to build up following their road-show.  One thing is certain that the stock will be definitely priced above $15.  Now, how many points above $15, we will find out in the next few months.  

Let the mad rush begin!!!

Friday, November 9, 2012

Financial Markets and President Elect: Do Financial Markets Favor A Republican Over A Democrat?

US financial markets favor a republican president over a democratic president. Has this sentiment stood the test of time?  Do financial markets care whether the president elect is a democrat or a republican?  How have financial markets behaved in the past after the announcement of next US president?  And finally, can one spot a pattern in the performance of financial markets based on the president's party affiliation?  More specifically, did financial markets fare better under a republican president or under a democratic president?

To answer all these questions, I turned to history and generated the historical performance of S&P 500 since 1952.  I also had to turn to Wikipedia to get a list of presidents and their party affiliation.  Between 1952 and 2012, US has elected 16 presidents with republican presidents outnumbering their democratic counterparts by 2 in occupying the white house (see table below):

To understand whether financial markets favored a republican president over a democratic president, I generated 1-day, 1-week, 4-week, 12-week, 52-week and the presidency term ("term") returns since the election date (see table above.)  Looking at the one day return, there was no clear indication whether markets favored one party or the other.  Financial markets welcomed Ronald Reagan, a republican, by sending the S&P 500 up 1.77% which is the highest one-day return among all the 16 presidential events.  Markets also cheered the reelection of Bill Clinton with a one-day return of 1.46% after the announcement of president elect.

Source: AllThingsAnalytics
With one-day returns of -5.27% and -2.37% in 2008 and 2012 respectively, President Obama is not much favored by financial markets.  Now, one can argue that October 2008 was a terrible period for anyone to be elected as the president because of the ongoing crash in financial markets that led to the great recession (see side chart.)  Nonetheless, markets also didn't like Obama's reelection (S&P 500 was down 2.37% following the election day) which leads to a status quo in Washington.  Combine that with all the ongoing macro concerns including the Euro debt crisis and already unraveling fiscal cliff, investors have become very jittery in the past couple of days.

Now, to overcome the short-term bias in financial market's reaction, let's review other period's returns (see table above.)  There are plenty of interesting observations one can make.  For example, under both of Clinton's (Democrat) presidencies, financial markets boomed with returns of 56% and 111% over the next 200 weeks since the election day.  Eisenhower's (Republican) presidency came second with returns of 95% and 19%.  Reagan era followed by Bush Sr's term also produced hefty gains for investors with returns of 28%, 56% and 49% under their terms.  Again, there is no clear indication whether financial markets favored one party over the other during a president's term in the office but financial markets definitely fared well under a republican president prior to 2000.

Bush Jr. (Republican) inherited dot-com crash, oversaw the biggest expansion in US public debt (see chart below) and observed the epic housing crisis of 2007-2008.  Financial markets yielded returns of -22% and 13% during Bush's two terms presidency, pretty poor for a republican president who unleashed all the expansionary polices on US economy.  Under Bush's 8 year presidency, US public debt doubled from $5.6 trillion to $10 trillion.  Obama added almost the same level of debt in just 4 years and took US public debt from $10 trillion to $14.2 trillion by the end of 2011.

Source: AllThingsAnalytics




From Wikipedia, click to expand




Are we living in times which have no historical precedence?  It took 20 years for US public debt to rise from $1 trillion level to $5.5 trillion level (see side chart).  It then just took 11 short years for US public debt to rise to $14 trillion level.  From year 1980 to 2000, S&P 500 appreciated by 1276% (from 105 at the start of 1980 to 1455 at the start of 2000). Astonishing rise!!!   Also astonishing is the fact that since 2000 till date, S&P 500 has been down -5%.  Has the mammoth economic expansion of 1980s and 1990s run its course and now debt is the only route left to sustain US economy.  Let's leave this discussion for another blog.


Financial markets care less which party's candidate is elected for the white house and focus more on the economic policies that president will enact.  All the rhetoric and party ideology does take a toll on financial markets though as evident in financial markets' immediate reaction similar to the one we are observing right now.  Hopefully, the congress and the president will put the rhetoric aside and break the impasse on the already unraveling fiscal cliff.

This blog has benefited from discussions with Jens DoerpmundRyan Leask and Rajani Aswani on this topic.

Disclaimer:  All numbers are approximate and the underlying analysis is preliminary.  This blog is not intended for offering any investment advice.

Wednesday, May 23, 2012

If You are a R Developer, Then You Must Try SAP HANA for Free.


This is a guest blog from Alvaro Tejada Galindo, my colleague and fellow R and SAP HANA enthusiast.  I am thankful to Alvaro for coming and posting on "AllThingsBusinessAnalytics".

Are you an R developers? Have ever heard of SAP HANA? Would you like to test SAP HANA for free?

SAP HANA is an In-Memory Database Technology allowing developers to analyze big data in real-time.

Processes that took hours now take seconds due to SAP HANA's power to keep everything on RAM memory.

As announced in SAP Sapphire Now event in Orlando, Florida, SAP HANA is free for developers. You just need to download and install both the SAP HANA Client and the SAP HANA Studio, and create an SAP HANA Server on the Amazon Web Services as described in the following document:
Get your own SAP HANA DB server on Amazon Web Services - http://scn.sap.com/docs/DOC-28294

Why should this interest you? Easy...SAP HANA is an agent of change bringing speed to its limits and it can also be integrated with R as described in the following blog:

Want to know more about SAP HANA? Read everything you need here: http://developers.sap.com

You're convinced but don't want to pay for the Amazon Web Services? No problem. Just leave a comment including your name, company and email. We will reach you and send you an Amazon Gift Card so you can get started. Of course, your feedback would be greatly appreciated. Of course, we only a limited set of gift cards, so be quick or be out.

Author Alvaro Tejada Galindo, mostly known as "Blag" is a Development Expert working for the Technology Innovation and Developer Experience team in SAP Labs.  He can be contacted at a.tejada.galindo@sap.com.

Alvaro's background in his own words: I used to be an ABAP Consultant for 11 years. I worked in implementations on Peru and Canada. I’m also a die hard developer using R, Python, Ruby, PHP, Flex and many more languages. Now, I work for SAP Labs and my main roles are evangelize SAP technologies by writing blogs, articles, helping people on the forums, attending SAP events, besides many other “Developer engagement” activities.
I maintain a blog called “Blag’s bag of rants” at blagrants.blogspot.com

Monday, April 9, 2012

Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize in HTML5 Using D3 - Part II

Technologies: SAP HANA, R, HTML5, D3, JQuery and JSON

In my last blog, Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize Using Google Maps, I analyzed historical airlines performance data set using R and SAP HANA and put the aggregated analysis on Google Maps.  Undoubtedly, Map is a pretty exciting canvas to view and analyze big data sets. One could draw shapes (circles, polygons) on the map under a marker pin, providing pin-point information and display aggregated information in the info-window when a marker is clicked.  So I enjoyed doing all of that, but I was craving for some old fashion bubble charts and other types of charts to provide comparative information on big data sets.  Ultimately, all big data sets get aggregated into smaller analytical sets for viewing, sharing and reporting.  An old fashioned chart is the best way to tell a visual story!

On bubble charts, one could display 4 dimensional data for comparative analysis. In this blog analysis, I used the same data-set which had 200M data points and went deeper looking at finer slices of information.  I leveraged D3, R and SAP HANA for this blog post.  Here I am publishing some of this work:  

In this first graphics, the performance of top airlines is compared for 2008.  As expected, Southwest, the largest airlines (when using total number of flights as a proxy), performed well for its size (1.2M flights, 64 destinations but average delay was ~10 mins.)  Some of the other airlines like American and Continental were the worst performers along with Skywest.  Note, I didn't remove outliers from this analysis.  Click here to interact with this example.


In the second analysis, I replaced airlines dimension with airports dimension but kept all the other dimensions the same.  To my disbelief, Newark airport is the worst performing airport when it comes to departure delays.  Chicago O'Hare, SFO and JFK follow.  Atlanta airport is the largest airport but it has the best performance. What are they doing differently at ATL?  Click here to interact with this example.


It was hell of a fun playing with D3, R and HANA, good intellectual stimulation if nothing else!  Happy Analyzing and remember possibilities are endless!

As always, my R modules are fairly simple and straightforward:
###########################################################################################  
#ETL - Read the AIRPORT Information, get major aiport informatoin extracted and upload this 
#transfromed dataset into HANA
###########################################################################################
major.airports <- data.table(read.csv("MajorAirports.csv",  header=TRUE, sep=",", stringsAsFactors=FALSE))
setkey(major.airports, iata)


all.airports <- data.table(read.csv("AllAirports.csv",  header=TRUE, sep=",", stringsAsFactors=FALSE)) 
setkey(all.airports, iata)


airports.2008.hp <- data.table(read.csv("2008.csv",  header=TRUE, sep=",", stringsAsFactors=FALSE)) 
setkey(airports.2008.hp, Origin, UniqueCarrier)


#Merge two datasets
airports.2008.hp <- major.airports[airports.2008.hp,]


###########################################################################################  
# Get airport statisitics for all airports
###########################################################################################
airports.2008.hp.summary <- airports.2008.hp[major.airports,     
    list(AvgDepDelay=round(mean(DepDelay, na.rm=TRUE), digits=2),
    TotalMiles=prettyNum(sum(Distance, na.rm=TRUE), big.mark=","),
    TotalFlights=length(Month),
    TotalDestinations=length(unique(Dest)),
    URL=paste("http://www.fly", Origin, ".com",sep="")), 
                    by=list(Origin)][order(-TotalFlights)]
setkey(airports.2008.hp.summary, Origin)
#merge two data tables
airports.2008.hp.summary <- major.airports[airports.2008.hp.summary, 
                                                     list(Airport=airport, 
                                                          AvgDepDelay, TotalMiles, TotalFlights, TotalDestinations, 
                                                          Address=paste(airport, city, state, sep=", "), 
                                                          Lat=lat, Lng=long, URL)][order(-TotalFlights)]




airports.2008.hp.summary.json <- getRowWiseJson(airports.2008.hp.summary)
writeLines(airports.2008.hp.summary.json, "airports.2008.hp.summary.json")                 
write.csv(airports.2008.hp.summary, "airports.2008.hp.summary.csv", row.names=FALSE)

Saturday, March 17, 2012

Geocode and reverse geocode your data using, R, JSON and Google Maps' Geocoding API


(Reposting the previous blog with additional module on reverse geocoding added here.)

First and foremost, I absolutely love the topic of Location Analytics (Geo-Spatial Analysis) and see tremendous business potential in not so distant future.  I would go out on a limb to predict that the Location Analytics will soon go viral in the enterprise space because it has the capability to WOW us. Look no further than your iPhone or an Android phone and count how many location aware apps you have. We all have at lease one app - Google Maps.  Mobile is one of the strongest catalyst for enterprise adoption of Location aware apps. All right, enough of business talk, let's get dirty with the code.

Over the last year and half, I have faced numerous challenges with geocoding and reverse geocoding the data that I have used to showcase my passion for location analytics.  In 2012, I decided to take thing in my control and turned to R.  Here, I am sharing a simple R script that I wrote to geo-code my data whenever I needed it, even BIG Data.

To geocode and reverse geocode my data, I use Google's Geocoding service which returns the geocoded data in a JSON. I will recommend that you register with Google Maps API and get a key if you have large amount of data and would do repeated geo coding.

Geocode:

getGeoCode <- function(gcStr)  {
  library("RJSONIO") #Load Library
  gcStr <- gsub(' ','%20',gcStr) #Encode URL Parameters
 #Open Connection
 connectStr <- paste('http://maps.google.com/maps/api/geocode/json?sensor=false&address=',gcStr, sep="") 
  con <- url(connectStr)
  data.json <- fromJSON(paste(readLines(con), collapse=""))
  close(con)
  #Flatten the received JSON
  data.json <- unlist(data.json)
  if(data.json["status"]=="OK")   {
    lat <- data.json["results.geometry.location.lat"]
    lng <- data.json["results.geometry.location.lng"]
    gcodes <- c(lat, lng)
    names(gcodes) <- c("Lat", "Lng")
    return (gcodes)
  }
}
geoCodes <- getGeoCode("Palo Alto,California")


> geoCodes
           Lat            Lng 
  "37.4418834" "-122.1430195" 

Reverse Geocode:
reverseGeoCode <- function(latlng) {
latlngStr <-  gsub(' ','%20', paste(latlng, collapse=","))#Collapse and Encode URL Parameters
  library("RJSONIO") #Load Library
  #Open Connection
  connectStr <- paste('http://maps.google.com/maps/api/geocode/json?sensor=false&latlng=',latlngStrsep="")
  con <- url(connectStr)
  data.json <- fromJSON(paste(readLines(con), collapse=""))
  close(con)
  #Flatten the received JSON
  data.json <- unlist(data.json)
  if(data.json["status"]=="OK")
    address <- data.json["results.formatted_address"]
  return (address)
}
address <- reverseGeoCode(c(37.4418834, -122.1430195))

> address
                    results.formatted_address 
"668 Coleridge Ave, Palo Alto, CA 94301, USA" 

Happy Coding!

Wednesday, March 14, 2012

R and SAP HANA: A Highly Potent Combo for Real Time Analytics on Big Data


Lets Talk Code

SAP DKOM 2012 kicks off in San Jose today and I can’t be more excited than this.  For the past three months Jens Doerpmund, Chief Development Architect of Analytics at SAP and I have been working on this topic of R and SAP HANA and all our hard work (upwards of 400 hours) is about to pay off (fingers crossed).

It has been a stunning journey and an incredible learning experience. Both R and HANA are fascinating technologies and bringing them together is analogous to bringing Google and Apple together. We are gearing up for our session and in the true spirit of DKOM, we will be only talking code, yes code and lots of it.  We just wrapped up our slides with lots of code snippets to share with fellow DKOMers.  Here is a quick sneak preview of what we are going to cover today:

Big Data Analytics (Really Big)
  • Airlines sector in the travel industry
  • 22 years (1987-2008) of airlines on time performance data on US airlines
  • 123 million records
  • Extract Transform Load – ETL work to combine this data with data on airports, data on carriers with this data to setup for Big Data analysis in R and HANA
  • D20 with 96GB of RAM and 24 Cores
  • Massive amount of data crunching using R and HANA

 We will be covering lots and lot of topics, here is a short list:
  • Sentiment Analysis on #DKOM and a WordCloud
  • Cluster Analysis using K-Mean
  • Geo Code Your Data – Google Maps API
  • SP100 - XML Parsing and Historical Stock Data
  • R and HANA integration
  • Moving big-data from one HANA to another HANA (Replication)
  • Server side Java Scripting
  • and an HTML5 App built with R, HANA and Server Side Java Script



Here is a wordcloud straight from R on #DKOM. There will be lot more to discuss today. Looking forward to meeting you all DKOMers.


Lets Talk Code Everyone and Happy Coding!

 Jitender Aswani
 Jens Doerpmund

Learn more on this session topic in my previous blog:  Advance Analytics with R and HANA at DKOM 2012 San Jose

Wednesday, February 1, 2012

Big Four and the Battle of Sentiments - Oracle, IBM, Microsoft and SAP

In this battle of sentiments or opinions for the four software giants - Oracle, IBM, Microsoft and SAP, SAP is generating a lot of positive buzz with its message of "innovation without disruption" and leading the pack with a 95% sentiment score.



TagTweetsFetched+ve Tweets-ve TweetsAvg.ScoreTweetsSentiment
@IBM19849450.0819452%
@Microsoft893307780.48438580%
@Oracle29790170.31310784%
@SAP985530.6735895%


Few days ago, I published this blog "Updated Sentiment Analysis and a Word Cloud for Netflix" and the underlying R code.  I used the same R program to compare the sentiments for the four software giants.  Now, technically speaking, IBM and Oracle are not pure software companies anymore since they both package hardware (server and storage hardware) along with the software but the rivalry between these four companies persuaded me to put a comparative analysis  here.  I originally included HP in this analysis but then dropped it as I didn't consider HP in the same league as these fours in the software category.

What surprised me the most was the lowest score IBM received, lower than Oracle!  What went wrong here?  I am also surprised to see Oracle occupying the second spot with 84% sentiment score.  So besides all the negative publicity Oracle attracts, the sentiment is overwhelmingly positive.

The one improvement I would like to make to this analysis is to get more tweets.  Twitter API restricts the number of tweets that one can fetch and doesn't allow you to fetch older tweets.  I would love to run this analysis over a year worth of tweets and also show a time series of sentiment score.  That will be fantastic!

Here are the four histograms, one each for four candidates, showing the distribution of opinion scores:










SAP










IBM







Microsoft






Oracle








Happy Analyzing!


The underlying data can be downloaded here.



Tuesday, January 31, 2012

Agile BI, Simple BI, Self-Serve BI - Okay, What the Hell This Thing is?

In layman's terms, anyone, including my mom, who is suffering from information overload should be able to analyze any data using simple and easy to use data visualization tools, get insights (like growth in milk usage at our home) and then share the results with my dad who should cut feeding expensive organic milk to his two cats.

Wow, that sounds pretty simple, isn't it? Yes, and precisely for that reason IDC says that this phenomenon presents a big market opportunity:


“We are at the forefront of an evolutionary market that is fraught with opportunity for innovative tools and solutions that can help users handle the information overload plaguing every major organization around the globe.” IDC Market Analysis, Worldwide Interactive Data Visualization Tools Forecast


How big of a market opportunity? $1 Billion big by 2013 and $1.6 Billion by 2015 says Gartner. See this graphics:

So someone asked me few weeks ago, how would you define simple, self-serve BI and I gave him the following definition -

Agile BI is a simple yet power-packed solution which is easy-to-use, cost-effective and offers full 360 degree experience and above all my mom should be able to use it without bugging me...

And here is my definition of a power-packed solution:


There are ZERO products that fulfill this vision today.  Products like QlikTech, Spotfire and Tableau do a pretty good job and therefore enjoy more that 70% of the market share. Where are the big guys?

"Agile" and "Big" doesn't go together I guess!



Here is how I contrasted Qlik against a large enterprise BI player:



This story is universal and gives competitive advantage to younger more agile players over their older and aging brethren because they have offered one single self-serve BI tool that could serve to many personas!





Qlik and Tableau have seen pretty solid growth over the past few years as a result of keeping their strategy simple.  Here is an older blog on Qlik showing its amazing growth: http://goo.gl/cyV7a


The most recent evidence of double digit growth in the Agile BI market was seen in Tableau's 2011 earnings: (http://apandre.wordpress.com/)
  • sales doubled year over year to $72M in 2011
  •  104% growth in bookings in Q4’11 and 94% growth YoY,
  • WW customer base grew by 40% in 2011
  •  more than 7,000 organizations use its analytics product
  •  big growth with customers in Europe, where base grew by 67 percent

2011 was the year of Agile (Simple) BI and the momentum is gaining further strength. Do you know now what Agile BI a.k.a Simple BI a.k.a self-serve BI is defined as?

Happy Simplifying!

Tuesday, January 24, 2012

Geocode your data using, R, JSON and Google Maps' Geocoding API

First and foremost, I absolutely love the topic of Location Analytics (Geo-Spatial Analysis) and see tremendous business potential in not so distant future.  I would go out on a limb to predict that the Location Analytics will soon go viral in the enterprise space because it has the capability to WOW us. Look no further than your iPhone or an Android phone and count how many location aware apps you have. We all have at lease one app - Google Maps.  Mobile is one of the strongest catalyst for enterprise adoption of Location aware apps. All right, enough of business talk, let's get dirty with the code.


Over the last year and half, I have faced numerous challenges with geocoding the data that I have used to showcase my passion for location analytics.  In 2012, I decided to take thing in my control and turned to R.  Here, I am sharing a simple R script that I wrote to geo-code my data whenever I needed it, even BIG Data.


To geocode my data, I use Google's Geocoding service which returns the geocoded data in a JSON. I will recommend that you register with Google Maps API and get a key if you have large amount of data and would do repeated geo coding.

Here is function that can be called repeatedly by other functions:

getGeoCode <- function(gcStr)
{
  library("RJSONIO") #Load Library
  gcStr <- gsub(' ','%20',gcStr) #Encode URL Parameters
 #Open Connection
 connectStr <- paste('http://maps.google.com/maps/api/geocode/json?sensor=false&address=',gcStr, sep="") 
  con <- url(connectStr)
  data.json <- fromJSON(paste(readLines(con), collapse=""))
  close(con)
#Flatten the received JSON
  data.json <- unlist(data.json)
  lat <- data.json["results.geometry.location.lat"]
  lng <- data.json["results.geometry.location.lng"]
  gcodes <- c(lat, lng)
  names(gcodes) <- c("Lat", "Lng")
  return (gcodes)
}

Let's put this function to test:
geoCodes <- getGeoCode("Palo Alto,California")

> geoCodes
           Lat            Lng 
  "37.4418834" "-122.1430195" 


You can run this on the entire column of a data frame or a data table:

Here  is my sample data frame with three columns - Opposition, Ground.Country and Toss. Two of the columns, you guessed it right, need geocoding.

> head(shortDS,10)
     Opposition              Ground.Country Toss
1      Pakistan            Karachi,Pakistan  won
2      Pakistan         Faisalabad,Pakistan lost
3      Pakistan             Lahore,Pakistan  won
4      Pakistan            Sialkot,Pakistan lost
5   New Zealand    Christchurch,New Zealand lost
6   New Zealand          Napier,New Zealand  won
7   New Zealand        Auckland,New Zealand  won
8       England              Lord's,England  won
9       England          Manchester,England lost
10      England            The Oval,England  won

To geo code this, here is a simple one liner I execute:

shortDS <- with(shortDS, data.frame(Opposition, Ground.Country, Toss,
                  laply(Ground.Country, function(val){getGeoCode(val)})))



> head(shortDS, 10)
    Opposition           Ground.Country Toss  Ground.Lat  Ground.Lng
1     Pakistan         Karachi,Pakistan  won   24.893379   67.028061
2     Pakistan      Faisalabad,Pakistan lost   31.408951   73.083458
3     Pakistan          Lahore,Pakistan  won    31.54505   74.340683
4     Pakistan         Sialkot,Pakistan lost  32.4972222  74.5361111
5  New Zealand Christchurch,New Zealand lost -43.5320544 172.6362254
6  New Zealand       Napier,New Zealand  won -39.4928444 176.9120178
7  New Zealand     Auckland,New Zealand  won -36.8484597 174.7633315
8      England           Lord's,England  won     51.5294     -0.1727
9      England       Manchester,England lost   53.479251   -2.247926
10     England         The Oval,England  won   51.369037   -2.378269



Happy Demoing and Coding!