Wednesday, March 28, 2012

Big Data, R and HANA: Analyze 200 Million Data Points and Later Visualize Using Google Maps

Technologies: SAP HANA, R, HTML5, D3, Google Maps, JQuery and JSON

For this fun exercise, I analyzed more than 200 million data points using SAP HANA and R and then brought in the aggregated results in HTML5 using D3, JSON and Google Maps APIs.  The 2008 airlines data is from the data expo and I have been using this entire data set (123 million rows and 29 columns) for quite sometime. See my other blogs

The results look beautiful:



Each airport icon is clickable and when clicked displays an info-window describing the key stats for the selected airport:


I then used D3 to display the aggregated result set in the modal window (light box):



Unfortunately, I can't provide the live example due to the restrictions put in by Google Maps APIs and I am approaching my free API limits.

Fun fact:  The Atlanta airport was the largest airport in 2008 on many dimensions: Total Flights Departed, Total Miles Flew, Total Destinations.  It also experienced lower average departure delay in 2008 than Chicago O'Hare. I always thought Chicago O'Hare is the largest US airport.

As always, I just needed 6 lines of R code including two lines of code to write data in JSON and CSV files:

################################################################################
airports.2008.hp.summary <- airports.2008.hp[major.airports,     
    list(AvgDepDelay=round(mean(DepDelay, na.rm=TRUE), digits=2),
    TotalMiles=prettyNum(sum(Distance, na.rm=TRUE), big.mark=","),
    TotalFlights=length(Month),
    TotalDestinations=length(unique(Dest)),
    URL=paste("http://www.fly", Origin, ".com",sep="")), 
                    by=list(Origin)][order(-TotalFlights)]
setkey(airports.2008.hp.summary, Origin)
#merge the two data tables
airports.2008.hp.summary <- major.airports[airports.2008.hp.summary, 
                                                     list(Airport=airport, 
                                                          AvgDepDelay, TotalMiles, TotalFlights, TotalDestinations, 
                                                          Address=paste(airport, city, state, sep=", "), 
                                                          Lat=lat, Lng=long, URL)][order(-TotalFlights)]


airports.2008.hp.summary.json <- getRowWiseJson(airports.2008.hp.summary)
writeLines(airports.2008.hp.summary.json, "airports.2008.hp.summary.json")                 
write.csv(airports.2008.hp.summary, "airports.2008.hp.summary.csv", row.names=FALSE)
##############################################################################

Happy Coding and remember the possibilities are endless!

Thursday, March 22, 2012

Tracking SFO Airport's Performance Using R, HANA and D3

Visualize Big Data Using R, HANA, D3, JSON and HTML5/JavaScript

This is my first introduction to D3 and I am simply blown away.  Mike Bostock (@mbostock), you are genius and thanks for creating D3!  With HANA, R, D3, HTML5 and iPad, and you got yourself a KILLER combo!

I have been burning my midnight oil on piecing together my big data story using HANA, R, JSON and HTML5.  If you recall, I did a technical session on R and SAP HANA at DKOM, SAP's Development Kickoff Event last week where I showcased the supreme powers of R and HANA when analyzing 124 million records in real time.  R and SAP HANA: A Highly Potent Combo for Real Time Analytics on Big Data

Since last week, I have been looking for other creative ways to analyze and then visualize this airlines data. I am very fortunate to come across D3.  After spending couple of hours with D3, I decided to build the calendar view for the airlines data I have.  The calendar view is the first example Mike shows on his D3 page. Amazingly awesome!

I created this calendar view capturing the percent of delayed flight from SFO airports that departed daily between 2005-2008.  For this analysis, I used HANA to get the data out for SFO (out of 250 plus airports) over this 4 years period in seconds and then did all the aggregation in R including creating a JSON and .CSV file in seconds again.  Later, I moved to HTML5 and D3 to generate this beautiful calendar view showing SFO's performance.  Graphics is presented below:


As expected, December and January are two notorious months for flights delay.  Have fun with the live example hosted in the Amazon cloud..

Once again, my R code is very simple:


## Depature Delay for SF Airport
ba.hp.sfo <- ba.hp[Origin=="SFO",]


ba.hp.sfo.daily.flights <- ba.hp.sfo[,list(DailyFlights=length(DepDelay)), by=list(Year, Month, DayofMonth)][order(Year,Month,DayofMonth)]
ba.hp.sfo.daily.flights.delayed <- ba.hp.sfo[DepDelay>15,list(DelayedDailyFlights=length(DepDelay)), by=list(Year, Month, DayofMonth)][order(Year,Month,DayofMonth)]
setkey(ba.hp.sfo.daily.flights.delayed, Year, Month, DayofMonth)
response <- ba.hp.sfo.daily.flights.delayed[ba.hp.sfo.daily.flights]
response <- response[,list(Date=as.Date(paste(Year, Month, DayofMonth, sep="-"),"%Y-%m-%d"), 
                           #DailyFlights,DelayedDailyFlights,
                           PercentDelayedFlights=round((DelayedDailyFlights/DailyFlights), digits=2))]
objs <- apply(response, 1, toJSON)
res <- paste('{"dailyFlightStats": [', paste(objs, collapse=', '), ']}')
writeLines(res, "dailyFlightStatsForSFO.json")                 
write.csv(response, "dailyFlightStatsForSFO.csv", row.names=FALSE)


For D3 and HTML code, please take a look at this example from D3 website. 

Happy Analyzing and Keep That Mid Night Oil Burning!



Saturday, March 17, 2012

Geocode and reverse geocode your data using, R, JSON and Google Maps' Geocoding API


(Reposting the previous blog with additional module on reverse geocoding added here.)

First and foremost, I absolutely love the topic of Location Analytics (Geo-Spatial Analysis) and see tremendous business potential in not so distant future.  I would go out on a limb to predict that the Location Analytics will soon go viral in the enterprise space because it has the capability to WOW us. Look no further than your iPhone or an Android phone and count how many location aware apps you have. We all have at lease one app - Google Maps.  Mobile is one of the strongest catalyst for enterprise adoption of Location aware apps. All right, enough of business talk, let's get dirty with the code.

Over the last year and half, I have faced numerous challenges with geocoding and reverse geocoding the data that I have used to showcase my passion for location analytics.  In 2012, I decided to take thing in my control and turned to R.  Here, I am sharing a simple R script that I wrote to geo-code my data whenever I needed it, even BIG Data.

To geocode and reverse geocode my data, I use Google's Geocoding service which returns the geocoded data in a JSON. I will recommend that you register with Google Maps API and get a key if you have large amount of data and would do repeated geo coding.

Geocode:

getGeoCode <- function(gcStr)  {
  library("RJSONIO") #Load Library
  gcStr <- gsub(' ','%20',gcStr) #Encode URL Parameters
 #Open Connection
 connectStr <- paste('http://maps.google.com/maps/api/geocode/json?sensor=false&address=',gcStr, sep="") 
  con <- url(connectStr)
  data.json <- fromJSON(paste(readLines(con), collapse=""))
  close(con)
  #Flatten the received JSON
  data.json <- unlist(data.json)
  if(data.json["status"]=="OK")   {
    lat <- data.json["results.geometry.location.lat"]
    lng <- data.json["results.geometry.location.lng"]
    gcodes <- c(lat, lng)
    names(gcodes) <- c("Lat", "Lng")
    return (gcodes)
  }
}
geoCodes <- getGeoCode("Palo Alto,California")


> geoCodes
           Lat            Lng 
  "37.4418834" "-122.1430195" 

Reverse Geocode:
reverseGeoCode <- function(latlng) {
latlngStr <-  gsub(' ','%20', paste(latlng, collapse=","))#Collapse and Encode URL Parameters
  library("RJSONIO") #Load Library
  #Open Connection
  connectStr <- paste('http://maps.google.com/maps/api/geocode/json?sensor=false&latlng=',latlngStrsep="")
  con <- url(connectStr)
  data.json <- fromJSON(paste(readLines(con), collapse=""))
  close(con)
  #Flatten the received JSON
  data.json <- unlist(data.json)
  if(data.json["status"]=="OK")
    address <- data.json["results.formatted_address"]
  return (address)
}
address <- reverseGeoCode(c(37.4418834, -122.1430195))

> address
                    results.formatted_address 
"668 Coleridge Ave, Palo Alto, CA 94301, USA" 

Happy Coding!

Wednesday, March 14, 2012

R and SAP HANA: A Highly Potent Combo for Real Time Analytics on Big Data


Lets Talk Code

SAP DKOM 2012 kicks off in San Jose today and I can’t be more excited than this.  For the past three months Jens Doerpmund, Chief Development Architect of Analytics at SAP and I have been working on this topic of R and SAP HANA and all our hard work (upwards of 400 hours) is about to pay off (fingers crossed).

It has been a stunning journey and an incredible learning experience. Both R and HANA are fascinating technologies and bringing them together is analogous to bringing Google and Apple together. We are gearing up for our session and in the true spirit of DKOM, we will be only talking code, yes code and lots of it.  We just wrapped up our slides with lots of code snippets to share with fellow DKOMers.  Here is a quick sneak preview of what we are going to cover today:

Big Data Analytics (Really Big)
  • Airlines sector in the travel industry
  • 22 years (1987-2008) of airlines on time performance data on US airlines
  • 123 million records
  • Extract Transform Load – ETL work to combine this data with data on airports, data on carriers with this data to setup for Big Data analysis in R and HANA
  • D20 with 96GB of RAM and 24 Cores
  • Massive amount of data crunching using R and HANA

 We will be covering lots and lot of topics, here is a short list:
  • Sentiment Analysis on #DKOM and a WordCloud
  • Cluster Analysis using K-Mean
  • Geo Code Your Data – Google Maps API
  • SP100 - XML Parsing and Historical Stock Data
  • R and HANA integration
  • Moving big-data from one HANA to another HANA (Replication)
  • Server side Java Scripting
  • and an HTML5 App built with R, HANA and Server Side Java Script



Here is a wordcloud straight from R on #DKOM. There will be lot more to discuss today. Looking forward to meeting you all DKOMers.


Lets Talk Code Everyone and Happy Coding!

 Jitender Aswani
 Jens Doerpmund

Learn more on this session topic in my previous blog:  Advance Analytics with R and HANA at DKOM 2012 San Jose

Friday, March 2, 2012

Advanced Analytics with R and HANA at DKOM 2012 San Jose


Advanced Analytics with R and SAP HANA

R has become the open source language of choice for statistical data analysis / data mining, advanced algorithms, credit risk scoring and for other forms of predictive analytics.  R is already an in-memory based scripting language and is capable of handling big data, tens of gigabytes and hundreds of millions of rows.  And when combined with SAP's in-memory platform technology called HANA, R offers the potential to take the in-memory analytics to a whole new level.  Imagine performing advanced statistical analysis such as decision tree, game-theory, linear and multiple regressions and much more inside SAP HANA on millions of rows and turning around with critical business insights at the speed of thought. 

This is possible now with R and HANA.  This combination has the potential to completely revolutionize and advance the game of analytics in your enterprise.  This is not it yet.  Imagine taking the output from R and using the Advanced Visualization techniques available in Business Intelligence 4.0 suite based on HTML5 to create stunning visualization for today’s business users.

Just to tease you, here is a one-liner in R that processed 120 million records and brought back aggregated data under 20 seconds:

averageDelay <- dt[,list(AvgArrDelay=round(mean(ArrDelay, na.rm=TRUE), digits=2),
                  AvgDepDelay=round(mean(DepDelay, na.rm=TRUE), digits=2),
                  DistanceTravelled=sum(Distance, na.rm=TRUE),
                  FlightCount=length(Month)), 
                  by=list(UniqueCarrier, Year)][order(Year, -AvgArrDelay)][AvgArrDelay > 10 | AvgDepDelay > 10]

The machine I used for this analysis had 24 cores and 96GB of memory! More to follow over next few days.

Join me and my fellow colleagues for this session at DKOM 2012 San Jose (March 14th at 11 AM at San Jose Convention Center)