All Things Analytics: Advanced Analytics

Showing posts with label Advanced Analytics. Show all posts

Wednesday, September 4, 2013

The Future of Big Data is Cognitive Big Data Apps

Volume, Velocity, Variety and Veracity of your data, the 4V challenge, has become untamable. Wait, yet another big data blog? No, not really. In this blog, I would like to propose a cognitive app approach that can transform your big-data problems into big opportunities at a fraction of the cost.

Everyone is talking about big data problems but not many are helping us in understanding big data opportunities. Let's define a big data opportunity in the context of customers because growing customer base, customer satisfaction and customer loyalty is everyone’s business:

you have a large, diverse and growing customer base

your customers are more mobile and social than ever before

you have engaged with your customers where ever they are: web, mobile, social, local

you believe that "more data beats better algorithms" and that big data is all data

you wish to collect all data - call center records, web logs, social media, customer transactions and more so that

you can understand your customers better and how they speak of and rank you in their social networks

you can group (segment) your customers to understand their likes and dislikes

you can offer (recommend) them the right products at the right time and at the right price

you can preempt customer backlash and prevent them for leaving (churn) to competitors and taking their social network with them (negative network effects)

all this effort will allow you to forecast sales accurately, run targeted marketing campaigns and cut cost to improve revenues and profitability

you wish to do all of this without hiring an army of data analysts, consultants and data scientists

and without buying half-dozen or more tools, getting access to several public / social data sets and integrating it all in your architecture

and above all, you wish to do it fast and drive changes in real time

And most importantly, you wish to rinse and repeat this approach for the foreseeable future

There are hardly any enterprise solutions in the market that can address the challenges listed above. You have no other choice but to build a custom solution by hiring several consultants and striking separate licenses agreements with public and social data vendors to get a combined lens on public and private data. This approach will be cost prohibitive for most enterprise customers and as "90% of the IT projects go" will be mired with delays, cost overruns and truck load of heartache.

The advances in technologies like in-memory databases and graph structures as well as democratization of data science concepts can help in addressing the challenges listed above in a meaningful and cost-effective way. Intelligent big data apps are the need of the hour. These apps need to be designed and built from scratch keeping the challenges and technologies such as cognitive computing[1] in mind. These apps will leave the technology paradigms of 1990s like "data needs to be gathered and modeled (caged) before an app is built" in the dumpster and will achieve the flexibility required from all modern apps to adapt as the underlying data structures and data sources change. These apps can be deployed right off the shelf with minimum customization and consulting because the app logic will not be anchored to the underlying data-schema and will evolve with changing data and behavior.

The enterprise customers will soon be asking for a suite of such cognitive big data apps for all domain functions so that they can put the big data opportunities to work to run their businesses better than their competitors. Without dynamic cognitive approach in apps, addressing the 4V challenge will be a nightmare and big data will fail to deliver its promise.

Stay tuned for future blogs on this topic including discussions on a pioneering technology approach.

[1] Cognitive computing is the ability to analyze oceans of data in context with related information and expertise. Cognitive systems learn from how they’re used and adjust their rules and results dynamically. Google search engine and knowledge graph technology is predicated upon this approach.

This blog has benefited from the infinite wisdom and hard work of my former colleagues Ryan Leask and Harish Butani and that of my current colleagues Sethu M., Jens Doerpmund and Vijay Vijayasankar.

Image courtesy of MemeGenerator

Sunday, August 25, 2013

Data Science: Definition and Opportunities

Image courtesy of BBC

My thoughts on what data science is, what skills data scientists have, what are the current issues in the Business Intelligence pipeline, how can machine learning automate a part of the BI chain, why and how data science should be democratized and made available to every one including decision makers (business users), how business analyst should build complex data models and how data scientists should be freed up from the mundane tasks of rinse and repeat ETL before building models that provide input for decision making, how companies can build a business practice around data science.

Key Premise: big data is all data and the big data apps offer the ability to combine all data (public + private) and expand the horizon to discover more meaningful insights.

Data Science is:

An art of mining large quantities of data
An art of combining disparate data sources and blending public data with corporate data
Forming hypothesis to solve hard problems
Building models to solve current problems and provide forecast
Anticipate future events (based on historical data) and provide correcting actions (finance, banking, travel, operational runtime)
Automating the processes to reduce time to solve future problems

A Data Scientists has following minimum set of core skills:

Problem-Solver
Creative and can form an hypothesis
Is able to program with large quantities of data
Can think of bringing data from appropriate data source and can bring and blend data
Stats/math/analytics background to build models and write algorithms
Can quickly develop domain knowledge to understand key factors which influence the performance of a busies problem

Roles Data Scientists play:

Problem description
Hypothesis formation
Data assembly, ETL and data integration role
Model development (pattern recognition or any other model to provide answers) and training
Data visualization
AB Testing
Propose solutions and/or new business idea

The balance between human vs. machines:

Current: humans play a significant role in the process – ETL, joins, models, visualization, machine-learning and repeating and recycling this process as the problem changes
Tomorrow: A big portion of the food-chain can be automated via machine learning so machines can take over and scientists can free up to build more algorithms/models
The process can be automated so repeating/recycling can be cheaper and less time consuming

The Data Science pipeline currently look like:

From Data to Insights – this entire process requires mundane skills (IT), specialized skills (data-scientist) and elements of human psychology to present the right information at right time
The data needs to be discovered, assembled, semantically enriched and anchored to a business logic – this task can be be automated through machine learning (a set of harmonized tools with AI) to free up scarce resources
Specialized skills today get addressed by open source technologies such as R and expensive solutions like Matlab and SPSS.
Very few software solution carefully introduce human interface to make their application consumable without requiring customer training

This pipeline needs complete rethinking:

Automate mundane tasks that IT gets tagged with
Discover data automatically
Detach business logic from data models
Make blending public data with corporate data a second nature
Free up scientists so that they can build analytics micro-apps for a domain or a sub-domain
Data Science need not be a niche (specialized category), it should appeal to the masses (democratization of data and brining insights to everyone without needing specialized skills)

Opportunities in Data Science:

Understand the value chain (IT + Business Analyst + Data Scientists + Business Users)
Provide something for everyone - a single integrated platform (ETL + Data Integration + Predictive modeling + in-memory computing + storage) for data-scientist so that they can build standard analytical apps and move away from proprietary models and standardize (helps IT)
Analytical apps on this platform (think of them as Rapid Deployment Solutions) for business users
Help business analysts write basic models (churn, segmentation, correlation etc.) without needing advanced skills
Work with consulting companies so that they can consult and build apps for companies that do not have data scientist on their pay-roll (Mu-Sigma and Opera Solutions)
Partner with public data provider (to help clients), consulting companies (Rapid Solutions solution), R/Python/ML communities (mind-share and thought-leadership),
Donate your predictive models to open-source communities

Friday, April 19, 2013

Democratization of Business Analytics Dashboards

I am super impressed with the following visual dashboard from IPL T20 tournament - IPL 2013 in Numbers. For those of you not so familiar with cricket or IPL, IPL is the biggest, the most extravagant and the most lucrative cricket tournament in the world. I like the way IPL is bringing sports analytics to the common masses.

What is impressive is that each metric (runs, wickets, or tweets) is live so these numbers get updated automatically, pretty cool for IPL and cricket fans. Also, each metric is clickable so one can drill down to his or her heart's content. This is a common roll-up analysis but the visualization and the real time updates make this dashboard pretty appealing. IPL team, thanks for not putting any dials on this dashboard (LOL).

I have been influencing and now building analytics products that power these sports and various other dashboards/reports for many years. The most fascinating thing is that these dashboards (or lets call it analytics in general) are reaching the masses like never before. Everyone has heard of terms like democratization of data and humanization of analytics. This is it! The data revolution is underway.

Now, there are many new frontiers to go after and the existing ones need to be reinvented. Yes, the analytics market is ready for massive disruption. This is what keeps me excited about Business Analytics space.

Happy Analyzing and Happy Friday!

Friday, November 9, 2012

Financial Markets and President Elect: Do Financial Markets Favor A Republican Over A Democrat?

US financial markets favor a republican president over a democratic president. Has this sentiment stood the test of time? Do financial markets care whether the president elect is a democrat or a republican? How have financial markets behaved in the past after the announcement of next US president? And finally, can one spot a pattern in the performance of financial markets based on the president's party affiliation? More specifically, did financial markets fare better under a republican president or under a democratic president?

To answer all these questions, I turned to history and generated the historical performance of S&P 500 since 1952. I also had to turn to Wikipedia to get a list of presidents and their party affiliation. Between 1952 and 2012, US has elected 16 presidents with republican presidents outnumbering their democratic counterparts by 2 in occupying the white house (see table below):

To understand whether financial markets favored a republican president over a democratic president, I generated 1-day, 1-week, 4-week, 12-week, 52-week and the presidency term ("term") returns since the election date (see table above.) Looking at the one day return, there was no clear indication whether markets favored one party or the other. Financial markets welcomed Ronald Reagan, a republican, by sending the S&P 500 up 1.77% which is the highest one-day return among all the 16 presidential events. Markets also cheered the reelection of Bill Clinton with a one-day return of 1.46% after the announcement of president elect.

Source: AllThingsAnalytics

With one-day returns of -5.27% and -2.37% in 2008 and 2012 respectively, President Obama is not much favored by financial markets. Now, one can argue that October 2008 was a terrible period for anyone to be elected as the president because of the ongoing crash in financial markets that led to the great recession (see side chart.) Nonetheless, markets also didn't like Obama's reelection (S&P 500 was down 2.37% following the election day) which leads to a status quo in Washington. Combine that with all the ongoing macro concerns including the Euro debt crisis and already unraveling fiscal cliff, investors have become very jittery in the past couple of days.

Now, to overcome the short-term bias in financial market's reaction, let's review other period's returns (see table above.) There are plenty of interesting observations one can make. For example, under both of Clinton's (Democrat) presidencies, financial markets boomed with returns of 56% and 111% over the next 200 weeks since the election day. Eisenhower's (Republican) presidency came second with returns of 95% and 19%. Reagan era followed by Bush Sr's term also produced hefty gains for investors with returns of 28%, 56% and 49% under their terms. Again, there is no clear indication whether financial markets favored one party over the other during a president's term in the office but financial markets definitely fared well under a republican president prior to 2000.

Bush Jr. (Republican) inherited dot-com crash, oversaw the biggest expansion in US public debt (see chart below) and observed the epic housing crisis of 2007-2008. Financial markets yielded returns of -22% and 13% during Bush's two terms presidency, pretty poor for a republican president who unleashed all the expansionary polices on US economy. Under Bush's 8 year presidency, US public debt doubled from $5.6 trillion to $10 trillion. Obama added almost the same level of debt in just 4 years and took US public debt from $10 trillion to $14.2 trillion by the end of 2011.

Source: AllThingsAnalytics

From Wikipedia, click to expand

Are we living in times which have no historical precedence? It took 20 years for US public debt to rise from $1 trillion level to $5.5 trillion level (see side chart). It then just took 11 short years for US public debt to rise to $14 trillion level. From year 1980 to 2000, S&P 500 appreciated by 1276% (from 105 at the start of 1980 to 1455 at the start of 2000). Astonishing rise!!! Also astonishing is the fact that since 2000 till date, S&P 500 has been down -5%. Has the mammoth economic expansion of 1980s and 1990s run its course and now debt is the only route left to sustain US economy. Let's leave this discussion for another blog.

Financial markets care less which party's candidate is elected for the white house and focus more on the economic policies that president will enact. All the rhetoric and party ideology does take a toll on financial markets though as evident in financial markets' immediate reaction similar to the one we are observing right now. Hopefully, the congress and the president will put the rhetoric aside and break the impasse on the already unraveling fiscal cliff.

This blog has benefited from discussions with Jens Doerpmund, Ryan Leask and Rajani Aswani on this topic.

Disclaimer: All numbers are approximate and the underlying analysis is preliminary. This blog is not intended for offering any investment advice.

Wednesday, October 10, 2012

Besides Facebook's Botched IPO, IPO Market Returns 20% in 2012

Facebook (Ticker: FB) is down ~47% since its IPO in May. Now, it is not the most botched IPO ever unfortunately as the infamous record belongs to BATS Exchange (Ticker: BATS) which operates an alternate stock exchange to NYSE and NASDAQ. (Read the Business Insider story here: 8 Unforgettable IPO Disasters)

Also, FB is not the worst performing IPO either. Groupon (Ticker: GRPN) and Zynga (Ticker: ZNGA, proudly led by Mark Pincus), are down 77% and 74% respectively since their IPO. In comparison, FB has done ok, it could be worst but a rapid strategy shift by FB including the emphasis on mobile and a decision to allow e-commerce transactions (Facebook Gifts) on Facebook have provided some kind of a floor under its stock. Here is a chart comparing the three (not-so) darlings of the Web 2.0.

Anyhow, below is a table of the best IPOs for this year. Guidewire (Ticker: GWRE) and Demandware (Ticker: DWRE) are the two cloud technology companies in the list that have done very well returning 137% and 108% till date.

IPO Top Performers (YTD)

Company	Offer Date	Under	Industry	Deal Size (mm)	Offer Price	First Day Close	Closing Price	First Day Return	Total Return
Supernus Pharmac	4/30/12	Citi	Health Care	$50	$5.00	$5.37	$12.77	7.4 %	155.4 %
Nationstar Mortg	3/7/12	Merrill	Financial	$233	$14.00	$14.20	$33.29	1.4 %	137.8 %
Guidewire Softwa	1/24/12	JPM	Technology	$115	$13.00	$17.12	$30.84	31.7 %	137.2 %
Annies	3/27/12	CS	Consumer	$95	$19.00	$35.92	$44.87	89.1 %	136.2 %
Demandware	3/14/12	GS	Technology	$88	$16.00	$23.59	$33.31	47.4 %	108.2 %

Palo Alto Network (Ticker: PANW) is up 16% since IPO with returns of 48% over its IPO price of $42. Splunk (Ticker: SPLK) is down about 10% since IPO but still giving returns of 90% over its IPO price of $17. Both these companies didn't make the cut in the table above.

Here is a list of the worst performing IPOs till date. If one were to change the time period from YTD to 12-months, Zynga shows up in the list, no surprise there. Social gaming is a fast changing environment and ZNGA faces crisis in confidence with so many departures.

IPO Worst Performers (YTD)

Company	Offer Date	Under	Industry	Deal Size (mm)	Offer Price	First Day Close	Closing Price	First Day Return	Total Return
Envivio	4/24/12	GS	Technology	$70	$9.00	$8.49	$2.15	-5.7 %	-76.1 %
Audience	5/9/12	JPM	Technology	$90	$17.00	$19.10	$5.65	12.4 %	-66.8 %
CafePress	3/28/12	JPM	Technology	$86	$19.00	$19.03	$8.07	0.2 %	-57.5 %
Ceres	2/21/12	GS	Materials	$65	$13.00	$14.80	$5.77	13.8 %	-55.6 %
Renewable	1/18/12	UBS	Energy	$72	$10.00	$10.10	$5.16	1.0 %	-48.4 %

Take a closer look, FB is barely staying away from this infamous list. On a similar note, LinkedIn (Ticker: LNKD) is up approximately 80% till date. What a contrasting tale of the two social network companies!

So far in 2012, IPOs have resulted in 20% returns which is better than the -11% returns IPO market yielded in 2011. Since there are about 2.5 months more to go before the curtains drop on 2012, the 2012 IPO return might beat the 25% returns the year 2010 produced.

One very encouraging signs for the IPO investors this year has been the 13% average first day pop in IPOs that is line with what IPO market observed before the great recession (~13%). And to all the naysayers out there who claim that tech-stocks are in a bubble, take a look at the average opening day pop in 1999 (72%) and 2000 (56%) and compare it to 2012, you will hold your peace for few more years at least!

Workday (Ticker: WDAY) is on the deck for this week. Do you due-diligence before investing.

Happy IPO Investing!
Jitender

Source: Renaissance Capital, Greenwich, CT (www.renaissancecapital.com).

Wednesday, May 23, 2012

If You are a R Developer, Then You Must Try SAP HANA for Free.

This is a guest blog from Alvaro Tejada Galindo, my colleague and fellow R and SAP HANA enthusiast. I am thankful to Alvaro for coming and posting on "AllThingsBusinessAnalytics".

Are you an R developers? Have ever heard of SAP HANA? Would you like to test SAP HANA for free?

SAP HANA is an In-Memory Database Technology allowing developers to analyze big data in real-time.

Processes that took hours now take seconds due to SAP HANA's power to keep everything on RAM memory.

As announced in SAP Sapphire Now event in Orlando, Florida, SAP HANA is free for developers. You just need to download and install both the SAP HANA Client and the SAP HANA Studio, and create an SAP HANA Server on the Amazon Web Services as described in the following document:

Get your own SAP HANA DB server on Amazon Web Services - http://scn.sap.com/docs/DOC-28294

Why should this interest you? Easy...SAP HANA is an agent of change bringing speed to its limits and it can also be integrated with R as described in the following blog:

When SAP HANA met R - First kiss - http://scn.sap.com/community/developer-center/hana/blog/2012/05/21/when-sap-hana-met-r--first-kiss

Want to know more about SAP HANA? Read everything you need here: http://developers.sap.com

You're convinced but don't want to pay for the Amazon Web Services? No problem. Just leave a comment including your name, company and email. We will reach you and send you an Amazon Gift Card so you can get started. Of course, your feedback would be greatly appreciated. Of course, we only a limited set of gift cards, so be quick or be out.

Author Alvaro Tejada Galindo, mostly known as "Blag" is a Development Expert working for the Technology Innovation and Developer Experience team in SAP Labs. He can be contacted at a.tejada.galindo@sap.com.

Alvaro's background in his own words: I used to be an ABAP Consultant for 11 years. I worked in implementations on Peru and Canada. I’m also a die hard developer using R, Python, Ruby, PHP, Flex and many more languages. Now, I work for SAP Labs and my main roles are evangelize SAP technologies by writing blogs, articles, helping people on the forums, attending SAP events, besides many other “Developer engagement” activities.

I maintain a blog called “Blag’s bag of rants” at blagrants.blogspot.com

Monday, April 9, 2012

Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize in HTML5 Using D3 - Part II

Technologies: SAP HANA, R, HTML5, D3, JQuery and JSON

In my last blog, Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize Using Google Maps, I analyzed historical airlines performance data set using R and SAP HANA and put the aggregated analysis on Google Maps. Undoubtedly, Map is a pretty exciting canvas to view and analyze big data sets. One could draw shapes (circles, polygons) on the map under a marker pin, providing pin-point information and display aggregated information in the info-window when a marker is clicked. So I enjoyed doing all of that, but I was craving for some old fashion bubble charts and other types of charts to provide comparative information on big data sets. Ultimately, all big data sets get aggregated into smaller analytical sets for viewing, sharing and reporting. An old fashioned chart is the best way to tell a visual story!

On bubble charts, one could display 4 dimensional data for comparative analysis. In this blog analysis, I used the same data-set which had 200M data points and went deeper looking at finer slices of information. I leveraged D3, R and SAP HANA for this blog post. Here I am publishing some of this work:

In this first graphics, the performance of top airlines is compared for 2008. As expected, Southwest, the largest airlines (when using total number of flights as a proxy), performed well for its size (1.2M flights, 64 destinations but average delay was ~10 mins.) Some of the other airlines like American and Continental were the worst performers along with Skywest. Note, I didn't remove outliers from this analysis. Click here to interact with this example.

In the second analysis, I replaced airlines dimension with airports dimension but kept all the other dimensions the same. To my disbelief, Newark airport is the worst performing airport when it comes to departure delays. Chicago O'Hare, SFO and JFK follow. Atlanta airport is the largest airport but it has the best performance. What are they doing differently at ATL? Click here to interact with this example.

It was hell of a fun playing with D3, R and HANA, good intellectual stimulation if nothing else! Happy Analyzing and remember possibilities are endless!

As always, my R modules are fairly simple and straightforward:
###########################################################################################
#ETL - Read the AIRPORT Information, get major aiport informatoin extracted and upload this
#transfromed dataset into HANA
###########################################################################################
major.airports <- data.table(read.csv("MajorAirports.csv", header=TRUE, sep=",", stringsAsFactors=FALSE))
setkey(major.airports, iata)

all.airports <- data.table(read.csv("AllAirports.csv", header=TRUE, sep=",", stringsAsFactors=FALSE))
setkey(all.airports, iata)

airports.2008.hp <- data.table(read.csv("2008.csv", header=TRUE, sep=",", stringsAsFactors=FALSE))
setkey(airports.2008.hp, Origin, UniqueCarrier)

#Merge two datasets
airports.2008.hp <- major.airports[airports.2008.hp,]

###########################################################################################
# Get airport statisitics for all airports
###########################################################################################
airports.2008.hp.summary <- airports.2008.hp[major.airports,
list(AvgDepDelay=round(mean(DepDelay, na.rm=TRUE), digits=2),
TotalMiles=prettyNum(sum(Distance, na.rm=TRUE), big.mark=","),
TotalFlights=length(Month),
TotalDestinations=length(unique(Dest)),
URL=paste("http://www.fly", Origin, ".com",sep="")),
by=list(Origin)][order(-TotalFlights)]
setkey(airports.2008.hp.summary, Origin)
#merge two data tables
airports.2008.hp.summary <- major.airports[airports.2008.hp.summary,
list(Airport=airport,
AvgDepDelay, TotalMiles, TotalFlights, TotalDestinations,
Address=paste(airport, city, state, sep=", "),
Lat=lat, Lng=long, URL)][order(-TotalFlights)]

airports.2008.hp.summary.json <- getRowWiseJson(airports.2008.hp.summary)
writeLines(airports.2008.hp.summary.json, "airports.2008.hp.summary.json")
write.csv(airports.2008.hp.summary, "airports.2008.hp.summary.csv", row.names=FALSE)

Wednesday, March 28, 2012

Big Data, R and HANA: Analyze 200 Million Data Points and Later Visualize Using Google Maps

Technologies: SAP HANA, R, HTML5, D3, Google Maps, JQuery and JSON

For this fun exercise, I analyzed more than 200 million data points using SAP HANA and R and then brought in the aggregated results in HTML5 using D3, JSON and Google Maps APIs. The 2008 airlines data is from the data expo and I have been using this entire data set (123 million rows and 29 columns) for quite sometime. See my other blogs

The results look beautiful:

Each airport icon is clickable and when clicked displays an info-window describing the key stats for the selected airport:

I then used D3 to display the aggregated result set in the modal window (light box):

Unfortunately, I can't provide the live example due to the restrictions put in by Google Maps APIs and I am approaching my free API limits.

Fun fact: The Atlanta airport was the largest airport in 2008 on many dimensions: Total Flights Departed, Total Miles Flew, Total Destinations. It also experienced lower average departure delay in 2008 than Chicago O'Hare. I always thought Chicago O'Hare is the largest US airport.

As always, I just needed 6 lines of R code including two lines of code to write data in JSON and CSV files:

################################################################################
airports.2008.hp.summary <- airports.2008.hp[major.airports,
list(AvgDepDelay=round(mean(DepDelay, na.rm=TRUE), digits=2),
TotalMiles=prettyNum(sum(Distance, na.rm=TRUE), big.mark=","),
TotalFlights=length(Month),
TotalDestinations=length(unique(Dest)),
URL=paste("http://www.fly", Origin, ".com",sep="")),
by=list(Origin)][order(-TotalFlights)]
setkey(airports.2008.hp.summary, Origin)
#merge the two data tables
airports.2008.hp.summary <- major.airports[airports.2008.hp.summary,
list(Airport=airport,
AvgDepDelay, TotalMiles, TotalFlights, TotalDestinations,
Address=paste(airport, city, state, sep=", "),
Lat=lat, Lng=long, URL)][order(-TotalFlights)]

airports.2008.hp.summary.json <- getRowWiseJson(airports.2008.hp.summary)
writeLines(airports.2008.hp.summary.json, "airports.2008.hp.summary.json")
write.csv(airports.2008.hp.summary, "airports.2008.hp.summary.csv", row.names=FALSE)
##############################################################################

Happy Coding and remember the possibilities are endless!

Thursday, March 22, 2012

Tracking SFO Airport's Performance Using R, HANA and D3

Visualize Big Data Using R, HANA, D3, JSON and HTML5/JavaScript

This is my first introduction to D3 and I am simply blown away. Mike Bostock (@mbostock), you are genius and thanks for creating D3! With HANA, R, D3, HTML5 and iPad, and you got yourself a KILLER combo!

I have been burning my midnight oil on piecing together my big data story using HANA, R, JSON and HTML5. If you recall, I did a technical session on R and SAP HANA at DKOM, SAP's Development Kickoff Event last week where I showcased the supreme powers of R and HANA when analyzing 124 million records in real time. R and SAP HANA: A Highly Potent Combo for Real Time Analytics on Big Data

Since last week, I have been looking for other creative ways to analyze and then visualize this airlines data. I am very fortunate to come across D3. After spending couple of hours with D3, I decided to build the calendar view for the airlines data I have. The calendar view is the first example Mike shows on his D3 page. Amazingly awesome!

I created this calendar view capturing the percent of delayed flight from SFO airports that departed daily between 2005-2008. For this analysis, I used HANA to get the data out for SFO (out of 250 plus airports) over this 4 years period in seconds and then did all the aggregation in R including creating a JSON and .CSV file in seconds again. Later, I moved to HTML5 and D3 to generate this beautiful calendar view showing SFO's performance. Graphics is presented below:

As expected, December and January are two notorious months for flights delay. Have fun with the live example hosted in the Amazon cloud..

Once again, my R code is very simple:

## Depature Delay for SF Airport
ba.hp.sfo <- ba.hp[Origin=="SFO",]

ba.hp.sfo.daily.flights <- ba.hp.sfo[,list(DailyFlights=length(DepDelay)), by=list(Year, Month, DayofMonth)][order(Year,Month,DayofMonth)]
ba.hp.sfo.daily.flights.delayed <- ba.hp.sfo[DepDelay>15,list(DelayedDailyFlights=length(DepDelay)), by=list(Year, Month, DayofMonth)][order(Year,Month,DayofMonth)]
setkey(ba.hp.sfo.daily.flights.delayed, Year, Month, DayofMonth)
response <- ba.hp.sfo.daily.flights.delayed[ba.hp.sfo.daily.flights]
response <- response[,list(Date=as.Date(paste(Year, Month, DayofMonth, sep="-"),"%Y-%m-%d"),
#DailyFlights,DelayedDailyFlights,
PercentDelayedFlights=round((DelayedDailyFlights/DailyFlights), digits=2))]
objs <- apply(response, 1, toJSON)
res <- paste('{"dailyFlightStats": [', paste(objs, collapse=', '), ']}')
writeLines(res, "dailyFlightStatsForSFO.json")
write.csv(response, "dailyFlightStatsForSFO.csv", row.names=FALSE)

For D3 and HTML code, please take a look at this example from D3 website.

Happy Analyzing and Keep That Mid Night Oil Burning!

Saturday, March 17, 2012

Geocode and reverse geocode your data using, R, JSON and Google Maps' Geocoding API

(Reposting the previous blog with additional module on reverse geocoding added here.)

First and foremost, I absolutely love the topic of Location Analytics (Geo-Spatial Analysis) and see tremendous business potential in not so distant future. I would go out on a limb to predict that the Location Analytics will soon go viral in the enterprise space because it has the capability to WOW us. Look no further than your iPhone or an Android phone and count how many location aware apps you have. We all have at lease one app - Google Maps. Mobile is one of the strongest catalyst for enterprise adoption of Location aware apps. All right, enough of business talk, let's get dirty with the code.

Over the last year and half, I have faced numerous challenges with geocoding and reverse geocoding the data that I have used to showcase my passion for location analytics. In 2012, I decided to take thing in my control and turned to R. Here, I am sharing a simple R script that I wrote to geo-code my data whenever I needed it, even BIG Data.

To geocode and reverse geocode my data, I use Google's Geocoding service which returns the geocoded data in a JSON. I will recommend that you register with Google Maps API and get a key if you have large amount of data and would do repeated geo coding.

Geocode:

getGeoCode <- function(gcStr) {

library("RJSONIO") #Load Library

gcStr <- gsub(' ','%20',gcStr) #Encode URL Parameters

#Open Connection

connectStr <- paste('http://maps.google.com/maps/api/geocode/json?sensor=false&address=',gcStr, sep="")

con <- url(connectStr)

data.json <- fromJSON(paste(readLines(con), collapse=""))

close(con)

#Flatten the received JSON

data.json <- unlist(data.json)

if(data.json["status"]=="OK") {

lat <- data.json["results.geometry.location.lat"]

lng <- data.json["results.geometry.location.lng"]

gcodes <- c(lat, lng)

names(gcodes) <- c("Lat", "Lng")

return (gcodes)

}

geoCodes <- getGeoCode("Palo Alto,California")

> geoCodes
           Lat            Lng 
  "37.4418834" "-122.1430195"

Reverse Geocode:

reverseGeoCode <- function(latlng) {

latlngStr <- gsub(' ','%20', paste(latlng, collapse=","))#Collapse and Encode URL Parameters

library("RJSONIO") #Load Library

#Open Connection

connectStr <- paste('http://maps.google.com/maps/api/geocode/json?sensor=false&latlng=',latlngStr, sep="")

con <- url(connectStr)

data.json <- fromJSON(paste(readLines(con), collapse=""))

close(con)

#Flatten the received JSON

data.json <- unlist(data.json)

if(data.json["status"]=="OK")

address <- data.json["results.formatted_address"]

return (address)

}

address <- reverseGeoCode(c(37.4418834, -122.1430195))

> address
                    results.formatted_address 
"668 Coleridge Ave, Palo Alto, CA 94301, USA"

Happy Coding!