All Things Analytics: HANA

Showing posts with label HANA. Show all posts

Wednesday, September 4, 2013

The Future of Big Data is Cognitive Big Data Apps

Volume, Velocity, Variety and Veracity of your data, the 4V challenge, has become untamable. Wait, yet another big data blog? No, not really. In this blog, I would like to propose a cognitive app approach that can transform your big-data problems into big opportunities at a fraction of the cost.

Everyone is talking about big data problems but not many are helping us in understanding big data opportunities. Let's define a big data opportunity in the context of customers because growing customer base, customer satisfaction and customer loyalty is everyone’s business:

you have a large, diverse and growing customer base

your customers are more mobile and social than ever before

you have engaged with your customers where ever they are: web, mobile, social, local

you believe that "more data beats better algorithms" and that big data is all data

you wish to collect all data - call center records, web logs, social media, customer transactions and more so that

you can understand your customers better and how they speak of and rank you in their social networks

you can group (segment) your customers to understand their likes and dislikes

you can offer (recommend) them the right products at the right time and at the right price

you can preempt customer backlash and prevent them for leaving (churn) to competitors and taking their social network with them (negative network effects)

all this effort will allow you to forecast sales accurately, run targeted marketing campaigns and cut cost to improve revenues and profitability

you wish to do all of this without hiring an army of data analysts, consultants and data scientists

and without buying half-dozen or more tools, getting access to several public / social data sets and integrating it all in your architecture

and above all, you wish to do it fast and drive changes in real time

And most importantly, you wish to rinse and repeat this approach for the foreseeable future

There are hardly any enterprise solutions in the market that can address the challenges listed above. You have no other choice but to build a custom solution by hiring several consultants and striking separate licenses agreements with public and social data vendors to get a combined lens on public and private data. This approach will be cost prohibitive for most enterprise customers and as "90% of the IT projects go" will be mired with delays, cost overruns and truck load of heartache.

The advances in technologies like in-memory databases and graph structures as well as democratization of data science concepts can help in addressing the challenges listed above in a meaningful and cost-effective way. Intelligent big data apps are the need of the hour. These apps need to be designed and built from scratch keeping the challenges and technologies such as cognitive computing[1] in mind. These apps will leave the technology paradigms of 1990s like "data needs to be gathered and modeled (caged) before an app is built" in the dumpster and will achieve the flexibility required from all modern apps to adapt as the underlying data structures and data sources change. These apps can be deployed right off the shelf with minimum customization and consulting because the app logic will not be anchored to the underlying data-schema and will evolve with changing data and behavior.

The enterprise customers will soon be asking for a suite of such cognitive big data apps for all domain functions so that they can put the big data opportunities to work to run their businesses better than their competitors. Without dynamic cognitive approach in apps, addressing the 4V challenge will be a nightmare and big data will fail to deliver its promise.

Stay tuned for future blogs on this topic including discussions on a pioneering technology approach.

[1] Cognitive computing is the ability to analyze oceans of data in context with related information and expertise. Cognitive systems learn from how they’re used and adjust their rules and results dynamically. Google search engine and knowledge graph technology is predicated upon this approach.

This blog has benefited from the infinite wisdom and hard work of my former colleagues Ryan Leask and Harish Butani and that of my current colleagues Sethu M., Jens Doerpmund and Vijay Vijayasankar.

Image courtesy of MemeGenerator

Sunday, August 25, 2013

Data Science: Definition and Opportunities

Image courtesy of BBC

My thoughts on what data science is, what skills data scientists have, what are the current issues in the Business Intelligence pipeline, how can machine learning automate a part of the BI chain, why and how data science should be democratized and made available to every one including decision makers (business users), how business analyst should build complex data models and how data scientists should be freed up from the mundane tasks of rinse and repeat ETL before building models that provide input for decision making, how companies can build a business practice around data science.

Key Premise: big data is all data and the big data apps offer the ability to combine all data (public + private) and expand the horizon to discover more meaningful insights.

Data Science is:

An art of mining large quantities of data
An art of combining disparate data sources and blending public data with corporate data
Forming hypothesis to solve hard problems
Building models to solve current problems and provide forecast
Anticipate future events (based on historical data) and provide correcting actions (finance, banking, travel, operational runtime)
Automating the processes to reduce time to solve future problems

A Data Scientists has following minimum set of core skills:

Problem-Solver
Creative and can form an hypothesis
Is able to program with large quantities of data
Can think of bringing data from appropriate data source and can bring and blend data
Stats/math/analytics background to build models and write algorithms
Can quickly develop domain knowledge to understand key factors which influence the performance of a busies problem

Roles Data Scientists play:

Problem description
Hypothesis formation
Data assembly, ETL and data integration role
Model development (pattern recognition or any other model to provide answers) and training
Data visualization
AB Testing
Propose solutions and/or new business idea

The balance between human vs. machines:

Current: humans play a significant role in the process – ETL, joins, models, visualization, machine-learning and repeating and recycling this process as the problem changes
Tomorrow: A big portion of the food-chain can be automated via machine learning so machines can take over and scientists can free up to build more algorithms/models
The process can be automated so repeating/recycling can be cheaper and less time consuming

The Data Science pipeline currently look like:

From Data to Insights – this entire process requires mundane skills (IT), specialized skills (data-scientist) and elements of human psychology to present the right information at right time
The data needs to be discovered, assembled, semantically enriched and anchored to a business logic – this task can be be automated through machine learning (a set of harmonized tools with AI) to free up scarce resources
Specialized skills today get addressed by open source technologies such as R and expensive solutions like Matlab and SPSS.
Very few software solution carefully introduce human interface to make their application consumable without requiring customer training

This pipeline needs complete rethinking:

Automate mundane tasks that IT gets tagged with
Discover data automatically
Detach business logic from data models
Make blending public data with corporate data a second nature
Free up scientists so that they can build analytics micro-apps for a domain or a sub-domain
Data Science need not be a niche (specialized category), it should appeal to the masses (democratization of data and brining insights to everyone without needing specialized skills)

Opportunities in Data Science:

Understand the value chain (IT + Business Analyst + Data Scientists + Business Users)
Provide something for everyone - a single integrated platform (ETL + Data Integration + Predictive modeling + in-memory computing + storage) for data-scientist so that they can build standard analytical apps and move away from proprietary models and standardize (helps IT)
Analytical apps on this platform (think of them as Rapid Deployment Solutions) for business users
Help business analysts write basic models (churn, segmentation, correlation etc.) without needing advanced skills
Work with consulting companies so that they can consult and build apps for companies that do not have data scientist on their pay-roll (Mu-Sigma and Opera Solutions)
Partner with public data provider (to help clients), consulting companies (Rapid Solutions solution), R/Python/ML communities (mind-share and thought-leadership),
Donate your predictive models to open-source communities

Wednesday, October 10, 2012

Besides Facebook's Botched IPO, IPO Market Returns 20% in 2012

Facebook (Ticker: FB) is down ~47% since its IPO in May. Now, it is not the most botched IPO ever unfortunately as the infamous record belongs to BATS Exchange (Ticker: BATS) which operates an alternate stock exchange to NYSE and NASDAQ. (Read the Business Insider story here: 8 Unforgettable IPO Disasters)

Also, FB is not the worst performing IPO either. Groupon (Ticker: GRPN) and Zynga (Ticker: ZNGA, proudly led by Mark Pincus), are down 77% and 74% respectively since their IPO. In comparison, FB has done ok, it could be worst but a rapid strategy shift by FB including the emphasis on mobile and a decision to allow e-commerce transactions (Facebook Gifts) on Facebook have provided some kind of a floor under its stock. Here is a chart comparing the three (not-so) darlings of the Web 2.0.

Anyhow, below is a table of the best IPOs for this year. Guidewire (Ticker: GWRE) and Demandware (Ticker: DWRE) are the two cloud technology companies in the list that have done very well returning 137% and 108% till date.

IPO Top Performers (YTD)

Company	Offer Date	Under	Industry	Deal Size (mm)	Offer Price	First Day Close	Closing Price	First Day Return	Total Return
Supernus Pharmac	4/30/12	Citi	Health Care	$50	$5.00	$5.37	$12.77	7.4 %	155.4 %
Nationstar Mortg	3/7/12	Merrill	Financial	$233	$14.00	$14.20	$33.29	1.4 %	137.8 %
Guidewire Softwa	1/24/12	JPM	Technology	$115	$13.00	$17.12	$30.84	31.7 %	137.2 %
Annies	3/27/12	CS	Consumer	$95	$19.00	$35.92	$44.87	89.1 %	136.2 %
Demandware	3/14/12	GS	Technology	$88	$16.00	$23.59	$33.31	47.4 %	108.2 %

Palo Alto Network (Ticker: PANW) is up 16% since IPO with returns of 48% over its IPO price of $42. Splunk (Ticker: SPLK) is down about 10% since IPO but still giving returns of 90% over its IPO price of $17. Both these companies didn't make the cut in the table above.

Here is a list of the worst performing IPOs till date. If one were to change the time period from YTD to 12-months, Zynga shows up in the list, no surprise there. Social gaming is a fast changing environment and ZNGA faces crisis in confidence with so many departures.

IPO Worst Performers (YTD)

Company	Offer Date	Under	Industry	Deal Size (mm)	Offer Price	First Day Close	Closing Price	First Day Return	Total Return
Envivio	4/24/12	GS	Technology	$70	$9.00	$8.49	$2.15	-5.7 %	-76.1 %
Audience	5/9/12	JPM	Technology	$90	$17.00	$19.10	$5.65	12.4 %	-66.8 %
CafePress	3/28/12	JPM	Technology	$86	$19.00	$19.03	$8.07	0.2 %	-57.5 %
Ceres	2/21/12	GS	Materials	$65	$13.00	$14.80	$5.77	13.8 %	-55.6 %
Renewable	1/18/12	UBS	Energy	$72	$10.00	$10.10	$5.16	1.0 %	-48.4 %

Take a closer look, FB is barely staying away from this infamous list. On a similar note, LinkedIn (Ticker: LNKD) is up approximately 80% till date. What a contrasting tale of the two social network companies!

So far in 2012, IPOs have resulted in 20% returns which is better than the -11% returns IPO market yielded in 2011. Since there are about 2.5 months more to go before the curtains drop on 2012, the 2012 IPO return might beat the 25% returns the year 2010 produced.

One very encouraging signs for the IPO investors this year has been the 13% average first day pop in IPOs that is line with what IPO market observed before the great recession (~13%). And to all the naysayers out there who claim that tech-stocks are in a bubble, take a look at the average opening day pop in 1999 (72%) and 2000 (56%) and compare it to 2012, you will hold your peace for few more years at least!

Workday (Ticker: WDAY) is on the deck for this week. Do you due-diligence before investing.

Happy IPO Investing!
Jitender

Source: Renaissance Capital, Greenwich, CT (www.renaissancecapital.com).

Wednesday, May 2, 2012

Why Delta's Foray into the Crude Refining Business is a BAD Move?

When my mentor/guide and company president Sanjay Poonen threw this open challenge on Twitter:

‏ @spoonen

For all u MBAs, what do u think of Delta buying an oil refinery for $150M (formerly $1B) for top-grade jet fuel. Would Michael Porter frown?

how could I have passed on this challenge PLUS I have been lately descending deep into the technology roots (most of my blogs are technical with lots of code snippets for all intentional purposes - AllThingsR.) So I decided to spend some time sleuthing and analyzing hard facts before replying to @spoonen, and (may be) counter @gkm1 (George Mathew's) arguments. This way I get back into analyzing business topics for some time. (After all, A in MBA stands for analysis right? Masters in Business Analytics?)

The original news on WSJ covering Delta decision to buy a refinery from ConocoPhillips is here.

I spent quite sometime researching so I can educate myself on this deal. I started with a prior belief that this is a BAD deal. After all, the crude refining business is a boom and bust business, has razor thin margins and is notoriously competitive. Here is a quote from Bloomberg supporting my argument: "Refiners in the northeastern U.S. are struggling to turn a profit because of the narrow margin between the cost of imported crude and fuel prices." (Source: Bloomberg)

Moreover, not a single new refinery has sprung up in the US for at least 35 years (Source) because no one wants to invest in this business. In addition, ConocoPhillips, had idled this refinery for few months now and Sunoco, another refiner in that area, is in the process of shutting down two more refineries in that region. (Source: Bloomberg) "Sunoco...said its refining businesses has been losing $1 million dollars a day for three years running." (Source)

So why is Delta buying this refinery? Vertical integration, fuel hedging, cost-savings, political, EPS improvements etc? Actually all of the above.

Delta's planes burned 3.9B gallons of jet fuel last year. At an avg. 2011 price of $2.86 per gallon, Delta spent $11.8B, which is 40% of its operating expenses. (Source: NYTimes) If the cost of jet fuel was 40% of your company's operating expense, you will also be thinking about taking such dramatic decisions but may not execute on it if it is outside your realm, but Delta did.

Delta will pay $150M in cash (it has $3B in cash on its balance sheet, so there is no liquidity issue) and will invest another $100M in retooling this refinery. Also you should note that PA government is chipping in with additional $30M (Thank you tax payers!). Retooling is required for reason self-evident in this table (Mainly to crank-up the jet-fuel production):

Now, looking at this table why would anyone believe that Delta can earn $300M every year from this. Also remember, Delta is not bringing down its fuel cost from ~$12B by a whole lot, it is merely trying to save few cents on the dollar. A little bit of shift in the numbers above and Delta will be in red trouble.

Delta said that this is a good deal for the investors, really? Valero had margins of less than 3% in its last quarter and it is a pure play refinery company. Can Delta beat Valero on margins? I have my serious doubts. This could be a gain but just for the Delta's management as it attempts to boost EPS in the near-term.

Also, can you believe that Delta can really retool the refinery and produce more jet-fuel than it is possible? The bio-chemistry doesn't support it. From one barrel of crude, only 19.5 gallons of gasoline and 4.1 gallons of jet-fuel can be produced. How is Delta going to produce more jet-fuel per barrel of crude?

Also, FYI, this refinery can only process light sweet crude (with low sulfur) not that heavy Saudi oil that has high sulfur and is gaining more prominence due to global oil issues. (Source: ConocoPhillips)

Net net, this is a bad move, Delta will burn itself and get out in a year or two. And when they sell, it will be a fire, sale since many other refineries in that area are already struggling to make a profit as I mentioned above. Delta's thinking that future of refineries is brighter is quite puzzling for me.

Happy Analyzing!

Big Data, R and SAP HANA: Analyze 200 Million Data Points and Later Visualize in HTML5 Using D3 - Part III

Mash-up Airlines Performance Data with Historical Weather Data to Pinpoint Weather Related Delays

For this exercise, I combined following four separate blogs that I did on BigData, R and SAP HANA. Historical airlines and weather data were used for the underlying analysis. The aggregated output of this analysis was outputted in JSON which was visualized in HTML5, D3 and Google Maps. The previous blogs on this series are:

In this blog, I wanted to mash-up disparate data sources in R and HANA by combining airlines data with weather data to understand the reasons behind the airport/airlines delay. Why weather - because weather is one of the commonly cited reasons in the airlines industry for flight delays. Fortunately, the airlines data breaks up the delay by weather, security, late aircraft etc., so weather related delays can be isolated and then the actual weather data can be mashed-up to validate the airlines' claims. However, I will not be doing this here, I will just be displaying the mashed-up data.

I have intentionally focused on the three bay-area airports and have used last 4 years of historical data to visualize the airport's performance using a HTML5 calendar built from scratch using D3.js. One can use all 20 years of data and for all the airports to extend this example. I had downloaded historical weather data for the same 2005-2008 period for SFO and SJC airports as shown in my previous blog (For some strange reasons, there is no weather data for OAK, huh?). Here is how the final result will look like in HTML5:

Click here to interact with the live example. Hover over any cell in the live example and a tool tip with comprehensive analytics will show the break down of the performance delay for the selected cell including weather data and correct icons* - result of a mash-up. Choose a different airport from the drop-down to change the performance calendar.

* Weather icons are properties of Weather Underground.

As anticipated, SFO airport had more red on the calendar than SJC and OAK. SJC definitely is the best performing airport in the bay-area. Contrary to my expectation, weather didn't cause as much havoc on SFO as one would expect, strange?

Creating a mash-up in R for these two data-sets was super easy and a CSV output was produced to work with HTML5/D3. Here is the R code and if it not clear from all my previous blogs: I just love data.table package.

###########################################################################################

# Percent delayed flights from three bay area airports, a break up of the flights delay by various reasons, mash-up with weather data

###########################################################################################

baa.hp.daily.flights <- baa.hp[,list( TotalFlights=length(DepDelay), CancelledFlights=sum(Cancelled, na.rm=TRUE)),

by=list(Year, Month, DayofMonth, Origin)]

setkey(baa.hp.daily.flights,Year, Month, DayofMonth, Origin)

baa.hp.daily.flights.delayed <- baa.hp[DepDelay>15,

list(DelayedFlights=length(DepDelay),

WeatherDelayed=length(WeatherDelay[WeatherDelay>0]),

AvgDelayMins=round(sum(DepDelay, na.rm=TRUE)/length(DepDelay), digits=2),

CarrierCaused=round(sum(CarrierDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2),

WeatherCaused=round(sum(WeatherDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2),

NASCaused=round(sum(NASDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2),

SecurityCaused=round(sum(SecurityDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2),

LateAircraftCaused=round(sum(LateAircraftDelay, na.rm=TRUE)/sum(DepDelay, na.rm=TRUE), digits=2)), by=list(Year, Month, DayofMonth, Origin)]

setkey(baa.hp.daily.flights.delayed, Year, Month, DayofMonth, Origin)

# Merge two data-tables

baa.hp.daily.flights.summary <- baa.hp.daily.flights.delayed[baa.hp.daily.flights,list(Airport=Origin,

TotalFlights, CancelledFlights, DelayedFlights, WeatherDelayed,

PercentDelayedFlights=round(DelayedFlights/(TotalFlights-CancelledFlights), digits=2),

AvgDelayMins, CarrierCaused, WeatherCaused, NASCaused, SecurityCaused, LateAircraftCaused)]

setkey(baa.hp.daily.flights.summary, Year, Month, DayofMonth, Airport)

# Merge with weather data

baa.hp.daily.flights.summary.weather <-baa.weather[baa.hp.daily.flights.summary]

baa.hp.daily.flights.summary.weather$Date <- as.Date(paste(baa.hp.daily.flights.summary.weather$Year,

baa.hp.daily.flights.summary.weather$Month,

baa.hp.daily.flights.summary.weather$DayofMonth,

sep="-"),"%Y-%m-%d")

# remove few columns

baa.hp.daily.flights.summary.weather <- baa.hp.daily.flights.summary.weather[,

which(!(colnames(baa.hp.daily.flights.summary.weather) %in% c("Year", "Month", "DayofMonth", "Origin"))), with=FALSE]

#Write the output in both JSON and CSV file formats

objs <- baa.hp.daily.flights.summary.weather[, getRowWiseJson(.SD), by=list(Airport)]

# You have now (Airportcode, JSONString), Once again, you need to attach them together.

row.json <- apply(objs, 1, function(x) paste('{\"AirportCode\":"', x[1], '","Data\":', x[2], '}', sep=""))

json.st <- paste('[', paste(row.json, collapse=', '), ']')

writeLines(json.st, "baa-2005-2008.summary.json")

write.csv(baa.hp.daily.flights.summary.weather, "baa-2005-2008.summary.csv", row.names=FALSE)

Happy Coding!

Thursday, March 22, 2012

Tracking SFO Airport's Performance Using R, HANA and D3

Visualize Big Data Using R, HANA, D3, JSON and HTML5/JavaScript

This is my first introduction to D3 and I am simply blown away. Mike Bostock (@mbostock), you are genius and thanks for creating D3! With HANA, R, D3, HTML5 and iPad, and you got yourself a KILLER combo!

I have been burning my midnight oil on piecing together my big data story using HANA, R, JSON and HTML5. If you recall, I did a technical session on R and SAP HANA at DKOM, SAP's Development Kickoff Event last week where I showcased the supreme powers of R and HANA when analyzing 124 million records in real time. R and SAP HANA: A Highly Potent Combo for Real Time Analytics on Big Data

Since last week, I have been looking for other creative ways to analyze and then visualize this airlines data. I am very fortunate to come across D3. After spending couple of hours with D3, I decided to build the calendar view for the airlines data I have. The calendar view is the first example Mike shows on his D3 page. Amazingly awesome!

I created this calendar view capturing the percent of delayed flight from SFO airports that departed daily between 2005-2008. For this analysis, I used HANA to get the data out for SFO (out of 250 plus airports) over this 4 years period in seconds and then did all the aggregation in R including creating a JSON and .CSV file in seconds again. Later, I moved to HTML5 and D3 to generate this beautiful calendar view showing SFO's performance. Graphics is presented below:

As expected, December and January are two notorious months for flights delay. Have fun with the live example hosted in the Amazon cloud..

Once again, my R code is very simple:

## Depature Delay for SF Airport
ba.hp.sfo <- ba.hp[Origin=="SFO",]

ba.hp.sfo.daily.flights <- ba.hp.sfo[,list(DailyFlights=length(DepDelay)), by=list(Year, Month, DayofMonth)][order(Year,Month,DayofMonth)]
ba.hp.sfo.daily.flights.delayed <- ba.hp.sfo[DepDelay>15,list(DelayedDailyFlights=length(DepDelay)), by=list(Year, Month, DayofMonth)][order(Year,Month,DayofMonth)]
setkey(ba.hp.sfo.daily.flights.delayed, Year, Month, DayofMonth)
response <- ba.hp.sfo.daily.flights.delayed[ba.hp.sfo.daily.flights]
response <- response[,list(Date=as.Date(paste(Year, Month, DayofMonth, sep="-"),"%Y-%m-%d"),
#DailyFlights,DelayedDailyFlights,
PercentDelayedFlights=round((DelayedDailyFlights/DailyFlights), digits=2))]
objs <- apply(response, 1, toJSON)
res <- paste('{"dailyFlightStats": [', paste(objs, collapse=', '), ']}')
writeLines(res, "dailyFlightStatsForSFO.json")
write.csv(response, "dailyFlightStatsForSFO.csv", row.names=FALSE)

For D3 and HTML code, please take a look at this example from D3 website.

Happy Analyzing and Keep That Mid Night Oil Burning!