Sunday, August 25, 2013

Data Science: Definition and Opportunities


Image courtesy of BBC
My thoughts on what data science is, what skills data scientists have, what are the current issues in the Business Intelligence pipeline, how can machine learning automate a part of the BI chain, why and how data science should be democratized and made available to every one including decision makers (business users), how business analyst should build complex data models and how data scientists should be freed up from the mundane tasks of rinse and repeat ETL before building models that provide input for decision making, how companies can build a business practice around data science. 

Key Premise: big data is all data and the big data apps offer the ability to combine all data (public + private) and expand the horizon to discover more meaningful insights.

Data Science is:
  • An art of mining large quantities of data 
  • An art of combining disparate data sources and blending public data with corporate data
  • Forming hypothesis to solve hard problems
  • Building models to solve current problems and provide forecast
  • Anticipate future events (based on historical data) and provide correcting actions (finance, banking, travel, operational runtime)
  • Automating the processes to reduce time to solve future problems
A Data Scientists has following minimum set of core skills:
  • Problem-Solver
  • Creative and can form an hypothesis
  • Is able to program with large quantities of data
  • Can think of bringing data from appropriate data source and can bring and blend data 
  • Stats/math/analytics background to build models and write algorithms 
  • Can quickly develop domain knowledge to understand key factors which influence the performance of a busies problem
Roles Data Scientists play:
  • Problem description 
  • Hypothesis formation
  • Data assembly, ETL and data integration role
  • Model development (pattern recognition or any other model to provide answers) and training
  • Data visualization 
  • AB Testing 
  • Propose solutions and/or new business idea
The balance between human vs. machines:
  • Current: humans play a significant role in the process – ETL, joins, models, visualization, machine-learning and repeating and recycling this process as the problem changes
  • Tomorrow: A big portion of the food-chain can be automated via machine learning so machines can take over and scientists can free up to build more algorithms/models 
  • The process can be automated so repeating/recycling can be cheaper and less time consuming
The Data Science pipeline currently look like:
  • From Data to Insights – this entire process requires mundane skills (IT),  specialized skills (data-scientist) and elements of human psychology to present the right information at right time 
  • The data needs to be discovered, assembled, semantically enriched and anchored to a business logic – this task can be be automated through machine learning (a set of harmonized tools with AI) to free up scarce resources
  • Specialized skills today get addressed by open source technologies such as R and expensive solutions like Matlab and SPSS.
  • Very few software solution carefully introduce human interface to make their application consumable without requiring customer training
This pipeline needs complete rethinking:
  • Automate mundane tasks that IT gets tagged with 
  • Discover data automatically 
  • Detach business logic from data models
  • Make blending public data with corporate data a second nature
  • Free up scientists so that they can build analytics micro-apps for a domain or a sub-domain
  • Data Science need not be a niche (specialized category), it should appeal to the masses (democratization of data and brining insights to everyone without needing specialized skills)
Opportunities in Data Science: 
  • Understand the value chain (IT + Business Analyst + Data Scientists + Business Users)
  • Provide something for everyone  - a single integrated platform (ETL + Data Integration + Predictive modeling + in-memory computing +  storage)  for data-scientist so that they can build standard analytical apps and move away from proprietary models and standardize (helps IT)
  • Analytical apps on this platform (think of them as Rapid Deployment Solutions) for business users
  • Help business analysts write basic models (churn, segmentation, correlation etc.) without needing advanced skills
  • Work with consulting companies so that they can consult and build apps for companies that do not have data scientist on their pay-roll (Mu-Sigma and Opera Solutions)
  • Partner with public data provider (to help clients), consulting companies (Rapid Solutions solution), R/Python/ML communities (mind-share and thought-leadership), 
  • Donate your predictive models to open-source communities

8 comments:

  1. Data Science required for AI based solutions. Like Cerexio Singapore does, where it serves with Industry 4.0 solutions.

    ReplyDelete
  2. I really like your writing style, great date, thank you for posting.
    data science course in delhi

    ReplyDelete
  3. Nice Blog !
    Are you unable to work on QuickBooks software? If yes, then do call us at QuickBooks Customer Support Number 1 (877) 261-2406 and get sure-shot solutions to troubleshoot all the issues of QuickBooks.

    ReplyDelete
  4. Nice & Informative Blog !
    In light of the prevailing situation, our team at Delta Customer Service offers a flat-price service for Delta.

    ReplyDelete
  5. I have been searching to find a comfort or effective procedure to complete this process and I think this is the most suitable way to do it effectively.
    data science course in malaysia

    ReplyDelete