Data Science

From Verify.Wiki
Jump to: navigation, search
Data Science Definition

Data science involves using automated methods to analyze massive amounts of data (also referred as big data) and to extract knowledge from them. One way to consider data science is as an evolutionary step in interdisciplinary fields like business analysis that incorporate computer science, modeling, statistics, analytics, and mathematics. Data science is also defined as a field that sits at the intersection of social science and statistics, information and computer science, and design. Data science has emerged to solve the problem of explosion in data volumes that traditional statistics cannot solve. Statistics was developed to understand small samples that mostly arose from agriculture. The focus of data science is on extracting,storing, assuring data quality, understanding and communicating information for better decision making. [1]

As data from social media, sensors,web logs, business applications, and the general web is growing rapidly, data science has become the core discipline to extract “actionable insights” from these datasets to help make informed business decisions. A Harvard Business Review article named Data Scientist as "The Sexiest Job of the 21st Century" [2] and the Mashable website reports Data Science provides the best job options for candidates looking for work-life balance.[3] A 2011 study by McKinsey showed in the U.S there will be a shortage of 140,000 to 190,000 data science experts by 2018. [4]Average annual salary for an R (programming language) expert in the USA is $115,000 and the average salary for a Python (programming language) expert is $101,000 according to a 2014 Dice tech salary survey. [5]


Data science is a discipline that arises out of the problem of analyzing and understanding big data sets. It is an interdisciplinary field that employs techniques from various disciplines such as mathematics, statistics, computer science,information science and business analytics.Techniques frequently used in data science include data mining, pattern recognition, data analysis and visualization, probability, machine learning, pattern recognition and so on. By utilizing these techniques, data science investigates problems in various domains such as marketing analytics, risk management, agriculture, public policy, marketing optimization, fraud detection,health care,public transport etc.

In general, data science is transforming traditional ways of analyzing problems and creating new solutions. Over time, the techniques that data scientists use will evolve and become more sophisticated, allowing data science to tackle age-old challenges in new ways. Health care, urban living and business are areas data science can be seen in action today.[6]

The main challenge for hospitals with cost pressures tightens is to treat as many patients as they can efficiently, keeping in mind the improvement of the quality of care. Instrument and machine data is being used increasingly to track as well as optimize patient flow, treatment, and equipment used in the hospitals. It is estimated that there will be a one percent efficiency gain that could yield more than $63 billion in the global healthcare savings.

Some of the use cases of Data Science are, understanding customer churn, customer segmentation, customer relationship channel optimization, demand forecasting, fraud detection, customer support demand forecasting and medication effectiveness. [7]

Case studies of data science in use

  • Prudential life insurance in USA is using a predictive model to automatically classify risk and streamline the process of risk assessment. [8]
  • Telstra in Australia is using data to understand disruptions in its telecommunications network. [9]
  • The 2015 Data Science Bowl has challenged data scientists to use MRI data for early diagnosis of heart diseases. [10]
  • The FAA in US is analyzing data from airlines, surveillance, weather, terrain and infrastructure to improve civil aviation. [11]
  • Vodafone and Argyle Data are analyzing network data to detect fraud. [12]
  • Verizon Wireless uses data science to keep customer churn below 1%. [13]
  • NSA PRISM uses data science to analyze social media data, phone calls, emails, financial transactions and other relevant data. This enables US intelligence services detect and prevent terrorism threats. [14]
  • Paypal uses data science to improve the safety of its payment systems. [15]
  • ebay uses data science to optimize customer experience. [16]
  • Siemens uses data science to improve safety.[17]
  • Uber uses data science in all aspects of its' establishment. They think all companies can benefit from this in all processes. This includes; Manufacturing processes, the retail industry, the finances services sector as well as the travel industry.

They further mention that they use it in their retail, finance and travel sectors in the following ways; Retail industry to analyse the reviews of customers in call centers, social media etc. To gain feedback to enhance performance. Financial service sector to innovate credit scoring, identify frauds and build products to satisfy customers. Travel industry to predict delays and fuel consumption, organize promotions and to make sure the company performs at its maximum capacity. [18]


  • 1960: The term "data science" was first used by Peter Naur as a substitute for Computer Science.[19]
  • 1962: John W. Tukey wrote in “The Future of Data Analysis” that Data analysis, and the parts of statistics which adhere to it, must take on the characteristics of science rather than those of mathematics [20]
  • 1974: Peter Naur published Concise Survey of Computer Methods, which used the term data science in its survey of the contemporary data processing methods that were used in a wide range of applications. [21]
  • 1977: The International Association for Statistical Computing (IASC) was established as a Section of the ISI. Its mission was to link traditional statistical methodology, modern computer technology and the knowledge of domain experts in order to convert data into information and knowledge. [22]
  • 1996: Members of the International Federation of Classification Societies (IFCS) met in Kobe for their biennial conference. Here, for the first time, the term data science was included in the title of the conference ("Data Science, classification, and related methods") [23].
  • 1997: C.F Jeff Wu gave the inaugural lecture entitled "Statistics = Data Science?" for his appointment to the H. C. Carver Professorship at the University of Michigan. In this lecture, he characterized statistical work as a trilogy of data collection, data modeling and analysis, and decision making. In his conclusion, he initiated the modern, non-computer science, usage of the term "data science" and advocated that statistics be renamed data science and statisticians data scientists. [24]
  • 1998: C.F Jeff Wu presented his lecture entitled "Statistics = Data Sciences" as the first of his P.C. Mahalanobis Memorial Lectures. These lectures honor Prasanta Chandra Mahalanobis, an Indian scientist and statistician and founder of the Indian Statistical Institute. [25]
  • 2001: William S. Cleveland introduced data science as an independent discipline, extending the field of statistics to incorporate "advances in computing with data" in his article "Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics," which was published in Volume 69, No. 1, of the April 2001 edition of the International Statistical Review / Revue Internationale de Statistique. In his report, Cleveland establishes six technical areas which he believed to encompass the field of data science: multidisciplinary investigations, models and methods for data, computing with data, pedagogy, tool evaluation, and theory. [26]
  • 2002: the International Council for Science: Committee on Data for Science and Technology (CODATA) started the Data Science Journal, a publication focused on issues such as the description of data systems, their publication on the internet, applications and legal issues. [27]
  • 2003: Columbia University began publishing The Journal of Data Science, which provided a platform for all data workers to present their views and exchange ideas. The journal was largely devoted to the application of statistical methods and quantitative research. [28]
  • 2005: The National Science Board published "Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century" defining data scientists as "the information and computer scientists, database and software and programmers, disciplinary experts, curators and expert annotators, librarians, archivists, and others, who are crucial to the successful management of a digital data collection" whose primary activity is to "conduct creative inquiry and analysis." [29]
  • 2008: DJ Patil and Jeff Hammerbacher used the term "data scientist" to define their jobs at LinkedIn and Facebook, respectively.[2]
  • 2012: Harvard Business Review Published an article titled "Data Scientist: The Sexiest Job of the 21st Century" [2]
  • 2014 publishes the average annual salary for an R (programming language) expert to be $115,531 in the USA, and the average salary for a Python (programming language) expert to be $94,139 and predicting them to be the hottest jobs for 2015. [30]

Techniques used in data science

Differences in use of the term data science

There is no standardized use of the term data science and it may mean different things to different people. Other terms closely related to data science include big data, statistics and business analytics. Jeff Wu argues that data science is equivalent to statistics therefore statistics needs to be re-branded to data science and statisticians to data scientists. William Cleveland views data science as a distinct discipline that has emerged from the field of statistics by incorporating other disciplines. IBM notes that data science complements big data, statistics and business analysis. Training in data science is similar to data analysis and business analysis. However, the data scientist is distinguished by strong business understanding and excellent communication skills. .

Ethical issues in data science

The primary concern is that privacy advocates feel it is unethical if people are unaware that their data is being analyzed without their consent. This is not a big problem in science and health research because the regulatory framework is stringent. In other industries ethical standards are not yet mature and this poses a privacy risk. This is an area data scientists need to make significant contributions to avert cases of breach of privacy. Legal frameworks and policies need to be put in place.[31] For example the NSA PRISM program to gather intelligence has been authorized by federal judges but it is debatable how misuse of this data will be forestalled.

Tools used in Data Science

  • R - An open source programming language used for statistical modelling and visualization
  • Python - A general purpose, open source, interpreted, object-oriented, high-level programming language
  • SAS - An integrated software environment designed for data extraction, transformation, access, mining, visualization and reporting
  • IBM SPSS - A software package used for statistical and predictive analysis.
  • Apache Hadoop - An open-source software framework for processing very large datasets by utilizing many machines. The framework is able to use from a few to several thousands of machines to cope with growing amounts of data.
  • ETL tools - Tools used for extracting data from different systems, correcting errors and loading into data warehouse. Informatica-Power center, IBM-Infosphere Information, Oracle-data integrator, Microsoft-SSIS and Pentaho data inegration are some of the widely used tools. [32]
  • Business intelligence software - Tools used for analyzing and visualizing data using reports, charts and dashboards. Cognos, Business objects, SQL server reporting services and Tableau are some of the tools used.
  • Weka - An open source project that provides machine learning and data mining software
  • Rapidminer - A predictive analytics software offered under open source and commercial licensing
  • KNIME - An open source data analysis, reporting and integration platform
  • Relational databases - Data management systems for organizing and storing information. Widely used systems include Oracle, SQL Server, DB2, Mysql and Postgresql
  • '''ebay''' uses data science to optimize customer experience. <ref></ref>

Companies providing Data Science Technologies and Services

  • Kaggle - Kaggle is a community where data scientists compete with each other to solve Data Science problems
  • Yhat - A Data Science technology company that provides tools and systems that allow enterprises to turn data insights into data-driven products
  • Data Science Inc. - DataScience combines human intellect with machine-powered analysis to create insights from complex data for enterprises
  • Framed - Uses machine learning to predict customer churn
  • Interana - Develops technology to help businesses analyze streaming data in realtime
  • ThoughtSpot - ThoughtSpot's Relational Search Appliance combines data from on-premise, cloud and desktop sources, and provides users with the ability to access that data with a simple search interface.
  • AtScale - AtScale Intelligence Platform software that allows commonly used business intelligence tools to access data stored in Hadoop clusters
  • Confluent - Provides technology and services that help businesses adopt and use the Apache Kafka system
  • Kyvos Insights - OLAP (online analytical processing) software that carries out interactive, multidimensional analysis tasks on huge volumes of structured and unstructured Hadoop data
  • Looker - A cloud-based tool that can connect to a wide range of data sources, including Amazon Redshift, Google BigQuery, HP Vertica, Cloudera Impala, Apache Spark, SQL databases and others
  • DataHero - The DataHero cloud-based service collects data from such disparate sources as Box, Dropbox, Google Drive, Excel, Office 365, Marketo, HubSpot and Eventbrite, and turns it into charts and dashboards
  • Tamr - Tamr develops enterprise data unification software to integrate diverse, siloed data for business analytics tasks and downstream applications
  • Domo - Domo provides business managers with access to information scattered across many disparate sources through a single dashboard
  • Arcadia Data - Arcadia Data develops visual analytics software by directly accessing data stored in Hadoop clusters
  • PwC(Data Science) - Provides consulting services in Data Science
  • Accenture(Data Science) - Provides consulting services in Data Science
  • Palantir - Provides technologies and services in Data Science
  • SaS(Big Data) - Provides software and services to analyze big data
  • Oracle(Big Data) - Provides software, hardware and services to store and analyze big data
  • Teradata(Big Data) - Provides software, hardware and services to store and analyze big data
  • SAP(Big Data) - SAP's HANA platform provides in-memory storage and analytics to crunch big data
  • IBM(Big Data) - Provides hardware, software and services to store and analyze big data. IBM's Watson system is used in many data science projects that involve machine learning
  • Ayasdi - Provides 3-D mapping solutions to unearth trends in big data
  • Splunk - Provides a platform to analyze machine generated operational data such as logs to find trends
  • Alpine Data Labs - Offers a Hadoop-based data analytics platform
  • Alteryx - Provides a software that combines structured and unstructured data from multiple sources into one database to conduct predictive, spatial and statistical analysis
  • Attivio - Provides search and discovery technology that integrates structured and unstructured information from various sources.
  • Birst - Offers a Software-as-a-Service business intelligence platform with visual analytics and an automated data warehouse system to store and analyze bigdata
  • Continuum Analytics - develops data analytics software based on the Python programming language
  • Datameer - Helps business users of Hadoop integrate, analyze and visualize large volumes of data
  • DataRPM - DataRPM uses machine learning to automatically perform advanced statistical analysis on Hadoop
  • Datawatch - Datawatch develops visual data discovery applications for creating data visualizations in realtime from structured, semistructured and Hadoop-based data
  • Gainsight - develops cloud-based predictive analytics software that's integrated with's CRM application
  • Glassbeam - develops Software-as-a-Service applications for machine log data analytics
  • GoodData - Develops a cloud-based business intelligence and big data analytics platform
  • Google(Big Data) - Google's BigQuery analytics-as-a-service technology performs SQL-like queries against massive amounts of data
  • Guavas - Develops tools to analyze streaming and stored data
  • H2O - develops an open-source, in-memory prediction engine for data scientists and developers
  • Information Builders - Develops a software system that accelerates the deployment of master data management and data integration applications
  • Looker Data Sciences - Develops LookML data description language that businesses use to build customer data applications that work with Amazon Redshift, Teradata Aster, HP Vertica, Greenplum, Google BigQuery and other big data systems
  • Luminoso Technologies - Develops text analytics software
  • Metric Insights - Develops "push intelligence technology" to deliver insights and alerts to business users
  • MicroStrategy(Big Data) - Develops business intelligence and visualization tools
  • Panorama Software - Develops data visualization tools
  • ParStream - Develops a distributed, parallel processing columnar database
  • Platfora - Platfora offers a big data analytics toolset that's native to the Apache Hadoop platform
  • Predixion Software - Predixion offers a cloud-based, self-service predictive analytics platform
  • EMC Big Data - EMC's federation of companies that include Pivotal, RSA and VMware provide customized solutions in big data and data science [33]
  • InsightSquared
  • Paxata
  • Trifacta
  • Cloudera
  • Sumo Logic
  • Visier
  • Tableau Software(Big Data)
  • MarkLogic
  • Actifio
  • HortonWorks
  • Informatica(Big Data)
  • Talend(Big Data)
  • Microsoft(Big Data)
  • MongoDB
  • Qlik(Big Data)
  • Data)
  • Datastax
  • Neo Technology
  • Dataguise
  • MapR Technologies
  • Dell(Big Data)
  • 1010Data
  • Amazon Webservices
  • HP (Big Data)
  • Tibco (Big Data)
  • SnapLogic(Big Data)
  • Numerify
  • Logi Analytics
  • Pivotal
  • Syncsort
  • Basho Technologies
  • Recommind
  • Actian
  • Aerospike
  • Bluedata software
  • Citus Data
  • Conccurent
  • Altiscale
  • Attunity
  • Cask
  • Clearstory Data
  • Couchbase
  • Databricks
  • EnterpriseDB

Related Topics

Top Schools that teach data science

There are many US universities that offer Analytics/Data Mining/Data Science degrees [34]. some of them are listed here as follows:[35]

Data Science in health care industry.

Data Science has taken various dimensions in the world of health care industry. The branch which has a highly positive impact is the Pharmaceutical industry, which utilizes this facility to be compliant in regulations.

Pharmacovilgilance (PV) is the practice of monitoring the effects of medications or drugs after they have been licensed for use, especially in order to identify and evaluate previously unreported adverse reactions.

Data Science helps in many ways to identify the adverse reactions of medications and/or drugs and produce a insightful data which is helpful for pharmaceutical companies to identify unheard adverse reactions.

Top 5 Recent Tweets

Date Author Tweet
8 Dec 2015 @MSLearning Interested in data science? Find out how to build and derive insights from data science and machine learning models:
8 Dec 2015 @TimHarford - 8 Dec 2015 @kdnuggets 8 Dec 2015 @KirkDBorne The 7 Conferences that Data Scientists shouldn’t miss:
8 Dec 2015 @jose_garde 20 Big Data Repositories You Should Check Out" on Data Science Central:

Top Lifetime Tweets

Date Author Tweet
12 Apr 2013 @nytimes Universities Offer Courses in a Hot New Field: Data Science
26 Feb 2013 @bigdataborat In Data Science, 80% of time spent prepare data, 20% of time spent complain about need for prepare data.
18 Jul 2014 @mashable Looking to achieve work-life balance? You may want to get into data science.


  2. 2.0 2.1 2.2 [1] Data Scientist: The Sexiest Job of the 21st Century
  3. [2] Top 10 Jobs With the Highest Work-Life Balance
  4. [3] Big data McKinsey study
  19. "Howard Wainer: Truth or Truthiness: Distinguishing Fact from Fiction by Learning to Think Like a Data Scientist"
  20. "The Future of Data Analysis"
  21. "Peter Naur: Concise Survey of Computer Methods, 397 p"
  22. IASC Mission
  23. "Data Science, classification, and related methods"
  24. "Statistics = Data Science?"
  29. [4] Long Lived Long-lived Digital Data Collections: Enabling Research and Education in the 21st Century
  30. "Dice Tech Salary Survey 2013-2014"

Verification history