Data Science
What it is and why it matters
Data science is a multidisciplinary field that broadly describes the use of data to generate insight. Unlike more specialised data-related fields, such as data mining or data engineering, data science encompasses the complete life cycle of translating raw data into usable information and applying it for productive ends in a wide variety of applications.
The evolution of data science
When tracing the origin of data science, many think back to 1962, when mathematician John Tukey hinted at the discipline in his seminal paper The Future of Data Analysis. In it, he described the existence of an “unrecognised science,” one which involved learning from data.
It’s more helpful, however, to examine data science in the modern world. The advent of big data – made possible by leaps in processing and storage capabilities – has brought about unprecedented opportunities for organisations to reveal hidden patterns in data and use this insight to improve decision making. But to do so, they must first collect, process, analyse and share that data. Managing this data life cycle is the essence of data science.
Today, data science is ubiquitous in the business world – and beyond. So much so that Harvard Business Review dubbed the data scientist the sexiest job in the 21st century. If data scientists are the practitioners, data science is the techniques and technologies.
Data science in today’s world
Get a glimpse into the modern world of data science.
Who’s using data science?
You’d be hard-pressed to find an industry that doesn’t infuse data science into critical business functions. Here are a few of the most interesting use cases.
Bridging the data science skills gap
The demand for advanced analytical skills has skyrocketed, leaving countries scrambling to bridge the talent gap. By using SAS® Education Analytical Suite and SAS® Viya®, North-West University is providing innovative data science education. This is transforming South Africa's workforce by helping students gain vital firsthand experience in problem formulation, business etiquette and writing, and value delivery.
Data science outputs
To understand the many ways data science can affect an organization, it’s helpful to examine some of the common data science goals and deliverables.
- Prediction (when an asset will fail).
- Classification (new or existing customer).
- Recommendations (if you like that, try this).
- Anomaly detection (fraudulent purchases).
- Recognition (image, text, audio, video, etc.).
- Practical insights (dashboards, reports, visualizations).
- Automated processes and decision making (credit card approval).
- Scoring and ranking (credit score).
- Segmentation (targeted marketing).
- Optimization (manufacturing improvements).
- Forecasts (predicting sales and revenue).
If you’re looking to augment your data science work with a better grasp of choosing, deploying and managing models, then exploring more training in AI and ML is ideal. Ronald van Loon Principal Analyst, CEO of Intelligent World
Composite AI
Most AI projects today rely on multiple data science technologies. According to Gartner, using a combination of different AI techniques to achieve the best result is called “composite AI.”
With composite AI, you start with the problem and then apply the right data and tools to solve the problem. This often includes using a combination of data science techniques, including ML, statistics, advanced analytics, data mining, forecasting, optimization, natural language processing, computer vision and others.
Most AI projects today rely on multiple data science technologies. According to Gartner, using a combination of different AI techniques to achieve the best result is called “composite AI.”
With composite AI, you start with the problem and then apply the right data and tools to solve the problem. This often includes using a combination of data science techniques, including ML, statistics, advanced analytics, data mining, forecasting, optimisation, natural language processing, computer vision and others.
Composite AI is increasingly synonymous with data science. That’s because choosing the right AI technology to use is not always straightforward. It requires a deep understanding of the business problem you’re trying to solve and the data available to solve it. This combination of business and technology expertise is the essence of data science.
How data science works – and data science tools
Data science projects involve the use of multiple tools and technologies to derive meaningful information from structured and unstructured data. Here are some of the common practices data scientists use as part of the data science process to transform raw information into business-changing insight.
Computer vision relies on pattern recognition and deep learning to recognize what’s in a picture or video. When machines can process, analyze and understand images, they can capture images or videos in real time and interpret their surroundings.
Data management is the practice of managing data to unlock its potential for an organization. Managing data effectively requires having a data strategy and reliable methods to access, integrate, cleanse, govern, store and prepare data for analytics.
Data visualization is the presentation of data in a pictorial or graphical format so it can be easily understood by business analysts and others. Data visualizations are especially important in helping organizations analyze large amounts of data and make business decisions based on the output.
Deep learning uses huge neural networks with many layers of processing units, taking advantage of advances in computing power and improved training techniques to learn complex patterns in high volumes of data. Common applications include image and speech recognition.
Machine learning – a branch of artificial intelligence – automates analytical model building. With unsupervised machine learning models, the technology uses methods from neural networks, statistics, operations research and physics to find hidden insights in data without being explicitly programmed where to look or what to conclude.
Natural language processing is the ability of computers to analyze, understand and generate human language, including speech. The next stage of NLP is natural language interaction, which allows humans to communicate with computers using everyday language to perform tasks.
A neural network is a kind of machine learning inspired by the workings of the human brain. It’s a computing system made up of interconnected units (like neurons) that process information by responding to external inputs, relaying information between each unit.
Data management is the practice of managing data to unlock its potential for an organisation. Managing data effectively requires having a data strategy and reliable methods to access, integrate, cleanse, govern, store and prepare data for analytics.
Machine learning automates analytical model building. With unsupervised machine learning, the technology uses methods from neural networks, statistics, operations research and physics to find hidden insights in data without being explicitly programmed where to look or what to conclude.
A neural network is a kind of machine learning inspired by the workings of the human brain. It’s a computing system made up of interconnected units (like neurons) that processes information by responding to external inputs, relaying information between each unit.
Deep learning uses huge neural networks with many layers of processing units, taking advantage of advances in computing power and improved training techniques to learn complex patterns in large amounts of data. Common applications include image and speech recognition.
Computer vision relies on pattern recognition and deep learning to recognise what’s in a picture or video. When machines can process, analyse and understand images, they can capture images or videos in real time and interpret their surroundings.
Natural language processing is the ability of computers to analyse, understand and generate human language, including speech. The next stage of NLP is natural language interaction, which allows humans to communicate with computers using everyday language to perform tasks.
Data visualisation is the presentation of data in a pictorial or graphical format so it can be easily analysed. This is especially important to enable organisations to make business decisions based on the output of data science efforts.
Popular programming languages for data science
Just as humans use a wide variety of languages, the same is true for data scientists. With hundreds of programming languages available today, choosing the right one comes down to what you’re trying to accomplish. Here’s a look at some of the top data science programming languages.
SQL is a domain-specific language used in programming and designed for managing data held in a relational database management system (RDBMS) or for stream processing in a relational data stream management system (RDSMS). It is particularly useful in handling structured data, i.e., data incorporating relations among entities and variables.
SAS is a programming language trusted by hundreds of thousands of data scientists worldwide. The SAS Viya platform allows you to combine the benefits of every technology system and programming language in your organisation for better analytical model development and deployment. Read how SAS Viya can help turn your modelling melting pot into smarter business decisions.
Data science solutions
SAS Viya data science offerings feature robust data management, visualisation, advanced analytics and model management capabilities to accelerate data science at any organisation.
SAS Visual Data Mining and Machine Learning enables you to solve the most complex analytical problems with a single, integrated, collaborative solution – now with its own automated modelling API.
SAS Visual Analytics provides you with the means to quickly prepare reports interactively, explore your data through visual displays and perform your analyses on a self-service basis.
These solutions and more are powered by SAS Viya, SAS’ market-leading data science platform that runs on a modern, scalable, cloud-enabled architecture.