SAS for Machine Learning and Deep Learning

Interactive programming in a web-based development environment

  • Visual interface for the entire analytical life cycle process.
  • Drag-and-drop interactive interface requires no coding, though coding is an option.
  • Supports automated code creation at each node in the pipeline.
  • Choose best practice templates (basic, intermediate or advanced) to get started quickly with machine learning tasks or take advantage of our automated modeling process.
  • Interpretability reports such as PD, LIME, ICE, and Kernel SHAP.
  • Share modeling insights via a PDF report.
  • Explore data from within Model Studio and launch directly into SAS Visual Analytics.
  • Edit models imported from SAS Visual Analytics in Model Studio.
  • View data within each node in Model Studio.
  • Run SAS® Enterprise Miner™ 14.3 batch code within Model Studio.
  • Provides a collaborative environment for easy sharing of data, code snippets, annotations and best practices among different personas.
  • Create, manage and share content and administer content permissions via SAS Drive.
  • The SAS lineage viewer visually displays the relationships between decisions, models, data and decisions.

Intelligent automation with human oversight

  • Public API to automate many of the manual, complex modeling steps to build machine learning models – from data wrangling, to feature engineering, to algorithm selection, to deployment.
  • Automatic Feature Engineering node for automatically cleansing, transforming, and selecting features for models.
  • Automatic Modeling node for automatically selecting the best model using a set of optimization and autotuning routines across multiple techniques.
  • Interactively adjust the pruning and splitting of decision tree nodes.
  • Automated data prep suggestions from meta learning.
  • Automated pipeline generation with complete customization capability.

Natural language generation

  • View results in simple language to facilitate understanding of reports, including model assessment and interpretability

Embedded support for Python & R languages

  • Embed open source code within an analysis, and call open source algorithms within Model Studio.
  • The Open Source Code node in Model Studio is agnostic to Python or R versions.
  • Manage Python models in a common repository within Model Studio.

Deep learning with Python (DLPy)

  • Build deep learning models for image, text, audio and time-series data using Jupyter Notebook.
  • High level APIs are available on GitHub for:
    • Deep neural networks for tabular data.
    • Image classification and regression.
    • Object detection.
    • RNN-based tasks – text classification, text generation and sequence labeling.
    • RNN-based time-series processing and modeling.
  • Support for predefined network architectures, such as LeNet, VGG, ResNet, DenseNet, Darknet, Inception, ShuffleNet, MobileNet, YOLO, Tiny YOLO, Faster R-CNN and U-Net.
  • Import and export deep learning models in the ONNX format.
  • Use ONNX models to score new data sets in a variety of environments by taking advantage of Analytic Store (ASTORE)

SAS procedures (PROCs) & CAS actions

  • A programming interface (SAS Studio) allows IT or developers to access a CAS server, load and save data directly from a CAS server, and support local and remote processing on a CAS server.
  • Python, Java, R, Lua and Scala programmers or IT staff can access data and perform basic data manipulation against a CAS server, or execute CAS actions using PROC CAS.
  • CAS actions support for interpretability, feature engineering and modeling.
  • Integrate and add the power of SAS to other applications using REST APIs.

Highly scalable, distributed in-memory analytical processing

  • Distributed, in-memory processing of complex analytical calculations on large data sets provides low-latency answers.
  • Analytical tasks are chained together as a single, in-memory job without having to reload the data or write out intermediate results to disks.
  • Concurrent access to the same data in memory by many users improves efficiency.
  • Data and intermediate results are held in memory as long as required, reducing latency.
  • Built-in workload management ensures efficient use of compute resources.
  • Built-in failover management guarantees submitted jobs always finish.
  • Automated I/O disk spillover for improved memory management.

Model development with modern machine learning algorithms

  • Reinforcement learning:
    • Techniques include Fitted Q-Network (FQN) and Deep Q-Network (DQN).
    • FQN can train a model over precollected data points without the need to communicate with the environment.
    • Uses replay memory and target network techniques to decorrelate the non-i.i.d. data points and stabilize the training process.
    • Ability to specify a custom environment for state-action pairs and rewards.
  • Decision forests:
    • Automated ensemble of decision trees to predict a single target.
    • Automated distribution of independent training runs.
    • Supports intelligent autotuning of model parameters.
    • Automated generation of SAS code for production scoring.
  • Gradient boosting:
    • Automated iterative search for optimal partition of the data in relation to selected label variable.
    • Automated resampling of input data several times with adjusted weights based on residuals.
    • Automated generation of weighted average for final supervised model.
    • Supports binary, nominal and interval labels.
    • Ability to customize tree training with variety of options for numbers of trees to grow, splitting criteria to apply, depth of subtrees and compute resources.
    • Automated stopping criteria based on validation data scoring to avoid overfitting.
    • Automated generation of SAS code for production scoring.
    • Access lightGBM, a popular open source modeling package.
  • Neural networks:
    • Automated intelligent tuning of parameter set to identify optimal model.
    • Supports modeling of count data.
    • Intelligent defaults for most neural network parameters.
    • Ability to customize neural networks architecture and weights.
    • Techniques include deep forward neural network (DNN), convolutional neural networks (CNNs), recurrent neural networks (RNNs) and autoencoders.
    • Ability to use an arbitrary number of hidden layers to support deep learning.
    • Support for different types of layers, such as convolution and pooling.
    • Automatic standardization of input and target variables.
    • Automatic selection and use of a validation data subset.
    • Automatic out-of-bag validation for early stopping to avoid overfitting.
    • Supports intelligent autotuning of model parameters.
    • Automated generation of SAS code for production scoring.
  • Support vector machines:
    • Models binary target labels.
    • Supports linear and polynomial kernels for model training.
    • Ability to include continuous and categorical in/out features.
    • Automated scaling of input features.
    • Ability to apply the interior-point method and the active-set method.
    • Supports data partition for model validation.
    • Supports cross-validation for penalty selection.
    • Automated generation of SAS code for production scoring.
  • Factorization machines:
    • Supports the development of recommender systems based on sparse matrices of user IDs and item ratings.
    • Ability to apply full pairwise-interaction tensor factorization.
    • Includes additional categorical and numerical input features for more accurate models.
    • Supercharge models with timestamps, demographic data and context information.
    • Supports warm restart (update models with new transactions without full retraining).
    • Automated generation of SAS score code for production scoring.
  • Bayesian networks:
    • Learns different Bayesian network structures, including naive, tree-augmented naive (TAN), Bayesian network-augmented naive (BAN), parent-child Bayesian networks and Markov blanket.
    • Performs efficient variable selection through independence tests.
    • Selects the best model automatically from specified parameters.
    • Generates SAS code or an analytics store to score data.
    • Loads data from multiple nodes and performs computations in parallel.
  • Dirichlet Gaussian mixture models (GMM):
    • Can execute clustering in parallel and is highly multithreaded.
    • Performs soft clustering, which provides not only the predicted cluster score but also the probability distribution over the clusters for each observation.
    • Learns the best number of clusters during the clustering process, which is supported by the Dirichlet process.
    • Uses a parallel variational Bayes (VB) method as the model inference method. This method approximates the (intractable) posterior distribution and then iteratively updates the model parameters until it reaches convergence.
  • Semisupervised learning algorithm:
    • Highly distributed and multithreaded.
    • Returns the predicted labels for both the unlabeled data table and the labeled data table.
  • T-distributed stochastic neighbor embedding (t-SNE):
    • Highly distributed and multithreaded.
    • Returns low-dimensional embeddings that are based on a parallel implementation of the t-SNE algorithm.
  • Generative adversarial networks (GANs)
    • Techniques include StyleGANs for image data and GANs for tabular data.
    • Generate synthetic data for deep learning models.

Analytical data preparation

  • Feature engineering best practice pipeline includes best transformations.
  • Distributed data management routines provided via a visual front end.
  • Large-scale data exploration and summarization.
  • Cardinality profiling:
    • Large-scale data profiling of input data sources.
    • Intelligent recommendation for variable measurement and role.
  • Sampling:
    • Supports random and stratified sampling, oversampling for rare events and indicator variables for sampled records.

Data exploration, feature engineering & dimension reduction

  • T-distributed stochastic neighbor embedding (t-SNE).
  • Feature binning.
  • High-performance imputation of missing values in features with user-specified values, mean, pseudo median and random value of nonmissing values.
  • Feature dimension reduction.
  • Large-scale principal components analysis (PCA), including moving windows and robust PCA.
  • Unsupervised learning with cluster analysis and mixed variable clustering.
  • Segment profiles for clustering.

Integrated text analytics

  • Supports 33 native languages out of the box:
    • English
    • Arabic
    • Chinese
    • Croatian
    • Czech
    • Danish
    • Dutch
    • Farsi
    • Finnish
    • French
    • German
    • Greek
    • Hebrew
    • Hindi
    • Hungarian
    • Indonesian
    • Italian
    • Japanese
    • Kazakh
    • Korean
    • Norwegian
    • Polish
    • Portuguese
    • Romanian
    • Russian
    • Slovak
    • Slovenian
    • Spanish
    • Swedish
    • Tagalog
    • Turkish
    • Thai
    • Vietnamese
  • Stop lists are automatically included and applied for all languages.
  • Automated parsing, tokenization, part-of-speech tagging and lemmatization.
  • Predefined concepts extract common entities such as names, dates, currency values, measurements, people, places and more.
  • Automated feature extraction with machine-generated topics (singular value decomposition and latent Dirichlet allocation).
  • Supports machine learning and rules-based approaches within a single project.
  • Automatic rule generation with the BoolRule.
  • Classify documents more accurately with deep learning (recurrent neural networks).

Model assessment

  • Automatically calculates supervised learning model performance statistics.
  • Produces output statistics for interval and categorical targets.
  • Creates lift table for interval and categorical target.
  • Creates ROC table for categorical target.
  • Creates Event Classification and Nominal Classification charts for supervised learning models with a class target.

Model scoring

  • Automatically generates SAS DATA step code for model scoring.
  • Applies scoring logic to training, holdout data and new data.

SAS Viya in-memory engine

  • CAS (SAS Cloud Analytic Services) performs processing in memory and distributes processing across nodes in a cluster.
  • User requests (expressed in a procedural language) are translated into actions with the parameters needed to process in a distributed environment. The result set and messages are passed back to the procedure for further action by the user.
  • Data is managed in blocks and can be loaded in memory and on demand.
  • If tables exceed memory capacity, the server caches the blocks on disk. Data and intermediate results are held in memory as long as required, across jobs and user boundaries.
  • Includes highly efficient node-to-node communication. An algorithm determines the optimal number of nodes for a given job.
  • Communication layer supports fault tolerance and lets you remove or add nodes from a server while it is running. All components can be replicated for high availability.
  • Support for legacy SAS code and direct interoperability with SAS 9.4M6 clients.
  • Supports multitenancy deployment, allowing for a shared software stack to support isolated tenants in a secure manner.