SAS for Machine Learning and Deep Learning
Interactive programming in a web-based development environment
- Visual interface for the entire analytical life cycle process.
- Drag-and-drop interactive interface requires no coding, though coding is an option.
- Supports automated code creation at each node in the pipeline.
- Choose best practice templates (basic, intermediate or advanced) to get started quickly with machine learning tasks or take advantage of our automated modeling process.
- Interpretability reports such as PD, LIME, ICE, and Kernel SHAP.
- Share modeling insights via a PDF report.
- Explore data from within Model Studio and launch directly into SAS Visual Analytics.
- Edit models imported from SAS Visual Analytics in Model Studio.
- View data within each node in Model Studio.
- Run SAS® Enterprise Miner™ 14.3 batch code within Model Studio.
- Provides a collaborative environment for easy sharing of data, code snippets, annotations and best practices among different personas.
- Create, manage and share content and administer content permissions via SAS Drive.
- The SAS lineage viewer visually displays the relationships between decisions, models, data and decisions.
Intelligent automation with human oversight
- Public API to automate many of the manual, complex modeling steps to build machine learning models – from data wrangling, to feature engineering, to algorithm selection, to deployment.
- Automatic Feature Engineering node for automatically cleansing, transforming, and selecting features for models.
- Automatic Modeling node for automatically selecting the best model using a set of optimization and autotuning routines across multiple techniques.
- Interactively adjust the pruning and splitting of decision tree nodes.
- Automated data prep suggestions from meta learning.
- Automated pipeline generation with complete customization capability.
Natural language generation
- View results in simple language to facilitate understanding of reports, including model assessment and interpretability
Embedded support for Python & R languages
- Embed open source code within an analysis, and call open source algorithms within Model Studio.
- The Open Source Code node in Model Studio is agnostic to Python or R versions.
- Manage Python models in a common repository within Model Studio.
Deep learning with Python (DLPy)
- Build deep learning models for image, text, audio and time-series data using Jupyter Notebook.
- High level APIs are available on GitHub for:
- Deep neural networks for tabular data.
- Image classification and regression.
- Object detection.
- RNN-based tasks – text classification, text generation and sequence labeling.
- RNN-based time-series processing and modeling.
- Support for predefined network architectures, such as LeNet, VGG, ResNet, DenseNet, Darknet, Inception, ShuffleNet, MobileNet, YOLO, Tiny YOLO, Faster R-CNN and U-Net.
- Import and export deep learning models in the ONNX format.
- Use ONNX models to score new data sets in a variety of environments by taking advantage of Analytic Store (ASTORE)
SAS procedures (PROCs) & CAS actions
- A programming interface (SAS Studio) allows IT or developers to access a CAS server, load and save data directly from a CAS server, and support local and remote processing on a CAS server.
- Python, Java, R, Lua and Scala programmers or IT staff can access data and perform basic data manipulation against a CAS server, or execute CAS actions using PROC CAS.
- CAS actions support for interpretability, feature engineering and modeling.
- Integrate and add the power of SAS to other applications using REST APIs.
Highly scalable, distributed in-memory analytical processing
- Distributed, in-memory processing of complex analytical calculations on large data sets provides low-latency answers.
- Analytical tasks are chained together as a single, in-memory job without having to reload the data or write out intermediate results to disks.
- Concurrent access to the same data in memory by many users improves efficiency.
- Data and intermediate results are held in memory as long as required, reducing latency.
- Built-in workload management ensures efficient use of compute resources.
- Built-in failover management guarantees submitted jobs always finish.
- Automated I/O disk spillover for improved memory management.
Model development with modern machine learning algorithms
- Reinforcement learning:
- Techniques include Fitted Q-Network (FQN) and Deep Q-Network (DQN).
- FQN can train a model over precollected data points without the need to communicate with the environment.
- Uses replay memory and target network techniques to decorrelate the non-i.i.d. data points and stabilize the training process.
- Ability to specify a custom environment for state-action pairs and rewards.
- Decision forests:
- Automated ensemble of decision trees to predict a single target.
- Automated distribution of independent training runs.
- Supports intelligent autotuning of model parameters.
- Automated generation of SAS code for production scoring.
- Gradient boosting:
- Automated iterative search for optimal partition of the data in relation to selected label variable.
- Automated resampling of input data several times with adjusted weights based on residuals.
- Automated generation of weighted average for final supervised model.
- Supports binary, nominal and interval labels.
- Ability to customize tree training with variety of options for numbers of trees to grow, splitting criteria to apply, depth of subtrees and compute resources.
- Automated stopping criteria based on validation data scoring to avoid overfitting.
- Automated generation of SAS code for production scoring.
- Access lightGBM, a popular open source modeling package.
- Neural networks:
- Automated intelligent tuning of parameter set to identify optimal model.
- Supports modeling of count data.
- Intelligent defaults for most neural network parameters.
- Ability to customize neural networks architecture and weights.
- Techniques include deep forward neural network (DNN), convolutional neural networks (CNNs), recurrent neural networks (RNNs) and autoencoders.
- Ability to use an arbitrary number of hidden layers to support deep learning.
- Support for different types of layers, such as convolution and pooling.
- Automatic standardization of input and target variables.
- Automatic selection and use of a validation data subset.
- Automatic out-of-bag validation for early stopping to avoid overfitting.
- Supports intelligent autotuning of model parameters.
- Automated generation of SAS code for production scoring.
- Support vector machines:
- Models binary target labels.
- Supports linear and polynomial kernels for model training.
- Ability to include continuous and categorical in/out features.
- Automated scaling of input features.
- Ability to apply the interior-point method and the active-set method.
- Supports data partition for model validation.
- Supports cross-validation for penalty selection.
- Automated generation of SAS code for production scoring.
- Factorization machines:
- Supports the development of recommender systems based on sparse matrices of user IDs and item ratings.
- Ability to apply full pairwise-interaction tensor factorization.
- Includes additional categorical and numerical input features for more accurate models.
- Supercharge models with timestamps, demographic data and context information.
- Supports warm restart (update models with new transactions without full retraining).
- Automated generation of SAS score code for production scoring.
- Bayesian networks:
- Learns different Bayesian network structures, including naive, tree-augmented naive (TAN), Bayesian network-augmented naive (BAN), parent-child Bayesian networks and Markov blanket.
- Performs efficient variable selection through independence tests.
- Selects the best model automatically from specified parameters.
- Generates SAS code or an analytics store to score data.
- Loads data from multiple nodes and performs computations in parallel.
- Dirichlet Gaussian mixture models (GMM):
- Can execute clustering in parallel and is highly multithreaded.
- Performs soft clustering, which provides not only the predicted cluster score but also the probability distribution over the clusters for each observation.
- Learns the best number of clusters during the clustering process, which is supported by the Dirichlet process.
- Uses a parallel variational Bayes (VB) method as the model inference method. This method approximates the (intractable) posterior distribution and then iteratively updates the model parameters until it reaches convergence.
- Semisupervised learning algorithm:
- Highly distributed and multithreaded.
- Returns the predicted labels for both the unlabeled data table and the labeled data table.
- T-distributed stochastic neighbor embedding (t-SNE):
- Highly distributed and multithreaded.
- Returns low-dimensional embeddings that are based on a parallel implementation of the t-SNE algorithm.
- Generative adversarial networks (GANs)
- Techniques include StyleGANs for image data and GANs for tabular data.
- Generate synthetic data for deep learning models.
Analytical data preparation
- Feature engineering best practice pipeline includes best transformations.
- Distributed data management routines provided via a visual front end.
- Large-scale data exploration and summarization.
- Cardinality profiling:
- Large-scale data profiling of input data sources.
- Intelligent recommendation for variable measurement and role.
- Sampling:
- Supports random and stratified sampling, oversampling for rare events and indicator variables for sampled records.
Data exploration, feature engineering & dimension reduction
- T-distributed stochastic neighbor embedding (t-SNE).
- Feature binning.
- High-performance imputation of missing values in features with user-specified values, mean, pseudo median and random value of nonmissing values.
- Feature dimension reduction.
- Large-scale principal components analysis (PCA), including moving windows and robust PCA.
- Unsupervised learning with cluster analysis and mixed variable clustering.
- Segment profiles for clustering.
Integrated text analytics
- Supports 33 native languages out of the box:
- English
- Arabic
- Chinese
- Croatian
- Czech
- Danish
- Dutch
- Farsi
- Finnish
- French
- German
- Greek
- Hebrew
- Hindi
- Hungarian
- Indonesian
- Italian
- Japanese
- Kazakh
- Korean
- Norwegian
- Polish
- Portuguese
- Romanian
- Russian
- Slovak
- Slovenian
- Spanish
- Swedish
- Tagalog
- Turkish
- Thai
- Vietnamese
- Stop lists are automatically included and applied for all languages.
- Automated parsing, tokenization, part-of-speech tagging and lemmatization.
- Predefined concepts extract common entities such as names, dates, currency values, measurements, people, places and more.
- Automated feature extraction with machine-generated topics (singular value decomposition and latent Dirichlet allocation).
- Supports machine learning and rules-based approaches within a single project.
- Automatic rule generation with the BoolRule.
- Classify documents more accurately with deep learning (recurrent neural networks).
Model assessment
- Automatically calculates supervised learning model performance statistics.
- Produces output statistics for interval and categorical targets.
- Creates lift table for interval and categorical target.
- Creates ROC table for categorical target.
- Creates Event Classification and Nominal Classification charts for supervised learning models with a class target.
Model scoring
- Automatically generates SAS DATA step code for model scoring.
- Applies scoring logic to training, holdout data and new data.
SAS Viya in-memory engine
- CAS (SAS Cloud Analytic Services) performs processing in memory and distributes processing across nodes in a cluster.
- User requests (expressed in a procedural language) are translated into actions with the parameters needed to process in a distributed environment. The result set and messages are passed back to the procedure for further action by the user.
- Data is managed in blocks and can be loaded in memory and on demand.
- If tables exceed memory capacity, the server caches the blocks on disk. Data and intermediate results are held in memory as long as required, across jobs and user boundaries.
- Includes highly efficient node-to-node communication. An algorithm determines the optimal number of nodes for a given job.
- Communication layer supports fault tolerance and lets you remove or add nodes from a server while it is running. All components can be replicated for high availability.
- Support for legacy SAS code and direct interoperability with SAS 9.4M6 clients.
- Supports multitenancy deployment, allowing for a shared software stack to support isolated tenants in a secure manner.