SAS Visual Statistics Features List
Visual data exploration & discovery (available through SAS Visual Analytics)
Visual data exploration & discovery (available through SAS Visual Analytics)
- Quickly interpret complex relationships or key variables that influence modeling outcomes within large data sets.
- Filter observations and understand a variable’s level of influence on overall model lift.
- Detect outliers and/or influence points to help you determine, capture and remove them from downstream analysis (e.g., models).
- Explore data using bar charts, histograms, box plots, heat maps, bubble plots, geographic maps and more.
- Derive predictive outputs or segmentations that can be used directly in other modeling or visualization tasks. Outputs can be saved and passed to those without model-building roles and capabilities.
- Automatically convert measure variables with two levels to category variables when a data set is first opened.
Visual interface access to analytical techniques
Visual interface access to analytical techniques
- Clustering:
- K-means, k-modes or k-prototypes clustering.
- Parallel coordinate plots to interactively evaluate cluster membership.
- Scatter plots of inputs with cluster profiles overlaid for small data sets and heat maps with cluster profiles overlaid for large data sets.
- Detailed summary statistics (means of each cluster, number of observations in each cluster, etc.).
- Generate on-demand cluster ID as a new column.
- Supports holdout data (training and validation) for model assessment.
- Decision trees:
- Supports classification and regression trees.
- Based on a modified C4.5 algorithm or cost-complexity pruning.
- Interactively grow and prune a tree. Interactively train a subtree.
- Set tree depth, max branch, leaf size, aggressiveness of tree pruning and more.
- Use tree map displays to interactively navigate the tree structure.
- Generate on-demand leaf ID, predicted values and residuals as new columns.
- Supports holdout data (training and validation) for model assessment.
- Supports pruning with holdout data.
- Supports autotuning with options for leaf size.
- Enables manual modification of splitting points for interactive tree.
- Linear regression:
- Influence statistics.
- Supports forward, backward, stepwise and lasso variable selection.
- Iteration plot for variable selection.
- Frequency and weight variables.
- Residual diagnostics.
- Summary table includes overall ANOVA, model dimensions, fit statistics, model ANOVA, Type III test and parameter estimates.
- Generate on-demand predicted values and residuals as new columns.
- Support holdout data (training and validation) for model assessment.
- Logistic regression:
- Models for binary data with logit and probit link functions.
- Influence statistics.
- Supports forward, backward, stepwise and lasso variable selection.
- Iteration plot for variable selection.
- Frequency and weight variables.
- Residual diagnostics.
- Summary table includes model dimensions, iteration history, fit statistics, convergence status, Type III tests, parameter estimates and response profile.
- Generate on-demand predicted labels and predicted event probabilities as new columns. Adjust the prediction cutoff to label an observation as event or non-event.
- Support holdout data (training and validation) for model assessment.
- Generalized linear models:
- Distributions supported include beta, normal, binary, exponential, gamma, geometric, Poisson, Tweedie, inverse Gaussian and negative binomial.
- Supports forward, backward, stepwise and lasso variable selection.
- Offset variable support.
- Frequency and weight variables.
- Residual diagnostics.
- Summary table includes model summary, iteration history, fit statistics, Type III test table and parameter estimates.
- Informative missing option for treatment of missing values on the predictor variable.
- Generate on-demand predicted values and residuals as new columns.
- Supports holdout data (training and validation) for model assessment.
- Generalized additive models:
- Distributions supported include normal, binary, gamma, Poisson, Tweedie, inverse Gaussian and negative binomial.
- Supports one- and two-dimensional spline effects.
- GCV, GACV and UBRE methods for selecting the smoothing effects.
- Offset variable support.
- Frequency and weight variables.
- Residual diagnostics.
- Summary table includes model summary, iteration history, fit statistics and parameter estimates.
- Supports holdout data (training and validation) for model assessment.
- Nonparametric logistic regression:
- Models for binary data with logit, probit, log-log and c-log-log link functions.
- Supports one- and two-dimensional spline effects.
- GCV, GACV and UBRE methods for selecting the smoothing effects.
- Offset variable support.
- Frequency and weight variables.
- Residual diagnostics.
- Summary table includes model summary, iteration history, fit statistics and parameter estimates.
- Supports holdout data (training and validation) for model assessment.
Programming access to analytical techniques
Programming access to analytical techniques
- Programmers and data scientists can access SAS Viya (CAS server) from SAS Studio using SAS procedures (PROCs) and other tasks.
- Programmers can execute CAS actions using PROC CAS or use different programming environments like Python, R, Lua and Java.
- Users can also access SAS Viya (CAS server) from their own applications using public REST APIs.
- Provides native integration to Python Pandas DataFrames. Python programmers can upload DataFrames to CAS and fetch results from CAS as DataFrames to interact with other Python packages, such as Pandas, matplotlib, Plotly, Bokeh, etc.
- Includes SAS/STAT® and SAS/GRAPH® software.
- Principal component analysis (PCA):
- Performs dimension reduction by computing principal components.
- Provides the eigenvalue decomposition, NIPALS and ITERGS algorithms.
- Outputs principal component scores across observations.
- Creates scree plots and pattern profile plots.
- Decision trees:
- Supports classification trees and regression trees.
- Supports categorical and numerical features.
- Provides criteria for splitting nodes based on measures of impurity and statistical tests.
- Provides the cost-complexity and reduced-error methods of pruning trees.
- Supports partitioning of data into training, validation and testing roles.
- Supports the use of validation data for selecting the best subtree.
- Supports the use of test data for assessment of final tree model.
- Provides various methods of handling missing values, including surrogate rules.
- Creates tree diagrams.
- Provides statistics for assessing model fit, including model-based (resubstitution) statistics.
- Computes measures of variable importance.
- Outputs leaf assignments and predicted values for observations.
- Clustering:
- Provides the k-means algorithm for clustering continuous (interval) variables.
- Provides the k-modes algorithm for clustering nominal variables.
- Provides various distance measures for similarity.
- Provides the aligned box criterion method for estimating the number of clusters.
- Outputs cluster membership and distance measures across observations.
- Linear regression:
- Supports linear models with continuous and classification variables.
- Supports various parameterizations for classification effects.
- Supports any degree of interaction and nested effects.
- Supports polynomial and spline effects.
- Supports forward, backward, stepwise, least angle regression and lasso selection methods.
- Supports information criteria and validation methods for controlling model selection.
- Offers selection of individual levels of classification effects.
- Preserves hierarchy among effects.
- Supports partitioning of data into training, validation and testing roles.
- Provides a variety of diagnostic statistics.
- Generates SAS code for production scoring.
- Logistic regression:
- Supports binary and binomial responses.
- Supports various parameterizations for classification effects.
- Supports any degree of interaction and nested effects.
- Supports polynomial and spline effects.
- Supports forward, backward, fast backward and lasso selection methods.
- Supports information criteria and validation methods for controlling model selection.
- Offers selection of individual levels of classification effects.
- Preserves hierarchy among effects.
- Supports partitioning of data into training, validation and testing roles.
- Provides variety of statistics for model assessment.
- Provides variety of optimization methods for maximum likelihood estimation.
- Generalized linear models:
- Supports responses with a variety of distributions, including binary, normal, Poisson and gamma.
- Supports various parameterizations for classification effects.
- Supports any degree of interaction and nested effects.
- Supports polynomial and spline effects.
- Supports forward, backward, fast backward, stepwise and group lasso selection methods.
- Supports information criteria and validation methods for controlling model selection.
- Offers selection of individual levels of classification effects.
- Preserves hierarchy among effects.
- Supports partitioning of data into training, validation and testing roles.
- Provides variety of statistics for model assessment.
- Provides a variety of optimization methods for maximum likelihood estimation.
- Nonlinear regression models:
- Fits nonlinear regression models with standard or general distributions.
- Computes analytical derivatives of user-provided expressions for more robust parameter estimations.
- Evaluates user-provided expressions using the ESTIMATE and PREDICT statements (procedure only).
- Requires a data table that contains the CMP item store if not using PROC NLMOD.
- Estimates parameters using the least squares method.
- Estimates parameters using the maximum likelihood method.
- Quantile regression models:
- Supports quantile regression for single or multiple quantile levels.
- Supports multiple parameterizations for classification effects.
- Supports any degree of interactions (crossed effects) and nested effects.
- Supports hierarchical model selection strategy among effects.
- Provides multiple effect-selection methods.
- Provides effect selection based on a variety of selection criteria.
- Supports stopping and selection rules.
- Predictive partial least squares models:
- Provides programming syntax with classification variables, continuous variables, interactions and nestings.
- Provides effect-construction syntax for polynomial and spline effects.
- Supports partitioning of data into training and testing roles.
- Provides test set validation to choose the number of extracted factors.
- Implements the following methods: principal component regression, reduced rank regression and partial least squares regression.
- Generalized additive models:
- Fit generalized additive models based on low-rank regression splines.
- Estimates the regression parameters by using penalized likelihood estimation.
- Estimates the smoothing parameters by using either the performance iteration method or the outer iteration method.
- Estimates the regression parameters by using maximum likelihood techniques.
- Tests the total contribution of each spline term based on the Wald statistic.
- Provides model-building syntax that can include classification variables, continuous variables, interactions and nestings.
- Enables you to construct a spline term by using multiple variables.
- Proportional hazard regression:
- Fit the Cox proportional hazards regression model to survival data and perform variable selection.
- Provides model-building syntax with classification variables, continuous variables, interactions and nestings.
- Provides effect-construction syntax for polynomial and spline effects.
- Performs maximum partial likelihood estimation, stratified analysis and variable selection.
- Partitions data into training, validation and testing roles.
- Provides weighted analysis and grouped analysis.
- Statistical process control:
- Perform Shewhart control chart analysis.
- Analyze multiple process variables to identify processes that are out of statistical control.
- Adjust control limits to compensate for unequal subgroup sizes.
- Estimate control limits from the data, compute control limits from specified values for population parameters (known standards) or read limits from an input data table.
- Perform tests for special causes based on runs patterns (Western Electric rules).
- Estimate the process standard deviation using various methods (variable charts only).
- Save chart statistics and control limits in output data tables.
- Independent component analysis:
- Extracts independent components (factors) from multivariate data.
- Maximizes non-Gaussianity of the estimated components.
- Supports whitening and dimension reduction.
- Produces an output data table that contains independent components and whitened variables.
- Implements symmetric decorrelation, which calculates all the independent components simultaneously.
- Implements deflationary decorrelation, which extracts the independent components successively.
- Linear mixed models:
- Supports many covariance structures, including variance components, compound symmetry, unstructured, AR(1), Toeplitz, factor analytic, etc.
- Provides specialized dense and sparse matrix algorithms.
- Supports REML and ML estimation methods, which are implemented with a variety of optimization algorithms.
- Provides Inference features, including standard errors and t tests for fixed and random effects.
- Supports repeated measures data.
- Model-based clustering:
- Models the observations by using a mixture of multivariate Gaussian distributions.
- Allows for a noise component and automatic model selection.
- Provides posterior scoring and graphical interpretation of results.
Descriptive statistics
Descriptive statistics
- Distinct counts to understand cardinality.
- Box plots to evaluate centrality and spread, including outliers for one or more variables.
- Correlations to measure the Pearson’s correlation coefficient for a set of variables. Supports grouped and weighted analysis.
- Cross-tabulations, including support for weights.
- Contingency tables, including measures of associations.
- Histograms with options to control binning values, maximum value thresholds, outliers and more.
- Multidimensional summaries in a single pass of the data.
- Percentiles for one or more variables.
- Summary statistics, such as number of observations, number of missing values, sum of nonmissing values, mean, standard deviation, standard errors, corrected and uncorrected sums of squares, min and max, and the coefficient of variation.
- Kernel density estimates using normal, tri-cube and quadratic kernel functions.
- Constructs one-way to n-way frequency and cross-tabulation tables.
Group-by processing
Group-by processing
- Build models, compute and process results on the fly for each group or segment without having to sort or index the data each time.
- Build segment-based models instantly (i.e., stratified modeling) from a decision tree or clustering analysis.
Model comparison, assessment & scoring
Model comparison, assessment & scoring
- Generate model comparison summaries, such as lift charts, ROC charts, concordance statistics and misclassification tables for one or more models.
- Interactively slide the prediction cutoff for automatic updating of assessment statistics and classification tables.
- Interactively evaluate lift at different percentiles.
- Export models as SAS DATA step code to integrate models with other applications. Score code is automatically concatenated if a model uses derived outputs from other models (leaf ID, cluster ID, etc.).
Model scoring
Model scoring
- Export models as SAS DATA step code to integrate models with other applications.
- Score code is automatically concatenated if a model uses derived outputs from other models (leaf ID, cluster ID, etc.).
SAS Viya in-memory runtime engine
SAS Viya in-memory runtime engine
- SAS Cloud Analytic Services (CAS) performs processing in memory and distributes processing across nodes in a cluster.
- User requests (expressed in a procedural language) are translated into actions with necessary parameters to process in a distributed environment. The result set and messages are passed back to the procedure for further action by the user.
- Data is managed in blocks and can be loaded in memory on demand. If tables exceed the memory capacity, the server caches the blocks on disk. Data and intermediate results are held in memory as long as required, across jobs and user boundaries.
- An algorithm determines the optimal number of nodes for a given job.
- A communication layer supports fault tolerance and lets you remove or add nodes from a server while it is running. All components in the architecture can be replicated for high availability.
- Products can be deployed in multitenant mode, allowing for a shared software stack to support securely isolated tenants.