Features List
SAS In-Memory Statistics Features
Interactive, in-memory programming
- Performs all mathematical calculations in memory.
- Uses a dynamic group-by processing operation to compute and process results for each group, partition or segment without having to sort or index data each time.
- Provides a new web-based interface, SAS Studio, for SAS programmers.
- Interactive programming language supports submitting, retrieving results and then submitting more statements on the fly.
- Chains together analytical tasks as a single in-memory job without having to reload the data or write out intermediate results to disks.
- Lets you update source tables with new column transformations and filter rows, and perform group-by processing.
Analytical data preparation
- Access data more efficiently via intelligent partitioning by a variable(s) across the cluster. Repartition or unpartition the data at any time.
- Derive new temporary tables, and promote them for use by other analysts.
- Subset, filter, join and promote tables, and add and drop computed columns through data manipulation.
- Use data access patterns and processing methods – such as group filtering, partitioning, data ordering within partitions and delete/undelete operations.
- Define data sources using update, append, set operations, filter, derive column and aggregate statements.
- Perform transformations on input data – missing value imputation, outlier transformation, functional transformation and binning – yielding output tables and/or score rules.
- Export ODS results tables for client-side graphic development.
Descriptive statistics
- Distinct counts to understand cardinality.
- Box plots to evaluate centrality and spread, including outliers for one or more variables.
- Correlations to measure the Pearson’s correlation coefficient for a set of variables.
- Crosstabulations, including support for weights.
- Contingency tables, including measures of associations.
- Parallel by-group processing.
- Histograms with options to control binning values, maximum value thresholds, outliers and more.
- Multidimensional summaries in a single pass of the data.
- Percentiles for one or more variables.
- Summary statistics, such as number of observations, number of missing values, sum of nonmissing values, mean, standard deviation, standard errors, corrected and uncorrected sums of squares, min and max, and the coefficient of variation.
- Kernel density estimates using normal, tri-cube and quadratic kernel functions.
Statistical algorithms & machine-learning techniques
- Classification trees and regression trees
- Based on the C4.5 algorithm.
- Control over the splitting criterion (information gain, information gain ratio).
- Greedy split search.
- Set tree depth, max branch, leaf size, pruning, missing value handling and more.
- Based on the C4.5 algorithm.
- Forecasting
- Automatically identifies the statistical characteristics of a time series and selects the appropriate models.
- Handle outliers, structural changes, unequally spaced data and calendar events.
- Model families include smoothing, and intermittent demand.
- Aggregate the time series using the sum or mean.
- Specify and restrict the length of the forecast horizon.
- Supports goal-seeking forecasts. Create a data set that includes the projected values that you want to achieve. The forecast action predicts the values of related variables to reach that outcome.
- General linear models
- Continuous responses (targets).
- Support for interval and class effects.
- Specify interaction terms.
- Stepwise model selection.
- WEIGHT and FREQ variables.
- Generalized linear model
- Generalized linear models and exponential class of models.
- Supported distributions include beta, exponential, gamma, Poisson, inverse Gaussian, negative binomial, student's t, and Weibull.
- Rich set of link functions, including identity, logit, probit, log, log-log, complementary log-log, reciprocal, power-2, and power.
- Optimization techniques include conjugate gradient, double-dogleg, Nelder-Mead, Newton-Raphson with and without ridging, quasi-Newton, and trust-region.
- Offset variable support.
- Frequency and weighting (FREQ and WEIGHT) variables.
- Logistic regression
- Models for binary or binomial data with logit, log-log and complementary log-log link functions.
- Multiple optimization techniques.
- FREQ and WEIGHT variables.
- Offset variable support.
- Suitable for fitting linear, quadratic or cubic polynomial models for each pair of numeric variables.
- BEST option to return the regression models with the highest coefficient of determination.
- Control over the polynomial order.
- Extensive residual diagnostics, influence and leverage analysis.
- Random decision and regression forests
- Supports developing random-forest ensemble models.
- Fast paralleled growth of trees in a forest.
- Ensemble trees by majority voting.
- Define the bootstrap size; support sampling with and without replacement.
- Compute variable importance based on a set of trees.
- Control over the forest including number of trees, number of variables to be evaluated at each split point, prune, splitting criterion and more.
- Control over trees including leaf size, maximum branches, number of bins to evaluate and more.
- Neural networks
- Trains feedforward artificial neural networks (ANN).
- Use the trained networks to score data sets and generate SAS DATA step scoring code.
- Activation functions for the neurons on each hidden layer and the target output nodes include IDENTITY, LOGISTIC, EXP, SIN, and TANH.
- Combination functions for the neurons on each hidden layer and the target output nodes include ADD, LINEAR, and RADIAL.
- Multiple optimization techniques.
- Supports optimized, fixed and omitted bias terms.
- Simulated Annealing global optimization.
- Monte Carlo global optimization.
- Error functions to train the network, including GAMMA, NORMAL, POISSON and ENTROPY.
- Multiple hidden layers.
- Impute missing values.
- Weight decay parameter.
- Resumes training using the previously obtained weights as the new starting weights.
- Lower and upper bounds for network weights.
- Train and ensemble multiple networks.
- Supports standardization of variables.
- Weights the prediction errors for each observation during training.
- Hypergroups
- Provides a number of analytics on data in which values in two or more columns are used to form a graph (vertices and edges).
- Can be used for data cleansing, fraud detection, social network analysis and more.
- Determines graph layout in 2D or 3D space by Force-directed algorithms, including some sub-algorithms in graph partitioning algorithms that improve layouts.
- Structural analysis – Identifies hypergroups, and identifies strongly connected subgraphs using the Label Propagation algorithm or the Graph Partitioning algorithm.
- Calculates multiple vertex centrality measures based on shortest path and layout.
- Clustering
- k-means clustering.
- Density-based spatial clustering.
- Control over the cluster size, seed, convergence criterion, number of iterations and more.
- Supported distance measures include squared Euclidean, Manhattan, maximum, cosine, Jaccard and Hamming. k-means uses Euclidean distance.
- Generates DATA step scoring code.
- Association rule mining
- Derive frequent itemsets.
- Mine association rules.
- Sequence mining.
- Derive frequent itemsets.
Model assessment
- Supports generating model comparison summaries – such as lift charts, ROC charts, concordance statistics and misclassification tables – for one or more models.
Model scoring
- Generation of SAS DATA step code.
- Score statement for applying scoring logic to training, holdout and new data.
Text analytics
- Parsing and stemming.
- Start and stop lists.
- Term and document frequency.
- Matrix factorization (singular value decomposition).
- Entity extraction and resolution.
- Topic projections of the document.
Recommendation system
- Interactive RECOMMEND procedure – all algorithms can be run interactively in-memory.
- Interactively apply a filter to develop recommendations for specific populations.
- Project-based to support loading user, items and rating tables into memory.
- Cold starting for new users without any history based on weighted averaging.
- Slope One fast regression-based model commonly used as a simple benchmark.
- k-nearest neighbor, including cosine, adjusted cosine and Pearson’s correlation.
- Matrix factorization with options for loss functions, regularization factors, optimization methods and more.
- Clustering of users and/or items using other attributes, including term frequency and inverse document frequency weights.
- Hybrid or ensemble models.
- Ability to define a holdout set of users and ratings for training and validation evaluation.
- Ability to predict action for scoring one or more new users or a table.