## Lýsing:

Predictive Analytics shows tech-savvy business managers and data analysts how to use the techniques of predictive analytics to solve practical business problems. It teaches readers the methods, principles, and techniques for conducting predictive analytics projects, from start to finish. The author focuses on best practices---including tips and tricks---that are essential for successful predictive modeling.

The author explains the theory behind the principles of predictive analytics in plain English; readers don't need an extensive background in math and statistics, which makes it ideal for most tech-savvy business and data analysts. Each of the techniques chapters will begin with a description of the specific technique and how it relates to the overall process model for predictive analytics. The depth of the description of a technique will match the complexity of the approach; the intent is to describe the techniques in enough depth for a practitioner to understand the effect of the major parameters needed to effectively use the technique and interpret the results.

For example, with decision trees, the primary algorithms (C5, CART and CHAID) will be described in qualitative terms (what are trees, what is a split), how they are similar and different (Gini vs. Entropy vs. chi-square tests), why one might use one technique over another, how one can be fooled by the models built using each algorithm (i. e. , their weaknesses), what knobs one can adjust (depth, complexity penalties, priors, costs, etc.

), and how to interpret the results. Each of the techniques is illustrated by hands-on examples, either unique to the task or as part of a more comprehensive case study. The companion website will provide all of the data sets used to generate these examples, along with a free trial version of software, so that readers can recreate and explore the examples and case studies. The book concludes with a series of in-depth case studies that apply predictive analytics to common types of business scenarios.

## Annað

- Höfundur: Dean Abbott
- Útgáfa:1
- Útgáfudagur: 04/2014
- Hægt að prenta út 2 bls.
- Hægt að afrita 10 bls.
- Format:ePub
- ISBN 13: 9781119030393
- Print ISBN: 9781118727966
- ISBN 10: 1119030390

## Efnisyfirlit

- Front Matter
- Dedication
- About the Author
- About the Technical Editor
- Credits
- Acknowledgments
- Introduction
- How This Book Is Organized
- Who Should Read This Book
- Tools You Will Need
- What's on the Website
- Summary

- CHAPTER 1 Overview of Predictive Analytics
- What Is Analytics?
- What Is Predictive Analytics?
- Figure 1-1: Histogram
- Supervised vs. Unsupervised Learning
- Parametric vs. Non-Parametric Models

- Business Intelligence
- Predictive Analytics vs. Business Intelligence
- Do Predictive Models Just State the Obvious?
- Similarities between Business Intelligence and Predictive Analytics
- Figure 1-2: Timeline for building predictive models

- Predictive Analytics vs. Statistics
- Table 1-1: Statistics vs. Predictive Analytics
- Statistics and Analytics
- Predictive Analytics and Statistics Contrasted

- Predictive Analytics vs. Data Mining
- Who Uses Predictive Analytics?
- Challenges in Using Predictive Analytics
- Obstacles in Management
- Obstacles with Data
- Obstacles with Modeling
- Obstacles in Deployment

- What Educational Background Is Needed to Become a Predictive Modeler?

- CHAPTER 2 Setting Up the Problem
- Predictive Analytics Processing Steps: CRISP-DM
- Table 2-1: CRISM-DM Sequence
- Figure 2-1: The CRISP-DM process model

- Business Understanding
- The Three-Legged Stool
- Business Objectives

- Defining Data for Predictive Modeling
- Defining the Columns as Measures
- Table 2-2: Simple Rectangular Layout of Data
- Table 2-3: Alternative Rectangular Layout of Data
- Table 2-4: Summarized Representation of Visits

- Defining the Unit of Analysis
- Which Unit of Analysis?

- Defining the Columns as Measures
- Defining the Target Variable
- Table 2-5: Potential Target Variables
- Temporal Considerations for Target Variable
- Figure 2-2: Timeline for defining target variable

- Defining Measures of Success for Predictive Models
- Success Criteria for Classification
- Success Criteria for Estimation
- Other Customized Success Criteria

- Doing Predictive Modeling Out of Order
- Building Models First
- Early Model Deployment

- Case Study: Recovering Lapsed Donors
- Overview
- Business Objectives
- Data for the Competition
- The Target Variables
- Modeling Objectives
- Model Selection and Evaluation Criteria
- Model Deployment

- Case Study: Fraud Detection
- Overview
- Business Objectives
- Data for the Project
- The Target Variables
- Modeling Objectives
- Model Selection and Evaluation Criteria
- Model Deployment

- Summary

- Predictive Analytics Processing Steps: CRISP-DM
- CHAPTER 3 Data Understanding
- What the Data Looks Like
- Single Variable Summaries
- Mean
- Standard Deviation
- The Normal Distribution
- Figure 3-1: Standard deviation

- Uniform Distribution
- Figure 3-2: Uniform distribution

- Applying Simple Statistics in Data Understanding
- Table 3-1: Summary Statistics for a Subset of Continuous Variables for KDD Cup 1998 Data

- Skewness
- Figure 3-3: Positive and negative skew
- Table 3-2: Skewness Values for Several Variables from KDD Cup 1998

- Kurtosis
- Figure 3-4: Types of skewness
- Figure 3-5: Big-tailed distribution
- Table 3-3: Kurtosis Values for Several Variables from KDD Cup 1998

- Rank-Ordered Statistics
- Table 3-4: Rank-Ordered Metrics
- Table 3-5: Quantile Labels
- Table 3-6: Quartile Measures for Several Variables from KDD Cup 1998

- Categorical Variable Assessment
- Table 3-7: Frequency Counts for Variable RFA_2
- Table 3-8: Frequency Counts for Variable STATE
- Table 3-9: Frequency Count for Variable with One Level
- Table 3-10: Frequency Count Summary Table for Several Variables

- Data Visualization in One Dimension
- Histograms
- Figure 3-6: Standard histogram
- Figure 3-7: Bi-modal distribution
- Figure 3-8: Spike in distribution
- Figure 3-9: Histogram overlaid with Target variable
- Figure 3-10: Box plot of NUMPROM
- Figure 3-11: Box plot with outliers squashing IQR
- Figure 3-12: Box and whiskers plot showing outliers

- Multiple Variable Summaries
- Hidden Value in Variable Interactions: Simpson's Paradox
- Table 3-11: Simpson's Paradox Example, Aggregate Table
- Table 3-12: Simpson's Paradox Example, The Interactions

- The Combinatorial Explosion of Interactions
- Table 3-13: Number of Interactions as Number of Variables Increases

- Correlations
- Spurious Correlations
- Back to Correlations
- Table 3-14: Correlations for Six Variables from the KDD Cup 98 Data

- Crosstabs
- Table 3-15: Crosstab of RFA_2F vs. RFA_2A

- Hidden Value in Variable Interactions: Simpson's Paradox
- Data Visualization, Two or Higher Dimensions
- Scatterplots
- Figure 3-13: Scatterplot of Band 2 vs. Band 1
- Figure 3-14: NUMPROM vs. CARDPROM scatterplot

- Anscombe's Quartet
- Figure 3-15: Highly skewed data in scatterplot
- Table 3-16: Anscombe's Quartet Data
- Table 3-17: Anscombe's Quartet Statistical Measures
- Figure 3-16: Anscombe's Quartet scatterplots
- Figure 3-17: Data Set 4, four different leverage points

- Scatterplot Matrices
- Figure 3-18: Scatterplot matrix
- Figure 3-19: Parallel coordinates

- Overlaying the Target Variable in Summary
- Figure 3-20: Histogram color coded with target
- Figure 3-21: Box plot for Band 1
- Figure 3-22: Box plot for Band 1 split by a target variable

- Scatterplots in More Than Two Dimensions
- Figure 3-23: Scatterplot with multidimensional overlay

- Scatterplots
- The Value of Statistical Significance
- Table 3-18: Measures of Significance in Data Understanding

- Pulling It All Together into a Data Audit
- Summary

- CHAPTER 4 Data Preparation
- Variable Cleaning
- Incorrect Values
- Consistency in Data Formats
- Outliers
- Table 4-1: Summary Statistics for MAXRAMNT
- Figure 4-1: Binning of MAXRAMNT

- Multidimensional Outliers
- Missing Values
- Table 4-2: Typical Missing Values Codes
- MCAR, MAR, and MNAR

- Fixing Missing Data
- Listwise and Column Deletion
- Imputation with a Constant
- Mean and Median Imputation for Continuous Variables
- Figure 4-2: Histogram for AGE, including missing values
- Table 4-3: Shrinking Standard Deviation from Mean Imputation

- Imputing with Distributions
- Figure 4-3: Spikes caused by mean imputation
- Table 4-4: Comparison of Standard Deviation for Mean and Random Imputation

- Random Imputation from Own Distributions
- Imputing Missing Values from a Model
- Dummy Variables Indicating Missing Values
- Imputation for Categorical Variables
- How Much Missing Data Is Too Much?
- Software Dependence and Default Imputation

- Feature Creation
- Simple Variable Transformations
- Figure 4-4: Effect of skew on regression models

- Fixing Skew
- Table 4-5: Common Transformations to Reduce Positive Skew
- Table 4-6: Log Unit Conversion
- Figure 4-5: Positively skewed distribution normalized with log transform
- Figure 4-6: Negatively skewed distribution normalized with the power transform
- Table 4-7: Table of Corrective Actions for Positive or Negative Skew

- Binning Continuous Variables
- Figure 4-7: Multi-modal distribution
- Figure 4-8: Binned version of multi-modal variable

- Numeric Variable Scaling
- Table 4-8: Commonly Used Scaling and Normalization Methods
- Table 4-9: Sample Z-Scored Data
- Figure 4-9: AGE versus MAXRAMNT in natural units
- Figure 4-10: AGE versus MAXRAMNT in scaled and normalized units

- Nominal Variable Transformation
- Table 4-10: Exploding Categorical Variables to Dummy Variables

- Ordinal Variable Transformations
- Table 4-11: Thermometer Scale

- Date and Time Variable Features
- ZIP Code Features
- Which Version of a Variable Is Best?
- Table 4-12: Distributions and Possible Corrective Action

- Multidimensional Features
- Domain Experts
- Principal Component Analysis Features
- Clustering Features
- Other Modeling Algorithms
- Tying Records Together
- Time Series Features
- Figure 4-11: Creating features from time series data

- Variable Selection Prior to Modeling
- Removing Irrelevant Variables
- Removing Redundant Variables
- Table 4-13: Correlations for Six Variables from the KDD Cup 98 Data

- Selecting Variables When There Are Too Many
- Table 4-14: A Short List of Single-Variable Selection Techniques
- Table 4-15: Variable Selection Test Using the F-statistic

- Selecting Variable Interactions
- Table 4-16: Number of Two-Way Interaction Combinations

- Sampling
- Figure 4-12: Example of overfitting data
- Partitioning Data
- Table 4-17: Sample Split into Sample Subsets

- The Curse of Dimensionality
- Figure 4-13: Random spread of data in 2D
- Figure 4-14: Increasing number of data points needed to populate data

- Rules of Thumb for Determining Data Size
- Partitioning Data: How Much Data to Include in the Subsets
- Table 4-18: Sample Size of Partitioned Data Based on Rules of Thumb

- Cross-Validation
- Figure 4-15: Cross validation sampling
- Figure 4-16: Errors from cross-validation out-of-sample data
- Table 4-19: Largest Errors in 10-Fold Cross-Validation Models

- Bootstrap Sampling
- Table 4-20: Example of the Bootstrap Sampling Method.

- Temporal Considerations in Sampling
- Figure 4-17: Interpolation of data points
- Figure 4-18: Extrapolation of data points
- Figure 4-19: Sampling for temporal data

- Stratified Sampling
- Figure 4-20: ROC curve for non-stratified sample
- Figure 4-21: Distribution of predicted probabilities
- Table 4-21: Confusion Matrix from 5 Percent Target Rate, Threshold 0.5
- Table 4-22: Confusion Matrix from 5 Percent Target Rate, Threshold 0.05

- Example: Why Normalization Matters for K-Means Clustering
- Table 4-23: Summary Statistics for Four Inputs to K-Means Model
- Figure 4-22: Box plots for four inputs to cluster model
- Table 4-24: Cluster Description by Mean Values of Inputs
- Table 4-25: Summary Statistics of Four Variables after Normalization
- Figure 4-23: Box plots of normalized cluster inputs
- Table 4-26: Cluster Summaries for Normalized Inputs
- Table 4-27: Summary of Clusters Built on Normalized Data with Inputs in Natural Units

- Simple Variable Transformations
- Summary

- Variable Cleaning
- CHAPTER 5 Itemsets and Association Rules
- Terminology
- Table 5-1: Sample Supermarket Data
- Condition
- Left-Hand-Side, Antecedent(s)
- Right-Hand-Side, Consequent, Output, Conclusion
- Rule (Item Set)
- Support
- Antecedent Support
- Confidence, Accuracy
- Lift
- Table 5-2: Key Measures of Association Rules for Simple Market Basket Data

- Parameter Settings
- How the Data Is Organized
- Standard Predictive Modeling Data Format
- Transactional Format
- Table 5-3: Sample Supermarket Data in Transactional Format

- Measures of Interesting Rules
- Table 5-4: List of Association Rules Sorted by Confidence
- Table 5-5: List of Association Rules Sorted by Support

- Deploying Association Rules
- Variable Selection
- Interaction Variable Creation

- Problems with Association Rules
- Redundant Rules
- Too Many Rules
- Too Few Rules

- Building Classification Rules from Association Rules
- Table 5-6: Applying Rules in Sequence

- Summary

- Terminology
- CHAPTER 6 Descriptive Modeling
- Data Preparation Issues with Descriptive Modeling
- Principal Component Analysis
- The PCA Algorithm
- Figure 6-1: Nasadata Band1 vs. Band2
- Figure 6-2: Nasadata Band1 vs. Band2 with principle component directions
- Figure 6-3: Scatterplot of Principal Component 2 vs. Principal Component 1
- Table 6-1: Eigenvalues and percent Variance Explained for PCA Computed from Band1 and Band2
- Table 6-2: Eigenvectors for PCA Computed from Band1 and Band2
- Table 6-3: Twelve PCs for Nasadata

- Applying PCA to New Data
- Figure 6-4: Scree plot
- Table 6-4: Top Three Principal Component Values for Ten Data Points

- PCA for Data Interpretation
- Table 6-5: Six Principal Components of Nasadata

- Additional Considerations before Using PCA
- Figure 6-5: Principal components for bi-modal distribution
- Figure 6-6: Histogram of first principal component of bi-modal data

- The Effect of Variable Magnitude on PCA Models
- Table 6-6: Summary Statistics for Nasadata Input Variables
- Table 6-7: Eigenvectors of First Six PCs from Nasadata with Natural Units
- Table 6-8: Principal Component Eigenvalues for Nasadata with Natural Units

- The PCA Algorithm
- Clustering Algorithms
- Figure 6-7: Distances between points in Scatterplot
- The K-Means Algorithm
- Measuring Cluster Goodness of Fit
- Figure 6-8: K-Means algorithm steps
- Figure 6-9: Scatterplot of data with two obvious groups
- Figure 6-10: Scatterplot of data with one group
- Figure 6-11: Cluster labels for forced two-cluster model

- Selecting Inputs
- Figure 6-12: One standard deviation ring for each of two clusters

- Measuring Cluster Goodness of Fit
- Data Preparation for K-Means
- Data Distributions
- Irrelevant Variables
- When Not to Correct for Distribution Problems

- Selecting the Number of Clusters
- Table 6-9: Sum of Squared Errors vs. Number of Clusters
- Figure 6-13: SSE vs. # clusters to find the “knee”
- The Kinds of Clusters K-Means Finds
- Figure 6-14: Cluster regions from Euclidean distance
- Figure 6-15: Sombrero function

- Other Distance Metrics
- Figure 6-16: 2-, 5-, and 8-cluster models for the sombrero function
- Figure 6-17: Application of 2-, 5-, and 8-cluster models to random data

- Kohonen SOMs
- Figure 6-18: Kohonen Self-Organizing Map, 4 × 4
- Figure 6-19: Kohonen Self-Organizing Map, 6 × 1 Map

- The Kohonen SOM Algorithm
- Figure 6-20: Kohonen Map with dead nodes
- Kohonen Map Parameters

- Visualizing Kohonen Maps
- Figure 6-21: Overlaying mean values of variables on top of the Kohonen Map
- Figure 6-22: 3 × 3 Kohonen Map overlapped on itself

- Similarities with K-Means
- Table 6-10: Comparison of Kohonen SOMs and K-Means

- Summary

- CHAPTER 7 Interpreting Descriptive Models
- Standard Cluster Model Interpretation
- Table 7-1: Variables Included in a 3-Cluster Model
- Table 7-2: Cluster Centers for K-Means 3-Cluster Model
- Figure 7-1: RFA_2F stacked histogram to interpret clusters
- Problems with Interpretation Methods
- Normalized Data
- Within-Cluster Descriptions

- Identifying Key Variables in Forming Cluster Models
- ANOVA
- Table 7-3: ANOVA Interpretation of Cluster Variable Significance

- Hierarchical Clustering
- Decision Trees
- Table 7-4: Decision Tree Rules to Predict Clusters
- Figure 7-2: Visualization of decision tree rules
- Table 7-5: Percentage of Records Matching Rule in Decision Tree

- Irrelevant Variables

- ANOVA
- Cluster Prototypes
- Table 7-6: Cluster Prototypes
- Table 7-7: Cluster Prototypes in Natural Units

- Cluster Outliers
- Table 7-8: Cluster Outliers

- Summary

- Standard Cluster Model Interpretation
- CHAPTER 8 Predictive Modeling
- Decision Trees
- The Decision Tree Landscape
- Figure 8-1: Simple decision tree built for the iris data
- Figure 8-2: Moderately complex decision tree

- Building Decision Trees
- Figure 8-3: Iris data scatterplot of petal width versus petal length
- Figure 8-4: First two decision tree splits of iris data
- Figure 8-5: Alternative decision tree splits of iris data
- Figure 8-6: Nonlinear decision boundaries for trees

- Decision Tree Splitting Metrics
- Decision Tree Knobs and Options
- Table 8-1: Characteristics of Three Decision Tree Algorithms

- Reweighting Records: Priors
- Reweighting Records: Misclassification Costs
- Table 8-2: Misclassification Cost Matrix for TARGET_B
- Table 8-3: Misclassification Cost Matrix for Nasadata Crop Types
- Figure 8-7: Nasadata decision tree with no misclassification costs
- Figure 8-8: Nasadata decision tree with misclassification costs

- Other Practical Considerations for Decision Trees

- The Decision Tree Landscape
- Logistic Regression
- Figure 8-9: Linear decision boundary based on NGIFTALL and LASTGIFT
- Table 8-4: Odds Ratio Values for a Logistic Regression Model
- Figure 8-10: Odds ratio versus model input
- Figure 8-11: Log odds ratio versus model input
- Figure 8-12: Logistic curve
- Interpreting Logistic Regression Models
- Table 8-5: Logistic Regression Model Report
- Table 8-6: Logistic Regression Model after Removing LASTGIFT

- Other Practical Considerations for Logistic Regression
- Interactions
- Figure 8-13: Cross data
- Figure 8-14: Linear separation of cross data
- Figure 8-15: Histogram of interaction variable

- Missing Values
- Dummy Variables
- Table 8-7: Logistic Regression Model with n Dummy Variables
- Table 8-8: Logistic Regression Model with n–1 Dummy Variable

- Multi-Class Classification

- Interactions

- Neural Networks
- Building Blocks: The Neuron
- Figure 8-16: Single neuron
- Figure 8-17: Neural network terminology, part 1
- Figure 8-18: Neural network terminology, part 2

- Neural Network Training
- Figure 8-19: Iterative convergence to the bottom of a quadratic error curve
- Figure 8-20: Local and global minima
- Figure 8-21: Iterating to local and global minima
- Figure 8-22: Local or global minimum?

- The Flexibility of Neural Networks
- Figure 8-23: Sombrero function
- Figure 8-24: Linear classifier for sombrero function
- Figure 8-25: Decision boundaries as neural network learns after 10, 100, 200, 500, 1000, and 5000 epochs

- Neural Network Settings
- Neural Network Pruning
- Interpreting Neural Networks
- Neural Network Decision Boundaries
- Figure 8-26: Neural network decision regions on nasadata

- Other Practical Considerations for Neural Networks

- Building Blocks: The Neuron
- K-Nearest Neighbor
- The k-NN Learning Algorithm
- Figure 8-27: 1-nearest neighbor solution
- Figure 8-28: 3-NN solution
- Figure 8-29: 1-NN decision regions for nasadata
- Figure 8-30: 3-NN decision regions for nasadata
- Figure 8-31: 7-NN decision regions for nasadata
- Table 8-9: The Number of Nearest Neighbors, K, Scored by AUC

- Distance Metrics for k-NN
- Other Practical Considerations for k-NN
- Distance Metrics
- Table 8-10: Two Inputs from KDD Cup 1998 Data
- Table 8-11: Euclidean Distance between New Data Point and Training Records
- Table 8-12: Two Transformations for Scaling Inputs

- Handling Categorical Variables
- The Curse of Dimensionality
- Weighted Votes

- Distance Metrics

- The k-NN Learning Algorithm
- Naïve Bayes
- Bayes’ Theorem
- Figure 8-32: Conditional probability
- Figure 8-33: Bayes classifier distances

- The Naïve Bayes Classifier
- Interpreting Naïve Bayes Classifiers
- Figure 8-34: Naïve Bayes model for nasadata
- Table 8-13: Naïve Bayes Probabilities for RFA_2F

- Other Practical Considerations for Naïve Bayes

- Bayes’ Theorem
- Regression Models
- Linear Regression
- Figure 8-35: Regression line
- Figure 8-36: Residuals in linear regression
- Figure 8-37: Residuals vs. LASTGIFT
- Linear Regression Assumptions
- Figure 8-38: Linear model fitting nonlinear data
- Figure 8-39: Linear model after transforming nonlinear data

- Variable Selection in Linear Regression
- Figure 8-40: Trading fitting error and complexity

- Interpreting Linear Regression Models
- Table 8-14: Regression Model Coefficients for TARGET_D Model

- Using Linear Regression for Classification
- Figure 8-41: Using linear regression for classification

- Other Regression Algorithms
- Figure 8-42: Linear activation function for regression output nodes

- Summary

- Decision Trees
- CHAPTER 9 Assessing Predictive Models
- Batch Approach to Model Assessment
- Percent Correct Classification
- Table 9-1: Maximum Lift for Baseline Class Rates
- Table 9-2: Sample Records with Actual and Predicted Class Values
- Confusion Matrices
- Figure 9-1: Confusion Matrix Components
- Table 9-3: Confusion Matrix Measures
- Table 9-4: Comparison Confusion Matrix Metrics for Two Models
- Table 9-5: Confusion Matrix for Model 1
- Table 9-6: Confusion Matrix for Model 2

- Confusion Matrices for Multi-Class Classification
- Table 9-7: Multi-Class Classification Confusion Matrix

- ROC Curves
- Figure 9-2: Sample ROC curve
- Figure 9-3: Comparison of three models

- Rank-Ordered Approach to Model Assessment
- Gains and Lift Charts
- Figure 9-4: Sample gains chart
- Figure 9-5: Perfect gains chart
- Figure 9-6: Sample cumulative lift chart
- Figure 9-7: Sample segment lift chart
- Figure 9-8: Cumulative lift chart for overfit model
- Figure 9-9: Segment lift chart for overfit model

- Custom Model Assessment
- Figure 9-10: Profit chart
- Table 9-8: Custom Cost Function for False Alarms and False Dismissals

- Which Assessment Should Be Used?
- Figure 9-11: Scatterplot of AUC vs. RMS Error

- Gains and Lift Charts

- Percent Correct Classification
- Assessing Regression Models
- Figure 9-12: R2 for two linear models
- Table 9-9: Batch Metrics for Assessing Regression Models
- Table 9-10: Regression Error Metrics for Four Models
- Table 9-11: Rank-Ordering Regression Models by Decile
- Figure 9-13: Average actual target value by decile

- Summary

- Batch Approach to Model Assessment
- CHAPTER 10 Model Ensembles
- Motivation for Ensembles
- The Wisdom of Crowds
- Figure 10-1: Characteristics of good ensemble decisions

- Bias Variance Tradeoff
- Figure 10-2: Low variance fit to data
- Figure 10-3: Low bias fit to data
- Figure 10-4: Errors on new data

- The Wisdom of Crowds
- Bagging
- Table 10-1: Bagging Model AUC
- Figure 10-5: Decision regions for nine bagged trees
- Figure 10-6: Decision region for bagging ensemble
- Figure 10-7: Decision region for actual target variable

- Boosting
- Figure 10-8: AdaBoost reweighting
- Figure 10-9: AdaBoost decision regions for a simple example
- Figure 10-10: Comparison of AUC for individual trees and ensembles
- Figure 10-11: Decision regions for the AdaBoost ensemble

- Improvements to Bagging and Boosting
- Random Forests
- Stochastic Gradient Boosting
- Heterogeneous Ensembles
- Figure 10-12: Heterogeneous ensemble example

- Model Ensembles and Occam's Razor
- Interpreting Model Ensembles
- Table 10-2: AdaBoost Variable Ranking According to ANOVA (F Statistic)
- Figure 10-13: RFA_2F histogram with AdaBoost ensemble overlay
- Table 10-3: Comparison of Variable Ranking for Three Ensemble Methods by ANOVA (F Statistic)

- Summary

- Motivation for Ensembles
- CHAPTER 11 Text Mining
- Motivation for Text Mining
- A Predictive Modeling Approach to Text Mining
- Structured vs. Unstructured Data
- Table 11-1: Short List of Unstructured Data

- Why Text Mining Is Hard
- Text Mining Applications
- Table 11-2: Text Mining Applications

- Data Sources for Text Mining

- Text Mining Applications
- Data Preparation Steps
- POS Tagging
- Figure 11-1: Typical text mining preprocessing steps
- Table 11-3: Penn Treebank Parts of Speech (Word Level)

- Tokens
- Stop Word and Punctuation Filters
- Character Length and Number Filters
- Stemming
- Table 11-4: Stemmed Words
- Table 11-5: Comparison of Stemming Algorithms

- Dictionaries
- The Sentiment Polarity Movie Data Set
- Table 11-6: Record Counts and Term Counts after Data Preparation
- Table 11-7: Example of Terms Grouped after Stripping POS

- POS Tagging
- Text Mining Features
- Table 11-8: Number of Times Terms Appear in Documents
- Term Frequency
- Table 11-9: Boolean TF Values for Five Terms
- Table 11-10: Boolean TF Values for Five Terms after Rolling Up to Document Level
- Table 11-11: Boolean Form of TF Features
- Table 11-12: TF and Log10(1+TF) Values

- Inverse Document Frequency
- Table 11-13: Log Transformed IDF Feature Values

- TF-IDF
- Table 11-14: TF-IDF Values for Five Terms

- Cosine Similarity
- Multi-Word Features: N-Grams
- Reducing Keyword Features
- Grouping Terms

- Modeling with Text Mining Features
- Table 11-15: Five-Cluster Model Results

- Regular Expressions
- Table 11-16: Key Regular Expression Syntax Characters
- Uses of Regular Expressions in Text Mining

- Summary

- CHAPTER 12 Model Deployment
- General Deployment Considerations
- Figure 12-1: CRISM-DM deployment steps
- Deployment Steps
- Figure 12-2: Deployment steps
- Table 12-1: Data Preparation Considerations Prior to Model Scoring
- Where Deployment Occurs
- Table 12-2: Summary of Deployment Location Options
- Deployment in the Predictive Modeling Software
- Deployment in the Predictive Modeling Software Deployment Add-On
- Deployment Using “Headless” Predictive Modeling Software
- Deployment In-Database
- Deployment in the Cloud
- Encoding Models into Other Languages
- Figure 12-3: Simple decision tree to be deployed

- Comparing the Costs

- Post-Processing Model Scores
- Creating Score Segments
- Picking the “Select” Population Directly from Model Score
- Figure 12-4: Histogram with a score cutoff

- Picking the “Select” Population Based on Rank-Ordered Statistics
- Figure 12-5: Gains chart used for creating a select population
- Table 12-3: Model Scores Corresponding to Gain Percent and Lift
- Figure 12-6: Lift equal to 1.5 as the metric for selecting the population to contact
- Figure 12-7: Response Rate equal to 7.5 percent as the metric for selecting the population to contact

- Picking the “Select” Population Directly from Model Score
- When Should Models Be Rebuilt?
- Figure 12-8: Decaying response rate from model
- Table 12-4: Binomial Significance
- Model Assessment by Comparing Observed vs. Expected

- Sampling Considerations for Rebuilding Models
- Table 12-5: Expected count and response rates from the sampling strategy
- Table 12-6: Sample Sizes for Non-Select Population

- What Is Champion-Challenger?
- Figure 12-9: Champion vs. Challenger average response rate
- Figure 12-10: Champion-Challenger sampling

- Summary

- General Deployment Considerations
- CHAPTER 13 Case Studies
- Survey Analysis Case Study: Overview
- Business Understanding: Defining the Problem
- Defining the Target Variable
- Table 13-1: Three Questions for Target Variables
- Figure 13-1: Index of Excellence distribution

- Defining the Target Variable
- Data Understanding
- Table 13-2: Questions and Categories for Model Inputs and Outputs

- Data Preparation
- Missing Data Imputation
- Figure 13-2: Missing value imputation

- Feature Creation and Selection through Factor Analysis
- Table 13-3: Factor Loadings for Five of Six Top Factors
- Table 13-4: Summary of Factors, Questions, and Variance Explained
- Table 13-5: Top Loading Questions for Factor 1, Staff Cares
- Table 13-6: Top Loading Questions for Factor 2, Facilities Clean/Safe
- Table 13-7: Top Loading Questions for Factor 3, Equipment
- Table 13-8: Top Loading Questions for Factor 4, Registration

- Missing Data Imputation
- Modeling
- Table 13-9: Regression Model with Factors as Inputs
- Table 13-10: Regression Model with Representative Questions as Inputs
- Model Interpretation
- Figure 13-3: Visualization of Drivers of Excellence
- Figure 13-4: Drivers of Excellence, example 2
- Figure 13-5: Drivers of Excellence, example 3
- Figure 13-6: Drivers of Excellence, example 4
- Figure 13-7: Drivers of Excellence, example 5

- Deployment: “What-If” Analysis
- Figure 13-8: What-if scenarios for key questions

- Revisit Models
- Business Understanding
- Data Preparation
- Modeling and Model Interpretation
- Satisfaction Model
- Figure 13-9: Member satisfaction tree
- Table 13-11: Key Variables Included in the Satisfaction Tree
- Table 13-12: Rule Descriptions for the Satisfaction Model
- Table 13-13: Key Terminal Nodes in the Satisfaction Model
- Table 13-14: Key Questions in Top Terminal Nodes

- Recommend to a Friend Model
- Figure 13-10: Recommend to a Friend decision tree
- Table 13-15: Terminal Node Populations for the Recommend to a Friend Model
- Table 13-16: Rule Descriptions for the Recommend to a Friend Model

- Intend to Renew Model
- Figure 13-11: Intend to Renew decision tree
- Table 13-17: Terminal Node Populations for the Intend to Renew Model
- Table 13-18: Rule Descriptions for the Intend to Renew Model

- Summary of Models
- Table 13-19: Key Questions That Differ between Target Variables.

- Satisfaction Model

- Deployment
- Summary and Conclusions

- Business Understanding: Defining the Problem
- Help Desk Case Study
- Data Understanding: Defining the Data
- Data Preparation
- Problems with the Target Variable
- Feature Creation for Text
- Table 13-20: Data Preparation for Help Desk Text

- Modeling
- Figure 13-12: Typical parts prediction decision tree

- Revisit Business Understanding
- Figure 13-13: Temporal framework for new features
- Modeling and Model Interpretation

- Deployment
- Table 13-21: Sample List of Rules to Fire

- Summary and Conclusions

- Survey Analysis Case Study: Overview
- Back Matter
- Index

## UM RAFBÆKUR Á HEIMKAUP.IS

Bókahillan þín er þitt svæði og þar eru bækurnar þínar geymdar. Þú kemst í bókahilluna þína hvar og hvenær sem er í tölvu eða snjalltæki. Einfalt og þægilegt!

**Þú kemst í bækurnar hvar sem er**

Þú getur nálgast allar raf(skóla)bækurnar þínar á einu augabragði, hvar og hvenær sem er í bókahillunni þinni. Engin taska, enginn kyndill og ekkert vesen (hvað þá yfirvigt).

**Auðvelt að fletta og leita**

Þú getur flakkað milli síðna og kafla eins og þér hentar best og farið beint í ákveðna kafla úr efnisyfirlitinu. Í leitinni finnur þú orð, kafla eða síður í einum smelli.

**Glósur og yfirstrikanir**

Þú getur auðkennt textabrot með mismunandi litum og skrifað glósur að vild í rafbókina. Þú getur jafnvel séð glósur og yfirstrikanir hjá bekkjarsystkinum og kennara ef þeir leyfa það. Allt á einum stað.

**Hvað viltu sjá? / Þú ræður hvernig síðan lítur út**

Þú lagar síðuna að þínum þörfum. Stækkaðu eða minnkaðu myndir og texta með multi-level zoom til að sjá síðuna eins og þér hentar best í þínu námi.

**Fleiri góðir kostir**

- Þú getur prentað síður úr bókinni (innan þeirra marka sem útgefandinn setur)

- Möguleiki á tengingu við annað stafrænt og gagnvirkt efni, svo sem myndbönd eða spurningar úr efninu

- Auðvelt að afrita og líma efni/texta fyrir t.d. heimaverkefni eða ritgerðir

- Styður tækni sem hjálpar nemendum með sjón- eða heyrnarskerðingu