R: Data Analysis and Visualization
By Brett Lantz, Jaynal Abedin, Hrishi V. Mittal and
5/5
()
About this ebook
Brett Lantz
"Brett Lantz has spent the past 10 years using innovative data methods to understand human behavior. A sociologist by training, he was first enchanted by machine learning while studying a large database of teenagers' social networking website profiles. Since then, he has worked on interdisciplinary studies of cellular telephone calls, medical billing data, and philanthropic activity, among others. When he's not spending time with family, following college sports, or being entertained by his dachshunds, he maintains dataspelunking.com, a website dedicated to sharing knowledge about the search for insight in data."
Read more from Brett Lantz
Machine Learning with R - Third Edition: Expert techniques for predictive modeling, 3rd Edition Rating: 0 out of 5 stars0 ratingsMachine Learning with R Rating: 4 out of 5 stars4/5Machine Learning with R - Second Edition Rating: 5 out of 5 stars5/5R: Unleash Machine Learning Techniques Rating: 0 out of 5 stars0 ratings
Related to R
Related ebooks
Mastering Predictive Analytics with R Rating: 4 out of 5 stars4/5Learning R Programming Rating: 5 out of 5 stars5/5R: Recipes for Analysis, Visualization and Machine Learning Rating: 0 out of 5 stars0 ratingsR Data Visualization Cookbook Rating: 0 out of 5 stars0 ratingsR Graphs Cookbook Second Edition Rating: 3 out of 5 stars3/5Learning Predictive Analytics with R Rating: 0 out of 5 stars0 ratingsPython: Real-World Data Science Rating: 0 out of 5 stars0 ratingsLearning Predictive Analytics with Python Rating: 0 out of 5 stars0 ratingsPractical Data Analysis Cookbook Rating: 0 out of 5 stars0 ratingsLearning Tableau 10 - Second Edition Rating: 4 out of 5 stars4/5Mastering Tableau Rating: 3 out of 5 stars3/5Mastering Scientific Computing with R Rating: 3 out of 5 stars3/5Simulation for Data Science with R Rating: 0 out of 5 stars0 ratingsR For Dummies Rating: 4 out of 5 stars4/5Python Data Analysis Cookbook Rating: 5 out of 5 stars5/5Learning pandas Rating: 4 out of 5 stars4/5Practical Data Analysis - Second Edition Rating: 0 out of 5 stars0 ratingsPython Business Intelligence Cookbook Rating: 0 out of 5 stars0 ratingsPython Data Visualization Cookbook - Second Edition Rating: 0 out of 5 stars0 ratingsmatplotlib Plotting Cookbook Rating: 5 out of 5 stars5/5Machine Learning with R, the tidyverse, and mlr Rating: 0 out of 5 stars0 ratingsTableau 10 Business Intelligence Cookbook Rating: 0 out of 5 stars0 ratingsPython Data Visualization Cookbook Rating: 4 out of 5 stars4/5Data Visualization: Representing Information on Modern Web Rating: 5 out of 5 stars5/5Microsoft Tabular Modeling Cookbook Rating: 0 out of 5 stars0 ratingsMastering Data Analysis with R Rating: 5 out of 5 stars5/5Data Analysis with R Rating: 5 out of 5 stars5/5R for Data Science Rating: 5 out of 5 stars5/5ggplot2 Essentials Rating: 0 out of 5 stars0 ratingsPractical Data Science with R, Second Edition Rating: 4 out of 5 stars4/5
Data Modeling & Design For You
WordPress For Beginners - How To Set Up A Self Hosted WordPress Blog Rating: 0 out of 5 stars0 ratingsDAX Patterns: Second Edition Rating: 5 out of 5 stars5/5Mastering Agile User Stories Rating: 4 out of 5 stars4/5Tableau Cookbook – Recipes for Data Visualization Rating: 0 out of 5 stars0 ratingsLearn T-SQL Querying: A guide to developing efficient and elegant T-SQL code Rating: 0 out of 5 stars0 ratingsMinding the Machines: Building and Leading Data Science and Analytics Teams Rating: 0 out of 5 stars0 ratingsData Analytics for Beginners: Introduction to Data Analytics Rating: 4 out of 5 stars4/5The Secrets of ChatGPT Prompt Engineering for Non-Developers Rating: 5 out of 5 stars5/5Supercharge Power BI: Power BI is Better When You Learn To Write DAX Rating: 5 out of 5 stars5/5How To Make Money With 3D Printing: The New Digital Revolution Rating: 3 out of 5 stars3/5Bayesian Analysis with Python Rating: 5 out of 5 stars5/5Data Analytics with Python: Data Analytics in Python Using Pandas Rating: 3 out of 5 stars3/5Neural Networks: Neural Networks Tools and Techniques for Beginners Rating: 5 out of 5 stars5/5Living in Data: A Citizen's Guide to a Better Information Future Rating: 4 out of 5 stars4/5Data Visualization: a successful design process Rating: 4 out of 5 stars4/5R All-in-One For Dummies Rating: 0 out of 5 stars0 ratingsWhat Makes Us Smart: The Computational Logic of Human Cognition Rating: 0 out of 5 stars0 ratingsPython Data Analysis Rating: 4 out of 5 stars4/5150 Most Poweful Excel Shortcuts: Secrets of Saving Time with MS Excel Rating: 3 out of 5 stars3/5Python: Master the Art of Design Patterns Rating: 4 out of 5 stars4/5Data Visualization with D3.js Cookbook Rating: 0 out of 5 stars0 ratingsRaspberry Pi :Raspberry Pi Guide On Python & Projects Programming In Easy Steps Rating: 3 out of 5 stars3/5Secrets of MS Excel VBA Macros for Beginners !: Save Your Time With Visual Basic Macros! Rating: 4 out of 5 stars4/5Logic Design: A Review Of Theory And Practice Rating: 0 out of 5 stars0 ratingsAdvanced Deep Learning with Python: Design and implement advanced next-generation AI solutions using TensorFlow and PyTorch Rating: 0 out of 5 stars0 ratings
Reviews for R
1 rating0 reviews
Book preview
R - Brett Lantz
Table of Contents
R: Data Analysis and Visualization
Meet Your Course Guide
Course Structure
Course journey
The Course Roadmap and Timeline
I. Module 1: Data Analysis with R
1. RefresheR
Navigating the basics
Arithmetic and assignment
Logicals and characters
Flow of control
Getting help in R
Vectors
Subsetting
Vectorized functions
Advanced subsetting
Recycling
Functions
Matrices
Loading data into R
Working with packages
2. The Shape of Data
Univariate data
Frequency distributions
Central tendency
Spread
Populations, samples, and estimation
Probability distributions
Visualization methods
3. Describing Relationships
Multivariate data
Relationships between a categorical and a continuous variable
Relationships between two categorical variables
The relationship between two continuous variables
Covariance
Correlation coefficients
Comparing multiple correlations
Visualization methods
Categorical and continuous variables
Two categorical variables
Two continuous variables
More than two continuous variables
4. Probability
Basic probability
A tale of two interpretations
Sampling from distributions
Parameters
The binomial distribution
The normal distribution
The three-sigma rule and using z-tables
5. Using Data to Reason About the World
Estimating means
The sampling distribution
Interval estimation
How did we get 1.96?
Smaller samples
6. Testing Hypotheses
Null Hypothesis Significance Testing
One and two-tailed tests
When things go wrong
A warning about significance
A warning about p-values
Testing the mean of one sample
Assumptions of the one sample t-test
Testing two means
Don't be fooled!
Assumptions of the independent samples t-test
Testing more than two means
Assumptions of ANOVA
Testing independence of proportions
What if my assumptions are unfounded?
7. Bayesian Methods
The big idea behind Bayesian analysis
Choosing a prior
Who cares about coin flips
Enter MCMC – stage left
Using JAGS and runjags
Fitting distributions the Bayesian way
The Bayesian independent samples t-test
8. Predicting Continuous Variables
Linear models
Simple linear regression
Simple linear regression with a binary predictor
A word of warning
Multiple regression
Regression with a non-binary predictor
Kitchen sink regression
The bias-variance trade-off
Cross-validation
Striking a balance
Linear regression diagnostics
Second Anscombe relationship
Third Anscombe relationship
Fourth Anscombe relationship
Advanced topics
9. Predicting Categorical Variables
k-Nearest Neighbors
Using k-NN in R
Confusion matrices
Limitations of k-NN
Logistic regression
Using logistic regression in R
Decision trees
Random forests
Choosing a classifier
The vertical decision boundary
The diagonal decision boundary
The crescent decision boundary
The circular decision boundary
10. Sources of Data
Relational Databases
Why didn't we just do that in SQL?
Using JSON
XML
Other data formats
Online repositories
11. Dealing with Messy Data
Analysis with missing data
Visualizing missing data
Types of missing data
So which one is it?
Unsophisticated methods for dealing with missing data
Complete case analysis
Pairwise deletion
Mean substitution
Hot deck imputation
Regression imputation
Stochastic regression imputation
Multiple imputation
So how does mice come up with the imputed values?
Methods of imputation
Multiple imputation in practice
Analysis with unsanitized data
Checking for out-of-bounds data
Checking the data type of a column
Checking for unexpected categories
Checking for outliers, entry errors, or unlikely data points
Chaining assertions
Other messiness
OpenRefine
Regular expressions
tidyr
12. Dealing with Large Data
Wait to optimize
Using a bigger and faster machine
Be smart about your code
Allocation of memory
Vectorization
Using optimized packages
Using another R implementation
Use parallelization
Getting started with parallel R
An example of (some) substance
Using Rcpp
Be smarter about your code
13. Reproducibility and Best Practices
R Scripting
RStudio
Running R scripts
An example script
Scripting and reproducibility
R projects
Version control
Communicating results
II. Module 2: R Graphs
1. R Graphics
Base graphics using the default package
Trellis graphs using lattice
Graphs inspired by Grammar of Graphics
2. Basic Graph Functions
Introduction
Creating basic scatter plots
Getting ready
How to do it...
How it works...
There's more...
A note on R's built-in datasets
See also
Creating line graphs
Getting ready
How to do it...
How it works...
There's more...
See also
Creating bar charts
Getting ready
How to do it...
How it works...
There's more...
See also
Creating histograms and density plots
How to do it...
How it works...
There's more...
See also
Creating box plots
Getting ready
How to do it...
How it works...
There's more...
See also
Adjusting x and y axes' limits
How to do it...
How it works...
There's more...
See also
Creating heat maps
How to do it...
How it works...
There's more...
See also
Creating pairs plots
How to do it...
How it works...
There's more...
See also
Creating multiple plot matrix layouts
How to do it...
How it works...
There's more...
See also
Adding and formatting legends
Getting ready
How to do it...
How it works...
There's more...
See also
Creating graphs with maps
Getting ready
How to do it...
How it works...
There's more...
See also
Saving and exporting graphs
How to do it...
How it works...
There's more...
See also
3. Beyond the Basics – Adjusting Key Parameters
Introduction
Setting colors of points, lines, and bars
Getting ready
How to do it...
How it works...
There's more...
See also
Setting plot background colors
Getting ready
How to do it...
How it works...
There's more...
Setting colors for text elements – axis annotations, labels, plot titles, and legends
Getting ready
How to do it...
How it works...
There's more...
Choosing color combinations and palettes
Getting ready
How to do it...
How it works...
There's more...
See also
Setting fonts for annotations and titles
Getting ready
How to do it...
How it works...
There's more...
See also
Choosing plotting point symbol styles and sizes
Getting ready
How to do it...
How it works...
There's more...
See also
Choosing line styles and width
Getting ready
How to do it...
How it works...
See also
Choosing box styles
Getting ready
How to do it...
How it works...
There's more...
Adjusting axis annotations and tick marks
Getting ready
How to do it...
How it works...
There's more...
See also
Formatting log axes
Getting ready
How to do it...
How it works...
There's more...
Setting graph margins and dimensions
Getting ready
How to do it...
How it works...
See also
4. Creating Scatter Plots
Introduction
Grouping data points within a scatter plot
Getting ready
How to do it...
How it works...
There's more...
See also
Highlighting grouped data points by size and symbol type
Getting ready
How to do it...
How it works...
Labeling data points
Getting ready
How to do it...
How it works...
There's more...
Correlation matrix using pairs plots
Getting ready
How to do it...
How it works...
Adding error bars
Getting ready
How to do it...
How it works...
There's more...
Using jitter to distinguish closely packed data points
Getting ready
How to do it...
How it works...
Adding linear model lines
Getting ready
How to do it...
How it works...
Adding nonlinear model curves
Getting ready
How to do it...
How it works...
Adding nonparametric model curves with lowess
Getting ready
How to do it...
How it works...
Creating three-dimensional scatter plots
Getting ready
How to do it...
How it works...
There's more...
Creating Quantile-Quantile plots
Getting ready
How to do it...
How it works...
There's more...
Displaying the data density on axes
Getting ready
How to do it...
How it works...
There's more...
Creating scatter plots with a smoothed density representation
Getting ready
How to do it...
How it works...
There's more...
5. Creating Line Graphs and Time Series Charts
Introduction
Adding customized legends for multiple-line graphs
Getting ready
How to do it...
How it works...
There's more...
See also
Using margin labels instead of legends for multiple-line graphs
Getting ready
How to do it...
How it works...
There's more...
Adding horizontal and vertical grid lines
Getting ready
How to do it...
How it works...
There's more...
See also
Adding marker lines at specific x and y values using abline
Getting ready
How to do it...
How it works...
There's more...
Creating sparklines
Getting ready
How to do it...
How it works...
Plotting functions of a variable in a dataset
Getting ready
How to do it...
How it works...
There's more...
Formatting time series data for plotting
Getting ready
How to do it...
How it works...
There's more...
Plotting the date or time variable on the x axis
Getting ready
How to do it...
How it works...
There's more...
Annotating axis labels in different human-readable time formats
Getting ready
How to do it...
How it works...
There's more...
Adding vertical markers to indicate specific time events
Getting ready
How to do it...
How it works...
There's more...
Plotting data with varying time-averaging periods
Getting ready
How to do it...
How it works...
Creating stock charts
Getting ready
How to do it...
How it works...
There's more...
6. Creating Bar, Dot, and Pie Charts
Introduction
Creating bar charts with more than one factor variable
Getting ready
How to do it...
How it works...
See also
Creating stacked bar charts
Getting ready
How to do it...
How it works...
There's more...
Adjusting the orientation of bars – horizontal and vertical
Getting ready
How to do it...
How it works...
There's more...
Adjusting bar widths, spacing, colors, and borders
Getting ready
How to do it...
How it works...
There's more...
Displaying values on top of or next to the bars
Getting ready
How to do it...
How it works...
There's more...
See also
Placing labels inside bars
Getting ready
How to do it...
How it works...
There's more...
Creating bar charts with vertical error bars
Getting ready
How to do it...
How it works...
There's more...
Modifying dot charts by grouping variables
Getting ready
How to do it...
How it works...
Making better, readable pie charts with clockwise-ordered slices
Getting ready
How to do it...
How it works...
See also
Labeling a pie chart with percentage values for each slice
Getting ready
How it works...
There's more...
See also
Adding a legend to a pie chart
Getting ready
How to do it...
How it works...
There's more...
7. Creating Histograms
Introduction
Visualizing distributions as count frequencies or probability densities
Getting ready
How to do it...
How it works...
There's more
Setting the bin size and the number of breaks
Getting ready
How to do it...
How it works...
There's more
Adjusting histogram styles – bar colors, borders, and axes
Getting ready
How to do it...
How it works...
There's more
Overlaying a density line over a histogram
Getting ready
How to do it...
How it works...
Multiple histograms along the diagonal of a pairs plot
Getting ready
How to do it...
How it works...
Histograms in the margins of line and scatter plots
Getting ready
How to do it...
How it works...
8. Box and Whisker Plots
Introduction
Creating box plots with narrow boxes for a small number of variables
Getting ready
How to do it...
How it works...
There's more
See also
Grouping over a variable
Getting ready
How to do it...
How it works...
There's more
See also
Varying box widths by the number of observations
Getting ready
How to do it...
How it works...
Creating box plots with notches
Getting ready
How to do it...
How it works...
There's more
Including or excluding outliers
Getting ready
How to do it...
How it works...
See also
Creating horizontal box plots
Getting ready
How to do it...
How it works...
Changing the box styling
Getting ready
How to do it...
How it works...
There's more
Adjusting the extent of plot whiskers outside the box
Getting ready
How to do it...
How it works...
There's more
Showing the number of observations
Getting ready
How to do it...
How it works...
There's more
Splitting a variable at arbitrary values into subsets
Getting ready
How to do it...
How it works...
There's more
9. Creating Heat Maps and Contour Plots
Introduction
Creating heat maps of a single Z variable with a scale
Getting ready
How to do it...
How it works...
There's more
See also
Creating correlation heat maps
Getting ready
How to do it...
How it works...
There's more
Summarizing multivariate data in a single heat map
Getting ready
How to do it...
How it works...
There's more
Creating contour plots
Getting ready
How to do it...
How it works...
There's more
See also
Creating filled contour plots
Getting ready
How to do it...
How it works...
There's more
See also
Creating three-dimensional surface plots
Getting ready
How to do it...
How it works...
There's more
Visualizing time series as calendar heat maps
Getting ready
How to do it...
How it works...
There's more
10. Creating Maps
Introduction
Plotting global data by countries on a world map
Getting ready
How to do it...
How it works...
There's more
See also
Creating graphs with regional maps
Getting ready
How to do it...
How it works...
There's more
Plotting data on Google maps
Getting ready
How to do it...
How it works...
There's more
See also
Creating and reading KML data
Getting ready
How to do it...
How it works...
See Also
Working with ESRI shapefiles
Getting ready
How to do it...
How it works...
There's more
11. Data Visualization Using Lattice
Introduction
Creating bar charts
Getting ready
How to do it…
How it works…
There's more…
See also
Creating stacked bar charts
Getting ready
How to do it…
How it works…
There's more…
See also
Creating bar charts to visualize cross-tabulation
Getting ready
How to do it…
How it works…
There's more…
Creating a conditional histogram
Getting ready
How to do it…
How it works…
There's more…
See also
Visualizing distributions through a kernel-density plot
Getting ready
How to do it…
How it works…
There's more…
Creating a normal Q-Q plot
Getting ready
How to do it…
How it works…
There's more…
Visualizing an empirical Cumulative Distribution Function
Getting ready
How to do it…
How it works…
There's more…
Creating a boxplot
Getting ready
How to do it…
How it works…
There's more…
Creating a conditional scatter plot
Getting ready
How to do it…
How it works…
There's more…
12. Data Visualization Using ggplot2
Introduction
Creating bar charts
Getting ready
How to do it…
How it works…
There's more…
See also
Creating multiple bar charts
Getting ready
How to do it…
How it works…
There's more…
See also
Creating a bar chart with error bars
Getting ready
How to do it…
How it works…
There's more…
Visualizing the density of a numeric variable
Getting ready
How to do it...
How it works…
There's more...
Creating a box plot
Getting ready
How to do it...
How it works…
Creating a layered plot with a scatter plot and fitted line
Getting ready
How to do it...
How it works…
There's more...
Creating a line chart
Getting ready
How to do it...
How it works…
There's more...
Graph annotation with ggplot
Getting ready
How to do it...
How it works...
13. Inspecting Large Datasets
Introduction
Multivariate continuous data visualization
Getting ready
How to do it…
How it works…
There's more…
See also
Multivariate categorical data visualization
Getting ready
How to do it…
How it works…
There's more…
Visualizing mixed data
Getting ready
How to do it…
Zooming and filtering
Getting ready
How to do it...
How it works…
There's more...
14. Three-dimensional Visualizations
Introduction
Three-dimensional scatter plots
Getting ready
How to do it…
How it works…
There's more…
See also...
Three-dimensional scatter plots with a regression plane
Getting ready
How to do it…
How it works…
There's more…
Three-dimensional bar charts
Getting ready
How to do it…
How it works…
Three-dimensional density plots
Getting ready
How to do it...
How it works…
15. Finalizing Graphs for Publications and Presentations
Introduction
Exporting graphs in high-resolution image formats – PNG, JPEG, BMP, and TIFF
Getting ready
How to do it...
How it works...
There's more
See also
Exporting graphs in vector formats – SVG, PDF, and PS
Getting ready
How to do it...
How it works...
There's more
Adding mathematical and scientific notations (typesetting)
Getting ready
How to do it...
How it works...
There's more
Adding text descriptions to graphs
Getting ready
How to do it...
How it works...
There's more
Using graph templates
Getting ready
How to do it...
How it works...
There's more
Choosing font families and styles under Windows, Mac OS X, and Linux
Getting ready
How to do it...
How it works...
There's more
See also
Choosing fonts for PostScripts and PDFs
Getting ready
How to do it...
How it works...
There's more
III. Module 3: Learning Data Mining with R
1. Warming Up
Big data
Scalability and efficiency
Data source
Data mining
Feature extraction
Summarization
The data mining process
CRISP-DM
SEMMA
Social network mining
Social network
Text mining
Information retrieval and text mining
Mining text for prediction
Web data mining
Why R?
What are the disadvantages of R?
Statistics
Statistics and data mining
Statistics and machine learning
Statistics and R
The limitations of statistics on data mining
Machine learning
Approaches to machine learning
Machine learning architecture
Data attributes and description
Numeric attributes
Categorical attributes
Data description
Data measuring
Data cleaning
Missing values
Junk, noisy data, or outlier
Data integration
Data dimension reduction
Eigenvalues and Eigenvectors
Principal-Component Analysis
Singular-value decomposition
CUR decomposition
Data transformation and discretization
Data transformation
Normalization data transformation methods
Data discretization
Visualization of results
Visualization with R
2. Mining Frequent Patterns, Associations, and Correlations
An overview of associations and patterns
Patterns and pattern discovery
The frequent itemset
The frequent subsequence
The frequent substructures
Relationship or rules discovery
Association rules
Correlation rules
Market basket analysis
The market basket model
A-Priori algorithms
Input data characteristics and data structure
The A-Priori algorithm
The R implementation
A-Priori algorithm variants
The Eclat algorithm
The R implementation
The FP-growth algorithm
Input data characteristics and data structure
The FP-growth algorithm
The R implementation
The GenMax algorithm with maximal frequent itemsets
The R implementation
The Charm algorithm with closed frequent itemsets
The R implementation
The algorithm to generate association rules
The R implementation
Hybrid association rules mining
Mining multilevel and multidimensional association rules
Constraint-based frequent pattern mining
Mining sequence dataset
Sequence dataset
The GSP algorithm
The R implementation
The SPADE algorithm
The R implementation
Rule generation from sequential patterns
High-performance algorithms
3. Classification
Classification
Generic decision tree induction
Attribute selection measures
Tree pruning
General algorithm for the decision tree generation
The R implementation
High-value credit card customers classification using ID3
The ID3 algorithm
The R implementation
Web attack detection
High-value credit card customers classification
Web spam detection using C4.5
The C4.5 algorithm
The R implementation
A parallel version with MapReduce
Web spam detection
Web key resource page judgment using CART
The CART algorithm
The R implementation
Web key resource page judgment
Trojan traffic identification method and Bayes classification
Estimating
Prior probability estimation
Likelihood estimation
The Bayes classification
The R implementation
Trojan traffic identification method
Identify spam e-mail and Naïve Bayes classification
The Naïve Bayes classification
The R implementation
Identify spam e-mail
Rule-based classification of player types in computer games and rule-based classification
Transformation from decision tree to decision rules
Rule-based classification
Sequential covering algorithm
The RIPPER algorithm
The R implementation
Rule-based classification of player types in computer games
4. Advanced Classification
Ensemble (EM) methods
The bagging algorithm
The boosting and AdaBoost algorithms
The Random forests algorithm
The R implementation
Parallel version with MapReduce
Biological traits and the Bayesian belief network
The Bayesian belief network (BBN) algorithm
The R implementation
Biological traits
Protein classification and the k-Nearest Neighbors algorithm
The kNN algorithm
The R implementation
Document retrieval and Support Vector Machine
The SVM algorithm
The R implementation
Parallel version with MapReduce
Document retrieval
Classification using frequent patterns
The associative classification
CBA
Discriminative frequent pattern-based classification
The R implementation
Text classification using sentential frequent itemsets
Classification using the backpropagation algorithm
The BP algorithm
The R implementation
Parallel version with MapReduce
5. Cluster Analysis
Search engines and the k-means algorithm
The k-means clustering algorithm
The kernel k-means algorithm
The k-modes algorithm
The R implementation
Parallel version with MapReduce
Search engine and web page clustering
Automatic abstraction of document texts and the k-medoids algorithm
The PAM algorithm
The R implementation
Automatic abstraction and summarization of document text
The CLARA algorithm
The CLARA algorithm
The R implementation
CLARANS
The CLARANS algorithm
The R implementation
Unsupervised image categorization and affinity propagation clustering
Affinity propagation clustering
The R implementation
Unsupervised image categorization
The spectral clustering algorithm
The R implementation
News categorization and hierarchical clustering
Agglomerative hierarchical clustering
The BIRCH algorithm
The chameleon algorithm
The Bayesian hierarchical clustering algorithm
The probabilistic hierarchical clustering algorithm
The R implementation
News categorization
6. Advanced Cluster Analysis
Customer categorization analysis of e-commerce and DBSCAN
The DBSCAN algorithm
Customer categorization analysis of e-commerce
Clustering web pages and OPTICS
The OPTICS algorithm
The R implementation
Clustering web pages
Visitor analysis in the browser cache and DENCLUE
The DENCLUE algorithm
The R implementation
Visitor analysis in the browser cache
Recommendation system and STING
The STING algorithm
The R implementation
Recommendation systems
Web sentiment analysis and CLIQUE
The CLIQUE algorithm
The R implementation
Web sentiment analysis
Opinion mining and WAVE clustering
The WAVE cluster algorithm
The R implementation
Opinion mining
User search intent and the EM algorithm
The EM algorithm
The R implementation
The user search intent
Customer purchase data analysis and clustering high-dimensional data
The MAFIA algorithm
The SURFING algorithm
The R implementation
Customer purchase data analysis
SNS and clustering graph and network data
The SCAN algorithm
The R implementation
Social networking service (SNS)
7. Outlier Detection
Credit card fraud detection and statistical methods
The likelihood-based outlier detection algorithm
The R implementation
Credit card fraud detection
Activity monitoring – the detection of fraud involving mobile phones and proximity-based methods
The NL algorithm
The FindAllOutsM algorithm
The FindAllOutsD algorithm
The distance-based algorithm
The Dolphin algorithm
The R implementation
Activity monitoring and the detection of mobile fraud
Intrusion detection and density-based methods
The OPTICS-OF algorithm
The High Contrast Subspace algorithm
The R implementation
Intrusion detection
Intrusion detection and clustering-based methods
Hierarchical clustering to detect outliers
The k-means-based algorithm
The ODIN algorithm
The R implementation
Monitoring the performance of the web server and classification-based methods
The OCSVM algorithm
The one-class nearest neighbor algorithm
The R implementation
Monitoring the performance of the web server
Detecting novelty in text, topic detection, and mining contextual outliers
The conditional anomaly detection (CAD) algorithm
The R implementation
Detecting novelty in text and topic detection
Collective outliers on spatial data
The route outlier detection (ROD) algorithm
The R implementation
Characteristics of collective outliers
Outlier detection in high-dimensional data
The brute-force algorithm
The HilOut algorithm
The R implementation
8. Mining Stream, Time-series, and Sequence Data
The credit card transaction flow and STREAM algorithm
The STREAM algorithm
The single-pass-any-time clustering algorithm
The R implementation
The credit card transaction flow
Predicting future prices and time-series analysis
The ARIMA algorithm
Predicting future prices
Stock market data and time-series clustering and classification
The hError algorithm
Time-series classification with the 1NN classifier
The R implementation
Stock market data
Web click streams and mining symbolic sequences
The TECNO-STREAMS algorithm
The R implementation
Web click streams
Mining sequence patterns in transactional databases
The PrefixSpan algorithm
The R implementation
9. Graph Mining and Network Analysis
Graph mining
Graph
Graph mining algorithms
Mining frequent subgraph patterns
The gPLS algorithm
The GraphSig algorithm
The gSpan algorithm
Rightmost path extensions and their supports
The subgraph isomorphism enumeration algorithm
The canonical checking algorithm
The R implementation
Social network mining
Community detection and the shingling algorithm
The node classification and iterative classification algorithms
The R implementation
10. Mining Text and Web Data
Text mining and TM packages
Text summarization
Topic representation
The multidocument summarization algorithm
The Maximal Marginal Relevance algorithm
The R implementation
The question answering system
Genre categorization of web pages
Categorizing newspaper articles and newswires into topics
The N-gram-based text categorization
The R implementation
Web usage mining with web logs
The FCA-based association rule mining algorithm
The R implementation
IV. Module 4: Mastering R for Quantitative Finance
1. Time Series Analysis
Multivariate time series analysis
Cointegration
Vector autoregressive models
VAR implementation example
Cointegrated VAR and VECM
Volatility modeling
GARCH modeling with the rugarch package
The standard GARCH model
The Exponential GARCH model (EGARCH)
The Threshold GARCH model (TGARCH)
Simulation and forecasting
References and reading list
2. Factor Models
Arbitrage pricing theory
Implementation of APT
Fama-French three-factor model
Modeling in R
Data selection
Estimation of APT with principal component analysis
Estimation of the Fama-French model
References
3. Forecasting Volume
Motivation
The intensity of trading
The volume forecasting model
Implementation in R
The data
Loading the data
The seasonal component
AR(1) estimation and forecasting
SETAR estimation and forecasting
Interpreting the results
References
4. Big Data – Advanced Analytics
Getting data from open sources
Introduction to big data analysis in R
K-means clustering on big data
Loading big matrices
Big data K-means clustering analysis
Big data linear regression analysis
Loading big data
Fitting a linear regression model on large datasets
References
5. FX Derivatives
Terminology and notations
Currency options
Exchange options
Two-dimensional Wiener processes
The Margrabe formula
Application in R
Quanto options
Pricing formula for a call quanto
Pricing a call quanto in R
References
6. Interest Rate Derivatives and Models
The Black model
Pricing a cap with Black's model
The Vasicek model
The Cox-Ingersoll-Ross model
Parameter estimation of interest rate models
Using the SMFI5 package
References
7. Exotic Options
A general pricing approach
The role of dynamic hedging
How R can help a lot
A glance beyond vanillas
Greeks – the link back to the vanilla world
Pricing the Double-no-touch option
Another way to price the Double-no-touch option
The life of a Double-no-touch option – a simulation
Exotic options embedded in structured products
References
8. Optimal Hedging
Hedging of derivatives
Market risk of derivatives
Static delta hedge
Dynamic delta hedge
Comparing the performance of delta hedging
Hedging in the presence of transaction costs
Optimization of the hedge
Optimal hedging in the case of absolute transaction costs
Optimal hedging in the case of relative transaction costs
Further extensions
References
9. Fundamental Analysis
The basics of fundamental analysis
Collecting data
Revealing connections
Including multiple variables
Separating investment targets
Setting classification rules
Backtesting
Industry-specific investment
References
10. Technical Analysis, Neural Networks, and Logoptimal Portfolios
Market efficiency
Technical analysis
The TA toolkit
Markets
Plotting charts - bitcoin
Built-in indicators
SMA and EMA
RSI
MACD
Candle patterns: key reversal
Evaluating the signals and managing the position
A word on money management
Wraping up
Neural networks
Forecasting bitcoin prices
Evaluation of the strategy
Logoptimal portfolios
A universally consistent, non-parametric investment strategy
Evaluation of the strategy
References
11. Asset and Liability Management
Data preparation
Data source at first glance
Cash-flow generator functions
Preparing the cash-flow
Interest rate risk measurement
Liquidity risk measurement
Modeling non-maturity deposits
A Model of deposit interest rate development
Static replication of non-maturity deposits
References
12. Capital Adequacy
Principles of the Basel Accords
Basel I
Basel II
Minimum capital requirements
Supervisory review
Transparency
Basel III
Risk measures
Analytical VaR
Historical VaR
Monte-Carlo simulation
Risk categories
Market risk
Credit risk
Operational risk
References
13. Systemic Risks
Systemic risk in a nutshell
The dataset used in our examples
Core-periphery decomposition
Implementation in R
Results
The simulation method
The simulation
Implementation in R
Results
Possible interpretations and suggestions
References
V. Module 5: Machine Learning with R module
1. Introducing Machine Learning
The origins of machine learning
Uses and abuses of machine learning
Machine learning successes
The limits of machine learning
Machine learning ethics
How machines learn
Data storage
Abstraction
Generalization
Evaluation
Machine learning in practice
Types of input data
Types of machine learning algorithms
Matching input data to algorithms
Machine learning with R
Installing R packages
Loading and unloading R packages
2. Managing and Understanding Data
R data structures
Vectors
Factors
Lists
Data frames
Matrixes and arrays
Managing data with R
Saving, loading, and removing R data structures
Importing and saving data from CSV files
Exploring and understanding data
Exploring the structure of data
Exploring numeric variables
Measuring the central tendency – mean and median
Measuring spread – quartiles and the five-number summary
Visualizing numeric variables – boxplots
Visualizing numeric variables – histograms
Understanding numeric data – uniform and normal distributions
Measuring spread – variance and standard deviation
Exploring categorical variables
Measuring the central tendency – the mode
Exploring relationships between variables
Visualizing relationships – scatterplots
Examining relationships – two-way cross-tabulations
3. Lazy Learning – Classification Using Nearest Neighbors
Understanding nearest neighbor classification
The k-NN algorithm
Measuring similarity with distance
Choosing an appropriate k
Preparing data for use with k-NN
Why is the k-NN algorithm lazy?
Example – diagnosing breast cancer with the k-NN algorithm
Step 1 – collecting data
Step 2 – exploring and preparing the data
Transformation – normalizing numeric data
Data preparation – creating training and test datasets
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Transformation – z-score standardization
Testing alternative values of k
4. Probabilistic Learning – Classification Using Naive Bayes
Understanding Naive Bayes
Basic concepts of Bayesian methods
Understanding probability
Understanding joint probability
Computing conditional probability with Bayes' theorem
The Naive Bayes algorithm
Classification with Naive Bayes
The Laplace estimator
Using numeric features with Naive Bayes
Example – filtering mobile phone spam with the Naive Bayes algorithm
Step 1 – collecting data
Step 2 – exploring and preparing the data
Data preparation – cleaning and standardizing text data
Data preparation – splitting text documents into words
Data preparation – creating training and test datasets
Visualizing text data – word clouds
Data preparation – creating indicator features for frequent words
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
5. Divide and Conquer – Classification Using Decision Trees and Rules
Understanding decision trees
Divide and conquer
The C5.0 decision tree algorithm
Choosing the best split
Pruning the decision tree
Example – identifying risky bank loans using C5.0 decision trees
Step 1 – collecting data
Step 2 – exploring and preparing the data
Data preparation – creating random training and test datasets
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Boosting the accuracy of decision trees
Making mistakes more costlier than others
Understanding classification rules
Separate and conquer
The 1R algorithm
The RIPPER algorithm
Rules from decision trees
What makes trees and rules greedy?
Example – identifying poisonous mushrooms with rule learners
Step 1 – collecting data
Step 2 – exploring and preparing the data
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
6. Forecasting Numeric Data – Regression Methods
Understanding regression
Simple linear regression
Ordinary least squares estimation
Correlations
Multiple linear regression
Example – predicting medical expenses using linear regression
Step 1 – collecting data
Step 2 – exploring and preparing the data
Exploring relationships among features – the correlation matrix
Visualizing relationships among features – the scatterplot matrix
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Model specification – adding non-linear relationships
Transformation – converting a numeric variable to a binary indicator
Model specification – adding interaction effects
Putting it all together – an improved regression model
Understanding regression trees and model trees
Adding regression to trees
Example – estimating the quality of wines with regression trees and model trees
Step 1 – collecting data
Step 2 – exploring and preparing the data
Step 3 – training a model on the data
Visualizing decision trees
Step 4 – evaluating model performance
Measuring performance with the mean absolute error
Step 5 – improving model performance
7. Black Box Methods – Neural Networks and Support Vector Machines
Understanding neural networks
From biological to artificial neurons
Activation functions
Network topology
The number of layers
The direction of information travel
The number of nodes in each layer
Training neural networks with backpropagation
Example – Modeling the strength of concrete with ANNs
Step 1 – collecting data
Step 2 – exploring and preparing the data
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Understanding Support Vector Machines
Classification with hyperplanes
The case of linearly separable data
The case of nonlinearly separable data
Using kernels for non-linear spaces
Example – performing OCR with SVMs
Step 1 – collecting data
Step 2 – exploring and preparing the data
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
8. Finding Patterns – Market Basket Analysis Using Association Rules
Understanding association rules
The Apriori algorithm for association rule learning
Measuring rule interest – support and confidence
Building a set of rules with the Apriori principle
Example – identifying frequently purchased groceries with association rules
Step 1 – collecting data
Step 2 – exploring and preparing the data
Data preparation – creating a sparse matrix for transaction data
Visualizing item support – item frequency plots
Visualizing the transaction data – plotting the sparse matrix
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
Sorting the set of association rules
Taking subsets of association rules
Saving association rules to a file or data frame
9. Finding Groups of Data – Clustering with k-means
Understanding clustering
Clustering as a machine learning task
The k-means clustering algorithm
Using distance to assign and update clusters
Choosing the appropriate number of clusters
Example – finding teen market segments using k-means clustering
Step 1 – collecting data
Step 2 – exploring and preparing the data
Data preparation – dummy coding missing values
Data preparation – imputing the missing values
Step 3 – training a model on the data
Step 4 – evaluating model performance
Step 5 – improving model performance
10. Evaluating Model Performance
Measuring performance for classification
Working with classification prediction data in R
A closer look at confusion matrices
Using confusion matrices to measure performance
Beyond accuracy – other measures of performance
The kappa statistic
Sensitivity and specificity
Precision and recall
The F-measure
Visualizing performance trade-offs
ROC curves
Estimating future performance
The holdout method
Cross-validation
Bootstrap sampling
11. Improving Model Performance
Tuning stock models for better performance
Using caret for automated parameter tuning
Creating a simple tuned model
Customizing the tuning process
Improving model performance with meta-learning
Understanding ensembles
Bagging
Boosting
Random forests
Training random forests
Evaluating random forest performance
12. Specialized Machine Learning Topics
Working with proprietary files and databases
Reading from and writing to Microsoft Excel, SAS, SPSS, and Stata files
Querying data in SQL databases
Working with online data and services
Downloading the complete text of web pages
Scraping data from web pages
Parsing XML documents
Parsing JSON from web APIs
Working with domain-specific data
Analyzing bioinformatics data
Analyzing and visualizing network data
Improving the performance of R
Managing very large datasets
Generalizing tabular data structures with dplyr
Making data frames faster with data.table
Creating disk-based data frames with ff
Using massive matrices with bigmemory
Learning faster with parallel computing
Measuring execution time
Working in parallel with multicore and snow
Taking advantage of parallel with foreach and doParallel
Parallel cloud computing with MapReduce and Hadoop
GPU computing
Deploying optimized learning algorithms
Building bigger regression models with biglm
Growing bigger and faster random forests with bigrf
Training and evaluating models in parallel with caret
A. Reflect and Test Yourself Answers
Module 1: Data Analysis with R
Chapter 1: RefresheR
Chapter 2: The Shape of Data
Chapter 3: Describing Relationships
Chapter 4: Probability
Chapter 5: Using Data to Reason About the World
Chapter 6: Testing Hypotheses
Chapter 7: Bayesian Methods
Chapter 8: Predicting Continuous Variables
Chapter 9: Predicting Categorical Variables
Chapter 10: Sources of Data
Chapter 11: Dealing with Messy Data
Chapter 12: Dealing with Large Data
Module 2: R Graphs
Chapter 1: R Graphics
Chapter 2: Basic Graph Functions
Chapter 3: Beyond the Basics – Adjusting Key Parameters
Chapter 4: Creating Scatter Plots
Chapter 5: Creating Line Graphs and Time Series Charts
Chapter 6: Creating Bar, Dot, and Pie Charts
Chapter 7: Creating Histograms
Chapter 8: Box and Whisker Plots
Chapter 9: Creating Heat Maps and Contour Plots
Module 4: Mastering R for Quantitative Finance
Chapter 1: Time Series Analysis
Chapter 3: Forecasting Volume
Chapter 4: Big Data – Advanced Analytics
Chapter 5: FX Derivatives
Chapter 6: Interest Rate Derivatives and Models
Chapter 7: Exotic Options
Chapter 8: Optimal Hedging
Chapter 9: Fundamental Analysis
Module 5: Machine Learning with R
Chapter 1: Introducing Machine Learning
Chapter 2: Managing and Understanding Data
Chapter 3: Lazy Learning – Classification Using Nearest Neighbors
Chapter 4: Probabilistic Learning – Classification Using Naive Bayes
Chapter 5: Divide and Conquer – Classification Using Decision Trees and Rules
Chapter 6: Forecasting Numeric Data – Regression Methods
Chapter 7: Black Box Methods – Neural Networks and Support Vector Machines
Chapter 8: Finding Patterns – Market Basket Analysis Using Association Rules
B. Bibliography
Index
R: Data Analysis and Visualization
R: Data Analysis and Visualization
A course in five modules
Master the art of building analytical models using R with your Course Guide Edwin Moses
Learn data analysis, data visualization techniques, data mining, and machine learning all using R and also learn to build models in quantitative finance using this powerful language
To contact your Course Guide
Email: <edwinm@packtpub.com>
BIRMINGHAM - MUMBAI
Meet Your Course Guide
Welcome to this course on R, the statistical programming language for data scientists and statisticians. With this course, you'll embark on a journey of learning R for data science.
If you have any questions along the way, you can reach out to me over email and I'll make sure you get everything from the course that we've planned – for you to become a working R developer. Details of how to contact me are included on the first page of this course.
Course Structure
The R learning path created for you has five connected modules. Each of these modules are a mini-course in their own right, and as you complete each one, you'll have gained key skills and be ready for the material in the next module!
Now, let’s look at the pathway these modules create and how they will take you from doing data analysis with R to creating analytical models based on machine learning.
Course journey
This course begins by looking at the Data Analysis with R module. This module will help you navigate the R environment. You'll gain a thorough understanding of statistical reasoning and sampling. Finally, you'll be able to put best practices into effect to make your job easier and facilitate reproducibility.
The second place to explore is R Graphs. This module will help you leverage powerful default R graphics and utilize advanced graphics systems such as lattice and ggplot2, the grammar of graphics. Through inspecting large datasets using tableplot and stunning three-dimensional visualizations, you will know how to produce, customize, and publish advanced visualizations using this popular, and powerful, framework.
With the third module, Learning Data Mining with R, you will learn how to manipulate data with R using code snippets and be introduced to mining frequent patterns, association, and correlations while working with R programs. Discover how to write code for various predication models, stream data, and time-series data. You will also be introduced to solutions written in R based on RHadoop projects. You will finish this module feeling confident in your ability to know which data mining algorithm to apply in any situation.
The Mastering R for Quantitative Finance module pragmatically introduces both the quantitative finance concepts and their modeling in R, enabling you to build a tailor-made trading system on your own. By the end of the module, you will be well versed with various financial techniques using R and will be able to place good bets while making financial decisions.
Finally, we'll look at the Machine Learning with R module. With this module, you'll discover all the analytical tools you need to gain insights from complex data and learn how to choose the correct algorithm for your specific needs. Through full engagement with the sort of real-world problems data-wranglers face, you'll learn to apply machine learning methods to deal with common tasks, including classification, prediction, forecasting, market analysis, and clustering.
The Course Roadmap and Timeline
Here's a view of the entire course plan before we begin. This grid gives you a topic overview of the whole course and its modules, so you can see how we will move through particular phases of learning to use R, what skills you’ll be learning along the way, and what you can do with those skills at each point. I also offer you an estimate of the time you might want to take for each module, although a lot depends on your learning style how much you’re able to give the course each week!
Part I. Module 1: Data Analysis with R
Chapter 1. RefresheR
Before we dive into the (other) fun stuff (sampling multi-dimensional probability distributions, using convex optimization to fit data models, and so on), it would be helpful if we review those aspects of R that all subsequent chapters will assume knowledge of.
If you fancy yourself as an R guru, you should still, at least, skim through this chapter, because you'll almost certainly find the idioms, packages, and style introduced here to be beneficial in following along with the rest of the material.
If you don't care much about R (yet), and are just in this for the statistics, you can heave a heavy sigh of relief that, for the most part, you can run the code given in this book in the interactive R interpreter with very little modification, and just follow along with the ideas. However, it is my belief (read: delusion) that by the end of this book, you'll cultivate a newfound appreciation of R alongside a robust understanding of methods in data analysis.
Fire up your R interpreter, and let's get started!
Navigating the basics
In the interactive R interpreter, any line starting with a > character denotes R asking for input (If you see a + prompt, it means that you didn't finish typing a statement at the prompt and R is asking you to provide the rest of the expression.). Striking the return key will send your input to R to be evaluated. R's response is then spit back at you in the line immediately following your input, after which R asks for more input. This is called a REPL (Read-Evaluate-Print-Loop). It is also possible for R to read a batch of commands saved in a file (unsurprisingly called batch mode), but we'll be using the interactive mode for most of the book.
As you might imagine, R supports all the familiar mathematical operators as most other languages:
Arithmetic and assignment
Check out the following example:
> 2 + 2
[1] 4
> 9 / 3
[1] 3
> 5 %% 2 # modulus operator (remainder of 5 divided by 2)
[1] 1
Anything that occurs after the octothorpe or pound sign, #, (or hash-tag for you young'uns), is ignored by the R interpreter. This is useful for documenting the code in natural language. These are called comments.
In a multi-operation arithmetic expression, R will follow the standard order of operations from math. In order to override this natural order, you have to use parentheses flanking the sub-expression that you'd like to be performed first.
> 3 + 2 - 10 ^ 2 # ^ is the exponent operator
[1] -95
> 3 + (2 - 10) ^ 2
[1] 67
In practice, almost all compound expressions are split up with intermediate values assigned to variables which, when used in future expressions, are just like substituting the variable with the value that was assigned to it. The (primary) assignment operator is <-.
> # assignments follow the form VARIABLE <- VALUE
> var <- 10
> var
[1] 10
> var ^ 2
[1] 100
> VAR / 2 # variable names are case-sensitive
Error: object 'VAR' not found
Notice that the first and second lines in the preceding code snippet didn't have an output to be displayed, so R just immediately asked for more input. This is because assignments don't have a return value. Their only job is to give a value to a variable, or to change the existing value of a variable. Generally, operations and functions on variables in R don't change the value of the variable. Instead, they return the result of the operation. If you want to change a variable to the result of an operation using that variable, you have to reassign that variable as follows:
> var # var is 10
[1] 10
> var ^ 2
[1] 100
> var # var is still 10
[1] 10
> var <- var ^ 2 # no return value
> var # var is now 100
[1] 100
Be aware that variable names may contain numbers, underscores, and periods; this is something that trips up a lot of people who are familiar with other programming languages that disallow using periods in variable names. The only further restrictions on variable names are that it must start with a letter (or a period and then a letter), and that it must not be one of the reserved words in R such as TRUE, Inf, and so on.
Although the arithmetic operators that we've seen thus far are functions in their own right, most functions in R take the form: function_name (value(s) supplied to the function). The values supplied to the function are called arguments of that function.
> cos(3.14159) # cosine function
[1] -1
> cos(pi) # pi is a constant that R provides
[1] -1
> acos(-1) # arccosine function
[1] 2.141593
> acos(cos(pi)) + 10
[1] 13.14159
> # functions can be used as arguments to other functions
(If you paid attention in math class, you'll know that the cosine of π is -1, and that arccosine is the inverse function of cosine.)
There are hundreds of such useful functions defined in base R, only a handful of which we will see in this book. Two sections from now, we will be building our very own functions.
Before we move on from arithmetic, it will serve us well to visit some of the odd values that may result from certain operations:
> 1 / 0
[1] Inf
> 0 / 0
[1] NaN
It is common during practical usage of R to accidentally divide by zero. As you can see, this undefined operation yields an infinite value in R. Dividing zero by zero yields the value NaN, which stands for Not a Number.
Logicals and characters
So far, we've only been dealing with numerics, but there are other atomic data types in R. To wit:
> foo <- TRUE # foo is of the logical data type
> class(foo) # class() tells us the type
[1] logical
> bar <- hi!
# bar is of the character data type
> class(bar)
[1] character
The logical data type (also called Booleans) can hold the values TRUE or FALSE or, equivalently, T or F. The familiar operators from Boolean algebra are defined for these types:
> foo
[1] TRUE
> foo && TRUE # boolean and
[1] TRUE
> foo && FALSE
[1] FALSE
> foo || FALSE # boolean or
[1] TRUE
> !foo # negation operator
[1] FALSE
In a Boolean expression with a logical value and a number, any number that is not 0 is interpreted as TRUE.
> foo && 1
[1] TRUE
> foo && 2
[1] TRUE
> foo && 0
[1] FALSE
Additionally, there are functions and operators that return logical values such as:
> 4 < 2 # less than operator
[1] FALSE
> 4 >= 4 # greater than or equal to
[1] TRUE
> 3 == 3 # equality operator
[1] TRUE
> 3 != 2 # inequality operator
[1] TRUE
Just as there are functions in R that are only defined for work on the numeric and logical data type, there are other functions that are designed to work only with the character data type, also known as strings:
> lang.domain <- statistics
> lang.domain <- toupper(lang.domain)
> print(lang.domain)
[1] STATISTICS
> # retrieves substring from first character to fourth character
> substr(lang.domain, 1, 4)
[1] STAT
> gsub(I
, 1
, lang.domain) # substitutes every I
for 1
[1] STAT1ST1CS
# combines character strings
> paste(R does
, lang.domain, !!!
)
[1] R does STATISTICS !!!
Flow of control
The last topic in this section will be flow of control constructs.
The most basic flow of control construct is the if statement. The argument to an if statement (what goes between the parentheses), is an expression that returns a logical value. The block of code following the if statement gets executed only if the expression yields TRUE. For example:
> if(2 + 2 == 4)
+ print(very good
)
[1] very good
> if(2 + 2 == 5)
+ print(all hail to the thief
)
>
It is possible to execute more than one statement if an if condition is triggered; you just have to use curly brackets ({}) to contain the statements.
> if((4/2==2) && (2*2==4)){
+ print(four divided by two is two...
)
+ print(and two times two is four
)
+ }
[1] four divided by two is two...
[1] and two times two is four
>
It is also possible to specify a block of code that will get executed if the if conditional is FALSE.
> closing.time <- TRUE
> if(closing.time){
+ print(you don't have to go home
)
+ print(but you can't stay here
)
+ } else{
+ print(you can stay here!
)
+ }
[1] you don't have to go home
[1] but you can't stay here
> if(!closing.time){
+ print(you don't have to go home
)
+ print(but you can't stay here
)
+ } else{
+ print(you can stay here!
)
+ }
[1] you can stay here!
>
There are other flow of control constructs (like while and for), but we won't directly be using them much in this text.
Getting help in R
Before we go further, it would serve us well to have a brief section detailing how to get help in R. Most R tutorials leave this for one of the last sections—if it is even included at all! In my own personal experience, though, getting help is going to be one of the first things you will want to do as you add more bricks to your R knowledge castle. Learning R doesn't have to be difficult; just take it slowly, ask questions, and get help early. Go you!
It is easy to get help with R right at the console. Running the help.start() function at the prompt will start a manual browser. From here, you can do anything from going over the basics of R to reading the nitty-gritty details on how R works internally.
You can get help on a particular function in R if you know its name, by supplying that name as an argument to the help function. For example, let's say you want to know more about the gsub() function that I sprang on you before. Running the following code:
> help(gsub
)
> # or simply
> ?gsub
will display a manual page documenting what the function is, how to use it, and examples of its usage.
This rapid accessibility to documentation means that I'm never hopelessly lost when I encounter a function which I haven't seen before. The downside to this extraordinarily convenient help mechanism is that I rarely bother to remember the order of arguments, since looking them up is just seconds away.
Occasionally, you won't quite remember the exact name of the function you're looking for, but you'll have an idea about what the name should be. For this, you can use the help.search() function.
> help.search(chisquare
)
> # or simply
> ??chisquare
For tougher, more semantic queries, nothing beats a good old fashioned web search engine. If you don't get relevant results the first time, try adding the term programming or statistics in there for good measure.
Vectors
Vectors are the most basic data structures in R, and they are ubiquitous indeed. In fact, even the single values that we've been working with thus far were actually vectors of length 1. That's why the interactive R console has been printing [1] along with all of our output.
Vectors are essentially an ordered collection of values of the same atomic data type. Vectors can be arbitrarily large (with some limitations), or they can be just one single value.
The canonical way of building vectors manually is by using the c() function (which stands for combine).
> our.vect <- c(8, 6, 7, 5, 3, 0, 9)
> our.vect
[1] 8 6 7 5 3 0 9
In the preceding example, we created a numeric vector of length 7 (namely, Jenny's telephone number).
Note that if we tried to put character data types into this vector as follows:
> another.vect <- c(8
, 6, 7, -
, 3, 0
, 9)
> another.vect
[1] 8
6
7
-
3
0
9
R would convert all the items in the vector (called elements) into character data types to satisfy the condition that all elements of a vector must be of the same type. A similar thing happens when you try to use logical values in a vector with numbers; the logical values would be converted into 1 and 0 (for TRUE and FALSE, respectively). These logicals will turn into TRUE and FALSE (note the quotation marks) when used in a vector that contains characters.
Subsetting
It is very common to want to extract one or more elements from a vector. For this, we use a technique called indexing or subsetting. After the vector, we put an integer in square brackets ([]) called the subscript operator. This instructs R to return the element at that index. The indices (plural for index, in case you were wondering!) for vectors in R start at 1, and stop at the length of the vector.
> our.vect[1] # to get the first value
[1] 8
> # the function length() returns the length of a vector
> length(our.vect)
[1] 7
> our.vect[length(our.vect)] # get the last element of a vector
[1] 9
Note that in the preceding code, we used a function in the subscript operator. In cases like these, R evaluates the expression in the subscript operator, and uses the number it returns as the index to extract.
If we get greedy, and try to extract an element at an index that doesn't exist, R will respond with NA, meaning, not available. We see this special value cropping up from time to time throughout this text.
> our.vect[10]
[1] NA
One of the most powerful ideas in R is that you can use vectors to subset other vectors:
> # extract the first, third, fifth, and
> # seventh element from our vector
> our.vect[c(1, 3, 5, 7)]
[1] 8 7 3 9
The ability to use vectors to index other vectors may not seem like much now, but its usefulness will become clear soon.
Another way to create vectors is by using sequences.
> other.vector <- 1:10
> other.vector
[1] 1 2 3 4 5 6 7 8 9 10
> another.vector <- seq(50, 30, by=-2)
> another.vector
[1] 50 48 46 44 42 40 38 36 34 32 30
Above, the 1:10 statement creates a vector from 1 to 10. 10:1 would have created the same 10 element vector, but in reverse. The seq() function is more general in that it allows sequences to be made using steps (among many other things).
Combining our knowledge of sequences and vectors subsetting vectors, we can get the first 5 digits of Jenny's number thusly:
> our.vect[1:5]
[1] 8 6 7 5 3
Vectorized functions
Part of what makes R so powerful is that many of R's functions take vectors as arguments. These vectorized functions are usually extremely fast and efficient. We've already seen one such function, length(), but there are many many others.
> # takes the mean of a vector
> mean(our.vect)
[1] 5.428571
> sd(our.vect) # standard deviation
[1] 3.101459
> min(our.vect)
[1] 0
> max(1:10)
[1] 10
> sum(c(1, 2, 3))
[1] 6
In practical settings, such as when reading data from files, it is common to have NA values in vectors:
> messy.vector <- c(8, 6, NA, 7, 5, NA, 3, 0, 9)
> messy.vector
[1] 8 6 NA 7 5 NA 3 0 9
> length(messy.vector)
[1] 9
Some vectorized functions will not allow NA values by default. In these cases, an extra keyword argument must be supplied along with the first argument to the function.
> mean(messy.vector)