Data Science with R 2nd Edition

The Data Science Process

Data science is an interdisciplinary practice that draws on methods such as data engineering, descriptive statistics, data mining, machine learning, and predictive analytics. Like data science, operations research focuses on executing data-driven decisions and managing their results. 
 
Feedback and iteration between other stakeholders and the data scientist are one of the data science environments. This is indicated in the data science project’s life cycle. In reality, the boundaries between phases are fluid, with activities in one stage often overlapping those in the other.
 
R is open source software which can run well on multiple platforms, like Unix, Linux, Apple’s macOS, and Microsoft Windows. R is a rich and broad language, and there are usually many ways to accomplish the same task. This represents an initial learning curve, as it is difficult to understand what R programs mean until you are familiar with the notation. However, reviewing some of the basic notations is well rewarded, which is good for understanding the R to learn data science methods and practices.
data science project life cycle - r 2 edition
data science project life cycle - r 2 edition

VECTORS AND LISTS

				
					## Builds an example vector. c() is R’s concatenate operator—it builds longer
## vectors and lists from shorter ones without nesting. For example, c(1)
## is just the number 1, and c (1, c (2, 3)) is equivalent to c (1, 2, 3),
## which in turn is the integers 1 through 3 (though stored in a floating-point format ).
example_vector <- c(10, 20, 30)
## Builds an example list
example_list <- list(a = 10, b = 20, c = 30)

example_vector[1]
## [1] 10
example_list[1]
## $a
## [1] 10
example_vector[[2]]
## [1] 20
example_list[[2]]
## [1] 20
example_vector[c(FALSE, TRUE, TRUE)]
## [1] 20 30
example_list[c(FALSE, TRUE, TRUE)]
## $b
## [1] 20
##
## $c
## [1] 30
example_list$b
## [1] 20
example_list[["b"]]
## [1] 20
				
			

NULL AND NANA (NOT AVAILABLE) VALUES

NULL is just a synonym for the empty or length-zero vector formed by using the concatenate operator c() with no arguments.
NA stands for “not available” and is fairly unique to R. Having NA is very convenient because it allows us to annotate missing or unavailable values in place, which is critical in data processing.

Primary R assignment operators

OperatorPurposeExample
<-Assign the value on the right to the symbol on the left.x <- 5 # assign the value of 5 to the symbol x
=Assign the value on the right to the symbol on the left.x = 5 # assign the value of 5 to the symbol x
->Assign left to right, instead of the traditional right to left.5 -> x # assign the value of 5 to the symbol x

ORGANIZING INTERMEDIATE VALUES

Long sequences of calculations can become difficult to read, debug, and maintain. To avoid this, we suggest reserving the variable named “.” to store intermediate values. The idea is this: work slow to move fast.

				
					## notional, or example, data
data <- data.frame(revenue = c(2, 1, 2), sort_key = c("b", "c", "a"), stringsAsFactors = FALSE)

## Assign our data to a temporary variable named “.”.
## The original values will remain available in the “data” variable, 
## making it easy to restart the calculation from the beginning if necessary.
. <- data

## Use the order command to sort the rows. drop = FALSE is not strictly needed,
## but it is good to get in the habit of including it.
## For single-column data.frames without the drop = FALSE argument,
## the [,] indexing operator will convert the result to a vector,
## which is rarely the R developer's true intent. The drop = FALSE argument turns off this conversion,
## and it is a good idea to include it “just in case” and a definite requirement
## when either the data.frame has a single column or when we don’t know
## if the data.frame has more than one column.
. <- .[order(.$sort_key), , drop = FALSE]

.$ordered_sum_revenue <- cumsum(.$revenue)
.$fraction_revenue_seen <- .$ordered_sum_revenue/sum(.$revenue)

## Assigns the result away from “.” to
## a more memorable variable name
result <- .
				
			

THE DATA.FRAME CLASS

The R uses the data.frame class to store data in a “ready for analysis” format. The data.frame is a two-dimensional array where each column represents a variable, measure, or fact, and each row represents an individual or instance.
For example:
				
					d <- data.frame(col1 = c(1, 2, 3), col2 = c(-1, 0, 1))
d$col3 <- d$col1 + d$col2
print(d)
# col1 col2 col3
# 1 1 -1 0
# 2 2 0 2
# 3 3 1 4
				
			

Using summary statistics to spot problems

In R, you can use the summary() command to glance at the data. The goal is to know whether the quality of the data is good enough to provide useful information.
				
					## Change this to your actual path to the
## directory where you unpacked PDSwR2
setwd("PDSwR2/Custdata")
customer_data = readRDS("custdata.RDS")
summary(customer_data)
## custid                   sex             is_employed             income
## Length:73262         Female:37837        FALSE: 2351             Min. : -6900
## Class :character     Male :35425         TRUE :45137             1st Qu.: 10700
## Mode :character                          NA's :25774             Median : 26200
##                                                                  Mean : 41764
##                                                                  3rd Qu.: 51700
##                                                                  Max. :1257000

				
			

Spotting problems using graphics and visualization

We can use ggplot2 from the R Graphics package and some prepackaged ggplot2 visualizations in the package WVPlots. You may also want to check out the ggpubr and ggstatsplot packages for more prepackaged ggplot2 graphs.
				
					## Load the ggplot2 library, if you
## haven’t already done so.
library(ggplot2)
ggplot(customer_data, aes(x=gas_usage)) +
    ## The binwidth parameter tells the geom_histogram
    ## call how to make bins of $10 intervals (default is
    ## datarange/30). The fill parameter specifies the
    ## color of the histogram bars (default: black).
    geom_histogram(binwidth=10, fill="gray")
				
			

Hopefully, you like this book and code examples.

For more details regarding R, you can visit here:

Views: 38

Leave a Reply

Your email address will not be published. Required fields are marked *