The Data Science Process
Data science is an interdisciplinary practice that draws on methods such as data engineering, descriptive statistics, data mining, machine learning, and predictive analytics. Like data science, operations research focuses on executing data-driven decisions and managing their results.
Feedback and iteration between other stakeholders and the data scientist are one of the data science environments. This is indicated in the data science project’s life cycle. In reality, the boundaries between phases are fluid, with activities in one stage often overlapping those in the other.
R is open source software which can run well on multiple platforms, like Unix, Linux, Apple’s macOS, and Microsoft Windows. R is a rich and broad language, and there are usually many ways to accomplish the same task. This represents an initial learning curve, as it is difficult to understand what R programs mean until you are familiar with the notation. However, reviewing some of the basic notations is well rewarded, which is good for understanding the R to learn data science methods and practices.
VECTORS AND LISTS
## Builds an example vector. c() is R’s concatenate operator—it builds longer
## vectors and lists from shorter ones without nesting. For example, c(1)
## is just the number 1, and c (1, c (2, 3)) is equivalent to c (1, 2, 3),
## which in turn is the integers 1 through 3 (though stored in a floating-point format ).
example_vector <- c(10, 20, 30)
## Builds an example list
example_list <- list(a = 10, b = 20, c = 30)
example_vector[1]
## [1] 10
example_list[1]
## $a
## [1] 10
example_vector[[2]]
## [1] 20
example_list[[2]]
## [1] 20
example_vector[c(FALSE, TRUE, TRUE)]
## [1] 20 30
example_list[c(FALSE, TRUE, TRUE)]
## $b
## [1] 20
##
## $c
## [1] 30
example_list$b
## [1] 20
example_list[["b"]]
## [1] 20
NULL AND NANA (NOT AVAILABLE) VALUES
NULL is just a synonym for the empty or length-zero vector formed by using the concatenate operator c() with no arguments.
NA stands for “not available” and is fairly unique to R. Having NA is very convenient because it allows us to annotate missing or unavailable values in place, which is critical in data processing.
Primary R assignment operators
Operator | Purpose | Example |
---|---|---|
<- | Assign the value on the right to the symbol on the left. | x <- 5 # assign the value of 5 to the symbol x |
= | Assign the value on the right to the symbol on the left. | x = 5 # assign the value of 5 to the symbol x |
-> | Assign left to right, instead of the traditional right to left. | 5 -> x # assign the value of 5 to the symbol x |
ORGANIZING INTERMEDIATE VALUES
Long sequences of calculations can become difficult to read, debug, and maintain. To avoid this, we suggest reserving the variable named “.” to store intermediate values. The idea is this: work slow to move fast.
## notional, or example, data
data <- data.frame(revenue = c(2, 1, 2), sort_key = c("b", "c", "a"), stringsAsFactors = FALSE)
## Assign our data to a temporary variable named “.”.
## The original values will remain available in the “data” variable,
## making it easy to restart the calculation from the beginning if necessary.
. <- data
## Use the order command to sort the rows. drop = FALSE is not strictly needed,
## but it is good to get in the habit of including it.
## For single-column data.frames without the drop = FALSE argument,
## the [,] indexing operator will convert the result to a vector,
## which is rarely the R developer's true intent. The drop = FALSE argument turns off this conversion,
## and it is a good idea to include it “just in case” and a definite requirement
## when either the data.frame has a single column or when we don’t know
## if the data.frame has more than one column.
. <- .[order(.$sort_key), , drop = FALSE]
.$ordered_sum_revenue <- cumsum(.$revenue)
.$fraction_revenue_seen <- .$ordered_sum_revenue/sum(.$revenue)
## Assigns the result away from “.” to
## a more memorable variable name
result <- .
THE DATA.FRAME CLASS
The R uses the data.frame class to store data in a “ready for analysis” format. The data.frame is a two-dimensional array where each column represents a variable, measure, or fact, and each row represents an individual or instance.
For example:
For example:
d <- data.frame(col1 = c(1, 2, 3), col2 = c(-1, 0, 1))
d$col3 <- d$col1 + d$col2
print(d)
# col1 col2 col3
# 1 1 -1 0
# 2 2 0 2
# 3 3 1 4
Using summary statistics to spot problems
In R, you can use the summary() command to glance at the data. The goal is to know whether the quality of the data is good enough to provide useful information.
## Change this to your actual path to the
## directory where you unpacked PDSwR2
setwd("PDSwR2/Custdata")
customer_data = readRDS("custdata.RDS")
summary(customer_data)
## custid sex is_employed income
## Length:73262 Female:37837 FALSE: 2351 Min. : -6900
## Class :character Male :35425 TRUE :45137 1st Qu.: 10700
## Mode :character NA's :25774 Median : 26200
## Mean : 41764
## 3rd Qu.: 51700
## Max. :1257000
Spotting problems using graphics and visualization
We can use ggplot2 from the R Graphics package and some prepackaged ggplot2 visualizations in the package WVPlots. You may also want to check out the ggpubr and ggstatsplot packages for more prepackaged ggplot2 graphs.
## Load the ggplot2 library, if you
## haven’t already done so.
library(ggplot2)
ggplot(customer_data, aes(x=gas_usage)) +
## The binwidth parameter tells the geom_histogram
## call how to make bins of $10 intervals (default is
## datarange/30). The fill parameter specifies the
## color of the histogram bars (default: black).
geom_histogram(binwidth=10, fill="gray")
Hopefully, you like this book and code examples.
For more details regarding R, you can visit here:
Views: 38