Start Coding with R

Start Coding with R

Installation R and Rstudio

Personally, I don’t like the installation and preparation of the software very much(Maybe you too), but there is no escape. Fortunately, I tried to make it as short as possible and write here. I promise better stuff will show up very soon. To start our journey we need preparation and knowledge about the basics. If data Science is divided into 2 parts one of them is Statistics/Analytics and the other one is Computer to apply those concepts. Usually, we use a programming language to do our job and on this website, we use R language. I another post I will later why. We don’t show you a traditional introduction to R, instead, we go directly to use it. To make the language available on your PC, you need to do two steps. Remember that both R and Rstudio are free and they can do everything you need for the whole data science project in the free version.

  1. Install R from here
  2. Install Rstudio(an IDE) from here

After you complete the steps you should open the rstudio on your PC and you will see something like this:

If you put the Green +(new file) then R scripts on the left top of the Rstudio, a new R script opens where you can actually write your R code more straightforwardly. Also, you can go this way: File>New File>R script

So you will see something like this(I changed the color of Rstudio)

From now till we mention it, we write everything in the script panel, the instant output can be visible in the console and if you make a visualization, it will show in the output panel. First code in Rstudio:

As you can see in the console tab, you can write some basic arithmetic statements in front of >, and then the result is

# simple summation
3+4
## [1] 7

If you see # at the start of a command line, R does not consider this line as a R code see this:

when start of a command line has # it does not consider that as code and usually it is used as the explanation of the code. To run code in Script panel, press ctrl+Enter in you keyboard. If you try 3+4 using (ctrl+Enter) in script panel then the result is shown at Console panel.

So let’s start learning our first code in Scripts without any hesitation(it was already too much)

Start your R coding:

Arithmetic operators:


# summation
5 + 10
## [1] 15
# subtract
20*2
## [1] 40
# multiplication
5 + 10
## [1] 15
# division
5 + 10
## [1] 15

You may wonder what is [1] in the start of each output. No worries. Soon we will explain it. Just know that because the output is only one value(15) then it shows that it is the first value of the result. Here it is not useful but it can be in future. Variable(vector)

Variable

You saw the 4 operators above and you may even expand what we can do more. It is true, you may also have seen it is in Excel or other software. But one of the things that make a difference between a programming language and other software is the ability to define a variable where you can store your data values that can reduce the manual jobs. Defining a variable is like assigning names to a newborn child. Imagine we have a variable called weight and you want to assign your weight to it. Let say your weight is 70. We want to say weight is equal to 70. We do this in R using <-.

Numeric Variable

# assign weight variable to be equal to 70 
weight <- 80
# To see the value of weight you can write the variable name.
weight
## [1] 80
# you can do the same with print function
print(weight)
## [1] 80

you can also do operations with variables

#simple operator
weight*2
## [1] 160
# spice it up
(weight-10)/2
## [1] 35
# You can assign the result to another variable.for example:
my.future.weight <- weight-10
# remember print?
print(my.future.weight)
## [1] 70

Variables with more than 1 values

You can have a variable with more than 1 value using function c(). Imagine you have 3 friends that their ages are 20, 24 and 29. Let put them all in one vector variable called ages.

ages <- c(20, 24, 29)
print(ages)
## [1] 20 24 29
class(ages)
## [1] "numeric"
# you can apply any calculation now for example multiple the ages variable by 2
ages*2
## [1] 40 48 58
# Or do subtract two vector
newages <- c(24,11,19)
differenceage <- newages-ages
differenceage
## [1]   4 -13 -10
# There are other ways to reach same result. Look at this example
vector <- c(2,3,4,5,6.7)
vector
## [1] 2.0 3.0 4.0 5.0 6.7
# we can also make the same result using `:`
vector2 <- 2:7
vector2
## [1] 2 3 4 5 6 7
# lets check if vector and vector2 are equal
identical(vector,vector2)
## [1] FALSE

Notice that whe you define a variable(like differenceage <- newages-ages) it does not automatically print the variable. For that you need to explicitly write the variable name.

Extract a part of variable

what if you want to extract part of a vector. For example, consider ages variable. it has 3 number of values: 20, 24, 29. If you want to pull only second value(24), then you only need to write ages[2].

# newages vector
newages
## [1] 24 11 19
# third value of newages variable
newages[3]
## [1] 19
# Second and third values of newages variable
newages[2:3]
## [1] 11 19
# First and third values(we should use c function)
newages[c(1,3)]
## [1] 24 19

Last one seems a little tricky but no worries, we will talk about it later.

Other Types of Variables in R

Beside Numeric variable, We have different variable types in R. Most useful ones are:

  1. Vectors(numeric, character, logical, date/datetime, factors)
  2. Matrices
  3. Data frames
  4. List
  5. Tibbles

Let see the class() of our variable.

class(weight)
## [1] "numeric"
class(my.future.weight)
## [1] "numeric"
class(ages)
## [1] "numeric"
class(differenceage)
## [1] "numeric"

Now we are talking about Vectors. The weight variable that we defined is a vector variable with the class of numeric(because it was number) and length of 1(because it was only one value). As you can see there are other classes. Lets try them in practice:

Character Variable

# character variable is defined by putting "" around the text 
City <- "Berlin"
# above you can see I assigned Berlin to a variable called City
print(City)
## [1] "Berlin"
# See the class
class(City)
## [1] "character"
# We can also put have a Character variable with more than one value using `c()`
cities <- c("Berlin", "Shiraz", "New York")
class(cities)
## [1] "character"
# You can also extract the specific parts of the vector
cities[c(1,3)]
## [1] "Berlin"   "New York"

R is case sensitive, so A and a are different symbols and would refer to different variables. As a result City and city are two completely different variable. The same happens with functions, values and any other code part.

Date and time

Date and time are a little more tricky than other type of variable but let start from a simple example

# Consider this character variable
mydate <- "2020-10-4"
# See the class
class(mydate)
## [1] "character"
# `mydate` variable seems like a date but R consider at character as it is with "..."
# In R we must explicitly define a date type of a variable with as.Date() function
Newmydate <- as.Date(mydate)
class(Newmydate)
## [1] "Date"
# dates vector
datesvector <- as.Date(c("2021-08-22","2017-05-14","2001-04-19"))
datesvector
## [1] "2021-08-22" "2017-05-14" "2001-04-19"
# Now get the maximum of datesvector using max() function
max(datesvector)
## [1] "2021-08-22"
# Also minumum
min(datesvector)
## [1] "2001-04-19"

Factor Variables

Conceptually, factors are variables in R which take on a limited number of different values; such variables are often referred to as categorical variables. It is a numerical variable behind the scene that shows a text. It means Factors in R are stored as a vector of integer values with a corresponding set of character values to use when the factor is displayed. It is useful when it is used for categorizing (grouping) a result or uses of factors is in statistical modeling.


# Create a color vector that is character vector
color_vector <- c('blue', 'red', 'blue', 'green', 'red', 'blue')
# Let see the class
class(color_vector)
## [1] "character"
# Output of gender_vector
print(color_vector)
## [1] "blue"  "red"   "blue"  "green" "red"   "blue"
# Convert gender_vector to a factor
factor_color <-factor(color_vector)
# Now let see this new variable class
class(factor_color)
## [1] "factor"
# Output of factor_color
print(factor_color)
## [1] blue  red   blue  green red   blue 
## Levels: blue green red

As you can see the factor vector shows different output than character vector. It includes another line called levels that show different distinct values used

Also we can see the number associated with the factors. And the number of factors

# Convert Factor to Numeric
as.numeric(factor_color)
## [1] 1 3 1 2 3 1
# the numbers above shows the order of each values
# Also count of each color using table() function
table(factor_color)
## factor_color
##  blue green   red 
##     3     1     2

Logical Variables

A logical vector is a vector that only contains TRUE and FALSE values. In R, true values are designated with TRUE, and false values with FALSE.

#logical vector
c <- c(TRUE,TRUE,TRUE,FALSE,TRUE,FALSE) 

When you index a vector with a logical vector, R will return values of the vector for which the indexing vector is TRUE. If that was confusing, think about it this way: a logical vector, combined with the brackets [ ], acts as a filter for the vector it is indexing. It only lets values of the vector pass through for which the logical vector is TRUE.

# create a logical variable comparing two number
20 > 15
## [1] TRUE
15 > 20
## [1] FALSE
5==5
## [1] TRUE
# create a vector of logical variable
c(17>10, 13==13, 19<17)
## [1]  TRUE  TRUE FALSE
# Now consider this numeric vector
a <- c(13, 100, 7, 21, 1)
# We want to get the 2th and 4th elements of the vector. We can write this logical vector
logical <- c(FALSE, TRUE, FALSE, TRUE, FALSE)
# combine a with logical vector to get only the values where logical vector position is true
a[logical]
## [1] 100  21
# As you see only second and 4th elements(100, 21) of a is returned because logical vector is true only at second and 4th position

There is another way to look at logical variable.

# Consider this vector
b <- c(11,5,30,73,9,16,9)
b
## [1] 11  5 30 73  9 16  9
# Let see which value of b is equal to 5
b==5
## [1] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
# It seems the second value is equal to 5
# what if we want to get the values where they are equal to 5
b[b==5]
## [1] 5
# b==5 simply say which value is TRUE(if the value is equal to 5) and which one is FALSE(which one is not equal to 5) b[...] says get the ones that is TRUE and remove the one that is FALSE from b vector 

We can also do the same other type of variables

# Do you remember the factor_color variable?
factor_color
## [1] blue  red   blue  green red   blue 
## Levels: blue green red
# consider this. this code make a vector of TRUE and FALSE where the value is
factor_color=="blue"
## [1]  TRUE FALSE  TRUE FALSE FALSE  TRUE
# To extract the parts of the vector that the value is blue
factor_color[factor_color=="blue"]
## [1] blue blue blue
## Levels: blue green red
# To get elements where it is equal to blue or green
factor_color[factor_color=="blue" | factor_color=="green"]
## [1] blue  blue  green blue 
## Levels: blue green red

Tables

In reality, most of the times when we do a data science project you work with a table that is 2 dimensional and have columns(variables) and rows(observation). You probably worked with tables in spreadsheets like Excel or Google sheets. The table looks something like this:

Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa

As you see there are 5 rows(Or observations) and 5 columns: Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species

There are multiple ways to make a data frame. Let see one of them:

# imagine we have two vectors prices and brands

brand <- c("Samsumg", "Dell", "Apple", "Asus", "HP")
prices <- c(5,4,12,9, 23)

# Lets put them together in one table
data.frame(brand, prices) %>% kable()
brand prices
Samsumg 5
Dell 4
Apple 12
Asus 9
HP 23

Comments

  1. Top 10 best casino hotels in Las Vegas - MapyRO
    The Best Casino Hotel Rates & Deals for Las Vegas - MapyRO - Best 구미 출장안마 Hotel 정읍 출장샵 Deals for Las Vegas - 시흥 출장샵 MapyRO 화성 출장마사지 - Top 10 Casino Hotels 익산 출장샵 in Vegas.

    ReplyDelete

Post a Comment

Popular posts from this blog

Data Science For Process Analysts

Test