0% found this document useful (0 votes)
94 views

R Lectures PDF

This document provides an overview of the RStudio interface and basic functions in R. It describes the four panes of RStudio - the console, script/source, environment/history, and plots/help/packages panes. It demonstrates how to perform basic math operations in R and store values in variables. It also covers creating and naming vectors, selecting elements from vectors, and loading/importing data. The key aspects covered are the user interface of RStudio and basic data types and operations in R like vectors, variables, and loading data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views

R Lectures PDF

This document provides an overview of the RStudio interface and basic functions in R. It describes the four panes of RStudio - the console, script/source, environment/history, and plots/help/packages panes. It demonstrates how to perform basic math operations in R and store values in variables. It also covers creating and naming vectors, selecting elements from vectors, and loading/importing data. The key aspects covered are the user interface of RStudio and basic data types and operations in R like vectors, variables, and loading data.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 277

R Tutorial

Prepared by: Katrina D. Elizon


Four Pane
R Script or Source Environment and
Pane History Pane

Console Pane Final Pane


Console pane
The console pane is where the
action takes place (located lower
left). From this screenshot you see
that this session was running R
version 3.5.0 called “Joy in
Playing,” which was released April
23, 2018. The console is the heart
of RStudio. You can type
commands directly into the console
whenever you see the flashing
cursor. Output and error messages
are displayed in the console.
R Script or Source pane
The script or source pane (located
upper left) is where you can type and
save your commands and make notes
to yourself about projects. When you
run a command from the source pane,
the command is sent over to the
console pane to be executed. It is
possible to have multiple sources or
scripts appear in the source pane, and
they will each have their own tab at
the top of the pane. More on this topic
later in the Lab.
Environment and History Pane

The environment and


history pane (located
upper right) is where
you will see the
different objects you
create or the different
datasets you import.
Final Pane

The final pane


contains everything
else including help,
plots, packages,
etc (located in the
lower right).
Calculator in RStudio
One of the more simple uses of RStudio is to use it like a
calculator:
Type basic addition (+), Addition +
subtraction(-), multiplication (*),
Subtraction -
and division procedures (/)
directly into the console pane. Multiplication *

RStudio is mostly flexible about


Division /
using spaces, so you can
include spaces between the Exponentiation ^

characters or not.
The command for naming object:
= or < −
RStudio will perform the mathematic
procedure and return the result in the
console in the line below next to a [1]
that indicates the first and, in this case,
only result from your command.
Notice that when you ask R to
print x, the value 42 appears
in the console.
Suppose you have a fruit
basket with five apples. As a
data analyst in training, you
want to store the number of
apples in a variable with the
name my_apples.
Example

Every tasty fruit basket needs oranges, so you


decide to add six oranges. As a data analyst,
your reflex is to immediately create the variable
my_oranges and assign the value 6 to it. Next,
you want to calculate how many pieces of fruit
you have in total. Since you have given
meaningful names to these values, you can now
code this.
The my_apples and my_oranges variables both
contained a number in the previous example. The +
operator works with numeric variables in R. If you really
tried to add "apples" and "oranges", and assigned a text
value to the variables my_oranges and my_apples you
would be trying to assign the addition of a character
variable to the variable my_fruit. This is not possible.
You will notice that the result is printed
in color red. It means there is an error.
Read the error message and make sure
to understand why this did not work.
Basic Data Types in R

R works with numerous data types. Some of the most


basic types to get started are:
Decimals values like 3.6 are called numerics.

Natural numbers like 7 are called integers. Integers


are also numerics.

Boolean values (TRUE or FALSE) are called Logical.

Text (or string) values are called characters.


What’s that Data Type?

Do you remember that when you added “five” + “six”,


you got an error due to a mismatch in data types?
You can avoid such embarrassing situations by
checking the data type of a variable beforehand. You
can do this with the class ( ) function.
Loading data
There are different ways you may want to get data in R Studio:
Loading Data from a Google Doc
1. From within the google spreadsheet, click File -> Publish to Web ->
Start Publishing.
2. Type google.doc("key"), where key should be replaced with everything
in between key = and # in the link for the google doc.
Loading Data from the Textbook
1. Find the name of the dataset you want to access as it’s written in bold
in the textbook, for example, tallysheet, and type data(tallysheet).
Loading Data from a Spreadsheet on your Computer
1. From your spreadsheet editing program (Excel, Google Docs, etc.) save
your spreadsheet as a .csv (Comma Separated Values) or .xlsx (Excel
Workbook) file on your computer.
2. In the top right panel, click Import Dataset, From Text File, then
choose the dataset you want to import.
Manually Typing Data

Create a vector. Vectors are one-dimension arrays


that can hold numeric data, character data, or
logical data. In other words, a vector is a simple tool
to store data. For example, you can store your daily
gains and losses in the casinos.

In R, you create a vector with the combine function


c ( ).
You place the vector elements
separated by a comma
between the parenthesis.
The quotation marks
indicate that “a”, “b”, “c”,
“d” are characters.
Example
After one week in City of Dreams Manila and still zero Ferraris in
your garage, you decide that it is time to start using your data
analytical superpowers.
Before doing a first analysis, you decide to first collect all the
winnings and losses for the last week:
Poker Roulette
On Monday you won P14,000 On Monday you lost P2,400
Tuesday you lost P5,000 Tuesday you lost P5,000
Wednesday you won P2,000 Wednesday you won P10,000
Thursday you lost P12,000 Thursday you lost P35,000
Friday you won P24,000 Friday you won P1,000
Naming a vector

In the previous exercise, we created a vector


with your winnings over the week. Each vector
element refers to a day of the week but it is hard
to tell which element belongs to which day. It
would be nice if you could show that in the
vector itself.
You can give a name to the elements of a vector
with the name ( ) function.
This code first creates a vector and then
gives the two elements a name. The
first element is assigned the name
Name, while the second element is
labeled Profession. Printing the
contents to the console.
In t h e p r e v i o u s e x e r c i s e s y o u p r o b a b l y
experienced that it is boring and frustrating to
type and retype information such as the days of
the week. However, when you look at it from a
higher perspective, there is a more efficient way
to do this, namely, to assign the days of the week
vector to a variable!

Create a variable that contains the days of the


week.
Now that you have the poker and roulette winnings nicely
as named vectors, you can start doing some data analytical
magic.

You want to find out the following type of information:

How much has been your overall profit or loss per day of
the week?

Have you lost money over the week in total?

Are you winning/losing money on poker or on roulette?

To get the answers, you have to do arithmetic calculations


on vector.
Yo u c a n a l s o d o t h e
calculations with variables
that represent vectors.
First, you need to understand what the overall
profit or loss per day of the week was. The total
daily profit is the sum of the profit/loss you
realized on poker per day, and the profit/loss you
realized on roulette per day.

In R, this is just the sum of poker_vectors and


roulette_vectors.
Based on the previous analysis, it looks like you
had a mix of a good and bad days. This is not
what you expected, and you wonder if there may
be very tiny chance you have lost money over
the week total?

A function that helps you answer this question is


sum ( ). It calculate the sum of all elements of a
vector.
Now that you have the totals
for roulette and poker, you can
easily calculate total_week
(which is the sum of all gains
and losses of the week).
Another way to compute
the total winnings is to
get the sum of total_daily.
It seems like you are losing
money. The possible explanation
might be that your skills in roulette
are not as well developed as your
skills in poker, that’s why your total
gain in poker are higher than in
roulette.
Vector selection

Our goal is to select specific elements of the


vector. To select elements of a vector you can
use square brackets [ ]. Between the square
brackets, you indicate what elements to select.
We create a new variable
poker_wednesday. Assign the poker
results of Wednesday to the variable
poker_wednesday. Wednesday is the
third element of poker_vector, and can
thus be selected with poker_vector [3].
If you want to select the the first element
of the vector, you type poker_vector [1].
To select the last element of the vector,
you type poker_vector [5].
To select multiple
elements from a vector,
you can add square
bracket at the end of it.
You can indicate
between the brackets
what elements should
be selected. For
example, you want to
select the first and the
fifth element of
poker_vector.
Assign the poker results
of Tuesday, Wednesday
and Thursday to the
variable poker_midweek.
Selecting multiple elements of poker_vector
with c(2, 3, 4) is not very convenient. Many
data analyst are lazy people by nature, so they
created an easier way to do this: c(2, 3, 4) can
be abbreviated to 2:4, which generates a
vector with all natural numbers from 2 up to 4.
Assign to roulette_selection_vector
the roulette results from Tuesday
up to Friday.
Another way to
select elements
from a vector is by
using the names of
the vector elements
instead of numeric
position.
Select the first three elements in
poker_vector by using their
names: “Monday”, “Tuesday” and
“Wednesday”. Assign the result of
the selection to poker_start.
To calculate the average of the
elements in poker_start, we
can use the mean ( ) function.
The (Logical) Comparison Operators

< for less than


> for greater than
<= for less than or equal
>= for greater than or equal
== for equal to each other
!= not equal to each other
This command tests for every
element of the vector if the condition
stated by the comparison operator is
TRUE or FALSE.
To check which elements in
poker_vector are positive (or > 0)
we can use the logical
comparison operators.
Monday, Wednesday and
Friday are the elements in
poker_vector that are positive.
This command will select the
elements in poker_vector that
are positive (or > 0) and
assign this to selection_vector.
The printout tells you whether
you won (TRUE) or lost (FALSE)
any money for each day.
Working with comparisons will make your data
analytical life easier. Instead of selecting a subset
of days to investigate yourself (like before), you
can simply ask R to return only those days where
you realized a positive return for poker.

In the previous exercises you used


selection_vector < − poker_vector > 0 to find the
days on which you had a positive poker return.
Now, you would like to know not only the days on
which you won, but also how much you won on
those days.
R knows what to do when you
pass a logical vector in square
brackets: it will only select the
elements that correspond to
TRUE in selection_vector.
Use selection_vector in square
brackets to assign the amounts that
you won on the profitable days to the
variable poker_winning_days.
Just like you did for
poker, you also want to
know those days where
you realized a positive
return for roulette. Use
selection_vector_r to
indicate that it is for
roulette. Also, assign
the amount that you
won on the profitable
days to the variable
roulette_winning_days.
What’s a Factor?

The term factor refers to a statistical data type used to store


categorical variables. The difference between a categorical
variable and a continuous variable is that a categorical
variable can belong to a limited number of categories. A
continuous variable, on the other hand, can correspond to
an infinite number of values.

It is important that R knows whether it is dealing with a


continuous or a categorical variable, as the statistical
models you will develop in the future treat both types
differently.
To create factors in R, you make use of the
function factor ( ). First thing that you have to
do is create a vector that contains all the
observations that belong to a limited number
of categories. For example, sex_vector
contains the sex of 5 different individuals.
Convert the character vector sex_vector
to a factor with factor ( ) and assign the
result to factor_sex_vector. Print out the
result.
It is clear that there are two categories,
or in R-terms 'factor levels', at work
here: Male and Female.
Nominal and Ordinal

There are two types of categorical variables: a nominal


categorical variable and an ordinal categorical variable.
A nominal variable is a categorical variable without an
implied order. This means that it is impossible to say that
'one is worth more than the other’. Here, it is impossible to
say that one stands above or below the other.
In contrast, ordinal variables do have a natural ordering.
Consider for example the categorical variable temperature
with the categories: “Low”, “Medium” and “High. Here it is
obvious that “Medium” stands above “Low”, and “High”
stands above “Medium”.
To create an ordered factor, you have to add two additional arguments:
ordered and levels. By setting the argument ordered to TRUE in the
function factor ( ), you indicate that the factor is ordered. With the
argument levels you give the values of the factor in the correct order.
I asked 10 students if they like
watching television. Seven of
them answered “Most of the
Time”, two of them answered
“Sometimes” and only one
student answered the question
as “Hardly Ever”. Create an
ordered factor, based on the
information given.
Summarizing a Factor

Going back to our activity, you would like


to know how many “Male” responses you
ha ve in your study, and how many
“Female” responses. The Summary ( )
function gives you the answer to this
question.
Using the data studentsurvey, we can
determine how many “Male” and “Female”
responses we have in our study.

To do this, first we need to import our


data file in RStudio and use the command
attach ( ).
To remove vector or data set in the
environment pane use the remove ( )
function.
To Determine the Structure of your Data Set

To get a rapid overview of your data is the function


str( ). The function str( ) shows you the structure of
your data set. For a data frame it tells you:
The total number of observations
The total number of variables
A full list of the variables names
The data type of each variable
The first observations
Investigate the structure of mtcars. Just
type mtcars in RScript and the sample
dataset will appear in the console pane.
Applying the str( ) function will often be the first
thing that you do when receiving a new data set or
data frame. It is a great way to get more insight in
your data set before diving into the real analysis.
Creating a Data Frame

You construct a data frame with the data.frame( )


function. As arguments, you pass the vectors from
before: they will become the different columns of
your data frame. Because every column has the same
length, the vectors you pass should also have the
same length. But don't forget that it is possible (and
likely) that they contain different types of data.
Use the function data.frame ( ) to construct a data
frame. Call the resulting data frame Description.
Extract a Variable from a Dataset

If you want to extract a particular variable from a


dataset, use dataname$variablename.
Subsetting a Dataset

To take a subset from a dataset, first create a


new dataset and use the subset command
subset(dataname, condition)
Sorting

In data analysis you can sort your data according to


a certain variable in the data set. In R, this is done
with the help of the function order( ).

order( ) is a function that gives you the ranked


position of each element when it is applied on a
variable, such as a vector.
10, which is the second element in money_vector, is
the smallest element, so 2 comes first in the output
of order(money_vector). 100, which is the first
element in money_vector is the second smallest
element, so 1 comes second in the output of
order(money_vector).
Use the output of order(money_vector)
to reshuffle money_vector. Use the
command vector[order( )].
How to compute MEAN, MEDIAN and MODE?

MEAN
It is calculated by taking the sum of the values and dividing with
the number of values in a data series.

The function mean( ) is used to calculate this in R.

If there are missing values, then the mean function returns NA.

To drop the missing values from the calculation use na.rm =


TRUE, which means remove the NA values.
How to compute MEAN, MEDIAN and MODE?

MEDIAN

The middle most value in a data series is called the median.

The median( ) function is used in R to calculate this value.

If there are missing values from the calculation use na.rm =


TRUE, which means remove the NA values.
How to compute MEAN, MEDIAN and MODE?

MODE

The number which appears most often in a set of numbers is


called mode.

There is no function in base R to find mode of set of numbers so


we will use this command:

names(table( ))[table( )==max(table( ))]


How to compute VARIANCE and STANDARD
DEVIATION?

VARIANCE
It is a numerical measure of how the data values is dispersed
around the mean.

The var ( ) function is used in R to calculate this value.

If there are missing values from the calculation use na.rm =


TRUE, which means remove the NA values.
How to compute VARIANCE and STANDARD
DEVIATION?

VARIANCE
It is a numerical measure of how the data values is dispersed
around the mean.

The var ( ) function is used in R to calculate this value.

If there are missing values from the calculation use na.rm =


TRUE, which means remove the NA values.
How to compute VARIANCE and STANDARD
DEVIATION?

STANDARD DEVIATION
It is the square root of its variance.

The sd ( ) function is used in R to calculate this value.

If there are missing values from the calculation use na.rm =


TRUE, which means remove the NA values.
Inferential Statistics
Parametric Test

Non - Parametric Test


Methods in Assessing Normality

Graphical
Normal Q - Q Plot
Histogram

Numerical
Shapiro - Wilk Test
EXAMPLE:

Constr uct a normal Q - Q plot and


histogram of these data. Diameters of 36
rivet heads in 1/100 of an inch:
Normal Q - Q Plot

To construct normal Q - Q plot

qqnorm(x)
qqline(x)

“x” is a numeric vector of data values


Histogram

To construct Histogram

hist(x, probability = TRUE)

“x” is a numeric vector of data values


Shapiro - Wilk Test

To compute Shapiro - Wilk Test

Shapiro.test(x)

“x” is a numeric vector of data values


The hypotheses used are:
Ho: The sample data follows a normal distribution.

Ha: The sample data does not follow a normal


distribution.

When we are testing normality:


• If P value > alpha, it means that the data are normal.

• If P value ≤ alpha, it means that the data are NOT


normal.
Based on the result of p value, the sample
data follows a normal distribution.
One Sample t - Test

To compute the confidence Interval:

t.test(x)

t.test(x, conf.level = 0.95)


# x is numeric value
EXAMPLE:

Students Amount of
Suppose we would like to No. Money
estimate the mean amount 1 125
2 225
of money spent on books 3 154
by BS Statistics students in 4 150
a semester. We have data 5 125
6 220
from 10 randomly selected 7 195
students. Construct a 95% 8 90
9 123
confidence interval. 10 145
Based on the result of p - value, it is greater
than the level of significance of 0.05, therefore
the sample data follows a normal distribution.
If we use the t.test ( ) command listing only the
data name, we get a 95% confidence interval for
the mean after the significance test.
The t.test ( ) command can also be used to find
confidence intervals with levels confidence different
from 95%. We can specify the desired level of
confidence using the conf.level command.
EXAMPLE:

A corporation monitors time spent by office


workers browsing the web on their computers
instead of working. In a sample of computer
records of 15 workers, construct a 99.5%
confidence interval for the mean time spent by
selected office workers in browsing the web in
an eight-hour day.
Employee No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Time Spent
43 27 39 28 45 39 25 14 40 27 32 35 42 16 41
(minutes)
One Sample t - Test

One-sample t-test is used to compare the


mean of one sample to a known standard (or
theoretical/hypothetical) mean ( μ0 )

Population mean μ < Specified value μ0 (μ < μ0 )


Population mean μ > Specified value μ0 (μ > μ0 )
Population mean μ ≠ Specified value μ0 (μ ≠ μ0 )
One Sample t - Test

To compute the one sample t - test:


t.test(x, mu= μ0 , alternative=“less”, conf.level=0.95)
t.test(x, mu= μ0 , alternative=“greater”, conf.level=0.95)
t.test(x, mu= μ0 , alternative=“two.sided”,
conf.level=0.95)
# x is numeric value
EXAMPLE:

1. The researcher is particularly concerned with


the pulse rate of the patients who take the
medication. The researcher wants to test
whether the pulse rate will be different from
the mean pulse rate of 82 beats per minute.
S t a te t h e r e l e v a n t n u l l a n d a l te r n a t i v e
hypothesis.
Answer:

1. H0 : μ = 82 and Ha: μ ≠ 82
EXAMPLE:

2. A chemist invents an additive to increase


the life of an automobile battery. The mean
lifetime of the battery is 36 months. State
the relevant null and alternative
hypothesis.
Answer:

2. H0 : ≤ μ 36 and Ha : μ > 36
EXAMPLE:

3. A contractor wishes to lower heating bills


by using a special type of insulation in
houses. If the average of the monthly
heating bills is P780, what is the hypothesis
about heating costs? State the relevant null
and alternative hypothesis.
Answer:

3. H0 : μ ≤ P780 and Ha: μ < P780


EXAMPLE:

A random sample of 10 fifth grade pupils has grades in


English, whereas marks range from 1 (worst) to 6
(excellent). The grade point average (GPA) of all fifth
grade pupils of the last six years is 4.5. Is the GPA of
the 10 pupils different from the populations’ GPA?
Use 0.01 level of significance.

Student 1 2 3 4 5 6 7 8 9 10
Grade
5 6 4.5 5 5 6 5 5 5 5.5
points
H0 : μ = 4.5
Ha : μ ≠ 4.5
Reject the null hypothesis, therefore, the grade
point average of the 10 pupils is different from
the populations’ GPA because the computed p
value is less than the alpha level (0.01).
EXAMPLE:

Suppose that the teacher of a school claims that the


average weight of student population greater than
from 140 lb. and we desire to test the truth of this
claim. We have a random sample of 6 students of the
school weights from student population. Use a 0.05
level of significance.

Student 1 2 3 4 5 6
Weight 135 119 106 135 180 108
Ho : μ ≤ 140
Ha : μ > 140
Do not reject the nul l hy pothesis,
therefore, the average weight of the
student is 140 lb. and the claim of the
teacher is false because the computed p
value is greater than the alpha level (0.05).
Independent Sample t - Test
The independent sample t - test allows researchers
to evaluate or to compare the mean difference between
two populations using the data from two separate
samples. It is used to test whether population means
are significantly different from each other, using the
means from randomly drawn samples.
H0 : μ1 − μ2 ≥ 0 and Ha : μ1 − μ2 < 0

H0 : μ1 − μ2 ≤ 0 and Ha : μ1 − μ2 > 0
H0 : μ1 − μ2 = 0 and Ha : μ1 − μ2 ≠ 0
Independent Sample t - Test

To compute the one sample t - test:

t.test(a,b, mu = 0, alternative = “less”,


var.equal=TRUE, conf.level = 0.95 )
EXAMPLE:

Suppose we put people on 2 diets “the fruit diet and


the bread diet”. Participants are randomly assigned to
either 7-days of eating exclusively fruits or 7-week of
exclusively eating bread. At the end of the day, we
measure weight gain by each participant. Is bread diet
causes more weight gain compared to fruits diet? Test
the claim using 10% level of significance.

Fruit Diet 3 4 4 4 5 6 6
Bread Diet 1 2 2 2 3 4 4
Before we proceed to t.test ( ) command, we must
first check whether the variances are homogeneous.
Used var.test command for F - test of Fisher.
The hypotheses used are:
Ho: Equal Variances Assumed

Ha: Equal Variances Not Assumed

When we are testing homogeneous variance:


• If P value > alpha, we can assumed equal variances.

• If P value ≤ alpha, we cannot assumed equal variances.


We obtained p-value greater than 0.10, then the two
variances are homogeneous.
H0 : μ1 − μ2 ≥ 0 and Ha : μ1 − μ2 < 0
EXAMPLE:

A random sample of 11 students sat a chemistry


examination consisting of one theory paper and one
practical paper. Their marks out of 100 are given in
the table below.

Theory 30 42 49 50 63 38 43 46 54 42 26
Practical 52 58 42 67 94 68 22 34 55 48 17

Test the hypothesis of there is difference in mean


m a r k o n t h e t wo p a p e r s a t t h e 5 % l e v e l o f
significance.
Dependent Sample t - Test
The dependent sample t-test (also called
the paired t-test or paired-samples t-test)
compares the means of two related groups to
determine whether there is a statistically
significant difference between these means.
H0 : μ1 − μ2 ≥ 0 and Ha : μ1 − μ2 < 0

H0 : μ1 − μ2 ≤ 0 and Ha : μ1 − μ2 > 0
H0 : μ1 − μ2 = 0 and Ha : μ1 − μ2 ≠ 0
Independent Sample t - Test
To compute the one sample t - test:

t.test(a,b, mu = 0, alternative = “less”,


var.equal = TRUE, paired = TRUE,
conf.level = 0.95 )
EXAMPLE:

A researcher is interested whether a training course


increases or decreases the teaching performance of
the teachers who attended the training courses. The
data are shown below:

Case 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Before
85 84 86 87 89 82 80 84 86 82 89 87 82 81 86 89 89 84 85 88
Training

After
95 98 97 92 96 93 94 95 90 82 97 98 95 95 92 91 94 95 96 97
Training
H0 : μ1 − μ2 = 0 and Ha : μ1 − μ2 ≠ 0
It shows that the results of before and after training
have significant difference because the p-value (0.000)
i s l e s s t h a n t h e l e v e l o f s i g n i f i c a n ce ( 0. 0 5 ) .
Furthermore, the result tell us that the training is
effective because the teaching performance of the
teacher was significantly higher after the training
course than before. Therefore, we have sufficient
evidence to support the claim.
Correlation Test

Correlation test is used to


evaluate the association between
two or more variables.
cor(x,y)
cor.test(x,y, method = “spearman")

You might also like

pFad - Phonifier reborn

Pfad - The Proxy pFad of © 2024 Garber Painting. All rights reserved.

Note: This service is not intended for secure transactions such as banking, social media, email, or purchasing. Use at your own risk. We assume no liability whatsoever for broken pages.


Alternative Proxies:

Alternative Proxy

pFad Proxy

pFad v3 Proxy

pFad v4 Proxy