# R Programming | R Language Tutorial | Data Science with R Course | Intellipaat

hey everyone welcome to the session by

Intellipaat in a world where we surrounded by data concepts such as data

science and data analytics take the center stage always in fact R

programming is one of the biggest assets of a data scientist and data analyst and in this session on R Language tutorial we’re going to check out everything there is to know About R

programming well before we begin with the session make sure to subscribe to

the Intellipaat’s YouTube channel and hit that Bell icon so that you never miss an

update from us here’s the agenda for today we’ll begin by checking out what

R programming actually is and after that we can check out everything there

is to know about variables and the type of operators present in R and followed

by this we can check out all the objects present in R as well and after this we

can take a quick introduction to the flow control statements and finally you

will have a complete hands-on project guide where you can build an entire book

recommendation system from scratch and guys if you have any queries make sure

to head down to the comment section below and do let us know and we’ll be

happy to help you out there also if you guys are looking for end-to-end Course

certification in R programming language Intellipaat provides the R

programming for data science program where you can learn all of these

concepts thoroughly and earn a certificate at the same time well

without further ado let’s begin the class so what is R so R is a language

developed by statisticians for statisticians so if you want to perform

any sort of statistical analysis R should be a go-to language now R is also

a great visualization tool so R provides packages such as ggplot2 and plotly so

that you can create stunning visualizations now R is an open source

cross-platform compatible software so it’s just plug and play all you have to

do is install R ones and then you can start having fun with it and since R is

an open source software you can actually modify the code and add your own

innovations to it and being cross-platform compatible you can

actually run the same R code on different operating systems and the best

thing about R is a turing-complete language that is it can perform any task

which it turing machine can that’s amazing isn’t it

and this is what makes R such a powerful tool so you can do various

operations such as statistical analysis implement machine learning algorithms

and also create stunning visualizations all with the help of the super language

called as R installing R so you can download R from Cran.r-project.org/

so let’s go to the site right guys so this is our R cran

Network and as you see we have different distributions of R respectively for

Linux Mac and Windows and since I’m using a Windows system I’ll downloaded

for Windows and I click over here install R for the first time and guys this is

the latest version of R three point five point one I click over here

download our three point five point one for Windows

right so we get this dialog box over you and I click on run and the download

would start after installing R we also require an IDE to make a tasks

easier so one such IDE is R studio and we can download R studio from R

studio com right guys so this is R studio dot com I’ll click on download R

studio over here and this is the free version I’ll click on download again we

have different distributions for Ubuntu Windows Mac and Fedora again since I’m

using a Windows system I’ll install it for Windows I’ll click on run and the

download would start right so I’ve downloaded and installed both R and

R studio now let’s open R studio and have a look at the different windows

present right guys so this is R studio and this is R studio looks like so as

you can see R studio comprised of four windows so this first window which you

see over here is a script window as the name States you can do all of your

scripting over here so let’s say I just type a equals three

and then I’ll type print 3 right so all of your scripting goes over here and if I

want to let’s say add a new script over here I need to click on the plus symbol

over here I’ll select R script and we get a new script similarly if I want to

add another strip I listen to click over view that we have another new script

over here ok and if I want to see what I’ve written over you I need to click

over here and this will be sieved let me give this a name so I’ll just give it

some random name I’ll save this right so I’ve saved this file as sgds d dot R and

if you were to run or execute the code you have this Run button over here so

let me select these two lines I’ll click on run so when we click on run the code

goes to the console window and gets executed

so guys this was the script window now we can actually use the console window

here directly to run our commands so I will give some commands over here let me

just run some basic mathematical operations so I will type 8 plus 5 right

so this gives us 13 similarly I will type 10 minus 5 which gives us 5 right

so you can directly use the console window to run all of your commands and

over here what we see is the environment window so this gives us a glance of all

the objects you have stored right so till now we just have one object with

the name A and that is what we see over here so we have an object he whose value

is 3 now similarly let’s see if I add another object B over here B equals 10

and I’ll print B let me execute these two lines of code right so as soon as I

executed these two lines of code what we see is another object has been added in

this environment window right so initially we just had the object a after

executing these two lines of code another object B has been added along

with its value and now if I want to clear all of these I can do that by

clicking here right so this will remove all of these

objects over here ok and then we have the history section so this gives us a

list of all the commands which we have implemented till now right so this is

just a history of all the commands which we had executed and we have this final

window at the bottom right corner so this window is for installing new

packages visualizing plots and accessing help whenever needed right so let me

give you guys an example so let me make a histogram so I will type hist I’ll

type in ly NX which is a dataset so as this is the visualization which we get

over here right similarly if you want to install a new package I’ll click over

here I click on install and let’s say the package which I’d want to install

would be 3 I’ll click here and when I can con

install the package installation would start and then you have help so whenever

need any sort of help we can use this window so let’s say there is a data set

called as iris which is inbuilt in our if you want to know information about

this data set all I need to is type the name of the data set and I’ll hit enter

right so you get all of the information with respect to this data over here so

guys this is all about R studio just a quick info in case if you guys are

looking for end-to-end course certification in R programming

Intellipaat provides the r programming for data science program where you can learn

all of these concepts thoroughly and on a certificate at the same time the link

is given in the description box below so make sure to check it out so now that

we’ve installed R & R studio it’s finally time to get on with the

practical part so we’ll start by reading data into R console afterwards

different functions to read different data formats if you want to read a text

file you can use the read dot table function so comma separated files you

can use read dot CSV function freedom JSON files you have the from JSON

function and similarly if you want to read data from HTML tables R provides

a function called read HTML table so you can read all formats of data into R and

since our customer churn DATA is of dot CSV type I’ll be using the read dot CSV

function so let’s head on to R studio so this is R studio guys so I’ll

usually dot CSV function to read the customer ChurnData set said all right

I’ll put in the double quotes now this is the data set guys this is the

dot CSV file I click on properties and I’ll copy the path from over here

and I’ll paste the part now you need to keep in mind that this is actually a

forward slash when you’re giving the part right

and after giving the part I’ll uh get the name of the file which will be

customer churn dot CSV and I will spool this and a new object and I’ll name the

object to be customer shown as well right so this what you see is actually

an assignment operator so I’m reading this part and I’m saving this file into

an object called as customer churn right now let me have a glance at this data set

so what i’ll do is I’ll use the view function well that view and I’ll type

out customers churn so this is our customer churn data set

guys now looking use the class function you have a look at the type of this

object so I’ll use class and I’ll give the name of the object which is customer

churn so what we see is this object is of type data frame that is this is

actually a data frame now if you’re coming from the SQL background you can

see that a data frame is actually sort of a table so over here all of these

columns are of the same data type and each row represents a record of this

data frame right so let’s see if I select this column tenure then this will

be of one particular data type so let me go ahead and see what is the type of

this so I’ll use class and I’ll type customer churn dollar

ten-yard so see that this is of Integer type that is

this tenure column is of integer type now another thing that you need to keep

in mind as all of these columns are actually vectors and factors that as a

vector is a set of homogeneous entities and a factor represents categorical

values so now let me type out class and what I’ll do is I’ll type out customer

churn dollar gender so I get that the class of this object is factor

just a quick info in case if you guys are looking for end-to-end course

certification in R programming Intellipaat provides the r programming for

data science program where you can learn all of these concepts thoroughly and on

a certificate at the same time the link is given in the description box below so

make sure to check it out now what we see over here is this just has two

categorical values female and male so whenever we have just categorical values

then most probably the class of that column is a factor right so we’ve

understood what our data frame is and we’ve also had a look at vectors and

factors right now we’ll go ahead and understand how can we access individual

columns from this data frame or access individual elements on this data frame

now let’s say I would want to access the column device protection from this

entire data frame so how can I do that but it’s actually simple so what I’ll do

is I will start off by giving the name of the data frame it’ll be customer

tune then I’ll use a dollar symbol and I automatically get a list of all of the

columns and since I need the device protection I will type the EBI I’ll

select this and I’ll store this in a new object and I’ll name this new object as

C underscore device protection

right so let me have a glance at this active so I’ll type view of C underscore

device protection sort of basically done is from this entire data set I’ve just

selected this column and stood it into a new object and in that object SC device

protection right so similarly if I want to select the payment method column I

can do the same thing so what I’ll do is I’ll type out the name of the data frame

I’ll use the dollar symbol and I’ll type the name of the column which will be

payment method over here and I’ll store this and let’s say C payment now let me

have a glance at this so I’ll type you see underscore payment right so I have

separated this column from this entire dataset

all right so this is one way by which we can select individual columns we can

also use the second method so the second method is also simple so what we’ll do

is we’ll start off by giving the name of the data set and I’ll give the square

braces now inside the square braces the values which are on the left side of the

comma represent all of the rows and the values which are on the right side of

the comma represent all of the columns now let’s say I would want to select

this gender column so what I will do is I let skip the numbering of the column

that is it which position the column is present so I’ll type odd two over here

and I’ll save this in let’s say C underscore gender now let me have a look

at this so I’ll type you see underscore gender right so I have used the square

braces to select the second column from this customer churn data set now similarly

if I want to select the first column I can do the same thing

well do is I’ll just put one over here and I’ll name this object as C

underscore ID right so let’s have a look at this view

of C underscore ID this is the column which gives me the customer IDs all

right now you can also select multiple columns in the same way so let’s say I

would want to select the second sixth and seventh column from the customer

churn data set over I’ll do is I will put a comma over here I’ll use the

combined function right and I’ll give all of the columns that I want to select

it so I need the second column the sixth column and the seventh column all right

and I’ll store this in C underscore two six seven so let’s see what is the

result now again I’ll type U of C underscore two six and seven right so I

have successfully selected second column sixth column and seventh column

so let’s actually verify yes this is the second column three four five six so six

is tenure seven is phone service all right so we have sixth and the seventh

column so this is how we can separate out some specific columns from the data

set well i said I would want these two columns over here so I can actually type

out the names of those columns so what I’ll do is I’ll type monthly charges and I’ll see this in C

month so let me have a look at this view of

the month alright so I have selected so I have successfully selected out the

monthly charges column now I can also select a set of continuous columns let’s

go into that now let’s say I would want to select all of the columns from senior

citizen to multiple lines that is from column number three two four five six

seven and eight so I would want all of the columns from column number three to

coloumn number eight so let’s go ahead and do that now so for that what I will do

is I’ll type the name of the dataset I’ll put a comma over here I’ll put

three and from three I’ll use colon symbol I’ll just put out eight over here

so this will give me all of the column starting from the third column to the

eighth column let’s see I will save this in C underscore three let me have a look

at this now view of C underscore three eight right so I have successfully

selected out all of the columns from third to eighth so the third column

senior citizen and the eighth column has multiple lines right now so this is how

we can filter out some specific columns so if we wanted to filter out some

specific rows so let’s go ahead and understand how can we do that you know

as I’ve already stated if you would want to filter out rows we would want to give

it on the left side of the comma and if you do want to filter also specific

columns you would want to give it on the right side of the comma so let’s go

ahead and filter out some rows so now let’s say I would want the second row of

the data set so all I need to do is type in two over here and this will give

me the second row right so let me store it in C underscore

and let me have a look at this view of c underscore 2 alright so

this is the second record whose customer ID is double five seven

five so let me just verify all right so this is the second record over here and

this is the same record that we have successfully filter out similarly let’s

say if I would join the record or the row number hundred so

I’ll just type out hundred over here and let me store this in C underscore

hundred let me have a look at the result view of C underscore hundred alright so

this is the record which is placed at the hundredth row whose customer ID is

this now we’ll go ahead and understand how can we filter out multiple rows from

this so let’s say I would want the first fifth and the tenth row so I’ll use the

combine function over here and I’ll give all of the row numbers so I need the

first room fifth row and the tenth row and I’ll screw this and let’s see one

underscore 5i underscore ten all right so let me have a glance of

this I’ll type C underscore one underscore and this right so I have

successfully filtered out the first fifth and the tenth row to see that the

gender of the customers at first and fifth columns is female and the gender

at the tenth column is male now what we’ll do is we’ll filter out a

sequence of rows so let’s say I would want all the rows from let’s say row

number 100 to row number 200 so what I’ll do is I’ll type 100 to 200 and I’ll

store this and see 100 200 so let me have a glance at this I’ll type view of

tea underscore 100 200 right so this gives me hundred and one entries

starting from row number 100 to row number 200 right similarly let’s say if I would

want all of the rows from 10,000 so similarly let’s see if I would

want all of the rows from 5,000 to 7,000 we can also do that I’ll just simply put

over here 5,000 and then I’ll type in 7,000 over here

right so let me just type out five thousand seven thousand over here let me

have a glance at this so I’ll type view of C underscore 5,000

7,000 right so there are 2001 and restarting from room number 5,000 to row

number 7,000 right now what we’ll do is we’ll give output the row numbers and

the column numbers but it will filter out some specific rows and some specific

columns so let’s go ahead and do that now let’s say I would want the row

numbers from 50 to 60 and I would want just the second and the third column so

let’s go ahead and do that so the row numbers from 50 to 60 and just the

second and third column so I’ll you see 2 comma 3 and let’s just store and see

random one let me have a glance at this view of C underscore random one alright

so this gives me all of the records from row number 50 to row number 60 and I just

selected these two columns gender and senior citizen

now let me actually go ahead and complicate this a bit so let’s say I

would want true numbers from 100 to 200 and also from thousand to 2,000 and I

would want the columns 2 3 5 and 7 so let’s go ahead and do that now right so

I’ll you see over here first part would be any row numbers 100 to 200 and

I will also need row numbers from thousand to 2,000 right so this is my

first part where I’m filtering out all of the rules and I would need column

numbers 2 3 5 and 7 so let me go ahead and put down all of these column numbers

and let me store this in C underscore random group so let me have a look at

this view of C underscore random 2 all right so I have thousand one hundred and

two entries in total where I have rows from 100 to 200 and as you see over here

just after row number 200 the row number gets to 1,000 because I have filtered out

all the rows from 100 to 200 and then it starts from thousand to 2,000 and I’ve

selected column number two three five and seven all right so this was an intro

to our data frames vectors and factors and how can we access individual

elements from the data frame just a quick info in case if you guys are

looking for end-to-end course certification in R programming

Intellipaat provides the R programming for data science program where you can

learn all of these concepts thoroughly and on a certificate at the same time

the link is given in the description box below so make sure to check it out we’re

going to look at the various types of operators in R and implement them on

the customer churn data set so broadly speaking these are the operators we have

assignment operators arithmetic operators logical operators and

relational operators so let’s go to R studio and work with all of these so we

are back to R studio guys so now to work with assignment operators what I’ll

do is I’ll start off by loading the customer churn data set and I’ll use the

read dot CSV function for this I’ll put in double quotes and let me go ahead and

copy the path of this dataset so this is the data set I’ll go to

properties and copy the path I’ll pitch it over here again this need

to be a forward slash so I’ll change this backward slash to a forward slash well I’ll go ahead and give the name of

the data set which will be customer churn and odd CSV now what we’ll do is

we’ll give an assignment operator so this symbol which you see over here is

nothing but an assignment operator which helps us to store values into an object

so using this assignment operator I love store this customer churn dot CSV file

into an object and I’ll name that object to be lets say churn one now similarly

there are two other ways by which we can use this assignment operator so I’ll

copy this I’ll just clear this window first right now

I can actually give the name of the object first and then use this

assignment operator right so this is less than symbol – it’s the same thing

so I am basically loading this path into this object using this assignment

operator right so let’s have a gance at this I will type view of churn 2 right so we have a

data set now now instead of using these two operators I can also use the equal

to operator so I will change this to churn three now

and when I have a glance at this will actually be the same

so churn one churn two and churn three are actually the same data sets but I

just used different assignment operators to store them into these objects right

so these are the different assignment operators now we’ll go ahead and work

with arithmetic operators as well so arithmetic operators are simply R plus

minus division and multiplication so now we’ll be implementing all of those

arithmetic operators on top of this customer churn dataset so you see these

two columns over here monthly charges and total charges so I will be using the

arithmetic operators on top of this so I will take this cell over here so let’s

say this customers monthly charges is 29 but actually what happened was there was

some calculation error and his monthly charges was just twenty eight point

eight five so we’ll go ahead and change this value to twenty eight point 85 so

how can we do that so for that all you have to do is subtract the cell value

with one so let’s do that so I will select churn one dollar and

I’ll select the column which is monthly charges since this is the cell which is

the first row I will give out one over here so from this cell value I need to

subtract one right I have subtracted one now I’ll store this result back to the

same cell so I’ll type shown one dollar monthly charges and the cell is

obviously one now let me have a glance at that churn one dataset let me go to the

monthly charges column right so here is a difference

so initially the value was twenty nine point eight five we have rectified the

errors we have subtracted one from it and now the value is twenty eight point

eight five so similarly we will be using the plus operator so plus operator

basically helps us to add something to the predefined value again so let’s see

the second customer over here whose charges are 1889 but again this was

incorrectly calculated and his charges were 1890 who will just add that one to

this using the plus operator so let me select that cell I will type churn

one dollar total charges and the cell number is two and I’ll add one to that

after adding one I’ll store it back to the same result so I’ll type churn one

dollar total charges and the cell number is obviously two

the very first is view churn 1 let me go to the total charges coloumn

right so initially it was eighteen eighty nine point five zero after adding

one to it the total charge became to 1890

now let’s say there is some discount going on and randomly a customer gets a

discount of 10% so let’s see it as this customer over here who gets a discount

of 10% so let me see which row number is this 1 2 3 4 5 6 7 8 9 so this is row

number 9 over here and let’s say this customer gets a discount of 10% so what

we have to do is basically multiply this value with 0.9 and when you multiply

with value with 0.9 the value gets reduced by 10% so let’s go ahead and do

that ill type churn 1 dollor total charges and it has the nine cell over here I

will multiply the 0.9 and I’ll store it back to the same cell over here so this

will be showing $1 total charges and I’ve been the cell number over here

right let me press Enter and let me summarize the result so view of churn

one so let me go to total charges over here right so initially it was three zero

four six and after giving a discount of 10% that total charges came down to two

seven four one right so let me show it again so it was initially three zero

four six and after this card of 10% it was two seven four one

just a quick info in case if you guys are looking for end-to-end course

certification in our programming in telepath provides the our programming

for data science program where you can learn all of these concepts thoroughly

and on a certificate at the same time the link is given in the description box

below so make sure to check it out now similarly let’s say a customer gets a

discount of 50% so all you have to do is divide that value by two so let’s see it

does this third customer over here who gets a discount of 50% on his monthly

charges so let’s go ahead and divide that cell value by two trying to select

showing $1 monthly charges and cell number is three now I will divide that

value by two and I’ll store it back to the same cell so that will be shown $1

monthly charges and cell number is obviously three so let me refresh this

now view of shown one surely the monthly charges 4:53 over

here and after giving a discount of 50% his monthly charges came down to 26

so from 53 to 26 I discovered a 50-person right so these

were arithmetic operators now well up go ahead and work with the relational

operators so relational operators basically help us to find out the

relation between them such as which one is greater

which one is lesser so now let’s say I would want to find out all of those

customers whose tenure as more than 60 so let’s go ahead and do that

so for that I will type shown $1 I will select the column which is tenure and

I’ll just use the greater than operator that is shown $1 tenure is greater than

60 and I’ll stir this in let’s see see tenure right so let me have a glance at

this now you see tenure now these are just Falls and true value so what

basically this means is so wherever you see walls this means that the tenure is

not greater than 60 and you see this true value it means that the tenure is

greater than 60 so let’s see let’s actually verify this so this is the

tenure over here so we see that this is the only value a tenure is greater than

60 and we have a true value for that right so now if you actually want to

clearly see the values wherever the tenure is greater than 60 we can use the

subset function for that so I’ll you subset I’ll give the name of the data

set first right so from this data set I need all of those values were see of

tenure is equal to true and I’ll store this back to see of tenure now let me see what is the result right

so we see there are fourteen hundred and seven customers whose tenure is more

than 60 so you see this over here so these are all of the customers whose

tenure is more than 60 and we found that out using the greater than operator

right now similarly we love use the less than operator to find out all of those

customers whose monthly charges are less than $10 right so I will type shown one

dollar monthly charges and this needs to be

less than 10 nice store dozen cm-1

right so let me have glanced at this C underscore mo one now let me use the

subset function to find out the actual values so I’ll type subset I’ll give the

name of the dataset I’ll type out C mo n over here and I need one of those values

where this is true I’ll store this back to see a muffin now

let me have a glance at this C underscore am a fan to see that there

are zero and three status there is actually no customer whose monthly

charges are less than ten dollars right so this was the greater than operator

and less than operator now we’ll go ahead and also work with logical

operators so logical operators are basically and are so these basically

help us to give multiple conditions let’s say I would want to select all of

those customers where gender is male and senior citizen as one we can do that

using the and operator so what I will do is I will type shown one dollar gender

and I’ll keep this to be male now I’ll use the and logical operator now I would

need to select all the senior citizen status where join one dollar senior

citizen is double equal to one right and I will show this in let’s say C

underscore M s so let me have a look at this so I’ll type you see underscore M s

now let me use the subset function to find out the real results subset shown

one C underscore MSS double equal to true and I will store this back to C

underscore M s now let me have a glance at this right so there are five seventy

four entries or 574 customers whose gender is male and who are senior

citizens or in other words there are 574 senior male citizens right so this was

the and operator now we’ll go ahead and work with our operator so let’s say I

would want to select all of those customers who was Internet services

either DSL or fiber-optic so let’s use the or operator for that

so what I’ll do is I’ll type Shawn one dollar

Internet service as either equal to DSL then I lose the or logical operator I’ll

type shown $1 Internet service equals fiber optic first I will get the list of

all those customers who use either of these internet service and I’ll store

the send let’s say C underscore internet for right let me have a glance

at this I’ll type C underscore internet now let me use the subset function to

find out the result I’ll get the dataset name then I’ll type

C underscore Internet and find out all of those values for the result is true

and I’ll store it back to C underscore internet right so I see that there are

five thousand five hundred and seventeen customers whose internet service either

DSL or fiber-optic so we are done with the or operator and operator now we’ll

also go ahead and work with not operator not operator basically gives us the

contrasting value so let’s see they’ll work with the senior citizen column and

I would want to select all of those rows where the senior citizen value is zero

so you can use the not operator for that so what I will do is show on one dollar

senior citizen is not equal to one that’ss so I will get all of those

values where this value is not equal to one and since the only other value is

zero so I’ll get all of those rows and I’ll store this in and let’s say C not

senior and let me have a glance at the right

so let me use the subset function subset of

schon one and i’ll give this to be double equal to truths

and I’ll store it back to see not senior let me have glanced at this again view

of C and score right so there are 5901 customers who

are not senior citizens rights over with assignment operators arithmetic

operators relational operators and logical operators we’re going to look at

the various types of inbuilt function in our and implement them on the customer

joined data set so these are some of the inbuilt functions so let’s go to our

studio and work with them so we’ll head back to our studio let me just have a

glance of the data set first so I will type out move off

customer churn right so this is a dataset so we’ll start off by

understanding the structure of this dataset so for that I’ll be using the

structure function STR and I’ll given the name of the dataset which is

customer churn right so this function gives me the entire structure of the

state asset so this basically tells me that I’m working with a dataset where

there are seven thousand 43 observations of 21 variables or in other words there

are 7,000 43 rows and 21 columns and these are all of the columns over here

so we have a customer ID gender senior citizen online backup streaming movies

churn and so on right so now followed by the name of the column we also have the

data type or the class of the column right over here we see that customer ID

is of type factor gender again is of type factor senior citizen is of integer

and this is the value so the values are either zeros and ones right and over

here multiple lines is of type factor with three levels so these are the three

factor levels over here so it could be either no yes or no phone service and

then similarly we have the internet service with the three factor levels so

it could either be DSL fiber-optic or no right again so for contract we have

three factor levels which could be either month two month or one year or

two years right so we have found out all of this with the help of this structure

function all right now we’ll go ahead and implement the second inbuilt

function so we’ll be using the head function for this right so head function

basically gives us the top six records of the dataset right so I will just go

ahead and give the name of the data set over here head off customer shown right

so this has given me the first six records so we have the first six records

for all of the columns over here all right now if I want to have a look at

the first 10 records all I’d have to do is give a number over here

so now I have the first in regards of the data set so similarly if I just want

to have a look at the first two records I’ll just give the number two over here

all right so I can have a look at the full story course of the data set so

similar to head we have another function called teal so tail function gives the

last six records of the dataset so I’ll go ahead and give our customer journey

to sit over here so you can see that whoo numbers over here so they start

from 7038 and the end at 7000 43 so basically this tail function gives us

the last six records of the data set so similarly if I just wind up the last one

record of this data set I’ll just give the number one right so this is our last

record so rule number seven thousand forty three similarly if we want the

last and records I’ll put 10 over here right so it starts

from 7030 four to seven thousand forty three so you have the last ten records

of the data set right so we are done with head be done with tail now we’ll

use n row end and call to find out the number of rows and number of columns

right so I’ll type n row and I’ll give in the name of the dataset right so with

this we find out that there are seven thousand and forty three rows in this

column similarly I will type and call and I’ll give the name of the dataset

customer shown now we find out that there are 21 columns in this data set

all right so now we have some numerical columns in our dataset so we have

monthly charges and total charges now what if I were to want to find out the

mean values or the maximum values of monthly charges so let’s go ahead and do

that so let’s say I would want to find out the mean of monthly charges so all I

need to do is type out mean over here and then give the column over here so I

will type customer tone dollar I’ll select the column which is monthly

charges right so the mean of monthly charges for all of the customers is

around 60 four dollars similarly if I’d want to

find out the minimum of monthly charges I’ll type Outman I’ll get the name of

the d-does it and then I’ll type auth monthly charges

again over here alright so the minimum monthly charges are $18 similarly if I

want to find out the maximum of monthly charges

I’ll buy pot max I’ll give the name of the data set which is customer churn

followed by the name of the column which is monthly charges right so the maximum

is 118 dollars so similarly what I’ll do is I’ll find out the mean Max and min

for the total charges as well now we have let me also have the range function

which automatically gives us the minimum and maximum values so I’ll give range

over here I’ll type the name of the data set well as customer churn and I’ll

select the column active monthly charges right so range gives me the range of all

of the values so the minimum value is 18 and the maximum value is 118 now so

let’s say there’s a lucky draw going on and we are selecting five customers

randomly to give a discount so we can use a sample function for that so with

the help of sample function I’ll be selecting some Phi random customer IDs

so let me go ahead and do that I’ll type the sample and what I’ll do is I’ll

select the customer ID column and I’ll give the number 5 that is from the

entire dataset we are selecting Phi random customer IDs

right so these values which we see over here so these are the customer IDs so

this is the first second third fourth and fifth so this has randomly given us

five customer IDs from around seven thousand forty three hundreds so again

let’s say if I would want around twenty random customer IDs I’ll get the number

to be twenty over here right so these are all the 20 customer

IDs now next if you would want to find out the distribution of some categorical

variables then we can use the table function so over here we see that we

have a lot of factors over here so gender is a factor column partner is

a factor column internet service is a factor column so most of these are

actually fact the columns so now if we have a lot of factor columns we can use

a table function to find out the distribution so now let’s see for this

gender column I would want to find out the number of female customers and also

the number of male customers so all I’d have to do is use a table function for

this so I will type out table I’ll give the name of the dataset and

I’ll just select the column over here but just gender right so this basically

tells me that there are around three thousand four hundred and eighty-eight

female customers and three thousand five hundred and fifty five male customers

all right so similarly if I’d want to find out the distribution for Internet

service I’ll use the table function again I’ll get the name of the dataset

and then I’ll give the column name which is internet service right so around 2000

421 customers use a DSL and 3096 customers use fiber optic and there are

around fifteen twenty six customers who don’t use any sort of internet service

right now so let’s say we want to find out the contract of the customers I’ll

again use a table function and I’ll type the name of the dataset followed by the

column right so there are 3875 customers who

have month-to-month contract 1473 customers who have a contract on yearly

basis 1695 customers who have the contract on a two-year basis right so

next finally will of use a table function on the payment method column

again I’ll type table over here but I’ll give the name of the dataset

which is customer churn and I’ll select the payment method column right so

around 1540 for customers do it by a bank transfer

15:22 customer is paid by credit card and these are the rest of the customers

who pay via electronic check and mail check we’re going to work with flow

control statements and user-defined functions now these flow control

statements basically help us to control the flow of execution so in general the

statements are executed from top to bottom but with the help of flow control

statements we can manipulate the order of execution so these are some of the

flow control statements over here if if-else and switch are something on a

selector statements then we have repeat for and Y which are looping statements

we also have some jump statements like continue and break so let’s have a

closer look at selector statements as the name suggests these selector

statements help us to select or manipulate data on the basis of a

condition such as if it rains will not play football or if you’re sick you’ll

not eat ice cream so we’ll just go ahead and start working with the selector

statements all right so I’ll start off and have a quick glance at our customer

to Andy Russell I will type view of customer churn and we have our data set

right in front of us now I will start with the if condition and I will check

if this cell over here the value in this cell is female and if the test condition

comes out to be true then I’ll change this value to be mean so let me go ahead

and do that right so I will type if I will give the name of the data set which

is customer churn and the column is gender and the row number is obviously 1

so from this data set I am checking if the value in this column is female so if

this is equal to female then I will give some action over here

so what I’ll do is I’ll change that value to be male

I’ll get the name of the data set I will select the column and in this column

right so this is cell number one and over here I’ll change the value from

female to male all right so initially we had female over here now let me copy and

paste over here now let me have a glance of the customer

he does it again alright so we have change this value from female to male

with the help of the F condition all right similarly we will use the if

Clause again to check if the tenure over here so we will take this cell value

over here so if the tenure is let’s say greater than sixty two months then what

I Louis I’ll give this customer a discount of 10% right so over here I see

the monthly charges and over here we have the customer so this is customer up

was presented true number ten and I will give this guy a discount of 10% for his

monthly charges right so let me go ahead and create another F Clause over here

so I will type if customers shown dollar tenure as greater than 50 and this cell

number over here is 10 now this is very important right so mrs. Selman button if

that value is greater than 50 then what I’ll do is I will give a corresponding

discount in monthly charges cell number is 10

so customer churn dollar monthly charges cell Lamberton and I will give this guy

a discount of 10% and that is why I am multiplying this value with 0.9 over

here so you see we have taken out this cell value and I’m multiplying that

value with 0.9 and that is how this guy will get a discount of 10% and I am

storing back the result into the same cell

let me copy it and the P sit over here so let us just have a quick glance so

the initial value is 56 now after modification let’s have a look

at the monthly charges right right so initially it was 56 then after using the

if Clause we have changed his monthly charges to 50 by giving him a discount

of 10% right so this was F now we will go ahead and also implement if else

clause so we’ll use the churn column for that and we’ll be using this cell so

let’s say we’ll just check the value over here is no or in other words it

basically means that the customer will not churn out or the customer will be

using the same network and we’ll just print that thank you for using our

network and if this is yes then we will print please give us a feedback on how

we can improve our network so let’s go ahead and do that

right sue f customer churn dollar shown and this is row number one as double

equal to yes and if this comes out to be true I just

turn it current please here was feedback on how we can

improve network else

I will print thank you for using our network let me place it over here and let me see

what will be the result so we get thank you for using a network because this

customer does not churn out all right so we are done with off we are done with

if-else now a local third selective statement which is switch so with the

help of switch I will give this guy a discount on monthly charges with respect

to the internet service let’s see I will take this customer and I will see if

this guy uses Internet service of DSL then I will be giving him a discount of

10% and if he uses internet service of fiber-optic then I’ll be giving him a

discount of 20% so let me go ahead and do this using this weight statement so I’ll delete all of this I will type

switch over here so over here I need to give the object so object again now

since this is actually a factor I will change this to a character vector so I

will type a dot character of and I will give the column

over here which will be stammers shown dollar donate service

right now the first case would be DSL of the customers internet service s DSL

then I will give this guy a discount of 10% right again so let me have a glance

at the cell numbers this will be one two three four and five

all right so over here let me just stop give down the cell number which is five

so what I’ll do is I’ll select customer churn dollar monthly charges cell number

is five and I’ll give this guy a discount of 10% I’ll give a comma now

I’ll get the second case and the second case is if this guy uses fiber-optic

so if this guy uses fiber-optic then I will give this guy a discount of 20% so

customer shown dollar monthly charges cell number five and twenty percent

discounts so I would have to multiply this value by zero point eight so let me

go to monthly charges so one two three four five so this was the initial value

seventy right now I will select all of this paste it over here and I’ll store

the result back to the same cell so this will be customer shown dollar monthly

charges cell number is five let me have a glance with Reyes at now let us see

the result so we have our monthly charges so initially it was seventy and

after giving this guy a discount of 20% his monthly charges came down to 56

right so we are also done with switch then we have looping statements so these

looping statements basically keep on repeating a certain action like keep on

printing your name four thousand times or keep playing the music for the next

one hour right so let’s go ahead and work with this looping statements so I

will start with for loop now we have this gender column over here

and I’ll use the fur loop to count the number of male customers right so I will type fur and over here I

will move available and I’ll name this variable to be I Russell to vector so

I’ll give a range over here so for I N 1 is 2 and row of customers shown let us

this loop will run starting from 1 to 7 thousand 43 right now in this entire

loop I need to check number of male customers so for that I will use the F

condition so F customer shown dollar gender as double equal to male now I

will create another variable over here and give this to be 0 so if customer churn dollar gender again

I need to give the cell which is I over here

so if customer Cho and dollar gender is double equal to male then I will

increment the count value so count will be count plus one alright so what is

happening is initially I value this one so now this will be evaluated to true

and again over here I am checking if customer churn dollar gender the force

cell value if this is equal to male then I’ll increment count with one again this

loop will come over here I value will be two over here we’ll check F customer

churn dollar gender – so this is male again

so counts value will be increased to two similarly then is value will be three

and we will check the cell number three over here

cell number three is male again and again so the count value will be

incremented and will be three now so this is how this loop will go on so let me print this over here let me

print count so we see that there are three thousand five hundred and fifty

six male customers so let me verify this with the help of table function so I

will type table of customer churn dollar gender let me see that this is actually

true there are three thousand five hundred and fifty six male customers

right so this was for loop now we’ll go ahead and understand the while loop now

with the help of while loop we will get a count of the number of customers whose

payment method is electronic check right so now I will delete all of this I will give a new variable which is AI

is equal to zero I will create the while loop over here I will give a condition

so I will check if I is less than and drew

of customer churn thatis F 1 is less than 7,000 43 so this

actually needs to be less than or equal to seven thousand 43 and if this is true

I’ll go ahead and check my condition F customer churn dollar payment method

here I’ll give the cell number which will be I so if this is equal to

electronic check then I will increase the value of count

with one so count will be count plus one right so after doing this

I will also increment the value of I so I will be ie plus one so let us

understand this properly so I am checking if one is less than or equal to

seven thousand 43 which is evaluated to true and since this is evaluated to true

then I am using the if condition to check if the cell number one the payment

method is electronic check so since this is electronic check the value of count

is incremented now after this if condition is done

I am incrementing the value of I over here now iced value will be 2 then I

will check if the value over here is electronic check or not similarly this

loop will continue on so let me go ahead and select all of this and print it over

here I’ll type count now let me verify this so table of customers shown dollar

payment method so see that the number of customers whose payment method is 2 3 6

5 and over here we have got the count to be 2 3 6 5 alright so we are done with

the while loop then we have user-defined functions so these basically help us to

modularize our entire program let’s see if we wanted to find out the minimum and

maximum values of every column so all we need to do is create two functions min

and Max which can be applied on all the columns so let’s go to our studio and

create some user-defined functions right now again I will create a user-defined

function to get a count of number of meal customers I will name the function to be gender

count now this is the syntax of a user-defined function so I will type

function and this is our parameter over here right now inside this I need to

write the entire code to find out the count of the number of male customers

right so I’ll be using the for loop again to do that so I will type for I n

one is two lengths of X right and over here I will check f X of

I is double equal to male and the feta is equal to male then I will say count

as a equal to count plus one again I will create a new local variable over

here count whose initial value is supposed to be zero right so let’s go

through this function again so this is the syntax of a function and I am naming

this function to be gender count so over here I will send this gender column as

the parameter now once I do that I have initialized local variable where count

is equal to zero and over here the loop start suffer I and one is two lengths of

X so length of X that would be the length of this column which would be

seven thousand 43 so this loop will go from one to seven thousand forty three

alright and inside this over here we check for each and every cell so if X of

I so for first iteration it will be X of one so we’ll check for this cell so if

this value is equal to meal then counts value is incremented by one again ice

value is two and if the value in the cell is male cons value is also

incremented by one and this goes on after the entire loop is done I’ll also

print the value of count over here all right so I will select all of this and I

will paste it over here so we have our function to be ready right so gender

underscore count and I will send the gender column as the parameter

all right so we get a value of three double five six let me verify this again

table of customer churn dollar gender right so we see that number of male

customers as three double five six so now the best part of functions is we

just need to make a small change over here if we need to find out the number

of female customers so all I’ll do is I will change this to be female and I can

pass in the same column to find out the count of number of female customers

right so I will use this function again gender count and I will say customer

churn dollar gender I am sending this as the parameter now let us see the count

let me verify this so I will type table of customer churn dollar gender and over

here we see that the number of female customers are three thousand four

hundred and eighty seven and that is the same value which we’ve got with the

function right so this was an implementation of user defined function

well work with the basic data structures in R so we’ll start with one dimensional

data structures which are vectors and lists and then we’ll head on to matrices

sender is which are multi-dimensional data structures so the most basic data

structure in our is a vector it’s a homogenous uni dimensional object so

what do I mean by homogenous well all of its elements must be of same type like

over here we have a collection of boots linearly arranged now let’s go ahead and

implement this in our right so we are back to our studio I will start off by

creating a character vector and I’ll name it has board and I’ll go ahead and

give it down some names of birds so first poet would be eagle

then we have buried and our final board would be Fijian now let me print this

right and let me also go ahead and take the class of this vector so I’ll type

class of food all right so we see that this as a character vector that is all

of these three elements are actually characters now I’ll be creating an

integer vector and Eileen this has no voice so I just

list down the numbers from 1 to 9 let me print this now which comprised of numbers from 1 to 9

let me go ahead and check plus so I’ll type class of numbers so this is integer

then we have a numeric vector type so when numeric we can give

floating-point or decimal values so I’ll name this to be decima

and I’ll give some floating-point values so I’ll just give some random holding

point values over here let me print this now let me check the class so I’ll die

loss of decimal so see that this is of numeric type right and then finally we

have a logical vector and in logical vector we can just have two values

either true or false so I’ll name this to be logic hundred

and here are some logical values true false and if I’m too lazy I can just

give PE and F like this over here right so let me

bring this Largent hundred right so these are all

of the values of this vector now let me take the class so I’ll type flies off

logic hundred to see that this is of logical type so this was an

implementation of vectors in R so then they have a list so a list is a

heterogeneous collection of elements that as though elements do not have to

be of the same type and each element actually retains its own identity even

when it is present in the list like oh here we have a heterogeneous collection

comprising off a board and a pill in the cart so let’s head to our and work with

lists so this is how we create a list I died powerless and let’s say the first

element is the integer one then I’ll give a character value and I’ll name it

to be Nirvana after that I’ll give a logical value and this is true I will

store this in mix bag right so let me print mix bag over here so this is how

our list looks like so we have three elements boosters integer next we have

character and then finally we have a logical value so let me take the class

of this object so I’ll type plus of mix back

to see that this is a list now I’ll also go ahead and check the class of

individual elements right so I’ll type plus of mixbag and I’ll give to square

braces and I want to check the class of the first element but so we see that it

is numeric second s character

and the third element is logical right so let me actually print the list for

you guys so this number one is of numeric type

this element Nirvana is of character type and this element true is of logical

type so we see that all of these three elements retain their original identity

or their original classes so this is how a list functions next in line is a

matrix so matrix is a homogeneous collection of elements in

two-dimensional space so over here all of the elements belong to the same

category namely fruits and they’re arranged in the form of rows and columns

so now let me go ahead and inflow in matrix in our

all right so to create a matrix what I’ll do is I’ll actually be using the

same vector first so let me have a glance at this

right so now I’ll be inserting all of these elements into our matrix right and

to create a matrix I will type matrix the first parameter is the data so for

the data I am giving the numbers vector after giving the numbers vector we have

two other parameters where we specify the number of rows and the number of

columns that we want so let’s say since we have nine numbers in total I would

want this to be a 3 cross 3 matrix that is n is equal to 3 or number of rows is

3 and similarly and call this 3 or in other words number of columns is 3 and I

will store this in my 1 so let me print Matt one now right so this is our matrix over here

where all of these of same type one two three four five six seven eight nine

so these are all integers and I’m storing them in the form of rows and

columns and what you see over here is these elements are arranged column wise

so if I want to I ange these by row then what I’ll do is we have the by ero

attribute and I’ll just set it to be true right and I’ll print mat one now

to see the difference over here so now the elements are arranged by rho 1 2 3 4

5 6 7 8 9 initially they were arranged with respect to column now they are

arranged with respect to row right so now I’ll create a character matrix so

what I’ll do is I’ll create a character went to first and let’s down some

characters so let’s say I will give the four six alphabets so a b c e and F right so I have created a

character actor naming alpha and thus conscious of six elements all right now

I’ll take this vector and create a matrix matrix the data is coming from alpha

vector now since there are six elements in

total I want this to be a 2 cross 3 matrix right so number of row is s2 and

number of columns as 3 and I will store this and let’s say

my underscore alpha let me print this now Matt underscore

alpha alright so this is a matrix over here so two rows and three columns now

again if I want to arrange this to add respect to Rho all I need to do is set

by row to be true and I’ll be printing mad alpha

right so this was with respect to columns this is with respect to rows

ABCDEF right so they have also created the matrix now what if you wanted to

access the individual elements of the matrix so this is how we can do it so

let’s say I would want to access this element over here so all we have to do

is set the index values so this is present in the first row and second

column right so value give one comma and we have successfully extracted this

element over here similarly if you wanted to extract this element F over

here so this is present in second row and third column right so mad alpha

comma 3 and we have successfully extracted the element from this matrix

right so this was an implementation of metrics and finally we have arrays so

this is just an extension of matrix not as it is a homogeneous collection of

elements and n-dimensional space so let’s actually go ahead and implement

arrays in are alright so what I’ll do is I’ll create a new integer way

and give out values from 1 to 9 and I’ll create a second integer vector and in

this I will give the value starting from 10 to 18 so we have created two numeric

vectors over here and we’ll be using these two numeric vectors to create an

array so this is the syntax to create an array I will die peri and I’ll give out

the data right so the data is coming from these two vectors I’ll use the

combined function and give out these two vectors over here num1 and num2 right

after this I need to set the dimensions that is the number of rows the number of

columns and the number of dimensions all right so the dem will be so in total we

have 18 elements so we have nine elements in num 1 vector and nine

elements in num 2 vector so what I need us actually two matrices of 3 cross 3 so

I will give 3 comma 3 so this is the number of rows and number of columns and

since I need 2 matrices of this type so I’ll type in 2 and I will store this

array 1 now let me go ahead and print everyone as a result in Eric right so this is all

of the elements from num 1 vector which are stored in this part over here and

then we have all of the elements from the num 2 vector which are stored in the

second dimension over here right now so how can we access individual elements

from this so let’s say I want to access this element number 15 so let’s go ahead

and access this I will type array 1 now let me check where is this present so

this is present and the third row and second column so I’ll type P comma 2 and

after this since this is present in the second Matrix or the second dimension I

will give into over here and let me check the result and voila

so we have successfully extracted 15 from this so similarly if I wanted to

extract this element 5 so let’s go ahead and do that so let me actually yeah I

need the snob so I will type array 1 and this is present in second row and second

column so I’ll type 2 comma 2 and since this is present in a first dimension it

sells so I’ll give out 1 over here right and I have also extracted 5 will be

working on a project so this project would be on recommendation engine so

have you ever wondered which book to read next

well I often have and to me book recommendations are a fascinating issue

and that is exactly what we’re going to do today so our data set for the key

study comprises of these four files ratings dot CSV books dot CSV book tags

dot CSV and tags dot CSV so as the name suggests the eating’s dot CSV contains

all users readings of the books so there are a total of nine hundred and eighty

thousand readings for ten thousand books from fifty three thousand four hundred

and twenty four users so the book store CSV contains more information on the

books such as the author’s name publication year book ID and so on then

we have the book tax dot CSV file so this file comprised of all tag IDs users

have assigned to the books and the responding that counts so the tag IDs

basically denote the categories into is the books fall into and the counts

denote the number of books belonging to each category and we have the attack

store CSV file so this file contains all the tag names corresponding to the tag

IDs tell us it gives us the labels corresponding to different tag IDs so

these are the tasks which you’d have to perform in this project so in the first

phase we do a bit of data cleaning so we’ll start off by removing the

duplicate ratings thus there are cases where a user has read in the same book

more than one time so we’ll go ahead and remove all these instances after which

we’ll go ahead and remove those users who have rated fewer than three books

right guys so we are into our studio now so let us go ahead and load all of the

packages required for a key study so these are of the packages required

right now after which I’ll upload the food files from a dataset so these are

the four files so we have books dot CSV readings dot

CSV book tags dot CSV and tags dot CSE and I’m storing this in objects books

ratings book tags and tags so we have loaded these four files now

let us have a glance at these four files so I’ll be using the View function to

have a glance at our four files right so these are our data sets guys so we have

the readings data set where it’s comprised of these three columns book ID

user ID and the rating then we have the books data set and these are the columns

so it has ID book ID work ID ISBN the author’s name then we have the original

publication your original title title language code and so on afterwards we

have the book tags and here the columns are good reads book ID tag ID and the

count and then we have the tags dataset here we have the tag ID and the

corresponding tag name for that tag ID right so as part of a first phase we had

to do a bit of data cleaning and the first task of our first phase was to

remove all of the duplicate readings and do that we’d have to find out how many

times has one single user rated one particular book and this would be the

command for that so here what I’m doing is I am grouping

this readings data set by user ID and book ID afterwards I am using the mutate

function and I’ll add a new column to this and that new column would be given

by the n from the deploy up so this basically would give us the

number of times a single user has rated one particular book and I’m giving the

name of the new column to be capital N and I’m storing the result back to

readings so let me have a glance at readings now

view of readings so we see that a new column has been

added so this is the user number so the user number 314 has rated the book

number one only once similarly if we take this case over here the user number

to nine double zero has rated the book number one only once so let me go down

and see if there are some changes over here in the counter fan right so let’s

have a glance at these two cases over here so the user number four 2:06 has

rated the book number eight nine four five two times over here right so these

are the duplicate readings which I am talking about

so these records need to be removed right all right so now let me also use

the table function to find out the distribution of these duplicate readings

so I’ll use table function and I will give in ratings dollar and over here

which would give me the count of the different ratings given by one

particular user to one particular book right so this value over here tells us

that there are being five instances where one particular user has rated the

same book five times this tells us that there are twenty-eight instances where

the same user has rated the same book four times this tells us that there have

been 156 instances where the same user has rated the same book three times and

this tells us that there are four thousand two hundred and ninety eight

instances where the same user has rated the same book two times and this is all

of those cases which are not duplicate that is the user has rated that

particular book only one time right so now what I’ll do us from this breedings

data set I will filter out all of those duplicate records and I will store them

in a new object so I will put in the name of the dataset

which should be readings over here and I will use the filter function to select

all of those records where n is greater than one that’ss which have duplicate

readings and I will store it in a new object so we have successfully created

this new object now let me have a glance at this view of duplicate ratings right

so there are four thousand four hundred and eighty seven entries in total which

have duplicate ratings or in other words there have been four thousand four

hundred eighty seven instances where the same user has rated the same book more

than one time right now well go ahead and remove all

of these duplicate readings now is a very simple command to do that so from

the readings object all you have to do is filter out only those records where

the value of N equals to 1 so this basically means that we are filtering

out those records where one particular user has rated one particular book only

once and I am swearing this result back to the readings dataset right so we have

done the changes now let me have a glance at it

view of readings right so these are all of the records

where there are no duplicate readings so our second task was to remove all of

those users who have rated fewer than three books so for this we’ll have to

start off by grouping the users with respect to user ID first and find out

the number of readings given by each user

so I will select this command over here and I’ll piece it over here so I have

given readings over here and I’m grouping this readings with respect to

user ID after which I am using the mutate function and over here again I am

adding a new column and that new column would be ratings given and I will get

that ratings given column with the help of this n function from the deployer

package so this n function from the deploy package would basically give me

the number of ratings given by each user right so I will store this back into the

ratings dataset now let me have a glance at it view of ratings

so this is the user ID so the user number 314 has given 181 readings in

total the user ID 439 has given 173 ratings in total

similarly the user ID 9 2 4 6 has given 190 readings in total

so now well go ahead and remove all of those user IDs who have given less than

three ratings so this is the command for that I have

again given readings over here and I am filtering out only those records where

the ratings given by each user is greater than two that as each user has

at least rated three books or more and I am storing the result back to readings so boo

ratings so this is our final data set so we see that there are nine hundred and

sixty thousand five hundred and ninety five entries in total so we are done with the first phase and

the second phase we’ll do some data exploration so we’ll start off by

extracting the sample set of 2% records from the entire dataset then they will

make a bar plot for the distribution of readings that as we’d want to analyze

the count of different readings after which we’ll make a plot to understand

how many times each book has been read it then will make a plot for the

percentage distribution of different genres going ahead well find the top 10

books with highest readings and finally well find out the 10 most popular books

right so we are back to our studio again and that’s time for Phase two now so the

first task in our phase two was to select a sample from the entire data set

so I’ll go ahead and set a seed so that if I ever want to run these commands

again I can get the same results so I’ll say the seed value to be 1

and I’ll set us use a fraction of 0.02 datas from the entire user base I need

only 2% of the sample users so I am assigning this value of 0.02 to a new

vector and naming that new vector to be user fraction now after which I will

find out op or the unique user IDs so I am using the unique function over

here and I will given the user ID column from the ratings data set so this will

give me all of the unique user IDs and I am soaring result in the users object

now after this let me have a glance at the number of the unique user IDs so

length of users so we see that there are 45,000 16

unique user IDs in total so we need 2 percent of this unique user IDs so 2

percent of this would be 0.02 into 4 5 0 1 6 so this would give us 900 users so

from 45,000 16 users in total we would need 900 users right so we’ll do a

random sampling of 900 users from the entire user base so this is the command

for that so I am using the sample function and this is the list of the all

of the users and from all of the users I only need 900 of the users so earlier we

had multiplied the user fraction into the length of the users which give a

value of 900 point something so we are basically rounding that off and I will

store that result n sample users so now let me have a glance at the length of

sample users over here length of sample users so you see that there are 900

sample users in total now let me also have a glance at our number of readings

so initially the number of readings which he have is nine like sixty

thousand five hundred and ninety five so now what I’ll do is from this readings

dataset I will be filtering out only those user IDs which are present in the

sample users object notice I would need only the sample users from all of the

users and I will store the result back to readings right now let me have a glance at the

number of readings so pen drew off readings

so now we see that the number of readings has reduced to eighteen

thousand eight hundred and thirty-two so initially we had more than 9 lakh

readings so now after filtering the data set we

have just eighteen thousand eight hundred and thirty-two readings all

right so our second task was to make a distribution of these readings so let me

go ahead and do that so guys this is the command for that so

again I am using the readings dataset and on top of this I am building the GG

plot so here I am mapping the rating column on to the x-axis so this column

over here so we have different readings 1 2 3 4 & 5 so I am mapping this column

on to the x-axis so the fill color would also be determined by the reading column

and after that since we’d have to make a bar plot I am using the Jerome bar

function and the color which I give to the boundary of the bar plot would be

great 20 and the color which would be coming to all of these bars would be

from this palette over here so the palette syl G and B U so this stands for

yellow green and blue so we’ll be giving this inside the scale fill broooo

function all right and I am also setting the guides to be

false let me hit enter so this is what you get let me zoom this now so this is

quite an interesting plot isn’t it so let’s have a glance at this bar over

here so this basically tells us that there are more than 6,000 cases where a

rating of 4 star was given now similarly we see that there are more than 5,000

cases where a rating of 5 star was given and this bar over here so this tells us

that around 4700 times a rating of 3 star was given so the count of these 2

bars is quite low so there have been very less cases where a rating of 1 star

was given so maybe not even 500 times a rating of 1 star was given so this is

for the rating of two stars so around thousand times rating of two stars would

have been given so guys this is the distribution of the readings now after which we had to find the

number of readings for each book so let’s also do that so here again I start off by giving the

readings dataset and I would have to group this with respect to book ID

because I’d want to find the number of readings per each book so that is why I

am grouping it with respect to book ID now after grouping it I will use the

summarize function so basically inside the summarize function I will basically

get the count of number of ratings for each book so here I am again using the n

function so this n function would give me the number of readings per each book

and I’m also named the result to be number of readings per book after which

I’ll again use the pipe operator and add a layer of the GG plot on top of it and

I am assigning the number of readings per book onto the x-axis the fill color

is orange the boundary color is creat wente and the x-axis values who would

range from 0 to 40 right guys so this is the plot let me

zoom this now so from this graph we can basically infer that there is not even

one case where a book was rated more than 10 times so let’s have a glance at

this bar over here so this tells us that there are more than 2500 instances where

a single book was rated only by one user so this is for those instances where a

single book was rated by two users or in other words a single book was rated two

times this is for those instances which tells us that a single book was rated by

three users or in other words a single book was rated three times and the count

for this is around 1500 times right so this plot was for the odd number of

readings per each book then we had to get the percentage

distribution of the different joiners so what we’ll do is we’ll start off by

making a new object and giving it the name Jonas

so this Jonas object would basically have a list of different genres in it so I have basically listed down a bunch

of different journals over here and I am storing all of these into the Jonah’s

object so the different journals are art biography science thriller travel humor

and comedy and so on so after building the Jonah’s object what will do us from

the Stags dataset I will be extracting only those tag names which are present

in the journals or in other words I am extracting only those journals which are

listed down over here so this is the command for that so what I’m basically

doing over here us I am finding out of the listed genres are present in the tag

names or not and if they are present I am extracting only those genres and I am

storing them in available Jonas so let me hit enter and let me actually see

what are the available genres right so these are all of the available genres in

the tags dataset so there are 27 genres in total and these are Christian

business poetry philosophy signs and so on now similarly I will extract all of

the corresponding tag IDs with respect to the tag names so let me find out which are the

available tags so over here I am basically extracting all of those tag

IDs if the tag name is present in one of the available Jonah’s right so if the

tag name is present in one of the available Jonah’s only those tag IDs I

am extracting and similarly if the tag Nima’s not present in the available

Jonah then I won’t be extracting those tag IDs and I am storing the result in

available tags so next we have to make a plot for the percentage of each owner so

let’s go ahead and do that so before we do that let’s actually get our count of

the different genres available so this would be the command for that

let me print it over here so what I’m basically doing over here is from the

book tags dataset I am extracting only those tag IDs which are present in

available tags and then again I am grouping it with respect to the tag ID

after which I’ll use the summarize function and get the number of counts of

each of these tag IDs or in other words I’ll get the count of the different

genres so let me hit enter and let me see what do we get so this is the tag ID

2 9 3 8 and for this corresponding John or the countess 436 notice there are

four hundred and thirty six books belonging to this joiner similarly this

is the tag ID for 6:05 and the countess one 1:09 so this means that there are 1

1 0 9 books present for this particular genre over here and let’s take this over

here so the tag IDs triple 7 8 and the count is 4 6 9 so this means that there

are four hundred and sixty nine books present with respect to this joiner

now let me go ahead and also find the percentage so let me select all of this

code over here right so now we had run the command till here now so we

basically got the count of each on earth now after getting the count of each

honor I am ungrouping it again after that I will

use the mutate function and find the total count that is the total count of

all of the journals combined and I am also creating a new column percentage so

this percentage over here I am dividing n upon sum of an thatis this would give

me the percentage of each of the journal and after getting the percentage of each

of the joiner I will arrange the data set and descending order and after

arranging the data set in descending order

I will also left join the tags data set to the book tags data set and the

joining would be done by the tag ID column over here so let me store this in

a new object so let’s say book info now let me have a glance at it view off book

info right so guys this is the tag ID this is the count of the tag ID that is

the number of times this joiner is present and this column gives us the

total count of all of the chana’s this gives us the percentage of the joiner

and this is the tag name which is fantasy so we’ve got a data set ready

now we’ll go ahead and make a plot on top of this so now let me go ahead and

make a plot so the object name was actually booked

in for so let me change this to book info over here and on top of this book

info object I am adding a ggplot layer so here I’ll be mapping the percentage

column on to the y-axis and the tag name column on to the x-axis and the fill

would be determined by the percentage column and since we want to make a bar

plot will be using their Chamba function and the stat which I’ve used as identity

and I’ll also use the quad flip function over here because I’d want these bars to

be stacked horizontally and not vertically and the color to these bars

would be your determined by this palette over here so this is yl o Rd so this

would be for yellow orange and red and the label which I’ve given for the

y-axis as percentage and the label which I’ve given for the x axis is shown up

let me it end up right guys so this is the plot

so here I have map johner onto the y-axis percentage onto the x-axis so we

see that fantasy is the most prevalent data set or in other words most of the

books belong to the fantasy genre and the least percentage is of the cookbooks

so this was the distribution of the percentage of different joiners so up

next we will go ahead and find the top 10 books with highest rating so this

would be the command for that so if you have to find out the top ten

books with highest reading all you have to do is arrange this average reading

column in descending order and that is what we are doing over here so here I

have given the name of the object which is books and I am arranging this data

set in inverse order of average reading after which I am selecting just the top

10 records and the columns which I’d be selecting our title ratings count and

average rating so let me store it in top 10 let me have a glance at top 10 now right guys so these are the top 10

highest rated books to the complete calvin and hobbes as the book with the

highest rating so it has the highest average rating of four point eight two

and then we have words of radiance so it has a rating of four point seven seven

third in police is the harry potter box set which has four point seven seven

reading for ‘this esv study bible which has a rating of four point seven six

fifth in the Lester’s mark of the lion trilogy which has an average rating of

four point seven six right guys so we have successfully found out the top ten

books with highest readings next we’ll go ahead and also find the

top and most popular books so this would be the command for that so to find out

the top 10 most popular books we’ll have to arrange this readings count column in

descending order dollars whichever book has the most number of ratings it would

automatically mean that it is the most popular book right so this is the

command for that let me run it over here so here what we are doing us on the

books data set I am arranging it in inverse order of the ratings count

column and then I will be extracting the top 10 records and I’ll be selecting the

title column the ratings count column and average count column so let me store

this in top popular let me have a glance earth table popular

right so these are the top 10 most popular books so the most popular book

in the list is the Hunger Games which has the highest ratings count then the

second most popular book is Harry Potter third most popular book is Twilight and

the fourth most popular book is To Kill a Mockingbird in the third phase we’ll

finally do some recommending so we’ll start off by building the user based

collaborative filtering model and then we’ll recommend six new books for two

different readers for a guy so it’s finally time to recommend some books

so before we go ahead and do that we would actually have to restructure our

data a bit so our data actually needs to be in the form of a matrix where all the

rows should correspond to the users and all the columns should correspond to the

books so the dimension names would then nothing be but the user IDs and the book

IDs so the user IDs would represent all of the rows and the book IDs would

represent all of the columns so let me go ahead and extract all of the

dimension names first so here with this command what I am doing is I am

extracting all of the unique user IDs and I’m also extracting all of the

unique book IDs and I’m storing them and this object dimension names so this is a

first us so we have basically got all of our dimension names now we’ll go ahead

and convert the format of a data frame from long format to white format so with

this command what we are doing is we are actually selecting the book ID the user

ID and the rating columns from this data frame over here and I’ll be splitting

this book ID column that us this value of book ID one would become one column

this value of book ID two would become the second column this value of book ID

3 would become the third column and so on and these rating values over here

they would become the values for the corresponding book IDs so this is how we

can use the spread function to spread out our data frame from long format to

white format and we’ll also remove the user ID column because it doesn’t serve

a purpose and I will store it in a new object and name that object to be rating

mat so uh let me hit enter right so we have created our rating matrix

so we have created the reading mat object now let me have a glance at the

class of this so class of reading mat so we see that this is still in the form of

a data frame but we can build our user based collaborative filtering model only

on top of the real rating matrix so first we’ll have to go ahead and convert

this data frame into a matrix so let me do that so here with the help of as dot

matrix function I am converting the class of reading mud from a data frame

to a matrix and I am storing the result back to reading mat Friday so now let me

have a glance of the class so class of rating mat and we see that now it is a

matrix so let me have a glance in the first five rows and the first five

columns of it so these are the first five rows and

these are the first five columns so we see that this user ID column has not

been removed so let us go ahead and manually remove these

so what I’m doing this I am manually removing this first column and I’m

storing it back into rating mat now let me have a glance at first fire rows and

first five columns right so these are the first five rows and these are the

first five columns so these rows basically correspond to all of the user

IDs and these columns basically correspond to all of the book IDs so

these any values which you see over here so this basically means that the first

user has not rated the first book the first user has not rated the second book

similarly the fourth user has not rated the third book and so on right so we

have a rating matrix ready now let me also assign the dimension names to the

dimension names of this rating mat object so here I am assigning all of the

dimension names which I’ve extracted to the dim names of the reading Matt right

so now let me have a glance at our dim names of reading Matt so dim names of

reading Matt so these are all of the dimension names

for the rows so this is the name for row number one this is the name for row

number two one is the name for row number three

so this basically signifies that all of the rows are represented by the user IDs

similarly if I go down then we have all of the book IDs so all of the columns

are represented by the book IDs over here now let me use the dim function to

find out the number of rows and columns in the matrix so we see that there are

900 rows and eight thousand four hundred and thirty-one columns or in other words

there are nine hundred users and eight thousand four hundred and thirty one

books so we have got our metrics ready but we can’t just build our user base

collaborative filtering model on top of the metrics so we have something known

as a real rating matrix and the recommender lab basically works only on

this type of object so I will AppStore this reading mat into a new object and

name that object’s name to be reading Mat Zo

let me again have a glance at the number of dimensions of this dim of reading mat

zero so we have the obscene number of rows and columns so again here the

number of rows a 900 and the number of columns are 8 4 3 1 now after this what

we’ll do is wherever we have any values we will replace those any values with 0

so in treating mod 0 wherever we find any values I am replacing those any

values with 0 so now let me again have a glance of the first 5 rows and there are

first five columns of this thread B rating mat 0 1 2 5 and 1 2 5 so we see

that all of any values have been replaced with zeros so now we can go

ahead and convert this matrix into a sparse matrix so I’ll be using the as function and I

am converting this rating math 0 object into worst parts metrics and I will

store this in a new object and name that object to be sparse readings now let me

again have a glance at the first five rows and first five columns of this

source force ratings 1 to 5 and 1 to 5 so this is how a

sparse matrix looks like so basically with the help of a sparse matrix we end

up saving the laurels piece now it’s the last power of the transforming so we

will go ahead and convert this past matrix into a real rating matrix so for

this we would need the new function and with the help of new function I am

converting this passed ratings object into a real rating matrix so this over

here takes in two parameters so the first parameter is basically what we are

trying to convert this into and then be given the data which we are trying to

convert and I’ll store this result in real readings object

let me print real readings now and let’s see what do we get real reading so this

is what we get so what does a rating matrix where there are 900 Dru’s and 8 4

3 1 columns and it is of class real rating matrix with 18,000 832 readings

right right so we finally have our real waiting metrics ready so now we can go

ahead and build a model on it so what we’ll do is well go ahead and split the

data set into train and test sets so it’ll be your 80/20 split so I’ll be

using the sample function to create this 80/20 split over here so I am using the

sample function and I will generate true or false values over here and the

sampling would be with replacement and the probability is 8020 dhatus i would

want to divide the data set into two parts were the first part would comprise

of 80% of all of the observations and the second part would comprise of 20% to

the rest of the observations and I will last or the split criteria in a new

object and name that object to be split book so now in real readings wherever

the value of split book is equal to true I will select all of those records and

store those records in direct rain similarly from the real readings matrix

wherever the value of split book is assigned to be false I am extracting all

of those observations and storing those observations in Trek tests so we have

our training and testing sets ready now we can finally go ahead and build a

first model so we’ll be using the recommender

function and this over here takes in two parameters first as the training set on

which we want to build the model and next is the method or the type of the

recommender model which we’d want to build and since we’d want to build a

user based collaborative filtering model well given the type to be you BCF so if

we wanted an item based collaborating filtering model then the method would be

IB CF but in our case since we want user based collaborative filtering model so

the method would be UBC F and I will store the result and an object which

would be Rick model UBC F right so I have successfully built the model now

I’ll give a value for the number of books to be recommended so I am

assigning the value 6 to a vector and the name of that vector is and

recommended UBC F so this basically lets us know that we’d have to recommend six

books in total right so the model building process is done so let me go

ahead and predict the values so I’d have to use the predict function

for this so over here this takes in three parameters first is the model

which you build next is a data set on which we want to predict the values and

third as the number of values to be recommended right so first well given

the model which you built watch as wreck model you BCF and then where do want to

predict on top of the test set so the test set as rick test and the number of

books to be recommended as six which are stored in and recommended UBC of and I

will store this result in wreck predicted UBC F right so the prediction

is also done now let me go ahead and find out the item numbers which have

been predicted so now that we have predicted the values

let’s go ahead and recommend some books to user number one so here we will use this object rec

predicted you BCF so this is the object which we’ve just built using the predict

function and I want to find out the item numbers which have been recommended to

user number one so this command over here rec predicted UBC of either eight

items one so this will give me the column numbers which have been

recommended to user number one so let me have a glance at user 1 book

numbers so this basically means that user 1 has been recommended to read

these books which are presented columns 2 1 7 7 column number 6 3 4 3 column

number 4 23 caller number 1908 column number 2 2 7 4 and column number 2 Phi 1

4 now we just have the column numbers so let’s actually find out the labels for

this column numbers right so from wreak predicted you VCF we

have the item labels and inside the item labels

I will given the book numbers which have been recommended right so these are the

item labels so the book ID which is under column number two one seven seven

as two six three six the book ID which is under column number six three four

three years of seven four eight two similarly the book ID which is under

column number 207 for us two seven five zero so we have successfully got the

book IDs so now that we have the book IDs ready let’s actually use these book

IDs to extract the name of the book and the name of the author right so what I’ll do is I will extract

the title of the book and the author’s name where the book ID is 549 so the

name of the book a score dry and author’s name is dawn Freeman similarly

let me extract the title and the author’s name where the book ID is two

seven five zero so here we see that the book recommend to this mrs. pickle

wiggle and the authors are betty mcdonald and alexandra Boyka

similarly let me extract the titles for the book ID three zero to nine let’s see

what do we get three zero to nine so the book recommended this region of

angels and author is Sydney Sheldon right so these are the six books which

have been recommended and this is how we can extract the name of the book and the

author of the book now similarly let’s also go ahead and

recommend six books for the user number five so this is what I’d have to do I

will be using the wreck predicted ubc of object and I’ll be using the at the rate

tag and extracting the items for user number five and I’ll store this result

in user five book numbers so now that we have the item numbers let’s actually

extract the column numbers for each of these items so for this the command

would be direct predicted you BCF at the rate item labels and inside this I will

give in the book numbers right so these are the book IDs which have been

predicted right so similarly let’s extract the title and authors for some

of these book IDs so we are extracting the title and the

author’s name where the book IDs for six to four so though

so the recommended book the name of the recommended bukas the girl who

circumnavigated fairyland in a ship of her own making and the author of

scattering am valentin and anna one similarly let’s extract the name of the

book and the author’s name well the book ID is six eight six seven so here the

book is Malgudi days and it’s been written by RK Narayan and Champa Larry

now let me extract the title and the author’s name for the book ID which is

seven three to six so the books name is doctors and it has been written by Eric

cycle right guys so we have successfully implemented the user based collaborative

filtering model and we have recommended six books to two different users

just a quick info in case if you guys are looking for end-to-end course

certification in art programming in telepath provides the our programming

for data science program where you can learn all of these concepts thoroughly

and on a certificate at the same time the link is given in the description box

below so make sure to check it out I hope you guys took away a lot from this

detailed session if you have any queries make sure to head down to the comment

section below and do let us know and we’ll be happy to help you out there and

on that note thank you for watching have a nice day

Guys, what else do you want to learn from Intellipaat? Comment down below and let us know so we can create more such tutorials for you.

👋 Guys everyday we upload in depth tutorial on your requested topic/technology so kindly SUBSCRIBE to our channel👉( http://bit.ly/Intellipaat ) & also share with your connections on social media to help them grow in their career.🙂

Sir ye jrur kijiyega courses ko hindi language me banayega jisse sb aasani se sikh sake plzz sir muje web designing sikhna h web designing ka hindi me course banayega 🙏🙏

कृपया हिंदी में वीडियो बनिये।

Very nice

Hello, Is this vedio is sufficient to learn R programming completely???

After finishing this video what should be the next step taken to learn more about R on your channel? Do you have a playlist for it?

I request you to upload videos in Hin+ENG so we can understand better

I am a civil engineer with no experience. Will it be a wise decision to switch to CS and IT job??? As i am still a fresher although passed out few years ago.

Video on information bdm

Sir, is this full course of R programming