R Programming | R Language Tutorial | Data Science with R Course | Intellipaat


hey everyone welcome to the session by
Intellipaat in a world where we surrounded by data concepts such as data
science and data analytics take the center stage always in fact R
programming is one of the biggest assets of a data scientist and data analyst and in this session on R Language tutorial we’re going to check out everything there is to know About R
programming well before we begin with the session make sure to subscribe to
the Intellipaat’s YouTube channel and hit that Bell icon so that you never miss an
update from us here’s the agenda for today we’ll begin by checking out what
R programming actually is and after that we can check out everything there
is to know about variables and the type of operators present in R and followed
by this we can check out all the objects present in R as well and after this we
can take a quick introduction to the flow control statements and finally you
will have a complete hands-on project guide where you can build an entire book
recommendation system from scratch and guys if you have any queries make sure
to head down to the comment section below and do let us know and we’ll be
happy to help you out there also if you guys are looking for end-to-end Course
certification in R programming language Intellipaat provides the R
programming for data science program where you can learn all of these
concepts thoroughly and earn a certificate at the same time well
without further ado let’s begin the class so what is R so R is a language
developed by statisticians for statisticians so if you want to perform
any sort of statistical analysis R should be a go-to language now R is also
a great visualization tool so R provides packages such as ggplot2 and plotly so
that you can create stunning visualizations now R is an open source
cross-platform compatible software so it’s just plug and play all you have to
do is install R ones and then you can start having fun with it and since R is
an open source software you can actually modify the code and add your own
innovations to it and being cross-platform compatible you can
actually run the same R code on different operating systems and the best
thing about R is a turing-complete language that is it can perform any task
which it turing machine can that’s amazing isn’t it
and this is what makes R such a powerful tool so you can do various
operations such as statistical analysis implement machine learning algorithms
and also create stunning visualizations all with the help of the super language
called as R installing R so you can download R from Cran.r-project.org/
so let’s go to the site right guys so this is our R cran
Network and as you see we have different distributions of R respectively for
Linux Mac and Windows and since I’m using a Windows system I’ll downloaded
for Windows and I click over here install R for the first time and guys this is
the latest version of R three point five point one I click over here
download our three point five point one for Windows
right so we get this dialog box over you and I click on run and the download
would start after installing R we also require an IDE to make a tasks
easier so one such IDE is R studio and we can download R studio from R
studio com right guys so this is R studio dot com I’ll click on download R
studio over here and this is the free version I’ll click on download again we
have different distributions for Ubuntu Windows Mac and Fedora again since I’m
using a Windows system I’ll install it for Windows I’ll click on run and the
download would start right so I’ve downloaded and installed both R and
R studio now let’s open R studio and have a look at the different windows
present right guys so this is R studio and this is R studio looks like so as
you can see R studio comprised of four windows so this first window which you
see over here is a script window as the name States you can do all of your
scripting over here so let’s say I just type a equals three
and then I’ll type print 3 right so all of your scripting goes over here and if I
want to let’s say add a new script over here I need to click on the plus symbol
over here I’ll select R script and we get a new script similarly if I want to
add another strip I listen to click over view that we have another new script
over here ok and if I want to see what I’ve written over you I need to click
over here and this will be sieved let me give this a name so I’ll just give it
some random name I’ll save this right so I’ve saved this file as sgds d dot R and
if you were to run or execute the code you have this Run button over here so
let me select these two lines I’ll click on run so when we click on run the code
goes to the console window and gets executed
so guys this was the script window now we can actually use the console window
here directly to run our commands so I will give some commands over here let me
just run some basic mathematical operations so I will type 8 plus 5 right
so this gives us 13 similarly I will type 10 minus 5 which gives us 5 right
so you can directly use the console window to run all of your commands and
over here what we see is the environment window so this gives us a glance of all
the objects you have stored right so till now we just have one object with
the name A and that is what we see over here so we have an object he whose value
is 3 now similarly let’s see if I add another object B over here B equals 10
and I’ll print B let me execute these two lines of code right so as soon as I
executed these two lines of code what we see is another object has been added in
this environment window right so initially we just had the object a after
executing these two lines of code another object B has been added along
with its value and now if I want to clear all of these I can do that by
clicking here right so this will remove all of these
objects over here ok and then we have the history section so this gives us a
list of all the commands which we have implemented till now right so this is
just a history of all the commands which we had executed and we have this final
window at the bottom right corner so this window is for installing new
packages visualizing plots and accessing help whenever needed right so let me
give you guys an example so let me make a histogram so I will type hist I’ll
type in ly NX which is a dataset so as this is the visualization which we get
over here right similarly if you want to install a new package I’ll click over
here I click on install and let’s say the package which I’d want to install
would be 3 I’ll click here and when I can con
install the package installation would start and then you have help so whenever
need any sort of help we can use this window so let’s say there is a data set
called as iris which is inbuilt in our if you want to know information about
this data set all I need to is type the name of the data set and I’ll hit enter
right so you get all of the information with respect to this data over here so
guys this is all about R studio just a quick info in case if you guys are
looking for end-to-end course certification in R programming
Intellipaat provides the r programming for data science program where you can learn
all of these concepts thoroughly and on a certificate at the same time the link
is given in the description box below so make sure to check it out so now that
we’ve installed R & R studio it’s finally time to get on with the
practical part so we’ll start by reading data into R console afterwards
different functions to read different data formats if you want to read a text
file you can use the read dot table function so comma separated files you
can use read dot CSV function freedom JSON files you have the from JSON
function and similarly if you want to read data from HTML tables R provides
a function called read HTML table so you can read all formats of data into R and
since our customer churn DATA is of dot CSV type I’ll be using the read dot CSV
function so let’s head on to R studio so this is R studio guys so I’ll
usually dot CSV function to read the customer ChurnData set said all right
I’ll put in the double quotes now this is the data set guys this is the
dot CSV file I click on properties and I’ll copy the path from over here
and I’ll paste the part now you need to keep in mind that this is actually a
forward slash when you’re giving the part right
and after giving the part I’ll uh get the name of the file which will be
customer churn dot CSV and I will spool this and a new object and I’ll name the
object to be customer shown as well right so this what you see is actually
an assignment operator so I’m reading this part and I’m saving this file into
an object called as customer churn right now let me have a glance at this data set
so what i’ll do is I’ll use the view function well that view and I’ll type
out customers churn so this is our customer churn data set
guys now looking use the class function you have a look at the type of this
object so I’ll use class and I’ll give the name of the object which is customer
churn so what we see is this object is of type data frame that is this is
actually a data frame now if you’re coming from the SQL background you can
see that a data frame is actually sort of a table so over here all of these
columns are of the same data type and each row represents a record of this
data frame right so let’s see if I select this column tenure then this will
be of one particular data type so let me go ahead and see what is the type of
this so I’ll use class and I’ll type customer churn dollar
ten-yard so see that this is of Integer type that is
this tenure column is of integer type now another thing that you need to keep
in mind as all of these columns are actually vectors and factors that as a
vector is a set of homogeneous entities and a factor represents categorical
values so now let me type out class and what I’ll do is I’ll type out customer
churn dollar gender so I get that the class of this object is factor
just a quick info in case if you guys are looking for end-to-end course
certification in R programming Intellipaat provides the r programming for
data science program where you can learn all of these concepts thoroughly and on
a certificate at the same time the link is given in the description box below so
make sure to check it out now what we see over here is this just has two
categorical values female and male so whenever we have just categorical values
then most probably the class of that column is a factor right so we’ve
understood what our data frame is and we’ve also had a look at vectors and
factors right now we’ll go ahead and understand how can we access individual
columns from this data frame or access individual elements on this data frame
now let’s say I would want to access the column device protection from this
entire data frame so how can I do that but it’s actually simple so what I’ll do
is I will start off by giving the name of the data frame it’ll be customer
tune then I’ll use a dollar symbol and I automatically get a list of all of the
columns and since I need the device protection I will type the EBI I’ll
select this and I’ll store this in a new object and I’ll name this new object as
C underscore device protection
right so let me have a glance at this active so I’ll type view of C underscore
device protection sort of basically done is from this entire data set I’ve just
selected this column and stood it into a new object and in that object SC device
protection right so similarly if I want to select the payment method column I
can do the same thing so what I’ll do is I’ll type out the name of the data frame
I’ll use the dollar symbol and I’ll type the name of the column which will be
payment method over here and I’ll store this and let’s say C payment now let me
have a glance at this so I’ll type you see underscore payment right so I have
separated this column from this entire dataset
all right so this is one way by which we can select individual columns we can
also use the second method so the second method is also simple so what we’ll do
is we’ll start off by giving the name of the data set and I’ll give the square
braces now inside the square braces the values which are on the left side of the
comma represent all of the rows and the values which are on the right side of
the comma represent all of the columns now let’s say I would want to select
this gender column so what I will do is I let skip the numbering of the column
that is it which position the column is present so I’ll type odd two over here
and I’ll save this in let’s say C underscore gender now let me have a look
at this so I’ll type you see underscore gender right so I have used the square
braces to select the second column from this customer churn data set now similarly
if I want to select the first column I can do the same thing
well do is I’ll just put one over here and I’ll name this object as C
underscore ID right so let’s have a look at this view
of C underscore ID this is the column which gives me the customer IDs all
right now you can also select multiple columns in the same way so let’s say I
would want to select the second sixth and seventh column from the customer
churn data set over I’ll do is I will put a comma over here I’ll use the
combined function right and I’ll give all of the columns that I want to select
it so I need the second column the sixth column and the seventh column all right
and I’ll store this in C underscore two six seven so let’s see what is the
result now again I’ll type U of C underscore two six and seven right so I
have successfully selected second column sixth column and seventh column
so let’s actually verify yes this is the second column three four five six so six
is tenure seven is phone service all right so we have sixth and the seventh
column so this is how we can separate out some specific columns from the data
set well i said I would want these two columns over here so I can actually type
out the names of those columns so what I’ll do is I’ll type monthly charges and I’ll see this in C
month so let me have a look at this view of
the month alright so I have selected so I have successfully selected out the
monthly charges column now I can also select a set of continuous columns let’s
go into that now let’s say I would want to select all of the columns from senior
citizen to multiple lines that is from column number three two four five six
seven and eight so I would want all of the columns from column number three to
coloumn number eight so let’s go ahead and do that now so for that what I will do
is I’ll type the name of the dataset I’ll put a comma over here I’ll put
three and from three I’ll use colon symbol I’ll just put out eight over here
so this will give me all of the column starting from the third column to the
eighth column let’s see I will save this in C underscore three let me have a look
at this now view of C underscore three eight right so I have successfully
selected out all of the columns from third to eighth so the third column
senior citizen and the eighth column has multiple lines right now so this is how
we can filter out some specific columns so if we wanted to filter out some
specific rows so let’s go ahead and understand how can we do that you know
as I’ve already stated if you would want to filter out rows we would want to give
it on the left side of the comma and if you do want to filter also specific
columns you would want to give it on the right side of the comma so let’s go
ahead and filter out some rows so now let’s say I would want the second row of
the data set so all I need to do is type in two over here and this will give
me the second row right so let me store it in C underscore
and let me have a look at this view of c underscore 2 alright so
this is the second record whose customer ID is double five seven
five so let me just verify all right so this is the second record over here and
this is the same record that we have successfully filter out similarly let’s
say if I would join the record or the row number hundred so
I’ll just type out hundred over here and let me store this in C underscore
hundred let me have a look at the result view of C underscore hundred alright so
this is the record which is placed at the hundredth row whose customer ID is
this now we’ll go ahead and understand how can we filter out multiple rows from
this so let’s say I would want the first fifth and the tenth row so I’ll use the
combine function over here and I’ll give all of the row numbers so I need the
first room fifth row and the tenth row and I’ll screw this and let’s see one
underscore 5i underscore ten all right so let me have a glance of
this I’ll type C underscore one underscore and this right so I have
successfully filtered out the first fifth and the tenth row to see that the
gender of the customers at first and fifth columns is female and the gender
at the tenth column is male now what we’ll do is we’ll filter out a
sequence of rows so let’s say I would want all the rows from let’s say row
number 100 to row number 200 so what I’ll do is I’ll type 100 to 200 and I’ll
store this and see 100 200 so let me have a glance at this I’ll type view of
tea underscore 100 200 right so this gives me hundred and one entries
starting from row number 100 to row number 200 right similarly let’s say if I would
want all of the rows from 10,000 so similarly let’s see if I would
want all of the rows from 5,000 to 7,000 we can also do that I’ll just simply put
over here 5,000 and then I’ll type in 7,000 over here
right so let me just type out five thousand seven thousand over here let me
have a glance at this so I’ll type view of C underscore 5,000
7,000 right so there are 2001 and restarting from room number 5,000 to row
number 7,000 right now what we’ll do is we’ll give output the row numbers and
the column numbers but it will filter out some specific rows and some specific
columns so let’s go ahead and do that now let’s say I would want the row
numbers from 50 to 60 and I would want just the second and the third column so
let’s go ahead and do that so the row numbers from 50 to 60 and just the
second and third column so I’ll you see 2 comma 3 and let’s just store and see
random one let me have a glance at this view of C underscore random one alright
so this gives me all of the records from row number 50 to row number 60 and I just
selected these two columns gender and senior citizen
now let me actually go ahead and complicate this a bit so let’s say I
would want true numbers from 100 to 200 and also from thousand to 2,000 and I
would want the columns 2 3 5 and 7 so let’s go ahead and do that now right so
I’ll you see over here first part would be any row numbers 100 to 200 and
I will also need row numbers from thousand to 2,000 right so this is my
first part where I’m filtering out all of the rules and I would need column
numbers 2 3 5 and 7 so let me go ahead and put down all of these column numbers
and let me store this in C underscore random group so let me have a look at
this view of C underscore random 2 all right so I have thousand one hundred and
two entries in total where I have rows from 100 to 200 and as you see over here
just after row number 200 the row number gets to 1,000 because I have filtered out
all the rows from 100 to 200 and then it starts from thousand to 2,000 and I’ve
selected column number two three five and seven all right so this was an intro
to our data frames vectors and factors and how can we access individual
elements from the data frame just a quick info in case if you guys are
looking for end-to-end course certification in R programming
Intellipaat provides the R programming for data science program where you can
learn all of these concepts thoroughly and on a certificate at the same time
the link is given in the description box below so make sure to check it out we’re
going to look at the various types of operators in R and implement them on
the customer churn data set so broadly speaking these are the operators we have
assignment operators arithmetic operators logical operators and
relational operators so let’s go to R studio and work with all of these so we
are back to R studio guys so now to work with assignment operators what I’ll
do is I’ll start off by loading the customer churn data set and I’ll use the
read dot CSV function for this I’ll put in double quotes and let me go ahead and
copy the path of this dataset so this is the data set I’ll go to
properties and copy the path I’ll pitch it over here again this need
to be a forward slash so I’ll change this backward slash to a forward slash well I’ll go ahead and give the name of
the data set which will be customer churn and odd CSV now what we’ll do is
we’ll give an assignment operator so this symbol which you see over here is
nothing but an assignment operator which helps us to store values into an object
so using this assignment operator I love store this customer churn dot CSV file
into an object and I’ll name that object to be lets say churn one now similarly
there are two other ways by which we can use this assignment operator so I’ll
copy this I’ll just clear this window first right now
I can actually give the name of the object first and then use this
assignment operator right so this is less than symbol – it’s the same thing
so I am basically loading this path into this object using this assignment
operator right so let’s have a gance at this I will type view of churn 2 right so we have a
data set now now instead of using these two operators I can also use the equal
to operator so I will change this to churn three now
and when I have a glance at this will actually be the same
so churn one churn two and churn three are actually the same data sets but I
just used different assignment operators to store them into these objects right
so these are the different assignment operators now we’ll go ahead and work
with arithmetic operators as well so arithmetic operators are simply R plus
minus division and multiplication so now we’ll be implementing all of those
arithmetic operators on top of this customer churn dataset so you see these
two columns over here monthly charges and total charges so I will be using the
arithmetic operators on top of this so I will take this cell over here so let’s
say this customers monthly charges is 29 but actually what happened was there was
some calculation error and his monthly charges was just twenty eight point
eight five so we’ll go ahead and change this value to twenty eight point 85 so
how can we do that so for that all you have to do is subtract the cell value
with one so let’s do that so I will select churn one dollar and
I’ll select the column which is monthly charges since this is the cell which is
the first row I will give out one over here so from this cell value I need to
subtract one right I have subtracted one now I’ll store this result back to the
same cell so I’ll type shown one dollar monthly charges and the cell is
obviously one now let me have a glance at that churn one dataset let me go to the
monthly charges column right so here is a difference
so initially the value was twenty nine point eight five we have rectified the
errors we have subtracted one from it and now the value is twenty eight point
eight five so similarly we will be using the plus operator so plus operator
basically helps us to add something to the predefined value again so let’s see
the second customer over here whose charges are 1889 but again this was
incorrectly calculated and his charges were 1890 who will just add that one to
this using the plus operator so let me select that cell I will type churn
one dollar total charges and the cell number is two and I’ll add one to that
after adding one I’ll store it back to the same result so I’ll type churn one
dollar total charges and the cell number is obviously two
the very first is view churn 1 let me go to the total charges coloumn
right so initially it was eighteen eighty nine point five zero after adding
one to it the total charge became to 1890
now let’s say there is some discount going on and randomly a customer gets a
discount of 10% so let’s see it as this customer over here who gets a discount
of 10% so let me see which row number is this 1 2 3 4 5 6 7 8 9 so this is row
number 9 over here and let’s say this customer gets a discount of 10% so what
we have to do is basically multiply this value with 0.9 and when you multiply
with value with 0.9 the value gets reduced by 10% so let’s go ahead and do
that ill type churn 1 dollor total charges and it has the nine cell over here I
will multiply the 0.9 and I’ll store it back to the same cell over here so this
will be showing $1 total charges and I’ve been the cell number over here
right let me press Enter and let me summarize the result so view of churn
one so let me go to total charges over here right so initially it was three zero
four six and after giving a discount of 10% that total charges came down to two
seven four one right so let me show it again so it was initially three zero
four six and after this card of 10% it was two seven four one
just a quick info in case if you guys are looking for end-to-end course
certification in our programming in telepath provides the our programming
for data science program where you can learn all of these concepts thoroughly
and on a certificate at the same time the link is given in the description box
below so make sure to check it out now similarly let’s say a customer gets a
discount of 50% so all you have to do is divide that value by two so let’s see it
does this third customer over here who gets a discount of 50% on his monthly
charges so let’s go ahead and divide that cell value by two trying to select
showing $1 monthly charges and cell number is three now I will divide that
value by two and I’ll store it back to the same cell so that will be shown $1
monthly charges and cell number is obviously three so let me refresh this
now view of shown one surely the monthly charges 4:53 over
here and after giving a discount of 50% his monthly charges came down to 26
so from 53 to 26 I discovered a 50-person right so these
were arithmetic operators now well up go ahead and work with the relational
operators so relational operators basically help us to find out the
relation between them such as which one is greater
which one is lesser so now let’s say I would want to find out all of those
customers whose tenure as more than 60 so let’s go ahead and do that
so for that I will type shown $1 I will select the column which is tenure and
I’ll just use the greater than operator that is shown $1 tenure is greater than
60 and I’ll stir this in let’s see see tenure right so let me have a glance at
this now you see tenure now these are just Falls and true value so what
basically this means is so wherever you see walls this means that the tenure is
not greater than 60 and you see this true value it means that the tenure is
greater than 60 so let’s see let’s actually verify this so this is the
tenure over here so we see that this is the only value a tenure is greater than
60 and we have a true value for that right so now if you actually want to
clearly see the values wherever the tenure is greater than 60 we can use the
subset function for that so I’ll you subset I’ll give the name of the data
set first right so from this data set I need all of those values were see of
tenure is equal to true and I’ll store this back to see of tenure now let me see what is the result right
so we see there are fourteen hundred and seven customers whose tenure is more
than 60 so you see this over here so these are all of the customers whose
tenure is more than 60 and we found that out using the greater than operator
right now similarly we love use the less than operator to find out all of those
customers whose monthly charges are less than $10 right so I will type shown one
dollar monthly charges and this needs to be
less than 10 nice store dozen cm-1
right so let me have glanced at this C underscore mo one now let me use the
subset function to find out the actual values so I’ll type subset I’ll give the
name of the dataset I’ll type out C mo n over here and I need one of those values
where this is true I’ll store this back to see a muffin now
let me have a glance at this C underscore am a fan to see that there
are zero and three status there is actually no customer whose monthly
charges are less than ten dollars right so this was the greater than operator
and less than operator now we’ll go ahead and also work with logical
operators so logical operators are basically and are so these basically
help us to give multiple conditions let’s say I would want to select all of
those customers where gender is male and senior citizen as one we can do that
using the and operator so what I will do is I will type shown one dollar gender
and I’ll keep this to be male now I’ll use the and logical operator now I would
need to select all the senior citizen status where join one dollar senior
citizen is double equal to one right and I will show this in let’s say C
underscore M s so let me have a look at this so I’ll type you see underscore M s
now let me use the subset function to find out the real results subset shown
one C underscore MSS double equal to true and I will store this back to C
underscore M s now let me have a glance at this right so there are five seventy
four entries or 574 customers whose gender is male and who are senior
citizens or in other words there are 574 senior male citizens right so this was
the and operator now we’ll go ahead and work with our operator so let’s say I
would want to select all of those customers who was Internet services
either DSL or fiber-optic so let’s use the or operator for that
so what I’ll do is I’ll type Shawn one dollar
Internet service as either equal to DSL then I lose the or logical operator I’ll
type shown $1 Internet service equals fiber optic first I will get the list of
all those customers who use either of these internet service and I’ll store
the send let’s say C underscore internet for right let me have a glance
at this I’ll type C underscore internet now let me use the subset function to
find out the result I’ll get the dataset name then I’ll type
C underscore Internet and find out all of those values for the result is true
and I’ll store it back to C underscore internet right so I see that there are
five thousand five hundred and seventeen customers whose internet service either
DSL or fiber-optic so we are done with the or operator and operator now we’ll
also go ahead and work with not operator not operator basically gives us the
contrasting value so let’s see they’ll work with the senior citizen column and
I would want to select all of those rows where the senior citizen value is zero
so you can use the not operator for that so what I will do is show on one dollar
senior citizen is not equal to one that’ss so I will get all of those
values where this value is not equal to one and since the only other value is
zero so I’ll get all of those rows and I’ll store this in and let’s say C not
senior and let me have a glance at the right
so let me use the subset function subset of
schon one and i’ll give this to be double equal to truths
and I’ll store it back to see not senior let me have glanced at this again view
of C and score right so there are 5901 customers who
are not senior citizens rights over with assignment operators arithmetic
operators relational operators and logical operators we’re going to look at
the various types of inbuilt function in our and implement them on the customer
joined data set so these are some of the inbuilt functions so let’s go to our
studio and work with them so we’ll head back to our studio let me just have a
glance of the data set first so I will type out move off
customer churn right so this is a dataset so we’ll start off by
understanding the structure of this dataset so for that I’ll be using the
structure function STR and I’ll given the name of the dataset which is
customer churn right so this function gives me the entire structure of the
state asset so this basically tells me that I’m working with a dataset where
there are seven thousand 43 observations of 21 variables or in other words there
are 7,000 43 rows and 21 columns and these are all of the columns over here
so we have a customer ID gender senior citizen online backup streaming movies
churn and so on right so now followed by the name of the column we also have the
data type or the class of the column right over here we see that customer ID
is of type factor gender again is of type factor senior citizen is of integer
and this is the value so the values are either zeros and ones right and over
here multiple lines is of type factor with three levels so these are the three
factor levels over here so it could be either no yes or no phone service and
then similarly we have the internet service with the three factor levels so
it could either be DSL fiber-optic or no right again so for contract we have
three factor levels which could be either month two month or one year or
two years right so we have found out all of this with the help of this structure
function all right now we’ll go ahead and implement the second inbuilt
function so we’ll be using the head function for this right so head function
basically gives us the top six records of the dataset right so I will just go
ahead and give the name of the data set over here head off customer shown right
so this has given me the first six records so we have the first six records
for all of the columns over here all right now if I want to have a look at
the first 10 records all I’d have to do is give a number over here
so now I have the first in regards of the data set so similarly if I just want
to have a look at the first two records I’ll just give the number two over here
all right so I can have a look at the full story course of the data set so
similar to head we have another function called teal so tail function gives the
last six records of the dataset so I’ll go ahead and give our customer journey
to sit over here so you can see that whoo numbers over here so they start
from 7038 and the end at 7000 43 so basically this tail function gives us
the last six records of the data set so similarly if I just wind up the last one
record of this data set I’ll just give the number one right so this is our last
record so rule number seven thousand forty three similarly if we want the
last and records I’ll put 10 over here right so it starts
from 7030 four to seven thousand forty three so you have the last ten records
of the data set right so we are done with head be done with tail now we’ll
use n row end and call to find out the number of rows and number of columns
right so I’ll type n row and I’ll give in the name of the dataset right so with
this we find out that there are seven thousand and forty three rows in this
column similarly I will type and call and I’ll give the name of the dataset
customer shown now we find out that there are 21 columns in this data set
all right so now we have some numerical columns in our dataset so we have
monthly charges and total charges now what if I were to want to find out the
mean values or the maximum values of monthly charges so let’s go ahead and do
that so let’s say I would want to find out the mean of monthly charges so all I
need to do is type out mean over here and then give the column over here so I
will type customer tone dollar I’ll select the column which is monthly
charges right so the mean of monthly charges for all of the customers is
around 60 four dollars similarly if I’d want to
find out the minimum of monthly charges I’ll type Outman I’ll get the name of
the d-does it and then I’ll type auth monthly charges
again over here alright so the minimum monthly charges are $18 similarly if I
want to find out the maximum of monthly charges
I’ll buy pot max I’ll give the name of the data set which is customer churn
followed by the name of the column which is monthly charges right so the maximum
is 118 dollars so similarly what I’ll do is I’ll find out the mean Max and min
for the total charges as well now we have let me also have the range function
which automatically gives us the minimum and maximum values so I’ll give range
over here I’ll type the name of the data set well as customer churn and I’ll
select the column active monthly charges right so range gives me the range of all
of the values so the minimum value is 18 and the maximum value is 118 now so
let’s say there’s a lucky draw going on and we are selecting five customers
randomly to give a discount so we can use a sample function for that so with
the help of sample function I’ll be selecting some Phi random customer IDs
so let me go ahead and do that I’ll type the sample and what I’ll do is I’ll
select the customer ID column and I’ll give the number 5 that is from the
entire dataset we are selecting Phi random customer IDs
right so these values which we see over here so these are the customer IDs so
this is the first second third fourth and fifth so this has randomly given us
five customer IDs from around seven thousand forty three hundreds so again
let’s say if I would want around twenty random customer IDs I’ll get the number
to be twenty over here right so these are all the 20 customer
IDs now next if you would want to find out the distribution of some categorical
variables then we can use the table function so over here we see that we
have a lot of factors over here so gender is a factor column partner is
a factor column internet service is a factor column so most of these are
actually fact the columns so now if we have a lot of factor columns we can use
a table function to find out the distribution so now let’s see for this
gender column I would want to find out the number of female customers and also
the number of male customers so all I’d have to do is use a table function for
this so I will type out table I’ll give the name of the dataset and
I’ll just select the column over here but just gender right so this basically
tells me that there are around three thousand four hundred and eighty-eight
female customers and three thousand five hundred and fifty five male customers
all right so similarly if I’d want to find out the distribution for Internet
service I’ll use the table function again I’ll get the name of the dataset
and then I’ll give the column name which is internet service right so around 2000
421 customers use a DSL and 3096 customers use fiber optic and there are
around fifteen twenty six customers who don’t use any sort of internet service
right now so let’s say we want to find out the contract of the customers I’ll
again use a table function and I’ll type the name of the dataset followed by the
column right so there are 3875 customers who
have month-to-month contract 1473 customers who have a contract on yearly
basis 1695 customers who have the contract on a two-year basis right so
next finally will of use a table function on the payment method column
again I’ll type table over here but I’ll give the name of the dataset
which is customer churn and I’ll select the payment method column right so
around 1540 for customers do it by a bank transfer
15:22 customer is paid by credit card and these are the rest of the customers
who pay via electronic check and mail check we’re going to work with flow
control statements and user-defined functions now these flow control
statements basically help us to control the flow of execution so in general the
statements are executed from top to bottom but with the help of flow control
statements we can manipulate the order of execution so these are some of the
flow control statements over here if if-else and switch are something on a
selector statements then we have repeat for and Y which are looping statements
we also have some jump statements like continue and break so let’s have a
closer look at selector statements as the name suggests these selector
statements help us to select or manipulate data on the basis of a
condition such as if it rains will not play football or if you’re sick you’ll
not eat ice cream so we’ll just go ahead and start working with the selector
statements all right so I’ll start off and have a quick glance at our customer
to Andy Russell I will type view of customer churn and we have our data set
right in front of us now I will start with the if condition and I will check
if this cell over here the value in this cell is female and if the test condition
comes out to be true then I’ll change this value to be mean so let me go ahead
and do that right so I will type if I will give the name of the data set which
is customer churn and the column is gender and the row number is obviously 1
so from this data set I am checking if the value in this column is female so if
this is equal to female then I will give some action over here
so what I’ll do is I’ll change that value to be male
I’ll get the name of the data set I will select the column and in this column
right so this is cell number one and over here I’ll change the value from
female to male all right so initially we had female over here now let me copy and
paste over here now let me have a glance of the customer
he does it again alright so we have change this value from female to male
with the help of the F condition all right similarly we will use the if
Clause again to check if the tenure over here so we will take this cell value
over here so if the tenure is let’s say greater than sixty two months then what
I Louis I’ll give this customer a discount of 10% right so over here I see
the monthly charges and over here we have the customer so this is customer up
was presented true number ten and I will give this guy a discount of 10% for his
monthly charges right so let me go ahead and create another F Clause over here
so I will type if customers shown dollar tenure as greater than 50 and this cell
number over here is 10 now this is very important right so mrs. Selman button if
that value is greater than 50 then what I’ll do is I will give a corresponding
discount in monthly charges cell number is 10
so customer churn dollar monthly charges cell Lamberton and I will give this guy
a discount of 10% and that is why I am multiplying this value with 0.9 over
here so you see we have taken out this cell value and I’m multiplying that
value with 0.9 and that is how this guy will get a discount of 10% and I am
storing back the result into the same cell
let me copy it and the P sit over here so let us just have a quick glance so
the initial value is 56 now after modification let’s have a look
at the monthly charges right right so initially it was 56 then after using the
if Clause we have changed his monthly charges to 50 by giving him a discount
of 10% right so this was F now we will go ahead and also implement if else
clause so we’ll use the churn column for that and we’ll be using this cell so
let’s say we’ll just check the value over here is no or in other words it
basically means that the customer will not churn out or the customer will be
using the same network and we’ll just print that thank you for using our
network and if this is yes then we will print please give us a feedback on how
we can improve our network so let’s go ahead and do that
right sue f customer churn dollar shown and this is row number one as double
equal to yes and if this comes out to be true I just
turn it current please here was feedback on how we can
improve network else
I will print thank you for using our network let me place it over here and let me see
what will be the result so we get thank you for using a network because this
customer does not churn out all right so we are done with off we are done with
if-else now a local third selective statement which is switch so with the
help of switch I will give this guy a discount on monthly charges with respect
to the internet service let’s see I will take this customer and I will see if
this guy uses Internet service of DSL then I will be giving him a discount of
10% and if he uses internet service of fiber-optic then I’ll be giving him a
discount of 20% so let me go ahead and do this using this weight statement so I’ll delete all of this I will type
switch over here so over here I need to give the object so object again now
since this is actually a factor I will change this to a character vector so I
will type a dot character of and I will give the column
over here which will be stammers shown dollar donate service
right now the first case would be DSL of the customers internet service s DSL
then I will give this guy a discount of 10% right again so let me have a glance
at the cell numbers this will be one two three four and five
all right so over here let me just stop give down the cell number which is five
so what I’ll do is I’ll select customer churn dollar monthly charges cell number
is five and I’ll give this guy a discount of 10% I’ll give a comma now
I’ll get the second case and the second case is if this guy uses fiber-optic
so if this guy uses fiber-optic then I will give this guy a discount of 20% so
customer shown dollar monthly charges cell number five and twenty percent
discounts so I would have to multiply this value by zero point eight so let me
go to monthly charges so one two three four five so this was the initial value
seventy right now I will select all of this paste it over here and I’ll store
the result back to the same cell so this will be customer shown dollar monthly
charges cell number is five let me have a glance with Reyes at now let us see
the result so we have our monthly charges so initially it was seventy and
after giving this guy a discount of 20% his monthly charges came down to 56
right so we are also done with switch then we have looping statements so these
looping statements basically keep on repeating a certain action like keep on
printing your name four thousand times or keep playing the music for the next
one hour right so let’s go ahead and work with this looping statements so I
will start with for loop now we have this gender column over here
and I’ll use the fur loop to count the number of male customers right so I will type fur and over here I
will move available and I’ll name this variable to be I Russell to vector so
I’ll give a range over here so for I N 1 is 2 and row of customers shown let us
this loop will run starting from 1 to 7 thousand 43 right now in this entire
loop I need to check number of male customers so for that I will use the F
condition so F customer shown dollar gender as double equal to male now I
will create another variable over here and give this to be 0 so if customer churn dollar gender again
I need to give the cell which is I over here
so if customer Cho and dollar gender is double equal to male then I will
increment the count value so count will be count plus one alright so what is
happening is initially I value this one so now this will be evaluated to true
and again over here I am checking if customer churn dollar gender the force
cell value if this is equal to male then I’ll increment count with one again this
loop will come over here I value will be two over here we’ll check F customer
churn dollar gender – so this is male again
so counts value will be increased to two similarly then is value will be three
and we will check the cell number three over here
cell number three is male again and again so the count value will be
incremented and will be three now so this is how this loop will go on so let me print this over here let me
print count so we see that there are three thousand five hundred and fifty
six male customers so let me verify this with the help of table function so I
will type table of customer churn dollar gender let me see that this is actually
true there are three thousand five hundred and fifty six male customers
right so this was for loop now we’ll go ahead and understand the while loop now
with the help of while loop we will get a count of the number of customers whose
payment method is electronic check right so now I will delete all of this I will give a new variable which is AI
is equal to zero I will create the while loop over here I will give a condition
so I will check if I is less than and drew
of customer churn thatis F 1 is less than 7,000 43 so this
actually needs to be less than or equal to seven thousand 43 and if this is true
I’ll go ahead and check my condition F customer churn dollar payment method
here I’ll give the cell number which will be I so if this is equal to
electronic check then I will increase the value of count
with one so count will be count plus one right so after doing this
I will also increment the value of I so I will be ie plus one so let us
understand this properly so I am checking if one is less than or equal to
seven thousand 43 which is evaluated to true and since this is evaluated to true
then I am using the if condition to check if the cell number one the payment
method is electronic check so since this is electronic check the value of count
is incremented now after this if condition is done
I am incrementing the value of I over here now iced value will be 2 then I
will check if the value over here is electronic check or not similarly this
loop will continue on so let me go ahead and select all of this and print it over
here I’ll type count now let me verify this so table of customers shown dollar
payment method so see that the number of customers whose payment method is 2 3 6
5 and over here we have got the count to be 2 3 6 5 alright so we are done with
the while loop then we have user-defined functions so these basically help us to
modularize our entire program let’s see if we wanted to find out the minimum and
maximum values of every column so all we need to do is create two functions min
and Max which can be applied on all the columns so let’s go to our studio and
create some user-defined functions right now again I will create a user-defined
function to get a count of number of meal customers I will name the function to be gender
count now this is the syntax of a user-defined function so I will type
function and this is our parameter over here right now inside this I need to
write the entire code to find out the count of the number of male customers
right so I’ll be using the for loop again to do that so I will type for I n
one is two lengths of X right and over here I will check f X of
I is double equal to male and the feta is equal to male then I will say count
as a equal to count plus one again I will create a new local variable over
here count whose initial value is supposed to be zero right so let’s go
through this function again so this is the syntax of a function and I am naming
this function to be gender count so over here I will send this gender column as
the parameter now once I do that I have initialized local variable where count
is equal to zero and over here the loop start suffer I and one is two lengths of
X so length of X that would be the length of this column which would be
seven thousand 43 so this loop will go from one to seven thousand forty three
alright and inside this over here we check for each and every cell so if X of
I so for first iteration it will be X of one so we’ll check for this cell so if
this value is equal to meal then counts value is incremented by one again ice
value is two and if the value in the cell is male cons value is also
incremented by one and this goes on after the entire loop is done I’ll also
print the value of count over here all right so I will select all of this and I
will paste it over here so we have our function to be ready right so gender
underscore count and I will send the gender column as the parameter
all right so we get a value of three double five six let me verify this again
table of customer churn dollar gender right so we see that number of male
customers as three double five six so now the best part of functions is we
just need to make a small change over here if we need to find out the number
of female customers so all I’ll do is I will change this to be female and I can
pass in the same column to find out the count of number of female customers
right so I will use this function again gender count and I will say customer
churn dollar gender I am sending this as the parameter now let us see the count
let me verify this so I will type table of customer churn dollar gender and over
here we see that the number of female customers are three thousand four
hundred and eighty seven and that is the same value which we’ve got with the
function right so this was an implementation of user defined function
well work with the basic data structures in R so we’ll start with one dimensional
data structures which are vectors and lists and then we’ll head on to matrices
sender is which are multi-dimensional data structures so the most basic data
structure in our is a vector it’s a homogenous uni dimensional object so
what do I mean by homogenous well all of its elements must be of same type like
over here we have a collection of boots linearly arranged now let’s go ahead and
implement this in our right so we are back to our studio I will start off by
creating a character vector and I’ll name it has board and I’ll go ahead and
give it down some names of birds so first poet would be eagle
then we have buried and our final board would be Fijian now let me print this
right and let me also go ahead and take the class of this vector so I’ll type
class of food all right so we see that this as a character vector that is all
of these three elements are actually characters now I’ll be creating an
integer vector and Eileen this has no voice so I just
list down the numbers from 1 to 9 let me print this now which comprised of numbers from 1 to 9
let me go ahead and check plus so I’ll type class of numbers so this is integer
then we have a numeric vector type so when numeric we can give
floating-point or decimal values so I’ll name this to be decima
and I’ll give some floating-point values so I’ll just give some random holding
point values over here let me print this now let me check the class so I’ll die
loss of decimal so see that this is of numeric type right and then finally we
have a logical vector and in logical vector we can just have two values
either true or false so I’ll name this to be logic hundred
and here are some logical values true false and if I’m too lazy I can just
give PE and F like this over here right so let me
bring this Largent hundred right so these are all
of the values of this vector now let me take the class so I’ll type flies off
logic hundred to see that this is of logical type so this was an
implementation of vectors in R so then they have a list so a list is a
heterogeneous collection of elements that as though elements do not have to
be of the same type and each element actually retains its own identity even
when it is present in the list like oh here we have a heterogeneous collection
comprising off a board and a pill in the cart so let’s head to our and work with
lists so this is how we create a list I died powerless and let’s say the first
element is the integer one then I’ll give a character value and I’ll name it
to be Nirvana after that I’ll give a logical value and this is true I will
store this in mix bag right so let me print mix bag over here so this is how
our list looks like so we have three elements boosters integer next we have
character and then finally we have a logical value so let me take the class
of this object so I’ll type plus of mix back
to see that this is a list now I’ll also go ahead and check the class of
individual elements right so I’ll type plus of mixbag and I’ll give to square
braces and I want to check the class of the first element but so we see that it
is numeric second s character
and the third element is logical right so let me actually print the list for
you guys so this number one is of numeric type
this element Nirvana is of character type and this element true is of logical
type so we see that all of these three elements retain their original identity
or their original classes so this is how a list functions next in line is a
matrix so matrix is a homogeneous collection of elements in
two-dimensional space so over here all of the elements belong to the same
category namely fruits and they’re arranged in the form of rows and columns
so now let me go ahead and inflow in matrix in our
all right so to create a matrix what I’ll do is I’ll actually be using the
same vector first so let me have a glance at this
right so now I’ll be inserting all of these elements into our matrix right and
to create a matrix I will type matrix the first parameter is the data so for
the data I am giving the numbers vector after giving the numbers vector we have
two other parameters where we specify the number of rows and the number of
columns that we want so let’s say since we have nine numbers in total I would
want this to be a 3 cross 3 matrix that is n is equal to 3 or number of rows is
3 and similarly and call this 3 or in other words number of columns is 3 and I
will store this in my 1 so let me print Matt one now right so this is our matrix over here
where all of these of same type one two three four five six seven eight nine
so these are all integers and I’m storing them in the form of rows and
columns and what you see over here is these elements are arranged column wise
so if I want to I ange these by row then what I’ll do is we have the by ero
attribute and I’ll just set it to be true right and I’ll print mat one now
to see the difference over here so now the elements are arranged by rho 1 2 3 4
5 6 7 8 9 initially they were arranged with respect to column now they are
arranged with respect to row right so now I’ll create a character matrix so
what I’ll do is I’ll create a character went to first and let’s down some
characters so let’s say I will give the four six alphabets so a b c e and F right so I have created a
character actor naming alpha and thus conscious of six elements all right now
I’ll take this vector and create a matrix matrix the data is coming from alpha
vector now since there are six elements in
total I want this to be a 2 cross 3 matrix right so number of row is s2 and
number of columns as 3 and I will store this and let’s say
my underscore alpha let me print this now Matt underscore
alpha alright so this is a matrix over here so two rows and three columns now
again if I want to arrange this to add respect to Rho all I need to do is set
by row to be true and I’ll be printing mad alpha
right so this was with respect to columns this is with respect to rows
ABCDEF right so they have also created the matrix now what if you wanted to
access the individual elements of the matrix so this is how we can do it so
let’s say I would want to access this element over here so all we have to do
is set the index values so this is present in the first row and second
column right so value give one comma and we have successfully extracted this
element over here similarly if you wanted to extract this element F over
here so this is present in second row and third column right so mad alpha
comma 3 and we have successfully extracted the element from this matrix
right so this was an implementation of metrics and finally we have arrays so
this is just an extension of matrix not as it is a homogeneous collection of
elements and n-dimensional space so let’s actually go ahead and implement
arrays in are alright so what I’ll do is I’ll create a new integer way
and give out values from 1 to 9 and I’ll create a second integer vector and in
this I will give the value starting from 10 to 18 so we have created two numeric
vectors over here and we’ll be using these two numeric vectors to create an
array so this is the syntax to create an array I will die peri and I’ll give out
the data right so the data is coming from these two vectors I’ll use the
combined function and give out these two vectors over here num1 and num2 right
after this I need to set the dimensions that is the number of rows the number of
columns and the number of dimensions all right so the dem will be so in total we
have 18 elements so we have nine elements in num 1 vector and nine
elements in num 2 vector so what I need us actually two matrices of 3 cross 3 so
I will give 3 comma 3 so this is the number of rows and number of columns and
since I need 2 matrices of this type so I’ll type in 2 and I will store this
array 1 now let me go ahead and print everyone as a result in Eric right so this is all
of the elements from num 1 vector which are stored in this part over here and
then we have all of the elements from the num 2 vector which are stored in the
second dimension over here right now so how can we access individual elements
from this so let’s say I want to access this element number 15 so let’s go ahead
and access this I will type array 1 now let me check where is this present so
this is present and the third row and second column so I’ll type P comma 2 and
after this since this is present in the second Matrix or the second dimension I
will give into over here and let me check the result and voila
so we have successfully extracted 15 from this so similarly if I wanted to
extract this element 5 so let’s go ahead and do that so let me actually yeah I
need the snob so I will type array 1 and this is present in second row and second
column so I’ll type 2 comma 2 and since this is present in a first dimension it
sells so I’ll give out 1 over here right and I have also extracted 5 will be
working on a project so this project would be on recommendation engine so
have you ever wondered which book to read next
well I often have and to me book recommendations are a fascinating issue
and that is exactly what we’re going to do today so our data set for the key
study comprises of these four files ratings dot CSV books dot CSV book tags
dot CSV and tags dot CSV so as the name suggests the eating’s dot CSV contains
all users readings of the books so there are a total of nine hundred and eighty
thousand readings for ten thousand books from fifty three thousand four hundred
and twenty four users so the book store CSV contains more information on the
books such as the author’s name publication year book ID and so on then
we have the book tax dot CSV file so this file comprised of all tag IDs users
have assigned to the books and the responding that counts so the tag IDs
basically denote the categories into is the books fall into and the counts
denote the number of books belonging to each category and we have the attack
store CSV file so this file contains all the tag names corresponding to the tag
IDs tell us it gives us the labels corresponding to different tag IDs so
these are the tasks which you’d have to perform in this project so in the first
phase we do a bit of data cleaning so we’ll start off by removing the
duplicate ratings thus there are cases where a user has read in the same book
more than one time so we’ll go ahead and remove all these instances after which
we’ll go ahead and remove those users who have rated fewer than three books
right guys so we are into our studio now so let us go ahead and load all of the
packages required for a key study so these are of the packages required
right now after which I’ll upload the food files from a dataset so these are
the four files so we have books dot CSV readings dot
CSV book tags dot CSV and tags dot CSE and I’m storing this in objects books
ratings book tags and tags so we have loaded these four files now
let us have a glance at these four files so I’ll be using the View function to
have a glance at our four files right so these are our data sets guys so we have
the readings data set where it’s comprised of these three columns book ID
user ID and the rating then we have the books data set and these are the columns
so it has ID book ID work ID ISBN the author’s name then we have the original
publication your original title title language code and so on afterwards we
have the book tags and here the columns are good reads book ID tag ID and the
count and then we have the tags dataset here we have the tag ID and the
corresponding tag name for that tag ID right so as part of a first phase we had
to do a bit of data cleaning and the first task of our first phase was to
remove all of the duplicate readings and do that we’d have to find out how many
times has one single user rated one particular book and this would be the
command for that so here what I’m doing is I am grouping
this readings data set by user ID and book ID afterwards I am using the mutate
function and I’ll add a new column to this and that new column would be given
by the n from the deploy up so this basically would give us the
number of times a single user has rated one particular book and I’m giving the
name of the new column to be capital N and I’m storing the result back to
readings so let me have a glance at readings now
view of readings so we see that a new column has been
added so this is the user number so the user number 314 has rated the book
number one only once similarly if we take this case over here the user number
to nine double zero has rated the book number one only once so let me go down
and see if there are some changes over here in the counter fan right so let’s
have a glance at these two cases over here so the user number four 2:06 has
rated the book number eight nine four five two times over here right so these
are the duplicate readings which I am talking about
so these records need to be removed right all right so now let me also use
the table function to find out the distribution of these duplicate readings
so I’ll use table function and I will give in ratings dollar and over here
which would give me the count of the different ratings given by one
particular user to one particular book right so this value over here tells us
that there are being five instances where one particular user has rated the
same book five times this tells us that there are twenty-eight instances where
the same user has rated the same book four times this tells us that there have
been 156 instances where the same user has rated the same book three times and
this tells us that there are four thousand two hundred and ninety eight
instances where the same user has rated the same book two times and this is all
of those cases which are not duplicate that is the user has rated that
particular book only one time right so now what I’ll do us from this breedings
data set I will filter out all of those duplicate records and I will store them
in a new object so I will put in the name of the dataset
which should be readings over here and I will use the filter function to select
all of those records where n is greater than one that’ss which have duplicate
readings and I will store it in a new object so we have successfully created
this new object now let me have a glance at this view of duplicate ratings right
so there are four thousand four hundred and eighty seven entries in total which
have duplicate ratings or in other words there have been four thousand four
hundred eighty seven instances where the same user has rated the same book more
than one time right now well go ahead and remove all
of these duplicate readings now is a very simple command to do that so from
the readings object all you have to do is filter out only those records where
the value of N equals to 1 so this basically means that we are filtering
out those records where one particular user has rated one particular book only
once and I am swearing this result back to the readings dataset right so we have
done the changes now let me have a glance at it
view of readings right so these are all of the records
where there are no duplicate readings so our second task was to remove all of
those users who have rated fewer than three books so for this we’ll have to
start off by grouping the users with respect to user ID first and find out
the number of readings given by each user
so I will select this command over here and I’ll piece it over here so I have
given readings over here and I’m grouping this readings with respect to
user ID after which I am using the mutate function and over here again I am
adding a new column and that new column would be ratings given and I will get
that ratings given column with the help of this n function from the deployer
package so this n function from the deploy package would basically give me
the number of ratings given by each user right so I will store this back into the
ratings dataset now let me have a glance at it view of ratings
so this is the user ID so the user number 314 has given 181 readings in
total the user ID 439 has given 173 ratings in total
similarly the user ID 9 2 4 6 has given 190 readings in total
so now well go ahead and remove all of those user IDs who have given less than
three ratings so this is the command for that I have
again given readings over here and I am filtering out only those records where
the ratings given by each user is greater than two that as each user has
at least rated three books or more and I am storing the result back to readings so boo
ratings so this is our final data set so we see that there are nine hundred and
sixty thousand five hundred and ninety five entries in total so we are done with the first phase and
the second phase we’ll do some data exploration so we’ll start off by
extracting the sample set of 2% records from the entire dataset then they will
make a bar plot for the distribution of readings that as we’d want to analyze
the count of different readings after which we’ll make a plot to understand
how many times each book has been read it then will make a plot for the
percentage distribution of different genres going ahead well find the top 10
books with highest readings and finally well find out the 10 most popular books
right so we are back to our studio again and that’s time for Phase two now so the
first task in our phase two was to select a sample from the entire data set
so I’ll go ahead and set a seed so that if I ever want to run these commands
again I can get the same results so I’ll say the seed value to be 1
and I’ll set us use a fraction of 0.02 datas from the entire user base I need
only 2% of the sample users so I am assigning this value of 0.02 to a new
vector and naming that new vector to be user fraction now after which I will
find out op or the unique user IDs so I am using the unique function over
here and I will given the user ID column from the ratings data set so this will
give me all of the unique user IDs and I am soaring result in the users object
now after this let me have a glance at the number of the unique user IDs so
length of users so we see that there are 45,000 16
unique user IDs in total so we need 2 percent of this unique user IDs so 2
percent of this would be 0.02 into 4 5 0 1 6 so this would give us 900 users so
from 45,000 16 users in total we would need 900 users right so we’ll do a
random sampling of 900 users from the entire user base so this is the command
for that so I am using the sample function and this is the list of the all
of the users and from all of the users I only need 900 of the users so earlier we
had multiplied the user fraction into the length of the users which give a
value of 900 point something so we are basically rounding that off and I will
store that result n sample users so now let me have a glance at the length of
sample users over here length of sample users so you see that there are 900
sample users in total now let me also have a glance at our number of readings
so initially the number of readings which he have is nine like sixty
thousand five hundred and ninety five so now what I’ll do is from this readings
dataset I will be filtering out only those user IDs which are present in the
sample users object notice I would need only the sample users from all of the
users and I will store the result back to readings right now let me have a glance at the
number of readings so pen drew off readings
so now we see that the number of readings has reduced to eighteen
thousand eight hundred and thirty-two so initially we had more than 9 lakh
readings so now after filtering the data set we
have just eighteen thousand eight hundred and thirty-two readings all
right so our second task was to make a distribution of these readings so let me
go ahead and do that so guys this is the command for that so
again I am using the readings dataset and on top of this I am building the GG
plot so here I am mapping the rating column on to the x-axis so this column
over here so we have different readings 1 2 3 4 & 5 so I am mapping this column
on to the x-axis so the fill color would also be determined by the reading column
and after that since we’d have to make a bar plot I am using the Jerome bar
function and the color which I give to the boundary of the bar plot would be
great 20 and the color which would be coming to all of these bars would be
from this palette over here so the palette syl G and B U so this stands for
yellow green and blue so we’ll be giving this inside the scale fill broooo
function all right and I am also setting the guides to be
false let me hit enter so this is what you get let me zoom this now so this is
quite an interesting plot isn’t it so let’s have a glance at this bar over
here so this basically tells us that there are more than 6,000 cases where a
rating of 4 star was given now similarly we see that there are more than 5,000
cases where a rating of 5 star was given and this bar over here so this tells us
that around 4700 times a rating of 3 star was given so the count of these 2
bars is quite low so there have been very less cases where a rating of 1 star
was given so maybe not even 500 times a rating of 1 star was given so this is
for the rating of two stars so around thousand times rating of two stars would
have been given so guys this is the distribution of the readings now after which we had to find the
number of readings for each book so let’s also do that so here again I start off by giving the
readings dataset and I would have to group this with respect to book ID
because I’d want to find the number of readings per each book so that is why I
am grouping it with respect to book ID now after grouping it I will use the
summarize function so basically inside the summarize function I will basically
get the count of number of ratings for each book so here I am again using the n
function so this n function would give me the number of readings per each book
and I’m also named the result to be number of readings per book after which
I’ll again use the pipe operator and add a layer of the GG plot on top of it and
I am assigning the number of readings per book onto the x-axis the fill color
is orange the boundary color is creat wente and the x-axis values who would
range from 0 to 40 right guys so this is the plot let me
zoom this now so from this graph we can basically infer that there is not even
one case where a book was rated more than 10 times so let’s have a glance at
this bar over here so this tells us that there are more than 2500 instances where
a single book was rated only by one user so this is for those instances where a
single book was rated by two users or in other words a single book was rated two
times this is for those instances which tells us that a single book was rated by
three users or in other words a single book was rated three times and the count
for this is around 1500 times right so this plot was for the odd number of
readings per each book then we had to get the percentage
distribution of the different joiners so what we’ll do is we’ll start off by
making a new object and giving it the name Jonas
so this Jonas object would basically have a list of different genres in it so I have basically listed down a bunch
of different journals over here and I am storing all of these into the Jonah’s
object so the different journals are art biography science thriller travel humor
and comedy and so on so after building the Jonah’s object what will do us from
the Stags dataset I will be extracting only those tag names which are present
in the journals or in other words I am extracting only those journals which are
listed down over here so this is the command for that so what I’m basically
doing over here us I am finding out of the listed genres are present in the tag
names or not and if they are present I am extracting only those genres and I am
storing them in available Jonas so let me hit enter and let me actually see
what are the available genres right so these are all of the available genres in
the tags dataset so there are 27 genres in total and these are Christian
business poetry philosophy signs and so on now similarly I will extract all of
the corresponding tag IDs with respect to the tag names so let me find out which are the
available tags so over here I am basically extracting all of those tag
IDs if the tag name is present in one of the available Jonah’s right so if the
tag name is present in one of the available Jonah’s only those tag IDs I
am extracting and similarly if the tag Nima’s not present in the available
Jonah then I won’t be extracting those tag IDs and I am storing the result in
available tags so next we have to make a plot for the percentage of each owner so
let’s go ahead and do that so before we do that let’s actually get our count of
the different genres available so this would be the command for that
let me print it over here so what I’m basically doing over here is from the
book tags dataset I am extracting only those tag IDs which are present in
available tags and then again I am grouping it with respect to the tag ID
after which I’ll use the summarize function and get the number of counts of
each of these tag IDs or in other words I’ll get the count of the different
genres so let me hit enter and let me see what do we get so this is the tag ID
2 9 3 8 and for this corresponding John or the countess 436 notice there are
four hundred and thirty six books belonging to this joiner similarly this
is the tag ID for 6:05 and the countess one 1:09 so this means that there are 1
1 0 9 books present for this particular genre over here and let’s take this over
here so the tag IDs triple 7 8 and the count is 4 6 9 so this means that there
are four hundred and sixty nine books present with respect to this joiner
now let me go ahead and also find the percentage so let me select all of this
code over here right so now we had run the command till here now so we
basically got the count of each on earth now after getting the count of each
honor I am ungrouping it again after that I will
use the mutate function and find the total count that is the total count of
all of the journals combined and I am also creating a new column percentage so
this percentage over here I am dividing n upon sum of an thatis this would give
me the percentage of each of the journal and after getting the percentage of each
of the joiner I will arrange the data set and descending order and after
arranging the data set in descending order
I will also left join the tags data set to the book tags data set and the
joining would be done by the tag ID column over here so let me store this in
a new object so let’s say book info now let me have a glance at it view off book
info right so guys this is the tag ID this is the count of the tag ID that is
the number of times this joiner is present and this column gives us the
total count of all of the chana’s this gives us the percentage of the joiner
and this is the tag name which is fantasy so we’ve got a data set ready
now we’ll go ahead and make a plot on top of this so now let me go ahead and
make a plot so the object name was actually booked
in for so let me change this to book info over here and on top of this book
info object I am adding a ggplot layer so here I’ll be mapping the percentage
column on to the y-axis and the tag name column on to the x-axis and the fill
would be determined by the percentage column and since we want to make a bar
plot will be using their Chamba function and the stat which I’ve used as identity
and I’ll also use the quad flip function over here because I’d want these bars to
be stacked horizontally and not vertically and the color to these bars
would be your determined by this palette over here so this is yl o Rd so this
would be for yellow orange and red and the label which I’ve given for the
y-axis as percentage and the label which I’ve given for the x axis is shown up
let me it end up right guys so this is the plot
so here I have map johner onto the y-axis percentage onto the x-axis so we
see that fantasy is the most prevalent data set or in other words most of the
books belong to the fantasy genre and the least percentage is of the cookbooks
so this was the distribution of the percentage of different joiners so up
next we will go ahead and find the top 10 books with highest rating so this
would be the command for that so if you have to find out the top ten
books with highest reading all you have to do is arrange this average reading
column in descending order and that is what we are doing over here so here I
have given the name of the object which is books and I am arranging this data
set in inverse order of average reading after which I am selecting just the top
10 records and the columns which I’d be selecting our title ratings count and
average rating so let me store it in top 10 let me have a glance at top 10 now right guys so these are the top 10
highest rated books to the complete calvin and hobbes as the book with the
highest rating so it has the highest average rating of four point eight two
and then we have words of radiance so it has a rating of four point seven seven
third in police is the harry potter box set which has four point seven seven
reading for ‘this esv study bible which has a rating of four point seven six
fifth in the Lester’s mark of the lion trilogy which has an average rating of
four point seven six right guys so we have successfully found out the top ten
books with highest readings next we’ll go ahead and also find the
top and most popular books so this would be the command for that so to find out
the top 10 most popular books we’ll have to arrange this readings count column in
descending order dollars whichever book has the most number of ratings it would
automatically mean that it is the most popular book right so this is the
command for that let me run it over here so here what we are doing us on the
books data set I am arranging it in inverse order of the ratings count
column and then I will be extracting the top 10 records and I’ll be selecting the
title column the ratings count column and average count column so let me store
this in top popular let me have a glance earth table popular
right so these are the top 10 most popular books so the most popular book
in the list is the Hunger Games which has the highest ratings count then the
second most popular book is Harry Potter third most popular book is Twilight and
the fourth most popular book is To Kill a Mockingbird in the third phase we’ll
finally do some recommending so we’ll start off by building the user based
collaborative filtering model and then we’ll recommend six new books for two
different readers for a guy so it’s finally time to recommend some books
so before we go ahead and do that we would actually have to restructure our
data a bit so our data actually needs to be in the form of a matrix where all the
rows should correspond to the users and all the columns should correspond to the
books so the dimension names would then nothing be but the user IDs and the book
IDs so the user IDs would represent all of the rows and the book IDs would
represent all of the columns so let me go ahead and extract all of the
dimension names first so here with this command what I am doing is I am
extracting all of the unique user IDs and I’m also extracting all of the
unique book IDs and I’m storing them and this object dimension names so this is a
first us so we have basically got all of our dimension names now we’ll go ahead
and convert the format of a data frame from long format to white format so with
this command what we are doing is we are actually selecting the book ID the user
ID and the rating columns from this data frame over here and I’ll be splitting
this book ID column that us this value of book ID one would become one column
this value of book ID two would become the second column this value of book ID
3 would become the third column and so on and these rating values over here
they would become the values for the corresponding book IDs so this is how we
can use the spread function to spread out our data frame from long format to
white format and we’ll also remove the user ID column because it doesn’t serve
a purpose and I will store it in a new object and name that object to be rating
mat so uh let me hit enter right so we have created our rating matrix
so we have created the reading mat object now let me have a glance at the
class of this so class of reading mat so we see that this is still in the form of
a data frame but we can build our user based collaborative filtering model only
on top of the real rating matrix so first we’ll have to go ahead and convert
this data frame into a matrix so let me do that so here with the help of as dot
matrix function I am converting the class of reading mud from a data frame
to a matrix and I am storing the result back to reading mat Friday so now let me
have a glance of the class so class of rating mat and we see that now it is a
matrix so let me have a glance in the first five rows and the first five
columns of it so these are the first five rows and
these are the first five columns so we see that this user ID column has not
been removed so let us go ahead and manually remove these
so what I’m doing this I am manually removing this first column and I’m
storing it back into rating mat now let me have a glance at first fire rows and
first five columns right so these are the first five rows and these are the
first five columns so these rows basically correspond to all of the user
IDs and these columns basically correspond to all of the book IDs so
these any values which you see over here so this basically means that the first
user has not rated the first book the first user has not rated the second book
similarly the fourth user has not rated the third book and so on right so we
have a rating matrix ready now let me also assign the dimension names to the
dimension names of this rating mat object so here I am assigning all of the
dimension names which I’ve extracted to the dim names of the reading Matt right
so now let me have a glance at our dim names of reading Matt so dim names of
reading Matt so these are all of the dimension names
for the rows so this is the name for row number one this is the name for row
number two one is the name for row number three
so this basically signifies that all of the rows are represented by the user IDs
similarly if I go down then we have all of the book IDs so all of the columns
are represented by the book IDs over here now let me use the dim function to
find out the number of rows and columns in the matrix so we see that there are
900 rows and eight thousand four hundred and thirty-one columns or in other words
there are nine hundred users and eight thousand four hundred and thirty one
books so we have got our metrics ready but we can’t just build our user base
collaborative filtering model on top of the metrics so we have something known
as a real rating matrix and the recommender lab basically works only on
this type of object so I will AppStore this reading mat into a new object and
name that object’s name to be reading Mat Zo
let me again have a glance at the number of dimensions of this dim of reading mat
zero so we have the obscene number of rows and columns so again here the
number of rows a 900 and the number of columns are 8 4 3 1 now after this what
we’ll do is wherever we have any values we will replace those any values with 0
so in treating mod 0 wherever we find any values I am replacing those any
values with 0 so now let me again have a glance of the first 5 rows and there are
first five columns of this thread B rating mat 0 1 2 5 and 1 2 5 so we see
that all of any values have been replaced with zeros so now we can go
ahead and convert this matrix into a sparse matrix so I’ll be using the as function and I
am converting this rating math 0 object into worst parts metrics and I will
store this in a new object and name that object to be sparse readings now let me
again have a glance at the first five rows and first five columns of this
source force ratings 1 to 5 and 1 to 5 so this is how a
sparse matrix looks like so basically with the help of a sparse matrix we end
up saving the laurels piece now it’s the last power of the transforming so we
will go ahead and convert this past matrix into a real rating matrix so for
this we would need the new function and with the help of new function I am
converting this passed ratings object into a real rating matrix so this over
here takes in two parameters so the first parameter is basically what we are
trying to convert this into and then be given the data which we are trying to
convert and I’ll store this result in real readings object
let me print real readings now and let’s see what do we get real reading so this
is what we get so what does a rating matrix where there are 900 Dru’s and 8 4
3 1 columns and it is of class real rating matrix with 18,000 832 readings
right right so we finally have our real waiting metrics ready so now we can go
ahead and build a model on it so what we’ll do is well go ahead and split the
data set into train and test sets so it’ll be your 80/20 split so I’ll be
using the sample function to create this 80/20 split over here so I am using the
sample function and I will generate true or false values over here and the
sampling would be with replacement and the probability is 8020 dhatus i would
want to divide the data set into two parts were the first part would comprise
of 80% of all of the observations and the second part would comprise of 20% to
the rest of the observations and I will last or the split criteria in a new
object and name that object to be split book so now in real readings wherever
the value of split book is equal to true I will select all of those records and
store those records in direct rain similarly from the real readings matrix
wherever the value of split book is assigned to be false I am extracting all
of those observations and storing those observations in Trek tests so we have
our training and testing sets ready now we can finally go ahead and build a
first model so we’ll be using the recommender
function and this over here takes in two parameters first as the training set on
which we want to build the model and next is the method or the type of the
recommender model which we’d want to build and since we’d want to build a
user based collaborative filtering model well given the type to be you BCF so if
we wanted an item based collaborating filtering model then the method would be
IB CF but in our case since we want user based collaborative filtering model so
the method would be UBC F and I will store the result and an object which
would be Rick model UBC F right so I have successfully built the model now
I’ll give a value for the number of books to be recommended so I am
assigning the value 6 to a vector and the name of that vector is and
recommended UBC F so this basically lets us know that we’d have to recommend six
books in total right so the model building process is done so let me go
ahead and predict the values so I’d have to use the predict function
for this so over here this takes in three parameters first is the model
which you build next is a data set on which we want to predict the values and
third as the number of values to be recommended right so first well given
the model which you built watch as wreck model you BCF and then where do want to
predict on top of the test set so the test set as rick test and the number of
books to be recommended as six which are stored in and recommended UBC of and I
will store this result in wreck predicted UBC F right so the prediction
is also done now let me go ahead and find out the item numbers which have
been predicted so now that we have predicted the values
let’s go ahead and recommend some books to user number one so here we will use this object rec
predicted you BCF so this is the object which we’ve just built using the predict
function and I want to find out the item numbers which have been recommended to
user number one so this command over here rec predicted UBC of either eight
items one so this will give me the column numbers which have been
recommended to user number one so let me have a glance at user 1 book
numbers so this basically means that user 1 has been recommended to read
these books which are presented columns 2 1 7 7 column number 6 3 4 3 column
number 4 23 caller number 1908 column number 2 2 7 4 and column number 2 Phi 1
4 now we just have the column numbers so let’s actually find out the labels for
this column numbers right so from wreak predicted you VCF we
have the item labels and inside the item labels
I will given the book numbers which have been recommended right so these are the
item labels so the book ID which is under column number two one seven seven
as two six three six the book ID which is under column number six three four
three years of seven four eight two similarly the book ID which is under
column number 207 for us two seven five zero so we have successfully got the
book IDs so now that we have the book IDs ready let’s actually use these book
IDs to extract the name of the book and the name of the author right so what I’ll do is I will extract
the title of the book and the author’s name where the book ID is 549 so the
name of the book a score dry and author’s name is dawn Freeman similarly
let me extract the title and the author’s name where the book ID is two
seven five zero so here we see that the book recommend to this mrs. pickle
wiggle and the authors are betty mcdonald and alexandra Boyka
similarly let me extract the titles for the book ID three zero to nine let’s see
what do we get three zero to nine so the book recommended this region of
angels and author is Sydney Sheldon right so these are the six books which
have been recommended and this is how we can extract the name of the book and the
author of the book now similarly let’s also go ahead and
recommend six books for the user number five so this is what I’d have to do I
will be using the wreck predicted ubc of object and I’ll be using the at the rate
tag and extracting the items for user number five and I’ll store this result
in user five book numbers so now that we have the item numbers let’s actually
extract the column numbers for each of these items so for this the command
would be direct predicted you BCF at the rate item labels and inside this I will
give in the book numbers right so these are the book IDs which have been
predicted right so similarly let’s extract the title and authors for some
of these book IDs so we are extracting the title and the
author’s name where the book IDs for six to four so though
so the recommended book the name of the recommended bukas the girl who
circumnavigated fairyland in a ship of her own making and the author of
scattering am valentin and anna one similarly let’s extract the name of the
book and the author’s name well the book ID is six eight six seven so here the
book is Malgudi days and it’s been written by RK Narayan and Champa Larry
now let me extract the title and the author’s name for the book ID which is
seven three to six so the books name is doctors and it has been written by Eric
cycle right guys so we have successfully implemented the user based collaborative
filtering model and we have recommended six books to two different users
just a quick info in case if you guys are looking for end-to-end course
certification in art programming in telepath provides the our programming
for data science program where you can learn all of these concepts thoroughly
and on a certificate at the same time the link is given in the description box
below so make sure to check it out I hope you guys took away a lot from this
detailed session if you have any queries make sure to head down to the comment
section below and do let us know and we’ll be happy to help you out there and
on that note thank you for watching have a nice day

11 Comments

  • Guys, what else do you want to learn from Intellipaat? Comment down below and let us know so we can create more such tutorials for you.

  • 👋 Guys everyday we upload in depth tutorial on your requested topic/technology so kindly SUBSCRIBE to our channel👉( http://bit.ly/Intellipaat ) & also share with your connections on social media to help them grow in their career.🙂

  • Sir ye jrur kijiyega courses ko hindi language me banayega jisse sb aasani se sikh sake plzz sir muje web designing sikhna h web designing ka hindi me course banayega 🙏🙏

  • कृपया हिंदी में वीडियो बनिये।

  • Very nice

  • Hello, Is this vedio is sufficient to learn R programming completely???

  • After finishing this video what should be the next step taken to learn more about R on your channel? Do you have a playlist for it?

  • I request you to upload videos in Hin+ENG so we can understand better

  • I am a civil engineer with no experience. Will it be a wise decision to switch to CS and IT job??? As i am still a fresher although passed out few years ago.

  • Video on information bdm

  • Sir, is this full course of R programming

Leave a Comment

Your email address will not be published. Required fields are marked *