Machine learning for beginners in R #1Introduction to KNN& Reading data from a dataset.
This is, in fact, my first tutorial on here, so I hope it is comprehensive and easy to work with. In this tutorial, i am going to show you the basic steps of machine learning in R. I recommend to look into the basics of R, so you have an idea what you are actually working with then. But, you can still work with this if you are an absolute newby in R.
I'll show you how to use R to work with the pretty well-known machine learning algorithm called k-nearest neighbors (KNN). The KNN algorithm is a simple machine learning algorithm and is an example of instance-based learning, where new data is classified based on labeled instances. (Which is a pretty good start, in my opinion)
Specifically, the differnce between the two data sets (Stored and new) is calculated by means of a similarity measurement. It is often expressed by the Euclidean distance.
So-TLDR; The similarity to the data that was already in the system is calculated for any new data input.
Using that similarity value, we can perform predictive modeling. predictive modleing
Predictive modeling is either classification, assigning a label or a class to the new instance, or regression, assigning a value to the new instance. Which one you use, is on you, when you use KNN later.
SO! Enough talking, lets get to code! Finally. In this example, we are going tu use the already existing dataset
IRIS. It already exists in R, so we don't need to import anything. Try it out now!
Hell yeah, this works great! (HOPEFULLY)
Now, lets load the whole dataset using this script:
# Read in `iris` data iris <- read.csv(url("http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data"), header = FALSE) # Print first lines head(iris) # Add column names names(iris) <- c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width", "Species") # Check the result iris
Nice! You were officially able to import data from your data set! Now, what does this script do?
The command reads the .csv file from a source. The header argument has been put to
FALSE, which means that this source won't give you the attribute names of the data.
To simplify working with the data set, i recommend to create the column names yourself: you can do this through the function names(), which gets or sets the names of an object. Define the names of the attributes as you would like them to appear. In the code chunk above, you’ll have listed
These names are in fact not random, but listed in the datasets description.
In our next tutorial, we will talk about how to understand your data. We will look into statistics with R, and even some advanced stuff.