- Installation of Julia
- Data Structures Concept in Julia Programming Language
- For Loop
- Basics Of Julia For Data Analysis
- Exploratory Data Analysis With Julia
- Using R and Python Libraries in Julia
- Using Pandas With Julia
- Introduction To DataFrames.jl
- Visualization in Julia Using Plots.jl
- Histogram Chart
- Data Munging In Julia
- Building a Predictive ML Model
- Logistic Regression
- Decision Tree
- Random Forest
- Using ggplot2 in Julia
Julia is a general-purpose programming language like C, C++, etc. Julia was developed mainly for numerical computation. As of now, we know how science has been changing in the area of computation. Everything needs a quick calculation in-order to generate results from large scale data in a fraction of seconds. However, despite all the advancements in programming world and despite having so many programming languages with good performance and compatibility, etc. like C, C++, Java, Python, we face the following question: Why Julia?
Julia was developed mainly for numerical computation purpose, and it helps eliminate performance issues. It will provide an environment which is good enough to develop applications that require high performances.
Check out Great Learning Academy for free courses on Data Science and more.
Installation of Julia
Here, we are going to see the steps on how to download and install Julia on your system:
Step-1: To download Julia go to https://julialang.org/downloads/ link or else you can search Google for the following, “Download Julia”.
Step-2: Download as per your machine bit configuration, i.e. 32-bit or 64-bit.
Step-3: After download run the .exe file
Step-4: Click the install button and furtherly go with the picture shown below.
Step-5: Click the checkbox to run Julia and click Finish as shown in the figure below.
Step-6: Now you can see a command line prompt which is also known as REPL
(Read-Eval-Print-Loop)
Before going into another topic, let’s see Julia’s packages for data analysis and data science-related projects.
We know about jupyter notebook and its popularity in data science and ML, which gives fast results and easy to handle the IDE. Similarly, we do have a notebook for Julia i.e
Juno IDE but if you are familiar with notebook then go on with jupyter notebook. Let’s see how we can set up the package for Julia notebook(IJulia).
Open the Julia prompt and then type the following command:
Julia> Pkg.add(“IJulia”)
After you run the command, the necessary packages will be added or updated.
After IJulia package is downloaded or updated you can type the following code to run it:
Julia> using IJulia
Julia> notebook()
You will get by default notebook “dashboard” which opens in your home directory or in the installation folder where you have done the installation;
If you want to open the dashboard in a different directory then notebook(dir = “/some/path”).
Data Structures Concept in Julia Programming Language
Like every other programming language, Julia also has data structure concepts. Let’s learn about some of these concepts that are used for data analysis.
- Vector(Array) – A vector is a one-dimensional array which is similar to a normal array. In array, we use numbers followed by a comma as separator similarly in Julia also the vector(array) follows same.
Let’s have a look on a piece of code.
In Julia, the index starts at ‘1’. In the above code snippet, it begins with ‘0’ since its python.
- Matrix Operations
A matrix is another data structure that is widely used in linear algebra. We know that matrix is of a multidimensional array. Let’s see dome operation of a matrix in Julia,
A = [1 2 3; 4 5 6; 7 8 9] # semi-column is used to change rows
When we print, it looks like: 1 2 3
4 5 6
7 8 9
In order access element, say A [1,2] = 2
Now for transpose of a matrix, A’ then the following result will look like:
A’ = 1 4 7
2 5 8
3 6 9
- Dictionary
Another data structure is the dictionary, which is an unordered key-value pair, and the keys are always unique.
Let’s have a look on the dictionary implementation,
D = Dict (“string1” => “Hello”, “length” => 5) #create dictionary
It will get result : string => Hello
Length => 5
Suppose in-order to access the dictionary we will access the key of dictionary then the value will give us as result
D[“length]
o/p: 5
to get count of dictionary use object. Count i.e D.count
Operations of Dictionary:
- Creation = Dict(“a” => 1, “b” => 2)
- Addition = d[“c”] = 3
- Removal = delete !(d, “b”)
- Lookup = get(d,”a”, 1)
- Update = d[“a”] = 10
Strings
Next data structure is strings , strings are generally written within the quotes as {“ ”} i.e inverted commas. Similar to the python in Julia also once string is created it cannot be changes as they are immutable.
Lets have a look,
Text = “Hello world”
print(Text[1]) # will gives first character of string as H
Print(Text.length) # will gives the length of string 11
There are three key phases of data structures that are used in data analysis
- Data Exploration
It’s all about finding the data more than what we have
- Data Munging
Cleaning the data and use that data for making better statistical models
- Predictive Modelling
Final thing is run the algorithm and have fun
Loops, Conditions In Julia
Like other programming languages Julia also uses the loops and conditional statements
For loop
While Loop
If condition
These are most commonly used loops and condition statement in Julia as well as other programming languages
If and else
In Julia we need not to worry about spaces, identation, semicolon, brackets etc instead just add end at the end of statement or condition. Lets have the syntax for if and else
Syntax: if condition
Statement
else
Statement
end
if elseif and else
It also follows same as if else block follows. Let’s have look on syntax
Syntax: if condition
Statement
elseif
Statement
else
Statement
End
Lets take an example to the above we discuused
If x > 0
“Positive”
else if x < 0
“Negative”
else
“Whole Number”
Lets talk about loops in Julia.
For Loop
The only difference to the loop for with other languages for loop is, in Julia for loop will have start and end counter.
Julia> for i in 0: 10: 100
Print(i)
end
will gives result as: 0 10 20 30 40 50 60 70 80 90 100
Julia> for a in [“red”, “green”, “yellow”]
Print(a, “ “)
end
Will give result as : red green yellow
Julia> for a in Dict(“name” => “orange”, “size” => 6)
Print(a)
end
Name => orange Size -=> 6
Similarly we can also iterate through 2D array, lets have look on that
A = reshape(1:50, (3, 3))
for I in A
Print(I, “ “)
end
The result will be as 1 2 3 4 5 6 7 8 9 …………..50
We can also use inside of functions
function()
for condition
Statement
end
return
end
We know that scope of an variable inside a method or function will exists until its life span is not yet done once method or function ends and comes out then the variable scope is zero or dead
Function()
K = 2
for I in 1:10 :50
K = k*i
end
return
end
if we want to persist the variable to be exist in the function or method then use keyword “global” before variable name.
continue and break are the condition statements used in between the loops
for I 10:5:20
print(i)
continue
end
comprehensions
similar to python Julia also supports comprehensions
Julia> s = set([a for a in 1: 8])
Set([6,4,5,7,1,3,2,8])
Julia> [(a,b) for a in 1:5, c in 1:2]
(1,1) (1,2)
(2,1) (2,2)
(3,1) (3,2)
(4,1) (4,2)
(5,1) (5,2)
Generator Expressions
Like comprehensions generating expressions can also be used to produce result using iterable variable.
Let’s have a look on the example
Julia> sum( x^2 for x in 1:10)
385
Nested Loops
Nested loops in Julia is quite different as of writing loop inside another loop is known to be as nested loops. But, in Julia we need not make duplicate loops instead we can use
@show(var1, var2) variables with comma separated
Have a loop on the piece of code for better understanding
for a in 1 : 10, y in 1: 10
@show (x,y)
Result will be:
(x,y) = (1,1)
(x,y) = (1,2)
(x,y) = (1,3)
(x,y) = (1,4)
(x,y) = (1,5)
(x,y) = (1,6)
(x,y) = (1,7)
(x,y) = (1,8)
……………
(x,y) = (10,10)
@show is an macro that prints the names and values
@time will gives the complexity of loops
Julia> x = rand(1000);
Julia> function sum()
A = 0.0
For I in x
A + = i
End
Return A
End
Julia> @time sum()
0.017705 seconds (15. 28k allocations: 694. 484 kiB)
496.84883432553846
While Loop
Same as for loop while as performs only when condition is true. The following syntax is
While condition
Statements
End
Let’s have an example
Julia> x = 0
0
Julia> while x < 3
Print(x)
global x+ = 1
end
result: 0 1 2
And finally Exceptions with loops, like other programming language Julia also have try, catch blocks.
Julia> s = “apple”
try
S[1] = “a”
catch e
Print(“caught an error: $e”)
End
Basics Of Julia For Data Analysis
Till today many of us familiar with python or R language in the field of machine learning, data science. All those are good in their performances and predicting fasten results. Whereas Julia is such a language that can computate the large amount of data and give results in fraction of seconds.
It is very similar to the languages like python or R with respect to syntax. There won’t be no time taking for one to use Julia on data analysis. Moreever a lot of time is spent by data scientists in-order to transform the data into good format . For that purpose Julia will provides an extensive library in dealing with the raw data and to make into good format of data I,e structured data format . There are basic steps to be followed in data analysis
- Always explore the given data sets or data tables and apply statistical methods to find patterns in numbers.
- Second thing is plot the data for visualization.
As in Machine Learning the data has to convert into data frames similarly using Julia we can do that. The following package provide by the Julia on Data Frames is DataFrames.jl that will converts the data into matrix format with extensions like .csv, .xlsx etc
Julia> Pkg.add(“DataFrames.jl”)
Let’s take an example to demonstrate dataframes in Julia
Using DataFrames
#read the dataset
df = readtable(“demo.csv”, separator=’,’)
—we have loaded the dataset into df variable and then we can print the dataset—-
Df
Look at the demo dataset , this is just the view of dataset its not the dataframe view.
Dataframe functions like finding size , column names, to know the first n rows of dataframe set
size(df) = given rows and columns (mXn)
output: [ 3, 3]
Names(df) = column names
Output: [‘Aanthony’, ‘Ball’, ‘Call’]
head(df) = say we give head(5) will results first five rows
output: first five rows
Numerical Data like describe() function which gives basic statistical data analysis such as mean, mode, sum, avg
Categorial Data countmap() is function that maps the values to the no. of occurrence in the dataset.
Dealing with Missing Data
This is very important concept because entire game runs on this data only as of when there is loss of data obviously the predicted result will generates differ accuracy. So, in-order to maintain a good accuracy we should handle the missing data from the dataset
showcols() = to check for missing values in variables
And we can replace the empty values with some related values , lets say
df.replace(df[‘Anthony’] == “ “ , : “some data to replace”)
Visualization part that generalizes the entire data and their relation among them.
Above chart says that rainfall over a period of time interval keeps on increasing [cm’s]
Point to remember
Histogram charts should always be divide into bins i.e more bins more data analyzed
Data Analysis is not limited to data visualization after modelling also data analysis is done.
Exploratory Data Analysis With Julia
Exploratory Data Analysis is used in understanding data in terms of data features, variables and their relationship among them. Always the main step to be do is understand the data set properly. There are some methods to be followed
Methods to be followed on given dataset (explore)
- Statistical Methods or Functions
- Visual Plot Techniques
To the data table apply some statistics
Step1: installing Data Frame Package
Using Julia over the data table or data set a data structure concept called Data Frames is used. As of data frame can handle multiple operations like speed , accuracy and compatibility
Data frames to be used in Julia should be installed first
The following command is used to install the data frame
Using Pkg
Pkg.add(“DataFrames”)
Step2: Next download the data set
Step3: Then install necessary packages, CSV packages, Data Frame etc
using DataFrames
using CSV
a = CSV.read(“sample.csv”)
Step4: Then have data exploration
Data exploration has to be done over the data set because it gives the relations among data variables, what are the functions ,column names, lists etc
using DataFrames
using CSV
a = CSV.read(“sample.csv”);
size(a)
names(a)
head(a, 10)
Describe Function
Describe function is used to give mean, mode, meadian, some basic statistical data over the data set
Mean: Mean gives the average of dataset or datatable.
Mode: Mode will gives the observed value of dataset or datatable
Median: Median will gives middest value of datatable or dataset.
using DataFrames
using CSV
a = CSV.read(“sample.csv”);
describe(a)
describe(a, :all, cols = :SepalLength)
Apply visual plot techniques over the data set
Visual plotting in Julia can be achieved using plot libraries like Plots, StatPlots and Pyplot
Plots : it’s an high level plotting package which interfaces with other plotting packages called
‘back-ends’ . Actually they behave like graphic engine that provides graphics
StatPlots: Its an plotting package including with the Plots package especially these StatPlots are used for some statistics
Pyplot: Its an package with Matplotlib which is library of python.
These libraries can be installed as follows:
Pkg.add(“Plots”)
Pkg.add(Statplots”)
Pkg.add(“Pyplot”)
Distribution Analysis
Here, in distribution Analysis Julia is performed using various plots such as histograms, scatterplot, boxplot
using DataFrames
using CSV
a = CSV.read(“sample.csv”);
using Plots
Plots.histogram(a[:SepalLength], bins = 50, xlabel = “Sepallength”,
Labels = “length in cm”)
Similarly we can plot graph using different formats like histogram etc
Using R and Python Libraries in Julia
Julia programming language is such a powerful language with many libraries and packages included as well as it also provides outside libraries to be accesses.
You may get doubt like if Julia is has such powerful libraries then why is needed to access from other languages especially Python and R because even the libraries are there but they might be very young to be used that’s the reason Julia provides ways to access libraries from R and python.
To call python libraries in the Julia PyCall is the package that will enables to call python libraries from Julia code
Julia> Pkg.add(“PyCall”).
PyCall provides many good functionality that helps in manipulating python in Julia using type PyObject
The following are the steps to be followed in order to call python packages
Step1: using Pkg
Step2: Pkg.add(“PyCall”)
Step3: using PyCall
Step4: @pyimport python_library_name
Lets see basic programe on how to import math package of python into Julia
using Pkg
Pkg.add(“PyCall”)
using PyCall
@pyimport math
Print(math.cos(90))
Second example to import Numpy package into Julia language
using Pkg
Pkg.add(“PyCall”)
using PyCall
@pyimport numpy
A = numpy.array([2,1,4,3,
5,7,6,8])
Print(A)
Output:
[2, 1, 4, 3, 5, 7, 6, 8]
Using Pandas With Julia
If you are familiar with the library pandas in python then it is same as Julia also. Using Pandas we can filter the data or analyze the data lot more. Especially converting the data into dataframes which is package of pandas library .
DataFrames will helps to visualize the data into multidimensional array i.e matrix format
Julia> Pkg.add(“Pandas”)
Lets see an example using pandas with Julia
using pandas
df = read_csv(“job.csv”)
df = DataFrame(Dict(:company => [“google”, “Apple”, “Microsoft”], :job=>[“sales executive”,
“business manager”, “business manager”, “computer manager”],
:degree=>[“bachelors”, “masters”], :salary=>[0,1]))
typeof(df)
head(df) # will gives first five rows of data
describe(df)
If df[“job”] == “computer manager”
df[“job”] = “manager”
end
df.mean(“salary”, axis = 1)
So, there are many operations which are basics of pandas and are used on data set as cleaning procedure .
Cleaning includes like removing null values, missing values replacement and modifying the data which is in appropriate .
Pandas is most powerful library not only in python but also in Julia .
Introduction To DataFrames.jl
As we all know that Julia has the library that handles the data transformation like python and R does i.e DataFrames. This approach although looks similar to python or R but it differs during API call. For complex data tables DataFramesMeta concept is used
Lets see how to install and import the library
- To install library use command Pkg.add(DataFrames)
- To load the library use command using DataFrame
After doing above steps the next is to load the data set . The following way to read the data table is.
using CSV
Datatable = CSV.readtable(“sample.csv”)
Fruits Sweet Sour
Apple 80% 10%
Orange 90% 10%
Pineapple 100% 0%
After loading CSV file check for the missing values suppose if the column has missing values in the top most rows due to using type-auto recognization then there are chances of having error rate. Manually we have to remove the error tendancy from the data set.
To find missing value
Types = Dict(“Florida” => Union{Missing, Int64})
If we want to edit the values of imported dataframes then don not forget to use copycols = true
- Use the package from the stream HTTP:
Using DataFrame , HTTP, CSV
Resp = HTTP.request(“GET”, https://somesite@domain.com?accesstyep = “Download)
df = CSV.read(IOBuffer(String(resp.body))
- Again create df from scratch
Df = DataFrame(
Color = [“red”, “yellow”, “orange”, “white”]
Shape = [ “circle”, “rhombus”, “vertical”]
Border = [“line”, “dotted”, “line”]
Area = [1.1,1.2,1.3,2.5])
- There are many possibilities with df like convert matrix form data to vector form :
For example:
df = DataFrame([[mat[ : , i]…] for I in a : size(mat, 2)], Symbol.(headerstrs))
Using dataframes package we can do a lot mpre with the data set or data table. Always the given dataset has to be converted into data frames i.e matrix conversions so that one can analyze the data properly and handle it regarding null values, missing values..
Get Some Insights of Data
- first(df, size)
- show(df, allrows=true, allcolls = true)
- last(df, size)
- describe(df)
- unique(df.fieldName)
- names(df)
- size(df)
- to iterate over each column [for a in eachcol(df)]
- to iterate over each row [for a in eachrow(df)]
Filter
In-order to refer to some columns there are two ways in data frame like referencing the stored values into the object or copying them into another new object
- Myobject = df[ !, [cFruits]] {store values in object}
- newObject = df[ :, [cFruits(s)] { Copying entire into new object }
You know we can also query using data frames let’s see how we can do
dfresult1 = @from I in df begin
@where i.col > 1
@select {aNewColName = i.col1, i.col3}
@collect DataFrame
end
dfresult2 = @from I in df begin
@where i.value != 1 && i.cat1 in [“red”, “yellow”]
@ select i
@collect DataFrame
end
Replace Data
We can replace the values of column with other data that to dictionary based values
df.col1 = map(key ->mydict[key], df.col1)
Can be concate the values of column using dot operation df.a = df.b
Appending rows : push! (df, [1 2 3])
Delete rows: deleterows !(df, rowIdx)
Change the structure of data or holding object
Here dataframe can be used to change name of column, data type of column , delete column, rename column or else replacing position of columns. Type casting which can be help to convert any kind of data type
From int to float: df.a = convert(Array{Float32, 1}, df.a)
Sorting sort ! (df, cols = (:col2, :col1), rev = (false, false))
So, Dataframes is most powerful library or package for data handling . It will handle missing values which cause a lot error tendancy . we can split the datasets and re combine them together and apply some statistical operations like aggregate functions,
Visualization in Julia Using Plots.jl
This is another way to explore the data and analysis i.e by doing visualization using various kinds of plot formats.
In Julia we can even plot the graph for the data using library. But, Julia does not provide direct library of its own instead it provides to use libraries of your own choice in Julia programs.
To have this functionality we need install some packages:
Julia> Pkg.add(“Plots.jl”)
Julia> Pkg.add(“StatPlots.jl”)
Julia> Pkg.add(“PyPlot.jl”)
This Plots.jl is act as interface to any plotting library such that using libraries in Julia we can plot data .
StatPlots.jl is supporting package for Plots.jl
PyPlot.jl will act as Matplotlib of python
Now, let’s see some data visualization plots using pyplot.jl and also we can get information about data table more using plots.
Using CSV
S = CSV.readtable(‘Venice.csv’)
using Plots, StatPlots
pyplot() #set backend as matplotlib package i.e matplotlib.pyplot
Plots.histogram(dropna(train[: ApplicationTax]), bins = 50, xlabel = “ApplicationTax”, labels = “Frequency”) # plot histogram
If you observe the plot we have different values with depriciation in the plot , so that is the reason why we need the bins as 50 or relevant to that
In other scenario we can look at box plots to understand the distributions of bins in the above graph clearly.
Lets see another way of visualizing the plot:
Plots.boxplot(dropna(train[: ApplicationTax]), xlabel = “”ApplicationTax”)
If u look the plot below it tells us the preence of extreme values . This can be attributed to the Tax in the society. And also we can segregate the part based on their profession in the society
Plots.boxplot(train[: Education], train[: ApllicationTax], label = “ApplicationTax”)
ApplicationTax
Now, if u see there is no difference between the Tax of the persons and also the Profession
of persons based on which the tax is paid i.e high or low tax .
Lets have look on other charts like line chart, pie chart for rain data in a year/month
using CSV
a = CSV.read(“sample.csv”)
plot(a.month, a.max)
This graph will says that a month with maximum rain
Next, we will see scatter chart by using same data i.e rain data in a year/month
Scatter(a.Rain, label = “y1”)
This chart says that the rainfall is vary’s on every year i.e increase as the year goes on increase
Similarly lets look on the pie chart also with same rain data in a year/month
W = 1:5; y = rand(5); #plotting data
Pie(x,y)
The pie chart gives an analyzation of more area with rainfall followed by average and less rainfall per year or month.
Histogram Chart
Histogram(a.Rain, label = “Rainfall”)
We can easily find by histogram chart the rainfall is varies in a year with unequal distribution of rainfall.
The graphs and charts can be used for visualizing or seeing the trends.
So, I hope we learnt topic in Julia i.e plots. so far we completed all the basic charts that are used in Julia with plot library.
Data Munging In Julia
While we did analysis of data there are some problems that we encountered i.e missing values, null values all these problem has to be remove under data analysis step. To do so, data munging is a technique or process to handle the missing values in data table or data set i.e converting the raw data into some format that can be utilized for data analysis . It is also known as Data Wrangling
It is one of the most important component in data science .
The following packages that are required:
RDataset this packagae will load the data set generally used in R language since julia can also be access the libraries or packages of other languages like R it can be installed as follows
Julia> Pkg.add(‘RDatasets’)
As we know that inorder to convert into multidimensional array format to a data set in python or R we use data frames . similarly here in julia DataFrames and DataFramesMeta will provide the functionality
Julia> Pkg.add(‘DataFrames’)
Julia> Pkg.add(‘DataFramesMeta’)
Let’s load the data set
It contains columns
company
job
degree
salary
So, the analysis of this data set is if an employee having bachelors degree he or she can be promoted or salary can be increased and condition applys i.e varies with company.
using RDatasets
sal = dataset(“datasets”, “sample”)
head(sal)
it gives the same dataset as we saw in the above figure
Using groupby():
The groupby function will group the data in all the columns to a given value . It splits the datagrame and those split dataframes are again split into subsets then the function is used. The indices for data set starts from indices 1 when we use the groupby()
The following syntax:
groupby(a, :col_names, sort = false, skipmissing = false)
Parameters are
a : dataframe
:col_names: column names on which data set is split
sort: to return the data set in sorted manner by default it is false
skipmissing: it will decides whether to skip the missing values or not , by default false
using RDatasets
sal = dataset(“datasets”, “sample”)
groupby(sal)
by() function
This by() function will performs split-apply method which means it will split the column and then apply the by() function . The syntax as follows:
by(a, :col_names, function, sort = false)
The Parameters :
a: dataframe
col_names: the split of columns
function: function applied on each column
sort: the dataframe to be return sort order by default it is false
lets split the dataframe and show the column who are eligible for salary promotion
using RDatasets
using Statistics
sal = dataset(“datasets”, “sample”)
by(sal, [:job, :degree]) do a DataFrame(Mean_of_Salary = mean(a[:Salary]),
Variance_of_Salary = var(a[:Salary])
End
* Mean of Column Salary
aggregate() function
aggregate function will also follows split- apply method . columns are split and then the function is applied to the specified column .
aggregate(a, :col_names, function)
The Parameters are:
a: dataframe
col_names: the split of columns
function: function applied on each column
using RDatasets
sal = dataset(“datasets”, “Sample”)
aggregate(sal, :job, degree)
Missing
In Julia the missing values are represented using special name i.e missing which is instance for the type Missing.
Julia> missing
missing
let’s see for the type of of missing
Julia> typeof(missing)
Missing
Missing type will allows users to create Vectors and DataFrame column with missing values.
Let we see an example :
Julia> x = [0, 1, missing]
3-element Array{Union{Missing, Int64}, 1}:
0
1
Missing
Julia> eltype(x)
Union{Missing, Int64}
Julia> Union{Missing, Int}
Union{Missing, Int64}
Julia> eltype(x) == Union{Missing, Int}
True
While performing some operations missing values can be excluded using a technique called as
“skipmissing”
Julia> skipmissing(x)
Base.Skipmissing{Array{Union{Union{Missing, Int64}, 1}}(Union{Missing, Int64}[0,1,missing].
Lets take an scenario i.e I want to find the average of all missing values.
Julia> avg(skipmissing(x))
4
Julia> collect(skipmissing(x))
2-element Array{Int64, 1}
Coalesce is the function which is used to replace null value with some other values.
Julia> coalesce(x, 0)
3-element Array{Int64, 1}
1
2
0
Similarly we may also have missing values or null values in rows . For that we can use dropmissing and dropmissing! to remove the missing values .
Julia> df = DataFrame(I = 1:4,
P = [missing, 3, missing, 2,1]
Q = [missing, missing, “c”,“d”,”e”])
4X3 DataFrame
Row | I x y
| Int64 Int 64 String?
1 | 1 missing missing
2 | 2 3 missing
3 | 3 missing c
4 | 4 2 d
Julia> dropmissing(df)
2X3 DataFrame
Row | I x y
| Int64 Int64 String
————————————————
1 | 4 2 d
2| 5 1 e
One more point i.e Missings.jl package provide the few functions inorder to work with missing values.
Julia> using Missing
Julia> Missings.replace(x,1)
Missings.EachReplaceMissing{Array{Union{Misssing, Int64}, 1}, Int64}(Union{Missing, Int64}[1,2,missing], 1)
These are some basic functions used to handle the data while analyzing i.e mainly to remove null and missing values from the data set. This is what data munging.
Building a Predictive ML Model
Till now, we have saw how the data set should be handle , how to overcome the problems especially like missing values in the data set or null values and more-ever visualizing the data using library plot.pl, StatPlots.
Now, we will see how to build an Machine Learning Model using Julia programming language.
In python scikitlearn is the package or library that will provides all the necessary models , similarly in Julia Scikitlearn package will provides.
Julia> Pkg.add(“Scikitlearn.jl”)
This package will act as interface to the python’s Scikitlearn package
“ Since Julia can access Packages of Python”
Label Encoder
In python labelencoder() is the package that can be found from Scikitlearn.Preprocessing which will converts data into numerical format data [0,1,2…………….]
In Julia also we will convert the data into numerical format. The one who are familiar with python they can understand why label encoder is used.(it becomes easy to access any column of data with numerical values).
Lets encode sample data
using ScikitLearn
@sk_import preprocessing: LabelEncoder
encoder = LabelEncoder()
data = [“apple”, “orange”, “papaya”]
for col in data
train[data] = fit_transform! (encoder, train[data])
end
Now, we will define generic classification function which takes model as input and gives us the accuracy and cross-validation scores.
using ScikitLearn : fit!, predict, @sk_import, fit_transform!
@sk_import preprocessing : LabelEncoder
@sk_import model_selection : cross_val_score
@sk_import metrics: accuracy_score
@sk_import linear_model: LogisticRegression
@sk_import ensemble: RandomForestClassifier
@sk_import tree: DecisionTreeClassifier
function classification_Model(model, predictions)
p = convert(Array, train[:13])
q = convert(Array, train[predictions])
r = convert(Array, test[predictions])
# check for fitness of model
fit! (model, p, 1)
#predicitons on training data set
Predictions = predict(model, p)
#accuracy
Accuracy = accuracy(Predictions, q)
#cross_validation
Cross_score = cross_val_score(model, p, q, cv = 5)
#print cross score
print(“cross score: “, mean(Cross_score))
fit!(model, p, q)
Out = predict(model, r)
Return Out
End
Logistic Regression
Using logistic regression we are going to calculate the accuracy and cross validation scores like what we have done in the above classification_Model function.
LogisticRegression in Julia is similar to Python. Logistic Regression in Machine Learning is an classification algorithm which is used to predict the probability of dependent categorial value. The dependent values will be either in 0 or 1.
Logistic Regression can be classifies into two classifications
- Binary Classification
- Multiclass Classification
Lets see the logistic regression plot in visual
Mathematical Equation For Logistic Regression : 1/ 1+ e^-x (or) 1/ 1 + e^-z
lets make use of model and determine the accuracy for the persons obesity
model = LogisticRegression()
predict_value = [:Obesity] => this code snippet add as
classification_Model(model, predict_value) continuation to above code
The result will be :
Accuracy: 80.9% Cross-Validation Score: 80%
The accuracy and cross_score are good but if you need more accuracy then change the column or variables and apply model again.
Predict_value = [:Obesity, :Age, :Weight]
Classification_Model(model,predict_value)
The result wil be :
Accuracy: 88% Cross-Validation-Score: 87.9%
This how logistic regression classifies. Generally problems which are not ended at particular limit instead they tend to change frequently for those problems Logistic Regression Model is used to solve.
Decision Tree
Decision Tree is another Model under Classification. Decision Tree works on parent child scenario, always the child node will be consider as the result node vice-versa parent node is consider as root node which takes decisions. The working process of decision tree
- Decision tree selects best attribute using Attribute Selection Measure
- Selected attribute will be consider as root node
- Then again it divides into sub nodes until it reaches to leaf node
The mathematic equations or formulae used in decision tree are:
- Information Gain (ig) = -p/s log(p/s) – n/s log(p/s)
- Gini Index = ig – Entropy
Information Gain:
This will gives us the information regarding an attribute i.e how important an attribute to the data set as of attribute posses feature od vectors through which we can identify the relations of parent and child nodes.
Entropy
Entropy , we can get this from information gain such that information gain will gives us the
entire relation of data set whereas the entropy will tells us the impurities from the data set.
The higher entropy the more information gain.
Let’s say two classes and we want to find the which class belongs to same category
Suppose class A belongs to some x category and B also same category x then it is not
a good entropy as 0. if it is like 50 – 50 % then it is good entropy and data set is good as 1
Gini Index
Gini Index will gives the pure impurity which means it will calculate the probability of s
Selected attribute if all are linked to same attribute then that attribute is pure attribute or
Belongs to same classs.
Decision tree gives higher accuracy than logistic regression , since decision tree follows the parent and child concept by taking exact decision.
Let’s see the implementation part for decision tree by considering an example.
We are going to calculate the results i.e accuracy and cross-validation-score of student using decision tree classifier algorithm. Now, the attributes for student are Name and age
Conside Name and Age columns possess some 10 rows of random data and we used decision tree classifier algorithm, which it should its gives best accuracy and cross-validation-score.
model = DecisionTreeClassifier()
predict_value = [:Student, :Name, :Age]
classification_Model(model, predict_value)
The result will be as:
Accuracy: 81.95% Cross-Validation Score: 75.6%
We can again increase the accuracy to more extent by changing the input columns so that maximum accuracy can be obtained.
“Always find maximum accuracy and score”
Predict_value = [:Student, :Name, :Class, :Age]
Classification_Model(model, predict_value)
The result will be as:
Accuracy: 85.78% Cross-Validation Score: 80.7%
Random Forest
Random Forest, it is an another algorithm that is capable of performing both regression as well as classification tasks with a technique called “Bootstrap” and “Aggregation” known as bagging.
Random Forest having multiple decision trees as its learning models then it performs random row sampling and feature sampling to the dataset by applying a model. This is called as Bootstrap.
Let’s see the approach or process involved to use random forest algorithm
- We should design a relevant question to the given information or data set
- And one more thing to make sure is convert all the data to accessible format or else convert into that format
- Develop a machine learning model
- categorize data set into training data and test data
- Apply model and find the accuracy or score for the testing data
- Repeatedly change the values so that accuracy will reach to max
Let’s see the implementation part of Random Forest
We are going to calculate the results i.e accuracy and cross-validation-score of bank customers using RandomForestClassifier algorithm to segregate customers based on loan status. Now, the attributes for customer are Name , Age, Sex, Loan.
Conside Name , Age, Sex, Loan columns possess n rows of random data and we used RandomForestClassifier algorithm, which it should its gives best accuracy and cross-validation-score.
model = RandomForestClassifier(n_value = 100)
predictions = [:Name, :Age, :Sex, :Loan]
classification_Model(model, prediction)
Accuracy : 100% Cross-Validation Score : 80%
Here, we got 100% accuracy for the training data set. This is the problem overfitting and can be resolved in two ways
- Reducing the number of predictions
- Tuning the model parameters
model = RandomForestClassifier(n_value = 100, min_samples_split = 50, max_depth = 20,
n_jobs = 1 )
classification_Model(model, predictions)
The result will be :
Accuracy : 83% Cross-Validation Score : 80%
Here if you see even though accuracy is reduced the score is increased which means the model is doing well Random Forest will use multiple decision trees which in return gives different predictions.
As possible as avoid complex modelling technique as black box without understanding the concepts.
Using ggplot2 in Julia
ggplot2 is an data visualization package used in statistical programming language R. ggplot will breaks the data into semantic components such as scales and layers.
Since, Julia can access the libraries of python and R so ggplot2 can be installed with Julia and include.
Lets see how to load R package into Julia
Using RCall
@rlibrary ggplot2
There might be question araise like having most powerful Julia with all packages include why to use R packages for data visualization ?
Plots.jl is powerful package but unfortunately its interface is similar to R language . If user wants to visualize the plot then it is very difficult to remember all the commands as there are more to remember .
So that’s the reason why Julia uses R packages for data visualization even python libraries too.
Lets consider an example with this scenario:
Using Julia plot.jl package
plot(plot_data_1, a = “a”, b = “b”, Geom.line,
layer(Geom.line, a = “a”, b = “text” , Theme(default_color = “red”)),
layer(Geom.line, a = “a”, b = “a_mc”, Theme(default_color = “blue”)),
layer(Geom.line, a = “a”, b = “a_mf”, Theme(default_color = “orange”)),
}
Using R ggplot package
ggplot(plot_data_1, aes(a = “a”, b = “b”)) +
geom_line(color = “red”) +
geom_line(aes(b = :a_mc), color = “green”) +
geom_line(aes(b =:a_mf), color = “violet”)
if u observe above piece of code using ggplot which is very simper when compared to Julia plots.jl . The user wont get frustrated on using R package as it is simpler than Julia package
The above code might be have some issues since, Gadfly do not follow grammer of graphics strictly like font size, data visualizing pattern, color pattern on the line etc.
By considering all these we can say at the end of day that packages of Julia are bit complex than the packages of R or python . R packages gives good interoperability and difficulty problems can be solved easily.
The package ggplot in Julia installed as follows:
Julia> Pkg.add(“RDatasets.jl”)
Julia> Pkg.add(“RCall.jl”)
Lets look on the plot visualized using ggplot library
using Rcall, RDatasets
val = datasets(“datasets”, “demo”)
library(ggplot2)
ggplot($demo, aes(p =”ASD” , q =”AOSI Total Score(Month 12)” )) + geom_print()
Thoughts of Conclusion
Finally Julia is such powerful language that provides accessability packages related to python and R by PyCall and RCall . Julia is ideal in its nature and its syntax too compared to python particularly when writing highly function code .
Julia is better programming language we can say . Strong reason might be its best for numerical computation
“Technology Never Stops instead it flows like Water”