Machine Learning Tools and Toolkits in The Explora
Machine Learning Tools and Toolkits in The Explora
net/publication/332113377
CITATIONS READS
6 501
2 authors, including:
Afreen Khan
Aligarh Muslim University
15 PUBLICATIONS 18 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Artificial Intelligence and Big Data Analytics Towards the Advancement of Alzheimer's Research View project
All content following this page was uploaded by Afreen Khan on 04 September 2019.
Keywords-Big data, Application Programming Interface (API), Command Line Interface (CLI), Graphic User Interface (GUI),
Machine learning, Tool, Toolkits, Platform, Library, Interface
budding realm of ML in the age of AI and Big Data learn the structure of the data where one does not have a
Analytics. The paper is organized as follows: The subsequent knowledge behind the theory of what the structure appears to
section involves the state of Machine Learning as an be and completely understood theoretical distributions are
indispensable technique. Section 3 discusses how to evaluate then adjusted to the data whereas in statistical model, every
an ML toolkit. Section 4 presents the various ML tools. model has a theory behind it which is scientifically proven
Section 5 introduces the comparison of ML tools followed by but the rule is that the data should meet some robust
a conclusion in Section 6. assumptions [10]. Furthermore, the test for an ML model is a
validation error that is performed on a new data,however, a
II. MACHINE LEARNING- AN INDISPENSABLE TECHNIQUE theoretical test is never carried out which is used to prove a
null hypothesis [10]. ML uses an iterative methodology to
Due to the pervasiveness of data and the huge scalability of learn and discover from data until a strong pattern is obtained
cloud computing power, there has been a massive and this acquisition of knowledge is automated easily.
advancement in the use of AI and ML, which has
developed in significance with its competence to filter The strategic rule of an ML model is the ability to
through large datasets, explore and analyze them, interpret independently learn and progress as new data is loaded into
them with the aim of discovering effective patterns and the system and to convert this data into actionable
lastly, constructing useful predictions based on the result knowledge. The chief notion of the entire ML concept is the
attained. capability of a model to automatically apply sophisticated
mathematical computations to Big Data repeatedly until a
As Intel CEO Brian Krzanich in one of his interview most probable solution is achieved [10].
stated that “Data is the new Oil” then accordingly, it can be
further stated that Machine Learning, a subset of AI, is III. HOW TO EVALUATE AN ML TOOLKIT
fuelled by data. It is built on a modelling scheme of not only
analyzing data but also, it has the ability to learn- get trained The first thing to consider when ML need to be adopted is
and improve- get better by using different algorithms that the tool on which the work will be performed. Each step in
provide new and innovative insights [5]. the ML process can be automated if the right tools are used.
The key importance of selecting the correct tool is reflected
When the talk is about Big Data (BD), ML is the best choice later when right predictions and improved results are
for unravelling the robust issues. In a report, Gartner achieved, thereby it is as essential as working with the finest
defined Big Data as, “high volume, high velocity, algorithms [12]. The toolkit provided differs considerably,
and/or high-variety information assets that require new therefore it is necessary to maintain an equilibrium between
forms of processing to enable enhanced decision keeping up with the latest developments and rigid reliability
making, insight discovery and process optimization and stability of a project [11]. These tools not only provide
[6].” ML and BD are related in such a way where ML is the facility of implementation of ML algorithms but also
centered on diverse algorithms that know how to learn support at every step while the tasks are being executed and
from data with no dependency on rules-based can be used throughout the ML challenge.
programming while BD is the kind of data that is
loaded into such a system where analytical practices are ML advances well only when critical decisions need to be
carried consequently leading to the improvement in the taken which thereby is built on assumptions that are
precision of the predictions that ML model is trained generated from the analysis of data [13]. Thus, there isno one
for [7].Big Data analytics aids in acquiring insights and particular criteria for deciding the best toolkit for ML. Each
hence therefore, better decisions are taken while and every toolkit is developed to focus on the needs as
modifications in modelling are applied [8]. observed by the developer. Presently, there are numerous ML
toolkits available and how to evaluate a specific tool is an
In a report, published in 2018, Gartner mentioned that essential issue in the current times so as to deal with the
“almost half of CIOs are preparing to implement AI in their problem statement in a most practicable way. Following are a
respective organizations [9].” The main purpose of this is to range of different criteria that is usually used to assess any
obtain insights from the collected data so as to understand tool:
and acquire knowledge of the respective model used and
thereby analysing through their behaviour in order to build 1. Language: In regard to developing the ML models and
improved decisions [5]. On the other hand, the level of writing the ML algorithms, the toolkit’s programming
success, how enhanced results, how effective predictions is language in which it is written in, influence the entire
only achieved when the data is used righteously in a way as modelling. The choice of language depends on the comfort
how better understanding one gain from it [5]. ML and level i.e. ability to use it efficiently, type of problem to be
statistical model differ in a way where the aim of ML is to solved, the quantity of data to be processed, and
developer’s expertise and past experience [14]. There are neural networks, and more to the list. As more and more
certain factors to consider before selecting a language, computation power is required, a huge interest is rising in
likewise, speed, concurrency, performance, cost exploiting ML in addition to the growing access to heaps of
effectiveness, learning curve i.e. functional or procedural, data so as to gain action-driven advantages. To the same
application development, and community support [15]. degree, there exist plenty of tools in ML, when machines
need to be trained to work without being explicitly
2. Type: This includes the different categories in which the programmed [13]. The nucleus of the ML system consists of
toolkits are divided. The categorization done is as follows: ML storage cluster and its computation power, which usually
platform, library, interface and local or remote tool. differs based on the learning method used, its application and
the need to automate it [5].
3. Documentation: The documentation of a specific toolkit
plays an important role in deciding which one to choose ML tool can be a platform, library, an interface or any local
and which one to avoid. If a toolkit is documented well in or remote tool. These toolkits provide the developers to
terms of quality, coverage of a huge number of examples create ML models more quickly and easily without stepping
that look similar to problems one work on, then it is easier into the details of the core algorithms thereby providing a
to build a solution to the particular problem. well-defined and brief approach for classifying ML models
by applying a set of pre-built and improved modules [16].
4. Integrated Development Environment (IDE):The IDE The tools can be divided into four branches, as depicted in
used for ML is as important as the ML techniques that are Figure 1 below.
used to solve the predictive modeling challenges. Certain
toolkits have graphical IDE, and others include command
line and editor IDE.
Provide complete facility needed at every step of an ML It has the ability to reproduce the results by storing
project development. the commands and command line arguments.
The ML platform interface may include API, CLI or It supports several small programs and many
GUI, or a combination of these while programming. program genres for certain subtasks of ML project.
They are used for general purpose modelling, instead of
focusing on accuracy, scalability and speed. iii. ML GUI (Graphical User Interface):Machine
Features are loosely coupled, therefore it is the task of a learning tools support a GUI which mainly focuses
user to assemble all the components collectively for the on the graphical representation of data i.e.
particular project. visualization and also consists of windows, point
and click [18]. The characteristics of ML GUI are as
2. ML Library: ML library contains capabilities for only follows:
finishing a fragment i.e. one or more steps of an ML The users that are not an expert in programming, for
project [12]. It is used to unravel the predictive use-cases. those, ML GUI provides an environment where they
It includes facilities like documentation, configuration and can complete their tasks easily through ML.
help data, pre-written subroutines and code, and message The chief emphasis of ML GUI is on the process
templates [18]. An ML library also provides modeling and by what means maximum information can be
algorithms that are suitable for particular use-cases, each extracted from the ML tools and techniques.
having their own pros and cons. When determining which
ML library to apply, several features need to be 4. A. ML Local Tool:Machine learning Local tool is a
considered. They are programming language, performance tool that can be downloaded, installed and can be used
and hardware features, and ML algorithm [19]. Certain and run in the local environment. The characteristics of
characteristics of the ML library are: ML Local tool are as follows:
ML library interface is usually an API which involves It is built for main memory data and its algorithms.
programming. This ML tool can be incorporated into our own ML
They are designed for a particular use-case or machines so as to model it according to our needs
environment. It provides control over the parameters so as to
devise predictions on newer data thereby supporting
3. ML Interfaces: ML interface is another ML tool and is the run configuration of the system.
further forked into three parts, namely, ML API, ML
CLI and ML GUI. 4. B. ML Remote Tool:Machine learning Remote tool
i. ML API (Application Programming is a tool that runs on the server of a third-party. It is a
Interface):Machine learning tools support an API which tool that is established on a server and the operations are
provides the ability to decide what components to work carried out on the local environment by calling it
with and how to apply them in the ML programs [18]. remotely. Thus, ML Remote tools are called Machine
The characteristics of ML API are as follows: Learning as a Service (MLaaS) [12]. The characteristics
It provides the capability of developing our own ML of ML Remote tool are as follows:
tools. These ML tools can handle large datasets even
ML API tool can be used to build our own processes, though the data scales up rapidly.
and thereby, it can be further implemented on ML It provides a set up where the processes can run
projects so as to automate them in an improved way. amongst the multiple machines, numerous cores
It gives the flexibility to develop our own methods, while sharing the memory.
merge them with existing the libraries and methods. Because these tools run remotely at scale, it
supports less number of ML algorithms since
ii. ML CLI (Command Line Interface):Machine learning complex modifications are needed.
tools give an environment of CLI that focuses on input It has the ability to get incorporated within our local
and output i.e. it structures ML tasks in terms of the environments though RPCs (Remote Procedure
required input and output to be produced [12]. In Calls).
addition, it also comprises of command line
parameterization and command line programs. The V. ILLUSTRATION OF TOOLKITS
characteristics of ML CLI are as follows:
It provides such an environment where non- The tools described above are illustrated in the following
programmers can perform their tasks through ML Table 1.
projects.
(DMTK) largest and fastest topic model and the biggest word-embedding model around the globe
[38].
Microsoft Azure ML Java It is a strong cloud-based tool which is used in analytics that allows predictive management
[39].
MLib for Spark Usable in Java, It consists of good set of ML algorithms that influence iteration and produces improved
Scala, Python, and results. It supports feature transformation, development of ML pipeline, hyper-parameter
R. tuning and model evaluation [40].
[13] https://knowm.org/machine-learning-tools-an-overview/
[14] https://blogs.opentext.com/choosing-the-right-programming-
VI. CONCLUSION language-for-machine-learning-algorithms-with-apache-
spark/amp/
Machine learning is a complicated field and the graph is [15] https://medium.com/@UdacityINDIA/machine-learning-
being rising at an elevated speed as we are heading programming-languages-why-is-the-best-and-why-
forward and becoming stronger technologically.As ML 56f9f370cb99
algorithm is quite difficult to write from scratch, Machine [16] https://www.analyticsindiamag.com/machine-learning-
learning toolkit provides a tremendous amount of framework-10-need-know/
[17] https://searchenterpriseai.techtarget.com/feature/How-to-make-
resources that can be used according to the problem a-wise-machine-learning-platforms-comparison
statement to solve any challenge. In this paper, we have [18] V. Vinothina, “MACHINE LEARNING TOOLS-AN
illustrated many tools that can be used for applying ML OVERVIEW,” in International Conference on Recent Trends in
techniques. The best toolkit is selected on the basis of Engineering Science, Humanities and Management, 2017, pp.
skills, background and use-case of a researcher. Also, the 629–637.
[19] https://www.oreilly.com/ideas/square-off-machine-learning-
type of project and available resources play an important
libraries
role in the selection of a tool. Therefore, when a project is [20] https://machinelearningmastery.com/tour-weka-machine-
started, it is required to spend a certain amount of time to learning-workbench/
assess existing toolkits so as to be confident enough that [21] https://bookdown.org/rdpeng/rprogdatascience/history-and-
the chosen toolkit is best for the situation. overview-of-r.html
[22] https://en.wikipedia.org/wiki/SciPy
REFERENCES [23] https://github.com/scikit-learn/scikit-learn
[24] https://github.com/EdwardRaff/JSAT
[1] https://www.simplilearn.com/what-is-machine-learning- [25] http://accord-framework.net/intro.html
and-why-it-matters-article [26] Pylearn2 Documentation Release dev, LISA lab, University of
[2] https://dzone.com/articles/5-open-source-machine-learning- Montreal, 2015.
frameworks-and-tool [27] https://www.csie.ntu.edu.tw/~cjlin/libsvm/
[3] https://www.forbes.com/sites/ciocentral/2018/02/28/gartner- [28] Thomas A. Henzinger, Anmol V. Singh, Vasu Singh, Thomas
magic-quadrant-whos-winning-in-the-data-machine-learning- Wies,DamienZufferey, “Static Scheduling in clouds”
space/ [29] Mike Gashler, “Waffles: A Machine Learning Toolkit”, Journal
[4] J. V. N. Lakshmi and A. Sheshasaayee, “A Big Data Analytical of Machine Learning Research, 12 (2011), 2383-2387.
Approach for Analyzing Temperature Dataset using Machine [30] G.Holmes, A.Donkin, I.H Witten, “WEKA: a machine learning
Learning Techniques,” Int. J. Sci. Res. Comput. Sci. Eng., vol. workbench”, Proceedings of Second Australian and New
5, no. 3, pp. 92–97, 2017. Zealand conferences on Intelligent Information System, 1994.
[5] C. E. Sapp, “Preparing and Architecting for Machine Learning,” [31] https://www.predictiveanalyticstoday.com/knime/
2017. [32] https://rapidminer.com/products/studio/feature-list/
[6] https://www.gartner.com/it-glossary/big-data/ [33] https://orange.biolab.si/#Orange-Features
[7] https://www.quora.com/How-are-big-data-and-machine- [34] http://126kr.com/article/yucgkiovd
learning-related [35] S¨orenSonnenburg et.al, “The SHOGUN Machine Learning
[8] Rakesh. S.Shirsath, VaibhavA.Desale, Amol. D.Potgantwar, Toolbox”, Journal of Machine Learning Research 11 (2010) ,
"Big Data Analytical Architecture for Real-Time Applications", 1799-1802.
International Journal of Scientific Research in Network Security [36] https://mahout.apache.org/docs/latest/index.html
and Communication, Vol.5, Issue.4, pp.1-8, 2017 [37] http://cloudacademy.com/blog/aws-machine-learning/
[9]https://www.forbes.com/sites/ciocentral/2018/02/28/gartner [38] https://www.microsoft.com/en-us/research/blog/microsoft-open-
-magic-quadrant-whos-winning-in-the-data-machine- sources-distributed-machine-learning-toolkit-for-more-efficient-
learning-space/#3995d9407dab big-data-research/
[10] https://www.sas.com/en_us/insights/analytics/machine- [39] https://www.predictiveanalyticstoday.com/microsoft-azure-
learning.html machine-learning/
[11] https://towardsdatascience.com/gui-fying-the-machine-learning- [40] https://spark.apache.org/mllib/
workflow-towards-rapid-discovery-of-viable-pipelines-
cab2552c909f
[12] https://machinelearningmastery.com/machine-learning-tools/