Python Libraries that Enable Capabilities to Distribute and Parallelize ML Tasks

Image by THAM YUAN YUAN from Pixabay

Nowadays, Neural network models are very deep and complicated with so many weights to learn. Training such models is very challenging. Data scientists need to set up distributed training, checkpointing, etc. Even after that, data scientists may not achieve the desired performance and convergence rate. Training large models is even more challenging in that the model easily runs out of memory.

In this article, we will see a list of Python Frameworks that allow us to Distribute and Parallelize the Deep Learning models.

1. Elephas

Elephas is an extension of Keras, which allows you to run distributed deep learning models at scale…


These Jupyter Notebook Extensions make Data Scientist life easier

Image by Bessi from Pixabay

Every Data Scientist spends most of his time in data visualization, preprocessing and model tuning based on the results. These are the toughest situations for every Data Scientist because you will get a good model when you perform all these three steps precisely. There are 10 very helpful jupyter notebook extensions to help in these circumstances.

1. Qgrid

Qgrid is a Jupyter notebook widget which uses SlickGrid to render pandas DataFrames within a Jupyter notebook. This allows you to explore your DataFrames with intuitive scrolling, sorting and filtering controls, as well as edit your DataFrames by double-clicking cells.


Most useful string methods for a Data Scientist during data preprocessing.

Image by Free-Photos from Pixabay

In simple words, Machine Learning is training/teaching algorithms with historical data to predict output on unseen data. Most of the times, the type of data is in the form of text. When working with text data, one must be familiar with python’s available string methods to make life easier.

In this post, I'll talk about some of the string methods that I personally found very useful while handling text data.

1. split()

split() separates the string into words based on the pre-defined separator.


Improve your model performance with features that contribute more to predictions.

Image by Arek Socha from Pixabay

Introduction

When do you say a model is good? When a model performs well on unseen data then we say its a good model. As a Data Scientist, we perform various operations in order to make a good Machine Learning model. The operations include data pre-processing (dealing with NA’s and outliers, column type conversions, dimensionality reduction, normalization etc), exploratory data analysis (EDA), hyperparameter tuning/optimization (the process of finding the best set of hyper-parameters of the ML algorithm that delivers the best performance), feature selection etc.

“Garbage in, Garbage out.” If…


Tune your Machine Learning models with open-source optimization libraries

Image by Jörg Felix from Pixabay

Introduction

Hyper-parameters are the parameters used to control the behavior of the algorithm while building the model. These parameters cannot be learned from the regular training process. They need to be assigned before training the model.

Example: n_neighbors (KNN), kernel (SVC) , max_depth & criterion (Decision Tree Classifier) etc.

Hyperparameter optimization or tuning in machine learning is the process of selecting the best combination of hyper-parameters that deliver the best performance.

Various automatic optimization techniques exist, and each has its own strengths and drawbacks when applied to different types of problems.

Example…


Normality test is used to check if a variable or sample has a normal distribution.

Image of Author

Before going to talk about Normality test lets first discuss normal distribution and why is it so important?

Normal distribution

The normal distribution also known as the Gaussian distribution is a probability function that describes how the values of a variable are distributed. It is a symmetric distribution where most of the observations fall around the central peak and the probabilities for values further away from the mean taper off equally in both directions with fewer outliers on the high and low ends of the data range.

The term “Gaussian distribution” refers to the German mathematician Carl Friedrich Gauss.

A normal distribution…


Hyper-Parameter Tuning sometimes messes up your model and leads to unpredictable results on unseen data.

Image by Author

Introduction

Data Leakage is when the model somehow knows the patterns in the test data during its training phase. In other words, the data that you are using to train your ML algorithm happens to have the information you are trying to predict.

Data leakage prevents the model to generalize well. It’s very difficult for a data scientist to identify data leakage. Some of the reasons for data leakage are

  • Outlier and missing value treatment with central values before splitting
  • Scaling the data before splitting into training and testing
  • train your model with both train and test data.

Hyper-Parameter Tuning is…


Popular hyper-parameter tuning techniques that every Data Scientist should know

Image from SigOpt

Introduction

Wikipedia states that “Hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm”

One of the most challenging parts in ML workflow is finding the best hyperparameters for the model. Performance of ML models is directly related to Hyper-parameter’s. The more you tune the hyperparameters, the better model you get. Tuning Hyper-parameters could be tedious, complicated and is more of an art than science.

Hyper-parameters

Hyper-parameters are the parameters used to control the behaviour of the algorithm while building the model. These parameters cannot be learned from the normal training process. …


Python and its most popular data wrangling library, Pandas, are soaring in popularity. Compared to competitors like Java, Python and Pandas make data exploration and transformation simple.

But both Python and Pandas are known to have issues around scalability and efficiency.

Python loses some efficiency right off the bat because it’s an interpreted, dynamically typed language. But more importantly, Python has always focused on simplicity and readability over raw power. Similarly, Pandas focuses on offering a simple, high-level API, largely ignoring performance. …


Speed up your string comparisons

Many programmers and beginners love python. Python is one of the languages that is witnessing incredible growth and popularity year by year.

In 2017, Stack overflow calculated that python would beat all other programming languages by 2020 as it has become the fastest-growing programming language in the world.
Python language is one of the most accessible programming languages available because of its Readability, tons of libraries, Vibrant Community, can be used for Web Programming, Desktop Applications, Big data, Cloud Computing, Machine Learning, and so on.

It’s very important for every Pythonista to understand the difference…

Sivasai Yadav Mudugandla

Data scientist at Tech Mahindra | Post Graduate in AI & ML | Pythonista | Swimmer | Dancer. https://www.linkedin.com/in/sivasai-mudugandla-89a156104/

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store