Recently I read an article titled Train sklearn 100x faster, which is about an open-source Python module named sk-dist. The module implements a "distributed scikit-learn" by extending it’s built-in parallelisation of meta-estimator, such as, pipeline.Pipeline, model_selection.GridSearchCV, feature_selection.SelectFromModel and ensemble.BaggingClassifier, etc., using spark.

It was 1AM in the morning. Wise-men and women have told me not to stay up late and use computers. However, I have too sedentary life to sleep early, I am too bored with netflix and chill, and I am too sober to dream about the next big thing since tiktok. So, I did the next best thing…

Image by Praha hosted in pexels.

Machine Learning Engineer (MLE) is a new engineering role that emerged a few years ago as the demand for a systematic and efficient approach to develop Machine Learning (ML) based solutions increased massively.

An MLE ensures that ML models go to production efficiently and effectively. At the same time, the person makes it easier to measure the impact of the model’s operational efficiency in terms of its cost, uptime, and business ROI metrics that can determine the project's level of success (or failure). In a way, this role strives to make running an ML project 10 times faster.

MLEs are…

I work with Apache Spark on a regular basis. I decided to formalize my know-how on the language by taking a certification. At the time of writing, Databricks is the hottest platform to host Apache Spark applications. So, I chose to take the one that they offer: Databricks Certified Associate Developer for Apache Spark 3.0. Despite being comfortable with the programming language, I have done a few things that made it easier to pass the test. Furthermore, I learned a few things while taking the exam. This article comprises information that could help you to prepare for the certification.

Figure 1: Associate Developer for Apache Spark 3.0 Certificate by Databricks. The figure is extracted from a real certificate given to the Author.

What does it cover?


A deep dive in Spark transformation and action is essential for writing effective spark code. This article provides a brief overview of Spark's transformation and action.


For simplicity, this article focuses on PySpark and DataFrame API. The concepts are applied similarly to other languages in the Spark framework. Furthermore, it is necessary to understand the following concepts to grasp the rest of the material easily.

Resilient Distributed Dataset: Spark jobs are typically executed against Resilient Distributed Dataset (RDD), which is fault-tolerant partitions of records that can be concurrently operated. …

People change teams all the time. There are many reasons including changing jobs, internal migration, personal time off, etc. Gone are the days when people stayed with company long time, let alone with one team. Embracing this fact and being prepared enables a team robust to changes. A big part of the preparation includes a solid onboarding plan. Machine Learning (ML) teams are different since they include many different techniques and skill dimensions compared to typical software products. Onboarding in such teams, therefore, brings in some unique challenges. With this article, we show an onboarding process as part of handling…

Photo by Markus Spiske

If you do not have the time to read the full article, consider reading the 30 seconds version.


If you have Machine Learning (ML) pipelines in production, you have to worry about backward compatibility of changes made to the pipeline. It may be tempting to increase test coverage, but a high test coverage cannot guarantee that your recent changes have not broken the pipeline or generated low quality results. To do that, you need to develop end-to-end tests that can be executed as part of the continuous integration pipelines. Developing such a test requires sampling the dataset that powers the…

Has anyone released open source libraries on these techniques?

Photo by Keven Ku on Pexels


If you do not have time, here is the 30-second version:

  • If you or your team is doing a data science exploration work that is not a total waste, you need to preserve the work in such a way that you, your team, or someone else can get back to it later without too much trouble. The value of an idea starts from exploration and the value of exploration starts from sharing in such a way that it is easy to reproduce.
  • It may be tempting to refer to a notebook running in a platform accessible to you or your…

Misbah Uddin

Engineering Manager: AI, Analytics and Data @H&M. Opening little boxes, one at a time

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store