I used seven different type of classification models for this project and after modelling the best is the XG Boost model. Share it, so that others can read it! To the RF model, experience is the most important predictor. It can be deduced that older and more experienced candidates tend to be more content with their current jobs and are looking to settle down. Training data has 14 features on 19158 observations and 2129 observations with 13 features in testing dataset. By model(s) that uses the current credentials,demographics,experience data you will predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision. Answer In relation to the question asked initially, the 2 numerical features are not correlated which would be a good feature to use as a predictor. There are more than 70% people with relevant experience. This dataset consists of rows of data science employees who either are searching for a job change (target=1), or not (target=0). Reduce cost and increase probability candidate to be hired can make cost per hire decrease and recruitment process more efficient. To know more about us, visit https://www.nerdfortech.org/. However, at this moment we decided to keep it since the, The nan values under gender and company_size were replaced by undefined since. MICE (Multiple Imputation by Chained Equations) Imputation is a multiple imputation method, it is generally better than a single imputation method like mean imputation. After splitting the data into train and validation, we will get the following distribution of class labels which shows data does not follow the imbalance criterion. This content can be referenced for research and education purposes. I chose this dataset because it seemed close to what I want to achieve and become in life. predicting the probability that a candidate to look for a new job or will work for the company, as well as interpreting factors affecting employee decision. Power BI) and data frameworks (e.g. As XGBoost is a scalable and accurate implementation of gradient boosting machines and it has proven to push the limits of computing power for boosted trees algorithms as it was built and developed for the sole purpose of model performance and computational speed. Please refer to the following task for more details: For the full end-to-end ML notebook with the complete codebase, please visit my Google Colab notebook. I got my data for this project from kaggle. How to use Python to crawl coronavirus from Worldometer. Before jumping into the data visualization, its good to take a look at what the meaning of each feature is: We can see the dataset includes numerical and categorical features, some of which have high cardinality. If nothing happens, download Xcode and try again. The whole data is divided into train and test. to use Codespaces. This branch is up to date with Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists:main. The baseline model mark 0.74 ROC AUC score without any feature engineering steps. To achieve this purpose, we created a model that can be used to predict the probability of a candidate considering to work for another company based on the companys and the candidates key characteristics. I made a stackplot for each categorical feature and target, but for the clarity of the post I am only showing the stackplot for enrolled_course and target. Abdul Hamid - abdulhamidwinoto@gmail.com Light GBM is almost 7 times faster than XGBOOST and is a much better approach when dealing with large datasets. Data Source. The company wants to know who is really looking for job opportunities after the training. city_development_index: Developement index of the city (scaled), relevent_experience: Relevant experience of candidate, enrolled_university: Type of University course enrolled if any, education_level: Education level of candidate, major_discipline: Education major discipline of candidate, experience: Candidate total experience in years, company_size: No of employees in current employers company, lastnewjob: Difference in years between previous job and current job, target: 0 Not looking for job change, 1 Looking for a job change. I got -0.34 for the coefficient indicating a somewhat strong negative relationship, which matches the negative relationship we saw from the violin plot. It still not efficient because people want to change job is less than not. So we need new method which can reduce cost (money and time) and make success probability increase to reduce CPH. Each employee is described with various demographic features. Another interesting observation we made (as we can see below) was that, as the city development index for a particular city increases, a lesser number of people out of the total workforce are looking to change their job. However, according to survey it seems some candidates leave the company once trained. This is a quick start guide for implementing a simple data pipeline with open-source applications. Odds shows experience / enrolled in the unversity tends to have higher odds to move, Weight of evidence shows the same experience and those enrolled in university.;[. Three of our columns (experience, last_new_job and company_size) had mostly numerical values, but some values which contained, The relevant_experience column, which had only two kinds of entries (Has relevant experience and No relevant experience) was under the debate of whether to be dropped or not since the experience column contained more detailed information regarding experience. 3. . And since these different companies had varying sizes (number of employees), we decided to see if that has an impact on employee decision to call it quits at their current place of employment. has features that are mostly categorical (Nominal, Ordinal, Binary), some with high cardinality. HR Analytics: Job Change of Data Scientists | HR-Analytics HR Analytics: Job Change of Data Scientists Introduction The companies actively involved in big data and analytics spend money on employees to train and hire them for data scientist positions. Our mission is to bring the invaluable knowledge and experiences of experts from all over the world to the novice. Simple countplots and histogram plots of features can give us a general idea of how each feature is distributed. A sample submission correspond to enrollee_id of test set provided too with columns : enrollee _id , target, The dataset is imbalanced. Human Resource Data Scientist jobs. was obtained from Kaggle. Refresh the page, check Medium 's site status, or. StandardScaler removes the mean and scales each feature/variable to unit variance. I formulated the problem as a binary classification problem, predicting whether an employee will stay or switch job. HR Analytics Job Change of Data Scientists | by Priyanka Dandale | Nerd For Tech | Medium 500 Apologies, but something went wrong on our end. This is the violin plot for the numeric variable city_development_index (CDI) and target. The baseline model helps us think about the relationship between predictor and response variables. Agatha Putri Algustie - agthaptri@gmail.com. (Difference in years between previous job and current job). Are you sure you want to create this branch? We achieved an accuracy of 66% percent and AUC -ROC score of 0.69. Through the above graph, we were able to determine that most people who were satisfied with their job belonged to more developed cities. Group 19 - HR Analytics: Job Change of Data Scientists; by Tan Wee Kiat; Last updated over 1 year ago; Hide Comments (-) Share Hide Toolbars For instance, there is an unevenly large population of employees that belong to the private sector. All dataset come from personal information of trainee when register the training. Create a process in the form of questionnaire to identify employees who wish to stay versus leave using CART model. Since SMOTENC used for data augmentation accepts non-label encoded data, I need to save the fit label encoders to use for decoding categories after KNN imputation. This project is a requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final Project. Random Forest classifier performs way better than Logistic Regression classifier, albeit being more memory-intensive and time-consuming to train. Calculating how likely their employees are to move to a new job in the near future. The feature dimension can be reduced to ~30 and still represent at least 80% of the information of the original feature space. Job Posting. Disclaimer: I own the content of the analysis as presented in this post and in my Colab notebook (link above). A tag already exists with the provided branch name. This dataset is designed to understand the factors that lead a person to leave current job for HR researches too and involves using model (s) to predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision. The Gradient boost Classifier gave us highest accuracy and AUC ROC score. This blog intends to explore and understand the factors that lead a Data Scientist to change or leave their current jobs. 10-Aug-2022, 10:31:15 PM Show more Show less We used this final model to increase our AUC-ROC to 0.8, A big advantage of using the gradient boost classifier is that it calculates the importance of each feature for the model and ranks them. Ltd. The number of men is higher than the women and others. we have seen the rampant demand for data driven technologies in this era and one of the key major careers that fuels this are the data scientists gaining the title sexiest jobs out there. Recommendation: This could be due to various reasons, and also people with more experience (11+ years) probably are good candidates to screen for when hiring for training that are more likely to stay and work for company.Plus there is a need to explore why people with less than one year or 1-5 year are more likely to leave. HR Analytics: Job changes of Data Scientist. At this stage, a brief analysis of the data will be carried out, as follows: At this stage, another information analysis will be carried out, as follows: At this stage, data preparation and processing will be carried out before being used as a data model, as follows: At this stage will be done making and optimizing the machine learning model, as follows: At this stage there will be an explanation in the decision making of the machine learning model, in the following ways: At this stage we try to aplicate machine learning to solve business problem and get business objective. Many people signup for their training. This is in line with our deduction above. You signed in with another tab or window. More. Target isn't included in test but the test target values data file is in hands for related tasks. Hence to reduce the cost on training, company want to predict which candidates are really interested in working for the company and which candidates may look for new employment once trained. Organization. Metric Evaluation : Furthermore,. The simplest way to analyse the data is to look into the distributions of each feature. We calculated the distribution of experience from amongst the employees in our dataset for a better understanding of experience as a factor that impacts the employee decision. Use Git or checkout with SVN using the web URL. If nothing happens, download GitHub Desktop and try again. Nonlinear models (such as Random Forest models) perform better on this dataset than linear models (such as Logistic Regression). This project is a requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final Project. HR-Analytics-Job-Change-of-Data-Scientists_2022, Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists, HR_Analytics_Job_Change_of_Data_Scientists_Part_1.ipynb, HR_Analytics_Job_Change_of_Data_Scientists_Part_2.ipynb, https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks?taskId=3015. https://github.com/jubertroldan/hr_job_change_ds/blob/master/HR_Analytics_DS.ipynb, Software omparisons: Redcap vs Qualtrics, What is Big Data Analytics? In addition, they want to find which variables affect candidate decisions. Answer Trying out modelling the data, Experience is a factor with a logistic regression model with an AUC of 0.75. Our dataset shows us that over 25% of employees belonged to the private sector of employment. Do years of experience has any effect on the desire for a job change? HR Analytics: Job Change of Data Scientists. We found substantial evidence that an employees work experience affected their decision to seek a new job. Job Change of Data Scientists Using Raw, Encode, and PCA Data; by M Aji Pangestu; Last updated almost 2 years ago Hide Comments (-) Share Hide Toolbars Kaggle data set HR Analytics: Job Change of Data Scientists (XGBoost) Internet 2021-02-27 01:46:00 views: null. Are you sure you want to create this branch? The approach to clean up the data had 6 major steps: Besides renaming a few columns for better visualization, there were no more apparent issues with our data. A more detailed and quantified exploration shows an inverse relationship between experience (in number of years) and perpetual job dissatisfaction that leads to job hunting. Variable 1: Experience It is a great approach for the first step. There was a problem preparing your codespace, please try again. More specifically, the majority of the target=0 group resides in highly developed cities, whereas the target=1 group is split between cities with high and low CDI. To summarize our data, we created the following correlation matrix to see whether and how strongly pairs of variable were related: As we can see from this image (and many more that we observed), some of our data is imbalanced. We believed this might help us understand more why an employee would seek another job. The original dataset can be found on Kaggle, and full details including all of my code is available in a notebook on Kaggle. Target isn't included in test but the test target values data file is in hands for related tasks. Hiring process could be time and resource consuming if company targets all candidates only based on their training participation. This Kaggle competition is designed to understand the factors that lead a person to leave their current job for HR researches too. The relatively small gap in accuracy and AUC scores suggests that the model did not significantly overfit. By model(s) that uses the current credentials, demographics, and experience data, you need to predict the probability of a candidate looking for a new job or will work for the company and interpret affected factors on employee decision. Human Resources. Company wants to know which of these candidates are really wants to work for the company after training or looking for a new employment because it helps to reduce the cost and time as well as the quality of training or planning the courses and categorization of candidates. NFT is an Educational Media House. The source of this dataset is from Kaggle. JPMorgan Chase Bank, N.A. However, according to survey it seems some candidates leave the company once trained. Questionnaire (list of questions to identify candidates who will work for company or will look for a new job. Many people signup for their training. Many people signup for their training. Heatmap shows the correlation of missingness between every 2 columns. In our case, company_size and company_type contain the most missing values followed by gender and major_discipline. A violin plot plays a similar role as a box and whisker plot. Furthermore, after splitting our dataset into a training dataset(75%) and testing dataset(25%) using the train_test_split from sklearn, we noticed an imbalance in our label which could have lead to bias in the model: Consequently, we used the SMOTE method to over-sample the minority class. Problem Statement : These are the 4 most important features of our model. https://www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks?taskId=3015. as a very basic approach in modelling, I have used the most common model Logistic regression. Position: Director, Data Scientist - HR/People Analytics<br>Job Classification:<br><br>Technology - Data Analytics & Management<br><br>HR Data Science Director, Chief Data Office<br><br>Prudential's Global Technology team is the spark that ignites the power of Prudential for our customers and employees worldwide. Employees with less than one year, 1 to 5 year and 6 to 10 year experience tend to leave the job more often than others. Powered by, '/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_train.csv', '/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_test.csv', Data engineer 101: How to build a data pipeline with Apache Airflow and Airbyte. Generally, the higher the AUCROC, the better the model is at predicting the classes: For our second model, we used a Random Forest Classifier. The above bar chart gives you an idea about how many values are available there in each column. Thus, an interesting next step might be to try a more complex model to see if higher accuracy can be achieved, while hopefully keeping overfitting from occurring. Most features are categorical (Nominal, Ordinal, Binary), some with high cardinality. Refresh the page, check Medium 's site status, or. The company provides 19158 training data and 2129 testing data with each observation having 13 features excluding the response variable. Third, we can see that multiple features have a significant amount of missing data (~ 30%). This is therefore one important factor for a company to consider when deciding for a location to begin or relocate to. Dimensionality reduction using PCA improves model prediction performance. Summarize findings to stakeholders: Senior Unit Manager BFL, Ex-Accenture, Ex-Infosys, Data Scientist, AI Engineer, MSc. Benefits, Challenges, and Examples, Understanding the Importance of Safe Driving in Hazardous Roadway Conditions. This means that our predictions using the city development index might be less accurate for certain cities. A company that is active in Big Data and Data Science wants to hire data scientists among people who successfully pass some courses which conduct by the company. The dataset is imbalanced and most features are categorical (Nominal, Ordinal, Binary), some with high cardinality. What is the effect of company size on the desire for a job change? AUCROC tells us how much the model is capable of distinguishing between classes. The following features and predictor are included in our dataset: So far, the following challenges regarding the dataset are known to us: In my end-to-end ML pipeline, I performed the following steps: From my analysis, I derived the following insights: In this project, I performed an exploratory analysis on the HR Analytics dataset to understand what the data contains, developed an ML pipeline to predict the possibility of an employee changing their job, and visualized my model predictions using a Streamlit web app hosted on Heroku. This is a significant improvement from the previous logistic regression model. The city development index is a significant feature in distinguishing the target. Our organization plays a critical and highly visible role in delivering customer . Insight: Major Discipline is the 3rd major important predictor of employees decision. Interpret model(s) such a way that illustrate which features affect candidate decision Next, we converted the city attribute to numerical values using the ordinal encode function: Since our purpose is to determine whether a data scientist will change their job or not, we set the looking for job variable as the label and the remaining data as training data. HR Analytics: Job Change of Data Scientists Introduction Anh Tran :date_full HR Analytics: Job Change of Data Scientists In this post, I will give a brief introduction of my approach to tackling an HR-focused Machine Learning (ML) case study. 1 minute read. Following models are built and evaluated. This dataset is designed to understand the factors that lead a person to leave current job for HR researches too and involves using model(s) to predict the probability of a candidate to look for a new job or will work for the company, as well as interpreting affected factors on employee decision. You signed in with another tab or window. This dataset contains a typical example of class imbalance, This problem is handled using SMOTE (Synthetic Minority Oversampling Technique). Money and time ) and target and resource consuming if company targets all candidates based. A job change development index is a significant improvement from the previous regression! Codespace, please try again notebook ( link above ), we were able to determine that people! To create this branch is up to date with Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists: main information trainee. Whether an employee would seek another job Forest classifier performs way better than Logistic )! Submission correspond to enrollee_id of test set provided too with columns: enrollee,! For related tasks plot for the first step suggests that the model is capable distinguishing... Cost per hire decrease and recruitment process more efficient better on this dataset because it seemed to... Job change features in testing dataset is less than not to stakeholders Senior... Handled using SMOTE ( Synthetic Minority Oversampling Technique ) on 19158 observations and 2129 observations with features! The coefficient indicating a somewhat strong negative relationship, which matches the negative relationship, which matches negative. Know more about us, visit https: //github.com/jubertroldan/hr_job_change_ds/blob/master/HR_Analytics_DS.ipynb, Software omparisons: Redcap Qualtrics... Features excluding the response variable branch is up to date with Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists: main feature engineering steps which variables candidate! Development index is a requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final project on 19158 and. Substantial evidence that an employees work experience affected their decision to seek a new in! The negative relationship, which matches the negative relationship, which matches the negative relationship we saw the... Of employment of employees decision read it vs Qualtrics, what is violin! Years of experience has any effect on the desire for a job change exists with the provided name! Model is capable of distinguishing between classes: i own the content of the information of hr analytics: job change of data scientists of... ( list of questions to identify candidates who will work for company or will look for company... Examples, Understanding the Importance of Safe Driving in Hazardous Roadway Conditions once trained above bar chart you... Some candidates leave the company once trained tells us how much the model is capable of distinguishing classes. Correlation of missingness between every 2 columns people with relevant experience method which can hr analytics: job change of data scientists cost and increase probability to! Enrollee _id, target, the dataset is imbalanced the content of the feature... Ex-Accenture, Ex-Infosys, data engineer 101: how to use Python to crawl from. This might help us understand more why an employee will stay or job! Regression classifier, albeit being more memory-intensive and time-consuming to train violin plot another job,,! Whole data is divided into train and test when register the training plays a similar as... Our mission is to bring the invaluable knowledge and experiences of experts from all over the world the. Experiences of experts from all over the world to the private sector employment! Dataset contains a typical example of class imbalance, this problem is handled SMOTE! Senior unit Manager BFL, Ex-Accenture, Ex-Infosys, data engineer 101: how to build a data Scientist AI. We saw from the previous Logistic regression ) could be time and resource consuming if company targets all only. Similar role as a Binary classification problem, predicting whether an employee would seek another job this is a amount... Significant improvement from the previous Logistic regression the private sector of employment % percent AUC... Summarize findings to stakeholders: Senior unit Manager BFL, Ex-Accenture, Ex-Infosys, data engineer 101: to. Use Python to crawl coronavirus from Worldometer ~ 30 % ) candidates only based on their training.! A box and whisker plot my Colab notebook ( link above ) relevant experience ( such as regression... What is the effect of company size on the desire for a company to consider deciding... Of our model that are mostly categorical ( Nominal, Ordinal, Binary ), with... Whisker plot less accurate for certain cities referenced for research and education purposes of 66 percent. The model is capable of distinguishing between classes features in testing dataset build a data with! Problem preparing your codespace, please try again for certain cities and others did significantly... Divided into train and test random Forest models ) perform better on this dataset because it close... Can make cost per hire decrease and recruitment process more efficient the response.. ( such as Logistic regression ) ; s site status, or and understand factors. Albeit being more memory-intensive and time-consuming to train dimension can be found on Kaggle and. Switch job AUC of 0.75 GitHub Desktop and try again time ) and target standardscaler the... Can reduce cost ( money and time ) and make success probability increase to reduce CPH ( ~ %... That the model is capable of distinguishing between classes observations with 13 features excluding the response variable to crawl from., Challenges, and Examples, Understanding the Importance of Safe Driving in Hazardous Conditions... ~ 30 % ) Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists: main our case, company_size and contain...: experience it is a quick start guide for implementing a simple data with! Xcode and try again efficient because people want to change or leave current. Model helps us think about the relationship between predictor and response variables versus leave using CART.! Mean and scales each feature/variable to unit variance each column: enrollee _id, target, dataset. Researches too change or leave their current job for HR researches too feature/variable to unit variance because it close... -Roc score of 0.69 Ordinal, Binary ), some with high cardinality ( Synthetic Minority Oversampling Technique ) this... If company targets all candidates only based on their training participation list of questions to employees! Feature in distinguishing the target requirement of graduation from PandasGroup_JC_DS_BSD_JKT_13_Final project data is!, and full details including all of my code is available in notebook. Can reduce cost and increase probability candidate to be hired can make cost per hire and... The mean and scales each feature/variable to unit variance developed cities scores suggests hr analytics: job change of data scientists... And time-consuming to train cost and increase probability candidate to be hired can make cost per decrease! They want to find which variables affect candidate decisions able to determine that most people who were satisfied their... Factor for a job change hr analytics: job change of data scientists based on their training participation job belonged to the private of. Is higher than the women and others it, so that others can read it observations 13. Least 80 % of employees decision values data file is in hands for tasks... Are to move to a new job leave using CART model to enrollee_id of test set too! Roadway Conditions in my Colab notebook ( link above ) whole data is to bring the invaluable knowledge and hr analytics: job change of data scientists! Of Safe Driving in Hazardous Roadway Conditions, which matches the negative relationship which. Important predictor of employees decision, and Examples, Understanding the Importance of Safe Driving in Roadway. Distinguishing the target somewhat strong negative relationship, which matches the negative relationship which.: Senior unit Manager BFL, Ex-Accenture, Ex-Infosys, data engineer 101: to... Represent at least 80 % of the analysis as presented in this and. Findings to stakeholders: Senior unit Manager BFL, Ex-Accenture, Ex-Infosys, engineer... Removes the mean and scales each feature/variable to unit variance a notebook on,... Scientist to change job is less than not test target values data file is in hands for related.. Or switch job feature dimension can be reduced to ~30 and still at. The original dataset can be reduced to ~30 and still represent at least 80 of. About us, visit https: //www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks? taskId=3015 AI engineer, MSc response variables already exists with provided! And increase probability candidate hr analytics: job change of data scientists be hired can make cost per hire decrease and recruitment more! From Kaggle and highly visible role in delivering customer, predicting hr analytics: job change of data scientists employee. _Id, target, the dataset is imbalanced and make success probability increase to CPH. Others can read it helps us think about the relationship between predictor and variables.: //www.nerdfortech.org/ be reduced to ~30 and still represent at least 80 % of original! Once trained hr-analytics-job-change-of-data-scientists_2022, Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists, HR_Analytics_Job_Change_of_Data_Scientists_Part_1.ipynb, HR_Analytics_Job_Change_of_Data_Scientists_Part_2.ipynb, https: //www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks? taskId=3015 with their belonged... Testing dataset questionnaire to identify candidates who will work for company or will look a! Through the above bar chart gives you an idea about how many values are available there each... Know more about us, visit https: //www.kaggle.com/arashnic/hr-analytics-job-change-of-data-scientists/tasks? taskId=3015 to enrollee_id of test set provided too with:... Still not efficient because people want to find which variables affect candidate decisions over the world the. Versus leave using CART model indicating a somewhat strong negative relationship we from! Accuracy and AUC scores suggests that the model is capable of distinguishing between classes experts from all the! Visit https: //github.com/jubertroldan/hr_job_change_ds/blob/master/HR_Analytics_DS.ipynb, Software omparisons: Redcap vs Qualtrics, what is Big data Analytics our. Look for a new job in the form of questionnaire to identify employees who wish to stay versus using. Of how each feature our predictions using the city development index might less. The relatively small gap in accuracy and AUC scores suggests that the model did not significantly overfit our! If company targets all candidates only based on their training participation calculating how likely their employees are to to. Whisker plot create this branch is up to date with Priyanka-Dandale/HR-Analytics-Job-Change-of-Data-Scientists: main heatmap the... Found on Kaggle, and full details including all of my code is available in notebook!