In recent years, data has been heralded as the world’s most valuable resource, surpassing even oil in importance (Economist 2017). Organizations across all industries are making significant investments in data and analytics, with machine learning (ML) and AI being key drivers of IT spending (Stackpole 2023). While these technologies offer a competitive edge, they also come with risks—mistakes driven by machine learning models can have serious consequences, from damaging an organization’s reputation to impacting revenue or even human lives (Olavsrud 2024). Therefore, it is crucial not only to understand the data but also to ensure the tools used are well-understood and aligned with the organization’s values. A structured, well-defined machine learning process is essential to mitigate risks and maximize the potential of these powerful technologies, ensuring responsible and effective use throughout.
In this tutorial, I will walk through the essential steps of the ML process using logistic regression model on a real-world dataset: the Pima Indians Diabetes Dataset (the diabetes dataset). THe demo is coded in Python and the primary machine learning software for modeling is scikit-learn, which is an open-source library for Python that offers an extensive range of supervised and unsupervised learning algorithms and tools for machine learning tasks. You can find all source code that supports the demo in this GitHub repository and the final research report here.
Step 1: Define the Problem
Before diving into code or models, the first step is always to define the problem clearly. Understanding the business problem or the task we are trying to solve is critical, as it determines the approach and the tools we’ll use.
- What problem are we trying to solve?
- Is it a classification task, regression task, or clustering task?
- What kind of data do we have, and what’s the expected output?
In this guide, our goal is to predict whether a person has diabetes based on clinical features like glucose levels, BMI, age, etc. This is a binary classification problem, where the output variable is whether or not the patient has diabetes.
Step 2: Understand and Validate the Data
Machine learning models rely on high-quality data to learn and make predictions, so it is crucial to understand the data before applying any machine learning techniques.
Source, Collection, and Scope of Data
The initial stage in understanding any dataset is to get a high-level overview of its source, collection process, and scope. This includes addressing the following questions:
- Where is the data coming from?
- How was the data collected?
- How many observations are there?
- What relevant features are available?
By understanding these aspects, we can assess the dataset’s relevance and reliability for the machine learning task at hand. For example, Pima Indians Diabetes Dataset is well-aligned with the goal of predicting diabetes, as it comes from a reliable source, the National Library of Medicine, is based on standardized medical examinations, and provides a sufficient sample size with 768 observations \(\times\) 8 features for meaningful model training and evaluation.
Data Validation
Following the top level summary on the dataset, the next critical step is to ensures the quality, integrity, and consistency of the dataset such as checking for issues like missing values, outliers, duplicates, and adhering to expected data formats, types, and ranges. For instance, are the glucose levels within a reasonable range, or are there any negative values in fields like age or pregnancies?
Tools like the Pandera package can automate this process by allowing us to define validation schemas for the dataset, ensuring that each column meets predefined conditions. It is important to note that data validation focuses on identifying and addressing inherent issues in the dataset, rather than making subjective judgments or imposing interpretations on the data. This means, in the diabetes dataset, observations are removed when they contain implausible values, like nulls or zeros, in features where these values would be unrealistic (e.g., glucose or blood pressure).
Step 3: Split the Data into Training and Testing Sets
Before proceeding any further with data exploration, it’s important to split the data into two subsets: one for training and one for testing.
This division is essential for evaluating how well our model generalizes to unseen data. A common split is 70% for training and 30% for testing, where the training set is used to teach the model, and the test set is reserved to measure its true performance upon the model is properly trained. Keeping the test set “locked” and untouched until the final evaluation is critical for ensuring that the model’s performance is not influenced by prior knowledge of this data.
It is VERY important to perform this split early so we avoid any potential data leakage and adhere to the golden rule of machine learning: the test data cannot influence training the model in anyway.
Step 4: Exploratory Data Analysis (EDA)
Once we’ve split the dataset, we can dive deeper into Exploratory Data Analysis (EDA) on the training data. EDA involves investigating the dataset’s structure and characteristics using both visualizations and statistical methods. By looking at key aspects such as missing values, the distribution of features, and visualizing how different features relate to the target variable, we gain a clear understanding of the data, uncover patterns, and identify any potential issues that might affect model performance.
In our demo, Figure 1 visualized the distributions of each predictor from the diabetes training set, with the distributions color-coded by class (0: blue, 1: orange). This exploration often helps identify patterns or trends that inform feature engineering later in the process.

Moroever, we also examined correlation among the predictors (Figure 2) to identify multicollinearity, which could cause problems when performing logistic regression. No multicollinearity was detected, as the highest correlation between predictors is below the 0.7 threshold.

Step 5: Data Preprocessing
This phase, known as data preprocessing, involves cleaning and transforming the raw data to make it suitable for machine learning algorithms. It includes handling missing values, transforming features, and performing feature engineering. Proper data preprocessing is crucial as it directly impacts the performance and accuracy of the machine learning model.
Handling Missing Data
Addressing missing or invalid data is one of the first preprocessing tasks. Typically missing data is handled by:
- Removing rows with missing values if they are few and don’t significantly affect the dataset.
- Imputing missing values using a statistical method, such as replacing them with the median, mean, or mode of the feature, or more sophisticated techniques such as predictive imputation.
Choosing the right approach for missing data depends on the nature of the dataset and the amount of missing data present.
Feature Transformation
Feature transformation ensures that different types of data are appropriately scaled or encoded so that machine learning algorithms can interpret them correctly. The methods used vary depending on the type of data: numeric, categorical, and textual.
1. Numeric Data
For numeric data, many machine learning algorithms benefit from scaling the features to a similar range. This helps ensure that one feature does not disproportionately influence the model compared to other features. Common approaches to numeric feature transformation include:
- Standardization: Rescales features to have a mean of 0 and a standard deviation of 1, often used for algorithms like logistic regression.
- Normalization: Scales features to a fixed range (e.g., 0 to 1), commonly used in distance-based algorithms like k-nearest neighbors (k-NN) model.
2. Categorical Data (Nominal and Ordinal)
Categorical data consists of non-numeric values that represent discrete categories or labels, which cannot be directly used by machine learning models. Common methods of transformtion include:
- One-Hot Encoding: This method creates binary (0 or 1) columns for each category, turning a single categorical feature with multiple categories into several binary features. This is often used for nominal data, where the categories do not have a meaningful order (e.g., “female” vs. “male”).
- Label Encoding: This approach converts each category into a numeric label, which is suitable for ordinal data where there is a meaningful order (e.g., “low”, “medium”, “high”).
3. Text Data
Raw text data, such as sentences, documents, or reviews, contains valuable information but cannot be directly interpreted by algorithms. Feature transformation techniques are used to convert text into a numerical format for machine learning models to process effectively.
- Bag-of-Words: Transforms text into numerical features by counting word occurrences in documents, creating a sparse matrix.
- TF-IDF (Term Frequency-Inverse Document Frequency): Weighs word frequencies by their importance across the dataset, down-weighting common words like “the” and “is.”
Feature Engineering
Feature engineering refers creating new features from the existing dataset to better represent the underlying patterns in the data. This step is often key to improving the predictive power of the model. This often includes creating interaction terms (e.g. squared or cubic terms), adding polynomial features, and sometimes aggregating multiple features.
In our case, the diabetes dataset already contains structured numeric data that only required scaling, but feature engineering such as interaction terms could have been explored to better capture relationships between features.
Important Note on Data Integrity
The procedures and techniques in this section highlight why it’s crucial to split the data before starting any preprocessing. For example, imagine applying a feature scaling technique like standardization, where we calculate the mean and standard deviation of a feature. If we compute these statistics using both the training and testing data, we would inadvertently introduce information from the test set into the training process, resulting in data leakage by giving the model an unfair advantage and inflating its performance.
By splitting the data first, we ensure the integrity of the test set for unbiased evaluation is preserved. Mechanically, we fit the transformation models (such as scaling or encoding) only on the training data, and then apply the same transformations to both the training and testing data, ensuring consistency without contaminating the test set.
Step 6: Choose a Model
At this stage, choose a machine learning model to solve our problem. The choice of model depends on the problem type, the nature of the data, and the available computational resources.
- Classification problems: Logistic regression, decision trees, random forests, support vector machines (SVM), k-NN, etc.
- Regression problems: Linear regression, decision trees, random forests, etc.
- Clustering problems: K-means, DBSCAN, hierarchical clustering, etc.
For this tutorial, we chose Logistic Regression, a simple yet powerful model for binary classification tasks. It models the probability of a binary outcome (e.g. diabetic vs. non-diabetic) based on the input features.
Step 7: Train the Model
Once we’ve selected the model, it’s time to train it on the training data. Training involves feeding the features and corresponding target values to the model so that it can learn the patterns and relationships between them.
Cross Validation
Typically, we use cross-validation technique during training to improve the generalization ability of a machine learning model. It works by splitting the training data into multiple subsets, training the model on some folds, and validating it on the remaining fold. This process is repeated for each fold, and the average performance (along with standard deviation) is used to assess the model. Cross-validation helps mitigate overfitting and provides a more reliable estimate of the model’s performance on unseen data.
Hyperparameter Tuning
For many machine learning models, there are hyperparameters that need to be tuned to optimize performance. Unlike model parameters, which are learned during training, hyperparameters are set before training and control the model’s learning process. Examples of hyperparameters include the learning rate in gradient descent, the number of trees in a random forest, or the regularization strength in logistic regression. Fine-tuning these hyperparameters can significantly impact the model’s ability to generalize to unseen data.
Common techniques for hyperparameter tuning include:
- Grid Search, where a predefined set of hyperparameters is tested exhaustively
- Random Search, which samples hyperparameters randomly from a specified range.
- Advanced techniques such as Bayesian optimization, genetic algorithms (GA), hyperband, etc.
In the demo, the goal is to optimize the model to make accurate predictions for new data. The model finds the best-fit feature coefficients (see Figure 3) through cross-validation to minimize the cost function (e.g., the log-loss function) for logistic regression. We’ve used RandomizedSearchCV
to find the optimal regularization strength that minimizes the model’s error rate.

Step 8: Evaluate the Model
After training the model, it’s essential to evaluate its performance on the test set. Evaluation allows us to measure how well the model generalizes to new, unseen data.
Evaluation Metrics
Evaluation metrics help quantify this generalization by comparing the model’s predictions against the true outcomes. The choice of evaluation metrics depends on the type of problem and the characteristics of the data.
- Classification Metrics: accuracy, precision, recall, f1-score, fbeta-score, receiver-operating characteristic (ROC) curve and AUC, precision-recall (PR) curve and average precision (AP)
- Regression Metrics: mean squared error (MSE), mean absolute error (MAE), mean absolute percentage error (mape), and R-squared (\(R^2\))
- Clustering Metrics: Clustering is an unsupervised learning task, and its evaluation metrics typically focus on measuring the similarity or compactness of the clusters formed. Evaluation metrics can be categorized into internal metrics, which assess clustering quality based solely on the data itself, and external metrics, which compare the clusters to pre-defined categories or ground truth.
- Internal metrics: silhouette score, Davies-Bouldin index, and Dunn index
- External metrics: Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and Fowlkes-Mallows Index (FMI)
Generally, accuracy for classification and mean squared error (MSE) for regression are set as default evaluation metrics in most machine learning software packages. For any classificatoin and regression metrics mentioned above that are unclear, please refer to the Machine Learning Crash Course offered by Google, which is an excellent resource.
For the sample analysis, the model reported an accuracy score of 75%. More in depth evaluation of test set predictions include confusion matrix (Figure 4), PR curve (Figure 5), and (Figure 6).



Step 9: Refine the Model
After evaluating the model, we might find areas for improvement. This could involve:
- Feature Engineering: Trying new features or transformations
- Hyperparameter Tuning: Using more advanced techniques to fine-tune the model’s parameters
- Trying Different Models: Experimenting other models if the chosen doesn’t perform well
Step 10: Deploy the Model
Once our model is refined and performing well, the next step is to deploy it in a real-world scenario, which typically involves the following steps:
- Integrate the model into an application or API: This allows the model to make predictions either in real time or in batch mode, depending on the specific needs of the system.
- Monitor the model’s performance: Continuously tracking the model’s effectiveness on new data helps identify any performance drops or changes in data patterns.
- Periodic model updates: The model may need to be retrained with new data or fine-tuned to maintain its accuracy and relevance as conditions evolve.
Our demo has skipped through step 9 and 10, as the project was designed solely for academic and tutorial purposes. Consequently, the results achieved are not intended to be final or satisfactory for production use.
Conclusion
In this tutorial, we’ve covered the complete machine learning process (Figure 7) using the Pima Indians Diabetes Dataset as a practical example. We explored how to define the problem, collect, split, and preprocess the data, train a model, evaluate its performance, and refine it for improved performance. Each of these steps is essential for building a robust machine learning model capable of solving real-world problems.
Machine learning is an iterative and evolving process that involves continuous experimentation and refinement. As you advance in your data science journey, the principles outlined in this tutorial will guide you in tackling more complex challenges and making data-driven decisions.
