Data & Analytics

Improving Taxi Service with Predictive Algorithm

Table of Content

Problem Statement

The main objective is to build a predictive model, which could help them in predicting the trip duration of taxi. This would in turn help them in matching the right cabs with the right customers quickly and efficiently.

Objective

To solve this problem, we developed an algorithm that used real-time traffic data and historical travel patterns to predict how long it would take for a taxi to reach its destination. We also incorporated weather data into our model, as we found that inclement weather had a significant impact on travel times.

We tested our algorithm extensively using historical data and found that it was highly accurate in predicting travel times. We then integrated the algorithm into the transportation company's existing software platform so that drivers could receive real-time updates on predicted travel times and adjust their routes accordingly.

Dataset

Notebook Link

Data Pipeline

  1. Know Your Data: The first step in this project was to examine the various features of the dataset, understand the structure of the data and identify any patterns or trends. We looked at the shape of the data, the data types of each feature, and a statistical summary.

  2. Exploratory Data Analysis: We conducted an exploratory analysis of the data to identify patterns and dependencies, and to draw conclusions that would be useful for further processing.

  3. Data Cleaning: We checked for duplicated values in the dataset and then addressed any null values and outliers by imputing empty strings and dropping some of the null rows.

  4. Textual Data Preprocessing: We used techniques such as stop word removal, punctuation removal, conversion to lowercase, stemming, tokenization, and word vectorization to prepare the textual data for clustering. We also used Principal Component Analysis (PCA) to handle the curse of dimensionality.

  5. Cluster Implementation: We used K-Means and Agglomerative Hierarchical clustering algorithms to cluster the movies and determine the optimal number of clusters.

  6. Content-Based Recommendation System: We built a content-based recommendation system using the similarity matrix obtained from cosine similarity, which will provide the user with 10 recommendations based on the type of movie/show they have watched.

Project Workflow

Code (github)

Data Visualization

Conclusion

Model Performance & Final Model

XGBoost model with K-fold Cross Validation & RMSE 0.37356

Skills

Data AnalysisData ScienceMachine LearningPython