NYC Taxi demand prediction
Problem statement
To find number of pickups, given location coordinates (latitude and longitude) and time, in the query region and surrounding regions.
Data sources
Get the 2016 data from NYC.GOV.
In the given project we are considering only the yellow taxis for the time period between Jan - Mar 2015 & Jan - Mar 2016 with given features.
Field Name | Description |
---|---|
VendorID |
A code indicating the TPEP provider that provided the record.
|
tpep_pickup_datetime | The date and time when the meter was engaged. |
tpep_dropoff_datetime | The date and time when the meter was disengaged. |
Passenger_count | The number of passengers in the vehicle. This is a driver-entered value. |
Trip_distance | The elapsed trip distance in miles reported by the taximeter. |
Pickup_longitude | Longitude where the meter was engaged. |
Pickup_latitude | Latitude where the meter was engaged. |
RateCodeID | The final rate code in effect at the end of the trip.
|
Store_and_fwd_flag | This flag indicates whether the trip record was held in vehicle memory before sending to the vendor,<br\> aka “store and forward,” because the vehicle did not have a connection to the server. <br\>Y= store and forward trip <br\>N= not a store and forward trip |
Dropoff_longitude | Longitude where the meter was disengaged. |
Dropoff latitude | Latitude where the meter was disengaged. |
Payment_type | A numeric code signifying how the passenger paid for the trip.
|
Fare_amount | The time-and-distance fare calculated by the meter. |
Extra | Miscellaneous extras and surcharges. Currently, this only includes. the $0.50 and $1 rush hour and overnight charges. |
MTA_tax | 0.50 MTA tax that is automatically triggered based on the metered rate in use. |
Improvement_surcharge | 0.30 improvement surcharge assessed trips at the flag drop. the improvement surcharge began being levied in 2015. |
Tip_amount | Tip amount – This field is automatically populated for credit card tips.Cash tips are not included. |
Tolls_amount | Total amount of all tolls paid in trip. |
Total_amount | The total amount charged to passengers. Does not include cash tips. |
Mapping to a machine learning problem
We need to predict the demand at a given time and region. So, we will treat it as a time-series forecasting problem keeping metric as MSE.
Exploratory Data Analysis
Pickup and Dropoff coordinates
New York is bounded by the location cordinates(lat,long) - (40.5774, -74.15) & (40.9176,-73.7004), however there are many coordinates which lie outside the bounded region for pickups and dropoffs (see image below). So, we removed those data points.
Trip duration
Moving onto the Trip
duration, according to NYC Taxi & Limousine Commision Regulations the maximum allowed trip duration in a 24 hour interval is 12 hours. So selecting only the data points whose trip times is greater than 1 and less than 720 minutes (12 hours). Given below is the PDF of log-trip-times along with its Normal Q-Q plot which says that the data follows a good normal distribution.
Speed
Taking the 99th percentile, replacing all the outliers by 45.31
(99th percentile). The average speed of taxis came out to be 12.45
miles/hr.
Distance
Replacing outliers by 23 miles (99th percentile).
Total Fare
Replacing outliers with 1000 (99th percentile)
Note - While removing outliers, everytime we suppose a value greater than 0
Data preparation
Creating clusters across the NY city. Keeping minimum inter-cluster distance as 0.5 mile and maximum inter-cluster distance as 2 mile, since 2 miles can be covered in 10 minutes.
We choose the optimum number of clusters as 40 because on choosing a cluster size of 40 we have -
Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 9.0
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 31.0
Min inter-cluster distance = 0.5064095487015858 (as desired)
Cluster centers
So finally, we have cluster IDs along with 10 minutes time bins. We see in many 10 minute time bins that number of pickups are zero. Such missing value is filled with interpolating and exterpolating.
On-Hold: Will write soon.