NYC Taxi demand prediction

3 minute read

Problem statement

To find number of pickups, given location coordinates (latitude and longitude) and time, in the query region and surrounding regions.

Data sources

Get the 2016 data from NYC.GOV.
In the given project we are considering only the yellow taxis for the time period between Jan - Mar 2015 & Jan - Mar 2016 with given features.

Field Name Description
VendorID A code indicating the TPEP provider that provided the record.
  1. Creative Mobile Technologies
  2. VeriFone Inc.
tpep_pickup_datetime The date and time when the meter was engaged.
tpep_dropoff_datetime The date and time when the meter was disengaged.
Passenger_count The number of passengers in the vehicle. This is a driver-entered value.
Trip_distance The elapsed trip distance in miles reported by the taximeter.
Pickup_longitude Longitude where the meter was engaged.
Pickup_latitude Latitude where the meter was engaged.
RateCodeID The final rate code in effect at the end of the trip.
  1. Standard rate
  2. JFK
  3. Newark
  4. Nassau or Westchester
  5. Negotiated fare
  6. Group ride
Store_and_fwd_flag This flag indicates whether the trip record was held in vehicle memory before sending to the vendor,<br\> aka “store and forward,” because the vehicle did not have a connection to the server. <br\>Y= store and forward trip <br\>N= not a store and forward trip
Dropoff_longitude Longitude where the meter was disengaged.
Dropoff latitude Latitude where the meter was disengaged.
Payment_type A numeric code signifying how the passenger paid for the trip.
  1. Credit card
  2. Cash
  3. No charge
  4. Dispute
  5. Unknown
  6. Voided trip
Fare_amount The time-and-distance fare calculated by the meter.
Extra Miscellaneous extras and surcharges. Currently, this only includes. the $0.50 and $1 rush hour and overnight charges.
MTA_tax 0.50 MTA tax that is automatically triggered based on the metered rate in use.
Improvement_surcharge 0.30 improvement surcharge assessed trips at the flag drop. the improvement surcharge began being levied in 2015.
Tip_amount Tip amount – This field is automatically populated for credit card tips.Cash tips are not included.
Tolls_amount Total amount of all tolls paid in trip.
Total_amount The total amount charged to passengers. Does not include cash tips.

Mapping to a machine learning problem

We need to predict the demand at a given time and region. So, we will treat it as a time-series forecasting problem keeping metric as MSE.

Exploratory Data Analysis

Pickup and Dropoff coordinates

New York is bounded by the location cordinates(lat,long) - (40.5774, -74.15) & (40.9176,-73.7004), however there are many coordinates which lie outside the bounded region for pickups and dropoffs (see image below). So, we removed those data points.

Trip duration

Moving onto the Trip duration, according to NYC Taxi & Limousine Commision Regulations the maximum allowed trip duration in a 24 hour interval is 12 hours. So selecting only the data points whose trip times is greater than 1 and less than 720 minutes (12 hours). Given below is the PDF of log-trip-times along with its Normal Q-Q plot which says that the data follows a good normal distribution.

Speed

Taking the 99th percentile, replacing all the outliers by 45.31 (99th percentile). The average speed of taxis came out to be 12.45 miles/hr.

Distance

Replacing outliers by 23 miles (99th percentile).

Total Fare

Replacing outliers with 1000 (99th percentile)

Note - While removing outliers, everytime we suppose a value greater than 0

Data preparation

Creating clusters across the NY city. Keeping minimum inter-cluster distance as 0.5 mile and maximum inter-cluster distance as 2 mile, since 2 miles can be covered in 10 minutes.
We choose the optimum number of clusters as 40 because on choosing a cluster size of 40 we have -

Avg. Number of Clusters within the vicinity (i.e. intercluster-distance < 2): 9.0   
Avg. Number of Clusters outside the vicinity (i.e. intercluster-distance > 2): 31.0   
Min inter-cluster distance =  0.5064095487015858  (as desired)  

Cluster centers


So finally, we have cluster IDs along with 10 minutes time bins. We see in many 10 minute time bins that number of pickups are zero. Such missing value is filled with interpolating and exterpolating.

On-Hold: Will write soon.

References