#### Motivation

I'm not sure about you, but I'm a huge fan of buying things on the secondhand market. For me, it's immensely rewarding to buy something second hand because you don't have to pay full price for the ephemeral novelty of owning something new (although we do thank early market adopters for using their disposable income to enable economy of scale to benefit the rest of us). I was recently on the hunt for a dependable Prius, so I thought, why not see what's a fair price. This project is a Prius price prediction problem that should enable buyers to know if they are purchasing a Prius within a reasonable price range, and sellers to know if they are pricing their vehicle in a reasonable price range. I want to 1) understand the biggest factors affecting car pricing, and 2) create a very accurate pricing algorithm based on common features you'd find.

#### Process

For this project, I used a very common classified advertisement platform to scrape data from, using Beautifulsoup. The listing above shows a typical search result for Priuses. Notice in the link, there are options for hyperlink parameters. The keyword, 'sso' represents whether or not the listings are 'sale by owner' listings, as opposed to listings by dealerships. I scraped both types of listings, tracking the type. In addition, the beginning of the link has 'sfbay' which specifies city/area for the listing. This is a variable which will be changed to navigate to other listings. The rest of the parameters we'll keep the same. In total I scraped every single listing in the US in every available city. On this particular day, there were approximately 12.3k total listings.

#### EDA/Feature Construction

Each search result/post had some number of pre-existing features such as a variable number of images, the text within the post itself, and some set of pre-filled car conditions to the right side, just below the location map. Such features included name of the listing (which often included the year of the car, which we can extract through regex), the condition of the car, the number of cylinder it has, type of drive, fuel, a numerical odometer status, paint color, etc. In addition to these features, I also created a few additional features such as the number of pictures within a post (nthumbs), the number of characters in a post (postlen), the age of the car (current year, 2018 at the time of this project, minus year of car), and an indicator for whether or not the listing was sso. In total, there were 39 total features, combining those that I created with things that pre-existed on the right hand side column. Furthermore, there were 49 states from which postings were scraped, totaling 344 cities. All of these additional features became categorical indicators, bringing the total number of features to 432.

There were many postings that were perhaps made in error, or placed in the wrong section, or just plain spam postings. While we can't remove every single one of these postings by hand, we can weed out the majority of them through sensible filtering. If no right hand column exists for a listing detailing its condition, we can exclude these postings. Since no Priuses existed before their market release in 1997, we can remove all such listing (I'm always reluctant to remove data, but age of the car will play a large role in our model, and we have a sufficient quantity of data anyways so we don't need to impute this kind of data). Furthermore, we assume listings under $1,000 and over $40,000 are not useful to us. They could have been made in error, like a wheel of a Prius being listed under car for sale by owner section (unless there's a special edition gold plated Prius somewhere someone is not telling me about), but they are simply not within the range for prediction that we are sensibly interested in anyways.

One feature that we know apriori would be important for pricing is the odometer reading. Plotting this against price (above) yields a curve that looks like an exponential decay. Intuitively, this should make sense because we know that the price of a car can't be linear and wont cross into the negative threshold even if the car has very high miles (because the car is going to be worth its parts). Because we are going to be working with a linear regression, it's going to be immensely helpful to do a log transformation of the odometer reading (an intuition which can be affirmed in a test-train validation run):

Feature Pruning

The bulk of the features will not be useful to us. With 12.3k rows of data, and 432 features, we risk overfitting our model with approximately 28 rows per feature. In addition, we want to create an explanable model, so we will prune aggresively to get underlying factors that affect pricing. To prune, we can start with an aggressive method like feature_selection from sklearn, which will help us prune away features that have a p-value less than some threshold we determine. The code looks like this:

def sort_coef_feat(coefs, features, f=False, reverse=True): '''Given the coefficients and set of features, return a sorted list f: is a lambda expression used to filter coef values''' coefs = np.abs(coefs) pairs = zip(coefs, features) pairs = sorted(pairs, key = lambda x: x[0], reverse=reverse) if f: pairs = list(filter(f, pairs)) print("Ranked features-pvals:") print(pairs) return pairs # Try ranking by pval for F score, return top list of cols def f_rank(df, cols): X = df[cols] y = df['price'] f = lambda x: x[0] < 0.001 F, pval = feature_selection.f_regression(X, y, center=True) pairs = sort_coef_feat(pval, cols, f=f, reverse=False) return [x[1] for x in pairs] top_all = f_rank(play_df, original_feats + states_of_int + cities_of_int) # play_df is our data # original_feats + states_of_int + cities_of_int are feature names of the cols

We create a helper function that sorts and returns a list of [(pval, feature1), (pval, feature2), ...]. In the primary f_rank function, we set a filter to only keep things with a p-val < 0.0001. This seems extremely aggressive, but we are still left with 69 features (which at this point, we can technically use, but it would be much better to clean this up even more). Many of these features will seem statistically significant by chance, but will offer no better insight for predictive inference (we go back to test-train for this).

In this next step, each of the remaining features is individually fitted to a simple ordinary least squares regression, and any feature that yields a model with an $R^2$ that is greater than 0.02 (an arbitrary, but very liberal threshold) is kept around. We make use of our helper sorting function from above, now with a specific filter that keeps only $R^2$ values that are greater than 0.02. Here I used patsy to create a design matrix (R style) to feed into statsmodel, which is aliased as 'sm'.

r_squared = [] sig_feat = [] for col in all_cols: y, X = patsy.dmatrices(base + 'Q("' + col + '")', data = play_df, return_type='dataframe') sm_model = sm.OLS(y, X) fit = sm_model.fit() pval = fit.pvalues[1] r2 = fit.rsquared if pval < 0.01: sig_feat.append(col) r_squared.append(r2) f = lambda x: x[0] > 0.02 sort_coef_feat(r_squared, sig_feat, f=f)

Finally we are left with the following features: 'age', 'odometer', 'postlen', 'nthumbs', 'sso', 'state_california', 'cond_good', 'type_wagon', 'state_washington', 'city_seattletacoma', 'cond_unknown'. It is curious why knowing whether or not a car is in the state of Washington, specifically whether or not it is in the Seattle-Tacoma area, would affect price predictions. It could be a result of an ecosystem that favors green driving thus incentivizing clean car purchases making Washington one of the most green car buying states. Similarly, this would also make sense as to why knowing whether or not a car is being sold in CA is useful. We can't really prove these assertions with the current data, but I think these are sensible guesses.

To ensure quality features, we can use ElasticNet with GridSearchCV to inspect a preliminary model. It is important here to use StandardScaler to scale the features because L1 and L2 regularizers have a penalty that is proportional to the size of the beta coefficients, which can be hugely misleading if the features are not in a standard scale. Inspecting the coefficients of our features with a gridsearch optimized ElasticNet enables us to discard 'cond_unknown' as a nearly useless predictor. ElasticNet is a combination of L1 (Lasso) and L2 (Ridge) regularization methods. Lasso penalizes by adding to the cost associated with the sum of the absolute values of the coefficients, and ridge penalizes using the sum of squared coefficients. Lasso is good at driving coefficients to zero, thus is good as a kind of feature selection mechanism, but is unstable (dependent upon data). Hence the mixture of the two can be useful in most cases.

We're going to kill two birds with one stone by first creating an ElasticNet model, then using gridsearch to find the optimal alpha and l1 ratio. We'll then run the model once more with the optimized hyperparameters to inspect the coefficients of our model:

# Try ElasticNet with GridSearchCV from sklearn.linear_model import ElasticNet from sklearn.model_selection import GridSearchCV from sklearn.preprocessing import StandardScaler l1_ratio = np.linspace(.75, .95, 20) alphas = np.linspace(.01, .02, 10) param_grid = {'alpha': alphas, 'l1_ratio': l1_ratio} X = play_df[cols_of_int].copy() y = play_df['price'].copy() regr = ElasticNet() scaler = StandardScaler(copy=True, with_mean=True, with_std=True) X_scaled = scaler.fit_transform(X) grid = GridSearchCV(regr, param_grid) grid.fit(X,y) g_scores = grid.cv_results_ print(grid.best_estimator_) for item in zip(g_scores['mean_train_score'], g_scores['params']): print(item) # Here, play_df is a subset of our data to play around with

#### Model Comparison

Finally, we arrive at the final step of our modeling. When modeling, it's important to understand model performance in terms of other models. Why? Because we can understand the explicit trade-offs or gains contributed by individual features by varying them from model to model. In addition, while we have a kind of ground truth provided by a test set, we don't actually know the true distribution (that's why we're trying to estimate it), so we can't calculate the distance between our model, and the true distribution, BUT we can compare models to each other and see which is a better estimate of the data we have. Also, sometimes more features is not always better, we will use various information criteria to help us understand this (some of which penalize or adjust for a high number of parameters to prevent overfitting). We start with the final set of proposed features below, and one by one create a model, excluding a single feature at a time (keeping age and odometer as features we know that are extremely important to the model):

m1: price ~ age + odometer + nthumbs + postlen + sso + state_california + cond_good + type_wagon + state_washington + city_seattletacoma m2: price ~ age + odometer + nthumbs + postlen + sso + state_california + cond_good + type_wagon + state_washington m3: price ~ age + odometer + nthumbs + postlen + sso + state_california + cond_good + type_wagon m4: price ~ age + odometer + nthumbs + postlen + sso + state_california + cond_good m5: price ~ age + odometer + nthumbs + postlen + sso + state_california m6: price ~ age + odometer + nthumbs + postlen + sso m7: price ~ age + odometer + nthumbs + postlen m8: price ~ age + odometer + nthumbs m9: price ~ age + odometer m10: price ~ age + odometer + nthumbs + postlen + state_california + type_wagon + state_washington

Here, I present m10 as our final model. We'll walk through the logic of how we arrive there. Lets first take a look at three different metrics for model comparison. The first is a measure called deviance, it is a distance measure between two probability distributions. Calculating it in code looks like this:

# Fit a statsmodel OLS regression to our training data X_test = pd.DataFrame(scaler.transform(X_test), index=X_test.index, columns=X_test.columns) y_tr, X_tr = patsy.dmatrices(formula, data = X_train, return_type='dataframe') sm_model = sm.OLS(y_tr, X_tr) fit = sm_model.fit() # Prepare test matrix for prediction and comparison y_true, X_te = patsy.dmatrices(formula, data = X_train, return_type='dataframe') # Predictions y_preds = fit.predict(X_te) y_true = np.array(y_true).flatten() y_preds = np.array(y_preds) # Calculate out-of-sample deviance out_SSE = np.sum(np.square(y_true - y_preds)) out_std = np.sqrt(out_SSE/len(y_preds)) log_density = pd.DataFrame() log_density['y_true'] = y_true log_density['y_preds'] = y_preds log_density.loc[:,'std'] = out_std log_density['likelihood'] = log_density.apply(lambda row: norm.logpdf(\ row['y_true'], row['y_preds'], row['std']), axis=1) deviance = -2*sum(log_density['likelihood']) deviances.append(deviance)

To calculate the out-of-sample deviance, we first need to find the log likelihood, or the probability of our data given some set of parameters we estimated. For this, we first ask, given that we calculated some y predicted value, what is the log probability density of the corresponding y true value (with calculated standard deviation): norm.logpdf(row['y_true'], row['y_preds'], row['std'])? Then we just find the likelihood via the log_density function. The deviance is -2*log_likelihood (see below for quick explanation as to why the -2 exists). Of course we could've just asked for the likelihood/deviance value from our statsmodel object with fit.llf or fit.deviance, but I wanted to be explicit in these calculations. A second note: you can also use the AIC measure which comes with the statsmodel object (fit.AIC) to get the *estimated out-of-sample deviance*. AIC takes the in-sample deviance measure and adds a 2*k penalty term where k is the number of parameters in the model (and in general, all ____ Information Criteria (_IC) will have some specific penalty term added to the deviance). It's a poor man's in-sample estimate of the out-of-sample deviance (which is useful when you have so little data that you can't do a split-train test), but there are strict model assumptions to using AIC and areas where it will fail as the number of parameters get large in comparison to the dataset. Each statistical package will have some constant for which AIC calculations are offset by. I don't know why this is.

*Note: The -2 in front of the likelihood term has to do with the deviance having a Chi-squared distribution under regularity conditions. In OLS, the errors are assumed iid (independent and identically distributed), thus the likelihood can be calculated as follows:
$$ L = \left (\frac{1}{\sqrt{2\pi}\sigma_\epsilon} \right)^n \exp\left[{-\frac{1}{2}\chi^2} \right] $$
$$ \ln L = -n\ln(\sqrt{2\pi}\sigma_\epsilon) - \frac{1}{2}\chi^2 $$
$$ -2\ln L = -2n\ln(\sqrt{2\pi}\sigma_\epsilon) + \chi^2 $$
$$ -2\ln L = n\ln(2\pi\sigma_\epsilon^2) + \chi^2 $$
In actual use, since the -2 is a scaling factor, it really doesn't much matter if we include it for model comparisons as long as we are consistent.
*

$$R^2=1-\frac{SSE}{TSS}$$

$$SSE=\sum_{i=1}^n (y_i - \hat{y}_i)^2 )$$ $$TSS=\sum_{i=1}^n (y_i - \bar{y} )^2$$ $$\bar{y}=\frac{1}n{}\sum_{i=1}^n y_i$$

$$R^2=1-\frac{n \times MSE} {\sum_{i=1}^n (y_i - \bar{y} )^2}$$

In short, the RMSE is the average deviation of the estimates from the observations. The benefit is that its unit is the same as the response variable (in our case, dollars) thus it is not restricted to a certain maximum value. Lower is better. The $R^2$ value, on the other hand, is scaled from 0-1, and helps to determine how well the features (independent variables) help to explain the variability in the target (dependent variable) with 0 being not at all, to 1 being completely. There are various problems with using $R^2$ alone, check out Anscome's quartet. In addition to this, $R^2$ always improves with additional features, so it can't account for overfitting. To address these problems, we first take a look at adjusted $R^2$: $$ R^2_{adj} = 1 - \frac {(1-R^2)(n-1)}{n-k-1} $$ Here, n is the number of total observations and k is the number of parameters in the model.

#### Final Model

Here's a scaled view of different model performances across these three metrics:

Moving from m1 -> m2, m4 -> m5, m6 -> m7, we see that removing corresponding features of 'city_seattletacoma', 'cond_good', and 'sso' doesn't impact these metrics very much. In fact, for city_seattletacoma, it makes sense because that is highly correlated with state_washington (although the correlation goes one way, the area of seattle-tacoma encompasses a huge number of listings). In addition, even though sso had an impact, whether the car was being sold by a dealer or owner, the base price was not substantially different (the benefits of private vs dealer vehicle sales tax varies across different states and may average out) according to our metrics. We are left with seven core features which form the following model:

m10: price ~ age + odometer + nthumbs + postlen + state_california + type_wagon + state_washington

This model yields an in-sample $R^2$ score of 0.841, and an out of sample (holdout set) $R^2$ score of 0.836. The close relationship between the in-sample and out-of-sample scores means we managed to not overfit our model.

Checking the correlation matrix between our variables one last time, we see (and expected) that the age and odometer are the only variables which are highly correlated. Removing one of them is not a good choice here because the test-train scores fall dramatically, which means that even though there is some multicollinearity which will impact interpretation of their coefficients, they each contain novel information that is informative to the model. The rest of the features seem relatively independent.

#### Random Forest Approach

As a final note, we can also run a quick non-linear model, with a less aggressive feature set. The goal here is to optimize the prediction score. The following code creates a RandomForestRegressor with 100 trees. Without much hyperparameter optimization, our ensemble method receives a 0.94 $R^2$ score, a vast improvement due to its non-linearity. Tree models are very robust so scaling is not necessary, nor is the log transformation we did earlier, but I kept the log(odometer) transformation anyways. The trade-off here is clear: ensemble methods can account for non-linearities thus can be more accurate if the model benefits from such non-linearities, but it is impossible to interpret the ensemble average of 100 trees, each with its own node split.

from sklearn.ensemble import RandomForestRegressor cols_of_int = ['age', 'odometer', 'postlen', 'nthumbs', 'sso', 'state_california', 'cond_good', 'type_wagon', 'state_washington', 'city_seattletacoma', 'cond_unknown'] forest = RandomForestRegressor(n_estimators=100) X_train = data[cols_of_int].copy() y_train = data['price'].copy() X_holdout = test[cols_of_int].copy() y_holdout = test['price'].copy() # Log transformation of odometer values try: X_train['odometer'] = X_train['odometer'].apply(float) X_holdout['odometer'] = X_holdout['odometer'].apply(float) odo_mean = X_train['odometer'].mean() odo_sd = X_train['odometer'].std() X_train['odometer'] = (X_train['odometer'] - odo_mean)/odo_sd X_train['odometer'] = np.exp(-1*X_train['odometer']) X_holdout['odometer'] = (X_holdout['odometer'] - odo_mean)/odo_sd X_holdout['odometer'] = np.exp(-1*X_holdout['odometer']) except: pass forest.fit(X_train, y_train) print(forest.score(X_holdout, y_holdout)) # Holdout result without hyperparam tuning is 0.93989900926606

As usual, all of this code can be be found on my Github. Send me an e-mail for any questions you may have on my work!