Real Estate Decision Tree Model to Predict Particular Home Sales
Overview
The client is a real estate investment firm that primarily purchases a particular subset of homes to flip and re-sell. The client cold-calls the homeowners and was looking to use a machine learning algorithm to determine which homeowners had the highest likelihood of being interested in selling.
Data was collected from PropStream.
Properties that met certain criteria are fed into the model, and the model is asked to predict yes/no for each property whether it is a good target for a sale.
About the Model
We employed a random forest model. Each “tree” in the forest is a decision tree based on identifying patterns in the data that make a given house more or less likely to have sold in a particular way.
A simplified example might be that any house with more than 2 outstanding loans is a good lead, so this particular tree would make a simple decision: given an input house, if it has 3 or more outstanding loans, call it a 1 or a good lead. If it has 2 or fewer, call it a 0 or a bad lead. The actual decision trees have more steps than this, but that’s the basic idea.
Next, take the output from each tree and average them together. Any one tree might make some unusual classification decisions, but by having hundreds of trees we can reduce the impact of any one of them. Finally, if that average is more than 0.5, we identify that as a good lead.
The model needs to be “trained” so that it can accurately predict good prospects. Training involves taking about 75% of the data and giving the model the correct “answer” for these houses. In this case, the answer is whether a house sold in a particular way or not.
Then, the model uses this information on the “test” data to make actual predictions.
Coding the Model Input
The input into the model is a simplified form of the data output from PropStream. Some things (address, owner’s name) can be excluded entirely, since they are not numeric values and should have no impact on the decision-making process.
Characteristics of a property that aren’t numeric are turned into variables that take on a value of either 0 or 1. For instance, there are separate variables for certain geographic locations of interest. Characteristics that are already numeric (square footage, estimated equity, etc) are left alone.
Output
When we feed the test data into the model, it will make a prediction for each house in the test data. There are four possible outcomes:
Sold In a Particular Way | Didn’t Sell That Way | |
Predict Sell In a Particular Way | Correct Prediction | Incorrect Prediction |
Predict Didn’t Sell That Way | Incorrect Prediction | Correct Prediction |
The houses that were classified as unlikely to sell in the way the client desired that actually did are of no use to us. They help to calibrate the model, but they have already sold so there’s no action that we can take on the basis of the prediction.
The houses that we are most interested in are the ones that are predicted to sell in the desired way that haven’t (yet). These are our best guesses for leads, and that is the list of houses that we provided.