What Was The Motivation Behind Real Time?

Recently, housing prices in Toronto have been growing with volatility. Prices jumped by 30% in the 2021 calendar year, and over 25% just between July 2021 and January 2022. With such unpredictability in the market, brokerages and realtors have resorted to giving “hunch” price estimates, while clients are worrying about overpaying for their dream home. Machine learning offers a promising alternative for housing price estimates - one which is data-driven, accurate, and constantly up-to-date. This is the premise of our project, RealTime.

Real Time and Real Value

RealTime was conceived as a continuation of our project from last year, RealValue, with many new improvements. RealValue consisted of a custom neural network model trained on a dataset of California houses, with transfer learning applied to a handcrafted dataset of 160 Toronto houses which we created.

Clients could fill in a “10-second form”, providing some basic information (postal code, number of bedrooms and bathrooms, area, 4 images of the house), and RealValue’s model could predict housing prices with a mean absolute percentage error (MAPE) of 15%. While this is a decent error metric, the handcrafted dataset we trained on did not include recent houses, and was just a snapshot in time.

A constantly updating dataset is needed in order to provide in-sync and relevant price estimates. To this end, we created a new dataset workflow in this year’s RealTime project, which could provide extensive and up-to-date data from a variety of sources. In addition, RealTime’s improved machine learning model was able to predict housing prices with a MAPE of 9% - a reduction factor of over 1.6.

Real Time's Workflow

The data collection process is an iterative process which involves fetching the house data through an API request, saving this data into our database, and updating the data with more detailed house information and amenities periodically. This process is an automated process and executes every 12 hours to ensure our database is always giving us up-to-date and reliable information. To automate this process, we use GitHub Actions to create Docker container actions as our method of choice for creating the builds. We operate the actions on a Google Cloud Virtual Machine which ensures that our hardware is consistent and can reliably update our database. Our database of choice is MongoDB due to its fast data access for read/write operations.

What Does Real Time's Dataset Consist of?

Real Time uses a NoSQL Database to allow for partiality. This dataset consists of 36,000 partial homes, which, when compared to last year, is a 225x increase. There are also over 80 amenity choices, with 17,000 mostly complete homes. Some examples of numerical attributes are; bedrooms, bathrooms, Long & Lat, date, and parking spaces. Some categorical attributes are house types, basement types, parking types, and heating types. Finally, the numerical (variable) attribute is the Amenity Choices, such as the restaurants, hospitals, etc.