What property features influence the rental rate of homestays in Boston, MA?

Photo by Michael Browning on Unsplash


This blog post is first project requirement of Udacity’s Data Science Nanodegree program. For this project, I chose to work on AirBNB Boston data. The dataset is publicly available through Kaggle. The dataset include the following:

1. Listings.csv— listing information, hosts information, review score. The listings file contain 3,585 records, 95 columns

2. Reviews.csv — Review comments The reviews file contain 68,275, 6 columns

3. Calendars.csv — availability and price for the day. The calendar file contain 1,308,890 records, 4 columns

In working to complete this project, I focused on analyzing the Listings and Calendars data only to answer the following questions:

1. What features or variables that will show popular listing?

2. What available data that would indicate high review score?

3. Which property features could have impact better or higher rental rates?

Part 1: What features or variables that will show popular listing?

The top 5 neighborhood with most listings are Jamaica Plain (9.57%), South End (9.09%), Back Bay (8.42%), Fenway (8.09%) and Dorchester (7.50%).

Apartment is the most popular property type, which is 72.94% of listings.

Real Bed is the most popular listing which is 96.32% of listings.

72.61% of the host’s identity are verified.

83.43% of property listed are not instant bookable.

44.13% cancellation policy is strict, while flexible and moderate cancellation policy are 27.87% and 25.63% respectively.

Part 2: What available data that would indicate high review score?

The heatmap doesn’t show any significant correlation to the high review scores. Before I started digging into the data for this part, I was assuming that price, property features, and host responsiveness could have correlation to the reviews’ score. But the data doesn’t support my assumption.

Heatmap to see correlation of review score and property features

Part 3: Which features could have impact to better or higher rates?

Looking at the calendar data, the graph shows that fall season (October, November) have highest rates,. The price decreases as the weather changes starting December to March. The price start to pick up again from spring through summer time.

The graphs below are plots of popular attributes shown above, against the price mean value. It looks like, the neighborhood, property type, room type, bed type, and cancellation policy have correlation to higher price. While instant bookable or verified host identity doesn’t have much effect on price.

The graphs above indicated higher prices, but they are merely based on correlation only. I did a predictive model below. The model result of 0.27 r2 score is pretty low and not fitting, but I can conclude that the model indicate that room type, could influence higher rental price. Similar to the results of the plots above, neighborhood, property type, and bed type may also influence higher rental price.


This project followed the steps from the lessons on CRISP-DM Process (Cross Industry Process for Data Mining). The CRISP-DM process has 6 phases.

1. Business Understanding — I posted three questions that are related to the business or real-world context of the data.

2. Data Understanding — I used the data to answer the business questions by understanding the data structure, contents, and data types.

3. Prepare Data — In this part I used several methods to perform data assessment and cleanup. I removed the outliers, columns with large number of missing data. I filled out some missing data, and transformed categorical variables so I can used them to build the predictive model.

4. Data Modeling — In this part, I followed the thought process and modelling technique from the course lessons.

5. Evaluate the Results — In each step, the results are evaluated and documented in the Jupyter Notebook.

6. Deploy — For this particular project, I haven’t deployed or applied any part of the code or model yet.

The analysis covered one year of AirBNB Boston data from Sept 2016 to Sept 2017. The result of the analysis showed that room type could influence higher rental rate. In addition to room type, other features such as neighborhood, property type, and bed type could influence the rental rate. Other variables that were included in the analysis, but doesn’t appear to have impact on prices are instantly bookable and verified host identity. Finally, I looked at features that could have correlation to higher reviews, but the data doesn’t show any correlation.

I hope this blog post will provide useful insights. Please checkout my Github here.

Disclaimer: This analysis is a project requirement for Data Science course and doesn’t represent insights from AirBNB company.