Apartment Data Web Scraping

This is a coursework of Programming for Business Analytics I took this semester. The programming language is Python.
———
I want to get name, address, number of bedrooms and price of each apartment which is listed on the first four pages of the website https://www.apartmenthomeliving.com/new-york-city-ny?page= and use DataFrame to represent a table of data with rows and columns. Then, output as a csv file.

The difference in url of each page is only at the page=

On each page, under <div id="property_list">, contains all names, addresses, number of bedrooms and prices of each property I want of that page.

I look for property name by class="property_link"

I look for property address by class="address"

I look for number of bedrooms by class="bedrooms"

I look for prices by class="prices"

Code

# Open the webpage that we are going to scrape.
import webbrowser
webbrowser.open_new('https://www.apartmenthomeliving.com/new-york-city-ny?page=')

# The requests module allows you to send HTTP requests using Python.
# The HTTP request returns a Response Object with all the response data (content, encoding, status, etc).
import requests

# Beautiful Soup is a Python library for pulling data out of HTML and XML files.
# It creates a parse tree for parsed pages that can be used to extract data from HTML.
from bs4 import BeautifulSoup

# Pandas is built on the Numpy package and its key data structure is called the DataFrame.
# DataFrames allow you to store and manipulate tabular data in rows of observations and columns of variables.
import pandas as pd

# The User-Agent request header is a characteristic string that lets servers and network peers 
# identify the application, operating system, vendor, and/or version of the requesting user agent.
# Chrome: inspect --> network --> All --> website address --> Headers --> User-Agent
headers = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_0_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'}

# Create an empty DataFrame with column names.
column_names = ["name", "address", "beds", "price"]
house = pd.DataFrame(columns = column_names)

# From page 1 to 4.
for i in range (1,5):

    # By observation, changes the page number in url changes to different page.
    url = "https://www.apartmenthomeliving.com/new-york-city-ny?page="+str(i)
    
    # To get or retrieve data from the link, and send out the header to the website.
    page = requests.get(url, headers=headers)
    
    # Use BeautifulSoup to creates a parse tree.
    soup = BeautifulSoup(page.content, 'html.parser')
    
    # The value of the id attribute must be unique within the HTML document.
    posts = soup.find(id = "property_list")
    
    # Find all name on page i at one time.
    name_tags = posts.select(".property_link")
    name = [nt.get_text() for nt in name_tags]
    
    # Find all address on page i at one time.
    address_tags = posts.select(".address")
    address = [at.get_text() for at in address_tags]
    
    # Find all beds of the page i at one time.
    beds_tags = posts.select(".bedrooms")
    beds = [pt.get_text() for pt in beds_tags]
    
    # Find all price of the page i at one time.
    price_tags = posts.select(".prices")
    price = [pt.get_text() for pt in price_tags]
    
    # Convert lists to DataFrame.
    df = pd.DataFrame({
        "name":name,
        "address":address,
        "beds":beds,
        "price":price
    })
    
    # Append df to house.
    house = house.append(df, ignore_index=True)
  
# Output a csv file.
house.to_csv('output.csv', mode='w')

Written on December 27, 2020