Python : Using Beautiful Soup

This post is about using BeautifulSoup and requests module in python application for webscrapping. We are writing a simple weather application that uses request module to html page from a website and then uses BeautifulSoup to scrap the web and to capture weather info at a particular zip code. After reading this post you will get a basic idea on how to use Beautiful Soup and requests module in python.

Note :I have recently updated this post to use BS4 and requests.

Beautiful Soup sits atop an HTML or XML parser, providing Pythonic idioms for iterating, searching, and modifying the parse tree. Requests is a module that makes sending http requests simple.So first step is to include these packages as part of our project and using it. We will use pip to install the packages.

    pip install beautifulsoup4    
    pip install requests

import requests
from bs4 import BeautifulSoup
import  collections

We will be using weather-underground to determine the conditions at a particular location. For the sake of our application goal, instead of using their web service we will be downloading the html and parsing that to determine the weather. If you pass zip code along with a base url, weather underground will display weather at that location as shown below.

As a first step, let us get the zip code from user.

def get_zip():
    zip_code = input("Enter your zip code : ")
    return  zip_code

Now download the html from url using requests and parse it using BS4.

def get_url_html(url):
    response = requests.get(url)
    # response.status_code 200 OK
    # returns html
    return response.text

I uses Chrome developer tool to inspect the html source and determine whilch html element has required info. First item I am trying to read is the city name and it resides in an H1 tag inside a div with id="location". In a similar way ‘condition’ is in a div with id='curCond' and class = 'wx-value' and so on.. So based on html and css info let us grab required information from the web page.

def get_weather_from_html(html):
    soup = BeautifulSoup(html,'html.parser')
    loc = soup.find(id='location').find('h1').get_text().strip()
    loc = get_city(loc)
    condition = soup.find(id='curCond').find(class_ ='wx-value').get_text()
    condition = clean_text(condition)
    temp = soup.find(id='curTemp').find(class_ ='wx-data').find(class_ = 'wx-value').get_text()
    temp = clean_text(temp)
    unit = soup.find(id='curTemp').find(class_ ='wx-data').find(class_ = 'wx-unit').get_text()
    unit = clean_text(unit)
    # print("Weather @ {0} : Temperature = {1}{2} {3}".format(loc, temp, unit, condition))
    # return named tuple
    w_report = WeatherReport(location=loc, temperature=temp, scale=unit, cond=condition)
    return w_report

Above function returns results as a named tuple, which is easy when we are passing large result as a tuple. This is how we create a namedtuple. It is part of collections module.

# create a named tuple for returning multiple values
WeatherReport = collections.namedtuple('WeatherReport', 'location, temperature, scale, cond')

Its also important that results might need some clean up. For example city name we got back is having new line character and we need to do some clean up there.

def clean_text(text: str):
    if not text:
        return text
    return text.strip() 

def get_city(text: str):
    if not text:
        return text
    text = clean_text(text)
    parts = text.split('\n')
    return parts[0]

Given below is the __main__ entry.

if __name__ == '__main__':
    zip_code = input("Enter your zip : ")
    url = base_url+'/'+zip_code
    html = get_url_html(url)
    w_report = get_weather_from_html(html)
    print("The weather in {} is {}{} and {}".format(w_report.location, w_report.temperature,
                                                    w_report.scale, w_report.cond))


            Weather App

Enter your zip : 92127
The weather in San Diego, CA is 75.8°F and Clear

Process finished with exit code 0

Coding is fun enjoy…