Rudimentary Scraper

I'm toying with the idea of attending a coding camp, so I was looking at a review website. I noticed I couldn't really read all the reviews I was looking at on one page or (as far as I could tell) sort them by ranking. I thought how useful it would be if I could just figure out a way to download all the review data and put it into a database.

Scanned the HTML for a bit, but didn't make any progress figuring out where the data I wanted was. Did a few searches that turned up some complicated looking stuff (a tutorial from a company that devotes itself exclusively to HTML scraping), gave up, came back, and finally found this. Unfortunately, the tutorial was written for use with Python 2, so I had to figure out how to modify the libraries -- now at least I understand a bit more about why Chuck's class imports urlopen. After that, it was as easy as mousing-over the element I wanted on the review page and requesting the text.

Needs further development in at least a few directions:

Find a way to loop through all the review pages instead of the first 10 and stop pulling data when I get to the end.
Pull other data, including the ranking and review text.
Put this all on a database. (Would there be too much text for it to run efficiently?)

Still -- scraping the site proved a lot easier than I thought it would. I was afraid the information I wanted was all going to be hidden under layers of Javascript requests or something like that -- but a lot of it is really right there. And I think (1) may not be that difficult to solve: what if I just keep incrementing i until titles is None, then break?

from urllib.request import urlopen
from bs4 import BeautifulSoup

i=1
while i < 10:
url = "https://www.coursereport.com/schools/coding-dojo?page="+str(i)+"&sort_by=recent#reviews"
page = urlopen(url)
soup = BeautifulSoup(page, 'html.parser')
titles = soup.find_all('div', attrs={'class': 'review-title'})
for title in titles:
print(title.text)
i += 1

UPDATE:

Figured out how to get all of the titles. Just had to understand what was going to happen to titles when the program stopped fetching review pages. I thought maybe it would become a None, but taking precautions against nothing didn't serve me, and after a few runs I had the bright idea to just print titles out each time the loop ran and see what that gave me. So actually when the data runs out, you just have an empty list. The following code quits the program once the bucket stops pulling up water:

if titles != []:
for title in titles:
print(title.text)
else:
print("No more data.")
break

Coding Journal

Search This Blog

Rudimentary Scraper

Comments

Post a Comment

Popular posts from this blog

Getting Geodata From Google's API

Compiling and Executing Java Files With -cp

Throughput, Latency, and Pipelines: Diagnosis Of A Fallacy