Skip to main content

Rudimentary Scraper

I'm toying with the idea of attending a coding camp, so I was looking at a review website.  I noticed I couldn't really read all the reviews I was looking at on one page or (as far as I could tell) sort them by ranking.  I thought how useful it would be if I could just figure out a way to download all the review data and put it into a database.

Scanned the HTML for a bit, but didn't make any progress figuring out where the data I wanted was.  Did a few searches that turned up some complicated looking stuff (a tutorial from a company that devotes itself exclusively to HTML scraping), gave up, came back, and finally found this.  Unfortunately, the tutorial was written for use with Python 2, so I had to figure out how to modify the libraries -- now at least I understand a bit more about why Chuck's class imports urlopen.  After that, it was as easy as mousing-over the element I wanted on the review page and requesting the text.

Needs further development in at least a few directions:

  1. Find a way to loop through all the review pages instead of the first 10 and stop pulling data when I get to the end.
  2. Pull other data, including the ranking and review text.
  3. Put this all on a database.  (Would there be too much text for it to run efficiently?)
Still -- scraping the site proved a lot easier than I thought it would.  I was afraid the information I wanted was all going to be hidden under layers of Javascript requests or something like that -- but a lot of it is really right there.  And I think (1) may not be that difficult to solve: what if I just keep incrementing i until titles is None, then break?

from urllib.request import urlopen
from bs4 import BeautifulSoup

i=1
while i < 10:
    url = "https://www.coursereport.com/schools/coding-dojo?page="+str(i)+"&sort_by=recent#reviews"
    page = urlopen(url)
    soup = BeautifulSoup(page, 'html.parser')
    titles = soup.find_all('div', attrs={'class': 'review-title'})
    for title in titles:
        print(title.text)
    i += 1

UPDATE:

Figured out how to get all of the titles. Just had to understand what was going to happen to titles when the program stopped fetching review pages.  I thought maybe it would become a None, but taking precautions against nothing didn't serve me, and after a few runs I had the bright idea to just print titles out each time the loop ran and see what that gave me.  So actually when the data runs out, you just have an empty list.  The following code quits the program once the bucket stops pulling up water:

if titles != []:
        for title in titles:
            print(title.text)
    else:
        print("No more data.")
        break

Comments

Popular posts from this blog

Getting Geodata From Google's API

The apps I'm going to be analyzing are part of Dr. Charles Severance's MOOC on Python and Databases and work together according to the following structure (which applies both in this specific case and more generally to any application that creates and interprets a database using online data). The data source, in this case, is Google's Google Maps Geocoding API.  The "package" has two components: geoload.py  and geodump.py .  geoload.py  reads a list of locations from a file -- addresses for which we would like geographical information -- requests information about them from Google, and stores the information on a database ( geodata.db ).  geodump.py  reads and parses data from the database in JSON, then loads that into a javascript file.  The javascript is then used to create a web page on which the data is visualized as a series of points on the world-map.  Dr. Severance's course focuses on Python, so I'm only going to work my way through ...

The Jump Algorithm

Meetup Went to a Coding Whiteboard Meetup tonight.  It was pretty great.  One of the leaders was even a CS master's student.  At first, honestly, I felt a little bit frustrated, especially because everyone around me seemed to be using pretty high level concepts / approaches that I wasn't familiar with.  But I found someone and relentlessly talked him through his approach until we both kind of realized there were issues in the problem we hadn't worked out yet.  I guess it just reinforces my feeling that when something seems too difficult, if you can, you need to find someone and force him to explain it to you in terms you can understand.  If the people around you really understand what they're about, they will have no problem and you'll learn a lot (assuming they're patient, I guess).  If they don't, you'll realize you aren't as alone as you thought you were.  Bit of the old Socrates. Problem So imagine you have an array with a bunch of numbe...

Shell Sort

Today I spent a little bit of time researching the "Shell" sort.  I wanted to post a few notes about the Princeton Algorithms Course's implementation to help me solidify my understanding. First, a little tidbit.  When I first heard about this algorithm, I thought it had something to do with shell games.  Turns out a man named Donald Shell discovered this method of sorting, whence the name. The Algorithms  book gives the following explanation (Sedgewick and Wayne,  Algorithms, 4th ed., p. 258): The idea is to rearrange the array to give it the property that taking every hth entry (starting anywhere) yields a sorted subsequence. Such an array is said to be h-sorted. Put another way, an h-sorted array is h independent sorted subsequences, interleaved together. By h-sorting for some large values of h, we can move items in the array long distances and thus make it easier to h-sort for smaller values of h. Using such a procedure for any sequence of values o...