Skip to main content

Rudimentary Scraper

I'm toying with the idea of attending a coding camp, so I was looking at a review website.  I noticed I couldn't really read all the reviews I was looking at on one page or (as far as I could tell) sort them by ranking.  I thought how useful it would be if I could just figure out a way to download all the review data and put it into a database.

Scanned the HTML for a bit, but didn't make any progress figuring out where the data I wanted was.  Did a few searches that turned up some complicated looking stuff (a tutorial from a company that devotes itself exclusively to HTML scraping), gave up, came back, and finally found this.  Unfortunately, the tutorial was written for use with Python 2, so I had to figure out how to modify the libraries -- now at least I understand a bit more about why Chuck's class imports urlopen.  After that, it was as easy as mousing-over the element I wanted on the review page and requesting the text.

Needs further development in at least a few directions:

  1. Find a way to loop through all the review pages instead of the first 10 and stop pulling data when I get to the end.
  2. Pull other data, including the ranking and review text.
  3. Put this all on a database.  (Would there be too much text for it to run efficiently?)
Still -- scraping the site proved a lot easier than I thought it would.  I was afraid the information I wanted was all going to be hidden under layers of Javascript requests or something like that -- but a lot of it is really right there.  And I think (1) may not be that difficult to solve: what if I just keep incrementing i until titles is None, then break?

from urllib.request import urlopen
from bs4 import BeautifulSoup

i=1
while i < 10:
    url = "https://www.coursereport.com/schools/coding-dojo?page="+str(i)+"&sort_by=recent#reviews"
    page = urlopen(url)
    soup = BeautifulSoup(page, 'html.parser')
    titles = soup.find_all('div', attrs={'class': 'review-title'})
    for title in titles:
        print(title.text)
    i += 1

UPDATE:

Figured out how to get all of the titles. Just had to understand what was going to happen to titles when the program stopped fetching review pages.  I thought maybe it would become a None, but taking precautions against nothing didn't serve me, and after a few runs I had the bright idea to just print titles out each time the loop ran and see what that gave me.  So actually when the data runs out, you just have an empty list.  The following code quits the program once the bucket stops pulling up water:

if titles != []:
        for title in titles:
            print(title.text)
    else:
        print("No more data.")
        break

Comments

Popular posts from this blog

Getting Geodata From Google's API

The apps I'm going to be analyzing are part of Dr. Charles Severance's MOOC on Python and Databases and work together according to the following structure (which applies both in this specific case and more generally to any application that creates and interprets a database using online data). The data source, in this case, is Google's Google Maps Geocoding API.  The "package" has two components: geoload.py  and geodump.py .  geoload.py  reads a list of locations from a file -- addresses for which we would like geographical information -- requests information about them from Google, and stores the information on a database ( geodata.db ).  geodump.py  reads and parses data from the database in JSON, then loads that into a javascript file.  The javascript is then used to create a web page on which the data is visualized as a series of points on the world-map.  Dr. Severance's course focuses on Python, so I'm only going to work my way through ...

Shell Sort

Today I spent a little bit of time researching the "Shell" sort.  I wanted to post a few notes about the Princeton Algorithms Course's implementation to help me solidify my understanding. First, a little tidbit.  When I first heard about this algorithm, I thought it had something to do with shell games.  Turns out a man named Donald Shell discovered this method of sorting, whence the name. The Algorithms  book gives the following explanation (Sedgewick and Wayne,  Algorithms, 4th ed., p. 258): The idea is to rearrange the array to give it the property that taking every hth entry (starting anywhere) yields a sorted subsequence. Such an array is said to be h-sorted. Put another way, an h-sorted array is h independent sorted subsequences, interleaved together. By h-sorting for some large values of h, we can move items in the array long distances and thus make it easier to h-sort for smaller values of h. Using such a procedure for any sequence of values o...

It's a Date

I guess I should really be putting these things up in GitHub.  The way I see it, the coding journal is just a place to share the code I write or study along with any notes I have about it.  It's sort of a documentation LiveJournal, if you will. Anyway, this is a "study" for my project idea: create an app that will prompt the user for two dates, then calculate the difference between them. The burden of this study is twofold: (1) convert dates in standard American form (e.g. December 15, 1993) into dates in standard American numeric form (e.g. 12/15/1993); (2) create a numerical representation of the date. To process the date, I started with a list of the months.  I then used a loop to create a dictionary that would attach a value to each month. Next I had to parse the user entry (I haven't added any debugging for incorrect entries yet). I did so by splitting the entry into "raw" data.  I used my dictionary to process the month name into a number, str...