Skip to main content

Rudimentary Scraper

I'm toying with the idea of attending a coding camp, so I was looking at a review website.  I noticed I couldn't really read all the reviews I was looking at on one page or (as far as I could tell) sort them by ranking.  I thought how useful it would be if I could just figure out a way to download all the review data and put it into a database.

Scanned the HTML for a bit, but didn't make any progress figuring out where the data I wanted was.  Did a few searches that turned up some complicated looking stuff (a tutorial from a company that devotes itself exclusively to HTML scraping), gave up, came back, and finally found this.  Unfortunately, the tutorial was written for use with Python 2, so I had to figure out how to modify the libraries -- now at least I understand a bit more about why Chuck's class imports urlopen.  After that, it was as easy as mousing-over the element I wanted on the review page and requesting the text.

Needs further development in at least a few directions:

  1. Find a way to loop through all the review pages instead of the first 10 and stop pulling data when I get to the end.
  2. Pull other data, including the ranking and review text.
  3. Put this all on a database.  (Would there be too much text for it to run efficiently?)
Still -- scraping the site proved a lot easier than I thought it would.  I was afraid the information I wanted was all going to be hidden under layers of Javascript requests or something like that -- but a lot of it is really right there.  And I think (1) may not be that difficult to solve: what if I just keep incrementing i until titles is None, then break?

from urllib.request import urlopen
from bs4 import BeautifulSoup

i=1
while i < 10:
    url = "https://www.coursereport.com/schools/coding-dojo?page="+str(i)+"&sort_by=recent#reviews"
    page = urlopen(url)
    soup = BeautifulSoup(page, 'html.parser')
    titles = soup.find_all('div', attrs={'class': 'review-title'})
    for title in titles:
        print(title.text)
    i += 1

UPDATE:

Figured out how to get all of the titles. Just had to understand what was going to happen to titles when the program stopped fetching review pages.  I thought maybe it would become a None, but taking precautions against nothing didn't serve me, and after a few runs I had the bright idea to just print titles out each time the loop ran and see what that gave me.  So actually when the data runs out, you just have an empty list.  The following code quits the program once the bucket stops pulling up water:

if titles != []:
        for title in titles:
            print(title.text)
    else:
        print("No more data.")
        break

Comments

Popular posts from this blog

Getting Geodata From Google's API

The apps I'm going to be analyzing are part of Dr. Charles Severance's MOOC on Python and Databases and work together according to the following structure (which applies both in this specific case and more generally to any application that creates and interprets a database using online data). The data source, in this case, is Google's Google Maps Geocoding API.  The "package" has two components: geoload.py  and geodump.py .  geoload.py  reads a list of locations from a file -- addresses for which we would like geographical information -- requests information about them from Google, and stores the information on a database ( geodata.db ).  geodump.py  reads and parses data from the database in JSON, then loads that into a javascript file.  The javascript is then used to create a web page on which the data is visualized as a series of points on the world-map.  Dr. Severance's course focuses on Python, so I'm only going to work my way through ...

Compiling and Executing Java Files With -cp

I decided I was going to "man up" and figure out how to compile a java program with an external dependency from the command line instead of relying on an IDE-- the DOS command line, to be more specific. I ran into a few problems: 1.  The external dependency was given to me as a java file.  I experimented compiling it as a .jar, but I wasn't sure how to import a class from a .jar, so I ended up compiling it into a class. 2.  When I tried to run the file, I got an error saying that the class had been compiled with a different version of Java than my JRE.  The Internet told me to check my path variable for Java.  It sure looked like it was pointing to the latest JRE (and the same version of Java as my compiler).  I asked the Internet again and found the following command: for %I in (java.exe) do @echo %~$PATH:I I'm not exactly sure what the syntax of that magic command is (intuitively it's returning the path that executes when I run the "java" com...

Quick Find / Quick Union (Connected Nodes)

Setup This week I learned about the "Quick Find" or "Quick Union" algorithm. Imagine an NxN grid of nodes, some of which are connected by lines. A connection can be interpreted as accessibility: if two nodes are connected, you can get from one to the other. Every node is accessible to itself: to get where you already are, stay there. Also, If you can get from A to B, you can go back from B to A. And if you can get from A to B and from B to C, then you can get from A to C. As a consequence, the connection between nodes divides the grid into regions of mutually accessible nodes. You can travel from any node in a given region to any other node in that region -- but not to any nodes outside that region (exercise to reader -- proof by contradiction). The problem has two parts. First, find a way to represent this grid structure and the accessibility relation; second, use your schema to efficiently calculate whether two given nodes are accessible to each other. ...